Medium

Implement a projection adapter and interleave image $+$ text embeddings into one sequence wi

Multimodal / Vision-Language (Lighter) · Problem 2 of 3

Chapter 16Multimodal / Vision-Language (Lighter)

Implement a projection adapter and interleave image $+$ text embeddings into one sequence wi

MediumProblem 2 / 3

Implement a projection adapter and interleave image $+$ text embeddings into one sequence with correct attention masking.

Implement the function/class skeleton in the editor. Any correct approach is accepted.

Hints

solution.pypython

local draft

import torch
import torch.nn as nn

class Projector(nn.Module):

    def __init__(self, d_v, d_model):
        raise NotImplementedError

    def forward(self, x):
        raise NotImplementedError

def interleave(text_embeds, text_mask, img_embeds, image_token_pos):
    raise NotImplementedError

⌘/Ctrl + ↵ to submit

AI review

Ready when you are

Submit your solution and a structured review appears here — verdict, score, and concrete feedback. Any correct approach passes.

Chapter 16Multimodal / Vision-Language (Lighter)

Implement a projection adapter and interleave image $+$ text embeddings into one sequence wi

MediumProblem 2 / 3

Implement a projection adapter and interleave image $+$ text embeddings into one sequence with correct attention masking.

Implement the function/class skeleton in the editor. Any correct approach is accepted.

Hints

Implement a projection adapter and interleave image+++text embeddings into one sequence wi

Implement a projection adapter and interleave image+++text embeddings into one sequence wi

Implement a projection adapter and interleave image $+$ text embeddings into one sequence wi

Implement a projection adapter and interleave image $+$ text embeddings into one sequence wi