多模态学习最新进展与应用
引言
多模态学习旨在整合来自不同感知通道的信息,如文本、图像、音频、视频等。近年来,视觉语言模型的发展使多模态学习取得了突破性进展。
CLIP:连接视觉与语言
CLIP架构
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
| class CLIP(nn.Module): """CLIP: Contrastive Language-Image Pre-training""" def __init__(self, vision_model, text_model): super().__init__() self.vision_encoder = vision_model self.text_encoder = text_model self.vision_projection = nn.Linear(768, 512) self.text_projection = nn.Linear(768, 512) self.logit_scale = nn.Parameter(torch.ones([]) * np.log(1 / 0.07)) def encode_image(self, image): features = self.vision_encoder(image) return self.vision_projection(features) def encode_text(self, text): features = self.text_encoder(text) return self.text_projection(features) def forward(self, image, text): image_features = self.encode_image(image) text_features = self.encode_text(text) image_features = image_features / image_features.norm(dim=-1, keepdim=True) text_features = text_features / text_features.norm(dim=-1, keepdim=True) logit_scale = self.logit_scale.exp() logits_per_image = logit_scale * image_features @ text_features.t() logits_per_text = logits_per_image.t() return logits_per_image, logits_per_text
|
视觉语言模型
LLaVA架构
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
| class LLaVA(nn.Module): """LLaVA: Large Language and Vision Assistant""" def __init__(self, llm, vision_tower, projector): super().__init__() self.llm = llm self.vision_tower = vision_tower self.projector = projector def forward(self, image, text): image_features = self.vision_tower(image) image_features = self.projector(image_features) inputs = self.prepare_inputs(image_features, text) outputs = self.llm.generate(inputs) return outputs
|
多模态融合策略
早期融合 vs 晚期融合
| 策略 |
描述 |
优点 |
缺点 |
| 早期融合 |
原始特征拼接 |
特征交互充分 |
计算量大 |
| 晚期融合 |
各模态独立处理后融合 |
灵活高效 |
交互受限 |
| 中期融合 |
深层特征融合 |
平衡 |
实现复杂 |
应用场景
零样本图像分类
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
| class ZeroShotClassifier: """基于CLIP的零样本分类""" def __init__(self, clip_model, classes): self.model = clip_model self.classes = classes def classify(self, image): image_features = self.model.encode_image(image) text_descriptions = [f"a photo of a {c}" for c in self.classes] text_features = self.model.encode_text(text_descriptions) similarity = image_features @ text_features.t() return self.classes[similarity.argmax()]
|
总结
多模态学习正在成为AI发展的主流方向,视觉语言模型的突破为通用人工智能奠定了基础。
推荐阅读:CLIP、LLaVA论文原文