CLIP模型原理与多模态学习

Posted on 四月 5, 2022

🎙️ 语音朗读当前: 晓晓 (温柔女声)

CLIP模型原理与多模态学习

CLIP（Contrastive Language-Image Pre-training）是OpenAI提出的多模态模型，通过对比学习将视觉和语言映射到同一特征空间，开创了视觉-语言预训练的新范式。

1. CLIP的核心思想

传统计算机视觉模型需要为每个任务收集标注数据，而CLIP通过从互联网上收集的4亿对图文数据进行对比学习，实现了零样本迁移：

1 2	传统方案: 标注数据 → 训练模型 → 特定任务 CLIP方案: 海量图文对 → 对比预训练 → 零样本迁移到任何任务

2. 模型架构

CLIP由两个编码器组成：图像编码器和文本编码器。

import torch
import torch.nn as nn

class CLIPModel(nn.Module):
    def __init__(self, image_encoder, text_encoder, 
                 embed_dim=512, temperature=0.07):
        super().__init__()
        self.image_encoder = image_encoder  # ViT或ResNet
        self.text_encoder = text_encoder    # Transformer
        
        # 投影头
        self.image_projection = nn.Sequential(
            nn.Linear(image_encoder.embed_dim, embed_dim),
            nn.ReLU(),
            nn.Linear(embed_dim, embed_dim)
        )
        self.text_projection = nn.Sequential(
            nn.Linear(text_encoder.embed_dim, embed_dim),
            nn.ReLU(),
            nn.Linear(embed_dim, embed_dim)
        )
        
        # 温度参数（可学习）
        self.logit_scale = nn.Parameter(
            torch.ones([]) * np.log(1 / temperature)
        )
    
    def forward(self, images, texts):
        # 编码图像和文本
        image_features = self.image_encoder(images)
        text_features = self.text_encoder(texts)
        
        # 投影到共享空间
        image_embeddings = self.image_projection(image_features)
        text_embeddings = self.text_projection(text_features)
        
        # L2归一化
        image_embeddings = F.normalize(image_embeddings, dim=-1)
        text_embeddings = F.normalize(text_embeddings, dim=-1)
        
        # 计算相似度矩阵
        logit_scale = self.logit_scale.exp()
        logits = logit_scale * image_embeddings @ text_embeddings.T
        
        return logits, image_embeddings, text_embeddings

3. 对比学习损失函数

CLIP使用对称的对比损失，让匹配的图文对相似度最高：

def clip_loss(logits):
    """
    对称对比损失
    logits: [batch_size, batch_size] 相似度矩阵
    对角线上是匹配的图文对，其余为不匹配的
    """
    labels = torch.arange(logits.shape[0], device=logits.device)
    
    # 图像到文本方向
    loss_i2t = F.cross_entropy(logits, labels)
    # 文本到图像方向
    loss_t2i = F.cross_entropy(logits.T, labels)
    
    return (loss_i2t + loss_t2i) / 2

# 示例：一个batch中有4对图文
# logits[i][j] 表示第i张图和第j段文本的相似度
# 好的模型应该让logits[i][i]最大（对角线最高）

4. 零样本分类

CLIP最强大的能力是零样本分类，无需任何训练数据即可完成分类任务：

def zero_shot_classify(model, image, class_names, template):
    """
    零样本图像分类
    
    Args:
        model: CLIP模型
        image: 输入图像
        class_names: 类别名称列表，如["cat", "dog", "bird"]
        template: 提示模板，如"A photo of a {}"
    """
    # 构造文本提示
    text_inputs = [template.format(name) for name in class_names]
    text_tokens = tokenizer(text_inputs)
    
    with torch.no_grad():
        # 编码图像
        image_features = model.image_encoder(image)
        image_embeddings = model.image_projection(image_features)
        image_embeddings = F.normalize(image_embeddings, dim=-1)
        
        # 编码所有类别文本
        text_features = model.text_encoder(text_tokens)
        text_embeddings = model.text_projection(text_features)
        text_embeddings = F.normalize(text_embeddings, dim=-1)
    
    # 计算图像与每个类别的相似度
    similarity = (image_embeddings @ text_embeddings.T).squeeze(0)
    probs = F.softmax(similarity * model.logit_scale.exp(), dim=-1)
    
    # 输出预测
    for name, prob in zip(class_names, probs):
        print(f"{name}: {prob:.2%}")
    
    return class_names[probs.argmax()]

# 使用示例
result = zero_shot_classify(
    model, 
    image, 
    ["猫", "狗", "鸟", "鱼"],
    "一张{}的照片"
)

5. 实际应用场景

5.1 图像检索

def image_text_retrieval(model, query_text, image_database, top_k=5):
    """基于文本的图像检索"""
    # 预计算所有图像的特征
    image_features = []
    for img in image_database:
        feat = model.image_encoder(img)
        emb = model.image_projection(feat)
        image_features.append(F.normalize(emb, dim=-1))
    
    image_features = torch.cat(image_features, dim=0)
    
    # 编码查询文本
    text_tokens = tokenizer([query_text])
    text_feat = model.text_encoder(text_tokens)
    text_emb = F.normalize(model.text_projection(text_feat), dim=-1)
    
    # 计算相似度并排序
    similarities = text_emb @ image_features.T
    top_indices = similarities.argsort(descending=True)[:top_k]
    
    return [image_database[i] for i in top_indices]

5.2 图像描述生成

def image_captioning(model, image, candidate_captions):
    """从候选描述中选择最匹配的"""
    image_feat = model.encode_image(image)
    best_score = -float('inf')
    best_caption = None
    
    for caption in candidate_captions:
        text_feat = model.encode_text(tokenizer(caption))
        score = (image_feat @ text_feat.T).item()
        if score > best_score:
            best_score = score
            best_caption = caption
    
    return best_caption

6. CLIP的局限性与改进方向

局限	描述	改进方向
细粒度识别弱	对子类区分能力有限	细粒度对比学习
OCR能力差	难以识别图像中的文字	引入OCR数据
计数能力弱	难以准确计数物体	添加计数任务
分辨率限制	输入图像分辨率较低	高分辨率适配器

7. CLIP的后续发展

OpenCLIP：开源复现版本
EVA-CLIP：更强的训练策略
SIGLIP：Sigmoid损失替代Softmax
Chinese-CLIP：中文多模态模型

总结

CLIP通过对比学习将视觉和语言统一到同一特征空间，实现了强大的零样本迁移能力。它不仅是一个模型，更是一种新范式——用自然语言作为监督信号来训练视觉模型，这为多模态AI的发展奠定了基础。