模型压缩与知识蒸馏实战

Posted on 六月 25, 2024

🎙️ 语音朗读当前: 晓晓 (温柔女声)

模型压缩与知识蒸馏实战

引言

深度学习模型越来越大，模型压缩成为部署到边缘设备的关键技术。知识蒸馏是其中最有效的压缩方法之一。

知识蒸馏原理

核心思想

Teacher模型 → 知识 → Student模型
     ↓
   软标签
     ↓
  温度参数T控制软化程度

蒸馏损失函数

class KnowledgeDistillationLoss(nn.Module):
    def __init__(self, temperature=4.0, alpha=0.7):
        super().__init__()
        self.T = temperature
        self.alpha = alpha
        
    def forward(self, student_logits, teacher_logits, labels):
        # 硬目标损失
        hard_loss = F.cross_entropy(student_logits, labels)
        
        # 软目标损失
        soft_teacher = F.softmax(teacher_logits / self.T, dim=1)
        soft_student = F.log_softmax(student_logits / self.T, dim=1)
        soft_loss = F.kl_div(soft_student, soft_teacher, reduction='batchmean')
        soft_loss = soft_loss * (self.T ** 2)
        
        # 总损失
        return self.alpha * hard_loss + (1 - self.alpha) * soft_loss

复杂蒸馏策略

多教师蒸馏

class MultiTeacherDistillation:
    def __init__(self, teachers):
        self.teachers = teachers
        self.student = StudentModel()
        
    def distill(self, data):
        # 聚合多个教师的知识
        teacher_outputs = []
        for teacher in self.teachers:
            with torch.no_grad():
                out = teacher(data)
                teacher_outputs.append(out)
        
        # 平均或加权平均
        combined_knowledge = torch.stack(teacher_outputs).mean(dim=0)
        
        # 训练学生
        student_out = self.student(data)
        loss = self.compute_distillation_loss(student_out, combined_knowledge)
        
        return loss

中间层蒸馏

class IntermediateDistillation:
    """中间层特征蒸馏"""
    
    def __init__(self, teacher, student):
        self.teacher = teacher
        self.student = student
        
    def forward(self, x):
        # 教师中间层
        teacher_features = self.teacher.extract_features(x)
        
        # 学生中间层
        student_features = self.student.extract_features(x)
        
        # 特征匹配损失
        feat_loss = 0
        for t_feat, s_feat in zip(teacher_features, student_features):
            feat_loss += F.mse_loss(s_feat, t_feat)
            
        return feat_loss

模型剪枝

结构化剪枝

class StructuredPruner:
    """结构化剪枝器"""
    
    def __init__(self, model, sparsity=0.5):
        self.model = model
        self.sparsity = sparsity
        
    def prune(self):
        # 计算每个通道的重要性
        for name, module in self.model.named_modules():
            if isinstance(module, nn.Conv2d):
                # 基于L1范数的重要性
                importance = module.weight.abs().mean(dim=(0, 2, 3))
                
                # 选择要剪枝的通道
                threshold = torch.quantile(importance, self.sparsity)
                mask = importance > threshold
                
                # 应用掩码
                module.weight.data = module.weight.data * mask.view(-1, 1, 1, 1)

量化技术

动态量化

class DynamicQuantizer:
    """动态量化"""
    
    def quantize(self, model):
        quantized_model = torch.quantization.quantize_dynamic(
            model,
            {nn.Linear, nn.LSTM},
            dtype=torch.qint8
        )
        return quantized_model

性能对比

方法	压缩比	精度保持	速度提升
知识蒸馏	4-10x	95%+	2-5x
剪枝	2-10x	90%+	1.5-3x
量化(INT8)	4x	98%+	2-4x
量化(INT4)	8x	90%+	4-8x

总结

模型压缩技术使深度学习模型能够在资源受限的设备上高效运行。

推荐阅读：《Knowledge Distillation: A Survey》