Independent Study Presentation

Title:  Adaptive Cross-architecture Mutual Knowledge Distillation


Presenter:  Jianyuan Ni


Advisor:  Dr. Shen (


Date/Time:  Thursday, December 1st @2:00 p.m.





Knowledge distillation (KD), which distills knowledge from complex networks (teacher) to lightweight (student) networks, has been actively studied recently. However, the key idea of KD fails to work properly if the student model is too weak to mimic the teacher's performance. In addition, similar types of architecture between teacher and student models in the traditional KD process usually lead to a substantial performance gap. Unlike previous works, which heavily relied on human-designed KD loss or complicated training strategies to mitigate the performance gap, we investigate the possibility of reducing such performance gap from the perspective of heterogeneous architectures' inductive bias. To this end, we propose a novel cross-architecture knowledge distillation method, named Adaptive Cross-architecture Mutual Knowledge Distillation (ACMKD), which tries to eliminate the performance gap issue using the multi-students mutual learning strategy. Specially, we adopt three mainstream models associated with various inductive biases (CNN, INN, and Transformer) as the student models. In addition, we propose an effective attention similarity mechanism to guide student models on the specific parts of the teacher model to mimic. Inspired by the Cannikin Law, we design a novel dynamic second-stage KD process so that the weakest student model has the opportunity to learn from other stronger student models again. We validate our methods on ImageNet and CIFAR100 datasets and results confirm our proposed ACMKD method achieved the lowest performance gap compared with other Transformer-based KD methods.



Deadline: Dec. 16, 2022, midnight