Approach

Knowledge Transfer (KT), which aims at training a smaller student network by transferring knowledge from a larger teacher model, is one of the popular solutions. In this paper, we propose a novel knowledge transfer method by treating it as a distribution matching problem. Particularly, we match the distributions of neuron selectivity patterns between teacher and student networks. To achieve this goal, we devise a new KT loss function by minimizing the Maximum Mean Discrepancy (MMD) metric between these distributions.

Notations
Maximum Mean Discrepancy

Neuron Selectivity Transfer
The regions with high activations from a neuron may share some task related similarities, in order to capture these similarities, there should be also neurons mimic these activation patterns in student networks.

Considering the activation of each spatial position as one feature, then the flattened activation map of each filter is an sample the space of neuron selectivities of dimension HW . This sample distribution reflects how a CNN interpret an input image: where does the CNN focus on?

Then we can define Neuron Selectivity Transfer loss as:

linear kernel

Experiment

References：
Like What You Like: Knowledge Distill via Neuron Selectivity Transfer，Naiyan Wang，2017， ArXiv

Knowledge Distill via NST

Approach

Experiment