Yigitcan Kaya, Sanghyun Hong, and Tudor Dumitras. Shallow-deep networks: Understanding and mitigating network overthinking. In Proceedings of the International Conference on Machine Learning, ICML, pages 3301–3310, 2019.
Wangchunshu Zhou, Canwen Xu, Tao Ge, Julian McAuley, Ke Xu, and Furu Wei. BERT loses patience: fast and robust inference with early exit. arXiv:2006.04152, 2020.
Surat Teerapittayanon, Bradley McDanel, and Hsiang-Tsung Kung. Branchynet: Fast inference via early exiting from deep neural networks. In Proceedings of the International Conference on Pattern Recognition, ICPR, pages 2464–2469, 2016.
Simone Scardapane, Michele Scarpiniti, Enzo Baccarelli, and Aurelio Uncini. Why should we add early exits to neural networks? arXiv:2004.12814, 2020.
Konstantin Berestizshevsky and Guy Even. Dynamically sacrificing accuracy for reduced computation: cascaded inference based on softmax confidence. In Proceedings of the International Conference on Artificial Neural Networks, ICANN, pages 306–320. Springer, 2019.
正文
可以直接用一个训练好的不带 early exit 的模型,模型的参数fix,只训练往上面插入的ICs,所以其实有点类似即插即用的方法。
文章指出,使用几何平均比使用加权算术平均能够取得更好的效果。所以其实就是对多个IC的softmax输出做几何平均,几何平均的权重
w
w
w 是可学习的参数(文章指出可学习参数比人为定义取得了更好的效果);连乘外面的
b
b
b 是对不同类别的算数平均权重,也是可学习的参数;
Z
m
Z_m
Zm是归一化参数,使得某个IC的预测结果对各个类别的概率值加和为1