[Failure Report] Distill Fine-tuned Transformers into Recurrent Neural Networks
Photo Credit Overview Motivation Transformer models[1] have been taking over the NLP field since the advent of BERT[2]. However, the high numbers of parameters and the quadratically scaled self attention that is expensive both in computation and memory[3] make the modern transformer models barely fit into a single consumer-grade GPU. Efforts have been made to alleviate this problem[3][4][5], but they are still far from ideal: No public models that are pre-trained on BERT-scale corpus (at the time of writing). [3] The complexity of the public models is no smaller than the existing transformer models. [4] They are just smaller versions of BERT. The self attention is still quadratically scaled. [5] To make the inference possible on weaker machines, one of the more ideal solutions is to distill the knowledge of a fine-tuned transformer model into a much simpler model, e.g., an LSTM model. Is it possible? Tang et al.[6] shows that they can improve the BiLSTM baseline by distillation and some data augmentation. Although their accuracies are still lagging behind ones of the transformer models, it is still a promising direction. ...