[Failure Report] Distill Fine-tuned Transformers into Recurrent Neural Networks

Photo Credit Overview Motivation Transformer models[1] have been taking over the NLP field since the advent of BERT[2]. However, the high numbers of parameters and the quadratically scaled self attention that is expensive both in computation and memory[3] make the modern transformer models barely fit into a single consumer-grade GPU. Efforts have been made to alleviate this problem[3][4][5], but they are still far from ideal: No public models that are pre-trained on BERT-scale corpus (at the time of writing). [3] The complexity of the public models is no smaller than the existing transformer models. [4] They are just smaller versions of BERT. The self attention is still quadratically scaled. [5] To make the inference possible on weaker machines, one of the more ideal solutions is to distill the knowledge of a fine-tuned transformer model into a much simpler model, e.g., an LSTM model. Is it possible? Tang et al.[6] shows that they can improve the BiLSTM baseline by distillation and some data augmentation. Although their accuracies are still lagging behind ones of the transformer models, it is still a promising direction. ...

June 16, 2020 · Ceshine Lee

Deploying EfficientNet Model using TorchServe

Photo Credit Introduction AWS recently released TorchServe, an open-source model serving library for PyTorch. The production-readiness of Tensorflow has long been one of its competitive advantages. TorchServe is PyTorch community’s response to that. It is supposed to be the PyTorch counterpart of Tensorflow Serving. So far, it seems to have a very strong start. This post from the AWS Machine Learning Blog and the documentation of TorchServe should be more than enough to get you started. But for advanced usage, the documentation is a bit chaotic and the example code suggests sometimes conflicting ways to do things. ...

May 4, 2020 · Ceshine Lee

Fine-tuning BERT for Similarity Search

Photo Credit Synopsis I have the task of finding similar entries among 8,000+ pieces of news, using their title and edited short descriptions in Traditional Chinese. I tried LASER[1] first but later found Universal Sentence Encoder[2] seemed to work slightly better. Results from these unsupervised approaches are already acceptable, but still have occasional confusion and hiccups. Not entirely satisfied with the unsupervised approaches, I collected and annotated 2,000 pairs of news and fine-tuned the BERT model on this dataset. This supervised approach is visibly better than the unsupervised one. And it’s also quite sample-efficient. Three hundred and fifty training example is already enough to beat Universal Sentence Encoder by a large margin. ...

November 28, 2019 · Ceshine Lee

Zero Shot Cross-Lingual Transfer with Multilingual BERT

Photo Credit Synopsis Do you want multilingual sentence embeddings, but only have a training dataset in English? This post presents an experiment that fine-tuned a pre-trained multilingual BERT model(“BERT-Base, Multilingual Uncased” [1][2]) on monolingual(English) AllNLI dataset[4] to create sentence embeddings model(that maps a sentence to a fixed-size vector)[3]. The experiment shows that the fine-tuned multilingual BERT sentence embeddings have generally better performance (i.e. lower error rates) over baselines in a multilingual similarity search task (Tatoeba dataset[5]). However, the error rates are still significantly higher than the ones from specialized sentence embedding models trained with multilingual datasets[5]. ...

September 24, 2019 · Ceshine Lee

More Memory-Efficient Swish Activation Function

Photo Credit Update on 2020-08-22: using torch.cuda.max_memory_allocated() and torch.cuda.reset_peak_memory_stats() in the newer version (1.6+) of PyTorch is probably more accurate. (reference) Motivation Recently I’ve been trying out EfficientNet models implemented in PyTorch. I’ve managed to successfully fine-tune pretrained EfficientNet models on my data set and reach accuracy on par with the mainstream ones like SE-ResNeXt-50. However, training the model from scratch has proven to be much harder. Fine-tuned EfficientNet models can reach the same accuracy with much smaller number of parameters, but they seem to occupy a lot of GPU memory than it probably should (comparing to the mainstream ones). There is an open issue on the Github Repository about this problem — [lukemelas/EfficientNet-PyTorch] Memory Issues. ...

August 22, 2019 · Ceshine Lee