Using Julia to Do Whole Word Masking

Photo Credit Introduction In my last post, [Failure Report] Distill Fine-tuned Transformers into Recurrent Neural Networks, I tried to distill the knowledge of a fine-tuned BERT model into an LSTM or GRU model without any data augmentation and failed to achieve satisfiable results. In the follow-up works, I tried to replicate the easies-to-implement augmentation method — masking — used in [1] and see its effect. The masking described in [1] is called “whole word masking” [2], that is, masking the whole word instead of just masking a single word piece. ...

June 28, 2020 · Ceshine Lee

[Failure Report] Distill Fine-tuned Transformers into Recurrent Neural Networks

Photo Credit Overview Motivation Transformer models[1] have been taking over the NLP field since the advent of BERT[2]. However, the high numbers of parameters and the quadratically scaled self attention that is expensive both in computation and memory[3] make the modern transformer models barely fit into a single consumer-grade GPU. Efforts have been made to alleviate this problem[3][4][5], but they are still far from ideal: No public models that are pre-trained on BERT-scale corpus (at the time of writing). [3] The complexity of the public models is no smaller than the existing transformer models. [4] They are just smaller versions of BERT. The self attention is still quadratically scaled. [5] To make the inference possible on weaker machines, one of the more ideal solutions is to distill the knowledge of a fine-tuned transformer model into a much simpler model, e.g., an LSTM model. Is it possible? Tang et al.[6] shows that they can improve the BiLSTM baseline by distillation and some data augmentation. Although their accuracies are still lagging behind ones of the transformer models, it is still a promising direction. ...

June 16, 2020 · Ceshine Lee

TensorFlow 2.1 with TPU in Practice

Photo Credit Executive Summary TensorFlow has become much easier to use: As an experience PyTorch developer who only knows a bit of TensorFlow 1.x, I was able to pick up TensorFlow 2.x in my spare time in 60 days and do competitive machine learning. TPU has never been more accessible: The new interface to TPU in TensorFlow 2.1 works right out of the box in most cases and greatly reduces the development time required to make a model TPU-compatible. Using TPU drastically increases the iteration speed of experiments. We present a case study of solving a Q&A labeling problem by fine-tuning the RoBERTa-base model from huggingface/transformer library: Codebase Colab TPU training notebook Kaggle Inference Kernel High-level library TF-HelperBot to provide more flexibility than the Keras interface. (TensorFlow 2.1 and TPU are also a very good fit for CV applications. A case study of solving an image classification problem will be published in about a month.) Acknowledgment I was granted free access to Cloud TPUs for 60 days via TensorFlow Research Cloud. It was for the TensorFlow 2.0 Question Answering competition. I chose to do this simpler Google QUEST Q&A Labeling competition first but unfortunately couldn’t find enough time to go back and do the original one (sorry!). ...

February 13, 2020 · Ceshine Lee

Create a Customized Text Annotation Tool in Two Days - Part 2

Photo Credit Introduction In Part 1 of this series, we’ve discussed why building your own annotation tool can be a good idea, and demonstrated a back-end API server based on FastAPI. Now in this Part 2, we’re going to build a front-end interface that interacts with the end-user (the annotator). The front-end needs to do mainly three things: Fetch a batch of sentence/paragraph pairs to be annotated from the back-end server. Present the pairs to the annotator and provide a way for them to adjust the automatically generated labels. Send the annotated results to the back-end server. Disclaimer: I’m relatively inexperienced in front-end development. The code here may seem extremely amateur to professionals. However, I hope this post can serve as a reference or starting point for those with similar requirements. ...

December 17, 2019 · Ceshine Lee

Create a Customized Text Annotation Tool in Two Days - Part 1

Photo Credit Introduction In my previous post, Fine-tuning BERT for Similarity Search, I mentioned that I annotated 2,000 pair of sentence pairs, but did not describe how I did it and what tool I used. Now in this two-part series, we’ll see how I created a customized text annotation tool that greatly speeds up the annotation process. The entire stack was developed in two days. You can probably do it a lot faster if you are familiar with the technology (the actual time I spent on it is about 6 hours top). ...

December 16, 2019 · Ceshine Lee