Using Julia to Do Whole Word Masking

Photo Credit Introduction In my last post, [Failure Report] Distill Fine-tuned Transformers into Recurrent Neural Networks, I tried to distill the knowledge of a fine-tuned BERT model into an LSTM or GRU model without any data augmentation and failed to achieve satisfiable results. In the follow-up works, I tried to replicate the easies-to-implement augmentation method — masking — used in [1] and see its effect. The masking described in [1] is called “whole word masking” [2], that is, masking the whole word instead of just masking a single word piece. ...

June 28, 2020 · Ceshine Lee

[Failure Report] Distill Fine-tuned Transformers into Recurrent Neural Networks

Photo Credit Overview Motivation Transformer models[1] have been taking over the NLP field since the advent of BERT[2]. However, the high numbers of parameters and the quadratically scaled self attention that is expensive both in computation and memory[3] make the modern transformer models barely fit into a single consumer-grade GPU. Efforts have been made to alleviate this problem[3][4][5], but they are still far from ideal: No public models that are pre-trained on BERT-scale corpus (at the time of writing). [3] The complexity of the public models is no smaller than the existing transformer models. [4] They are just smaller versions of BERT. The self attention is still quadratically scaled. [5] To make the inference possible on weaker machines, one of the more ideal solutions is to distill the knowledge of a fine-tuned transformer model into a much simpler model, e.g., an LSTM model. Is it possible? Tang et al.[6] shows that they can improve the BiLSTM baseline by distillation and some data augmentation. Although their accuracies are still lagging behind ones of the transformer models, it is still a promising direction. ...

June 16, 2020 · Ceshine Lee

Deploying EfficientNet Model using TorchServe

Photo Credit Introduction AWS recently released TorchServe, an open-source model serving library for PyTorch. The production-readiness of Tensorflow has long been one of its competitive advantages. TorchServe is PyTorch community’s response to that. It is supposed to be the PyTorch counterpart of Tensorflow Serving. So far, it seems to have a very strong start. This post from the AWS Machine Learning Blog and the documentation of TorchServe should be more than enough to get you started. But for advanced usage, the documentation is a bit chaotic and the example code suggests sometimes conflicting ways to do things. ...

May 4, 2020 · Ceshine Lee

Tensorflow Profiler with Custom Training Loop

Photo Credit Introduction The Tensorflow Profiler in the upcoming Tensorflow 2.2 release is a much-welcomed addition to the ecosystem. For image-related tasks, often the bottleneck is the input pipeline. But you also don’t want to spend time optimizing the input pipeline unless it is necessary. The Tensorflow Profiler makes pinpointing the bottleneck of the training process much easier, so you can decide where the optimization effort should be put into. ...

April 24, 2020 · Ceshine Lee

Monitor Python Script Cron Jobs using Telegram

Photo Credit Motivation Apache Airflow is great for managing scheduled workflows, but in a lot of cases, it is an overkill and brings unnecessary complexity to the overall solution. Cron jobs are much easier to set up, have built-in support in most systems, and have a very flat learning curve. However, the lack of monitoring features and the consequential silent failures can be the bane of system admins’ lives. We want a simple solution that can help admins monitor the health of cron jobs in simple scenarios that do not warrant Airflow. The simple scenarios have the following characteristics: ...

April 10, 2020 · Ceshine Lee