Posts

How to Reduce the Loading Time of Julia Scripts

Photo Credit Motivation Julia is a promising new language for scientific computing and data science. I’ve demonstrated that doing whole work masking in Julia can be a lot faster (up to 100x) than in Python in this post. The secret of Julia’s speed is from its use of JIT compilers (rather than interpreters used by R and Python). However, this design also impedes Julia’s ambition as a general-purpose language since ten seconds of precompiling time for a simple script is unacceptable for most use cases. ...

[Notes] Gradient Checkpointing with BERT

Photo Credit Overview Gradient checkpointing is a technique that reduces the memory footprint during model training (From O(n) to O(sqrt(n)) in the OpenAI example, n being the number of layers). The price is some computing overhead (multiple forward-pass on the same input). This post by Yaroslav Bulatov of OpenAI explains the mechanism behind it very well. Source: Gugger’s slides In many cases, what consumes the most memory is not the model itself but the intermediate activations and gradients of them, as this set of slides by Sylvain Gugger shows. Gradient checkpointing replaces the intermediate activations with checkpoints (the model is split into chunks by checkpoints) and recreates the activations between checkpoints by running another forward-pass in this chunk. Every activation is computed at most twice (once in the last chunk, twice in others). We only need to store the checkpoints (also a set of activations) and the activations of the active chunk in the memory during the backward-pass. If we’re using a model with n layers of equal size and we put a checkpoint every 10 layers (9 checkpoints, at layer 10, 20, …, 90.), memory consumption from activations and gradients of them is (9+10)kn comparing to 100kn without checkpointing (k is a constant). ...

[Paper] Adafactor: Adaptive Learning Rates with Sublinear Memory Cost

Photo Credit Motivation The Adafactor optimizer, in my experience, can provide much better convergence when fine-tuning the T5 v1.1 and mT5[1] pre-trained models. However, I encountered problems when using a custom learning rate scheduler with the Adafactor implementation from the huggingface/transformer library. I combed through the paper and the source code to find and fix the cause of the problem, which turned into a tiny contribution to the library. To further squeeze value from the time I’ve invested, I wrote this post to introduce the key ideas of the Adafactor optimizer and analyze the corresponding chunk of code in the huggingface/transformer implementation (which was taken from the fairseq library). Working examples as Kaggle notebooks are also provided: T5 v1.1 and mT5. ...

Mistake I Made that Crippled My Streamlit App

Photo Credit Streamlit an increasingly popular tool that allows Python developers to turn data scripts into interactive web applications in a few lines of code. I recently developed and deployed a semantic search app for news articles in Chinese, and I made a mistake not caching the model loading code. The performance was abysmal, and the memory footprint was huge for a TinyBERT-4L model (had to allocate 1GB of memory for the app). ...

A Case Study of fastcore @patch_to

Photo Credit Motivation I recently came across this new image data augmentation technique called SnapMix. It looks like a very sensible improvement over CutMix, so I was eager to give it a try. The SnapMix author provides a PyTorch implementation. I made some adjustments to improve the numeric stability and converted it to a callback in PyTorch Lightning. I encountered one major obstacle during the process — SnapMix uses Class Activation Mapping(CAM) to calculate an augmented example’s label weights. It requires access to the final linear classifier’s weight and the model activations before the pooling operation. Some PyTorch pre-trained CV models do implement methods to access these two things, but the namings are inconsistent. We need a unified API to do this. ...