[Notes] Understanding XCiT - Part 1

credit Overview XCiT: Cross-Covariance Image Transformers[1] is a paper from Facebook AI that proposes a “transposed” version of self-attention that operates across feature channels rather than tokens. This cross-covariance attention has linear complexity in the number of tokens (the original self-attention has quadratic complexity). When used on images as in vision transformers, this linear complexity allows the model to process images of higher resolutions and split the images into smaller patches, which are both shown to improve performance. ...

July 24, 2021 · Ceshine Lee

[Notes] Gradient Checkpointing with BERT

Photo Credit Overview Gradient checkpointing is a technique that reduces the memory footprint during model training (From O(n) to O(sqrt(n)) in the OpenAI example, n being the number of layers). The price is some computing overhead (multiple forward-pass on the same input). This post by Yaroslav Bulatov of OpenAI explains the mechanism behind it very well. Source: Gugger’s slides In many cases, what consumes the most memory is not the model itself but the intermediate activations and gradients of them, as this set of slides by Sylvain Gugger shows. Gradient checkpointing replaces the intermediate activations with checkpoints (the model is split into chunks by checkpoints) and recreates the activations between checkpoints by running another forward-pass in this chunk. Every activation is computed at most twice (once in the last chunk, twice in others). We only need to store the checkpoints (also a set of activations) and the activations of the active chunk in the memory during the backward-pass. If we’re using a model with n layers of equal size and we put a checkpoint every 10 layers (9 checkpoints, at layer 10, 20, …, 90.), memory consumption from activations and gradients of them is (9+10)kn comparing to 100kn without checkpointing (k is a constant). ...

April 4, 2021 · Ceshine Lee

[Paper] Adafactor: Adaptive Learning Rates with Sublinear Memory Cost

Photo Credit Motivation The Adafactor optimizer, in my experience, can provide much better convergence when fine-tuning the T5 v1.1 and mT5[1] pre-trained models. However, I encountered problems when using a custom learning rate scheduler with the Adafactor implementation from the huggingface/transformer library. I combed through the paper and the source code to find and fix the cause of the problem, which turned into a tiny contribution to the library. To further squeeze value from the time I’ve invested, I wrote this post to introduce the key ideas of the Adafactor optimizer and analyze the corresponding chunk of code in the huggingface/transformer implementation (which was taken from the fairseq library). Working examples as Kaggle notebooks are also provided: T5 v1.1 and mT5. ...

March 18, 2021 · Ceshine Lee

Mistake I Made that Crippled My Streamlit App

Photo Credit Streamlit an increasingly popular tool that allows Python developers to turn data scripts into interactive web applications in a few lines of code. I recently developed and deployed a semantic search app for news articles in Chinese, and I made a mistake not caching the model loading code. The performance was abysmal, and the memory footprint was huge for a TinyBERT-4L model (had to allocate 1GB of memory for the app). ...

March 14, 2021 · Ceshine Lee

A Case Study of fastcore @patch_to

Photo Credit Motivation I recently came across this new image data augmentation technique called SnapMix. It looks like a very sensible improvement over CutMix, so I was eager to give it a try. The SnapMix author provides a PyTorch implementation. I made some adjustments to improve the numeric stability and converted it to a callback in PyTorch Lightning. I encountered one major obstacle during the process — SnapMix uses Class Activation Mapping(CAM) to calculate an augmented example’s label weights. It requires access to the final linear classifier’s weight and the model activations before the pooling operation. Some PyTorch pre-trained CV models do implement methods to access these two things, but the namings are inconsistent. We need a unified API to do this. ...

February 19, 2021 · Ceshine Lee