Posts

A Case Study of fastcore @patch_to

Photo Credit Motivation I recently came across this new image data augmentation technique called SnapMix. It looks like a very sensible improvement over CutMix, so I was eager to give it a try. The SnapMix author provides a PyTorch implementation. I made some adjustments to improve the numeric stability and converted it to a callback in PyTorch Lightning. I encountered one major obstacle during the process — SnapMix uses Class Activation Mapping(CAM) to calculate an augmented example’s label weights. It requires access to the final linear classifier’s weight and the model activations before the pooling operation. Some PyTorch pre-trained CV models do implement methods to access these two things, but the namings are inconsistent. We need a unified API to do this. ...

[Paper] Rethinking Cooperative Rationalization: Introspective Extraction and Complement Control

Photo Credit Introduction Model interpretability is crucial if we want to use AI models to make high-stake decisions (e.g., making medical diagnoses, preventing suicides, etc.). In NLP, one common way to get interpretability is to extract information from the trained models. For example, some use gradient-based input attribution techniques, some perturb the input to get explanations, and some use influence functions to find the most influential training examples to this particular input sequence. Another way is to make the model intrinsically explainable (e.g., a decision tree). ...

Reducing the SentencePiece Vocabulary Size of Pretrained NLP Models

Photo Credit Motivation Q: Why and when would we want to trim down the vocabulary size of a pretrained model? A: When a large portion of the vocabulary isn’t used in your downstream task, it will make sense to get rid of the redundant part of the vocabulary to increase the model speed. For example, Google’s multilingual version of T5 — mT5 — was pretrained on 101 languages. Imagine if we only use English, Japanese, and Chinese in our downstream text generation task. We would waste a lot of time and space to process the rows in the embedding matrix and the LM head that corresponds to tokens that never appear in the dataset. ...

[Kaggle] Google Research Football 2020

Photo Credit (This post an expansion of this Kaggle post.) My Solution Thanks to Kaggle, Manchester City F.C., and Google Research for this fantastic competition. Working on this competition was the most fun I’ve had for a while. The tl;dr version of my solution is that I used an MLP model to stochastically imitate WeKick’s agents, with some rules to help it navigate in unfamiliar waters. Why this Approach After I got the GCP coupon, I looked at the competition timeline and thought that there is no way I can train a competitive RL agent from scratch in less than two weeks. I had to find some way to cut the training time shorter. ...

[PyTorch Lightning] Log Training Losses when Accumulating Gradients

Photo Credit PyTorch Lightning reached 1.0.0 in October 2020. I wasn’t fully satisfied with the flexibility of its API, so I continued to use my pytorch-helper-bot. This has changed since the 1.0.0 release. Now I use PyTorch Lightning to develop training code that supports both single and multi-GPU training. However, one thing that bugged me is that the logging doesn’t work as expected when I set the number of gradient accumulation batches larger than one. The steps recorded in the training loop is still the raw step number, but those recorded in the validation is divided by the number of gradient accumulation batches. The training loop will be flooded with warnings of inconsistent steps being recorded. And it’ll be harder for you to compare the training and validation losses without the same step scale. ...