[Notes] Understanding XCiT - Part 1

credit Overview XCiT: Cross-Covariance Image Transformers[1] is a paper from Facebook AI that proposes a “transposed” version of self-attention that operates across feature channels rather than tokens. This cross-covariance attention has linear complexity in the number of tokens (the original self-attention has quadratic complexity). When used on images as in vision transformers, this linear complexity allows the model to process images of higher resolutions and split the images into smaller patches, which are both shown to improve performance. ...

July 24, 2021 · Ceshine Lee

How to Create a Documentation Website for Your Python Package

Photo Credit Motivation Sphinx is a tool that helps you create intelligent and beautiful documentation. I use it to generate documentation for the pytorch-lightning-spells project and publish it on readthedocs.io for free (if the project is open-source). Documentation is tremendously helpful to users of your project (including yourselves). As long as you maintain the good habit of writing docstrings in your code, Sphinx will convert the docstrings into webpages for you, drastically reducing the manual labor required from you. ...

June 13, 2021 · Ceshine Lee

Text Analysis using Julia

Photo Credit Overview I tried to conduct some exploratory analysis on the title field of the “Shopee - Price Match Guarantee” dataset. I wanted to know how similar the titles are within the same group, so we can have a rough idea of how useful the field would be in determining if two listings belong to the same group. I used StringDistances.jl for raw string analysis and WordToeknizers.jl for token analysis. Instead of using Jupyter Notebook, I used Pluto.jl to get reactive notebooks with more presentably visual design right out of the box. The experience was a blast. Writing in Julia is not as hard as I expected, and the end result is very clean and blazing fast. ...

May 1, 2021 · Ceshine Lee

How to Reduce the Loading Time of Julia Scripts

Photo Credit Motivation Julia is a promising new language for scientific computing and data science. I’ve demonstrated that doing whole work masking in Julia can be a lot faster (up to 100x) than in Python in this post. The secret of Julia’s speed is from its use of JIT compilers (rather than interpreters used by R and Python). However, this design also impedes Julia’s ambition as a general-purpose language since ten seconds of precompiling time for a simple script is unacceptable for most use cases. ...

April 18, 2021 · Ceshine Lee

[Notes] Gradient Checkpointing with BERT

Photo Credit Overview Gradient checkpointing is a technique that reduces the memory footprint during model training (From O(n) to O(sqrt(n)) in the OpenAI example, n being the number of layers). The price is some computing overhead (multiple forward-pass on the same input). This post by Yaroslav Bulatov of OpenAI explains the mechanism behind it very well. Source: Gugger’s slides In many cases, what consumes the most memory is not the model itself but the intermediate activations and gradients of them, as this set of slides by Sylvain Gugger shows. Gradient checkpointing replaces the intermediate activations with checkpoints (the model is split into chunks by checkpoints) and recreates the activations between checkpoints by running another forward-pass in this chunk. Every activation is computed at most twice (once in the last chunk, twice in others). We only need to store the checkpoints (also a set of activations) and the activations of the active chunk in the memory during the backward-pass. If we’re using a model with n layers of equal size and we put a checkpoint every 10 layers (9 checkpoints, at layer 10, 20, …, 90.), memory consumption from activations and gradients of them is (9+10)kn comparing to 100kn without checkpointing (k is a constant). ...

April 4, 2021 · Ceshine Lee