[Notes] Uncovering the Hidden Preprocessing Logic of ColPali

Cover image generated by Nano Banana Pro Introduction I recently came across a course called “Multi-Vector Image Retrieval” by DeepLearning.ai. The course mainly introduces ColPali [1], a vision-language model that generalizes the late-interaction retrieval paradigm pioneered by ColBERT [2], extending it from covering only text tokens to covering both text and visual tokens. It also contains a few tutorials on performance optimization techniques using Qdrant’s Python SDK. It is a great introductory resource, and I recommend it to anyone interested in visual document understanding and retrieval. ...

January 22, 2026 · Ceshine Lee

Speed Up Your Python Scripts with Rust: A Levenshtein Distance Case Study

Cover image generated by Nano Banana Disclaimer: A 50x speedup is not guaranteed. Actual performance depends on the nature of the dataset and the hardware on which the code is run. Please refer to the Benchmarks section below for more information. Introduction Recently, I finally found some time to learn the Rust programming language. I find its memory safety guarantee quite elegant, although it comes with the trade-off of a steep learning curve, especially when it comes to Rust’s ownership and lifetime system. It is very appealing to someone like me, who mainly uses a scripting language and who writes low-level code only from time to time. Writing C/C++ code can easily lead to unstable runtime behavior or unexpected results in such circumstances. ...

December 21, 2025 · Ceshine Lee

Text Analysis using Julia

Photo Credit Overview I tried to conduct some exploratory analysis on the title field of the “Shopee - Price Match Guarantee” dataset. I wanted to know how similar the titles are within the same group, so we can have a rough idea of how useful the field would be in determining if two listings belong to the same group. I used StringDistances.jl for raw string analysis and WordToeknizers.jl for token analysis. Instead of using Jupyter Notebook, I used Pluto.jl to get reactive notebooks with more presentably visual design right out of the box. The experience was a blast. Writing in Julia is not as hard as I expected, and the end result is very clean and blazing fast. ...

May 1, 2021 · Ceshine Lee

[Notes] Gradient Checkpointing with BERT

Photo Credit Overview Gradient checkpointing is a technique that reduces the memory footprint during model training (From O(n) to O(sqrt(n)) in the OpenAI example, n being the number of layers). The price is some computing overhead (multiple forward-pass on the same input). This post by Yaroslav Bulatov of OpenAI explains the mechanism behind it very well. Source: Gugger’s slides In many cases, what consumes the most memory is not the model itself but the intermediate activations and gradients of them, as this set of slides by Sylvain Gugger shows. Gradient checkpointing replaces the intermediate activations with checkpoints (the model is split into chunks by checkpoints) and recreates the activations between checkpoints by running another forward-pass in this chunk. Every activation is computed at most twice (once in the last chunk, twice in others). We only need to store the checkpoints (also a set of activations) and the activations of the active chunk in the memory during the backward-pass. If we’re using a model with n layers of equal size and we put a checkpoint every 10 layers (9 checkpoints, at layer 10, 20, …, 90.), memory consumption from activations and gradients of them is (9+10)kn comparing to 100kn without checkpointing (k is a constant). ...

April 4, 2021 · Ceshine Lee

[Paper] Adafactor: Adaptive Learning Rates with Sublinear Memory Cost

Photo Credit Motivation The Adafactor optimizer, in my experience, can provide much better convergence when fine-tuning the T5 v1.1 and mT5[1] pre-trained models. However, I encountered problems when using a custom learning rate scheduler with the Adafactor implementation from the huggingface/transformer library. I combed through the paper and the source code to find and fix the cause of the problem, which turned into a tiny contribution to the library. To further squeeze value from the time I’ve invested, I wrote this post to introduce the key ideas of the Adafactor optimizer and analyze the corresponding chunk of code in the huggingface/transformer implementation (which was taken from the fairseq library). Working examples as Kaggle notebooks are also provided: T5 v1.1 and mT5. ...

March 18, 2021 · Ceshine Lee