More Memory-Efficient Swish Activation Function

Photo Credit Update on 2020-08-22: using torch.cuda.max_memory_allocated() and torch.cuda.reset_peak_memory_stats() in the newer version (1.6+) of PyTorch is probably more accurate. (reference) Motivation Recently I’ve been trying out EfficientNet models implemented in PyTorch. I’ve managed to successfully fine-tune pretrained EfficientNet models on my data set and reach accuracy on par with the mainstream ones like SE-ResNeXt-50. However, training the model from scratch has proven to be much harder. Fine-tuned EfficientNet models can reach the same accuracy with much smaller number of parameters, but they seem to occupy a lot of GPU memory than it probably should (comparing to the mainstream ones). There is an open issue on the Github Repository about this problem — [lukemelas/EfficientNet-PyTorch] Memory Issues. ...

August 22, 2019 · Ceshine Lee

Smaller Docker Image using Multi-Stage Build

Photo Credit Why Use Mutli-Stage Build? Starting from Docker 17.05, users can utilize this new “multi-stage build” feature [1] to simplify their workflow and make the final Docker images smaller. It basically streamlines the “Builder pattern”, which means using a “builder” image to build the binary files, and copying those binary files to another runtime/production image. Despite being an interpreted programming language, many of Python libraries, especially the ones doing scientific computing and machine learning, are built upon pieces written in compiled languages (mostly C/C++). Therefore, the “Builder pattern” can still be applied. ...

June 21, 2019 · Ceshine Lee

Mixed Precision Training on Tesla T4 and P100

Photo Credit tl;dr: the power of Tensor Cores is real. Also, make sure the CPU does not become the bottleneck. Motivation I’ve written about Apex in this previous post: Use NVIDIA Apex for Easy Mixed Precision Training in PyTorch. At that time I only have my GTX 1070 to experiment on. And as we’ve learned in that post, pre-Volta nVidia cards does not benefit from half-precision arithmetic in terms of speed. It only saves some GPU memory. Therefore, I wasn’t able to personally evaluate how much speed boost we can get from mixed precision with Tensor Cores. ...

June 13, 2019 · Ceshine Lee

Use NVIDIA Apex for Easy Mixed Precision Training in PyTorch

Photo by Sam Power on Unsplash The Apex project from NVIDIA is touted as a PyTorch extension that let developers do mixed precision and distributed training “with 4 or fewer line changes to the existing code”. It’s been out for a while (circa June 2018) and seems to be well received (huggingface/pytorch-pretrained-BERT uses Apex to do 16-bit training). So I decided to give it a try. This post documents what I’ve learned. ...

March 26, 2019 · Ceshine Lee

Multilingual Similarity Search Using Pretrained Bidirectional LSTM Encoder

Photo by Steven Wei on Unsplash Introduction Previously I’ve demonstrated how to use pretrained BERT model to create a similarity measure between two documents in this post: News Topic Similarity Measure using Pretrained BERT Model. However, to find similar entries to* N* documents in corpus A of size M, we need to run NM* feed-forwards. A more efficient and widely used method is to use neural networks to generate sentence/document embeddings, and calculate cosine similarity scores between these embeddings. ...

February 15, 2019 · Ceshine Lee