Pytorch

Zero Shot Cross-Lingual Transfer with Multilingual BERT

Photo Credit Synopsis Do you want multilingual sentence embeddings, but only have a training dataset in English? This post presents an experiment that fine-tuned a pre-trained multilingual BERT model(“BERT-Base, Multilingual Uncased” [1][2]) on monolingual(English) AllNLI dataset[4] to create sentence embeddings model(that maps a sentence to a fixed-size vector)[3]. The experiment shows that the fine-tuned multilingual BERT sentence embeddings have generally better performance (i.e. lower error rates) over baselines in a multilingual similarity search task (Tatoeba dataset[5]). However, the error rates are still significantly higher than the ones from specialized sentence embedding models trained with multilingual datasets[5]. ...

More Memory-Efficient Swish Activation Function

Photo Credit Update on 2020-08-22: using torch.cuda.max_memory_allocated() and torch.cuda.reset_peak_memory_stats() in the newer version (1.6+) of PyTorch is probably more accurate. (reference) Motivation Recently I’ve been trying out EfficientNet models implemented in PyTorch. I’ve managed to successfully fine-tune pretrained EfficientNet models on my data set and reach accuracy on par with the mainstream ones like SE-ResNeXt-50. However, training the model from scratch has proven to be much harder. Fine-tuned EfficientNet models can reach the same accuracy with much smaller number of parameters, but they seem to occupy a lot of GPU memory than it probably should (comparing to the mainstream ones). There is an open issue on the Github Repository about this problem — [lukemelas/EfficientNet-PyTorch] Memory Issues. ...

Smaller Docker Image using Multi-Stage Build

Photo Credit Why Use Mutli-Stage Build? Starting from Docker 17.05, users can utilize this new “multi-stage build” feature [1] to simplify their workflow and make the final Docker images smaller. It basically streamlines the “Builder pattern”, which means using a “builder” image to build the binary files, and copying those binary files to another runtime/production image. Despite being an interpreted programming language, many of Python libraries, especially the ones doing scientific computing and machine learning, are built upon pieces written in compiled languages (mostly C/C++). Therefore, the “Builder pattern” can still be applied. ...

Mixed Precision Training on Tesla T4 and P100

Photo Credit tl;dr: the power of Tensor Cores is real. Also, make sure the CPU does not become the bottleneck. Motivation I’ve written about Apex in this previous post: Use NVIDIA Apex for Easy Mixed Precision Training in PyTorch. At that time I only have my GTX 1070 to experiment on. And as we’ve learned in that post, pre-Volta nVidia cards does not benefit from half-precision arithmetic in terms of speed. It only saves some GPU memory. Therefore, I wasn’t able to personally evaluate how much speed boost we can get from mixed precision with Tensor Cores. ...

Use NVIDIA Apex for Easy Mixed Precision Training in PyTorch

Photo by Sam Power on Unsplash The Apex project from NVIDIA is touted as a PyTorch extension that let developers do mixed precision and distributed training “with 4 or fewer line changes to the existing code”. It’s been out for a while (circa June 2018) and seems to be well received (huggingface/pytorch-pretrained-BERT uses Apex to do 16-bit training). So I decided to give it a try. This post documents what I’ve learned. ...