[Paper] Language-agnostic BERT Sentence Embedding

Photo Credit The Google AI Blog post This post on Google AI Blog explains the premise, background, and related works of this paper pretty well. I’m not going to repeat them in this post. Instead, I’ll try to fill in some of the gaps I see as someone that is familiar with this topic but does not follow very closely with the latest development. Firstly, I want to point out something in the Google AI post that confuses me. In the first paragraph the authors stated: ...

August 19, 2020 · Ceshine Lee

[Competition] Jigsaw Multilingual Toxic Comment Classification

Photo Credit Introduction Jigsaw Multilingual Toxic Comment Classification is the third Jigsaw toxic comment classification hosted on Kaggle. I’ve covered both the first one in 2018 and the second one in 2019 on this blog. This time, Kagglers were asked to use English training corpora to create multilingual toxic comment classifiers that are tested in 6 other languages. I’ve been taking a break from Kaggle during COVID pandemic, so I did not participate in this year’s competition. However, reading top solutions is always very helpful whether you participated or not, and is exactly what I’m doing in this post. Due to time limitation, I will only cover a small part of the solutions shared. I’ll update the post if I find the other interesting things later. ...

August 5, 2020 · Ceshine Lee

[Paper] Training Question Answering Models From Synthetic Data

Photo Credit Preamble “Training Question Answering Models From Synthetic Data” is an NLP paper from Nvidia that I found very interesting. Question and answer(QA) data is expansive to obtain. If we can use the data we have to generate more data, that will be a huge time saver and create a lot of new possibilities. This paper shows some promising results in this direction. Some caveats: We need big models to be able to get decent results. (The paper reported question generation models with the number of parameters from 117M to 8.3B. See the ablation study in the following sections.) Generated QA data is still not at the same level as the real data. (At least 3x+ more synthetic data is needed to reach the same level of accuracy.) There are a lot of contents in this paper, and it can be a bit overwhelming. I wrote down parts of the paper that I think is most relevant in this post, and hopefully, it can be helpful to you as well. ...

July 23, 2020 · Ceshine Lee

[Tip] TorchScript Supports Half Precision

Photo Credit This is a short post describing how to use half precision in TorchScript. This can speed up models that were trained using mixed precision in PyTorch (using Apex Amps), and also some of the model trained using full precision (with some potential degradation of accuracy). TorchScript is a way to create serializable and optimizable models from PyTorch code. Any TorchScript program can be saved from a Python process and loaded in a process where there is no Python dependency. source ...

July 11, 2020 · Ceshine Lee

Self-Supervised Domain Adaptation

Photo Credit Introduction Self-supervised learning made transfer learning possible in NLP [1] (by using language modeling as the pre-training task) and has started to show some potential in CV as well [2, 3, 4]. They make the downstream tasks more label efficient, that is, requires fewer labeled examples to achieve good prediction accuracies. In CV, we are already quite familiar with transfer learning from models pre-trained on the labeled Imagenet dataset. However, if the dataset used in the downstream task is significantly different from the Imagenet, transfer learning/fine-tuning usually would not be very helpful. ...

July 6, 2020 · Ceshine Lee