Posts

Replicate Conda Environment in Docker

Photo Credit Introduction You just finished developing your prototype in a Conda environment, and you are eager to share it with stakeholders, who may not have the required knowledge to recreate the environment to run your model on their end. Docker is a great tool that can help in this kind of scenario (p.s: it can utilize GPU via nvidia-docker). Just create a Docker image and share it with the stakeholders, and your model will run on their device the same way it runs on yours. ...

[Paper] Please Stop Permuting Features

Photo Credit This post summarizes the findings and suggestions from the paper “Please Stop Permuting Features ‒ An Explanation and Alternatives” by Giles Hooker and Lucas Mentch. (Note: Permutation importance is covered in one of my previous posts: Feature Importance Measures for Tree Models — Part I.) TL;DR Permutation importance (permuting features without retraining) is biased toward features that are correlated. Avoid using it, and use one of the following alternatives: ...

[Paper] Language-agnostic BERT Sentence Embedding

Photo Credit The Google AI Blog post This post on Google AI Blog explains the premise, background, and related works of this paper pretty well. I’m not going to repeat them in this post. Instead, I’ll try to fill in some of the gaps I see as someone that is familiar with this topic but does not follow very closely with the latest development. Firstly, I want to point out something in the Google AI post that confuses me. In the first paragraph the authors stated: ...

[Competition] Jigsaw Multilingual Toxic Comment Classification

Photo Credit Introduction Jigsaw Multilingual Toxic Comment Classification is the third Jigsaw toxic comment classification hosted on Kaggle. I’ve covered both the first one in 2018 and the second one in 2019 on this blog. This time, Kagglers were asked to use English training corpora to create multilingual toxic comment classifiers that are tested in 6 other languages. I’ve been taking a break from Kaggle during COVID pandemic, so I did not participate in this year’s competition. However, reading top solutions is always very helpful whether you participated or not, and is exactly what I’m doing in this post. Due to time limitation, I will only cover a small part of the solutions shared. I’ll update the post if I find the other interesting things later. ...

[Paper] Training Question Answering Models From Synthetic Data

Photo Credit Preamble “Training Question Answering Models From Synthetic Data” is an NLP paper from Nvidia that I found very interesting. Question and answer(QA) data is expansive to obtain. If we can use the data we have to generate more data, that will be a huge time saver and create a lot of new possibilities. This paper shows some promising results in this direction. Some caveats: We need big models to be able to get decent results. (The paper reported question generation models with the number of parameters from 117M to 8.3B. See the ablation study in the following sections.) Generated QA data is still not at the same level as the real data. (At least 3x+ more synthetic data is needed to reach the same level of accuracy.) There are a lot of contents in this paper, and it can be a bit overwhelming. I wrote down parts of the paper that I think is most relevant in this post, and hopefully, it can be helpful to you as well. ...