Paper

[Paper] Adafactor: Adaptive Learning Rates with Sublinear Memory Cost

Photo Credit Motivation The Adafactor optimizer, in my experience, can provide much better convergence when fine-tuning the T5 v1.1 and mT5[1] pre-trained models. However, I encountered problems when using a custom learning rate scheduler with the Adafactor implementation from the huggingface/transformer library. I combed through the paper and the source code to find and fix the cause of the problem, which turned into a tiny contribution to the library. To further squeeze value from the time I’ve invested, I wrote this post to introduce the key ideas of the Adafactor optimizer and analyze the corresponding chunk of code in the huggingface/transformer implementation (which was taken from the fairseq library). Working examples as Kaggle notebooks are also provided: T5 v1.1 and mT5. ...

[Paper] Rethinking Cooperative Rationalization: Introspective Extraction and Complement Control

Photo Credit Introduction Model interpretability is crucial if we want to use AI models to make high-stake decisions (e.g., making medical diagnoses, preventing suicides, etc.). In NLP, one common way to get interpretability is to extract information from the trained models. For example, some use gradient-based input attribution techniques, some perturb the input to get explanations, and some use influence functions to find the most influential training examples to this particular input sequence. Another way is to make the model intrinsically explainable (e.g., a decision tree). ...

[Paper] Are We Really Making Much Progress?

Photo Credit Introduction Today we’re examining this very interesting and alarming paper in the field of recommender systems — Are We Really Making Much Progress? A Worrying Analysis of Recent Neural Recommendation Approaches. It also has an extended version still under review — A Troubling Analysis of Reproducibility and Progress in Recommender Systems Research. The first author of the papers also gave an overview and answered some questions to the first paper in this YouTube video (he also mentioned some of the contents in the extended version, e.g., the information leakage problem): ...

[Paper] Language-agnostic BERT Sentence Embedding

Photo Credit The Google AI Blog post This post on Google AI Blog explains the premise, background, and related works of this paper pretty well. I’m not going to repeat them in this post. Instead, I’ll try to fill in some of the gaps I see as someone that is familiar with this topic but does not follow very closely with the latest development. Firstly, I want to point out something in the Google AI post that confuses me. In the first paragraph the authors stated: ...

[Paper] Training Question Answering Models From Synthetic Data

Photo Credit Preamble “Training Question Answering Models From Synthetic Data” is an NLP paper from Nvidia that I found very interesting. Question and answer(QA) data is expansive to obtain. If we can use the data we have to generate more data, that will be a huge time saver and create a lot of new possibilities. This paper shows some promising results in this direction. Some caveats: We need big models to be able to get decent results. (The paper reported question generation models with the number of parameters from 117M to 8.3B. See the ablation study in the following sections.) Generated QA data is still not at the same level as the real data. (At least 3x+ more synthetic data is needed to reach the same level of accuracy.) There are a lot of contents in this paper, and it can be a bit overwhelming. I wrote down parts of the paper that I think is most relevant in this post, and hopefully, it can be helpful to you as well. ...