[Kaggle] Google Research Football 2020

Photo Credit (This post an expansion of this Kaggle post.) My Solution Thanks to Kaggle, Manchester City F.C., and Google Research for this fantastic competition. Working on this competition was the most fun I’ve had for a while. The tl;dr version of my solution is that I used an MLP model to stochastically imitate WeKick’s agents, with some rules to help it navigate in unfamiliar waters. Why this Approach After I got the GCP coupon, I looked at the competition timeline and thought that there is no way I can train a competitive RL agent from scratch in less than two weeks. I had to find some way to cut the training time shorter. ...

December 28, 2020 · Ceshine Lee

[PyTorch Lightning] Log Training Losses when Accumulating Gradients

Photo Credit PyTorch Lightning reached 1.0.0 in October 2020. I wasn’t fully satisfied with the flexibility of its API, so I continued to use my pytorch-helper-bot. This has changed since the 1.0.0 release. Now I use PyTorch Lightning to develop training code that supports both single and multi-GPU training. However, one thing that bugged me is that the logging doesn’t work as expected when I set the number of gradient accumulation batches larger than one. The steps recorded in the training loop is still the raw step number, but those recorded in the validation is divided by the number of gradient accumulation batches. The training loop will be flooded with warnings of inconsistent steps being recorded. And it’ll be harder for you to compare the training and validation losses without the same step scale. ...

December 22, 2020 · Ceshine Lee

Generating Synthetic Tabular Data Using GAN

Photo Credit Introduction Recently I came across the article “How to Generate Synthetic Data? — A synthetic data generation dedicated repository”. The post introduces Wasserstein GAN[1] and demonstrates how to use it to generate synthetic(fake) data that looks very “real” (i.e., has similar statistical properties as the real data). This topic interests me as I’ve been wondering if we can reliably generate augmented data for tabular data. The author open-sourced the code on Github, so I decided to take some time to reproduce the results, make some improvements, and check if the quality of the synthetic data is good enough to use to augment the data or even replace the training data. ...

December 14, 2020 · Ceshine Lee

[Paper] Are We Really Making Much Progress?

Photo Credit Introduction Today we’re examining this very interesting and alarming paper in the field of recommender systems — Are We Really Making Much Progress? A Worrying Analysis of Recent Neural Recommendation Approaches. It also has an extended version still under review — A Troubling Analysis of Reproducibility and Progress in Recommender Systems Research. The first author of the papers also gave an overview and answered some questions to the first paper in this YouTube video (he also mentioned some of the contents in the extended version, e.g., the information leakage problem): ...

December 4, 2020 · Ceshine Lee

Weird Behavior in the FiveThirtyEight 2020 Election Model

Photo Credit (This short post is just me writing down some of my thoughts after reading the analysis.) Andrew Gelman wrote this very interesting analysis of the FiveThirtyEight(538) model in October — “Reverse-engineering the problematic tail behavior of the Fivethirtyeight presidential election forecast.” Three weeks after the election day, the 2020 election vote counts are almost finalized. We can see that the wider credible interval of the popular vote from the 538 model looks better on paper in hindsight (in contrast, for example, the economist’s model has a much narrower confidence interval). But is it justified? After all, everyone can tune their model to make uncertainty seems bigger, and explain the inaccuracy in their model with that artificially inflated uncertainly. (I’m not accusing 538 of doing so. I’m just saying that we shouldn’t blindly trust the computed uncertainty shown to us.) ...

November 23, 2020 · Ceshine Lee