Generating Synthetic Tabular Data Using GAN

Photo Credit Introduction Recently I came across the article “How to Generate Synthetic Data? — A synthetic data generation dedicated repository”. The post introduces Wasserstein GAN[1] and demonstrates how to use it to generate synthetic(fake) data that looks very “real” (i.e., has similar statistical properties as the real data). This topic interests me as I’ve been wondering if we can reliably generate augmented data for tabular data. The author open-sourced the code on Github, so I decided to take some time to reproduce the results, make some improvements, and check if the quality of the synthetic data is good enough to use to augment the data or even replace the training data. ...

December 14, 2020 · Ceshine Lee

[Paper] Are We Really Making Much Progress?

Photo Credit Introduction Today we’re examining this very interesting and alarming paper in the field of recommender systems — Are We Really Making Much Progress? A Worrying Analysis of Recent Neural Recommendation Approaches. It also has an extended version still under review — A Troubling Analysis of Reproducibility and Progress in Recommender Systems Research. The first author of the papers also gave an overview and answered some questions to the first paper in this YouTube video (he also mentioned some of the contents in the extended version, e.g., the information leakage problem): ...

December 4, 2020 · Ceshine Lee

Weird Behavior in the FiveThirtyEight 2020 Election Model

Photo Credit (This short post is just me writing down some of my thoughts after reading the analysis.) Andrew Gelman wrote this very interesting analysis of the FiveThirtyEight(538) model in October — “Reverse-engineering the problematic tail behavior of the Fivethirtyeight presidential election forecast.” Three weeks after the election day, the 2020 election vote counts are almost finalized. We can see that the wider credible interval of the popular vote from the 538 model looks better on paper in hindsight (in contrast, for example, the economist’s model has a much narrower confidence interval). But is it justified? After all, everyone can tune their model to make uncertainty seems bigger, and explain the inaccuracy in their model with that artificially inflated uncertainly. (I’m not accusing 538 of doing so. I’m just saying that we shouldn’t blindly trust the computed uncertainty shown to us.) ...

November 23, 2020 · Ceshine Lee

The Book Of Why: The New Science of Cause and Effect

Photo Credit Impression I just finished The Book of Why by Judea Pearl. This book is one of those that I wish I had picked it up a lot earlier. It makes a convincing case on what is missing in traditional probabilistic thinking and why the causal models can help to fill in the gap. Although reading this book probably won’t help you in finding a job as a data scientist or AI/ML engineer, but I genuinely think that every data scientist should read it to better understand the limitation of the current statistical learning methods. The model-free approaches to AI are unlikely to bring us Artificial General Intelligence(AGI). Blindingly throwing data at machine learning algorithms can only get us this far. (There already seems to be some research in reinforcement learning that shows world models that imitate how humans perceive the world can help build more intelligent agents. However, I’m not yet an expert in reinforcement learning, so my interpretation can be wrong.) ...

November 14, 2020 · Ceshine Lee

Automatic Testing Your SQLite Database with Great Expectations

Photo Credit Introduction If you are familiar with software engineering, you’d know that automatic testing and continuous integration can save you a lot of debugging time when a project is complex enough and/or involves collaboration between contributors. They help you make sure the new code doesn’t break anything that it’s not supposed to and quickly narrow down the scope of places that could go wrong when failures inevitably happen. For data scientists, we have to test not only against code but also against data to make sure our data pipelines are working correctly. Just like new code can break your software, new data can also break your pipelines. Great Expectations is a tool that protects you from problematic new data: ...

October 17, 2020 · Ceshine Lee