The Book Of Why: The New Science of Cause and Effect

Photo Credit Impression I just finished The Book of Why by Judea Pearl. This book is one of those that I wish I had picked it up a lot earlier. It makes a convincing case on what is missing in traditional probabilistic thinking and why the causal models can help to fill in the gap. Although reading this book probably won’t help you in finding a job as a data scientist or AI/ML engineer, but I genuinely think that every data scientist should read it to better understand the limitation of the current statistical learning methods. The model-free approaches to AI are unlikely to bring us Artificial General Intelligence(AGI). Blindingly throwing data at machine learning algorithms can only get us this far. (There already seems to be some research in reinforcement learning that shows world models that imitate how humans perceive the world can help build more intelligent agents. However, I’m not yet an expert in reinforcement learning, so my interpretation can be wrong.) ...

November 14, 2020 · Ceshine Lee

Automatic Testing Your SQLite Database with Great Expectations

Photo Credit Introduction If you are familiar with software engineering, you’d know that automatic testing and continuous integration can save you a lot of debugging time when a project is complex enough and/or involves collaboration between contributors. They help you make sure the new code doesn’t break anything that it’s not supposed to and quickly narrow down the scope of places that could go wrong when failures inevitably happen. For data scientists, we have to test not only against code but also against data to make sure our data pipelines are working correctly. Just like new code can break your software, new data can also break your pipelines. Great Expectations is a tool that protects you from problematic new data: ...

October 17, 2020 · Ceshine Lee

[Tensorflow] Training CV Models on TPU without Using Cloud Storage

Photo Credit Introduction Recently I was asked this question (paraphrasing): I have a small image dataset that I want to train on Google Colab and its free TPU. Is there a way to do that without having to upload the dataset as TFRecord files to Cloud Storage? First of all, if your dataset is small, I’d say training on GPU wouldn’t be much slower than on TPU. But they were adamant that they wanted to see how fast training on TPU can be. That’s fine, and the answer is yes. There is a way to do that. ...

October 11, 2020 · Ceshine Lee

Replicate Conda Environment in Docker

Photo Credit Introduction You just finished developing your prototype in a Conda environment, and you are eager to share it with stakeholders, who may not have the required knowledge to recreate the environment to run your model on their end. Docker is a great tool that can help in this kind of scenario (p.s: it can utilize GPU via nvidia-docker). Just create a Docker image and share it with the stakeholders, and your model will run on their device the same way it runs on yours. ...

October 7, 2020 · Ceshine Lee

[Paper] Please Stop Permuting Features

Photo Credit This post summarizes the findings and suggestions from the paper “Please Stop Permuting Features ‒ An Explanation and Alternatives” by Giles Hooker and Lucas Mentch. (Note: Permutation importance is covered in one of my previous posts: Feature Importance Measures for Tree Models — Part I.) TL;DR Permutation importance (permuting features without retraining) is biased toward features that are correlated. Avoid using it, and use one of the following alternatives: ...

September 8, 2020 · Ceshine Lee