Smaller Docker Image using Multi-Stage Build

Photo Credit Why Use Mutli-Stage Build? Starting from Docker 17.05, users can utilize this new “multi-stage build” feature [1] to simplify their workflow and make the final Docker images smaller. It basically streamlines the “Builder pattern”, which means using a “builder” image to build the binary files, and copying those binary files to another runtime/production image. Despite being an interpreted programming language, many of Python libraries, especially the ones doing scientific computing and machine learning, are built upon pieces written in compiled languages (mostly C/C++). Therefore, the “Builder pattern” can still be applied. ...

June 21, 2019 · Ceshine Lee

Mixed Precision Training on Tesla T4 and P100

Photo Credit tl;dr: the power of Tensor Cores is real. Also, make sure the CPU does not become the bottleneck. Motivation I’ve written about Apex in this previous post: Use NVIDIA Apex for Easy Mixed Precision Training in PyTorch. At that time I only have my GTX 1070 to experiment on. And as we’ve learned in that post, pre-Volta nVidia cards does not benefit from half-precision arithmetic in terms of speed. It only saves some GPU memory. Therefore, I wasn’t able to personally evaluate how much speed boost we can get from mixed precision with Tensor Cores. ...

June 13, 2019 · Ceshine Lee

[Notes] SHAP Values

Photo Credit Unlike other feature importance measures, SHAP values are fairly complicated and theoretically grounded. I kept forgetting the small details of how SHAP values works. These notes aim for making sure I understand the concept well enough and be something that I can refer back to once in a while. Hopefully it will also be helpful to you. Classic Shapley Value Estimation Shapley regression values: $$\phi_{i} = \sum_{S \subset F \backslash \{i\}} \frac{|S|!(|F|-|S|-1)!}{|F|!}[f_{S \cup \{i\}}(x_{S \cup \{i\}}) - f_S(x_S)]$$ Shapley regression values are feature importances for linear models in the presence of multicollinearity. [1] ...

May 23, 2019 · Ceshine Lee

Detecting Chinese Characters in Unicode Strings

Photo Credit Motivation I have a situation where an automatic mechanism to remove texts in a dataset that are not in Chinese. The dataset contains characters from Traditional Chinese, Simplified Chinese, English, and on some rare occasion French, Arabic, and other languages. General purpose language detection packages (such as this one) produces a lot more false positives than expected. Texts with Chinese characters mixed with Latin characters are often classified as different languages. And quite often Chinese texts are classified as Korean, which is very interesting because the dataset does not have any Korean characters. ...

April 24, 2019 · Ceshine Lee

A First Look at Plotly Express

Photo Credit Plotly has a new high-level wrapper libaray for Python called Plotly Express. Along with the new theming system introduced late last year, this post documents me trying out the new API and features. It also includes simple comparisons between the base Plotly.py API and the Plotly Express, and my initial thoughts on Plotly Express. This post does not intend to cover all kind of plots. Only plots relevant to the particular dataset used here (basically bar charts) are covered. ...

April 9, 2019 · Ceshine Lee