[Notes] SHAP Values

Photo Credit Unlike other feature importance measures, SHAP values are fairly complicated and theoretically grounded. I kept forgetting the small details of how SHAP values works. These notes aim for making sure I understand the concept well enough and be something that I can refer back to once in a while. Hopefully it will also be helpful to you. Classic Shapley Value Estimation Shapley regression values: $$\phi_{i} = \sum_{S \subset F \backslash \{i\}} \frac{|S|!(|F|-|S|-1)!}{|F|!}[f_{S \cup \{i\}}(x_{S \cup \{i\}}) - f_S(x_S)]$$ Shapley regression values are feature importances for linear models in the presence of multicollinearity. [1] ...

May 23, 2019 · Ceshine Lee

Detecting Chinese Characters in Unicode Strings

Photo Credit Motivation I have a situation where an automatic mechanism to remove texts in a dataset that are not in Chinese. The dataset contains characters from Traditional Chinese, Simplified Chinese, English, and on some rare occasion French, Arabic, and other languages. General purpose language detection packages (such as this one) produces a lot more false positives than expected. Texts with Chinese characters mixed with Latin characters are often classified as different languages. And quite often Chinese texts are classified as Korean, which is very interesting because the dataset does not have any Korean characters. ...

April 24, 2019 · Ceshine Lee

A First Look at Plotly Express

Photo Credit Plotly has a new high-level wrapper libaray for Python called Plotly Express. Along with the new theming system introduced late last year, this post documents me trying out the new API and features. It also includes simple comparisons between the base Plotly.py API and the Plotly Express, and my initial thoughts on Plotly Express. This post does not intend to cover all kind of plots. Only plots relevant to the particular dataset used here (basically bar charts) are covered. ...

April 9, 2019 · Ceshine Lee

Custom Image Augmentation with Keras

Photo by Josh Gordon on Unsplash The new Tensorflow 2.0 is going to standardize on Keras as its High-level API. The existing Keras API will mostly remain the same, while Tensorflow features like eager execution, distributed training and other deeper Tensorflow integration will be added or improved. I think it’s a good time to revisit Keras as someone who had switched to use PyTorch most of the time. ...

April 4, 2019 · Ceshine Lee

UMAP on RAPIDS (15x Speedup)

A_Different_Perspective from Pixabay RAPIDS RAPIDS is a collection of Python libraries from NVIDIA that enables the users to do their data science pipelines entirely on GPUs. The two main components are cuDF and cuML. The cuDF library provides Pandas-like data frames, and cuML mimics scikit-learn. There’s also a cuGRAPH graph analytics library that have been introduced in the latest release (0.6 on March 28). The RAPIDS suite of open source software libraries gives you the freedom to execute end-to-end data science and analytics pipelines entirely on GPUs. RAPIDS is incubated by NVIDIA® based on years of accelerated data science experience. RAPIDS relies on NVIDIA CUDA® primitives for low-level compute optimization, and exposes GPU parallelism and high-bandwidth memory speed through user-friendly Python interfaces. ...

March 30, 2019 · Ceshine Lee