Quantile Regression — Part 1

Photo Credit I’m starting to think prediction interval[1] should be a required output of every real-world regression model. You need to know the uncertainty behind each point estimation. Otherwise the predictions are often not actionable. For example, consider historical sales of an item under a certain circumstance are (10000, 10, 50, 100). Standard least squares method gives you an estimate of 2540. If you restock based on that prediction, you’re likely going to significantly overstock 75% of the time. The prediction is almost useless. But if you estimate the quantiles of the data distribution, the estimated 5th, 50th, and 95th percentiles are 16, 75, 8515, which are much more informative than the 2540 single estimation. It is also the idea of quantile regression. ...

July 12, 2018 · Ceshine Lee

[Review] Kaggle Toxic Comment Classification Challenge

Photo Credit Introduction Jigsaw toxic comment classification challenge features a multi-label text classification problem with a highly imbalanced dataset. The test set used originally was revealed to be already public on the Internet, so a new dataset was released mid-competition, and the evaluation metric was changed from Log Loss to AUC. I tried a few ideas after building up my PyTorch pipeline but did not find any innovative approach that looks promising. Text normalization is the only strategy I had found to give solid improvements, but it is very time consuming. The final result (105th place/about top 3%) was quite fitting IMO given the time I spent on this competition(not a lot). ...

March 24, 2018 · Ceshine Lee

Analyzing Tweets with R

Source Introduction NLP(Natural-language processing) is hard, partly because human is hard to understand. We need good tools to help us analyze texts. Even if the texts are eventually fed into a black box model, doing exploratory analysis is very likely to help you get a better model. I’ve heard great things about a R package tidytext and recently decided to give it a try. The package authors also wrote a book about it and kindly released it online: Text Mining with R: A guide to text analysis within the tidy data framework, using the tidytext package and other tidy tools. ...

February 27, 2018 · Ceshine Lee

Feature Importance Measures for Tree Models — Part I

Photo Credit 2018–02–20 Update: Adds two images (random forest and gradient boosting). 2019–05–25 Update: I’ve published a post covering another importance measure — SHAP values — on my personal blog and on Medium. This post is inspired by a Kaggle kernel and its discussions [1]. I’d like to do a brief review of common algorithms to measure feature importance with tree-based models. We can interpret the results to check intuition(no surprisingly important features), do feature selection, and guide the direction of feature engineering. ...

October 28, 2017 · Ceshine Lee

[Learning Note] Single Shot MultiBox Detector with Pytorch — Part 3

(Reminder: The SSD paper and the Pytorch implementation used in this post. Also, the first and second part of the series.) Training Objective / Loss Function Every deep learning / neural network needs a differentiable objective function to learn from. After pairing ground truths and default boxes, and marking the remaining default boxes as background, we’re ready to formulate the objective function of SSD: Overall Objective — Formula (1) from the original paper ...

July 27, 2017 · Ceshine Lee