[Notes] (Ir)Reproducible Machine Learning: A Case Study

Photo Credit I just read this (draft) paper named “(Ir)Reproducible Machine Learning: A Case Study” (blog post; paper). It reviewed 15 papers that were focusing on predicting civil war and evaluated using a train-test split. Out of these 15 papers: 12 shared the complete code and data for their results. 4 have errors. 9 do not have hypothesis testing or uncertainty quantification (including 3 of 4 papers with errors). Three of the papers with errors shared the same dataset. Muchlinski et al.[1] created the dataset, and then Colaresi and Zuhaib Mahmood[2] and Wang[3] reused the dataset without noticing the critical error in Muchlinski et al.’s dataset construction process — data leakage due to imputing the training and test data together. ...

August 27, 2021 · Ceshine Lee

[Paper] Please Stop Permuting Features

Photo Credit This post summarizes the findings and suggestions from the paper “Please Stop Permuting Features ‒ An Explanation and Alternatives” by Giles Hooker and Lucas Mentch. (Note: Permutation importance is covered in one of my previous posts: Feature Importance Measures for Tree Models — Part I.) TL;DR Permutation importance (permuting features without retraining) is biased toward features that are correlated. Avoid using it, and use one of the following alternatives: ...

September 8, 2020 · Ceshine Lee

Feature Importance Measures for Tree Models — Part I

Photo Credit 2018–02–20 Update: Adds two images (random forest and gradient boosting). 2019–05–25 Update: I’ve published a post covering another importance measure — SHAP values — on my personal blog and on Medium. This post is inspired by a Kaggle kernel and its discussions [1]. I’d like to do a brief review of common algorithms to measure feature importance with tree-based models. We can interpret the results to check intuition(no surprisingly important features), do feature selection, and guide the direction of feature engineering. ...

October 28, 2017 · Ceshine Lee