Dataset

[Notes] (Ir)Reproducible Machine Learning: A Case Study

Photo Credit I just read this (draft) paper named “(Ir)Reproducible Machine Learning: A Case Study” (blog post; paper). It reviewed 15 papers that were focusing on predicting civil war and evaluated using a train-test split. Out of these 15 papers: 12 shared the complete code and data for their results. 4 have errors. 9 do not have hypothesis testing or uncertainty quantification (including 3 of 4 papers with errors). Three of the papers with errors shared the same dataset. Muchlinski et al.[1] created the dataset, and then Colaresi and Zuhaib Mahmood[2] and Wang[3] reused the dataset without noticing the critical error in Muchlinski et al.’s dataset construction process — data leakage due to imputing the training and test data together. ...

Automatic Testing Your SQLite Database with Great Expectations

Photo Credit Introduction If you are familiar with software engineering, you’d know that automatic testing and continuous integration can save you a lot of debugging time when a project is complex enough and/or involves collaboration between contributors. They help you make sure the new code doesn’t break anything that it’s not supposed to and quickly narrow down the scope of places that could go wrong when failures inevitably happen. For data scientists, we have to test not only against code but also against data to make sure our data pipelines are working correctly. Just like new code can break your software, new data can also break your pipelines. Great Expectations is a tool that protects you from problematic new data: ...

Create a Customized Text Annotation Tool in Two Days - Part 2

Photo Credit Introduction In Part 1 of this series, we’ve discussed why building your own annotation tool can be a good idea, and demonstrated a back-end API server based on FastAPI. Now in this Part 2, we’re going to build a front-end interface that interacts with the end-user (the annotator). The front-end needs to do mainly three things: Fetch a batch of sentence/paragraph pairs to be annotated from the back-end server. Present the pairs to the annotator and provide a way for them to adjust the automatically generated labels. Send the annotated results to the back-end server. Disclaimer: I’m relatively inexperienced in front-end development. The code here may seem extremely amateur to professionals. However, I hope this post can serve as a reference or starting point for those with similar requirements. ...

Create a Customized Text Annotation Tool in Two Days - Part 1

Photo Credit Introduction In my previous post, Fine-tuning BERT for Similarity Search, I mentioned that I annotated 2,000 pair of sentence pairs, but did not describe how I did it and what tool I used. Now in this two-part series, we’ll see how I created a customized text annotation tool that greatly speeds up the annotation process. The entire stack was developed in two days. You can probably do it a lot faster if you are familiar with the technology (the actual time I spent on it is about 6 hours top). ...