[Notes] Jigsaw Unintended Bias in Toxicity Classification

Photo Credit Preamble Jigsaw hosted a toxic comment classification competition[2] in 2018, and has also created an API service for detecting toxic comments[3]. However, it has been shown that the model trained on this kind of datasets tend to have some biases against minority groups. For example, a simple sentence “I am a black woman” would be classified as toxic, and also more toxic than the sentence “I am a woman"[4]. This year’s Jigsaw Unintended Bias in Toxicity Classification competition[1] introduces an innovative metric that aims to reduce such biases and challenges Kagglers to find out the best score we can get under this year’s new dataset. ...

August 4, 2019 · Ceshine Lee

[Notes] iMet Collection 2019 - FGVC6 (Part 1)

Photo Credit Overview Preamble I started doing this competition (iMet Collection 2019 - FGVC6) seriously after hitting a wall doing the Freesound competition. It was really late (only about one week until the competition ends), but by re-using a lot of code from the Freesound competition and using Kaggle Kernels to train models, I managed to get a decent submission with F2 score of 0.622 on the private leaderboard (the top 1 solution got 0.672, but used a hell lot more resources to train). ...

July 16, 2019 · Ceshine Lee

Dealing with Synthetic Data

Photo Credit Overview Kaggle recently hosted a competition (Instant Gratification) to test their new “synchronous Kernel-only competition” format. It features a synthetic dataset, and the best way to achieve high score on this dataset is to reverse-engineer the dataset creation algorithm. I did not really spend time into this competition, but after the competition was over I went back checked the discussion forum for solutions and insights shared, and found it actually quite interesting. There are quite a few of lessons to be learned about how to create or deal with synthetic data. ...

June 25, 2019 · Ceshine Lee

Custom Image Augmentation with Keras

Photo by Josh Gordon on Unsplash The new Tensorflow 2.0 is going to standardize on Keras as its High-level API. The existing Keras API will mostly remain the same, while Tensorflow features like eager execution, distributed training and other deeper Tensorflow integration will be added or improved. I think it’s a good time to revisit Keras as someone who had switched to use PyTorch most of the time. ...

April 4, 2019 · Ceshine Lee

Multilingual Similarity Search Using Pretrained Bidirectional LSTM Encoder

Photo by Steven Wei on Unsplash Introduction Previously I’ve demonstrated how to use pretrained BERT model to create a similarity measure between two documents in this post: News Topic Similarity Measure using Pretrained BERT Model. However, to find similar entries to* N* documents in corpus A of size M, we need to run NM* feed-forwards. A more efficient and widely used method is to use neural networks to generate sentence/document embeddings, and calculate cosine similarity scores between these embeddings. ...

February 15, 2019 · Ceshine Lee