Fine-tuning BERT for Similarity Search

Photo Credit Synopsis I have the task of finding similar entries among 8,000+ pieces of news, using their title and edited short descriptions in Traditional Chinese. I tried LASER[1] first but later found Universal Sentence Encoder[2] seemed to work slightly better. Results from these unsupervised approaches are already acceptable, but still have occasional confusion and hiccups. Not entirely satisfied with the unsupervised approaches, I collected and annotated 2,000 pairs of news and fine-tuned the BERT model on this dataset. This supervised approach is visibly better than the unsupervised one. And it’s also quite sample-efficient. Three hundred and fifty training example is already enough to beat Universal Sentence Encoder by a large margin. ...

November 28, 2019 · Ceshine Lee

Zero Shot Cross-Lingual Transfer with Multilingual BERT

Photo Credit Synopsis Do you want multilingual sentence embeddings, but only have a training dataset in English? This post presents an experiment that fine-tuned a pre-trained multilingual BERT model(“BERT-Base, Multilingual Uncased” [1][2]) on monolingual(English) AllNLI dataset[4] to create sentence embeddings model(that maps a sentence to a fixed-size vector)[3]. The experiment shows that the fine-tuned multilingual BERT sentence embeddings have generally better performance (i.e. lower error rates) over baselines in a multilingual similarity search task (Tatoeba dataset[5]). However, the error rates are still significantly higher than the ones from specialized sentence embedding models trained with multilingual datasets[5]. ...

September 24, 2019 · Ceshine Lee

Customizing Spacy Sentence Segmentation

Photo Credit The Problem Often in natural language processing(NLP), we would want to split a large document into sentences, so we can analyze the individual sentences and the relationship between them. Spacy’s pretrained neural models provide such functionality via their syntactic dependency parsers. It also provides a rule-based Sentencizer, which will be very likely to fail with more complex sentences. While the statistical sentence segmentation of spacy works quite well in most cases, there are still some weird cases on which it fails. One of them is the difficulty in handling the ’s tokens, which I noticed when using Spacy version 1.0.18 and model en_core_web_md version 2.0.0. ...

August 14, 2019 · Ceshine Lee

[Notes] Jigsaw Unintended Bias in Toxicity Classification

Photo Credit Preamble Jigsaw hosted a toxic comment classification competition[2] in 2018, and has also created an API service for detecting toxic comments[3]. However, it has been shown that the model trained on this kind of datasets tend to have some biases against minority groups. For example, a simple sentence “I am a black woman” would be classified as toxic, and also more toxic than the sentence “I am a woman"[4]. This year’s Jigsaw Unintended Bias in Toxicity Classification competition[1] introduces an innovative metric that aims to reduce such biases and challenges Kagglers to find out the best score we can get under this year’s new dataset. ...

August 4, 2019 · Ceshine Lee

Detecting Chinese Characters in Unicode Strings

Photo Credit Motivation I have a situation where an automatic mechanism to remove texts in a dataset that are not in Chinese. The dataset contains characters from Traditional Chinese, Simplified Chinese, English, and on some rare occasion French, Arabic, and other languages. General purpose language detection packages (such as this one) produces a lot more false positives than expected. Texts with Chinese characters mixed with Latin characters are often classified as different languages. And quite often Chinese texts are classified as Korean, which is very interesting because the dataset does not have any Korean characters. ...

April 24, 2019 · Ceshine Lee