Posts

Create a Customized Text Annotation Tool in Two Days - Part 2

Photo Credit Introduction In Part 1 of this series, we’ve discussed why building your own annotation tool can be a good idea, and demonstrated a back-end API server based on FastAPI. Now in this Part 2, we’re going to build a front-end interface that interacts with the end-user (the annotator). The front-end needs to do mainly three things: Fetch a batch of sentence/paragraph pairs to be annotated from the back-end server. Present the pairs to the annotator and provide a way for them to adjust the automatically generated labels. Send the annotated results to the back-end server. Disclaimer: I’m relatively inexperienced in front-end development. The code here may seem extremely amateur to professionals. However, I hope this post can serve as a reference or starting point for those with similar requirements. ...

Create a Customized Text Annotation Tool in Two Days - Part 1

Photo Credit Introduction In my previous post, Fine-tuning BERT for Similarity Search, I mentioned that I annotated 2,000 pair of sentence pairs, but did not describe how I did it and what tool I used. Now in this two-part series, we’ll see how I created a customized text annotation tool that greatly speeds up the annotation process. The entire stack was developed in two days. You can probably do it a lot faster if you are familiar with the technology (the actual time I spent on it is about 6 hours top). ...

Fine-tuning BERT for Similarity Search

Photo Credit Synopsis I have the task of finding similar entries among 8,000+ pieces of news, using their title and edited short descriptions in Traditional Chinese. I tried LASER[1] first but later found Universal Sentence Encoder[2] seemed to work slightly better. Results from these unsupervised approaches are already acceptable, but still have occasional confusion and hiccups. Not entirely satisfied with the unsupervised approaches, I collected and annotated 2,000 pairs of news and fine-tuned the BERT model on this dataset. This supervised approach is visibly better than the unsupervised one. And it’s also quite sample-efficient. Three hundred and fifty training example is already enough to beat Universal Sentence Encoder by a large margin. ...

[Notes] “Statistical Inference Enables Bad Science; Statistical Thinking Enables Good Science”

Photo Credit This article by Christopher Tong has got a lot of love from people I followed on Twitter, so I decided to read it. It was very enlightening. But to be honest, I don’t fully understand quite a few arguments made by this article, probably because I lack the experience of more rigorous scientific experiments and research. Nonetheless, I think writing down the parts I find interesting and put it into a blog post would be beneficial for myself and other potential readers. Hopefully, it makes it easier to reflect on these materials later. ...

Pro Tip: Use Shutdown Script Detect Preemption on GCP

Photo Credit Motivation I was recently given a $300 credit to Google Cloud Platform from Google to participate in a Kaggle competition. This gives me free access to the powerful GPUs (T4, P100, even V100) to train models and thus opens the window to many new possibilities. However, the problem is that $300 can be used up rather quickly. For example, one Tesla P100 GPU cost $1.46 per hour, so $300 can only give me 200 hours or 8.5 days. Don’t forget there are still other costs from CPU, memory, and disk storage. ...