Posts

News Topic Similarity Measure using Pretrained BERT Model

credit In this post we establish a topic similarity measure among the news articles collected from the New York Times RSS feeds. The main purpose is to familiarized ourselves with the (PyTorch) BERT implementation and pretrained model(s). What is BERT? BERT stands for Bidirectional Encoder Representations from Transformers. It comes from a paper published by Google AI Language in 2018[1]. It is based on the idea that fine-tuning a pretrained language model can help the model achieve better results in the downstream tasks[2][3]. ...

Playing with rstudio/gt R Package

Photo Credit Tables can be an effective way of communicating data. Though not as powerful in telling stories as charts, by cramming a lot of numbers into a limited space, tables can provide readers with accurate and potentially useful information which readers can interpret in their own ways. I’ve come across this new R package gt (Easily generate information-rich, publication-quality tables from R) and decided to give it a try. ...

More Portable, Reproducible R Development Environment

Photo Credit R is awesome. In my opinion it’s the best (free) tool for telling great stories with data. My first post on Medium was about R. Although what I wrote here mostly involves Python, I still try to get back to R from time to time. I briefly mentioned my preferred R setup in this previous post “Analyzing Tweets with R” (in “R tips” section), which includes _Microsoft R Open _(MRO) and the checkpoint package. Unfortunately, checkpoint doesn’t work well with RStudio, and some weird issues with MRO become more and more annoying to me. Therefore I decided to find a new setup that can work more smoothly and reliably. After some trial and error, here is a configuration that I ended up most satisfied with: ...

Use TextRank to Extract Most Important Sentences in Article

Photo Credit Motivation I’m trying to build a NLP system that can automatically highlight the important part of an article to help people to read long articles. The common practice is to start with a simple baseline model that is useful enough, and then incrementally improves the performance. The TextRank algorithm[1], which I also used as a baseline in a text summarization system, is a natural fit to this task. ...

Implementing Beam Search - Part 2

Photo Credit Overview Part one gave an overview on how OpenNMT-py produces output sequences for a batch of input sequences (Translator._translate_batch method), and how it conducts beam searches (Beam objects): Implementing Beam Search (Part 1) - A Source Code Analysis of OpenNMT-py Now we turn our attention to some of the details we skipped through in part one — the advanced features that influence how the translator produce output candidates/hypotheses. They can be put into two categories: rule-based and number-based. ...