Deep-Learning

News Topic Similarity Measure using Pretrained BERT Model

credit In this post we establish a topic similarity measure among the news articles collected from the New York Times RSS feeds. The main purpose is to familiarized ourselves with the (PyTorch) BERT implementation and pretrained model(s). What is BERT? BERT stands for Bidirectional Encoder Representations from Transformers. It comes from a paper published by Google AI Language in 2018[1]. It is based on the idea that fine-tuning a pretrained language model can help the model achieve better results in the downstream tasks[2][3]. ...

Implementing Beam Search - Part 2

Photo Credit Overview Part one gave an overview on how OpenNMT-py produces output sequences for a batch of input sequences (Translator._translate_batch method), and how it conducts beam searches (Beam objects): Implementing Beam Search (Part 1) - A Source Code Analysis of OpenNMT-py Now we turn our attention to some of the details we skipped through in part one — the advanced features that influence how the translator produce output candidates/hypotheses. They can be put into two categories: rule-based and number-based. ...

Implementing Beam Search - Part 1

Photo Credit As hinted in the previous post “Building a Summary System in Minutes”, I’ll try do some source code analysis of OpenNMT-py project in this post. I’d like to start with its Beam Search implementation. It is widely used in seq2seq models, but I haven’t yet had a good grasp on its details. The translator/predictor of OpenNMT-py is also one of the most powerful I’ve seen, coming with a wide range of parameters and options. ...

[Notes] Neural Language Models with PyTorch

Photo Credit Motivation I was reading this paper titled “Character-Level Language Modeling with Deeper Self-Attention” by Al-Rfou et al., which describes some ways to use Transformer self-attention models to solve the language modeling problem. One big problem of Transformer models in this setting is that they cannot pass information from one batch to the next, so they have to make predictions based on limited contexts. It becomes a problem when we have to compare the results with “traditional” RNN-based models, and what Al-Rfou et al. proposed is to use only the outputs at the last position in the sequence from the Transformers when evaluating. If we ignore the first batch, a sequence of length N will require N batches to predict for Transformers, and only (N / M) batches for RNN models (M being the sequence length of a batch). ...

Prepare Deep-Learning-Ready VMs on Google Cloud Platform

Photo Credit [The 2nd YouTube-8M Video Understanding Challenge](http://The 2nd YouTube-8M Video Understanding Challenge) has just finished. Google generously handed out $300 Google Cloud Platform(GCP) credits to the first 200 eligible people, and I was lucky enough to be one of them. I wouldn’t be able to participate in this challenge at a higher level otherwise. My local hardware can barely handle the size of the dataset and is not strong enough to handle the size of the model. The least I can do to return the favor is to write a short tutorial on how to set up deep-learning-ready VMs on GCP and about some tips that I’ve learned. ...