[Notes] Neural Language Models with PyTorch

With Notebook Examples Runnable on Google Colab

Oct 13, 2018 · 1590 words · 8 minute read machine_learning deep-learning nlp pytorch

Photo Credit

Photo Credit

Motivation

I was reading this paper titled “Character-Level Language Modeling with Deeper Self-Attention” by Al-Rfou et al., which describes some ways to use Transformer self-attention models to solve the language modeling problem. One big problem of Transformer models in this setting is that they cannot pass information from one batch to the next, so they have to make predictions based on limited contexts.

It becomes a problem when we have to compare the results with “traditional” RNN-based models, and what Al-Rfou et al. proposed is to use only the outputs at the last position in the sequence from the Transformers when evaluating. If we ignore the first batch, a sequence of length N will require N batches to predict for Transformers, and only (N / M) batches for RNN models (M being the sequence length of a batch).

As I read the paper, I’d found that I was not really sure about some implementation details of RNN-based language models. It really bugged me, so I went back to the official PyTorch example and figured it out. The following sections are the notes I took during the process.

Theoretical Background

We’re not going to cover this in this post. But here are some resources for you if you’re interested:

  1. [Video] Lecture 8: Recurrent Neural Networks and Language Models — Taught by Richard Socher, who is an excellent teacher. Highly recommended.

  2. The Wikipedia page on Language Model

  3. Gentle Introduction to Statistical Language Modeling and Neural Language Models

  4. Language Model: A Survey of the State-of-the-Art Technology

Basically a language model tries to predict the next token given the previous tokens, which is to estimate the conditional probability:

source

source

Source Code

I forked the pytorch/examples Github repo, made some tiny changes, and added two notebooks. Here’s the link:

ceshine/examples

Dataset Preparation

This example comes with a copy of wikitext2 dataset. The texts have already been tokenized to word level, and split into train, validation, test sets.

No processing is needed other than replacing newlines with <eos> tokens.