[Notes] Neural Language Models with PyTorch
With Notebook Examples Runnable on Google Colab
Oct 13, 2018 · 1590 words · 8 minute read

Motivation
I was reading this paper titled “Character-Level Language Modeling with Deeper Self-Attention” by Al-Rfou et al., which describes some ways to use Transformer self-attention models to solve the language modeling problem. One big problem of Transformer models in this setting is that they cannot pass information from one batch to the next, so they have to make predictions based on limited contexts.
It becomes a problem when we have to compare the results with “traditional” RNN-based models, and what Al-Rfou et al. proposed is to use only the outputs at the last position in the sequence from the Transformers when evaluating. If we ignore the first batch, a sequence of length N will require N batches to predict for Transformers, and only (N / M) batches for RNN models (M being the sequence length of a batch).
As I read the paper, I’d found that I was not really sure about some implementation details of RNN-based language models. It really bugged me, so I went back to the official PyTorch example and figured it out. The following sections are the notes I took during the process.
Theoretical Background
We’re not going to cover this in this post. But here are some resources for you if you’re interested:
-
[Video] Lecture 8: Recurrent Neural Networks and Language Models — Taught by Richard Socher, who is an excellent teacher. Highly recommended.
-
Gentle Introduction to Statistical Language Modeling and Neural Language Models
Basically a language model tries to predict the next token given the previous tokens, which is to estimate the conditional probability:

Source Code
I forked the pytorch/examples Github repo, made some tiny changes, and added two notebooks. Here’s the link:
Dataset Preparation
This example comes with a copy of wikitext2 dataset. The texts have already been tokenized to word level, and split into train, validation, test sets.
No processing is needed other than replacing newlines with <eos> tokens.