Multilingual Similarity Search Using Pretrained Bidirectional LSTM Encoder

Evaluating LASER (Language-Agnostic SEntence Representations)

Feb 15, 2019 · 1669 words · 8 minute read machine_learning deep-learning pytorch nlp

Photo by Steven Wei on Unsplash

Photo by Steven Wei on Unsplash

Introduction

Previously I’ve demonstrated how to use pretrained BERT model to create a similarity measure between two documents in this post: News Topic Similarity Measure using Pretrained BERT Model.

However, to find similar entries to* N* documents in corpus A of size M, we need to run NM* feed-forwards. A more efficient and widely used method is to use neural networks to generate sentence/document embeddings, and calculate cosine similarity scores between these embeddings.

The LASER (Language-Agnostic SEntence Representations) project released by Facebook provides a pretrained sentence encoder that can handle 92 different languages. Sentences from all languages are mapped into the same embedding space, so embeddings from different languages are comparable.

Our system uses a single BiLSTM encoder with a shared BPE vocabulary for all languages, which is coupled with an auxiliary decoder and trained on publicly available parallel corpora. [1]

In this post we’ll try to reproduce the mapping between English and Chinese Mandarin sentences using the Tatoeba dataset created by [1]. After confirming we have the same results as reported in the paper, we’ll test if LASER can find the corresponding English titles to some (translated) articles from the New York Times Chinese version.

A Big Caveat

LASER is licensed under Attribution-NonCommercial 4.0 International, so you can not do anything commercial with it. An update to the license seems to be upcoming, but not timetable is given yet.

You can write your own implementation, though. The training data used is publicly available (see appendix A in [1]). It takes 5 days and 16 V100 GPUs to train.

Our implementation is based on fairseq, and we make use of its multi-GPU support to train on 16 NVIDIA V100 GPUs with a total batch size of 128,000 tokens. Unless otherwise specified, we train our model for 17 epochs, which takes about 5 days. [1]

Installation Notes

The core encoder itself only depends on PyTorch 1.0, but tokenization, BPE, similarity search requires some third-party libraries. Follow the official installation instructions to install tokenization scripts from Moses encoder and FastBPE. And install FAISS (a library for efficient similarity search and clustering of dense vectors) via conda or from source.

I recommend using my fork of LASER since the original one requires you to install *transliterate* package, which is only used for Greek, no matter you actually use Greek or not. My fork made it an optional dependency: ceshine/LASER.

Model Overview

Taken from [1]

Taken from [1]

The pretraining model architecture is not very different from the traditional sequence to sequence network. Several differences:

  1. Bidirectional LSTM at every encoder layer.

  2. Max pooling on top of the last encoder layer (instead of taking the hidden states at the last time step).

  3. A linear transformation is performed on the pooled states, and then passed to the decoder to initialize the hidden states of its own LSTM units.

  4. The pooled states are concatenated to the decoder input at every time step.

The sentence in the source language is encoded by the encoder, and then translated to the target language (English and Spanish) by the decoder. The encoder doesn’t know what the target language is. The target language is specified in the inputs (L*_*id) to the decoder.

After pretraining, the encoder is extracted and used as-is (without any fine-tuning). It is proven to be quite useful in zero-shot transfer tasks (training data in one language, testing data in another).

Left: monolingual embedding space. Right: shared embedding space. Taken from [2].

Left: monolingual embedding space. Right: shared embedding space. Taken from [2].

Tatoeba English-Mandarin dataset

Tatoeba Notebook.

There are 1,000 sentence pairs in English and Chinese Mandarin, stored in two text files tatoeba.cmn-eng.eng and tatoeba.cmn-eng.cmn.

Currently tokenization and BPE are basically shell commands wrapped in Python functions. We tokenize a text file by invoking Token:

Token(
    str(DATA_PATH / "tatoeba.cmn-eng.cmn"),
    str(CACHE_PATH / "tatoeba.cmn-eng.cmn"),
    lang="zh",
    romanize=False,
    lower_case=True, gzip=False,
    verbose=True, over_write=False)

Then run BPE using BPEfastApply:

bpe_codes = str(MODEL_PATH / "93langs.fcodes")
BPEfastApply(
    str(CACHE_PATH / "tatoeba.cmn-eng.eng"),
    str(CACHE_PATH / "tatoeba.cmn-eng.eng.bpe"),
    bpe_codes,
    verbose=True, over_write=False)

The pretrained encoder is loaded using the class SentenceEncoder:

encoder = SentenceEncoder(
    str(MODEL_PATH / "bilstm.93langs.2018-12-26.pt"),
    max_sentences=None,
    max_tokens=10000,
    cpu=False)

The encoder consists of an embedding matrix (73640x320) and a 5-layer bidirectional LSTM module:

Encoder(
  (embed_tokens): Embedding(73640, 320, padding_idx=1)
  (lstm): LSTM(320, 512, num_layers=5, bidirectional=True)
)

We compute the sentence embeddings and store it in yet another file:

EncodeFile(
    encoder,
    str(CACHE_PATH / "tatoeba.cmn-eng.cmn.bpe"),
    str(CACHE_PATH / "tatoeba.cmn-eng.cmn.enc"),
    verbose=True, over_write=False)

And finally we create an FAISS index from that file containing embeddings:

data_zh, index_zh = IndexCreate(
    str(CACHE_PATH / "tatoeba.cmn-eng.cmn.enc"), 'FlatL2',
    verbose=True, save_index=False)

A lot of temporary files are involved, which can be a bit annoying. It can definitely be improved, for example, by creating an end-to-end function hiding all the temporary file creation details from the user.

similarity error en=>zh 4.10%

similarity error zh=>en 5.00%`

The model yielded exactly the same error rate as reported in the paper[1]. Let’s take a look at some error cases. (The sentences are tokenized and BPE’ed.)

English to Chinese:

source:  i 'm at a loss for words .
predict: 我@@ 興@@ 奮 得 說 不 出@@ 話 來   。
correct: 我 不 知道 應@@ 該 說 什麼 才 好   。

source:  i just don 't know what to say .
predict: 我 不 知道 應@@ 該 說 什麼 才 好   。
correct: 我 只是 不 知道 應@@ 該 說 什麼 而@@ 已   ..@@ ....

source:  you should sleep .
predict: 你 应该 睡@@ 觉   。
correct: 你 應@@ 該 去 睡@@ 覺 了 吧   。

source:  so fu@@ ck@@ in ' what .
predict: 這@@ 是 什麼 啊   ?
correct: 那 又 怎@@ 樣   ?

Chinese to English:

source:  我 不 知道 應@@ 該 說 什麼 才 好   。
predict: i just don 't know what to say .
correct: i 'm at a loss for words .

source:  你 應@@ 該 去 睡@@ 覺 了 吧   。
predict: you should go to bed .
correct: you should sleep .

source:  那 又 怎@@ 樣   ?
predict: what is this ?
correct: so fu@@ ck@@ in ' what .

source:  我們 之@@ 間 已@@ 經 沒@@ 有 感@@ 情 了   。
predict: it would be better for both of us not to see each other any@@ more .
correct: i don 't like him any more than he lik@@ es me .

Most predictions in these cases are actually not far from the correct one. In some cases they are almost semantically identical (e.g., “you should go to bed” and “you should sleep”). The results were quite impressive.

Chinese to English Mapping of Article Titles (the New York Times)

NYTimes Notebook.

Similar to the previous BERT post, we use the RSS feeds from the New York Times to extract article titles. However, this time I used the Feedly API to read the RSS feeds, so you can try it on your end(no Feedly account is required):

def fetch_latest(feed_url, count=500):
    res = requests.get(
        'https://cloud.feedly.com//v3/streams/contents'
        f'?streamId=feed/{feed_url}&count={count}')
    return res.json()

Since not every article from the NYT Chinese was translated from an English one, I downloaded the webpages and automatically extract the corresponding English titles if they exist.

The nearest 3 (English) neighbors were taken as the top 3 predictions:

_, matched_indices = index_en.search(data_zh, 3)

And we got:

Top 1 Accuracy: 47.37%

Top 3 Accuracy: 57.89%

Around 50% accuracy may not look very good, but let’s take a look at error cases before jumping into conclusions:

Chinese:    美国将禁止中国设备进入5G市场
Correct:    Administration Readies Order to Keep China Out of Wireless Networks
Predict(1): Key Senator Warns of Dangers of Chinese Investment in 5G Networks
Predict(2): China Warns 2 American Warships in South China Sea
Predict(3): In 5G Race With China, U.S. Pushes Allies to Fight Huawei

--------------------

Chinese:    寻找艾衣提:抗议和诗歌的火种在维族音乐中燃烧
Correct:    MUSIC; In a Far-Flung Corner of China, a Folk Star
Predict(1): In China, Dolce & Gabbana Draws Fire and Accusations of Racism on Social Media
Predict(2): Album Review: Ariana Grande Is Living a Public Life. The Real Reveals Are in Her Music.
Predict(3): Ducking and Weaving: Corbyn’s Vanishing Act on Brexit

--------------------

Chinese:    劳工维权给习近平的“中国梦”蒙上阴影
Correct:    Workers’ Activism Rises as China’s Economy Slows. Xi Aims to Rein Them In.
Predict(1): Pessimism Looms Over Prospect of a Sweeping China Trade Deal
Predict(2): Wall Street Slides on Renewed U.S.-China Trade Fears
Predict(3): China’s Ambassador to Canada Blames ‘White Supremacy’ in Feud Over Arrests

--------------------

Chinese:    决定成功的“两种规则”
Correct:    The Two Codes Your Kids Need to Know
Predict(1): A Tale of Two Trumps
Predict(2): The Case Against ‘Border Security’
Predict(3): Personal Stories Behind the ‘Green Book’

The Chinese titles are mostly not the direct translation of the English one, so it’s understandable the encoder pretrained on translation tasks did not see them as almost the same. That being said, the predictions can be really off sometimes, as in the last case.

Conclusions and Future Work

LASER provides a pretrained LSTM encoder that can take inputs from 92 languages (and close siblings in the language families) and map them to a shared embedding space.

The pretrained encoder itself already is quite useful in doing similarity search between sentences in different languages. The semantic features it can properly map are limited by the training corpus (see appendix A in [1]), though. The news corpus used in training, Global Voices, is relatively small. Some bizarre cases in the New York Times example can probably be attributed to this reason.

We did not evaluate zero-shot transfer tasks in this post because I haven’t thought of any interesting dataset I’d like to try on (except for the XNLI[4] and MLDoc[5] the paper had used [1].) I might write another post if I manage to find one.

References

  1. Mikel Artetxe and Holger Schwenk, *Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond* arXiv, 26 Dec 2018.

  2. Zero-shot transfer across 93 languages: Open-sourcing enhanced LASER library

  3. Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to Sequence Learning with Neural Net- works.

  4. Alexis Conneau, Guillaume Lample, Ruty Rinott, Adina Williams, Samuel R. Bowman, Holger Schwenk and Veselin Stoyanov, *XNLI: Cross-lingual Sentence Understanding through Inference*, EMNLP, 2018.

  5. Holger Schwenk and Xian Li, *A Corpus for Multilingual Document Classification in Eight Languages*, LREC, pages 3548–3551, 2018.

tweet Share