# Using Julia to Do Whole Word Masking

## Introduction

In my last post, [Failure Report] Distill Fine-tuned Transformers into Recurrent Neural Networks, I tried to distill the knowledge of a fine-tuned BERT model into an LSTM or GRU model without any data augmentation and failed to achieve satisfiable results. In the follow-up works, I tried to replicate the easies-to-implement augmentation method — masking — used in [1] and see its effect. The masking described in [1] is called “whole word masking” [2], that is, masking the whole word instead of just masking a single word piece.

It is non-trivial to implement whole word masking, as it would require the sampling process to be aware of which word piece is itself a whole word, and which is part of a word. As you may know, doing text processing in pure Python is quite slow comparing to other compiled languages. I recently picked up the Julia programming language, which promises the flexibility of scripting languages and the speed of compiled languages, and thought that it was a good opportunity to test Julia in the field.

This post describes the Julia code I wrote for this task and shows that for this specific task the Julia code is as simple to write as Python, while runs up to 100x faster than its pure Python counterpart.

### The Algorithm

This is the algorithm I used to do whole word masking (given that the examples are already tokenized to word pieces):

1. For each example, mark all the word pieces that are either a whole word or the first piece of a word (by using a mask).
2. Randomly sample N marked pieces for each example (N is a hyper-parameter).
3. Replacing the selected pieces with “[MASK]”.
4. Check if the next piece is a part of this word (tokens start with “##” in BERT tokenizer). If so, also replace it with “[MASK]”.
5. Repeat step 4 until the condition is false or the end of the example is reached.

## Benchmarks

Notebook used in this section:

### Summary

(Comparing the mean run time here as the %timeit magic doesn’t provide the median run time.)

• Tokenizing examples: 15 seconds (shared by both Python and Julia Pipeline).
• Python: 42 ms (estimated)
• Julia: 41 ms
• Marking First Pieces
• Python: 326 ms
• Sample One Word to Mask
• Python: 8.2 s (using Numpy.random.choice)
• Julia: 69 ms
• Python: 725 ms (copying the examples)
• Julia: 426 ms (copying the examples)
• Python: 300 ms (estimated)
• Julia: 10 ms

#### Remarks

• The most time-consuming part is tokenizing the examples. So in reality optimizing the tokenizer has the most potential (That’s why huggingface has re-implemented the word-piece tokenizers in Rust).
• But the eight seconds saved on sampling by switching to Julia is also a significant improvement, and just took a few lines to implement.
• Copying the examples takes around 300 to 500 ms, and is the most expensive operation besides tokenization. So try to avoid it if possible. (If you need the augment the same dataset multiple times, you have no choice to copy the examples.)

A simple operation that adds “[CLS]” to the head and “[SEP]” to the tail. Python and Julia are equally fast in this one.

#### Python

def add_special_tokens(sentence):
sentence.insert(0, "[CLS]")
sentence.append("[SEP]")
tmp = deepcopy(sentences)
for sentence in tmp:


#### Julia

function add_special_tokens!(sentence)
pushfirst!(sentence, "[CLS]")
push!(sentence, "[SEP]")
end
tmp = deepcopy(sentences)


### Marking First Pieces

Create binary masks to filter out word piece that is not the first word piece of a word. Julia is starting to outperform Python.

#### Python

def is_first_piece(tokens):
return [not token.startswith("##") for token in tokens]
first_piece_masks = [is_first_piece(sent) for sent in sentences]


#### Julia

function is_first_piece(arr::Array{String,1})
return .!startswith.(arr, "##")
end
results = is_first_piece.(sentences)


A multi-thread version is also provided, which can sometimes be faster depending on your hardware:

results = [Bool[] for _ in 1:length(sentences)]
results[i] = is_first_piece(sentences[i])
end


### Sampling

Randomly sample one word from each example to be masked. Since I can’t think of any simple way to vectorized this in Python, a naive for-loop approach is used. Vectorizing in Julia, on the other hand, is fairly straight-forward. As a result, the Julia version is vastly faster (100x) than the Python one.

Note: I used Numpy in the Python implementation, so it’s not really “pure python” in this case.

#### Python

def sample(first_piece_masks, n=1):
results = []
results.append([])
continue
return results


#### Julia

using StatsBase
return Int64[]
end
end


Full word masking. This one inevitably has to use some loop to scan the example. For loops are not a problem for Julia, so the Julia version is much faster (30x) than Python.

The implementation presented here copies the examples inside the function so the original examples can be augmented multiple times.

#### Python

def masking(rows, first_piece_masks, masking_points):
augmented_rows = deepcopy(rows)
pos += 1
return augmented_rows


#### Julia

function masking(rows::Vector{Vector{String}}, first_piece_masks::Vector{Vector{Bool}}, masking_points::Vector{Vector{Int64}})
augmented_rows = deepcopy(rows)
while pos + 1 <= length(first_piece_masks[idx]) && first_piece_masks[idx][pos + 1] == 0
pos += 1
end
end
end
return augmented_rows
end