Text Analysis using Julia

Photo Credit Overview I tried to conduct some exploratory analysis on the title field of the “Shopee - Price Match Guarantee” dataset. I wanted to know how similar the titles are within the same group, so we can have a rough idea of how useful the field would be in determining if two listings belong to the same group. I used StringDistances.jl for raw string analysis and WordToeknizers.jl for token analysis. Instead of using Jupyter Notebook, I used Pluto.jl to get reactive notebooks with more presentably visual design right out of the box. The experience was a blast. Writing in Julia is not as hard as I expected, and the end result is very clean and blazing fast. ...

May 1, 2021 · Ceshine Lee

How to Reduce the Loading Time of Julia Scripts

Photo Credit Motivation Julia is a promising new language for scientific computing and data science. I’ve demonstrated that doing whole work masking in Julia can be a lot faster (up to 100x) than in Python in this post. The secret of Julia’s speed is from its use of JIT compilers (rather than interpreters used by R and Python). However, this design also impedes Julia’s ambition as a general-purpose language since ten seconds of precompiling time for a simple script is unacceptable for most use cases. ...

April 18, 2021 · Ceshine Lee

Using Julia to Do Whole Word Masking

Photo Credit Introduction In my last post, [Failure Report] Distill Fine-tuned Transformers into Recurrent Neural Networks, I tried to distill the knowledge of a fine-tuned BERT model into an LSTM or GRU model without any data augmentation and failed to achieve satisfiable results. In the follow-up works, I tried to replicate the easies-to-implement augmentation method — masking — used in [1] and see its effect. The masking described in [1] is called “whole word masking” [2], that is, masking the whole word instead of just masking a single word piece. ...

June 28, 2020 · Ceshine Lee