Random Sampling at the Command Line

When you receive a large dataset to analyze, you’d probably want to take a look at the data before fitting any models on it. But what if the dataset is too big to fit into the memory? One way to deal with this is to take much smaller random samples from the dataset. You’ll be able to have some rough ideas about what’s going on. (Of course you cannot get global maximum or minimum through this approach, but that kind of statistics can be easily obtained in linear time with minimal memory requirements) ...

January 22, 2015 · Ceshine Lee

Implement FTRL-Proximal Algorithm in Go - Part 2

I’ve actually finished the concurrent version of the algorithm a while ago, right after the previous post. Unfortunately my laptop broke and it took almost a month to repair. Now I finally get to publish the result here. I know that the code is not elegant nor properly documented, but it’s a start. You’ll need to set the core variable in the main function to the number of cores of your CPU. The program will simultaneously trains a number of models according to that value, and predict the average of the prediction from each model. ...

January 2, 2015 · Ceshine Lee

Implement FTRL-Proximal Algorithm in Go - Part 1

For the sake of practicing, I’ve re-written tinrtgu’s solution to the Avazu challenge on Kaggle using Go. I’ve made some changes to save more memory, but the underlying algorithm is basically the same. (See this paper from where the alogorithm came for more information). The go code has been put on Github Gist. Any constructive comments are welcomed on that gist page, as I haven’t added a comment section on this blog. (I haven’t even set up Google Analytics, so I have no idea how many people are reading thi blog) I’m also working on a concurrent version utilizing the built-in support of concurrency in Go. So theoretically it would run faster in multi-core environment. ...

December 9, 2014 · Ceshine Lee

The Power of PyPy

PyPy is an alternative Python implementation which emphasize on speed and memory usage. I didn’t take it seriously until I wrote a Python script for a kaggle competition that requires hours to run. I read someone on the kaggle forum suggesting everyone to give PyPy a try. I did. And it worked like a magic. A 2 to 5 times speed boost can be achieved just by substituting python with pypy when you run a python script. Don’t have a accurate number for that, but it was significantly faster. This is critical because now you have more time to try different models and hence get a better score in the competition. ...

November 29, 2014 · Ceshine Lee

Tip for using iPython Notebooks in virtualenv

When trying to install ipython and dependencies of its notebook feature via pip, I was stuck. Even I’d already installed pyzmq, I still got this message: ImportError: IPython.zmq requires pyzmq It was quite frustrating, until I found this post on StackOverflow. So it turns out this can be solved by just install pyzmq using an extra parameter: pip install pyzmq --install-option="--zmq=bundled"

April 29, 2014 · Ceshine Lee