Use MPIRE to Parallelize PostgreSQL Queries

Photo Credit Introduction Parallel programming is hard, and you probably should not use any low-level API to do it in most cases (I’d argue that Python’s built-in multiprocessing package is low-level). I’ve been using Joblib’s Parallel class for tasks that are embarrassingly parallel and it works wonderfully. However, sometimes the task at hand is not simple enough for the Parallel class (e.g., you need to share something from the main process that is not pickle-able, or you want to maintain states in each child process). I’ve recently found this library — MPIRE (MultiProcessing Is Really Easy) — that significantly mitigates this problem of not having enough flexibility, while still having a high-level and user-friendly API. ...

January 7, 2022 · Ceshine Lee

Random Sampling Data with Header

I’ve mention a handy script call sample, which can randomly sample row/record-based data with given probability. One major problem with this script is that it doesn’t consider data with a header row specifying field names. It samples the head row like every other row. It’s not the end of the world though; two lines of head and cat commands can easily fix that. But it has become more and more annoying to do this every time. ...

July 31, 2015 · Ceshine Lee

Random Sampling at the Command Line

When you receive a large dataset to analyze, you’d probably want to take a look at the data before fitting any models on it. But what if the dataset is too big to fit into the memory? One way to deal with this is to take much smaller random samples from the dataset. You’ll be able to have some rough ideas about what’s going on. (Of course you cannot get global maximum or minimum through this approach, but that kind of statistics can be easily obtained in linear time with minimal memory requirements) ...

January 22, 2015 · Ceshine Lee

A simple script to automate MySQLdump backups

I just moved my MySQL database to some OpenVZ VPS, which doesn’t support snapshot backups. Therefore I had to set up some backup mechanism myself. The solution I came up with is to use BitTorrent Sync to sync my backups to the other server. It turns out to be much faster than transfering backups using scp and much easier (and perhaps more secure) than using FTP. I highly recommend BitTorrent Sync. ...

March 5, 2014 · Ceshine Lee