Random Sampling at the Command Line

Taking random samples from a large set

Jan 22, 2015 · 166 words · 1 minute read data_eng

When you receive a large dataset to analyze, you’d probably want to take a look at the data before fitting any models on it. But what if the dataset is too big to fit into the memory?

One way to deal with this is to take much smaller random samples from the dataset. You’ll be able to have some rough ideas about what’s going on. (Of course you cannot get global maximum or minimum through this approach, but that kind of statistics can be easily obtained in linear time with minimal memory requirements)

Two command-line utilities I’ve found quite useful: shuf and sample(from Data Science at the Command Line)

shuf can be used to randomly get N record from the population:

    shuf -n 100 training_data.csv

sample can be used to randomly extract N% of the data:

    sample -r 0.3 training_data.csv

These are really easy examples, but should suffice in most cases. You can check out the documentation yourself and get creative if you feel like it.