Random Sampling Data with Header

Jul 31, 2015 · 119 words · 1 minute read data_eng

I’ve mention a handy script call sample, which can randomly sample row/record-based data with given probability. One major problem with this script is that it doesn’t consider data with a header row specifying field names. It samples the head row like every other row. It’s not the end of the world though; two lines of head and cat commands can easily fix that. But it has become more and more annoying to do this every time.

So I’ve modified the original script to take headers into account. The code is on Github Gist.

Simply add –header in the command whenever you need to ensure the header is included in the output:

sample -r 10% --header < input.csv > output.csv