First Step of Web Scraping in Go

An appropriate amount of web scraping is often required for web-related data science projects. Python has a well-known scraping framework called Scrapy which aims to accommodate all kinds of possible scenarios. For those who want more control over the process and don’t mind getting their hands dirty, GRequests(or the good old Requests) combined with BeautifulSoup can also be a solid solution. However, multi-threading in Python can cause a lot of pain in the neck. And Scrapy depends on Twisted, which is not yet Python3-ready, and there is no clear roadmap on when the project will finish migrating to Python 3.x. These constraints made me started finding other faster, and more robust alternatives. ...

August 29, 2015 · Ceshine Lee

Plotly Example: Deaths Caused By Cancer in Taiwan

I’ve been looking for a way for me to easily develop and share data visualization. I don’t want static image files because of their inflexibility, and creating every plots using D3.js seems like an overkill. Plotly, a web service that creates plots based on D3 and provides API for both Python and R, has so far been a very good match for my needs. To get started, you can read this tutorial for R, or the official documentation. ...

August 13, 2015 · Ceshine Lee

Random Sampling Data with Header

I’ve mention a handy script call sample, which can randomly sample row/record-based data with given probability. One major problem with this script is that it doesn’t consider data with a header row specifying field names. It samples the head row like every other row. It’s not the end of the world though; two lines of head and cat commands can easily fix that. But it has become more and more annoying to do this every time. ...

July 31, 2015 · Ceshine Lee

Migrated the Blog from Pelican to Hugo

I’ve been using pelican to build blog.ceshine.net for about two years, and as you can see, I’ve not been very productive. Part of the reasons is that I found I spent more time tuning the code rather than actually writing stuffs. Recently Go-based Hugo caught my attention. Go can easily compile multi-platform binary executables, which makes deployment much easier. Hugo also provide a decent built-in web server whose performance is good enough for some small-scale production use. So after some experimenting, I decided to replace the old pelican site with Hugo. ...

July 28, 2015 · Ceshine Lee

Bayesian Logistic Regression using PyMC3

I’ve been reading this amazing (free) book Bayesian Methods for Hackers. I was half way through in early 2015, but dropped it because of some nuisances. But when I finally restarted reading it, I found it might be a good thing that I stopped reading for a while. Now I have more appreciation of the Bayesian methods and more mathematical understanding to fully grasp the idea the book trying to convey. (To be honest, I was quite confused about some concept like MAP in the first round of reading) ...

July 11, 2015 · Ceshine Lee