How to Extract a Webpage’s Main Article Content: The Unicode Edition

When I originally wrote html2text.py, my focus was only on extracting English text from webpages, so I didn’t give much thought to handling Unicode. Ignoring anything but ASCII would suffice. However, I was recently commissioned to extend the script’s functionality to use Unicode, so it could extract text in nearly any language found on the Web. The modifications turned out to be relatively straight-forward, made easier with the use of the handy chardet package to detect encoding. I also cleaned up the command line interface, to make the various parameters more accessible.

You can download the code here.

Update 2011-12-17: To avoid naming conflicts with similar scripts, I’ve renamed the script to webarticle2text and published it to github.com.

Leave a Reply