How to Extract a Webpage’s Main Article Content: The Unicode Edition
Posted on January 3rd, 2011 by Chris
When I originally wrote html2text.py, my focus was only on extracting English text from webpages, so I didn’t give much thought to handling Unicode. Ignoring anything but ASCII would suffice. However, I was recently commissioned to extend the script’s functionality to use Unicode, so it could extract text in nearly any language found on the [...]
No Comments »
Filed under: Python