How to Extract a Webpage’s Main Article Content: The Unicode Edition

When I originally wrote html2text.py, my focus was only on extracting English text from webpages, so I didn’t give much thought to handling Unicode. Ignoring anything but ASCII would suffice. However, I was recently commissioned to extend the script’s functionality to use Unicode, so it could extract text in nearly any language found on the [...]