How to Extract a Webpage’s Main Article Content
The Idea
I had an idea to make a personalized news feed reader. Basically, I’d register a bunch of feeds with the application, and rate a few stories as either “good” or “bad”. The application would then use my ratings and the article text to generate a statistical model, apply that model to future articles, and only recommend those it predicted I would rate as “good”. It sounded like a plausible idea. I decided to start a pet project.
I soon learned that this idea wasn’t original, and in fact had been attempted by quite a few companies. The first to seriously implement this idea was Findory, later followed by Thoof, Tiinker, Persai, and probably others I’m not aware of. As of this writing, only Persai is still in business. Apparently, personalized news feeds aren’t terribly profitable. Why they’re not a commercial hit is a whole article in itself, so I won’t go into it now. However, before I admitted to myself that this project was doomed to failure, I decided to implement a few components to get a better feel for how the system would work. This is a review of a few interesting things I learned along the way.
The Problem
Obviously, the first step to any would-be social news indexing service is to download the target webpages and extract their article text. After you read a lot of non-standard HTML, you start developing a bit of respect for web browser developers who have to make sense of this mess. Most webpages have a pretty low signal-to-noise ratio. There are banners, navigation links, side-bars, headers, footers, meta tags, related-story clips, unrelated story clips, and a lot of other junk we don’t care about. So how do we take this mess and extract not just the text, but the main article text? It’s pretty easy to look for all the <div> or <p> tags in an HTML document and extract their contents, but we’ll get a lot of this junk we don’t want.
Cleaning Up the Mess
In an ideal world, all webpages would be XML documents that could be easily validated and parsed. In the real world they’re a mess, but they’re close enough to XML where we can potentially clean them up and parse away. With the help of the excellent tool Tidy, this turns out to be surprisingly easy. With Tidy, I can take the ugliest webpage and turn it into something I can easily feed into an XML parser, which can then extract the document’s text, ignoring all the HTML markup, Javascript, and other useless information.
Enter the DOM
So now we have a clean document that we’ve fed to an XML parser to build a DOM and extracted all nodes that contain text. How do we distinguish the text used in the navigation bar/side-bar/footer from the main article text? It turns out we can make an assumption in 99% of the cases we’re interested in. This assumption is that the largest “clump” of text is usually the main article. What I mean by “clump” are nodes of text that are grouped together based on their relative position in the DOM.
For example, suppose we have a document whose body is something like:
<body>
<div>
<ul>
<li><a href="/home">Home</a></li>
<li><a href="/politics">Politics</a></li>
<li><a href="/health">Health</a></li>
<li><a href="/travel">Travel</a></li>
<li><a href="/about">About</a></li>
</ul>
<div>
<div>
<div>
<p><b>MIAMI, Florida (CNN) </b> -- Hurricane Ike weakened slightly...
<p>Ike hit Turks and Caicos Islands Sunday morning, leaving a trail of...
<p>"It pretty much looks like an episode of 'The Twilight Zone,' " said...
<p>Aftwood estimates at least 90 percent of homes he saw on the island were...
<p>The possibility of similar devastation prompted state and local officials...
<p > "Let's hope it's all a false alarm," Louisiana Gov. Bobby Jindal said...
</div>
<div>
<p>Some side-story that we don't really care about.</p>
<p>Another paragraph for this story.</p>
</div>
<div>
<p>Yet another semi-related side-story that we still don't care about.</p>
<p>Another paragraph for this story.</p>
<p>Another paragraph for this story.</p>
<p>Yet another paragraph for this story.</p>
</div>
</div>
<div>© 2008 Cable News Network.<div>
</body>
Clearly, we don’t care about the navigation link text, or the two side-stories. Let’s break it down based on DOM location. We have six <p> tags in the first <div> tag of the second <div> tag of the body. We’ll represent this location as a list of indexes, like (2,1,*). If we group all the text nodes in this fashion, and track how much text each group contains, we get a table like:
location = characters (1,1,1,1) = 4 (1,1,2,1) = 8 (1,1,3,1) = 6 (1,1,4,1) = 6 (1,1,5,1) = 5 (2,1,*) = 500 (2,2,*) = 100 (2,3,*) = 250 (3) = 26
Using this, we can safetly assume either the main article text is located at position (2,1,*), or the document has more ads/spam/comments/side-stories/etc than actual content. Once we know where the main text is located, we strip it out and move on.
Conclusion
This method proved to be a fairly simple and robust way to extract text from webpages. The main exception I found were news aggregators like Digg/Reddit/Slashdot, where the text blurb may be overshadowed by hundreds of user comments.
I implemented a simple Django web app with a Javascript widget frontend to demo this method. Try it out for yourself by pasting a URL below.
Hi Chris,
Nice article. Glad to see that somebody else cares about this problem too. Your approach is very clean. We implemented something that makes several passes over the DOM and snips out nodes driven by a few simple heuristics like “text weight” and such. We found that stuff like that works well for maybe 70% of the web pages out there. To reach the other ~30%, we took a data-driven approach (i.e. using the data we already crawled to figure out what markup is).
Still, we can’t get everything right, but it works Well Enough ™.
Also, the semantic search company Twine has attempted something like this.
Cheers,
Ted
(of Persai, now Pressflip)
[...] How to Extract a Webpage’s Main Article Content I had an idea to make a personalized news feed reader. Basically, I’d register a bunch of feeds with the application, and rate a few stories as either “good” or “bad”. The application would then use my ratings and the article text to generate a statistical model, apply that model to future articles, and only recommend those it predicted I would rate as “good”. It sounded like a plausible idea. I decided to start a pet project. [...]
Nice idea. Any chance of opening up the code so that people can learn? There are similar things for Perl, etc. but nothing for Python. So some code would be appreciated. Thanks.
Yes it seems strange you would go to all this effort, but not publish the code.
I too would appreciate looking at the code
seems to only pull out the comments for some pages. example:
http://www.cbsnews.com/blogs/2009/10/02/politics/politicalhotsheet/entry5359359.shtml
I implemeted a very similar idea in java a while ago, while not even nearly as this it might help people
http://www.redbrick.dcu.ie/~gleesog4/Projects/page.html#extractmainarticle