<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	>
<channel>
	<title>Comments for chrisspenblog</title>
	<atom:link href="http://www.chrisspen.com/blog/comments/feed" rel="self" type="application/rss+xml" />
	<link>http://www.chrisspen.com/blog</link>
	<description>Just another WordPress weblog</description>
	<pubDate>Tue, 06 Jan 2009 13:38:11 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.5.1</generator>
		<item>
		<title>Comment on How to Extract a Webpage&#8217;s Main Article Content by High-quality personal filtering &#171; Sri Spot</title>
		<link>http://www.chrisspen.com/blog/how-to-extract-a-webpages-main-article-content.html#comment-3</link>
		<dc:creator>High-quality personal filtering &#171; Sri Spot</dc:creator>
		<pubDate>Wed, 17 Sep 2008 14:24:24 +0000</pubDate>
		<guid isPermaLink="false">http://www.chrisspen.com/blog/?p=6#comment-3</guid>
		<description>[...] How to Extract a Webpage&#8217;s Main Article Content I had an idea to make a personalized news feed reader. Basically, I’d register a bunch of feeds with the application, and rate a few stories as either “good” or “bad”. The application would then use my ratings and the article text to generate a statistical model, apply that model to future articles, and only recommend those it predicted I would rate as “good”. It sounded like a plausible idea. I decided to start a pet project. [...]</description>
		<content:encoded><![CDATA[<p>[...] How to Extract a Webpage&#8217;s Main Article Content I had an idea to make a personalized news feed reader. Basically, I’d register a bunch of feeds with the application, and rate a few stories as either “good” or “bad”. The application would then use my ratings and the article text to generate a statistical model, apply that model to future articles, and only recommend those it predicted I would rate as “good”. It sounded like a plausible idea. I decided to start a pet project. [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on How to Extract a Webpage&#8217;s Main Article Content by Ted Dziuba</title>
		<link>http://www.chrisspen.com/blog/how-to-extract-a-webpages-main-article-content.html#comment-2</link>
		<dc:creator>Ted Dziuba</dc:creator>
		<pubDate>Tue, 16 Sep 2008 15:02:18 +0000</pubDate>
		<guid isPermaLink="false">http://www.chrisspen.com/blog/?p=6#comment-2</guid>
		<description>Hi Chris,

Nice article.  Glad to see that somebody else cares about this problem too.  Your approach is very clean.  We implemented something that makes several passes over the DOM and snips out nodes driven by a few simple heuristics like "text weight" and such.  We found that stuff like that works well for maybe 70% of the web pages out there.  To reach the other ~30%, we took a data-driven approach (i.e. using the data we already crawled to figure out what markup is).

Still, we can't get everything right, but it works Well Enough (tm).

Also, the semantic search company Twine has attempted something like this.

Cheers,

Ted
(of Persai, now Pressflip)</description>
		<content:encoded><![CDATA[<p>Hi Chris,</p>
<p>Nice article.  Glad to see that somebody else cares about this problem too.  Your approach is very clean.  We implemented something that makes several passes over the DOM and snips out nodes driven by a few simple heuristics like &#8220;text weight&#8221; and such.  We found that stuff like that works well for maybe 70% of the web pages out there.  To reach the other ~30%, we took a data-driven approach (i.e. using the data we already crawled to figure out what markup is).</p>
<p>Still, we can&#8217;t get everything right, but it works Well Enough &#8482;.</p>
<p>Also, the semantic search company Twine has attempted something like this.</p>
<p>Cheers,</p>
<p>Ted<br />
(of Persai, now Pressflip)</p>
]]></content:encoded>
	</item>
</channel>
</rss>
