<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	>

<channel>
	<title>chrisspenblog</title>
	<atom:link href="http://www.chrisspen.com/blog/feed" rel="self" type="application/rss+xml" />
	<link>http://www.chrisspen.com/blog</link>
	<description>Just another WordPress weblog</description>
	<pubDate>Wed, 02 Dec 2009 04:42:45 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.5.1</generator>
	<language>en</language>
			<item>
		<title>Implementing a Simple Machine Learning Intelligence to Play Rock-Paper-Scissors in Javascript</title>
		<link>http://www.chrisspen.com/blog/implementing-a-simple-machine-learning-intelligence-to-play-rock-paper-scissors-in-javascript.html</link>
		<comments>http://www.chrisspen.com/blog/implementing-a-simple-machine-learning-intelligence-to-play-rock-paper-scissors-in-javascript.html#comments</comments>
		<pubDate>Wed, 02 Dec 2009 04:42:45 +0000</pubDate>
		<dc:creator>Chris</dc:creator>
		
		<category><![CDATA[Javascript]]></category>

		<category><![CDATA[machine learning]]></category>

		<category><![CDATA[rock paper scissors]]></category>

		<guid isPermaLink="false">http://www.chrisspen.com/blog/?p=15</guid>
		<description><![CDATA[Rock-paper-scissors is not a complicated game. However, there are some that not only enjoy the game, but take pride in professing skill at it. But is it possible to have skill at rock-paper-scissors?
In the real world, where a human plays against a human, &#8220;cheating&#8221; is relatively easy. A player can gain an advantage by subtly [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://en.wikipedia.org/wiki/Rock-paper-scissors">Rock-paper-scissors</a> is not a complicated game. However, there are some that not only <a href="http://www.worldrps.com/">enjoy the game</a>, but take pride in <a href="http://www.usarps.com/">professing skill</a> at it. But is it possible to have skill at rock-paper-scissors?</p>
<p>In the real world, where a human plays against a human, &#8220;cheating&#8221; is relatively easy. A player can gain an advantage by subtly delaying their release just long enough to read their opponent&#8217;s body language and choose a winning gesture. For sake of argument, let&#8217;s ignore this and assume the players play fairly, where each is completely unaware of their opponent&#8217;s move.</p>
<p>In this case, optimal play between two players making truly random gestures ends in a tie (on average). The only way to repeatedly win is to somehow predict your opponent&#8217;s gesture, which is impossible if your opponent is truly random. Only if your opponent has a bias can statistics be used to potentially model and predict their gesture.</p>
<p>To demonstrate this, I implemented three simple prediction routines.</p>
<p>To play a round against each routine, just click a link representing the gesture you&#8217;d like to make. The routine will then calculate its own gesture (not looking at your current choice of course), and then display the result.<br />
<script type="text/javascript" src="/ropasc/jquery-1.3.2.min.js"></script><br />
<script type="text/javascript" src="/ropasc/ropasc.js"></script></p>
<link media="screen" type="text/css" href="/ropasc/ropasc.css" rel="stylesheet" />
<p>1. Conditional-0<br />
All the opponent&#8217;s past gestures are counted, and these counts are used as weights when pseudo-randomly selecting a gesture. For example, if you&#8217;ve picked rock once, paper four times, and scissors five times, then there will be a 10% chance that the routine will choose paper, 40% for scissors, and 50% for rock.</p>
<p></p>
<div id="frequency-brain"></div>
<p><script type="text/javascript">
$(document).ready(function(){
    g1 = (new ROPASC.Game('#frequency-brain', ROPASC.FrequencyBrain));
});
</script></p>
<p>2. Conditional-1<br />
This method tracks the last two gestures made by the opponent, and counts the number of each transition. These counts are again used to weight the random selection of a gesture. This allows it to see simple sequential patterns where an opponent&#8217;s next gesture is dependent on their last gesture.</p>
<p></p>
<div id="conditional-1-brain"></div>
<p><script type="text/javascript">
$(document).ready(function(){
    g2 = (new ROPASC.Game('#conditional-1-brain', ROPASC.Conditional1Brain));
});
</script></p>
<p>3. Conditional-2<br />
Similar to the previous method, this one tracks a combination of the opponent&#8217;s last gesture, and its own next-to-last gesture. This allows it to see simple dependence between the opponent&#8217;s gesture and its own.</p>
<p></p>
<div id="conditional-2-brain"></div>
<p><script type="text/javascript">
$(document).ready(function(){
    g3 = (new ROPASC.Game('#conditional-2-brain', ROPASC.Conditional2Brain));
});
</script></p>
<p>It&#8217;s a strange game in that the player with the most entropy wins in the long run. To see the code <a href="/ropasc/ropasc.js">click here</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.chrisspen.com/blog/implementing-a-simple-machine-learning-intelligence-to-play-rock-paper-scissors-in-javascript.html/feed</wfw:commentRss>
		</item>
		<item>
		<title>Handling PostgreSQL Integrity Errors in Django</title>
		<link>http://www.chrisspen.com/blog/handling-postgresql-integrity-errors-in-django.html</link>
		<comments>http://www.chrisspen.com/blog/handling-postgresql-integrity-errors-in-django.html#comments</comments>
		<pubDate>Fri, 18 Sep 2009 13:00:05 +0000</pubDate>
		<dc:creator>Chris</dc:creator>
		
		<category><![CDATA[Django]]></category>

		<category><![CDATA[MySQL]]></category>

		<category><![CDATA[PostgreSQL]]></category>

		<guid isPermaLink="false">http://www.chrisspen.com/blog/?p=14</guid>
		<description><![CDATA[I have some basic Django code that attempts to insert a new record into a PostgreSQL database. It&#8217;s wrapped in a try/except statement in case a unique constraint is violated, in which case I&#8217;ll typically ignore it and move on.
So the general pattern is:
from django.db import IntegrityError
import models
try:
    myNewRecord = models.MyTable(data=blah)
  [...]]]></description>
			<content:encoded><![CDATA[<p>I have some basic <a href="http://www.djangoproject.com/">Django</a> code that attempts to insert a new record into a <a href="http://www.postgresql.org/">PostgreSQL</a> database. It&#8217;s wrapped in a try/except statement in case a unique constraint is violated, in which case I&#8217;ll typically ignore it and move on.</p>
<p>So the general pattern is:</p>
<pre>from django.db import IntegrityError
import models
try:
    myNewRecord = models.MyTable(data=blah)
    myNewRecord.save()
except IntegrityError:
    pass

myOtherRecord = models.MyOtherTable(data=blah)
myOtherRecord.save() # Throws an error if an IntegrityError was previously thrown.</pre>
<p>I&#8217;d been previously using <a href="http://www.mysql.com/">MySQL</a>, but after switching to PostgreSQL I started getting a strange error when I tried to do any model access after an IntegrityError had been thrown, even though this access was outside the try/except:</p>
<p>&#8220;ProgrammingError: current transaction is aborted, commands ignored until end of transaction block&#8221;</p>
<p>What I eventually discovered, with a little <a href="http://www.errorhelp.com/search/details/69807/programmingerror-current-transaction-is-aborted-commands-ignored-until-end-of-transaction-block-appears-in-django-when-the-current-session-is-blocked">help</a>, is that Django&#8217;s PostgreSQL wrapper has some unusual transaction handling, so that if an error occurs then the entire connection must be closed to actually close the transaction.</p>
<p>The simple fix was to add connection.close() to the my except clause:</p>
<pre>from django.db import IntegrityError, connection
import models
try:
    myNewRecord = models.MyTable(data=blah)
    myNewRecord.save()
except IntegrityError:
    connection.close() # Required to clear PostgreSQL's failed transaction.

myOtherRecord = models.MyOtherTable(data=blah)
myOtherRecord.save() # Now works correctly.</pre>
]]></content:encoded>
			<wfw:commentRss>http://www.chrisspen.com/blog/handling-postgresql-integrity-errors-in-django.html/feed</wfw:commentRss>
		</item>
		<item>
		<title>Enabling Implicit Cast From Integer To Boolean in PostgreSQL</title>
		<link>http://www.chrisspen.com/blog/enabling-implicit-cast-from-integer-to-boolean-in-postgresql.html</link>
		<comments>http://www.chrisspen.com/blog/enabling-implicit-cast-from-integer-to-boolean-in-postgresql.html#comments</comments>
		<pubDate>Tue, 18 Aug 2009 17:36:06 +0000</pubDate>
		<dc:creator>Chris</dc:creator>
		
		<category><![CDATA[PostgreSQL]]></category>

		<category><![CDATA[data conversion]]></category>

		<category><![CDATA[MySQL]]></category>

		<category><![CDATA[postgres]]></category>

		<guid isPermaLink="false">http://www.chrisspen.com/blog/?p=13</guid>
		<description><![CDATA[I&#8217;ve been using MySQL for a while with my pet projects, but recently I&#8217;ve been writing some more complicated queries, and I&#8217;m running into a few obnoxious limitations, mostly involving known view performance problems. I&#8217;ve decided to start testing the waters in PostgreSQL, so I tried importing my data into PG by running a MySQL-generated [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve been using <a href="http://www.mysql.com">MySQL</a> for a while with my pet projects, but recently I&#8217;ve been writing some more complicated queries, and I&#8217;m running into a few obnoxious limitations, mostly involving known <a href="http://www.mysqlperformanceblog.com/2007/08/12/mysql-view-as-performance-troublemaker/">view performance problems</a>. I&#8217;ve decided to start testing the waters in <a href="http://www.postgresql.org">PostgreSQL</a>, so I tried importing my data into PG by running a MySQL-generated insert statement, only to run into a PostgreSQL newbie gotcha; &#8220;Column <name> is of type boolean but expression is of type integer&#8221;.</p>
<p>By default, PG does not automatically interpret &#8220;1&#8243; or &#8220;0&#8243; as a boolean. The arguments in favor of this limitation are that it prevents &#8220;unintended&#8221; conversions. However, I&#8217;ve been developing in Python for years, which does do this automatic conversion, and its never bitten me.</p>
<p>The code to disable this feature is relatively simple, although I can&#8217;t find anyone who&#8217;s mentioned it, so I post it here:</p>
<p><code>update pg_cast set castcontext = 'i' where oid in (<br />
	select c.oid<br />
	from pg_cast c<br />
	inner join pg_type src on src.oid = c.castsource<br />
	inner join pg_type tgt on tgt.oid = c.casttarget<br />
	where src.typname like 'int%' and tgt.typname like 'bool%'<br />
)</code></p>
<p>This SQL updates PG&#8217;s int-to-bool cast path to &#8220;implicit&#8221; mode, meaning PG will now automatically cast an integer to boolean if the target type is a boolean. Otherwise, in the default &#8220;emplicit&#8221; mode, you&#8217;d have to use the CAST() syntax.</p>
<p>After the update, the insert statement runs perfectly.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.chrisspen.com/blog/enabling-implicit-cast-from-integer-to-boolean-in-postgresql.html/feed</wfw:commentRss>
		</item>
		<item>
		<title>Google App Engine Patch Accepted</title>
		<link>http://www.chrisspen.com/blog/google-app-engine-patch-accepted.html</link>
		<comments>http://www.chrisspen.com/blog/google-app-engine-patch-accepted.html#comments</comments>
		<pubDate>Sat, 20 Dec 2008 18:38:23 +0000</pubDate>
		<dc:creator>Chris</dc:creator>
		
		<category><![CDATA[App Engine]]></category>

		<category><![CDATA[Python]]></category>

		<category><![CDATA[patch]]></category>

		<guid isPermaLink="false">http://www.chrisspen.com/blog/?p=11</guid>
		<description><![CDATA[It&#8217;s a trivial fix, and it took them over a month to get around to it (hey, I&#8217;m sure they&#8217;re busy), but Google&#8217;s App Engine team finally accepted my patch. With that, the development datastore can be specified with a relative path, making app-engine related scripts easier to share with others who might not have [...]]]></description>
			<content:encoded><![CDATA[<p>It&#8217;s a trivial fix, and it took them over a month to get around to it (hey, I&#8217;m sure they&#8217;re busy), but Google&#8217;s App Engine team finally accepted my <a href="http://code.google.com/p/googleappengine/issues/detail?id=845">patch</a>. With that, the development datastore can be specified with a relative path, making app-engine related scripts easier to share with others who might not have your the identical path structure.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.chrisspen.com/blog/google-app-engine-patch-accepted.html/feed</wfw:commentRss>
		</item>
		<item>
		<title>Simple Cross Browser Rounded Corners with Drop Shadow</title>
		<link>http://www.chrisspen.com/blog/simple-cross-browser-rounded-corners-with-drop-shadow.html</link>
		<comments>http://www.chrisspen.com/blog/simple-cross-browser-rounded-corners-with-drop-shadow.html#comments</comments>
		<pubDate>Sat, 13 Dec 2008 00:36:31 +0000</pubDate>
		<dc:creator>Chris</dc:creator>
		
		<category><![CDATA[CSS]]></category>

		<category><![CDATA[Javascript]]></category>

		<category><![CDATA[jQuery]]></category>

		<category><![CDATA[drop shadow]]></category>

		<category><![CDATA[rounded corners]]></category>

		<category><![CDATA[VML]]></category>

		<guid isPermaLink="false">http://www.chrisspen.com/blog/?p=10</guid>
		<description><![CDATA[I found a convenient method for creating elements with rounded corners and a drop shadow. Previously, I had been using Steffen Rusitschka&#8217;s ShadedBorder script to accomplish this effect with minimal overhead. While his script is an impressive piece of work, it can be a little slow to render since it accomplishes the effect by creating [...]]]></description>
			<content:encoded><![CDATA[<p>I found a convenient method for creating elements with rounded corners <b>and</b> a drop shadow. Previously, I had been using Steffen Rusitschka&#8217;s <a href="http://www.ruzee.com/blog/shadedborder">ShadedBorder</a> script to accomplish this effect with minimal overhead. While his script is an impressive piece of work, it can be a little slow to render since it accomplishes the effect by creating several elements in order to simulate the anti-aliasing and rounding.</p>
<p>The primary attraction for script&#8217;s like Steffen&#8217;s lies mostly in IE&#8217;s perceived lack of functionally. Most other browsers, like Firefox and Safari, natively support rounded elements with a simple CSS declaration. IE, perpetually behind the curve, effectively had no easy way to accomplish rounded corners until a rounding method was <a href="http://snook.ca/archives/html_and_css/rounded_corners_experiment_ie/">discovered</a> that makes use of Microsoft&#8217;s own <a href="http://en.wikipedia.org/wiki/Vector_Markup_Language">VML</a> implementation. <a href="http://www.dillerdesign.com/experiment/DD_roundies/">Drew Miller</a> then went on to create a simple wrapper that let&#8217;s you easily apply the effect across multiple browsers. Why Microsoft&#8217;s own <a href="http://msdn.microsoft.com/en-us/library/bb250413(VS.85).aspx">rounded howto</a> still recommends using tables and GIF images is anyone&#8217;s guess.</p>
<p>For the drop shadow, I made use of Larry Stevens&#8217; excellent <a href="http://eyebulb.com/dropshadow/">jQuery plugin</a>. It accomplishes the effect in basically the same manner as Steffen&#8217;s script, but does so in a convenient jQuery interface.</p>
<p>And finally, a <a href="/roundshadow/index.html">demo</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.chrisspen.com/blog/simple-cross-browser-rounded-corners-with-drop-shadow.html/feed</wfw:commentRss>
		</item>
		<item>
		<title>Swarmastica: A Flash Game</title>
		<link>http://www.chrisspen.com/blog/swarmastica-a-flash-game.html</link>
		<comments>http://www.chrisspen.com/blog/swarmastica-a-flash-game.html#comments</comments>
		<pubDate>Tue, 14 Oct 2008 18:55:09 +0000</pubDate>
		<dc:creator>Chris</dc:creator>
		
		<category><![CDATA[Swarmastica]]></category>

		<category><![CDATA[Actionscript]]></category>

		<category><![CDATA[Flash]]></category>

		<category><![CDATA[games]]></category>

		<guid isPermaLink="false">http://www.chrisspen.com/blog/?p=8</guid>
		<description><![CDATA[I decided to try my hand at making a Flash game using Actionscript 3. I have mixed feelings towards Adobe technologies, but AS3 is a significant improvement over AS2. The syntax is now much closer to that of Java, with cleaner object-orientation and strict type checking. They also released their Java-based AS3 compiler under an [...]]]></description>
			<content:encoded><![CDATA[<p>I decided to try my hand at making a Flash game using Actionscript 3. I have mixed feelings towards Adobe technologies, but AS3 is a significant improvement over AS2. The syntax is now much closer to that of Java, with cleaner object-orientation and strict type checking. They also released their Java-based AS3 compiler under an open source license, allowing developers to write and compile code in practically any environment they want.</p>
<p>The first step in developing any Flash game is coming up with an entertaining idea. Since this is my first game, I wanted the idea to be simple enough to quickly implement, but I still wanted it to be complex enough to be interesting. <a href="http://www.handdrawngames.com/DesktopTD/Game.asp">Tower Defense</a> is a great example of a popular game that&#8217;s relatively simple to implement, but also has enough complexity to be fun.</p>
<p>I&#8217;m a fan of real time strategy games that shun micro-management and busy work. Granted, this isn&#8217;t always an easy task. If you remove too much work from the user through automation, they become bored. And if you require them to handle every minor detail, they become overwhelmed. The under-appreciated <a href="http://www.globulation2.org">Globulation</a> is an obscure but innovative approach to this problem. Most RTS games require the user to manually create, direct, and assign every sprite for some arbitrary task. Globulation instead automates most of these functions and lets the user focus on the big picture, like where your armies should be directed and what percentage of your resources should be devoted to various tasks. For example, if you wanted to attack your opponent, instead of selecting a group of soldiers, a target, and clicking &#8220;attack!&#8221;, you&#8217;d create an attack point near your target and give it a weight proportional to how many soldiers you want to invest. That proportion of your soldiers will then automatically start attacking. That is, if they&#8217;re not overworked, hungry, or suicidal. I made the novice mistake of directing all my soldiers to do too many things at once. Imagine my surprise when more than a few wandered out to the dessert to die. There&#8217;s a great managerial lesson in there somewhere.</p>
<p>These two games outline my main inspirations. I decided to make something slightly more complex than Tower Defense, but significantly simpler than Globulation. Both have autonomous critters as a common theme. In Tower Defense, this autonomy is limited to simply firing when the enemy is near, but gets much more complex with Globulation.</p>
<p>So here&#8217;s the basic idea. In my game, there are 5 types of critters that the user can buy and use to fight the opponent. Each critter has different strengths and weaknesses. Like Tower Defense, most critter types attack by firing projectiles. However, unlike Tower Defense, the critters can all move and pursue their own simple objectives independently of the user.</p>
<p>The user decides where to place critters on the screen, and can tell critters to gravitate towards a particular spot, but otherwise each critter acts completely on it&#8217;s own. They attack when an enemy&#8217;s detected. They retreat when injured. They wander around when bored. Killing enemy critters earn you money, which you can use to buy more critters. I purposefully kept the critter design very abstract, partially for aesthetics and also to speed up rendering time.</p>
<p>The five critter types are:</p>
<ul>
<li><b>Dragoon</b>
<ul>
<li>Pros: Strong long-range attack.</li>
<li>Cons: Slow speed and weak shields.</li>
</ul>
</li>
<li><b>Mine-Layer</b>
<ul>
<li>Pros: Creates a wide field of powerful mines. Strong shields.</li>
<li>Cons: Cannot directly attack.</li>
</ul>
</li>
<li><b>Puppet-Master</b>
<ul>
<li>Pros: Can control an enemy critter from long range<br />(except Wraith and other Puppet-Masters).</li>
<li>Cons: Cannot directly attack.</li>
</ul>
</li>
<li><b>Soldier</b>
<ul>
<li>Pros: Strong weapons and shields. Decent speed. Good at close range attacks.</li>
<li>Cons: Prey to long range attacks.</li>
</ul>
</li>
<li><b>Wraith</b>
<ul>
<li>Pros: Invisible when not attacking.</li>
<li>Cons: Weak shields.</li>
</ul>
</li>
</ul>
<p>It&#8217;s still rough around the edges. I&#8217;m still tweaking the game play and working on an in-game guided-walkthrough to lower the learning curve, but most of the features are there.</p>
<p>I used the <a href="http://box2dflash.sourceforge.net/">AS3 port of Box2D</a> for the physics simulation and <a href="http://puremvc.org/">PureMVC</a> as the application framework.</p>
<p>Please feel free to try it out the <a href="/swarmastica-beta/">Beta</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.chrisspen.com/blog/swarmastica-a-flash-game.html/feed</wfw:commentRss>
		</item>
		<item>
		<title>How to Make a Generic Javascript Method Closure</title>
		<link>http://www.chrisspen.com/blog/how-to-make-a-generic-javascript-method-closure.html</link>
		<comments>http://www.chrisspen.com/blog/how-to-make-a-generic-javascript-method-closure.html#comments</comments>
		<pubDate>Wed, 01 Oct 2008 00:33:57 +0000</pubDate>
		<dc:creator>Chris</dc:creator>
		
		<category><![CDATA[Javascript]]></category>

		<category><![CDATA[closure]]></category>

		<category><![CDATA[programming]]></category>

		<guid isPermaLink="false">http://www.chrisspen.com/blog/?p=7</guid>
		<description><![CDATA[The Problem
I recently found myself writing a custom Javascript API to integrate a Flash widget written by an external vendor. I was making the API object oriented, so it could be instantiated and controlled simply like:

var myWidget = new FlashWidget()
myWidget.doStuff()

Things were going well, until I realized I had overlooked a minor detail in the Flash [...]]]></description>
			<content:encoded><![CDATA[<h3>The Problem</h3>
<p>I recently found myself writing a custom Javascript API to integrate a Flash widget written by an external vendor. I was making the API object oriented, so it could be instantiated and controlled simply like:</p>
<pre style="color: green;">
var myWidget = new FlashWidget()
myWidget.doStuff()
</pre>
<p>Things were going well, until I realized I had overlooked a minor detail in the Flash widget. For several actions, the widget accepts functions names, which it will later evaluate and call when the actions complete. The detail I missed was that it only accepts <b>function</b> names visible in the global scope. Since all my methods are bound to myWidget, and not the global scope, the Flash widget won&#8217;t be able to directly call anything in my nice API. I had hoped it would be smart enough to accept a qualified name, like &#8220;myWidget.doStuffCallback&#8221;, and resolve it correctly but no such luck.</p>
<h3>I Need Closure</h3>
<p>Fortunately, I&#8217;m well acquainted with the concept of Javascript closures, so the solution turned out to be a few extra lines of code to expose my callbacks in a way that the Flash widget would accept. Essentially, I needed to create an anonymous function which would call the given method, remaining bound to the myWidget scope, and include any additional arguments the widget might pass over.</p>
<p>Some Javascript frameworks, like the well-known <a href="http://www.prototypejs.org/">Prototype</a> framework, have various helper functions to accomplish something like this, but they usually boil down to this simple bit of code:</p>
<pre style="color: green;">
function methodClosure(obj, funcName){
    return function(){
        return obj[funcName].apply(obj, arguments)
    }
}
</pre>
<p>With this, I can easily expose my callbacks simply like:</p>
<pre style="color: green;">
doStuffCallback = methodClosure(myWidget, 'onDoStuff')
</pre>
]]></content:encoded>
			<wfw:commentRss>http://www.chrisspen.com/blog/how-to-make-a-generic-javascript-method-closure.html/feed</wfw:commentRss>
		</item>
		<item>
		<title>How to Extract a Webpage&#8217;s Main Article Content</title>
		<link>http://www.chrisspen.com/blog/how-to-extract-a-webpages-main-article-content.html</link>
		<comments>http://www.chrisspen.com/blog/how-to-extract-a-webpages-main-article-content.html#comments</comments>
		<pubDate>Tue, 16 Sep 2008 11:39:07 +0000</pubDate>
		<dc:creator>Chris</dc:creator>
		
		<category><![CDATA[Django]]></category>

		<category><![CDATA[Python]]></category>

		<category><![CDATA[DOM]]></category>

		<category><![CDATA[Tidy]]></category>

		<guid isPermaLink="false">http://www.chrisspen.com/blog/?p=6</guid>
		<description><![CDATA[The Idea
I had an idea to make a personalized news feed reader. Basically, I&#8217;d register a bunch of feeds with the application, and rate a few stories as either &#8220;good&#8221; or &#8220;bad&#8221;. The application would then use my ratings and the article text to generate a statistical model, apply that model to future articles, and [...]]]></description>
			<content:encoded><![CDATA[<h3>The Idea</h3>
<p>I had an idea to make a personalized news feed reader. Basically, I&#8217;d register a bunch of feeds with the application, and rate a few stories as either &#8220;good&#8221; or &#8220;bad&#8221;. The application would then use my ratings and the article text to generate a statistical model, apply that model to future articles, and only recommend those it predicted I would rate as &#8220;good&#8221;. It sounded like a plausible idea. I decided to start a pet project.</p>
<p>I soon learned that this idea wasn&#8217;t original, and in fact had been attempted by quite a few companies. The first to seriously implement this idea was <a href="http://www.findory.com">Findory</a>, later followed by Thoof, Tiinker, <a href="http://pressflip.com/">Persai</a>, and probably others I&#8217;m not aware of. As of this writing, only Persai is still in business. Apparently, personalized news feeds aren&#8217;t terribly profitable. Why they&#8217;re not a commercial hit is a whole article in itself, so I won&#8217;t go into it now. However, before I admitted to myself that this project was doomed to failure, I decided to implement a few components to get a better feel for how the system would work. This is a review of a few interesting things I learned along the way.</p>
<h3>The Problem</h3>
<p>Obviously, the first step to any would-be social news indexing service is to download the target webpages and extract their article text. After you read a lot of non-standard HTML, you start developing a bit of respect for web browser developers who have to make sense of this mess. Most webpages have a pretty low signal-to-noise ratio. There are banners, navigation links, side-bars, headers, footers, meta tags, related-story clips, unrelated story clips, and a lot of other junk we don&#8217;t care about. So how do we take this mess and extract not just the text, but the main article text? It&#8217;s pretty easy to look for all the &lt;div&gt; or &lt;p&gt; tags in an HTML document and extract their contents, but we&#8217;ll get a lot of this junk we don&#8217;t want.</p>
<h3>Cleaning Up the Mess</h3>
<p>In an ideal world, all webpages would be XML documents that could be easily validated and parsed. In the real world they&#8217;re a mess, but they&#8217;re close enough to XML where we can potentially clean them up and parse away. With the help of the excellent tool <a href="http://tidy.sourceforge.net/">Tidy</a>, this turns out to be surprisingly easy. With Tidy, I can take the ugliest webpage and turn it into something I can easily feed into an XML parser, which can then extract the document&#8217;s text, ignoring all the HTML markup, Javascript, and other useless information.</p>
<h3>Enter the DOM</h3>
<p>So now we have a clean document that we&#8217;ve fed to an XML parser to build a DOM and extracted all nodes that contain text. How do we distinguish the text used in the navigation bar/side-bar/footer from the main article text? It turns out we can make an assumption in 99% of the cases we&#8217;re interested in. This assumption is that the largest &#8220;clump&#8221; of text is usually the main article. What I mean by &#8220;clump&#8221; are nodes of text that are grouped together based on their relative position in the DOM.</p>
<p>For example, suppose we have a document whose body is something like:</p>
<pre style="color: green;">
&lt;body&gt;
    &lt;div&gt;
        &lt;ul&gt;
            &lt;li&gt;&lt;a href=&quot;/home&quot;&gt;Home&lt;/a&gt;&lt;/li&gt;
            &lt;li&gt;&lt;a href=&quot;/politics&quot;&gt;Politics&lt;/a&gt;&lt;/li&gt;
            &lt;li&gt;&lt;a href=&quot;/health&quot;&gt;Health&lt;/a&gt;&lt;/li&gt;
            &lt;li&gt;&lt;a href=&quot;/travel&quot;&gt;Travel&lt;/a&gt;&lt;/li&gt;
            &lt;li&gt;&lt;a href=&quot;/about&quot;&gt;About&lt;/a&gt;&lt;/li&gt;
        &lt;/ul&gt;
    &lt;div&gt;
    &lt;div&gt;
        &lt;div&gt;
            &lt;p&gt;&lt;b&gt;MIAMI, Florida (CNN) &lt;/b&gt; -- Hurricane Ike weakened slightly...
            &lt;p&gt;Ike hit Turks and Caicos Islands Sunday morning, leaving a trail of...
            &lt;p&gt;"It pretty much looks like an episode of 'The Twilight Zone,' " said...
            &lt;p&gt;Aftwood estimates at least 90 percent of homes he saw on the island were...
            &lt;p&gt;The possibility of similar devastation prompted state and local officials...
            &lt;p &gt; "Let's hope it's all a false alarm," Louisiana Gov. Bobby Jindal said...
        &lt;/div&gt;
        &lt;div&gt;
            &lt;p&gt;Some side-story that we don't really care about.&lt;/p&gt;
            &lt;p&gt;Another paragraph for this story.&lt;/p&gt;
        &lt;/div&gt;
        &lt;div&gt;
            &lt;p&gt;Yet another semi-related side-story that we still don't care about.&lt;/p&gt;
            &lt;p&gt;Another paragraph for this story.&lt;/p&gt;
            &lt;p&gt;Another paragraph for this story.&lt;/p&gt;
            &lt;p&gt;Yet another paragraph for this story.&lt;/p&gt;
        &lt;/div&gt;
    &lt;/div&gt;
    &lt;div&gt;&copy; 2008 Cable News Network.&lt;div&gt;
&lt;/body&gt;
</pre>
<p>Clearly, we don&#8217;t care about the navigation link text, or the two side-stories. Let&#8217;s break it down based on DOM location. We have six &lt;p&gt; tags in the first &lt;div&gt; tag of the second &lt;div&gt; tag of the body. We&#8217;ll represent this location as a list of indexes, like (2,1,*). If we group all the text nodes in this fashion, and track how much text each group contains, we get a table like:</p>
<pre style="color: green;">
location = characters
(1,1,1,1) = 4
(1,1,2,1) = 8
(1,1,3,1) = 6
(1,1,4,1) = 6
(1,1,5,1) = 5
(2,1,*) = 500
(2,2,*) = 100
(2,3,*) = 250
(3) = 26
</pre>
<p>Using this, we can safetly assume either the main article text is located at position (2,1,*), or the document has more ads/spam/comments/side-stories/etc than actual content. Once we know where the main text is located, we strip it out and move on.</p>
<h3>Conclusion</h3>
<p>This method proved to be a fairly simple and robust way to extract text from webpages. The main exception I found were news aggregators like Digg/Reddit/Slashdot, where the text blurb may be overshadowed by hundreds of user comments.</p>
<p>I implemented a simple <a href="http://www.djangoproject.com/">Django</a> web app with a Javascript widget frontend to demo this method. Try it out for yourself by pasting a URL below.<br />
<script type="text/javascript" src="/textscrapit/loader/"></script></p>
]]></content:encoded>
			<wfw:commentRss>http://www.chrisspen.com/blog/how-to-extract-a-webpages-main-article-content.html/feed</wfw:commentRss>
		</item>
		<item>
		<title>A Simple Pylons Wordnet Name Generator</title>
		<link>http://www.chrisspen.com/blog/a-simple-pylons-wordnet-name-generator.html</link>
		<comments>http://www.chrisspen.com/blog/a-simple-pylons-wordnet-name-generator.html#comments</comments>
		<pubDate>Wed, 03 Sep 2008 04:03:49 +0000</pubDate>
		<dc:creator>Chris</dc:creator>
		
		<category><![CDATA[Pylons]]></category>

		<category><![CDATA[Python]]></category>

		<category><![CDATA[Wordnet]]></category>

		<category><![CDATA[python pylons wordnet]]></category>

		<guid isPermaLink="false">http://www.chrisspen.com/blog/?p=4</guid>
		<description><![CDATA[To familiarize myself with the Pylons framework, I wrote a simple app that uses Wordnet to randomly generate very obscure names in the format &#8220;adjective noun&#8221;. Try it out below by clicking &#8220;Generate!&#8221;.

]]></description>
			<content:encoded><![CDATA[<p>To familiarize myself with the <a href="http://pylonshq.com/">Pylons</a> framework, I wrote a simple app that uses <a href="http://wordnet.princeton.edu/">Wordnet</a> to randomly generate very obscure names in the format &#8220;adjective noun&#8221;. Try it out below by clicking &#8220;Generate!&#8221;.</p>
<p><iframe src="/namegen" width="200" height="100" frameborder="0"></iframe></p>
]]></content:encoded>
			<wfw:commentRss>http://www.chrisspen.com/blog/a-simple-pylons-wordnet-name-generator.html/feed</wfw:commentRss>
		</item>
	</channel>
</rss>
