<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Two Interactions With Amazon</title>
	<atom:link href="http://push.cx/2009/two-interactions-with-amazon/feed" rel="self" type="application/rss+xml" />
	<link>http://push.cx/2009/two-interactions-with-amazon</link>
	<description>A tea-drinking web geek's coffee-flavored blog</description>
	<lastBuildDate>Tue, 27 Jul 2010 00:47:39 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0</generator>
	<item>
		<title>By: Posts about Text Links as of January 18, 2009 &#124; The Lessnau Lounge</title>
		<link>http://push.cx/2009/two-interactions-with-amazon/comment-page-1#comment-89784</link>
		<dc:creator>Posts about Text Links as of January 18, 2009 &#124; The Lessnau Lounge</dc:creator>
		<pubDate>Sun, 18 Jan 2009 19:07:12 +0000</pubDate>
		<guid isPermaLink="false">http://push.cx/?p=450#comment-89784</guid>
		<description>[...]  [...]</description>
		<content:encoded><![CDATA[<p>[...]  [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Peter Harkins</title>
		<link>http://push.cx/2009/two-interactions-with-amazon/comment-page-1#comment-89782</link>
		<dc:creator>Peter Harkins</dc:creator>
		<pubDate>Sun, 18 Jan 2009 17:33:18 +0000</pubDate>
		<guid isPermaLink="false">http://push.cx/?p=450#comment-89782</guid>
		<description>I did a quick &lt;kbd&gt;aptitude search ocr&lt;/kbd&gt; and found gocr, ocrad, and tesseract-ocr. I spent an hour or so testing each package with different options on samples from the Clinton Schedule, which I extracted with

&lt;kbd&gt;gs -sDEVICE=pnggray -sOutputFile=page.png -r600 -dTextAlphaBits=4 -dGraphicsAlphaBits=4 -dBATCH -dNOPAUSE -dSAFER -dFirstPage=&lt;i&gt;42&lt;/i&gt; -dLastPage=&lt;i&gt;42&lt;/i&gt; -f hrcsked.pdf -c quit &amp;&amp; convert page.png page00042.tiff&lt;/kbd&gt; (I also used ImageMagick to crop the image down and got better results because there wasn&#039;t the staticy margins, but I can&#039;t immediately find my notes on that and the IM docs explain how to do it.)

I found that tesseract worked best, but I strongly suspect that the different ocr programs will rank differently depending on the material.</description>
		<content:encoded><![CDATA[<p>I did a quick <kbd>aptitude search ocr</kbd> and found gocr, ocrad, and tesseract-ocr. I spent an hour or so testing each package with different options on samples from the Clinton Schedule, which I extracted with</p>
<p><kbd>gs -sDEVICE=pnggray -sOutputFile=page.png -r600 -dTextAlphaBits=4 -dGraphicsAlphaBits=4 -dBATCH -dNOPAUSE -dSAFER -dFirstPage=<i>42</i> -dLastPage=<i>42</i> -f hrcsked.pdf -c quit &#038;&#038; convert page.png page00042.tiff</kbd> (I also used ImageMagick to crop the image down and got better results because there wasn&#8217;t the staticy margins, but I can&#8217;t immediately find my notes on that and the IM docs explain how to do it.)</p>
<p>I found that tesseract worked best, but I strongly suspect that the different ocr programs will rank differently depending on the material.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Simon Willison</title>
		<link>http://push.cx/2009/two-interactions-with-amazon/comment-page-1#comment-89777</link>
		<dc:creator>Simon Willison</dc:creator>
		<pubDate>Sun, 18 Jan 2009 08:35:43 +0000</pubDate>
		<guid isPermaLink="false">http://push.cx/?p=450#comment-89777</guid>
		<description>A bit off topic for the post, but I&#039;d love to know what software you use for processing scanned documents on EC2 :)</description>
		<content:encoded><![CDATA[<p>A bit off topic for the post, but I&#8217;d love to know what software you use for processing scanned documents on EC2 :)</p>
]]></content:encoded>
	</item>
</channel>
</rss>
