<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
	>

<channel>
	<title>Elastic Web Mining &#124; Bixo Labs</title>
	<atom:link href="http://bixolabs.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://bixolabs.com</link>
	<description></description>
	<lastBuildDate>Fri, 18 Jun 2010 14:50:38 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
<cloud domain='bixolabs.com' port='80' path='/?rsscloud=notify' registerProcedure='' protocol='http-post' />
<image>
		<url>http://www.gravatar.com/blavatar/308be8690ccbdf7ea4dff361b3bbc457?s=96&#038;d=http://s2.wp.com/i/buttonw-com.png</url>
		<title>Elastic Web Mining &#124; Bixo Labs</title>
		<link>http://bixolabs.com</link>
	</image>
	<atom:link rel="search" type="application/opensearchdescription+xml" href="http://bixolabs.com/osd.xml" title="Elastic Web Mining &#124; Bixo Labs" />
	<atom:link rel='hub' href='http://bixolabs.com/?pushpress=hub'/>
		<item>
		<title>Focused web crawling</title>
		<link>http://bixolabs.com/2010/06/18/focused-web-crawling/</link>
		<comments>http://bixolabs.com/2010/06/18/focused-web-crawling/#comments</comments>
		<pubDate>Fri, 18 Jun 2010 14:50:38 +0000</pubDate>
		<dc:creator>kkrugler</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://bixolabs.com/?p=277</guid>
		<description><![CDATA[Recently some customers have been asking for a more concrete description of how we handle &#8220;focused web crawling at Bixo Labs. After answering the same questions a few times, it seemed like a good idea to post details to our web site &#8211; thus the new page titled Focused Crawling. The basic concepts are straightforward, [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=bixolabs.com&blog=10127813&post=277&subd=bixolabs&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<p>Recently some customers have been asking for a more concrete description of how we handle &#8220;focused web crawling at Bixo Labs.</p>
<p>After answering the same questions a few times, it seemed like a good idea to post details to our web site &#8211; thus the new page titled <a href="about/focused-crawler" target="_self">Focused Crawling</a>.</p>
<p>The basic concepts are straightforward, and very similar to what we did at Krugle to efficiently find web pages that were likely to be of interest to software developers. In Bixo Labs we&#8217;ve generalized the concept a bit, and implemented it using Bixo and a Cascading workflow. This gives us a lot more flexibility when it comes to customizing the behavior, as well as making it easier for us to work with customer-provided code for extension points such as scoring pages.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/bixolabs.wordpress.com/277/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/bixolabs.wordpress.com/277/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/bixolabs.wordpress.com/277/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/bixolabs.wordpress.com/277/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/bixolabs.wordpress.com/277/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/bixolabs.wordpress.com/277/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/bixolabs.wordpress.com/277/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/bixolabs.wordpress.com/277/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/bixolabs.wordpress.com/277/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/bixolabs.wordpress.com/277/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=bixolabs.com&blog=10127813&post=277&subd=bixolabs&ref=&feed=1" />]]></content:encoded>
			<wfw:commentRss>http://bixolabs.com/2010/06/18/focused-web-crawling/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/cee2819ae7c1633e11ad2ef33de7794f?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">kkrugler</media:title>
		</media:content>
	</item>
		<item>
		<title>Hadoop User Group Meetup Talk</title>
		<link>http://bixolabs.com/2010/04/22/hadoop-user-group-meetup-talk/</link>
		<comments>http://bixolabs.com/2010/04/22/hadoop-user-group-meetup-talk/#comments</comments>
		<pubDate>Fri, 23 Apr 2010 01:35:34 +0000</pubDate>
		<dc:creator>kkrugler</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[public terabyte dataset]]></category>
		<category><![CDATA[cascading]]></category>
		<category><![CDATA[simpledb]]></category>
		<category><![CDATA[avro]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[elastic mapreduce]]></category>

		<guid isPermaLink="false">http://bixolabs.com/?p=239</guid>
		<description><![CDATA[Last night I did a presentation at the April Hadoop Bay Area User Group meetup, hosted by Yahoo. 250+ people in attendance, so the interest in Hadoop continues to grow. Dekel has posted the slides of my talk, as well as a (very quiet) video. My talk was on the status of the Public Terabyte [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=bixolabs.com&blog=10127813&post=239&subd=bixolabs&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<p>Last night I did a presentation at the <a href="http://www.meetup.com/hadoop/calendar/13002132/" target="_blank">April Hadoop Bay Area User Group meetup</a>, hosted by Yahoo. 250+ people in attendance, so the interest in Hadoop continues to grow.</p>
<p>Dekel has posted the <a href="http://www.slideshare.net/hadoopusergroup/bixo-hug-talk" target="_blank">slides</a> of my talk, as well as a (very quiet) <a href="http://www.youtube.com/watch?v=VIIi8DjQbzI&amp;feature=channel" target="_blank">video</a>.</p>
<p>My talk was on the status of the <a href="http://bixolabs.com/datasets/public-terabyte-dataset-project/" target="_blank">Public Terabyte Dataset (PTD) project</a>, and advice on running jobs in Amazon&#8217;s <a href="http://aws.amazon.com/elasticmapreduce/" target="_blank">Elastic MapReduce</a> (EMR) cloud. As part of the PTD architecture, we wound up using Amazon&#8217;s <a href="http://aws.amazon.com/simpledb/" target="_blank">SimpleDB</a> for storing the crawl DB, thus one section of my talk was on what we learned about using that to efficiently and inexpensively save persistent data (crawl state) while still using EMR for bursty processing. I&#8217;d previously blogged about our <a href="http://bixolabs.com/2010/03/16/simpledb-tap-for-cascading/" target="_blank">SimpleDB tap &amp; scheme for Cascading</a>, and our use of it for PTD has helped shake out some bugs.</p>
<p>As well, we decided to use <a href="http://hadoop.apache.org/avro/" target="_blank">Apache Avro</a> for our output format. This meant creating a Cascading scheme, which would have been pretty painful but for the fortuitous, just-in-time release of Hadoop mapreduce support code in the Avro project (thanks to Doug &amp; Scott for that). Vivek mentioned this new project in his recent blog post about our <a href="http://bixolabs.com/2010/04/21/first-sample-of-public-terabyte-dataset/" target="_blank">first release of PTD data</a>, and we&#8217;re looking forward to others using this to read/write Avro files.</p>
<p>The real-world use case I described in my talk was analyzing the quality of the <a href="http://lucene.apache.org/tika/" target="_blank">Tika</a> charset detection, using HTML data from our initial crawl dataset. The results showed plenty of room for improvement <img src='http://s.wordpress.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<div id="attachment_242" class="wp-caption alignnone" style="width: 610px"><a href="http://bixolabs.files.wordpress.com/2010/04/charset-analysis.png"><img class="size-full wp-image-242" title="charset-analysis" src="http://bixolabs.files.wordpress.com/2010/04/charset-analysis.png?w=600&#038;h=388" alt="" width="600" height="388" /></a><p class="wp-caption-text">Tika accuracy detecting character sets</p></div>
<p>The real point of this use case wasn&#8217;t to point out problems with Tika, but rather to demonstrate how easy it is to use the dataset to perform this type of analysis. Which means it&#8217;s also easy to compare alternative algorithms, and improve the Tika support with a large enough dataset to inspire confidence in the end results.</p>
<p>As an aside, Ted Dunning might be using this data &amp; Mahout to train a better charset and/or langauge classifier, which would be a really nice addition to the Tika project. The same thing could obviously be done for language detection, which currently also suffers from similar accuracy issues, as well as being a CPU cycle hog.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/bixolabs.wordpress.com/239/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/bixolabs.wordpress.com/239/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/bixolabs.wordpress.com/239/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/bixolabs.wordpress.com/239/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/bixolabs.wordpress.com/239/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/bixolabs.wordpress.com/239/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/bixolabs.wordpress.com/239/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/bixolabs.wordpress.com/239/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/bixolabs.wordpress.com/239/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/bixolabs.wordpress.com/239/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=bixolabs.com&blog=10127813&post=239&subd=bixolabs&ref=&feed=1" />]]></content:encoded>
			<wfw:commentRss>http://bixolabs.com/2010/04/22/hadoop-user-group-meetup-talk/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/cee2819ae7c1633e11ad2ef33de7794f?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">kkrugler</media:title>
		</media:content>

		<media:content url="http://bixolabs.files.wordpress.com/2010/04/charset-analysis.png" medium="image">
			<media:title type="html">charset-analysis</media:title>
		</media:content>
	</item>
		<item>
		<title>First Sample of Public Terabyte Dataset</title>
		<link>http://bixolabs.com/2010/04/21/first-sample-of-public-terabyte-dataset/</link>
		<comments>http://bixolabs.com/2010/04/21/first-sample-of-public-terabyte-dataset/#comments</comments>
		<pubDate>Wed, 21 Apr 2010 16:01:35 +0000</pubDate>
		<dc:creator>kkrugler</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[avro]]></category>
		<category><![CDATA[cascading]]></category>
		<category><![CDATA[public terabyte dataset]]></category>

		<guid isPermaLink="false">http://bixolabs.com/?p=230</guid>
		<description><![CDATA[We are excited that the Public Terabyte Dataset project is starting to release data. We decided to go with the Avro file format, instead of WARC, as Avro is more efficient (easily splittable by Hadoop) and cross-language. Since we&#8217;re using Cascading for this project, we have also released a Cascading Avro Scheme to read and [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=bixolabs.com&blog=10127813&post=230&subd=bixolabs&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<p>We are excited that the <a href="http://bixolabs.com/datasets/public-terabyte-dataset-project/">Public Terabyte Dataset</a> project is starting to release data. We decided to go with the <a href="http://hadoop.apache.org/avro/">Avro</a> file format, instead of WARC, as Avro is more efficient (easily splittable by Hadoop) and cross-language. Since we&#8217;re using <a href="http://www.cascading.org/">Cascading</a> for this project, we have also released a Cascading <a href="http://github.com/bixolabs/cascading.avro">Avro Scheme</a> to read and write Avro files.</p>
<p>In order to get you jump started with leveraging this dataset, we have posted a small sample of the dataset in S3 in the bixolabs-ptd-demo bucket. Along with that is the <a href="http://s3.amazonaws.com/bixolabs-ptd-demo/ptd-sample.json">Avro JSON</a> schema to access the file. For those unfamiliar with working with Avro files, here&#8217;s a sample snippet that illustrates one way of reading them:<br />
<code><br />
Schema schema = Schema.parse(jsonSchemaFile);<br />
DataFileReader&lt;Object&gt;  reader = new DataFileReader&lt;Object&gt;(avroFile, new GenericDatumReader&lt;Object&gt;(schema));<br />
while (reader.hasNext()) {<br />
GenericData.Record obj =  (Record) reader.next();<br />
// You can access the fields in this object like this...<br />
System.out.println(obj.get("AvroDatum-url"));<br />
}<br />
</code><br />
Please take a look, and let us know if there&#8217;s any missing raw content that you&#8217;d want. We&#8217;ve intentionally avoided doing post-processing of the results &#8211; this is source data for exactly that type of activity.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/bixolabs.wordpress.com/230/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/bixolabs.wordpress.com/230/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/bixolabs.wordpress.com/230/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/bixolabs.wordpress.com/230/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/bixolabs.wordpress.com/230/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/bixolabs.wordpress.com/230/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/bixolabs.wordpress.com/230/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/bixolabs.wordpress.com/230/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/bixolabs.wordpress.com/230/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/bixolabs.wordpress.com/230/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=bixolabs.com&blog=10127813&post=230&subd=bixolabs&ref=&feed=1" />]]></content:encoded>
			<wfw:commentRss>http://bixolabs.com/2010/04/21/first-sample-of-public-terabyte-dataset/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/cee2819ae7c1633e11ad2ef33de7794f?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">kkrugler</media:title>
		</media:content>
	</item>
		<item>
		<title>SimpleDB Tap for Cascading</title>
		<link>http://bixolabs.com/2010/03/16/simpledb-tap-for-cascading/</link>
		<comments>http://bixolabs.com/2010/03/16/simpledb-tap-for-cascading/#comments</comments>
		<pubDate>Tue, 16 Mar 2010 16:28:37 +0000</pubDate>
		<dc:creator>kkrugler</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[ec2]]></category>
		<category><![CDATA[cascading]]></category>
		<category><![CDATA[simpledb]]></category>

		<guid isPermaLink="false">http://bixolabs.com/?p=202</guid>
		<description><![CDATA[Recently we&#8217;ve been running a number of large, multi-phase web mining applications in Amazon&#8217;s EC2 &#38; Elastic MapReduce (EMR), and we needed a better way to maintain state than pushing sequence files back and forth between HDFS and S3. One option was to set up an HBase cluster, but then we&#8217;d be paying 24&#215;7 for [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=bixolabs.com&blog=10127813&post=202&subd=bixolabs&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<p>Recently we&#8217;ve been running a number of large, multi-phase web mining applications in Amazon&#8217;s <a href="http://aws.amazon.com/ec2/" target="_blank">EC2</a> &amp; <a href="http://aws.amazon.com/elasticmapreduce/" target="_blank">Elastic MapReduce (EMR)</a>, and we needed a better way to maintain state than pushing sequence files back and forth between <a href="http://hadoop.apache.org/hdfs/" target="_blank">HDFS</a> and <a href="https://s3.amazonaws.com/" target="_blank">S3</a>.</p>
<p>One option was to set up an <a href="http://hadoop.apache.org/hbase/" target="_blank">HBase</a> cluster, but then we&#8217;d be paying 24&#215;7 for servers that we&#8217;d only need for a few minutes each day. We could also set up MySQL with persistent storage on an Amazon <a href="http://aws.amazon.com/ebs/" target="_blank">EBS</a> volume, but then we&#8217;d have to configure &amp; launch MySQL on our cluster master for each mining &#8220;loop&#8221;, and for really big jobs it wouldn&#8217;t scale well to 100M+ items.</p>
<p>So we spent some time creating a <a href="http://www.cascading.org/" target="_blank">Cascading</a> <a href="http://www.cascading.org/documentation/overview.html" target="_blank">tap &amp; scheme</a> that lets us use Amazon&#8217;s <a href="http://aws.amazon.com/simpledb/" target="_blank">SimpleDB</a> to maintain the state of web mining &amp; crawling jobs. It&#8217;s still pretty rough, but usable. The code is publicly available at GitHub &#8211; check out <a href="http://github.com/bixolabs/cascading.simpledb" target="_blank">http://github.com/bixolabs/cascading.simpledb</a>.</p>
<p>There&#8217;s also a README to help you get started, which I&#8217;ve copied below since it contains useful information about the project.</p>
<p>As to the big question on performance &#8211; not sure yet how well it handles a SimpleDB with 100M+ entries, but we&#8217;re heading there fast on one project, so more details to follow. Enjoy&#8230;</p>
<h2>README</h2>
<h3>Introduction</h3>
<p>cascading.simpledb is a Cascading Tap &amp; Scheme for Amazon&#8217;s SimpleDB.</p>
<p>This means you can use SimpleDB as the source of tuples for a Cascading<br />
flow, and as a sink for saving results. This is particularly useful when<br />
you need a scalable, persistent store for Cascading jobs being run in<br />
Amazon&#8217;s EC2/EMR cloud environment.</p>
<p>Information about SimpleDB is available from <a href="http://aws.amazon.com/simpledb/">http://aws.amazon.com/simpledb/</a><br />
and also <a href="http://docs.amazonwebservices.com/AmazonSimpleDB/latest/DeveloperGuide/">http://docs.amazonwebservices.com/AmazonSimpleDB/latest/DeveloperGuide/</a></p>
<p>Note that you will need to be signed up to use both AWS and SimpleDB, and<br />
have valid AWS access key and secret key values before using this code. See<br />
<a href="http://docs.amazonwebservices.com/AmazonSimpleDB/2009-04-15/GettingStartedGuide/GettingSetUp.html">http://docs.amazonwebservices.com/AmazonSimpleDB/2009-04-15/GettingStartedGuide/GettingSetUp.html</a></p>
<h3>Design</h3>
<p>In order to get acceptable performance, the cascading.simpledb scheme splits<br />
each virtual &#8220;table&#8221; of data in SimpleDB across multiple shards. A shard<br />
corresponds to what SimpleDB calls a domain. This allows most requests to<br />
be run in parallel across multiple mappers, without having to worry about<br />
duplicate records being returned for the same request.</p>
<p>Each record (Cascading tuple, or SimpleDB item) has an implicit field called<br />
SimpleDBUtils-itemHash. This is a zero-padded hash of the record&#8217;s key or item<br />
value. This is another SimpleDB concept &#8211; every record has a unique key, used<br />
to read it directly.</p>
<p>Records (items) are split between shards using partitions of this hash value. This<br />
implies that once a table has been created and populated with items, there is no<br />
easy way to change the number of shards; you essentially have to build a new<br />
table and copy all of the values.</p>
<p>The implicit itemHash field could also be used to parallelize search requests within<br />
a single shard, by further partitioning. This performance boost is not yet<br />
implemented by cascading.simpledb, however.</p>
<h3>Example</h3>
<pre>  // Specify fields in SimpleDB item, and which field is to be used as the key.
  Fields itemFields = new Fields("name", "birthday", "occupation", "status");
  SimpleDBScheme sourceScheme = new SimpleDBScheme(itemFields, new Fields("name"));</pre>
<pre>  // Load people that haven't yet been contacted, up to 1000
  String query = "`status` = \"NOT_CONTACTED\"";
  sourceScheme.setQuery(query);
  sourceScheme.setSelectLimit(1000);</pre>
<pre>  int numShards = 20;
  String tableName = "people"
  Tap sourceTap = new SimpleDBTap(sourceScheme, accessKey, secretKey, tableName, numShards);</pre>
<pre>  Pipe processingPipe = new Each("generate email pipe", new SendEmailOperation());</pre>
<pre>  // Use the same scheme as the source - query &amp; limit are ignored
  Tap sinkTap = new SimpleDBTap(sourceScheme, accessKey, secretKey, tableName, numShards);</pre>
<pre>  Flow flow = new FlowConnector().connect(sourceTap, sinkTap, processingPipe);
  flow.complete();</pre>
<h3>Limitations</h3>
<p>All limitations of the underlying SimpleDB system obviously apply. That means<br />
things like the maximum number of shards (100), the maximum size of any one<br />
item (1MB), the maximum size of any field value (1024) and so on. See details<br />
at <a href="http://docs.amazonwebservices.com/AmazonSimpleDB/latest/DeveloperGuide/SDBLimits.html">http://docs.amazonwebservices.com/AmazonSimpleDB/latest/DeveloperGuide/SDBLimits.html</a></p>
<p>Given the 1K max field value length, SimpleDB is most useful for storing small<br />
chunks of data, or references to bigger blobs that can be saved in S3.</p>
<p>In addition, all values are stored as strings. This means all fields must be<br />
round-trippable as text.</p>
<p>Finally, SimpleDB <del datetime="2010-04-19T02:45:18+00:00">does not guarantee immediate consistency</del>can guarantee consistency when reading back<br />
the results of a write (or doing queries for the same), but this imposes a significant performance penalty, so the tap uses the default &#8220;eventually consistent&#8221; mode. This typically isn&#8217;t a problem due to the batch-oriented nature of most Cascading workflows, but could be an issue if you had multiple jobs writing to and reading from the same table.</p>
<h3>Known Issues</h3>
<p>Currently you need to correctly specify the number of shards for a table when<br />
you define the tap. This is error prone, and only necessary when creating (or<br />
re-creating) the table from scratch.</p>
<p>Tuple values that are null will not be updated in the table, which means you<br />
can&#8217;t delete values, only add or update them.</p>
<p>Some operations are not multi-threaded, and thus take longer than they should.<br />
For example, calculating the splits for a read will make a series of requests<br />
to SimpleDB to get the item counts.</p>
<p>Numeric fields should automatically be stored as zero-padded strings to ensure<br />
proper sort behavior, but currently this is only done for the implicit hash field.</p>
<h3>Building</h3>
<p>You need Apache Ant 1.7 or higher, and a git client.</p>
<p>1. Download source from GitHub</p>
<pre>% git clone git://github.com/bixolabs/cascading.simpledb.git
% cd cascading.simpledb</pre>
<p>2. Set appropriate credentials for testing</p>
<pre>% edit src/test/resources/aws-keys.txt</pre>
<p>Enter valid AWS access key and secret key values for the two corresponding properties.</p>
<p>3. Build the jar</p>
<pre>% ant clean jar</pre>
<p>or to build and install the jar in your local Maven repo:</p>
<pre>% ant clean install</pre>
<p>4. Create Eclipse project files</p>
<pre>% ant eclipse</pre>
<p>Then, from Eclipse follow the standard procedure to import an existing Java project into your Workspace.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/bixolabs.wordpress.com/202/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/bixolabs.wordpress.com/202/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/bixolabs.wordpress.com/202/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/bixolabs.wordpress.com/202/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/bixolabs.wordpress.com/202/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/bixolabs.wordpress.com/202/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/bixolabs.wordpress.com/202/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/bixolabs.wordpress.com/202/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/bixolabs.wordpress.com/202/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/bixolabs.wordpress.com/202/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=bixolabs.com&blog=10127813&post=202&subd=bixolabs&ref=&feed=1" />]]></content:encoded>
			<wfw:commentRss>http://bixolabs.com/2010/03/16/simpledb-tap-for-cascading/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/cee2819ae7c1633e11ad2ef33de7794f?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">kkrugler</media:title>
		</media:content>
	</item>
		<item>
		<title>Crawler-commons project gets started</title>
		<link>http://bixolabs.com/2009/12/03/crawler-commons-project-gets-started/</link>
		<comments>http://bixolabs.com/2009/12/03/crawler-commons-project-gets-started/#comments</comments>
		<pubDate>Fri, 04 Dec 2009 04:41:35 +0000</pubDate>
		<dc:creator>kkrugler</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://bixolabs.com/?p=188</guid>
		<description><![CDATA[Back in November we helped put together a small gathering for web crawler developers at ApacheCon 2009. One of the key topics was how to share development efforts, versus each project independently implementing similar functionality. Out of this was born the crawler-commons project. As the main page says: The purpose of this project is to [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=bixolabs.com&blog=10127813&post=188&subd=bixolabs&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<table>
<tr>
<td width="70"><img class="alignnone size-full wp-image-190" title="crawlercommons-logo" src="http://bixolabs.files.wordpress.com/2009/12/crawlercommons-logo.png?w=57&#038;h=48" alt="" width="57" height="48" /></td>
<td>Back in November we helped put together a small gathering for web crawler developers at ApacheCon 2009. One of the key topics was how to share development efforts, versus each project independently implementing similar functionality.</td>
</tr>
</table>
<p>Out of this was born the <a href="http://code.google.com/p/crawler-commons/" target="_blank">crawler-commons project</a>. As the main page says:</p>
<blockquote><p>The purpose of this project is to develop a set of reusable Java components that implement functionality common to any web crawler. These components would benefit from collaboration among various existing web crawler projects, and reduce duplication of effort.</p></blockquote>
<p>There&#8217;s a long list of functionality that is identical, or nearly so, between the various projects. The <a href="http://code.google.com/p/crawler-commons/w/list" target="_blank">project wiki</a> has a more detailed <a href="http://code.google.com/p/crawler-commons/wiki/ApacheCon2009Meetup" target="_blank">write-up from the ApacheCon meeting</a>, but a short list includes:</p>
<ul>
<li>robots.txt parsing</li>
<li>URL normalization</li>
<li>URL filtering</li>
<li>Domain name manipulation</li>
<li>HTML page cleaning</li>
<li>HttpClient configuration</li>
<li>Text similarity</li>
</ul>
<p>It&#8217;s still early, but some initial code has been submitted to the <a href="http://code.google.com/p/crawler-commons/source/browse/#svn/trunk" target="_blank">Google Code SVN repository</a>. And anybody with an interest in the area of Java web crawlers should use this <a href="http://code.google.com/feeds/p/crawler-commons/updates/basic" target="_blank">feed</a> to track project updates.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/bixolabs.wordpress.com/188/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/bixolabs.wordpress.com/188/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/bixolabs.wordpress.com/188/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/bixolabs.wordpress.com/188/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/bixolabs.wordpress.com/188/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/bixolabs.wordpress.com/188/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/bixolabs.wordpress.com/188/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/bixolabs.wordpress.com/188/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/bixolabs.wordpress.com/188/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/bixolabs.wordpress.com/188/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=bixolabs.com&blog=10127813&post=188&subd=bixolabs&ref=&feed=1" />]]></content:encoded>
			<wfw:commentRss>http://bixolabs.com/2009/12/03/crawler-commons-project-gets-started/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/cee2819ae7c1633e11ad2ef33de7794f?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">kkrugler</media:title>
		</media:content>

		<media:content url="http://bixolabs.files.wordpress.com/2009/12/crawlercommons-logo.png" medium="image">
			<media:title type="html">crawlercommons-logo</media:title>
		</media:content>
	</item>
		<item>
		<title>Public web crawler projects</title>
		<link>http://bixolabs.com/2009/12/02/public-web-crawler-projects/</link>
		<comments>http://bixolabs.com/2009/12/02/public-web-crawler-projects/#comments</comments>
		<pubDate>Wed, 02 Dec 2009 21:26:05 +0000</pubDate>
		<dc:creator>kkrugler</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[public terabyte dataset]]></category>
		<category><![CDATA[web crawler]]></category>
		<category><![CDATA[heritrix]]></category>
		<category><![CDATA[nutch]]></category>

		<guid isPermaLink="false">http://bixolabs.com/?p=184</guid>
		<description><![CDATA[Several people have pointed me to other public/non-profit projects doing large-scale public web crawls, so I thought I&#8217;d summarize the ones I now know about below. And if you know of others, please add your comments and I&#8217;ll update the list. Wayback Machine &#8211; A time-series snapshot of important web pages, from 1996 to present. [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=bixolabs.com&blog=10127813&post=184&subd=bixolabs&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<p>Several people have pointed me to other public/non-profit projects doing large-scale public web crawls, so I thought I&#8217;d summarize the ones I now know about below. And if you know of others, please add your comments and I&#8217;ll update the list.</p>
<ul>
<li><a href="http://www.archive.org/web/web.php" target="_blank">Wayback Machine</a> &#8211; A time-series snapshot of important web pages, from 1996 to present. 150B pages crawled in total as of 2009. The data is searchable, but not available in raw format AFAIK. The work is part of the Internet Archive organization, and uses Heritrix for crawling.</li>
<li><a href="http://webarchives.cdlib.org/" target="_blank">CDL Web Archiving Service</a> &#8211; The California Digital Library provides the Web Archiving Service to enable librarians and scholars to create archives of captured web sites and publications. Similar to the Wayback Machine, they use Heritrix and other software from the Internet Archive, and the results are searchable but not available in raw format.</li>
<li><a href="http://www.commoncrawl.org/" target="_blank">CommonCrawl</a> &#8211; Their goal is to build, maintain and make widely available a comprehensive crawl of the Internet. They use Nutch (useragent is ccBot). I&#8217;ve seen Ahad Rana post to the Nutch list. So far I haven&#8217;t seen any actual search or raw data results from this project. The do have a cool public &#8220;<a href="http://www.commoncrawl.org/crawlstats.html" target="_blank">crawl stats</a>&#8221; page.</li>
<li><a href="http://www.webarchive.org.uk/ukwa/" target="_blank">UK Web Archive</a> &#8211; A &#8220;Wayback Machine&#8221; for UK web sites. Provided by the British Library. Searchable, but no raw data that I can see. They in turn sponsor the <a href="http://webcurator.sourceforge.net/" target="_blank">Web Curator Tool</a>, which is an open-source workflow management application for selective web archiving (driver for Heritrix).</li>
<li><a href="http://www.isara.org/search/" target="_blank">Isara Search</a> &#8211; A project sponsored by Isara Charity Organization to build the world&#8217;s first non-profit search engine. Based in Thailand, using Nutch. No search/data available yet.</li>
<li><a href="http://boston.lti.cs.cmu.edu/Data/clueweb09/" target="_blank">ClueWeb09</a> &#8211; The ClueWeb09 dataset was created by the <a href="http://www.lti.cs.cmu.edu/">Language Technologies Institute</a> at <a href="http://www.cmu.edu/">Carnegie Mellon University</a> to support research on information retrieval and related human language technologies.  The dataset consists of 1 billion web pages, in ten languages, collected in January and February 2009. The data is available to researchers who sign a legal agreement and pay $750 for the hard disks needed to store the data.</li>
<li><a href="http://diglib.stanford.edu:8091/~testbed/doc2/WebBase/" target="_blank">WebBase</a> &#8211; The Stanford WebBase project has been collecting topic focused snapshots of Web sites. All the resulting archives are available to the public via fast <a href="http://wb3.stanford.edu/%7Etestbed/cgi-bin/crawlStreamingControls.pl" target="_blank">download streams</a>. The useragent is WebVac (was Pita). There&#8217;s also a <a href="http://wb8.stanford.edu/~testbed/cgi-bin/crawlStreamingControls.pl" target="_blank">web GUI</a> for fetching specific crawl sets.</li>
<li><a href="http://law.dsi.unimi.it/" target="_blank">Laboratory for Web Algorithmics</a> &#8211; Uses <a href="http://law.dsi.unimi.it/index.php?option=com_content;task=view;id=34;Itemid=42" target="_blank">UbiCrawler</a> to create large-scale <a href="http://law.dsi.unimi.it/index.php?option=com_include&amp;Itemid=65" target="_blank">link graph datasets</a> that can be freely downloaded.</li>
</ul>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/bixolabs.wordpress.com/184/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/bixolabs.wordpress.com/184/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/bixolabs.wordpress.com/184/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/bixolabs.wordpress.com/184/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/bixolabs.wordpress.com/184/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/bixolabs.wordpress.com/184/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/bixolabs.wordpress.com/184/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/bixolabs.wordpress.com/184/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/bixolabs.wordpress.com/184/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/bixolabs.wordpress.com/184/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=bixolabs.com&blog=10127813&post=184&subd=bixolabs&ref=&feed=1" />]]></content:encoded>
			<wfw:commentRss>http://bixolabs.com/2009/12/02/public-web-crawler-projects/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/cee2819ae7c1633e11ad2ef33de7794f?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">kkrugler</media:title>
		</media:content>
	</item>
		<item>
		<title>Proposals for Big Data web mining talk</title>
		<link>http://bixolabs.com/2009/11/16/proposals-for-big-data-web-mining-talk/</link>
		<comments>http://bixolabs.com/2009/11/16/proposals-for-big-data-web-mining-talk/#comments</comments>
		<pubDate>Mon, 16 Nov 2009 19:44:05 +0000</pubDate>
		<dc:creator>kkrugler</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[acm]]></category>
		<category><![CDATA[data mining]]></category>
		<category><![CDATA[mahout]]></category>
		<category><![CDATA[public terabyte dataset]]></category>

		<guid isPermaLink="false">http://bixolabs.com/?p=175</guid>
		<description><![CDATA[I&#8217;m going to be giving a talk at the Bay Area ACM data mining SIG in December, and I need to finalize my topic soon &#8211; like today I was going to expand on my Elastic Web Mining talk (&#8220;Web mining for SEO keywords&#8221;) from the ACM data mining unconference a few weeks back. But [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=bixolabs.com&blog=10127813&post=175&subd=bixolabs&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<p>I&#8217;m going to be giving a talk at the <a href="http://sfbayacm.org/dmsig.php" target="_blank">Bay Area ACM data mining SIG</a> in December, and I need to finalize my topic soon &#8211; like today <img src='http://s.wordpress.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p>I was going to expand on my <a href="/2009/11/02/elastic-web-mining-talk/">Elastic Web Mining talk</a> (&#8220;Web mining for SEO keywords&#8221;) from the <a href="http://events.linkedin.com/events/142420/clickthru" target="_blank">ACM data mining unconference</a> a few weeks back.</p>
<p>But the fact that I&#8217;ll have 10s to 100s of millions of web page data to work with, from the <a href="/datasets/public-terabyte-dataset-project/">public terabyte dataset</a> crawl, makes me want to apply <a href="http://lucene.apache.org/mahout/" target="_blank">Mahout</a> to the data.</p>
<p>I tossed out one idea on the Mahout list, looking for input:</p>
<ul>
<li>I&#8217;d like to automatically generate a timeline of events.</li>
<li>I can extract potential dates from web pages, using simple patterns.</li>
<li>I can extract 2-to-4 word terms (skipping those which start/end with stop words) from pages that have extracted dates.</li>
<li>And then by the miracle of LDA (<a href="http://cwiki.apache.org/MAHOUT/latent-dirichlet-allocation.html" target="_blank">latent dirichlet allocation</a>), I get clusters of date+terms.</li>
</ul>
<p>But in this example, I don&#8217;t actually need LDA &#8211; I have my &#8220;topic&#8221;, which is the date. So it might not be a very good example. And will LDA scale to 100M web pages (which implies many billions of terms)? And how will I handle the same term (e.g. &#8220;barack inauguration&#8221;) being associated with a cluster of dates, since stories from a range of dates before/after the event will contain that same term?</p>
<p>So it could be a non-starter &#8211; I&#8217;m hoping for input on feasibility, level of effort, or if somebody else has a suggestion for something simple that could provide interesting/obvious results, I&#8217;m all ears.</p>
<p>Thanks!</p>
<p>&#8211; Ken</p>
<p>PS &#8211; my current fall-back is to just do brute-force map-reduce to come up with lists of terms per unique date, pick the top N, and maybe do some filtering for top-level terms that have too many associated unique dates. Which unfortunately wouldn&#8217;t use Mahout, but would be an example of crunching lots of data.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/bixolabs.wordpress.com/175/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/bixolabs.wordpress.com/175/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/bixolabs.wordpress.com/175/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/bixolabs.wordpress.com/175/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/bixolabs.wordpress.com/175/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/bixolabs.wordpress.com/175/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/bixolabs.wordpress.com/175/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/bixolabs.wordpress.com/175/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/bixolabs.wordpress.com/175/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/bixolabs.wordpress.com/175/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=bixolabs.com&blog=10127813&post=175&subd=bixolabs&ref=&feed=1" />]]></content:encoded>
			<wfw:commentRss>http://bixolabs.com/2009/11/16/proposals-for-big-data-web-mining-talk/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/cee2819ae7c1633e11ad2ef33de7794f?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">kkrugler</media:title>
		</media:content>
	</item>
		<item>
		<title>Web Miners vs Web Masters &#8211; An Uneasy Truce</title>
		<link>http://bixolabs.com/2009/11/11/web-miners-vs-web-masters-an-uneasy-truce/</link>
		<comments>http://bixolabs.com/2009/11/11/web-miners-vs-web-masters-an-uneasy-truce/#comments</comments>
		<pubDate>Wed, 11 Nov 2009 17:10:03 +0000</pubDate>
		<dc:creator>kkrugler</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[polite crawling]]></category>
		<category><![CDATA[robots]]></category>
		<category><![CDATA[web masters]]></category>
		<category><![CDATA[web mining]]></category>

		<guid isPermaLink="false">http://bixolabs.com/?p=163</guid>
		<description><![CDATA[The life of a webmaster is hard, and web crawlers make it harder http://www.flickr.com/photos/absolutely_loverly/ / CC BY 2.0 &#160; There&#8217;s the daily drama of keeping both web site users and web site developers happy. Now mix in the unpredictable side effects of having automated agents hitting the site, and you can see why webmasters might [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=bixolabs.com&blog=10127813&post=163&subd=bixolabs&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<h2>The life of a webmaster is hard, and web crawlers make it harder</h2>
<table>
<tbody>
<tr>
<td><a href="http://www.flickr.com/photos/absolutely_loverly/2953035408/"><img class="alignnone size-full wp-image-167" title="Angry Face" src="http://bixolabs.files.wordpress.com/2009/11/angry-face.png?w=206&#038;h=206" alt="Angry Face" width="206" height="206" /></a></p>
<div><a rel="cc:attributionURL" href="http://www.flickr.com/photos/absolutely_loverly/">http://www.flickr.com/photos/absolutely_loverly/</a> / <a rel="license" href="http://creativecommons.org/licenses/by/2.0/">CC BY 2.0</a></div>
<p>&nbsp;</p>
</td>
<td align="top">There&#8217;s the daily drama of keeping both web site users and web site developers happy. Now mix in the unpredictable side effects of having automated agents hitting the site, and you can see why webmasters might think many <a href="http://www.webmasterworld.com/forum39/4119.htm" target="_blank">web crawlers are evil</a>.</td>
</tr>
</tbody>
</table>
<p>But web crawlers serve a very real, important role in the life of a successful site, and it&#8217;s all about <strong>traffic</strong>. Without search engines like Google and Yahoo/Bing, most sites would be invisible to most users.</p>
<h2>Implicit Contracts</h2>
<p>An unwritten agreement exists between webmasters and web crawlers, and it reads something like this: you don&#8217;t overload my site, and you bring traffic my way. In return, I&#8217;ll give you free access lots of valuable content that I host.</p>
<p>And that&#8217;s worked reasonably well, for the past 15 years. Yes, there are crawlers that ignore the <a href="http://en.wikipedia.org/wiki/Robots_exclusion_standard" target="_blank">Robots Exclusion Standard</a>. And there are crawlers that overload the site by hammering it with lots of simultaneous requests for hours on end. And sometimes a crawler goes a little crazy and spends hours trying to fetch non-existent pages using bogus URLs that it incorrectly derived from content on the site&#8217;s pages. For the most part, though, web crawlers try to do the Right Thing, and webmasters can always block rogue crawlers by IP address.</p>
<h2>Web Mining != Search Index</h2>
<p>But now you&#8217;ve got web miners &#8211; automated agents that collect data which often doesn&#8217;t wind up in a search index. And that means no traffic from searches. And thus the implicit contract has been broken.</p>
<p>It hasn&#8217;t happened yet, but I can see a day when many sites set up their robots.txt to allow the major search engines access, and then block everybody else.</p>
<p>What does this mean for the web eco-system? Three things, one for each participant:</p>
<ol>
<li>Web miners need to <strong>crawl extra-super-politely</strong>.</li>
<li>Customers need to work with key sites to <strong>pick good crawl times</strong>.</li>
<li>Web sites need to <strong>offer for-fee APIs</strong> for data mining.</li>
</ol>
<p>The first point is the easiest one to solve &#8211; never hit a site with more than one simultaneous request, never fetch more than a handful of pages a minute, and respect all robots.txt restrictions.</p>
<p>The second is a bit harder, as it currently requires person-to-person contact with the web site in question. It&#8217;s possible to derive these &#8220;good crawl times&#8221; by varying the request rate with the response performance, so there are work-arounds. But eventually I expect to see an extension to robots.txt that lets the site owner provide additional clues to web crawlers about good and bad times for crawling.</p>
<p>The last point, about providing APIs, is the most long-term but also the most powerful. There are many web APIs out there, some of which provide access to valuable web data, but few offer a pay-to-play model. Most are rate limited, where you need to cut special deals if you exceed some relatively low daily threshold. Many have serious terms of use restrictions that limit a caller&#8217;s ability to actually mine the response data &#8211; often the only option is to republish it, with links/attribution back to the originating site.</p>
<p>What would be great is if everybody had a model like Amazon&#8217;s <a href="http://aws.amazon.com/awis/" target="_blank">AWIS</a>, where X requests cost N dollars. You can decide how much or how little to spend. There aren&#8217;t many restrictions on rate/volume or usage. And as a huge added bonus, the data comes back structured, so you don&#8217;t have to waste time hand-crafting some fragile, error-prone HTML scraping code.</p>
<p>And a side-note to companies thinking about the API issue &#8211; if you don&#8217;t provide one, and you block web miners, then you&#8217;ll get crawled anyway, in stealth mode by less scrupulous firms. So then everybody loses, since you&#8217;ll still be giving free access while taking a performance hit, while companies that need the data pay more to these &#8220;stealth crawlers&#8221; and get worse results.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/bixolabs.wordpress.com/163/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/bixolabs.wordpress.com/163/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/bixolabs.wordpress.com/163/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/bixolabs.wordpress.com/163/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/bixolabs.wordpress.com/163/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/bixolabs.wordpress.com/163/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/bixolabs.wordpress.com/163/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/bixolabs.wordpress.com/163/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/bixolabs.wordpress.com/163/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/bixolabs.wordpress.com/163/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=bixolabs.com&blog=10127813&post=163&subd=bixolabs&ref=&feed=1" />]]></content:encoded>
			<wfw:commentRss>http://bixolabs.com/2009/11/11/web-miners-vs-web-masters-an-uneasy-truce/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/cee2819ae7c1633e11ad2ef33de7794f?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">kkrugler</media:title>
		</media:content>

		<media:content url="http://bixolabs.files.wordpress.com/2009/11/angry-face.png" medium="image">
			<media:title type="html">Angry Face</media:title>
		</media:content>
	</item>
		<item>
		<title>Paul O&#8217;Rorke summary of elastic web mining talk</title>
		<link>http://bixolabs.com/2009/11/04/paul-ororke-summary-of-elastic-web-mining-talk/</link>
		<comments>http://bixolabs.com/2009/11/04/paul-ororke-summary-of-elastic-web-mining-talk/#comments</comments>
		<pubDate>Wed, 04 Nov 2009 18:01:49 +0000</pubDate>
		<dc:creator>kkrugler</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[cascading]]></category>
		<category><![CDATA[elastic web mining]]></category>
		<category><![CDATA[workflow]]></category>

		<guid isPermaLink="false">http://bixolabs.com/?p=132</guid>
		<description><![CDATA[Paul posted a nice summary of my elastic web mining talk over at his blog. He captured one of the key points I was trying to make when he said: It was impressive to see how much of the processing was generated by Bixo and Cascading and how only a small fraction of the code [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=bixolabs.com&blog=10127813&post=132&subd=bixolabs&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<p>Paul posted a nice summary of my <a title="web mining talk" href="http://ororke.com/paul/blog/?p=261">elastic web mining</a> talk over at his blog. He captured one of the key points I was trying to make when he said:</p>
<blockquote><p>It was impressive to see how much of the processing was generated by Bixo and Cascading and how only a small fraction of the code needed to be custom coded “by hand.”</p></blockquote>
<p>That&#8217;s a recurring theme when I show workflow graphs (dot files generated by Cascading) for example web mining applications that I&#8217;ve created. The real work is in figuring out what needs to be done (defining the workflow), not the coding to create the workflow or the custom bits that need to added.</p>
<div id="attachment_133" class="wp-caption alignnone" style="width: 310px"><a href="http://bixolabs.files.wordpress.com/2009/11/workflow-dag.png"><img class="size-medium wp-image-133" title="Workflow Graph" src="http://bixolabs.files.wordpress.com/2009/11/workflow-dag.png?w=300&#038;h=203" alt="Workflow Graph" width="300" height="203" /></a><p class="wp-caption-text">Web mining app workflow</p></div>
<p>In the above graph, the purple ovals represent custom code, and of those six I could have cut out two by using existing Cascading operators with some regular expression juju. Add in the new Bixo utility operator for loading URLs into the workflow plus new Tika support for parsing mbox files, and you&#8217;re down to two custom operators &#8211; parsing the top-level &#8220;mailbox archives&#8221; page to find the monthly mailbox archives, and scoring the emails.</p>
<p>The blue and yellow ovals are pre-defined Cascading &amp; Bixo operators (respectively).</p>
<p>And while the total workflow looks very complex, this was defined in about a page of Java code.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/bixolabs.wordpress.com/132/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/bixolabs.wordpress.com/132/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/bixolabs.wordpress.com/132/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/bixolabs.wordpress.com/132/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/bixolabs.wordpress.com/132/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/bixolabs.wordpress.com/132/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/bixolabs.wordpress.com/132/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/bixolabs.wordpress.com/132/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/bixolabs.wordpress.com/132/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/bixolabs.wordpress.com/132/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=bixolabs.com&blog=10127813&post=132&subd=bixolabs&ref=&feed=1" />]]></content:encoded>
			<wfw:commentRss>http://bixolabs.com/2009/11/04/paul-ororke-summary-of-elastic-web-mining-talk/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/cee2819ae7c1633e11ad2ef33de7794f?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">kkrugler</media:title>
		</media:content>

		<media:content url="http://bixolabs.files.wordpress.com/2009/11/workflow-dag.png?w=300" medium="image">
			<media:title type="html">Workflow Graph</media:title>
		</media:content>
	</item>
		<item>
		<title>Elastic Web Mining Talk</title>
		<link>http://bixolabs.com/2009/11/02/elastic-web-mining-talk/</link>
		<comments>http://bixolabs.com/2009/11/02/elastic-web-mining-talk/#comments</comments>
		<pubDate>Tue, 03 Nov 2009 02:32:20 +0000</pubDate>
		<dc:creator>kkrugler</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[acm]]></category>
		<category><![CDATA[data mining]]></category>
		<category><![CDATA[presentation]]></category>
		<category><![CDATA[web mining]]></category>

		<guid isPermaLink="false">http://bixolabs.com/?p=125</guid>
		<description><![CDATA[Here&#8217;s the presentation I gave at the ACM data mining unconference on elastic web mining &#8211; how to create scalable, reliable and cost effective web mining solutions using an open source stack (Hadoop, Cascading, Bixo) running in Amazon&#8217;s Elastic Compute Cloud (EC2). But I don&#8217;t see my notes showing up, so here&#8217;s the PDF version [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=bixolabs.com&blog=10127813&post=125&subd=bixolabs&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<p>Here&#8217;s the presentation I gave at the ACM data mining unconference on elastic web mining &#8211; how to create scalable, reliable and cost effective web mining solutions using an open source stack (Hadoop, Cascading, Bixo) running in Amazon&#8217;s Elastic Compute Cloud (EC2).</p>
<p><object type='application/x-shockwave-flash' wmode='opaque' data='http://static.slideshare.net/swf/ssplayer2.swf?id=2407600&#038;doc=acmuctalk-091102194640-phpapp02' width='600' height='492'><param name='movie' value='http://static.slideshare.net/swf/ssplayer2.swf?id=2407600&#038;doc=acmuctalk-091102194640-phpapp02' /><param name='allowFullScreen' value='true' /><param name='allowScriptAccess' value='always' /></object></p>
<p>But I don&#8217;t see my notes showing up, so here&#8217;s the PDF version with full notes, which make the resulting slides a lot more meaningful.</p>
<p><object style='margin: 0px;' width='600' height='492'><param name='movie' value='http://static.slidesharecdn.com/swf/ssplayerd.swf?doc=acmtalk-slideshare-091102203022-phpapp02' /><param name='allowFullScreen' value='true' /><param name='allowScriptAccess' value='always' /><param name='wmode' value='opaque' /><embed src='http://static.slidesharecdn.com/swf/ssplayerd.swf?doc=acmtalk-slideshare-091102203022-phpapp02' type='application/x-shockwave-flash' allowscriptaccess='always' allowfullscreen='true' wmode='opaque' width='600' height='492'></embed></object></p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/bixolabs.wordpress.com/125/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/bixolabs.wordpress.com/125/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/bixolabs.wordpress.com/125/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/bixolabs.wordpress.com/125/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/bixolabs.wordpress.com/125/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/bixolabs.wordpress.com/125/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/bixolabs.wordpress.com/125/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/bixolabs.wordpress.com/125/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/bixolabs.wordpress.com/125/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/bixolabs.wordpress.com/125/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=bixolabs.com&blog=10127813&post=125&subd=bixolabs&ref=&feed=1" />]]></content:encoded>
			<wfw:commentRss>http://bixolabs.com/2009/11/02/elastic-web-mining-talk/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/cee2819ae7c1633e11ad2ef33de7794f?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">kkrugler</media:title>
		</media:content>
	</item>
	</channel>
</rss>