<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
	>

<channel>
	<title>Elastic Web Mining &#124; Bixolabs</title>
	<atom:link href="http://bixolabs.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://bixolabs.com</link>
	<description></description>
	<lastBuildDate>Fri, 04 Dec 2009 04:45:04 +0000</lastBuildDate>
	<generator>http://wordpress.com/</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<cloud domain='bixolabs.com' port='80' path='/?rsscloud=notify' registerProcedure='' protocol='http-post' />
<image>
		<url>http://www.gravatar.com/blavatar/308be8690ccbdf7ea4dff361b3bbc457?s=96&#038;d=http://s2.wp.com/i/buttonw-com.png</url>
		<title>Elastic Web Mining &#124; Bixolabs</title>
		<link>http://bixolabs.com</link>
	</image>
	<atom:link rel="search" type="application/opensearchdescription+xml" href="http://bixolabs.com/osd.xml" title="Elastic Web Mining | Bixolabs" />
	<atom:link rel='hub' href='http://bixolabs.com/?pushpress=hub'/>
		<item>
		<title>Crawler-commons project gets started</title>
		<link>http://bixolabs.com/2009/12/03/crawler-commons-project-gets-started/</link>
		<comments>http://bixolabs.com/2009/12/03/crawler-commons-project-gets-started/#comments</comments>
		<pubDate>Fri, 04 Dec 2009 04:41:35 +0000</pubDate>
		<dc:creator>kkrugler</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://bixolabs.com/?p=188</guid>
		<description><![CDATA[


Back in November we helped put together a small gathering for web crawler developers at ApacheCon 2009. One of the key topics was how to share development efforts, versus each project independently implementing similar functionality.


Out of this was born the crawler-commons project. As the main page says:
The purpose of this project is to develop a [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=bixolabs.com&blog=10127813&post=188&subd=bixolabs&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<table>
<tr>
<td width="70"><img class="alignnone size-full wp-image-190" title="crawlercommons-logo" src="http://bixolabs.files.wordpress.com/2009/12/crawlercommons-logo.png?w=57&#038;h=48" alt="" width="57" height="48" /></td>
<td>Back in November we helped put together a small gathering for web crawler developers at ApacheCon 2009. One of the key topics was how to share development efforts, versus each project independently implementing similar functionality.</td>
</tr>
</table>
<p>Out of this was born the <a href="http://code.google.com/p/crawler-commons/" target="_blank">crawler-commons project</a>. As the main page says:</p>
<blockquote><p>The purpose of this project is to develop a set of reusable Java components that implement functionality common to any web crawler. These components would benefit from collaboration among various existing web crawler projects, and reduce duplication of effort.</p></blockquote>
<p>There&#8217;s a long list of functionality that is identical, or nearly so, between the various projects. The <a href="http://code.google.com/p/crawler-commons/w/list" target="_blank">project wiki</a> has a more detailed <a href="http://code.google.com/p/crawler-commons/wiki/ApacheCon2009Meetup" target="_blank">write-up from the ApacheCon meeting</a>, but a short list includes:</p>
<ul>
<li>robots.txt parsing</li>
<li>URL normalization</li>
<li>URL filtering</li>
<li>Domain name manipulation</li>
<li>HTML page cleaning</li>
<li>HttpClient configuration</li>
<li>Text similarity</li>
</ul>
<p>It&#8217;s still early, but some initial code has been submitted to the <a href="http://code.google.com/p/crawler-commons/source/browse/#svn/trunk" target="_blank">Google Code SVN repository</a>. And anybody with an interest in the area of Java web crawlers should use this <a href="http://code.google.com/feeds/p/crawler-commons/updates/basic" target="_blank">feed</a> to track project updates.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/bixolabs.wordpress.com/188/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/bixolabs.wordpress.com/188/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/bixolabs.wordpress.com/188/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/bixolabs.wordpress.com/188/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/bixolabs.wordpress.com/188/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/bixolabs.wordpress.com/188/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/bixolabs.wordpress.com/188/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/bixolabs.wordpress.com/188/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/bixolabs.wordpress.com/188/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/bixolabs.wordpress.com/188/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=bixolabs.com&blog=10127813&post=188&subd=bixolabs&ref=&feed=1" />]]></content:encoded>
			<wfw:commentRss>http://bixolabs.com/2009/12/03/crawler-commons-project-gets-started/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/cee2819ae7c1633e11ad2ef33de7794f?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">kkrugler</media:title>
		</media:content>

		<media:content url="http://bixolabs.files.wordpress.com/2009/12/crawlercommons-logo.png" medium="image">
			<media:title type="html">crawlercommons-logo</media:title>
		</media:content>
	</item>
		<item>
		<title>Public web crawler projects</title>
		<link>http://bixolabs.com/2009/12/02/public-web-crawler-projects/</link>
		<comments>http://bixolabs.com/2009/12/02/public-web-crawler-projects/#comments</comments>
		<pubDate>Wed, 02 Dec 2009 21:26:05 +0000</pubDate>
		<dc:creator>kkrugler</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[heritrix]]></category>
		<category><![CDATA[nutch]]></category>
		<category><![CDATA[public terabyte dataset]]></category>
		<category><![CDATA[web crawler]]></category>

		<guid isPermaLink="false">http://bixolabs.com/?p=184</guid>
		<description><![CDATA[Several people have pointed me to other public/non-profit projects doing large-scale public web crawls, so I thought I&#8217;d summarize the ones I now know about below. And if you know of others, please add your comments and I&#8217;ll update the list.

Wayback Machine &#8211; A time-series snapshot of important web pages, from 1996 to present. 150B [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=bixolabs.com&blog=10127813&post=184&subd=bixolabs&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<p>Several people have pointed me to other public/non-profit projects doing large-scale public web crawls, so I thought I&#8217;d summarize the ones I now know about below. And if you know of others, please add your comments and I&#8217;ll update the list.</p>
<ul>
<li><a href="http://www.archive.org/web/web.php" target="_blank">Wayback Machine</a> &#8211; A time-series snapshot of important web pages, from 1996 to present. 150B pages crawled in total as of 2009. The data is searchable, but not available in raw format AFAIK. The work is part of the Internet Archive organization, and uses Heritrix for crawling.</li>
<li><a href="http://webarchives.cdlib.org/" target="_blank">CDL Web Archiving Service</a> &#8211; The California Digital Library provides the Web Archiving Service to enable librarians and scholars to create archives of captured web sites and publications. Similar to the Wayback Machine, they use Heritrix and other software from the Internet Archive, and the results are searchable but not available in raw format.</li>
<li><a href="http://www.commoncrawl.org/" target="_blank">CommonCrawl</a> &#8211; Their goal is to build, maintain and make widely available a comprehensive crawl of the Internet. They use Nutch (useragent is ccBot). I&#8217;ve seen Ahad Rana post to the Nutch list. So far I haven&#8217;t seen any actual search or raw data results from this project. The do have a cool public &#8220;<a href="http://www.commoncrawl.org/crawlstats.html" target="_blank">crawl stats</a>&#8221; page.</li>
<li><a href="http://www.webarchive.org.uk/ukwa/" target="_blank">UK Web Archive</a> &#8211; A &#8220;Wayback Machine&#8221; for UK web sites. Provided by the British Library. Searchable, but no raw data that I can see. They in turn sponsor the <a href="http://webcurator.sourceforge.net/" target="_blank">Web Curator Tool</a>, which is an open-source workflow management application for selective web archiving (driver for Heritrix).</li>
<li><a href="http://www.isara.org/search/" target="_blank">Isara Search</a> &#8211; A project sponsored by Isara Charity Organization to build the world&#8217;s first non-profit search engine. Based in Thailand, using Nutch. No search/data available yet.</li>
<li><a href="http://boston.lti.cs.cmu.edu/Data/clueweb09/" target="_blank">ClueWeb09</a> &#8211; The ClueWeb09 dataset was created by the <a href="http://www.lti.cs.cmu.edu/">Language Technologies Institute</a> at <a href="http://www.cmu.edu/">Carnegie Mellon University</a> to support research on information retrieval and related human language technologies.  The dataset consists of 1 billion web pages, in ten languages, collected in January and February 2009. The data is available to researchers who sign a legal agreement and pay $750 for the hard disks needed to store the data.</li>
<li><a href="http://diglib.stanford.edu:8091/~testbed/doc2/WebBase/" target="_blank">WebBase</a> &#8211; The Stanford WebBase project has been collecting topic focused snapshots of Web sites. All the resulting archives are available to the public via fast <a href="http://wb3.stanford.edu/%7Etestbed/cgi-bin/crawlStreamingControls.pl" target="_blank">download streams</a>. The useragent is WebVac (was Pita). There&#8217;s also a <a href="http://wb8.stanford.edu/~testbed/cgi-bin/crawlStreamingControls.pl" target="_blank">web GUI</a> for fetching specific crawl sets.</li>
<li><a href="http://law.dsi.unimi.it/" target="_blank">Laboratory for Web Algorithmics</a> &#8211; Uses <a href="http://law.dsi.unimi.it/index.php?option=com_content;task=view;id=34;Itemid=42" target="_blank">UbiCrawler</a> to create large-scale <a href="http://law.dsi.unimi.it/index.php?option=com_include&amp;Itemid=65" target="_blank">link graph datasets</a> that can be freely downloaded.</li>
</ul>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/bixolabs.wordpress.com/184/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/bixolabs.wordpress.com/184/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/bixolabs.wordpress.com/184/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/bixolabs.wordpress.com/184/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/bixolabs.wordpress.com/184/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/bixolabs.wordpress.com/184/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/bixolabs.wordpress.com/184/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/bixolabs.wordpress.com/184/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/bixolabs.wordpress.com/184/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/bixolabs.wordpress.com/184/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=bixolabs.com&blog=10127813&post=184&subd=bixolabs&ref=&feed=1" />]]></content:encoded>
			<wfw:commentRss>http://bixolabs.com/2009/12/02/public-web-crawler-projects/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/cee2819ae7c1633e11ad2ef33de7794f?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">kkrugler</media:title>
		</media:content>
	</item>
		<item>
		<title>Proposals for Big Data web mining talk</title>
		<link>http://bixolabs.com/2009/11/16/proposals-for-big-data-web-mining-talk/</link>
		<comments>http://bixolabs.com/2009/11/16/proposals-for-big-data-web-mining-talk/#comments</comments>
		<pubDate>Mon, 16 Nov 2009 19:44:05 +0000</pubDate>
		<dc:creator>kkrugler</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[acm]]></category>
		<category><![CDATA[data mining]]></category>
		<category><![CDATA[mahout]]></category>
		<category><![CDATA[public terabyte dataset]]></category>

		<guid isPermaLink="false">http://bixolabs.com/?p=175</guid>
		<description><![CDATA[I&#8217;m going to be giving a talk at the Bay Area ACM data mining SIG in December, and I need to finalize my topic soon &#8211; like today  
I was going to expand on my Elastic Web Mining talk (&#8220;Web mining for SEO keywords&#8221;) from the ACM data mining unconference a few weeks back.
But [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=bixolabs.com&blog=10127813&post=175&subd=bixolabs&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<p>I&#8217;m going to be giving a talk at the <a href="http://sfbayacm.org/dmsig.php" target="_blank">Bay Area ACM data mining SIG</a> in December, and I need to finalize my topic soon &#8211; like today <img src='http://s.wordpress.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p>I was going to expand on my <a href="/2009/11/02/elastic-web-mining-talk/">Elastic Web Mining talk</a> (&#8220;Web mining for SEO keywords&#8221;) from the <a href="http://events.linkedin.com/events/142420/clickthru" target="_blank">ACM data mining unconference</a> a few weeks back.</p>
<p>But the fact that I&#8217;ll have 10s to 100s of millions of web page data to work with, from the <a href="/datasets/public-terabyte-dataset-project/">public terabyte dataset</a> crawl, makes me want to apply <a href="http://lucene.apache.org/mahout/" target="_blank">Mahout</a> to the data.</p>
<p>I tossed out one idea on the Mahout list, looking for input:</p>
<ul>
<li>I&#8217;d like to automatically generate a timeline of events.</li>
<li>I can extract potential dates from web pages, using simple patterns.</li>
<li>I can extract 2-to-4 word terms (skipping those which start/end with stop words) from pages that have extracted dates.</li>
<li>And then by the miracle of LDA (<a href="http://cwiki.apache.org/MAHOUT/latent-dirichlet-allocation.html" target="_blank">latent dirichlet allocation</a>), I get clusters of date+terms.</li>
</ul>
<p>But in this example, I don&#8217;t actually need LDA &#8211; I have my &#8220;topic&#8221;, which is the date. So it might not be a very good example. And will LDA scale to 100M web pages (which implies many billions of terms)? And how will I handle the same term (e.g. &#8220;barack inauguration&#8221;) being associated with a cluster of dates, since stories from a range of dates before/after the event will contain that same term?</p>
<p>So it could be a non-starter &#8211; I&#8217;m hoping for input on feasibility, level of effort, or if somebody else has a suggestion for something simple that could provide interesting/obvious results, I&#8217;m all ears.</p>
<p>Thanks!</p>
<p>&#8211; Ken</p>
<p>PS &#8211; my current fall-back is to just do brute-force map-reduce to come up with lists of terms per unique date, pick the top N, and maybe do some filtering for top-level terms that have too many associated unique dates. Which unfortunately wouldn&#8217;t use Mahout, but would be an example of crunching lots of data.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/bixolabs.wordpress.com/175/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/bixolabs.wordpress.com/175/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/bixolabs.wordpress.com/175/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/bixolabs.wordpress.com/175/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/bixolabs.wordpress.com/175/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/bixolabs.wordpress.com/175/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/bixolabs.wordpress.com/175/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/bixolabs.wordpress.com/175/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/bixolabs.wordpress.com/175/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/bixolabs.wordpress.com/175/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=bixolabs.com&blog=10127813&post=175&subd=bixolabs&ref=&feed=1" />]]></content:encoded>
			<wfw:commentRss>http://bixolabs.com/2009/11/16/proposals-for-big-data-web-mining-talk/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/cee2819ae7c1633e11ad2ef33de7794f?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">kkrugler</media:title>
		</media:content>
	</item>
		<item>
		<title>Web Miners vs Web Masters &#8211; An Uneasy Truce</title>
		<link>http://bixolabs.com/2009/11/11/web-miners-vs-web-masters-an-uneasy-truce/</link>
		<comments>http://bixolabs.com/2009/11/11/web-miners-vs-web-masters-an-uneasy-truce/#comments</comments>
		<pubDate>Wed, 11 Nov 2009 17:10:03 +0000</pubDate>
		<dc:creator>kkrugler</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[polite crawling]]></category>
		<category><![CDATA[robots]]></category>
		<category><![CDATA[web masters]]></category>
		<category><![CDATA[web mining]]></category>

		<guid isPermaLink="false">http://bixolabs.com/?p=163</guid>
		<description><![CDATA[The life of a webmaster is hard, and web crawlers make it harder




http://www.flickr.com/photos/absolutely_loverly/ / CC BY 2.0
&#160;

There&#8217;s the daily drama of keeping both web site users and web site developers happy. Now mix in the unpredictable side effects of having automated agents hitting the site, and you can see why webmasters might think many web [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=bixolabs.com&blog=10127813&post=163&subd=bixolabs&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<h2>The life of a webmaster is hard, and web crawlers make it harder</h2>
<table>
<tbody>
<tr>
<td><a href="http://www.flickr.com/photos/absolutely_loverly/2953035408/"><img class="alignnone size-full wp-image-167" title="Angry Face" src="http://bixolabs.files.wordpress.com/2009/11/angry-face.png?w=206&#038;h=206" alt="Angry Face" width="206" height="206" /></a></p>
<div><a rel="cc:attributionURL" href="http://www.flickr.com/photos/absolutely_loverly/">http://www.flickr.com/photos/absolutely_loverly/</a> / <a rel="license" href="http://creativecommons.org/licenses/by/2.0/">CC BY 2.0</a></div>
<p>&nbsp;</p>
</td>
<td align="top">There&#8217;s the daily drama of keeping both web site users and web site developers happy. Now mix in the unpredictable side effects of having automated agents hitting the site, and you can see why webmasters might think many <a href="http://www.webmasterworld.com/forum39/4119.htm" target="_blank">web crawlers are evil</a>.</td>
</tr>
</tbody>
</table>
<p>But web crawlers serve a very real, important role in the life of a successful site, and it&#8217;s all about <strong>traffic</strong>. Without search engines like Google and Yahoo/Bing, most sites would be invisible to most users.</p>
<h2>Implicit Contracts</h2>
<p>An unwritten agreement exists between webmasters and web crawlers, and it reads something like this: you don&#8217;t overload my site, and you bring traffic my way. In return, I&#8217;ll give you free access lots of valuable content that I host.</p>
<p>And that&#8217;s worked reasonably well, for the past 15 years. Yes, there are crawlers that ignore the <a href="http://en.wikipedia.org/wiki/Robots_exclusion_standard" target="_blank">Robots Exclusion Standard</a>. And there are crawlers that overload the site by hammering it with lots of simultaneous requests for hours on end. And sometimes a crawler goes a little crazy and spends hours trying to fetch non-existent pages using bogus URLs that it incorrectly derived from content on the site&#8217;s pages. For the most part, though, web crawlers try to do the Right Thing, and webmasters can always block rogue crawlers by IP address.</p>
<h2>Web Mining != Search Index</h2>
<p>But now you&#8217;ve got web miners &#8211; automated agents that collect data which often doesn&#8217;t wind up in a search index. And that means no traffic from searches. And thus the implicit contract has been broken.</p>
<p>It hasn&#8217;t happened yet, but I can see a day when many sites set up their robots.txt to allow the major search engines access, and then block everybody else.</p>
<p>What does this mean for the web eco-system? Three things, one for each participant:</p>
<ol>
<li>Web miners need to <strong>crawl extra-super-politely</strong>.</li>
<li>Customers need to work with key sites to <strong>pick good crawl times</strong>.</li>
<li>Web sites need to <strong>offer for-fee APIs</strong> for data mining.</li>
</ol>
<p>The first point is the easiest one to solve &#8211; never hit a site with more than one simultaneous request, never fetch more than a handful of pages a minute, and respect all robots.txt restrictions.</p>
<p>The second is a bit harder, as it currently requires person-to-person contact with the web site in question. It&#8217;s possible to derive these &#8220;good crawl times&#8221; by varying the request rate with the response performance, so there are work-arounds. But eventually I expect to see an extension to robots.txt that lets the site owner provide additional clues to web crawlers about good and bad times for crawling.</p>
<p>The last point, about providing APIs, is the most long-term but also the most powerful. There are many web APIs out there, some of which provide access to valuable web data, but few offer a pay-to-play model. Most are rate limited, where you need to cut special deals if you exceed some relatively low daily threshold. Many have serious terms of use restrictions that limit a caller&#8217;s ability to actually mine the response data &#8211; often the only option is to republish it, with links/attribution back to the originating site.</p>
<p>What would be great is if everybody had a model like Amazon&#8217;s <a href="http://aws.amazon.com/awis/" target="_blank">AWIS</a>, where X requests cost N dollars. You can decide how much or how little to spend. There aren&#8217;t many restrictions on rate/volume or usage. And as a huge added bonus, the data comes back structured, so you don&#8217;t have to waste time hand-crafting some fragile, error-prone HTML scraping code.</p>
<p>And a side-note to companies thinking about the API issue &#8211; if you don&#8217;t provide one, and you block web miners, then you&#8217;ll get crawled anyway, in stealth mode by less scrupulous firms. So then everybody loses, since you&#8217;ll still be giving free access while taking a performance hit, while companies that need the data pay more to these &#8220;stealth crawlers&#8221; and get worse results.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/bixolabs.wordpress.com/163/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/bixolabs.wordpress.com/163/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/bixolabs.wordpress.com/163/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/bixolabs.wordpress.com/163/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/bixolabs.wordpress.com/163/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/bixolabs.wordpress.com/163/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/bixolabs.wordpress.com/163/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/bixolabs.wordpress.com/163/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/bixolabs.wordpress.com/163/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/bixolabs.wordpress.com/163/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=bixolabs.com&blog=10127813&post=163&subd=bixolabs&ref=&feed=1" />]]></content:encoded>
			<wfw:commentRss>http://bixolabs.com/2009/11/11/web-miners-vs-web-masters-an-uneasy-truce/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/cee2819ae7c1633e11ad2ef33de7794f?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">kkrugler</media:title>
		</media:content>

		<media:content url="http://bixolabs.files.wordpress.com/2009/11/angry-face.png" medium="image">
			<media:title type="html">Angry Face</media:title>
		</media:content>
	</item>
		<item>
		<title>Paul O&#8217;Rorke summary of elastic web mining talk</title>
		<link>http://bixolabs.com/2009/11/04/paul-ororke-summary-of-elastic-web-mining-talk/</link>
		<comments>http://bixolabs.com/2009/11/04/paul-ororke-summary-of-elastic-web-mining-talk/#comments</comments>
		<pubDate>Wed, 04 Nov 2009 18:01:49 +0000</pubDate>
		<dc:creator>kkrugler</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[cascading]]></category>
		<category><![CDATA[elastic web mining]]></category>
		<category><![CDATA[workflow]]></category>

		<guid isPermaLink="false">http://bixolabs.com/?p=132</guid>
		<description><![CDATA[Paul posted a nice summary of my elastic web mining talk over at his blog. He captured one of the key points I was trying to make when he said:
It was impressive to see how much of the processing was generated by Bixo and Cascading and how only a small fraction of the code needed [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=bixolabs.com&blog=10127813&post=132&subd=bixolabs&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<p>Paul posted a nice summary of my <a title="web mining talk" href="http://ororke.com/paul/blog/?p=261">elastic web mining</a> talk over at his blog. He captured one of the key points I was trying to make when he said:</p>
<blockquote><p>It was impressive to see how much of the processing was generated by Bixo and Cascading and how only a small fraction of the code needed to be custom coded “by hand.”</p></blockquote>
<p>That&#8217;s a recurring theme when I show workflow graphs (dot files generated by Cascading) for example web mining applications that I&#8217;ve created. The real work is in figuring out what needs to be done (defining the workflow), not the coding to create the workflow or the custom bits that need to added.</p>
<div id="attachment_133" class="wp-caption alignnone" style="width: 310px"><a href="http://bixolabs.files.wordpress.com/2009/11/workflow-dag.png"><img class="size-medium wp-image-133" title="Workflow Graph" src="http://bixolabs.files.wordpress.com/2009/11/workflow-dag.png?w=300&#038;h=203" alt="Workflow Graph" width="300" height="203" /></a><p class="wp-caption-text">Web mining app workflow</p></div>
<p>In the above graph, the purple ovals represent custom code, and of those six I could have cut out two by using existing Cascading operators with some regular expression juju. Add in the new Bixo utility operator for loading URLs into the workflow plus new Tika support for parsing mbox files, and you&#8217;re down to two custom operators &#8211; parsing the top-level &#8220;mailbox archives&#8221; page to find the monthly mailbox archives, and scoring the emails.</p>
<p>The blue and yellow ovals are pre-defined Cascading &amp; Bixo operators (respectively).</p>
<p>And while the total workflow looks very complex, this was defined in about a page of Java code.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/bixolabs.wordpress.com/132/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/bixolabs.wordpress.com/132/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/bixolabs.wordpress.com/132/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/bixolabs.wordpress.com/132/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/bixolabs.wordpress.com/132/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/bixolabs.wordpress.com/132/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/bixolabs.wordpress.com/132/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/bixolabs.wordpress.com/132/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/bixolabs.wordpress.com/132/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/bixolabs.wordpress.com/132/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=bixolabs.com&blog=10127813&post=132&subd=bixolabs&ref=&feed=1" />]]></content:encoded>
			<wfw:commentRss>http://bixolabs.com/2009/11/04/paul-ororke-summary-of-elastic-web-mining-talk/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/cee2819ae7c1633e11ad2ef33de7794f?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">kkrugler</media:title>
		</media:content>

		<media:content url="http://bixolabs.files.wordpress.com/2009/11/workflow-dag.png?w=300" medium="image">
			<media:title type="html">Workflow Graph</media:title>
		</media:content>
	</item>
		<item>
		<title>Elastic Web Mining Talk</title>
		<link>http://bixolabs.com/2009/11/02/elastic-web-mining-talk/</link>
		<comments>http://bixolabs.com/2009/11/02/elastic-web-mining-talk/#comments</comments>
		<pubDate>Tue, 03 Nov 2009 02:32:20 +0000</pubDate>
		<dc:creator>kkrugler</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[acm]]></category>
		<category><![CDATA[data mining]]></category>
		<category><![CDATA[presentation]]></category>
		<category><![CDATA[web mining]]></category>

		<guid isPermaLink="false">http://bixolabs.com/?p=125</guid>
		<description><![CDATA[Here&#8217;s the presentation I gave at the ACM data mining unconference on elastic web mining &#8211; how to create scalable, reliable and cost effective web mining solutions using an open source stack (Hadoop, Cascading, Bixo) running in Amazon&#8217;s Elastic Compute Cloud (EC2).

But I don&#8217;t see my notes showing up, so here&#8217;s the PDF version with [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=bixolabs.com&blog=10127813&post=125&subd=bixolabs&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<p>Here&#8217;s the presentation I gave at the ACM data mining unconference on elastic web mining &#8211; how to create scalable, reliable and cost effective web mining solutions using an open source stack (Hadoop, Cascading, Bixo) running in Amazon&#8217;s Elastic Compute Cloud (EC2).</p>
<p><object type='application/x-shockwave-flash' wmode='opaque' data='http://static.slideshare.net/swf/ssplayer2.swf?id=2407600&#038;doc=acmuctalk-091102194640-phpapp02' width='600' height='492'><param name='movie' value='http://static.slideshare.net/swf/ssplayer2.swf?id=2407600&#038;doc=acmuctalk-091102194640-phpapp02' /><param name='allowFullScreen' value='true' /><param name='allowScriptAccess' value='always' /></object></p>
<p>But I don&#8217;t see my notes showing up, so here&#8217;s the PDF version with full notes, which make the resulting slides a lot more meaningful.</p>
<p><object style='margin: 0px;' width='600' height='492'><param name='movie' value='http://static.slidesharecdn.com/swf/ssplayerd.swf?doc=acmtalk-slideshare-091102203022-phpapp02' /><param name='allowFullScreen' value='true' /><param name='allowScriptAccess' value='always' /><param name='wmode' value='opaque' /><embed src='http://static.slidesharecdn.com/swf/ssplayerd.swf?doc=acmtalk-slideshare-091102203022-phpapp02' type='application/x-shockwave-flash' allowscriptaccess='always' allowfullscreen='true' wmode='opaque' width='600' height='492'></embed></object></p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/bixolabs.wordpress.com/125/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/bixolabs.wordpress.com/125/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/bixolabs.wordpress.com/125/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/bixolabs.wordpress.com/125/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/bixolabs.wordpress.com/125/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/bixolabs.wordpress.com/125/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/bixolabs.wordpress.com/125/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/bixolabs.wordpress.com/125/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/bixolabs.wordpress.com/125/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/bixolabs.wordpress.com/125/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=bixolabs.com&blog=10127813&post=125&subd=bixolabs&ref=&feed=1" />]]></content:encoded>
			<wfw:commentRss>http://bixolabs.com/2009/11/02/elastic-web-mining-talk/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/cee2819ae7c1633e11ad2ef33de7794f?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">kkrugler</media:title>
		</media:content>
	</item>
		<item>
		<title>Session writeups for ACM data mining unconference</title>
		<link>http://bixolabs.com/2009/11/02/session-writeups-for-acm-data-mining-unconference/</link>
		<comments>http://bixolabs.com/2009/11/02/session-writeups-for-acm-data-mining-unconference/#comments</comments>
		<pubDate>Tue, 03 Nov 2009 00:53:02 +0000</pubDate>
		<dc:creator>kkrugler</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://bixolabs.com/?p=122</guid>
		<description><![CDATA[I wound up being the scribe for two sessions at this past Sunday&#8217;s ACM data mining unconference.
The first session was on open/public datasets, which are very useful for people working on data mining algorithms.
The second session (last one of the day) was on open source data mining tools. Lots of people at this one, with [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=bixolabs.com&blog=10127813&post=122&subd=bixolabs&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<p>I wound up being the scribe for two sessions at this past Sunday&#8217;s ACM data mining unconference.</p>
<p>The first session was on <a href="/datasets/public-datasets/">open/public datasets</a>, which are very useful for people working on data mining algorithms.</p>
<p>The second session (last one of the day) was on <a href="/oss/open-source-data-mining-tools/">open source data mining tools</a>. Lots of people at this one, with a nice demo on <a href="http://www.knime.org/" target="_blank">KNIME</a> and a good discussion of the R language pros/cons for data mining tasks.</p>
<p>&nbsp;</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/bixolabs.wordpress.com/122/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/bixolabs.wordpress.com/122/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/bixolabs.wordpress.com/122/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/bixolabs.wordpress.com/122/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/bixolabs.wordpress.com/122/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/bixolabs.wordpress.com/122/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/bixolabs.wordpress.com/122/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/bixolabs.wordpress.com/122/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/bixolabs.wordpress.com/122/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/bixolabs.wordpress.com/122/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=bixolabs.com&blog=10127813&post=122&subd=bixolabs&ref=&feed=1" />]]></content:encoded>
			<wfw:commentRss>http://bixolabs.com/2009/11/02/session-writeups-for-acm-data-mining-unconference/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/cee2819ae7c1633e11ad2ef33de7794f?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">kkrugler</media:title>
		</media:content>
	</item>
		<item>
		<title>Announcing the Public Terabyte Dataset project</title>
		<link>http://bixolabs.com/2009/11/01/announcing-the-public-terabyte-dataset-project/</link>
		<comments>http://bixolabs.com/2009/11/01/announcing-the-public-terabyte-dataset-project/#comments</comments>
		<pubDate>Sun, 01 Nov 2009 14:58:43 +0000</pubDate>
		<dc:creator>kkrugler</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[amazon]]></category>
		<category><![CDATA[ec2]]></category>
		<category><![CDATA[emr]]></category>
		<category><![CDATA[public terabyte dataset]]></category>
		<category><![CDATA[web mining]]></category>

		<guid isPermaLink="false">http://bixolabs.com/?p=88</guid>
		<description><![CDATA[We&#8217;re very excited to announce the Public Terabyte Dataset project.
This is a high quality crawl of top web sites, using AWS&#8217;s Elastic Map Reduce, Concurrent&#8217;s Cascading workflow API, and Bixolab&#8217;s elastic web mining platform.
Hosting for the resulting dataset will be provided by Amazon in S3, and freely available to all EC2 users.
In addition, the code [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=bixolabs.com&blog=10127813&post=88&subd=bixolabs&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<p>We&#8217;re very excited to announce the <a href="/datasets/public-terabyte-dataset-project/" target="_self">Public Terabyte Dataset project</a>.</p>
<p>This is a high quality crawl of top web sites, using AWS&#8217;s <a href="http://aws.amazon.com/elasticmapreduce/" target="_blank">Elastic Map Reduce</a>, Concurrent&#8217;s <a href="http://www.cascading.org/" target="_blank">Cascading</a> workflow API, and Bixolab&#8217;s elastic <a href="/">web mining platform</a>.</p>
<p>Hosting for the resulting dataset will be provided by Amazon in <a href="https://s3.amazonaws.com/" target="_blank">S3</a>, and freely available to all <a href="http://aws.amazon.com/ec2/" target="_blank">EC2</a> users.</p>
<p>In addition, the code used to create and process the dataset will be available for download from <a href="http://developer.amazonwebservices.com/connect/kbcategory.jspa?categoryID=263" target="_blank">http://developer.amazonwebservices.com/connect/kbcategory.jspa?categoryID=263</a></p>
<p>Questions and input on the project can be submitted at <a title="Publc Terabyte Dataset form" href="http://bixolabs.com/PTD/">http://bixolabs.com/PTD/</a></p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/bixolabs.wordpress.com/88/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/bixolabs.wordpress.com/88/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/bixolabs.wordpress.com/88/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/bixolabs.wordpress.com/88/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/bixolabs.wordpress.com/88/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/bixolabs.wordpress.com/88/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/bixolabs.wordpress.com/88/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/bixolabs.wordpress.com/88/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/bixolabs.wordpress.com/88/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/bixolabs.wordpress.com/88/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=bixolabs.com&blog=10127813&post=88&subd=bixolabs&ref=&feed=1" />]]></content:encoded>
			<wfw:commentRss>http://bixolabs.com/2009/11/01/announcing-the-public-terabyte-dataset-project/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/cee2819ae7c1633e11ad2ef33de7794f?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">kkrugler</media:title>
		</media:content>
	</item>
		<item>
		<title>Presenting at 2009 Silicon Valley Data Mining Camp</title>
		<link>http://bixolabs.com/2009/10/30/presenting-at-2009-silicon-valley-data-mining-camp/</link>
		<comments>http://bixolabs.com/2009/10/30/presenting-at-2009-silicon-valley-data-mining-camp/#comments</comments>
		<pubDate>Fri, 30 Oct 2009 17:42:03 +0000</pubDate>
		<dc:creator>kkrugler</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[acm]]></category>
		<category><![CDATA[data mining]]></category>
		<category><![CDATA[web mining]]></category>

		<guid isPermaLink="false">http://bixolabs.com/?p=68</guid>
		<description><![CDATA[This coming Sunday is the big Bay Area data mining &#8220;unconference&#8220;, and with more than 200 people already signed up, it&#8217;s going to be a lot of fun.
I&#8217;ll be presenting at some point during the day &#8211; since it&#8217;s an unconference, you don&#8217;t really know who&#8217;s going to be talking about what/when. My topic is [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=bixolabs.com&blog=10127813&post=68&subd=bixolabs&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<p>This coming Sunday is the big Bay Area data mining &#8220;<a href="http://en.wikipedia.org/wiki/Unconference" target="_blank">unconference</a>&#8220;, and with more than 200 people already signed up, it&#8217;s going to be a lot of fun.</p>
<p>I&#8217;ll be presenting at some point during the day &#8211; since it&#8217;s an unconference, you don&#8217;t really know who&#8217;s going to be talking about what/when. My topic is &#8220;<a href="http://www.sfbayacm.org/?p=894&amp;cpage=1#comment-37" target="_blank">Elastic web mining using open source (Hadoop/Cascading/Bixo) in Amazon’s EC2 cloud</a>&#8220;.</p>
<p>If you scan the list of attendees (click the &#8220;RSVPs&#8221; tab near the top of the <a href="http://events.linkedin.com/ACM-Silicon-Valley-Data-Mining-Camp/pub/142420" target="_blank">LinkedIn event page</a>) you&#8217;ll see a lot of high powered executives, consultants and researchers, so I&#8217;m looking forward to really great lobby conversations.</p>
<p>Many thanks to the San Francisco Bay Area Chapter of the ACM for helping out with the <a href="http://www.sfbayacm.org/?p=894" target="_blank">event</a>, which is taking place from noon to 7:30pm at the <a href="http://hackerdojo.pbworks.com/" target="_blank">Hacker&#8217;s Dojo</a> in Mountain View. <a href="http://www.linkedin.com/in/gregmakowski" target="_blank">Greg Makowski</a> is the organizer, so he&#8217;s probably going a little bit crazy right now <img src='http://s.wordpress.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p>&nbsp;</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/bixolabs.wordpress.com/68/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/bixolabs.wordpress.com/68/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/bixolabs.wordpress.com/68/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/bixolabs.wordpress.com/68/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/bixolabs.wordpress.com/68/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/bixolabs.wordpress.com/68/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/bixolabs.wordpress.com/68/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/bixolabs.wordpress.com/68/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/bixolabs.wordpress.com/68/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/bixolabs.wordpress.com/68/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=bixolabs.com&blog=10127813&post=68&subd=bixolabs&ref=&feed=1" />]]></content:encoded>
			<wfw:commentRss>http://bixolabs.com/2009/10/30/presenting-at-2009-silicon-valley-data-mining-camp/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/cee2819ae7c1633e11ad2ef33de7794f?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">kkrugler</media:title>
		</media:content>
	</item>
		<item>
		<title>Amazon Drops EC2 Prices</title>
		<link>http://bixolabs.com/2009/10/28/amazon-drops-ec2-prices/</link>
		<comments>http://bixolabs.com/2009/10/28/amazon-drops-ec2-prices/#comments</comments>
		<pubDate>Wed, 28 Oct 2009 22:40:08 +0000</pubDate>
		<dc:creator>kkrugler</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://bixolabs.com/?p=36</guid>
		<description><![CDATA[And that&#8217;s good news for our customers! In their official announcement, Amazon said:
Finally, we are also lowering prices on all Amazon EC2 On-Demand compute instances, effective on November 1st. Charges for Linux-based instances will drop 15% &#8212; a small Linux instance will now cost just 8.5 cents per hour of computing, compared to the previous [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=bixolabs.com&blog=10127813&post=36&subd=bixolabs&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<p>And that&#8217;s good news for our customers! In their official announcement, Amazon said:</p>
<blockquote><p>Finally, we are also lowering prices on all Amazon EC2 On-Demand compute instances, effective on November 1st. Charges for Linux-based instances will drop 15% &#8212; a small Linux instance will now cost just 8.5 cents per hour of computing, compared to the previous price of 10 cents per hour.</p></blockquote>
<p>Since your Bixolabs usage fee is based on AWS pricing, you directly benefit from this reduction, since Linux-based instances are what we use for web mining. Visit the <a href="http://aws.amazon.com/ec2/#pricing" target="_blank">EC2 pricing page</a> for more details on the base price we use for calculating your usage fees.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/bixolabs.wordpress.com/36/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/bixolabs.wordpress.com/36/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/bixolabs.wordpress.com/36/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/bixolabs.wordpress.com/36/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/bixolabs.wordpress.com/36/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/bixolabs.wordpress.com/36/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/bixolabs.wordpress.com/36/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/bixolabs.wordpress.com/36/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/bixolabs.wordpress.com/36/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/bixolabs.wordpress.com/36/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=bixolabs.com&blog=10127813&post=36&subd=bixolabs&ref=&feed=1" />]]></content:encoded>
			<wfw:commentRss>http://bixolabs.com/2009/10/28/amazon-drops-ec2-prices/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/cee2819ae7c1633e11ad2ef33de7794f?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">kkrugler</media:title>
		</media:content>
	</item>
	</channel>
</rss>