<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
		>
<channel>
	<title>Comments on: Announcing the Public Terabyte Dataset project</title>
	<atom:link href="http://bixolabs.com/2009/11/01/announcing-the-public-terabyte-dataset-project/feed/" rel="self" type="application/rss+xml" />
	<link>http://bixolabs.com/2009/11/01/announcing-the-public-terabyte-dataset-project/</link>
	<description></description>
	<lastBuildDate>Wed, 03 Mar 2010 15:09:34 +0000</lastBuildDate>
	<generator>http://wordpress.com/</generator>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
		<item>
		<title>By: BotchagalupeMarks for November 13th - 11:23 &#124; IT Management and Cloud Blog</title>
		<link>http://bixolabs.com/2009/11/01/announcing-the-public-terabyte-dataset-project/#comment-18</link>
		<dc:creator>BotchagalupeMarks for November 13th - 11:23 &#124; IT Management and Cloud Blog</dc:creator>
		<pubDate>Sat, 14 Nov 2009 06:26:50 +0000</pubDate>
		<guid isPermaLink="false">http://bixolabs.com/?p=88#comment-18</guid>
		<description>[...] Announcing the Public Terabyte Dataset project &#171; Elastic Web Mining &#124; Bixolabs - This is a high quality crawl of top web sites, using AWS&#8217;s Elastic Map Reduce, Concurrent&#8217;s Cascading workflow API, and Bixolab&#8217;s elastic web mining platform. [...]</description>
		<content:encoded><![CDATA[<p>[...] Announcing the Public Terabyte Dataset project &laquo; Elastic Web Mining | Bixolabs &#8211; This is a high quality crawl of top web sites, using AWS&rsquo;s Elastic Map Reduce, Concurrent&rsquo;s Cascading workflow API, and Bixolab&rsquo;s elastic web mining platform. [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: kkrugler</title>
		<link>http://bixolabs.com/2009/11/01/announcing-the-public-terabyte-dataset-project/#comment-16</link>
		<dc:creator>kkrugler</dc:creator>
		<pubDate>Tue, 10 Nov 2009 20:39:35 +0000</pubDate>
		<guid isPermaLink="false">http://bixolabs.com/?p=88#comment-16</guid>
		<description>Hi Brewster,

Thanks for the offer to host the crawl at the Internet Archive. 

I&#039;ll take you up on that offer, and follow up with Alexis.

-- Ken</description>
		<content:encoded><![CDATA[<p>Hi Brewster,</p>
<p>Thanks for the offer to host the crawl at the Internet Archive. </p>
<p>I&#8217;ll take you up on that offer, and follow up with Alexis.</p>
<p>&#8211; Ken</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Brewster Kahle</title>
		<link>http://bixolabs.com/2009/11/01/announcing-the-public-terabyte-dataset-project/#comment-13</link>
		<dc:creator>Brewster Kahle</dc:creator>
		<pubDate>Sun, 08 Nov 2009 04:43:20 +0000</pubDate>
		<guid isPermaLink="false">http://bixolabs.com/?p=88#comment-13</guid>
		<description>Thank you for doing this crawl!

The Internet Archive would be happy to host this collection for free public access.  You can push it to the Internet Archive given our implementation of S3.

If this is interesting to you, please contact alexis rossi (alexis at archive).

-brewster</description>
		<content:encoded><![CDATA[<p>Thank you for doing this crawl!</p>
<p>The Internet Archive would be happy to host this collection for free public access.  You can push it to the Internet Archive given our implementation of S3.</p>
<p>If this is interesting to you, please contact alexis rossi (alexis at archive).</p>
<p>-brewster</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Michael Nielsen &#187; Biweekly links for 11/06/2009</title>
		<link>http://bixolabs.com/2009/11/01/announcing-the-public-terabyte-dataset-project/#comment-12</link>
		<dc:creator>Michael Nielsen &#187; Biweekly links for 11/06/2009</dc:creator>
		<pubDate>Fri, 06 Nov 2009 10:53:17 +0000</pubDate>
		<guid isPermaLink="false">http://bixolabs.com/?p=88#comment-12</guid>
		<description>[...] The Public Terabyte Dataset project « Elastic Web Mining &#124; Bixolabs [...]</description>
		<content:encoded><![CDATA[<p>[...] The Public Terabyte Dataset project « Elastic Web Mining | Bixolabs [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: kkrugler</title>
		<link>http://bixolabs.com/2009/11/01/announcing-the-public-terabyte-dataset-project/#comment-8</link>
		<dc:creator>kkrugler</dc:creator>
		<pubDate>Tue, 03 Nov 2009 23:08:15 +0000</pubDate>
		<guid isPermaLink="false">http://bixolabs.com/?p=88#comment-8</guid>
		<description>Hi Otis,

The dataset will be pushed to S3 as a set of compressed warc (web archive) files. Still working out how much additional data to include, but parse would be too big...might include compressed term vectors for Mahout-ers out there.

The link you reference is where the sample code will be posted, for both the generation of the data (the crawl code) and examples of how to process the data files.

-- Ken</description>
		<content:encoded><![CDATA[<p>Hi Otis,</p>
<p>The dataset will be pushed to S3 as a set of compressed warc (web archive) files. Still working out how much additional data to include, but parse would be too big&#8230;might include compressed term vectors for Mahout-ers out there.</p>
<p>The link you reference is where the sample code will be posted, for both the generation of the data (the crawl code) and examples of how to process the data files.</p>
<p>&#8211; Ken</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Otis Gospodnetic</title>
		<link>http://bixolabs.com/2009/11/01/announcing-the-public-terabyte-dataset-project/#comment-7</link>
		<dc:creator>Otis Gospodnetic</dc:creator>
		<pubDate>Tue, 03 Nov 2009 15:19:19 +0000</pubDate>
		<guid isPermaLink="false">http://bixolabs.com/?p=88#comment-7</guid>
		<description>Ken, I&#039;m looking at http://developer.amazonwebservices.com/connect/kbcategory.jspa?categoryID=263 but I don&#039;t see PTD there.
What exactly is in the dataset?  A bunch of raw, unparsed HTML pages, right?</description>
		<content:encoded><![CDATA[<p>Ken, I&#8217;m looking at <a href="http://developer.amazonwebservices.com/connect/kbcategory.jspa?categoryID=263" rel="nofollow">http://developer.amazonwebservices.com/connect/kbcategory.jspa?categoryID=263</a> but I don&#8217;t see PTD there.<br />
What exactly is in the dataset?  A bunch of raw, unparsed HTML pages, right?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Notional Slurry &#187; links for 2009-11-02</title>
		<link>http://bixolabs.com/2009/11/01/announcing-the-public-terabyte-dataset-project/#comment-5</link>
		<dc:creator>Notional Slurry &#187; links for 2009-11-02</dc:creator>
		<pubDate>Tue, 03 Nov 2009 06:03:15 +0000</pubDate>
		<guid isPermaLink="false">http://bixolabs.com/?p=88#comment-5</guid>
		<description>[...] Announcing the Public Terabyte Dataset project « Elastic Web Mining &#124; Bixolabs &quot;We’re very excited to announce the Public Terabyte Dataset project.&quot; (tags: data datasets mapreduce S3) [...]</description>
		<content:encoded><![CDATA[<p>[...] Announcing the Public Terabyte Dataset project « Elastic Web Mining | Bixolabs &quot;We’re very excited to announce the Public Terabyte Dataset project.&quot; (tags: data datasets mapreduce S3) [...]</p>
]]></content:encoded>
	</item>
</channel>
</rss>
