<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
		>
<channel>
	<title>Comments on: Public Terabyte Dataset Project</title>
	<atom:link href="http://bixolabs.com/datasets/public-terabyte-dataset-project/feed/" rel="self" type="application/rss+xml" />
	<link>http://bixolabs.com</link>
	<description></description>
	<lastBuildDate>Wed, 03 Mar 2010 15:09:34 +0000</lastBuildDate>
	<generator>http://wordpress.com/</generator>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
		<item>
		<title>By: kkrugler</title>
		<link>http://bixolabs.com/datasets/public-terabyte-dataset-project/#comment-68</link>
		<dc:creator>kkrugler</dc:creator>
		<pubDate>Wed, 03 Mar 2010 15:09:34 +0000</pubDate>
		<guid isPermaLink="false">http://bixolabs.com/?page_id=101#comment-68</guid>
		<description>@elhoim - I hadn&#039;t found the domain lists on premiumdrops.com ...good stuff.

Sounds like you&#039;ve been poking around this space a bit in the past :)

-- Ken</description>
		<content:encoded><![CDATA[<p>@elhoim &#8211; I hadn&#8217;t found the domain lists on premiumdrops.com &#8230;good stuff.</p>
<p>Sounds like you&#8217;ve been poking around this space a bit in the past <img src='http://s.wordpress.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p>&#8211; Ken</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: elhoim</title>
		<link>http://bixolabs.com/datasets/public-terabyte-dataset-project/#comment-67</link>
		<dc:creator>elhoim</dc:creator>
		<pubDate>Wed, 03 Mar 2010 14:40:54 +0000</pubDate>
		<guid isPermaLink="false">http://bixolabs.com/?page_id=101#comment-67</guid>
		<description>premiumdrops.com is also offering copies of .com/.net/.org zones if you want to constitue a wide seed  list.</description>
		<content:encoded><![CDATA[<p>premiumdrops.com is also offering copies of .com/.net/.org zones if you want to constitue a wide seed  list.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: kkrugler</title>
		<link>http://bixolabs.com/datasets/public-terabyte-dataset-project/#comment-66</link>
		<dc:creator>kkrugler</dc:creator>
		<pubDate>Wed, 03 Mar 2010 14:35:00 +0000</pubDate>
		<guid isPermaLink="false">http://bixolabs.com/?page_id=101#comment-66</guid>
		<description>Hi Joel,

Excellent input, thanks! The Alexa web service API offers a bit more information (e.g. is it &quot;adult&quot;), but it feels out of date (e.g. no walmart.com data). By merging these together, I&#039;ll have a much better see list.

-- Ken</description>
		<content:encoded><![CDATA[<p>Hi Joel,</p>
<p>Excellent input, thanks! The Alexa web service API offers a bit more information (e.g. is it &#8220;adult&#8221;), but it feels out of date (e.g. no walmart.com data). By merging these together, I&#8217;ll have a much better see list.</p>
<p>&#8211; Ken</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: elhoim</title>
		<link>http://bixolabs.com/datasets/public-terabyte-dataset-project/#comment-65</link>
		<dc:creator>elhoim</dc:creator>
		<pubDate>Wed, 03 Mar 2010 09:27:47 +0000</pubDate>
		<guid isPermaLink="false">http://bixolabs.com/?page_id=101#comment-65</guid>
		<description>Alexa and quantcast both have a free &quot;top 1 million&quot; list that is updated daily.</description>
		<content:encoded><![CDATA[<p>Alexa and quantcast both have a free &#8220;top 1 million&#8221; list that is updated daily.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: kkrugler</title>
		<link>http://bixolabs.com/datasets/public-terabyte-dataset-project/#comment-61</link>
		<dc:creator>kkrugler</dc:creator>
		<pubDate>Mon, 25 Jan 2010 20:45:30 +0000</pubDate>
		<guid isPermaLink="false">http://bixolabs.com/?page_id=101#comment-61</guid>
		<description>Hi Dan,

1. Top sites by traffic come from Alexa web services API.
2. The dataset focuses on these sites, but expands to include others.
3. Amazon is helping because they are interested in useful public datasets and examples of effectively using EMR.
4. The dataset has highest value for people working on text processing algorithms, and for doing performance baselining/optimizations.

-- Ken</description>
		<content:encoded><![CDATA[<p>Hi Dan,</p>
<p>1. Top sites by traffic come from Alexa web services API.<br />
2. The dataset focuses on these sites, but expands to include others.<br />
3. Amazon is helping because they are interested in useful public datasets and examples of effectively using EMR.<br />
4. The dataset has highest value for people working on text processing algorithms, and for doing performance baselining/optimizations.</p>
<p>&#8211; Ken</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Dan</title>
		<link>http://bixolabs.com/datasets/public-terabyte-dataset-project/#comment-60</link>
		<dc:creator>Dan</dc:creator>
		<pubDate>Mon, 25 Jan 2010 20:20:28 +0000</pubDate>
		<guid isPermaLink="false">http://bixolabs.com/?page_id=101#comment-60</guid>
		<description>How did you get the top 100K websites by traffic?
Is the datasets all pages from those websites?
How did you get Amazon to host the data?
What do you view as the value of this dataset?</description>
		<content:encoded><![CDATA[<p>How did you get the top 100K websites by traffic?<br />
Is the datasets all pages from those websites?<br />
How did you get Amazon to host the data?<br />
What do you view as the value of this dataset?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: The web is an endless series of edge cases &#171; Ken&#39;s Techno Tidbits</title>
		<link>http://bixolabs.com/datasets/public-terabyte-dataset-project/#comment-49</link>
		<dc:creator>The web is an endless series of edge cases &#171; Ken&#39;s Techno Tidbits</dc:creator>
		<pubDate>Thu, 17 Dec 2009 20:31:52 +0000</pubDate>
		<guid isPermaLink="false">http://bixolabs.com/?page_id=101#comment-49</guid>
		<description>[...] that same week, I was debugging a weird problem with my Elastic MapReduce web crawling job for the Public Terabyte Datset project. At some point during one of the steps, I was getting LeaseExpiredExceptions in the logs, and the [...]</description>
		<content:encoded><![CDATA[<p>[...] that same week, I was debugging a weird problem with my Elastic MapReduce web crawling job for the Public Terabyte Datset project. At some point during one of the steps, I was getting LeaseExpiredExceptions in the logs, and the [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Bixolabs goes public &#171; Ken&#39;s Techno Tidbits</title>
		<link>http://bixolabs.com/datasets/public-terabyte-dataset-project/#comment-4</link>
		<dc:creator>Bixolabs goes public &#171; Ken&#39;s Techno Tidbits</dc:creator>
		<pubDate>Tue, 03 Nov 2009 00:48:12 +0000</pubDate>
		<guid isPermaLink="false">http://bixolabs.com/?page_id=101#comment-4</guid>
		<description>[...] gave a talk at the ACM Data Mining Unconference on Sunday, where I also announced the Public Terabyte Dataset project, so the timing was [...]</description>
		<content:encoded><![CDATA[<p>[...] gave a talk at the ACM Data Mining Unconference on Sunday, where I also announced the Public Terabyte Dataset project, so the timing was [...]</p>
]]></content:encoded>
	</item>
</channel>
</rss>
