Public Terabyte Dataset Project
Public Terabyte Dataset Project
This page has more details on the Public Terabyte Dataset project, which was recently announced at the ACM data mining unconference.
- The data comes from a crawl of 50-200M pages from the 100K top (by traffic) English language domains.
- The crawl is done by a custom Bixo workflow created by Bixolabs, built on top of Cascading/Hadoop and running in EC2 using Amazon’s Elastic Map Reduce service.
- We’ll be trying hard to avoid spam/adult content, though getting totally clean results is of course impossible.
- The resulting data will be stored as compressed warc files in S3.
- Hosting for the dataset is being provided by Amazon.
- Access to the data is free, assuming you’re running code in EC2.
- The code used to run the crawl, as well as code to access the crawl data, will be available at http://developer.amazonwebservices.com/connect/kbcategory.jspa?categoryID=263.
There’s a form where you can request information and provide input on the crawl.
8 Responses
leave one →
Trackbacks & Pingbacks
- Bixolabs goes public « Ken's Techno Tidbits
- The web is an endless series of edge cases « Ken's Techno Tidbits

How did you get the top 100K websites by traffic?
Is the datasets all pages from those websites?
How did you get Amazon to host the data?
What do you view as the value of this dataset?
Hi Dan,
1. Top sites by traffic come from Alexa web services API.
2. The dataset focuses on these sites, but expands to include others.
3. Amazon is helping because they are interested in useful public datasets and examples of effectively using EMR.
4. The dataset has highest value for people working on text processing algorithms, and for doing performance baselining/optimizations.
– Ken
Alexa and quantcast both have a free “top 1 million” list that is updated daily.
premiumdrops.com is also offering copies of .com/.net/.org zones if you want to constitue a wide seed list.
@elhoim – I hadn’t found the domain lists on premiumdrops.com …good stuff.
Sounds like you’ve been poking around this space a bit in the past
– Ken
Hi Joel,
Excellent input, thanks! The Alexa web service API offers a bit more information (e.g. is it “adult”), but it feels out of date (e.g. no walmart.com data). By merging these together, I’ll have a much better see list.
– Ken