Skip to content

Announcing the Public Terabyte Dataset project

November 1, 2009

We’re very excited to announce the Public Terabyte Dataset project.

This is a high quality crawl of top web sites, using AWS’s Elastic MapReduce, Concurrent’s Cascading workflow API, and Bixo Lab’s elastic web mining platform.

Hosting for the resulting dataset will be provided by Amazon in S3, and freely available to all EC2 users.

In addition, the code used to create and process the dataset will be available for download from http://developer.amazonwebservices.com/connect/kbcategory.jspa?categoryID=263

Questions and input on the project can be submitted at http://bixolabs.com/PTD/


7 Comments leave one →
  1. November 3, 2009 8:19 am

    Ken, I’m looking at http://developer.amazonwebservices.com/connect/kbcategory.jspa?categoryID=263 but I don’t see PTD there.
    What exactly is in the dataset? A bunch of raw, unparsed HTML pages, right?

    • November 3, 2009 4:08 pm

      Hi Otis,

      The dataset will be pushed to S3 as a set of compressed warc (web archive) files. Still working out how much additional data to include, but parse would be too big…might include compressed term vectors for Mahout-ers out there.

      The link you reference is where the sample code will be posted, for both the generation of the data (the crawl code) and examples of how to process the data files.

      – Ken

      • November 7, 2009 9:43 pm

        Thank you for doing this crawl!

        The Internet Archive would be happy to host this collection for free public access. You can push it to the Internet Archive given our implementation of S3.

        If this is interesting to you, please contact alexis rossi (alexis at archive).

        -brewster

  2. November 10, 2009 1:39 pm

    Hi Brewster,

    Thanks for the offer to host the crawl at the Internet Archive.

    I’ll take you up on that offer, and follow up with Alexis.

    – Ken

Trackbacks

  1. Notional Slurry » links for 2009-11-02
  2. Michael Nielsen » Biweekly links for 11/06/2009
  3. BotchagalupeMarks for November 13th - 11:23 | IT Management and Cloud Blog

Leave a Reply

Note: You can use basic XHTML in your comments.

Subscribe to this comment feed via RSS