Announcing the Public Terabyte Dataset project

November 1, 2009

We’re very excited to announce the Public Terabyte Dataset project.

This is a high quality crawl of top web sites, using AWS’s Elastic Map Reduce, Concurrent’s Cascading workflow API, and Bixolab’s elastic web mining platform.

Hosting for the resulting dataset will be provided by Amazon in S3, and freely available to all EC2 users.

In addition, the code used to create and process the dataset will be available for download from http://developer.amazonwebservices.com/connect/kbcategory.jspa?categoryID=263

Questions and input on the project can be submitted at http://bixolabs.com/PTD/

7 Responses leave one →
  1. November 3, 2009

    Ken, I’m looking at http://developer.amazonwebservices.com/connect/kbcategory.jspa?categoryID=263 but I don’t see PTD there.
    What exactly is in the dataset? A bunch of raw, unparsed HTML pages, right?

    • November 3, 2009

      Hi Otis,

      The dataset will be pushed to S3 as a set of compressed warc (web archive) files. Still working out how much additional data to include, but parse would be too big…might include compressed term vectors for Mahout-ers out there.

      The link you reference is where the sample code will be posted, for both the generation of the data (the crawl code) and examples of how to process the data files.

      – Ken

      • November 7, 2009

        Thank you for doing this crawl!

        The Internet Archive would be happy to host this collection for free public access. You can push it to the Internet Archive given our implementation of S3.

        If this is interesting to you, please contact alexis rossi (alexis at archive).

        -brewster

  2. November 10, 2009

    Hi Brewster,

    Thanks for the offer to host the crawl at the Internet Archive.

    I’ll take you up on that offer, and follow up with Alexis.

    – Ken

Trackbacks & Pingbacks

  1. Notional Slurry » links for 2009-11-02
  2. Michael Nielsen » Biweekly links for 11/06/2009
  3. BotchagalupeMarks for November 13th - 11:23 | IT Management and Cloud Blog

Leave a Reply

Note: You can use basic XHTML in your comments. Your email address will never be published.

Subscribe to this comment feed via RSS