Crawler-commons project gets started

December 3, 2009
by kkrugler
Back in November we helped put together a small gathering for web crawler developers at ApacheCon 2009. One of the key topics was how to share development efforts, versus each project independently implementing similar functionality.

Out of this was born the crawler-commons project. As the main page says:

The purpose of this project is to develop a set of reusable Java components that implement functionality common to any web crawler. These components would benefit from collaboration among various existing web crawler projects, and reduce duplication of effort.

There’s a long list of functionality that is identical, or nearly so, between the various projects. The project wiki has a more detailed write-up from the ApacheCon meeting, but a short list includes:

  • robots.txt parsing
  • URL normalization
  • URL filtering
  • Domain name manipulation
  • HTML page cleaning
  • HttpClient configuration
  • Text similarity

It’s still early, but some initial code has been submitted to the Google Code SVN repository. And anybody with an interest in the area of Java web crawlers should use this feed to track project updates.

Public web crawler projects

December 2, 2009

Several people have pointed me to other public/non-profit projects doing large-scale public web crawls, so I thought I’d summarize the ones I now know about below. And if you know of others, please add your comments and I’ll update the list.

  • Wayback Machine – A time-series snapshot of important web pages, from 1996 to present. 150B pages crawled in total as of 2009. The data is searchable, but not available in raw format AFAIK. The work is part of the Internet Archive organization, and uses Heritrix for crawling.
  • CDL Web Archiving Service – The California Digital Library provides the Web Archiving Service to enable librarians and scholars to create archives of captured web sites and publications. Similar to the Wayback Machine, they use Heritrix and other software from the Internet Archive, and the results are searchable but not available in raw format.
  • CommonCrawl – Their goal is to build, maintain and make widely available a comprehensive crawl of the Internet. They use Nutch (useragent is ccBot). I’ve seen Ahad Rana post to the Nutch list. So far I haven’t seen any actual search or raw data results from this project. The do have a cool public “crawl stats” page.
  • UK Web Archive – A “Wayback Machine” for UK web sites. Provided by the British Library. Searchable, but no raw data that I can see. They in turn sponsor the Web Curator Tool, which is an open-source workflow management application for selective web archiving (driver for Heritrix).
  • Isara Search – A project sponsored by Isara Charity Organization to build the world’s first non-profit search engine. Based in Thailand, using Nutch. No search/data available yet.
  • ClueWeb09 – The ClueWeb09 dataset was created by the Language Technologies Institute at Carnegie Mellon University to support research on information retrieval and related human language technologies. The dataset consists of 1 billion web pages, in ten languages, collected in January and February 2009. The data is available to researchers who sign a legal agreement and pay $750 for the hard disks needed to store the data.
  • WebBase – The Stanford WebBase project has been collecting topic focused snapshots of Web sites. All the resulting archives are available to the public via fast download streams. The useragent is WebVac (was Pita). There’s also a web GUI for fetching specific crawl sets.
  • Laboratory for Web Algorithmics – Uses UbiCrawler to create large-scale link graph datasets that can be freely downloaded.

Proposals for Big Data web mining talk

November 16, 2009

I’m going to be giving a talk at the Bay Area ACM data mining SIG in December, and I need to finalize my topic soon – like today :)

I was going to expand on my Elastic Web Mining talk (“Web mining for SEO keywords”) from the ACM data mining unconference a few weeks back.

But the fact that I’ll have 10s to 100s of millions of web page data to work with, from the public terabyte dataset crawl, makes me want to apply Mahout to the data.

I tossed out one idea on the Mahout list, looking for input:

  • I’d like to automatically generate a timeline of events.
  • I can extract potential dates from web pages, using simple patterns.
  • I can extract 2-to-4 word terms (skipping those which start/end with stop words) from pages that have extracted dates.
  • And then by the miracle of LDA (latent dirichlet allocation), I get clusters of date+terms.

But in this example, I don’t actually need LDA – I have my “topic”, which is the date. So it might not be a very good example. And will LDA scale to 100M web pages (which implies many billions of terms)? And how will I handle the same term (e.g. “barack inauguration”) being associated with a cluster of dates, since stories from a range of dates before/after the event will contain that same term?

So it could be a non-starter – I’m hoping for input on feasibility, level of effort, or if somebody else has a suggestion for something simple that could provide interesting/obvious results, I’m all ears.

Thanks!

– Ken

PS – my current fall-back is to just do brute-force map-reduce to come up with lists of terms per unique date, pick the top N, and maybe do some filtering for top-level terms that have too many associated unique dates. Which unfortunately wouldn’t use Mahout, but would be an example of crunching lots of data.

Web Miners vs Web Masters – An Uneasy Truce

November 11, 2009

The life of a webmaster is hard, and web crawlers make it harder

Angry Face

 

There’s the daily drama of keeping both web site users and web site developers happy. Now mix in the unpredictable side effects of having automated agents hitting the site, and you can see why webmasters might think many web crawlers are evil.

But web crawlers serve a very real, important role in the life of a successful site, and it’s all about traffic. Without search engines like Google and Yahoo/Bing, most sites would be invisible to most users.

Implicit Contracts

An unwritten agreement exists between webmasters and web crawlers, and it reads something like this: you don’t overload my site, and you bring traffic my way. In return, I’ll give you free access lots of valuable content that I host.

And that’s worked reasonably well, for the past 15 years. Yes, there are crawlers that ignore the Robots Exclusion Standard. And there are crawlers that overload the site by hammering it with lots of simultaneous requests for hours on end. And sometimes a crawler goes a little crazy and spends hours trying to fetch non-existent pages using bogus URLs that it incorrectly derived from content on the site’s pages. For the most part, though, web crawlers try to do the Right Thing, and webmasters can always block rogue crawlers by IP address.

Web Mining != Search Index

But now you’ve got web miners – automated agents that collect data which often doesn’t wind up in a search index. And that means no traffic from searches. And thus the implicit contract has been broken.

It hasn’t happened yet, but I can see a day when many sites set up their robots.txt to allow the major search engines access, and then block everybody else.

What does this mean for the web eco-system? Three things, one for each participant:

  1. Web miners need to crawl extra-super-politely.
  2. Customers need to work with key sites to pick good crawl times.
  3. Web sites need to offer for-fee APIs for data mining.

The first point is the easiest one to solve – never hit a site with more than one simultaneous request, never fetch more than a handful of pages a minute, and respect all robots.txt restrictions.

The second is a bit harder, as it currently requires person-to-person contact with the web site in question. It’s possible to derive these “good crawl times” by varying the request rate with the response performance, so there are work-arounds. But eventually I expect to see an extension to robots.txt that lets the site owner provide additional clues to web crawlers about good and bad times for crawling.

The last point, about providing APIs, is the most long-term but also the most powerful. There are many web APIs out there, some of which provide access to valuable web data, but few offer a pay-to-play model. Most are rate limited, where you need to cut special deals if you exceed some relatively low daily threshold. Many have serious terms of use restrictions that limit a caller’s ability to actually mine the response data – often the only option is to republish it, with links/attribution back to the originating site.

What would be great is if everybody had a model like Amazon’s AWIS, where X requests cost N dollars. You can decide how much or how little to spend. There aren’t many restrictions on rate/volume or usage. And as a huge added bonus, the data comes back structured, so you don’t have to waste time hand-crafting some fragile, error-prone HTML scraping code.

And a side-note to companies thinking about the API issue – if you don’t provide one, and you block web miners, then you’ll get crawled anyway, in stealth mode by less scrupulous firms. So then everybody loses, since you’ll still be giving free access while taking a performance hit, while companies that need the data pay more to these “stealth crawlers” and get worse results.

Paul O’Rorke summary of elastic web mining talk

November 4, 2009
by kkrugler

Paul posted a nice summary of my elastic web mining talk over at his blog. He captured one of the key points I was trying to make when he said:

It was impressive to see how much of the processing was generated by Bixo and Cascading and how only a small fraction of the code needed to be custom coded “by hand.”

That’s a recurring theme when I show workflow graphs (dot files generated by Cascading) for example web mining applications that I’ve created. The real work is in figuring out what needs to be done (defining the workflow), not the coding to create the workflow or the custom bits that need to added.

Workflow Graph

Web mining app workflow

In the above graph, the purple ovals represent custom code, and of those six I could have cut out two by using existing Cascading operators with some regular expression juju. Add in the new Bixo utility operator for loading URLs into the workflow plus new Tika support for parsing mbox files, and you’re down to two custom operators – parsing the top-level “mailbox archives” page to find the monthly mailbox archives, and scoring the emails.

The blue and yellow ovals are pre-defined Cascading & Bixo operators (respectively).

And while the total workflow looks very complex, this was defined in about a page of Java code.

Elastic Web Mining Talk

November 2, 2009
by kkrugler

Here’s the presentation I gave at the ACM data mining unconference on elastic web mining – how to create scalable, reliable and cost effective web mining solutions using an open source stack (Hadoop, Cascading, Bixo) running in Amazon’s Elastic Compute Cloud (EC2).

But I don’t see my notes showing up, so here’s the PDF version with full notes, which make the resulting slides a lot more meaningful.

Session writeups for ACM data mining unconference

November 2, 2009
by kkrugler

I wound up being the scribe for two sessions at this past Sunday’s ACM data mining unconference.

The first session was on open/public datasets, which are very useful for people working on data mining algorithms.

The second session (last one of the day) was on open source data mining tools. Lots of people at this one, with a nice demo on KNIME and a good discussion of the R language pros/cons for data mining tasks.

 

Announcing the Public Terabyte Dataset project

November 1, 2009

We’re very excited to announce the Public Terabyte Dataset project.

This is a high quality crawl of top web sites, using AWS’s Elastic Map Reduce, Concurrent’s Cascading workflow API, and Bixolab’s elastic web mining platform.

Hosting for the resulting dataset will be provided by Amazon in S3, and freely available to all EC2 users.

In addition, the code used to create and process the dataset will be available for download from http://developer.amazonwebservices.com/connect/kbcategory.jspa?categoryID=263

Questions and input on the project can be submitted at http://bixolabs.com/PTD/

Presenting at 2009 Silicon Valley Data Mining Camp

October 30, 2009
by kkrugler

This coming Sunday is the big Bay Area data mining “unconference“, and with more than 200 people already signed up, it’s going to be a lot of fun.

I’ll be presenting at some point during the day – since it’s an unconference, you don’t really know who’s going to be talking about what/when. My topic is “Elastic web mining using open source (Hadoop/Cascading/Bixo) in Amazon’s EC2 cloud“.

If you scan the list of attendees (click the “RSVPs” tab near the top of the LinkedIn event page) you’ll see a lot of high powered executives, consultants and researchers, so I’m looking forward to really great lobby conversations.

Many thanks to the San Francisco Bay Area Chapter of the ACM for helping out with the event, which is taking place from noon to 7:30pm at the Hacker’s Dojo in Mountain View. Greg Makowski is the organizer, so he’s probably going a little bit crazy right now :)

 

Amazon Drops EC2 Prices

October 28, 2009
by kkrugler

And that’s good news for our customers! In their official announcement, Amazon said:

Finally, we are also lowering prices on all Amazon EC2 On-Demand compute instances, effective on November 1st. Charges for Linux-based instances will drop 15% — a small Linux instance will now cost just 8.5 cents per hour of computing, compared to the previous price of 10 cents per hour.

Since your Bixolabs usage fee is based on AWS pricing, you directly benefit from this reduction, since Linux-based instances are what we use for web mining. Visit the EC2 pricing page for more details on the base price we use for calculating your usage fees.