Open Source Data Mining Tools

Open Source Data Mining Tools

Below is a report on the open source data mining tools session at the ACM data mining unconference this past Sunday (01 Nov 2009).

This only covers tools that the panelists had used, so it’s not a survey of the available tools. See Jeff Dalton’s blog post on Java Open Source NLP and Text Mining tools for an example of a more complete list of a closely related group of tools.

Weka

Paul O’Rorke talked about Weka, a collection of machine learning algorithms for data mining tasks. Concerns about whether it’s still viable. One person said that pieces of it are still viable for clustering, feature selection.

An attendee mentioned MOA. MOA is a framework for data stream mining. Includes tools for evaluation and a collection of machine learning algorithms. Related to the WEKA project, also written in Java, while scaling to more demanding problems.

R Language

David Smith talked about R. Possible to quickly get results by using building blocks from other users. Often data is prepared before processing by R. On the back end is presentation tools. Sweave is a report generation backup that works well with R. Lots of research going on for out-of-memory modeling, to handle larger data sets. Also lots of work in parallel processing. BigMemory is a package for large models. Paul mentioned that R has a steep learning curve. David agreed that R is quirky, especially in terms of memory usage. See David’s blog post about the event.

Attendee asked about comparing Matlab & R, with respect to viability in a production environment. He’d run into memory problems with Matlab. David said that it was similar, and recommended doing scoring outside of R. He estimates 3-6x more memory is required for R vs. C++.

Attendee said many people use R for prototyping and generating models, but production uses something else. Examples would be Numpy and SciPy.

Paul mentioned that R provides a very compact representation of data mining tasks. (Ken – so it’s the APL of data mining?)

KNIME

Nicolas Cebron talked about KNIME (pronounced “naim”), a modular data exploration platform. Started in 2004. knime.org has full details. He demonstrated the KNIME application, which has a nice GUI for working with data sets. The model can be output as PMML.

Attendee asked about long-term viability of KNIME. Nicolas said that it’s been around for 4 years, has a vibrant community, and there are commercial companies creating modules.

Mahout

Ted Dunning talked about Mahout, an Apache open source project with the goal of scalable machine learning/data mining. Java is main language, Hadoop & Lucene are foundation technologies. Currently has good algorithms for clustering, kmeans. Reasonably good classifiers. Supervised learning algorithm. Also recommendation framework called TASTE. Very young project. Has support for sparse matrix math – might pool efforts with Apache commons math project. Mahout is mature enough for some types of machine learning problems.

Hadoop

Attendee asked about comparing Hadoop distributed file system (HDFS) and Sun distributed file system. Chris Wensel from Concurrent explained that HDFS is very specialized, optimized for streaming reads. Can’t do random updates to files. Scales to 1000s of servers. Very fault tolerant. Ted confirmed that it’s very reliable, with a humorous story about a cluster of the world’s worst servers.

Bixo

Ken Krugler (your faithful scribe) talked about the HECB (Hadoop, EC2, Cascading, Bixo) stack for web mining. Focus is on the collection and initial processing/reduction of the data, not hard core machine learning & data mining.

4 Responses leave one →
  1. November 3, 2009

    I should have mentioned Jeff Dalton’s excellent list of Java text mining tools (also NLP) at http://www.searchenginecaffe.com/2007/03/java-open-source-text-mining-and.html

    – Ken

  2. November 8, 2009

    Another powerful, flexible, and comprehensive open source data mining software is RapidMiner, which is freely available at http://www.rapid-i.com/

    According to the yearly KDnuggets polls in 2007, 2008, and 2009, RapidMiner was the most widely used open source data mining solution among data mining experts:
    http://www.kdnuggets.com/polls/2009/data-mining-tools-used.htm
    http://www.kdnuggets.com/news/2009/n10/1i.html

    • November 8, 2009

      Hi Frank,

      Thanks for the information on RapidMiner!

      I’ll also add some information at the beginning of the write-up to clarify that we were only covering tools that panel members knew about, so it’s not any kind of general survey.

      – Ken

  3. November 18, 2009

    Ken, thanks for the comment you posted to my Open Source Text Analytics article pointing me to this page.

    Readers of this page: you might find my article useful, http://www.b-eye-network.com/view/9516

    Seth

Leave a Reply

Note: You can use basic XHTML in your comments. Your email address will never be published.

Subscribe to this comment feed via RSS