Hacking through the Amazon with a shiny new MachetEC2

Hold on to your pith helmets: the Infochimps are releasing an Amazon Machine Image designed for data processing, analysis, and visualization.

Amazon’s Elastic Compute Cloud (EC2) allows users to instantiate a virtual computer with a pre-installed operating system, software packages, and up to 1 TB of data loaded on disk, ready to work with, from a shared image (an “Amazon Machine Image”, or AMI).

MachetEC2 is an effort by a group of Infochimps to create an AMI for data processing, analysis, and visualization. If you create an instance of MachetEC2, you’ll be have an environment with tools designed for working with data ready to go. You can load in your own data, grab one of our datasets, or try grabbing the data from one of Amazon’s Public Data Sets. No matter what, you’ll be hacking in minutes.

We’re taking suggestions for what software the community would be most interested in having installed on the image (peek inside to see what we’ve thought of so far…)

We’ve thought of including some subset of

  • Ruby, Python, Erlang, R
  • MySQL, PostgreSQL
  • AllegroGraph, CouchDB
  • Hadoop, Hive, Pig
  • Cytoscape, Gruff
  • Processing, Prefuse/Flare, Modest Maps
  • NLTK, SciPy

What other software would you like to see? Operating system preferences? Know of any similar AMI’s? (Only suggest free and open software please!)

When we feel that the AMI is getting too bloated, we’ll split it up: MachetEC2-ML (machine learning), MachetEC2-viz, MachetEC2-lang, MachetEC2-bio, &c.

(Also check out a similar discussion on the forums at FlowingData. We’ll reply to comments both here and there.)

Comments

  1. M. Edward (Ed) Borasky March 13, 2009 at 9:23 pm

    My *hard* requirements for data analysis / visualization software:

    1. 64-bit Gnu/Linux, 2.6.25 kernel or later. I don’t much care which distro, although I personally use openSUSE 11.1. Ubuntu Intrepid Ibex or later, Fedora 9 or later will also work. Hardy Heron’s kernel is too old. If Gentoo ever gets their release engineering act together, I’d consider it. CentOS / RHEL 5 kernel is too old. Debian Lenny might work, but I’ve never used it and Ubuntu seems to be more tuned to a workstation than Debian, which is mostly a server distro.

    2. R 2.8.1-patched or later. This has access to all the repositories; older versions don’t know about RForge. 2.9.0 is almost ready for release; I’ll be alpha testing it daily.

    3. *All* of the CRAN task views! That includes the dependencies. For example, if you have Rgraphviz, you need graphviz and graphviz-devel (on openSUSE).

    4. ggobi 2.1.8. or later.

    5. PostgreSQL 8.3.5 — sorry, dolphins. You can have MySQL but I won’t use it. pgadmin3 is also required.

    6. Lyx 1.6.1 or later. This integrates with R for publication-quality graphics, “literate programming” and “reproducible research”.

    That’s what I’ve got on my workstation, and that’s what I’d expect as a minimum for software.

    I can give you openSUSE 11.1 build scripts; the standard openSUSE 11.1 and even 11.2 (“Factory”) packages are behind the state of the art for R, GGobi and LyX, so I build them from upstream source. I’ve got most of this working on two different machines, and the scripts are up on Github, but I don’t have an Amazon AMI to test with yet. I could probably build an image here locally with the tools openSUSE has (Kiwi, Xen).

    Bonus points if there’s a way to build the ATLAS Automatically Tuned Linear Algebra Subroutines on a virtual machine. :)

  2. Pingback: infochimps Amazon Machine Image for data analysis and viz

  3. Pingback: Start hacking: machetEC2 released! « blog.infochimps.org - Organizing Huge Information Sources

  4. Stephen February 5, 2009 at 7:51 pm

    It would be nice if it had Condor or Globus toolkit, and also some kind of script to make it easy to add these nodes to a condor cluster.

  5. Philip (flip) Kromer January 30, 2009 at 12:54 pm

    Ooh I also like that Dhruv — the .icss (Infochimps Stupid Schema — all the data types, notes, links, etc metadata off the infochimps page) files can be live updated at one’s preference from there.

  6. Philip (flip) Kromer January 30, 2009 at 12:51 pm

    Cool idea on the spell-checking lists Neal.

    We can put in volume images that have “smaller” but related datasets — so, the BSD etc/words, the panoply of spellchecking programs, the BNC corpus word frequency list, the gutenberg.org word lists and dictionaries.

    NLTK’s corpora will probably live on their own.

  7. Neal Richter January 30, 2009 at 1:19 am

    Software: Solr, Lucene, Mahout, memcache, memcachedb, memcacheq

    Packages: Every spell checking package you can put in there (for the word lists!)

    Data: Wikipedia catagories dump, freebase, dmoz category dump. NIST text data if you can get it.

  8. dhruvbansal January 29, 2009 at 2:06 pm

    We’ll post the build scripts we used for the image when we make the image public.

    @Pete, your list of Python packages was indispensible! And thanks for the info about startup scripts — looks like a good way to manage fast-changing resources (like the Infinite Monkeywrench…).

    Check http://machetec2.org for updates (right now it just points at this blog entry but soon will be its own wiki).

  9. Pete Skomoroch January 29, 2009 at 9:04 am

    Unless you already have this built in, but you might want to consider a startup boot script to optionally install updated versions of libraries and pull code from an infochimps repo.

    http://developer.amazonwebservices.com/connect/entry.jspa?externalID=1403

    If the list of installs grows fairly large and you are building things from source on new versions of the ami, you can hit issues with unavailable repos or mirrors. A good strategy there is to build those non apt packages from an infochimps src repo just in case they move in the future.

    I ended up using puppet for a lot of this package management stuff at work, but that might be a pain for public AMIs. A simple ami installtion bash script often does the job.

  10. Pete Skomoroch January 29, 2009 at 8:41 am

    Hope that package list helps… I released a much more python-focused AMI last year based on that list that allows you to fire up an MPI cluster in a similar fashion to the hadoop cluster bash scripts. That is useful for running existing MPI code like the parallel boost graph library, etc.

    http://code.google.com/p/elasticwulf/

    My image is way out of date at this point, and doesn’t do things like automatically load EBS volumes. I’ve been planning on updating it, but maybe I’ll hijack the infochimps image if you publish build scripts.

    One thing I would suggest is to make sure that you have a 64 bit version of the ami, and that you use the corresponding 64 bit version of python before building numpy/scipy – otherwise you will hit a 2GB memory limit when processing data with scipy.

  11. dhruvbansal January 28, 2009 at 11:33 pm

    Update: @peteskomoroch is working on a similar AMI, details at http://tinyurl.com/ckl2zr