Start hacking: machetEC2 released!

machetEC2, the Infochimps Amazon Machine Image (AMI) designed for data processing, analysis, and visualization, has been released!

Amazon’s Cloud Computing services give you transformatively cheap and scalable computing power, and their Public Data Sets (AWS/PDS) collection (which infochimps is contributing to) is helping to put the world of free, open data at your fingertips.  MachetEC2 lets you summon a “batteries included” computer — or a hundred computers — from the cloud.  As soon as it loads, you’re ready to start crunching and transforming and visualizing data, whether from AWS/PDS, or, or your own pool.

When you SSH into an instance of machetEC2 (brief instructions after the jump), check the README files: they describe what’s installed, how to deal with volumes and Amazon Public Datasets, and how to use X11-based applications.  You can also visit the the machetEC2 GitHub page to see the full list of packages installed, the list of gems, and the list of programs installed from source.

This machete is only as sharp as it is complete. If there’s software that you find indispensable, we encourage you to suggest it here, or even better to help add it to the toolkit (instructions are within).

To launch an instance of machetEC2, log into the AWS Console, click “AMIs”, search for “machetEC2” or ami-29ef0840, and click “Launch”. If you’re on the command-line, simply run

$ ec2-run-instances ami-29ef0840 -k [your-keypair-name]

By the time you’ve grabbed some coffee, you’ll be able to access an EC2 instance with all the tools you need for working with data already installed, configured, and ready to hack.

You can obtain a copy of the machetEC2 build scripts at the Infochimps machetEC2 GitHub page. If you improve them, send us a pull request on GitHub and we’ll include your contributions in the next build of machetEC2!

This is our first build of machetEC2 and we’re very excited to have the community’s input on what’s missing and what needs to be improved. We’ve incorporated many of the suggestions from our RFC post, but not all — either for reasons of time or (disk) space — have made it in to this initial release.  We’re tracking things to add though, so either post below (comments on this post will become entries in a wiki soon to be hosted at or as we said, send along a pull request.


  1. Pingback: How Google and Facebook are using R :Health Fitness Wealth

  2. Philip (flip) Kromer March 25, 2009 at 4:51 pm

    Stamen’s Mike Migurski and Shawn Allen have a cartography AMI, available at

    Joe Bob says check it out.

  3. Pingback: Amazon Web Services hosts DBpedia, Freebase data sets « - Organizing Huge Information Sources

  4. Pingback: Crunch that data : business|bytes|genes|molecules

  5. Pingback: How Google and Facebook use R for Analytics : Data Evolution

  6. dhruvbansal February 9, 2009 at 7:19 pm

    The image doesn’t have any data (and no volumes) attached to it; it’s just a huge collection of tools for working with data.

    @Neil is right: housing data on EBS yields allows for data-persistence as well as striping. We’re going to add some commands in the next version of machetEC2 which make creating volumes from Amazon Public Data Sets and other sources very easy. Till then, we’re leaving the handling of volumes entirely up to the user.

  7. Neil February 9, 2009 at 5:41 pm

    Very nice,

    I’m just starting to get my head around the ‘cloud’ and why it is cool. Currently deploying an Oracle server. . . the easiest install I’ve ever done!.

    Will look into the machetEC2 image, once I get used to the way this works, and planning some costs. For storage do you guys use EBS so that you don’t always need your instance running, and so you have striping etc?

  8. Michael E Driscoll February 6, 2009 at 7:16 pm

    Flip – Truly fantastic! Glad that you guys included the r-core stuff in there, but being an R nut – would love to see some of the other Ubuntu R packages included (all with the “r-cran” prefix), esp. Lattice and Cairo (for pretty graphics).

  9. mrflip February 6, 2009 at 6:50 pm

    Side notes: this release doesn’t have Hadoop yet, as the current 19.0 version has been deprecated and we want that to stabilize out.

    We’ll be adding Condor and probably Disco too. I think the spelling list packages are actually best done installed on an EBS volume and set up as a Public Data Sets image.

    If you’d like to accelerate any of those changes, the scripts Dhruv wrote (inspired by code @esh and @peterskomoroch wrote) are really elegant, and should be very easy to hook in to. Fork the project from our GitHub repository
    git clone git://
    and just ping infochimps with a pull request once you’re happy with the new, improved AMI.