Products & Features

Announcing Support for OpenStack and the Rackspace Cloud

Infochimps is happy to announce that we now support the next generation Rackspace Cloud, based on OpenStack. Through integration with the OpenStack API the Infochimps Platform can now power big data applications based in the Rackspace Cloud, expanding the reach of the Infochimps Platform and making the running of complex big data infrastructures quick and easy for a broader range of users.

Rackspace customers running the new OpenStack-based Rackspace Cloud Servers can quickly and easily spin up Hadoop clusters to power their big data applications in as little as 20 minutes with a single command using the Infochimps Platform. With the power of Ironfan, Infochimps’ open source provisioning tool, and Dashpot, Infochimps’ visualization and operations dashboard, customers can easily monitor and manage their Big Data operations on an ongoing basis, or leave it to Infochimps to manage it on the Rackspace Cloud for them.

Check out this demo of Infochimps Platform running in the Rackspace Cloud:

Why OpenStack and Rackspace?
From the beginning, the Infochimps Platform has been built on a foundation of open source tools for managing data, aimed at simplifying the experience of working with complex technologies such as Hadoop or Cassandra. Within the Infochimps Platform, Wukong, Ironfan and Swineherd are major open sourced components of the stack. OpenStack supports our open source tradition with its strong open source ecosystem. It is used by and contributed to by not only Rackspace, but organizations such as NASA, Canonical, RedHat, Dell, HP, and AT&T, so its architecture serves a multitude of needs, rather than bending to the whims of a single provider.

OpenStack also encourages standardization among Infrastructure as a Service providers, which ultimately benefits everyone in the market. Clients can make (and remake) decisions based on their businesses’ current day to day needs, without needing to employ a crystal ball to try to predict which provider will be best for them in the long term. By sharing open and standard interfaces, cloud providers can compete on current quality and value, instead of fighting to lock-in customers based on promises.

The modular design of OpenStack is part of what makes standards possible without blocking innovation. There are a set of core APIs that every provider will support, and extensions for added capabilities that not every provider will want to allow. The contracts these APIs provide can be (and often are) fulfilled by different back-end providers, letting each provider make different architectural choices without requiring customers to completely retool to take advantage of them. All of this allows apples-to-apples comparison of provider architectures, without making orange sales impossible.

What does OpenStack mean for Infochimps?
The work we’ve done to support this announcement has enabled us to provide a level of abstraction from the Amazon Web Services environment, and we can deploy our platform in a cloud agnostic way. Many of our customers have asked for implementations on their in-house cloud environments – our OpenStack support allows those implementations to be airlifted in using a common set of APIs that sit on top of whatever infrastructure already exists, instead of one-off installations that require more custom development and introduce brittleness.

Interested in learning more about Infochimps, Rackspace, and OpenStack? Contact us today for more information!

Announcing Dashpot, our Analytics & Operations Dashboard for the Infochimps Platform

Infochimps is happy to announce Dashpot, an easy-to-use analytics and operations dashboard that provides business metrics and visualization, cluster management capabilities, and system monitoring on top of the Infochimps Platform. Dashpot gives you real time visibility and control of your Big Data stack running with Infochimps, helping you go from input to insight faster, with our best-in-class Big Data infrastructure and tools.

Here are some of Dashpot’s key features:

  • Business Metrics – Dashpot’s in-stream visualization provides business users with the ability to capture and visualize business metrics on the fly as data is being ingested into their Infochimps Platform. By enabling data to be decorated in-stream through our Flume-based Data Delivery Service, Infochimps enables quick introspection on how a data or business process is performing. Organizations can view spikes or drops in key system or business metrics in near real-time, enabling quicker response to changing business conditions, saving time and helping ensure higher quality and more valuable information in the organization’s ultimate datastore. Infochimps business metrics are designed to provide an intermediate data visualization capability in conjunction with an organization’s existing investments in traditional business intelligence solutions.
  • Cluster Management – Built on the power of Ironfan, Dashpot offers simple Big Data system automation and management with a quick glance view into the servers and clusters currently running. Operations users can easily spin them up and down with a simple button click as their processing needs change, creating significant, easy-to-attain cost savings in machine usage.
  • Systems Monitoring – Dashpot provides integration with popular monitoring packages to provide users with at-a-glance views on Big Data system performance, availability, system integrity and more. Designed to easily integrate with any monitoring product, Infochimps has implemented the popular open source product, Zabbix as its initial reference monitoring solution, integrating Zabbix graphs on system performance and availability in the Infochimps Dashpot dashboard.

Implementing and operating Big Data architectures can be difficult, requiring significant investment of resources and time. By choosing to use the Infochimps Platform, enterprises needn’t worry about the time and hassle of building and maintaining their own infrastructure. When combined with our tools, such as Ironfan and DDS, Dashpot’s simple visualizations and management tools help organizations keep their Big Data system humming, with little operational overhead. Best of all, Dashpot’s in-stream visualizations help provide the insights businesses need to get the most value out of their Big Data infrastructure investment.

Interested in talking about how we can help simplify your Big Data stack?  Contact us today for more information!

How to Build a Hadoop Cluster in 20 Minutes

If you’ve ever tried your hand at manually provisioning, configuring and deploying a Hadoop cluster, you know that it can take days or weeks to create a fully functional system. With tools like Chef, this time can be cut down to a matter of hours or days (depending on the size of the cluster). In this video, Dhruv Bansal, Chief Science Officer of Infochimps, builds a Hadoop cluster in 20 minutes with Ironfan.

Ironfan is the foundation for your Big Data stack, making provisioning and configuring your Big Data infrastructure simple. Spin up clusters when you need them, kill them when you don’t, so you can spend your time, money, and engineering focus on finding insights, not getting your machines ready. To learn more about how Ironfan enables The Infochimps Platform, check out our white paper.

Explore Foursquare with Infochimps

Today, Foursquare announced the launch of a web version of Explore, their tool for discovering interesting places.  Leveraging the power of 1.5 billion checkins, this recommendation engine does not spit out one -size-fits-all answers.  Instead, it intelligently compares your own check-in history with those of your friends and others to help you answer questions like…

  • What’s the best sushi restaurant in my town that I haven’t been to before?
  • What food trailer on East 6th Street will offer me the fastest service at 1am when I have had too much to drink and need a delicious mobile food option NOW?
  • Where can I get Golden Monkey beer near the Infochimps HQ during happy hour?

foursquareexplore 1024x601 Explore Foursquare with Infochimps

Foursquare Explore is a great illustration of a favorite saying of our CTO, Flip Kromer – “the solution to the too much data problem is more data!”  With the massive amount of check-in data and comments/tips left by Foursquare users, we can suddenly begin to get reliable answers to our strangely difficult to answer everyday questions.

Interested in building a tool similar to Foursquare Explore or augmenting an existing places recommendation engine?  You too can unlock the power of Big Data with some of these great Infochimps APIs:


Foursquare Venues, Wikipedia Articles, Census Data and More… All With Just an IP Address!

IMG 20110623 132455 1024x768 Foursquare Venues, Wikipedia Articles, Census Data and More... All With Just an IP Address!

Greetings from deep in the Data Mine here at Infochimps. This week the team rolled out new features that combine one of our most popular APIs with our Geo API platform, unlocking the ability to geolocate based on an IP Address with any of our Geo APIs.

The idea is based on one of our more popular mashups, our MaxMind GeoLite IP to Census API  which blends IP geolocation functionality with Census data. This allows you to find out not just where an IP address maps to, but also some high level information about that area – ideal for websites that do geotargeting and for people looking for a deeper understanding about their visitor audience. The data it draws on has become a bit dated though (it uses the 2000 Census), and the data covers a relatively narrow band of properties. Enter our Geo API platform, our platform for richer and more current data from a variety of sources.

A great advantage of our new Geo API platform is our ability to perform two-step queries internally, essentially converting a parameter into another parameter behind the scenes. It’s the key technology behind our ability to geolocate using an address: our geocoder first converts the address into latitude/longitude before making a secondary query against our data store to retrieve the response values.

By using the same principle with IP Geolocation instead of address geocoding, we have unlocked the ability for our users to query any of our Geo APIs with an IP Address as the geolocator, returning data as if the request had used a latitude/longitude. So now you can use an updated IP to Census API and also a more detailed drilldown version. Furthermore you can now go from IP to Foursquare Venue, Zillow Neighborhood, Wikipedia Article, and so on.

To use the new IP-Geolocation feature, just pass in the parameter g.ip_address with an IP address, along with a g.radius.  Check out this example query, which will help you locate banks and credit unions in our Foursquare database that are within 3 kms (about 1 mile) from the Infochimps office in Austin, TX.[YOUR API KEY HERE]

For client-side geo application developers we’ve also added another feature along with g.ip_address. With any of these APIs you can now pass “g.get_ip_address=true” instead, and our Geo API will determine the IP address of the machine calling our API and use that IP address as the geolocator. This new flag makes it easy to ask questions of our API like “tell me about venues near me” without ever having to know what your longitude is or how to interpret a quadkey.

All in the spirit of making Geo data more accessible and easy to use!

A Designer Writes an App Using Our Geo API

Jim England is Infochimp’s new Director of User Experience. He’s well-versed in CSS, HTML and great design, but without a hardcore programming background, he was the perfect candidate to put our Geo API to the test. Did we create a product that was versatile, powerful and so easy-to-use that a UX guy could create a useful app in just a few days?

gibbonguide A Designer Writes an App Using Our Geo API

The release of the Infochimps Geo API was an excellent opportunity to sharpen up my programming skills by developing a fun sample application. In only a few days, I was able to build the Gibbon Travel Guide, which calls the Foursquare Places and Wikipedia Articles APIs to show interesting places to visit in a city. It defaults to Austin, Texas but be sure to try out other cities!

As I developed the app, the Infochimps documentation was there to steer me in the right direction. The Getting Started with Geo guide described the available APIs, taught me the basic structure of the API calls, and showed how to limit the search results. With this knowledge, I added “f.q=museum” and “f.q=park” filters on the Wikipedia API to limit the results to those categories.

Once my query was constructed, I used the code examples in the Infochimps Ruby library to have my app access the Infochimps API.  If you want to see the source code, check out my git repo.

The project was a fun experience and really showed just how easy it was to create something cool in a very short period of time with our new Geo API.  If you’re building an app on top of the Geo API, feel free to get in touch with me if you have any questions!

The Summarizer: The Infochimps Cure for Geolocation Overload

Last week, we revealed our brand new Infochimps Geo APIs. Not only are our APIs chock full of millions of points of interest and contextual data, but the schema is also dead simple to learn and implement. And since all the new Geo APIs are unified under the same schema, no matter what location data you are looking to access, the API should always work exactly as you’d expect it too.

In the process of developing our new Geo APIs, we developed one very important and useful feature: The Summarizer. It makes presenting venues user-friendly by intelligently clustering locations, and you can take advantage of it automatically. Let’s dive in a little deeper so you can see exactly what we’re talking about.

texas churches all1 The Summarizer: The Infochimps Cure for Geolocation Overload

The image above is a plot of all the churches in Texas, taken from Geonames Places. Cyan is Dallas, blue is Austin, green is San Antonio, and purple is Houston. As you can easily see, there are way more points being plotted than it makes sense to present to the user. The common remedy is to present a smaller sample of data. In practice though, that sample ends up being more random than not.


Back to School Data Pop Quiz

With a record setting heat wave still raging through the United States, it’s hard to believe that it’s nearly back to school season.  Many students will be returning to classrooms as early as Monday of next week and we’ve rounded up some of our most interesting education-related data sets.  In the spirit of our fine institutions of learning, we’ve decided to have a little two question pop quiz!

Be the first person to leave correct answers to the following questions in comments and we’ll send you a sweet Infochimps belt buckle.  Yee-haw!

book monkey web Back to School Data Pop Quiz Enrollment in Public and Private Schools: 1960 to 2005
Between which years did total enrollment in public and private schools remain flat? (Hint: there may be more than one instance of this.)

Most Dangerous Colleges 2010 – The Daily Beast Ranking
Which school ranks #1 in most instances of arson? (Bonus points if you can come up with a reasonable reason why)

And this one isn’t on the test, but a good data set to check out anyway…

US Colleges and Universities

Want to build a university finder app?  It would probably help to have a database of all US college and universities, as of 2010 (9,350 total).

We’ve got over 15,100 more where that came from.  Visit our site today and search for the data you want.  Can’t find what you need? Let us know on UserVoice!

Gigantopithecus & Other Huge (Data) Apes

Gigantopithecus.7834526 std Gigantopithecus & Other Huge (Data) ApesMeet Gigantopithecus.

… or at least a life-like, artist rendering of the now 100,000 year extinct* giant ape. Based on the few fossils that have been unearthed, our best guess is that Gigantopithecus stood at about 10 feet tall and weighed in at 1200 lbs.  Fossils come few and far between and much about this creature remains unknown due to lack of complete data.

What does this have to do with huge data sets?

Once upon a time, huge data sets were hard to find and even harder to download, cleanup and analyze.  Mythical beasts, such as historical records of Twitter users & conversations or the mapping of the Human Genome, proved difficult to locate, let alone interact with in the wild.

We receive tons of emails from folks looking for large data sets.  Whether they’re looking to test a new algorithm or storage solution or simply want to try some new tools, requests for 100+ MB data sets are common and we’ve pulled together a list of our biggest and best.

Check out our huge data sets, including one that’s 150 TB!

* Yes, this means that we likely cohabited the Earth with these guys… and still may according to some cryptozoologists. 

New Data: Nerd Out with WoW, M:TG and More

monkey nerd with glasses 300x277 New Data: Nerd Out with WoW, M:TG and MoreLast week, we showcased data sets and APIs that could help you navigate the joys and perils of monkey love.  This week, we’re nerding it up!  From WoW census data to lists of M:TG cards and Legos to tons of baseball stats, we’ve got your nerd-out needs covered.

World of Warcraft Census Data
Which server has the most players?  Which server has the most Level 60 characters?  How many orcs are on Tichondrius?  These and other burning WoW questions can be answered by this running census of the number of characters per server, per level, and per race, playing World of Warcraft.  Census data provided by

Magic: The Gathering Card Lists
With over 12,000 unique cards, Magic: The Gathering is an amazing rich, complex game that can be played again and again with renewed excitement.  (Author’s Note: Yes, sometimes I stay up til the wee hours of the morning playing Magic with friends.  Don’t judge.)  Enjoy this nerdy bounty of lists and spoilers of every M:TG card produced from Alpha – 2009.

Master List of Lego Parts
You’ve spend months building your Lego replica of the cast of Harry Potter and are missing that one piece to finish the detailing in Harry’s lighting bolt scar… what to do?  Does that piece even exist?  Check out this list of part numbers and descriptions of Lego parts to help you attain your OCD dreams.

Pennant Baseball History (coming soon)
Batter up! The practice of keeping stats of player achievements was started in the 19th century by Henry Chadwick.  Today, sports fans and data nerds alike clamor to this extraordinarily documented game.  Our Pennant Baseball History API isn’t quite up yet, but when it is, you can enter an MLB team name and in return get its team_id and a list of the years played, between 1960-2010. We’ve got more great baseball APIs coming soon as well that can return team statistics, records, game_ids and full at bat data!

We’ve got over 14,700 more where that came from.  Visit our site today and search for the data you want.  Can’t find what you need? Let us know on UserVoice!