Products & Features

How to Build a Hadoop Cluster in 20 Minutes

If you’ve ever tried your hand at manually provisioning, configuring and deploying a Hadoop cluster, you know that it can take days or weeks to create a fully functional system. With tools like Chef, this time can be cut down to a matter of hours or days (depending on the size of the cluster). In this video, Dhruv Bansal, Chief Science Officer of Infochimps, builds a Hadoop cluster in 20 minutes with Ironfan.

Ironfan is the foundation for your Big Data stack, making provisioning and configuring your Big Data infrastructure simple. Spin up clusters when you need them, kill them when you don’t, so you can spend your time, money, and engineering focus on finding insights, not getting your machines ready. To learn more about how Ironfan enables The Infochimps Platform, check out our white paper.

Explore Foursquare with Infochimps

Today, Foursquare announced the launch of a web version of Explore, their tool for discovering interesting places.  Leveraging the power of 1.5 billion checkins, this recommendation engine does not spit out one -size-fits-all answers.  Instead, it intelligently compares your own check-in history with those of your friends and others to help you answer questions like…

  • What’s the best sushi restaurant in my town that I haven’t been to before?
  • What food trailer on East 6th Street will offer me the fastest service at 1am when I have had too much to drink and need a delicious mobile food option NOW?
  • Where can I get Golden Monkey beer near the Infochimps HQ during happy hour?

foursquareexplore 1024x601 Explore Foursquare with Infochimps

Foursquare Explore is a great illustration of a favorite saying of our CTO, Flip Kromer – “the solution to the too much data problem is more data!”  With the massive amount of check-in data and comments/tips left by Foursquare users, we can suddenly begin to get reliable answers to our strangely difficult to answer everyday questions.

Interested in building a tool similar to Foursquare Explore or augmenting an existing places recommendation engine?  You too can unlock the power of Big Data with some of these great Infochimps APIs:

 

Foursquare Venues, Wikipedia Articles, Census Data and More… All With Just an IP Address!

IMG 20110623 132455 1024x768 Foursquare Venues, Wikipedia Articles, Census Data and More... All With Just an IP Address!

Greetings from deep in the Data Mine here at Infochimps. This week the team rolled out new features that combine one of our most popular APIs with our Geo API platform, unlocking the ability to geolocate based on an IP Address with any of our Geo APIs.

The idea is based on one of our more popular mashups, our MaxMind GeoLite IP to Census API  which blends IP geolocation functionality with Census data. This allows you to find out not just where an IP address maps to, but also some high level information about that area – ideal for websites that do geotargeting and for people looking for a deeper understanding about their visitor audience. The data it draws on has become a bit dated though (it uses the 2000 Census), and the data covers a relatively narrow band of properties. Enter our Geo API platform, our platform for richer and more current data from a variety of sources.

A great advantage of our new Geo API platform is our ability to perform two-step queries internally, essentially converting a parameter into another parameter behind the scenes. It’s the key technology behind our ability to geolocate using an address: our geocoder first converts the address into latitude/longitude before making a secondary query against our data store to retrieve the response values.

By using the same principle with IP Geolocation instead of address geocoding, we have unlocked the ability for our users to query any of our Geo APIs with an IP Address as the geolocator, returning data as if the request had used a latitude/longitude. So now you can use an updated IP to Census API and also a more detailed drilldown version. Furthermore you can now go from IP to Foursquare Venue, Zillow Neighborhood, Wikipedia Article, and so on.

To use the new IP-Geolocation feature, just pass in the parameter g.ip_address with an IP address, along with a g.radius.  Check out this example query, which will help you locate banks and credit unions in our Foursquare database that are within 3 kms (about 1 mile) from the Infochimps office in Austin, TX.

http://api.infochimps.com/geo/location/foursquare/places/search?&f._type=business.bank_or_credit_union&g.ip_address=67.78.118.7&g.radius=3000&apikey=[YOUR API KEY HERE]

For client-side geo application developers we’ve also added another feature along with g.ip_address. With any of these APIs you can now pass “g.get_ip_address=true” instead, and our Geo API will determine the IP address of the machine calling our API and use that IP address as the geolocator. This new flag makes it easy to ask questions of our API like “tell me about venues near me” without ever having to know what your longitude is or how to interpret a quadkey.

All in the spirit of making Geo data more accessible and easy to use!

A Designer Writes an App Using Our Geo API

Jim England is Infochimp’s new Director of User Experience. He’s well-versed in CSS, HTML and great design, but without a hardcore programming background, he was the perfect candidate to put our Geo API to the test. Did we create a product that was versatile, powerful and so easy-to-use that a UX guy could create a useful app in just a few days?

gibbonguide A Designer Writes an App Using Our Geo API

The release of the Infochimps Geo API was an excellent opportunity to sharpen up my programming skills by developing a fun sample application. In only a few days, I was able to build the Gibbon Travel Guide, which calls the Foursquare Places and Wikipedia Articles APIs to show interesting places to visit in a city. It defaults to Austin, Texas but be sure to try out other cities!

As I developed the app, the Infochimps documentation was there to steer me in the right direction. The Getting Started with Geo guide described the available APIs, taught me the basic structure of the API calls, and showed how to limit the search results. With this knowledge, I added “f.q=museum” and “f.q=park” filters on the Wikipedia API to limit the results to those categories.

Once my query was constructed, I used the code examples in the Infochimps Ruby library to have my app access the Infochimps API.  If you want to see the source code, check out my git repo.

The project was a fun experience and really showed just how easy it was to create something cool in a very short period of time with our new Geo API.  If you’re building an app on top of the Geo API, feel free to get in touch with me if you have any questions!

The Summarizer: The Infochimps Cure for Geolocation Overload

Last week, we revealed our brand new Infochimps Geo APIs. Not only are our APIs chock full of millions of points of interest and contextual data, but the schema is also dead simple to learn and implement. And since all the new Geo APIs are unified under the same schema, no matter what location data you are looking to access, the API should always work exactly as you’d expect it too.

In the process of developing our new Geo APIs, we developed one very important and useful feature: The Summarizer. It makes presenting venues user-friendly by intelligently clustering locations, and you can take advantage of it automatically. Let’s dive in a little deeper so you can see exactly what we’re talking about.

texas churches all1 The Summarizer: The Infochimps Cure for Geolocation Overload

The image above is a plot of all the churches in Texas, taken from Geonames Places. Cyan is Dallas, blue is Austin, green is San Antonio, and purple is Houston. As you can easily see, there are way more points being plotted than it makes sense to present to the user. The common remedy is to present a smaller sample of data. In practice though, that sample ends up being more random than not.

(more…)

Back to School Data Pop Quiz

With a record setting heat wave still raging through the United States, it’s hard to believe that it’s nearly back to school season.  Many students will be returning to classrooms as early as Monday of next week and we’ve rounded up some of our most interesting education-related data sets.  In the spirit of our fine institutions of learning, we’ve decided to have a little two question pop quiz!

Be the first person to leave correct answers to the following questions in comments and we’ll send you a sweet Infochimps belt buckle.  Yee-haw!

book monkey web Back to School Data Pop Quiz Enrollment in Public and Private Schools: 1960 to 2005
Between which years did total enrollment in public and private schools remain flat? (Hint: there may be more than one instance of this.)

Most Dangerous Colleges 2010 – The Daily Beast Ranking
Which school ranks #1 in most instances of arson? (Bonus points if you can come up with a reasonable reason why)

And this one isn’t on the test, but a good data set to check out anyway…


US Colleges and Universities

Want to build a university finder app?  It would probably help to have a database of all US college and universities, as of 2010 (9,350 total).

We’ve got over 15,100 more where that came from.  Visit our site today and search for the data you want.  Can’t find what you need? Let us know on UserVoice!

Gigantopithecus & Other Huge (Data) Apes

Gigantopithecus.7834526 std Gigantopithecus & Other Huge (Data) ApesMeet Gigantopithecus.

… or at least a life-like, artist rendering of the now 100,000 year extinct* giant ape. Based on the few fossils that have been unearthed, our best guess is that Gigantopithecus stood at about 10 feet tall and weighed in at 1200 lbs.  Fossils come few and far between and much about this creature remains unknown due to lack of complete data.

What does this have to do with huge data sets?

Once upon a time, huge data sets were hard to find and even harder to download, cleanup and analyze.  Mythical beasts, such as historical records of Twitter users & conversations or the mapping of the Human Genome, proved difficult to locate, let alone interact with in the wild.

We receive tons of emails from folks looking for large data sets.  Whether they’re looking to test a new algorithm or storage solution or simply want to try some new tools, requests for 100+ MB data sets are common and we’ve pulled together a list of our biggest and best.

Check out our huge data sets, including one that’s 150 TB!

* Yes, this means that we likely cohabited the Earth with these guys… and still may according to some cryptozoologists. 

New Data: Nerd Out with WoW, M:TG and More

monkey nerd with glasses 300x277 New Data: Nerd Out with WoW, M:TG and MoreLast week, we showcased data sets and APIs that could help you navigate the joys and perils of monkey love.  This week, we’re nerding it up!  From WoW census data to lists of M:TG cards and Legos to tons of baseball stats, we’ve got your nerd-out needs covered.

World of Warcraft Census Data
Which server has the most players?  Which server has the most Level 60 characters?  How many orcs are on Tichondrius?  These and other burning WoW questions can be answered by this running census of the number of characters per server, per level, and per race, playing World of Warcraft.  Census data provided by WarcraftRealms.com.

Magic: The Gathering Card Lists
With over 12,000 unique cards, Magic: The Gathering is an amazing rich, complex game that can be played again and again with renewed excitement.  (Author’s Note: Yes, sometimes I stay up til the wee hours of the morning playing Magic with friends.  Don’t judge.)  Enjoy this nerdy bounty of lists and spoilers of every M:TG card produced from Alpha – 2009.

Master List of Lego Parts
You’ve spend months building your Lego replica of the cast of Harry Potter and are missing that one piece to finish the detailing in Harry’s lighting bolt scar… what to do?  Does that piece even exist?  Check out this list of part numbers and descriptions of Lego parts to help you attain your OCD dreams.

Pennant Baseball History (coming soon)
Batter up! The practice of keeping stats of player achievements was started in the 19th century by Henry Chadwick.  Today, sports fans and data nerds alike clamor to this extraordinarily documented game.  Our Pennant Baseball History API isn’t quite up yet, but when it is, you can enter an MLB team name and in return get its team_id and a list of the years played, between 1960-2010. We’ve got more great baseball APIs coming soon as well that can return team statistics, records, game_ids and full at bat data!

We’ve got over 14,700 more where that came from.  Visit our site today and search for the data you want.  Can’t find what you need? Let us know on UserVoice!

New Data: Dinner, Drinks & The Breakup

34511 New Data: Dinner, Drinks & The BreakupSo, you think you’ve met the perfect chimp companion?  It takes more than just a shiny coat to attract the ladies and gentleman of your species… you’ve got to know how to cook… make tasty drinks.  But what if all your best laid plans fail? Then you’ve got to end it… but how?  We’ve got all the APIs and datasets to help you navigate the sticky world of monkey love.

Cook the Perfect Dinner
They say the fastest way to one’s heart is through their stomach.  You, my chimpy friend, can take advantage of this old adage with the Punchfork API, which will let you easily integrate recipes into your website or app by providing direct access to recipe data from all sorts of publishers – from bloggers to top recipe sites.  With their API, you can get top rated recipes to help you prepare that perfect simian supper to woo your potential mating partner.

Make Good Mixed Drinks
Have no clue what to do with triple sec, amaretto and bitters?  Some smart chimps scraped the data from a site with over 9000 different popular drink combinations and ran taste trials with their peers to determine which drinks tasted good together, and which ones were not compatible.  Now you can pretend to be Tom Cruise from that scene in Cocktail (trick bartending skills not included).

Relationship Breakup Mechanisms
Has your date turned out to be a dud despite your best efforts?  Well, we’ve got a list of ways break up with your not-so-significant-anymore other that can serve as a great starting point for ending things.

We’ve got over 14,700 more where that came from.  Visit our site today and search for the data you want.  Can’t find what you need? Let us know on UserVoice!

New Data Sets: Alcohol, Free WiFi & How Long You’ve Got Left to Live

phonemonkey New Data Sets: Alcohol, Free WiFi & How Long You’ve Got Left to LiveOur chimps have been busily scouring the data jungle and thanks to users like AggData and vanceinteriors, we got over 1000 delicious new datasets in just the last two weeks!  Today, we’ll highlight a few of our favorites and answer some of your most burning questions.

How Long Do I Have Left to Live?
How long have I got, Doc? In an interesting measure from the US Census, here’s a free dataset gives the average number of years an individual in the US has left to live, given their age, sex and race.

Where Can I Get Free WiFi?
Ever find yourself in a new city wondering where you can get free WiFi?  This dataset contains over 63,000 locations throughout the entire US with the latitude/longitude and business name.

What Bar in Austin Sells the Most Mixed Booze?
Curious what the hottest bars in town are (based on mixed drinks sales)?  The Texas Alcoholic Beverage Commission has got your answer!  This free dataset contains the trade name, address and reported tax on mixed drink sales for bars throughout Texas.  

Be the first to answer this question in our comments and we’ll send you a package of sweet Chimpy stuff (stainless steel water bottle, stickers and Startup: The Hackering!): What bar in Austin had the highest reported mix drink sales in May 2011?

We’ve got over 14,700 more where that came from.  Visit our site today and search for the data you want.  Can’t find what you need?  Let us know on UserVoice!