Data Mine

Splice Data Scientist DNA Into Your Existing Team

IT World Splice Data Scientist DNA Into Your Existing TeamAs organizations continue to grapple with Big Data demands, they may find that business managers who understand data may meet their “data scientist” needs better than the hard core data technologists

There’s little doubt that data-derived insight will be a key differentiator in business success, and even less doubt that those who produce such insight are going to be in very high demand. Harvard Business Review called “data scientist” the “sexiest” job of the 21st century, and McKinsey predicts a shortfall of about 140,000 by 2018. Yet most companies are still clueless as to how they’re going to meet this shortfall.

Unfortunately, the job description for a data scientist has become quite lofty. Unless your company is Google-level cool, you’re going to struggle to hire your Big Data dream team (well, at least right now), and few firms out there could recruit them for you. Ultimately, most organizations will need to enlist the support of existing staff to achieve their data-driven goals, and train them to become data scientists. To accomplish this, you must determine the basic elements of data scientist “DNA” and strategically splice it into the right people.

READ 300x80 Splice Data Scientist DNA Into Your Existing Team

 

 

Serial entrepreneur Jim Kaskade, CEO of Infochimps, the company that is bringing Big Data to the cloud, has been leading startups from their founding to acquisition for more than ten years of his 25 years in technology. Prior to Infochimps, Jim was an Entrepreneur-in-Residence at PARC, a Xerox company, where he established PARC’s Big Data program, and helped build its Private Cloud platform. Jim also served as the SVP, General Manager and Chief of Cloud at SIOS Technology, where he led global cloud strategy. Jim started his analytics and data-warehousing career working at Teradata for 10 years, where he initiated the company’s in-database analytics and data mining programs.




229fa9b4 2ea6 4535 8a80 e041d110204c Splice Data Scientist DNA Into Your Existing Team



Big Data and Banking – More than Hadoop

Jims Bank 300x224 Big Data and Banking – More than Hadoop

Fraud is definitely top of mind for all banks. Steve Rosenbush at the Wall Street Journal recently wrote about Visa’s new Big Data analytic engine which has changed the way the company combats fraud. Visa estimates that its new Big Data fraud platform has identified $2 billion in potential annual incremental fraud savings. With Big Data, their new analytic engine can study as many as 500 aspects of a transaction at once. That’s a sharp improvement from the company’s previous analytic engine, which could study only 40 aspects at once. And instead of using just one analytic model, Visa now operates 16 models, covering different segments of its market, such as geographic regions.

Do you think Visa, or any bank for that matter, uses just batch analytics to provide fraud detection? Hadoop can play a significant role in building models. However, only a real-time solution will allow you to take those models and apply them in a timeframe that can make an impact.

The banking industry is based on data – the products and services in banking have no physical presence – and as a consequence, banks have to contend with ever-increasing volumes (and velocity, and variety) of data. Beyond the basic transactional data concerning debits/credits and payments, banks now:

  • Gather data from many external sources (including news) to gain insight into their risk position;
  • Chart their brand’s reputation in social media and other online forums.

This data is both structured and unstructured, as well as very time-critical. And, of course, in all cases financial data is highly sensitive and often subject to extensive regulation. By applying advanced analytics, the bank can turn this volume, velocity, and variety of data into actionable, real-time and secure intelligence with applications including:

  • Customer experience
  • Risk Management
  • Operations Optimization

It’s important to note that applying new technologies like Hadoop is only a start (it addresses 20% of the solution). Turing your insights into real-time actions will require additional Big Data technologies that help you “operationalize” the output of your batch analytics.

Customer Experience

Customer Experience Management Customer Centric Organization copy 300x211 Big Data and Banking – More than HadoopBanks are trying to become more focused on the specific needs of their customers and less on the products that they offer. They need to:

  • Engage customers in interactive/personalized conversations (real-time)
  • Provide a consistent, cross-channel experience including real-time touch points like web and mobile
  • Act at critical moments in the customer sales cycle (in the moment)
  • Market and sell based on customer real-time activities

Noting a general theme here? Big Data can assist banks with this transformation and reduce the cost of customer acquisition, increase retention, increase customer acceptance of marketing offers, increase sales by targeted marketing activities, and increase brand loyalty and trust. Big Data presents a phenomenal opportunity. However, the definition of Big Data HAS to be broader then Hadoop.

Big Data promises the following technology solutions to help with this transformation:

  • Single View of Customer (all detailed data in one location)
  • Targeted Marketing with micro-segmentation (sophisticated analytics on ALL of the data)
  • Multichannel Customer Experience (operationalizing back out to all the customer touch points)

Risk Management

Quality Risk Management Big Data and Banking – More than HadoopRisk management is also critically important to the bank. Risk management needs to be pervasive within the organizational culture and operating model of the bank in order to make risk-aware business decisions, allocate capital appropriately, and reduce the cost of compliance. Ultimately, this means making data analytics as accessible as it is at Yahoo! If the bank could provide a “data playground” where all data sources were readily available with tools that were easy to use…well, lets just say that new risk management products would be popping up left and right.

Big Data promises a way of providing the organization integrated risk management solutions, covering:

 

  • Financial Risk (Risk Architecture, Data Architecture, Risk Analytics, Performance & reporting)
  • Operational Risk & Compliance
  • Financial Crimes (AML, Fraud, Case Management)
  • IT Risk (Security, Business Continuity and Resilience)

The key is to focus on one use-case first, and expand from there. But no matter which risk use-case you attack first, you will need batch, ad hoc, and real-time analytics.

Operations Optimization

operations management Big Data and Banking – More than HadoopLarge banks often become unwieldy organizations through many acquisitions. Increasing flexibility and streamlining operations is therefore even more important in today’s more competitive banking industry. A bank that is able to increase their flexibility and streamline operations by transforming their core functions will be able to drive higher growth and profits; develop more modular back-room office systems; and respond quickly to changing business needs in a highly flexible environment.

This means that banks need new core infrastructure solutions. Examples might involve reducing loan origination times by standardizing its loan processes across all entities using Big Data. Streamlining and automating these business processes will result in higher loan profitability, while complying with new government mandates.

Operational leverage improves when banks can deliver global, regional and local transaction and payment services efficiently and also when they use transaction insights to deliver the right services at the right price to the right clients.

Many banks are seeking to innovate in the areas of processing, data management and supply chain optimization. For example, in the past, when new payment business needs would arise, the bank would often build a payments solution from scratch to address it, leading to a fragmented and complex payments infrastructure. With Big Data technologies, the bank can develop an enterprise payments hub solution that gives a better understanding of product and payments platform utilization and improved efficiency.

Are you a bank and interested in new Big Data technologies like Hadoop, NoSQL datastores, and real-time stream processing? Interested in one integrated platform of all three?

Jim Kaskade serves as CEO of Austin-based Infochimps, the leading Big Data Platform-as-a-Service provider. Jim is a visionary leader within both large as well as small company environments with over 25 years of experience building hi-tech businesses, leading startups in cloud computing enterprise software, software as a service (SaaS), online and mobile digital media, online and mobile advertising, and semiconductors from their founding to acquisition.




229fa9b4 2ea6 4535 8a80 e041d110204c Big Data and Banking – More than Hadoop



Customized, Intelligent, Vertical Applications – The Future of Big Data?

Future of Big Data Customized, Intelligent, Vertical Applications   The Future of Big Data?

The Ideal Big Data Application Development Environment

Lets assume that your entire organization had access to the following building blocks:

  • Data: All sources of data from the enterprise (at rest and in motion)
  • Analytics: Any/All Queries, Algorithms, Machine Learning Models
  • Application Business Logic: Domain specific use-cases / business problems
  • Actionable Insights: Knowledge of how to apply analytics against data through the use of application business logic to produce a positive impact to the business
  • Infrastructure Configuration: High scalable, distributed, enterprise-class infrastructure capable of combining data, analytics, with app logic to produce actionable insights

Imagine if your entire organization was empowered to produce data-driven applications tailored specifically for your vertical use-cases?

Data-Driven Vertical Apps

banking Customized, Intelligent, Vertical Applications   The Future of Big Data?

You are a regional bank who is under heavier regulation, focused on risk management, and expanding your mobile offerings. You are seeking ways to get ahead of your competition through the use of Big Data by optimizing financial decisions and yields.

What if there was an easy and automated way to define new data sources, create new algorithms, apply these to gain better insight into your risk position, and ultimately operationalize all this by improving your ability to reject and accept loans?

Retailer Customized, Intelligent, Vertical Applications   The Future of Big Data?

You are a retailer who is being affected by the economic downturn, demographic shifts, and new competition from online sources. You are seeking ways of leveraging the fact that your customers are empowered by mobile and social by transforming the shopping experience through the use of Big Data.

What if there was an easy and automated way to capture all customer touch points, create new segmentation and customer experience analytics, apply these to create a customized cross-channel solution which integrates online shopping with social media, personalized promotions, and relevant content?

Telecommunications Customized, Intelligent, Vertical Applications   The Future of Big Data?

You are a fixed line operator, wireless network provider, or fixed broadband provider who is in the middle of convergence of both services and networks, and feeling price pressures of existing services. You are seeking ways to leverage cloud and Big Data to create smarter networks (autonomous and self-analyzing), smarter operations (improving working efficiency and capacity of day-to-day operations), and ways to leverage subscriber demographic data to create new data products and services to partners.

What if there was an easy and automated way to start by consuming additional data across the organization, deploy segmentation analytics to better target customers and increase ARPU?

It Starts With The “Infrastructure Recipe”

Application Dev Team Customized, Intelligent, Vertical Applications   The Future of Big Data?OK. You are a member of the application development team. All you have to do is create a data-driven application “deploy package.” It’s your recipe of all the data sources, analytics, and application logic needed to insert into this magical cloud service that produces your industry and use-case specific application. You don’t need to be an analytics expert. You don’t need to be a DBA, an ETL expert or even a Big Data technologist. All you need is a clear understanding of your business problem, and you can assemble the parts through a simple-to-use “recipe” which is abstracted from the details of the infrastructure used to execute on that recipe.

Any Data Source

Data Source Customized, Intelligent, Vertical Applications   The Future of Big Data?Imagine an environment where your enterprise data is at your fingertips – no heavy ETL tools, no database exports, no Hadoop flume or sqoop jobs. Access to data is as simple as defining “nouns” in a sentence. Where your data lives is not a worry. You are equipped with the magic ability to simply define what the data source is and where it lives and accessing it is automated. You also care less whether the data is some large historic volume living in a relational database or whether it is real-time streaming event data.

Analytics Made Easy

Analytics Customized, Intelligent, Vertical Applications   The Future of Big Data?Imagine a world where you can pick from literally thousands of algorithms and apply them to any of the above data sources in part or in combination. You create one algorithm and can apply it to years of historic data and/or a stream of live real-time data. Also, imagine a world where configuring your data in a format that your algorithms can consume is made seamless. Lastly, your algorithms execute on infrastructure in a parallel, distributed, highly scalable way. Getting excited yet?

Focus on Applications With Actionable Insights

Actionable Insights Customized, Intelligent, Vertical Applications   The Future of Big Data?

Now lets embody this combination of analytics and data in a way that can actually be consumed and acted upon. Imagine a world where you can produce your insights and report on them with your BI tool of choice. That’s kind of exciting.

But what’s even more exciting is the ability to deploy your insights operationally through an application that leverages your domain expertise and understanding of the business logic associated with the targeted use-case you are solving. Translation – you can code up a Java, Python, PHP, or Ruby application that is light, simple, and easy to build/maintain. Why? Because the underlying logic normally embedded in ETL tools, separate analytics software tools, MapReduce code, NoSQL queries and stream processing logic is pushed up into the hands of application developers. Drooling yet?  Wait, it gets better.

Big Data, Cloud and The Enterprise

Big Data Cloud Customized, Intelligent, Vertical Applications   The Future of Big Data?

Lets take this entire application paradigm and automate it within an elastic cloud service purpose-built for the organization. You have the ability to submit your application “deploy packages” to be instantly processed without having to understand the compute infrastructure and, better yet, without having to understand the underlying data analytic services required to process your various data sources in real-time, near real-time or in batch modes.

Ok…if we had such an environment, we’d all be producing a ton of next-generation applications…data-driven, highly intelligent and specific to our industry and use-cases.

I’m ready…are you?

Jim Kaskade serves as CEO of Austin-based Infochimps, the leading Big Data Platform-as-a-Service provider. Jim is a visionary leader within both large as well as small company environments with over 25 years of experience building hi-tech businesses, leading startups in cloud computing enterprise software, software as a service (SaaS), online and mobile digital media, online and mobile advertising, and semiconductors from their founding to acquisition.




6fefa857 2e95 4742 9684 869168ac7099 Customized, Intelligent, Vertical Applications   The Future of Big Data?



Fact or Fiction: Big Data and Marketing Myth Busters

  • Amanda McGuckin Hager

Big Data and Marketing Myth Busters Fact or Fiction: Big Data and Marketing Myth Busters

As the uses of Big Data continue to evolve with the creation of platforms and dashboards that promise in-the-moment marketing feedback, much skepticism arises as to whether or not they can deliver on their promise: real-time decision making analytics that are actionable and accessible without a team of data-scientists.

Though advancements are being made every day and with an infinite future of refinement to come, there naturally exists some uncertainty around Big Data, and what it can actually offer marketers. Here are a few of the most common misconceptions fueling such apprehension:

Myth: Campaigns take weeks, if not months to execute.

Truth: Big Data makes real-time campaigns a reality.

As stated simply by GigaOm’s Ravi Mhatre, “Big Data is useless unless it’s also fast”. At a time when social and mobile walk hand-in-hand, marketing departments must be agile, and capable of acting at the drop of a hat (or tweet). The fact is that Big Data has entered an era where “real-time” is possible, and business dashboards power crucial decisions in-the-moment.

Myth: Marketers must still rely on “gut” decisions, which may or may not be reliable.

Truth: Big Data powers success through simple data-driven decisions.

Another common theme among skeptics is the notion that the only people equipped to understand Big Data insights are the data scientists that are siloed in departments and organizations far away from management and marketing. This is simply not the case.

Widely available tools allow marketers and other business experts to derive data-driven insight without having the technical expertise of a data scientist. Marketers can now perform sophisticated analytics to deliver truly actionable information about efficiencies (or lack thereof) within a business, as well as tangible insights about customers.

Myth: Much data is useless.

Truth: All data is powerful; Big Data makes it possible for a business to find unexpected stories and insights.

With traditional techniques, data storage is expensive, and therefore finite. Consequently, companies have had to pick and choose which data is important enough to keep, and have thrown away data which actually could have yielded valuable insight. With new Big Data technologies drastically reducing the storage and management price tag, companies now have the freedom to save and analyze everything – those who do will quickly begin to uncover gems.

As buzz builds around potential enterprise Big Data use-cases, so does hesitation and concern that this is just a fleeting trend; but this couldn’t be farther from the truth.

While we’ve finally reached the point where it is feasible for businesses to tap and start to understand the data that streams from their market, operations, and customers, there’s still much room for refinement. Although one can justifiably state that we’ve entered the era of Big Data, where campaigns can be executed quickly and insights can be pulled in real-time, we’ve just hit the tip of the iceberg in terms of the untapped potential of these technologies.

Amanda McGuckin Hager is a high-tech marketing professional with over 17 years of experience focused on driving demand through strategic marketing programs. She is the Director of Marketing at Infochimps. Follow Amanda on Twitter.

Image Source: brightonwoman.blogspot.com




6fefa857 2e95 4742 9684 869168ac7099 Fact or Fiction: Big Data and Marketing Myth Busters



The 3 Waypoints of a Data Exploration

Part of our goal is to unlock the big data stack for exploratory analytics.

How do you know when you’ve found the right questions? That you’ve gone deep enough to trust the answers? Here’s one sign.

The 3 Waypoints of a Data Exploration:

  • What you knew — are they validated by the data?
  • What you suspect — how do your hypotheses agree with reality?
  • What you would have never suspected — something unpredictable in advance?

In Practice:
A while back, a friend asked me about signals in the Twitter stream for things like “Spanglish” — multiple languages mixed in the same message.  I did a simple exploration of tweets from around the world (simplifying at first to non-english languages) to see how easy such messages are to find.

I took 100 million tweets and looked for only those “non-keyboard” characters — é (e with acute accent) or 猿 (Kanji character meaning ‘ape’) or even ☃ (snowman).

Using all the cases where there were two non-keyboard characters in the same message, I assembled the following graph.

Imagine tying a little rubber band between every pair of characters, as strong as the number of times they were seen hanging out together; also, give every character the desire for a bit of personal space so they don’t just pile on top of each other. It’s a super-simple model that tools like Cytoscape or Gephi will do out-of-the-box.

That gave this picture (I left out the edges for clarity and hand-arranged the clusters at the bottom):

3 Waypoints 1024x742 The 3 Waypoints of a Data Exploration
This “map” of the world — the composition of each island, and the arrangement of the large central archipelago — popped out of this super-simplistic model. It had no information about human languages other than “sometimes, when a person says 情報 they also say 猿.” Any time the data is this dense and connected, I’ve found it speaks for itself.

Now let’s look at the 3 Waypoints.

What We Knew: What I really mean by “knew”  is “if this isn’t the case, I’m going to suspect my methods much more strongly than the results”:

  • Most messages are in a single language, but there are some crossovers. After the fact, I colored each character by its “script” type from the Unicode standard (i.e. Hangul is in cyan). As you can see, most of the clouds have a single color.
  • Languages with large alphabets have tighter-bound clouds, because there are more “pairs” to find (i.e. The Hiragana character cloud is denser than the Arabic cloud).
  • Languages with smaller representation don’t show up as strongly (i.e. There are not as many Malayam tweeters as Russian (Cyrillic) tweeters).

What We Suspected:

First, about the clusters themselves:

  • Characters from Latin scripts (the accented versions of the characters English speakers are familiar with) do indeed cluster together, and group within that cluster. Many languages use ö, but only subsets of them use Å or ß. You can see rough groups for Scandinavian, Romance and Eastern-European scripts.
  • Japanese and Chinese are mashed together, because both use characters from the Han script.

Second, about the binds between languages. Clusters will arrange themselves in the large based on how many co-usages were found. A separated character dragged out in the open is especially interesting — somehow no single language “owns” that character.

Things we suspected about the connections:

  • Nearby countries will show more “mashups”.  Indeed, Greek and Cyrillic are tightly bound to each other, and loosely bound to European scripts; Korean has strong ties to European and Japanese/Chinese scripts. This initial assumption was partially incorrect though — Thai appears to have stronger ties to European than to Japanese/Chinese scripts.
  • Punctuation, Math and Music are universal. Look closely and you’ll see the fringe of brownish characters pulled out into “international waters”.

What We Never Suspected in Advance: There were two standouts that slapped me in the face when taking a closer look.

The first is the island in the lower right, off the coast of Europe. It’s a bizarre menagerie of Amharic, International Phonetic Alphabet and other scripts. What’s going on? These are characters that taken together look like upside-down English text: “¡pnolɔ ǝɥʇ uı ɐʇɐp ƃıq“. (Try it out yourself: http://www.revfad.com/flip.html) My friend Steve Watt’s reaction was, “so you’re saying that within the complexity of the designed-for-robots Unicode standard, people found some novel, human, way to communicate? Enterprises and Three Letter Agencies dedicate tons of resources for such findings”.

As soon as you’ve found a new question within your answers you’ve reached Waypoint 3 — a good sign for confidence in your results.

However, my favorite is the one single blue (Katakana) character that every language binds to (see close-up below). Why is Unicode code point U+30C4 , the Katakana “Tsu” character, so fascinating?

3 Waypoints Smiley The 3 Waypoints of a Data Exploration

Because looks like a smiley face.
The common bond across all of humanity is a smile.


6fefa857 2e95 4742 9684 869168ac7099 The 3 Waypoints of a Data Exploration


S3Chimp: Information Science in Action

selene arrazolo S3Chimp: Information Science in ActionI’m Selene, Infochimps’ new Analyst. Prior to my new position, I was an Infochimps intern. I recently graduated from the School of Information at the University of Texas with a Master’s of Science in Information Studies. As part of my MSIS degree plan, I completed a semester long project entitled: Developing and Integrating a Lightweight Metadata System into a Data Ingestion Workflow here at Infochimps, Inc.

The main ingredients of the project were Ruby on Rails, MongoDB, and everyone’s favorite, Amazon Web Services. The result is an alpha stage of the tentatively named S3Chimp. It is an addition to Dashpot, our Analytics & Operations Dashboard for the Infochimps Platform. Dashpot boasts an easy-to-use analytics and operations dashboard that provides business metrics and visualization, cluster management capabilities, and system monitoring on top of the Infochimps Platform. Integrating a lightweight metadata system into the workflow makes it possible for Dashpot to also track and organize distributed massive-scale data assets. What was once time-consuming (according to us as well as various people in the industry), can now be a dynamic part of an organization’s internal analytics.

Before I could begin making S3Chimp, organizing the Infochimps Amazon S3 Buckets was key. Perhaps a company that boasts about its command of data should have a beautifully organized set of buckets? Perhaps….  But let’s pretend that is not the case. And let us imagine that a young and excited Information Studies graduate student decides to tackle the S3 clutter. The essential steps in such a scenario include designing a thought-out schema guideline tailored to the company’s needs and data types, and insensately enforcing those guidelines.

Next on the list was learning Ruby on Rails, over several weeks. It was a baptism by fire. I learned the very basics of Ruby on Rails and how to love the MVC trinity. Ruby on Rails is a smart and fun web app framework and it was an enjoyable experience, relative to PHP. Relative to a Saturday afternoon at Barton Springs? Not so much.

With a snazzy script written in the enchanted Infochimps Data Mine, I was able to take the most exciting leap which was taking metadata from the now beautifully organized S3 buckets, and injecting it into MongoDB, a NoSQL database. The result is the S3Chimp genesis. S3Chimps is a system that that tells you what data and how much of it is in AWS, all from your analytics dashboard. Future plans for this product include making a tool to capture provenance metadata, and other goodies.

mongo db huge logo S3Chimp: Information Science in ActionYou can find me at the upcoming MongoDB NYC conference, if you’d like to ask me about our awesome new Ironfan Platform, Dashpot, or my CapStone project.

I’d like to thank my Field Supervisor, Flip Kromer as well as my Faculty Adviser, Dr. Melanie Feinberg.

Keep an eye out for my next blog post where I will be chronicling my personal Ruby on Rails adventure that is near and dear to my librarian heart. Travis Dempsey and I will make an in-house database of our office library’s catalog. The Bukfin Repostiry’s catalog is currently housed in Librarything.

Why Real-Time Analytics? [Free White Paper]

realtime analytics Why Real Time Analytics? [Free White Paper]

When you think Big Data, the first words that come to mind are often Hadoop and NoSQL, but what do these technologies actually mean for your business?  Different Big Data technologies have different use cases where they work best.  For your real-time Big Data challenges often a very different class of tools must be implemented.

In this free white paper, we’ll explore:

  • How to create a flexible architecture that allows you to use the best Big Data tools and technologies for the job at hand
  • Where Hadoop analysis and NoSQL databases work and where they can fall short
  • How Hadoop differs from real-time analytics and stream processing approaches
  • Visual representations of how real-time analytics works and real world use cases
  • How to leverage the Infochimps Platform to perform real-time analytics

How to Build a Hadoop Cluster in 20 Minutes

If you’ve ever tried your hand at manually provisioning, configuring and deploying a Hadoop cluster, you know that it can take days or weeks to create a fully functional system. With tools like Chef, this time can be cut down to a matter of hours or days (depending on the size of the cluster). In this video, Dhruv Bansal, Chief Science Officer of Infochimps, builds a Hadoop cluster in 20 minutes with Ironfan.

Ironfan is the foundation for your Big Data stack, making provisioning and configuring your Big Data infrastructure simple. Spin up clusters when you need them, kill them when you don’t, so you can spend your time, money, and engineering focus on finding insights, not getting your machines ready. To learn more about how Ironfan enables The Infochimps Platform, check out our white paper.

Foursquare Venues, Wikipedia Articles, Census Data and More… All With Just an IP Address!

IMG 20110623 132455 1024x768 Foursquare Venues, Wikipedia Articles, Census Data and More... All With Just an IP Address!

Greetings from deep in the Data Mine here at Infochimps. This week the team rolled out new features that combine one of our most popular APIs with our Geo API platform, unlocking the ability to geolocate based on an IP Address with any of our Geo APIs.

The idea is based on one of our more popular mashups, our MaxMind GeoLite IP to Census API  which blends IP geolocation functionality with Census data. This allows you to find out not just where an IP address maps to, but also some high level information about that area – ideal for websites that do geotargeting and for people looking for a deeper understanding about their visitor audience. The data it draws on has become a bit dated though (it uses the 2000 Census), and the data covers a relatively narrow band of properties. Enter our Geo API platform, our platform for richer and more current data from a variety of sources.

A great advantage of our new Geo API platform is our ability to perform two-step queries internally, essentially converting a parameter into another parameter behind the scenes. It’s the key technology behind our ability to geolocate using an address: our geocoder first converts the address into latitude/longitude before making a secondary query against our data store to retrieve the response values.

By using the same principle with IP Geolocation instead of address geocoding, we have unlocked the ability for our users to query any of our Geo APIs with an IP Address as the geolocator, returning data as if the request had used a latitude/longitude. So now you can use an updated IP to Census API and also a more detailed drilldown version. Furthermore you can now go from IP to Foursquare Venue, Zillow Neighborhood, Wikipedia Article, and so on.

To use the new IP-Geolocation feature, just pass in the parameter g.ip_address with an IP address, along with a g.radius.  Check out this example query, which will help you locate banks and credit unions in our Foursquare database that are within 3 kms (about 1 mile) from the Infochimps office in Austin, TX.

http://api.infochimps.com/geo/location/foursquare/places/search?&f._type=business.bank_or_credit_union&g.ip_address=67.78.118.7&g.radius=3000&apikey=[YOUR API KEY HERE]

For client-side geo application developers we’ve also added another feature along with g.ip_address. With any of these APIs you can now pass “g.get_ip_address=true” instead, and our Geo API will determine the IP address of the machine calling our API and use that IP address as the geolocator. This new flag makes it easy to ask questions of our API like “tell me about venues near me” without ever having to know what your longitude is or how to interpret a quadkey.

All in the spirit of making Geo data more accessible and easy to use!

Where Does The Weather (Data) Come From? Visualizations of Worldwide Weather Stations

This post was written by Hohyon Ryu, who interned with us this past summer as a Catalog Engineer.  He’s currently pursuing a PhD at the University of Texas’ School of Information.

The idea for this project started from one of the simplest and most essential questions in computer science. How close is the nearest X from where I stand?  To explore how to answer this question, I used our NCDC Weather Station API and attempted to answer, “What is the closest weather station from where I stand”?

stations map left 1024x517 Where Does The Weather (Data) Come From?  Visualizations of Worldwide Weather Stations

A brute force algorithm that calculates distances from all the weather stations will have to go through 2.5 million weather stations. It works but it just takes long long time.

mapgrid Where Does The Weather (Data) Come From?  Visualizations of Worldwide Weather Stations

One better solution is dividing the earth by grids. We may divide the globe into small tiles and find the closest station in the grid that I’m standing in. This solution is very fast, but there’s another problem. In the map below, let’s say I’m standing in New Orleans. There are 2 stations: one in Baton Rouge and one in Slidell. The closest station would be the one in Slidell, it is in a different tile. So this algorithm would find the one in Baton Rouge as the closest point.

voronoi 1024x560 Where Does The Weather (Data) Come From?  Visualizations of Worldwide Weather Stations

So, came up with this solution, a Voronoi diagram for all the stations in the world! It looks like a very complex calculation should be involved to generate a map like the following, but it takes only a few minutes to build the world scale map with 2.5 million points. Each station has a polygon that indicates the range it covers.

The best solution for us was grid + Voronoi lattice. Now let’s go back to the New Orleans problem. We’re in New Orleans and it is in the grid that intersects with the Voronoi polygons of Slidell and Baron Rouge. So now, we know that we have 2 candidate stations and the one in Slidell is the closest one.

Want to try out making your own visualization of our weather station data?  You can find the NCDC Weather Stations API here and the Voronoi Lattice library written in Python is available at Github.