Community

Overview of Open Government Budget Crisis

It’s hard to say what will become of Data.gov and USAspending.com. Researcher and Scholar Vivek Wadhwa claims the sites have plenty of support from government officials, but do they have enough support from lawmakers to stay afloat? Reports claim that budget for Data.gov and USAspending.com will plummet from $35 million to $2 million.

If there’s one thing we like to do at Infochimps, it’s collecting interesting nuggets of information for you to use. So here are some useful posts on the matter. Please share them with your friends so we can ensure support for open government:

(more…)

A Data Driven Race to Solve America’s Health Care Woes

Over $30 billion was spent on unnecessary hospital admissions in 2006. Each of these unnecessary admissions took away one hospital bed from someone else who needed it more. Rather than waiting for politicians to settle their arguments about how to implement health care reform, health care provider Heritage Provider Network teamed up with data modeling and prediction competition network Kaggle to offer a very interesting solution.

Heritage Provider Network launched the Heritage Health Prize with one goal in mind: to develop a breakthrough algorithm that uses available patient data, including health records and claims data, to predict and prevent unnecessary hospitalizations. They’ve invited data scientists to help crack the problem, and the winner will receive $3 million.

$3 million sounds like a lot, but it could save Heritage Provider Network a considerable amount of superfluous claims and make our healthcare system much more efficient. How effective do you think data algorithms can be at distinguishing life-saving versus unnecessary visits? What data and precautions could be crucial for this contest to be a success?

To register your interest in the Heritage Health Prize that begins on April 4, please visit the official website. Be sure to check out other current and upcoming competitions at kaggle.com.

Help Release Over 40,000 Songs with Lyrics at Kickstarter.com

My friend Tahir Hemphill has built the Hip Hop Word Count, a searchable database of over 40,000 songs with lyrics and metadata – including dates and geolocation of the artists.  Check out Tahir talking about the project:

He was picked up in ReadWriteWeb recently and he’s raised over $6,000 through his Kickstarter campaign, from the likes of Clay Shirky no less, to launch the service publicly.  And he’s started to share his data on Infochimps, now you can download a pack of Jay-Z lyrics.  You can find similar data on Infochimps by searching the music tag.

Show your support for another developer/artist that’s doing something cool with data, and contribute to his fundraising campaign. Tahir will be using the proceeds to release the data, and his tool, to the public.

Stay tuned next week for a release of data from the Million Song Dataset project, a massive dataset that catalogs the features of a million songs. It’s music data like this and from the HHWC project that help create web services like Pandora, neat graphics about whether crunk was first used in the South, and that make the dreams of us data hobbyists come true.

Sharing the Love

Data visualizations are like houses and neighborhoods, monuments even, built on the foundation that Infochimps is laying with our big data gathering and processing. We love it when people do really cool things with the information that we have on our site and just wanted to share a recent example with you. One of our users, Kennedy Elliott (@kennelliott) found subway trend data on our site and used it to make a really cool holiday greeting card that she sent to us. :)5266393170 b9918c1506 Sharing the Love

To Open or Not to Open Data: A Private Organization’s Dilemma

Open data has thus far largely been associated with government data. Though government data is indeed valuable, the potential of the data that private organizations gather has been overlooked. These organizations usually don’t realize the potential that their data holds.

At the Data Cluster last month, our own Dhruv Bansal and Gil Elbaz of Factual led the Open Data Birds-of-a-feather session. Using insights from that discussion, and some of our own, we want to highlight some pros and cons of this process to help organizations determine whether opening their data is the right move:

Pros
1. Profit generation – Almost all data will have some value to someone else, whether an organization realizes it or not. Putting up data for sale would help these organizations realize how valuable their data is and may even provide another revenue stream from this latent resource. For example, a firm with data on parking meter locations and occupancy rates can sell it to a firm building an iphone app to help you reliably find parking in our nation’s downtowns.
2. Crowd-sourced curation – Gil commented that a lot can be gained from crowd-sourced curation. Firstly, the organization avoids the costs of curating the data themselves. Secondly, the pool of brains working on the data can amount to incredible products that were not immediately evident, especially when your data is mashed with others’. In this Factual table of Nationwide Restaurants, geo data is mashed with information and reviews of restaurants from sites yelp, Yahoo! Citysearch and Zagat, to make this interactive search table.
3. Potential uses – There are many different uses for data that range from cool informational data visualizations to applications to mining for insights. The organization avoids the costs of having to set up infrastructure and gather manpower to translate the data into these products by opening their data for others to use.
Some examples of what has already been done with open government data can be found in a previous blog post “Open data applications”
4. Exposure – Organizations can gain exposure from opening their data, especially now while it’s still relatively uncommon, positioning itself on the cutting edge of the data sphere. Additionally, transparency is demanded more these days, and this is one of the ways to achieve that. Best Buy has an open API called the Best Buy Remix of their product catalog. With this open API, they not only leave the development of apps to others, but they also gain exposure and generate business from apps that would, for example, allow users to search for products they want and get details on it (location, price, specs, etc).

Cons
1. Historically difficult – The development of the market for alternative data is relatively new. Opening data used to be incredibly difficult, expensive and labor-intensive. Large amounts of data took a lot of time and were extremely hard, if not impossible, to process. However, things such as cloud computing and processing tools like Hadoop have helped address these problems, making the whole data process a lot easier.
2. Privacy concerns – These fall under two types: First, some companies might be concerned about certain data being accessed by their competitors. This problem can be avoided since companies can choose what data they open and keep more sensitive data secret. In the end, these organizations might find that the data that is crowd-sourced may result in interesting insights that would further develop their product/service. Second, there are also concerns about users’ personal data. Efforts need to be made to ensure that they understand how their data is being used, security upheld, and how to opt-out if they choose to do so.
3. Data processing – Some organizations don’t have the capabilities to process the data for public consumption, but if they really do have valuable data, then a cost-benefit analysis might show that setting up the required infrastructure is worth it. If a company just doesn’t have the resources for this, as mentioned earlier, it can leave some of the data processing to the crowd.
4. Reservations about crowd-sourcing – Someone from Wolfram Alpha pointed out that companies may believe that expert curation is better than crowd-sourcing. What these companies fail to realize is that there are increasingly more people fluent in data. Crowd-sourcing their many talents and ideas means that a lot more can be done with their data- things that one expert alone may overlook.

Verdict? Open your data! The data market is growing and infrastructure is developing alongside. The traditional hindrances to opening data, such as the scarcity of people who can curate data, the difficulty of identifying buyers, and the impossibility of handling large amounts of data, are dissipating. Instead, a lot of potential lies in the data, from financial gains to the increase of brand recognition. With all this in mind, companies need to take a second look at their data and evaluate its worth.

Open Data Applications

With President Obama’s Open Government Directive and news about Data.gov’s overhaul, more and more people have been talking about the benefits of open data. Yes, this includes greater transparency and a more accountable government, but it also gives birth to useful apps that use these newly available datasets.

A lot of these apps have been made for competitions like Sunlight Lab’s Apps for America and various cities’ own initiatives like NYC BigApps. Understandably, they provide appealing incentives for programmers. (If not the recognition, the cash prizes are appealing).

All that said, these competitions have spawned very useful apps. Here are 5 that we feel are great examples of the good that can be done with government data:

This We Know  Explore U.S. Government Data About Your Community 11 150x150 Open Data Applications
1. This We Know (www.thisweknow.org)
This We Know is a excellent tool that provides a wealth of information sourced mainly from Data.gov. You name a place and it tells you what we know about that location – things like demographics or the number of factories in the area. It’s also presented in a very clear fashion, condensing data into an easily understandable and still useful format.

stumble 150x89 Open Data Applications
2. StumbleSafely (www.outsideindc.com/stumblesafely)
This app from DC literally helps you stumble safely home. It uses data on crime and geography to map out safe routes from the more (in)famous bars in the city, no matter what time you like to party – day, evening or night.

photo 185 150x150 Open Data Applications
3. NYC Way (www.nycway.com)
An iPhone app, NYC Way provides you with a plethora of useful information for locals and tourists alike right at your fingertips. Location aware, it draws from a bunch of various datasets from the NYC.gov Data Mine and gives you facts about nearby zoos, wi-fi spots, emergency rooms, and a lot of other useful places to help you find your way in the big city.

everyblock 0 150x105 Open Data Applications
4. EveryBlock (www.everyblock.com)
This one’s not yet available in Austin, but it does have versions for 15 cities across the nation. EveryBlock provides you with a newsfeed of things going on around a user specified address or location in these cities. It also allows you to browse by topic and track trends overtime.

ikid 150x150 Open Data Applications
5. iKidNY (www.ikidny.com)
Not all apps are useful just for adults – this iPhone app, iKidNY, helps you find kid-friendly places all over the NYC. It provides you with locations and information about activities, kid-friendly restaurants, playgrounds, and even changing tables and subway elevators.

If you want to look at more apps, these competitions’ submission galleries are worth a look:
Apps for America 2
Apps for Democracy
NYC BigApps
DataSF

Did we miss out on your favorite app? Let us know! We’d love to check it out.

The data landscape (Part 2), and Microsoft

The data platform industry has a new entrant this week!  Yesterday Microsoft announced a data store of their own at their developer conference.  Called Dallas, their offering is another example of a data marketplace.  The market for selling data online in an open way is still young (how many platforms besides ours and Microsoft’s do you know?) and so it is validating to see another entrant in this space.  We know that Microsoft will encourage the developer community to explore what these new platforms make possible.

Like many other services, Dallas meters out data through an API which is helpful to programmers with limited resources.  With Infochimps, however, developers get full datasets in bulk, which is better for many applications and essential for any kind of analytic work.

Both our marketplaces have the same value proposition: open up your data and profit.  When trying to convince an organization to open up its data, API’s can be an easier sell.  Even though they are costly to build and run, organizations may prefer the control they get over what people can access when compared to our simple and cheap bulk solution.

It is still unclear what the size and format restrictions are on Dallas.  If they are like other services out there (Socrata, Factual), they need data that comes in a structured, rectangular format.  These constraints enable these services to display their data live online.  While Infochimps doesn’t have that feature (yet!), we can handle datasets at the terabyte scale as well as those that don’t fit the spreadsheet paradigm, such as social network graphs.

Dallas is also part of a platform that forces users to integrate with other Microsoft services.  Infochimps’ mission is simply to connect people with the data they’re looking for, and we let anyone download data without having to register for an account.

We are proud to be a part of a strong community that’s grown over the past year, and to continue our commitment to an open data comons.  On the commercial side, we are narrowing focus on the right verticals after months of talking with this new market about what is possible.  That ultimately is what this is about – enabling something that couldn’t be done before, and connecting buyers to sellers and people to knowledge.

Eric Reis’ Startup Lessons Learned

In June the Infochimps attended an event in Austin where Eric Reis gave a talk about the Lean Startup. His ideas inspired further reading, and we have been applying his methodology to making Infochimps.org a sustainable and profitable web service. Here is a breakdown of two of the ideas Eric writes about, which also crossover with Steve Blank’s wonderful book, The 4 Steps to the Epiphany.

1) Product development vs. customer development: In product development the team builds a product that they spec’d out themselves in the early stages. Customer development instead is about developing the market. It is a more holistic approach to building a company and launching a product. And customer development deeply integrates with agile software development. Every code deploy happens for a reason – it is in the service of some story that solves an identified need of the customer or users. How do you know what those needs are? You need to have talked to real customers and users.

Our site is built by two Physics researchers – scientists intimately familiar with the problems of finding and sharing data on the web. They have thought well into the future about how our site can solve these issues. Our feature list is long and describes a killer application. Problems arise, however, when we try to organize and prioritize this list. User testing helps tremendously. Observing how people used the site teaches us which features our users have trouble with and which features we can neglect because they aren’t being used. For example, user testing showed that Search is our most important feature, and that browsing by categories was less important.

Once we started talking to customers, our organizational priorities became much clearer as well. Through talking to Data Suppliers, we learned what features are most important to them on the site, which clauses of our Data Supplier Agreement they had most trouble with, and what the best way is to talk to them about selling their data on our site.

2) What type of market are you in? Steve Blank drives this point home in nearly every chapter of his book. Is your product competing in a market that already exists? If so, does it resegment that market by price or niche? Or is your product creating a new market?

Steve’s clearest example of this is the PDA market. When the first PDA came out, it created a new market. People could now do something they had never been able to do before – that is, sync their computer with a handheld device and work on the go. Marketing and PR efforts had to go towards educating people on these new tools and what they could do, and not talk about product features. Once PDA’s became an existing market with multiple players, marketing and PR efforts had to switch goals, and the conversations became less about the new possibilities and more about individual features, like whether this PDA had 8MB of memory and a 10in screen.

Infochimps has to split our pitch between the existing markets we resegment, and the new markets we create. Data is already sold in the Market Research and Finance industries – our website resegments this existing industry by offering different features and benefits. When we spoke to Zogby we didn’t have to tell them they could sell their data, they already do this. We just had to show them why Infochimps is different and a better solution. Data is not already sold by businesses everywhere, but our website is enabling just this. It is much harder to talk a taxicab company into selling their data – we first have to make the case that this is a profitable possibility. Our job is to educate this mainstream market to the new opportunities they can take advantage of with their data.

Start-up Checklist

On Jessica Hagy’s “Indexed”, the “Start-up Checklist“:

card2178 Start up Checklist

Amazon Web Services hosts DBpedia, Freebase data sets

The Infochimps.org community played part in pushing DBpedia and Freebase data sets  to Amazon Web Services.  This is an auxiliary effort by Infochimps.org to increase access to data.  It is important to have the data in places where there are the right tools for people to use it.  AWS is the place, look at creating an Amazon Machine Image to start working with the new data sets.  Our MachetEC2 can help, please let us know how your experience was in using it.

Thanks to Kingsley Idehen with Linked Open Data for being a good point of contact. 

We will upload more data sets to AWS in the near future.  Any requests?