How to create datasets that the rest of the world needs

We recently created a dataset for the web site that is a map between IP addresses to zip codes and census demographic information. The work that was involved in this is representative of the type of community we want to have involved with Infochimps in the future. The type of people that will find this dataset useful – web site owners, internet advertisers – are not always going to be the same people that can create such a dataset. This division of labor can only happen when experts at data gathering can share their data in a place where people that want to use the data can find it.

Our social media expert Maegan recently interviewed Carl, a member of our data team, to talk about this dataset creation process. You can find the IP-Census data he’s talking about here: http://infochimps.org/collections/ip-address-to-us-census-data.

M: Hi Carl, would you start by introducing yourself and telling us what you do for Infochimps?

C: I’m a member of the data team here at Infochimps. Basically, the team in charge of gathering data that’s available on the web, cleaning it up and making it more useful for other people out there that are looking for this sort of data.

M: I can imagine how appealing that data is to a lot of people. Speaking of useful data, I heard that you recently came up with a collection of datasets that link IP addresses to Census information. Can you tell me more about it?

C: Well, we heard from a few people that that sort of thing might be interesting. There are a lot of people out there want to know more about the people that come to their website. Using this dataset, they can get demographic details by using the IP address of their visitors. That way they can improve their understanding of their audience and target the content on their website better. The dataset that we have links IP addresses to zip codes, and then zip codes to all sorts of demographic data from the Census.

M: I saw that you have so many different types of information from the Census. Where did you go to find the data to mash together?

C: For the Census data, that’s a fairly well-known source. The US government has a Census website, Factfinder.census.gov, where you can go to download all sorts of information. As far as the IP to geolocation data, there are lot of datasets available. We were looking for one that had good coverage of IP addresses, was available for free, and had a license that allowed us to take that data, do what we wanted with it and make it available on our site.

M: Is this a new kind of dataset? Or is it available elsewhere?

C: The IP to geolocation dataset is available from where we got it – at MaxMind. Linking that to the Census data is something that I don’t think we’ve seen elsewhere.

M: How did the process work once you had the data?

C: The Census data is divided into a lot of different geographic segments – national, state, city, county and all those sorts of things, but the IP geolocation data only uses zip codes. We wanted just the data from the Census that’s associated with the zip codes, so I had to comb through the Census data and pull out just the lines of the data that are associated with zip codes and then use that to match up to the IP addresses in the geolocation data.

M: Is it just how they’re organized?

C: Yeah, it’s more of how it’s organized. The Census data is organized into a few different files. You have one file that lists all the different breakdowns of how the data is divided up – like how I was saying, by state, city, zip code or the country. Each of those breakdowns was associated with this logical record number. Then, the actual Census data files have the logical record number at the beginning and then all the numbers associated with the different fields in the rest of the file. I had to pick out just all the logical record numbers that were associated with the zip codes in the first files and then pull all those out of the Census data to match it to the zip codes from the IP addresses.

M: I would imagine that Census data would involve big files – did this make them difficult to manage?

C: Yeah, the Census data files are really large and so it took a lot of space to load everything into memory. Then, I made a list of what data we needed from the Census data files and searched through them line by line to match zip codes to demographic information.

M: That sounds like a lot of work. Did you have to do anything else to process the data?

C: The other thing that I did was figure out the column headings to make it more useful. The way it was presented by the US Census bureau is that each column of data has a column heading that is just a code that you look up somewhere else to figure out what it actually meant. I went through and did a lot of manual editing to make the column headings more readable. Now if you just look at it, you have a better idea of what’s actually going on and it’s not just meaningless code.

M: How did you find data with licenses that actually let you mash them?

C: We were looking for specific datasets that had the licenses with certain properties that let you freely download, mash and mix up the data with other datasets, and sell it on your own site or do anything commercial with it. Of course, most of these licenses have attribution requirements, so we made sure to list all our sources in the dataset. The final dataset that we have available clearly says that this data originally came from the US Census Bureau and this MaxMind website.

M: In the end, what licenses did you put on the dataset that you made?

C: The license that is on there now is a very open license that lets users use the data for whatever they need. It is the Open Database License.

M: Are there any other difficulties you faced?

C: One of the issues that we wanted to make sure was cleared up was that the IP address data that we got was reliable and would cover a lot of IP addresses. It needed to have broad coverage of general IP addresses. We did a quick test and used the logs from our own website, took IP addresses from 6 months worth of page visits, and ran all those IP addresses through the IP address database. It turned out that it matched over 90% of the IP addresses that we had, and so that was a pretty good indication that the IP address dataset we had was fairly complete and had very good coverage compared to others which we heard would have only 50% coverage.

M: Is the availability of the IP addresses a privacy concern?

C: I don’t think it’s a privacy concern because it’s not matching it up to a specific address, but it’s matching it up to a zip code. Since zip codes have a very large number of people, it’s hard to determine if that IP address is coming from one specific person or even one specific household.

M: Ok, thank you very much, Carl.