Monthly Archives November 2009

Twitter data, open questions to Developers, Academics, and Data Geeks

We are excited to announce the re-release of the Twitter datasets, and a discount to the Twitter API Map dataset.  Again, the datasets are:

and

Conversation Metrics, with Token Count of:

This time the data is being released with Twitter’s approval.  We are talking with them about how we can increase access to more and more bulk data, and need your help in showing them how useful this data really is.

We want to make clear to people with privacy concerns that we absolutely hear and respect your points, and so does Twitter.  These datasets contain NO personally identifiable information, they do NOT contain whole tweets, and they meet the guidelines laid out in this EFF document (on personally id’able info).

We encourage everybody to take advantage of this weekend’s discount and go build great things with this data.  Let’s show Twitter and the world what is possible when one has access to bulk data:

  • Data geeks and Visualization studs: what would you do if you could run jobs across our massive crawl (or the full Twitter graph)?
  • App devs: what data do you want those nerds to extract?  How would it improve the experience of Twitter or enable new things?
  • Businesses: how can this data improve your services?  How can this data make you money?
  • Academic researchers: what amazing things will you uncover by exploring the social network’s deep structure?

Reach out to us in the comments or send us ideas at info@infochimps.org

The data landscape (Part 2), and Microsoft

The data platform industry has a new entrant this week!  Yesterday Microsoft announced a data store of their own at their developer conference.  Called Dallas, their offering is another example of a data marketplace.  The market for selling data online in an open way is still young (how many platforms besides ours and Microsoft’s do you know?) and so it is validating to see another entrant in this space.  We know that Microsoft will encourage the developer community to explore what these new platforms make possible.

Like many other services, Dallas meters out data through an API which is helpful to programmers with limited resources.  With Infochimps, however, developers get full datasets in bulk, which is better for many applications and essential for any kind of analytic work.

Both our marketplaces have the same value proposition: open up your data and profit.  When trying to convince an organization to open up its data, API’s can be an easier sell.  Even though they are costly to build and run, organizations may prefer the control they get over what people can access when compared to our simple and cheap bulk solution.

It is still unclear what the size and format restrictions are on Dallas.  If they are like other services out there (Socrata, Factual), they need data that comes in a structured, rectangular format.  These constraints enable these services to display their data live online.  While Infochimps doesn’t have that feature (yet!), we can handle datasets at the terabyte scale as well as those that don’t fit the spreadsheet paradigm, such as social network graphs.

Dallas is also part of a platform that forces users to integrate with other Microsoft services.  Infochimps’ mission is simply to connect people with the data they’re looking for, and we let anyone download data without having to register for an account.

We are proud to be a part of a strong community that’s grown over the past year, and to continue our commitment to an open data comons.  On the commercial side, we are narrowing focus on the right verticals after months of talking with this new market about what is possible.  That ultimately is what this is about – enabling something that couldn’t be done before, and connecting buyers to sellers and people to knowledge.

Twitter data update

Our launch of the Twitter data was a great success, and we thank Marshal Kirkpatrick at ReadWriteWeb (also) and Jordan Golson at GigaOm for their coverage. The community reaction has been overwhelming and energizing. We accomplished our two main goals: crack open some issues close to our hearts and kick-start the conversation about sharing data online.

Twitter has advanced some reasonable concerns, however, and have asked us to take the datasets down. We have temporarily disabled downloads while we discuss licensing terms. The outcome of discussions will, we hope, encourage more internet services to open up and share data in bulk. The two biggest issues this data release highlighted are third party redistribution and user privacy.

Redistribution rights. Twitter maintains a legendarily open API:

“Except as permitted through the Services (or these Terms), you have to use the Twitter API if you want to reproduce, modify, create derivative works, distribute, sell, transfer, publicly display, publicly perform, transmit, or otherwise use the Content or Services.
“We encourage and permit broad re-use of Content. The Twitter API exists to enable this.” [highlighting added by us]

However, Twitter wants to more closely control who has access to data at massive scale and to prevent its malicious use. We understand this concern — innovation is always a double-edged sword. The applications and services that can use this data to make the world a better place far outnumber those with bad intentions, however, and good people need better access to this type of data. The best solution is to apply a reasonable license to the data. We are addressing this in our talks with Twitter, and we expect to have a resolution soon.

User privacy. What little criticism we heard from the community was the potential for a breach of user privacy. This is an issue with many types of internet data, and one we take seriously. We ensured that the datasets released posed no such dangers. The Token Count data contained no personally identifying information, only what the entire mass of twitter users were discussing over time. The API ID Mapping Dataset is simply a sort of phone book for the Twitter APIs: it converts screen names to numeric IDs and reveals absolutely nothing about the corresponding user. Infochimps.org’s policy is to not host any personally identifying information of non-consenting individuals — we apply this rule to any data that goes on the site from any source.

These are hard issues and it took a bold move to bring them into the open. It will take further sharing and discussion to establish best practices for these concerns so that Twitter and other internet services (Facebook, Amazon, etc.) can share their data to the benefit of the greater online community. Stay tuned while we agree upon appropriate licensing for open sharing of this social data.

Twitter Census: Publishing the First of Many Datasets

As useful as the Twitter API is, developers, designers, and researchers have long clamored for more than the trickle of data that service currently allows. We agree — some of the sexiest uses of data require processing not just all that is now, but the vast historical record. Twitter doesn’t provide the only use case for this, but until now its historical bulk data has been hard to find.

Today we are publishing a few items collected from our large scrape of Twitter’s API. The data was collected, cleaned, and packaged over twelve months and contains almost the entire history of Twitter: 35 million users, one billion relationships, and half a billion Tweets, reaching back to March 2006. The initial datasets are a part of our Twitter Census collection.

The first dataset, a Token Count, counts the number of tokens (hashtags, smiley’s and URL’s) that have been tweeted. The data is available for free by month and for pay by hour. Think about comparing this data to the stock market, new movies, new video games, or even trendingtopics.org. For example, use it to look at the adoption of Google Wave on the rate of its mentions. On one payload’s page you will find a snippet with a sample taken during Kanye West’s outburst in September, and on another’s you can see that the “:)” emoticon has been used 135,000 times.

The second dataset solves a large problem developers have when they use Twitter’s Search API and the Twitter API, as each API gives back a different unique string for every user on Twitter. This dataset maps user IDs between the two API’s for 24.5 million users. This mapping should be a godsend to Twitter app developers, as it allows them to easily combine data from each API, letting API calls for friends lists mix easily with searches on the Twitter Search API.

These datasets are only views from the massive collection we have been growing over the last year. We will be releasing additional datasets regularly over the next few weeks so please check back for updates. If you’d like a custom slice or analysis done on this data, please get in touch at imw@infochimps.org.

With the release of this data, we hope to send a signal that this data is valuable and useful to real-time search engines, Twitter apps, and social media researchers. This should start a conversation about where value really lies in this type of data, the various ownership and privacy issues that arise, and that Infochimps.org is the place to go to find data. We invite interested parties to get in touch and begin uploading their data(try invite code “newsupplier”) today as part of the Infochimps marketplace.

Eric Reis’ Startup Lessons Learned

In June the Infochimps attended an event in Austin where Eric Reis gave a talk about the Lean Startup. His ideas inspired further reading, and we have been applying his methodology to making Infochimps.org a sustainable and profitable web service. Here is a breakdown of two of the ideas Eric writes about, which also crossover with Steve Blank’s wonderful book, The 4 Steps to the Epiphany.

1) Product development vs. customer development: In product development the team builds a product that they spec’d out themselves in the early stages. Customer development instead is about developing the market. It is a more holistic approach to building a company and launching a product. And customer development deeply integrates with agile software development. Every code deploy happens for a reason – it is in the service of some story that solves an identified need of the customer or users. How do you know what those needs are? You need to have talked to real customers and users.

Our site is built by two Physics researchers – scientists intimately familiar with the problems of finding and sharing data on the web. They have thought well into the future about how our site can solve these issues. Our feature list is long and describes a killer application. Problems arise, however, when we try to organize and prioritize this list. User testing helps tremendously. Observing how people used the site teaches us which features our users have trouble with and which features we can neglect because they aren’t being used. For example, user testing showed that Search is our most important feature, and that browsing by categories was less important.

Once we started talking to customers, our organizational priorities became much clearer as well. Through talking to Data Suppliers, we learned what features are most important to them on the site, which clauses of our Data Supplier Agreement they had most trouble with, and what the best way is to talk to them about selling their data on our site.

2) What type of market are you in? Steve Blank drives this point home in nearly every chapter of his book. Is your product competing in a market that already exists? If so, does it resegment that market by price or niche? Or is your product creating a new market?

Steve’s clearest example of this is the PDA market. When the first PDA came out, it created a new market. People could now do something they had never been able to do before – that is, sync their computer with a handheld device and work on the go. Marketing and PR efforts had to go towards educating people on these new tools and what they could do, and not talk about product features. Once PDA’s became an existing market with multiple players, marketing and PR efforts had to switch goals, and the conversations became less about the new possibilities and more about individual features, like whether this PDA had 8MB of memory and a 10in screen.

Infochimps has to split our pitch between the existing markets we resegment, and the new markets we create. Data is already sold in the Market Research and Finance industries – our website resegments this existing industry by offering different features and benefits. When we spoke to Zogby we didn’t have to tell them they could sell their data, they already do this. We just had to show them why Infochimps is different and a better solution. Data is not already sold by businesses everywhere, but our website is enabling just this. It is much harder to talk a taxicab company into selling their data – we first have to make the case that this is a profitable possibility. Our job is to educate this mainstream market to the new opportunities they can take advantage of with their data.

Little Tips

    # Run a command in each git submodule (show status)
    for foo in `find . -iname .git -type d ` ; do repo=`dirname $foo` ; ( echo "                == $repo ==" ; cd $repo ; git status ) ; done
    # Run a command in each git submodule (show remote URL)
   for foo in `find . -iname .git -type d ` ; do repo=`dirname $foo` ; ( cd $repo ; url=`git remote show origin | grep URL | cut -c 8-`; printf "%-47s\t%s\n" "$repo" "$url" ) ; done

Make your command-line history extend to the beginning of time

I save my entire command-line history, archived by month, and have a shell script that lets me search back through it — if I need to recall the command line parameters to do an ssh tunnel or to make curl do a form POST I can pull it up from that time in June when I figured out how.

  # no limit on history file size
  unset  HISTFILESIZE
  # 10k lines limit on in-memory history
  export HISTSIZE=10000
  # name the history file after the date
  export HISTFILE=$HOME/.history-bash/"hist-`date +%Y-%W`.hist"
  # if starting a brand-new history file
  if [[ ! -f $HISTFILE ]]; then
    # seed new history file with the last part of the most recent history file
    LASTHIST=~/.history-bash/`/bin/ls -1tr ~/.history-bash/ | tail -n 1`;
    if [[ -f "$LASTHIST" ]]; then tail -n 1000 "$LASTHIST" > $HISTFILE  ; fi
  fi
  # seed history buffer from history file
  history -n $HISTFILE

h3. Password safety from the command line

For many commands — mysql, curl/wget, others — it’s convenient to pass your credentials from the command line rather than (unsafely) in a file or (inconveniently) enter them each time. There’s a danger, though, that you’ll accidentally save that password in your .history for anyone with passing access to find.

In my .bashrc, I set export HISTCONTROL=ignorespace — now any command entered with leading spaces on the command line is NOT saved in the history buffer (use ignoreboth to also ignore repeated commands). If I know I’m going to be running repeated commands that require a password on the command line, I can just set an environment variable in an ignored line, and then recall the password variable:

womper ~/wukong$      DBPASS=my.sekritpass1234
womper ~/wukong$ mysql -u ics --password=$DBPASS ics_dev

or for another example,

womper ~/wukong$       twuserpass="myusername:twittterpassword"
womper ~/wukong$ curl -s -u $twuserpass http://stream.twitter.com/1/statuses/sample.json

TSV / Hadoop Streaming Fu

Hadoop streaming uses tab-separated text files.

Quickie histogram:

I keep around a few useful

    cat file.tsv | cuttab 3 | cutc 6 | sort | uniq -c


This take file.tsv and extracts the third column (cuttab 3), takes the first six characters (YYYYMM) ; then sorts (putting all distinct entries together in a run) ; and takes the count of each run. Its output is

   4245 200904
  14660 200905
   7654 200906

A few other useful hadoop commands:

A filename of ‘-’ to the -put command makes it use STDIN. For example, this creates a file on the HDFS with a recursive listing of the somedir directory:

    hadoop fs -lsr somedir | hadoop fs -put - tmp/listing


Wukong’s hdp-du command is tab-separated

    hdp-du somedir | hdp-put - tmp/listing


So you can also run an HDFS file through a quick filter:

   hdp-cat somefile.tsv | cuttab 2 | hdp-put - somefile-col2.tsv 


(If you brainfart and use ‘… > hdp-put …’ know that I’ve done so a dozen times too).