Monthly Archives August 2010

Infochimps notes from Lone Star Ruby Conference 2010

Notes and repos from @jessecrouch at LSRC 2010:

Find the notes in /presentation for stronglinks-example. All slides are done with S5/Operashow and can be viewed with Firefox and Opera by opening the show.html file and pressing F11. Use pageup/pagedown to navigate.


Infochimps is ecstatic to promote the following proposed 2011 SXSW Interactive panels. We’ve categorized them loosely according to topic. To the data geeks of the world: go forth and vote!

Seeing Data

Beautiful Data: Interactive Visualization of Social Media

What are the different methods in which data can be displayed and what tools are used to create them? What are the benfits and practical uses of presenting data visually? Finally, what are the most exicting and innovative specimens of data visualization erected around social data?

Social Media Data Visualization: Mapping the World’s Conversations

All about Infographics. How are Infographics constructed and what information can they convey beyond that of raw data?

Exploring Data

Data Overload: Probabilistic Computing For Breakthrough Data Analytics

What is probabilistic computing and how does it differ from more common types of programming? How does probabilistic computing fit into other data analysis tools?

Making Sense of Social Media Data

Explains the ins-and-outs of social media monitoring tools, the techniques that work and realistic expectations of what they can deliver.

Managing Data

Big Data for Everyone (No Data Scientists Required)

What makes Big Data so darn big? Topics of discussion range from (the necessity of) non-traditional solutions to handling Big Data, how those solutions fit into existing architecture, and common pitfalls encountered.

A Showdown at the Database Corral

Oh yes, there are a new sheriffs in town. They answer to Casandra, Drivel, and Drupal. Panelists will talk about case scenarios for each, their relation to traditional, distributed, and non-relational databases, in addition to other topics of interest for folks with their head in the clouds.

Data Nerds, Is Big Data Crushing the Web?

How does a business discern differences between Hadoop, bulk raw data and web crawlers as big data solutions? How does the average non-programmer tap into big data’s value? What sorts of tools are available to access big data, and what are their differences? What problems with our current business systems can be fixed to more manageable handle big data? Is it feasible to make big data repositories open source?

Humanizing Data

What the F*** is the Semantic Web

Good question! In this panel geared for everyone with a soupçon of curiosity and a brain, the Semantic Web is defined and discussed. How do web developers become part of it and how what are the business opportunities?

Open Data & What It Means For You

Does the mere thought of open data cause you to quiver in excitement? You’re not alone. More on the open data movement than you could shake a stick at!

Paying with Data: how free services aren’t free

How is your Facebook information being used and how could it be used in the future? How concerned should we all be? Panelists will also discuss current policy on privacy issues online.

Hadoop World 2010 & New Propaganda

Yay! Infochimps is going to Hadoop World 2010. Watch out New York! I (flip) am giving a talk titled “Millionfold Mashups” — I’ll talk about how we store, process and analyze massively numerous datasets and datasets of massive size.

We’re going to order propaganda stickers to give out, and we want to get your feedback on which to print.

Favorites? Terrible puns of your own to add? Want us to send you a set? Let us know in the comments!

  • Live Fast and Leave a Beautiful Corpus at
  • Where Hot Singles come to Dataset
  • Upload Yours.
  • Hadoop-de-doo for you
  • Dammit, No, the Other NLP
  • I’m Consistently Available. Want to see my Partition?
  • Intoxication by Miners is OK at
  • Fit your Curves at
  • Head in the Clouds?
  • Expose your Bits at
  • Support Vector Machines!
  • Free Variables
  • Everyone at our Datacenter has a Nice Rack
  • Bayesians Against Discrimination
  • Map Reduce, Map Reuse, Map Recycle
  • PAXOS in our time
  • Pro Axiom of Choice
  • Big Chimpin’
  • We have the most Cunning Linguists
  • P = NP
  • P != NP

Several of the slogans shamelessly stolen from this protest by CMU Machine Learning researchers, which I love so much it hurts.

Refreshed Datasets!

By popular demand, we have refreshed our massive corpus of Twitter data. As part of the facelift, some of our API fields have been eliminated, and many more have been added. Trstrank, for instance, will include a new field called Trstquotient, or TQ, which can be used as a spam indicator. (For details on how that works, stay tuned for a forthcoming blog post). The fields we chose to eliminate from Trstrank–followers, following, and statuses–can be readily accessed via Twitter’s API.

Our new datasets will provide the most accurate and up-to-date reflection of a Twitter user’s measure of influence (Trstrank), activity level (Influencer Metrics), and interactions between two given users (Conversations). The datasets that changed the most, Influencer Metrics and Conversations, have lots of new fields.  Influencer Metrics is now a more rigorous way to measure retweets and @ replies, both incoming and outgoing, and Conversations gives a full summary of the interactions between two users.

Screen shot 2010 08 02 at 8.05.04 PM2 250x300 Refreshed Datasets!

We’re versioning the new API calls, to prevent the unpleasantness that could accompany a rapid switcheroo, but our old calls will be phased out quickly. We welcome your feedback on this exciting update!