- May 21, 2012
I’m Selene, Infochimps’ new Analyst. Prior to my new position, I was an Infochimps intern. I recently graduated from the School of Information at the University of Texas with a Master’s of Science in Information Studies. As part of my MSIS degree plan, I completed a semester long project entitled: Developing and Integrating a Lightweight Metadata System into a Data Ingestion Workflow here at Infochimps, Inc.
The main ingredients of the project were Ruby on Rails, MongoDB, and everyone’s favorite, Amazon Web Services. The result is an alpha stage of the tentatively named S3Chimp. It is an addition to Dashpot, our Analytics & Operations Dashboard for the Infochimps Platform. Dashpot boasts an easy-to-use analytics and operations dashboard that provides business metrics and visualization, cluster management capabilities, and system monitoring on top of the Infochimps Platform. Integrating a lightweight metadata system into the workflow makes it possible for Dashpot to also track and organize distributed massive-scale data assets. What was once time-consuming (according to us as well as various people in the industry), can now be a dynamic part of an organization’s internal analytics.
Before I could begin making S3Chimp, organizing the Infochimps Amazon S3 Buckets was key. Perhaps a company that boasts about its command of data should have a beautifully organized set of buckets? Perhaps…. But let’s pretend that is not the case. And let us imagine that a young and excited Information Studies graduate student decides to tackle the S3 clutter. The essential steps in such a scenario include designing a thought-out schema guideline tailored to the company’s needs and data types, and insensately enforcing those guidelines.
Next on the list was learning Ruby on Rails, over several weeks. It was a baptism by fire. I learned the very basics of Ruby on Rails and how to love the MVC trinity. Ruby on Rails is a smart and fun web app framework and it was an enjoyable experience, relative to PHP. Relative to a Saturday afternoon at Barton Springs? Not so much.
With a snazzy script written in the enchanted Infochimps Data Mine, I was able to take the most exciting leap which was taking metadata from the now beautifully organized S3 buckets, and injecting it into MongoDB, a NoSQL database. The result is the S3Chimp genesis. S3Chimps is a system that that tells you what data and how much of it is in AWS, all from your analytics dashboard. Future plans for this product include making a tool to capture provenance metadata, and other goodies.
I’d like to thank my Field Supervisor, Flip Kromer as well as my Faculty Adviser, Dr. Melanie Feinberg.
Keep an eye out for my next blog post where I will be chronicling my personal Ruby on Rails adventure that is near and dear to my librarian heart. Travis Dempsey and I will make an in-house database of our office library’s catalog. The Bukfin Repostiry’s catalog is currently housed in Librarything.