Data Mine

Does the Big Data Solution Exist?

What is a Big Data solution and what does it take to make a project successful? Perform your own experiment by posing this question to technology companies in the Big Data space. Then pose the same question to the pure service providers that are focused on Big Data. Finally, pose the same question to a few customers. Here is what I have found:

Technology providers will talk in terms of their specific contribution to the solution. Let’s think of the architectural stack from the bottom up. In the simplest terms, the Big Data solution is enabled by the infrastructure, the platform for the analytics to be performed, data software (which includes everything from data ingestion to statistical analysis), the visualization of the data, and the applications that depend on this solution. It is the sum of the parts, which no one vendor has, which makes up the enabling technologies that is “Big Data.”

big data 2 300x168 Does the Big Data Solution Exist?Service providers will talk in terms of business needs to understand what value there is in the data (e.g., use case discoveries, the data science engagements, proof-of-value offerings, implementation assistance, and application development).

Customers interested in Big Data are looking to simplify things to get to the incremental and previously unattainable insights that are the promise of Big Data. That journey, however, is a very complex one and one that is not without risk. The customer answer depends on who you ask. Ask IT and they may talk technology and the partners they prefer. Ask the application team or analytics team and your answers will straddle both the business value discussions and the technology needed to get to those answers. Lastly, the more progressive line of business decision makers aren’t interested in the complexities that make up a Big Data solution, but they are interested in the game changing insight that will allow them to create new service offerings or help to make the business more efficient as a result of the analytics being performed.

Is it now time to say that all of these answers combined is what makes up a Big Data solution? Not quite. Compliance and security are considerations businesses must address. Add to this, the deployment options which include on-premise bare metal, on-premise private cloud, a private secured cloud, a hybrid approach with both data center and cloud resources available, and finally public options like Amazon, Google, AT&T, and others. Not to mention, the talent needed to do this all in-house by customers of all sizes isn’t readily available.

The war to win in the Big Data space is being waged and customers are in the middle of it. Continuing the analogy further, customers would like to sit the war out and have the Big Data solution provided to them, removing the confusion, complexity and concern.

Now ask yourself the question, “What is a Big Data solution and what does it take to make your project successful?” Now the answer…it’s easier than you think. Ask yourself who has the technology expertise, services capabilities, customer proof points, provide flexibility in deployment, and has the option to provide all of this in a managed service so that you pay for just what you use. Those who provide “The Big Data Solution” exist. You just need to ask the right questions and look in the right places for those answers.

Alan Geary, VP of business development at Infochimps, a CSC Big Data Business, has focused on business and channel development at software and technology companies that have grown through partnering. Alan has a unique combination of Big Data and Cloud experience by working over the last decade at both a Hadoop distribution company and VMware. Both companies doubled revenue year over year with the partnerships playing a significant role in the adoption of both Hadoop and virtualization respectively.

Image source:

5fd3b37b f0ff 4b11 a9ba 54ff208f06f1 Does the Big Data Solution Exist?

Data Science: State of the Industry

O’Reilly has released their 2013 Data Science Salary Survey, and it’s a treasure trove of interesting information about the work of data science.

One of the most informative things I found was a breakdown of the data tools that were used most often by data scientists.

 Data Science: State of the Industry

This confirms a lot of hunches about the state of the industry:

  • SQL is the mack daddy of data science. It is used literally twice as much as Hadoop.

  • Excel and R are the analysis tools of choice. Since both of these tools can do multiple things (analysis and visualization), it makes sense that these would be more popular than single-use tools.

  • Scripting is widespread and diverse. Python, R, JavaScript, and Ruby are the glue of data science, with an especially strong showing for Python.

The big surprise to me was the relative unpopularity of SAS/SPSS. I think this effect may be exaggerated by the nature of the survey population (it was limited to people attending the Strata conference). However, a 4x disparity between R and Legacy vendors really highlights what I see as an accelerating trend towards open tools.

Another fascinating visualization was the breakdown of how different tools are used together by data scientists.

 Data Science: State of the Industry

In geek speak, this is a graph that describes the positive and negative correlations between tool usage. Visually, this separates into the traditional I/T world (in blue) and the new Hadoop world (in orange). “Visualization” might be a way to describe the red cluster, although Weka really breaks the mold.

What this tells me is that there is a definite geography to the work of data science. If traditional I/T is North America and Hadoop is South America, Tableau would be the Panama Canal, the conduit between the two continents. Also, this picture makes it easy to see why SQL is so popular. Like Starbucks, there’s at least one SQL-like tool in each of the clusters (Hive, MySQL, PostgreSQL, SQL, and SQL Server), with more on the way soon.

Looking at the big picture, this tells us three important things:

  1. Data science can come from anywhere. Innovation does not require the resources of the Fortune 500, nor the specialization of Silicon Valley. The work can leverage the strengths of either environment, and the best people can work anywhere.

  2. Virtually any company either already has or can inexpensively acquire the tools to do data science. If you can download R Studio and have a SQL database, you can start working like the pros.

  3. Data science isn’t thinking about real-time analytics, yet. Storm, Spark, and other tools are still cutting edge. Watch out for this in the 2014 survey.

Thanks O’Reilly, for the insight into data science and data scientists!

Dhruv Bansal is the chief science officer and co-founder of Infochimps, a CSC Big Data Business. He holds a B.A. in math and physics from Columbia University in New York and attended graduate school for physics at The University of Texas at Austin. For more information, email Dhruv at or follow him on Twitter at @dhruvbansal.

Image source:

119efc1b cf09 4f4f 9085 057e76e0464c Data Science: State of the Industry

Nothing so Practical as a Good Theory

Actionable Insight 150x150 Nothing so Practical as a Good TheoryThe most common error I have encountered among new data science practitioners is forgetting that the goal is not simply knowledge, but actionable insight. This isn’t limited to data scientists. Many analysts get carried away with the wrong metrics, tracking what is easy to measure rather than what is correct to measure. New data scientists get carried away with the latest statistical method or machine learning algorithm, because that’s much more fun than acknowledging that key data are missing.

To create actionable insight, we must start from the action, a choice. Data science is useless if it is not used to make decisions. When starting a project, I first ask how we will measure our progress towards our goals. As my colleague Morgan said last week, this often boils down to revenue, cost, and risk. An economist might bundle that up as time-discounted risk-adjusted future profits. My second task is identifying what decisions we will make in the process of accomplishing these goals.

The choices we make might be between different types of actions or might be between different intensities of an action: which advertising campaign, how much to spend, etc. These choices usually benefit from information. Some choices, such as selecting “red” or “black” at the roulette table, do not benefit from information. The outcome of most choices is partially dependent on information. Knowledge gives us power, but there is some randomness too. We might have hundreds of observations of every American’s response to our spokesperson’s call to action, but the predictive model we generate from that data might not help us after the spokesperson’s embarrassing incident at the golf course. The business case for data science is the estimation of how much information we can gain from our data and how much that information will improve the time-discounted, risk-adjusted benefit of our decisions.

The third task is picking what metrics to use. A management consultant might call this developing key performance indicators. A statistician might call this variable selection. A machine learning practitioner might call this feature engineering. We transform, combine, filter, and aggregate our data in clever and complex ways. Most critical is picking a good dependent variable, or explained variable. This is the metric you are predicting. This will be the distillation of all our knowledge to a single number.

To pick a good dependent variable, a data scientist must consider the quality of the data available and what predictions they might support, but more importantly, the data scientist must consider the decision improved by our prediction. When choosing whether to eat outside for lunch, we prefer to know the temperature at noon rather than the average temperature for the day. More important would be the chance of rain. The exact temperature to the fraction of a degree is unnecessary. Best of all would be a direct estimate of lunchtime happiness for outside versus inside on a scale of, “Yes, go outside” or “No, stay inside.” Unfortunately, we often cannot pick the most directly representative variable, because it is too difficult to measure. Lunchtime surveys would be expensive to conduct and self-reported happiness might be unreliable. A good dependent variable balances predictive power with decision relevance.

After we have built a great predictive model, the last step is figuring out how to operationalize the knowledge we gained. This is where the data science stops and the traditional engineering, or big data engineering, starts. No matter how great our product recommendations are, they are useless if we do not share those recommendations with the customer in a timely manner. In large enterprises, operationalizing insights often requires complex coordination across teams and business units, as hard a problem as the data science. Keeping this operation in mind from the start of the project will ensure the data science has business value.

Michael Selik is a data scientist at Infochimps. Over his career, he has worked for major enterprises and venture-backed startups delivering sophisticated analysis and technology project management services from hyperlocal demographics inference to market share forecasting. With Infochimps, Michael helps organizations deploy fast, scalable data services. He received a MS Economics, a BS Computer Science, and a BS International Affairs from the Georgia Institute of Technology; he likes bicycles and semi-colons.

Image Source:

6e6c46da 2b08 4559 8c27 e09f1e4df781 Nothing so Practical as a Good Theory

Data Science and the Personal Optimization Problem

Data Science 300x174 Data Science and the Personal Optimization Problem“What gets measured gets done” is a common refrain.  And, to a large extent, that is how the business world works.  As Data Scientists, we have an outsized influence on what gets measured (and by extension, what gets done) in a business.  This is especially true with advent of predictive analytics.  We have a lot of responsibility, and we need to use it wisely.

Data Scientists need to be proactive to ensure that what we model and predict and measure provides quantifiable value for our organization.  But how can we do this, realistically?  After all, the numbers are the numbers, we are just drawing conclusions from them.  Right?  The truth is that you can have two Data Scientists develop models with the same tools against the same data and one analysis can be significantly more valuable to the people paying the bills.  It is our own personal optimization problem.

A salesperson usually has a number of accounts where revenue comes in from.  A typical consultant has one or more projects that they can bill hours to.  However, if you are in R&D or on staff in a support role, how can you ensure that your data science is valuable to your organization?

As a Data Scientist, the best barometer for the business value of your work is how well it:

  1. Generates Revenue
  2. Reduces Cost
  3. Eliminates Risk

That sounds great, but does a Data Scientist know that what they are working on is valuable?  This can be especially hard to figure if you are working in a supporting role or are in a shared service environment, such as a centralized data science team in a large organization.  My colleagues and I have had long discussions on this subject, and it seems that there is little consensus on how to do this effectively.

However, I have one sure-fire way to make sure that your data science is as valuable to your organization as you are.

Personal Optimization for Data Scientists

For every project that you work on, imagine that your part is going to be used as an entry on your resume in a section marked “Major Accomplishments” (there are lots of resume guides available that talk about how to do this).  Now, think about a hiring manager who is looking at your resume; not some bozo or corporate drone who is just there to fill bodies. Imagine a shark, someone who knows the industry inside and out and wants only to hire the best; someone who knows the data and the math and can sniff out a phony a mile away.

The hiring manager is going to grill you for detailed answers about your major accomplishments.  They want to know what you know and how you learned it.  They want to know what went well and what didn’t.  They want to know if you can do the same (or better) work for them.  They want to make sure that you know the theory and the application, and can deliver on the goods in a timely manner.  This is the definition of the bottom line.

Can you comfortably sit down in front of this person and talk about your major accomplishments?  Is your data science adding to your list of accomplishments?

Making Data Science Count

Data science has some really fantastic tools such as machine learning, data mining, statistics, and predictive modeling.  They are only going to get better in the future. However, we have to remember that these are just tools at our disposal.  Having skilled craftsmen using the best tools is key, but the most important thing we can do is to make sure that we are building the right things.

One of the things I like best about the Infochimps Cloud is that it takes care of all the infrastructure and architecture work in building a Big Data solution, and lets me focus on really figuring out how to make a valuable solution.  I don’t have to worry about building a Hadoop cluster for batch analytics, or stitching together Storm and Elasticsearch and Kibana to deliver real-time visualizations.  I also don’t have to worry about scaling things up if and when my data volume goes through the roof.

When I build with Infochimps, I know that my effort is being harnessed to build out major accomplishments; not to build sandboxes or dither with infrastructure issues. If you would like to learn more about Infochimps and the value of real-time data science, come by and see us at Strata in New York on October 28-30.  See you there!

Morgan Goeller is a Data Scientist at Infochimps, a CSC company. He is a longtime numbers guy with a B.S. in Mathematics and background in Hadoop, ETL, and Data Warehousing. Morgan lives in Austin, Texas with his wife, sons, and many cats and dogs.

3527b357 2038 47ae a163 deda4a8c5176 Data Science and the Personal Optimization Problem

Photo credit:

Splice Data Scientist DNA Into Your Existing Team

IT World Splice Data Scientist DNA Into Your Existing TeamAs organizations continue to grapple with Big Data demands, they may find that business managers who understand data may meet their “data scientist” needs better than the hard core data technologists

There’s little doubt that data-derived insight will be a key differentiator in business success, and even less doubt that those who produce such insight are going to be in very high demand. Harvard Business Review called “data scientist” the “sexiest” job of the 21st century, and McKinsey predicts a shortfall of about 140,000 by 2018. Yet most companies are still clueless as to how they’re going to meet this shortfall.

Unfortunately, the job description for a data scientist has become quite lofty. Unless your company is Google-level cool, you’re going to struggle to hire your Big Data dream team (well, at least right now), and few firms out there could recruit them for you. Ultimately, most organizations will need to enlist the support of existing staff to achieve their data-driven goals, and train them to become data scientists. To accomplish this, you must determine the basic elements of data scientist “DNA” and strategically splice it into the right people.

READ 300x80 Splice Data Scientist DNA Into Your Existing Team



Serial entrepreneur Jim Kaskade, CEO of Infochimps, the company that is bringing Big Data to the cloud, has been leading startups from their founding to acquisition for more than ten years of his 25 years in technology. Prior to Infochimps, Jim was an Entrepreneur-in-Residence at PARC, a Xerox company, where he established PARC’s Big Data program, and helped build its Private Cloud platform. Jim also served as the SVP, General Manager and Chief of Cloud at SIOS Technology, where he led global cloud strategy. Jim started his analytics and data-warehousing career working at Teradata for 10 years, where he initiated the company’s in-database analytics and data mining programs.

229fa9b4 2ea6 4535 8a80 e041d110204c Splice Data Scientist DNA Into Your Existing Team

Big Data and Banking – More than Hadoop

Jims Bank 300x224 Big Data and Banking – More than Hadoop

Fraud is definitely top of mind for all banks. Steve Rosenbush at the Wall Street Journal recently wrote about Visa’s new Big Data analytic engine which has changed the way the company combats fraud. Visa estimates that its new Big Data fraud platform has identified $2 billion in potential annual incremental fraud savings. With Big Data, their new analytic engine can study as many as 500 aspects of a transaction at once. That’s a sharp improvement from the company’s previous analytic engine, which could study only 40 aspects at once. And instead of using just one analytic model, Visa now operates 16 models, covering different segments of its market, such as geographic regions.

Do you think Visa, or any bank for that matter, uses just batch analytics to provide fraud detection? Hadoop can play a significant role in building models. However, only a real-time solution will allow you to take those models and apply them in a timeframe that can make an impact.

The banking industry is based on data – the products and services in banking have no physical presence – and as a consequence, banks have to contend with ever-increasing volumes (and velocity, and variety) of data. Beyond the basic transactional data concerning debits/credits and payments, banks now:

  • Gather data from many external sources (including news) to gain insight into their risk position;
  • Chart their brand’s reputation in social media and other online forums.

This data is both structured and unstructured, as well as very time-critical. And, of course, in all cases financial data is highly sensitive and often subject to extensive regulation. By applying advanced analytics, the bank can turn this volume, velocity, and variety of data into actionable, real-time and secure intelligence with applications including:

  • Customer experience
  • Risk Management
  • Operations Optimization

It’s important to note that applying new technologies like Hadoop is only a start (it addresses 20% of the solution). Turing your insights into real-time actions will require additional Big Data technologies that help you “operationalize” the output of your batch analytics.

Customer Experience

Customer Experience Management Customer Centric Organization copy 300x211 Big Data and Banking – More than HadoopBanks are trying to become more focused on the specific needs of their customers and less on the products that they offer. They need to:

  • Engage customers in interactive/personalized conversations (real-time)
  • Provide a consistent, cross-channel experience including real-time touch points like web and mobile
  • Act at critical moments in the customer sales cycle (in the moment)
  • Market and sell based on customer real-time activities

Noting a general theme here? Big Data can assist banks with this transformation and reduce the cost of customer acquisition, increase retention, increase customer acceptance of marketing offers, increase sales by targeted marketing activities, and increase brand loyalty and trust. Big Data presents a phenomenal opportunity. However, the definition of Big Data HAS to be broader then Hadoop.

Big Data promises the following technology solutions to help with this transformation:

  • Single View of Customer (all detailed data in one location)
  • Targeted Marketing with micro-segmentation (sophisticated analytics on ALL of the data)
  • Multichannel Customer Experience (operationalizing back out to all the customer touch points)

Risk Management

Quality Risk Management Big Data and Banking – More than HadoopRisk management is also critically important to the bank. Risk management needs to be pervasive within the organizational culture and operating model of the bank in order to make risk-aware business decisions, allocate capital appropriately, and reduce the cost of compliance. Ultimately, this means making data analytics as accessible as it is at Yahoo! If the bank could provide a “data playground” where all data sources were readily available with tools that were easy to use…well, lets just say that new risk management products would be popping up left and right.

Big Data promises a way of providing the organization integrated risk management solutions, covering:


  • Financial Risk (Risk Architecture, Data Architecture, Risk Analytics, Performance & reporting)
  • Operational Risk & Compliance
  • Financial Crimes (AML, Fraud, Case Management)
  • IT Risk (Security, Business Continuity and Resilience)

The key is to focus on one use-case first, and expand from there. But no matter which risk use-case you attack first, you will need batch, ad hoc, and real-time analytics.

Operations Optimization

operations management Big Data and Banking – More than HadoopLarge banks often become unwieldy organizations through many acquisitions. Increasing flexibility and streamlining operations is therefore even more important in today’s more competitive banking industry. A bank that is able to increase their flexibility and streamline operations by transforming their core functions will be able to drive higher growth and profits; develop more modular back-room office systems; and respond quickly to changing business needs in a highly flexible environment.

This means that banks need new core infrastructure solutions. Examples might involve reducing loan origination times by standardizing its loan processes across all entities using Big Data. Streamlining and automating these business processes will result in higher loan profitability, while complying with new government mandates.

Operational leverage improves when banks can deliver global, regional and local transaction and payment services efficiently and also when they use transaction insights to deliver the right services at the right price to the right clients.

Many banks are seeking to innovate in the areas of processing, data management and supply chain optimization. For example, in the past, when new payment business needs would arise, the bank would often build a payments solution from scratch to address it, leading to a fragmented and complex payments infrastructure. With Big Data technologies, the bank can develop an enterprise payments hub solution that gives a better understanding of product and payments platform utilization and improved efficiency.

Are you a bank and interested in new Big Data technologies like Hadoop, NoSQL datastores, and real-time stream processing? Interested in one integrated platform of all three?

Jim Kaskade serves as CEO of Austin-based Infochimps, the leading Big Data Platform-as-a-Service provider. Jim is a visionary leader within both large as well as small company environments with over 25 years of experience building hi-tech businesses, leading startups in cloud computing enterprise software, software as a service (SaaS), online and mobile digital media, online and mobile advertising, and semiconductors from their founding to acquisition.

229fa9b4 2ea6 4535 8a80 e041d110204c Big Data and Banking – More than Hadoop

Customized, Intelligent, Vertical Applications – The Future of Big Data?

Future of Big Data Customized, Intelligent, Vertical Applications   The Future of Big Data?

The Ideal Big Data Application Development Environment

Lets assume that your entire organization had access to the following building blocks:

  • Data: All sources of data from the enterprise (at rest and in motion)
  • Analytics: Any/All Queries, Algorithms, Machine Learning Models
  • Application Business Logic: Domain specific use-cases / business problems
  • Actionable Insights: Knowledge of how to apply analytics against data through the use of application business logic to produce a positive impact to the business
  • Infrastructure Configuration: High scalable, distributed, enterprise-class infrastructure capable of combining data, analytics, with app logic to produce actionable insights

Imagine if your entire organization was empowered to produce data-driven applications tailored specifically for your vertical use-cases?

Data-Driven Vertical Apps

banking Customized, Intelligent, Vertical Applications   The Future of Big Data?

You are a regional bank who is under heavier regulation, focused on risk management, and expanding your mobile offerings. You are seeking ways to get ahead of your competition through the use of Big Data by optimizing financial decisions and yields.

What if there was an easy and automated way to define new data sources, create new algorithms, apply these to gain better insight into your risk position, and ultimately operationalize all this by improving your ability to reject and accept loans?

Retailer Customized, Intelligent, Vertical Applications   The Future of Big Data?

You are a retailer who is being affected by the economic downturn, demographic shifts, and new competition from online sources. You are seeking ways of leveraging the fact that your customers are empowered by mobile and social by transforming the shopping experience through the use of Big Data.

What if there was an easy and automated way to capture all customer touch points, create new segmentation and customer experience analytics, apply these to create a customized cross-channel solution which integrates online shopping with social media, personalized promotions, and relevant content?

Telecommunications Customized, Intelligent, Vertical Applications   The Future of Big Data?

You are a fixed line operator, wireless network provider, or fixed broadband provider who is in the middle of convergence of both services and networks, and feeling price pressures of existing services. You are seeking ways to leverage cloud and Big Data to create smarter networks (autonomous and self-analyzing), smarter operations (improving working efficiency and capacity of day-to-day operations), and ways to leverage subscriber demographic data to create new data products and services to partners.

What if there was an easy and automated way to start by consuming additional data across the organization, deploy segmentation analytics to better target customers and increase ARPU?

It Starts With The “Infrastructure Recipe”

Application Dev Team Customized, Intelligent, Vertical Applications   The Future of Big Data?OK. You are a member of the application development team. All you have to do is create a data-driven application “deploy package.” It’s your recipe of all the data sources, analytics, and application logic needed to insert into this magical cloud service that produces your industry and use-case specific application. You don’t need to be an analytics expert. You don’t need to be a DBA, an ETL expert or even a Big Data technologist. All you need is a clear understanding of your business problem, and you can assemble the parts through a simple-to-use “recipe” which is abstracted from the details of the infrastructure used to execute on that recipe.

Any Data Source

Data Source Customized, Intelligent, Vertical Applications   The Future of Big Data?Imagine an environment where your enterprise data is at your fingertips – no heavy ETL tools, no database exports, no Hadoop flume or sqoop jobs. Access to data is as simple as defining “nouns” in a sentence. Where your data lives is not a worry. You are equipped with the magic ability to simply define what the data source is and where it lives and accessing it is automated. You also care less whether the data is some large historic volume living in a relational database or whether it is real-time streaming event data.

Analytics Made Easy

Analytics Customized, Intelligent, Vertical Applications   The Future of Big Data?Imagine a world where you can pick from literally thousands of algorithms and apply them to any of the above data sources in part or in combination. You create one algorithm and can apply it to years of historic data and/or a stream of live real-time data. Also, imagine a world where configuring your data in a format that your algorithms can consume is made seamless. Lastly, your algorithms execute on infrastructure in a parallel, distributed, highly scalable way. Getting excited yet?

Focus on Applications With Actionable Insights

Actionable Insights Customized, Intelligent, Vertical Applications   The Future of Big Data?

Now lets embody this combination of analytics and data in a way that can actually be consumed and acted upon. Imagine a world where you can produce your insights and report on them with your BI tool of choice. That’s kind of exciting.

But what’s even more exciting is the ability to deploy your insights operationally through an application that leverages your domain expertise and understanding of the business logic associated with the targeted use-case you are solving. Translation – you can code up a Java, Python, PHP, or Ruby application that is light, simple, and easy to build/maintain. Why? Because the underlying logic normally embedded in ETL tools, separate analytics software tools, MapReduce code, NoSQL queries and stream processing logic is pushed up into the hands of application developers. Drooling yet?  Wait, it gets better.

Big Data, Cloud and The Enterprise

Big Data Cloud Customized, Intelligent, Vertical Applications   The Future of Big Data?

Lets take this entire application paradigm and automate it within an elastic cloud service purpose-built for the organization. You have the ability to submit your application “deploy packages” to be instantly processed without having to understand the compute infrastructure and, better yet, without having to understand the underlying data analytic services required to process your various data sources in real-time, near real-time or in batch modes.

Ok…if we had such an environment, we’d all be producing a ton of next-generation applications…data-driven, highly intelligent and specific to our industry and use-cases.

I’m ready…are you?

Jim Kaskade serves as CEO of Austin-based Infochimps, the leading Big Data Platform-as-a-Service provider. Jim is a visionary leader within both large as well as small company environments with over 25 years of experience building hi-tech businesses, leading startups in cloud computing enterprise software, software as a service (SaaS), online and mobile digital media, online and mobile advertising, and semiconductors from their founding to acquisition.

6fefa857 2e95 4742 9684 869168ac7099 Customized, Intelligent, Vertical Applications   The Future of Big Data?

Fact or Fiction: Big Data and Marketing Myth Busters

  • Amanda McGuckin Hager

Big Data and Marketing Myth Busters Fact or Fiction: Big Data and Marketing Myth Busters

As the uses of Big Data continue to evolve with the creation of platforms and dashboards that promise in-the-moment marketing feedback, much skepticism arises as to whether or not they can deliver on their promise: real-time decision making analytics that are actionable and accessible without a team of data-scientists.

Though advancements are being made every day and with an infinite future of refinement to come, there naturally exists some uncertainty around Big Data, and what it can actually offer marketers. Here are a few of the most common misconceptions fueling such apprehension:

Myth: Campaigns take weeks, if not months to execute.

Truth: Big Data makes real-time campaigns a reality.

As stated simply by GigaOm’s Ravi Mhatre, “Big Data is useless unless it’s also fast”. At a time when social and mobile walk hand-in-hand, marketing departments must be agile, and capable of acting at the drop of a hat (or tweet). The fact is that Big Data has entered an era where “real-time” is possible, and business dashboards power crucial decisions in-the-moment.

Myth: Marketers must still rely on “gut” decisions, which may or may not be reliable.

Truth: Big Data powers success through simple data-driven decisions.

Another common theme among skeptics is the notion that the only people equipped to understand Big Data insights are the data scientists that are siloed in departments and organizations far away from management and marketing. This is simply not the case.

Widely available tools allow marketers and other business experts to derive data-driven insight without having the technical expertise of a data scientist. Marketers can now perform sophisticated analytics to deliver truly actionable information about efficiencies (or lack thereof) within a business, as well as tangible insights about customers.

Myth: Much data is useless.

Truth: All data is powerful; Big Data makes it possible for a business to find unexpected stories and insights.

With traditional techniques, data storage is expensive, and therefore finite. Consequently, companies have had to pick and choose which data is important enough to keep, and have thrown away data which actually could have yielded valuable insight. With new Big Data technologies drastically reducing the storage and management price tag, companies now have the freedom to save and analyze everything – those who do will quickly begin to uncover gems.

As buzz builds around potential enterprise Big Data use-cases, so does hesitation and concern that this is just a fleeting trend; but this couldn’t be farther from the truth.

While we’ve finally reached the point where it is feasible for businesses to tap and start to understand the data that streams from their market, operations, and customers, there’s still much room for refinement. Although one can justifiably state that we’ve entered the era of Big Data, where campaigns can be executed quickly and insights can be pulled in real-time, we’ve just hit the tip of the iceberg in terms of the untapped potential of these technologies.

Amanda McGuckin Hager is a high-tech marketing professional with over 17 years of experience focused on driving demand through strategic marketing programs. She is the Director of Marketing at Infochimps. Follow Amanda on Twitter.

Image Source:

6fefa857 2e95 4742 9684 869168ac7099 Fact or Fiction: Big Data and Marketing Myth Busters

The 3 Waypoints of a Data Exploration

Part of our goal is to unlock the big data stack for exploratory analytics.

How do you know when you’ve found the right questions? That you’ve gone deep enough to trust the answers? Here’s one sign.

The 3 Waypoints of a Data Exploration:

  • What you knew — are they validated by the data?
  • What you suspect — how do your hypotheses agree with reality?
  • What you would have never suspected — something unpredictable in advance?

In Practice:
A while back, a friend asked me about signals in the Twitter stream for things like “Spanglish” — multiple languages mixed in the same message.  I did a simple exploration of tweets from around the world (simplifying at first to non-english languages) to see how easy such messages are to find.

I took 100 million tweets and looked for only those “non-keyboard” characters — é (e with acute accent) or 猿 (Kanji character meaning ‘ape’) or even ☃ (snowman).

Using all the cases where there were two non-keyboard characters in the same message, I assembled the following graph.

Imagine tying a little rubber band between every pair of characters, as strong as the number of times they were seen hanging out together; also, give every character the desire for a bit of personal space so they don’t just pile on top of each other. It’s a super-simple model that tools like Cytoscape or Gephi will do out-of-the-box.

That gave this picture (I left out the edges for clarity and hand-arranged the clusters at the bottom):

3 Waypoints 1024x742 The 3 Waypoints of a Data Exploration
This “map” of the world — the composition of each island, and the arrangement of the large central archipelago — popped out of this super-simplistic model. It had no information about human languages other than “sometimes, when a person says 情報 they also say 猿.” Any time the data is this dense and connected, I’ve found it speaks for itself.

Now let’s look at the 3 Waypoints.

What We Knew: What I really mean by “knew”  is “if this isn’t the case, I’m going to suspect my methods much more strongly than the results”:

  • Most messages are in a single language, but there are some crossovers. After the fact, I colored each character by its “script” type from the Unicode standard (i.e. Hangul is in cyan). As you can see, most of the clouds have a single color.
  • Languages with large alphabets have tighter-bound clouds, because there are more “pairs” to find (i.e. The Hiragana character cloud is denser than the Arabic cloud).
  • Languages with smaller representation don’t show up as strongly (i.e. There are not as many Malayam tweeters as Russian (Cyrillic) tweeters).

What We Suspected:

First, about the clusters themselves:

  • Characters from Latin scripts (the accented versions of the characters English speakers are familiar with) do indeed cluster together, and group within that cluster. Many languages use ö, but only subsets of them use Å or ß. You can see rough groups for Scandinavian, Romance and Eastern-European scripts.
  • Japanese and Chinese are mashed together, because both use characters from the Han script.

Second, about the binds between languages. Clusters will arrange themselves in the large based on how many co-usages were found. A separated character dragged out in the open is especially interesting — somehow no single language “owns” that character.

Things we suspected about the connections:

  • Nearby countries will show more “mashups”.  Indeed, Greek and Cyrillic are tightly bound to each other, and loosely bound to European scripts; Korean has strong ties to European and Japanese/Chinese scripts. This initial assumption was partially incorrect though — Thai appears to have stronger ties to European than to Japanese/Chinese scripts.
  • Punctuation, Math and Music are universal. Look closely and you’ll see the fringe of brownish characters pulled out into “international waters”.

What We Never Suspected in Advance: There were two standouts that slapped me in the face when taking a closer look.

The first is the island in the lower right, off the coast of Europe. It’s a bizarre menagerie of Amharic, International Phonetic Alphabet and other scripts. What’s going on? These are characters that taken together look like upside-down English text: “¡pnolɔ ǝɥʇ uı ɐʇɐp ƃıq“. (Try it out yourself: My friend Steve Watt’s reaction was, “so you’re saying that within the complexity of the designed-for-robots Unicode standard, people found some novel, human, way to communicate? Enterprises and Three Letter Agencies dedicate tons of resources for such findings”.

As soon as you’ve found a new question within your answers you’ve reached Waypoint 3 — a good sign for confidence in your results.

However, my favorite is the one single blue (Katakana) character that every language binds to (see close-up below). Why is Unicode code point U+30C4 , the Katakana “Tsu” character, so fascinating?

3 Waypoints Smiley The 3 Waypoints of a Data Exploration

Because looks like a smiley face.
The common bond across all of humanity is a smile.

6fefa857 2e95 4742 9684 869168ac7099 The 3 Waypoints of a Data Exploration

S3Chimp: Information Science in Action

selene arrazolo S3Chimp: Information Science in ActionI’m Selene, Infochimps’ new Analyst. Prior to my new position, I was an Infochimps intern. I recently graduated from the School of Information at the University of Texas with a Master’s of Science in Information Studies. As part of my MSIS degree plan, I completed a semester long project entitled: Developing and Integrating a Lightweight Metadata System into a Data Ingestion Workflow here at Infochimps, Inc.

The main ingredients of the project were Ruby on Rails, MongoDB, and everyone’s favorite, Amazon Web Services. The result is an alpha stage of the tentatively named S3Chimp. It is an addition to Dashpot, our Analytics & Operations Dashboard for the Infochimps Platform. Dashpot boasts an easy-to-use analytics and operations dashboard that provides business metrics and visualization, cluster management capabilities, and system monitoring on top of the Infochimps Platform. Integrating a lightweight metadata system into the workflow makes it possible for Dashpot to also track and organize distributed massive-scale data assets. What was once time-consuming (according to us as well as various people in the industry), can now be a dynamic part of an organization’s internal analytics.

Before I could begin making S3Chimp, organizing the Infochimps Amazon S3 Buckets was key. Perhaps a company that boasts about its command of data should have a beautifully organized set of buckets? Perhaps….  But let’s pretend that is not the case. And let us imagine that a young and excited Information Studies graduate student decides to tackle the S3 clutter. The essential steps in such a scenario include designing a thought-out schema guideline tailored to the company’s needs and data types, and insensately enforcing those guidelines.

Next on the list was learning Ruby on Rails, over several weeks. It was a baptism by fire. I learned the very basics of Ruby on Rails and how to love the MVC trinity. Ruby on Rails is a smart and fun web app framework and it was an enjoyable experience, relative to PHP. Relative to a Saturday afternoon at Barton Springs? Not so much.

With a snazzy script written in the enchanted Infochimps Data Mine, I was able to take the most exciting leap which was taking metadata from the now beautifully organized S3 buckets, and injecting it into MongoDB, a NoSQL database. The result is the S3Chimp genesis. S3Chimps is a system that that tells you what data and how much of it is in AWS, all from your analytics dashboard. Future plans for this product include making a tool to capture provenance metadata, and other goodies.

mongo db huge logo S3Chimp: Information Science in ActionYou can find me at the upcoming MongoDB NYC conference, if you’d like to ask me about our awesome new Ironfan Platform, Dashpot, or my CapStone project.

I’d like to thank my Field Supervisor, Flip Kromer as well as my Faculty Adviser, Dr. Melanie Feinberg.

Keep an eye out for my next blog post where I will be chronicling my personal Ruby on Rails adventure that is near and dear to my librarian heart. Travis Dempsey and I will make an in-house database of our office library’s catalog. The Bukfin Repostiry’s catalog is currently housed in Librarything.