Data wrangling is the latest clever way to describe the lengthy and cumbersome data preparation process required to make modern complex data (aka: post-relational, multi-structured, semi-structured…) fit legacy (read: relational) analytics tools. But let’s face it, we are analysts and product managers and software architects responsible for real outcomes. We are not digital cowboys out on the data “range”.
“‘Data Wrangling’ and ‘Data Preparation’ have exploded over the last few years for one simple reason: data has gotten much more complex, and our analytics tooling has not kept up, period!”
While the basic idea of preparing data may not seem so bad, it does seem like a step backwards. In the years prior to the rise of Data Wrangling, companies leveraged BI solutions like Business Objects, Microstrategy and others. And once in place they just worked. Users could fire up the tool and start asking questions against just about any RDBMS data source. Our new data — the so-called post-relational kinds — adds another layer of complexity, and one that is not always easy and one that is not compatible with the way things have been done.
The Wrangle Wrestle That Rankles
If you’re going to “wrangle data” get ready for as many as 6 individual steps to force data that is not uniform or tabular at all, such as JSON, CSV, or XML, into flat, completely homogenous tables. Only at this point can legacy analytics consume the data and create charts and graphs for end users. Today, it is estimated that Data Scientists spend 50-90% of their time on Data Wrangling and Data prep, not actual analysis.
“Today, it is estimated that Data Scientists spend 50-90% of their time on Data Wrangling and Data prep, not actual analysis.”
While there are certainly some use cases for this kind of process, like a data scientist working on a variety of complex data sets, in many cases users are just trying to make fairly simple and small (not “Big Data”) data files fit into a form their outdated analytics tools can understand. People want insight. Sooner rather than later.
Send The Analytics To the Data: There’s No Toll That Way
The more obvious answer is to just perform the analysis on the data natively, as it actually exists, without requiring data wrangling at all. Why not use tools that understand modern data and all its complexity without requiring all the manual intervention?
Wrangle At Your Own Risk
So what happens to the data when you “wrangle” it from its native form into its new form? Well in many cases you lose actual data. In most cases your analytic fidelity goes down. At its worst, this kind of transformation can make it impossible to answer entire classes of queries. So why do it? And why waste the time and money? In most cases, the short answer may be simply that folks just never imagined the possibility. The past has a good way of quietly dictating the future.
Popularity Contests With the Same Old Contestants
So why are data wrangling apps popular? To answer this you need to look at the end goal, which is gaining insight from the data. Current analytics tools universally use the same relational data model and analytics plumbing that was invented in the 1970’s. Frankly, at the lowest level analytics plumbing has changed very little in the last 40 years, what HAS changed is the data. New modern data structures like JSON are the norm for most new applications. IoT, social media, and SaaS applications almost universally leverage modern NoSQL DB’s as the underlying data engine. This works great for building apps, and terrible for doing any kind of analytics on the resulting data. Hence the rise of ETL, Data Wrangling, and lots of other tricks and traps to smash the data into our legacy tooling. If you need analytic value from your modern data you had to go thru this process.
“If you need analytic value from your modern data you had to go thru this process.”
Common Sense: Snow Tires for Snowy Roads. NoSQL Analytics for NoSQL Data
What if there were a new way? How about just building plumbing that works on the actual new data models? What a novel concept. SlamData is built on the Quasar engine, an open source project that provides first class support for JSON, XML and any other kind of non-tabular data model you wish to analyze. It supports SQL2 which is a souped-up version of SQL that makes it great at understanding heavily nested, non-uniform data, right out of the box. It is fully generalized and supports all major SQL Select operations; users can create complex powerful queries of any kind.
Quasar: The Open Source Engine Powering the Future of Post-Relational Analytics
Quasar is the first attempt to bring a sensible approach to analyzing modern data. It starts from the ground up with a mathematical formalism called MRA (Murray) which stands for Multi-Dimensional Relational Algebra. While this is a mouthful it is incredibly important. Other solutions, even so called “Big Data” solutions like Spark (SparkSQL) still rely on the same relational algebra created over 40 years ago at the dawn of the Relational Database era.
Yawn.
“Seems kind of odd with all the changes and innovations we have had since then that nobody tackled this thorny problem. Well, better late than never.”
MRA is mathematics updated for modern data, and it packs all the power you would expect to slice and dice data no matter how complex and messy. With MRA as its foundation, Quasar — and SlamData (the front end we built for Quasar) — can work directly against ANY kind of data, not just nice neat tabular data.
Ahem: Welcome to the Future
First, a few good-byes: Good-bye wasted time, Good-bye sweat, blood and tears. Good-bye wasted money. Good-bye “Data Wrangling”.
Ok, so let me tone it down for a sec. Data Wrangling and prep have their place. They helped. But they’re transitional technologies. And the outcomes are akin to cultivating your precious rose garden with a weed whacker. Ghastly results, right?
That’s yesterday. SlamData is tomorrow.