There are many tools out there offering easy ways to pull together a variety of data from disparate sources and create unique visualizations. The charts are very, very pretty. However, in the age of modern data (NoSQL, JSON) these “pretty” charts can be wildly inaccurate. Sorry cloud analytics companies, close your eyes because I’m about to throw you under the bus.
ETL and data mapping for modern data = Satan
There, I said it. ETL has been around longer than I’ve been alive. And it was great. Until it wasn’t. Let’s look at the standard workflow for cloud analytics providers:
- Pull data out of NoSQL datastore (like MongoDB).
- Send it to something like Redshift, Google BigQuery and others.
- Map it to your relational schema making arbitrary decisions on what counts as schema
- Execute some analytics in the warehouse (Redshift, GBQ) and suck back the rest to some blackbox in memory store provided by the analytics provider
- Create pretty, albeit probably inaccurate charts.
That’s ETL/Data mapping at its pinnacle. Er, I should say its nadir… What are we getting out of this process? See the picture up top? The photocopy of Mona Lisa? That’s the right metaphor. You invested in NoSQL for a good reason: it captures something that your relational dbs don’t. Complexity. Texture. Dynamism. Variability. So then why do you send your beautiful data through the meatgrinder? It’s simply not the same data you had in the beginning.
See, ETL’s “relationalizing” nature is a BIG problem.
The truth is everyone has been looking at the low-fi version of Mona and thinking they’re looking at the left side. Just. Say. No. Relocation also cuts your data off at the source. Once you move your data – it’s like cutting flowers. You can get them into a nice vase for a few days but after that they’re decaying. Well, you might blame me for having come down with 20/20 vision when I’m looking backwards over 30 years of database development… Not true. My vision is 20/20 because we committed ourselves to figuring out how to “push down” analytics to the data, as it actually exists, IN your DB! – “Hey, ETL, thanks for all your years of hard work…” – and it works.
So, think about this alternative NoSQL analytics workflow.
- Connect SlamData to your source.
- Query to your heart’s content: Look, find, discover, explore and more (On live data. Leaving the data right where it is).
It’s like one day realizing you can actually breath under water or maybe walk on water. It’s no miracle, it’s just multidimensional relational algebra, also known as the SlamData “secret sauce”.
Final Notes If you’re sticking up for Data Warehouses because they deliver on bringing disparate datasets together in one common place, well we have that solved too. It’s called SQL². SQL is the lingua franca of databases. SQL² is a small set of adaptations for SQL to make it work across any NoSQL datastore. We call it query federation and we’re excited to start talking about it soon. In the meantime, ask yourself: “Am I looking at the photocopy of the MonaLisa or the real thing?