story

Working with Data Sucks.


We’ve all heard the joke – “80% of data science is data cleaning”. While this is true to some extent, I’d like to refine this definition to “90% of data work is in the basic implementation”.

I started working with data in 2016, originally with pandas and sklearn to build small models for fun. I came from a programming background, so this never bothered me much. However, as I progressed something became obvious – “This takes forever.”

The fun quickly wore off, and the annoyance set in.

Why is working with data so hard?

Because of code.

As a software engineer, I should be biting my tongue.

I won’t pick on any particular environment, but it’s basically the same with R, Python, or SQL. Every possible change or operation is code. You get the worst parts of software engineering, with none of the benefits.

Almost every action is ad-hoc, manual, slow and error prone.

  • Want to visualize a binary format like Parquet? cute.
  • Want to re-order some columns? type out each column in a list
  • Need to work with remote data s3 data? navigate to aws, download it, load into R or Jupyter, modify, upload again
  • Make a logic mistake? good luck noticing it; If you do, repeat everything you just did
  • Need to change the type of a column? well, depends on the format type, library and version
  • Need to visualize anything? spend forever fooling around with Matplotlib, or Seaborn, or jumping between Tableau and your exported csv
  • Want to load a CSV with corrupt chars? see you next week when you finish aligning everything
  • Want to run a simple model? write 100 lines of code to transform the data, but 5 to run the model
  • Want to do things at the same time? no, wait until your un-optimized code cell finishes

There’s zero unification. Zero continuity. Absolutely nothing is optimized for the user’s workflow. I would say the focus on the User Experience (UX/UI) is some of the worst in the world.

”Data” is not special, nor hard

Many in the industry have been incorrectly convinced that data science or analytics is “special”. It’s “too complicated” for traditional software. It’s too mathematically complex, bespoke, and so on.

I call B.S.

Exibit A: Video Editing

Various video editing software programs overlaid on top of each other

Software like Final Cut Pro, DaVinci Resolve and Premiere Pro are extremely flexible. You can edit a home video, or a full feature film with the same software. They work with your 4K HDR 120 fps footage from your iPhone, or 8K RAW footage.

Sure, you might need a beefier computer for 8K, but you’re not running to setup a Spark cluster on AWS.

This is also all done - you guessed it - with zero code. And the math? So much more complicated. Even Pixar’s animated films are built with amazing visual software, RenderMan.

This is the correct bar for user-interface quality.

Unfortunately, in the data space, the bar is on the floor.

Continuity is Everything.

I interviewed ~150 data scientists and analysts before starting Coco Alemana. The most common frustration was the lack of continuity among different systems. Everything was a separate tool.

No tool had everything in one place. There was no application that handled the “whole workflow”, or even a large part of it.

This lack of continuity kills productivity.

When continuity is done right, you don’t even notice it. You feel it. It’s so well woven into the fabric of the product that your cognitive load actively drops. You think less, while doing more.

Apple is king of this. Their product suite is so seamlessly woven together it feels as though you’re using a singular product.

So, what should we do?

My belief is that User Experience and Design Continuity are the only things that matter for working with data.

The user shouldn’t be thinking about which execution engine they’re using. They shouldn’t be bothered by format types or compression. They shouldn’t worry about optimization. They shouldn’t be fighting the software.

They should worry about applying their domain expertise. The software should do the rest.

It’s that simple.