Monthly Archives: July 2014

Data Cleaning

The Issue isn’t “Big Data”, it’s “Clean Data”

Hoo boy! Wow, I started this blog “back up” in the middle of June! Since then, I’ve moved to Oshkosh, Wisconsin and been busy as heck. My blogging has lagged, but I don’t want that to be a habit. My goal will definitely be one post a week (interspersed with whatever Quora/Basketball stuff I get in), as well as the weekly Boxscore Geeks podcast (on Thursdays). Bug me if you don’t see it :)

Big Data?

In graduate school I was running face recognition on a “huge set of data”. Be ready to laugh. I believe it was somewhere around 2-4 gigs worth. Since then data has exploded in both ease to get and ease to store. I have gigs and gigs of data on my Amazon AWS account. About once a month they send me a bill for a little under a dollar. If you’re not talking in Terabytes, or heck, Petabytes, it’s small potatoes. Yet, interestingly, the issue I see with data is not big data, no, it’s clean data!

On a great podcast with Ari Caroline about healthcare, this issue came up. In sports, we have lots of data. And most of it is in useful, tabular formats. Want to know what hand a player shoots with? That’s a check box in a column on some site. In healthcare, it gets more complex. From any set of doctors’ notes, you could easily infer some information. And you can easily store an infinite amount of notes on the cloud. For making robust data sets easy to browse though, you’d need to be able to ask the notes question. And there the data gets trickier.

This isn’t uncommon. In fact, on a recent Freakonomics podcast, Steven Levitt said this was a problem he noticed at many of the companies he consulted for.

I never would have thought this before I started working with companies. I never would have imagined that it is an I.T. problem that you simply cannot get the data you want, and the data are held in 27 different data sets that have different identifiers, so you simply…So sometimes when my little consulting firm TGG comes into a company we’ll spend something like three or six person months working with a company of trying to just put together a data set to do a basic analysis that I think many listeners would think wow I would think that a big, fancy company would be able to do this with the push of a button. But it really is… the I.T. support and the complexity in these big firms blows your mind about how hard it is to do the littlest, simple things.

The issue isn’t companies don’t have the data. It’s that they don’t have the data in easy to digest formats!

I noticed an example of this first hand at a job for one of the big companies I worked at. I had a side project I wanted to work on. In the middle of one of those team building sessions, I was talking with a co-worker. A project he was working on lined up perfectly! I excitedly told my boss about it in our next one on one. And… he was completely confused. He didn’t realize my co-worker even was working on a project that related. This was baffling to me. Seriously? My boss wasn’t aware of what one of his own employees was working on? And consider the implications. If instead of randomly talking with my co-worker, I’d asked my boss: “Who would know best about…?”, could he have answered it?

It used to be really hard to collect data. That’s changed. Write a scraper, sign up for a cloud account, and go! If you’re inside a company, the data’s probably somewhere… Now, the issue is how to make sure the data is in a usable state to glean information from it. It’s a much harder problem, and one I hope gets as much buzz and press as “Big Data.”

-Dre