Thursday 11 December 2014

Are you STILL here? Big Data's bad smell

Rambling introduction
Big data, big data, big data, big data. Yes, we're all probably sick to death of hearing the term 'big data' now* and somewhere along the lines the meaning has disappeared and whatever feelings the term originally engendered have now morphed into angst, disillusionment, embarrassment and general scoffing. Maybe this is because the famous 3 Vs of big data (vanity, vanity and vanity**) have not actually produced many good examples that can help explain why and how big data is useful. Actually, this kind of thing is pretty common in tech, academic, business and all sorts of other fields. A new concept comes along, people start jumping on the bandwagon, people start jumping off the bandwagon, bandwagon crashes and burns, and everyone says 'I was never on the bandwagon and, anyway, it was going in the wrong direction'. Despite the sarcasm and general rambling nature of this opening paragraph, below the floppy disk I'm going to make the case for not abandoning 'big data' at all - even if we do decide to stop using the term. I may also throw in even more bad metaphors, analogies and bad writing.

A 'big data' enabled storage device

Shark Jumping
In 2011 an article appeared on saying 'big data' had 'jumped the shark' (see below, and here for definition) and it makes for interesting reading because much of what it says is true. However, like most articles about big data, the core of the critique is not always directed towards the data or the analytical process but towards the hype. The comments section of this article is also very interesting because Doug Laney, the originator of the much-cited 3v concept in big data, has a few things to say. Fast forward to September 2014 and Techsling ask whether big data has jumped the shark (conclusion seems to be yes, probably), whereas in 2013 Wired said not to worry because big data had definitely not jumped the shark. However, my favourite article in this vein has to be the syncsort piece entitled 'has big data nuked the fridge', which actually contains a lot of common sense from a real 'big data' person. As for me, I completely agree that the hype has gone far too far. However, let's keep working on large datasets with powerful machines and then let people know when we get some useful or interesting results. And let's not use big data as an excuse for clever, anti-social people to avoid speaking to real people.

Big data didn't make it

Voices of reason
When you are immersed in hype, it's very important to find some voices of reason. Two prominent voices that I like the sound of are Rob Kitchin and David Lazer. Rob Kitchin is a professor at Maynooth University in Ireland and has written extensively on the need to approach big data sensibly and with a healthy degree of critique; most notably in his 2014 book The Data Revolution. My personal favourite is his piece from June 2014 entitled 'Big Data, new epistemologies and paradigm shifts' where he explores Anderson's 'end of theory' piece and argues both that big data is disruptive but also that there is 'an urgent need for wider critical reflection'. Rob's stance is particularly interesting from a social science perspective but actually I find his conclusions resonate much more widely.

My other voice of reason in big data is David Lazer, professor in Political Science and Computer and Information Science at Northeastern University and Visiting Scholar at the Kennedy School at Harvard, who wrote a great piece with colleagues on 'The Parable of Google Flu: Traps in Big Data Analysis' for Science in March 2014. Most people with an interest in big data probably know the story of Google Flu Trends because it made headlines for the wrong reasons in February 2013. Lazer et al. use this story to bring some reason to the big data debate and critique 'big data hubris'. Interestingly, they also talk about the need to incorporate 'small data':

"However, traditional “small data” often offer information that is not contained (or containable) in big data, and the very factors that have enabled big data are enabling more traditional data collection."

The conclusion of the Lazer et al. piece is not that we should abandon big data but rather that we need to understand what the recent data revolution means and then use innovative analytics to move towards a clearer understanding of our world.

Big dog, small dog
I have posted a photo below of a big dog and a small dog. They are both dogs. I can see why some people would get excited about big dogs. They can fetch bigger sticks. They can keep burglars at bay more easily and they are stronger, but they do take up more space. But the small dog has a really loud bark, can go places big dogs can't, knows just as much as the big dog and takes up less space in your house. They do require different approaches in relation to being looked after, but that's another issue altogether. Someone has even produced a nice visual representation of different kinds of big and small dogs.

They are both dogs

How to proceed
Apologies for the big dog, small dog nonsense above but I've been sucked in to the big data debate over the past few years and it always makes me think of this. I don't even have a dog. But I do have lots of data and a fancy computer and this is what I have in common with lots of other people who are 'doing big data'. So, to conclude and by way of trying to say something useful about big data, here's a final few bullet points...

  • Let's accept that the hype around big data has gone too far and put that to one side. It's not novel or useful to say that 'big data has jumped the shark', 'big data is all hype', 'big data is dead' or other similar comments. The people who are working with big data and have a critical mind (Kitchin, Lazer et al.) already know all this.
  • Let's try to take a more nuanced approach to understanding what big data is***, what it is not and what it can and cannot do - along the lines of what Kitchin refers to as a 'contextually nuanced epistemology'.
  • We ought to understand that the reason 'big data' emerged was because of enhanced processing power in computers which arrived at roughly the same time as access to very large datasets. This created the ability to ask questions of data that we previously could not answer because of problems of 'small tools'. But it hasn't really led to many transformative developments that people know about. This needs to change.
  • Let's start with big questions rather than big data. A very obvious point but the criticism that big data so far has been a solution in search of a problem is in some cases justified. 
  • Let's let the term 'big data' fade into the distance and keep working with large datasets and powerful computers on big societal challenges that we need to find the answers to (i.e. keep doing big data but stop calling it that).
  • Finally, let's keep in mind this statement form a Financial Times Magazine piece on big data from early 2014: "a theory-free analysis of mere correlations is inevitably fragile. If you have no idea what is behind a correlation, you have no idea what might cause that correlation to break down". 

* or 1 3, 5, 7, 10 years ago depending upon how far ahead of the curve you are.
** I think that's right...
*** data is, data are... I like is, even if some say it's wrong
This blog is written in a somewhat rambling, tongue-in-cheek style just to make a point