Data hoarding: Would a Marie Kondo approach to data storage bring you joy?

By Gareth Stokes on June 24, 2019

Posted in Artificial Intelligence, Data Governance

There is something immensely satisfying about identifying a real bargain in a charity shop or thrift store. In among the 17 copies of last year’s must-read summer-holiday novel, and the three cups with saucers that don’t match, you might stumble across a hidden gem. My best-ever find was a set of three Star Wars original release laserdiscs, priced at £3 each, that, because of the co-incidence of shape and size, were nestled in amongst the records at the charity shop supporting a local hospice – right there between Val Doonican and Phil Collins. Conscience wouldn’t permit me to pay the well-under-market-value sticker price, so after a quick search of online auction sites we settled on a fair rate.

Charity shops are now creaking under the influx of donations from those following Marie Kondo’s decluttering show. A simple question – does it bring you joy? – and if not, out it goes!

If you’re instinctively a hoarder (and I will unburden my soul by confessing that I certainly am) it can be hard to watch. But, as witnessed on the show, the unbearable lightness of being that seems to go with the unbearable lightness of shelving is undeniable.

The promise of new technologies – big data analytics and AI in particular – have turned many businesses into data hoarders. Collect and retain everything indefinitely in the hope that you might derive some insight that gives you a key advantage over the competition.

Techniques for analysing unstructured data sets, and the ability to buy vast amounts of cloud storage and cloud compute power to hold and interrogate the data mean that it is easier than ever to deploy ‘big data’ processes. As ever though, just because you can store every bit of data indefinitely doesn’t mean you should.

First, the indiscriminate collection of data for some undefined future use case is inefficient. It will often require the implementation of logging systems (or even worse, additional ‘busywork’ steps in manual work flows) that consumes time and resources. Storing the data then involves more cost for infrastructure, which constantly increases as the dataset size will often grow at a rate that outpaces any general downward trend in price per gigabyte of cloud storage dropping over time.

This ‘keep everything and we’ll figure out what to do with it later’ approach also presupposes that a serious effort will be made at some undefined future point to use it to find insight. Instead of kicking the can down the road, the wise choice would be to spend the effort now to working out what insight(s) would be useful, and designing systems to capture the specific data needed to deliver them. Otherwise, the accumulation of lots of data that is not useful creates a haystack within which the needles of useful insight must later be found.

This requires a collection of different skills to be brought together:

people with a deep understanding of your current business operations;
visionaries who can imagine the brave new world that could be realised in future;
change experts who can steer the transformation; and
data scientists who can interrogate current data resources, identify which additional data points will be useful in guiding decisions about change, and undertake the analysis that will deliver data-driven transformational change.

Not only is this structured approach commercially far more efficient than indiscriminate data hoarding, it allows a narrative around the direction of change to be communicated throughout the enterprise whilst meaningfully clearing out the clutter – the genuinely useless data clogging up the corridors and bookshelves of data stores.

Finally, no article by a lawyer considering the risks attendant with indiscriminately collecting data would be complete without mentioning data regulations and cybersecurity. Inevitably the indiscriminate collection of data will include much that may constitute personal data for the purposes of the General Data Protection Regulation in Europe, and trigger similar privacy-related legislation in other jurisdictions. Mass storage, particularly if data from global operations is being funnelled into one or a small number of locations, will breach data minimisation principles, and entail cross-border transfers on a grand scale. If stored on the least costly third-party cloud infrastructure, this adds an additional layer of risk and complexity. However undertaken, the impact of such data concentration and the technical and organisational measures, and contractual protections required, are likely to be far from straightforward. Last, and by no means least, storage for some unspecified future use is very unlikely to meet GDPR requirements for a clear purpose for processing to be communicated to data subjects.

Similarly, the risks associated with data concentration and the potential for an accidental or deliberate data breach cannot be easily dismissed. Risks from regulatory fines, contractual claims for breach of confidentiality, the possibility of class-actions from affected data subjects, and the negative publicity and loss of public trust in the organisation all grow by the gigabyte. Data hoarding that leads to massive data concentration also creates an attractive target for hackers.

All of these are additional legal and regulatory risks that can be minimised by capturing specific data for specific purposes, and communicating that purpose to data subjects.

For more on big data, analytics, machine learning and the legal and regulatory issues that touch upon them, come along to DLA Piper’s European Tech Summit in London on 15th October. More details on dlapipertechsummit.com