Eating data science for breakfast

I got up nice and early this morning to chair a data science and data protection breakfast in Soho. Nothing like a sprinkling of support vector machines on your granola and a couple of slices of regulation on the side to get you going on a Friday.

The event was organised by data strategists the Ammonite Group, and it was Chatham House rules so I can’t be too specific about who was there or what was said. But a really interesting collection of data scientists from various different industries including publishing, motor, gambling as well as straight up tech were in the house, so it was an interesting discussion.

While each of us was grappling with very different data problems, it was fascinating to discover how united we were by the questions we were asking about the ways in which big data and machine learning models were going to be affected by the arrival of GDPR.

My own company, Hospify — which provides compliant messaging for healthcare — is very much predicated on the existence of this piece of legislation. We’re all about making sure that the kind of things we’ve seen happen to people’s Facebook data as a result of the Cambridge Analytica debacle doesn’t happen to their medical data too.

Handling data in a compliant way is Hospify’s stock-in-trade, but like other businesses we’re looking to wring value from that data for our users by using the latest machine learning tools. The trouble is, the compliance part of the equation makes that very difficult for us to do.

Machine learning technology — which, beneath all the hype about AI amounts to adding a layer of robust feature recognition (and associated transformations) to the compute stack — is arriving just at the moment that the world is waking up to the ways in which the great open data experiment is making us all very vulnerable to whole new kinds of attack.

Cybersecurity, however, is just one of the challenges we face. As we know at Hospify, things can be highly secure and still not be compliant, as compliance deals with a whole raft of requirements from data storage, consumer opt-in, subject data access requests and the right to be forgotten, all of which can be problematic for a business at the best of times, let alone when the data concerned have been passed through a machine-learning model.

And that’s before we even get to the right to explanation. There are already conflicting interpretations about what this even means. If you’ve been turned down for insurance because of a decision that was made by an algorithm, what does having a right to know how that decision was made amount to? Should you be given access to every weight in the matrix of a multilayer neural net, which would not only be hard to deliver but also pretty much meaningless? Or do you just have the right to be told the methodologies involved? And if so, which ones, and to what extent?

On top of this, the amount of data that the digital operations side of any business needs to retain in order to properly do its job is increasing all the time. User profiles, mobile apps, metrics from IoT devices, APIs and analytics of all kinds are moving beyond the realm of the human and into that of the algorithmic just by dint of the sheer volume of information they generate.

Equally, and driven by smaller chips, better batteries, the need to remove data bottlenecks, and the security risks inherent in putting any information in transit, more processing power is moving out of the cloud and back towards the edges of the network. This creates challenges of its own around tracking, implementation and security, and is something that I personally am particularly interested in.

This morning’s conversation ranged across all these topics, and generated useful insights into quite a few of them. As a group we felt that a lot of the demand for “right to explanation” could be satisfied by demonstrating best practice in data collection and pre-processing, and that unpicking actual models might be much less necessary than it initially appears to be.

Where that wouldn’t be sufficient, there were some innovative suggestions for using input-output correlations to give a very human level of insight into decisions around individual cases. Keeping clear separation, when possible, between customer data and transaction data was another top tip; it was also salutary to hear the extent to which the group felt that the third-party data market was already disappearing.

One particularly thorny area concerned the tension between the regulatory need to identify problematic customer behaviour in certain sectors, and the need to exercise the right to be forgotten. Another red flag was raised about the dangers of introducing bias into data sets via hidden correlations in otherwise innocuous-data sources — questionnaires, for example, whose question sets inadvertently encourage particular types of answer, or put off particular categories of person. As a former psychology student, I’m very familiar with this particular species of difficulty, and know well how tough it can be to eradicate it.

Another spectre that loomed over the meeting was the feeling that different pieces of legislation often contradicted one another, making it impossible to be sure that you were complying with everything. We talked a lot about how transparency of process and clear opt-outs/opt-ins for users and consumers would help mitigate the chances of falling foul of many of the new rules, but that in quite a lot of situations best practice wouldn’t really be established until after GDPR was in place and some edge cases had been tested in the courts.

One question we did settle though, before we went out separate ways: whoever had final sign-off on the GDPR, they probably weren’t a data scientist!

Writer, co-founder of Hospify, plaything of the gods