In this special guest feature, Stephen Bailey, Director of Data & Analytics, Immuta, believes that it’s the data engineers who should take on the responsibility of data privacy since they are the ones who have built the systems that collect the data. Stephen strives to implement privacy best practices while delivering business value from data. He loves to teach and learn, on just about any subject. He holds a PhD in educational cognitive neuroscience from Vanderbilt and enjoys reading philosophy.
Data has been called the “new gold” for its ability to transform and automate business processes. At the same time, it has been called the “new uranium” for its ability to violate the human right to privacy on a massive scale. Just as nuclear engineers could effortlessly enumerate fundamental differences in gold and uranium, data engineers must also learn to instinctively identify and separate dangerous data from the benign.
Take, for instance, the famous “link attack” in 1997 that re-identified the medical records of several high-profile patients of Massachusetts General Hospital. MGH released about 15,000 records in which names and patient IDs had been stripped from the database. Despite the precautions, Harvard researcher Latanya Sweeney was able to connect publicly available voter information to these anonymized medical records by joining them on three indirect identifiers: zip codes, birthdays, and genders. This left Sweeney, with only a handful of records to sift through, to re-identify many individuals — most notably, the Massachusetts governor’s patient records.
More than twenty years later, every business is an MGH and every person with internet access is a potential Latanya Sweeney. Yet, we all want a world where data is handled responsibly, shared cautiously, and leveraged only for the right purposes. Our greatest limitation in realizing that world is not one of possibility, but responsibility.
It’s not just a question of “How?” but “Who?”
It’s tempting to pin the challenge of respecting privacy to “whoever gets fired”, but the reality is that even if one team — legal, compliance, security — is ultimately accountable, the chain of responsibility extends much deeper. To get maximum value out of the data, it must be moved, transformed, joined, studied. The personal data is piped through multiple systems, from marketing automation tools into customer management tools and on to business intelligence tools, each of which has a different owner and standards. Once it’s inside the organizational walls, personal data spreads like wildfire.
There’s a number of teams who can reasonably be expected to try to contain this fire, including:
- Legal and compliance
- Information security
- Lines of business
- Individual system administrators
Ultimately, though, the modern privacy problem is a problem caused by technology, and it requires a solution that starts with a solid technical foundation. Asking teams who do not work “in the trenches” to design solutions will result in surface solutions; asking domain owners who do not have a holistic perspective of the ecosystem will result in localized solutions that break down as soon as data is moved beyond its boundaries. Managing privacy loss is a systemic problem demanding systemic solutions — and data engineers build the systems.
Architecting systems that rely on fluctuating business logic (such as data regulations or access control policy) is not easy. Data engineers need to challenge the business to define the requirements in terms that can be operationalized. What metadata must be collected from every system? What is the specific vocabulary we are using? In what situations is raw data suitable for release, and when does it need to be transformed? When are we authorized to say, “no!” to replicating data? Controlling the re-identifiability of records in a single dashboard is good analytics hygiene, but preserving privacy in the platforms delivering the data is crucial.
Enabling Engineers to Change the Future of Privacy
The mandate to protect privacy presents exciting new technical challenges. How can we quantify the degree of privacy protection we are providing? How can we rebuild data products — and guarantee they still function — after an individual requests that their data be deleted? How can we translate sprawling legal regulations into comprehensible data policies while satisfying data-hungry consumers?
It will be critical to formulate a new set of engineering best practices that extend beyond the familiar domains of security and system design. Determining what is best practice requires much actual practice, though. It’s essential that engineering leaders push their teams to understand and address the pertinent issues: the strengths and weaknesses of data masking, anonymization techniques like k-anonymization and differential privacy, and emerging technologies such as federated learning. Ultimately, data engineers should know the practice of privacy by design as intuitively as they do the principle of least privilege.
The alternative, if history is any guide, is a world in which institutions publish “anonymized” data to the world, and clever people and organizations reconstruct and repurpose private data for their own ends. Managing privacy, far from being an abstract concept for just philosophers and lawyers, has become a concrete problem perfectly suited for data engineers. It’s time they made it their own.
Sign up for the free insideAI News newsletter.
Join us on Twitter: @InsideBigData1 – https://twitter.com/InsideBigData1