Assessing the practicality of differential privacy applications at an early-stage startup.
Truework manages sensitive employment and income information with the mission of empowering consumers to own and control their personal data. We process tens of thousands of employment and income verifications every month. Each of these verifications is someone processing a loan, obtaining a mortgage, paying off a car, or one of many other lifetime milestones. Ensuring these verifications are completed promptly and accurately is critical to our network. Truework makes this possible through a combination of manual and automated quality assurance. We want to leverage recent innovations in job title standardization and salary information to improve our quality assurance layer. To build these systems, our data scientists need a way to draw out the patterns seen in job title and salary data. This data is obviously extremely sensitive, and we need to ensure our data scientists are unable to compromise privacy while still maintaining the integrity of our data when we build models that ingest sensitive data.
Whenever an employee on Truework's network wants to mortgage a home, obtain a car loan, or more, they will need to obtain a verification of income (VOI). For these VOIs, the most commonly reported errors have to do with salary information. Income data is generally very reliable, given its importance, but nevertheless, errors may arise in the transfer of this data. Sometimes this data needs to be manually transcribed. This inherently comes with issues. These issues can look like misplaced or misused decimals, missing or additional zeroes, and typos. Other times, this data is automatically transferred via crawls. Occasionally, there are mistaken variations in the way data is given to us, or an error arises in our crawling. Failures due to these issues are few and far between, but any one of these failures means a mortgage is not closed, a loan cannot be procured, and the individual being verified needs to put their life on hold to go through the process again. Understanding the patterns that cause these errors and building the models and tools to address them allows Truework's verification product to be robust.
Differential Privacy (DP) is a technique that offers strong privacy assurances, preventing data leaks and re-identification of individuals in a dataset. DP guarantees that any individual will be exposed to essentially the same privacy risk whether or not their data is included in a differentially private analysis. In this context, we think of privacy risk associated with a data release as the potential harm that an individual might experience due to a belief that an observer forms based on that data release . A practical system that utilizes differential privacy techniques will act as an intermediary layer between the query source (analyst, BI tool, code) and the statistical database being queried.
DP operates by adding a small amount of noise to an input statistical query, returning the differentially private results to the agent submitting the query. The techniques contained in DP are used by government statistical bodies such as the Department of Labor for releasing statistics on job markets. It is also used by industry leaders like Google, Microsoft, and Uber for use by their own analysts. DP is a relatively new technology, but its use need not be restricted to those organizations with large resources. DP offers the ability for any organization, small or large, to give their analysts sensible means to work with sensitive data.
Differential Privacy makes strong guarantees, but it still must be assessed practically. How will using DP affect our business? This question needs to be broken into several areas of consideration. We are principally concerned with accuracy of analysis, the "privacy budget", and legal ramifications.
There are many measures for accuracy. Here, we define error as the deviation of a differentially private result from a true result as a fraction of the true result. If the true result is 10 and the differentially private result is 9.2, then the error of our private result is 0.08.
The random noise DP computations introduce to hide the effect of any single individual inherently decreases the accuracy of a statistical measure taken on the noisy data. There are two primary factors that will affect the error or lack of accuracy. These are the number of records in the dataset, and the privacy parameter ε. There are others sources of error, namely sampling and the type of query being performed, but these are beyond the scope of this blog post. The graphs generated below were taken by performing a count of occurrences of a particular job title in a simulated dataset. We simulated frequencies of occurrence at 100, 250, 300, 400, 500, 750, 1000, 1500, 2000, and 3000.
As the number of records in the dataset increases, the error of the differentially private results will decrease. As a rule of thumb, no utility should be expected if the dataset contains fewer than 1⁄ε records. Ideally, a system using DP techniques will omit results for groups where the release of the results could potentially risk privacy. Where this threshold lies is very dependent on the metadata for your private data. In our simulated dataset, this threshold was around 350 observations. This is why the graph below does not include lines for the titles with 100, 250, and 300 occurrences. Above this threshold, we find there to be a pronounced decrease in the error from around 0.1 to around 0.05 as the number of records increases beyond 1000. This is, for us, one of the most critical observations. We regularly see datasets of interest with less than 1000 records. In these datasets, we can expect up to 10% error in our calculations, and in larger ones, we expect 5% or less. For the below graph, we held the epsilon constant at a value of 0.1.
The privacy parameter (ε), also called the "privacy budget", determines by "how much" the risk to an individual's privacy can increase in a data release. A lower ε implies better protection. However, a 0-ε computation will also have no utility. Privacy must be relinquished to some small degree for patterns to show themselves. It is important to note that while a single analysis can consume the entirety of the privacy budget if multiple analyses are to be done, each may use only a fraction of the entire privacy budget. Take, for example, an analyst with a total privacy budget of 0.1. They may conduct a single analysis with ε = 0.1, or they may conduct 100 analyses with ε = 0.01. This must be taken into careful consideration when conducting repeated analysis on the same dataset. Even with a differential privacy system in place, millions of queries to the same dataset will render the protection useless. The true result will be estimated by observing the distribution of the private results. The legend below includes the number of true occurrences besides the title (just a number in our simulated dataset), revealing how the effects of varying epsilon are dependent on the number of records as well. It can be seen that the variance in error is much lower for larger numbers of records.
Finally, DP's guarantees offer protections against privacy attacks of the sort named in laws and regulations surrounding sensitive data. Fewer things are generally protected by law, and the language for these can be ill-defined. Typically, only PII (personally identifying information) such as social security numbers, emails, and similar pieces of data are explicitly protected by law. DP's guarantees to protect an individual's privacy an amount equivalent to the individual's opt-out scenario provably resist linkage attacks. The statistical noise added makes it very difficult for an outside observer to use externally obtained data to reveal information on individuals in the dataset. Inference attacks are also resisted in much the same way as linkage attacks. In fact, several organizations cite differential privacy as a key part of their GDPR compliance, as it provably resists re-identification style attacks and prevents aggregate data from being able to be tied to an individual's information.
Risk is a term commonly used in legal documents but is not formally defined. DP enables a formal quantification of risk. The privacy parameter controls this risk, and DP guarantees that for any individual included or excluded from the dataset, their risk is equivalent to the scenario where the individual was excluded from the dataset. Similarly, "consent" and "opt-out" policies are also frequently cited. These policies can actually have detrimental effects on privacy, as even those individuals who are excluded from a data release can have their privacy threatened in a worst-case scenario. Differential privacy offers an included individual the same risk as an excluded one.
Differential Privacy is a relatively new set of techniques that provide a mathematical definition for privacy. It is on the cutting edge, implemented in practice by industry leaders such as Google, Microsoft, and Uber. These tech giants have huge resources and huge data, but these are not necessary for differential privacy to be useful and effective. In fact,
we argue that the techniques of differential privacy can be made useful, effective, and practical for early-stage startups. Because of DP, we are able to provide secure, productive access for our data scientists to some of the most sensitive data relating to individuals' employment and income information. This allows data scientists to better understand patterns in the data, enabling the development of better models and systems to protect our customers from errors. In the end, this is all to ensure our customers receive their verifications quickly, reliably, and accurately, and in turn allow our network to move forward with lifetime milestones.
We use DP for access to highly sensitive datasets relating to income and title information for verifications. Google has implemented a framework for access to Google Chrome web traffic analytics, and we would like to do something similar with our web traffic and app usage analytics. DP is an area of active development for the greater research community. Advances are being made for differentially private implementations of linear regression, clustering, and other techniques that sit in any data scientist's toolbox. Further developments are in the works to improve the privacy accuracy trade-off. Altogether, these improvements will continue to improve the utility, practicality, and ease of entry for differential privacy. We hope that these advances and the promise of differential privacy will encourage other organizations to make the investment for their customer's privacy as well.
 Kobbi Nissim, et al. Differential Privacy: A Primer for a Non-technical Audience. February 14, 2018.