Video details

PEPR '22 - Compiling Python Programs into Differentially Private Ones

Python
08.22.2022
English

PEPR '22 - Compiling Python Programs into Differentially Private Ones
Johan Leduc and Nicolas Grislain, Sarus Technologies
Working with privacy-sensitive data today, whether it is in health-care, insurance or any other industry, is a complex and slow process often involving long manual reviews by compliance teams. The recent development of differential privacy helped standardize what privacy protection means. As such it has the potential to unlock the automation and scaling of data analysis on privacy sensitive data. To help realize this promise, we designed and built a framework in which an analyst can write data analysis jobs with common data-science tools and languages: SQL, numpy, pandas, scikit-learn, and have them compiled into differentially private jobs executed remotely on the sensitive data. In this talk, we will describe how a user expresses his job declaratively in python and how his python code is analyzed and compiled, before it is run and a result is eventually returned.
Johan Leduc is a senior data scientist at Sarus Technologies. He graduated from Ecole Polytechnique in 2014. He started his career in the energy sector and switched to data science in 2019. He joined Sarus (YC W22) in 2020 as the first employee and has been working on private synthetic data generation and private data analysis.
Nicolas Grislain is Chief Science Officer at Sarus Technologies. He graduated from École Normale Supérieure de Lyon in Mathematics and Computer Science. Nicolas started his career in economics and finance modeling at the French Treasury and then at Société Générale. He co-founded a first company: AlephD, in 2012, where he was also leading Research and Development. AlephD was acquired by Yahoo in 2016. In 2020 he co-founded Sarus Technologies (YC W22) with the same founding team as AlephD.
View the full PEPR '22 program at https://www.usenix.org/conference/pepr22/conference-program

Transcript

Hello, and welcome to our talk about compiling Python programs into differentially private ones. This talk. We'll go through the different stages of the process. We will start with data onboarding. We will then move on to the description of the data job before diving into the specifics of the differential private compilation. Finally, we will go through a step by step application to make things clearer. So imagine that you are the owner of a sensitive data set and you want someone to perform data analysis on it without ever actually accessing the real data. Well, this is possible through our tool. What you can do is import the sensitive data in our platform. The sensitive data can be a CSV file or a sequel like database. With access to the data. We will then explore it. Get the number of tables, the number of columns infer the data types and the value ranges. We will then compute estimate of the marginals using differential privacy. And we will in turn use these marginals to train a generated model using differential privacy, too. Now the trained generated model will be used to generate a synthetic dataset equivalent to the original one. But this synthetic data set is actually private and can be shared with anyone. And in particular it can be shared with the data practitioner to iterate on the definition of the data job. So now that we have onboarded our sensitive data set onto the service platform, we would like someone to perform a data job on it. And we call this person the data practitioner. This is usually someone who would not have access to the real data that would like to work on it. And this person is usually a data scientist that work in a python environment. So in the real world, on nonsensitive data, it will just get access to the data and start working on it. But in our case, he doesn't. So instead, what he gets is a symbolic representation along with differentially private marginals and synthesis data. Now to describe the job itself. In Python, there are already plenty of libraries to do that, such as Pandas Nampi TensorFlow, and we didn't want to write another library to do that from scratch. So what we chose to instead is to implement steps of the widespread libraries that will just register the operations performed on the original data set. So while this looks like standard data science job code, well, what it does instead is register a computational graph. So let's go through this example quickly. Here we can see that we have a data set that we choose to view as a pandas data frame. And from now on, we can use the Pandas Primitive to describe the operations. So we extract the features, we extract the labels, we get the mean and the standard deviation of the features that we use to normalize the features. Then we declare a model which will then fit under normalized data. And We Have an equivalent computational graph derived directly from the code. We have a data frame. We extract the features, we extract the labels. Then we compute the mean of the features and the standard deviation. We subtract the mean and divide everything by the standard deviation to get a normalized version of the features. Then we declare remodel and we fit it on the data. So now the question is we have a description of the job to be performed but we would like to have a fitted model with differential privacy guarantees. And the question is how do we do that? So, before diving into the specifics of the compilation, let's first introduce some notions about differential privacy. And one of the important notion is that adjacent data sets so two data sets are adjacent if they differ only by one entity. One entity can be a person, a company or a group of people. And this notion is important because it is at the very definition of differential privacy and a differentiated private mechanism applied on two agents and data sets will have a very similar results. That is, a differentially private mechanism doesn't depend too much on the contribution of a single entity. So, to formalize this important notion, let's first introduce some vocabulary. So we call protected entity the entity that we want to protect. And it is first identified during the onboarding stage where we ask what is the protected entity? Is it any row in the table? Is it identified by column in a table? And once we have that, a very important property that a data set can have is the property that we call depreserving, that is protected entity preserving. And this property, you have it if we can link each row in a data set to at most one protected density. When you have this property, you have a natural notion of adjacency. That is, if you add or remove a single protected entity from the data set you simply add or remove all the rows that you can link to this protected entity. And so it's very important because then you can measure easily the contribution of a single entity which is important to apply differentially private mechanism. So, a few key notions to have in mind is that while you can link a single row to only one protected entity, you can link many rows to a single protected entity so that a protected entity can have many rows associated to it. And another detail is that if we cannot link a row to a protected entity and it can be the case, for example, if you have a table with country names then we consider that these rows where you cannot think to protect identity are public. So, with this hypothesis in mind and we have identified the preserving property in the original data set, then we try to keep track of this property in the transformed data set. And you can see that some transformations preserve this property for instance, if you select a subset of columns then you are able to trace back each row of the results to the original protected entity. But if you start to perform some kinds of aggregations then the result will have mixed contributions from several protected entities so that you have lost this the preserving property. So let's go back to our competition graph and our original question how do we get a differentially private estimate of the fitted model? Well, our algorithm works roughly like this if we want to have an estimate differentially private of the fitted model, it would be nice to train the model with a differential private mechanism such as DPS GD. But to apply this mechanism we need the inputs to be pre preserving and if one of the input is not the preserving then we can ask recursively the algorithm to compile a pre preserving version of the input. And it turns out that to compile a preserving version we may need to compile other differentially private results upstream in the graph so that in the end to get a differentiated private estimate of our fitted model we may need to recursively compile different notes in the graph either to differentiate private equivalents or to PE preserving equivalents. So let's quickly go through an example to make things clearer. Here we have the same graph as before where we want to train a model using differential privacy and as you recall, the idea would be to use DPS GD to train the model but to do that we would need all the inputs to be prepreserving and it's not the case here because it's not the preserving because there is some aggregates in this computation graph. So we ask the algorithm to compute a pre preserving version of Xnorm which in turn requires us to have a pre presenting version of the standard deviation and a way to do that is to compute a differentially private version of the standard deletion and publish it so we consume privacy. But we have an estimate that is not linked to protected entity. So we can divide each row in X by this estimate and we don't lose the link to the original protected entity and we can do that with the standard deviation and we can also do that with the mean. And if we do that, we have a peep preserving estimated external which we can use to train the model using DPS CD. So, as you have seen, by recursively computing alternatives either differentially private or preserving of some nodes in the graph, we have managed to get a fitted model with differential privacy guarantees. So this algorithm is quite simple and does not finish in every case which is why we always have the possibility to fall back to the private synthetic data to have a private estimate of any request. So that was it. Thank you very much for your attention and we will be glad to answer any questions your questions.