Table of Content

Home

Anonymized Synthetic Data Generation by DP-GAN

Overview: briefly reviewing Differential Privacy Generative Adversarial Network (DP-GAN) as a generative model for synthesizing anonymous data samples.

1. Why generation of synthetic data can reserve data privacy?

Many industries from insurance companies to financial institutions to care(health)-providing agencies have access to an ocean of information (in form of structured tabular data) that can be used for making more informed decisions and recognizing new strategies and policies to not only increase their profits but also enhance the quality of their services, resulting in higher customers satisfaction. Machine learning and data mining techniques can effectively extract some data-driven insights from such tremendously large and rich datasets if they can be shared with some third-party research institutions.

Notably, the datasets usually contain an enormous amount of sensitive and personal information and mishandling them can drastically threaten the data privacy of customers. Thus, before sharing these sensitive data with any third-party research institutions, the holders of these datasets have to guarantee and preserve the privacy of their customers' information.

Data anonymization is one of the approaches for preserving data privacy as it aims to make de-identification attacks very hard or preferably impossible. In meantime, the anonymized data should preserve the statistical properties and the patterns existed in the original but unprotected data in order to allows ML techniques to extract and learn these patterns for a variety of tasks, such as predictive models. This leads us to a dilemma; privacy v.s. utility.

Figure1: Utility vs privacy. Image credit:https://aircloak.com/explaining-differential-privacy/

To address both these criteria (privacy v.s. utility), one requires to devise a method to generate synthetic anonymized data that I) pertains statistical properties of the original data (measured by a utility metric) while II) preserving the privacy of customers information (which can be quantified by a privacy metric, e.g. DP).

To create such synthetic anonymized data, one can either fully or partially synthetic data. In the former, all the features (attributes) of a given dataset are considered as sensitive data, thus analysts should generate fully synthetic data records to be used instead of the original data records. While in the latter, regarding some features as sensitive, the analysts tend to either synthesize values for these attributes or censor them without hurting the utility while keeping the privacy risk, e.g. identity disclosure, low.


2. MedGAN: a variant of GAN to generate data from different types

Generally speaking, vanilla GAN learns to estimate the distribution of data, then through sampling from this estimated data distribution, many data samples can be generated. As GAN is originally proposed for image datasets, where the data samples contain real-value features, and is trained regardless of any privacy measurements. However, we aim to generate privacy-preserved synthetic tabular data with categorical, discrete, binary or mixed features (data attributes). MedGAN [1] is proposed to generate such discrete-value anonymized features, particularly for medical structured dataset.

In order to generate synthetic discrete-value, the author of MedGAN incorporate an encoder-decoder. The pre-trained encoder ($Enc(\cdot)$) maps each real record represented by $\mathbf{x}\in\mathcal{Z}_{+}^{D}$ (from a $D$-dimensional discrete-value space) into a continuous feature space, then the decoder ($Dec(\cdot)$) maps it back to the discrete-value space. The generator $G(\cdot)$ takes in random prior $\mathbf{z}$ to generate continuous-value feature ($G(\mathbf{z})$), which then maps back to the discrete-value space by $Dec (G(\mathbf{z})) $. Finally, the discriminator is trained to distinguish the generated samples from the real samples.

The privacy of MedGAN's data generated has been empirically assessed by different privacy metrics~\cite{choi2017generating,goncalves2020generation}, but it is better if the generative model can be explicitly trained somehow for encouraging privacy.

Figure 2: MedGAN for generating discrete-type (categorical) data by incorporating an AutoEncoder

3. Experiment

3.1. Dataset

We use home credit dataset, a Kaggle dataset for predicting repayment abilities of customers according to some attributes. First, the dataset is pre-processed as follows:

  1. Outlier removal: Outliers per attribute are detected according to their z-score (z-score>3), then removed.
  2. Data imputation: using mean and most-frequency strategies, missing values of the real-value and categorical attributes are imputed, respectively.
  3. Encoding categorical attributes : categorical attributes indicated by string are encoded to be integer type categorical data.
  4. Data standardization : float-type attributes are standardized to interval $[0,1]$

home_credite_2020-09

4. Reference

  1. Karim Armanious, Chenming Jiang, Marc Fischer, Thomas Küstner, Konstantin Nikolaou, Sergios Gatidis, Bin Yang, MedGAN: "Medical Image Translation using GANs", Computerized medical imaging and graphics,2020.