Comparison of machine learning models applied on anonymized data with different techniques
Team: Streulea Lucian, Turda Robert, Stoia Teodora
Article: 2305.07415.pdf (arxiv.org)
INTRODUCTION
This paper addresses the increasing importance of data anonymization techniques in the context of the massive amount of personal data generated daily due to digitization. The study recognizes the need for secure protocols and privacy-preserving techniques in data science pipelines, particularly when dealing with sensitive data that can identify individuals. The paper distinguishes identifiers, quasi-identifiers, and sensitive attributes in data, with a focus on how these can be anonymized while preserving data utility for analysis. Four different anonymization methods are applied to a classical dataset and an inference process is performed using four machine learning models: k-Nearest Neighbors, Random Forest, Adaptive Boosting, and Gradient Tree Boosting. The paper seeks to balance data anonymization with the usefulness of data for analysis, ultimately providing insights into the trade-off between data privacy and utility.
ANONYMIZATION TECHNIQUES
The study applies four classical anonymity techniques to obfuscate quasi-identifiers in a dataset using value generalization hierarchies (VGH). The process involves replacing original values with less specific ones or suppressing them with a special character. The four techniques are:
- k-anonymity: This is a widely used and easily implemented technique. A database is k-anonymous if there are at least k records in each equivalence class of the database, reducing the probability of individual identification to 1/k.
- L-diversity: This is another commonly used method, especially effective for preserving the privacy of sensitive attributes and preventing homogeneity attacks. L-diversity ensures that there are at least ` different values for the sensitive attribute in each equivalence class of the database.
- t-closeness: This technique ensures that the distribution of sensitive attribute values in each equivalence class is no more than a distance t from the overall distribution in the database. The Earth Mover's distance (for numerical attributes) or equal distance (for categorical attributes) is used to measure the distance between distributions.
- δ-disclosure privacy: This is satisfied if the logarithm of the ratio of the distribution of sensitive attribute value s in an equivalence class to its distribution in the whole database is less than δ.
The study emphasizes the importance of using multiple anonymity techniques in tandem, as each one can prevent a different type of database attack.
USED MACHINE LEARNING MODELS
This research applies four supervised machine learning models to a classical classification problem, using quasi-identifiers as features and the sensitive attribute as the label. The models are k-Nearest Neighbors (kNN), Random Forest (RF), Adaptive Boosting (AB), and Gradient Tree Boosting (GB). They are trained and tested using the Python library scikit-learn (version 1.2.0), with optimal parameters selected via a 5-fold cross-validated grid search. Performance is evaluated based on accuracy and area under the ROC curve. Although the anonymization process is applied to the entire database, a stratified random train-test split (75%-25%) is used when training the models.
- k-Nearest Neighbors (kNN): This is a non-parametric method used for classification and regression. In kNN classification, an object is classified by a majority vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors.
- Random Forest (RF): This is an ensemble learning method for classification, regression, and other tasks that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.
- Adaptive Boosting (AB): Also known as AdaBoost, this is a boosting algorithm used as an ensemble method to improve the stability and accuracy of machine learning algorithms. It combines multiple weak predictors to build a strong predictor. This algorithm works by fitting a sequence of weak learners on different weighted training data.
- Gradient Tree Boosting (GB): Also known as Gradient Boosting, this is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. It builds the model in a stage-wise fashion and it generalizes them by allowing optimization of an arbitrary differentiable loss function.
Results
Data Encoding For Healthcare Data Democratisation and Information Leakage Prevention
Artcle: https://arxiv.org/pdf/2305.03710.pdf
Introduction
The article discusses the potential for deep learning to impact healthcare but highlights several hindrances in its development and acceptance. One of the major hindrances is the lack of data democratization and latent information leakage. Data democratization refers to the difficulty in making healthcare data available to AI researchers due to data privacy laws. Latent information leakage refers to the sensitive patient information that deep learning models can infer, potentially violating patient privacy. The article emphasizes the need for mechanisms to mask private information while retaining the data semantics to enable data sharing or democratization.
Method
The authors of this study suggest a framework that divides the time-series into segments of length n and applies a transformation operation f() on every segment, which results in an encoded version of the segment. The dimensions of the transformed and input segments are the same, and each encoded segment of length n is temporally concatenated to obtain the encoded version of the signal. The framework uses random projection and random quantum encoding as data transformation operations to transform the input data into a random subspace, making the data imperceptible. Random projection is a method of projecting input data into a random subspace using a random projection matrix, while random quantum encoding refers to a process of data transformation through the use of a quantum circuit containing multiple gates with random parameters.
Models
Following neural network architectures have been used for the prediction tasks:
The first model used is Long short-term memory (LSTM), which consists of an LSTM with 256 recurrent units followed by a linear layer with 1 node and sigmoid activation for binary prediction. The second model is Temporal convolution neural networks, which use 1-d convolution operations for modeling input time series. The third model is Multi-branch temporal convolutional network (Multi-TCN), which consists of two multi-branch temporal blocks followed by a linear layer with 1 node and sigmoid activation. The fourth and fifth models are Transformer and Vision Transformer, respectively, which are both designed to capture global dependencies in the input signals.
Results
The analysis of the results highlights that the models trained on the encoded data exhibit lesser latent information leakage than the models trained on the original data. On average, MIMIC-III models trained on data encoded using quantum circuits and random projections (rather than original data) exhibited a relative drop of 20.11 (±2.45)% and 23.52 (±3.98)% in performance for the latent gender prediction task. The PhysioNet models also exhibited relative drops of 22.66 (±5.45)% and 28.21 (±8.98)% for the data encoded using the quantum circuit and the random projections, respectively. Similar behavior is observed for the eICU models.The encoding data also resulted in a drop in the performance of the ethnicity prediction tasks. A similar trend is observed for the patient disorder prediction from MIMIC-III models. Quantum encoding and random projections resulted in a relative drop of 12.5 (±3.79)% and 18.75 (±5.45)% in the average macro AUROC score.
Niciun comentariu:
Trimiteți un comentariu