skip to main content
research-article
Open access

BUDS+: Better Privacy with Converger and Noisy Shuffling

Published: 15 February 2022 Publication History

Abstract

Advancements in machine learning and data science deal with the collection of a tremendous amount of data for research and analysis, following which there is a growing awareness among a large number of users about their sensitive data, and hence privacy protection has seen significant growth. Differential privacy is one of the most popular techniques to ensure data protection. However, it has two major issues: first, utility-privacy tradeoff, where users are asked to choose between them; second, the real-time implementation of such a system on high-dimensional data is missing. In this work, we propose BUDS+, a novel differential privacy framework that achieves an impressive privacy budget of 0.03. It introduces iterative shuffling, embedding for data encoding, converger function into a novel comparison system to converge the privacy threshold among the aggregated differentially private and noisy reports to further minimize the attack model’s time.

1 Introduction

In recent days most organizations, institutes, and big companies such as Google, Apple, IBM, and so on, collect various data containing sensitive information from multiple sources for their internal usage, analysis, or any other causes [30]. After collecting the data, they transmit the data to a trusted data holder or curator to store it through an insecure channel that is directly or indirectly connected with third-party involvement. These types of insecure channels are the cause of information leakage. The infringements of computer specialists [9], oblivious interaction with strangers by the workers and co-workers, accidents, or intended leaks are many of the true reasons behind this. Unwanted data breaches lead to compromise with the private and secret information of the users like name, bank details, transaction history, personal phone number, personal disease, and so on. It can be restricted through a secured data processing algorithm. Unfortunately, most of the existing solutions to prevent information leakage need third-party engagement to the dataset, which can be risky. For the worst-case situation, it can be the cause of leakage [35].
To avoid third-party involvement in 2006, Dwork introduced “Differential Privacy” [15, 17, 35], which works without the help of third-party assessment. Differential Privacy (DP) is a standard for database and big data scientists nowadays to prevent information leakage [35]. DP provides the mathematical and statistical framework to anonymize users’ data. It gives high-assurance, analytic means of insurance to the data containing sensitive attributes that need an uncompromising privacy-preserving guarantee. DP focuses to ensure that regardless of whether an individual’s information is included in the data or not, a particular query on the data always returns with approximately the same result. The exact definition of Differential Privacy is discussed thoroughly in the next section. But as a promise, DP might be thought of as a diminution of the mechanism used to deliver aggregate distributional information about a statistical database that restricts the leakage of sensitive information. The definition for the 2006 definition of \(\epsilon\)-differential privacy is that an individual’s privacy cannot be exuded by any sudden attack or planned statistical release. DP provides a distributional report that does not depend on any particular individual, i.e., DP provides an average report no matter a particular is present or not in the given dataset. Therefore, the goal of DP is to give each roughly the same privacy that would result from having their data removed. That is, the statistical functions run on the database should not overly depend on the data of any one individual.
Although differential privacy shows promise, there exists a fundamental tradeoff between privacy and accuracy (utility). Researchers have investigated various mechanisms to add noise with the privacy constraints \(\epsilon\) and \(\delta\) to find the minimum amount of noise that can be added for maximum utility and accuracy while preserving differential privacy. Authors have used the method of additive noise and random projections to maintain the balance based on conditional mutual information in Reference [36]. Reference [26] considers the privacy utility tradeoff run across by user by asking them if they wish to disclose some information to increase utility. Authors of Reference [7] explore parameterizations of privacy settings that balance the tradeoff between maximum utility, minimum privacy, and minimum utility, maximum privacy.

1.1 Our Contributions

In this work, we propose a new differential privacy framework based on the BUDS [34]. BUDS+ introduces a novel architecture that fixes some old issues (discussed in Section 2) regarding privacy-utility tradeoffs, longitudinal privacy guarantee, and high-dimensional input. Not only that, BUDS+ solves the lack of memory problem and reduces the time limit for the strangers, which prevents the sudden attacks on the database containing individuals’ sensitive information. This work gives a faithful result on privacy and utility tradeoffs while keeping a longitudinal privacy guarantee. Our contributions are as follows:
Encoding with Embedding: applies word-embedding [2, 29, 32] technique to encode the data to solve the problem regarding high cardinality of dataset.
Iterative Shuffling (IS): This novel iterative shuffle technique reduces the chance of linkage attack and the attacks through backtracking.
Data discarding Method: Using this novel technique BUDS+ secures privacy guarantee more by storing only noisy distributional reports for the next one hour of generation against the particular queries instead of storing the whole input dataset. After storing the generated noisy distributional final output, the system discards the whole dataset and crowdsources the data multiple times. Every time it finds new output, update the model and discard the dataset. This technique not only helps to reduce the problem related to lack of memory but also helps to update the model over time. This novel mechanism helps to reduce the bound of time limit for attackers and they get only a very small time window to crack the privacy and go back to the original data, which is impossible, as the machine uses Iterative Shuffling to shuffle the information of the dataset.
Better Utility with Converger: This function plays an important role here by adding the bias to the noisy aggregate report to converge it with the report generated from noiseless shuffled data for obtaining strong privacy-utility measures. This technique helps to increase the utility bound of the data and make an optimal balance between privacy-utility.
Longitudinal Privacy Guarantee: The inclusion of optimal noise to the data helps to maintain the Longitudinal Privacy guarantee.
Optimal Privacy-Utility Tradeoff: Last but not the least, it gives a high utility guarantee with a strong privacy budget by achieving an optimal randomization function with the help of Risk Minimizer.
In this article, Section 2 contains the related works where their pros and cons are discussed briefly.

1.2 Organization

Related works are detailed in the Section 2. This is followed by Section 4, where the architecture of BUDS+ is discussed elaborately with its proper pipeline. All the necessary proofs relating to the total privacy budget, utility measure, and bias bound are given there. Section 5 contains the real-life experiments of BUDS+ where the utility comparison is done with RAPPOR and ARA algorithms under the same privacy budgets. Following that, Section 6 gives an empirical analysis of the proposed BUDS+. Last, Section 7 concludes the whole work with a promising discussion of the future direction. All the background theorems and lemmas can be found in the Appendix.

2 Related Work

The main aim of a statistical database containing sensitive information has always been to protect the “privacy” of “individuals” from a sudden attack. Differential Privacy (DP) [17, 35] prevents these types of attacks by securing the database having sensitive information and it is useful for applying in real-life practices and experiments. In 2006, a work of Dwork [18] provides the concentrated differential privacy, which is a relaxation of DP achieving better accuracy than pure DP and its popular “\((\epsilon , \delta)\)” relaxation without compromising the cumulative privacy loss during multiple calculations. All the fundamental theorems and definitions related to DP can be found in [17] and [35]. Many existing DP models provide a strong privacy guarantee to the database, and all of these mechanisms are divided into two broad categories: Local Differential Privacy (LDP) and Central Differential Privacy (CDP). In this section, some current state-of-art algorithms are discussed shortly.
RAPPOR (Randomized Aggregatable Privacy-Preserving Ordinal Response) [21] is the first industrial implementation on DP or LDP. It introduces a mechanism that applies a bloom filter with multiple hash functions. It delivers the noisy report to the users to provide a strong privacy guarantee on the dataset maintaining the longitudinal privacy guarantee. The only drawback of this work is, due to the delivery of extremely noisy reports, the mechanism fails to provide enough utility to the dataset. In 2017, PROCHLO [11] is introduced where the authors proposed an algorithm that uses the technique of Encode, Shuffle, and Analyze (ESA) and it gives a strong privacy guarantee on CDP. ESA framework uses the attribute fragmentation in the encoding part after which the dataset chooses a random shuffler having separate channels and goes for shuffling to generate the final report. In 2018, another theoretical CDP mechanism called Amplification by Shuffling [20] is proposed where it uses the oblivious shuffling technique that actually gives \(\epsilon\)-LDP, which satisfies \(\mathcal {O}(\epsilon \sqrt {\log {\frac{1}{\delta }/ n}, \delta })\) centralized differential privacy (CDP) guarantee. But this mechanism fails to assure the longitudinal privacy guarantee to the dataset.
In 2019, A Privacy-preserving Randomized Gossip Algorithm via Controlled Noise Insertion [23] introduces a dual algorithm based on average consensus (AC) that provides iteration complexity bound and performs extensive numerical experiments while protecting the information regarding the initial value stored in the node. After that, in the same year, 2019, by extending the previous work (by McMahan et al.) [28] on moment accountant (by Abadi et al.) [3] for the sub-sampled Gaussian mechanism, Reference [27] (by McMahan et al.) introduces an algorithm with the modular approach to minimize the changes on training algorithm by various configuration strategies for privacy mechanism, which provides an integrated privacy mechanism that differs in granularity privacy guarantee offered and the method of private accounting. Though the author claims that this algorithm provides a good balance between privacy and utility, there are some certain drawbacks: No utility bound is given to justify the good balance between privacy and utility, and the distribution is defined over a finite domain. Another work [5] on LDP provides a more accurate location privacy recommendation using metrics factorization. Besides, they found out the best granularity of a location privacy preference and built a suitable method that can predict the location privacy preference accurately, which inspires us, too. But it was solely LDP-based with less utility in comparison to the normal matrix factorization technique and has a promise to go into more granular depth concerning time.
Paul et al. introduces a new algorithm [31], where they provide CDP on RAPPOR dataset by using the aggregating technique of Term Frequency (TF) and Inverse Document Frequency (IDF). This mechanism is the fastest implementation on CDP till now. Still, the mechanism fails to attend enough utility guarantee (only 52.28%) for having excessive noisy reports. The author of the ESA framework extends its mechanism and proposed a new version of it in this paper [19], which applies one-hot encoding to encode the dataset and uses the report fragmentation technique to generate the final report. This mechanism holds a strong longitudinal privacy budget while having the maximum utility guarantee on the dataset. Continuous Release of Data Streams Under Both Centralized and Local Differential Privacy [38] is another algorithm that is introduced in 2020. Here, an exponential mechanism with quality function is proposed along with two algorithms ToPS (threshold optimizer, perturber, and smoother) and ToPL (ToPS applied in LDP) that provide the solution of the problem regarding publishing a stream of real-valued data satisfying DP. The disadvantages of this work are: They use the PAK setting in the Threshold optimizer and assume that the distribution stays the same, and if they have further information like a slow or regular change of data, then they must add some changes to the algorithm.
Another algorithm called Learning with Privacy at Scale [37] is proposed in the year 2020. This algorithm comes up with three different stages: privatization, ingestion, and aggregation, wherein privatization stage, the system defines a pre-event privacy parameter by which they set a limit on the number of privatized data that can be transmitted to the server in each use case. In early 2020, another algorithm, BUDS: Balancing Utility and Differential Privacy by Shuffling [34], was introduced to solve these problems by maintaining an optimal balance between privacy and utility via an iterative shuffling technique. Though this mechanism works excellent to achieve a high utility guarantee while having strong privacy by using minimum noise inclusion, still it has some drawbacks:
(1)
Without ARA [31] application in the aggregation process, the shuffling technique itself does not provide privacy due to noiseless data. So, the attacker can backtrack the original data if they have enough time.
(2)
This algorithm has some lack of memory problem, does not work good with high-dimensional data, and
(3)
This architecture does not perform well in the case of longitudinal privacy guarantee.
To mitigate the above problems, we introduce BUDS+: Better Privacy with Converger and Noisy Shuffling [34], which is the advanced architecture of BUDS differential privacy framework that and is decomposed by multiple phases: Encoding, Iterative Shuffling (IS), Noiseless Temporary Report (NTR) generation, Distributional Noisy Report Storing, Discarding of input data-set and Converger Function to provide the highest utility guarantee while prevailing the strong privacy guarantee.

3 BUDS Vs. BUDS+

Our first proposal BUDS was indeed a strong mechanism to achieve the optimal privacy utility guarantee, but still, it had some problems regarding data dimension. Also, the architecture itself does not push noise to the report and so, though it reduces the linkage attack, still the attacker can go back to the original information due to less noise insertion. Not only that, the BUDS does provide any time limit to the attacker, and for that they can easily do some malicious activity. Figures 2(a) and 2(b) show the various improvement over BUDS, which we have done in our new proposed algorithm BUDS+. BUDS uses one hot encoding, which does not work well with the high-dimensional dataset. To solve this problem, BUDS+ uses the encoding with the embedding technique, which is denoted with a red box named as \(Improvement 1\) in 2(b). To achieve a higher privacy guarantee over BUDS, BUDS+ imposes Gaussian noise to the data to get a noisy output. However, the insertion of Bias helps to maximize the utility through Risk Minimizer. This Noise and Bias insertion technique solves the problem of BUDS regarding noiseless low DP guarantee and this improvement is marked with a red box named \(Improvement 2\) in 2(b). In BUDS, the attacker does not give any time-bound and if they have enough time and capability, then they can go back to the original data anyway due to less noise. To provide this time-bound, BUDS+ introduces the data discarding method (denoted with a red box named as \(Improvement 3\) in 2(b)), which discards the whole dataset after giving the final report. BUDS+ stores the final report in the local cache for some time and then discards it also. When a query is asked, the system searches the answer in the local cache and if it is there it delivers the answer directly, otherwise the crowd sources the data again and goes through the whole process to get the report. This technique is marked as \(Improvement 4\) in Figure 2(b). Table 1 shows the features comparison between BUDS and BUDS+.
Table 1.
FeaturesBUDSBUDS+
EncodingOne hotEmbedding
Query FunctionAppliedApplied
Iterative ShufflingPresentPresent
Noise and Bias InsertionAbsentPresent
Risk MinimizerOperatesOperates
Data Discarding MethodAbsentPresent
Time Bound for AttackersNot providedProvided
Optimal Privacy-utility boundGoodBetter
Table 1. Features Comparison between BUDS and BUDS+

4 BUDS+

This section is dedicated to the introduction of the proper overview of this novel architecture that focuses on the method of encoding and iterative shuffling followed by noisy distributional report storing technique, data discarding method, converger function to converge the noisy reports towards the optimal NTR with the theorems and proofs to support our claim. More fundamentally, the BUDS+ algorithm style can be described as follows:
Collection of data \(\rightarrow\) Encoded data with glove embedding \(\rightarrow\) Application of Query function \(\rightarrow\) Iterative Shuffling \(\rightarrow\) Generate noiseless reports and it is stored temporarily \(\rightarrow\) Noise insertion to the data count \(\rightarrow\) Application of converger function to converge the report with BUDS report \(\rightarrow\) Final noisy distributional report is stored and delivered to the clients \(\rightarrow\) Discard the whole data-set as well as the BUDS report \(\rightarrow\) For new query, crowdsource again, repeat the previous process, update reports, and discard the dataset again.

4.1 Encoding with Embedding

As attributes are mostly “words,” using word-embedding to encode attributes is an efficient way. One-hot-encoding was used in our previous work BUDS, but due to its structural problem whenever there is a flow of a large number of words, the dimension of the encoding increases and the data structure becomes hard to control due to lack of memory space. So, here, we have used the word embedding technique to encode the attributes. It not only solves the problem regarding high-dimensional datasets but also helps to control the lack of memory problem.

4.2 Query Function

In the first phase of this work, a query function is used on the dataset to get the names of the important attributes related to a particular query. When a query is asked, the function finds out the related attributes to the final report and tries to find if the answer to the query is already stored or not. If yes, then the system gives the stored answer to the clients, otherwise, a new report is generated from the crowdsourced data according to the following example:
Example 1.
Considering a database contains names, ages, sex, heights, and weights of individuals where \(name, age,\ sex, \: height,\:\) and\(\: weight\) are five attributes and the query is “How many females in the database are less than 38 years old?”, then the attributes “sex” and “age” will be returned as answers after applying the query function, which indicates that only these two attributes are important for generating the final report to that particular query.
For the general case, if the dataset has n rows and k attributes. After the collection of whole data, the query function is applied to it and the attributes that have importance to generate the answer are returned. If the result reflects that the m number of attributes is important to generate the terminal report, then these m attributes are tied up together and behave as a single attribute. In that case, the lessened number of attributes is \(g=k-m+1\). Now if g is divisible by S, then these g attributes will be divided into S groups where each group contains \(g/S\) attributes. Else, the extra e elements choose a group randomly without replacement. Now the whole data is divided into t batches where ith batch contains \(n_i\) number of rows and \(n_i \simeq n_j; \forall \: 1 \le i,j \le t\), i.e., number of rows in each group is almost equal. After that, each batch will go for independent shuffling.

4.3 Iterative Shuffling

Now, the data-set have n number of users and S number of group of attributes where each group of attributes chooses a random shuffler without replacement and goes for independent shuffling. There is an S number of shufflers each having a separate channel for separate attributes and each shuffler has its shuffle technique for doing independent shuffling.
Here, the first batch with \(n_1\) number of rows will go for shuffling first, then the 2nd batch with \(n_2\) number of rows will go and so on. In the end, the last batch with \(n_t\) number of rows will go for shuffling. The shuffling occurs iteratively for each batch and that is why it is called Iterative Shuffling (IS). This technique is used here as the final shuffling mechanism, which not only gives a strong privacy budget but also maintains a high utility guarantee for the given dataset. Now, if this mechanism is considered up to IS and NTR is generated for the particular query, this sub-architecture will be quite similar to the BUDS except for the Encoding part. So, examining all the theorems and lemmas from BUDS, it can be obtained that the privacy and utility guarantee for IS with a noiseless dataset will be:
\begin{equation} \epsilon _\text{NTR} = \ln {\frac{t}{(n_1 - 1)^S}} . \end{equation}
(1)
Example 2.
Let us take the dataset containing names, ages, heights, and weights of six individuals where \(name, age, height\), and weight are four attributes and the query is “How many people in the database are less than \(40 \ years\) old and have weight more than \(60 \ kg\)?” The query function returns the attributes \(``weight,\hbox{''}\, ``Age\hbox{''}\) as the relevant attribute for generating the final report. Here, these two attributes will tie up. After that, the groups of attributes go to the shufflers for secret IS where only the rows of each attribute of each batch will shuffle in a secret technique. For this, the relevant attribute pair (Weight: Age) always hold the person’s weight with their age in each row while each weight: age changes their unique ID due to IS. However, rows of other irrelevant attributes (\(``Name,\hbox{''}\, ``Height\hbox{''}\)) not only change their unique ID but also change their belonging weights: age due to IS in a way that one can never go back to the actual database using generated report and explore all the information related to a particular individual that is stored in the database. Table 2 is showing the database before and after shuffling.
Table 2.
Before
IS
After
IS
NameAgeHeight:WeightNameAgeHeight:Weight
Shivansh225.1”:48Brintika715.1”:48
Krishanu714.9”:42Shivansh224.9”:42
Brintika286.3”:70Krishanu286.3”:70
Vandita306.00”:80Mithila156.00”:80
Mithila615.9”:55Vandita616.11”:64
Aswin156.11”:64Aswin305.9”:55
Table 2. Empirical Scenario of Example 2 Before and After Shuffling

4.4 Noiseless Temporary Report (NTR) Generation

After completing the IS, this mechanism uses aggregation method to generate NTR. In BUDS, the mechanism used to provide reports with minimum noise to the clients. Here, the proposed mechanism first generates the NTR and then adds noise to data to get noisy distributional output. After that it converges the noisy output with NTR. This section mostly concentrates on NTR generation technique. There are g number of reduced attributes and total n number of users in the dataset. Let the whole dataset be represented by a vector \(V = (v_1, v_2,\ldots , v_n)\) where \(v_j; j=1:n\) denotes the information vector of jth user. Now the \(v_j; j=1:n\) can be represented as \(v_j= (v^1_j, v^2_j, \ldots , v^{g}_j)\) where \(v^i_j; i= 1:g\) and \(j=1:n\) denotes the ith records of the jth user information. At first a subset containing query-related attributes with probability q is taken from set of attributes of jth user denoted by \(\mathcal {X}= (v_j^1,\ldots ,v_j^{k^{\prime }})\), then clip each \(v_j^i; \ i\, \epsilon \, \mathcal {X}\) to have maximum \(L_2\) norm \(S^*\) with A aggregator function \(\pi _{S^*}\) where \(\pi _{S^*}(v_j) = v_j . min(1, \frac{S^*}{||v_j||})\). For the jth user, the estimated aggregated report in BUDS will be:
\begin{equation} \hat{v}_\text{jNTR} = \frac{1}{qk^{\prime }}\sum _{i\epsilon \mathcal {X}} \pi _{S^*}(v^i)_j . \end{equation}
(2)
The estimated average aggregate NTR for the whole dataset will be:
\begin{equation} \hat{V}_\text{NTR} = \frac{1}{nqk^{\prime }}\sum _{j=1}^{n}\sum _{i\epsilon \mathcal {X}} \pi _{S^*}(v^i)_j . \end{equation}
(3)
Here, \(S^*\) is the user-provided clipping parameter that is the upper bound of \(L_2\) norm and \(E(\mathcal {X})= qk^{\prime }\). This report is saved temporarily but not delivered as the final report. Now, if this report is considered as the outcome and delivered to the clients, then it can be the cause of leakage. For the worst-case possibility, if the attackers have enough time and capability, then they can go back to the actual dataset and leak the data easily, as the dataset contains negligible noise. To avoid this problem, this architecture stores this report temporarily until the converger uses it for a privacy-utility tradeoff. This mechanism uses the concept of a separate time window where the whole dataset along with NTR is deleted at the end of a particular time window after delivering the converged noisy distributional reports. This tricky algorithm gives a time limitation to the attackers, as they get only a particular time window to go back to the actual data, otherwise, the data-set will be deleted permanently. Now, going back to the original data within a small tenure of time is too difficult, as this architecture uses the notion of IS.

4.4.1 NTR Utility:.

The description of utility can be considered as the amount of real information we can gain from the input data by asking a particular query or “Series” of queries. In this section, the aim is to a strict bound for loss function that has a great influence on utility. In addition to this, we have tried to establish an optimal randomization function through the minimization of the empirical risk. This technique helps to gain the maximum utility by keeping a \(\epsilon\) privacy guarantee.
Presume the dataset that contains n individuals and k number of attributes. The query function helps to get a reduced set of attributes g. After that, the dataset is distributed into \(1, 2,\ldots , t\) batches.
Consider a query, “How many persons are involved with an affair E within a time horizon \([d]\)?” where, \(E \ \epsilon \ \mathcal {E}\); \(\mathcal {E} = \lbrace E_1, E_2,\ldots \rbrace\) is the set of all possible affairs and \([d] = \lbrace 1, 2, \ldots , d\rbrace\). Now, let X and Y be the input and output database at any point of time T. Before Iterative Shuffling, if a count on the engaged people with affair \(E\, \epsilon \, \mathcal {E}\) is taken for database X, it must be : \(c = \Sigma _\text{i $\epsilon $ [t]} \Sigma _\text{T $\epsilon $ [d]} X_i [T]\), where i denotes the batch number. \(\Sigma _\text{T $\epsilon $ [d]} X_i [T]\) denotes the number of persons involved with the affairs, \(E\, \epsilon \, \mathcal {E}\) within the time horizon \([d]\) in ith batch. The count c gives the average number of persons involved to the affair \(E\, \epsilon \, \mathcal {E}\) for the input database within the time horizon \([d]\), which must be the noiseless answer of that particular query. Then the whole in input data-set can be thought as a vector matrix \(V^* = (v^*_1, v^*_2,\ldots , v^*_n)\) where \(v^*_j; j=1:n\) is the jth user vector information. Now \(v^*_j =(v_j^{*1},v_j^{*2},\ldots , v_j^{*k})\) where \(v_j^{*i}; j =1:n, i=1:k\) denotes the ith records of the jth user. Then the input data-set count c can be written as \(c \simeq \hat{V}_\text{Raw} = \frac{1}{nqk^{\prime }}\Sigma _{j=1}^{n}\Sigma _{i\epsilon \mathcal {X}} \pi _{S^*}(v^{*i})_j\) where \(\mathcal {X} =(v_j^1, v_j^2,\ldots , v_j^{k^{\prime }})\) and \(v_j^i ; j=1:n, i= 1:k\) denote the ith record of the jth user as defined previously. Now, if the same count is taken on the output database Y of form of vector matrix \(V =(v_1, v_2,\ldots , v_n)\) where \(v_j = (v_j^1, v_j^2,\ldots , v_j^k)\), then it will give the output count \(c^{\prime } = \Sigma _\text{i $\epsilon $ [t]} \Sigma _\text{T $\epsilon $ [d]} Y_i [T] \simeq \hat{V}_\text{NTR} = \frac{1}{nqk^{\prime }}\Sigma _{j=1}^{n}\Sigma _{i\epsilon \mathcal {X}} \pi _{S^*}(v^i)_j\), which is the noisy report. Now the main proposal is to find the optimal stretch between these two counts, which has a great impression on utility measures. If the distance between these two counts is extremely close, then it can be inferred that the utility is good enough. The aim is to minimize this distance, which can be measured by the loss function.
For general cases, when the query shows m number of relevant attributes, the above conditions are also applicable. To generate the report for a query, the algorithm concentrates only on the tied relevant attributes. This is called the sub-database. Now the input sub-database \(D_X \subseteq X\) must behave like an adjacent database of the output sub-database \(D_Y \subseteq Y\), as the rows of \(D_X\) has information that is the same as \(D_Y\) has, the only difference is that the rows of \(D_Y\) are shuffled, i.e., each individual’s information does not belong to its unique ID. Remembering this fact, if one wants to aggregate the count considering the people involved in an affair \(E\, \epsilon \, \mathcal {E}\) for two adjacent databases \(D_X\) and \(D_Y\), then the count \(c^{\prime }\) must be in the neighborhood of the count c. As the mechanism includes minimum noise to the data measurements of the distance between c and \(c^{\prime }\) must be negligible and only differ on \(e^{\epsilon _\text{NTR}}\), where, \(e^{\epsilon _\text{NTR}}\) is the privacy measures of the sub-mechanism of BUDS+ that deals with the almost noiseless dataset.
Thus, we get the following bound:
\begin{equation} c \le e^{\epsilon _\text{NTR}} c^{\prime } , \end{equation}
(4)
when \(e^{\epsilon _\text{NTR}}\) = 0 \(\Rightarrow\) \(c = c^{\prime }\); i.e., the utility \(\mathcal {U}(X,Y)\) reaches in highest value. If we take the range of utility \([0,1]\), when \(e^{\epsilon _\text{NTR}}\) = 0, then \(\mathcal {U}(X,Y) = 1\). In this work,
\begin{equation} c \le e^\text{$\ln {[\tfrac{t}{(n_1 - 1)^S}]}$} c^{\prime } . \end{equation}
(5)
By subtracting \(c^{\prime }\) from both sides and taking absolute value of Equation (24), we get:
\begin{equation} |c - c^{\prime }| \le c^{\prime } \bigg |e^\text{$\ln {[\tfrac{t}{(n_1 - 1)^S}]}$} - 1\bigg | . \end{equation}
(6)
We define the loss function \(\mathcal {L}(c,c^{\prime })\) = \(|c - c^{\prime }|\) and get:
\begin{equation} \mathcal {L}(c,c^{\prime }) \le c^{\prime } \bigg |e^\text{$\ln {[\tfrac{t}{(n_1 - 1)^S}]}$} - 1\bigg | . \end{equation}
(7)
Here, we can write as:
\begin{equation} \mathcal {L}(\hat{V}_\text{Raw},\hat{V}_\text{NTR}) \le \hat{V}_\text{NTR}\ .\ \bigg |e^\text{$\ln {[\tfrac{t}{(n_1 - 1)^S}]}$} - 1\bigg |. \end{equation}
(8)

4.5 Storing of Noisy Distributional Report (NDR)

BUDS+ deletes the whole dataset after giving a noisy distributional query report that provides the aggregate report with a strong privacy budget and utility measure approximately the same as NTR original privacy and utility guarantee. After IS, BUDS+ uses a noise inserter that applies the noise on the shuffled dataset and the NDR is sent to the Converger for adding bias to it. The Converger helps to make the distribution reports more well ordered by the method of risk minimization so the final aggregate report converges to the previous NTR having high utility bound.

4.5.1 Noise Insertion.

The Gaussian noise is used here to insert noise into the database. At first, the individual aggregates are prepared with the help of aggregator function \(\pi\) for every user of the dataset, and then the Gaussian noise is applied to each aggregate where distribution mean is 0 and the user provides variance of Gaussian distribution is \(\sigma _j^2\) for jth user. After that, the estimated average aggregated NDR is computed and sent to the converger.
After adding Gaussian noise, the individual estimated NDR is:
\begin{equation} \hat{v}_\text{jNew} = \frac{1}{qk^{\prime }}\sum _{i\epsilon \mathcal {X}} \pi _{S^*}(v^i)_j + \mathcal {N}(0; \sigma _j^2. I), \end{equation}
(9)
where \(S^*\) and \(\sigma _j^2\) are the user-provided clipping parameter and Gaussian distribution variance for jth user, respectively. As \(E(\mathcal {X}) = qk^{\prime }\), the average aggregate is scaled by \(qk^{\prime }\). Now, this equation can be written as:
\begin{equation} \hat{v}_\text{jNew} = \frac{1}{qk^{\prime }}\sum _{i\epsilon \mathcal {X}} \pi _{S^*}(v^i)_j + \eta _j . \end{equation}
(10)
\(\eta _j\) is a random variable chosen from the distribution \(\mathcal {N}(0;\sigma _j^2)\) within the interval \((-l\sigma _j,l\sigma _j)\) to provide a consistent noise to the aggregate all the time. Now the value of l is chosen in a way that helps to achieve the optimal bound for Bias and loss function and maximize the utility.
Now, the estimated average aggregated NDR for all users is:
\begin{equation} \hat{V}_\text{New} = \frac{1}{nqk^{\prime }}\sum _{j=1}^{n}\sum _{i\epsilon \mathcal {X}} \pi _{S^*}(v^i)_j + \frac{1}{n}\sum _{j=1}^{n} \eta _j . \end{equation}
(11)
When \(\eta _j = l\sigma _j^2\), then the upper bound of the estimated average aggregated is:
\begin{equation} \hat{V}_\text{New} = \frac{1}{nqk^{\prime }}\sum _{j=1}^{n}\sum _{i\epsilon \mathcal {X}} \pi _{S^*}(v^i)_j + \frac{l}{n}\sum _{j=1}^{n} \sigma _j, \end{equation}
(12)
and the lower bound will be:
\begin{equation} \hat{V}_\text{New} = \frac{1}{nqk^{\prime }}\sum _{j=1}^{n}\sum _{i\epsilon \mathcal {X}} \pi _{S^*}(v^i)_j - \frac{l}{n}\sum _{j=1}^{n} \sigma _j. \end{equation}
(13)
Here, the final noise applied to the estimated average report for the whole data-set is \(\eta ^{\prime } = \frac{1}{n}\Sigma _{j=1}^{n} \eta _j\). Let, \(S_n = \Sigma _{j=1}^{n} \sigma _j\).
Now, if \(\eta _j\, \epsilon \, (-l\sigma _j, l\sigma _j)\), then for the independent sequence \((\eta _1, \eta _2,\ldots , \eta _n)\), \(\lim _{n\rightarrow \infty } \frac{1}{S_n^\text{2+ $\delta $}} \Sigma _{j=1}^{n} E(|\eta _j|^\text{2+$\delta $})\) exists where \(\delta \ge 0\) and the maximum value is equal to \(\frac{l^\text{2+$\delta $}}{S_n^2} \Sigma \sigma _j^\text{2+$\delta $}\). For sake of simplicity if \(\delta = 1\) is taken, the answer will be \(\frac{l^3}{S_n^2} \Sigma \sigma _j^3\). So, \(\eta _j\) follows Lyapunov central limit theorem by which it can be said that, for a large n, \(\eta ^{\prime } \sim \mathcal {N}(0; S_n^2)\).

4.5.2 Better Utility with Converger:.

The main aim of the Converger is to add a Bias denoted by \(\beta\) to the estimated average aggregated NDR and converge this Noisy Distributional Report to the previous Noiseless Temporary Report (NTR) to maximize the utility while prevailing a strong privacy guarantee. As the noise of the report possesses a Gaussian distribution with mean 0 and the variance \(S_n^2\), then the Bias \(\beta\) should be a random variable from this same distribution for achieving the proper convergence of the reports. After adding the Bias to the NDR, the estimated average aggregated report is :
\begin{equation} \hat{V}_\text{cn} = \frac{1}{nqk^{\prime }}\sum _{j=1}^{n}\sum _{i\epsilon \mathcal {X}} \pi _{S^*}(v^i)_j + \frac{1}{n}\sum _{j=1}^{n} \eta _j + \beta . \end{equation}
(14)
Here, the ratio of the noise applied to \(l_2\)-sensitivity of the query is \(Z^{\prime } = \frac{S_n}{S^*}\), and if \(Z^{\prime } = \frac{1}{\epsilon }\sqrt {2\ln {1.25}/\delta }\), then the mechanism provides \((q\epsilon , q\delta)\)- DP for the whole database.
Now, the aim is to find an optimal bound \(\beta ^{\prime }\) of the Bias \(\beta\) for which the new report converges to the NTR.
Bound of the Bias: To converge the new report and the NTR, the following condition must hold:
\begin{equation} |\hat{V}_\text{cn} -\hat{V}_\text{NTR} | \rightarrow 0. \end{equation}
(15)
That means
\begin{equation} \left|\frac{1}{n}\sum _{j=1}^{n} \eta _j + \beta \right| \rightarrow 0. \end{equation}
(16)
Now,
\begin{equation} \left|\frac{1}{n}\sum _{j=1}^{n} \eta _j + \beta \right| \ge 0. \end{equation}
(17)
So, as the noise is added between the interval \((-l\sigma , l\sigma)\), then from Equations (12) and (13), finally the bound appears as:
\begin{equation} - \frac{l}{n}\sum _{j=1}^{n}\sigma _j^2 \le \beta \le \frac{l}{n}\sum _{j=1}^{n}\sigma _j^2. \end{equation}
(18)
It can be written as
\begin{equation} - \frac{l}{n} S_n^2 \le \beta \le \frac{l}{n}S_n^2. \end{equation}
(19)
Risk Minimizing: Let \(\varrho\) be a randomized function that is applied on the BUDS reports and add the noise and Bias to it. That means \(\varrho (\hat{v}_\text{jNTR}) = \hat{v}_\text{jcn}\). Then the loss function is \(\mathcal {L}(\varrho (\hat{v}_\text{jNTR}),\hat{v}_\text{jcn})\) and the risk function is denoted by, \(Risk_\text{TRUE}(\varrho (\hat{v}_\text{jNTR}),\hat{v}_\text{jcn}) \triangleq E[\mathcal {L}(\varrho (\hat{v}_\text{jNTR}),\hat{v}_\text{jcn})]\). Now the target is to find the optimal randomizer \(\varrho ^*\) for which the risk will be minimized and the optimal Bias \(\hat{\beta }\) can be found to achieve the maximum utility, which is approximately same as NTR utility. Now, to minimize the risk function, the empirical risk should be calculated, which is denoted as \(Risk_\text{EMP}(\varrho)\) and it is \(Risk_\text{EMP}(\varrho) = \frac{1}{n} \Sigma _{j=1}^{n} \mathcal {L}[\varrho (\hat{v}_\text{jNTR}],\hat{v}_\text{jcn}) + \lambda (G(\varrho))\) where G is a regularization parameter and \(\lambda\) is used to control the strength of complexity penalty. Then a optimal randomization function \(\varrho ^*\) can be found and \(\varrho ^* = \text{argmin}_\text{$\varrho \epsilon \mathcal {H}$} Risk_\text{EMP}(\varrho)\), where \(\mathcal {H}\) is hypothesis class containing all possible randomization functions. After achieving the best randomization function \(\varrho ^*\), which gives the optimal Bias \(\hat{\beta }\), the generated final report will be:
\begin{equation} \hat{V}_\text{Final} = \frac{1}{nqk^{\prime }}\sum _{j=1}^{n}\sum _{i\epsilon \mathcal {X}} \pi _{S^*}(v^i)_j + \frac{1}{n}\sum _{j=1}^{n} \eta _j + \hat{\beta .} \end{equation}
(20)
Now, ultimately when the final report is generated, it is stored for the next hour and delivered to the users. However, the whole dataset is deleted. Now, this report belongs to a particular query and the whole process is executed for a particular time window. For this, if an attacker wants to attack the process and leak the dataset, then he/she has only a limited time duration of a particular time window and due to occurring of IS; if the time window is not unexpectedly long, then it makes it more difficult to the attacker for leaking the data.
As the dataset is deleted after delivery of the final report of a particular query, for the next query, the query function finds the attributes related to the current query and tries to match them with the attributes of the previously asked queries. If the match occurs, i.e., two queries are the same, then the mechanism provides the previously stored noisy report to the user. Otherwise, the system crowd-sources the data, applies BUDS+, generates the final report, delivers it to the user, and stores it. In the end, the system deletes the dataset again.

4.6 Discarding Database

After the final report generation, BUDS+ stores all the distributional noisy reports and discards the whole dataset. Now, this technique not only solves the lack of memory problem but also reduces the time bound for the adversarial attackers. Once a noisy report is generated, it is stored in memory and the input dataset is removed. For the next query for a particular time interval, the query function checks if this query is asked previously or not. If it matches with any previous one, then the answer must be stored in memory. The query function searches for this answer according to the importance of attributes present in queries and delivers the existing report to the clients. Now, if the query is new and does not match with others or has some different time interval for which the information is not collected yet, the system crowdsources again to collect the fresh dataset. Again, all the previous mechanisms are applied to the new input database to generate the NDR for the new query. After having the report, it is stored and delivered to the clients. However, the dataset is discarded again. This technique makes the data more secure to prevent a sudden attack. If anyone wants to attack the database by backtracking or by linking with other datasets, then the time-bound for them will be very short to achieve their goals, as the system discarded the dataset immediately after report generation. However, multiple crowdsourcing and data updation collect all the current information over time and the presence of noise maintains the longitudinal privacy guarantee. This technique contributes to the following:
Solves the problem of lack of memory.
Makes dataset more secure by reducing time bound for the attacker.
Helps to remain the data up to date.
Maintains longitudinal privacy guarantee.

4.6.1 Privacy Guarantee.

Here, the privacy guarantee is calculated while considering the application of noise and Bias. Let X and Y be two random variables where X denotes the event of a significant change of the response due to the application of noise and Y denotes the significant change of noisy response due to the application of Bias. Let, \(P(X) = P(Response\ will\ be \ changed\ by\ noise) = P\) and \(P(Y) = P(Response\ will\ be\ changed\ by\ the\ Bias) = Q\) where \(0\lt P\lt 1, 0\lt Q\lt 1\). If P and Q take two extreme values 0 and 1, then the system becomes biased and not robust. So, our theory is based on the fact that these two probabilities belong to the open set (0, 1). Then \(P(Response\ will\ not\ be\ changed\ by\ noise) = 1-P(X) = 1-P\) and \(P(Response\ will\ not\ be\ changed\ by\ the\ Bias) = 1-P(Y) = 1-Q\). Now, after getting the final report, \(P(Response \ is \ TRUE\ |\ Response\ of\ NTR\ was\ TRUE)\) will contain two cases: Maybe the response of NTR is not changed by both the noise and bias or the NTR response is changed by both noise and Bias.
Then, \(P(Response \ is \ TRUE\ |\ Response \ of \ NTR \ was\ TRUE) = P(X).P(Y) + [1-P(X)].[1-P(Y)] = P.Q + (1-P).(1-Q)\). Similarly, we can say, \(P(Response\ is\ FALSE\ |\ Response\ of\ NTR\ was\ TRUE)\) contains two cases, the response is changed by the noise but not changed by the Bias and vice versa.
Then, \(P(Response is FALSE\ |\ Response \ of \ NTR\ was \ TRUE) = P(X).[1-P(Y)] + [1-P(X)].P(Y) = P.(1-Q) + (1-P).Q\). Now,
\begin{equation} RR_\text{$\mathcal {N}$. $\hat{\beta }$} = \frac{P(BUDS+ \ Response = TRUE \ | \ NTR \ response = TRUE)}{P(BUDS+ \ Response= FALSE\ |\ NTR\ response= TRUE)} . \end{equation}
(21)
Then,
\begin{equation} RR_\text{$\mathcal {N}$. $\hat{\beta }$} = \frac{P.Q + (1-P).(1-Q)}{ P.(1-Q) + (1-P).Q} . \end{equation}
(22)
So, the privacy guarantee for this is:
\begin{equation} \epsilon ^{\prime }_\text{$\mathcal {N}$. $\hat{\beta }$} = \ln {\bigg [\frac{P.Q + (1-P).(1-Q)}{ P.(1-Q) + (1-P).Q}\bigg ]} . \end{equation}
(23)
Here, \(q\epsilon ^{\prime }_\text{$\mathcal {N}$. $\hat{\beta }$}\) will be the privacy parameter of this mechanism, which is greater than 0 but significantly small. Then, we can say,
\begin{equation} \epsilon _\text{$\mathcal {N}$. $\hat{\beta }$} = q\ln {\bigg [\frac{P.Q + (1-P).(1-Q)}{ P.(1-Q) + (1-P).Q}\bigg ]} . \end{equation}
(24)
\(\epsilon _\text{$\mathcal {N}$. $\hat{\beta }$}\) is the privacy parameter of NDR.

4.7 Total Privacy Guarantee

The NTR privacy guarantee is \(\epsilon _\text{(NTR)} = \ln {[\frac{t}{(n_1 - 1)^S}]}\) where t represents the number of batches, \(n_1\) denotes the batch size, and S is the number of shuffler. Now, the privacy guarantee during the application of next mechanism, which helps to add the noise and Bias to the aggregate, is \(\epsilon _\text{$\mathcal {N}$. $\hat{\beta }$} = \ln {[\frac{P.Q + (1-P).(1-Q)}{ P.(1-Q) + (1-P).Q}]}\). Then the total privacy guarantee for the BUDS+ is:
\begin{equation} \epsilon _\text{Final} = \ln {\bigg [\bigg (\frac{t}{(n_1 - 1)^S}\bigg) \times \bigg (\frac{P.Q + (1-P).(1-Q)}{ P.(1-Q) + (1-P).Q}\bigg)^q\bigg ] } . \end{equation}
(25)
Theorem 4.1.
The final noisy aggregated count report holds the differential privacy guarantee.
Proof.
Let us assume we have a dataset with n number of individuals’ information and for a particular query, \(\mathcal {Q}\) the mechanism gives the noisy report r. Also, there is another dataset with the same n number of individuals’ information along with an additional row containing the information of an extra person, then this second database contains the \(n+1\) number of rows. Now, for the same query \(\mathcal {Q}\), let this mechanism give the noisy answer \(r^{\prime }\) corresponding to the second database. Now, as these two databases differ only for one row, we can say the first and second databases are neighboring datasets to each other. Both of these adjacent databases’ answers contain some noise that is drawn from the same Gaussian distribution with mean 0 and variance \(S^2_n\). The amount of noise of these counts will decide how much these two counts will differ from each other. Now, the privacy parameter \(\epsilon _{Final}\) of this mechanism plays an important role here. As these two databases only differ from one row and the count contains some extra noise \(I\eta ^{\prime }| = |\frac{1}{n+1} \Sigma _{j=1}^{n+1} \eta _j - \frac{1}{n} \Sigma _{j=1}^{n} \eta _j |\), the counts will not be able to give the exact values of the answers from these datasets. So, for two neighboring datasets, it will be tough to find out the difference between the two counts. Now, the closeness of these two answers will be indirectly dependent on the privacy budget of this mechanism, as the relation between privacy budget and noise insertion is very prominent and it is proved in the previous section. Reminding of these facts, it can be said that, \(P(r =r_1 | r \ is \ the \ answer \ count\ from\ first\ database) \le e^{\epsilon _{Final}} P(r^{\prime }=r_1 | r^{\prime } \ is \ the\ answer\ count\ from\ second\ dataset) + q\delta\) as, this mechanism gives \((\epsilon _{Final}, q\delta)\) privacy guarantee. Now, when \(\epsilon \rightarrow 0\), \(\delta \rightarrow 0\), then r and \(r^{\prime }\) will be almost the same, i.e., for two neighboring datasets, the proposed mechanism gives almost the same answer count for a given particular query \(\mathcal {Q}\). So, it can be stated that the final answer count achieved by this algorithm holds a differential privacy guarantee.□

4.8 Total Utility Bound and Risk Minimization

Here, the input database is \(I=(i_1,i_2,\ldots , i_n)\) and the output database is \(O=(o_1,o_2,\ldots , o_n)\). Let a randomization function \(\mathcal {R}(\varrho ())\) be used on the input dataset to get the final output dataset; i.e., \(\mathcal {R}(\varrho (i_j)) = O_j\) where \(j = 1:n\). Now, before applying this algorithm to the dataset, the raw aggregate count for a particular query is \(\hat{V}_\text{Raw}\) and assume the final aggregate count is \(\hat{V}_\text{Final}\), then the total utility bound for BUDS+ is:
\begin{equation} \hat{V}_\text{Raw} \le \hat{V}_\text{Final} . e^\text{$ \times \ln {[(\tfrac{t}{(n_1 - 1)^S}) \times (\tfrac{P.Q + (1-P).(1-Q)}{ P.(1-Q) + (1-P).Q})^q] } $} + q\delta . \end{equation}
(26)
And the bound for loss function is:
\begin{align} & \mathcal {L}(\hat{V}_\text{Raw},\hat{V}_\text{Final}) \end{align}
(27)
\begin{align} & = | \hat{V}_\text{Raw} - \hat{V}_\text{Final}| \end{align}
(28)
\begin{align} &\le \hat{V}_\text{Final} . \bigg [e^\text{$ \ln {[(\tfrac{t}{(n_1 - 1)^S}) \times (\tfrac{P.Q + (1-P).(1-Q)}{ P.(1-Q) + (1-P).Q})^q] }$} - 1\bigg ]+q\delta . \end{align}
(29)
The aim is to minimize the risk to maximize the utility. For that, then risk function is computed here:
\begin{align} & Risk_\text{True}(\mathcal {R}(\varrho ())) \end{align}
(30)
\begin{align} &= E[\mathcal {L}(\hat{V}_\text{Raw},\hat{V}_\text{Final})] \end{align}
(31)
\begin{align} & = \int \int P(\hat{V}_\text{Raw},\hat{V}_\text{Final}^{\prime }) \mathcal {L}(\hat{V}_\text{Raw},\hat{V}_\text{Final}) . d\hat{V}_\text{Raw} . d\hat{V}_\text{Final} . \end{align}
(32)
Here, \(P(\hat{V}_\text{Raw},\hat{V}_\text{Final}^{\prime })\) is the distribution of sample data-set containing n data points that are drawn randomly from a population that follows the distribution \(\mu (Z)\) over \(Z = \hat{V}_\text{Raw} .\hat{V}_\text{Final}\!: \lbrace (\hat{V}_\text{1Raw} , \hat{V}_\text{1Final}), (\hat{V}_\text{2Raw} , \hat{V}_\text{2Final}),\ldots , (\hat{V}_\text{nRaw} , \hat{V}_\text{nFinal})\) } and,
\begin{equation} P(\hat{V}_\text{Raw},\hat{V}_\text{Final}^{\prime }) = P(\hat{V}_\text{Raw}|\hat{V}_\text{Final}^{\prime }) . P(\hat{V}_\text{Raw}). \end{equation}
(33)
The empirical risk will be,
\begin{align} & Risk_\text{EMP}(\mathcal {R}(\varrho ())) \end{align}
(34)
\begin{align} &= \frac{1}{n} \sum _{j=1}^{n} \mathcal {L}(\hat{V}_\text{jRaw},\hat{V}_\text{jFinal}) + \lambda G(\mathcal {R}(\varrho ())) \end{align}
(35)
\begin{align} &\le \frac{1}{n} \sum _{j=1}^{n} \hat{V}_\text{jFinal} . e^{\epsilon _\text{Final}} + \lambda G(\mathcal {R}(\varrho ())), \end{align}
(36)
where G is the regularization parameter and \(\lambda\) controls the strength of the complexity penalty.
The process finds the best randomization function \(\mathcal {R^*}(\varrho ^*())\) for which the balance between privacy and the utility of the mechanism well is maintained:
\begin{equation} \mathcal {R^*}(\varrho ^*()) = \mathop{\text{argmin}}\limits_\text{$\mathcal {R}(\varrho ()) \epsilon \mathcal {H}$} Risk_\text{EMP}(\mathcal {R}(\varrho ())), \end{equation}
(37)
where \(\mathcal {H}\) is hypothesis class containing all possible randomization functions. By finding the optimal randomized function, the minimum values for \(\epsilon _\text{Final}\) and \(\delta\) can be achieved where for \(\mathcal {R^*}(\varrho ^*())\), \(\epsilon _\text{Final}\rightarrow 0\), \(\delta \rightarrow 0\) and \(\mathcal {U}(X,Y)\rightarrow 1\). Here, X and Y are the input and output database.
The architecture of the proposed extension is in Figure 1:
Fig. 1.
Fig. 1. Architecture of BUDS+. It consists of the following parts: Data collections and encoding, Query function application, Iterative Shuffling, Noise, and Bias insertion, Report generation, Data Discarding method, and report storing.
Fig. 2.
Fig. 2. Improvement over BUDS in BUDS+.

5 Experiments

The privacy and utilization benefit of the BUDS+ scheme was compared against the state-of-the-art methods ARA [31] and RAPPOR [21]. In this section, we describe the data-set used and the experiments performed.

5.1 Dataset

The data-set used was the Toy Dataset from Kaggle [1]. This data-set is long-tailed and biased data, which is suitable for privacy analysis and techniques like RAPPOR. This data-set consists of 150,000 rows and 6 columns, namely, the following:
(1)
Number: index number for row
(2)
City: location of the person (8 cities in total)
(3)
Gender: Male or Female
(4)
Age: ranging from 25 to 65 years
(5)
Income: ranging from \(-\)674 to 17,717
(6)
Illness: yes or no.

5.2 Results

The experimental results are shown in Figure 3. We can see that BUDS+ shows improvement in the utility for the same privacy settings over ARA and RAPPOR. The utility is measured in recall, which is identifying how much relevant are the aggregated differential privacy reports concerning true values. For RAPPOR, privacy is twofold: one for the permanent report and one for the instantaneous report. We tried with the hash functions from 2, 4, 8, 16 till 32. Better results were obtained for the lower hash functions. The RAPPOR outputs are used in ARA for aggregation. Corresponding epsilon values from RAPPOR were used in ARA for better comparison. In the BUDS+, Gaussian noise and bias were introduced on top of the differential privacy reports along with the iterative shuffling. Looped application of risk minimization leads to an increase in the utility of the differential privacy reports generated by BUDS+, as demonstrated in Figure 3.
Fig. 3.
Fig. 3. Experimental results comparing BUDS+ with ARA [31] and RAPPOR [21].

6 Discussion

The following Table 3 contains how risk minimizer works to obtain the best privacy parameter and by varying the different values of various parameters related to the final privacy budget equations, and Table 4 shows the optimal final result obtained by risk minimizer.
Table 3.
nt, \(n_1\), S, \(\epsilon _\text{BUDS}\)PQq\(\delta\)\(\epsilon _\text{Final}\)LossUtility
11,000500, 22, 3, 0.020.010.990.020.020.210.233 \(\hat{V}_\text{Final}\) +0.0004Medium
0.800.200.300.0010.871.387 \(\hat{V}_\text{Final}\) + 0.0003Low
0.990.990.010.020.160.173 \(\hat{V}_\text{Final}\) + 0.0002High
1,000, 11, 3, 00.010.990.020.040.090.094 \(\hat{V}_\text{Final}\)+ 0.008Medium
0.200.200.500.000.380.462 \(\hat{V}_\text{Final}\) + 0.00Low
0.300.500.700.0010.0000.00 \(\hat{V}_\text{Final}\) + 0.007High
100,0005,500, 18, 3, 0.110.010.990.020.020.200.221 \(\hat{V}_\text{Final}\)+ 0.0004Medium
0.200.200.500.0010.490.632 \(\hat{V}_\text{Final}\) + 0.0005Low
0.300.500.700.0010.100.105 \(\hat{V}_\text{Final}\) + 0.0007High
2,200, 45, 2, 0.10.010.990.020.050.190.209 \(\hat{V}_\text{Final}\) + 0.001Medium
0.800.200.300.0010.851.339 \(\hat{V}_\text{Final}\) + 0.0003Low
0.990.990.010.010.140.010 \(\hat{V}_\text{Final}\) + 0.0001High
1,000,00031,000, 32, 3, 0.030.010.990.020.020.120.127 \(\hat{V}_\text{Final}\) + 0.0004Medium
0.200.200.500.0010.410.506 \(\hat{V}_\text{Final}\) + 0.0005Low
0.300.500.700.0010.030.030 \(\hat{V}_\text{Final}\) + 0.0007High
1,000, 100, 2, 0.20.010.990.020.020.110.116 \(\hat{V}_\text{Final}\) + 0.0004Medium
0.800.200.300.0010.771.159 \(\hat{V}_\text{Final}\) + 0.0003Low
0.990.990.010.010.060.062 \(\hat{V}_\text{Final}\) + 0.0001High
100,000,0001,000,000, 99, 3, 0.030.010.990.020.020.120.127 \(\hat{V}_\text{Final}\) + 0.0004Medium
0.200.200.500.0010.410.506 \(\hat{V}_\text{Final}\) + 0.0005Low
0.300.500.700.0010.030.030 \(\hat{V}_\text{Final}\) + 0.0007High
21,800, 458, 2, 0.040.010.990.020.020.130.138 \(\hat{V}_\text{Final}\)+ 0.0004Medium
0.800.200.300.0010.791.203 \(\hat{V}_\text{Final}\) + 0.0003Low
0.990.990.010.010.080.083 \(\hat{V}_\text{Final}\)+ 0.0001High
Table 3. Empirical Results of Risk Minimizer to Obtain Best Privacy Budget 6)
Table 4.
nt, \(n_1\), S, \(\epsilon _\text{BUDS}\)PQq\(\delta\)\(\epsilon _\text{Final}\)Utility
11,000500, 22, 3, 0.020.990.990.010.020.16High
1,000, 11, 3, 00.300.500.700.0010.00High
100,0005,500, 18, 3, 0.110.300.500.700.0010.10High
2,200, 45, 2, 0.10.990.990.010.010.14High
1,000,00031,000, 32, 3, 0.030.300.500.700.0010.03High
1,000, 100, 2, 0.20.990.990.010.010.06High
100,000,0001,000,000, 99, 3, 0.030.300.500.700.0010.03High
21,800, 458, 2, 0.040.990.990.010.010.08High
Table 4. Optimal Result Achieved by Risk Minimizer from Table 3
From these previous tables, it can be said as the dimension of the dataset increases, the privacy parameter of this mechanism gives reliable values. That means, with the help of this algorithm, more faithful optimal privacy guarantee can be achieved with a big dataset. This result also satisfies the converger function’s work technique, because it is supported by the Lyapunov Central Limit theorem, which is valid for the large dataset. Not only that, the mechanism works with Gaussian noise that gives the dataset (\(\epsilon\), \(\delta\)) DP, which usually works well with large databases where the chance of privacy leak is high and might not also have low sensitivity. So, we can say the proposed mechanism works sufficiently well with the high-dimensional dataset and gives an optimal privacy utility tradeoff.

6.1 Reason behind Consideration of Gaussian Noise

Gaussian, Laplacian, and exponential distributions are the three most popular distributions that are used for noise insertion to provide differential privacy to client databases. In Reference [17], Dwork et al. had already proved that Laplacian and Exponential noise provide (\(\epsilon\), 0) (i.e., \(\delta = 0\); the parameter \(\delta\) stands for the probability of the accidental leakage of the user data) differential privacy guarantee, which is only preferable for those databases where the chance of accidental leakage of user information is very low. In a very high-dimensional dataset, the chance of accidental leakage of user’s data is very high and for this type of database the (\(\epsilon\), \(\delta\)) [\(\delta \ne 0\)] differential guarantee is a good option. The proper tuning of the parameter \(\delta\) can not only give a good optimal bound for the accidental leakage in the higher dimensional databases but is also able to work well on the smaller dimensional dataset. Here, the proposed algorithm uses Gaussian noise, which provides (\(\epsilon\), \(\delta\)) DP [17], which provides excellent privacy guarantee to databases irrespective of their cardinality. Also, the Gaussian noise is primarily considered, because there is an advantage that any absolute continuous distribution having topological equivalence between its own space and Gaussian space can be transformed into Gaussian distribution very easily through the proper transformation technique (for example, Box-Cox Transformation, Box-Muller Transformation, etc.) [12, 13, 22]. For the case of the Laplace or double exponential mechanism (as it satisfies the condition of absolute continuity and topological equivalence [22]), the corresponding Laplace distribution can easily be converted to the Gaussian distribution and can be used in this proposed algorithm. However, an exponential distribution can be transformed into a uniform distribution [6] and after that the uniform distribution can be transformed into Gaussian by Box-Muller transformation [33]. Table 5 shows the different noise comparison.
Table 5.
 GaussianLaplaceExponential
Differential PrivacyProvides (\(\epsilon , \delta\)) DPProvides (\(\epsilon , 0\)) DPProvides (\(\epsilon , 0\)) DP
Dimension of the datasetCompatible with all types of database dimensionsPreferable for smaller dimensional datasetsWorks well with the smaller dimensional dataset.
Ease of transformationGaussian transformation of other distribution comparatively easyLaplacian transformation is more complexVery few distributions can be transformed into exponential.
Sensitivity\(l_2\)\(l_1\)\(l_1\)
Table 5. Noise Comparison
Figure 4 shows the different situations when different noises are applied on same distribution. Here, the standard normal distribution is chosen where Gaussian, Laplace, and Exponential noises are added in (4(a)), (4(b)), and (4(c)), respectively.
Fig. 4.
Fig. 4. Gaussian (4(a)), Laplace (4(b)), and Exponential (4(c)) noise application on normal distribution.

7 Conclusion

Differential privacy has shown a lot of potential in safeguarding user privacy by adding noise to the database. The novel privacy framework BUDS is introduced to provide one of the best utility privacy tradeoffs, which uses iterative shuffling as well as attribute merging to improve utility while providing comparable privacy guarantees. BUDS+ is an improved framework based on BUDS [34], which optimizes the privacy-utility tradeoff making it practical for real-life use. Iterative shuffling with the noisy report, data discarding, and converger function together contribute to the boost in utility. This framework taking advantage of its modularity also opens up research problems in individual modules such as the representation of query function, new embedding techniques for further enhancing the robustness of the framework. This also follows with a promise of extensive experiments with various options for different modules of the BUDS+.

A Background Theorems

This section consists of some useful definitions that are needed for the proceedings of this work.
Definition A.1 (Local Differential Privacy (LDP) [14, 17, 35]).
This is entirely dependent on randomized response technique introduced in 1965, which is a simple response technique by the user depending on the coin toss probability. This model was formally introduced by Reference [24]. In this model, the distribution of a forest of data is always assumed to be stable even when a user can change his or her response suddenly. This simple trust model is the main attraction of this model, which is why it is the main adoption in the industrial implementation nowadays, which does not even need a third-party assurance at all.
Definition A.2 (Central Differential Privacy (CDP) [17, 35]).
Here, the trusted data curator has the main role to play. The curator has the responsibility to add the uncertainty by adding random noise, which eventually led to differential privacy. This whole process occurs entirely with the untrusted data analyst’s queries. As the answer to the queries always holds a small fraction of data of the whole centralized dataset that helps to establish the differential privacy phenomenon.
Definition A.3 (One-Hot Encoding).
This type of encoding has strong impact on utility of the deferentially private algorithm. Let D be dictionary of elements and ID D is not too large, then a data input x can be encoded by one-hot encoding where each data record x will hold an element in D. But when the D is large enough, it is not a suitable idea to apply one-hot encoding, whereas sketching algorithm can be used.
Definition A.4 (Probability Simplex).
Given a discrete set \(\mathcal {B}\), the probability simplex over \(\mathcal {B}\), denoted \(\Delta\)(\(\mathcal {B}\)) is defined to be:
(38)
Definition A.5 (Randomized Algorithm [15, 17, 35]).
A randomized algorithm \(\mathcal {M}\) with domain \(\mathcal {A}\) and discrete range \(\mathcal {B}\) is associated with a mapping \(\mathcal {M}\) : \(\mathcal {A} \rightarrow \Delta\)(\(\mathcal {B}\)). On input a \(\epsilon\) \(\mathcal {A}\), the algorithm \(\mathcal {M}\) outputs \(\mathcal {M}\)(a) = \(\mathcal {B}\) with probability (\(\mathcal {M}\) (a)) b for each b \(\epsilon\) \(\mathcal {B}\). The probability space is over the coin flips of the algorithm \(\mathcal {M}\).
Definition A.6 (Differential Privacy [15, 16, 17, 25, 35]).
A randomized algorithm \(\mathcal {M}\) with domain \(\mathbb {N}^{|\mathcal {X^*}|}\) is (\(\epsilon , \delta\)) deferentially private if for all \(S^{\prime } \subseteq\) Range(M) and for all \(x, y \epsilon \mathbb {N}^{|\mathcal {X^*}|}\) such that \(\Vert x - y\Vert 1 \le 1\):
\begin{equation} P[\mathcal {M}(x) \epsilon S^{\prime }] \le exp(\epsilon) P[\mathcal {M}(y) \epsilon S^{\prime }] + \delta . \end{equation}
(39)
If \(\delta = 0\), then we say that \(\mathcal {M}\) is \(\epsilon\)- deferentially private.
Definition A.7 (Randomized Response Technique [17, 35]).
Let E be an event and the query is “Are you engaged with E?” For giving the answer, the participant will do the following steps:
1.
Flip a coin.
2.
Outcome: tails, say the truth.
3.
Outcome: heads, flip another coin.
4.
Outcome: heads (again), respond untruthfully.
Table 6.
SymbolsDescription
\(X_d\)The distribution of sample \(d \epsilon \mathbb {N}\)
\(\mathbb {N}\)Set of natural numbers
\(\mathcal {X^*}\)Set of all distributions on \(X_d\)
\(\mathcal {M}\)Randomized algorithm
XInput dataset
YOutput dataset
\(\Delta f\)\(l_1\) Sensitivity
\(\Delta _2 f\)\(l_2\) Sensitivity
\(\mathbb {R}\)Set of real numbers
\(Z^*\)Standard normal random variable
\(Q^*\)Query
SNumber of shufflers and number of group of attributes
kNumber of attributes in input dataset.
nNumber of users
gNumber of reduced attributes after applying query function
mNumber of attribute names given by the query function
tNumber of batches
\(n_i\)ith batch size, \(i = 1:t\)
\(\mathcal {X}\)Attribute subset including related attributes of a particular user for a particular query
\(S^*\)User-given clipping parameter and upper bound of the \(l_2\) norm
\(\sigma ^2_j\)Variance of Gaussian noise provided by jth user
\(Z^{\prime }\)Ratio of noise applied according to \(l_2\)- sensitivity
lStringe of the noise bound for consistency of the noise
qProbability for selecting an attribute for the subset \(\mathcal {X}\)
\(\pi _{S^*}\)Aggregator function
\(k^{\prime }\)Number of attributes in the subset \(\mathcal {X}\)
\(\epsilon\)Privacy budget
\(\delta\)Probability of information accidentally leaked
\(\varepsilon\)Set of all events
EAn affair
\(V^*\)Input dataset vector matrix
\(v^*i_j\)ith record of jth user, \(i=1:k^{\prime }; j=1:n\)
VOutput dataset vector matrix
\(\hat{V}_\text{Raw}\)Average aggregate on input dataset before applying any randomized algorithm
\(\hat{V}_\text{NTR}\)Average aggregate Noiseless Temporary Report
\(\hat{V}_\text{New}\)Average aggregate of noisy distributional report
\(\hat{V}_\text{cn}\)Average aggregate report provided by Converger
\(V_\text{Final}\)Final report
\(\epsilon _\text{Final}\)Total privacy budget for whole mechanism
\(\varrho\)Randomized function providing noise and bias
\(\mathcal {R}(\varrho ())\)Randomized function for the whole mechanism that is proposed here
\(\eta _j\)Gaussian noise for jth user, \(j=1:n\)
\(\beta\)Bias
\(\hat{\beta }\)Optimal bias
\(\mathcal {U}\)Utility function
Table 6. Symbols and Descriptions
This is called Randomized Response Technique.
For (\(\epsilon , \delta\)) deferentially private mechanism, to find the value of \(\epsilon\), we obtain: RR = \(\frac{P[Response = \mbox{``}Yes\mbox{''}| Truth = \hbox{``}Yes\hbox{''}]}{P[Response = \hbox{``}Yes\hbox{''} | Truth = \hbox{``}No\hbox{''}]}\)
where \(\epsilon\) = \(\ln {(RR)}\).
Definition A.8 (\(l_2\)-Sensitivity [17, 35]) The \(l_2\) sensitivity of a function f : \(\mathbb {N}^{|\mathcal {X^*}|} \rightarrow \mathbb {R}^k\)
(40)
Definition A.9 (Gaussian Mechanism [17, 35]).
Just like Laplace Mechanism, it adds noise drawn from the Gaussian distribution whose variance is calibrated according to the sensitivity and privacy parameters.
\begin{equation} \mathcal {M}_\text{Gauss}(x,f,\epsilon ,\delta) = f(x) + \mathcal {N}^d \bigg (\mu = 0, \sigma ^2 = \frac{2 \ln {(1.25/\delta)} . (\Delta _2 f)^2}{\epsilon ^2}\bigg). \end{equation}
(41)
claim: \(\mathcal {M}_\text{Gauss}\) holds (\(\epsilon , \delta\)) differential privacy [17, 35].
Theorem A.10 (Lindeberg-Levi Central Limit Theorem [10]).
Suppose \(\lbrace X_1, X_2,\ldots , X_n\rbrace\) is a sequence of i.i.d random variables with \(E[X_i]=\mu\) and \(Var[X_i]= \sigma ^2 \lt \infty\). Let \(Z^*=(\frac{\bar{X} - \mu }{\frac{\sigma }{\sqrt {n}}})\), then for \(n\rightarrow \infty\) the distribution of \(Z^*\) converges to a distribution standard normal, i.e., \(\mathcal {N}(0;1)\).
Theorem A.11 (Lyapunov Central Limit Theorem [8, 10]).
Suppose \(\lbrace X_1, X_2,\ldots , X_n\rbrace\) is a sequence of independent random variables, each with finite mean \(\mu _i\) and variance \(\sigma _i^2\). Define \(S_n^2 = \Sigma _{i=1}^{n}\sigma _i^2\). If for some \(\delta \gt 0\), Lyapunov’s condition \(\lim _{n\rightarrow \infty }\frac{1}{S_n^\text{2+$\delta $}} \Sigma _{i=1}^{n}E[|x_i-\mu _i|^{2+\delta }] =0\) satisfied, then as n approached to infinity, the sum of \((\frac{X_i-\mu _i}{S_n})\) converges in distribution to a standard normal, i.e., \(\mathcal {N}(0;1)\).
Definition A.12 (Accuracy).
We will say that an algorithm that outputs a stream of answers \(a_1 ,\ldots , \epsilon (\top ,\bot)^*\) in response to a stream of h queries \(Q^*_1,\ldots , Q^*_h\) is \((\alpha , \beta)\)-accurate with respect to a threshold \(\bot\) if except with probability at most \(\beta\), the algorithm does not halt before \(Q^*_h\), and
\(\forall a_i = \top\):
\begin{equation} Q_i (D) \ge T -\alpha , \end{equation}
(42)
and, \(\forall a_i = \bot\):
\begin{equation} Q_i (D) \le T + \alpha . \end{equation}
(43)
Definition A.13 (Utility [4]).
How much information about the real answer can be obtained from the reported one by a specific model is called the utility.

References

[2]
Jeffrey Pennington, Socher Richard, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP’14).
[3]
Martin Abadi, Andy Chu, Ian Goodfellow, H. Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. 2016. Deep learning with differential privacy. In Proceedings of the ACM SIGSAC Conference on Computer and Communications Security (CCS’16). Association for Computing Machinery. DOI:
[4]
Mário S. Alvim, Miguel E. Andrés, Konstantinos Chatzikokolakis, Pierpaolo Degano, and Catuscia Palamidessi. 2011. Differential privacy: On the trade-off between utility and information leakage. In Proceedings of the International Workshop on Formal Aspects in Security and Trust. Springer, 39–54.
[5]
Maho Asada, Masatoshi Yoshikawa, and Yang Cao. 2019. When and where do you want to hide? Recommendation of location privacy preferences with local differential privacy. 11559 (2019), 164–176. arXiv:1904.10578 [cs].
[6]
Robert B. Ash and Melvin F. Gardner. 2014. Topics in Stochastic Processes: Probability and Mathematical Statistics: A Series of Monographs and Textbooks, Vol. 27. Academic Press.
[7]
Thomas Asikis and Evangelos Pournaras. 2018. Optimization of privacy-utility trade-offs under informational self-determination. Fut. Gen. Comput. Syst. (2018).
[8]
Imre Bárány, Van Vu, et al. 2007. Central limit theorems for Gaussian polytopes. Ann. Probab. 35, 4 (2007), 1593–1621.
[9]
Ghazaleh Beigi and Huan Liu. 2020. A survey on privacy in social media: Identification, mitigation, and applications. ACM/IMS Trans. Data Sci. 1, 1 (Mar.2020), 7:1–7:38. DOI:
[10]
P. Billingsley. 1995. Probability and Measure (3rd ed.). Wiley, New York.
[11]
Andrea Bittau, Úlfar Erlingsson, Petros Maniatis, Ilya Mironov, Ananth Raghunathan, David Lie, Mitch Rudominer, Ushasree Kode, Julien Tinnes, and Bernhard Seefeld. 2017. PROCHLO: Strong privacy for analytics in the crowd. In Proceedings of the 26th Symposium on Operating Systems Principles. 441–459. Retrieved from https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/46411.pdf.
[12]
Vladimir I. Bogachev. 2007. Measure Theory, Vol. 1. Springer Science & Business Media.
[13]
Donald L. Cohn. 2013. Measure Theory. Springer.
[14]
Graham Cormode, Somesh Jha, Tejas Kulkarni, Ninghui Li, Divesh Srivastava, and Tianhao Wang. 2018. Privacy at scale: Local differential privacy in practice. In Proceedings of the International Conference on Management of Data. 1655–1658.
[15]
Cynthia Dwork. 2008. Differential privacy: A survey of results. In Proceedings of the International Conference on Theory and Applications of Models of Computation. Springer, 1–19.
[16]
Cynthia Dwork, Krishnaram Kenthapadi, Frank McSherry, Ilya Mironov, and Moni Naor. 2006. Our data, ourselves: Privacy via distributed noise generation. In Proceedings of the Annual International Conference on the Theory and Applications of Cryptographic Techniques. Springer, 486–503.
[17]
Cynthia Dwork and Aaron Roth. 2013. The algorithmic foundations of differential privacy. FNT Theoret. Comput. Sci. 9, 3–4 (2013), 211–407. DOI:
[18]
Cynthia Dwork and Guy N. Rothblum. 2016. Concentrated differential privacy. arXiv:1603.01887 [cs] (March2016).
[19]
Úlfar Erlingsson, Vitaly Feldman, Ilya Mironov, Ananth Raghunathan, Shuang Song, Kunal Talwar, and Abhradeep Thakurta. 2020. Encode, shuffle, analyze privacy revisited: Formalizations and empirical evaluation. arXiv:2001.03618 [cs] (Jan.2020).
[20]
Úlfar Erlingsson, Vitaly Feldman, Ilya Mironov, Ananth Raghunathan, Kunal Talwar, and Abhradeep Thakurta. 2018. Amplification by shuffling: From local to central differential privacy via anonymity. arXiv:1811.12469 [cs, stat] (Nov.2018).
[21]
Úlfar Erlingsson, Vasyl Pihur, and Aleksandra Korolova. 2014. RAPPOR: Randomized aggregatable privacy-preserving ordinal response. In Proceedings of the ACM SIGSAC Conference on Computer and Communications Security (CCS’14). Association for Computing Machinery, 1054–1067. DOI:
[22]
Paul R. Halmos. 2013. Measure Theory, Vol. 18. Springer.
[23]
Filip Hanzely, Jakub Konečný, Nicolas Loizou, Peter Richtárik, and Dmitry Grishchenko. 2019. A privacy preserving randomized gossip algorithm via controlled noise insertion. arXiv:1901.09367 [cs, math] (Jan.2019).
[24]
Shiva Prasad Kasiviswanathan, Homin K. Lee, Kobbi Nissim, Sofya Raskhodnikova, and Adam Smith. 2011. What can we learn privately? SIAM J. Comput. 40, 3 (2011), 793–826.
[25]
Min Lyu, Dong Su, and Ninghui Li. 2016. Understanding the sparse vector technique for differential privacy. arXiv preprint arXiv:1603.01699 (2016).
[26]
Ali Makhdoumi, Salman Salamatian, Nadia Fawaz, and Muriel Médard. 2014. From the information bottleneck to the privacy funnel. In Proceedings of the IEEE Information Theory Workshop (ITW’14). IEEE, 501–505.
[27]
H. Brendan McMahan, Galen Andrew, Ulfar Erlingsson, Steve Chien, Ilya Mironov, Nicolas Papernot, and Peter Kairouz. 2019. A general approach to adding differential privacy to iterative training procedures. arXiv:1812.06210 [cs, stat] (March2019).
[28]
H. Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Agüera y Arcas. 2017. Communication-efficient learning of deep networks from decentralized data. arXiv:1602.05629 [cs] (Feb.2017).
[29]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 3111–3119. Retrieved from http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf.
[30]
Mona Nashaat, Aindrila Ghosh, James Miller, and Shaikh Quader. 2020. Asterisk: Generating large training datasets with automatic active supervision. ACM/IMS Trans. Data Sci. 1, 2 (May2020), 13:1–13:25. DOI:
[31]
Sudipta Paul and Subhankar Mishra. 2019. ARA: Aggregated RAPPOR and analysis for centralized differential privacy. SN Comput. Sci. 1, 1 (Sept.2019), 22. DOI:
[32]
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). 1532–1543. Retrieved from https://www.aclweb.org/anthology/D14-1162.pdf.
[33]
David W. Scott. 2011. Box–Muller transformation. Wiley Interdisc. Rev.: Computat. Statist. 3, 2 (2011), 177–179.
[34]
Poushali Sengupta, Sudipta Paul, and Subhankar Mishra. 2020. BUDS: Balancing utility and differential privacy by shuffling. arXiv preprint arXiv:2006.04125 (2020).
[35]
Poushali Sengupta, Sudipta Paul, and Subhankar Mishra. 2020. Learning with Differential Privacy. Retrieved from arxiv:https://arxiv.org/pdf/2006.05609.pdf. [cs.CR].
[36]
Mehrdad Showkatbakhsh, Can Karakus, and Suhas Diggavi. 2018. Privacy-utility trade-off of linear regression under random projections and additive noise. In Proceedings of the IEEE International Symposium on Information Theory (ISIT). IEEE, 186–190.
[37]
ADP Team et al. 2017. Learning with privacy at scale. Apple Mach. Learn. J. 1, 8 (2017). Retrieved from https://machinelearning.apple.com/docs/learning-with-privacy-at-scale/appledifferentialprivacysystem.pdf.
[38]
Tianhao Wang, Joann Qiongna Chen, Zhikun Zhang, Dong Su, Yueqiang Cheng, Zhou Li, Ninghui Li, and Somesh Jha. 2020. Continuous release of data streams under both centralized and local differential privacy. arXiv:2005.11753 [cs] (May2020).

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Digital Threats: Research and Practice
Digital Threats: Research and Practice  Volume 4, Issue 2
June 2023
344 pages
EISSN:2576-5337
DOI:10.1145/3615671
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 February 2022
Accepted: 17 September 2021
Revised: 06 September 2021
Received: 21 January 2021
Published in DTRAP Volume 4, Issue 2

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Differential privacy
  2. utility
  3. noise smoother
  4. risk minimization
  5. iterative shuffling

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)234
  • Downloads (Last 6 weeks)34
Reflects downloads up to 20 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media