Architecture and Hardware

Belt and Braces: When Federated Learning Meets Differential Privacy

Building federated learning with differential privacy to train and refine machine-learning models with more comprehensive datasets can help exploit the potential of machine learning to its fullest.

By Xuebin Ren, Shusen Yang, Cong Zhao, Julie Mccann, and Zongben Xu

Posted Nov 14 2024

Credit: Shutterstock

Federated Learning
Differential Privacy
Federated Learning with Differential Privacy
Challenges and Discussions
Conclusion
Acknowledgments
References
Footnotes

With the development of advanced algorithms, computing capabilities, and available datasets, machine learning (ML) has been widely adopted to solve real-world problems in various application domains. The success of ML often relies on large amounts of application-specified training data, especially for large models such as ChatGPT. However, this data is often generated and scattered among enormous network edges or users’ end devices, and can be quite sensitive and impractical to be moved to a central location as the result of regulatory laws (for example, GDPR) or privacy concerns.⁸ This fact has brought an inconvenient dilemma between large-scale ML and increasingly severe data isolation. The conflict between data hungriness and privacy awareness is becoming increasingly prominent in the artificial intelligence (AI) era.

Key Insights

Federated learning (FL) can help learn collaboratively from massive scattered datasets without direct raw data exposure, but it still lacks a rigorous privacy guarantee against indirect information inferences.
Differential privacy (DP) can mathematically formulate and limit the indirect privacy leakage in various learning tasks, but it may suffer from the low signal-to-noise ratio issue when having a small number of learning samples.
There is an ongoing and growing body of research on the mutual complementarity and benefits of FL and DP; this article summarizes that research and explores optimization principles.
We outline a set of new research challenges and related investigation dimensions for achieving usable FL with DP in emerging applications.

Google proposed federated learning (FL) as a potential solution to the above issue.²⁶ Through coordination between the central server and clients (devices participating in FL), FL collaboratively trains ML models over extensive data across geographies, which bridges the gap between the ideal of big data utilization and the reality of data fragmentation everywhere. By sharing locally trained models, FL not only minimizes the risks of raw data exposure but also eliminates client-server communications. Once proposed, it has been seen as a rising star in AI technology. Its recent usage in fine-tuning large language models (LLMs) confirmed that again.

The advancement of FL in privacy protection stems from the delicacy in restricting raw data sharing. This is however far from sufficient, as gradients of deep models can even expose the privacy of training data³⁹ but FL gives no formal privacy guarantees. Fortunately, differential privacy (DP), proposed by Dwork,¹⁰ allows a controllable privacy guarantee, via formalizing the information derived from private data. By adding proper noise, DP guarantees a query result does not disclose much information about the data. Because of its rigorous formulation, DP has been the de facto standard of privacy and applied in both ML and FL.

As privacy in design, the emergence of DP and FL greatly encourages data sharing and utilization in reality. On one hand, by restricting raw data exposure, FL enables ML model training over massively fragmented data. It also significantly enriches ML applications for extensive distributed scenarios. On the other hand, by rigorously limiting the indirect information leakage, DP can strengthen the privacy in trained models with provable guarantees. The complementarity of FL and DP in privacy suggests a promising future of their combination, which can significantly extend the applicable areas for both techniques and bring privacy-preserving large-scale ML to reality. Specifically, FL has advantages in fusing geographically isolated datasets, while DP can offer provable guarantees and thus encourage sensitive data sharing. Aimed at exploiting the potential of ML to its fullest, it is highly desirable and essential to build FL with DP to train and refine ML models with more comprehensive datasets.

The benefit of privacy protection in both FL and DP comes at a cost in terms of data utility, albeit other issues. FL clients often have limited capabilities and distribution-skewed datasets, causing insufficient and/or unbalanced training of global models with low utility. DP algorithms hide the presence of any individual sample or client by adding noise to model parameters, also leading to possible utility loss. Therefore, utility optimization, that is, improving the model utility as much as possible for a given privacy guarantee is an essential problem in combining FL and DP. Given the great potential, studies on this problem have rapidly expanded in recent years. However, they are often conducted based on various FL and DP paradigms concerning different security assumptions (for example, whether the server is trustworthy) and levels of privacy granularity (for instance, sample or client). Without a systematic review and clear categorization of existing paradigms, it is hard to precisely evaluate and compare their utility performance. On the other hand, despite the paradigm differences, the utility optimization principles are quite similar. However, current studies often focus on specific algorithm designs for different paradigms of FL with DP and there lacks some common pathways to follow. Meanwhile, only few surveys on the intersection of DP and FL either have a different focus other than the utility issue or lack high-level insights into future challenges.

This article aims to provide a systematic overview of DP-enabled FL while focusing on high-level perspectives on its utility optimization techniques. We begin by presenting an introduction to FL and DP respectively, highlighting the benefits of their combination. We then summarize research advances by categorizing the paradigms and software frameworks of FL with DP. Aiming at usable analytic results, we present the high-level principles and primary technical challenges in their utility optimization in several emerging scenarios. Finally, we discuss some related topics to FL with DP, which would also impact the achieved data utility. Our review can benefit the general audience with a systematic understanding of the development and achievements on this topic. The perspectives on utility optimization for DP-enabled FL can offer some insights into research opportunities and challenges for usable AI services with privacy protection in both academia and industry.

Federated Learning

Overview of federated learning. An FL system is essentially a distributed ML (or DML) system coordinated by a central server,^a which helps multiple remote clients with separate datasets to collaboratively train an ML model, under a privacy constraint that any client does not expose its raw data. There are two popular FL frameworks.²⁶ Federated stochastic gradient descent (FedSGD) is the federated version of the stochastic gradient descent (SGD) algorithm. In SGD for centralized ML, gradients are computed on a random subset of the total dataset and then used to make one step of the gradient descent. FedSGD uses a random fraction of clients and all their local data. The gradients are averaged by the server proportionally to the number of training samples on each client and are used to make a gradient-descent step. To overcome the communication bottleneck, federated averaging (FedAvg) allows clients to perform more than one batch update on the local dataset and exchange the updated parameters rather than the gradients.²⁰ FedAvg is a generalization of FedSGD since averaging the gradients would be equivalent to averaging the parameters themselves if all the clients begin with the same initialization. So, generally FL works as follows:

Each participating client performs a local training procedure on its own dataset and sends the gradients or model updates to the server.
The server securely aggregates the received gradients or model updates, and updates the global model accordingly.
The server sends back the new global model to the corresponding clients.
Clients update their local models and prepare for the next iteration.

These procedures are repeated until the global model converges or a sufficient number of iterations is applied. FL is classified into cross-device FL, which leverages up to millions of devices in the wide-area network, and cross-silo FL, which ties up a handful of edge nodes with reliable backbones.

Comparison with traditional DML. Despite being a typical DML paradigm, when compared with traditional DML in data centers for ML speedup, FL has many distinct characteristics (as shown in Figure 1):

Figure 1. Building blocks of FL systems.

Privacy requirement: Unlike traditional DML in the datacenters (where data can be arbitrarily scheduled among computing nodes), ensuring privacy protection lies at the center of FL, which strictly prohibits raw data sharing.
Data partitioning: Data in FL is generated naturally or obtained from individual users, thus often being non-IID and imbalanced. Instead, data in traditional DML is usually manually scheduled to be almost shuffled or balanced.
On-device learning: In datacenters, DML computing nodes are homogeneous, deployed centrally, and powerful. In contrast, FL is implemented with tens to millions of distributed clients with heterogeneous and limited computing capacities.
Communication: Traditional DML in datacenters can enjoy Gigabytes of bandwidth and communicate in a peer-to-peer manner. However, FL clients are usually connected to the server by the wide-area network and bandwidth constrained.
Model aggregation fuses training results (for example, local models) from distributed nodes. Compared to homogeneous sub-models in traditional DML, one challenge in FL is the prominent heterogeneity among local models due to either ‘non-IIDness’ or varied training progress.
System actors: Unlike the closed and fixed system of traditional DML, FL is often conceived as an open and scalable system consisting of massive clients owned by different individuals/organizations seeking different benefits.

Privacy threats in federated learning. Due to the above characteristics (for example, geographically distributed nature, open architecture, and complicated interactions), various attacks can be mounted against FL in both model training and serving (that is, inference). Instead of those for degrading system availability or compromising data integrity (for example, poisoning attacks), we focus on privacy threats for snooping private information in FL.

Privacy adversaries. Privacy may be disclosed to or inferred by anyone who has access to the information flow in FL. Compared with ML over centralized data or traditional DML centrally deployed in datacenters, mutually distrusted entities in FL may all be viewed as privacy adversaries inferring others’ private information. Possible adversaries can be classified as insiders and outsiders. The former includes the server and participating clients, and the latter contains eavesdroppers over communication channels and third-party analysts (users) that consume the final model. Compared with outsiders that are more likely to have black-box access (that is, can only query via APIs) to the final model, insiders are generally more capable as they can often have white-box access (that is, full access with prior knowledge) and substantially impact FL model training. Insiders can be further considered to be semi-honest and malicious. The former is also known as honest-but-curious, that is, following the protocol correctly but trying to learn other entities’ private state. The latter may actively deviate from the protocol (for example, modifying data or colluding with others) to achieve the goal.

Privacy attacks. Considering the above adversaries, the following privacy attacks may exist in FL (shown in Figure 2):

Figure 2. Privacy threats in FL training.

Membership inference targeting a model aims to predict whether a given data sample was in its training set.³² It works by training multiple customized inference models to recognize noticeable patterns in the models’ outputs for the given sample. In traditional centrally deployed ML, membership inference is normally mounted by third-party users. In FL, it can be carried out not only by third-party users, but also communication eavesdroppers and even participating clients and the server. This is because the local, aggregated, accumulated, and final forms of gradients or model parameters all may expose private information about training data. Moreover, active attackers disguised as clients can selectively alter their gradient updates to significantly enhance the attack accuracy over the victim clients.

Class representative inference tries to generate class representatives from the underlying distribution of the training data that the targeted model could have been trained on. In traditional ML, third-party users can achieve this goal by iteratively modifying the features of a random sample until a maximal confidence is reached, or by training an inverse model, with black-box access to the targeted model. In FL, while an honest-but-curious server may partially recover some samples of honest clients by simply observing their uploaded gradients, active malicious clients or a passive malicious server can exploit generative adversarial networks (GANs) to construct class representatives from not only the global data distribution but also specific clients.

Other privacy attacks include inferences for properties and even the accurate training data (both inputs and labels). Different from the above inferences in terms of properties characterizing an entire class, property inferences aim to infer those properties independent of the characteristic features. With some auxiliary data, a passive adversary trains a binary property classifier to predict whether the observed updates were based on the data with the property, while an active adversary can exploit multi-task learning to simultaneously conduct main FL training and infer the targeted property state with enhanced capability. Inferring accurate training data is also demonstrated as possible under the deep leakage from gradients, which optimizes the dummy inputs and labels via minimizing the difference between the dummy and targeted gradients for differentiable models.³⁹

Related privacy-preserving techniques. Cryptographic primitives and protocols can restrict unauthorized access to confidential information, thus reducing the chances of privacy leakage.¹⁹ For instance, homomorphic encryption (HE) supports dedicated operations on multiple encrypted data to produce ciphertexts that can be decrypted to generate desirable functional outcomes of original plaintexts. Functional encryption (FE) authorizes the holder of a key associated with a specified function to directly learn the function output over encrypted data and nothing else. Using secure multi-party computation (SMC), a set of parties jointly compute from their inputs without relying on a trusted third party or learning each other’s input. Cryptography implemented in software still requires an error-free environment for execution and uncompromising storage of secret keys. This naturally calls for hardware-assisted security. Trusted execution environments (TEEs) can create an isolated operating environment that ensures the confidentiality of the data and codes within, while enabling remote authentication and attestation. In FL training, the above technologies can be adopted either alone or in combination to guarantee the desired confidentiality of the processed models.

However, note that privacy is essentially orthogonal to confidentiality. Whatever secure protocols and trusted systems are used, a final model will eventually be trained for consumption. Even if providing inference APIs only, model predictions may still reveal sensitive information as ML models inevitably carry some knowledge of training samples.¹¹ In general, models with poor generalization tend to leak more. Overfitting is one of the sufficient conditions of performing membership inference attacks.²⁸ Therefore, another line of defensive approaches is properly suppressing fine-grained model utility. For instance, regularization can undermine inference attacks by reducing overfitting. For deep learning, two useful strategies are model compression (or sparsification), which sets gradients below a threshold to zero, and weight quantization, which limits the parameter precision. However, these approaches provide intuitive protection only without rigorous guarantee.³¹

Differential Privacy

With the provable guarantee of limiting privacy leakage even in securely aggregated results, differential privacy promises to complement the above technologies and strengthen FL.

Overview of differential privacy. Through establishing a formal measure of privacy loss, DP allows for rigorously controlling the (worst-case) information leakage. Informally, it guarantees an algorithm’s output does not change much for two datasets differing by a single entry.¹⁰ To achieve DP, the basic idea is to properly randomize the relationship between data input and algorithmic output, for example, by adding noise.

DP has various models, as noise can be added to the different components or phases of algorithms.¹¹ Conventional DP assumes a trustworthy aggregator and adds minor noise to algorithm output, which is known as centralized DP (CDP). Assuming an honest-but-curious aggregator, local DP (LDP) randomizes data at the users’ end before collection and reconstructs utility from perturbed data of multiple uses. From CDP to LDP, the trust model is weakened under the same DP parameter, while data uncertainty and accuracy loss become larger. To bridge the trust-accuracy gap, distributed DP (DDP) exploits cryptography to obtain high accuracy without a trusted aggregator.³⁵ There are currently two DDP paradigms, based on secure shuffling and secure aggregation respectively. Secure shuffling uses an anonymous communication channel to alleviate identification risks of messages thereby relaxing the trust model. Secure aggregation replaces the trusted aggregator with secure computation protocols and thus can reduce noise and gain the same utility as in the centralized model.

Figure 3. Approaches to achieve DP for ML.

The prevalence of DP also comes from many delicate characteristics. The post-processing property keeps the privacy guarantee of algorithms after arbitrary workflows. Composition theorems help to understand the composed privacy guarantee of a series of sub-algorithms and enable building complicated algorithms from simple operations.

Differential privacy for ML. DP has been applied in ML to prevent adversaries with access to the model from inferring the training data. While intrinsic privacy can be achieved freely for some ML models with inner randomness,¹⁶ noise addition to different components of ML algorithms provides viable pathways for privacy-preserving ML with DP, as shown in Figure 3.

Output perturbation adds calibrated noise to the parameters of final models which, however, may have large (even unbounded) sensitivities and lead to severe model utility loss. Input perturbation randomizes training data and then constructs an approximate learning model on it,⁹ which usually has limited model utility due to much stronger protection. Objective perturbation perturbs the objective functions of the optimization problem in ML. Although functional mechanisms³⁸ allow its usage for complicated model functions, it is often infeasible to explicitly express the loss functions for most ML models, especially deep learning. Gradient perturbation that sanitizes parameter gradients during training¹ can ensure DP even for nonconvex objectives, making it quite useful for deep models. Differentially private SGD (DP-SGD), which has been the common practice for privacy-preserving ML,¹ samples a mini-batch of samples, clips the $l_{2}$ norm of the gradients computed on each sample, aggregates the clipped gradients, and adds Gaussian noise in each iteration. By incorporating gradient clipping, it can avoid the issue of unknown gradient sensitivity. Besides, it is often used with a moments accountant for tracking a tighter privacy loss bound.

Federated Learning with Differential Privacy

The wide application of DP in privacy-preserving ML shows the great potential of privacy-preserving FL with DP.

Benefits of FL with DP. DP with rigorous guarantee has been an essential technology for privacy-preserving data analysis and ML. Although it has been successfully integrated into distributed systems for data querying and analyses,³⁰ there is still a lack of a DP-enhanced framework for large-scale distributed ML over massively scattered datasets. FL supports flexible ML tasks with extensive models and scalable ML training for massively scattered datasets. Despite ensuring no direct data exposure by solely sharing intermediate parameters, it still lacks a formal privacy guarantee and may expose indirect privacy. Therefore, when combined, FL with DP can realize large-scale and flexible distributed learning while preventing both direct and indirect privacy leakage.

The combination of FL and DP, as complements of each other in encouraging massively confidential and sensitive data utilization, can achieve paramount benefits for privacy protection in reliability.

FL empowers and prospers DP-based ML over large-scale siloed datasets. DP-based ML (especially deep learning) in the centralized setting has made rapid progress. However, data centralization and privacy regulations strongly hinder its further development. As a result, DP-based ML wishes to meet large-scale data or data-extensive applications. Fortunately, FL naturally enables DP-based ML over massively scattered data, thus greatly prospering its success.

DP completes and strengthens the reliability of FL via offering rigorous guarantees. The mission of FL is to train and refine ML models with more comprehensive end-user data, which is subject to the willingness of data owners. Hence, a provable privacy guarantee is key to the popularization of FL systems. Beyond isolated datasets, privacy-preserved FL systems may encourage users to contribute more sensitive datasets.

Research advances on FL with DP. Due to the above benefits, marrying FL with DP has attracted extensive interest from both academia and industry. We systematically review the advances according to different paradigms and privacy notions.

FL with centralized DP. It is natural to extend differentially private ML algorithms (for example, DP-SGD) in the centralized setting, to the context of FL, to prevent information leakage from the training iterations and final model—against malicious clients or third-party users.

DP has different granularity, relying on the precise definition of neighboring datasets. Different from DP-SGD, which provides sample-level DP for hiding the existence of any single sample, it is more meaningful to provide client-level DP in FL, which ensures all the training data of a single client is protected. This also fits in the FL setting, where each client computes a single model from all its local data. Assuming a trusted central server, a straightforward idea is to apply DP to the aggregation of model updates for participating clients and hide any client’s influence on the model update at the server. DP-SGD can be adapted to both FedAvg and FedSGD, which forms two DP variants: DP-FedAvg and DP-FedSGD.²⁷ At a high level, they work as follows:

Sampling a group of clients to train local models with total data
Clipping the model updates of clients to bound the norm of the total updates
Averaging the clipped updates
Adding calibrated Gaussian noise to the average update

Privacy amplification via subsampling and moment accountant still applies to compose the privacy loss.¹⁴ However, when providing a formal DP guarantee, particular attention should be paid to a client dropout issue, which may violate the uniform sampling assumption. Fortunately, recent studies show the possibilities of addressing in theory or bypassing with the new framework. Despite the existence of noise in both the intermediate model updates and the final model, their privacy guarantees are much different as being quantified from different views.

FL with local DP. LDP implemented on local models can defend against untrusted servers or malicious clients. Related studies can be categorized into two lines based on the FL architecture.

Noise before aggregation: Considering an untrusted central server in practice, LDP can be applied to perturb gradients or model updates for individual clients in each iterate. A simple approach is to add Gaussian noise to individuals’ updates before uploading, which is also known as noising before model aggregation FL.³⁶ For example, DP-FedSGD or DP-FedAvg can be further adapted into the LDP setting by offloading Gaussian noise addition to the clients’ side. Since the summation of multiple Gaussian noises still follows a Gaussian distribution, both the privacy loss at individual clients and the central server can be tracked simultaneously. FL algorithms with LDP (for example, LDP-FedSGD) face the critical problem of the dimension dependency of communication and privacy. Besides communication overheads, given privacy parameters, the noise needed is substantially proportional to the dimension of the model parameter vector. By selecting a fraction of important dimensions, both noise variance and communication overhead can have a significant reduction. Therefore, dimension reduction is commonly used for large models. For instance, updated gradients can be sampled in a subset to reduce communication and truncated in value to compress noise variance.³¹

Blind flooding with noise: FL can be also implemented in a fully decentralized form without any central entity, thus avoiding a single-point failure and improving efficiency for heterogeneous systems. Its main feature is using peer-to-peer (P2P) communications other than a client-server architecture. A reasonable way to ensure model convergence with full information is to broadcast parameters to close neighbors, which, informally, face even higher privacy risk than an untrusted server. Moreover, in some opportunistic networks (for example, mobile crowd sensing or autonomous vehicle networks), the communication topology may be even time-varying and clients may meet unfamiliar neighbors frequently. In such a case, LDP is necessary and effective to preserve the privacy of exchanged messages among individual clients. This leads to the problem of decentralized optimization with LDP, which aims to ensure model convergence over a sparse P2P network with noisy local models. However, lacking a coordinating server, autonomous clients often have to adopt an asynchronous update pattern, which brings new challenges to the decentralized optimization in practice. Nonetheless, it has demonstrated that a differentially private asynchronous decentralized parallel SGD can converge at the same optimal rate as SGD and have a comparable model utility as the synchronous mode, while achieving relatively higher efficiency.³⁷

FL with distributed DP. As discussed before, DDP can bridge the utility-trust gap between LDP and CDP while eliminating the assumption of a trusted server via two cryptographic techniques.

Privacy amplification by shuffling: A line of DDP studies for FL concentrates on the aforementioned secure shuffling technique, which amplifies the privacy-utility trade-off via additional anonymization for DP. Before forwarding to the untrusted server, locally perturbed models with minor noise are first permuted randomly to eliminate their client identities by one or more trusted (that is, secure) shufflers, which can be implemented as a trusted proxy or by delicate cryptographic primitives. By devising the classic encoder-shuffler-aggregator (ESA) framework for adapting FL, LDP-SGD adapted with secure shuffling can achieve both strong iteration-level LDP and good overall CDP for the final model, without noticeable accuracy loss.¹² For high-dimensional parameters in deep models, shuffling client identities only may still suffer from linkage attacks from side channels. A solution is to split the parameter vector and then shuffle the dividends to enhance anonymity.³⁴ To further trade off between privacy and utility, subsampling is also an important direction, which should consider the dimension importance.²⁴ Reckoning the benefits of Renyi DP (RDP) and its stronger composition of privacy loss, beyond exploring the RDP of subsampled mechanism, a natural extension is to further analyze and exploit RDP and RDP composition in the shuffled model.¹⁵

Secure aggregation of small noises: Secure aggregation protocols in Bonawitz et al.⁵ overcome the practical issue of random client dropouts in cross-device FL, paving the way for FL with DDP via secure aggregation. However, such protocols often involve modular arithmetic, requiring the quantization of communicating contents (or discrete-valued inputs) for acceptable complexity. Then, the noise for privacy protection of local models should be also generated in discrete value. One solution is to generate and add minor discrete noise to the discretized parameters of individual clients before secure aggregation while outputting the aggregate parameters with moderate noise equivalent to the CDP model. Binomial or Poisson distribution can approach a similar trade-off between the utility and privacy of the Gaussian mechanism,³ which however does not achieve RDP or enjoy the state-of-the-art composition and amplification. Simply using discrete Gaussian noise can yield RDP with sharp composition and subsampling-based amplification,¹⁷ but relies on an uncommon sampling mechanism when being implemented in software packages. Besides, the summation of discrete Gaussian is not closed and may cause privacy degradation. Recently, the Skellam mechanism can generate noise distributed according to the differences of two independent Poisson random variables.² Skellam noise is closed under summation and can leverage the common Poisson sampling tools to get privacy amplification and sharper RDP bound in theory. However, it remains an important problem to develop a practical protocol for production-level FL systems.

Platforms and tools for FL with DP. Toward usable FL with DP, many software frameworks and platforms have been developed to support research-oriented simulations or production-oriented applications. For private deep learning, PySyft^b is a Python library that supports FL and DP, and decouples model training from private data. Its current version mainly focuses on SMC and HE rather than DP implementation. Dedicating to a fair evaluation of FL algorithms for the research community, FedML^c develops an open research library and standardized benchmark with diverse FL paradigms and configurations. The current version only integrates weak DP but provides low-level APIs for security primitives. Similarly, by providing a high-level interface, PaddleFL^d supports FL model development with DP and offers a baseline DP-SGD implementation. Furthermore, despite the consideration of practical FL settings and recognition of privacy issues, other FL frameworks (such as FATE^e and LEAF^f) still lack deep and flexible support for DP implementation. Recently, Sherpa.ai FL developed a unified framework for FL with DP, featuring comprehensive support for DP mechanisms and optimization techniques.²⁹ Nevertheless, it mainly offers algorithm-level optimization and does not consider practical system implementation. TensorFlow includes DP and FL implementations in its TensorFlow Privacy and TensorFlow Federated libraries,^g respectively. Both libraries integrate seamlessly with existing TensorFlow models and allow training personalized models with DP. However, its integrated DP mechanisms are relatively fixed in design and do not support customized and flexible optimization. Opacus^h is a scalable and efficient library for PyTorch model training with DP. It introduces an abstraction of a privacy engine that attaches to the standard PyTorch optimizer, which makes DP-SGD implementation much easier without explicitly calling low-level APIs. Beyond ML in PyTorch, it can be easily used in PySyft FL workflows to implement FL with DP.

Improving model utility for FL with DP. Existing work underpins the baseline frameworks of FL with DP. Aiming at usable FL with DP, it is essential to pursue a better trade-off between model utility and privacy. By reviewing common techniques in the fields of DP, ML, and FL, some optimization principles are summarized below.

Optimization from the perspective of DP. To seek better trade-offs, there are two directions: reducing unnecessary noise addition and tracking privacy loss tightly.

Clipping-bound estimation: Sensitivity calibration, which determines the proper noise amplitude by correctly bounding the sensitivity value, is crucial for minimizing noise variance while guaranteeing certain DP. As mentioned, a common practice in DP-SGD, thus also in SGD-based FL with DP, is to bound gradient sensitivity by gradient clipping and then add noise accordingly.¹ However, an underestimated clipping threshold may cause gradient bias and even model divergence, while an overestimated one results in excessive noise addition. Thus, it is important to understand the impact of gradient clipping and dynamically identify the proper clipping bounds during training.⁷ For instance, adaptive gradient clipping via divergence analysis or heuristic estimation can provably or empirically reduce noise and produce models with higher utility.²³

Noise-distribution optimization: It aims to reduce noise variance by reshaping noise distribution, thus decreasing unnecessary noise addition in DP. It has been invested with lots of effort. For instance, in traditional DP research, some discrete noise distribution and staircase noise distribution via segmentation techniques have been used in DP algorithms to lessen the necessary noise scale while meeting the DP requirement. In fact, both Laplace and Gaussian noise for DP are only some instances in a family of the whole distribution space satisfying DP definitions (as shown in Figure 4). Besides, to incorporate encryption primitives with less overheads, the discretization and quantization of data content should also apply to the noise generation for LDP and DDP.

Figure 4. Illustration of utility optimization techniques.

Privacy-loss composition: The composition property of DP allows building complex FL models with DP primitives while composing privacy loss. Traditionally, both sequential and advanced compositions offer fairly loose bounds. The moment account analyzes a detailed distribution of the composed privacy-loss variable and derives a much tighter bound with higher-order moments. It shows acceptable utility with quite a small privacy loss for DP-SGD via using amplification techniques.¹^,¹⁴ Privacy loss composition contributes to the optimization of privacy/utility trade-off by tightly tracking privacy loss in the composition of multiple independent noise across DP mechanisms.⁴⁰ A relevant but opposite angle is to fix the privacy budget and add correlated noises via wise budget division. For instance, classic tree-aggregation techniques add correlated noises rather than independent ones for repeated computations, which can get high utility while guaranteeing a given DP. Inspired by the idea, an amplification-free algorithm adds correlated noise to the accumulation of mini-batch gradients, which achieves a nice trade-off for DP-SGD without any amplification technique (and no uniform sampling and shuffling requirement).¹⁸

Intrinsic DP computation: Many studies have shown that noise-free DP can be achieved by leveraging the inherent randomness of certain models or algorithms for model training, instead of using additional techniques or system components. Being aware of the intrinsic DP level, the designer or developer can save up much budget and add smaller noise, thus gaining utility without privacy degradation. For instance, by mapping the sampling process to an equivalent exponential mechanism, intrinsic DP in graph models can be effectively measured and leveraged in DP algorithm design. A novel federated model distillation framework can provide provable noise-free DP via random data sampling.³³ It has also been proved that data sketching for communication reduction in FL guarantees DP inherently.²² Nonetheless, intrinsic privacy is not very common and only exists in certain models or algorithms.

Optimization from the perspective of FL. Massive FL clients and the pervasively spatiotemporal sparsity of model parameters offer the chance to extract acceptable utility without significantly harming the privacy guarantee.

Updating frequency reduction: DP-enhanced FL suffers from noise accumulation during excessive training epochs. For communication efficiency, too many training epochs also require much network bandwidth. Therefore, it is highly desirable to reduce the model update frequency. Compared with FedSGD, FedAvg allows clients to perform multiple local updates before aggregation, thus reducing global update frequency.²⁶ A similar technique has been widely adopted in DP applications with dynamic datasets or time-series data. For instance, the data curator publishes perturbed data with DP noise at the timestamps with frequent changes while releasing approximate data without privacy budget consumption at non-changing timestamps.

Model parameter compression: Like the issue of frequent parameter updating, a long parameter vector heavily consumes the privacy budget (or incurs much noise with the fixed budget) and burdens the limited communication channel. To this end, many aforementioned model compression approaches, including parameter filtering, low-rank approximation, random projection, gradient quantization, compressive sensing, and so on have been proposed for deep-learning models. For instance, similar studies include sampling and truncating a subset of gradient parameters in FL with CDP,³¹ selecting top-K dimensions with large contributions in FL with LDP,²⁵ and sampling dimensions in FL with DDP.²⁴ All these methods manage to empirically reduce both the communication bandwidth consumption and noise variance. However, lossy compression techniques can, on the one hand, effectively improve model utility by reducing the DP noise. On the other hand, they may lead to utility loss as some parameter information is eliminated. An immediate question is how to find the optimal compression rate for achieving the best utility privacy trade-off.

Participating clients sampling: Besides reducing the update frequency and size of parameters, sampling the clients participating in DP-based FL training is also a promising approach to saving privacy budget, communication overhead, and energy consumption. The rationale behind this approach comes from the amplification effect of sampling for DP, in which, by randomly sampling the DP-protected FL clients in training epochs, much stronger privacy protection can be achieved while minimizing the average consumption in communication and computation as well as privacy. However, in practical cross-device FL, the set of available clients is usually dynamic without prior knowledge of the population. Moreover, as will be discussed, participating clients may drop out randomly. These issues make the assumption of uniform sampling unrealistic and cause severe challenges for gaining privacy-utility trade-offs.

Challenges and Discussions

Despite the great potential and opportunities of DP-enhanced FL, there are still challenges in achieving usable FL with DP guarantee in emerging applications.

Vertical/transfer federation. FL can also be categorized according to different data-partition strategies. The above-discussed FL in the generic form mainly considers the horizontal data partition, where each client holds a set of samples with the same feature space. Now, vertical FL, where each party holds different features of the same set of samples, has gained increasing attention.¹³ However, many existing studies on VFL are based on SMC for protecting confidentiality without considering privacy leakage in the final results. To achieve provable resistance to membership inference or reconstruction attacks, DP must be employed for safeguarding VFL. But it is more challenging than HFL because of two reasons. One is that the VFL algorithm design varies for different tasks and models and often requires case-by-case development. Another is the correlations among distributed attributes are more difficult to identify without spreading individual information to other parties. Besides the vertical federation, there are also scenarios where different parties may hold datasets with non-overlapping features and users. Federated transfer learning (FTL) can eliminate the shifts of feature spaces in this scenario by combining FL and domain adaptation. However, similar to VFL, achieving DP for FTL is still challenging as the gradient of individual instances has to be exchanged between participants.

Large language models. With the emergence of large language models (LLMs) such as ChatGPT, both FL and DP have begun to demonstrate a promising future in fine-tuning LLMs, while preserving privacy with respect to the private domain data. However, these LLMs often have several billions to hundreds of billions of parameters. When applying DP and FL to LLMs, there will be multiple challenges concerning the huge number of parameters beyond the extra communication and computation burdens on resource-constrained participants. Regardless of the DP model, the total amount of privacy noise has to be proportional to the number of parameters for enforcing DP on models, which would lead to huge utility loss. Besides, the fine-tuning of pre-trained LLMs is also different from conventional model training. The theoretical privacy guarantee in ML (for example, DP-SGD) often assumes models are learned from scratch with many training iterations, instead of a fine-tuning mode with much fewer iterations. Therefore, it is necessary to investigate new frameworks for applying both DP and FL and develop new theories for proper privacy guarantees in LLMs.

FL over streams. In many realistic scenarios, training data is continuously generated in the form of streams at distributed clients. In such cases, FL systems have to conduct repetitive analyses on distributed streams. By inheriting online machine learning (OL), online federated learning can be naturally derived to avoid retraining models from scratch each time a new data fragment comes. However, achieving DP for OFL brings multiple challenges. The first is how to define privacy in the OFL setting, as the general DP notion works for static datasets only. Although existing privacy notions for data streams and FL seem to apply here, they still need to be clarified and formulated rigorously in the OFL setting. The second is the efficient algorithm. Taking the event-level LDP (that is, ensuring $ε$ -LDP at each time instance) as an example, frequent uploading of local model updates accumulates huge communication costs and great utility loss, as the noise is proportional to the size of communication data. How to achieve communication and privacy efficiency without degrading overall model performance is thus an important but unsolved research problem.

Apart from adapting to the new settings, building usable DP-enhanced FL systems still needs improvements in robustness, fairness, and privacy (allow data to be forgotten).

Robustness. A robust FL system should be resilient to various failures and attacks caused by misbehaved participants. Due to limited capabilities (for example, battery limit), FL clients (for instance, smartphones) may drop out of FL training unexpectedly at any time. Random client dropouts present severe challenges to the practical design of differentially private FL. Except for requiring a more sophisticated design of secure aggregation protocols,⁵ some important assumptions may no longer hold for correctly measuring DP in FL. For instance, the DP amplification via shuffling and subsampling both rely on the assumption of clients correctly following the protocol. Despite recent progress in theory,⁴^,⁵ building practical FL systems while addressing the above impacts simultaneously is still challenging. Beyond robustness to dropouts of unintended client failure, defending against robustness attacks (for example, model poisoning for Byzantine and backdoor attacks) mounted by malicious participants is much more challenging.²⁸ Specifically, both data heterogeneity and model privacy protection in FL would prevent the server from accurately detecting anomalies and tracking specific participants.

Fairness. Privacy protection is only the first step to encouraging data sharing among a large population. Fairness enforcement helps to mitigate the unintended bias on individuals with heterogeneous data. However, the dilemma is that DP aims to obscure identifiable attributes while fairness requires the knowledge of individuals’ sensitive attribute values to avoid biased results. Gradient clipping and noise addition in DP can exacerbate unfairness by decreasing the accuracy of the model over underrepresented classes and subgroups. So, the general tension between privacy and fairness calls for ethically sensitive FL algorithms that respect both issues. Meanwhile, gradient clipping and noise addition can also enhance robustness to some extent, as discussed above. This is consistent with the conclusion that there is a tension between fairness and robustness in FL.²¹ The constraints of fairness and robustness compete with each other, as robustness enhancement demands filtering out informative updates with significant model differences. Therefore, there is a subtle relationship between privacy, fairness, and robustness in FL. While existing studies concentrate on each of them separately, it would be significant to unify the interplay of the three simultaneously.

Right to be forgotten. Privacy rights include the “right to be forgotten”, that is, users can opt out of private data contribution without leaving any trace. As ML models memorize much specific information about training samples to ensure a specific private sample is totally forgotten, the concept of machine “unlearning” is proposed to eliminate its influence on trained models. However, on the one hand, machine unlearning in the context of FL, that is, federated unlearning, faces distinct challenges. Specifically, it is much harder to erase the influence of a client’s data, as the global model iteratively carries on all participating clients’ information. A straightforward idea for resolving the problem is recording historical parameter updates of clients at the server, which may cause significant complexity. On the other hand, existing machine unlearning has been demonstrated to leak privacy by observing the differences between the original and unlearned models.⁶ DP seems to be one of the promising countermeasures. Therefore, the question of how to realize efficient and privacy-preserving solutions for federated unlearning remains open.

Conclusion

With both privacy awareness and regulatory compliance, the meeting of FL and DP will promote the development of AI by unblocking the bottlenecking problem of large-scale ML. The article presents a comprehensive overview of the developments, a clear categorization of current advances, and high-level perspectives on the utility optimization principles of FL with DP. This review aims to help the community to better understand the achievements in different ways of combining FL with DP, and the challenges of usable FL with rigorous privacy guarantees. Although FL and DP show increasing promise for safeguarding private data in the AI era, their combination still faces severe challenges in emerging AI applications. Also, they need further consideration and improvements on other practical issues.

Acknowledgments

This work was supported in part by the National Key Research and Development Program of China under Grants 2020YFA0713900; in part by the National Natural Science Foundation of China under Grants 62172329, U21A6005, 61802298; in part by the Science and Technology Plan Project of Henan province under Grant 232102211007.

Submit an Article to CACM

CACM welcomes unsolicited submissions on topics of relevance and value to the computing community.

You Just Read

Belt and Braces: When Federated Learning Meets Differential Privacy

View in the ACM Digital Library

DOI

10.1145/3650028

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Explore More

News Nov 14 2024

The AI Spy

Karen Emslie

Artificial Intelligence and Machine Learning

News Nov 13 2024

The Computation Behind This Year’s Nobel Prizes in Chemistry and Physics

Logan Kugler

Computer History

BLOG@CACM Nov 11 2024

Remembering Grace Hopper

Joel C. Adams

Computer History

detail of portrait of Rear Adm. Grace Hopper

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More