There has been remarkable success of machine learning (ML) technologies in empowering practical artificial intelligence (AI) applications, such as automatic speech recognition and computer vision. However, we are facing two major challenges in adopting AI today. One is that data in most industries exist in the form of isolated islands. The other is the ever-increasing demand for privacy-preserving AI. Conventional AI approaches based on centralized data collection cannot meet these challenges. How to solve the problem of data fragmentation and isolation while complying with the privacy-protection laws and regulations is a major challenge for AI researchers and practitioners.
On the legal front, lawmakers and regulatory bodies are coming up with new laws ruling how data shall be managed and used.3 One prominent example is the adoption of the General Data Protection Regulation (GDPR) by the European Union in 2018. In the United States, the California Consumer Privacy Act will be enacted in 2020. China’s Cyber Security Law, came into effect in 2017, also imposed strict controls on data collection and transactions.
Under this new legislative landscape, collecting and sharing data among different organizations are becoming increasingly difficult, if not outright impossible. In addition, the sensitivity nature of certain data (for example, financial transactions and medical records) prohibits free data circulation and forces the data to exist in data silos. Due to competition, user privacy, data security, and complicated administrative procedures, even data integration among different departments of the same company faces heavy resistance. As the old privacy-intrusive way of collecting and sharing data are no longer allowed, data consolidation involving different data owners is becoming extremely challenging.10
Data silos and privacy concerns are two of the most challenging impediments to the AI progresses. It is thus natural to seek solutions to build ML models that do not rely on collecting data to a centralized storage where model training takes place. One attractive idea is to train a sub-model at each location with only local data, and then let the parties at different sites communicate their respective submodels in order to reach a consensus for a global model. In order to ensure user privacy and data confidentiality, the communication process is carefully engineered so that no site can reverse-engineer the private data of any other sites. In the meanwhile, the model is built as if the data sources were combined. This is the idea behind “federated machine learning,” or “federated learning (FL)” for short.7,9
FL was practiced by Google for next-word prediction on mobile devices.2,7 Google’s FL system serves as an example of a secure distributed learning environment for B2C (business to consumer) applications where all parties share the same data features and collaboratively train an ML model. Besides the B2C paradigm, the FL framework has been extended to support “cross-silos” scenarios and B2B (business-to-business) applications by the AI researchers in WeBank,a where each party has different sets of data features.5,6,9
In a nutshell, a fundamental change in algorithmic design with FL is, instead of transferring raw data from sites to sites or to a server, we transfer ML model parameters in a secure way,2,10 so that parties cannot access the content of others’ data. FL is an algorithmic framework for building ML models that can be characterized by the following features.
- There are two or more parties interested in jointly building a model.
- Each party holds some data that it can use for model training.
- In the model-training process, the data held by each party does not leave that party.
- The model can be transferred in part from one party to another under an encryption scheme, such that any party cannot reverse-engineer the data of other parties.
- The performance of the federated model is a good approximation of an ideal model built with centralized data.
Techniques for privacy-preserving ML have been extensively studied,1 such as employing differential privacy (DP)4 and secure multi-party computation.10 DP involves in adding noise to the training data, or using generalization methods to obscure certain sensitive features until the third party cannot distinguish the individual, thereby making the data impossible to be restored to protect user privacy. However, DP still requires that the data is transmitted elsewhere and involves a trade-off between accuracy and privacy.
FL can be classiffied into horizontal FL (HFL),7 vertical FL (VFL),9 and federated transfer learning (FTL),6 according to how data is distributed among the participating parties in the feature and sample spaces.10 Figures 1a–1c illustrate the three FL categories respectively for a two-party scenario.
Figure 1. Categorization of FL.10
HFL refers to the scenarios where the parties share overlapping data features, but differ in data samples. Different from HFL, VFL applies to the scenarios where the parties share overlapping data samples, but differ in data features. FTL is applicable for the scenarios when there is little overlapping in data samples and in features. We also refer to HFL as sample-partitioned FL,10 or example-partitioned FL,5 as in a matrix form, samples correspond to the rows and features correspond to the columns (see Figure 1a). HFL is carried out across different horizontal rows, that is, data is partitioned by samples. We also refer to VFL as feature-partitioned FL,5,10 as VFL is carried out across different vertical columns, that is data is partitioned by features (see Figure 1b).
For example, when two organizations providing different services (for example, a bank and an e-commerce company), but having a large intersection of common customers (that is, aligned data samples), they may collaborate on the different data features they respectively own to achieve better ML models using VFL.5,9
An FL system architecture can employ the client-server model, as shown in Figure 2a. The coordinator C can be played by an authority such as a government department or replaced by a secure computing node.10 The communication between the coordinator C and the data owners A and B (a.k.a. parties) may be encrypted (for example, using homomorphic encryption2,10) to further defend against any information leakage. Further, the coordinator C may also be a logical entity and be located in either A or B. An FL system architecture can also employ the peer-to-peer model, without a coordinator, as illustrated in Figure 2b. The data owners A and B communicate directly without the help of a third party. While there are only two data owners in Figure 2, an FL system may generally include two or more data owners.7,10
Figure 2. Examples of VFL Architecture.10
Taking the client-server model shown in Figure 2a as an example, we summarize the encrypted and secure model training with VFL into the following four steps, after aligning the data samples between the two data owners.10
- Step 1: C creates encryption key pairs, and sends the public key to A and B.
- Step 2: A and B encrypt and exchange intermediate computation results for gradient and loss calculations.
- Step 3: A and B compute encrypted gradients and add an additional mask, respectively. B also computes the encrypted loss. A and B send encrypted results to C.
- Step 4: C decrypts gradients and loss and sends the corresponding results back to A and B. A and B unmask the gradients, and update their model parameters accordingly.
Readers can find more more information about the FL model training and inference procedures, such as the convergence speeds of the exiting training algorithms, in the existing works5,10 and references therein.
FL enables us to build cross-enterprise, cross-data and cross-domain AI applications while complying with data protection laws and regulations. It has potential applications in finance, insurance, healthcare, education, smart city, and edge computing, and so forth.10 We present here two FL applications selected from the use casesb that have been deployed in practice by WeBank.
Use Case 1: FedRiskCtrl
The first use case is an FL application in finance. It is an example of federated risk control (FedRiskCtrl) for small and micro enterprise (SME) loans deployed by WeBank.c
There is an invoice agency A, which has invoice related data features, such as for the kth SME. There is bank B, which has credit-related data features, such as and label Y(k) for the kth SME, with N > M. The agency A and bank B collaboratively build a risk control model for SME loans using VFL.10
Before model training, we need to find the common SMEs served by A and B to align the training data samples, which is called private set intersection or secure entity alignment.10 After determining the aligned data samples between A and B, we can then follow the steps shown in Figure 2 for training a risk control model for SME loans.
FedRiskCtrl is implemented with the FATE (Federated AI Technology Enabler) platform.d With VFL, the agency A and the bank B do not need to expose their private data to each other, and the model built with FL is expected to perform as well as the model built with centralized dataset . The model built with FL performs significantly better than the model built only with the bank B’s data.
Use Case 2: FedVision
The second use case is an FL application in edge computing. It is an example of federated computer vision (FedVision) for object detection deployed by WeBank.e
FL can overcome the challenges of data silos, small data, privacy issues, and lead us toward privacy-preserving AI.
Due to privacy concerns and high cost of transmitting video data, it is difficult to centrally collect surveillance video data for model training in practice. With FedVision, surveillance video data collected and stored in the edge cloud of each surveillance company are no longer required to be uploaded to a central cloud for centralized model training.10 In FedVision, an initial object detection model is sent from the FL server to each surveillance company (that is, to each edge cloud), which then uses the locally stored data to train the object detection model. After a few local training epochs, the model parameters from each surveillance company are encrypted and sent to the FL server. The local model parameters are aggregated into a global federated model by the FL server and sent back to each surveillance company. This process iterates until the stopping criterion is met.
The model training process in FedVision is very similar to the federated averaging procedure for HFL model training.2,7 The final global federated model will be distributed to the participating surveillance companies in the federation, to be used for object detection, such as fire detection.
FL can overcome the challenges of data silos, small data, privacy issues, and lead us toward privacy-preserving AI. It will become the foundation of next-generation ML that caters to technological and societal needs for responsible AI development and applications.10
While FL has great potential, it also faces several practical challenges.5,10 The communication links between the local data owners and the coordinator may be slow and unstable. There may be a very large number of local data owners (for example, mobile devices) to manage. Data from different data owner in an FL system may follow non-identical distributions, and different data owners may have unbalanced numbers of data samples, which may result in a biased model or even failure of model training.5 Incentivizing mobile device owners or organizations to participate in FL also needs further studies. Incentive mechanism design for FL should be done in such a way to make the federation fair and sustainable.8