[Paper] Communication Efficient Learning
“Communication Efficient Learning”이란 논문에 대한 리뷰입니다.
원문은 링크에서 확인할 수 있습니다.
Communication-Efficient Learning of Deep Networks from Decentralized Data
Client(Node) – Server(Aggregator)
Contribution
- Identification of training on decentralized data from mobile device
- Algorithm
- FederatedAveraging(FedAvg) Algorithm with SGD
Federated Optimization Issues
- Non-IID & Unbalanced
- Massively distributed: client > average num of example per client
- Limited communication: offline or slow & expensive connections
- (Practical) data changes
- (Practical) client availability with the local data distribution
- Communication costs dominate
K: num of clients
C: fraction of clients data (when c=1, global batch size, full-batch)
Non-convex Obejectives
Purpose: Decrease the number of rounds of communication
- Increasing Parallelism: more clients working
- Increasing computation on each client
- One-shot averaging
- FederatedAveraging Algorithm
Large batch synchronous SGD
The amount of computation is controlled by
- C: fraction
- E: Number of training passes
- B: the local minibatch size
FederatedSGD(FedSGD) : E=1
FedAveraging(FedAvg) : what we concerns
RESULT
Model Architecture
Data : MNIST digit recognition
Model 1) MNIST 2NN
Multilayer-perceptron with 2 hidden with 200 unit using ReLu(total 199,210 params)
Model 2) CNN
55 conv layer(32 – 64 channel with 22 max pooling) & fully connected with 512 units and ReLu(1,663,370)
Data Distribution
- IID: 100 clients with 600 ex
- Non IID: 200 shards of size 300, 100 client with 2 shards
- LSTM
- Language Model
Parallelism
With B = , small advantage in increasing the client fraction
Smaller B, significant improvement, especially in the non=IID
Computation per client For fixed C=0.1, More computation per client(decreasing B, increasing E or both), More local SGD updates per round, resulting in decreasing in communication costs. Expected number of updates per client: u=EnkB*E=nEKB Increasing u by varying E & B is effective As long as B is large enough for parallelism, no cost in computation time for lowering it, 그래서 parameter to be tuned at first More computation per client decreases the number of rounds FedAvg is better than the benchmark(FedSGD)
댓글남기기