[Paper] Communication Efficient Learning

1 분 소요

“Communication Efficient Learning”이란 논문에 대한 리뷰입니다.

원문은 링크에서 확인할 수 있습니다.

Communication-Efficient Learning of Deep Networks from Decentralized Data

Client(Node) – Server(Aggregator)

Contribution

Identification of training on decentralized data from mobile device
Algorithm
FederatedAveraging(FedAvg) Algorithm with SGD

Federated Optimization Issues

Non-IID & Unbalanced
Massively distributed: client > average num of example per client
Limited communication: offline or slow & expensive connections
(Practical) data changes
(Practical) client availability with the local data distribution
Communication costs dominate

K: num of clients
C: fraction of clients data (when c=1, global batch size, full-batch)
Non-convex Obejectives

Purpose: Decrease the number of rounds of communication

Increasing Parallelism: more clients working
Increasing computation on each client
One-shot averaging
FederatedAveraging Algorithm

Large batch synchronous SGD

The amount of computation is controlled by

C: fraction
E: Number of training passes
B: the local minibatch size FederatedSGD(FedSGD) : E=1
FedAveraging(FedAvg) : what we concerns

RESULT

Model Architecture

Data : MNIST digit recognition
Model 1) MNIST 2NN
Multilayer-perceptron with 2 hidden with 200 unit using ReLu(total 199,210 params)
Model 2) CNN
55 conv layer(32 – 64 channel with 22 max pooling) & fully connected with 512 units and ReLu(1,663,370)

Data Distribution

IID: 100 clients with 600 ex
Non IID: 200 shards of size 300, 100 client with 2 shards
LSTM
Language Model

Parallelism

With B = , small advantage in increasing the client fraction
Smaller B, significant improvement, especially in the non=IID

Computation per client For fixed C=0.1, More computation per client(decreasing B, increasing E or both), More local SGD updates per round, resulting in decreasing in communication costs. Expected number of updates per client: u=EnkB*E=nEKB Increasing u by varying E & B is effective As long as B is large enough for parallelism, no cost in computation time for lowering it, 그래서 parameter to be tuned at first More computation per client decreases the number of rounds FedAvg is better than the benchmark(FedSGD)

Twitter Facebook LinkedIn

Yejin Kim

[Paper] Communication Efficient Learning

Contribution

Federated Optimization Issues

Purpose: Decrease the number of rounds of communication

Large batch synchronous SGD

RESULT

Model Architecture

Data Distribution

Parallelism

공유하기

댓글남기기

참고

[Paper Review] TRPO and PPO

[Paper] Mamba: Linear-Time Sequence Modeling with Selective State Spaces

[SSM] Modeling Sequences with Structured State Spaces - Part I

[LLM-RL] Lecture 2: Value Functions