Section: New Software and Platforms
Participants: Corentin Hardy, Bruno Sericola
Our recent works, in collaboration with Technicolor, on deep learning and distributed learning led us to study a kind of data parallelism called the Parameter Server model. This model consists in sharing the learning of a deep neural network between many devices (called the workers) via a centralized Parameter Server (PS). We deployed a platform which allow us to experiment different state-of-the-art algorithms based on the PS model. The platform is composed of a unique powerful machine where many Linux containers (LXC) are running. Each LXC executes a Tensorflow session and can be a worker or a PS. The first experimentations were used to validate the correct functioning of the platform, to better understand its limitations and to determine what can be measured in an unbiased way. Others experimentations helped us to understand the role of different parameters of the overall model, mainly those related to the distribution on user-devices, and their impact on the learning (accuracy of the model, number of iterations to learn the model). During these experimentations, we noted that the main bottleneck is the ingress traffic of PS during the learning phase. To reduce this ingress traffic, we chose to compress the messages sent by the workers to the PS. We proposed in  a method to reduce up to 2 orders of magnitude this ingress traffic, keeping a good accuracy on the learned model. This new method, called AdaComp, is available in github (https://github.com/Hardy-c/AdaComp).