分布式TensorFlow 神经网络训练基准测试参考 驱动、内核软件、训练框架和集群通信软件准备 网络、服务器和容器平台配置 通过NCCL和Horovod集群通信框架,分布式运行集群训练任务 https://docs.mellanox.com/pages/releaseview.action?pageId=15049828 https://docs.mellanox.com/pages/releaseview.action?pageId=15049840
更多 AI Benchmark Reference Deployment Guide
▪ TensorFlow solutions on https://community.mellanox.com/s/topic/0TO50000000g1umGAA/tensorflow
▪ Reference Deployment Guide for RDMA over Ethernet (RoCE) accelerated TensorFlow 1.6 with an NVIDIA GPU Card over Mellanox 100 GbE Network https://community.mellanox.com/s/article/reference-deployment-guide-for-rdma-over-ethernet-roce--accelerated-tensorflow-1-6-with-an-nvidia-gpu-card-over-mellanox-100-gbe-network
▪ RDG for distributed, dockerised, RDMA accelerated Horovod training framework on HPE Apollo 6500 servers and 100Gb InfiniBand fabric https://docs.mellanox.com/pages/releaseview.action?pageId=15049840 ▪ How To build and run RDMA / RoCE accelerated Horovod framework Docker https://docs.mellanox.com/pages/releaseview.action?pageId=15049724 ▪ RDG for Accelerated ML and HPC Applications on K8s Cluster over Ethernet with RoCE https://docs.mellanox.com/pages/releaseview.action?pageId=15049828
原版文档下载链接, (HPC GPU DRMA AI DIST)