Automatic Distributed Training System Infrastructure TePDist

TePDist (TEnsor Program Distributed) is an automatic distributed training system infrastructure for DL ​​models, not just an algorithm. The TePDist system operates in client/server mode. The client should be any front end that can generate XLA HLOs. The server is responsible for distributed policy planning and automatic distributed task initiation. The motivation for decoupling the client and server is to facilitate future integration with different front-end frameworks. TePDist has its own runtime graph and task scheduler for distributed execution. The TePDist system is now based on previous versions of community TensorFlow…

#Automatic #Distributed #Training #System #Infrastructure #TePDist


發佈留言必須填寫的電子郵件地址不會公開。 必填欄位標示為 *