pytorch 多卡分布式训练
torch._C._distributed_c10d中的函数all_gather_object 出现阻塞等待死锁的问题
解决办法就是 在进程通信之前调用torch.cuda.set_device(local_rank)
For NCCL-based processed groups, internal tensor representations of objects must be moved to the GPU device before communication takes place. In this case, the device used is given by torch.cuda.current_device() and it is the user’s responsiblity to ensure that this is set so that each rank has an individual GPU, via torch.cuda.set_device().