• nohup训练pytorch模型时的报错以及tmux的简单使用


    问题:

    在使用nohup命令后台训练pytorch模型时,关闭ssh窗口,有时会遇到下面报错:

    WARNING:torch.distributed.elastic.agent.server.api:Received 1 death signal, shutting down workers
    WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4156332 closing signal SIGHUP
    WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4156333 closing signal SIGHUP
    Traceback (most recent call last):
    File “/home/user2/anaconda3/envs/mmlab/lib/python3.7/runpy.py”, line 193, in _run_module_as_main
    main”, mod_spec)
    File “/home/user2/anaconda3/envs/mmlab/lib/python3.7/runpy.py”, line 85, in _run_code
    exec(code, run_globals)
    File “/home/user2/anaconda3/envs/mmlab/lib/python3.7/site-packages/torch/distributed/launch.py”, line 193, in
    main()
    File “/home/user2/anaconda3/envs/mmlab/lib/python3.7/site-packages/torch/distributed/launch.py”, line 189, in main
    launch(args)
    File “/home/user2/anaconda3/envs/mmlab/lib/python3.7/site-packages/torch/distributed/launch.py”, line 174, in launch
    run(args)
    File “/home/user2/anaconda3/envs/mmlab/lib/python3.7/site-packages/torch/distributed/run.py”, line 713, in run
    )(*cmd_args)
    File “/home/user2/anaconda3/envs/mmlab/lib/python3.7/site-packages/torch/distributed/launcher/api.py”, line 131, in call
    return launch_agent(self._config, self._entrypoint, list(args))
    File “/home/user2/anaconda3/envs/mmlab/lib/python3.7/site-packages/torch/distributed/launcher/api.py”, line 252, in launch_agent
    result = agent.run()
    File “/home/user2/anaconda3/envs/mmlab/lib/python3.7/site-packages/torch/distributed/elastic/metrics/api.py”, line 125, in wrapper
    result = f(*args, **kwargs)
    File “/home/user2/anaconda3/envs/mmlab/lib/python3.7/site-packages/torch/distributed/elastic/agent/server/api.py”, line 709, in run
    result = self._invoke_run(role)
    File “/home/user2/anaconda3/envs/mmlab/lib/python3.7/site-packages/torch/distributed/elastic/agent/server/api.py”, line 843, in _invoke_run
    time.sleep(monitor_interval)
    File “/home/user2/anaconda3/envs/mmlab/lib/python3.7/site-packages/torch/distributed/elastic/multiprocessing/api.py”, line 60, in _terminate_process_handler
    raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
    torch.distributed.elastic.multiprocessing.api.SignalException: Process 4156314 got signal: 1

    这是nohup的bug,我们可以使用tmux来替换nohup。

    解决方案:

    直接看阮一峰大佬写的教程,详细且清晰,几分钟就能学会使用:Tmux 使用教程 - 阮一峰的网络日志 (ruanyifeng.com)

    我在这稍微整理一下tmux的命令,如果只是简单后台训练,用下面几个命令就够用:

    sudo apt-get install tmux   # 安装
    tmux                        # 进入tmux窗口
    exit                        # 退出tmux窗口,或者使用快捷键[ Ctrl+d ]
    tmux new -s ${session-name} # 创建一个会话,并设置绘画名
    # 快捷键[ Ctrl+b ] 是tmux的前缀键,用完前缀键后可以继续按指定键来完成指定命令
    [ Ctrl+b ] [ d ]                         # 将会话与窗口分离,或者[ Ctrl+b ] tmux detach
    tmux ls                                  # 查看所有会话,或者使用tmux list-session
    tmux attach -t ${session-name}           #  根据会话名将terminal窗口接入会话
    tmux kill-session -t ${session-name}     #  根据会话名杀死会话
    tmux switch -t ${session-name}           # 根据会话名切换会话
    tmux rename-session -t 0 ${session-name} # 根据会话名,重命名会话
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11

    tmux简单使用流程:

    [terminal]: tmux new -s train_model       # 创建一个会话,并设置绘画名:train_model
    [tmux]: conda activate env_name           # 在tmux会话中,我们激活我们要使用的conda环境
    [tmux]: python train.py                   # 在tmux会话中,开始训练我们的模型
    [tmux]: [ Ctrl+b ] [ d ]                  # 将会话与窗口分离
    [terminal]: tmux ls                       # 查看我们刚刚创建的会话
    [terminal]: watch -n 1 -c gpustat --color # 监控我们的gpu信息
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
  • 相关阅读:
    JVM虚拟机知识点(保姆级教程)
    解决windows termail中文乱码的问题
    Spring Boot 之 ORM
    孩子看书用白光还是暖白光?最适合写作业台灯推荐
    深入理解 Django 单元测试
    推荐几本这个系列封面的编程书,涉及Python、计算机图形学、Linux
    SpringBoot - Swagger2的集成与使用(二)
    ahocorasick的报错
    kubebuilder(3)实现operator
    清洁服务机器人---洗地机SOC SSD222开发经验总结
  • 原文地址:https://blog.csdn.net/qq_39435411/article/details/127131681