前言

本次尝试了使用Horovod官方镜像、ignite官方镜像与自行手动搭建ignite官方镜像, 做此记录

本次实验使用两台设备

设备1设备2
ip地址192.168.134.154121.248.201.6
组网地址10.3.108.210.3.108.3
显卡数量及型号3090 24GiB * 23070 Laptop 8GiB
别名154ServerWHYLenovo

Horovod官方镜像测试

仅能完成本地单机多卡(horovod)和远程单机多卡(mpirun), 多机多卡$\textcolor{red}{无法运行}$!

参考教程: 官方教程 与 博客教程

  1. 启动Docker镜像, 在两台设备上均使用下面命令启动容器

    docker run \
    --gpus all \
    --network=host \
    --name horovod \
    --hostname horovod \
    -it --rm horovod/horovod
  2. 先在两台设备上均运行一遍下面程序, 第一是为了测试是否能成功调用显卡, 第二是为了下载数据集

    # 若想在设备1上使用两张显卡, 也可以在设备1上运行下面命令时, 将localhost:1改成localhost:2
    horovodrun -np 1 -H localhost:1 python ./pytorch/pytorch_mnist.py
  3. 配置两台设备ssh相互免密登陆

    mkdir ~/.ssh
    # 创建私钥, 这边为了方便就用之前生成过的私钥了, 生产环境中需要重新生成私钥保证安全
    echo '
    -----BEGIN OPENSSH PRIVATE KEY-----
    b3BlbnNzaC1rZXktdjEAAAAABG5vbmUAAAAEbm9uZQAAAAAAAAABAAABlwAAAAdzc2gtcn
    NhAAAAAwEAAQAAAYEAzA9CWCJ2x9GRqBubTQN6j8jS0QvRXZBmJ08dyXGvV8vG9IFsr6eB
    lFuOb4BVf+OyJPY4B5a+Q/gNHChfpXCB3KpGoWEPuA7rPoFuKs3yNtx6D+/yaIJTaGHA3p
    gmbBRTxxq+QsUkv/tsUY21B5mDoZqra+BJ/geWAL50b1/qJzsLw29D5yB9Bl8GqqV/NpTk
    glqEskM03xDPl1xR0KixTn1rOrr9GAinv+J7AvVdjk+qsDF/dtmyiYl+NRHA9TSjTmcOrt
    8J1NVJ5aR4cDzINYGN1F+iryQf/szyeehIfdRnbioZOElh9t9/VU9fZa/cQtCJ0naxGQxA
    0aZlq+99cCEX1uiwqKLWDxCc34vevHyyscy980/sChMjMolFdFtJa4qWJHBaJQ41g72Yn5
    WGjP9leX1eHExU5dL3wpYgPb28cqvgqFGuST9phTQMn5gPvVJ9Qg1lVJ7QeSzuaKDTe6TS
    qseoWxFp5coNCHTGIfHFxdMUElffnwgl3cYRjAclAAAFiGq6Q4VqukOFAAAAB3NzaC1yc2
    EAAAGBAMwPQlgidsfRkagbm00Deo/I0tEL0V2QZidPHclxr1fLxvSBbK+ngZRbjm+AVX/j
    siT2OAeWvkP4DRwoX6VwgdyqRqFhD7gO6z6BbirN8jbceg/v8miCU2hhwN6YJmwUU8cavk
    LFJL/7bFGNtQeZg6Gaq2vgSf4HlgC+dG9f6ic7C8NvQ+cgfQZfBqqlfzaU5IJahLJDNN8Q
    z5dcUdCosU59azq6/RgIp7/iewL1XY5PqrAxf3bZsomJfjURwPU0o05nDq7fCdTVSeWkeH
    A8yDWBjdRfoq8kH/7M8nnoSH3UZ24qGThJYfbff1VPX2Wv3ELQidJ2sRkMQNGmZavvfXAh
    F9bosKii1g8QnN+L3rx8srHMvfNP7AoTIzKJRXRbSWuKliRwWiUONYO9mJ+Vhoz/ZXl9Xh
    xMVOXS98KWID29vHKr4KhRrkk/aYU0DJ+YD71SfUINZVSe0Hks7mig03uk0qrHqFsRaeXK
    DQh0xiHxxcXTFBJX358IJd3GEYwHJQAAAAMBAAEAAAGAN6mt6ka0aftTpSyqp05cn14ji5
    ySptgd1XkyYeHd97ABfG7Vi/DAWwzChM3YBMPCs2xqij9ndTjzsouc048mDWBxVdIZLJb9
    Opapy4lUGfz4WuKUGEf8oouPxehxCqhc1gIIhkQqqyfVO0XRbNpGWs3LFukepenB1EAfmM
    XsJHlp0wzF1AU7tYI0WlY8plHlJ12ztsC4amS2i85GDwoFG6kAmAurwGOUBrar4Xm25Hv8
    zoUiBPSLTBMyVx2ZqgmKSYzH5SR9WYkeexMDbY4QDK++Gts0PNhPhErrLyAkjNwwOGr5dy
    iezHX20En8ZtyhsyCBEXINIwFfIVhsTQAXvMSiH5Cnx4aSSrXsOiA80jxkfGghm9tYCK4Y
    tsZsU5e6L6dLMcB/JW3C31OUFAOyAdBJ2NN7y7Coc7JLI4kEliRhB0qfav25HqnxIoomUc
    EqV3c8XSanM4XimW/ZpZHxwE/3keZPEj6SdTJhR9Bq5Ie1XsBAJ8CErNFsFKoSIPQ9AAAA
    wQCFdQQTW9Ds9kwIQFxZpdZuQzCsUHaxxBiaFEFrzByPNHFOfDg9EYdW5ODIcTr4wk8rYd
    I4X5r43VBtnVnqCnMQf9MzP/7RxbZuoG7UlOhk5oLEj8mRTORvnbJmpmqg82i2XoiLXxez
    0yjfzs2on0nlYYuzkDGTcBExNkbZsbCJ1PU7iZodTYibp4U1+DnSftJzXngCJklZmGI3Bp
    rh7IK3aLXAM0FNyy8MJNdJL4wUttdyzaJNLwlZatpBxAKeitMAAADBAPKwcXq7iu8pxoIA
    AaZ08MtP3whquAyaHofA1Xx/+OaALs9E711ns4mVzBRTjpAkJU264ZuZkzFyCv2nUFeObq
    Wa7s8eN0tqwomuqEu5EBlZ0BcES120/AToLRw+cAnfIrwnbwa00/qU6Xe/nUwqd4Cr4qdZ
    jBN16vmXdvE4sWOcUHOx1NPxAoSfmHyHlGMnm+sfEcUsCUegtWOJ1V/0GpddMxh1ozkrGY
    cI6oSAH7weZGU3uNwERcHiTNCniKziGwAAAMEA10BskuObViIVTttf0zI7IQzTbuaCWvz2
    TRGl9fTS+lAnMuPvAwxTwEMF64Y5RczkQVYJkBrQTaspzaULMwH56hgyregu7QB9A1MbOg
    P/VUK5Afbx1S/h23dEbx4nT+9NULIzqO2WqejU8g1QEEZMgKwvj/Ki60KV6LLzBdwxTVYI
    CELah/azaNFAmD4isHfIbexlDXSfR+7LvybXXs0uZgm70Aqp3NbLlVhKbDehEknAZhjNL4
    d5/zMYapWQ60+/AAAADHJvb3RAaG9yb3ZvZAECAwQFBg==
    -----END OPENSSH PRIVATE KEY-----
    ' > ~/.ssh/id_rsa
    # 配置免密登录
    echo '
    ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQDMD0JYInbH0ZGoG5tNA3qPyNLRC9FdkGYnTx3Jca9Xy8b0gWyvp4GUW45vgFV/47Ik9jgHlr5D+A0cKF+lcIHcqkahYQ+4Dus+gW4qzfI23HoP7/JoglNoYcDemCZsFFPHGr5CxSS/+2xRjbUHmYOhmqtr4En+B5YAvnRvX+onOwvDb0PnIH0GXwaqpX82lOSCWoSyQzTfEM+XXFHQqLFOfWs6uv0YCKe/4nsC9V2OT6qwMX922bKJiX41EcD1NKNOZw6u3wnU1UnlpHhwPMg1gY3UX6KvJB/+zPJ56Eh91GduKhk4SWH2339VT19lr9xC0InSdrEZDEDRpmWr731wIRfW6LCootYPEJzfi968fLKxzL3zT+wKEyMyiUV0W0lripYkcFolDjWDvZiflYaM/2V5fV4cTFTl0vfCliA9vbxyq+CoUa5JP2mFNAyfmA+9Un1CDWVUntB5LO5ooNN7pNKqx6hbEWnlyg0IdMYh8cXF0xQSV9+fCCXdxhGMByU= root@horovod
    ' > ~/.ssh/authorized_keys
    # 配置别名, 方便互连
    echo '
    Host WHYLenovo
     HostName 121.248.201.6
     Port 12345
    
    Host 154Server
     HostName 192.168.134.154
     Port 12345
    ' > ~/.ssh/config
    # 修改权限, 不然无法免密登陆
    chmod 700 ~/.ssh
    chmod 600 ~/.ssh/id_rsa
    chmod 640 ~/.ssh/authorized_keys
    # 启动sshd服务, 启动后按下Ctrl + C退出, 但sshd服务不会杀掉
    bash -c "/usr/sbin/sshd -p 12345; sleep infinity"
  4. 按照官方的步骤, 接下来应该在主节点和分节点上分别运行不同的命令
    主节点: 这边选用设备1用作主节点

    # np后的数字为总共的显卡数量, 冒号后的数字为调用冒号前设备的显卡数量
    horovodrun -np 3 -H 154Server:2,WHYLenovo:1 python ./pytorch/pytorch_mnist.py

    然后就狠狠报错了, 于是尝试在设备1上调用设备2的显卡进行运算

    # --verbose 可以显示更多细节
    root@horovod:/horovod/examples# horovodrun --verbose --start-timeout 60 -np 1 -H 154Server:1 python ./pytorch/pytorch_mnist.py
    """
    Filtering local host names.
    Remote host found: 154Server
    Checking ssh on all remote hosts.
    SSH was successful into all the remote hosts.
    Testing interfaces on all the hosts.
    Launched horovod server.
    Launching horovod task function: ssh -o PasswordAuthentication=no -o StrictHostKeyChecking=no 154Server    /usr/bin/python -m horovod.runner.task_fn gAVLAC4= gAVLAS4= ...
    Attempted to launch horovod task servers.
    Waiting for the hosts to acknowledge.
    Traceback (most recent call last):
      File "/usr/local/bin/horovodrun", line 8, in <module>
     sys.exit(run_commandline())
      File "/usr/local/lib/python3.8/dist-packages/horovod/runner/launch.py", line 837, in run_commandline
     _run(args)
      File "/usr/local/lib/python3.8/dist-packages/horovod/runner/launch.py", line 827, in _run
     return _run_static(args)
      File "/usr/local/lib/python3.8/dist-packages/horovod/runner/launch.py", line 659, in _run_static
     nics = driver_service.get_common_interfaces(settings, all_host_names,
      File "/usr/local/lib/python3.8/dist-packages/horovod/runner/driver/driver_service.py", line 252, in get_common_interfaces
     nics = _driver_fn(all_host_names, local_host_names, settings, fn_cache=fn_cache)
      File "/usr/local/lib/python3.8/dist-packages/horovod/runner/util/cache.py", line 119, in wrap_f
     results = func(*args, **kwargs)
      File "/usr/local/lib/python3.8/dist-packages/horovod/runner/driver/driver_service.py", line 191, in _driver_fn
     return _run_probe(driver, settings, num_hosts)
      File "/usr/local/lib/python3.8/dist-packages/horovod/runner/driver/driver_service.py", line 126, in _run_probe
     driver.wait_for_initial_registration(settings.start_timeout)
      File "/usr/local/lib/python3.8/dist-packages/horovod/runner/common/service/driver_service.py", line 166, in wait_for_initial_registration
     timeout.check_time_out_for('tasks to start')
      File "/usr/local/lib/python3.8/dist-packages/horovod/runner/common/util/timeout.py", line 39, in check_time_out_for
     raise TimeoutException(
    horovod.runner.common.util.timeout.TimeoutException: Timed out waiting for tasks to start. Please check connectivity between servers. You may need to increase the --start-timeout parameter if you have too many servers. Timeout after 60 seconds.
    """

    官方类似的issue, 似乎是说两台设备沟通的网卡名称不同??? 然后网友测试调用远程设备GPU可以采用mpirun解决

    # -mcabtl_tcp_if_include 网友似乎说要加这句, 但实际上加了这句会报错
    mpirun --allow-run-as-root --verbose -np 1 -H 154Server:1 python ./pytorch/pytorch_mnist.py

    的确成功调用了远程的GPU, 但当希望同时调用远程和本地GPU时, 仍旧报错无法运行

总结: 搞不定, 最多只能使用Horovod调用本地单机多卡, 使用mpirun调用远程单机多卡, 但多机多卡就完全没有办法了

ignite官方镜像测试

$\textcolor{red}{成功}$完成本地单机多卡、远程单机多卡和多机多卡推理!

  1. 在两台设备上均拉起容器

    # shm-size根据自己的内存来配置即可, 不配的话分配的空间贼小, 非常容易爆内存
    docker run --gpus all -it --network=host --name ignite --hostname ignite --shm-size 4G --rm pytorchignite/base:latest /bin/bash
  2. 先两个容器里边均运行程序, 测试并下载数据集

    torchrun --nnodes=1 --nproc_per_node=1 --node_rank=0 --master_addr=127.0.0.1 --master_port=2222 pytorch-ignite-examples/examples/cifar10/main.py run --backend="nccl"
  3. (应该会报错, 跑不起来)在主节点(设备1)和次节点(设备2)上均运行命令

    # 主节点运行
    torchrun --nnodes=2 --nproc_per_node=1 --node_rank=0 --master_addr=127.0.0.1 --master_port=2222 pytorch-ignite-examples/examples/cifar10/main.py run --backend="nccl"
    
    # 次节点运行
    torchrun --nnodes=2 --nproc_per_node=2 --node_rank=1 --master_addr=192.168.134.154 --master_port=2222 pytorch-ignite-examples/examples/cifar10/main.py run --backend="nccl"
    
    # --nnodes 后接总共有多少台设备
    # --nmproc_per_node 后接当前设备有多少张卡
    # --node_rank 后接当前设备编号, 主节点最好设为0, 其他节点不重复编号即可

    运行不起来, 会进行报错

    [rank2]: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
    [rank2]: Last error:
    [rank2]: socketStartConnect: Connect to fe80::4fa:a5ff:fe23:a740%virbr0<54173> failed : Network is unreachable

    原因在于nccl没有指定网卡, 甚至跑去用ipv6了???

  4. 配置网卡, 两台设备上均要配置一遍

    # 装个ip a看看网卡, 其实不装也行, 因为容器的--network=host, 在宿主机看看使用的网卡即可
    apt update && apt install iproute2 -y
    ip a
    # 查看报错信息
    export NCCL_DEBUG=INFO
    # 手动指定通讯网卡
    export NCCL_SOCKET_IFNAME=eno1
    export GLOO_SOCKET_IFNAME=eno1

    再次运行3中命令, 这次成功运行!

总结: 的确可以多机多卡, 但需要给容器--network=host权限, 不喜欢, 这样我一台物理机多跑几个这种容器网络不就乱套了吗?

ignite手动配置镜像

是wireguard, 我在容器里边加入了wireguard.
wireguard是虚拟组网工具, 在容器中使用wireguard组网, 直接连端口都不用给容器映射, 爽!

构建镜像使用的是ubuntu:20.04(一代人有一代人的win7)

构建测试过程

dockerfile

# 指定镜像 
FROM ubuntu:20.04                      
# 支持中文                    
ENV LANG="C.UTF-8"
# 设置时区, 以防某些软件需要确定时区ENV TZ=Asia/Shanghai
RUN ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && \
    echo "$TZ" > /etc/timezone
                                                    
# 安装cmake                             
COPY ./depend_file/cmake-3.28.3.tar.gz \
     install_cmake \
     /root/                
RUN cat /root/install_cmake | bash && \
    rm -rf /root/install_cmake       
                                                    
# 依赖cmake
# 安装python3.10
COPY ./depend_file/Python-3.10.0.tgz \
     install_python \ 
     write_python \
     /root/                                
RUN cat /root/install_python | bash && \
    rm -rf /root/install_python                     
                                                                                                        
# 安装CUDA, 仅适用于ubuntu20.04版本                                                                     
COPY ./depend_file/cuda-ubuntu2004.pin \   
     ./depend_file/cuda-repo-ubuntu2004-12-8-local_12.8.1-570.124.06-1_amd64.deb \
     ./install_cuda \                                                                                   
     ./write_cuda \                                                                                     
     /root/                                                                                                                                                                                                      
RUN cat /root/install_cuda | bash && \              
    rm -rf /root/install_cuda                                                                                                                                                                                    
                                                    
# 依赖CUDA, 但对操作系统没有依赖                                                                                                                                                                                 
# cudnn安装                                         
COPY ./depend_file/cudnn-linux-x86_64-8.9.7.29_cuda12-archive.tar.xz \                                                                                                                                           
     install_cudnn \                                
     /root/                                                                                                                                                                                                      
RUN cat /root/install_cudnn | bash && \             
    rm -rf /root/install_cudnn                                                                                                                                                                                   
                                                    
# 安装nccl, 似乎不需要安装cmake?                                                                                                                                                                                 
COPY ./depend_file/nccl_git.tar.gz \                
     install_nccl \                                 
     write_nccl \                                   
     /root/                                         
RUN cat /root/install_nccl | bash && \              
    rm -rf /root/install_nccl  
                                                                                                        
# 安装torch                                                                                             
COPY ./depend_file/torch_whl \                      
     install_torch \                 
     /root/
RUN cat /root/install_torch | bash && \
    rm -rf /root/install_torch
                                                    
# 依赖torch                         
# 安装ignite        
COPY ./depend_file/ignite_whl \                                                                         
     install_ignite \         
     /root/                                         
RUN cat /root/install_ignite | bash && \
    rm -rf /root/install_ignite         
                                                    
# 安装ssh并设置自动启动脚本COPY install_ssh write_ssh /root/      
RUN cat /root/install_ssh | bash && \
    rm -rf /root/install_ssh                        
                                                    
# 安装wireguard 
COPY install_wireguard \              
     write_wireguard \
     /root/        
RUN cat /root/install_wireguard | bash && \
    rm -rf /root/install_wireguard

  1. 分别在设备1和设备2上拉起容器

    # 启动容器, 给予容器使用内核网络部分的权限
    docker run \
    --cap-add NET_ADMIN \
    --gpus all \
    --name ignite_distribute \
    --hostname node1 \
    --restart always \
    -v /mnt/path:/etc/wireguard \
    -it ubuntu:build250319
    # 由于wireguard需要使用内核通信, 需要加上该句允许容器使用网络相关的管理员指令
    
    --cap-add  list # 添加某些权限
    --cap-drop list # 关闭某些权限
  2. 在两个容器内初始化配置, 指定使用wireguard网卡

    export NCCL_SOCKET_IFNAME=wg0
    export GLOO_SOCKET_IFNAME=wg0
  3. 两个容器里边均运行程序, 测试并下载数据集

    torchrun --nnodes=1 --nproc_per_node=1 --node_rank=0 --master_addr=10.3.108.2 --master_port=2222 pytorch-ignite-examples/examples/cifar10/main.py run --backend="nccl"
  4. 两容器显卡交火, 以node1为主节点

    # 主节点运行
    torchrun --nnodes=2 --nproc_per_node=1 --node_rank=0 --master_addr=10.3.108.2 --master_port=2222 pytorch-ignite-examples/examples/cifar10/main.py run --backend="nccl"
    
    # 次节点运行
    torchrun --nnodes=2 --nproc_per_node=2 --node_rank=1 --master_addr=10.3.108.2 --master_port=2222 pytorch-ignite-examples/examples/cifar10/main.py run --backend="nccl"

    最后成功运行!

最后修改:2025 年 03 月 20 日
赛博讨口子