前言
本次尝试了使用Horovod官方镜像、ignite官方镜像与自行手动搭建ignite官方镜像, 做此记录
本次实验使用两台设备
| 设备1 | 设备2 | |
|---|---|---|
| ip地址 | 192.168.134.154 | 121.248.201.6 |
| 组网地址 | 10.3.108.2 | 10.3.108.3 |
| 显卡数量及型号 | 3090 24GiB * 2 | 3070 Laptop 8GiB |
| 别名 | 154Server | WHYLenovo |
Horovod官方镜像测试
仅能完成本地单机多卡(horovod)和远程单机多卡(mpirun), 多机多卡$\textcolor{red}{无法运行}$!
启动Docker镜像, 在两台设备上均使用下面命令启动容器
docker run \ --gpus all \ --network=host \ --name horovod \ --hostname horovod \ -it --rm horovod/horovod先在两台设备上均运行一遍下面程序, 第一是为了测试是否能成功调用显卡, 第二是为了下载数据集
# 若想在设备1上使用两张显卡, 也可以在设备1上运行下面命令时, 将localhost:1改成localhost:2 horovodrun -np 1 -H localhost:1 python ./pytorch/pytorch_mnist.py配置两台设备ssh相互免密登陆
mkdir ~/.ssh # 创建私钥, 这边为了方便就用之前生成过的私钥了, 生产环境中需要重新生成私钥保证安全 echo ' -----BEGIN OPENSSH PRIVATE KEY----- b3BlbnNzaC1rZXktdjEAAAAABG5vbmUAAAAEbm9uZQAAAAAAAAABAAABlwAAAAdzc2gtcn NhAAAAAwEAAQAAAYEAzA9CWCJ2x9GRqBubTQN6j8jS0QvRXZBmJ08dyXGvV8vG9IFsr6eB lFuOb4BVf+OyJPY4B5a+Q/gNHChfpXCB3KpGoWEPuA7rPoFuKs3yNtx6D+/yaIJTaGHA3p gmbBRTxxq+QsUkv/tsUY21B5mDoZqra+BJ/geWAL50b1/qJzsLw29D5yB9Bl8GqqV/NpTk glqEskM03xDPl1xR0KixTn1rOrr9GAinv+J7AvVdjk+qsDF/dtmyiYl+NRHA9TSjTmcOrt 8J1NVJ5aR4cDzINYGN1F+iryQf/szyeehIfdRnbioZOElh9t9/VU9fZa/cQtCJ0naxGQxA 0aZlq+99cCEX1uiwqKLWDxCc34vevHyyscy980/sChMjMolFdFtJa4qWJHBaJQ41g72Yn5 WGjP9leX1eHExU5dL3wpYgPb28cqvgqFGuST9phTQMn5gPvVJ9Qg1lVJ7QeSzuaKDTe6TS qseoWxFp5coNCHTGIfHFxdMUElffnwgl3cYRjAclAAAFiGq6Q4VqukOFAAAAB3NzaC1yc2 EAAAGBAMwPQlgidsfRkagbm00Deo/I0tEL0V2QZidPHclxr1fLxvSBbK+ngZRbjm+AVX/j siT2OAeWvkP4DRwoX6VwgdyqRqFhD7gO6z6BbirN8jbceg/v8miCU2hhwN6YJmwUU8cavk LFJL/7bFGNtQeZg6Gaq2vgSf4HlgC+dG9f6ic7C8NvQ+cgfQZfBqqlfzaU5IJahLJDNN8Q z5dcUdCosU59azq6/RgIp7/iewL1XY5PqrAxf3bZsomJfjURwPU0o05nDq7fCdTVSeWkeH A8yDWBjdRfoq8kH/7M8nnoSH3UZ24qGThJYfbff1VPX2Wv3ELQidJ2sRkMQNGmZavvfXAh F9bosKii1g8QnN+L3rx8srHMvfNP7AoTIzKJRXRbSWuKliRwWiUONYO9mJ+Vhoz/ZXl9Xh xMVOXS98KWID29vHKr4KhRrkk/aYU0DJ+YD71SfUINZVSe0Hks7mig03uk0qrHqFsRaeXK DQh0xiHxxcXTFBJX358IJd3GEYwHJQAAAAMBAAEAAAGAN6mt6ka0aftTpSyqp05cn14ji5 ySptgd1XkyYeHd97ABfG7Vi/DAWwzChM3YBMPCs2xqij9ndTjzsouc048mDWBxVdIZLJb9 Opapy4lUGfz4WuKUGEf8oouPxehxCqhc1gIIhkQqqyfVO0XRbNpGWs3LFukepenB1EAfmM XsJHlp0wzF1AU7tYI0WlY8plHlJ12ztsC4amS2i85GDwoFG6kAmAurwGOUBrar4Xm25Hv8 zoUiBPSLTBMyVx2ZqgmKSYzH5SR9WYkeexMDbY4QDK++Gts0PNhPhErrLyAkjNwwOGr5dy iezHX20En8ZtyhsyCBEXINIwFfIVhsTQAXvMSiH5Cnx4aSSrXsOiA80jxkfGghm9tYCK4Y tsZsU5e6L6dLMcB/JW3C31OUFAOyAdBJ2NN7y7Coc7JLI4kEliRhB0qfav25HqnxIoomUc EqV3c8XSanM4XimW/ZpZHxwE/3keZPEj6SdTJhR9Bq5Ie1XsBAJ8CErNFsFKoSIPQ9AAAA wQCFdQQTW9Ds9kwIQFxZpdZuQzCsUHaxxBiaFEFrzByPNHFOfDg9EYdW5ODIcTr4wk8rYd I4X5r43VBtnVnqCnMQf9MzP/7RxbZuoG7UlOhk5oLEj8mRTORvnbJmpmqg82i2XoiLXxez 0yjfzs2on0nlYYuzkDGTcBExNkbZsbCJ1PU7iZodTYibp4U1+DnSftJzXngCJklZmGI3Bp rh7IK3aLXAM0FNyy8MJNdJL4wUttdyzaJNLwlZatpBxAKeitMAAADBAPKwcXq7iu8pxoIA AaZ08MtP3whquAyaHofA1Xx/+OaALs9E711ns4mVzBRTjpAkJU264ZuZkzFyCv2nUFeObq Wa7s8eN0tqwomuqEu5EBlZ0BcES120/AToLRw+cAnfIrwnbwa00/qU6Xe/nUwqd4Cr4qdZ jBN16vmXdvE4sWOcUHOx1NPxAoSfmHyHlGMnm+sfEcUsCUegtWOJ1V/0GpddMxh1ozkrGY cI6oSAH7weZGU3uNwERcHiTNCniKziGwAAAMEA10BskuObViIVTttf0zI7IQzTbuaCWvz2 TRGl9fTS+lAnMuPvAwxTwEMF64Y5RczkQVYJkBrQTaspzaULMwH56hgyregu7QB9A1MbOg P/VUK5Afbx1S/h23dEbx4nT+9NULIzqO2WqejU8g1QEEZMgKwvj/Ki60KV6LLzBdwxTVYI CELah/azaNFAmD4isHfIbexlDXSfR+7LvybXXs0uZgm70Aqp3NbLlVhKbDehEknAZhjNL4 d5/zMYapWQ60+/AAAADHJvb3RAaG9yb3ZvZAECAwQFBg== -----END OPENSSH PRIVATE KEY----- ' > ~/.ssh/id_rsa # 配置免密登录 echo ' ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQDMD0JYInbH0ZGoG5tNA3qPyNLRC9FdkGYnTx3Jca9Xy8b0gWyvp4GUW45vgFV/47Ik9jgHlr5D+A0cKF+lcIHcqkahYQ+4Dus+gW4qzfI23HoP7/JoglNoYcDemCZsFFPHGr5CxSS/+2xRjbUHmYOhmqtr4En+B5YAvnRvX+onOwvDb0PnIH0GXwaqpX82lOSCWoSyQzTfEM+XXFHQqLFOfWs6uv0YCKe/4nsC9V2OT6qwMX922bKJiX41EcD1NKNOZw6u3wnU1UnlpHhwPMg1gY3UX6KvJB/+zPJ56Eh91GduKhk4SWH2339VT19lr9xC0InSdrEZDEDRpmWr731wIRfW6LCootYPEJzfi968fLKxzL3zT+wKEyMyiUV0W0lripYkcFolDjWDvZiflYaM/2V5fV4cTFTl0vfCliA9vbxyq+CoUa5JP2mFNAyfmA+9Un1CDWVUntB5LO5ooNN7pNKqx6hbEWnlyg0IdMYh8cXF0xQSV9+fCCXdxhGMByU= root@horovod ' > ~/.ssh/authorized_keys # 配置别名, 方便互连 echo ' Host WHYLenovo HostName 121.248.201.6 Port 12345 Host 154Server HostName 192.168.134.154 Port 12345 ' > ~/.ssh/config # 修改权限, 不然无法免密登陆 chmod 700 ~/.ssh chmod 600 ~/.ssh/id_rsa chmod 640 ~/.ssh/authorized_keys # 启动sshd服务, 启动后按下Ctrl + C退出, 但sshd服务不会杀掉 bash -c "/usr/sbin/sshd -p 12345; sleep infinity"按照官方的步骤, 接下来应该在主节点和分节点上分别运行不同的命令
主节点: 这边选用设备1用作主节点# np后的数字为总共的显卡数量, 冒号后的数字为调用冒号前设备的显卡数量 horovodrun -np 3 -H 154Server:2,WHYLenovo:1 python ./pytorch/pytorch_mnist.py然后就狠狠报错了, 于是尝试在设备1上调用设备2的显卡进行运算
# --verbose 可以显示更多细节 root@horovod:/horovod/examples# horovodrun --verbose --start-timeout 60 -np 1 -H 154Server:1 python ./pytorch/pytorch_mnist.py """ Filtering local host names. Remote host found: 154Server Checking ssh on all remote hosts. SSH was successful into all the remote hosts. Testing interfaces on all the hosts. Launched horovod server. Launching horovod task function: ssh -o PasswordAuthentication=no -o StrictHostKeyChecking=no 154Server /usr/bin/python -m horovod.runner.task_fn gAVLAC4= gAVLAS4= ... Attempted to launch horovod task servers. Waiting for the hosts to acknowledge. Traceback (most recent call last): File "/usr/local/bin/horovodrun", line 8, in <module> sys.exit(run_commandline()) File "/usr/local/lib/python3.8/dist-packages/horovod/runner/launch.py", line 837, in run_commandline _run(args) File "/usr/local/lib/python3.8/dist-packages/horovod/runner/launch.py", line 827, in _run return _run_static(args) File "/usr/local/lib/python3.8/dist-packages/horovod/runner/launch.py", line 659, in _run_static nics = driver_service.get_common_interfaces(settings, all_host_names, File "/usr/local/lib/python3.8/dist-packages/horovod/runner/driver/driver_service.py", line 252, in get_common_interfaces nics = _driver_fn(all_host_names, local_host_names, settings, fn_cache=fn_cache) File "/usr/local/lib/python3.8/dist-packages/horovod/runner/util/cache.py", line 119, in wrap_f results = func(*args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/horovod/runner/driver/driver_service.py", line 191, in _driver_fn return _run_probe(driver, settings, num_hosts) File "/usr/local/lib/python3.8/dist-packages/horovod/runner/driver/driver_service.py", line 126, in _run_probe driver.wait_for_initial_registration(settings.start_timeout) File "/usr/local/lib/python3.8/dist-packages/horovod/runner/common/service/driver_service.py", line 166, in wait_for_initial_registration timeout.check_time_out_for('tasks to start') File "/usr/local/lib/python3.8/dist-packages/horovod/runner/common/util/timeout.py", line 39, in check_time_out_for raise TimeoutException( horovod.runner.common.util.timeout.TimeoutException: Timed out waiting for tasks to start. Please check connectivity between servers. You may need to increase the --start-timeout parameter if you have too many servers. Timeout after 60 seconds. """官方类似的issue, 似乎是说两台设备沟通的网卡名称不同??? 然后网友测试调用远程设备GPU可以采用mpirun解决
# -mcabtl_tcp_if_include 网友似乎说要加这句, 但实际上加了这句会报错 mpirun --allow-run-as-root --verbose -np 1 -H 154Server:1 python ./pytorch/pytorch_mnist.py的确成功调用了远程的GPU, 但当希望同时调用远程和本地GPU时, 仍旧报错无法运行
总结: 搞不定, 最多只能使用Horovod调用本地单机多卡, 使用mpirun调用远程单机多卡, 但多机多卡就完全没有办法了
ignite官方镜像测试
$\textcolor{red}{成功}$完成本地单机多卡、远程单机多卡和多机多卡推理!
在两台设备上均拉起容器
# shm-size根据自己的内存来配置即可, 不配的话分配的空间贼小, 非常容易爆内存 docker run --gpus all -it --network=host --name ignite --hostname ignite --shm-size 4G --rm pytorchignite/base:latest /bin/bash先两个容器里边均运行程序, 测试并下载数据集
torchrun --nnodes=1 --nproc_per_node=1 --node_rank=0 --master_addr=127.0.0.1 --master_port=2222 pytorch-ignite-examples/examples/cifar10/main.py run --backend="nccl"(应该会报错, 跑不起来)在主节点(设备1)和次节点(设备2)上均运行命令
# 主节点运行 torchrun --nnodes=2 --nproc_per_node=1 --node_rank=0 --master_addr=127.0.0.1 --master_port=2222 pytorch-ignite-examples/examples/cifar10/main.py run --backend="nccl" # 次节点运行 torchrun --nnodes=2 --nproc_per_node=2 --node_rank=1 --master_addr=192.168.134.154 --master_port=2222 pytorch-ignite-examples/examples/cifar10/main.py run --backend="nccl" # --nnodes 后接总共有多少台设备 # --nmproc_per_node 后接当前设备有多少张卡 # --node_rank 后接当前设备编号, 主节点最好设为0, 其他节点不重复编号即可运行不起来, 会进行报错
[rank2]: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. [rank2]: Last error: [rank2]: socketStartConnect: Connect to fe80::4fa:a5ff:fe23:a740%virbr0<54173> failed : Network is unreachable原因在于nccl没有指定网卡, 甚至跑去用ipv6了???
配置网卡, 两台设备上均要配置一遍
# 装个ip a看看网卡, 其实不装也行, 因为容器的--network=host, 在宿主机看看使用的网卡即可 apt update && apt install iproute2 -y ip a # 查看报错信息 export NCCL_DEBUG=INFO # 手动指定通讯网卡 export NCCL_SOCKET_IFNAME=eno1 export GLOO_SOCKET_IFNAME=eno1再次运行3中命令, 这次成功运行!
总结: 的确可以多机多卡, 但需要给容器--network=host权限, 不喜欢, 这样我一台物理机多跑几个这种容器网络不就乱套了吗?
ignite手动配置镜像
是wireguard, 我在容器里边加入了wireguard.
wireguard是虚拟组网工具, 在容器中使用wireguard组网, 直接连端口都不用给容器映射, 爽!
构建镜像使用的是ubuntu:20.04(一代人有一代人的win7)
构建测试过程
# 指定镜像
FROM ubuntu:20.04
# 支持中文
ENV LANG="C.UTF-8"
# 设置时区, 以防某些软件需要确定时区ENV TZ=Asia/Shanghai
RUN ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && \
echo "$TZ" > /etc/timezone
# 安装cmake
COPY ./depend_file/cmake-3.28.3.tar.gz \
install_cmake \
/root/
RUN cat /root/install_cmake | bash && \
rm -rf /root/install_cmake
# 依赖cmake
# 安装python3.10
COPY ./depend_file/Python-3.10.0.tgz \
install_python \
write_python \
/root/
RUN cat /root/install_python | bash && \
rm -rf /root/install_python
# 安装CUDA, 仅适用于ubuntu20.04版本
COPY ./depend_file/cuda-ubuntu2004.pin \
./depend_file/cuda-repo-ubuntu2004-12-8-local_12.8.1-570.124.06-1_amd64.deb \
./install_cuda \
./write_cuda \
/root/
RUN cat /root/install_cuda | bash && \
rm -rf /root/install_cuda
# 依赖CUDA, 但对操作系统没有依赖
# cudnn安装
COPY ./depend_file/cudnn-linux-x86_64-8.9.7.29_cuda12-archive.tar.xz \
install_cudnn \
/root/
RUN cat /root/install_cudnn | bash && \
rm -rf /root/install_cudnn
# 安装nccl, 似乎不需要安装cmake?
COPY ./depend_file/nccl_git.tar.gz \
install_nccl \
write_nccl \
/root/
RUN cat /root/install_nccl | bash && \
rm -rf /root/install_nccl
# 安装torch
COPY ./depend_file/torch_whl \
install_torch \
/root/
RUN cat /root/install_torch | bash && \
rm -rf /root/install_torch
# 依赖torch
# 安装ignite
COPY ./depend_file/ignite_whl \
install_ignite \
/root/
RUN cat /root/install_ignite | bash && \
rm -rf /root/install_ignite
# 安装ssh并设置自动启动脚本COPY install_ssh write_ssh /root/
RUN cat /root/install_ssh | bash && \
rm -rf /root/install_ssh
# 安装wireguard
COPY install_wireguard \
write_wireguard \
/root/
RUN cat /root/install_wireguard | bash && \
rm -rf /root/install_wireguard分别在设备1和设备2上拉起容器
# 启动容器, 给予容器使用内核网络部分的权限 docker run \ --cap-add NET_ADMIN \ --gpus all \ --name ignite_distribute \ --hostname node1 \ --restart always \ -v /mnt/path:/etc/wireguard \ -it ubuntu:build250319 # 由于wireguard需要使用内核通信, 需要加上该句允许容器使用网络相关的管理员指令 --cap-add list # 添加某些权限 --cap-drop list # 关闭某些权限在两个容器内初始化配置, 指定使用wireguard网卡
export NCCL_SOCKET_IFNAME=wg0 export GLOO_SOCKET_IFNAME=wg0两个容器里边均运行程序, 测试并下载数据集
torchrun --nnodes=1 --nproc_per_node=1 --node_rank=0 --master_addr=10.3.108.2 --master_port=2222 pytorch-ignite-examples/examples/cifar10/main.py run --backend="nccl"两容器显卡交火, 以node1为主节点
# 主节点运行 torchrun --nnodes=2 --nproc_per_node=1 --node_rank=0 --master_addr=10.3.108.2 --master_port=2222 pytorch-ignite-examples/examples/cifar10/main.py run --backend="nccl" # 次节点运行 torchrun --nnodes=2 --nproc_per_node=2 --node_rank=1 --master_addr=10.3.108.2 --master_port=2222 pytorch-ignite-examples/examples/cifar10/main.py run --backend="nccl"最后成功运行!