↓跳到主要内容

conda安装的pytorch导致fork process的affinity为0

April 09, 2024

目录

问题描述 #

关于什么是 affinity 见 Procss Affinity

PyTorch 会使用 affinity，然后经常会出现 warning

This DataLoader will create 8 worker processes in total. Our suggested max number of worker in current system is 1, which is smaller than what this DataLoader is going to create.
Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.

同时，会发现即使设置了 num_workers 大于 1，CPU 依旧只在用一个 core，然后 GPU 占用率卡在 20% 以下

Workaround #

总之就是 LLVM 的 OpenMP 实现导致进程的 affinity 全部被设置为第一个 core（即 0），可以通过：

export KMP_AFFINITY=disabled 让 MKL 不遵守 affinity
降级 llvm-openmp 到 16 以下来避免错误设置的 affinity
貌似还以换成 intel-openmp 来解决

根据参考issue，还可以：

def worker_init_fn(worker_id):
    os.sched_setaffinity(0, range(os.cpu_count()))

相关 Issues&PR #

PyTorch #

ray #

Trainers/Dataloaders from separate tasks impede each other #42135

影响的版本 #

conda package llvm-openmp >= 16，实测 17.xx 也受到影响
因为在 PyTorch 的 fixup PR 之前，PyTorch 没有写死 llmv-openmp 包的依赖版本的最高限制，所以基本上都会出问题

04-09 update: llvm-openmp 18.xx 还是没修，顺便 conda-forge 里面的 pytorch 没有修改依赖版本限制，导致还是会装到出问题的版本