valueerror: error initializing torch.distributed using env:// rendezvous: environment variable master_addr expected, but not set
#68
by
mahi22muki
- opened
I am trying to run the script over 2 server (each 4GPU*2) using mpirun with horovod.
I am facing this error. Rank , world size , local size is not getting detected automatically , master_add and port also not getting fetched.
Help me up to resolve the error .
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '29500'
tried with different approached , nothing worked out.
Did you ever find a solution?