Runtime error
A newer version of the Gradio SDK is available:
Training on a Single GPU
You can use tools/
to train a model on a single machine with a CPU and optionally a GPU.
Here is the full usage of the script:
python tools/ ${CONFIG_FILE} [ARGS]
By default, MMOCR prefers GPU to CPU. If you want to train a model on CPU, please empty CUDA_VISIBLE_DEVICES
or set it to -1 to make GPU invisible to the program. Note that CPU training requires MMCV >= 1.4.4.
ARGS | Type | Description |
--work-dir |
str | The target folder to save logs and checkpoints. Defaults to ./work_dirs . |
--load-from |
str | Path to the pre-trained model, which will be used to initialize the network parameters. |
--resume-from |
str | Resume training from a previously saved checkpoint, which will inherit the training epoch and optimizer parameters. |
--no-validate |
bool | Disable checkpoint evaluation during training. Defaults to False . |
--gpus |
int | Deprecated, please use --gpu-id. Numbers of gpus to use. Only applicable to non-distributed training. |
--gpu-ids |
int*N | Deprecated, please use --gpu-id. A list of GPU ids to use. Only applicable to non-distributed training. |
--gpu-id |
int | The GPU id to use. Only applicable to non-distributed training. |
--seed |
int | Random seed. |
--diff_seed |
bool | Whether or not set different seeds for different ranks. |
--deterministic |
bool | Whether to set deterministic options for CUDNN backend. |
--cfg-options |
str | Override some settings in the used config, the key-value pair in xxx=yyy format will be merged into the config file. If the value to be overwritten is a list, it should be of the form of either key="[a,b]" or key=a,b. The argument also allows nested list/tuple values, e.g. key="[(a,b),(c,d)]". Note that the quotation marks are necessary and that no white space is allowed. |
--launcher |
'none', 'pytorch', 'slurm', 'mpi' | Options for job launcher. |
--local_rank |
int | Used for distributed training. |
--mc-config |
str | Memory cache config for image loading speed-up during training. |
Training on Multiple GPUs
MMOCR implements distributed training with MMDistributedDataParallel
. (Please refer to to prepare your datasets)
Arguments | Type | Description |
int | The master port that will be used by the machine with rank 0. Defaults to 29500. Note: If you are launching multiple distrbuted training jobs on a single machine, you need to specify different ports for each job to avoid port conflicts. |
str | Arguments to be parsed by tools/ . |
Training on Multiple Machines
MMOCR relies on torch.distributed package for distributed training. Thus, as a basic usage, one can launch distributed training via PyTorch’s launch utility.
Training with Slurm
If you run MMOCR on a cluster managed with Slurm, you can use the script
Arguments | Type | Description |
int | The number of GPUs to be used by this task. Defaults to 8. |
int | The number of GPUs to be allocated per node. Defaults to 8. |
int | The number of CPUs to be allocated per task. Defaults to 5. |
str | Arguments to be parsed by srun. Available options can be found here. |
str | Arguments to be parsed by tools/ . |
Here is an example of using 8 GPUs to train a text detection model on the dev partition.
./tools/ dev psenet-ic15 configs/textdet/psenet/ /nfs/xxxx/psenet-ic15
Running Multiple Training Jobs on a Single Machine
If you are launching multiple training jobs on a single machine with Slurm, you may need to modify the port in configs to avoid communication conflicts.
For example, in
dist_params = dict(backend='nccl', port=29500)
dist_params = dict(backend='nccl', port=29501)
Then you can launch two jobs with
Commonly Used Training Configs
Here we list some configs that are frequently used during training for quick reference.
total_epochs = 1200
data = dict(
# Note: User can configure general settings of train, val and test dataloader by specifying them here. However, their values can be overridden in dataloader's config.
samples_per_gpu=8, # Batch size per GPU
workers_per_gpu=4, # Number of workers to process data for each GPU
train_dataloader=dict(samples_per_gpu=10, drop_last=True), # Batch size = 10, workers_per_gpu = 4
val_dataloader=dict(samples_per_gpu=6, workers_per_gpu=1), # Batch size = 6, workers_per_gpu = 1
test_dataloader=dict(workers_per_gpu=16), # Batch size = 8, workers_per_gpu = 16
# Evaluation
evaluation = dict(interval=1, by_epoch=True) # Evaluate the model every epoch
# Saving and Logging
checkpoint_config = dict(interval=1) # Save a checkpoint every epoch
log_config = dict(
interval=5, # Print out the model's performance every 5 iterations
# Optimizer
optimizer = dict(type='SGD', lr=0.02, momentum=0.9, weight_decay=0.0001) # Supports all optimizers in PyTorch and shares the same parameters
optimizer_config = dict(grad_clip=None) # Parameters for the optimizer hook. See for implementation details
# Learning policy
lr_config = dict(policy='poly', power=0.9, min_lr=1e-7, by_epoch=True)