[2024-09-15 15:26:38,023] torch.distributed.run: [WARNING] 
[2024-09-15 15:26:38,023] torch.distributed.run: [WARNING] *****************************************
[2024-09-15 15:26:38,023] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
[2024-09-15 15:26:38,023] torch.distributed.run: [WARNING] *****************************************
[2024-09-15 15:26:40,361] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 15:26:40,380] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 15:26:40,383] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 15:26:40,413] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
[2024-09-15 15:26:42,413] [INFO] [comm.py:637:init_distributed] cdb=None
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
[2024-09-15 15:26:42,447] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-09-15 15:26:42,482] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-09-15 15:26:42,518] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-09-15 15:26:42,518] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
09/15/2024 15:26:42 - WARNING - __main__ - Process rank: 2, device: cuda:2, n_gpu: 1distributed training: True, 16-bits training: False
09/15/2024 15:26:42 - WARNING - __main__ - Process rank: 3, device: cuda:3, n_gpu: 1distributed training: True, 16-bits training: False
09/15/2024 15:26:42 - WARNING - __main__ - Process rank: 1, device: cuda:1, n_gpu: 1distributed training: True, 16-bits training: False
09/15/2024 15:26:42 - WARNING - __main__ - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: False
09/15/2024 15:26:42 - INFO - __main__ - Training/evaluation parameters TrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=True,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=4,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=/workspace/InternVL/internvl_chat/zero_stage1_config.json,
disable_tqdm=False,
dispatch_batches=None,
do_eval=False,
do_predict=False,
do_train=True,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=no,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
gradient_checkpointing_kwargs=None,
greater_is_better=None,
group_by_length=True,
half_precision_backend=auto,
hub_always_push=False,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=every_save,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_inputs_for_metrics=False,
include_num_input_tokens_seen=False,
include_tokens_per_second=False,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=4e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=0,
log_level=passive,
log_level_replica=warning,
log_on_each_node=True,
logging_dir=work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora/runs/Sep15_15-26-42_1fe2e16e3880,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=1.0,
logging_strategy=steps,
lr_scheduler_kwargs={},
lr_scheduler_type=cosine,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
neftune_noise_alpha=None,
no_cuda=False,
num_train_epochs=3.0,
optim=adamw_torch,
optim_args=None,
output_dir=work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora,
overwrite_output_dir=True,
past_index=-1,
per_device_eval_batch_size=8,
per_device_train_batch_size=4,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
ray_scope=last,
remove_unused_columns=True,
report_to=['tensorboard'],
resume_from_checkpoint=None,
run_name=work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora,
save_on_each_node=False,
save_only_model=False,
save_safetensors=True,
save_steps=200,
save_strategy=steps,
save_total_limit=1,
seed=42,
skip_memory_metrics=True,
split_batches=False,
tf32=None,
torch_compile=False,
torch_compile_backend=None,
torch_compile_mode=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_cpu=False,
use_ipex=False,
use_legacy_prediction_loop=False,
use_mps_device=False,
warmup_ratio=0.03,
warmup_steps=0,
weight_decay=0.01,
)
09/15/2024 15:26:42 - INFO - __main__ - Loading Tokenizer: /workspace/pretrained/InternVL2-8B
[INFO|tokenization_utils_base.py:2025] 2024-09-15 15:26:42,601 >> loading file ./tokenizer.model
[INFO|tokenization_utils_base.py:2025] 2024-09-15 15:26:42,602 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2025] 2024-09-15 15:26:42,602 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2025] 2024-09-15 15:26:42,602 >> loading file tokenizer_config.json
[INFO|tokenization_utils_base.py:2025] 2024-09-15 15:26:42,602 >> loading file tokenizer.json
[WARNING|logging.py:314] 2024-09-15 15:26:42,725 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[WARNING|logging.py:314] 2024-09-15 15:26:42,742 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[WARNING|logging.py:314] 2024-09-15 15:26:42,769 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[WARNING|logging.py:314] 2024-09-15 15:26:42,796 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
09/15/2024 15:26:42 - INFO - __main__ - Loading InternVLChatModel...
[INFO|configuration_utils.py:727] 2024-09-15 15:26:42,947 >> loading configuration file /workspace/pretrained/InternVL2-8B/config.json
[INFO|configuration_utils.py:792] 2024-09-15 15:26:42,949 >> Model config InternVLChatConfig {
  "_commit_hash": null,
  "architectures": [
    "InternVLChatModel"
  ],
  "auto_map": {
    "AutoConfig": "configuration_internvl_chat.InternVLChatConfig",
    "AutoModel": "modeling_internvl_chat.InternVLChatModel",
    "AutoModelForCausalLM": "modeling_internvl_chat.InternVLChatModel"
  },
  "downsample_ratio": 0.5,
  "dynamic_image_size": true,
  "force_image_size": 448,
  "llm_config": {
    "_name_or_path": "internlm/internlm2_5-7b-chat",
    "add_cross_attention": false,
    "architectures": [
      "InternLM2ForCausalLM"
    ],
    "attn_implementation": "flash_attention_2",
    "auto_map": {
      "AutoConfig": "configuration_internlm2.InternLM2Config",
      "AutoModel": "modeling_internlm2.InternLM2ForCausalLM",
      "AutoModelForCausalLM": "modeling_internlm2.InternLM2ForCausalLM"
    },
    "bad_words_ids": null,
    "begin_suppress_tokens": null,
    "bias": false,
    "bos_token_id": 1,
    "chunk_size_feed_forward": 0,
    "cross_attention_hidden_size": null,
    "decoder_start_token_id": null,
    "diversity_penalty": 0.0,
    "do_sample": false,
    "early_stopping": false,
    "encoder_no_repeat_ngram_size": 0,
    "eos_token_id": 2,
    "exponential_decay_length_penalty": null,
    "finetuning_task": null,
    "forced_bos_token_id": null,
    "forced_eos_token_id": null,
    "hidden_act": "silu",
    "hidden_size": 4096,
    "id2label": {
      "0": "LABEL_0",
      "1": "LABEL_1"
    },
    "initializer_range": 0.02,
    "intermediate_size": 14336,
    "is_decoder": false,
    "is_encoder_decoder": false,
    "label2id": {
      "LABEL_0": 0,
      "LABEL_1": 1
    },
    "length_penalty": 1.0,
    "max_length": 20,
    "max_position_embeddings": 32768,
    "min_length": 0,
    "model_type": "internlm2",
    "no_repeat_ngram_size": 0,
    "num_attention_heads": 32,
    "num_beam_groups": 1,
    "num_beams": 1,
    "num_hidden_layers": 32,
    "num_key_value_heads": 8,
    "num_return_sequences": 1,
    "output_attentions": false,
    "output_hidden_states": false,
    "output_scores": false,
    "pad_token_id": 2,
    "prefix": null,
    "pretraining_tp": 1,
    "problem_type": null,
    "pruned_heads": {},
    "remove_invalid_values": false,
    "repetition_penalty": 1.0,
    "return_dict": true,
    "return_dict_in_generate": false,
    "rms_norm_eps": 1e-05,
    "rope_scaling": {
      "factor": 2.0,
      "type": "dynamic"
    },
    "rope_theta": 1000000,
    "sep_token_id": null,
    "suppress_tokens": null,
    "task_specific_params": null,
    "temperature": 1.0,
    "tf_legacy_loss": false,
    "tie_encoder_decoder": false,
    "tie_word_embeddings": false,
    "tokenizer_class": null,
    "top_k": 50,
    "top_p": 1.0,
    "torch_dtype": "bfloat16",
    "torchscript": false,
    "transformers_version": "4.37.2",
    "typical_p": 1.0,
    "use_bfloat16": true,
    "use_cache": true,
    "vocab_size": 92553
  },
  "max_dynamic_patch": 12,
  "min_dynamic_patch": 1,
  "model_type": "internvl_chat",
  "pad2square": false,
  "ps_version": "v2",
  "select_layer": -1,
  "template": "internlm2-chat",
  "torch_dtype": "bfloat16",
  "transformers_version": null,
  "use_backbone_lora": 0,
  "use_llm_lora": 0,
  "use_thumbnail": true,
  "vision_config": {
    "_name_or_path": "",
    "add_cross_attention": false,
    "architectures": [
      "InternVisionModel"
    ],
    "attention_dropout": 0.0,
    "bad_words_ids": null,
    "begin_suppress_tokens": null,
    "bos_token_id": null,
    "chunk_size_feed_forward": 0,
    "cross_attention_hidden_size": null,
    "decoder_start_token_id": null,
    "diversity_penalty": 0.0,
    "do_sample": false,
    "drop_path_rate": 0.0,
    "dropout": 0.0,
    "early_stopping": false,
    "encoder_no_repeat_ngram_size": 0,
    "eos_token_id": null,
    "exponential_decay_length_penalty": null,
    "finetuning_task": null,
    "forced_bos_token_id": null,
    "forced_eos_token_id": null,
    "hidden_act": "gelu",
    "hidden_size": 1024,
    "id2label": {
      "0": "LABEL_0",
      "1": "LABEL_1"
    },
    "image_size": 448,
    "initializer_factor": 1.0,
    "initializer_range": 0.02,
    "intermediate_size": 4096,
    "is_decoder": false,
    "is_encoder_decoder": false,
    "label2id": {
      "LABEL_0": 0,
      "LABEL_1": 1
    },
    "layer_norm_eps": 1e-06,
    "length_penalty": 1.0,
    "max_length": 20,
    "min_length": 0,
    "model_type": "intern_vit_6b",
    "no_repeat_ngram_size": 0,
    "norm_type": "layer_norm",
    "num_attention_heads": 16,
    "num_beam_groups": 1,
    "num_beams": 1,
    "num_channels": 3,
    "num_hidden_layers": 24,
    "num_return_sequences": 1,
    "output_attentions": false,
    "output_hidden_states": false,
    "output_scores": false,
    "pad_token_id": null,
    "patch_size": 14,
    "prefix": null,
    "problem_type": null,
    "pruned_heads": {},
    "qk_normalization": false,
    "qkv_bias": true,
    "remove_invalid_values": false,
    "repetition_penalty": 1.0,
    "return_dict": true,
    "return_dict_in_generate": false,
    "sep_token_id": null,
    "suppress_tokens": null,
    "task_specific_params": null,
    "temperature": 1.0,
    "tf_legacy_loss": false,
    "tie_encoder_decoder": false,
    "tie_word_embeddings": true,
    "tokenizer_class": null,
    "top_k": 50,
    "top_p": 1.0,
    "torch_dtype": "bfloat16",
    "torchscript": false,
    "transformers_version": "4.37.2",
    "typical_p": 1.0,
    "use_bfloat16": true,
    "use_flash_attn": true
  }
}

09/15/2024 15:26:42 - INFO - __main__ - Using flash_attention_2 for InternLM
[INFO|modeling_utils.py:3473] 2024-09-15 15:26:42,951 >> loading weights file /workspace/pretrained/InternVL2-8B/model.safetensors.index.json
[INFO|modeling_utils.py:1426] 2024-09-15 15:26:42,951 >> Instantiating InternVLChatModel model under default dtype torch.bfloat16.
[INFO|configuration_utils.py:826] 2024-09-15 15:26:42,952 >> Generate config GenerationConfig {}

[INFO|configuration_utils.py:826] 2024-09-15 15:26:42,990 >> Generate config GenerationConfig {
  "bos_token_id": 1,
  "eos_token_id": 2,
  "pad_token_id": 2
}

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]Loading checkpoint shards:  25%|██▌       | 1/4 [00:02<00:08,  2.67s/it]Loading checkpoint shards:  25%|██▌       | 1/4 [00:02<00:08,  2.97s/it]Loading checkpoint shards:  25%|██▌       | 1/4 [00:02<00:08,  2.97s/it]Loading checkpoint shards:  25%|██▌       | 1/4 [00:02<00:08,  2.98s/it]Loading checkpoint shards:  50%|█████     | 2/4 [00:05<00:05,  2.71s/it]Loading checkpoint shards:  50%|█████     | 2/4 [00:05<00:05,  2.96s/it]Loading checkpoint shards:  50%|█████     | 2/4 [00:05<00:05,  2.90s/it]Loading checkpoint shards:  50%|█████     | 2/4 [00:05<00:05,  2.99s/it]Loading checkpoint shards:  75%|███████▌  | 3/4 [00:08<00:02,  2.70s/it]Loading checkpoint shards:  75%|███████▌  | 3/4 [00:08<00:02,  2.85s/it]Loading checkpoint shards:  75%|███████▌  | 3/4 [00:08<00:02,  2.94s/it]Loading checkpoint shards: 100%|██████████| 4/4 [00:08<00:00,  1.97s/it]Loading checkpoint shards: 100%|██████████| 4/4 [00:08<00:00,  2.24s/it]
Loading checkpoint shards:  75%|███████▌  | 3/4 [00:08<00:02,  2.97s/it]Loading checkpoint shards: 100%|██████████| 4/4 [00:09<00:00,  2.09s/it]Loading checkpoint shards: 100%|██████████| 4/4 [00:09<00:00,  2.39s/it]
Loading checkpoint shards: 100%|██████████| 4/4 [00:09<00:00,  2.15s/it]Loading checkpoint shards: 100%|██████████| 4/4 [00:09<00:00,  2.44s/it]
trainable params: 37,748,736 || all params: 7,775,531,008 || trainable%: 0.4855
Loading checkpoint shards: 100%|██████████| 4/4 [00:09<00:00,  2.17s/it]Loading checkpoint shards: 100%|██████████| 4/4 [00:09<00:00,  2.47s/it]
[INFO|modeling_utils.py:4350] 2024-09-15 15:26:54,253 >> All model checkpoint weights were used when initializing InternVLChatModel.

[INFO|modeling_utils.py:4358] 2024-09-15 15:26:54,253 >> All the weights of InternVLChatModel were initialized from the model checkpoint at /workspace/pretrained/InternVL2-8B.
If your task is similar to the task the model of the checkpoint was trained on, you can already use InternVLChatModel for predictions without further training.
[INFO|configuration_utils.py:779] 2024-09-15 15:26:54,259 >> loading configuration file /workspace/pretrained/InternVL2-8B/generation_config.json
[INFO|configuration_utils.py:826] 2024-09-15 15:26:54,259 >> Generate config GenerationConfig {
  "eos_token_id": [
    92542,
    92543
  ]
}

09/15/2024 15:26:54 - INFO - __main__ - Finished
09/15/2024 15:26:54 - INFO - __main__ - model.config.force_image_size: 448
09/15/2024 15:26:54 - INFO - __main__ - data_args.force_image_size: 448
09/15/2024 15:26:54 - INFO - __main__ - model.config.vision_config.image_size: 448
09/15/2024 15:26:54 - INFO - __main__ - [Dataset] num_image_token: 256
09/15/2024 15:26:54 - INFO - __main__ - [Dataset] dynamic_image_size: True
09/15/2024 15:26:54 - INFO - __main__ - [Dataset] use_thumbnail: True
09/15/2024 15:26:54 - INFO - __main__ - [Dataset] min_dynamic_patch: 1, max_dynamic_patch: 6
09/15/2024 15:26:54 - INFO - __main__ - Formatting inputs...Skip in lazy mode
09/15/2024 15:26:54 - INFO - __main__ - Add dataset: amazon-product-dataset with length: 9996
trainable params: 37,748,736 || all params: 7,775,531,008 || trainable%: 0.4855
trainable params: 37,748,736 || all params: 7,775,531,008 || trainable%: 0.4855
trainable params: 37,748,736 || all params: 7,775,531,008 || trainable%: 0.4855
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.0.attention.wqkv.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.0.attention.wqkv.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.0.attention.wo.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.0.attention.wo.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.0.feed_forward.w1.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.0.feed_forward.w1.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.0.feed_forward.w3.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.0.feed_forward.w3.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.0.feed_forward.w2.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.0.feed_forward.w2.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.1.attention.wqkv.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.1.attention.wqkv.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.1.attention.wo.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.1.attention.wo.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.1.feed_forward.w1.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.1.feed_forward.w1.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.1.feed_forward.w3.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.1.feed_forward.w3.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.1.feed_forward.w2.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.1.feed_forward.w2.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.2.attention.wqkv.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.2.attention.wqkv.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.2.attention.wo.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.2.attention.wo.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.2.feed_forward.w1.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.2.feed_forward.w1.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.2.feed_forward.w3.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.2.feed_forward.w3.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.2.feed_forward.w2.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.2.feed_forward.w2.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.3.attention.wqkv.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.3.attention.wqkv.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.3.attention.wo.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.3.attention.wo.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.3.feed_forward.w1.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.3.feed_forward.w1.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.3.feed_forward.w3.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.3.feed_forward.w3.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.3.feed_forward.w2.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.3.feed_forward.w2.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.4.attention.wqkv.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.4.attention.wqkv.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.4.attention.wo.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.4.attention.wo.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.4.feed_forward.w1.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.4.feed_forward.w1.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.4.feed_forward.w3.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.4.feed_forward.w3.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.4.feed_forward.w2.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.4.feed_forward.w2.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.5.attention.wqkv.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.5.attention.wqkv.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.5.attention.wo.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.5.attention.wo.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.5.feed_forward.w1.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.5.feed_forward.w1.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.5.feed_forward.w3.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.5.feed_forward.w3.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.5.feed_forward.w2.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.5.feed_forward.w2.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.6.attention.wqkv.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.6.attention.wqkv.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.6.attention.wo.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.6.attention.wo.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.6.feed_forward.w1.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.6.feed_forward.w1.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.6.feed_forward.w3.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.6.feed_forward.w3.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.6.feed_forward.w2.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.6.feed_forward.w2.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.7.attention.wqkv.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.7.attention.wqkv.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.7.attention.wo.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.7.attention.wo.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.7.feed_forward.w1.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.7.feed_forward.w1.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.7.feed_forward.w3.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.7.feed_forward.w3.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.7.feed_forward.w2.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.7.feed_forward.w2.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.8.attention.wqkv.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.8.attention.wqkv.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.8.attention.wo.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.8.attention.wo.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.8.feed_forward.w1.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.8.feed_forward.w1.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.8.feed_forward.w3.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.8.feed_forward.w3.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.8.feed_forward.w2.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.8.feed_forward.w2.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.9.attention.wqkv.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.9.attention.wqkv.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.9.attention.wo.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.9.attention.wo.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.9.feed_forward.w1.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.9.feed_forward.w1.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.9.feed_forward.w3.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.9.feed_forward.w3.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.9.feed_forward.w2.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.9.feed_forward.w2.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.10.attention.wqkv.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.10.attention.wqkv.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.10.attention.wo.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.10.attention.wo.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.10.feed_forward.w1.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.10.feed_forward.w1.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.10.feed_forward.w3.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.10.feed_forward.w3.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.10.feed_forward.w2.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.10.feed_forward.w2.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.11.attention.wqkv.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.11.attention.wqkv.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.11.attention.wo.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.11.attention.wo.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.11.feed_forward.w1.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.11.feed_forward.w1.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.11.feed_forward.w3.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.11.feed_forward.w3.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.11.feed_forward.w2.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.11.feed_forward.w2.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.12.attention.wqkv.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.12.attention.wqkv.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.12.attention.wo.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.12.attention.wo.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.12.feed_forward.w1.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.12.feed_forward.w1.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.12.feed_forward.w3.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.12.feed_forward.w3.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.12.feed_forward.w2.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.12.feed_forward.w2.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.13.attention.wqkv.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.13.attention.wqkv.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.13.attention.wo.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.13.attention.wo.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.13.feed_forward.w1.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.13.feed_forward.w1.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.13.feed_forward.w3.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.13.feed_forward.w3.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.13.feed_forward.w2.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.13.feed_forward.w2.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.14.attention.wqkv.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.14.attention.wqkv.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.14.attention.wo.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.14.attention.wo.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.14.feed_forward.w1.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.14.feed_forward.w1.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.14.feed_forward.w3.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.14.feed_forward.w3.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.14.feed_forward.w2.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.14.feed_forward.w2.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.15.attention.wqkv.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.15.attention.wqkv.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.15.attention.wo.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.15.attention.wo.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.15.feed_forward.w1.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.15.feed_forward.w1.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.15.feed_forward.w3.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.15.feed_forward.w3.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.15.feed_forward.w2.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.15.feed_forward.w2.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.16.attention.wqkv.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.16.attention.wqkv.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.16.attention.wo.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.16.attention.wo.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.16.feed_forward.w1.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.16.feed_forward.w1.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.16.feed_forward.w3.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.16.feed_forward.w3.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.16.feed_forward.w2.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.16.feed_forward.w2.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.17.attention.wqkv.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.17.attention.wqkv.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.17.attention.wo.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.17.attention.wo.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.17.feed_forward.w1.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.17.feed_forward.w1.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.17.feed_forward.w3.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.17.feed_forward.w3.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.17.feed_forward.w2.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.17.feed_forward.w2.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.18.attention.wqkv.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.18.attention.wqkv.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.18.attention.wo.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.18.attention.wo.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.18.feed_forward.w1.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.18.feed_forward.w1.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.18.feed_forward.w3.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.18.feed_forward.w3.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.18.feed_forward.w2.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.18.feed_forward.w2.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.19.attention.wqkv.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.19.attention.wqkv.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.19.attention.wo.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.19.attention.wo.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.19.feed_forward.w1.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.19.feed_forward.w1.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.19.feed_forward.w3.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.19.feed_forward.w3.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.19.feed_forward.w2.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.19.feed_forward.w2.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.20.attention.wqkv.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.20.attention.wqkv.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.20.attention.wo.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.20.attention.wo.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.20.feed_forward.w1.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.20.feed_forward.w1.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.20.feed_forward.w3.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.20.feed_forward.w3.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.20.feed_forward.w2.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.20.feed_forward.w2.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.21.attention.wqkv.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.21.attention.wqkv.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.21.attention.wo.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.21.attention.wo.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.21.feed_forward.w1.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.21.feed_forward.w1.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.21.feed_forward.w3.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.21.feed_forward.w3.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.21.feed_forward.w2.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.21.feed_forward.w2.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.22.attention.wqkv.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.22.attention.wqkv.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.22.attention.wo.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.22.attention.wo.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.22.feed_forward.w1.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.22.feed_forward.w1.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.22.feed_forward.w3.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.22.feed_forward.w3.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.22.feed_forward.w2.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.22.feed_forward.w2.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.23.attention.wqkv.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.23.attention.wqkv.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.23.attention.wo.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.23.attention.wo.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.23.feed_forward.w1.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.23.feed_forward.w1.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.23.feed_forward.w3.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.23.feed_forward.w3.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.23.feed_forward.w2.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.23.feed_forward.w2.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.24.attention.wqkv.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.24.attention.wqkv.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.24.attention.wo.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.24.attention.wo.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.24.feed_forward.w1.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.24.feed_forward.w1.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.24.feed_forward.w3.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.24.feed_forward.w3.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.24.feed_forward.w2.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.24.feed_forward.w2.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.25.attention.wqkv.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.25.attention.wqkv.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.25.attention.wo.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.25.attention.wo.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.25.feed_forward.w1.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.25.feed_forward.w1.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.25.feed_forward.w3.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.25.feed_forward.w3.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.25.feed_forward.w2.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.25.feed_forward.w2.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.26.attention.wqkv.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.26.attention.wqkv.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.26.attention.wo.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.26.attention.wo.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.26.feed_forward.w1.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.26.feed_forward.w1.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.26.feed_forward.w3.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.26.feed_forward.w3.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.26.feed_forward.w2.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.26.feed_forward.w2.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.27.attention.wqkv.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.27.attention.wqkv.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.27.attention.wo.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.27.attention.wo.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.27.feed_forward.w1.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.27.feed_forward.w1.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.27.feed_forward.w3.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.27.feed_forward.w3.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.27.feed_forward.w2.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.27.feed_forward.w2.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.28.attention.wqkv.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.28.attention.wqkv.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.28.attention.wo.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.28.attention.wo.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.28.feed_forward.w1.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.28.feed_forward.w1.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.28.feed_forward.w3.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.28.feed_forward.w3.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.28.feed_forward.w2.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.28.feed_forward.w2.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.29.attention.wqkv.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.29.attention.wqkv.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.29.attention.wo.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.29.attention.wo.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.29.feed_forward.w1.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.29.feed_forward.w1.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.29.feed_forward.w3.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.29.feed_forward.w3.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.29.feed_forward.w2.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.29.feed_forward.w2.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.30.attention.wqkv.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.30.attention.wqkv.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.30.attention.wo.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.30.attention.wo.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.30.feed_forward.w1.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.30.feed_forward.w1.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.30.feed_forward.w3.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.30.feed_forward.w3.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.30.feed_forward.w2.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.30.feed_forward.w2.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.31.attention.wqkv.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.31.attention.wqkv.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.31.attention.wo.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.31.attention.wo.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.31.feed_forward.w1.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.31.feed_forward.w1.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.31.feed_forward.w3.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.31.feed_forward.w3.lora_B.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.31.feed_forward.w2.lora_A.default.weight
09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.31.feed_forward.w2.lora_B.default.weight
[INFO|trainer.py:571] 2024-09-15 15:26:55,235 >> Using auto half precision backend
[2024-09-15 15:26:55,443] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.13.5, git-hash=unknown, git-branch=unknown
[2024-09-15 15:26:58,856] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
Using /root/.cache/torch_extensions/py310_cu118 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py310_cu118 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py310_cu118 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py310_cu118 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py310_cu118/fused_adam/build.ninja...
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_adam...
Time to load fused_adam op: 0.08728218078613281 seconds
Loading extension module fused_adam...
Time to load fused_adam op: 0.10238933563232422 seconds
Loading extension module fused_adam...
Loading extension module fused_adam...
Time to load fused_adam op: 0.1020348072052002 seconds
Time to load fused_adam op: 0.10272526741027832 seconds
[2024-09-15 15:26:59,612] [INFO] [logging.py:96:log_dist] [Rank 0] Using DeepSpeed Optimizer param name adamw as basic optimizer
[2024-09-15 15:26:59,612] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer
[2024-09-15 15:26:59,651] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedAdam
[2024-09-15 15:26:59,651] [INFO] [utils.py:56:is_zero_supported_optimizer] Checking ZeRO support for optimizer=FusedAdam type=<class 'deepspeed.ops.adam.fused_adam.FusedAdam'>
[2024-09-15 15:26:59,651] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 1 optimizer
[2024-09-15 15:26:59,651] [INFO] [stage_1_and_2.py:149:__init__] Reduce bucket size 1000000000
[2024-09-15 15:26:59,651] [INFO] [stage_1_and_2.py:150:__init__] Allgather bucket size 1000000000
[2024-09-15 15:26:59,651] [INFO] [stage_1_and_2.py:151:__init__] CPU Offload: False
[2024-09-15 15:26:59,652] [INFO] [stage_1_and_2.py:152:__init__] Round robin gradient partitioning: False
[2024-09-15 15:26:59,955] [INFO] [utils.py:800:see_memory_usage] Before initializing optimizer states
[2024-09-15 15:26:59,956] [INFO] [utils.py:801:see_memory_usage] MA 15.65 GB         Max_MA 15.67 GB         CA 15.85 GB         Max_CA 16 GB 
[2024-09-15 15:26:59,957] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory:  used = 87.25 GB, percent = 9.2%
[2024-09-15 15:27:00,146] [INFO] [utils.py:800:see_memory_usage] After initializing optimizer states
[2024-09-15 15:27:00,147] [INFO] [utils.py:801:see_memory_usage] MA 15.65 GB         Max_MA 15.68 GB         CA 15.89 GB         Max_CA 16 GB 
[2024-09-15 15:27:00,148] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory:  used = 87.35 GB, percent = 9.2%
[2024-09-15 15:27:00,148] [INFO] [stage_1_and_2.py:539:__init__] optimizer state initialized
[2024-09-15 15:27:00,327] [INFO] [utils.py:800:see_memory_usage] After initializing ZeRO optimizer
[2024-09-15 15:27:00,327] [INFO] [utils.py:801:see_memory_usage] MA 15.65 GB         Max_MA 15.65 GB         CA 15.89 GB         Max_CA 16 GB 
[2024-09-15 15:27:00,328] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory:  used = 87.37 GB, percent = 9.3%
[2024-09-15 15:27:00,330] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = adamw
[2024-09-15 15:27:00,330] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client callable to create LR scheduler
[2024-09-15 15:27:00,330] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = <torch.optim.lr_scheduler.LambdaLR object at 0x7fdf40dcf490>
[2024-09-15 15:27:00,330] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0], mom=[[0.9, 0.999]]
[2024-09-15 15:27:00,335] [INFO] [config.py:996:print] DeepSpeedEngine configuration:
[2024-09-15 15:27:00,335] [INFO] [config.py:1000:print]   activation_checkpointing_config  {
    "partition_activations": false, 
    "contiguous_memory_optimization": false, 
    "cpu_checkpointing": false, 
    "number_checkpoints": null, 
    "synchronize_checkpoint_boundary": false, 
    "profile": false
}
[2024-09-15 15:27:00,335] [INFO] [config.py:1000:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2024-09-15 15:27:00,335] [INFO] [config.py:1000:print]   amp_enabled .................. False
[2024-09-15 15:27:00,335] [INFO] [config.py:1000:print]   amp_params ................... False
[2024-09-15 15:27:00,335] [INFO] [config.py:1000:print]   autotuning_config ............ {
    "enabled": false, 
    "start_step": null, 
    "end_step": null, 
    "metric_path": null, 
    "arg_mappings": null, 
    "metric": "throughput", 
    "model_info": null, 
    "results_dir": "autotuning_results", 
    "exps_dir": "autotuning_exps", 
    "overwrite": true, 
    "fast": true, 
    "start_profile_step": 3, 
    "end_profile_step": 5, 
    "tuner_type": "gridsearch", 
    "tuner_early_stopping": 5, 
    "tuner_num_trials": 50, 
    "model_info_path": null, 
    "mp_size": 1, 
    "max_train_batch_size": null, 
    "min_train_batch_size": 1, 
    "max_train_micro_batch_size_per_gpu": 1.024000e+03, 
    "min_train_micro_batch_size_per_gpu": 1, 
    "num_tuning_micro_batch_sizes": 3
}
[2024-09-15 15:27:00,335] [INFO] [config.py:1000:print]   bfloat16_enabled ............. True
[2024-09-15 15:27:00,335] [INFO] [config.py:1000:print]   bfloat16_immediate_grad_update  False
[2024-09-15 15:27:00,335] [INFO] [config.py:1000:print]   checkpoint_parallel_write_pipeline  False
[2024-09-15 15:27:00,335] [INFO] [config.py:1000:print]   checkpoint_tag_validation_enabled  True
[2024-09-15 15:27:00,335] [INFO] [config.py:1000:print]   checkpoint_tag_validation_fail  False
[2024-09-15 15:27:00,335] [INFO] [config.py:1000:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7fdf40dcea10>
[2024-09-15 15:27:00,335] [INFO] [config.py:1000:print]   communication_data_type ...... None
[2024-09-15 15:27:00,336] [INFO] [config.py:1000:print]   compile_config ............... enabled=False backend='inductor' kwargs={}
[2024-09-15 15:27:00,336] [INFO] [config.py:1000:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2024-09-15 15:27:00,336] [INFO] [config.py:1000:print]   curriculum_enabled_legacy .... False
[2024-09-15 15:27:00,336] [INFO] [config.py:1000:print]   curriculum_params_legacy ..... False
[2024-09-15 15:27:00,336] [INFO] [config.py:1000:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2024-09-15 15:27:00,336] [INFO] [config.py:1000:print]   data_efficiency_enabled ...... False
[2024-09-15 15:27:00,336] [INFO] [config.py:1000:print]   dataloader_drop_last ......... False
[2024-09-15 15:27:00,336] [INFO] [config.py:1000:print]   disable_allgather ............ False
[2024-09-15 15:27:00,336] [INFO] [config.py:1000:print]   dump_state ................... False
[2024-09-15 15:27:00,336] [INFO] [config.py:1000:print]   dynamic_loss_scale_args ...... None
[2024-09-15 15:27:00,336] [INFO] [config.py:1000:print]   eigenvalue_enabled ........... False
[2024-09-15 15:27:00,336] [INFO] [config.py:1000:print]   eigenvalue_gas_boundary_resolution  1
[2024-09-15 15:27:00,336] [INFO] [config.py:1000:print]   eigenvalue_layer_name ........ bert.encoder.layer
[2024-09-15 15:27:00,336] [INFO] [config.py:1000:print]   eigenvalue_layer_num ......... 0
[2024-09-15 15:27:00,336] [INFO] [config.py:1000:print]   eigenvalue_max_iter .......... 100
[2024-09-15 15:27:00,336] [INFO] [config.py:1000:print]   eigenvalue_stability ......... 1e-06
[2024-09-15 15:27:00,336] [INFO] [config.py:1000:print]   eigenvalue_tol ............... 0.01
[2024-09-15 15:27:00,336] [INFO] [config.py:1000:print]   eigenvalue_verbose ........... False
[2024-09-15 15:27:00,336] [INFO] [config.py:1000:print]   elasticity_enabled ........... False
[2024-09-15 15:27:00,336] [INFO] [config.py:1000:print]   flops_profiler_config ........ {
    "enabled": false, 
    "recompute_fwd_factor": 0.0, 
    "profile_step": 1, 
    "module_depth": -1, 
    "top_modules": 1, 
    "detailed": true, 
    "output_file": null
}
[2024-09-15 15:27:00,336] [INFO] [config.py:1000:print]   fp16_auto_cast ............... None
[2024-09-15 15:27:00,336] [INFO] [config.py:1000:print]   fp16_enabled ................. False
[2024-09-15 15:27:00,336] [INFO] [config.py:1000:print]   fp16_master_weights_and_gradients  False
[2024-09-15 15:27:00,336] [INFO] [config.py:1000:print]   global_rank .................. 0
[2024-09-15 15:27:00,336] [INFO] [config.py:1000:print]   grad_accum_dtype ............. None
[2024-09-15 15:27:00,336] [INFO] [config.py:1000:print]   gradient_accumulation_steps .. 1
[2024-09-15 15:27:00,336] [INFO] [config.py:1000:print]   gradient_clipping ............ 1.0
[2024-09-15 15:27:00,336] [INFO] [config.py:1000:print]   gradient_predivide_factor .... 1.0
[2024-09-15 15:27:00,336] [INFO] [config.py:1000:print]   graph_harvesting ............. False
[2024-09-15 15:27:00,336] [INFO] [config.py:1000:print]   hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
[2024-09-15 15:27:00,336] [INFO] [config.py:1000:print]   initial_dynamic_scale ........ 1
[2024-09-15 15:27:00,336] [INFO] [config.py:1000:print]   load_universal_checkpoint .... False
[2024-09-15 15:27:00,336] [INFO] [config.py:1000:print]   loss_scale ................... 1.0
[2024-09-15 15:27:00,336] [INFO] [config.py:1000:print]   memory_breakdown ............. False
[2024-09-15 15:27:00,336] [INFO] [config.py:1000:print]   mics_hierarchial_params_gather  False
[2024-09-15 15:27:00,336] [INFO] [config.py:1000:print]   mics_shard_size .............. -1
[2024-09-15 15:27:00,336] [INFO] [config.py:1000:print]   monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
[2024-09-15 15:27:00,336] [INFO] [config.py:1000:print]   nebula_config ................ {
    "enabled": false, 
    "persistent_storage_path": null, 
    "persistent_time_interval": 100, 
    "num_of_version_in_retention": 2, 
    "enable_nebula_load": true, 
    "load_path": null
}
[2024-09-15 15:27:00,336] [INFO] [config.py:1000:print]   optimizer_legacy_fusion ...... False
[2024-09-15 15:27:00,336] [INFO] [config.py:1000:print]   optimizer_name ............... adamw
[2024-09-15 15:27:00,336] [INFO] [config.py:1000:print]   optimizer_params ............. {'lr': 4e-05, 'betas': [0.9, 0.999], 'eps': 1e-08, 'weight_decay': 0.01}
[2024-09-15 15:27:00,336] [INFO] [config.py:1000:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True}
[2024-09-15 15:27:00,336] [INFO] [config.py:1000:print]   pld_enabled .................. False
[2024-09-15 15:27:00,336] [INFO] [config.py:1000:print]   pld_params ................... False
[2024-09-15 15:27:00,336] [INFO] [config.py:1000:print]   prescale_gradients ........... False
[2024-09-15 15:27:00,336] [INFO] [config.py:1000:print]   scheduler_name ............... None
[2024-09-15 15:27:00,337] [INFO] [config.py:1000:print]   scheduler_params ............. None
[2024-09-15 15:27:00,337] [INFO] [config.py:1000:print]   seq_parallel_communication_data_type  torch.float32
[2024-09-15 15:27:00,337] [INFO] [config.py:1000:print]   sparse_attention ............. None
[2024-09-15 15:27:00,337] [INFO] [config.py:1000:print]   sparse_gradients_enabled ..... False
[2024-09-15 15:27:00,337] [INFO] [config.py:1000:print]   steps_per_print .............. inf
[2024-09-15 15:27:00,337] [INFO] [config.py:1000:print]   train_batch_size ............. 16
[2024-09-15 15:27:00,337] [INFO] [config.py:1000:print]   train_micro_batch_size_per_gpu  4
[2024-09-15 15:27:00,337] [INFO] [config.py:1000:print]   use_data_before_expert_parallel_  False
[2024-09-15 15:27:00,337] [INFO] [config.py:1000:print]   use_node_local_storage ....... False
[2024-09-15 15:27:00,337] [INFO] [config.py:1000:print]   wall_clock_breakdown ......... True
[2024-09-15 15:27:00,337] [INFO] [config.py:1000:print]   weight_quantization_config ... None
[2024-09-15 15:27:00,337] [INFO] [config.py:1000:print]   world_size ................... 4
[2024-09-15 15:27:00,337] [INFO] [config.py:1000:print]   zero_allow_untested_optimizer  False
[2024-09-15 15:27:00,337] [INFO] [config.py:1000:print]   zero_config .................. stage=1 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=1000000000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=1000000000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True
[2024-09-15 15:27:00,337] [INFO] [config.py:1000:print]   zero_enabled ................. True
[2024-09-15 15:27:00,337] [INFO] [config.py:1000:print]   zero_force_ds_cpu_optimizer .. True
[2024-09-15 15:27:00,337] [INFO] [config.py:1000:print]   zero_optimization_stage ...... 1
[2024-09-15 15:27:00,337] [INFO] [config.py:986:print_user_config]   json = {
    "zero_optimization": {
        "stage": 1, 
        "allgather_partitions": true, 
        "allgather_bucket_size": 1.000000e+09, 
        "overlap_comm": true, 
        "reduce_scatter": true, 
        "reduce_bucket_size": 1.000000e+09, 
        "contiguous_gradients": true
    }, 
    "fp16": {
        "enabled": false, 
        "auto_cast": true, 
        "loss_scale": 0, 
        "initial_scale_power": 32, 
        "loss_scale_window": 1000, 
        "hysteresis": 2, 
        "min_loss_scale": 1
    }, 
    "bf16": {
        "enabled": true
    }, 
    "optimizer": {
        "type": "AdamW", 
        "params": {
            "lr": 4e-05, 
            "betas": [0.9, 0.999], 
            "eps": 1e-08, 
            "weight_decay": 0.01
        }
    }, 
    "gradient_accumulation_steps": 1, 
    "gradient_clipping": 1.0, 
    "steps_per_print": inf, 
    "train_batch_size": 16, 
    "train_micro_batch_size_per_gpu": 4, 
    "wall_clock_breakdown": true
}
[INFO|trainer.py:1721] 2024-09-15 15:27:00,337 >> ***** Running training *****
[INFO|trainer.py:1722] 2024-09-15 15:27:00,337 >>   Num examples = 9,996
[INFO|trainer.py:1723] 2024-09-15 15:27:00,337 >>   Num Epochs = 3
[INFO|trainer.py:1724] 2024-09-15 15:27:00,337 >>   Instantaneous batch size per device = 4
[INFO|trainer.py:1727] 2024-09-15 15:27:00,337 >>   Total train batch size (w. parallel, distributed & accumulation) = 16
[INFO|trainer.py:1728] 2024-09-15 15:27:00,337 >>   Gradient Accumulation steps = 1
[INFO|trainer.py:1729] 2024-09-15 15:27:00,337 >>   Total optimization steps = 1,875
[INFO|trainer.py:1730] 2024-09-15 15:27:00,343 >>   Number of trainable parameters = 37,748,736
  0%|          | 0/1875 [00:00<?, ?it/s][2024-09-15 15:27:03,156] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 15:27:03,184] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 15:27:03,207] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 15:27:03,210] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 15:27:06,914] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 15:27:07,033] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 15:27:07,093] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 15:27:07,099] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 15:27:10,617] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 15:27:10,738] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 15:27:10,847] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 15:27:10,867] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 15:27:14,357] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 15:27:14,588] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 15:27:14,637] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 15:27:14,691] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
dynamic ViT batch size: 20, images per sample: 5.0, dynamic token length: 1388
[2024-09-15 15:27:19,120] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.79 | optimizer_gradients: 2.96 | optimizer_step: 1.06
[2024-09-15 15:27:19,120] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1012.23 | bwd_microstep: 1624.24 | bwd_inner_microstep: 1334.22 | bwd_allreduce_microstep: 289.88 | step_microstep: 19.82
[2024-09-15 15:27:19,121] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 1012.20 | bwd: 1624.24 | bwd_inner: 1334.24 | bwd_allreduce: 289.89 | step: 19.82
  0%|          | 1/1875 [00:18<9:46:39, 18.78s/it]                                                  {'loss': 3.6001, 'learning_rate': 7.017543859649123e-07, 'epoch': 0.0}
  0%|          | 1/1875 [00:18<9:46:39, 18.78s/it]dynamic ViT batch size: 20, images per sample: 5.0, dynamic token length: 1388
[2024-09-15 15:27:21,227] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.41 | optimizer_gradients: 0.35 | optimizer_step: 0.39
[2024-09-15 15:27:21,227] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 741.57 | bwd_microstep: 1330.30 | bwd_inner_microstep: 1314.87 | bwd_allreduce_microstep: 15.25 | step_microstep: 7.05
[2024-09-15 15:27:21,227] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 741.55 | bwd: 1330.31 | bwd_inner: 1314.96 | bwd_allreduce: 15.24 | step: 7.05
  0%|          | 2/1875 [00:20<4:40:07,  8.97s/it]                                                  {'loss': 3.6638, 'learning_rate': 1.4035087719298246e-06, 'epoch': 0.0}
  0%|          | 2/1875 [00:20<4:40:07,  8.97s/it]dynamic ViT batch size: 20, images per sample: 5.0, dynamic token length: 1388
[2024-09-15 15:27:23,346] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.60 | optimizer_gradients: 0.39 | optimizer_step: 0.39
[2024-09-15 15:27:23,347] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 748.44 | bwd_microstep: 1335.68 | bwd_inner_microstep: 1322.71 | bwd_allreduce_microstep: 12.90 | step_microstep: 7.53
[2024-09-15 15:27:23,347] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 748.43 | bwd: 1335.69 | bwd_inner: 1322.72 | bwd_allreduce: 12.90 | step: 7.53
  0%|          | 3/1875 [00:23<3:02:19,  5.84s/it]                                                  {'loss': 3.505, 'learning_rate': 2.105263157894737e-06, 'epoch': 0.0}
  0%|          | 3/1875 [00:23<3:02:19,  5.84s/it]dynamic ViT batch size: 20, images per sample: 5.0, dynamic token length: 1369
[2024-09-15 15:27:26,055] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.23 | optimizer_gradients: 0.48 | optimizer_step: 0.40
[2024-09-15 15:27:26,056] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 741.20 | bwd_microstep: 1932.35 | bwd_inner_microstep: 1306.54 | bwd_allreduce_microstep: 625.68 | step_microstep: 7.47
[2024-09-15 15:27:26,056] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 741.19 | bwd: 1932.37 | bwd_inner: 1306.58 | bwd_allreduce: 625.69 | step: 7.47
  0%|          | 4/1875 [00:25<2:23:38,  4.61s/it]                                                  {'loss': 4.205, 'learning_rate': 2.8070175438596493e-06, 'epoch': 0.01}
  0%|          | 4/1875 [00:25<2:23:38,  4.61s/it]dynamic ViT batch size: 20, images per sample: 5.0, dynamic token length: 1900
[2024-09-15 15:27:28,794] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.45 | optimizer_gradients: 0.39 | optimizer_step: 0.39
[2024-09-15 15:27:28,795] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 962.25 | bwd_microstep: 1737.43 | bwd_inner_microstep: 1724.44 | bwd_allreduce_microstep: 12.90 | step_microstep: 7.80
[2024-09-15 15:27:28,795] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 962.23 | bwd: 1737.44 | bwd_inner: 1724.47 | bwd_allreduce: 12.90 | step: 7.80
  0%|          | 5/1875 [00:28<2:02:34,  3.93s/it]                                                  {'loss': 3.5304, 'learning_rate': 3.5087719298245615e-06, 'epoch': 0.01}
  0%|          | 5/1875 [00:28<2:02:34,  3.93s/it]dynamic ViT batch size: 18, images per sample: 4.5, dynamic token length: 1878
[2024-09-15 15:27:31,489] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.60 | optimizer_gradients: 0.39 | optimizer_step: 0.37
[2024-09-15 15:27:31,489] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 942.43 | bwd_microstep: 1716.42 | bwd_inner_microstep: 1703.58 | bwd_allreduce_microstep: 12.73 | step_microstep: 7.86
[2024-09-15 15:27:31,489] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 942.42 | bwd: 1716.43 | bwd_inner: 1703.62 | bwd_allreduce: 12.73 | step: 7.86
  0%|          | 6/1875 [00:31<1:49:23,  3.51s/it]                                                  {'loss': 3.9124, 'learning_rate': 4.210526315789474e-06, 'epoch': 0.01}
  0%|          | 6/1875 [00:31<1:49:23,  3.51s/it]dynamic ViT batch size: 20, images per sample: 5.0, dynamic token length: 1369
[2024-09-15 15:27:33,591] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.48 | optimizer_gradients: 0.39 | optimizer_step: 0.40
[2024-09-15 15:27:33,591] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 742.47 | bwd_microstep: 1324.64 | bwd_inner_microstep: 1310.04 | bwd_allreduce_microstep: 14.45 | step_microstep: 7.17
[2024-09-15 15:27:33,591] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 742.46 | bwd: 1324.65 | bwd_inner: 1310.08 | bwd_allreduce: 14.44 | step: 7.18
  0%|          | 7/1875 [00:33<1:34:59,  3.05s/it]                                                  {'loss': 4.0505, 'learning_rate': 4.912280701754386e-06, 'epoch': 0.01}
  0%|          | 7/1875 [00:33<1:34:59,  3.05s/it]dynamic ViT batch size: 22, images per sample: 5.5, dynamic token length: 1878
[2024-09-15 15:27:36,349] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.51 | optimizer_gradients: 0.38 | optimizer_step: 0.37
[2024-09-15 15:27:36,349] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 976.87 | bwd_microstep: 1745.74 | bwd_inner_microstep: 1732.90 | bwd_allreduce_microstep: 12.79 | step_microstep: 7.75
[2024-09-15 15:27:36,349] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 976.86 | bwd: 1745.75 | bwd_inner: 1732.90 | bwd_allreduce: 12.81 | step: 7.75
  0%|          | 8/1875 [00:36<1:32:02,  2.96s/it]                                                  {'loss': 3.5936, 'learning_rate': 5.6140350877192985e-06, 'epoch': 0.01}
  0%|          | 8/1875 [00:36<1:32:02,  2.96s/it]dynamic ViT batch size: 20, images per sample: 5.0, dynamic token length: 1366
[2024-09-15 15:27:38,454] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.52 | optimizer_gradients: 0.37 | optimizer_step: 0.37
[2024-09-15 15:27:38,454] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 744.14 | bwd_microstep: 1326.17 | bwd_inner_microstep: 1312.95 | bwd_allreduce_microstep: 13.17 | step_microstep: 7.13
[2024-09-15 15:27:38,455] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 744.12 | bwd: 1326.18 | bwd_inner: 1312.95 | bwd_allreduce: 13.17 | step: 7.13
  0%|          | 9/1875 [00:38<1:23:42,  2.69s/it]                                                  {'loss': 3.6596, 'learning_rate': 6.31578947368421e-06, 'epoch': 0.01}
  0%|          | 9/1875 [00:38<1:23:42,  2.69s/it]dynamic ViT batch size: 20, images per sample: 5.0, dynamic token length: 1369
[2024-09-15 15:27:41,497] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.42 | optimizer_gradients: 0.40 | optimizer_step: 0.38
[2024-09-15 15:27:41,498] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 746.58 | bwd_microstep: 2257.36 | bwd_inner_microstep: 1317.28 | bwd_allreduce_microstep: 939.97 | step_microstep: 7.37
[2024-09-15 15:27:41,498] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 746.56 | bwd: 2257.37 | bwd_inner: 1317.31 | bwd_allreduce: 939.98 | step: 7.37
  1%|          | 10/1875 [00:41<1:27:01,  2.80s/it]                                                   {'loss': 3.9084, 'learning_rate': 7.017543859649123e-06, 'epoch': 0.02}
  1%|          | 10/1875 [00:41<1:27:01,  2.80s/it]dynamic ViT batch size: 20, images per sample: 5.0, dynamic token length: 1388
[2024-09-15 15:27:44,240] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.21 | optimizer_gradients: 0.40 | optimizer_step: 0.38
[2024-09-15 15:27:44,240] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 749.90 | bwd_microstep: 1958.15 | bwd_inner_microstep: 1332.21 | bwd_allreduce_microstep: 625.84 | step_microstep: 7.11
[2024-09-15 15:27:44,241] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 749.89 | bwd: 1958.16 | bwd_inner: 1332.23 | bwd_allreduce: 625.85 | step: 7.11
  1%|          | 11/1875 [00:43<1:26:26,  2.78s/it]                                                   {'loss': 3.4137, 'learning_rate': 7.719298245614036e-06, 'epoch': 0.02}
  1%|          | 11/1875 [00:43<1:26:26,  2.78s/it]dynamic ViT batch size: 20, images per sample: 5.0, dynamic token length: 1366
[2024-09-15 15:27:46,984] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.53 | optimizer_gradients: 0.40 | optimizer_step: 0.37
[2024-09-15 15:27:46,985] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 741.25 | bwd_microstep: 1967.18 | bwd_inner_microstep: 1315.89 | bwd_allreduce_microstep: 651.18 | step_microstep: 7.52
[2024-09-15 15:27:46,985] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 741.23 | bwd: 1967.19 | bwd_inner: 1315.92 | bwd_allreduce: 651.20 | step: 7.52
  1%|          | 12/1875 [00:46<1:26:01,  2.77s/it]                                                   {'loss': 3.7522, 'learning_rate': 8.421052631578948e-06, 'epoch': 0.02}
  1%|          | 12/1875 [00:46<1:26:01,  2.77s/it]dynamic ViT batch size: 21, images per sample: 5.25, dynamic token length: 1900
[2024-09-15 15:27:49,748] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.63 | optimizer_gradients: 0.40 | optimizer_step: 0.38
[2024-09-15 15:27:49,748] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 975.06 | bwd_microstep: 1753.00 | bwd_inner_microstep: 1740.20 | bwd_allreduce_microstep: 12.75 | step_microstep: 7.91
[2024-09-15 15:27:49,748] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 975.04 | bwd: 1753.01 | bwd_inner: 1740.20 | bwd_allreduce: 12.77 | step: 7.92
  1%|          | 13/1875 [00:49<1:25:55,  2.77s/it]                                                   {'loss': 3.32, 'learning_rate': 9.12280701754386e-06, 'epoch': 0.02}
  1%|          | 13/1875 [00:49<1:25:55,  2.77s/it]dynamic ViT batch size: 20, images per sample: 5.0, dynamic token length: 1368
[2024-09-15 15:28:14,978] torch.distributed.run: [WARNING] 
[2024-09-15 15:28:14,978] torch.distributed.run: [WARNING] *****************************************
[2024-09-15 15:28:14,978] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
[2024-09-15 15:28:14,978] torch.distributed.run: [WARNING] *****************************************
[2024-09-15 15:28:17,324] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 15:28:17,354] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 15:28:17,359] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 15:28:17,376] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
[2024-09-15 15:28:19,422] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-09-15 15:28:19,447] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-09-15 15:28:19,447] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2024-09-15 15:28:19,454] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-09-15 15:28:19,487] [INFO] [comm.py:637:init_distributed] cdb=None
09/15/2024 15:28:19 - WARNING - __main__ - Process rank: 1, device: cuda:1, n_gpu: 1distributed training: True, 16-bits training: False
09/15/2024 15:28:19 - WARNING - __main__ - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: False
09/15/2024 15:28:19 - INFO - __main__ - Training/evaluation parameters TrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=True,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=4,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=/workspace/InternVL/internvl_chat/zero_stage1_config.json,
disable_tqdm=False,
dispatch_batches=None,
do_eval=False,
do_predict=False,
do_train=True,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=no,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=0,
gradient_checkpointing=False,
gradient_checkpointing_kwargs=None,
greater_is_better=None,
group_by_length=True,
half_precision_backend=auto,
hub_always_push=False,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=every_save,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_inputs_for_metrics=False,
include_num_input_tokens_seen=False,
include_tokens_per_second=False,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=4e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=0,
log_level=passive,
log_level_replica=warning,
log_on_each_node=True,
logging_dir=work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora/runs/Sep15_15-28-19_1fe2e16e3880,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=1.0,
logging_strategy=steps,
lr_scheduler_kwargs={},
lr_scheduler_type=cosine,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
neftune_noise_alpha=None,
no_cuda=False,
num_train_epochs=3.0,
optim=adamw_torch,
optim_args=None,
output_dir=work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora,
overwrite_output_dir=True,
past_index=-1,
per_device_eval_batch_size=8,
per_device_train_batch_size=8,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
ray_scope=last,
remove_unused_columns=True,
report_to=['tensorboard'],
resume_from_checkpoint=None,
run_name=work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora,
save_on_each_node=False,
save_only_model=False,
save_safetensors=True,
save_steps=200,
save_strategy=steps,
save_total_limit=1,
seed=42,
skip_memory_metrics=True,
split_batches=False,
tf32=None,
torch_compile=False,
torch_compile_backend=None,
torch_compile_mode=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_cpu=False,
use_ipex=False,
use_legacy_prediction_loop=False,
use_mps_device=False,
warmup_ratio=0.03,
warmup_steps=0,
weight_decay=0.01,
)
09/15/2024 15:28:19 - INFO - __main__ - Loading Tokenizer: /workspace/pretrained/InternVL2-8B
[INFO|tokenization_utils_base.py:2025] 2024-09-15 15:28:19,533 >> loading file ./tokenizer.model
[INFO|tokenization_utils_base.py:2025] 2024-09-15 15:28:19,533 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2025] 2024-09-15 15:28:19,533 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2025] 2024-09-15 15:28:19,533 >> loading file tokenizer_config.json
[INFO|tokenization_utils_base.py:2025] 2024-09-15 15:28:19,533 >> loading file tokenizer.json
09/15/2024 15:28:19 - WARNING - __main__ - Process rank: 3, device: cuda:3, n_gpu: 1distributed training: True, 16-bits training: False
09/15/2024 15:28:19 - WARNING - __main__ - Process rank: 2, device: cuda:2, n_gpu: 1distributed training: True, 16-bits training: False
[WARNING|logging.py:314] 2024-09-15 15:28:19,691 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[WARNING|logging.py:314] 2024-09-15 15:28:19,711 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[WARNING|logging.py:314] 2024-09-15 15:28:19,722 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[WARNING|logging.py:314] 2024-09-15 15:28:19,733 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
09/15/2024 15:28:19 - INFO - __main__ - Loading InternVLChatModel...
[INFO|configuration_utils.py:727] 2024-09-15 15:28:19,861 >> loading configuration file /workspace/pretrained/InternVL2-8B/config.json
[INFO|configuration_utils.py:792] 2024-09-15 15:28:19,862 >> Model config InternVLChatConfig {
  "_commit_hash": null,
  "architectures": [
    "InternVLChatModel"
  ],
  "auto_map": {
    "AutoConfig": "configuration_internvl_chat.InternVLChatConfig",
    "AutoModel": "modeling_internvl_chat.InternVLChatModel",
    "AutoModelForCausalLM": "modeling_internvl_chat.InternVLChatModel"
  },
  "downsample_ratio": 0.5,
  "dynamic_image_size": true,
  "force_image_size": 448,
  "llm_config": {
    "_name_or_path": "internlm/internlm2_5-7b-chat",
    "add_cross_attention": false,
    "architectures": [
      "InternLM2ForCausalLM"
    ],
    "attn_implementation": "flash_attention_2",
    "auto_map": {
      "AutoConfig": "configuration_internlm2.InternLM2Config",
      "AutoModel": "modeling_internlm2.InternLM2ForCausalLM",
      "AutoModelForCausalLM": "modeling_internlm2.InternLM2ForCausalLM"
    },
    "bad_words_ids": null,
    "begin_suppress_tokens": null,
    "bias": false,
    "bos_token_id": 1,
    "chunk_size_feed_forward": 0,
    "cross_attention_hidden_size": null,
    "decoder_start_token_id": null,
    "diversity_penalty": 0.0,
    "do_sample": false,
    "early_stopping": false,
    "encoder_no_repeat_ngram_size": 0,
    "eos_token_id": 2,
    "exponential_decay_length_penalty": null,
    "finetuning_task": null,
    "forced_bos_token_id": null,
    "forced_eos_token_id": null,
    "hidden_act": "silu",
    "hidden_size": 4096,
    "id2label": {
      "0": "LABEL_0",
      "1": "LABEL_1"
    },
    "initializer_range": 0.02,
    "intermediate_size": 14336,
    "is_decoder": false,
    "is_encoder_decoder": false,
    "label2id": {
      "LABEL_0": 0,
      "LABEL_1": 1
    },
    "length_penalty": 1.0,
    "max_length": 20,
    "max_position_embeddings": 32768,
    "min_length": 0,
    "model_type": "internlm2",
    "no_repeat_ngram_size": 0,
    "num_attention_heads": 32,
    "num_beam_groups": 1,
    "num_beams": 1,
    "num_hidden_layers": 32,
    "num_key_value_heads": 8,
    "num_return_sequences": 1,
    "output_attentions": false,
    "output_hidden_states": false,
    "output_scores": false,
    "pad_token_id": 2,
    "prefix": null,
    "pretraining_tp": 1,
    "problem_type": null,
    "pruned_heads": {},
    "remove_invalid_values": false,
    "repetition_penalty": 1.0,
    "return_dict": true,
    "return_dict_in_generate": false,
    "rms_norm_eps": 1e-05,
    "rope_scaling": {
      "factor": 2.0,
      "type": "dynamic"
    },
    "rope_theta": 1000000,
    "sep_token_id": null,
    "suppress_tokens": null,
    "task_specific_params": null,
    "temperature": 1.0,
    "tf_legacy_loss": false,
    "tie_encoder_decoder": false,
    "tie_word_embeddings": false,
    "tokenizer_class": null,
    "top_k": 50,
    "top_p": 1.0,
    "torch_dtype": "bfloat16",
    "torchscript": false,
    "transformers_version": "4.37.2",
    "typical_p": 1.0,
    "use_bfloat16": true,
    "use_cache": true,
    "vocab_size": 92553
  },
  "max_dynamic_patch": 12,
  "min_dynamic_patch": 1,
  "model_type": "internvl_chat",
  "pad2square": false,
  "ps_version": "v2",
  "select_layer": -1,
  "template": "internlm2-chat",
  "torch_dtype": "bfloat16",
  "transformers_version": null,
  "use_backbone_lora": 0,
  "use_llm_lora": 0,
  "use_thumbnail": true,
  "vision_config": {
    "_name_or_path": "",
    "add_cross_attention": false,
    "architectures": [
      "InternVisionModel"
    ],
    "attention_dropout": 0.0,
    "bad_words_ids": null,
    "begin_suppress_tokens": null,
    "bos_token_id": null,
    "chunk_size_feed_forward": 0,
    "cross_attention_hidden_size": null,
    "decoder_start_token_id": null,
    "diversity_penalty": 0.0,
    "do_sample": false,
    "drop_path_rate": 0.0,
    "dropout": 0.0,
    "early_stopping": false,
    "encoder_no_repeat_ngram_size": 0,
    "eos_token_id": null,
    "exponential_decay_length_penalty": null,
    "finetuning_task": null,
    "forced_bos_token_id": null,
    "forced_eos_token_id": null,
    "hidden_act": "gelu",
    "hidden_size": 1024,
    "id2label": {
      "0": "LABEL_0",
      "1": "LABEL_1"
    },
    "image_size": 448,
    "initializer_factor": 1.0,
    "initializer_range": 0.02,
    "intermediate_size": 4096,
    "is_decoder": false,
    "is_encoder_decoder": false,
    "label2id": {
      "LABEL_0": 0,
      "LABEL_1": 1
    },
    "layer_norm_eps": 1e-06,
    "length_penalty": 1.0,
    "max_length": 20,
    "min_length": 0,
    "model_type": "intern_vit_6b",
    "no_repeat_ngram_size": 0,
    "norm_type": "layer_norm",
    "num_attention_heads": 16,
    "num_beam_groups": 1,
    "num_beams": 1,
    "num_channels": 3,
    "num_hidden_layers": 24,
    "num_return_sequences": 1,
    "output_attentions": false,
    "output_hidden_states": false,
    "output_scores": false,
    "pad_token_id": null,
    "patch_size": 14,
    "prefix": null,
    "problem_type": null,
    "pruned_heads": {},
    "qk_normalization": false,
    "qkv_bias": true,
    "remove_invalid_values": false,
    "repetition_penalty": 1.0,
    "return_dict": true,
    "return_dict_in_generate": false,
    "sep_token_id": null,
    "suppress_tokens": null,
    "task_specific_params": null,
    "temperature": 1.0,
    "tf_legacy_loss": false,
    "tie_encoder_decoder": false,
    "tie_word_embeddings": true,
    "tokenizer_class": null,
    "top_k": 50,
    "top_p": 1.0,
    "torch_dtype": "bfloat16",
    "torchscript": false,
    "transformers_version": "4.37.2",
    "typical_p": 1.0,
    "use_bfloat16": true,
    "use_flash_attn": true
  }
}

09/15/2024 15:28:19 - INFO - __main__ - Using flash_attention_2 for InternLM
[INFO|modeling_utils.py:3473] 2024-09-15 15:28:19,864 >> loading weights file /workspace/pretrained/InternVL2-8B/model.safetensors.index.json
[INFO|modeling_utils.py:1426] 2024-09-15 15:28:19,864 >> Instantiating InternVLChatModel model under default dtype torch.bfloat16.
[INFO|configuration_utils.py:826] 2024-09-15 15:28:19,865 >> Generate config GenerationConfig {}

[INFO|configuration_utils.py:826] 2024-09-15 15:28:19,905 >> Generate config GenerationConfig {
  "bos_token_id": 1,
  "eos_token_id": 2,
  "pad_token_id": 2
}

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]Loading checkpoint shards:  25%|██▌       | 1/4 [00:02<00:08,  2.67s/it]Loading checkpoint shards:  25%|██▌       | 1/4 [00:02<00:07,  2.64s/it]Loading checkpoint shards:  25%|██▌       | 1/4 [00:02<00:07,  2.54s/it]Loading checkpoint shards:  25%|██▌       | 1/4 [00:02<00:08,  2.90s/it]Loading checkpoint shards:  50%|█████     | 2/4 [00:04<00:04,  2.48s/it]Loading checkpoint shards:  50%|█████     | 2/4 [00:05<00:05,  2.61s/it]Loading checkpoint shards:  50%|█████     | 2/4 [00:05<00:05,  2.60s/it]Loading checkpoint shards:  50%|█████     | 2/4 [00:05<00:05,  2.84s/it]Loading checkpoint shards:  75%|███████▌  | 3/4 [00:07<00:02,  2.47s/it]Loading checkpoint shards:  75%|███████▌  | 3/4 [00:07<00:02,  2.58s/it]Loading checkpoint shards:  75%|███████▌  | 3/4 [00:07<00:02,  2.58s/it]Loading checkpoint shards: 100%|██████████| 4/4 [00:08<00:00,  1.82s/it]Loading checkpoint shards: 100%|██████████| 4/4 [00:08<00:00,  2.07s/it]
[INFO|modeling_utils.py:4350] 2024-09-15 15:28:29,699 >> All model checkpoint weights were used when initializing InternVLChatModel.

[INFO|modeling_utils.py:4358] 2024-09-15 15:28:29,699 >> All the weights of InternVLChatModel were initialized from the model checkpoint at /workspace/pretrained/InternVL2-8B.
If your task is similar to the task the model of the checkpoint was trained on, you can already use InternVLChatModel for predictions without further training.
[INFO|configuration_utils.py:779] 2024-09-15 15:28:29,704 >> loading configuration file /workspace/pretrained/InternVL2-8B/generation_config.json
[INFO|configuration_utils.py:826] 2024-09-15 15:28:29,705 >> Generate config GenerationConfig {
  "eos_token_id": [
    92542,
    92543
  ]
}

09/15/2024 15:28:29 - INFO - __main__ - Finished
09/15/2024 15:28:29 - INFO - __main__ - model.config.force_image_size: 448
09/15/2024 15:28:29 - INFO - __main__ - data_args.force_image_size: 448
09/15/2024 15:28:29 - INFO - __main__ - model.config.vision_config.image_size: 448
09/15/2024 15:28:29 - INFO - __main__ - [Dataset] num_image_token: 256
09/15/2024 15:28:29 - INFO - __main__ - [Dataset] dynamic_image_size: True
09/15/2024 15:28:29 - INFO - __main__ - [Dataset] use_thumbnail: True
09/15/2024 15:28:29 - INFO - __main__ - [Dataset] min_dynamic_patch: 1, max_dynamic_patch: 6
09/15/2024 15:28:29 - INFO - __main__ - Formatting inputs...Skip in lazy mode
Loading checkpoint shards: 100%|██████████| 4/4 [00:08<00:00,  1.89s/it]Loading checkpoint shards: 100%|██████████| 4/4 [00:08<00:00,  2.16s/it]
Loading checkpoint shards: 100%|██████████| 4/4 [00:08<00:00,  1.90s/it]Loading checkpoint shards: 100%|██████████| 4/4 [00:08<00:00,  2.15s/it]
Loading checkpoint shards:  75%|███████▌  | 3/4 [00:08<00:02,  2.84s/it]09/15/2024 15:28:30 - INFO - __main__ - Add dataset: amazon-product-dataset with length: 9996
trainable params: 37,748,736 || all params: 7,775,531,008 || trainable%: 0.4855
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.0.attention.wqkv.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.0.attention.wqkv.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.0.attention.wo.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.0.attention.wo.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.0.feed_forward.w1.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.0.feed_forward.w1.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.0.feed_forward.w3.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.0.feed_forward.w3.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.0.feed_forward.w2.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.0.feed_forward.w2.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.1.attention.wqkv.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.1.attention.wqkv.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.1.attention.wo.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.1.attention.wo.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.1.feed_forward.w1.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.1.feed_forward.w1.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.1.feed_forward.w3.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.1.feed_forward.w3.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.1.feed_forward.w2.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.1.feed_forward.w2.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.2.attention.wqkv.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.2.attention.wqkv.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.2.attention.wo.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.2.attention.wo.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.2.feed_forward.w1.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.2.feed_forward.w1.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.2.feed_forward.w3.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.2.feed_forward.w3.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.2.feed_forward.w2.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.2.feed_forward.w2.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.3.attention.wqkv.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.3.attention.wqkv.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.3.attention.wo.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.3.attention.wo.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.3.feed_forward.w1.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.3.feed_forward.w1.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.3.feed_forward.w3.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.3.feed_forward.w3.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.3.feed_forward.w2.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.3.feed_forward.w2.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.4.attention.wqkv.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.4.attention.wqkv.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.4.attention.wo.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.4.attention.wo.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.4.feed_forward.w1.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.4.feed_forward.w1.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.4.feed_forward.w3.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.4.feed_forward.w3.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.4.feed_forward.w2.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.4.feed_forward.w2.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.5.attention.wqkv.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.5.attention.wqkv.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.5.attention.wo.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.5.attention.wo.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.5.feed_forward.w1.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.5.feed_forward.w1.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.5.feed_forward.w3.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.5.feed_forward.w3.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.5.feed_forward.w2.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.5.feed_forward.w2.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.6.attention.wqkv.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.6.attention.wqkv.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.6.attention.wo.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.6.attention.wo.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.6.feed_forward.w1.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.6.feed_forward.w1.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.6.feed_forward.w3.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.6.feed_forward.w3.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.6.feed_forward.w2.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.6.feed_forward.w2.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.7.attention.wqkv.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.7.attention.wqkv.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.7.attention.wo.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.7.attention.wo.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.7.feed_forward.w1.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.7.feed_forward.w1.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.7.feed_forward.w3.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.7.feed_forward.w3.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.7.feed_forward.w2.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.7.feed_forward.w2.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.8.attention.wqkv.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.8.attention.wqkv.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.8.attention.wo.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.8.attention.wo.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.8.feed_forward.w1.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.8.feed_forward.w1.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.8.feed_forward.w3.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.8.feed_forward.w3.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.8.feed_forward.w2.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.8.feed_forward.w2.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.9.attention.wqkv.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.9.attention.wqkv.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.9.attention.wo.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.9.attention.wo.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.9.feed_forward.w1.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.9.feed_forward.w1.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.9.feed_forward.w3.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.9.feed_forward.w3.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.9.feed_forward.w2.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.9.feed_forward.w2.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.10.attention.wqkv.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.10.attention.wqkv.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.10.attention.wo.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.10.attention.wo.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.10.feed_forward.w1.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.10.feed_forward.w1.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.10.feed_forward.w3.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.10.feed_forward.w3.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.10.feed_forward.w2.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.10.feed_forward.w2.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.11.attention.wqkv.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.11.attention.wqkv.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.11.attention.wo.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.11.attention.wo.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.11.feed_forward.w1.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.11.feed_forward.w1.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.11.feed_forward.w3.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.11.feed_forward.w3.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.11.feed_forward.w2.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.11.feed_forward.w2.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.12.attention.wqkv.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.12.attention.wqkv.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.12.attention.wo.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.12.attention.wo.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.12.feed_forward.w1.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.12.feed_forward.w1.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.12.feed_forward.w3.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.12.feed_forward.w3.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.12.feed_forward.w2.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.12.feed_forward.w2.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.13.attention.wqkv.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.13.attention.wqkv.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.13.attention.wo.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.13.attention.wo.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.13.feed_forward.w1.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.13.feed_forward.w1.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.13.feed_forward.w3.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.13.feed_forward.w3.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.13.feed_forward.w2.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.13.feed_forward.w2.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.14.attention.wqkv.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.14.attention.wqkv.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.14.attention.wo.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.14.attention.wo.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.14.feed_forward.w1.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.14.feed_forward.w1.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.14.feed_forward.w3.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.14.feed_forward.w3.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.14.feed_forward.w2.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.14.feed_forward.w2.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.15.attention.wqkv.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.15.attention.wqkv.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.15.attention.wo.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.15.attention.wo.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.15.feed_forward.w1.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.15.feed_forward.w1.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.15.feed_forward.w3.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.15.feed_forward.w3.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.15.feed_forward.w2.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.15.feed_forward.w2.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.16.attention.wqkv.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.16.attention.wqkv.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.16.attention.wo.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.16.attention.wo.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.16.feed_forward.w1.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.16.feed_forward.w1.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.16.feed_forward.w3.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.16.feed_forward.w3.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.16.feed_forward.w2.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.16.feed_forward.w2.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.17.attention.wqkv.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.17.attention.wqkv.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.17.attention.wo.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.17.attention.wo.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.17.feed_forward.w1.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.17.feed_forward.w1.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.17.feed_forward.w3.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.17.feed_forward.w3.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.17.feed_forward.w2.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.17.feed_forward.w2.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.18.attention.wqkv.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.18.attention.wqkv.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.18.attention.wo.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.18.attention.wo.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.18.feed_forward.w1.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.18.feed_forward.w1.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.18.feed_forward.w3.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.18.feed_forward.w3.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.18.feed_forward.w2.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.18.feed_forward.w2.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.19.attention.wqkv.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.19.attention.wqkv.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.19.attention.wo.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.19.attention.wo.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.19.feed_forward.w1.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.19.feed_forward.w1.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.19.feed_forward.w3.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.19.feed_forward.w3.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.19.feed_forward.w2.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.19.feed_forward.w2.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.20.attention.wqkv.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.20.attention.wqkv.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.20.attention.wo.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.20.attention.wo.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.20.feed_forward.w1.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.20.feed_forward.w1.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.20.feed_forward.w3.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.20.feed_forward.w3.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.20.feed_forward.w2.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.20.feed_forward.w2.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.21.attention.wqkv.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.21.attention.wqkv.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.21.attention.wo.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.21.attention.wo.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.21.feed_forward.w1.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.21.feed_forward.w1.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.21.feed_forward.w3.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.21.feed_forward.w3.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.21.feed_forward.w2.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.21.feed_forward.w2.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.22.attention.wqkv.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.22.attention.wqkv.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.22.attention.wo.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.22.attention.wo.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.22.feed_forward.w1.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.22.feed_forward.w1.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.22.feed_forward.w3.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.22.feed_forward.w3.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.22.feed_forward.w2.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.22.feed_forward.w2.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.23.attention.wqkv.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.23.attention.wqkv.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.23.attention.wo.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.23.attention.wo.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.23.feed_forward.w1.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.23.feed_forward.w1.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.23.feed_forward.w3.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.23.feed_forward.w3.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.23.feed_forward.w2.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.23.feed_forward.w2.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.24.attention.wqkv.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.24.attention.wqkv.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.24.attention.wo.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.24.attention.wo.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.24.feed_forward.w1.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.24.feed_forward.w1.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.24.feed_forward.w3.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.24.feed_forward.w3.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.24.feed_forward.w2.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.24.feed_forward.w2.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.25.attention.wqkv.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.25.attention.wqkv.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.25.attention.wo.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.25.attention.wo.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.25.feed_forward.w1.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.25.feed_forward.w1.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.25.feed_forward.w3.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.25.feed_forward.w3.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.25.feed_forward.w2.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.25.feed_forward.w2.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.26.attention.wqkv.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.26.attention.wqkv.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.26.attention.wo.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.26.attention.wo.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.26.feed_forward.w1.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.26.feed_forward.w1.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.26.feed_forward.w3.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.26.feed_forward.w3.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.26.feed_forward.w2.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.26.feed_forward.w2.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.27.attention.wqkv.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.27.attention.wqkv.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.27.attention.wo.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.27.attention.wo.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.27.feed_forward.w1.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.27.feed_forward.w1.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.27.feed_forward.w3.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.27.feed_forward.w3.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.27.feed_forward.w2.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.27.feed_forward.w2.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.28.attention.wqkv.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.28.attention.wqkv.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.28.attention.wo.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.28.attention.wo.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.28.feed_forward.w1.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.28.feed_forward.w1.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.28.feed_forward.w3.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.28.feed_forward.w3.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.28.feed_forward.w2.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.28.feed_forward.w2.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.29.attention.wqkv.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.29.attention.wqkv.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.29.attention.wo.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.29.attention.wo.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.29.feed_forward.w1.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.29.feed_forward.w1.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.29.feed_forward.w3.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.29.feed_forward.w3.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.29.feed_forward.w2.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.29.feed_forward.w2.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.30.attention.wqkv.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.30.attention.wqkv.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.30.attention.wo.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.30.attention.wo.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.30.feed_forward.w1.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.30.feed_forward.w1.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.30.feed_forward.w3.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.30.feed_forward.w3.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.30.feed_forward.w2.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.30.feed_forward.w2.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.31.attention.wqkv.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.31.attention.wqkv.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.31.attention.wo.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.31.attention.wo.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.31.feed_forward.w1.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.31.feed_forward.w1.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.31.feed_forward.w3.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.31.feed_forward.w3.lora_B.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.31.feed_forward.w2.lora_A.default.weight
09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.31.feed_forward.w2.lora_B.default.weight
[INFO|trainer.py:571] 2024-09-15 15:28:30,663 >> Using auto half precision backend
Traceback (most recent call last):
  File "/workspace/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 847, in <module>
    main()
  File "/workspace/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 832, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1539, in train
    return inner_training_loop(
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1580, in _inner_training_loop
    num_update_steps_per_epoch = len_dataloader // args.gradient_accumulation_steps
ZeroDivisionError: integer division or modulo by zero
trainable params: 37,748,736 || all params: 7,775,531,008 || trainable%: 0.4855
trainable params: 37,748,736 || all params: 7,775,531,008 || trainable%: 0.4855
Loading checkpoint shards: 100%|██████████| 4/4 [00:09<00:00,  2.06s/it]Loading checkpoint shards: 100%|██████████| 4/4 [00:09<00:00,  2.35s/it]
Traceback (most recent call last):
  File "/workspace/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 847, in <module>
    main()
  File "/workspace/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 832, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1539, in train
    return inner_training_loop(
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1580, in _inner_training_loop
    num_update_steps_per_epoch = len_dataloader // args.gradient_accumulation_steps
ZeroDivisionError: integer division or modulo by zero
Traceback (most recent call last):
  File "/workspace/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 847, in <module>
    main()
  File "/workspace/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 832, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1539, in train
    return inner_training_loop(
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1580, in _inner_training_loop
    num_update_steps_per_epoch = len_dataloader // args.gradient_accumulation_steps
ZeroDivisionError: integer division or modulo by zero
trainable params: 37,748,736 || all params: 7,775,531,008 || trainable%: 0.4855
Traceback (most recent call last):
  File "/workspace/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 847, in <module>
    main()
  File "/workspace/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 832, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1539, in train
    return inner_training_loop(
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1580, in _inner_training_loop
    num_update_steps_per_epoch = len_dataloader // args.gradient_accumulation_steps
ZeroDivisionError: integer division or modulo by zero
[2024-09-15 15:28:35,003] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 8326) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/workspace/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-09-15_15:28:35
  host      : 1fe2e16e3880
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 8327)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2024-09-15_15:28:35
  host      : 1fe2e16e3880
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 8328)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2024-09-15_15:28:35
  host      : 1fe2e16e3880
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 8329)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-09-15_15:28:35
  host      : 1fe2e16e3880
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 8326)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
[2024-09-15 15:29:04,013] torch.distributed.run: [WARNING] 
[2024-09-15 15:29:04,013] torch.distributed.run: [WARNING] *****************************************
[2024-09-15 15:29:04,013] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
[2024-09-15 15:29:04,013] torch.distributed.run: [WARNING] *****************************************
[2024-09-15 15:29:06,305] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 15:29:06,333] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 15:29:06,351] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 15:29:06,402] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
[2024-09-15 15:29:08,338] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-09-15 15:29:08,338] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2024-09-15 15:29:08,385] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-09-15 15:29:08,394] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-09-15 15:29:08,418] [INFO] [comm.py:637:init_distributed] cdb=None
09/15/2024 15:29:08 - WARNING - __main__ - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: False
09/15/2024 15:29:08 - INFO - __main__ - Training/evaluation parameters TrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=True,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=4,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=/workspace/InternVL/internvl_chat/zero_stage1_config.json,
disable_tqdm=False,
dispatch_batches=None,
do_eval=False,
do_predict=False,
do_train=True,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=no,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=0,
gradient_checkpointing=False,
gradient_checkpointing_kwargs=None,
greater_is_better=None,
group_by_length=True,
half_precision_backend=auto,
hub_always_push=False,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=every_save,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_inputs_for_metrics=False,
include_num_input_tokens_seen=False,
include_tokens_per_second=False,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=4e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=0,
log_level=passive,
log_level_replica=warning,
log_on_each_node=True,
logging_dir=work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora/runs/Sep15_15-29-08_1fe2e16e3880,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=1.0,
logging_strategy=steps,
lr_scheduler_kwargs={},
lr_scheduler_type=cosine,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
neftune_noise_alpha=None,
no_cuda=False,
num_train_epochs=3.0,
optim=adamw_torch,
optim_args=None,
output_dir=work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora,
overwrite_output_dir=True,
past_index=-1,
per_device_eval_batch_size=8,
per_device_train_batch_size=6,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
ray_scope=last,
remove_unused_columns=True,
report_to=['tensorboard'],
resume_from_checkpoint=None,
run_name=work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora,
save_on_each_node=False,
save_only_model=False,
save_safetensors=True,
save_steps=200,
save_strategy=steps,
save_total_limit=1,
seed=42,
skip_memory_metrics=True,
split_batches=False,
tf32=None,
torch_compile=False,
torch_compile_backend=None,
torch_compile_mode=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_cpu=False,
use_ipex=False,
use_legacy_prediction_loop=False,
use_mps_device=False,
warmup_ratio=0.03,
warmup_steps=0,
weight_decay=0.01,
)
09/15/2024 15:29:08 - INFO - __main__ - Loading Tokenizer: /workspace/pretrained/InternVL2-8B
[INFO|tokenization_utils_base.py:2025] 2024-09-15 15:29:08,448 >> loading file ./tokenizer.model
[INFO|tokenization_utils_base.py:2025] 2024-09-15 15:29:08,448 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2025] 2024-09-15 15:29:08,448 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2025] 2024-09-15 15:29:08,448 >> loading file tokenizer_config.json
[INFO|tokenization_utils_base.py:2025] 2024-09-15 15:29:08,448 >> loading file tokenizer.json
09/15/2024 15:29:08 - WARNING - __main__ - Process rank: 1, device: cuda:1, n_gpu: 1distributed training: True, 16-bits training: False
09/15/2024 15:29:08 - WARNING - __main__ - Process rank: 2, device: cuda:2, n_gpu: 1distributed training: True, 16-bits training: False
09/15/2024 15:29:08 - WARNING - __main__ - Process rank: 3, device: cuda:3, n_gpu: 1distributed training: True, 16-bits training: False
[WARNING|logging.py:314] 2024-09-15 15:29:08,622 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[WARNING|logging.py:314] 2024-09-15 15:29:08,639 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[WARNING|logging.py:314] 2024-09-15 15:29:08,645 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[WARNING|logging.py:314] 2024-09-15 15:29:08,660 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
09/15/2024 15:29:08 - INFO - __main__ - Loading InternVLChatModel...
[INFO|configuration_utils.py:727] 2024-09-15 15:29:08,764 >> loading configuration file /workspace/pretrained/InternVL2-8B/config.json
[INFO|configuration_utils.py:792] 2024-09-15 15:29:08,766 >> Model config InternVLChatConfig {
  "_commit_hash": null,
  "architectures": [
    "InternVLChatModel"
  ],
  "auto_map": {
    "AutoConfig": "configuration_internvl_chat.InternVLChatConfig",
    "AutoModel": "modeling_internvl_chat.InternVLChatModel",
    "AutoModelForCausalLM": "modeling_internvl_chat.InternVLChatModel"
  },
  "downsample_ratio": 0.5,
  "dynamic_image_size": true,
  "force_image_size": 448,
  "llm_config": {
    "_name_or_path": "internlm/internlm2_5-7b-chat",
    "add_cross_attention": false,
    "architectures": [
      "InternLM2ForCausalLM"
    ],
    "attn_implementation": "flash_attention_2",
    "auto_map": {
      "AutoConfig": "configuration_internlm2.InternLM2Config",
      "AutoModel": "modeling_internlm2.InternLM2ForCausalLM",
      "AutoModelForCausalLM": "modeling_internlm2.InternLM2ForCausalLM"
    },
    "bad_words_ids": null,
    "begin_suppress_tokens": null,
    "bias": false,
    "bos_token_id": 1,
    "chunk_size_feed_forward": 0,
    "cross_attention_hidden_size": null,
    "decoder_start_token_id": null,
    "diversity_penalty": 0.0,
    "do_sample": false,
    "early_stopping": false,
    "encoder_no_repeat_ngram_size": 0,
    "eos_token_id": 2,
    "exponential_decay_length_penalty": null,
    "finetuning_task": null,
    "forced_bos_token_id": null,
    "forced_eos_token_id": null,
    "hidden_act": "silu",
    "hidden_size": 4096,
    "id2label": {
      "0": "LABEL_0",
      "1": "LABEL_1"
    },
    "initializer_range": 0.02,
    "intermediate_size": 14336,
    "is_decoder": false,
    "is_encoder_decoder": false,
    "label2id": {
      "LABEL_0": 0,
      "LABEL_1": 1
    },
    "length_penalty": 1.0,
    "max_length": 20,
    "max_position_embeddings": 32768,
    "min_length": 0,
    "model_type": "internlm2",
    "no_repeat_ngram_size": 0,
    "num_attention_heads": 32,
    "num_beam_groups": 1,
    "num_beams": 1,
    "num_hidden_layers": 32,
    "num_key_value_heads": 8,
    "num_return_sequences": 1,
    "output_attentions": false,
    "output_hidden_states": false,
    "output_scores": false,
    "pad_token_id": 2,
    "prefix": null,
    "pretraining_tp": 1,
    "problem_type": null,
    "pruned_heads": {},
    "remove_invalid_values": false,
    "repetition_penalty": 1.0,
    "return_dict": true,
    "return_dict_in_generate": false,
    "rms_norm_eps": 1e-05,
    "rope_scaling": {
      "factor": 2.0,
      "type": "dynamic"
    },
    "rope_theta": 1000000,
    "sep_token_id": null,
    "suppress_tokens": null,
    "task_specific_params": null,
    "temperature": 1.0,
    "tf_legacy_loss": false,
    "tie_encoder_decoder": false,
    "tie_word_embeddings": false,
    "tokenizer_class": null,
    "top_k": 50,
    "top_p": 1.0,
    "torch_dtype": "bfloat16",
    "torchscript": false,
    "transformers_version": "4.37.2",
    "typical_p": 1.0,
    "use_bfloat16": true,
    "use_cache": true,
    "vocab_size": 92553
  },
  "max_dynamic_patch": 12,
  "min_dynamic_patch": 1,
  "model_type": "internvl_chat",
  "pad2square": false,
  "ps_version": "v2",
  "select_layer": -1,
  "template": "internlm2-chat",
  "torch_dtype": "bfloat16",
  "transformers_version": null,
  "use_backbone_lora": 0,
  "use_llm_lora": 0,
  "use_thumbnail": true,
  "vision_config": {
    "_name_or_path": "",
    "add_cross_attention": false,
    "architectures": [
      "InternVisionModel"
    ],
    "attention_dropout": 0.0,
    "bad_words_ids": null,
    "begin_suppress_tokens": null,
    "bos_token_id": null,
    "chunk_size_feed_forward": 0,
    "cross_attention_hidden_size": null,
    "decoder_start_token_id": null,
    "diversity_penalty": 0.0,
    "do_sample": false,
    "drop_path_rate": 0.0,
    "dropout": 0.0,
    "early_stopping": false,
    "encoder_no_repeat_ngram_size": 0,
    "eos_token_id": null,
    "exponential_decay_length_penalty": null,
    "finetuning_task": null,
    "forced_bos_token_id": null,
    "forced_eos_token_id": null,
    "hidden_act": "gelu",
    "hidden_size": 1024,
    "id2label": {
      "0": "LABEL_0",
      "1": "LABEL_1"
    },
    "image_size": 448,
    "initializer_factor": 1.0,
    "initializer_range": 0.02,
    "intermediate_size": 4096,
    "is_decoder": false,
    "is_encoder_decoder": false,
    "label2id": {
      "LABEL_0": 0,
      "LABEL_1": 1
    },
    "layer_norm_eps": 1e-06,
    "length_penalty": 1.0,
    "max_length": 20,
    "min_length": 0,
    "model_type": "intern_vit_6b",
    "no_repeat_ngram_size": 0,
    "norm_type": "layer_norm",
    "num_attention_heads": 16,
    "num_beam_groups": 1,
    "num_beams": 1,
    "num_channels": 3,
    "num_hidden_layers": 24,
    "num_return_sequences": 1,
    "output_attentions": false,
    "output_hidden_states": false,
    "output_scores": false,
    "pad_token_id": null,
    "patch_size": 14,
    "prefix": null,
    "problem_type": null,
    "pruned_heads": {},
    "qk_normalization": false,
    "qkv_bias": true,
    "remove_invalid_values": false,
    "repetition_penalty": 1.0,
    "return_dict": true,
    "return_dict_in_generate": false,
    "sep_token_id": null,
    "suppress_tokens": null,
    "task_specific_params": null,
    "temperature": 1.0,
    "tf_legacy_loss": false,
    "tie_encoder_decoder": false,
    "tie_word_embeddings": true,
    "tokenizer_class": null,
    "top_k": 50,
    "top_p": 1.0,
    "torch_dtype": "bfloat16",
    "torchscript": false,
    "transformers_version": "4.37.2",
    "typical_p": 1.0,
    "use_bfloat16": true,
    "use_flash_attn": true
  }
}

09/15/2024 15:29:08 - INFO - __main__ - Using flash_attention_2 for InternLM
[INFO|modeling_utils.py:3473] 2024-09-15 15:29:08,768 >> loading weights file /workspace/pretrained/InternVL2-8B/model.safetensors.index.json
[INFO|modeling_utils.py:1426] 2024-09-15 15:29:08,768 >> Instantiating InternVLChatModel model under default dtype torch.bfloat16.
[INFO|configuration_utils.py:826] 2024-09-15 15:29:08,770 >> Generate config GenerationConfig {}

[INFO|configuration_utils.py:826] 2024-09-15 15:29:08,808 >> Generate config GenerationConfig {
  "bos_token_id": 1,
  "eos_token_id": 2,
  "pad_token_id": 2
}

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]Loading checkpoint shards:  25%|██▌       | 1/4 [00:02<00:07,  2.66s/it]Loading checkpoint shards:  25%|██▌       | 1/4 [00:02<00:08,  2.69s/it]Loading checkpoint shards:  25%|██▌       | 1/4 [00:02<00:08,  2.87s/it]Loading checkpoint shards:  25%|██▌       | 1/4 [00:03<00:09,  3.01s/it]Loading checkpoint shards:  50%|█████     | 2/4 [00:05<00:05,  2.61s/it]Loading checkpoint shards:  50%|█████     | 2/4 [00:05<00:05,  2.66s/it]Loading checkpoint shards:  50%|█████     | 2/4 [00:05<00:05,  2.83s/it]Loading checkpoint shards:  50%|█████     | 2/4 [00:05<00:05,  2.96s/it]Loading checkpoint shards:  75%|███████▌  | 3/4 [00:07<00:02,  2.58s/it]Loading checkpoint shards:  75%|███████▌  | 3/4 [00:07<00:02,  2.62s/it]Loading checkpoint shards:  75%|███████▌  | 3/4 [00:08<00:02,  2.80s/it]Loading checkpoint shards: 100%|██████████| 4/4 [00:08<00:00,  1.90s/it]Loading checkpoint shards: 100%|██████████| 4/4 [00:08<00:00,  2.16s/it]
Loading checkpoint shards: 100%|██████████| 4/4 [00:08<00:00,  1.92s/it]Loading checkpoint shards: 100%|██████████| 4/4 [00:08<00:00,  2.19s/it]
[INFO|modeling_utils.py:4350] 2024-09-15 15:29:18,959 >> All model checkpoint weights were used when initializing InternVLChatModel.

[INFO|modeling_utils.py:4358] 2024-09-15 15:29:18,959 >> All the weights of InternVLChatModel were initialized from the model checkpoint at /workspace/pretrained/InternVL2-8B.
If your task is similar to the task the model of the checkpoint was trained on, you can already use InternVLChatModel for predictions without further training.
[INFO|configuration_utils.py:779] 2024-09-15 15:29:18,964 >> loading configuration file /workspace/pretrained/InternVL2-8B/generation_config.json
[INFO|configuration_utils.py:826] 2024-09-15 15:29:18,965 >> Generate config GenerationConfig {
  "eos_token_id": [
    92542,
    92543
  ]
}

09/15/2024 15:29:18 - INFO - __main__ - Finished
09/15/2024 15:29:18 - INFO - __main__ - model.config.force_image_size: 448
09/15/2024 15:29:18 - INFO - __main__ - data_args.force_image_size: 448
09/15/2024 15:29:18 - INFO - __main__ - model.config.vision_config.image_size: 448
09/15/2024 15:29:18 - INFO - __main__ - [Dataset] num_image_token: 256
09/15/2024 15:29:18 - INFO - __main__ - [Dataset] dynamic_image_size: True
09/15/2024 15:29:18 - INFO - __main__ - [Dataset] use_thumbnail: True
09/15/2024 15:29:18 - INFO - __main__ - [Dataset] min_dynamic_patch: 1, max_dynamic_patch: 6
09/15/2024 15:29:18 - INFO - __main__ - Formatting inputs...Skip in lazy mode
Loading checkpoint shards:  75%|███████▌  | 3/4 [00:08<00:02,  2.93s/it]09/15/2024 15:29:19 - INFO - __main__ - Add dataset: amazon-product-dataset with length: 9996
Loading checkpoint shards: 100%|██████████| 4/4 [00:09<00:00,  2.05s/it]Loading checkpoint shards: 100%|██████████| 4/4 [00:09<00:00,  2.33s/it]
trainable params: 37,748,736 || all params: 7,775,531,008 || trainable%: 0.4855
trainable params: 37,748,736 || all params: 7,775,531,008 || trainable%: 0.4855
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.0.attention.wqkv.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.0.attention.wqkv.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.0.attention.wo.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.0.attention.wo.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.0.feed_forward.w1.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.0.feed_forward.w1.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.0.feed_forward.w3.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.0.feed_forward.w3.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.0.feed_forward.w2.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.0.feed_forward.w2.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.1.attention.wqkv.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.1.attention.wqkv.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.1.attention.wo.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.1.attention.wo.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.1.feed_forward.w1.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.1.feed_forward.w1.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.1.feed_forward.w3.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.1.feed_forward.w3.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.1.feed_forward.w2.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.1.feed_forward.w2.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.2.attention.wqkv.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.2.attention.wqkv.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.2.attention.wo.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.2.attention.wo.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.2.feed_forward.w1.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.2.feed_forward.w1.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.2.feed_forward.w3.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.2.feed_forward.w3.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.2.feed_forward.w2.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.2.feed_forward.w2.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.3.attention.wqkv.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.3.attention.wqkv.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.3.attention.wo.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.3.attention.wo.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.3.feed_forward.w1.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.3.feed_forward.w1.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.3.feed_forward.w3.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.3.feed_forward.w3.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.3.feed_forward.w2.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.3.feed_forward.w2.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.4.attention.wqkv.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.4.attention.wqkv.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.4.attention.wo.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.4.attention.wo.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.4.feed_forward.w1.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.4.feed_forward.w1.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.4.feed_forward.w3.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.4.feed_forward.w3.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.4.feed_forward.w2.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.4.feed_forward.w2.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.5.attention.wqkv.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.5.attention.wqkv.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.5.attention.wo.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.5.attention.wo.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.5.feed_forward.w1.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.5.feed_forward.w1.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.5.feed_forward.w3.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.5.feed_forward.w3.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.5.feed_forward.w2.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.5.feed_forward.w2.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.6.attention.wqkv.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.6.attention.wqkv.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.6.attention.wo.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.6.attention.wo.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.6.feed_forward.w1.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.6.feed_forward.w1.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.6.feed_forward.w3.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.6.feed_forward.w3.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.6.feed_forward.w2.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.6.feed_forward.w2.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.7.attention.wqkv.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.7.attention.wqkv.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.7.attention.wo.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.7.attention.wo.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.7.feed_forward.w1.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.7.feed_forward.w1.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.7.feed_forward.w3.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.7.feed_forward.w3.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.7.feed_forward.w2.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.7.feed_forward.w2.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.8.attention.wqkv.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.8.attention.wqkv.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.8.attention.wo.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.8.attention.wo.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.8.feed_forward.w1.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.8.feed_forward.w1.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.8.feed_forward.w3.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.8.feed_forward.w3.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.8.feed_forward.w2.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.8.feed_forward.w2.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.9.attention.wqkv.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.9.attention.wqkv.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.9.attention.wo.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.9.attention.wo.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.9.feed_forward.w1.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.9.feed_forward.w1.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.9.feed_forward.w3.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.9.feed_forward.w3.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.9.feed_forward.w2.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.9.feed_forward.w2.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.10.attention.wqkv.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.10.attention.wqkv.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.10.attention.wo.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.10.attention.wo.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.10.feed_forward.w1.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.10.feed_forward.w1.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.10.feed_forward.w3.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.10.feed_forward.w3.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.10.feed_forward.w2.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.10.feed_forward.w2.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.11.attention.wqkv.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.11.attention.wqkv.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.11.attention.wo.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.11.attention.wo.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.11.feed_forward.w1.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.11.feed_forward.w1.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.11.feed_forward.w3.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.11.feed_forward.w3.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.11.feed_forward.w2.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.11.feed_forward.w2.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.12.attention.wqkv.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.12.attention.wqkv.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.12.attention.wo.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.12.attention.wo.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.12.feed_forward.w1.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.12.feed_forward.w1.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.12.feed_forward.w3.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.12.feed_forward.w3.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.12.feed_forward.w2.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.12.feed_forward.w2.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.13.attention.wqkv.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.13.attention.wqkv.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.13.attention.wo.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.13.attention.wo.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.13.feed_forward.w1.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.13.feed_forward.w1.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.13.feed_forward.w3.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.13.feed_forward.w3.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.13.feed_forward.w2.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.13.feed_forward.w2.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.14.attention.wqkv.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.14.attention.wqkv.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.14.attention.wo.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.14.attention.wo.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.14.feed_forward.w1.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.14.feed_forward.w1.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.14.feed_forward.w3.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.14.feed_forward.w3.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.14.feed_forward.w2.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.14.feed_forward.w2.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.15.attention.wqkv.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.15.attention.wqkv.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.15.attention.wo.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.15.attention.wo.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.15.feed_forward.w1.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.15.feed_forward.w1.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.15.feed_forward.w3.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.15.feed_forward.w3.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.15.feed_forward.w2.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.15.feed_forward.w2.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.16.attention.wqkv.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.16.attention.wqkv.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.16.attention.wo.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.16.attention.wo.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.16.feed_forward.w1.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.16.feed_forward.w1.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.16.feed_forward.w3.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.16.feed_forward.w3.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.16.feed_forward.w2.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.16.feed_forward.w2.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.17.attention.wqkv.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.17.attention.wqkv.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.17.attention.wo.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.17.attention.wo.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.17.feed_forward.w1.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.17.feed_forward.w1.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.17.feed_forward.w3.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.17.feed_forward.w3.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.17.feed_forward.w2.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.17.feed_forward.w2.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.18.attention.wqkv.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.18.attention.wqkv.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.18.attention.wo.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.18.attention.wo.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.18.feed_forward.w1.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.18.feed_forward.w1.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.18.feed_forward.w3.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.18.feed_forward.w3.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.18.feed_forward.w2.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.18.feed_forward.w2.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.19.attention.wqkv.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.19.attention.wqkv.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.19.attention.wo.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.19.attention.wo.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.19.feed_forward.w1.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.19.feed_forward.w1.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.19.feed_forward.w3.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.19.feed_forward.w3.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.19.feed_forward.w2.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.19.feed_forward.w2.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.20.attention.wqkv.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.20.attention.wqkv.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.20.attention.wo.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.20.attention.wo.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.20.feed_forward.w1.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.20.feed_forward.w1.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.20.feed_forward.w3.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.20.feed_forward.w3.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.20.feed_forward.w2.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.20.feed_forward.w2.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.21.attention.wqkv.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.21.attention.wqkv.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.21.attention.wo.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.21.attention.wo.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.21.feed_forward.w1.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.21.feed_forward.w1.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.21.feed_forward.w3.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.21.feed_forward.w3.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.21.feed_forward.w2.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.21.feed_forward.w2.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.22.attention.wqkv.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.22.attention.wqkv.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.22.attention.wo.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.22.attention.wo.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.22.feed_forward.w1.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.22.feed_forward.w1.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.22.feed_forward.w3.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.22.feed_forward.w3.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.22.feed_forward.w2.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.22.feed_forward.w2.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.23.attention.wqkv.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.23.attention.wqkv.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.23.attention.wo.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.23.attention.wo.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.23.feed_forward.w1.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.23.feed_forward.w1.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.23.feed_forward.w3.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.23.feed_forward.w3.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.23.feed_forward.w2.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.23.feed_forward.w2.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.24.attention.wqkv.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.24.attention.wqkv.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.24.attention.wo.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.24.attention.wo.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.24.feed_forward.w1.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.24.feed_forward.w1.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.24.feed_forward.w3.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.24.feed_forward.w3.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.24.feed_forward.w2.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.24.feed_forward.w2.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.25.attention.wqkv.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.25.attention.wqkv.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.25.attention.wo.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.25.attention.wo.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.25.feed_forward.w1.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.25.feed_forward.w1.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.25.feed_forward.w3.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.25.feed_forward.w3.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.25.feed_forward.w2.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.25.feed_forward.w2.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.26.attention.wqkv.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.26.attention.wqkv.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.26.attention.wo.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.26.attention.wo.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.26.feed_forward.w1.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.26.feed_forward.w1.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.26.feed_forward.w3.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.26.feed_forward.w3.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.26.feed_forward.w2.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.26.feed_forward.w2.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.27.attention.wqkv.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.27.attention.wqkv.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.27.attention.wo.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.27.attention.wo.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.27.feed_forward.w1.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.27.feed_forward.w1.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.27.feed_forward.w3.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.27.feed_forward.w3.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.27.feed_forward.w2.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.27.feed_forward.w2.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.28.attention.wqkv.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.28.attention.wqkv.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.28.attention.wo.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.28.attention.wo.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.28.feed_forward.w1.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.28.feed_forward.w1.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.28.feed_forward.w3.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.28.feed_forward.w3.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.28.feed_forward.w2.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.28.feed_forward.w2.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.29.attention.wqkv.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.29.attention.wqkv.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.29.attention.wo.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.29.attention.wo.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.29.feed_forward.w1.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.29.feed_forward.w1.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.29.feed_forward.w3.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.29.feed_forward.w3.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.29.feed_forward.w2.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.29.feed_forward.w2.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.30.attention.wqkv.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.30.attention.wqkv.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.30.attention.wo.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.30.attention.wo.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.30.feed_forward.w1.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.30.feed_forward.w1.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.30.feed_forward.w3.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.30.feed_forward.w3.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.30.feed_forward.w2.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.30.feed_forward.w2.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.31.attention.wqkv.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.31.attention.wqkv.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.31.attention.wo.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.31.attention.wo.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.31.feed_forward.w1.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.31.feed_forward.w1.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.31.feed_forward.w3.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.31.feed_forward.w3.lora_B.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.31.feed_forward.w2.lora_A.default.weight
09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.31.feed_forward.w2.lora_B.default.weight
[INFO|trainer.py:571] 2024-09-15 15:29:19,927 >> Using auto half precision backend
Traceback (most recent call last):
  File "/workspace/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 847, in <module>
    main()
  File "/workspace/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 832, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1539, in train
    return inner_training_loop(
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1580, in _inner_training_loop
    num_update_steps_per_epoch = len_dataloader // args.gradient_accumulation_steps
ZeroDivisionError: integer division or modulo by zero
Loading checkpoint shards: 100%|██████████| 4/4 [00:09<00:00,  2.14s/it]Loading checkpoint shards: 100%|██████████| 4/4 [00:09<00:00,  2.44s/it]
Traceback (most recent call last):
  File "/workspace/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 847, in <module>
    main()
  File "/workspace/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 832, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1539, in train
    return inner_training_loop(
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1580, in _inner_training_loop
    num_update_steps_per_epoch = len_dataloader // args.gradient_accumulation_steps
ZeroDivisionError: integer division or modulo by zero
trainable params: 37,748,736 || all params: 7,775,531,008 || trainable%: 0.4855
Traceback (most recent call last):
  File "/workspace/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 847, in <module>
    main()
  File "/workspace/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 832, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1539, in train
    return inner_training_loop(
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1580, in _inner_training_loop
    num_update_steps_per_epoch = len_dataloader // args.gradient_accumulation_steps
ZeroDivisionError: integer division or modulo by zero
trainable params: 37,748,736 || all params: 7,775,531,008 || trainable%: 0.4855
Traceback (most recent call last):
  File "/workspace/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 847, in <module>
    main()
  File "/workspace/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 832, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1539, in train
    return inner_training_loop(
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1580, in _inner_training_loop
    num_update_steps_per_epoch = len_dataloader // args.gradient_accumulation_steps
ZeroDivisionError: integer division or modulo by zero
[2024-09-15 15:29:24,040] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 8453) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/workspace/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-09-15_15:29:24
  host      : 1fe2e16e3880
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 8454)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2024-09-15_15:29:24
  host      : 1fe2e16e3880
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 8455)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2024-09-15_15:29:24
  host      : 1fe2e16e3880
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 8456)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-09-15_15:29:24
  host      : 1fe2e16e3880
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 8453)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
[2024-09-15 15:30:31,684] torch.distributed.run: [WARNING] 
[2024-09-15 15:30:31,684] torch.distributed.run: [WARNING] *****************************************
[2024-09-15 15:30:31,684] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
[2024-09-15 15:30:31,684] torch.distributed.run: [WARNING] *****************************************
[2024-09-15 15:30:34,079] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 15:30:34,121] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 15:30:34,218] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 15:30:34,263] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
[2024-09-15 15:30:36,205] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-09-15 15:30:36,233] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-09-15 15:30:36,247] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-09-15 15:30:36,247] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2024-09-15 15:30:36,271] [INFO] [comm.py:637:init_distributed] cdb=None
09/15/2024 15:30:36 - WARNING - __main__ - Process rank: 3, device: cuda:3, n_gpu: 1distributed training: True, 16-bits training: False
09/15/2024 15:30:36 - WARNING - __main__ - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: False
09/15/2024 15:30:36 - WARNING - __main__ - Process rank: 1, device: cuda:1, n_gpu: 1distributed training: True, 16-bits training: False
09/15/2024 15:30:36 - INFO - __main__ - Training/evaluation parameters TrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=True,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=4,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=/workspace/InternVL/internvl_chat/zero_stage1_config.json,
disable_tqdm=False,
dispatch_batches=None,
do_eval=False,
do_predict=False,
do_train=True,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=no,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=0,
gradient_checkpointing=False,
gradient_checkpointing_kwargs=None,
greater_is_better=None,
group_by_length=True,
half_precision_backend=auto,
hub_always_push=False,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=every_save,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_inputs_for_metrics=False,
include_num_input_tokens_seen=False,
include_tokens_per_second=False,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=4e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=0,
log_level=passive,
log_level_replica=warning,
log_on_each_node=True,
logging_dir=work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora/runs/Sep15_15-30-36_1fe2e16e3880,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=1.0,
logging_strategy=steps,
lr_scheduler_kwargs={},
lr_scheduler_type=cosine,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
neftune_noise_alpha=None,
no_cuda=False,
num_train_epochs=3.0,
optim=adamw_torch,
optim_args=None,
output_dir=work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora,
overwrite_output_dir=True,
past_index=-1,
per_device_eval_batch_size=8,
per_device_train_batch_size=8,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
ray_scope=last,
remove_unused_columns=True,
report_to=['tensorboard'],
resume_from_checkpoint=None,
run_name=work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora,
save_on_each_node=False,
save_only_model=False,
save_safetensors=True,
save_steps=200,
save_strategy=steps,
save_total_limit=1,
seed=42,
skip_memory_metrics=True,
split_batches=False,
tf32=None,
torch_compile=False,
torch_compile_backend=None,
torch_compile_mode=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_cpu=False,
use_ipex=False,
use_legacy_prediction_loop=False,
use_mps_device=False,
warmup_ratio=0.03,
warmup_steps=0,
weight_decay=0.01,
)
09/15/2024 15:30:36 - INFO - __main__ - Loading Tokenizer: /workspace/pretrained/InternVL2-8B
[INFO|tokenization_utils_base.py:2025] 2024-09-15 15:30:36,322 >> loading file ./tokenizer.model
[INFO|tokenization_utils_base.py:2025] 2024-09-15 15:30:36,322 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2025] 2024-09-15 15:30:36,322 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2025] 2024-09-15 15:30:36,322 >> loading file tokenizer_config.json
[INFO|tokenization_utils_base.py:2025] 2024-09-15 15:30:36,322 >> loading file tokenizer.json
09/15/2024 15:30:36 - WARNING - __main__ - Process rank: 2, device: cuda:2, n_gpu: 1distributed training: True, 16-bits training: False
[WARNING|logging.py:314] 2024-09-15 15:30:36,487 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[WARNING|logging.py:314] 2024-09-15 15:30:36,511 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[WARNING|logging.py:314] 2024-09-15 15:30:36,520 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[WARNING|logging.py:314] 2024-09-15 15:30:36,536 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
09/15/2024 15:30:36 - INFO - __main__ - Loading InternVLChatModel...
[INFO|configuration_utils.py:727] 2024-09-15 15:30:36,660 >> loading configuration file /workspace/pretrained/InternVL2-8B/config.json
[INFO|configuration_utils.py:792] 2024-09-15 15:30:36,662 >> Model config InternVLChatConfig {
  "_commit_hash": null,
  "architectures": [
    "InternVLChatModel"
  ],
  "auto_map": {
    "AutoConfig": "configuration_internvl_chat.InternVLChatConfig",
    "AutoModel": "modeling_internvl_chat.InternVLChatModel",
    "AutoModelForCausalLM": "modeling_internvl_chat.InternVLChatModel"
  },
  "downsample_ratio": 0.5,
  "dynamic_image_size": true,
  "force_image_size": 448,
  "llm_config": {
    "_name_or_path": "internlm/internlm2_5-7b-chat",
    "add_cross_attention": false,
    "architectures": [
      "InternLM2ForCausalLM"
    ],
    "attn_implementation": "flash_attention_2",
    "auto_map": {
      "AutoConfig": "configuration_internlm2.InternLM2Config",
      "AutoModel": "modeling_internlm2.InternLM2ForCausalLM",
      "AutoModelForCausalLM": "modeling_internlm2.InternLM2ForCausalLM"
    },
    "bad_words_ids": null,
    "begin_suppress_tokens": null,
    "bias": false,
    "bos_token_id": 1,
    "chunk_size_feed_forward": 0,
    "cross_attention_hidden_size": null,
    "decoder_start_token_id": null,
    "diversity_penalty": 0.0,
    "do_sample": false,
    "early_stopping": false,
    "encoder_no_repeat_ngram_size": 0,
    "eos_token_id": 2,
    "exponential_decay_length_penalty": null,
    "finetuning_task": null,
    "forced_bos_token_id": null,
    "forced_eos_token_id": null,
    "hidden_act": "silu",
    "hidden_size": 4096,
    "id2label": {
      "0": "LABEL_0",
      "1": "LABEL_1"
    },
    "initializer_range": 0.02,
    "intermediate_size": 14336,
    "is_decoder": false,
    "is_encoder_decoder": false,
    "label2id": {
      "LABEL_0": 0,
      "LABEL_1": 1
    },
    "length_penalty": 1.0,
    "max_length": 20,
    "max_position_embeddings": 32768,
    "min_length": 0,
    "model_type": "internlm2",
    "no_repeat_ngram_size": 0,
    "num_attention_heads": 32,
    "num_beam_groups": 1,
    "num_beams": 1,
    "num_hidden_layers": 32,
    "num_key_value_heads": 8,
    "num_return_sequences": 1,
    "output_attentions": false,
    "output_hidden_states": false,
    "output_scores": false,
    "pad_token_id": 2,
    "prefix": null,
    "pretraining_tp": 1,
    "problem_type": null,
    "pruned_heads": {},
    "remove_invalid_values": false,
    "repetition_penalty": 1.0,
    "return_dict": true,
    "return_dict_in_generate": false,
    "rms_norm_eps": 1e-05,
    "rope_scaling": {
      "factor": 2.0,
      "type": "dynamic"
    },
    "rope_theta": 1000000,
    "sep_token_id": null,
    "suppress_tokens": null,
    "task_specific_params": null,
    "temperature": 1.0,
    "tf_legacy_loss": false,
    "tie_encoder_decoder": false,
    "tie_word_embeddings": false,
    "tokenizer_class": null,
    "top_k": 50,
    "top_p": 1.0,
    "torch_dtype": "bfloat16",
    "torchscript": false,
    "transformers_version": "4.37.2",
    "typical_p": 1.0,
    "use_bfloat16": true,
    "use_cache": true,
    "vocab_size": 92553
  },
  "max_dynamic_patch": 12,
  "min_dynamic_patch": 1,
  "model_type": "internvl_chat",
  "pad2square": false,
  "ps_version": "v2",
  "select_layer": -1,
  "template": "internlm2-chat",
  "torch_dtype": "bfloat16",
  "transformers_version": null,
  "use_backbone_lora": 0,
  "use_llm_lora": 0,
  "use_thumbnail": true,
  "vision_config": {
    "_name_or_path": "",
    "add_cross_attention": false,
    "architectures": [
      "InternVisionModel"
    ],
    "attention_dropout": 0.0,
    "bad_words_ids": null,
    "begin_suppress_tokens": null,
    "bos_token_id": null,
    "chunk_size_feed_forward": 0,
    "cross_attention_hidden_size": null,
    "decoder_start_token_id": null,
    "diversity_penalty": 0.0,
    "do_sample": false,
    "drop_path_rate": 0.0,
    "dropout": 0.0,
    "early_stopping": false,
    "encoder_no_repeat_ngram_size": 0,
    "eos_token_id": null,
    "exponential_decay_length_penalty": null,
    "finetuning_task": null,
    "forced_bos_token_id": null,
    "forced_eos_token_id": null,
    "hidden_act": "gelu",
    "hidden_size": 1024,
    "id2label": {
      "0": "LABEL_0",
      "1": "LABEL_1"
    },
    "image_size": 448,
    "initializer_factor": 1.0,
    "initializer_range": 0.02,
    "intermediate_size": 4096,
    "is_decoder": false,
    "is_encoder_decoder": false,
    "label2id": {
      "LABEL_0": 0,
      "LABEL_1": 1
    },
    "layer_norm_eps": 1e-06,
    "length_penalty": 1.0,
    "max_length": 20,
    "min_length": 0,
    "model_type": "intern_vit_6b",
    "no_repeat_ngram_size": 0,
    "norm_type": "layer_norm",
    "num_attention_heads": 16,
    "num_beam_groups": 1,
    "num_beams": 1,
    "num_channels": 3,
    "num_hidden_layers": 24,
    "num_return_sequences": 1,
    "output_attentions": false,
    "output_hidden_states": false,
    "output_scores": false,
    "pad_token_id": null,
    "patch_size": 14,
    "prefix": null,
    "problem_type": null,
    "pruned_heads": {},
    "qk_normalization": false,
    "qkv_bias": true,
    "remove_invalid_values": false,
    "repetition_penalty": 1.0,
    "return_dict": true,
    "return_dict_in_generate": false,
    "sep_token_id": null,
    "suppress_tokens": null,
    "task_specific_params": null,
    "temperature": 1.0,
    "tf_legacy_loss": false,
    "tie_encoder_decoder": false,
    "tie_word_embeddings": true,
    "tokenizer_class": null,
    "top_k": 50,
    "top_p": 1.0,
    "torch_dtype": "bfloat16",
    "torchscript": false,
    "transformers_version": "4.37.2",
    "typical_p": 1.0,
    "use_bfloat16": true,
    "use_flash_attn": true
  }
}

09/15/2024 15:30:36 - INFO - __main__ - Using flash_attention_2 for InternLM
[INFO|modeling_utils.py:3473] 2024-09-15 15:30:36,664 >> loading weights file /workspace/pretrained/InternVL2-8B/model.safetensors.index.json
[INFO|modeling_utils.py:1426] 2024-09-15 15:30:36,664 >> Instantiating InternVLChatModel model under default dtype torch.bfloat16.
[INFO|configuration_utils.py:826] 2024-09-15 15:30:36,665 >> Generate config GenerationConfig {}

[INFO|configuration_utils.py:826] 2024-09-15 15:30:36,702 >> Generate config GenerationConfig {
  "bos_token_id": 1,
  "eos_token_id": 2,
  "pad_token_id": 2
}

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]Loading checkpoint shards:  25%|██▌       | 1/4 [00:02<00:08,  2.67s/it]Loading checkpoint shards:  25%|██▌       | 1/4 [00:02<00:07,  2.63s/it]Loading checkpoint shards:  25%|██▌       | 1/4 [00:02<00:08,  2.91s/it]Loading checkpoint shards:  25%|██▌       | 1/4 [00:02<00:08,  2.81s/it]Loading checkpoint shards:  50%|█████     | 2/4 [00:05<00:05,  2.62s/it]Loading checkpoint shards:  50%|█████     | 2/4 [00:05<00:05,  2.65s/it]Loading checkpoint shards:  50%|█████     | 2/4 [00:05<00:05,  2.92s/it]Loading checkpoint shards:  50%|█████     | 2/4 [00:05<00:05,  2.85s/it]Loading checkpoint shards:  75%|███████▌  | 3/4 [00:07<00:02,  2.58s/it]Loading checkpoint shards:  75%|███████▌  | 3/4 [00:07<00:02,  2.61s/it]Loading checkpoint shards:  75%|███████▌  | 3/4 [00:08<00:02,  2.77s/it]Loading checkpoint shards: 100%|██████████| 4/4 [00:08<00:00,  1.90s/it]Loading checkpoint shards: 100%|██████████| 4/4 [00:08<00:00,  2.16s/it]
Loading checkpoint shards: 100%|██████████| 4/4 [00:08<00:00,  1.91s/it]Loading checkpoint shards: 100%|██████████| 4/4 [00:08<00:00,  2.18s/it]
Loading checkpoint shards:  75%|███████▌  | 3/4 [00:08<00:02,  2.91s/it]Loading checkpoint shards: 100%|██████████| 4/4 [00:09<00:00,  2.01s/it]Loading checkpoint shards: 100%|██████████| 4/4 [00:09<00:00,  2.30s/it]
trainable params: 37,748,736 || all params: 7,775,531,008 || trainable%: 0.4855
trainable params: 37,748,736 || all params: 7,775,531,008 || trainable%: 0.4855
Loading checkpoint shards: 100%|██████████| 4/4 [00:09<00:00,  2.12s/it]Loading checkpoint shards: 100%|██████████| 4/4 [00:09<00:00,  2.41s/it]
[INFO|modeling_utils.py:4350] 2024-09-15 15:30:47,773 >> All model checkpoint weights were used when initializing InternVLChatModel.

[INFO|modeling_utils.py:4358] 2024-09-15 15:30:47,773 >> All the weights of InternVLChatModel were initialized from the model checkpoint at /workspace/pretrained/InternVL2-8B.
If your task is similar to the task the model of the checkpoint was trained on, you can already use InternVLChatModel for predictions without further training.
[INFO|configuration_utils.py:779] 2024-09-15 15:30:47,779 >> loading configuration file /workspace/pretrained/InternVL2-8B/generation_config.json
[INFO|configuration_utils.py:826] 2024-09-15 15:30:47,779 >> Generate config GenerationConfig {
  "eos_token_id": [
    92542,
    92543
  ]
}

09/15/2024 15:30:47 - INFO - __main__ - Finished
09/15/2024 15:30:47 - INFO - __main__ - model.config.force_image_size: 448
09/15/2024 15:30:47 - INFO - __main__ - data_args.force_image_size: 448
09/15/2024 15:30:47 - INFO - __main__ - model.config.vision_config.image_size: 448
09/15/2024 15:30:47 - INFO - __main__ - [Dataset] num_image_token: 256
09/15/2024 15:30:47 - INFO - __main__ - [Dataset] dynamic_image_size: True
09/15/2024 15:30:47 - INFO - __main__ - [Dataset] use_thumbnail: True
09/15/2024 15:30:47 - INFO - __main__ - [Dataset] min_dynamic_patch: 1, max_dynamic_patch: 6
09/15/2024 15:30:47 - INFO - __main__ - Formatting inputs...Skip in lazy mode
Traceback (most recent call last):
  File "/workspace/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 847, in <module>
    main()
  File "/workspace/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 832, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1539, in train
    return inner_training_loop(
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1580, in _inner_training_loop
    num_update_steps_per_epoch = len_dataloader // args.gradient_accumulation_steps
ZeroDivisionError: integer division or modulo by zero
Traceback (most recent call last):
  File "/workspace/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 847, in <module>
    main()
  File "/workspace/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 832, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1539, in train
    return inner_training_loop(
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1580, in _inner_training_loop
    num_update_steps_per_epoch = len_dataloader // args.gradient_accumulation_steps
ZeroDivisionError: integer division or modulo by zero
09/15/2024 15:30:48 - INFO - __main__ - Add dataset: amazon-product-dataset with length: 9996
trainable params: 37,748,736 || all params: 7,775,531,008 || trainable%: 0.4855
Traceback (most recent call last):
  File "/workspace/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 847, in <module>
    main()
  File "/workspace/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 832, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1539, in train
    return inner_training_loop(
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1580, in _inner_training_loop
    num_update_steps_per_epoch = len_dataloader // args.gradient_accumulation_steps
ZeroDivisionError: integer division or modulo by zero
trainable params: 37,748,736 || all params: 7,775,531,008 || trainable%: 0.4855
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.0.attention.wqkv.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.0.attention.wqkv.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.0.attention.wo.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.0.attention.wo.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.0.feed_forward.w1.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.0.feed_forward.w1.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.0.feed_forward.w3.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.0.feed_forward.w3.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.0.feed_forward.w2.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.0.feed_forward.w2.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.1.attention.wqkv.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.1.attention.wqkv.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.1.attention.wo.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.1.attention.wo.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.1.feed_forward.w1.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.1.feed_forward.w1.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.1.feed_forward.w3.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.1.feed_forward.w3.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.1.feed_forward.w2.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.1.feed_forward.w2.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.2.attention.wqkv.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.2.attention.wqkv.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.2.attention.wo.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.2.attention.wo.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.2.feed_forward.w1.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.2.feed_forward.w1.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.2.feed_forward.w3.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.2.feed_forward.w3.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.2.feed_forward.w2.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.2.feed_forward.w2.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.3.attention.wqkv.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.3.attention.wqkv.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.3.attention.wo.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.3.attention.wo.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.3.feed_forward.w1.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.3.feed_forward.w1.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.3.feed_forward.w3.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.3.feed_forward.w3.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.3.feed_forward.w2.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.3.feed_forward.w2.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.4.attention.wqkv.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.4.attention.wqkv.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.4.attention.wo.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.4.attention.wo.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.4.feed_forward.w1.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.4.feed_forward.w1.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.4.feed_forward.w3.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.4.feed_forward.w3.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.4.feed_forward.w2.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.4.feed_forward.w2.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.5.attention.wqkv.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.5.attention.wqkv.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.5.attention.wo.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.5.attention.wo.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.5.feed_forward.w1.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.5.feed_forward.w1.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.5.feed_forward.w3.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.5.feed_forward.w3.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.5.feed_forward.w2.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.5.feed_forward.w2.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.6.attention.wqkv.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.6.attention.wqkv.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.6.attention.wo.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.6.attention.wo.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.6.feed_forward.w1.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.6.feed_forward.w1.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.6.feed_forward.w3.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.6.feed_forward.w3.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.6.feed_forward.w2.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.6.feed_forward.w2.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.7.attention.wqkv.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.7.attention.wqkv.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.7.attention.wo.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.7.attention.wo.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.7.feed_forward.w1.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.7.feed_forward.w1.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.7.feed_forward.w3.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.7.feed_forward.w3.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.7.feed_forward.w2.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.7.feed_forward.w2.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.8.attention.wqkv.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.8.attention.wqkv.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.8.attention.wo.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.8.attention.wo.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.8.feed_forward.w1.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.8.feed_forward.w1.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.8.feed_forward.w3.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.8.feed_forward.w3.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.8.feed_forward.w2.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.8.feed_forward.w2.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.9.attention.wqkv.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.9.attention.wqkv.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.9.attention.wo.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.9.attention.wo.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.9.feed_forward.w1.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.9.feed_forward.w1.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.9.feed_forward.w3.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.9.feed_forward.w3.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.9.feed_forward.w2.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.9.feed_forward.w2.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.10.attention.wqkv.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.10.attention.wqkv.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.10.attention.wo.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.10.attention.wo.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.10.feed_forward.w1.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.10.feed_forward.w1.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.10.feed_forward.w3.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.10.feed_forward.w3.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.10.feed_forward.w2.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.10.feed_forward.w2.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.11.attention.wqkv.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.11.attention.wqkv.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.11.attention.wo.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.11.attention.wo.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.11.feed_forward.w1.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.11.feed_forward.w1.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.11.feed_forward.w3.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.11.feed_forward.w3.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.11.feed_forward.w2.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.11.feed_forward.w2.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.12.attention.wqkv.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.12.attention.wqkv.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.12.attention.wo.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.12.attention.wo.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.12.feed_forward.w1.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.12.feed_forward.w1.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.12.feed_forward.w3.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.12.feed_forward.w3.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.12.feed_forward.w2.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.12.feed_forward.w2.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.13.attention.wqkv.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.13.attention.wqkv.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.13.attention.wo.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.13.attention.wo.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.13.feed_forward.w1.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.13.feed_forward.w1.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.13.feed_forward.w3.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.13.feed_forward.w3.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.13.feed_forward.w2.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.13.feed_forward.w2.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.14.attention.wqkv.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.14.attention.wqkv.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.14.attention.wo.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.14.attention.wo.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.14.feed_forward.w1.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.14.feed_forward.w1.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.14.feed_forward.w3.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.14.feed_forward.w3.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.14.feed_forward.w2.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.14.feed_forward.w2.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.15.attention.wqkv.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.15.attention.wqkv.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.15.attention.wo.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.15.attention.wo.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.15.feed_forward.w1.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.15.feed_forward.w1.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.15.feed_forward.w3.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.15.feed_forward.w3.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.15.feed_forward.w2.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.15.feed_forward.w2.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.16.attention.wqkv.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.16.attention.wqkv.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.16.attention.wo.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.16.attention.wo.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.16.feed_forward.w1.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.16.feed_forward.w1.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.16.feed_forward.w3.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.16.feed_forward.w3.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.16.feed_forward.w2.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.16.feed_forward.w2.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.17.attention.wqkv.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.17.attention.wqkv.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.17.attention.wo.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.17.attention.wo.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.17.feed_forward.w1.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.17.feed_forward.w1.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.17.feed_forward.w3.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.17.feed_forward.w3.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.17.feed_forward.w2.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.17.feed_forward.w2.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.18.attention.wqkv.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.18.attention.wqkv.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.18.attention.wo.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.18.attention.wo.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.18.feed_forward.w1.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.18.feed_forward.w1.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.18.feed_forward.w3.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.18.feed_forward.w3.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.18.feed_forward.w2.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.18.feed_forward.w2.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.19.attention.wqkv.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.19.attention.wqkv.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.19.attention.wo.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.19.attention.wo.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.19.feed_forward.w1.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.19.feed_forward.w1.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.19.feed_forward.w3.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.19.feed_forward.w3.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.19.feed_forward.w2.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.19.feed_forward.w2.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.20.attention.wqkv.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.20.attention.wqkv.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.20.attention.wo.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.20.attention.wo.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.20.feed_forward.w1.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.20.feed_forward.w1.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.20.feed_forward.w3.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.20.feed_forward.w3.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.20.feed_forward.w2.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.20.feed_forward.w2.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.21.attention.wqkv.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.21.attention.wqkv.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.21.attention.wo.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.21.attention.wo.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.21.feed_forward.w1.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.21.feed_forward.w1.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.21.feed_forward.w3.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.21.feed_forward.w3.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.21.feed_forward.w2.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.21.feed_forward.w2.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.22.attention.wqkv.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.22.attention.wqkv.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.22.attention.wo.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.22.attention.wo.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.22.feed_forward.w1.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.22.feed_forward.w1.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.22.feed_forward.w3.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.22.feed_forward.w3.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.22.feed_forward.w2.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.22.feed_forward.w2.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.23.attention.wqkv.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.23.attention.wqkv.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.23.attention.wo.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.23.attention.wo.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.23.feed_forward.w1.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.23.feed_forward.w1.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.23.feed_forward.w3.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.23.feed_forward.w3.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.23.feed_forward.w2.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.23.feed_forward.w2.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.24.attention.wqkv.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.24.attention.wqkv.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.24.attention.wo.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.24.attention.wo.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.24.feed_forward.w1.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.24.feed_forward.w1.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.24.feed_forward.w3.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.24.feed_forward.w3.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.24.feed_forward.w2.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.24.feed_forward.w2.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.25.attention.wqkv.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.25.attention.wqkv.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.25.attention.wo.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.25.attention.wo.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.25.feed_forward.w1.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.25.feed_forward.w1.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.25.feed_forward.w3.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.25.feed_forward.w3.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.25.feed_forward.w2.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.25.feed_forward.w2.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.26.attention.wqkv.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.26.attention.wqkv.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.26.attention.wo.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.26.attention.wo.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.26.feed_forward.w1.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.26.feed_forward.w1.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.26.feed_forward.w3.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.26.feed_forward.w3.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.26.feed_forward.w2.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.26.feed_forward.w2.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.27.attention.wqkv.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.27.attention.wqkv.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.27.attention.wo.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.27.attention.wo.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.27.feed_forward.w1.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.27.feed_forward.w1.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.27.feed_forward.w3.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.27.feed_forward.w3.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.27.feed_forward.w2.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.27.feed_forward.w2.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.28.attention.wqkv.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.28.attention.wqkv.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.28.attention.wo.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.28.attention.wo.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.28.feed_forward.w1.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.28.feed_forward.w1.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.28.feed_forward.w3.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.28.feed_forward.w3.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.28.feed_forward.w2.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.28.feed_forward.w2.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.29.attention.wqkv.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.29.attention.wqkv.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.29.attention.wo.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.29.attention.wo.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.29.feed_forward.w1.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.29.feed_forward.w1.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.29.feed_forward.w3.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.29.feed_forward.w3.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.29.feed_forward.w2.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.29.feed_forward.w2.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.30.attention.wqkv.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.30.attention.wqkv.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.30.attention.wo.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.30.attention.wo.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.30.feed_forward.w1.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.30.feed_forward.w1.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.30.feed_forward.w3.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.30.feed_forward.w3.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.30.feed_forward.w2.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.30.feed_forward.w2.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.31.attention.wqkv.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.31.attention.wqkv.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.31.attention.wo.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.31.attention.wo.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.31.feed_forward.w1.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.31.feed_forward.w1.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.31.feed_forward.w3.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.31.feed_forward.w3.lora_B.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.31.feed_forward.w2.lora_A.default.weight
09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.31.feed_forward.w2.lora_B.default.weight
[INFO|trainer.py:571] 2024-09-15 15:30:48,754 >> Using auto half precision backend
Traceback (most recent call last):
  File "/workspace/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 847, in <module>
    main()
  File "/workspace/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 832, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1539, in train
    return inner_training_loop(
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1580, in _inner_training_loop
    num_update_steps_per_epoch = len_dataloader // args.gradient_accumulation_steps
ZeroDivisionError: integer division or modulo by zero
[2024-09-15 15:30:51,713] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 8574) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/workspace/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-09-15_15:30:51
  host      : 1fe2e16e3880
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 8575)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2024-09-15_15:30:51
  host      : 1fe2e16e3880
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 8576)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2024-09-15_15:30:51
  host      : 1fe2e16e3880
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 8577)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-09-15_15:30:51
  host      : 1fe2e16e3880
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 8574)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
[2024-09-15 15:32:34,497] torch.distributed.run: [WARNING] 
[2024-09-15 15:32:34,497] torch.distributed.run: [WARNING] *****************************************
[2024-09-15 15:32:34,497] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
[2024-09-15 15:32:34,497] torch.distributed.run: [WARNING] *****************************************
[2024-09-15 15:32:36,778] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 15:32:36,787] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 15:32:36,843] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 15:32:36,849] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
[2024-09-15 15:32:38,756] [INFO] [comm.py:637:init_distributed] cdb=None
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
[2024-09-15 15:32:38,813] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-09-15 15:32:38,822] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-09-15 15:32:38,847] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-09-15 15:32:38,847] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
09/15/2024 15:32:38 - WARNING - __main__ - Process rank: 3, device: cuda:3, n_gpu: 1distributed training: True, 16-bits training: False
09/15/2024 15:32:38 - WARNING - __main__ - Process rank: 1, device: cuda:1, n_gpu: 1distributed training: True, 16-bits training: False
09/15/2024 15:32:38 - WARNING - __main__ - Process rank: 2, device: cuda:2, n_gpu: 1distributed training: True, 16-bits training: False
09/15/2024 15:32:38 - WARNING - __main__ - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: False
09/15/2024 15:32:38 - INFO - __main__ - Training/evaluation parameters TrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=True,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=4,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=/workspace/InternVL/internvl_chat/zero_stage1_config.json,
disable_tqdm=False,
dispatch_batches=None,
do_eval=False,
do_predict=False,
do_train=True,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=no,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
gradient_checkpointing_kwargs=None,
greater_is_better=None,
group_by_length=True,
half_precision_backend=auto,
hub_always_push=False,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=every_save,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_inputs_for_metrics=False,
include_num_input_tokens_seen=False,
include_tokens_per_second=False,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=4e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=0,
log_level=passive,
log_level_replica=warning,
log_on_each_node=True,
logging_dir=work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora/runs/Sep15_15-32-38_1fe2e16e3880,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=1.0,
logging_strategy=steps,
lr_scheduler_kwargs={},
lr_scheduler_type=cosine,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
neftune_noise_alpha=None,
no_cuda=False,
num_train_epochs=3.0,
optim=adamw_torch,
optim_args=None,
output_dir=work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora,
overwrite_output_dir=True,
past_index=-1,
per_device_eval_batch_size=8,
per_device_train_batch_size=8,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
ray_scope=last,
remove_unused_columns=True,
report_to=['tensorboard'],
resume_from_checkpoint=None,
run_name=work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora,
save_on_each_node=False,
save_only_model=False,
save_safetensors=True,
save_steps=200,
save_strategy=steps,
save_total_limit=1,
seed=42,
skip_memory_metrics=True,
split_batches=False,
tf32=None,
torch_compile=False,
torch_compile_backend=None,
torch_compile_mode=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_cpu=False,
use_ipex=False,
use_legacy_prediction_loop=False,
use_mps_device=False,
warmup_ratio=0.03,
warmup_steps=0,
weight_decay=0.01,
)
09/15/2024 15:32:38 - INFO - __main__ - Loading Tokenizer: /workspace/pretrained/InternVL2-8B
[INFO|tokenization_utils_base.py:2025] 2024-09-15 15:32:38,917 >> loading file ./tokenizer.model
[INFO|tokenization_utils_base.py:2025] 2024-09-15 15:32:38,917 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2025] 2024-09-15 15:32:38,917 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2025] 2024-09-15 15:32:38,917 >> loading file tokenizer_config.json
[INFO|tokenization_utils_base.py:2025] 2024-09-15 15:32:38,917 >> loading file tokenizer.json
[WARNING|logging.py:314] 2024-09-15 15:32:39,038 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[WARNING|logging.py:314] 2024-09-15 15:32:39,068 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[WARNING|logging.py:314] 2024-09-15 15:32:39,090 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[WARNING|logging.py:314] 2024-09-15 15:32:39,103 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
09/15/2024 15:32:39 - INFO - __main__ - Loading InternVLChatModel...
[INFO|configuration_utils.py:727] 2024-09-15 15:32:39,253 >> loading configuration file /workspace/pretrained/InternVL2-8B/config.json
[INFO|configuration_utils.py:792] 2024-09-15 15:32:39,255 >> Model config InternVLChatConfig {
  "_commit_hash": null,
  "architectures": [
    "InternVLChatModel"
  ],
  "auto_map": {
    "AutoConfig": "configuration_internvl_chat.InternVLChatConfig",
    "AutoModel": "modeling_internvl_chat.InternVLChatModel",
    "AutoModelForCausalLM": "modeling_internvl_chat.InternVLChatModel"
  },
  "downsample_ratio": 0.5,
  "dynamic_image_size": true,
  "force_image_size": 448,
  "llm_config": {
    "_name_or_path": "internlm/internlm2_5-7b-chat",
    "add_cross_attention": false,
    "architectures": [
      "InternLM2ForCausalLM"
    ],
    "attn_implementation": "flash_attention_2",
    "auto_map": {
      "AutoConfig": "configuration_internlm2.InternLM2Config",
      "AutoModel": "modeling_internlm2.InternLM2ForCausalLM",
      "AutoModelForCausalLM": "modeling_internlm2.InternLM2ForCausalLM"
    },
    "bad_words_ids": null,
    "begin_suppress_tokens": null,
    "bias": false,
    "bos_token_id": 1,
    "chunk_size_feed_forward": 0,
    "cross_attention_hidden_size": null,
    "decoder_start_token_id": null,
    "diversity_penalty": 0.0,
    "do_sample": false,
    "early_stopping": false,
    "encoder_no_repeat_ngram_size": 0,
    "eos_token_id": 2,
    "exponential_decay_length_penalty": null,
    "finetuning_task": null,
    "forced_bos_token_id": null,
    "forced_eos_token_id": null,
    "hidden_act": "silu",
    "hidden_size": 4096,
    "id2label": {
      "0": "LABEL_0",
      "1": "LABEL_1"
    },
    "initializer_range": 0.02,
    "intermediate_size": 14336,
    "is_decoder": false,
    "is_encoder_decoder": false,
    "label2id": {
      "LABEL_0": 0,
      "LABEL_1": 1
    },
    "length_penalty": 1.0,
    "max_length": 20,
    "max_position_embeddings": 32768,
    "min_length": 0,
    "model_type": "internlm2",
    "no_repeat_ngram_size": 0,
    "num_attention_heads": 32,
    "num_beam_groups": 1,
    "num_beams": 1,
    "num_hidden_layers": 32,
    "num_key_value_heads": 8,
    "num_return_sequences": 1,
    "output_attentions": false,
    "output_hidden_states": false,
    "output_scores": false,
    "pad_token_id": 2,
    "prefix": null,
    "pretraining_tp": 1,
    "problem_type": null,
    "pruned_heads": {},
    "remove_invalid_values": false,
    "repetition_penalty": 1.0,
    "return_dict": true,
    "return_dict_in_generate": false,
    "rms_norm_eps": 1e-05,
    "rope_scaling": {
      "factor": 2.0,
      "type": "dynamic"
    },
    "rope_theta": 1000000,
    "sep_token_id": null,
    "suppress_tokens": null,
    "task_specific_params": null,
    "temperature": 1.0,
    "tf_legacy_loss": false,
    "tie_encoder_decoder": false,
    "tie_word_embeddings": false,
    "tokenizer_class": null,
    "top_k": 50,
    "top_p": 1.0,
    "torch_dtype": "bfloat16",
    "torchscript": false,
    "transformers_version": "4.37.2",
    "typical_p": 1.0,
    "use_bfloat16": true,
    "use_cache": true,
    "vocab_size": 92553
  },
  "max_dynamic_patch": 12,
  "min_dynamic_patch": 1,
  "model_type": "internvl_chat",
  "pad2square": false,
  "ps_version": "v2",
  "select_layer": -1,
  "template": "internlm2-chat",
  "torch_dtype": "bfloat16",
  "transformers_version": null,
  "use_backbone_lora": 0,
  "use_llm_lora": 0,
  "use_thumbnail": true,
  "vision_config": {
    "_name_or_path": "",
    "add_cross_attention": false,
    "architectures": [
      "InternVisionModel"
    ],
    "attention_dropout": 0.0,
    "bad_words_ids": null,
    "begin_suppress_tokens": null,
    "bos_token_id": null,
    "chunk_size_feed_forward": 0,
    "cross_attention_hidden_size": null,
    "decoder_start_token_id": null,
    "diversity_penalty": 0.0,
    "do_sample": false,
    "drop_path_rate": 0.0,
    "dropout": 0.0,
    "early_stopping": false,
    "encoder_no_repeat_ngram_size": 0,
    "eos_token_id": null,
    "exponential_decay_length_penalty": null,
    "finetuning_task": null,
    "forced_bos_token_id": null,
    "forced_eos_token_id": null,
    "hidden_act": "gelu",
    "hidden_size": 1024,
    "id2label": {
      "0": "LABEL_0",
      "1": "LABEL_1"
    },
    "image_size": 448,
    "initializer_factor": 1.0,
    "initializer_range": 0.02,
    "intermediate_size": 4096,
    "is_decoder": false,
    "is_encoder_decoder": false,
    "label2id": {
      "LABEL_0": 0,
      "LABEL_1": 1
    },
    "layer_norm_eps": 1e-06,
    "length_penalty": 1.0,
    "max_length": 20,
    "min_length": 0,
    "model_type": "intern_vit_6b",
    "no_repeat_ngram_size": 0,
    "norm_type": "layer_norm",
    "num_attention_heads": 16,
    "num_beam_groups": 1,
    "num_beams": 1,
    "num_channels": 3,
    "num_hidden_layers": 24,
    "num_return_sequences": 1,
    "output_attentions": false,
    "output_hidden_states": false,
    "output_scores": false,
    "pad_token_id": null,
    "patch_size": 14,
    "prefix": null,
    "problem_type": null,
    "pruned_heads": {},
    "qk_normalization": false,
    "qkv_bias": true,
    "remove_invalid_values": false,
    "repetition_penalty": 1.0,
    "return_dict": true,
    "return_dict_in_generate": false,
    "sep_token_id": null,
    "suppress_tokens": null,
    "task_specific_params": null,
    "temperature": 1.0,
    "tf_legacy_loss": false,
    "tie_encoder_decoder": false,
    "tie_word_embeddings": true,
    "tokenizer_class": null,
    "top_k": 50,
    "top_p": 1.0,
    "torch_dtype": "bfloat16",
    "torchscript": false,
    "transformers_version": "4.37.2",
    "typical_p": 1.0,
    "use_bfloat16": true,
    "use_flash_attn": true
  }
}

09/15/2024 15:32:39 - INFO - __main__ - Using flash_attention_2 for InternLM
[INFO|modeling_utils.py:3473] 2024-09-15 15:32:39,256 >> loading weights file /workspace/pretrained/InternVL2-8B/model.safetensors.index.json
[INFO|modeling_utils.py:1426] 2024-09-15 15:32:39,257 >> Instantiating InternVLChatModel model under default dtype torch.bfloat16.
[INFO|configuration_utils.py:826] 2024-09-15 15:32:39,258 >> Generate config GenerationConfig {}

[INFO|configuration_utils.py:826] 2024-09-15 15:32:39,297 >> Generate config GenerationConfig {
  "bos_token_id": 1,
  "eos_token_id": 2,
  "pad_token_id": 2
}

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]Loading checkpoint shards:  25%|██▌       | 1/4 [00:02<00:08,  2.73s/it]Loading checkpoint shards:  25%|██▌       | 1/4 [00:02<00:08,  2.94s/it]Loading checkpoint shards:  25%|██▌       | 1/4 [00:02<00:08,  2.99s/it]Loading checkpoint shards:  25%|██▌       | 1/4 [00:03<00:09,  3.03s/it]Loading checkpoint shards:  50%|█████     | 2/4 [00:05<00:05,  2.68s/it]Loading checkpoint shards:  50%|█████     | 2/4 [00:05<00:05,  2.84s/it]Loading checkpoint shards:  50%|█████     | 2/4 [00:05<00:05,  2.85s/it]Loading checkpoint shards:  50%|█████     | 2/4 [00:05<00:05,  2.98s/it]Loading checkpoint shards:  75%|███████▌  | 3/4 [00:07<00:02,  2.63s/it]Loading checkpoint shards:  75%|███████▌  | 3/4 [00:08<00:02,  2.78s/it]Loading checkpoint shards:  75%|███████▌  | 3/4 [00:08<00:02,  2.77s/it]Loading checkpoint shards: 100%|██████████| 4/4 [00:08<00:00,  1.93s/it]Loading checkpoint shards: 100%|██████████| 4/4 [00:08<00:00,  2.20s/it]
Loading checkpoint shards:  75%|███████▌  | 3/4 [00:08<00:02,  2.95s/it]Loading checkpoint shards: 100%|██████████| 4/4 [00:09<00:00,  2.05s/it]Loading checkpoint shards: 100%|██████████| 4/4 [00:09<00:00,  2.34s/it]
Loading checkpoint shards: 100%|██████████| 4/4 [00:09<00:00,  2.04s/it]Loading checkpoint shards: 100%|██████████| 4/4 [00:09<00:00,  2.34s/it]
[INFO|modeling_utils.py:4350] 2024-09-15 15:32:50,042 >> All model checkpoint weights were used when initializing InternVLChatModel.

[INFO|modeling_utils.py:4358] 2024-09-15 15:32:50,042 >> All the weights of InternVLChatModel were initialized from the model checkpoint at /workspace/pretrained/InternVL2-8B.
If your task is similar to the task the model of the checkpoint was trained on, you can already use InternVLChatModel for predictions without further training.
[INFO|configuration_utils.py:779] 2024-09-15 15:32:50,048 >> loading configuration file /workspace/pretrained/InternVL2-8B/generation_config.json
[INFO|configuration_utils.py:826] 2024-09-15 15:32:50,048 >> Generate config GenerationConfig {
  "eos_token_id": [
    92542,
    92543
  ]
}

09/15/2024 15:32:50 - INFO - __main__ - Finished
09/15/2024 15:32:50 - INFO - __main__ - model.config.force_image_size: 448
09/15/2024 15:32:50 - INFO - __main__ - data_args.force_image_size: 448
09/15/2024 15:32:50 - INFO - __main__ - model.config.vision_config.image_size: 448
09/15/2024 15:32:50 - INFO - __main__ - [Dataset] num_image_token: 256
09/15/2024 15:32:50 - INFO - __main__ - [Dataset] dynamic_image_size: True
09/15/2024 15:32:50 - INFO - __main__ - [Dataset] use_thumbnail: True
09/15/2024 15:32:50 - INFO - __main__ - [Dataset] min_dynamic_patch: 1, max_dynamic_patch: 6
09/15/2024 15:32:50 - INFO - __main__ - Formatting inputs...Skip in lazy mode
trainable params: 37,748,736 || all params: 7,775,531,008 || trainable%: 0.4855
09/15/2024 15:32:50 - INFO - __main__ - Add dataset: amazon-product-dataset with length: 9996
Loading checkpoint shards: 100%|██████████| 4/4 [00:09<00:00,  2.15s/it]Loading checkpoint shards: 100%|██████████| 4/4 [00:09<00:00,  2.45s/it]
trainable params: 37,748,736 || all params: 7,775,531,008 || trainable%: 0.4855
trainable params: 37,748,736 || all params: 7,775,531,008 || trainable%: 0.4855
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.0.attention.wqkv.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.0.attention.wqkv.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.0.attention.wo.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.0.attention.wo.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.0.feed_forward.w1.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.0.feed_forward.w1.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.0.feed_forward.w3.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.0.feed_forward.w3.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.0.feed_forward.w2.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.0.feed_forward.w2.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.1.attention.wqkv.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.1.attention.wqkv.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.1.attention.wo.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.1.attention.wo.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.1.feed_forward.w1.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.1.feed_forward.w1.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.1.feed_forward.w3.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.1.feed_forward.w3.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.1.feed_forward.w2.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.1.feed_forward.w2.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.2.attention.wqkv.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.2.attention.wqkv.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.2.attention.wo.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.2.attention.wo.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.2.feed_forward.w1.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.2.feed_forward.w1.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.2.feed_forward.w3.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.2.feed_forward.w3.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.2.feed_forward.w2.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.2.feed_forward.w2.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.3.attention.wqkv.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.3.attention.wqkv.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.3.attention.wo.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.3.attention.wo.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.3.feed_forward.w1.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.3.feed_forward.w1.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.3.feed_forward.w3.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.3.feed_forward.w3.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.3.feed_forward.w2.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.3.feed_forward.w2.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.4.attention.wqkv.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.4.attention.wqkv.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.4.attention.wo.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.4.attention.wo.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.4.feed_forward.w1.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.4.feed_forward.w1.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.4.feed_forward.w3.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.4.feed_forward.w3.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.4.feed_forward.w2.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.4.feed_forward.w2.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.5.attention.wqkv.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.5.attention.wqkv.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.5.attention.wo.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.5.attention.wo.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.5.feed_forward.w1.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.5.feed_forward.w1.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.5.feed_forward.w3.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.5.feed_forward.w3.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.5.feed_forward.w2.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.5.feed_forward.w2.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.6.attention.wqkv.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.6.attention.wqkv.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.6.attention.wo.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.6.attention.wo.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.6.feed_forward.w1.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.6.feed_forward.w1.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.6.feed_forward.w3.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.6.feed_forward.w3.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.6.feed_forward.w2.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.6.feed_forward.w2.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.7.attention.wqkv.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.7.attention.wqkv.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.7.attention.wo.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.7.attention.wo.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.7.feed_forward.w1.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.7.feed_forward.w1.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.7.feed_forward.w3.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.7.feed_forward.w3.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.7.feed_forward.w2.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.7.feed_forward.w2.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.8.attention.wqkv.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.8.attention.wqkv.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.8.attention.wo.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.8.attention.wo.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.8.feed_forward.w1.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.8.feed_forward.w1.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.8.feed_forward.w3.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.8.feed_forward.w3.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.8.feed_forward.w2.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.8.feed_forward.w2.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.9.attention.wqkv.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.9.attention.wqkv.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.9.attention.wo.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.9.attention.wo.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.9.feed_forward.w1.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.9.feed_forward.w1.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.9.feed_forward.w3.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.9.feed_forward.w3.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.9.feed_forward.w2.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.9.feed_forward.w2.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.10.attention.wqkv.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.10.attention.wqkv.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.10.attention.wo.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.10.attention.wo.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.10.feed_forward.w1.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.10.feed_forward.w1.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.10.feed_forward.w3.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.10.feed_forward.w3.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.10.feed_forward.w2.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.10.feed_forward.w2.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.11.attention.wqkv.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.11.attention.wqkv.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.11.attention.wo.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.11.attention.wo.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.11.feed_forward.w1.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.11.feed_forward.w1.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.11.feed_forward.w3.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.11.feed_forward.w3.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.11.feed_forward.w2.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.11.feed_forward.w2.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.12.attention.wqkv.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.12.attention.wqkv.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.12.attention.wo.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.12.attention.wo.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.12.feed_forward.w1.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.12.feed_forward.w1.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.12.feed_forward.w3.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.12.feed_forward.w3.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.12.feed_forward.w2.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.12.feed_forward.w2.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.13.attention.wqkv.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.13.attention.wqkv.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.13.attention.wo.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.13.attention.wo.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.13.feed_forward.w1.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.13.feed_forward.w1.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.13.feed_forward.w3.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.13.feed_forward.w3.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.13.feed_forward.w2.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.13.feed_forward.w2.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.14.attention.wqkv.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.14.attention.wqkv.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.14.attention.wo.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.14.attention.wo.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.14.feed_forward.w1.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.14.feed_forward.w1.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.14.feed_forward.w3.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.14.feed_forward.w3.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.14.feed_forward.w2.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.14.feed_forward.w2.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.15.attention.wqkv.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.15.attention.wqkv.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.15.attention.wo.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.15.attention.wo.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.15.feed_forward.w1.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.15.feed_forward.w1.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.15.feed_forward.w3.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.15.feed_forward.w3.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.15.feed_forward.w2.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.15.feed_forward.w2.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.16.attention.wqkv.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.16.attention.wqkv.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.16.attention.wo.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.16.attention.wo.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.16.feed_forward.w1.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.16.feed_forward.w1.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.16.feed_forward.w3.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.16.feed_forward.w3.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.16.feed_forward.w2.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.16.feed_forward.w2.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.17.attention.wqkv.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.17.attention.wqkv.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.17.attention.wo.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.17.attention.wo.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.17.feed_forward.w1.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.17.feed_forward.w1.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.17.feed_forward.w3.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.17.feed_forward.w3.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.17.feed_forward.w2.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.17.feed_forward.w2.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.18.attention.wqkv.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.18.attention.wqkv.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.18.attention.wo.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.18.attention.wo.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.18.feed_forward.w1.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.18.feed_forward.w1.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.18.feed_forward.w3.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.18.feed_forward.w3.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.18.feed_forward.w2.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.18.feed_forward.w2.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.19.attention.wqkv.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.19.attention.wqkv.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.19.attention.wo.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.19.attention.wo.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.19.feed_forward.w1.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.19.feed_forward.w1.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.19.feed_forward.w3.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.19.feed_forward.w3.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.19.feed_forward.w2.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.19.feed_forward.w2.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.20.attention.wqkv.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.20.attention.wqkv.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.20.attention.wo.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.20.attention.wo.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.20.feed_forward.w1.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.20.feed_forward.w1.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.20.feed_forward.w3.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.20.feed_forward.w3.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.20.feed_forward.w2.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.20.feed_forward.w2.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.21.attention.wqkv.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.21.attention.wqkv.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.21.attention.wo.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.21.attention.wo.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.21.feed_forward.w1.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.21.feed_forward.w1.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.21.feed_forward.w3.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.21.feed_forward.w3.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.21.feed_forward.w2.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.21.feed_forward.w2.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.22.attention.wqkv.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.22.attention.wqkv.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.22.attention.wo.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.22.attention.wo.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.22.feed_forward.w1.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.22.feed_forward.w1.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.22.feed_forward.w3.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.22.feed_forward.w3.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.22.feed_forward.w2.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.22.feed_forward.w2.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.23.attention.wqkv.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.23.attention.wqkv.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.23.attention.wo.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.23.attention.wo.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.23.feed_forward.w1.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.23.feed_forward.w1.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.23.feed_forward.w3.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.23.feed_forward.w3.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.23.feed_forward.w2.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.23.feed_forward.w2.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.24.attention.wqkv.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.24.attention.wqkv.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.24.attention.wo.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.24.attention.wo.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.24.feed_forward.w1.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.24.feed_forward.w1.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.24.feed_forward.w3.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.24.feed_forward.w3.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.24.feed_forward.w2.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.24.feed_forward.w2.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.25.attention.wqkv.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.25.attention.wqkv.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.25.attention.wo.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.25.attention.wo.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.25.feed_forward.w1.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.25.feed_forward.w1.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.25.feed_forward.w3.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.25.feed_forward.w3.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.25.feed_forward.w2.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.25.feed_forward.w2.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.26.attention.wqkv.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.26.attention.wqkv.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.26.attention.wo.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.26.attention.wo.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.26.feed_forward.w1.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.26.feed_forward.w1.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.26.feed_forward.w3.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.26.feed_forward.w3.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.26.feed_forward.w2.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.26.feed_forward.w2.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.27.attention.wqkv.lora_A.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.27.attention.wqkv.lora_B.default.weight
09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.27.attention.wo.lora_A.default.weight
09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.27.attention.wo.lora_B.default.weight
09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.27.feed_forward.w1.lora_A.default.weight
09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.27.feed_forward.w1.lora_B.default.weight
09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.27.feed_forward.w3.lora_A.default.weight
09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.27.feed_forward.w3.lora_B.default.weight
09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.27.feed_forward.w2.lora_A.default.weight
09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.27.feed_forward.w2.lora_B.default.weight
09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.28.attention.wqkv.lora_A.default.weight
09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.28.attention.wqkv.lora_B.default.weight
09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.28.attention.wo.lora_A.default.weight
09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.28.attention.wo.lora_B.default.weight
09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.28.feed_forward.w1.lora_A.default.weight
09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.28.feed_forward.w1.lora_B.default.weight
09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.28.feed_forward.w3.lora_A.default.weight
09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.28.feed_forward.w3.lora_B.default.weight
09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.28.feed_forward.w2.lora_A.default.weight
09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.28.feed_forward.w2.lora_B.default.weight
09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.29.attention.wqkv.lora_A.default.weight
09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.29.attention.wqkv.lora_B.default.weight
09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.29.attention.wo.lora_A.default.weight
09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.29.attention.wo.lora_B.default.weight
09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.29.feed_forward.w1.lora_A.default.weight
09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.29.feed_forward.w1.lora_B.default.weight
09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.29.feed_forward.w3.lora_A.default.weight
09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.29.feed_forward.w3.lora_B.default.weight
09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.29.feed_forward.w2.lora_A.default.weight
09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.29.feed_forward.w2.lora_B.default.weight
09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.30.attention.wqkv.lora_A.default.weight
09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.30.attention.wqkv.lora_B.default.weight
09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.30.attention.wo.lora_A.default.weight
09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.30.attention.wo.lora_B.default.weight
09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.30.feed_forward.w1.lora_A.default.weight
09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.30.feed_forward.w1.lora_B.default.weight
09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.30.feed_forward.w3.lora_A.default.weight
09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.30.feed_forward.w3.lora_B.default.weight
09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.30.feed_forward.w2.lora_A.default.weight
09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.30.feed_forward.w2.lora_B.default.weight
09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.31.attention.wqkv.lora_A.default.weight
09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.31.attention.wqkv.lora_B.default.weight
09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.31.attention.wo.lora_A.default.weight
09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.31.attention.wo.lora_B.default.weight
09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.31.feed_forward.w1.lora_A.default.weight
09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.31.feed_forward.w1.lora_B.default.weight
09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.31.feed_forward.w3.lora_A.default.weight
09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.31.feed_forward.w3.lora_B.default.weight
09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.31.feed_forward.w2.lora_A.default.weight
09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.31.feed_forward.w2.lora_B.default.weight
[INFO|trainer.py:571] 2024-09-15 15:32:51,022 >> Using auto half precision backend
[2024-09-15 15:32:51,228] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.13.5, git-hash=unknown, git-branch=unknown
trainable params: 37,748,736 || all params: 7,775,531,008 || trainable%: 0.4855
[2024-09-15 15:32:54,896] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
Using /root/.cache/torch_extensions/py310_cu118 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py310_cu118 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py310_cu118 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py310_cu118 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py310_cu118/fused_adam/build.ninja...
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_adam...
Time to load fused_adam op: 0.0755770206451416 seconds
Loading extension module fused_adam...
Time to load fused_adam op: 0.10229706764221191 seconds
Loading extension module fused_adam...
Time to load fused_adam op: 0.10167860984802246 seconds
Loading extension module fused_adam...
Time to load fused_adam op: 0.10164499282836914 seconds
[2024-09-15 15:32:55,650] [INFO] [logging.py:96:log_dist] [Rank 0] Using DeepSpeed Optimizer param name adamw as basic optimizer
[2024-09-15 15:32:55,651] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer
[2024-09-15 15:32:55,691] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedAdam
[2024-09-15 15:32:55,691] [INFO] [utils.py:56:is_zero_supported_optimizer] Checking ZeRO support for optimizer=FusedAdam type=<class 'deepspeed.ops.adam.fused_adam.FusedAdam'>
[2024-09-15 15:32:55,691] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 1 optimizer
[2024-09-15 15:32:55,692] [INFO] [stage_1_and_2.py:149:__init__] Reduce bucket size 1000000000
[2024-09-15 15:32:55,692] [INFO] [stage_1_and_2.py:150:__init__] Allgather bucket size 1000000000
[2024-09-15 15:32:55,692] [INFO] [stage_1_and_2.py:151:__init__] CPU Offload: False
[2024-09-15 15:32:55,692] [INFO] [stage_1_and_2.py:152:__init__] Round robin gradient partitioning: False
[2024-09-15 15:32:55,968] [INFO] [utils.py:800:see_memory_usage] Before initializing optimizer states
[2024-09-15 15:32:55,969] [INFO] [utils.py:801:see_memory_usage] MA 15.65 GB         Max_MA 15.67 GB         CA 15.85 GB         Max_CA 16 GB 
[2024-09-15 15:32:55,969] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory:  used = 90.14 GB, percent = 9.5%
[2024-09-15 15:32:56,150] [INFO] [utils.py:800:see_memory_usage] After initializing optimizer states
[2024-09-15 15:32:56,151] [INFO] [utils.py:801:see_memory_usage] MA 15.65 GB         Max_MA 15.68 GB         CA 15.89 GB         Max_CA 16 GB 
[2024-09-15 15:32:56,151] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory:  used = 90.19 GB, percent = 9.5%
[2024-09-15 15:32:56,151] [INFO] [stage_1_and_2.py:539:__init__] optimizer state initialized
[2024-09-15 15:32:56,333] [INFO] [utils.py:800:see_memory_usage] After initializing ZeRO optimizer
[2024-09-15 15:32:56,334] [INFO] [utils.py:801:see_memory_usage] MA 15.65 GB         Max_MA 15.65 GB         CA 15.89 GB         Max_CA 16 GB 
[2024-09-15 15:32:56,334] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory:  used = 90.2 GB, percent = 9.6%
[2024-09-15 15:32:56,336] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = adamw
[2024-09-15 15:32:56,336] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client callable to create LR scheduler
[2024-09-15 15:32:56,336] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = <torch.optim.lr_scheduler.LambdaLR object at 0x7ff956fdf4f0>
[2024-09-15 15:32:56,336] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0], mom=[[0.9, 0.999]]
[2024-09-15 15:32:56,341] [INFO] [config.py:996:print] DeepSpeedEngine configuration:
[2024-09-15 15:32:56,341] [INFO] [config.py:1000:print]   activation_checkpointing_config  {
    "partition_activations": false, 
    "contiguous_memory_optimization": false, 
    "cpu_checkpointing": false, 
    "number_checkpoints": null, 
    "synchronize_checkpoint_boundary": false, 
    "profile": false
}
[2024-09-15 15:32:56,341] [INFO] [config.py:1000:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2024-09-15 15:32:56,341] [INFO] [config.py:1000:print]   amp_enabled .................. False
[2024-09-15 15:32:56,341] [INFO] [config.py:1000:print]   amp_params ................... False
[2024-09-15 15:32:56,342] [INFO] [config.py:1000:print]   autotuning_config ............ {
    "enabled": false, 
    "start_step": null, 
    "end_step": null, 
    "metric_path": null, 
    "arg_mappings": null, 
    "metric": "throughput", 
    "model_info": null, 
    "results_dir": "autotuning_results", 
    "exps_dir": "autotuning_exps", 
    "overwrite": true, 
    "fast": true, 
    "start_profile_step": 3, 
    "end_profile_step": 5, 
    "tuner_type": "gridsearch", 
    "tuner_early_stopping": 5, 
    "tuner_num_trials": 50, 
    "model_info_path": null, 
    "mp_size": 1, 
    "max_train_batch_size": null, 
    "min_train_batch_size": 1, 
    "max_train_micro_batch_size_per_gpu": 1.024000e+03, 
    "min_train_micro_batch_size_per_gpu": 1, 
    "num_tuning_micro_batch_sizes": 3
}
[2024-09-15 15:32:56,342] [INFO] [config.py:1000:print]   bfloat16_enabled ............. True
[2024-09-15 15:32:56,342] [INFO] [config.py:1000:print]   bfloat16_immediate_grad_update  False
[2024-09-15 15:32:56,342] [INFO] [config.py:1000:print]   checkpoint_parallel_write_pipeline  False
[2024-09-15 15:32:56,342] [INFO] [config.py:1000:print]   checkpoint_tag_validation_enabled  True
[2024-09-15 15:32:56,342] [INFO] [config.py:1000:print]   checkpoint_tag_validation_fail  False
[2024-09-15 15:32:56,342] [INFO] [config.py:1000:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7ff956fdea70>
[2024-09-15 15:32:56,342] [INFO] [config.py:1000:print]   communication_data_type ...... None
[2024-09-15 15:32:56,342] [INFO] [config.py:1000:print]   compile_config ............... enabled=False backend='inductor' kwargs={}
[2024-09-15 15:32:56,342] [INFO] [config.py:1000:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2024-09-15 15:32:56,342] [INFO] [config.py:1000:print]   curriculum_enabled_legacy .... False
[2024-09-15 15:32:56,342] [INFO] [config.py:1000:print]   curriculum_params_legacy ..... False
[2024-09-15 15:32:56,342] [INFO] [config.py:1000:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2024-09-15 15:32:56,342] [INFO] [config.py:1000:print]   data_efficiency_enabled ...... False
[2024-09-15 15:32:56,342] [INFO] [config.py:1000:print]   dataloader_drop_last ......... False
[2024-09-15 15:32:56,342] [INFO] [config.py:1000:print]   disable_allgather ............ False
[2024-09-15 15:32:56,342] [INFO] [config.py:1000:print]   dump_state ................... False
[2024-09-15 15:32:56,342] [INFO] [config.py:1000:print]   dynamic_loss_scale_args ...... None
[2024-09-15 15:32:56,342] [INFO] [config.py:1000:print]   eigenvalue_enabled ........... False
[2024-09-15 15:32:56,342] [INFO] [config.py:1000:print]   eigenvalue_gas_boundary_resolution  1
[2024-09-15 15:32:56,342] [INFO] [config.py:1000:print]   eigenvalue_layer_name ........ bert.encoder.layer
[2024-09-15 15:32:56,342] [INFO] [config.py:1000:print]   eigenvalue_layer_num ......... 0
[2024-09-15 15:32:56,342] [INFO] [config.py:1000:print]   eigenvalue_max_iter .......... 100
[2024-09-15 15:32:56,342] [INFO] [config.py:1000:print]   eigenvalue_stability ......... 1e-06
[2024-09-15 15:32:56,342] [INFO] [config.py:1000:print]   eigenvalue_tol ............... 0.01
[2024-09-15 15:32:56,342] [INFO] [config.py:1000:print]   eigenvalue_verbose ........... False
[2024-09-15 15:32:56,342] [INFO] [config.py:1000:print]   elasticity_enabled ........... False
[2024-09-15 15:32:56,342] [INFO] [config.py:1000:print]   flops_profiler_config ........ {
    "enabled": false, 
    "recompute_fwd_factor": 0.0, 
    "profile_step": 1, 
    "module_depth": -1, 
    "top_modules": 1, 
    "detailed": true, 
    "output_file": null
}
[2024-09-15 15:32:56,342] [INFO] [config.py:1000:print]   fp16_auto_cast ............... None
[2024-09-15 15:32:56,342] [INFO] [config.py:1000:print]   fp16_enabled ................. False
[2024-09-15 15:32:56,342] [INFO] [config.py:1000:print]   fp16_master_weights_and_gradients  False
[2024-09-15 15:32:56,342] [INFO] [config.py:1000:print]   global_rank .................. 0
[2024-09-15 15:32:56,342] [INFO] [config.py:1000:print]   grad_accum_dtype ............. None
[2024-09-15 15:32:56,342] [INFO] [config.py:1000:print]   gradient_accumulation_steps .. 1
[2024-09-15 15:32:56,342] [INFO] [config.py:1000:print]   gradient_clipping ............ 1.0
[2024-09-15 15:32:56,342] [INFO] [config.py:1000:print]   gradient_predivide_factor .... 1.0
[2024-09-15 15:32:56,342] [INFO] [config.py:1000:print]   graph_harvesting ............. False
[2024-09-15 15:32:56,342] [INFO] [config.py:1000:print]   hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
[2024-09-15 15:32:56,342] [INFO] [config.py:1000:print]   initial_dynamic_scale ........ 1
[2024-09-15 15:32:56,342] [INFO] [config.py:1000:print]   load_universal_checkpoint .... False
[2024-09-15 15:32:56,343] [INFO] [config.py:1000:print]   loss_scale ................... 1.0
[2024-09-15 15:32:56,343] [INFO] [config.py:1000:print]   memory_breakdown ............. False
[2024-09-15 15:32:56,343] [INFO] [config.py:1000:print]   mics_hierarchial_params_gather  False
[2024-09-15 15:32:56,343] [INFO] [config.py:1000:print]   mics_shard_size .............. -1
[2024-09-15 15:32:56,343] [INFO] [config.py:1000:print]   monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
[2024-09-15 15:32:56,343] [INFO] [config.py:1000:print]   nebula_config ................ {
    "enabled": false, 
    "persistent_storage_path": null, 
    "persistent_time_interval": 100, 
    "num_of_version_in_retention": 2, 
    "enable_nebula_load": true, 
    "load_path": null
}
[2024-09-15 15:32:56,343] [INFO] [config.py:1000:print]   optimizer_legacy_fusion ...... False
[2024-09-15 15:32:56,343] [INFO] [config.py:1000:print]   optimizer_name ............... adamw
[2024-09-15 15:32:56,343] [INFO] [config.py:1000:print]   optimizer_params ............. {'lr': 4e-05, 'betas': [0.9, 0.999], 'eps': 1e-08, 'weight_decay': 0.01}
[2024-09-15 15:32:56,343] [INFO] [config.py:1000:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True}
[2024-09-15 15:32:56,343] [INFO] [config.py:1000:print]   pld_enabled .................. False
[2024-09-15 15:32:56,343] [INFO] [config.py:1000:print]   pld_params ................... False
[2024-09-15 15:32:56,343] [INFO] [config.py:1000:print]   prescale_gradients ........... False
[2024-09-15 15:32:56,343] [INFO] [config.py:1000:print]   scheduler_name ............... None
[2024-09-15 15:32:56,343] [INFO] [config.py:1000:print]   scheduler_params ............. None
[2024-09-15 15:32:56,343] [INFO] [config.py:1000:print]   seq_parallel_communication_data_type  torch.float32
[2024-09-15 15:32:56,343] [INFO] [config.py:1000:print]   sparse_attention ............. None
[2024-09-15 15:32:56,343] [INFO] [config.py:1000:print]   sparse_gradients_enabled ..... False
[2024-09-15 15:32:56,343] [INFO] [config.py:1000:print]   steps_per_print .............. inf
[2024-09-15 15:32:56,343] [INFO] [config.py:1000:print]   train_batch_size ............. 32
[2024-09-15 15:32:56,343] [INFO] [config.py:1000:print]   train_micro_batch_size_per_gpu  8
[2024-09-15 15:32:56,343] [INFO] [config.py:1000:print]   use_data_before_expert_parallel_  False
[2024-09-15 15:32:56,343] [INFO] [config.py:1000:print]   use_node_local_storage ....... False
[2024-09-15 15:32:56,343] [INFO] [config.py:1000:print]   wall_clock_breakdown ......... True
[2024-09-15 15:32:56,343] [INFO] [config.py:1000:print]   weight_quantization_config ... None
[2024-09-15 15:32:56,343] [INFO] [config.py:1000:print]   world_size ................... 4
[2024-09-15 15:32:56,343] [INFO] [config.py:1000:print]   zero_allow_untested_optimizer  False
[2024-09-15 15:32:56,343] [INFO] [config.py:1000:print]   zero_config .................. stage=1 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=1000000000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=1000000000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True
[2024-09-15 15:32:56,343] [INFO] [config.py:1000:print]   zero_enabled ................. True
[2024-09-15 15:32:56,343] [INFO] [config.py:1000:print]   zero_force_ds_cpu_optimizer .. True
[2024-09-15 15:32:56,343] [INFO] [config.py:1000:print]   zero_optimization_stage ...... 1
[2024-09-15 15:32:56,343] [INFO] [config.py:986:print_user_config]   json = {
    "zero_optimization": {
        "stage": 1, 
        "allgather_partitions": true, 
        "allgather_bucket_size": 1.000000e+09, 
        "overlap_comm": true, 
        "reduce_scatter": true, 
        "reduce_bucket_size": 1.000000e+09, 
        "contiguous_gradients": true
    }, 
    "fp16": {
        "enabled": false, 
        "auto_cast": true, 
        "loss_scale": 0, 
        "initial_scale_power": 32, 
        "loss_scale_window": 1000, 
        "hysteresis": 2, 
        "min_loss_scale": 1
    }, 
    "bf16": {
        "enabled": true
    }, 
    "optimizer": {
        "type": "AdamW", 
        "params": {
            "lr": 4e-05, 
            "betas": [0.9, 0.999], 
            "eps": 1e-08, 
            "weight_decay": 0.01
        }
    }, 
    "gradient_accumulation_steps": 1, 
    "gradient_clipping": 1.0, 
    "steps_per_print": inf, 
    "train_batch_size": 32, 
    "train_micro_batch_size_per_gpu": 8, 
    "wall_clock_breakdown": true
}
[INFO|trainer.py:1721] 2024-09-15 15:32:56,344 >> ***** Running training *****
[INFO|trainer.py:1722] 2024-09-15 15:32:56,344 >>   Num examples = 9,996
[INFO|trainer.py:1723] 2024-09-15 15:32:56,344 >>   Num Epochs = 3
[INFO|trainer.py:1724] 2024-09-15 15:32:56,344 >>   Instantaneous batch size per device = 8
[INFO|trainer.py:1727] 2024-09-15 15:32:56,344 >>   Total train batch size (w. parallel, distributed & accumulation) = 32
[INFO|trainer.py:1728] 2024-09-15 15:32:56,344 >>   Gradient Accumulation steps = 1
[INFO|trainer.py:1729] 2024-09-15 15:32:56,344 >>   Total optimization steps = 939
[INFO|trainer.py:1730] 2024-09-15 15:32:56,349 >>   Number of trainable parameters = 37,748,736
  0%|          | 0/939 [00:00<?, ?it/s][2024-09-15 15:32:59,158] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 15:32:59,193] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 15:32:59,213] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 15:32:59,225] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 15:33:02,859] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 15:33:02,965] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 15:33:03,026] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 15:33:03,062] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 15:33:06,624] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 15:33:06,776] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 15:33:06,801] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 15:33:06,830] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 15:33:10,440] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 15:33:10,517] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 15:33:10,597] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 15:33:10,632] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1388
[2024-09-15 15:33:18,171] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.42 | optimizer_gradients: 3.47 | optimizer_step: 1.26
[2024-09-15 15:33:18,172] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1969.21 | bwd_microstep: 2927.51 | bwd_inner_microstep: 2581.46 | bwd_allreduce_microstep: 345.98 | step_microstep: 22.63
[2024-09-15 15:33:18,172] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 1969.10 | bwd: 2927.51 | bwd_inner: 2581.45 | bwd_allreduce: 346.00 | step: 22.64
  0%|          | 1/939 [00:21<5:41:15, 21.83s/it]                                                 {'loss': 3.6312, 'learning_rate': 1.3793103448275862e-06, 'epoch': 0.0}
  0%|          | 1/939 [00:21<5:41:15, 21.83s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1388
[2024-09-15 15:33:24,406] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.21 | optimizer_gradients: 0.53 | optimizer_step: 0.39
[2024-09-15 15:33:24,406] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1448.78 | bwd_microstep: 4746.95 | bwd_inner_microstep: 2555.43 | bwd_allreduce_microstep: 2191.44 | step_microstep: 9.14
[2024-09-15 15:33:24,407] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 1448.77 | bwd: 4746.97 | bwd_inner: 2555.43 | bwd_allreduce: 2191.47 | step: 9.14
  0%|          | 2/939 [00:28<3:17:40, 12.66s/it]                                                 {'loss': 3.8409, 'learning_rate': 2.7586206896551725e-06, 'epoch': 0.01}
  0%|          | 2/939 [00:28<3:17:40, 12.66s/it]dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1900
[2024-09-15 15:33:29,778] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.55 | optimizer_gradients: 0.38 | optimizer_step: 0.40
[2024-09-15 15:33:29,779] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1908.35 | bwd_microstep: 3423.61 | bwd_inner_microstep: 3411.14 | bwd_allreduce_microstep: 12.42 | step_microstep: 8.00
[2024-09-15 15:33:29,779] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 1908.34 | bwd: 3423.62 | bwd_inner: 3411.14 | bwd_allreduce: 12.43 | step: 8.01
  0%|          | 3/939 [00:33<2:25:32,  9.33s/it]                                                 {'loss': 3.7117, 'learning_rate': 4.137931034482759e-06, 'epoch': 0.01}
  0%|          | 3/939 [00:33<2:25:32,  9.33s/it]dynamic ViT batch size: 41, images per sample: 5.125, dynamic token length: 1876
[2024-09-15 15:33:35,072] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.46 | optimizer_gradients: 0.37 | optimizer_step: 0.40
[2024-09-15 15:33:35,073] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1870.03 | bwd_microstep: 3385.96 | bwd_inner_microstep: 3373.86 | bwd_allreduce_microstep: 12.05 | step_microstep: 7.73
[2024-09-15 15:33:35,073] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 1870.02 | bwd: 3385.97 | bwd_inner: 3373.86 | bwd_allreduce: 12.07 | step: 7.73
  0%|          | 4/939 [00:38<2:00:33,  7.74s/it]                                                 {'loss': 3.821, 'learning_rate': 5.517241379310345e-06, 'epoch': 0.01}
  0%|          | 4/939 [00:38<2:00:33,  7.74s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876
[2024-09-15 15:33:41,310] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.45 | optimizer_gradients: 0.42 | optimizer_step: 0.40
[2024-09-15 15:33:41,311] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1880.53 | bwd_microstep: 4320.93 | bwd_inner_microstep: 3392.51 | bwd_allreduce_microstep: 928.31 | step_microstep: 7.61
[2024-09-15 15:33:41,311] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 1880.52 | bwd: 4320.94 | bwd_inner: 3392.54 | bwd_allreduce: 928.33 | step: 7.61
  1%|          | 5/939 [00:44<1:52:01,  7.20s/it]                                                 {'loss': 3.8334, 'learning_rate': 6.896551724137932e-06, 'epoch': 0.02}
  1%|          | 5/939 [00:44<1:52:01,  7.20s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1870
[2024-09-15 15:33:46,637] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.43 | optimizer_gradients: 0.38 | optimizer_step: 0.39
[2024-09-15 15:33:46,638] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1882.99 | bwd_microstep: 3407.10 | bwd_inner_microstep: 3394.88 | bwd_allreduce_microstep: 12.17 | step_microstep: 7.55
[2024-09-15 15:33:46,638] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 1882.98 | bwd: 3407.11 | bwd_inner: 3394.88 | bwd_allreduce: 12.19 | step: 7.55
  1%|          | 6/939 [00:50<1:42:01,  6.56s/it]                                                 {'loss': 3.6848, 'learning_rate': 8.275862068965518e-06, 'epoch': 0.02}
  1%|          | 6/939 [00:50<1:42:01,  6.56s/it]dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1900
[2024-09-15 15:33:52,070] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.45 | optimizer_gradients: 0.37 | optimizer_step: 0.39
[2024-09-15 15:33:52,071] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1931.11 | bwd_microstep: 3458.90 | bwd_inner_microstep: 3446.59 | bwd_allreduce_microstep: 12.26 | step_microstep: 8.18
[2024-09-15 15:33:52,071] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 1931.10 | bwd: 3458.91 | bwd_inner: 3446.59 | bwd_allreduce: 12.28 | step: 8.18
  1%|          | 7/939 [00:55<1:36:10,  6.19s/it]                                                 {'loss': 3.4049, 'learning_rate': 9.655172413793105e-06, 'epoch': 0.02}
  1%|          | 7/939 [00:55<1:36:10,  6.19s/it]dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1877
[2024-09-15 15:33:57,454] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.47 | optimizer_gradients: 0.38 | optimizer_step: 0.40
[2024-09-15 15:33:57,455] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1910.90 | bwd_microstep: 3435.02 | bwd_inner_microstep: 3419.80 | bwd_allreduce_microstep: 15.17 | step_microstep: 7.83
[2024-09-15 15:33:57,455] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 1910.89 | bwd: 3435.03 | bwd_inner: 3419.80 | bwd_allreduce: 15.18 | step: 7.83
  1%|          | 8/939 [01:01<1:32:05,  5.93s/it]                                                 {'loss': 3.8125, 'learning_rate': 1.103448275862069e-05, 'epoch': 0.03}
  1%|          | 8/939 [01:01<1:32:05,  5.93s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1366
[2024-09-15 15:34:02,762] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.27 | optimizer_gradients: 0.42 | optimizer_step: 0.43
[2024-09-15 15:34:02,763] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1456.94 | bwd_microstep: 3813.71 | bwd_inner_microstep: 2564.04 | bwd_allreduce_microstep: 1249.62 | step_microstep: 7.73
[2024-09-15 15:34:02,763] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 1456.93 | bwd: 3813.73 | bwd_inner: 2564.04 | bwd_allreduce: 1249.64 | step: 7.73
  1%|          | 9/939 [01:06<1:28:57,  5.74s/it]                                                 {'loss': 3.4812, 'learning_rate': 1.2413793103448277e-05, 'epoch': 0.03}
  1%|          | 9/939 [01:06<1:28:57,  5.74s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1881
[2024-09-15 15:34:08,130] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.53 | optimizer_gradients: 0.36 | optimizer_step: 0.38
[2024-09-15 15:34:08,131] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1898.02 | bwd_microstep: 3427.46 | bwd_inner_microstep: 3415.02 | bwd_allreduce_microstep: 12.39 | step_microstep: 7.79
[2024-09-15 15:34:08,131] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 1898.01 | bwd: 3427.47 | bwd_inner: 3415.02 | bwd_allreduce: 12.40 | step: 7.80
  1%|          | 10/939 [01:11<1:27:04,  5.62s/it]                                                  {'loss': 3.5974, 'learning_rate': 1.3793103448275863e-05, 'epoch': 0.03}
  1%|          | 10/939 [01:11<1:27:04,  5.62s/it][2024-09-15 15:34:46,456] torch.distributed.run: [WARNING] 
[2024-09-15 15:34:46,456] torch.distributed.run: [WARNING] *****************************************
[2024-09-15 15:34:46,456] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
[2024-09-15 15:34:46,456] torch.distributed.run: [WARNING] *****************************************
[2024-09-15 15:34:48,796] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 15:34:48,855] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 15:34:48,856] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 15:34:48,857] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
[2024-09-15 15:34:50,926] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-09-15 15:34:50,947] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-09-15 15:34:50,947] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2024-09-15 15:34:51,008] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-09-15 15:34:51,010] [INFO] [comm.py:637:init_distributed] cdb=None
09/15/2024 15:34:51 - WARNING - __main__ - Process rank: 1, device: cuda:1, n_gpu: 1distributed training: True, 16-bits training: False
09/15/2024 15:34:51 - WARNING - __main__ - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: False
09/15/2024 15:34:51 - INFO - __main__ - Training/evaluation parameters TrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=True,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=4,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=/workspace/InternVL/internvl_chat/zero_stage1_config.json,
disable_tqdm=False,
dispatch_batches=None,
do_eval=False,
do_predict=False,
do_train=True,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=no,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=2,
gradient_checkpointing=False,
gradient_checkpointing_kwargs=None,
greater_is_better=None,
group_by_length=True,
half_precision_backend=auto,
hub_always_push=False,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=every_save,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_inputs_for_metrics=False,
include_num_input_tokens_seen=False,
include_tokens_per_second=False,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=4e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=0,
log_level=passive,
log_level_replica=warning,
log_on_each_node=True,
logging_dir=work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora/runs/Sep15_15-34-50_1fe2e16e3880,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=1.0,
logging_strategy=steps,
lr_scheduler_kwargs={},
lr_scheduler_type=cosine,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
neftune_noise_alpha=None,
no_cuda=False,
num_train_epochs=3.0,
optim=adamw_torch,
optim_args=None,
output_dir=work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora,
overwrite_output_dir=True,
past_index=-1,
per_device_eval_batch_size=8,
per_device_train_batch_size=8,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
ray_scope=last,
remove_unused_columns=True,
report_to=['tensorboard'],
resume_from_checkpoint=None,
run_name=work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora,
save_on_each_node=False,
save_only_model=False,
save_safetensors=True,
save_steps=200,
save_strategy=steps,
save_total_limit=1,
seed=42,
skip_memory_metrics=True,
split_batches=False,
tf32=None,
torch_compile=False,
torch_compile_backend=None,
torch_compile_mode=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_cpu=False,
use_ipex=False,
use_legacy_prediction_loop=False,
use_mps_device=False,
warmup_ratio=0.03,
warmup_steps=0,
weight_decay=0.01,
)
09/15/2024 15:34:51 - INFO - __main__ - Loading Tokenizer: /workspace/pretrained/InternVL2-8B
[INFO|tokenization_utils_base.py:2025] 2024-09-15 15:34:51,047 >> loading file ./tokenizer.model
[INFO|tokenization_utils_base.py:2025] 2024-09-15 15:34:51,047 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2025] 2024-09-15 15:34:51,047 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2025] 2024-09-15 15:34:51,047 >> loading file tokenizer_config.json
[INFO|tokenization_utils_base.py:2025] 2024-09-15 15:34:51,047 >> loading file tokenizer.json
09/15/2024 15:34:51 - WARNING - __main__ - Process rank: 3, device: cuda:3, n_gpu: 1distributed training: True, 16-bits training: False
09/15/2024 15:34:51 - WARNING - __main__ - Process rank: 2, device: cuda:2, n_gpu: 1distributed training: True, 16-bits training: False
[WARNING|logging.py:314] 2024-09-15 15:34:51,204 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[WARNING|logging.py:314] 2024-09-15 15:34:51,236 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[WARNING|logging.py:314] 2024-09-15 15:34:51,253 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[WARNING|logging.py:314] 2024-09-15 15:34:51,255 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
09/15/2024 15:34:51 - INFO - __main__ - Loading InternVLChatModel...
[INFO|configuration_utils.py:727] 2024-09-15 15:34:51,387 >> loading configuration file /workspace/pretrained/InternVL2-8B/config.json
[INFO|configuration_utils.py:792] 2024-09-15 15:34:51,389 >> Model config InternVLChatConfig {
  "_commit_hash": null,
  "architectures": [
    "InternVLChatModel"
  ],
  "auto_map": {
    "AutoConfig": "configuration_internvl_chat.InternVLChatConfig",
    "AutoModel": "modeling_internvl_chat.InternVLChatModel",
    "AutoModelForCausalLM": "modeling_internvl_chat.InternVLChatModel"
  },
  "downsample_ratio": 0.5,
  "dynamic_image_size": true,
  "force_image_size": 448,
  "llm_config": {
    "_name_or_path": "internlm/internlm2_5-7b-chat",
    "add_cross_attention": false,
    "architectures": [
      "InternLM2ForCausalLM"
    ],
    "attn_implementation": "flash_attention_2",
    "auto_map": {
      "AutoConfig": "configuration_internlm2.InternLM2Config",
      "AutoModel": "modeling_internlm2.InternLM2ForCausalLM",
      "AutoModelForCausalLM": "modeling_internlm2.InternLM2ForCausalLM"
    },
    "bad_words_ids": null,
    "begin_suppress_tokens": null,
    "bias": false,
    "bos_token_id": 1,
    "chunk_size_feed_forward": 0,
    "cross_attention_hidden_size": null,
    "decoder_start_token_id": null,
    "diversity_penalty": 0.0,
    "do_sample": false,
    "early_stopping": false,
    "encoder_no_repeat_ngram_size": 0,
    "eos_token_id": 2,
    "exponential_decay_length_penalty": null,
    "finetuning_task": null,
    "forced_bos_token_id": null,
    "forced_eos_token_id": null,
    "hidden_act": "silu",
    "hidden_size": 4096,
    "id2label": {
      "0": "LABEL_0",
      "1": "LABEL_1"
    },
    "initializer_range": 0.02,
    "intermediate_size": 14336,
    "is_decoder": false,
    "is_encoder_decoder": false,
    "label2id": {
      "LABEL_0": 0,
      "LABEL_1": 1
    },
    "length_penalty": 1.0,
    "max_length": 20,
    "max_position_embeddings": 32768,
    "min_length": 0,
    "model_type": "internlm2",
    "no_repeat_ngram_size": 0,
    "num_attention_heads": 32,
    "num_beam_groups": 1,
    "num_beams": 1,
    "num_hidden_layers": 32,
    "num_key_value_heads": 8,
    "num_return_sequences": 1,
    "output_attentions": false,
    "output_hidden_states": false,
    "output_scores": false,
    "pad_token_id": 2,
    "prefix": null,
    "pretraining_tp": 1,
    "problem_type": null,
    "pruned_heads": {},
    "remove_invalid_values": false,
    "repetition_penalty": 1.0,
    "return_dict": true,
    "return_dict_in_generate": false,
    "rms_norm_eps": 1e-05,
    "rope_scaling": {
      "factor": 2.0,
      "type": "dynamic"
    },
    "rope_theta": 1000000,
    "sep_token_id": null,
    "suppress_tokens": null,
    "task_specific_params": null,
    "temperature": 1.0,
    "tf_legacy_loss": false,
    "tie_encoder_decoder": false,
    "tie_word_embeddings": false,
    "tokenizer_class": null,
    "top_k": 50,
    "top_p": 1.0,
    "torch_dtype": "bfloat16",
    "torchscript": false,
    "transformers_version": "4.37.2",
    "typical_p": 1.0,
    "use_bfloat16": true,
    "use_cache": true,
    "vocab_size": 92553
  },
  "max_dynamic_patch": 12,
  "min_dynamic_patch": 1,
  "model_type": "internvl_chat",
  "pad2square": false,
  "ps_version": "v2",
  "select_layer": -1,
  "template": "internlm2-chat",
  "torch_dtype": "bfloat16",
  "transformers_version": null,
  "use_backbone_lora": 0,
  "use_llm_lora": 0,
  "use_thumbnail": true,
  "vision_config": {
    "_name_or_path": "",
    "add_cross_attention": false,
    "architectures": [
      "InternVisionModel"
    ],
    "attention_dropout": 0.0,
    "bad_words_ids": null,
    "begin_suppress_tokens": null,
    "bos_token_id": null,
    "chunk_size_feed_forward": 0,
    "cross_attention_hidden_size": null,
    "decoder_start_token_id": null,
    "diversity_penalty": 0.0,
    "do_sample": false,
    "drop_path_rate": 0.0,
    "dropout": 0.0,
    "early_stopping": false,
    "encoder_no_repeat_ngram_size": 0,
    "eos_token_id": null,
    "exponential_decay_length_penalty": null,
    "finetuning_task": null,
    "forced_bos_token_id": null,
    "forced_eos_token_id": null,
    "hidden_act": "gelu",
    "hidden_size": 1024,
    "id2label": {
      "0": "LABEL_0",
      "1": "LABEL_1"
    },
    "image_size": 448,
    "initializer_factor": 1.0,
    "initializer_range": 0.02,
    "intermediate_size": 4096,
    "is_decoder": false,
    "is_encoder_decoder": false,
    "label2id": {
      "LABEL_0": 0,
      "LABEL_1": 1
    },
    "layer_norm_eps": 1e-06,
    "length_penalty": 1.0,
    "max_length": 20,
    "min_length": 0,
    "model_type": "intern_vit_6b",
    "no_repeat_ngram_size": 0,
    "norm_type": "layer_norm",
    "num_attention_heads": 16,
    "num_beam_groups": 1,
    "num_beams": 1,
    "num_channels": 3,
    "num_hidden_layers": 24,
    "num_return_sequences": 1,
    "output_attentions": false,
    "output_hidden_states": false,
    "output_scores": false,
    "pad_token_id": null,
    "patch_size": 14,
    "prefix": null,
    "problem_type": null,
    "pruned_heads": {},
    "qk_normalization": false,
    "qkv_bias": true,
    "remove_invalid_values": false,
    "repetition_penalty": 1.0,
    "return_dict": true,
    "return_dict_in_generate": false,
    "sep_token_id": null,
    "suppress_tokens": null,
    "task_specific_params": null,
    "temperature": 1.0,
    "tf_legacy_loss": false,
    "tie_encoder_decoder": false,
    "tie_word_embeddings": true,
    "tokenizer_class": null,
    "top_k": 50,
    "top_p": 1.0,
    "torch_dtype": "bfloat16",
    "torchscript": false,
    "transformers_version": "4.37.2",
    "typical_p": 1.0,
    "use_bfloat16": true,
    "use_flash_attn": true
  }
}

09/15/2024 15:34:51 - INFO - __main__ - Using flash_attention_2 for InternLM
[INFO|modeling_utils.py:3473] 2024-09-15 15:34:51,390 >> loading weights file /workspace/pretrained/InternVL2-8B/model.safetensors.index.json
[INFO|modeling_utils.py:1426] 2024-09-15 15:34:51,391 >> Instantiating InternVLChatModel model under default dtype torch.bfloat16.
[INFO|configuration_utils.py:826] 2024-09-15 15:34:51,392 >> Generate config GenerationConfig {}

[INFO|configuration_utils.py:826] 2024-09-15 15:34:51,429 >> Generate config GenerationConfig {
  "bos_token_id": 1,
  "eos_token_id": 2,
  "pad_token_id": 2
}

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]Loading checkpoint shards:  25%|██▌       | 1/4 [00:02<00:08,  2.68s/it]Loading checkpoint shards:  25%|██▌       | 1/4 [00:02<00:08,  2.68s/it]Loading checkpoint shards:  25%|██▌       | 1/4 [00:02<00:08,  2.99s/it]Loading checkpoint shards:  25%|██▌       | 1/4 [00:02<00:08,  2.95s/it]Loading checkpoint shards:  50%|█████     | 2/4 [00:05<00:05,  2.64s/it]Loading checkpoint shards:  50%|█████     | 2/4 [00:05<00:05,  2.64s/it]Loading checkpoint shards:  50%|█████     | 2/4 [00:05<00:05,  2.96s/it]Loading checkpoint shards:  50%|█████     | 2/4 [00:05<00:05,  2.90s/it]Loading checkpoint shards:  75%|███████▌  | 3/4 [00:07<00:02,  2.60s/it]Loading checkpoint shards:  75%|███████▌  | 3/4 [00:07<00:02,  2.60s/it]Loading checkpoint shards: 100%|██████████| 4/4 [00:08<00:00,  1.90s/it]Loading checkpoint shards: 100%|██████████| 4/4 [00:08<00:00,  2.17s/it]
Loading checkpoint shards: 100%|██████████| 4/4 [00:08<00:00,  1.91s/it]Loading checkpoint shards: 100%|██████████| 4/4 [00:08<00:00,  2.17s/it]
Loading checkpoint shards:  75%|███████▌  | 3/4 [00:08<00:02,  2.85s/it]Loading checkpoint shards:  75%|███████▌  | 3/4 [00:08<00:02,  2.94s/it]trainable params: 37,748,736 || all params: 7,775,531,008 || trainable%: 0.4855
trainable params: 37,748,736 || all params: 7,775,531,008 || trainable%: 0.4855
Loading checkpoint shards: 100%|██████████| 4/4 [00:09<00:00,  2.09s/it]Loading checkpoint shards: 100%|██████████| 4/4 [00:09<00:00,  2.38s/it]
Loading checkpoint shards: 100%|██████████| 4/4 [00:09<00:00,  2.15s/it]Loading checkpoint shards: 100%|██████████| 4/4 [00:09<00:00,  2.45s/it]
[INFO|modeling_utils.py:4350] 2024-09-15 15:35:02,655 >> All model checkpoint weights were used when initializing InternVLChatModel.

[INFO|modeling_utils.py:4358] 2024-09-15 15:35:02,656 >> All the weights of InternVLChatModel were initialized from the model checkpoint at /workspace/pretrained/InternVL2-8B.
If your task is similar to the task the model of the checkpoint was trained on, you can already use InternVLChatModel for predictions without further training.
[INFO|configuration_utils.py:779] 2024-09-15 15:35:02,661 >> loading configuration file /workspace/pretrained/InternVL2-8B/generation_config.json
[INFO|configuration_utils.py:826] 2024-09-15 15:35:02,661 >> Generate config GenerationConfig {
  "eos_token_id": [
    92542,
    92543
  ]
}

09/15/2024 15:35:02 - INFO - __main__ - Finished
09/15/2024 15:35:02 - INFO - __main__ - model.config.force_image_size: 448
09/15/2024 15:35:02 - INFO - __main__ - data_args.force_image_size: 448
09/15/2024 15:35:02 - INFO - __main__ - model.config.vision_config.image_size: 448
09/15/2024 15:35:02 - INFO - __main__ - [Dataset] num_image_token: 256
09/15/2024 15:35:02 - INFO - __main__ - [Dataset] dynamic_image_size: True
09/15/2024 15:35:02 - INFO - __main__ - [Dataset] use_thumbnail: True
09/15/2024 15:35:02 - INFO - __main__ - [Dataset] min_dynamic_patch: 1, max_dynamic_patch: 6
09/15/2024 15:35:02 - INFO - __main__ - Formatting inputs...Skip in lazy mode
09/15/2024 15:35:03 - INFO - __main__ - Add dataset: amazon-product-dataset with length: 9996
trainable params: 37,748,736 || all params: 7,775,531,008 || trainable%: 0.4855
trainable params: 37,748,736 || all params: 7,775,531,008 || trainable%: 0.4855
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.0.attention.wqkv.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.0.attention.wqkv.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.0.attention.wo.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.0.attention.wo.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.0.feed_forward.w1.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.0.feed_forward.w1.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.0.feed_forward.w3.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.0.feed_forward.w3.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.0.feed_forward.w2.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.0.feed_forward.w2.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.1.attention.wqkv.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.1.attention.wqkv.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.1.attention.wo.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.1.attention.wo.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.1.feed_forward.w1.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.1.feed_forward.w1.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.1.feed_forward.w3.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.1.feed_forward.w3.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.1.feed_forward.w2.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.1.feed_forward.w2.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.2.attention.wqkv.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.2.attention.wqkv.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.2.attention.wo.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.2.attention.wo.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.2.feed_forward.w1.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.2.feed_forward.w1.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.2.feed_forward.w3.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.2.feed_forward.w3.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.2.feed_forward.w2.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.2.feed_forward.w2.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.3.attention.wqkv.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.3.attention.wqkv.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.3.attention.wo.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.3.attention.wo.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.3.feed_forward.w1.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.3.feed_forward.w1.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.3.feed_forward.w3.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.3.feed_forward.w3.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.3.feed_forward.w2.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.3.feed_forward.w2.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.4.attention.wqkv.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.4.attention.wqkv.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.4.attention.wo.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.4.attention.wo.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.4.feed_forward.w1.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.4.feed_forward.w1.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.4.feed_forward.w3.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.4.feed_forward.w3.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.4.feed_forward.w2.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.4.feed_forward.w2.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.5.attention.wqkv.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.5.attention.wqkv.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.5.attention.wo.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.5.attention.wo.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.5.feed_forward.w1.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.5.feed_forward.w1.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.5.feed_forward.w3.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.5.feed_forward.w3.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.5.feed_forward.w2.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.5.feed_forward.w2.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.6.attention.wqkv.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.6.attention.wqkv.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.6.attention.wo.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.6.attention.wo.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.6.feed_forward.w1.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.6.feed_forward.w1.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.6.feed_forward.w3.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.6.feed_forward.w3.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.6.feed_forward.w2.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.6.feed_forward.w2.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.7.attention.wqkv.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.7.attention.wqkv.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.7.attention.wo.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.7.attention.wo.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.7.feed_forward.w1.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.7.feed_forward.w1.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.7.feed_forward.w3.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.7.feed_forward.w3.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.7.feed_forward.w2.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.7.feed_forward.w2.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.8.attention.wqkv.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.8.attention.wqkv.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.8.attention.wo.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.8.attention.wo.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.8.feed_forward.w1.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.8.feed_forward.w1.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.8.feed_forward.w3.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.8.feed_forward.w3.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.8.feed_forward.w2.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.8.feed_forward.w2.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.9.attention.wqkv.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.9.attention.wqkv.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.9.attention.wo.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.9.attention.wo.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.9.feed_forward.w1.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.9.feed_forward.w1.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.9.feed_forward.w3.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.9.feed_forward.w3.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.9.feed_forward.w2.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.9.feed_forward.w2.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.10.attention.wqkv.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.10.attention.wqkv.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.10.attention.wo.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.10.attention.wo.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.10.feed_forward.w1.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.10.feed_forward.w1.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.10.feed_forward.w3.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.10.feed_forward.w3.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.10.feed_forward.w2.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.10.feed_forward.w2.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.11.attention.wqkv.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.11.attention.wqkv.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.11.attention.wo.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.11.attention.wo.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.11.feed_forward.w1.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.11.feed_forward.w1.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.11.feed_forward.w3.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.11.feed_forward.w3.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.11.feed_forward.w2.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.11.feed_forward.w2.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.12.attention.wqkv.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.12.attention.wqkv.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.12.attention.wo.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.12.attention.wo.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.12.feed_forward.w1.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.12.feed_forward.w1.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.12.feed_forward.w3.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.12.feed_forward.w3.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.12.feed_forward.w2.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.12.feed_forward.w2.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.13.attention.wqkv.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.13.attention.wqkv.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.13.attention.wo.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.13.attention.wo.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.13.feed_forward.w1.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.13.feed_forward.w1.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.13.feed_forward.w3.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.13.feed_forward.w3.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.13.feed_forward.w2.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.13.feed_forward.w2.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.14.attention.wqkv.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.14.attention.wqkv.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.14.attention.wo.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.14.attention.wo.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.14.feed_forward.w1.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.14.feed_forward.w1.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.14.feed_forward.w3.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.14.feed_forward.w3.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.14.feed_forward.w2.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.14.feed_forward.w2.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.15.attention.wqkv.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.15.attention.wqkv.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.15.attention.wo.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.15.attention.wo.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.15.feed_forward.w1.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.15.feed_forward.w1.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.15.feed_forward.w3.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.15.feed_forward.w3.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.15.feed_forward.w2.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.15.feed_forward.w2.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.16.attention.wqkv.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.16.attention.wqkv.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.16.attention.wo.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.16.attention.wo.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.16.feed_forward.w1.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.16.feed_forward.w1.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.16.feed_forward.w3.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.16.feed_forward.w3.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.16.feed_forward.w2.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.16.feed_forward.w2.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.17.attention.wqkv.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.17.attention.wqkv.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.17.attention.wo.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.17.attention.wo.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.17.feed_forward.w1.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.17.feed_forward.w1.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.17.feed_forward.w3.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.17.feed_forward.w3.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.17.feed_forward.w2.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.17.feed_forward.w2.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.18.attention.wqkv.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.18.attention.wqkv.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.18.attention.wo.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.18.attention.wo.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.18.feed_forward.w1.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.18.feed_forward.w1.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.18.feed_forward.w3.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.18.feed_forward.w3.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.18.feed_forward.w2.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.18.feed_forward.w2.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.19.attention.wqkv.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.19.attention.wqkv.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.19.attention.wo.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.19.attention.wo.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.19.feed_forward.w1.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.19.feed_forward.w1.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.19.feed_forward.w3.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.19.feed_forward.w3.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.19.feed_forward.w2.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.19.feed_forward.w2.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.20.attention.wqkv.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.20.attention.wqkv.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.20.attention.wo.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.20.attention.wo.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.20.feed_forward.w1.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.20.feed_forward.w1.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.20.feed_forward.w3.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.20.feed_forward.w3.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.20.feed_forward.w2.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.20.feed_forward.w2.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.21.attention.wqkv.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.21.attention.wqkv.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.21.attention.wo.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.21.attention.wo.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.21.feed_forward.w1.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.21.feed_forward.w1.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.21.feed_forward.w3.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.21.feed_forward.w3.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.21.feed_forward.w2.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.21.feed_forward.w2.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.22.attention.wqkv.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.22.attention.wqkv.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.22.attention.wo.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.22.attention.wo.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.22.feed_forward.w1.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.22.feed_forward.w1.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.22.feed_forward.w3.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.22.feed_forward.w3.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.22.feed_forward.w2.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.22.feed_forward.w2.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.23.attention.wqkv.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.23.attention.wqkv.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.23.attention.wo.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.23.attention.wo.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.23.feed_forward.w1.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.23.feed_forward.w1.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.23.feed_forward.w3.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.23.feed_forward.w3.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.23.feed_forward.w2.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.23.feed_forward.w2.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.24.attention.wqkv.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.24.attention.wqkv.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.24.attention.wo.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.24.attention.wo.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.24.feed_forward.w1.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.24.feed_forward.w1.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.24.feed_forward.w3.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.24.feed_forward.w3.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.24.feed_forward.w2.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.24.feed_forward.w2.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.25.attention.wqkv.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.25.attention.wqkv.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.25.attention.wo.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.25.attention.wo.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.25.feed_forward.w1.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.25.feed_forward.w1.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.25.feed_forward.w3.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.25.feed_forward.w3.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.25.feed_forward.w2.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.25.feed_forward.w2.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.26.attention.wqkv.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.26.attention.wqkv.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.26.attention.wo.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.26.attention.wo.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.26.feed_forward.w1.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.26.feed_forward.w1.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.26.feed_forward.w3.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.26.feed_forward.w3.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.26.feed_forward.w2.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.26.feed_forward.w2.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.27.attention.wqkv.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.27.attention.wqkv.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.27.attention.wo.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.27.attention.wo.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.27.feed_forward.w1.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.27.feed_forward.w1.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.27.feed_forward.w3.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.27.feed_forward.w3.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.27.feed_forward.w2.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.27.feed_forward.w2.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.28.attention.wqkv.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.28.attention.wqkv.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.28.attention.wo.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.28.attention.wo.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.28.feed_forward.w1.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.28.feed_forward.w1.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.28.feed_forward.w3.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.28.feed_forward.w3.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.28.feed_forward.w2.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.28.feed_forward.w2.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.29.attention.wqkv.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.29.attention.wqkv.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.29.attention.wo.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.29.attention.wo.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.29.feed_forward.w1.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.29.feed_forward.w1.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.29.feed_forward.w3.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.29.feed_forward.w3.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.29.feed_forward.w2.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.29.feed_forward.w2.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.30.attention.wqkv.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.30.attention.wqkv.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.30.attention.wo.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.30.attention.wo.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.30.feed_forward.w1.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.30.feed_forward.w1.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.30.feed_forward.w3.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.30.feed_forward.w3.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.30.feed_forward.w2.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.30.feed_forward.w2.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.31.attention.wqkv.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.31.attention.wqkv.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.31.attention.wo.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.31.attention.wo.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.31.feed_forward.w1.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.31.feed_forward.w1.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.31.feed_forward.w3.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.31.feed_forward.w3.lora_B.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.31.feed_forward.w2.lora_A.default.weight
09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.31.feed_forward.w2.lora_B.default.weight
[INFO|trainer.py:571] 2024-09-15 15:35:03,664 >> Using auto half precision backend
[2024-09-15 15:35:03,871] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.13.5, git-hash=unknown, git-branch=unknown
[2024-09-15 15:35:07,120] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
Using /root/.cache/torch_extensions/py310_cu118 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py310_cu118 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py310_cu118 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py310_cu118 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py310_cu118/fused_adam/build.ninja...
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_adam...
Time to load fused_adam op: 0.06660223007202148 seconds
Loading extension module fused_adam...
Time to load fused_adam op: 0.10187005996704102 seconds
Loading extension module fused_adam...
Time to load fused_adam op: 0.10172486305236816 seconds
Loading extension module fused_adam...
Time to load fused_adam op: 0.10187482833862305 seconds
[2024-09-15 15:35:07,875] [INFO] [logging.py:96:log_dist] [Rank 0] Using DeepSpeed Optimizer param name adamw as basic optimizer
[2024-09-15 15:35:07,875] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer
[2024-09-15 15:35:07,915] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedAdam
[2024-09-15 15:35:07,915] [INFO] [utils.py:56:is_zero_supported_optimizer] Checking ZeRO support for optimizer=FusedAdam type=<class 'deepspeed.ops.adam.fused_adam.FusedAdam'>
[2024-09-15 15:35:07,915] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 1 optimizer
[2024-09-15 15:35:07,915] [INFO] [stage_1_and_2.py:149:__init__] Reduce bucket size 1000000000
[2024-09-15 15:35:07,915] [INFO] [stage_1_and_2.py:150:__init__] Allgather bucket size 1000000000
[2024-09-15 15:35:07,915] [INFO] [stage_1_and_2.py:151:__init__] CPU Offload: False
[2024-09-15 15:35:07,915] [INFO] [stage_1_and_2.py:152:__init__] Round robin gradient partitioning: False
[2024-09-15 15:35:08,212] [INFO] [utils.py:800:see_memory_usage] Before initializing optimizer states
[2024-09-15 15:35:08,213] [INFO] [utils.py:801:see_memory_usage] MA 15.65 GB         Max_MA 15.67 GB         CA 15.85 GB         Max_CA 16 GB 
[2024-09-15 15:35:08,213] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory:  used = 83.43 GB, percent = 8.8%
[2024-09-15 15:35:08,404] [INFO] [utils.py:800:see_memory_usage] After initializing optimizer states
[2024-09-15 15:35:08,405] [INFO] [utils.py:801:see_memory_usage] MA 15.65 GB         Max_MA 15.68 GB         CA 15.89 GB         Max_CA 16 GB 
[2024-09-15 15:35:08,406] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory:  used = 83.54 GB, percent = 8.8%
[2024-09-15 15:35:08,406] [INFO] [stage_1_and_2.py:539:__init__] optimizer state initialized
[2024-09-15 15:35:08,586] [INFO] [utils.py:800:see_memory_usage] After initializing ZeRO optimizer
[2024-09-15 15:35:08,587] [INFO] [utils.py:801:see_memory_usage] MA 15.65 GB         Max_MA 15.65 GB         CA 15.89 GB         Max_CA 16 GB 
[2024-09-15 15:35:08,587] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory:  used = 83.57 GB, percent = 8.8%
[2024-09-15 15:35:08,589] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = adamw
[2024-09-15 15:35:08,590] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client callable to create LR scheduler
[2024-09-15 15:35:08,590] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = <torch.optim.lr_scheduler.LambdaLR object at 0x7f124e5eb550>
[2024-09-15 15:35:08,590] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0], mom=[[0.9, 0.999]]
[2024-09-15 15:35:08,594] [INFO] [config.py:996:print] DeepSpeedEngine configuration:
[2024-09-15 15:35:08,595] [INFO] [config.py:1000:print]   activation_checkpointing_config  {
    "partition_activations": false, 
    "contiguous_memory_optimization": false, 
    "cpu_checkpointing": false, 
    "number_checkpoints": null, 
    "synchronize_checkpoint_boundary": false, 
    "profile": false
}
[2024-09-15 15:35:08,595] [INFO] [config.py:1000:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2024-09-15 15:35:08,595] [INFO] [config.py:1000:print]   amp_enabled .................. False
[2024-09-15 15:35:08,595] [INFO] [config.py:1000:print]   amp_params ................... False
[2024-09-15 15:35:08,595] [INFO] [config.py:1000:print]   autotuning_config ............ {
    "enabled": false, 
    "start_step": null, 
    "end_step": null, 
    "metric_path": null, 
    "arg_mappings": null, 
    "metric": "throughput", 
    "model_info": null, 
    "results_dir": "autotuning_results", 
    "exps_dir": "autotuning_exps", 
    "overwrite": true, 
    "fast": true, 
    "start_profile_step": 3, 
    "end_profile_step": 5, 
    "tuner_type": "gridsearch", 
    "tuner_early_stopping": 5, 
    "tuner_num_trials": 50, 
    "model_info_path": null, 
    "mp_size": 1, 
    "max_train_batch_size": null, 
    "min_train_batch_size": 1, 
    "max_train_micro_batch_size_per_gpu": 1.024000e+03, 
    "min_train_micro_batch_size_per_gpu": 1, 
    "num_tuning_micro_batch_sizes": 3
}
[2024-09-15 15:35:08,595] [INFO] [config.py:1000:print]   bfloat16_enabled ............. True
[2024-09-15 15:35:08,595] [INFO] [config.py:1000:print]   bfloat16_immediate_grad_update  False
[2024-09-15 15:35:08,595] [INFO] [config.py:1000:print]   checkpoint_parallel_write_pipeline  False
[2024-09-15 15:35:08,595] [INFO] [config.py:1000:print]   checkpoint_tag_validation_enabled  True
[2024-09-15 15:35:08,595] [INFO] [config.py:1000:print]   checkpoint_tag_validation_fail  False
[2024-09-15 15:35:08,595] [INFO] [config.py:1000:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f124e5eaad0>
[2024-09-15 15:35:08,595] [INFO] [config.py:1000:print]   communication_data_type ...... None
[2024-09-15 15:35:08,595] [INFO] [config.py:1000:print]   compile_config ............... enabled=False backend='inductor' kwargs={}
[2024-09-15 15:35:08,595] [INFO] [config.py:1000:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2024-09-15 15:35:08,595] [INFO] [config.py:1000:print]   curriculum_enabled_legacy .... False
[2024-09-15 15:35:08,595] [INFO] [config.py:1000:print]   curriculum_params_legacy ..... False
[2024-09-15 15:35:08,595] [INFO] [config.py:1000:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2024-09-15 15:35:08,595] [INFO] [config.py:1000:print]   data_efficiency_enabled ...... False
[2024-09-15 15:35:08,595] [INFO] [config.py:1000:print]   dataloader_drop_last ......... False
[2024-09-15 15:35:08,595] [INFO] [config.py:1000:print]   disable_allgather ............ False
[2024-09-15 15:35:08,595] [INFO] [config.py:1000:print]   dump_state ................... False
[2024-09-15 15:35:08,596] [INFO] [config.py:1000:print]   dynamic_loss_scale_args ...... None
[2024-09-15 15:35:08,596] [INFO] [config.py:1000:print]   eigenvalue_enabled ........... False
[2024-09-15 15:35:08,596] [INFO] [config.py:1000:print]   eigenvalue_gas_boundary_resolution  1
[2024-09-15 15:35:08,596] [INFO] [config.py:1000:print]   eigenvalue_layer_name ........ bert.encoder.layer
[2024-09-15 15:35:08,596] [INFO] [config.py:1000:print]   eigenvalue_layer_num ......... 0
[2024-09-15 15:35:08,596] [INFO] [config.py:1000:print]   eigenvalue_max_iter .......... 100
[2024-09-15 15:35:08,596] [INFO] [config.py:1000:print]   eigenvalue_stability ......... 1e-06
[2024-09-15 15:35:08,596] [INFO] [config.py:1000:print]   eigenvalue_tol ............... 0.01
[2024-09-15 15:35:08,596] [INFO] [config.py:1000:print]   eigenvalue_verbose ........... False
[2024-09-15 15:35:08,596] [INFO] [config.py:1000:print]   elasticity_enabled ........... False
[2024-09-15 15:35:08,596] [INFO] [config.py:1000:print]   flops_profiler_config ........ {
    "enabled": false, 
    "recompute_fwd_factor": 0.0, 
    "profile_step": 1, 
    "module_depth": -1, 
    "top_modules": 1, 
    "detailed": true, 
    "output_file": null
}
[2024-09-15 15:35:08,596] [INFO] [config.py:1000:print]   fp16_auto_cast ............... None
[2024-09-15 15:35:08,596] [INFO] [config.py:1000:print]   fp16_enabled ................. False
[2024-09-15 15:35:08,596] [INFO] [config.py:1000:print]   fp16_master_weights_and_gradients  False
[2024-09-15 15:35:08,596] [INFO] [config.py:1000:print]   global_rank .................. 0
[2024-09-15 15:35:08,596] [INFO] [config.py:1000:print]   grad_accum_dtype ............. None
[2024-09-15 15:35:08,596] [INFO] [config.py:1000:print]   gradient_accumulation_steps .. 2
[2024-09-15 15:35:08,596] [INFO] [config.py:1000:print]   gradient_clipping ............ 1.0
[2024-09-15 15:35:08,596] [INFO] [config.py:1000:print]   gradient_predivide_factor .... 1.0
[2024-09-15 15:35:08,596] [INFO] [config.py:1000:print]   graph_harvesting ............. False
[2024-09-15 15:35:08,596] [INFO] [config.py:1000:print]   hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
[2024-09-15 15:35:08,596] [INFO] [config.py:1000:print]   initial_dynamic_scale ........ 1
[2024-09-15 15:35:08,596] [INFO] [config.py:1000:print]   load_universal_checkpoint .... False
[2024-09-15 15:35:08,596] [INFO] [config.py:1000:print]   loss_scale ................... 1.0
[2024-09-15 15:35:08,596] [INFO] [config.py:1000:print]   memory_breakdown ............. False
[2024-09-15 15:35:08,596] [INFO] [config.py:1000:print]   mics_hierarchial_params_gather  False
[2024-09-15 15:35:08,596] [INFO] [config.py:1000:print]   mics_shard_size .............. -1
[2024-09-15 15:35:08,596] [INFO] [config.py:1000:print]   monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
[2024-09-15 15:35:08,596] [INFO] [config.py:1000:print]   nebula_config ................ {
    "enabled": false, 
    "persistent_storage_path": null, 
    "persistent_time_interval": 100, 
    "num_of_version_in_retention": 2, 
    "enable_nebula_load": true, 
    "load_path": null
}
[2024-09-15 15:35:08,596] [INFO] [config.py:1000:print]   optimizer_legacy_fusion ...... False
[2024-09-15 15:35:08,596] [INFO] [config.py:1000:print]   optimizer_name ............... adamw
[2024-09-15 15:35:08,596] [INFO] [config.py:1000:print]   optimizer_params ............. {'lr': 4e-05, 'betas': [0.9, 0.999], 'eps': 1e-08, 'weight_decay': 0.01}
[2024-09-15 15:35:08,596] [INFO] [config.py:1000:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True}
[2024-09-15 15:35:08,596] [INFO] [config.py:1000:print]   pld_enabled .................. False
[2024-09-15 15:35:08,596] [INFO] [config.py:1000:print]   pld_params ................... False
[2024-09-15 15:35:08,596] [INFO] [config.py:1000:print]   prescale_gradients ........... False
[2024-09-15 15:35:08,596] [INFO] [config.py:1000:print]   scheduler_name ............... None
[2024-09-15 15:35:08,596] [INFO] [config.py:1000:print]   scheduler_params ............. None
[2024-09-15 15:35:08,596] [INFO] [config.py:1000:print]   seq_parallel_communication_data_type  torch.float32
[2024-09-15 15:35:08,596] [INFO] [config.py:1000:print]   sparse_attention ............. None
[2024-09-15 15:35:08,596] [INFO] [config.py:1000:print]   sparse_gradients_enabled ..... False
[2024-09-15 15:35:08,596] [INFO] [config.py:1000:print]   steps_per_print .............. inf
[2024-09-15 15:35:08,596] [INFO] [config.py:1000:print]   train_batch_size ............. 64
[2024-09-15 15:35:08,596] [INFO] [config.py:1000:print]   train_micro_batch_size_per_gpu  8
[2024-09-15 15:35:08,596] [INFO] [config.py:1000:print]   use_data_before_expert_parallel_  False
[2024-09-15 15:35:08,596] [INFO] [config.py:1000:print]   use_node_local_storage ....... False
[2024-09-15 15:35:08,596] [INFO] [config.py:1000:print]   wall_clock_breakdown ......... True
[2024-09-15 15:35:08,596] [INFO] [config.py:1000:print]   weight_quantization_config ... None
[2024-09-15 15:35:08,596] [INFO] [config.py:1000:print]   world_size ................... 4
[2024-09-15 15:35:08,597] [INFO] [config.py:1000:print]   zero_allow_untested_optimizer  False
[2024-09-15 15:35:08,597] [INFO] [config.py:1000:print]   zero_config .................. stage=1 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=1000000000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=1000000000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True
[2024-09-15 15:35:08,597] [INFO] [config.py:1000:print]   zero_enabled ................. True
[2024-09-15 15:35:08,597] [INFO] [config.py:1000:print]   zero_force_ds_cpu_optimizer .. True
[2024-09-15 15:35:08,597] [INFO] [config.py:1000:print]   zero_optimization_stage ...... 1
[2024-09-15 15:35:08,597] [INFO] [config.py:986:print_user_config]   json = {
    "zero_optimization": {
        "stage": 1, 
        "allgather_partitions": true, 
        "allgather_bucket_size": 1.000000e+09, 
        "overlap_comm": true, 
        "reduce_scatter": true, 
        "reduce_bucket_size": 1.000000e+09, 
        "contiguous_gradients": true
    }, 
    "fp16": {
        "enabled": false, 
        "auto_cast": true, 
        "loss_scale": 0, 
        "initial_scale_power": 32, 
        "loss_scale_window": 1000, 
        "hysteresis": 2, 
        "min_loss_scale": 1
    }, 
    "bf16": {
        "enabled": true
    }, 
    "optimizer": {
        "type": "AdamW", 
        "params": {
            "lr": 4e-05, 
            "betas": [0.9, 0.999], 
            "eps": 1e-08, 
            "weight_decay": 0.01
        }
    }, 
    "gradient_accumulation_steps": 2, 
    "gradient_clipping": 1.0, 
    "steps_per_print": inf, 
    "train_batch_size": 64, 
    "train_micro_batch_size_per_gpu": 8, 
    "wall_clock_breakdown": true
}
[INFO|trainer.py:1721] 2024-09-15 15:35:08,597 >> ***** Running training *****
[INFO|trainer.py:1722] 2024-09-15 15:35:08,597 >>   Num examples = 9,996
[INFO|trainer.py:1723] 2024-09-15 15:35:08,597 >>   Num Epochs = 3
[INFO|trainer.py:1724] 2024-09-15 15:35:08,597 >>   Instantaneous batch size per device = 8
[INFO|trainer.py:1727] 2024-09-15 15:35:08,597 >>   Total train batch size (w. parallel, distributed & accumulation) = 64
[INFO|trainer.py:1728] 2024-09-15 15:35:08,597 >>   Gradient Accumulation steps = 2
[INFO|trainer.py:1729] 2024-09-15 15:35:08,597 >>   Total optimization steps = 468
[INFO|trainer.py:1730] 2024-09-15 15:35:08,602 >>   Number of trainable parameters = 37,748,736
  0%|          | 0/468 [00:00<?, ?it/s][2024-09-15 15:35:11,456] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 15:35:11,521] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 15:35:11,535] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 15:35:11,538] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 15:35:15,183] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 15:35:15,311] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 15:35:15,340] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 15:35:15,363] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 15:35:18,943] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 15:35:19,141] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 15:35:19,167] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 15:35:19,199] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 15:35:22,642] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 15:35:22,828] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 15:35:22,985] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 15:35:23,010] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1388
[2024-09-15 15:35:33,052] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2692.38 | bwd_microstep: 5120.97 | bwd_inner_microstep: 5120.91 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.08
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1369
[2024-09-15 15:35:41,095] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 4.32 | optimizer_gradients: 3.19 | optimizer_step: 1.10
[2024-09-15 15:35:41,096] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1428.64 | bwd_microstep: 6570.22 | bwd_inner_microstep: 2592.98 | bwd_allreduce_microstep: 3977.18 | step_microstep: 20.40
[2024-09-15 15:35:41,096] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 4120.99 | bwd: 11691.19 | bwd_inner: 7713.88 | bwd_allreduce: 3977.21 | step: 20.47
  0%|          | 1/468 [00:32<4:12:57, 32.50s/it]                                                 {'loss': 3.7352, 'learning_rate': 2.666666666666667e-06, 'epoch': 0.01}
  0%|          | 1/468 [00:32<4:12:57, 32.50s/it]dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1900
[2024-09-15 15:35:46,587] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.63 | bwd_microstep: 3346.61 | bwd_inner_microstep: 3346.57 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.11
dynamic ViT batch size: 39, images per sample: 4.875, dynamic token length: 1368
[2024-09-15 15:35:51,828] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.22 | optimizer_gradients: 0.45 | optimizer_step: 0.40
[2024-09-15 15:35:51,829] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1431.24 | bwd_microstep: 3775.07 | bwd_inner_microstep: 2525.83 | bwd_allreduce_microstep: 1249.18 | step_microstep: 7.44
[2024-09-15 15:35:51,829] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3545.84 | bwd: 7121.73 | bwd_inner: 5872.40 | bwd_allreduce: 1249.21 | step: 7.55
  0%|          | 2/468 [00:43<2:32:58, 19.70s/it]                                                 {'loss': 3.7667, 'learning_rate': 5.333333333333334e-06, 'epoch': 0.01}
  0%|          | 2/468 [00:43<2:32:58, 19.70s/it]dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1876
[2024-09-15 15:35:57,158] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1902.06 | bwd_microstep: 3397.31 | bwd_inner_microstep: 3397.28 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.04
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1366
[2024-09-15 15:36:02,394] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.23 | optimizer_gradients: 0.43 | optimizer_step: 0.38
[2024-09-15 15:36:02,395] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1449.11 | bwd_microstep: 3756.23 | bwd_inner_microstep: 2554.96 | bwd_allreduce_microstep: 1201.22 | step_microstep: 7.18
[2024-09-15 15:36:02,395] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3351.14 | bwd: 7153.56 | bwd_inner: 5952.24 | bwd_allreduce: 1201.25 | step: 7.23
  1%|          | 3/468 [00:53<2:00:19, 15.53s/it]                                                 {'loss': 3.7797, 'learning_rate': 8.000000000000001e-06, 'epoch': 0.02}
  1%|          | 3/468 [00:53<2:00:19, 15.53s/it]dynamic ViT batch size: 41, images per sample: 5.125, dynamic token length: 1900
[2024-09-15 15:36:07,753] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1908.96 | bwd_microstep: 3419.12 | bwd_inner_microstep: 3419.09 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.23
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1367
[2024-09-15 15:36:11,827] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.43 | optimizer_gradients: 0.42 | optimizer_step: 0.39
[2024-09-15 15:36:11,827] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1456.12 | bwd_microstep: 2581.32 | bwd_inner_microstep: 2568.45 | bwd_allreduce_microstep: 12.82 | step_microstep: 7.76
[2024-09-15 15:36:11,827] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3365.04 | bwd: 6000.45 | bwd_inner: 5987.55 | bwd_allreduce: 12.84 | step: 7.99
  1%|          | 4/468 [01:03<1:41:28, 13.12s/it]                                                 {'loss': 3.6592, 'learning_rate': 1.0666666666666667e-05, 'epoch': 0.03}
  1%|          | 4/468 [01:03<1:41:28, 13.12s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1881
[2024-09-15 15:36:17,164] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1902.27 | bwd_microstep: 3402.54 | bwd_inner_microstep: 3402.51 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.05
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1366
[2024-09-15 15:36:22,494] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.49 | optimizer_gradients: 0.45 | optimizer_step: 0.40
[2024-09-15 15:36:22,494] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1460.39 | bwd_microstep: 3838.39 | bwd_inner_microstep: 2570.19 | bwd_allreduce_microstep: 1268.14 | step_microstep: 7.96
[2024-09-15 15:36:22,494] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3362.63 | bwd: 7240.95 | bwd_inner: 5972.70 | bwd_allreduce: 1268.17 | step: 8.02
  1%|          | 5/468 [01:13<1:34:25, 12.24s/it]                                                 {'loss': 3.6968, 'learning_rate': 1.3333333333333333e-05, 'epoch': 0.03}
  1%|          | 5/468 [01:13<1:34:25, 12.24s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878
[2024-09-15 15:36:27,837] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1902.86 | bwd_microstep: 3409.65 | bwd_inner_microstep: 3409.60 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.26
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1880
[2024-09-15 15:36:33,560] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.23 | optimizer_gradients: 0.44 | optimizer_step: 0.38
[2024-09-15 15:36:33,560] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1892.98 | bwd_microstep: 3795.64 | bwd_inner_microstep: 3401.18 | bwd_allreduce_microstep: 394.40 | step_microstep: 7.40
[2024-09-15 15:36:33,561] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3795.81 | bwd: 7205.34 | bwd_inner: 6810.79 | bwd_allreduce: 394.43 | step: 7.66
  1%|▏         | 6/468 [01:24<1:31:09, 11.84s/it]                                                 {'loss': 3.7754, 'learning_rate': 1.6000000000000003e-05, 'epoch': 0.04}
  1%|▏         | 6/468 [01:24<1:31:09, 11.84s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876
[2024-09-15 15:36:38,899] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1900.01 | bwd_microstep: 3407.13 | bwd_inner_microstep: 3407.10 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.23
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389
[2024-09-15 15:36:44,274] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.44 | optimizer_gradients: 0.46 | optimizer_step: 0.40
[2024-09-15 15:36:44,274] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1485.18 | bwd_microstep: 3853.58 | bwd_inner_microstep: 2612.02 | bwd_allreduce_microstep: 1241.49 | step_microstep: 8.50
[2024-09-15 15:36:44,274] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3385.16 | bwd: 7260.72 | bwd_inner: 6019.13 | bwd_allreduce: 1241.52 | step: 8.74
  1%|▏         | 7/468 [01:35<1:28:07, 11.47s/it]                                                 {'loss': 3.4281, 'learning_rate': 1.866666666666667e-05, 'epoch': 0.04}
  1%|▏         | 7/468 [01:35<1:28:07, 11.47s/it]dynamic ViT batch size: 32, images per sample: 4.0, dynamic token length: 1388
[2024-09-15 15:36:48,270] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1422.53 | bwd_microstep: 2543.25 | bwd_inner_microstep: 2543.08 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.21
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368
[2024-09-15 15:36:54,934] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.21 | optimizer_gradients: 0.46 | optimizer_step: 0.41
[2024-09-15 15:36:54,934] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1461.08 | bwd_microstep: 5172.00 | bwd_inner_microstep: 2576.72 | bwd_allreduce_microstep: 2595.21 | step_microstep: 7.57
[2024-09-15 15:36:54,935] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2883.58 | bwd: 7715.28 | bwd_inner: 5119.80 | bwd_allreduce: 2595.36 | step: 7.78
  2%|▏         | 8/468 [01:46<1:25:57, 11.21s/it]                                                 {'loss': 3.6136, 'learning_rate': 2.1333333333333335e-05, 'epoch': 0.05}
  2%|▏         | 8/468 [01:46<1:25:57, 11.21s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1905
[2024-09-15 15:37:00,312] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1908.26 | bwd_microstep: 3438.19 | bwd_inner_microstep: 3438.16 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.05
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877
[2024-09-15 15:37:05,689] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.57 | optimizer_gradients: 0.43 | optimizer_step: 0.40
[2024-09-15 15:37:05,689] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1907.78 | bwd_microstep: 3438.20 | bwd_inner_microstep: 3425.71 | bwd_allreduce_microstep: 12.44 | step_microstep: 8.47
[2024-09-15 15:37:05,690] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3816.01 | bwd: 6876.40 | bwd_inner: 6863.88 | bwd_allreduce: 12.46 | step: 8.53
  2%|▏         | 9/468 [01:57<1:24:40, 11.07s/it]                                                 {'loss': 3.0098, 'learning_rate': 2.4e-05, 'epoch': 0.06}
  2%|▏         | 9/468 [01:57<1:24:40, 11.07s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878
[2024-09-15 15:37:11,051] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1909.10 | bwd_microstep: 3422.41 | bwd_inner_microstep: 3422.39 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.05
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1879
[2024-09-15 15:37:16,433] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.29 | optimizer_gradients: 0.42 | optimizer_step: 0.40
[2024-09-15 15:37:16,433] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1911.58 | bwd_microstep: 3439.21 | bwd_inner_microstep: 3425.38 | bwd_allreduce_microstep: 13.78 | step_microstep: 7.83
[2024-09-15 15:37:16,433] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3820.66 | bwd: 6861.63 | bwd_inner: 6847.77 | bwd_allreduce: 13.80 | step: 7.88
  2%|▏         | 10/468 [02:07<1:23:43, 10.97s/it]                                                  {'loss': 2.7693, 'learning_rate': 2.6666666666666667e-05, 'epoch': 0.06}
  2%|▏         | 10/468 [02:07<1:23:43, 10.97s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1870
[2024-09-15 15:37:21,779] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1906.43 | bwd_microstep: 3408.52 | bwd_inner_microstep: 3408.28 | bwd_allreduce_microstep: 0.11 | step_microstep: 0.20
dynamic ViT batch size: 34, images per sample: 4.25, dynamic token length: 1367
[2024-09-15 15:37:25,885] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.25 | optimizer_gradients: 0.45 | optimizer_step: 0.39
[2024-09-15 15:37:25,885] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1421.51 | bwd_microstep: 2649.81 | bwd_inner_microstep: 2533.64 | bwd_allreduce_microstep: 116.10 | step_microstep: 7.46
[2024-09-15 15:37:25,886] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3327.92 | bwd: 6058.35 | bwd_inner: 5941.97 | bwd_allreduce: 116.23 | step: 7.66
  2%|▏         | 11/468 [02:17<1:20:00, 10.50s/it]                                                  {'loss': 2.2559, 'learning_rate': 2.9333333333333333e-05, 'epoch': 0.07}
  2%|▏         | 11/468 [02:17<1:20:00, 10.50s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389
[2024-09-15 15:37:30,028] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1488.54 | bwd_microstep: 2624.72 | bwd_inner_microstep: 2624.69 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.04
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876
[2024-09-15 15:37:35,415] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.50 | optimizer_gradients: 0.41 | optimizer_step: 0.41
[2024-09-15 15:37:35,416] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1907.24 | bwd_microstep: 3448.17 | bwd_inner_microstep: 3434.46 | bwd_allreduce_microstep: 13.66 | step_microstep: 7.79
[2024-09-15 15:37:35,416] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3395.75 | bwd: 6072.90 | bwd_inner: 6059.16 | bwd_allreduce: 13.68 | step: 7.87
  3%|▎         | 12/468 [02:26<1:17:34, 10.21s/it]                                                  {'loss': 2.1567, 'learning_rate': 3.2000000000000005e-05, 'epoch': 0.08}
  3%|▎         | 12/468 [02:26<1:17:34, 10.21s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1878
[2024-09-15 15:37:40,766] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1898.25 | bwd_microstep: 3421.09 | bwd_inner_microstep: 3421.03 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.25
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1369
[2024-09-15 15:37:46,108] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.29 | optimizer_gradients: 0.43 | optimizer_step: 0.39
[2024-09-15 15:37:46,108] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1467.83 | bwd_microstep: 3840.85 | bwd_inner_microstep: 2590.34 | bwd_allreduce_microstep: 1250.46 | step_microstep: 7.49
[2024-09-15 15:37:46,109] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3366.05 | bwd: 7261.98 | bwd_inner: 6011.37 | bwd_allreduce: 1250.49 | step: 7.75
  3%|▎         | 13/468 [02:37<1:18:31, 10.36s/it]                                                  {'loss': 1.7343, 'learning_rate': 3.466666666666667e-05, 'epoch': 0.08}
  3%|▎         | 13/468 [02:37<1:18:31, 10.36s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1877
[2024-09-15 15:37:51,449] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1896.95 | bwd_microstep: 3413.05 | bwd_inner_microstep: 3413.02 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.05
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1370
[2024-09-15 15:37:56,835] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.22 | optimizer_gradients: 0.45 | optimizer_step: 0.40
[2024-09-15 15:37:56,835] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1467.56 | bwd_microstep: 3880.59 | bwd_inner_microstep: 2589.86 | bwd_allreduce_microstep: 1290.66 | step_microstep: 7.43
[2024-09-15 15:37:56,836] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3364.48 | bwd: 7293.65 | bwd_inner: 6002.89 | bwd_allreduce: 1290.69 | step: 7.50
  3%|▎         | 14/468 [02:48<1:19:12, 10.47s/it]                                                  {'loss': 1.677, 'learning_rate': 3.733333333333334e-05, 'epoch': 0.09}
  3%|▎         | 14/468 [02:48<1:19:12, 10.47s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1387
[2024-09-15 15:38:00,979] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1488.70 | bwd_microstep: 2623.23 | bwd_inner_microstep: 2623.20 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.17
dynamic ViT batch size: 39, images per sample: 4.875, dynamic token length: 1366
[2024-09-15 15:38:08,325] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.41 | optimizer_gradients: 0.44 | optimizer_step: 0.39
[2024-09-15 15:38:08,326] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1458.45 | bwd_microstep: 5855.29 | bwd_inner_microstep: 2576.15 | bwd_allreduce_microstep: 3279.08 | step_microstep: 7.63
[2024-09-15 15:38:08,326] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2947.08 | bwd: 8478.53 | bwd_inner: 5199.36 | bwd_allreduce: 3279.11 | step: 7.83
  3%|▎         | 15/468 [02:59<1:21:21, 10.78s/it]                                                  {'loss': 1.2854, 'learning_rate': 4e-05, 'epoch': 0.1}
  3%|▎         | 15/468 [02:59<1:21:21, 10.78s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1388
[2024-09-15 15:38:12,467] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1487.74 | bwd_microstep: 2620.17 | bwd_inner_microstep: 2620.14 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.05
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368
[2024-09-15 15:38:19,043] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 4.10 | optimizer_gradients: 0.43 | optimizer_step: 0.40
[2024-09-15 15:38:19,044] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1461.94 | bwd_microstep: 5082.62 | bwd_inner_microstep: 2585.74 | bwd_allreduce_microstep: 2496.82 | step_microstep: 8.35
[2024-09-15 15:38:19,044] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2949.65 | bwd: 7702.82 | bwd_inner: 5205.89 | bwd_allreduce: 2496.85 | step: 8.41
  3%|▎         | 16/468 [03:10<1:21:02, 10.76s/it]                                                  {'loss': 1.1381, 'learning_rate': 3.999951904814875e-05, 'epoch': 0.1}
  3%|▎         | 16/468 [03:10<1:21:02, 10.76s/it]dynamic ViT batch size: 36, images per sample: 4.5, dynamic token length: 1389
[2024-09-15 15:38:23,127] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1455.34 | bwd_microstep: 2583.00 | bwd_inner_microstep: 2582.97 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.04
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877
[2024-09-15 15:38:29,686] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.22 | optimizer_gradients: 0.54 | optimizer_step: 0.40
[2024-09-15 15:38:29,686] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1914.07 | bwd_microstep: 4612.74 | bwd_inner_microstep: 3426.25 | bwd_allreduce_microstep: 1186.37 | step_microstep: 8.59
[2024-09-15 15:38:29,686] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3369.38 | bwd: 7195.77 | bwd_inner: 6009.22 | bwd_allreduce: 1186.44 | step: 8.63
  4%|▎         | 17/468 [03:21<1:20:36, 10.72s/it]                                                  {'loss': 0.7843, 'learning_rate': 3.999807621572648e-05, 'epoch': 0.11}
  4%|▎         | 17/468 [03:21<1:20:36, 10.72s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876
[2024-09-15 15:38:35,051] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1907.25 | bwd_microstep: 3427.56 | bwd_inner_microstep: 3427.53 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.04
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876
[2024-09-15 15:38:40,444] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.48 | optimizer_gradients: 0.40 | optimizer_step: 0.37
[2024-09-15 15:38:40,445] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1911.72 | bwd_microstep: 3449.69 | bwd_inner_microstep: 3436.10 | bwd_allreduce_microstep: 13.53 | step_microstep: 8.29
[2024-09-15 15:38:40,445] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3818.95 | bwd: 6877.25 | bwd_inner: 6863.63 | bwd_allreduce: 13.55 | step: 8.35
  4%|▍         | 18/468 [03:31<1:20:30, 10.73s/it]                                                  {'loss': 0.7415, 'learning_rate': 3.999567157212646e-05, 'epoch': 0.12}
  4%|▍         | 18/468 [03:31<1:20:30, 10.73s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389
[2024-09-15 15:38:44,590] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1490.53 | bwd_microstep: 2619.43 | bwd_inner_microstep: 2619.40 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.04
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878
[2024-09-15 15:38:51,133] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 4.21 | optimizer_gradients: 0.46 | optimizer_step: 0.38
[2024-09-15 15:38:51,133] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1912.86 | bwd_microstep: 4596.09 | bwd_inner_microstep: 3431.55 | bwd_allreduce_microstep: 1164.47 | step_microstep: 10.73
[2024-09-15 15:38:51,134] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3403.36 | bwd: 7215.53 | bwd_inner: 6050.96 | bwd_allreduce: 1164.50 | step: 10.78
  4%|▍         | 19/468 [03:42<1:20:13, 10.72s/it]                                                  {'loss': 0.6684, 'learning_rate': 3.999230523300049e-05, 'epoch': 0.12}
  4%|▍         | 19/468 [03:42<1:20:13, 10.72s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368
[2024-09-15 15:38:55,211] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1463.89 | bwd_microstep: 2582.58 | bwd_inner_microstep: 2582.55 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.23
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877
[2024-09-15 15:39:00,601] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.20 | optimizer_gradients: 0.39 | optimizer_step: 0.40
[2024-09-15 15:39:00,602] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1911.68 | bwd_microstep: 3443.24 | bwd_inner_microstep: 3429.54 | bwd_allreduce_microstep: 13.63 | step_microstep: 7.59
[2024-09-15 15:39:00,602] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3375.55 | bwd: 6025.84 | bwd_inner: 6012.09 | bwd_allreduce: 13.67 | step: 7.82
  4%|▍         | 20/468 [03:52<1:17:14, 10.34s/it]                                                  {'loss': 0.4267, 'learning_rate': 3.998797736025326e-05, 'epoch': 0.13}
  4%|▍         | 20/468 [03:52<1:17:14, 10.34s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1388
[2024-09-15 15:39:04,749] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1490.26 | bwd_microstep: 2625.82 | bwd_inner_microstep: 2625.79 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22
dynamic ViT batch size: 46, images per sample: 5.75, dynamic token length: 1878
[2024-09-15 15:39:10,229] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.51 | optimizer_gradients: 0.37 | optimizer_step: 0.37
[2024-09-15 15:39:10,230] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1944.82 | bwd_microstep: 3493.35 | bwd_inner_microstep: 3481.05 | bwd_allreduce_microstep: 12.26 | step_microstep: 8.54
[2024-09-15 15:39:10,230] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3435.04 | bwd: 6119.18 | bwd_inner: 6106.84 | bwd_allreduce: 12.28 | step: 8.77
  4%|▍         | 21/468 [04:01<1:15:27, 10.13s/it]                                                  {'loss': 0.5993, 'learning_rate': 3.9982688162034624e-05, 'epoch': 0.13}
  4%|▍         | 21/468 [04:01<1:15:27, 10.13s/it]dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1367
[2024-09-15 15:39:14,290] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1454.96 | bwd_microstep: 2574.76 | bwd_inner_microstep: 2574.73 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.04
dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1877
[2024-09-15 15:39:20,905] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.39 | optimizer_gradients: 0.44 | optimizer_step: 0.40
[2024-09-15 15:39:20,906] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1936.46 | bwd_microstep: 4646.49 | bwd_inner_microstep: 3463.18 | bwd_allreduce_microstep: 1183.25 | step_microstep: 9.22
[2024-09-15 15:39:20,906] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3391.40 | bwd: 7221.26 | bwd_inner: 6037.91 | bwd_allreduce: 1183.28 | step: 9.27
  5%|▍         | 22/468 [04:12<1:16:30, 10.29s/it]                                                  {'loss': 0.555, 'learning_rate': 3.997643789272954e-05, 'epoch': 0.14}
  5%|▍         | 22/468 [04:12<1:16:30, 10.29s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1390
[2024-09-15 15:39:25,062] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1491.09 | bwd_microstep: 2629.22 | bwd_inner_microstep: 2629.19 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.05
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878
[2024-09-15 15:39:31,636] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.55 | optimizer_gradients: 0.43 | optimizer_step: 0.39
[2024-09-15 15:39:31,636] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1917.54 | bwd_microstep: 4625.08 | bwd_inner_microstep: 3440.71 | bwd_allreduce_microstep: 1184.31 | step_microstep: 7.65
[2024-09-15 15:39:31,637] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3408.61 | bwd: 7254.31 | bwd_inner: 6069.90 | bwd_allreduce: 1184.35 | step: 7.71
  5%|▍         | 23/468 [04:23<1:17:18, 10.42s/it]                                                  {'loss': 0.5142, 'learning_rate': 3.996922685294587e-05, 'epoch': 0.15}
  5%|▍         | 23/468 [04:23<1:17:18, 10.42s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878
[2024-09-15 15:39:37,015] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1915.79 | bwd_microstep: 3431.91 | bwd_inner_microstep: 3431.88 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.04
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1876
[2024-09-15 15:39:42,376] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.51 | optimizer_gradients: 0.51 | optimizer_step: 0.69
[2024-09-15 15:39:42,377] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1898.70 | bwd_microstep: 3429.28 | bwd_inner_microstep: 3416.03 | bwd_allreduce_microstep: 13.19 | step_microstep: 10.30
[2024-09-15 15:39:42,377] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3814.47 | bwd: 6861.21 | bwd_inner: 6847.92 | bwd_allreduce: 13.22 | step: 10.35
  5%|▌         | 24/468 [04:33<1:17:50, 10.52s/it]                                                  {'loss': 0.3841, 'learning_rate': 3.9961055389499904e-05, 'epoch': 0.15}
  5%|▌         | 24/468 [04:33<1:17:50, 10.52s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878
[2024-09-15 15:39:47,764] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1918.61 | bwd_microstep: 3437.97 | bwd_inner_microstep: 3437.94 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.06
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877
[2024-09-15 15:39:53,165] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.43 | optimizer_gradients: 0.41 | optimizer_step: 0.42
[2024-09-15 15:39:53,165] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1918.21 | bwd_microstep: 3451.29 | bwd_inner_microstep: 3439.09 | bwd_allreduce_microstep: 12.15 | step_microstep: 8.03
[2024-09-15 15:39:53,165] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3836.79 | bwd: 6889.27 | bwd_inner: 6877.03 | bwd_allreduce: 12.18 | step: 8.10
  5%|▌         | 25/468 [04:44<1:18:15, 10.60s/it]                                                  {'loss': 0.3057, 'learning_rate': 3.9951923895399696e-05, 'epoch': 0.16}
  5%|▌         | 25/468 [04:44<1:18:15, 10.60s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1881
[2024-09-15 15:39:58,555] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1916.89 | bwd_microstep: 3442.11 | bwd_inner_microstep: 3442.09 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.21
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876
[2024-09-15 15:40:03,963] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.54 | optimizer_gradients: 0.40 | optimizer_step: 0.40
[2024-09-15 15:40:03,963] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1920.84 | bwd_microstep: 3452.66 | bwd_inner_microstep: 3440.01 | bwd_allreduce_microstep: 12.61 | step_microstep: 8.13
[2024-09-15 15:40:03,964] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3837.69 | bwd: 6894.79 | bwd_inner: 6882.10 | bwd_allreduce: 12.63 | step: 8.36
  6%|▌         | 26/468 [04:55<1:18:31, 10.66s/it]                                                  {'loss': 0.4327, 'learning_rate': 3.9941832809826136e-05, 'epoch': 0.17}
  6%|▌         | 26/468 [04:55<1:18:31, 10.66s/it]dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1878
[2024-09-15 15:40:09,287] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1887.69 | bwd_microstep: 3404.81 | bwd_inner_microstep: 3404.78 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.17
dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1878
[2024-09-15 15:40:14,731] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.60 | optimizer_gradients: 0.37 | optimizer_step: 0.41
[2024-09-15 15:40:14,732] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1933.66 | bwd_microstep: 3474.72 | bwd_inner_microstep: 3462.07 | bwd_allreduce_microstep: 12.60 | step_microstep: 8.20
[2024-09-15 15:40:14,732] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3821.33 | bwd: 6879.54 | bwd_inner: 6866.85 | bwd_allreduce: 12.63 | step: 8.41
  6%|▌         | 27/468 [05:06<1:18:35, 10.69s/it]                                                  {'loss': 0.3207, 'learning_rate': 3.993078261811186e-05, 'epoch': 0.17}
  6%|▌         | 27/468 [05:06<1:18:35, 10.69s/it]dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1878
[2024-09-15 15:40:20,160] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1934.74 | bwd_microstep: 3462.58 | bwd_inner_microstep: 3462.55 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.04
dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1368
[2024-09-15 15:40:24,233] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.77 | optimizer_gradients: 0.38 | optimizer_step: 0.37
[2024-09-15 15:40:24,234] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1457.10 | bwd_microstep: 2584.90 | bwd_inner_microstep: 2572.07 | bwd_allreduce_microstep: 12.78 | step_microstep: 8.05
[2024-09-15 15:40:24,234] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3391.81 | bwd: 6047.49 | bwd_inner: 6034.62 | bwd_allreduce: 12.80 | step: 8.10
  6%|▌         | 28/468 [05:15<1:15:47, 10.34s/it]                                                  {'loss': 0.454, 'learning_rate': 3.991877385171789e-05, 'epoch': 0.18}
  6%|▌         | 28/468 [05:15<1:15:47, 10.34s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876
[2024-09-15 15:40:29,626] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1918.59 | bwd_microstep: 3442.66 | bwd_inner_microstep: 3442.63 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.04
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1880
[2024-09-15 15:40:35,037] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.48 | optimizer_gradients: 0.33 | optimizer_step: 0.38
[2024-09-15 15:40:35,037] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1920.55 | bwd_microstep: 3459.17 | bwd_inner_microstep: 3446.78 | bwd_allreduce_microstep: 12.34 | step_microstep: 8.38
[2024-09-15 15:40:35,037] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3839.12 | bwd: 6901.84 | bwd_inner: 6889.41 | bwd_allreduce: 12.36 | step: 8.44
  6%|▌         | 29/468 [05:26<1:16:38, 10.48s/it]                                                  {'loss': 0.2997, 'learning_rate': 3.990580708820805e-05, 'epoch': 0.19}
  6%|▌         | 29/468 [05:26<1:16:38, 10.48s/it]dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1876
[2024-09-15 15:40:40,471] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1934.84 | bwd_microstep: 3467.60 | bwd_inner_microstep: 3467.37 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.17
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1870
[2024-09-15 15:40:45,830] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.48 | optimizer_gradients: 0.33 | optimizer_step: 0.37
[2024-09-15 15:40:45,830] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1900.06 | bwd_microstep: 3425.70 | bwd_inner_microstep: 3412.00 | bwd_allreduce_microstep: 13.64 | step_microstep: 7.65
[2024-09-15 15:40:45,831] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3834.88 | bwd: 6893.33 | bwd_inner: 6879.41 | bwd_allreduce: 13.76 | step: 7.82
  6%|▋         | 30/468 [05:37<1:17:10, 10.57s/it]                                                  {'loss': 0.3583, 'learning_rate': 3.9891882951221246e-05, 'epoch': 0.19}
  6%|▋         | 30/468 [05:37<1:17:10, 10.57s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1388
[2024-09-15 15:40:49,989] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1493.05 | bwd_microstep: 2635.61 | bwd_inner_microstep: 2635.59 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.04
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878
[2024-09-15 15:40:55,398] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.51 | optimizer_gradients: 0.34 | optimizer_step: 0.39
[2024-09-15 15:40:55,399] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1920.52 | bwd_microstep: 3457.26 | bwd_inner_microstep: 3444.62 | bwd_allreduce_microstep: 12.59 | step_microstep: 8.38
[2024-09-15 15:40:55,399] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3413.55 | bwd: 6092.89 | bwd_inner: 6080.21 | bwd_allreduce: 12.61 | step: 8.44
  7%|▋         | 31/468 [05:46<1:14:48, 10.27s/it]                                                  {'loss': 0.2823, 'learning_rate': 3.9877002110441424e-05, 'epoch': 0.2}
  7%|▋         | 31/468 [05:46<1:14:48, 10.27s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1900
[2024-09-15 15:41:00,814] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1928.82 | bwd_microstep: 3453.57 | bwd_inner_microstep: 3453.55 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.14
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1870
[2024-09-15 15:41:06,229] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.55 | optimizer_gradients: 0.34 | optimizer_step: 0.38
[2024-09-15 15:41:06,229] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1917.19 | bwd_microstep: 3463.36 | bwd_inner_microstep: 3450.16 | bwd_allreduce_microstep: 13.11 | step_microstep: 8.27
[2024-09-15 15:41:06,229] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3845.98 | bwd: 6916.94 | bwd_inner: 6903.74 | bwd_allreduce: 13.12 | step: 8.43
  7%|▋         | 32/468 [05:57<1:15:51, 10.44s/it]                                                  {'loss': 0.2659, 'learning_rate': 3.986116528156537e-05, 'epoch': 0.2}
  7%|▋         | 32/468 [05:57<1:15:51, 10.44s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876
[2024-09-15 15:41:11,631] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1922.75 | bwd_microstep: 3443.03 | bwd_inner_microstep: 3443.00 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.04
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877
[2024-09-15 15:41:17,056] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.51 | optimizer_gradients: 0.49 | optimizer_step: 0.44
[2024-09-15 15:41:17,057] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1927.59 | bwd_microstep: 3463.47 | bwd_inner_microstep: 3450.13 | bwd_allreduce_microstep: 13.28 | step_microstep: 10.30
[2024-09-15 15:41:17,057] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3850.31 | bwd: 6906.51 | bwd_inner: 6893.13 | bwd_allreduce: 13.31 | step: 10.36
  7%|▋         | 33/468 [06:08<1:16:32, 10.56s/it]                                                  {'loss': 0.3158, 'learning_rate': 3.9844373226268305e-05, 'epoch': 0.21}
  7%|▋         | 33/468 [06:08<1:16:32, 10.56s/it]dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1877
[2024-09-15 15:41:22,496] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1935.51 | bwd_microstep: 3467.38 | bwd_inner_microstep: 3467.35 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.04
dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1878
[2024-09-15 15:41:27,957] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.49 | optimizer_gradients: 0.34 | optimizer_step: 0.39
[2024-09-15 15:41:27,958] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1940.57 | bwd_microstep: 3488.35 | bwd_inner_microstep: 3475.80 | bwd_allreduce_microstep: 12.50 | step_microstep: 8.49
[2024-09-15 15:41:27,958] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3876.06 | bwd: 6955.74 | bwd_inner: 6943.15 | bwd_allreduce: 12.52 | step: 8.55
  7%|▋         | 34/468 [06:19<1:17:05, 10.66s/it]                                                  {'loss': 0.2704, 'learning_rate': 3.982662675216723e-05, 'epoch': 0.22}
  7%|▋         | 34/468 [06:19<1:17:05, 10.66s/it]dynamic ViT batch size: 46, images per sample: 5.75, dynamic token length: 1900
[2024-09-15 15:41:33,484] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1974.08 | bwd_microstep: 3520.76 | bwd_inner_microstep: 3520.74 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.06
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1878
[2024-09-15 15:41:38,880] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.37 | optimizer_gradients: 0.42 | optimizer_step: 0.42
[2024-09-15 15:41:38,881] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1909.27 | bwd_microstep: 3451.28 | bwd_inner_microstep: 3438.16 | bwd_allreduce_microstep: 13.06 | step_microstep: 12.32
[2024-09-15 15:41:38,881] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3883.31 | bwd: 6972.06 | bwd_inner: 6958.90 | bwd_allreduce: 13.08 | step: 12.39
  7%|▋         | 35/468 [06:30<1:17:29, 10.74s/it]                                                  {'loss': 0.3642, 'learning_rate': 3.9807926712782115e-05, 'epoch': 0.22}
  7%|▋         | 35/468 [06:30<1:17:29, 10.74s/it]dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1877
[2024-09-15 15:41:44,332] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1938.51 | bwd_microstep: 3475.27 | bwd_inner_microstep: 3475.25 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.04
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1870
[2024-09-15 15:41:49,703] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.60 | optimizer_gradients: 0.40 | optimizer_step: 0.42
[2024-09-15 15:41:49,703] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1905.07 | bwd_microstep: 3434.24 | bwd_inner_microstep: 3421.17 | bwd_allreduce_microstep: 13.02 | step_microstep: 8.88
[2024-09-15 15:41:49,704] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3843.55 | bwd: 6909.53 | bwd_inner: 6896.42 | bwd_allreduce: 13.04 | step: 8.94
  8%|▊         | 36/468 [06:41<1:17:29, 10.76s/it]                                                  {'loss': 0.3058, 'learning_rate': 3.978827400749481e-05, 'epoch': 0.23}
  8%|▊         | 36/468 [06:41<1:17:29, 10.76s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389
[2024-09-15 15:41:53,867] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1497.85 | bwd_microstep: 2632.68 | bwd_inner_microstep: 2632.65 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.21
dynamic ViT batch size: 39, images per sample: 4.875, dynamic token length: 1876
[2024-09-15 15:41:59,237] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.66 | optimizer_gradients: 0.43 | optimizer_step: 0.40
[2024-09-15 15:41:59,238] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1901.02 | bwd_microstep: 3435.61 | bwd_inner_microstep: 3422.66 | bwd_allreduce_microstep: 12.90 | step_microstep: 8.58
[2024-09-15 15:41:59,238] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3398.84 | bwd: 6068.30 | bwd_inner: 6055.31 | bwd_allreduce: 12.92 | step: 8.80
  8%|▊         | 37/468 [06:50<1:14:40, 10.40s/it]                                                  {'loss': 0.3343, 'learning_rate': 3.976766958150581e-05, 'epoch': 0.24}
  8%|▊         | 37/468 [06:50<1:14:40, 10.40s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1870
[2024-09-15 15:42:04,596] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1902.65 | bwd_microstep: 3416.69 | bwd_inner_microstep: 3416.67 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.04
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878
[2024-09-15 15:42:10,011] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.56 | optimizer_gradients: 0.33 | optimizer_step: 0.39
[2024-09-15 15:42:10,011] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1923.57 | bwd_microstep: 3457.34 | bwd_inner_microstep: 3445.02 | bwd_allreduce_microstep: 12.26 | step_microstep: 8.17
[2024-09-15 15:42:10,011] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3826.19 | bwd: 6874.04 | bwd_inner: 6861.69 | bwd_allreduce: 12.28 | step: 8.22
  8%|▊         | 38/468 [07:01<1:15:18, 10.51s/it]                                                  {'loss': 0.2987, 'learning_rate': 3.97461144257888e-05, 'epoch': 0.24}
  8%|▊         | 38/468 [07:01<1:15:18, 10.51s/it]dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1389
[2024-09-15 15:42:14,138] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1481.44 | bwd_microstep: 2615.04 | bwd_inner_microstep: 2615.02 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.15
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1369
[2024-09-15 15:42:19,482] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.20 | optimizer_gradients: 0.40 | optimizer_step: 0.40
[2024-09-15 15:42:19,483] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1477.48 | bwd_microstep: 3830.75 | bwd_inner_microstep: 2604.62 | bwd_allreduce_microstep: 1226.07 | step_microstep: 7.99
[2024-09-15 15:42:19,483] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2958.88 | bwd: 6445.81 | bwd_inner: 5219.64 | bwd_allreduce: 1226.10 | step: 8.15
  8%|▊         | 39/468 [07:10<1:12:54, 10.20s/it]                                                  {'loss': 0.2469, 'learning_rate': 3.972360957704298e-05, 'epoch': 0.25}
  8%|▊         | 39/468 [07:10<1:12:54, 10.20s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389
[2024-09-15 15:42:23,643] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1495.27 | bwd_microstep: 2634.91 | bwd_inner_microstep: 2634.88 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.05
dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1877
[2024-09-15 15:42:30,221] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.68 | optimizer_gradients: 0.39 | optimizer_step: 0.40
[2024-09-15 15:42:30,222] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1945.76 | bwd_microstep: 4598.78 | bwd_inner_microstep: 3473.61 | bwd_allreduce_microstep: 1125.11 | step_microstep: 7.98
[2024-09-15 15:42:30,222] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3441.00 | bwd: 7233.70 | bwd_inner: 6108.49 | bwd_allreduce: 1125.14 | step: 8.04
  9%|▊         | 40/468 [07:21<1:13:53, 10.36s/it]                                                  {'loss': 0.359, 'learning_rate': 3.970015611764323e-05, 'epoch': 0.26}
  9%|▊         | 40/468 [07:21<1:13:53, 10.36s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389
[2024-09-15 15:42:34,386] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1494.48 | bwd_microstep: 2638.47 | bwd_inner_microstep: 2638.45 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.23
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878
[2024-09-15 15:42:39,809] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.54 | optimizer_gradients: 0.41 | optimizer_step: 0.42
[2024-09-15 15:42:39,810] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1921.29 | bwd_microstep: 3464.03 | bwd_inner_microstep: 3450.83 | bwd_allreduce_microstep: 13.15 | step_microstep: 9.16
[2024-09-15 15:42:39,810] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3415.74 | bwd: 6102.51 | bwd_inner: 6089.27 | bwd_allreduce: 13.17 | step: 9.40
  9%|▉         | 41/468 [07:31<1:12:04, 10.13s/it]                                                  {'loss': 0.2903, 'learning_rate': 3.9675755175588006e-05, 'epoch': 0.26}
  9%|▉         | 41/468 [07:31<1:12:04, 10.13s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876
[2024-09-15 15:42:45,207] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1921.00 | bwd_microstep: 3444.36 | bwd_inner_microstep: 3444.33 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.05
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1366
[2024-09-15 15:42:50,495] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.19 | optimizer_gradients: 0.53 | optimizer_step: 0.43
[2024-09-15 15:42:50,496] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1474.76 | bwd_microstep: 3779.40 | bwd_inner_microstep: 2600.54 | bwd_allreduce_microstep: 1178.80 | step_microstep: 8.47
[2024-09-15 15:42:50,496] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3395.73 | bwd: 7223.77 | bwd_inner: 6044.87 | bwd_allreduce: 1178.83 | step: 8.54
  9%|▉         | 42/468 [07:41<1:13:06, 10.30s/it]                                                  {'loss': 0.3881, 'learning_rate': 3.9650407924445147e-05, 'epoch': 0.27}
  9%|▉         | 42/468 [07:41<1:13:06, 10.30s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1888
[2024-09-15 15:42:55,875] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1905.81 | bwd_microstep: 3439.16 | bwd_inner_microstep: 3439.00 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.17
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368
[2024-09-15 15:42:59,988] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.46 | optimizer_gradients: 0.35 | optimizer_step: 0.41
[2024-09-15 15:42:59,989] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1471.70 | bwd_microstep: 2609.69 | bwd_inner_microstep: 2595.45 | bwd_allreduce_microstep: 14.19 | step_microstep: 7.36
[2024-09-15 15:42:59,989] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3377.48 | bwd: 6048.88 | bwd_inner: 6034.48 | bwd_allreduce: 14.28 | step: 7.51
  9%|▉         | 43/468 [07:51<1:11:13, 10.05s/it]                                                  {'loss': 0.2877, 'learning_rate': 3.9624115583295375e-05, 'epoch': 0.27}
  9%|▉         | 43/468 [07:51<1:11:13, 10.05s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1900
[2024-09-15 15:43:05,406] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1931.44 | bwd_microstep: 3456.55 | bwd_inner_microstep: 3456.53 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.04
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1366
[2024-09-15 15:43:10,800] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.74 | optimizer_gradients: 0.43 | optimizer_step: 0.40
[2024-09-15 15:43:10,800] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1475.42 | bwd_microstep: 3885.86 | bwd_inner_microstep: 2603.08 | bwd_allreduce_microstep: 1282.72 | step_microstep: 8.80
[2024-09-15 15:43:10,800] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3406.83 | bwd: 7342.43 | bwd_inner: 6059.61 | bwd_allreduce: 1282.75 | step: 8.85
  9%|▉         | 44/468 [08:02<1:12:39, 10.28s/it]                                                  {'loss': 0.2628, 'learning_rate': 3.959687941667372e-05, 'epoch': 0.28}
  9%|▉         | 44/468 [08:02<1:12:39, 10.28s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389
[2024-09-15 15:43:14,969] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1494.83 | bwd_microstep: 2636.84 | bwd_inner_microstep: 2636.50 | bwd_allreduce_microstep: 0.14 | step_microstep: 0.24
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389
[2024-09-15 15:43:21,610] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.22 | optimizer_gradients: 0.39 | optimizer_step: 0.40
[2024-09-15 15:43:21,610] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1498.41 | bwd_microstep: 5109.16 | bwd_inner_microstep: 2639.43 | bwd_allreduce_microstep: 2469.67 | step_microstep: 7.46
[2024-09-15 15:43:21,611] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2993.21 | bwd: 7746.05 | bwd_inner: 5276.01 | bwd_allreduce: 2469.83 | step: 7.71
 10%|▉         | 45/468 [08:13<1:13:36, 10.44s/it]                                                  {'loss': 0.2789, 'learning_rate': 3.9568700734508645e-05, 'epoch': 0.29}
 10%|▉         | 45/468 [08:13<1:13:36, 10.44s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1870
[2024-09-15 15:43:26,984] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1911.87 | bwd_microstep: 3430.56 | bwd_inner_microstep: 3430.53 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.14
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878
[2024-09-15 15:43:32,401] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.42 | optimizer_gradients: 0.35 | optimizer_step: 0.41
[2024-09-15 15:43:32,401] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1918.24 | bwd_microstep: 3463.16 | bwd_inner_microstep: 3448.21 | bwd_allreduce_microstep: 14.90 | step_microstep: 8.19
[2024-09-15 15:43:32,401] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3830.08 | bwd: 6893.73 | bwd_inner: 6878.74 | bwd_allreduce: 14.92 | step: 8.35
 10%|▉         | 46/468 [08:23<1:14:10, 10.55s/it]                                                  {'loss': 0.2164, 'learning_rate': 3.9539580892059086e-05, 'epoch': 0.29}
 10%|▉         | 46/468 [08:23<1:14:10, 10.55s/it]dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1878
[2024-09-15 15:43:37,833] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1935.59 | bwd_microstep: 3464.59 | bwd_inner_microstep: 3464.56 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.23
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1366
[2024-09-15 15:43:41,950] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.77 | optimizer_gradients: 0.34 | optimizer_step: 0.40
[2024-09-15 15:43:41,950] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1468.57 | bwd_microstep: 2611.60 | bwd_inner_microstep: 2596.16 | bwd_allreduce_microstep: 15.39 | step_microstep: 7.95
[2024-09-15 15:43:41,950] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3404.14 | bwd: 6076.20 | bwd_inner: 6060.72 | bwd_allreduce: 15.41 | step: 8.19
 10%|█         | 47/468 [08:33<1:11:53, 10.25s/it]                                                  {'loss': 0.217, 'learning_rate': 3.950952128984927e-05, 'epoch': 0.3}
 10%|█         | 47/468 [08:33<1:11:53, 10.25s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1901
[2024-09-15 15:43:47,395] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1940.27 | bwd_microstep: 3473.30 | bwd_inner_microstep: 3473.27 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.13
dynamic ViT batch size: 46, images per sample: 5.75, dynamic token length: 1881
[2024-09-15 15:43:52,888] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.42 | optimizer_gradients: 0.41 | optimizer_step: 0.39
[2024-09-15 15:43:52,888] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1952.80 | bwd_microstep: 3506.00 | bwd_inner_microstep: 3493.42 | bwd_allreduce_microstep: 12.52 | step_microstep: 8.34
[2024-09-15 15:43:52,889] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3893.05 | bwd: 6979.31 | bwd_inner: 6966.70 | bwd_allreduce: 12.55 | step: 8.48
 10%|█         | 48/468 [08:44<1:13:11, 10.46s/it]                                                  {'loss': 0.2196, 'learning_rate': 3.9478523373601325e-05, 'epoch': 0.31}
 10%|█         | 48/468 [08:44<1:13:11, 10.46s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878
[2024-09-15 15:43:58,292] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1919.28 | bwd_microstep: 3447.97 | bwd_inner_microstep: 3447.94 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.14
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1870
[2024-09-15 15:44:03,694] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.53 | optimizer_gradients: 0.35 | optimizer_step: 0.41
[2024-09-15 15:44:03,695] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1914.98 | bwd_microstep: 3452.10 | bwd_inner_microstep: 3437.07 | bwd_allreduce_microstep: 14.98 | step_microstep: 8.28
[2024-09-15 15:44:03,695] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3834.24 | bwd: 6900.08 | bwd_inner: 6885.02 | bwd_allreduce: 15.00 | step: 8.45
 10%|█         | 49/468 [08:55<1:13:44, 10.56s/it]                                                  {'loss': 0.3036, 'learning_rate': 3.944658863416575e-05, 'epoch': 0.31}
 10%|█         | 49/468 [08:55<1:13:44, 10.56s/it]dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1366
[2024-09-15 15:44:07,761] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1459.83 | bwd_microstep: 2576.27 | bwd_inner_microstep: 2576.24 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.23
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876
[2024-09-15 15:44:13,217] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.43 | optimizer_gradients: 0.36 | optimizer_step: 0.38
[2024-09-15 15:44:13,218] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1921.71 | bwd_microstep: 3492.87 | bwd_inner_microstep: 3446.64 | bwd_allreduce_microstep: 46.19 | step_microstep: 7.58
[2024-09-15 15:44:13,218] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3381.49 | bwd: 6069.17 | bwd_inner: 6022.88 | bwd_allreduce: 46.21 | step: 7.81
 11%|█         | 50/468 [09:04<1:11:23, 10.25s/it]                                                  {'loss': 0.2398, 'learning_rate': 3.941371860744978e-05, 'epoch': 0.32}
 11%|█         | 50/468 [09:04<1:11:23, 10.25s/it]dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1390
[2024-09-15 15:44:17,344] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1481.09 | bwd_microstep: 2615.18 | bwd_inner_microstep: 2615.10 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.27
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1870
[2024-09-15 15:44:23,929] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.21 | optimizer_gradients: 0.37 | optimizer_step: 0.39
[2024-09-15 15:44:23,930] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1916.91 | bwd_microstep: 4631.83 | bwd_inner_microstep: 3442.15 | bwd_allreduce_microstep: 1189.61 | step_microstep: 7.20
[2024-09-15 15:44:23,930] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3397.96 | bwd: 7247.05 | bwd_inner: 6057.26 | bwd_allreduce: 1189.66 | step: 7.47
 11%|█         | 51/468 [09:15<1:12:11, 10.39s/it]                                                  {'loss': 0.2441, 'learning_rate': 3.937991487434342e-05, 'epoch': 0.33}
 11%|█         | 51/468 [09:15<1:12:11, 10.39s/it]dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1878
[2024-09-15 15:44:29,258] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1888.85 | bwd_microstep: 3408.90 | bwd_inner_microstep: 3408.87 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.19
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368
[2024-09-15 15:44:33,381] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.27 | optimizer_gradients: 0.46 | optimizer_step: 0.39
[2024-09-15 15:44:33,382] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1469.59 | bwd_microstep: 2615.89 | bwd_inner_microstep: 2598.01 | bwd_allreduce_microstep: 17.81 | step_microstep: 9.42
[2024-09-15 15:44:33,382] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3358.41 | bwd: 6024.80 | bwd_inner: 6006.89 | bwd_allreduce: 17.84 | step: 9.64
 11%|█         | 52/468 [09:24<1:10:05, 10.11s/it]                                                  {'loss': 0.2922, 'learning_rate': 3.934517906064348e-05, 'epoch': 0.33}
 11%|█         | 52/468 [09:24<1:10:05, 10.11s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1901
[2024-09-15 15:44:38,837] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1939.30 | bwd_microstep: 3476.40 | bwd_inner_microstep: 3476.37 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.23
dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1878
[2024-09-15 15:44:44,294] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.46 | optimizer_gradients: 0.34 | optimizer_step: 0.39
[2024-09-15 15:44:44,295] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1936.04 | bwd_microstep: 3486.14 | bwd_inner_microstep: 3471.12 | bwd_allreduce_microstep: 14.97 | step_microstep: 7.56
[2024-09-15 15:44:44,295] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3875.31 | bwd: 6962.55 | bwd_inner: 6947.49 | bwd_allreduce: 14.99 | step: 7.80
 11%|█▏        | 53/468 [09:35<1:11:34, 10.35s/it]                                                  {'loss': 0.3127, 'learning_rate': 3.930951283697534e-05, 'epoch': 0.34}
 11%|█▏        | 53/468 [09:35<1:11:34, 10.35s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1369
[2024-09-15 15:44:48,397] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1472.80 | bwd_microstep: 2598.11 | bwd_inner_microstep: 2598.07 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22
dynamic ViT batch size: 36, images per sample: 4.5, dynamic token length: 1368
[2024-09-15 15:44:54,906] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.23 | optimizer_gradients: 0.37 | optimizer_step: 0.38
[2024-09-15 15:44:54,907] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1441.77 | bwd_microstep: 5034.22 | bwd_inner_microstep: 2565.14 | bwd_allreduce_microstep: 2469.02 | step_microstep: 7.15
[2024-09-15 15:44:54,907] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2914.55 | bwd: 7632.37 | bwd_inner: 5163.22 | bwd_allreduce: 2469.05 | step: 7.37
 12%|█▏        | 54/468 [09:46<1:11:56, 10.43s/it]                                                  {'loss': 0.2994, 'learning_rate': 3.927291791871264e-05, 'epoch': 0.35}
 12%|█▏        | 54/468 [09:46<1:11:56, 10.43s/it]dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1389
[2024-09-15 15:44:59,027] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1477.08 | bwd_microstep: 2612.19 | bwd_inner_microstep: 2612.17 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1876
[2024-09-15 15:45:04,396] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.43 | optimizer_gradients: 0.33 | optimizer_step: 0.38
[2024-09-15 15:45:04,397] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1899.88 | bwd_microstep: 3436.57 | bwd_inner_microstep: 3421.77 | bwd_allreduce_microstep: 14.76 | step_microstep: 7.69
[2024-09-15 15:45:04,397] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3376.94 | bwd: 6048.78 | bwd_inner: 6033.93 | bwd_allreduce: 14.78 | step: 7.91
 12%|█▏        | 55/468 [09:55<1:09:50, 10.15s/it]                                                  {'loss': 0.2534, 'learning_rate': 3.923539606589473e-05, 'epoch': 0.35}
 12%|█▏        | 55/468 [09:55<1:09:50, 10.15s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1396
[2024-09-15 15:45:08,601] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1507.66 | bwd_microstep: 2666.51 | bwd_inner_microstep: 2666.48 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22
dynamic ViT batch size: 36, images per sample: 4.5, dynamic token length: 1369
[2024-09-15 15:45:15,060] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 4.05 | optimizer_gradients: 0.40 | optimizer_step: 0.38
[2024-09-15 15:45:15,061] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1440.14 | bwd_microstep: 4980.32 | bwd_inner_microstep: 2561.03 | bwd_allreduce_microstep: 2419.23 | step_microstep: 10.49
[2024-09-15 15:45:15,061] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2947.77 | bwd: 7646.84 | bwd_inner: 5227.51 | bwd_allreduce: 2419.26 | step: 10.71
 12%|█▏        | 56/468 [10:06<1:10:44, 10.30s/it]                                                  {'loss': 0.2855, 'learning_rate': 3.919694908314209e-05, 'epoch': 0.36}
 12%|█▏        | 56/468 [10:06<1:10:44, 10.30s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876
[2024-09-15 15:45:20,441] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1913.64 | bwd_microstep: 3435.70 | bwd_inner_microstep: 3435.68 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1870
[2024-09-15 15:45:25,832] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.33 | optimizer_gradients: 0.36 | optimizer_step: 0.39
[2024-09-15 15:45:25,832] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1911.62 | bwd_microstep: 3444.66 | bwd_inner_microstep: 3429.91 | bwd_allreduce_microstep: 14.70 | step_microstep: 8.00
[2024-09-15 15:45:25,832] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3825.23 | bwd: 6880.38 | bwd_inner: 6865.59 | bwd_allreduce: 14.72 | step: 8.13
 12%|█▏        | 57/468 [10:17<1:11:31, 10.44s/it]                                                  {'loss': 0.2889, 'learning_rate': 3.9157578819569455e-05, 'epoch': 0.36}
 12%|█▏        | 57/468 [10:17<1:11:31, 10.44s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1870
[2024-09-15 15:45:31,205] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1914.92 | bwd_microstep: 3428.03 | bwd_inner_microstep: 3428.00 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.14
dynamic ViT batch size: 36, images per sample: 4.5, dynamic token length: 1372
[2024-09-15 15:45:36,593] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.30 | optimizer_gradients: 0.38 | optimizer_step: 0.38
[2024-09-15 15:45:36,594] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1442.50 | bwd_microstep: 3909.83 | bwd_inner_microstep: 2561.29 | bwd_allreduce_microstep: 1348.48 | step_microstep: 7.56
[2024-09-15 15:45:36,594] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3357.40 | bwd: 7337.87 | bwd_inner: 5989.29 | bwd_allreduce: 1348.51 | step: 7.73
 12%|█▏        | 58/468 [10:27<1:12:00, 10.54s/it]                                                  {'loss': 0.3228, 'learning_rate': 3.9117287168696956e-05, 'epoch': 0.37}
 12%|█▏        | 58/468 [10:27<1:12:00, 10.54s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1388
[2024-09-15 15:45:40,741] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1488.36 | bwd_microstep: 2628.89 | bwd_inner_microstep: 2628.86 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1369
[2024-09-15 15:45:45,973] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.22 | optimizer_gradients: 0.37 | optimizer_step: 0.37
[2024-09-15 15:45:45,974] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1469.78 | bwd_microstep: 3728.93 | bwd_inner_microstep: 2595.93 | bwd_allreduce_microstep: 1132.95 | step_microstep: 7.17
[2024-09-15 15:45:45,974] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2958.11 | bwd: 6357.83 | bwd_inner: 5224.79 | bwd_allreduce: 1132.98 | step: 7.39
 13%|█▎        | 59/468 [10:37<1:09:28, 10.19s/it]                                                  {'loss': 0.2197, 'learning_rate': 3.907607606835899e-05, 'epoch': 0.38}
 13%|█▎        | 59/468 [10:37<1:09:28, 10.19s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1870
[2024-09-15 15:45:51,343] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1911.47 | bwd_microstep: 3426.91 | bwd_inner_microstep: 3426.88 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.12
dynamic ViT batch size: 39, images per sample: 4.875, dynamic token length: 1368
[2024-09-15 15:45:56,689] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.38 | optimizer_gradients: 0.36 | optimizer_step: 0.38
[2024-09-15 15:45:56,689] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1459.67 | bwd_microstep: 3851.68 | bwd_inner_microstep: 2584.06 | bwd_allreduce_microstep: 1267.56 | step_microstep: 7.51
[2024-09-15 15:45:56,690] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3371.11 | bwd: 7278.60 | bwd_inner: 6010.95 | bwd_allreduce: 1267.59 | step: 7.66
 13%|█▎        | 60/468 [10:48<1:10:22, 10.35s/it]                                                  {'loss': 0.3422, 'learning_rate': 3.903394750061106e-05, 'epoch': 0.38}
 13%|█▎        | 60/468 [10:48<1:10:22, 10.35s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389
[2024-09-15 15:46:00,838] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1488.11 | bwd_microstep: 2629.30 | bwd_inner_microstep: 2629.28 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.05
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877
[2024-09-15 15:46:07,370] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 4.04 | optimizer_gradients: 0.39 | optimizer_step: 0.38
[2024-09-15 15:46:07,370] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1915.80 | bwd_microstep: 4578.62 | bwd_inner_microstep: 3440.25 | bwd_allreduce_microstep: 1138.30 | step_microstep: 10.30
[2024-09-15 15:46:07,371] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3403.88 | bwd: 7207.93 | bwd_inner: 6069.53 | bwd_allreduce: 1138.33 | step: 10.37
 13%|█▎        | 61/468 [10:58<1:10:52, 10.45s/it]                                                  {'loss': 0.2788, 'learning_rate': 3.899090349163444e-05, 'epoch': 0.39}
 13%|█▎        | 61/468 [10:58<1:10:52, 10.45s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1388
[2024-09-15 15:46:11,519] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1488.72 | bwd_microstep: 2628.60 | bwd_inner_microstep: 2628.57 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.26
dynamic ViT batch size: 46, images per sample: 5.75, dynamic token length: 1878
[2024-09-15 15:46:18,062] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.37 | optimizer_gradients: 0.38 | optimizer_step: 0.38
[2024-09-15 15:46:18,063] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1941.83 | bwd_microstep: 4565.39 | bwd_inner_microstep: 3488.63 | bwd_allreduce_microstep: 1076.70 | step_microstep: 7.78
[2024-09-15 15:46:18,063] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3430.53 | bwd: 7194.00 | bwd_inner: 6117.20 | bwd_allreduce: 1076.73 | step: 8.04
 13%|█▎        | 62/468 [11:09<1:11:11, 10.52s/it]                                                  {'loss': 0.3245, 'learning_rate': 3.8946946111638696e-05, 'epoch': 0.4}
 13%|█▎        | 62/468 [11:09<1:11:11, 10.52s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1876
[2024-09-15 15:46:23,409] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1899.16 | bwd_microstep: 3416.19 | bwd_inner_microstep: 3416.16 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.23
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1366
[2024-09-15 15:46:28,821] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.30 | optimizer_gradients: 0.37 | optimizer_step: 0.38
[2024-09-15 15:46:28,822] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1465.00 | bwd_microstep: 3912.84 | bwd_inner_microstep: 2589.69 | bwd_allreduce_microstep: 1323.09 | step_microstep: 7.55
[2024-09-15 15:46:28,822] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3364.13 | bwd: 7329.07 | bwd_inner: 6005.86 | bwd_allreduce: 1323.12 | step: 7.79
 13%|█▎        | 63/468 [11:20<1:11:30, 10.59s/it]                                                  {'loss': 0.336, 'learning_rate': 3.8902077474762155e-05, 'epoch': 0.4}
 13%|█▎        | 63/468 [11:20<1:11:30, 10.59s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1369
[2024-09-15 15:46:32,907] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1465.43 | bwd_microstep: 2588.53 | bwd_inner_microstep: 2588.50 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.21
dynamic ViT batch size: 36, images per sample: 4.5, dynamic token length: 1366
[2024-09-15 15:46:39,470] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.31 | optimizer_gradients: 0.35 | optimizer_step: 0.36
[2024-09-15 15:46:39,471] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1437.67 | bwd_microstep: 5091.91 | bwd_inner_microstep: 2554.08 | bwd_allreduce_microstep: 2537.78 | step_microstep: 7.00
[2024-09-15 15:46:39,471] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2903.07 | bwd: 7680.46 | bwd_inner: 5142.58 | bwd_allreduce: 2537.80 | step: 7.22
 14%|█▎        | 64/468 [11:30<1:11:26, 10.61s/it]                                                  {'loss': 0.2951, 'learning_rate': 3.8856299738970225e-05, 'epoch': 0.41}
 14%|█▎        | 64/468 [11:30<1:11:26, 10.61s/it]dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1878
[2024-09-15 15:46:44,881] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1924.79 | bwd_microstep: 3455.20 | bwd_inner_microstep: 3455.17 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.08
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1366
[2024-09-15 15:46:48,986] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.54 | optimizer_gradients: 0.33 | optimizer_step: 0.38
[2024-09-15 15:46:48,986] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1465.29 | bwd_microstep: 2605.53 | bwd_inner_microstep: 2590.26 | bwd_allreduce_microstep: 15.23 | step_microstep: 7.76
[2024-09-15 15:46:48,986] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3390.05 | bwd: 6060.74 | bwd_inner: 6045.43 | bwd_allreduce: 15.25 | step: 7.87
 14%|█▍        | 65/468 [11:40<1:09:03, 10.28s/it]                                                  {'loss': 0.3156, 'learning_rate': 3.880961510595158e-05, 'epoch': 0.42}
 14%|█▍        | 65/468 [11:40<1:09:03, 10.28s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1388
[2024-09-15 15:46:53,132] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1488.12 | bwd_microstep: 2627.10 | bwd_inner_microstep: 2627.01 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878
[2024-09-15 15:46:58,535] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.46 | optimizer_gradients: 0.36 | optimizer_step: 0.40
[2024-09-15 15:46:58,536] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1913.95 | bwd_microstep: 3454.12 | bwd_inner_microstep: 3439.14 | bwd_allreduce_microstep: 14.93 | step_microstep: 7.70
[2024-09-15 15:46:58,536] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3402.04 | bwd: 6081.26 | bwd_inner: 6066.15 | bwd_allreduce: 14.98 | step: 7.93
 14%|█▍        | 66/468 [11:49<1:07:24, 10.06s/it]                                                  {'loss': 0.2534, 'learning_rate': 3.876202582101229e-05, 'epoch': 0.42}
 14%|█▍        | 66/468 [11:49<1:07:24, 10.06s/it]dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1876
[2024-09-15 15:47:03,853] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1884.09 | bwd_microstep: 3402.12 | bwd_inner_microstep: 3402.09 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1369
[2024-09-15 15:47:09,283] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.21 | optimizer_gradients: 0.39 | optimizer_step: 0.38
[2024-09-15 15:47:09,284] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1470.45 | bwd_microstep: 3926.39 | bwd_inner_microstep: 2596.01 | bwd_allreduce_microstep: 1330.31 | step_microstep: 7.20
[2024-09-15 15:47:09,284] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3354.51 | bwd: 7328.55 | bwd_inner: 5998.10 | bwd_allreduce: 1330.35 | step: 7.43
 14%|█▍        | 67/468 [12:00<1:08:37, 10.27s/it]                                                  {'loss': 0.2775, 'learning_rate': 3.8713534172967815e-05, 'epoch': 0.43}
 14%|█▍        | 67/468 [12:00<1:08:37, 10.27s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1908
[2024-09-15 15:47:15,446] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2375.78 | bwd_microstep: 3755.55 | bwd_inner_microstep: 3755.53 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.20
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877
[2024-09-15 15:47:20,854] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.40 | optimizer_gradients: 0.35 | optimizer_step: 0.40
[2024-09-15 15:47:20,854] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1916.31 | bwd_microstep: 3455.26 | bwd_inner_microstep: 3440.27 | bwd_allreduce_microstep: 14.93 | step_microstep: 7.99
[2024-09-15 15:47:20,854] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 4292.06 | bwd: 7210.82 | bwd_inner: 7195.80 | bwd_allreduce: 14.96 | step: 8.22
 15%|█▍        | 68/468 [12:12<1:11:03, 10.66s/it]                                                  {'loss': 0.3381, 'learning_rate': 3.866414249403295e-05, 'epoch': 0.43}
 15%|█▍        | 68/468 [12:12<1:11:03, 10.66s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1388
[2024-09-15 15:47:25,005] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1491.26 | bwd_microstep: 2628.94 | bwd_inner_microstep: 2628.80 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.24
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1877
[2024-09-15 15:47:31,634] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.34 | optimizer_gradients: 0.36 | optimizer_step: 0.38
[2024-09-15 15:47:31,634] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1904.07 | bwd_microstep: 4691.59 | bwd_inner_microstep: 3425.35 | bwd_allreduce_microstep: 1266.19 | step_microstep: 7.19
[2024-09-15 15:47:31,635] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3395.31 | bwd: 7320.57 | bwd_inner: 6054.14 | bwd_allreduce: 1266.30 | step: 7.44
 15%|█▍        | 69/468 [12:23<1:11:07, 10.70s/it]                                                  {'loss': 0.2593, 'learning_rate': 3.861385315970964e-05, 'epoch': 0.44}
 15%|█▍        | 69/468 [12:23<1:11:07, 10.70s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876
[2024-09-15 15:47:37,016] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1912.02 | bwd_microstep: 3438.89 | bwd_inner_microstep: 3438.86 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.09
dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1877
[2024-09-15 15:47:42,361] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.30 | optimizer_gradients: 0.34 | optimizer_step: 0.38
[2024-09-15 15:47:42,362] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1886.56 | bwd_microstep: 3424.47 | bwd_inner_microstep: 3409.39 | bwd_allreduce_microstep: 15.04 | step_microstep: 7.84
[2024-09-15 15:47:42,362] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3798.55 | bwd: 6863.37 | bwd_inner: 6848.25 | bwd_allreduce: 15.06 | step: 7.96
 15%|█▍        | 70/468 [12:33<1:11:00, 10.70s/it]                                                  {'loss': 0.166, 'learning_rate': 3.856266858867273e-05, 'epoch': 0.45}
 15%|█▍        | 70/468 [12:33<1:11:00, 10.70s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389
[2024-09-15 15:47:46,518] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1492.47 | bwd_microstep: 2632.70 | bwd_inner_microstep: 2632.67 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368
[2024-09-15 15:47:51,829] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.21 | optimizer_gradients: 0.38 | optimizer_step: 0.38
[2024-09-15 15:47:51,829] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1469.14 | bwd_microstep: 3806.80 | bwd_inner_microstep: 2595.76 | bwd_allreduce_microstep: 1210.98 | step_microstep: 7.19
[2024-09-15 15:47:51,829] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2961.59 | bwd: 6439.51 | bwd_inner: 5228.43 | bwd_allreduce: 1211.01 | step: 7.41
 15%|█▌        | 71/468 [12:43<1:08:22, 10.33s/it]                                                  {'loss': 0.3453, 'learning_rate': 3.851059124265363e-05, 'epoch': 0.45}
 15%|█▌        | 71/468 [12:43<1:08:22, 10.33s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876
[2024-09-15 15:47:57,216] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1913.84 | bwd_microstep: 3442.15 | bwd_inner_microstep: 3442.12 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.25
dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1878
[2024-09-15 15:48:02,667] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.38 | optimizer_gradients: 0.36 | optimizer_step: 0.39
[2024-09-15 15:48:02,668] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1934.79 | bwd_microstep: 3481.43 | bwd_inner_microstep: 3466.64 | bwd_allreduce_microstep: 14.74 | step_microstep: 8.04
[2024-09-15 15:48:02,668] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3848.61 | bwd: 6923.59 | bwd_inner: 6908.76 | bwd_allreduce: 14.76 | step: 8.30
 15%|█▌        | 72/468 [12:54<1:09:12, 10.49s/it]                                                  {'loss': 0.2238, 'learning_rate': 3.8457623626321944e-05, 'epoch': 0.46}
 15%|█▌        | 72/468 [12:54<1:09:12, 10.49s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1369
[2024-09-15 15:48:06,761] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1468.16 | bwd_microstep: 2594.36 | bwd_inner_microstep: 2594.33 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368
[2024-09-15 15:48:13,384] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.21 | optimizer_gradients: 0.40 | optimizer_step: 0.40
[2024-09-15 15:48:13,385] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1468.99 | bwd_microstep: 5120.81 | bwd_inner_microstep: 2596.79 | bwd_allreduce_microstep: 2523.96 | step_microstep: 7.50
[2024-09-15 15:48:13,385] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2937.12 | bwd: 7715.18 | bwd_inner: 5191.12 | bwd_allreduce: 2524.00 | step: 7.73
 16%|█▌        | 73/468 [13:04<1:09:29, 10.55s/it]                                                  {'loss': 0.2773, 'learning_rate': 3.840376828716499e-05, 'epoch': 0.47}
 16%|█▌        | 73/468 [13:04<1:09:29, 10.55s/it]dynamic ViT batch size: 36, images per sample: 4.5, dynamic token length: 1389
[2024-09-15 15:48:17,468] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1461.37 | bwd_microstep: 2591.75 | bwd_inner_microstep: 2591.72 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.21
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878
[2024-09-15 15:48:24,044] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.31 | optimizer_gradients: 0.36 | optimizer_step: 0.37
[2024-09-15 15:48:24,045] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1914.39 | bwd_microstep: 4622.85 | bwd_inner_microstep: 3443.40 | bwd_allreduce_microstep: 1179.40 | step_microstep: 7.01
[2024-09-15 15:48:24,045] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3375.73 | bwd: 7214.61 | bwd_inner: 6035.12 | bwd_allreduce: 1179.42 | step: 7.24
 16%|█▌        | 74/468 [13:15<1:09:31, 10.59s/it]                                                  {'loss': 0.2777, 'learning_rate': 3.834902781536527e-05, 'epoch': 0.47}
 16%|█▌        | 74/468 [13:15<1:09:31, 10.59s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876
[2024-09-15 15:48:29,424] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1911.39 | bwd_microstep: 3436.48 | bwd_inner_microstep: 3436.46 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.11
dynamic ViT batch size: 46, images per sample: 5.75, dynamic token length: 1880
[2024-09-15 15:48:34,910] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.49 | optimizer_gradients: 0.36 | optimizer_step: 0.39
[2024-09-15 15:48:34,910] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1947.33 | bwd_microstep: 3503.51 | bwd_inner_microstep: 3488.52 | bwd_allreduce_microstep: 14.95 | step_microstep: 7.94
[2024-09-15 15:48:34,910] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3858.70 | bwd: 6940.00 | bwd_inner: 6924.97 | bwd_allreduce: 14.97 | step: 8.08
 16%|█▌        | 75/468 [13:26<1:09:53, 10.67s/it]                                                  {'loss': 0.3032, 'learning_rate': 3.8293404843675904e-05, 'epoch': 0.48}
 16%|█▌        | 75/468 [13:26<1:09:53, 10.67s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1878
[2024-09-15 15:48:40,269] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1901.98 | bwd_microstep: 3425.59 | bwd_inner_microstep: 3425.56 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.23
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368
[2024-09-15 15:48:44,403] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.19 | optimizer_gradients: 0.50 | optimizer_step: 0.43
[2024-09-15 15:48:44,403] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1469.14 | bwd_microstep: 2628.08 | bwd_inner_microstep: 2593.95 | bwd_allreduce_microstep: 34.07 | step_microstep: 9.59
[2024-09-15 15:48:44,404] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3371.09 | bwd: 6053.70 | bwd_inner: 6019.52 | bwd_allreduce: 34.10 | step: 9.83
 16%|█▌        | 76/468 [13:35<1:07:24, 10.32s/it]                                                  {'loss': 0.2138, 'learning_rate': 3.8236902047294015e-05, 'epoch': 0.49}
 16%|█▌        | 76/468 [13:35<1:07:24, 10.32s/it]dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1876
[2024-09-15 15:48:49,732] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1884.69 | bwd_microstep: 3407.66 | bwd_inner_microstep: 3407.64 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22
dynamic ViT batch size: 36, images per sample: 4.5, dynamic token length: 1366
[2024-09-15 15:48:53,864] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.26 | optimizer_gradients: 0.39 | optimizer_step: 0.37
[2024-09-15 15:48:53,865] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1437.89 | bwd_microstep: 2660.77 | bwd_inner_microstep: 2558.05 | bwd_allreduce_microstep: 102.66 | step_microstep: 7.25
[2024-09-15 15:48:53,865] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3322.55 | bwd: 6068.46 | bwd_inner: 5965.69 | bwd_allreduce: 102.69 | step: 7.47
 16%|█▋        | 77/468 [13:45<1:05:33, 10.06s/it]                                                  {'loss': 0.2605, 'learning_rate': 3.817952214373206e-05, 'epoch': 0.49}
 16%|█▋        | 77/468 [13:45<1:05:33, 10.06s/it]dynamic ViT batch size: 46, images per sample: 5.75, dynamic token length: 1900
[2024-09-15 15:48:59,385] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1970.16 | bwd_microstep: 3518.65 | bwd_inner_microstep: 3518.62 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.08
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878
[2024-09-15 15:49:04,796] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.32 | optimizer_gradients: 0.34 | optimizer_step: 0.38
[2024-09-15 15:49:04,797] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1919.66 | bwd_microstep: 3457.78 | bwd_inner_microstep: 3443.09 | bwd_allreduce_microstep: 14.63 | step_microstep: 7.93
[2024-09-15 15:49:04,797] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3889.80 | bwd: 6976.43 | bwd_inner: 6961.71 | bwd_allreduce: 14.66 | step: 8.04
 17%|█▋        | 78/468 [13:56<1:07:05, 10.32s/it]                                                  {'loss': 0.3205, 'learning_rate': 3.812126789268712e-05, 'epoch': 0.5}
 17%|█▋        | 78/468 [13:56<1:07:05, 10.32s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1388
[2024-09-15 15:49:08,957] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1492.76 | bwd_microstep: 2636.31 | bwd_inner_microstep: 2636.28 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1369
[2024-09-15 15:49:14,375] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.21 | optimizer_gradients: 0.41 | optimizer_step: 0.41
[2024-09-15 15:49:14,376] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1469.73 | bwd_microstep: 3912.74 | bwd_inner_microstep: 2598.56 | bwd_allreduce_microstep: 1314.12 | step_microstep: 7.26
[2024-09-15 15:49:14,376] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2962.47 | bwd: 6549.06 | bwd_inner: 5234.84 | bwd_allreduce: 1314.15 | step: 7.48
 17%|█▋        | 79/468 [14:05<1:05:28, 10.10s/it]                                                  {'loss': 0.27, 'learning_rate': 3.806214209590819e-05, 'epoch': 0.5}
 17%|█▋        | 79/468 [14:05<1:05:28, 10.10s/it]dynamic ViT batch size: 36, images per sample: 4.5, dynamic token length: 1389
[2024-09-15 15:49:18,461] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1460.01 | bwd_microstep: 2595.41 | bwd_inner_microstep: 2595.27 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.22
dynamic ViT batch size: 46, images per sample: 5.75, dynamic token length: 1878
[2024-09-15 15:49:23,949] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.46 | optimizer_gradients: 0.35 | optimizer_step: 0.39
[2024-09-15 15:49:23,949] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1949.34 | bwd_microstep: 3503.56 | bwd_inner_microstep: 3488.68 | bwd_allreduce_microstep: 14.84 | step_microstep: 7.73
[2024-09-15 15:49:23,949] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3409.32 | bwd: 6099.00 | bwd_inner: 6083.95 | bwd_allreduce: 14.93 | step: 7.96
 17%|█▋        | 80/468 [14:15<1:04:17,  9.94s/it]                                                  {'loss': 0.2882, 'learning_rate': 3.80021475970614e-05, 'epoch': 0.51}
 17%|█▋        | 80/468 [14:15<1:04:17,  9.94s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1390
[2024-09-15 15:49:28,107] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1492.10 | bwd_microstep: 2634.97 | bwd_inner_microstep: 2634.94 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876
[2024-09-15 15:49:33,518] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.47 | optimizer_gradients: 0.35 | optimizer_step: 0.38
[2024-09-15 15:49:33,519] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1916.62 | bwd_microstep: 3459.52 | bwd_inner_microstep: 3444.59 | bwd_allreduce_microstep: 14.88 | step_microstep: 7.56
[2024-09-15 15:49:33,519] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3408.70 | bwd: 6094.52 | bwd_inner: 6079.53 | bwd_allreduce: 14.90 | step: 7.79
 17%|█▋        | 81/468 [14:24<1:03:24,  9.83s/it]                                                  {'loss': 0.3081, 'learning_rate': 3.7941287281593284e-05, 'epoch': 0.52}
 17%|█▋        | 81/468 [14:24<1:03:24,  9.83s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878
[2024-09-15 15:49:38,916] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1920.22 | bwd_microstep: 3446.49 | bwd_inner_microstep: 3446.46 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22
dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1877
[2024-09-15 15:49:44,273] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.42 | optimizer_gradients: 0.34 | optimizer_step: 0.38
[2024-09-15 15:49:44,273] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1889.51 | bwd_microstep: 3432.12 | bwd_inner_microstep: 3410.97 | bwd_allreduce_microstep: 21.10 | step_microstep: 8.47
[2024-09-15 15:49:44,274] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3809.71 | bwd: 6878.64 | bwd_inner: 6857.43 | bwd_allreduce: 21.13 | step: 8.69
 18%|█▊        | 82/468 [14:35<1:05:01, 10.11s/it]                                                  {'loss': 0.3268, 'learning_rate': 3.787956407659198e-05, 'epoch': 0.52}
 18%|█▊        | 82/468 [14:35<1:05:01, 10.11s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1388
[2024-09-15 15:49:48,438] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1494.26 | bwd_microstep: 2630.69 | bwd_inner_microstep: 2630.66 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1366
[2024-09-15 15:49:53,725] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.39 | optimizer_gradients: 0.38 | optimizer_step: 0.38
[2024-09-15 15:49:53,725] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1469.03 | bwd_microstep: 3782.71 | bwd_inner_microstep: 2596.64 | bwd_allreduce_microstep: 1186.01 | step_microstep: 7.45
[2024-09-15 15:49:53,726] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2963.26 | bwd: 6413.44 | bwd_inner: 5227.31 | bwd_allreduce: 1186.04 | step: 7.68
 18%|█▊        | 83/468 [14:45<1:03:35,  9.91s/it]                                                  {'loss': 0.2018, 'learning_rate': 3.781698095064647e-05, 'epoch': 0.53}
 18%|█▊        | 83/468 [14:45<1:03:35,  9.91s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1877
[2024-09-15 15:49:59,079] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1899.31 | bwd_microstep: 3423.54 | bwd_inner_microstep: 3423.52 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22
dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1878
[2024-09-15 15:50:04,536] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.55 | optimizer_gradients: 0.34 | optimizer_step: 0.41
[2024-09-15 15:50:04,536] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1934.03 | bwd_microstep: 3486.75 | bwd_inner_microstep: 3471.72 | bwd_allreduce_microstep: 14.98 | step_microstep: 7.93
[2024-09-15 15:50:04,536] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3833.31 | bwd: 6910.30 | bwd_inner: 6895.24 | bwd_allreduce: 15.00 | step: 8.16
 18%|█▊        | 84/468 [14:55<1:05:09, 10.18s/it]                                                  {'loss': 0.2624, 'learning_rate': 3.775354091370376e-05, 'epoch': 0.54}
 18%|█▊        | 84/468 [14:55<1:05:09, 10.18s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1388
[2024-09-15 15:50:08,693] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1493.63 | bwd_microstep: 2632.69 | bwd_inner_microstep: 2632.66 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.05
dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1878
[2024-09-15 15:50:14,155] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.51 | optimizer_gradients: 0.34 | optimizer_step: 0.38
[2024-09-15 15:50:14,155] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1938.49 | bwd_microstep: 3485.89 | bwd_inner_microstep: 3470.55 | bwd_allreduce_microstep: 15.30 | step_microstep: 7.68
[2024-09-15 15:50:14,155] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3432.09 | bwd: 6118.59 | bwd_inner: 6103.21 | bwd_allreduce: 15.32 | step: 7.76
 18%|█▊        | 85/468 [15:05<1:03:54, 10.01s/it]                                                  {'loss': 0.2042, 'learning_rate': 3.7689247016924186e-05, 'epoch': 0.54}
 18%|█▊        | 85/468 [15:05<1:03:54, 10.01s/it]dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1876
[2024-09-15 15:50:19,589] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1936.19 | bwd_microstep: 3466.48 | bwd_inner_microstep: 3466.45 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.15
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1884
[2024-09-15 15:50:25,008] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.39 | optimizer_gradients: 0.35 | optimizer_step: 0.38
[2024-09-15 15:50:25,008] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1919.88 | bwd_microstep: 3464.91 | bwd_inner_microstep: 3449.81 | bwd_allreduce_microstep: 15.06 | step_microstep: 7.85
[2024-09-15 15:50:25,008] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3856.04 | bwd: 6931.40 | bwd_inner: 6916.26 | bwd_allreduce: 15.08 | step: 8.03
 18%|█▊        | 86/468 [15:16<1:05:20, 10.26s/it]                                                  {'loss': 0.3193, 'learning_rate': 3.7624102352534615e-05, 'epoch': 0.55}
 18%|█▊        | 86/468 [15:16<1:05:20, 10.26s/it]dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1877
[2024-09-15 15:50:30,439] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1932.66 | bwd_microstep: 3467.30 | bwd_inner_microstep: 3467.27 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878
[2024-09-15 15:50:35,857] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.42 | optimizer_gradients: 0.33 | optimizer_step: 0.38
[2024-09-15 15:50:35,858] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1920.76 | bwd_microstep: 3461.96 | bwd_inner_microstep: 3447.02 | bwd_allreduce_microstep: 14.88 | step_microstep: 7.85
[2024-09-15 15:50:35,858] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3853.39 | bwd: 6929.28 | bwd_inner: 6914.29 | bwd_allreduce: 14.91 | step: 7.98
 19%|█▊        | 87/468 [15:27<1:06:17, 10.44s/it]                                                  {'loss': 0.2487, 'learning_rate': 3.755811005367974e-05, 'epoch': 0.56}
 19%|█▊        | 87/468 [15:27<1:06:17, 10.44s/it]dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1901
[2024-09-15 15:50:41,342] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1957.00 | bwd_microstep: 3496.40 | bwd_inner_microstep: 3496.37 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.11
dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1368
[2024-09-15 15:50:46,639] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.40 | optimizer_gradients: 0.38 | optimizer_step: 0.40
[2024-09-15 15:50:46,640] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1457.52 | bwd_microstep: 3804.60 | bwd_inner_microstep: 2579.78 | bwd_allreduce_microstep: 1224.77 | step_microstep: 7.48
[2024-09-15 15:50:46,640] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3414.49 | bwd: 7301.01 | bwd_inner: 6076.15 | bwd_allreduce: 1224.80 | step: 7.62
 19%|█▉        | 88/468 [15:38<1:06:46, 10.54s/it]                                                  {'loss': 0.1408, 'learning_rate': 3.7491273294271386e-05, 'epoch': 0.56}
 19%|█▉        | 88/468 [15:38<1:06:46, 10.54s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1877
[2024-09-15 15:50:52,001] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1899.79 | bwd_microstep: 3430.95 | bwd_inner_microstep: 3430.92 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.21
dynamic ViT batch size: 46, images per sample: 5.75, dynamic token length: 1878
[2024-09-15 15:50:57,495] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.48 | optimizer_gradients: 0.36 | optimizer_step: 0.38
[2024-09-15 15:50:57,496] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1950.46 | bwd_microstep: 3508.02 | bwd_inner_microstep: 3493.36 | bwd_allreduce_microstep: 14.61 | step_microstep: 8.53
[2024-09-15 15:50:57,496] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3850.23 | bwd: 6938.98 | bwd_inner: 6924.29 | bwd_allreduce: 14.64 | step: 8.75
 19%|█▉        | 89/468 [15:48<1:07:11, 10.64s/it]                                                  {'loss': 0.276, 'learning_rate': 3.742359528883588e-05, 'epoch': 0.57}
 19%|█▉        | 89/468 [15:48<1:07:11, 10.64s/it]dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1877
[2024-09-15 15:51:02,819] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1886.38 | bwd_microstep: 3406.33 | bwd_inner_microstep: 3406.30 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22
dynamic ViT batch size: 39, images per sample: 4.875, dynamic token length: 1877
[2024-09-15 15:51:08,235] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.41 | optimizer_gradients: 0.38 | optimizer_step: 0.39
[2024-09-15 15:51:08,236] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1896.67 | bwd_microstep: 3485.36 | bwd_inner_microstep: 3410.95 | bwd_allreduce_microstep: 74.36 | step_microstep: 7.35
[2024-09-15 15:51:08,236] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3783.02 | bwd: 6891.70 | bwd_inner: 6817.25 | bwd_allreduce: 74.39 | step: 7.57
 19%|█▉        | 90/468 [15:59<1:07:12, 10.67s/it]                                                  {'loss': 0.2293, 'learning_rate': 3.735507929235941e-05, 'epoch': 0.58}
 19%|█▉        | 90/468 [15:59<1:07:12, 10.67s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1870
[2024-09-15 15:51:13,612] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1913.06 | bwd_microstep: 3432.18 | bwd_inner_microstep: 3432.15 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.25
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878
[2024-09-15 15:51:19,027] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.47 | optimizer_gradients: 0.33 | optimizer_step: 0.38
[2024-09-15 15:51:19,027] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1917.61 | bwd_microstep: 3461.74 | bwd_inner_microstep: 3446.60 | bwd_allreduce_microstep: 15.09 | step_microstep: 8.38
[2024-09-15 15:51:19,027] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3830.64 | bwd: 6893.92 | bwd_inner: 6878.75 | bwd_allreduce: 15.11 | step: 8.64
 19%|█▉        | 91/468 [16:10<1:07:15, 10.70s/it]                                                  {'loss': 0.2223, 'learning_rate': 3.7285728600131535e-05, 'epoch': 0.58}
 19%|█▉        | 91/468 [16:10<1:07:15, 10.70s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1388
[2024-09-15 15:51:23,189] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1496.79 | bwd_microstep: 2634.88 | bwd_inner_microstep: 2634.85 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.15
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389
[2024-09-15 15:51:29,860] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.27 | optimizer_gradients: 0.40 | optimizer_step: 0.40
[2024-09-15 15:51:29,861] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1494.43 | bwd_microstep: 5141.10 | bwd_inner_microstep: 2638.01 | bwd_allreduce_microstep: 2503.02 | step_microstep: 7.53
[2024-09-15 15:51:29,861] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2991.19 | bwd: 7775.99 | bwd_inner: 5272.87 | bwd_allreduce: 2503.06 | step: 7.71
 20%|█▉        | 92/468 [16:21<1:07:19, 10.74s/it]                                                  {'loss': 0.3011, 'learning_rate': 3.7215546547586596e-05, 'epoch': 0.59}
 20%|█▉        | 92/468 [16:21<1:07:19, 10.74s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876
[2024-09-15 15:51:35,244] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1913.34 | bwd_microstep: 3438.62 | bwd_inner_microstep: 3438.60 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.27
dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1878
[2024-09-15 15:51:40,697] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.44 | optimizer_gradients: 0.34 | optimizer_step: 0.38
[2024-09-15 15:51:40,697] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1934.29 | bwd_microstep: 3483.80 | bwd_inner_microstep: 3468.69 | bwd_allreduce_microstep: 15.06 | step_microstep: 7.79
[2024-09-15 15:51:40,698] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3847.61 | bwd: 6922.44 | bwd_inner: 6907.29 | bwd_allreduce: 15.08 | step: 8.03
 20%|█▉        | 93/468 [16:32<1:07:19, 10.77s/it]                                                  {'loss': 0.2439, 'learning_rate': 3.7144536510143436e-05, 'epoch': 0.59}
 20%|█▉        | 93/468 [16:32<1:07:19, 10.77s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876
[2024-09-15 15:51:46,083] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1915.97 | bwd_microstep: 3438.48 | bwd_inner_microstep: 3438.45 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1366
[2024-09-15 15:51:50,199] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.60 | optimizer_gradients: 0.36 | optimizer_step: 0.39
[2024-09-15 15:51:50,199] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1467.44 | bwd_microstep: 2613.09 | bwd_inner_microstep: 2597.88 | bwd_allreduce_microstep: 15.17 | step_microstep: 8.30
[2024-09-15 15:51:50,199] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3383.38 | bwd: 6051.58 | bwd_inner: 6036.33 | bwd_allreduce: 15.19 | step: 8.52
 20%|██        | 94/468 [16:41<1:04:46, 10.39s/it]                                                  {'loss': 0.2061, 'learning_rate': 3.707270190304294e-05, 'epoch': 0.6}
 20%|██        | 94/468 [16:41<1:04:46, 10.39s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1870
[2024-09-15 15:51:55,570] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1914.17 | bwd_microstep: 3426.07 | bwd_inner_microstep: 3426.04 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.19
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1880
[2024-09-15 15:52:00,986] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.51 | optimizer_gradients: 0.34 | optimizer_step: 0.38
[2024-09-15 15:52:00,987] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1919.95 | bwd_microstep: 3460.83 | bwd_inner_microstep: 3445.74 | bwd_allreduce_microstep: 15.04 | step_microstep: 7.77
[2024-09-15 15:52:00,987] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3834.10 | bwd: 6886.91 | bwd_inner: 6871.79 | bwd_allreduce: 15.06 | step: 7.99
 20%|██        | 95/468 [16:52<1:05:20, 10.51s/it]                                                  {'loss': 0.1888, 'learning_rate': 3.7000046181183834e-05, 'epoch': 0.61}
 20%|██        | 95/468 [16:52<1:05:20, 10.51s/it]dynamic ViT batch size: 41, images per sample: 5.125, dynamic token length: 1877
[2024-09-15 15:52:06,353] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1910.87 | bwd_microstep: 3423.95 | bwd_inner_microstep: 3423.93 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.24
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878
[2024-09-15 15:52:11,767] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.54 | optimizer_gradients: 0.36 | optimizer_step: 0.40
[2024-09-15 15:52:11,767] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1919.36 | bwd_microstep: 3459.84 | bwd_inner_microstep: 3445.12 | bwd_allreduce_microstep: 14.66 | step_microstep: 7.97
[2024-09-15 15:52:11,768] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3830.20 | bwd: 6883.80 | bwd_inner: 6869.05 | bwd_allreduce: 14.68 | step: 8.22
 21%|██        | 96/468 [17:03<1:05:39, 10.59s/it]                                                  {'loss': 0.2698, 'learning_rate': 3.692657283895651e-05, 'epoch': 0.61}
 21%|██        | 96/468 [17:03<1:05:39, 10.59s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878
[2024-09-15 15:52:17,161] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1919.46 | bwd_microstep: 3443.50 | bwd_inner_microstep: 3443.47 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.21
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1878
[2024-09-15 15:52:22,542] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.45 | optimizer_gradients: 0.34 | optimizer_step: 0.38
[2024-09-15 15:52:22,542] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1903.83 | bwd_microstep: 3441.69 | bwd_inner_microstep: 3426.70 | bwd_allreduce_microstep: 14.94 | step_microstep: 8.22
[2024-09-15 15:52:22,542] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3823.27 | bwd: 6885.20 | bwd_inner: 6870.18 | bwd_allreduce: 14.96 | step: 8.46
 21%|██        | 97/468 [17:13<1:05:49, 10.65s/it]                                                  {'loss': 0.1813, 'learning_rate': 3.6852285410074974e-05, 'epoch': 0.62}
 21%|██        | 97/468 [17:13<1:05:49, 10.65s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877
[2024-09-15 15:52:27,936] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1918.09 | bwd_microstep: 3444.38 | bwd_inner_microstep: 3444.36 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.29
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1366
[2024-09-15 15:52:33,187] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.43 | optimizer_gradients: 0.38 | optimizer_step: 0.38
[2024-09-15 15:52:33,187] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1469.51 | bwd_microstep: 3745.76 | bwd_inner_microstep: 2598.35 | bwd_allreduce_microstep: 1147.36 | step_microstep: 7.88
[2024-09-15 15:52:33,187] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3387.57 | bwd: 7190.16 | bwd_inner: 6042.70 | bwd_allreduce: 1147.39 | step: 8.17
 21%|██        | 98/468 [17:24<1:05:38, 10.65s/it]                                                  {'loss': 0.3824, 'learning_rate': 3.6777187467406857e-05, 'epoch': 0.63}
 21%|██        | 98/468 [17:24<1:05:38, 10.65s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1870
[2024-09-15 15:52:38,564] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1912.24 | bwd_microstep: 3433.24 | bwd_inner_microstep: 3433.22 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.29
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876
[2024-09-15 15:52:43,979] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.46 | optimizer_gradients: 0.34 | optimizer_step: 0.38
[2024-09-15 15:52:43,979] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1917.22 | bwd_microstep: 3462.08 | bwd_inner_microstep: 3446.91 | bwd_allreduce_microstep: 15.13 | step_microstep: 8.11
[2024-09-15 15:52:43,979] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3829.43 | bwd: 6895.34 | bwd_inner: 6880.13 | bwd_allreduce: 15.15 | step: 8.34
 21%|██        | 99/468 [17:35<1:05:44, 10.69s/it]                                                  {'loss': 0.2891, 'learning_rate': 3.6701282622801626e-05, 'epoch': 0.63}
 21%|██        | 99/468 [17:35<1:05:44, 10.69s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878
[2024-09-15 15:52:49,371] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1919.55 | bwd_microstep: 3440.86 | bwd_inner_microstep: 3440.83 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.19
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877
[2024-09-15 15:52:54,787] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.46 | optimizer_gradients: 0.36 | optimizer_step: 0.38
[2024-09-15 15:52:54,788] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1919.47 | bwd_microstep: 3461.07 | bwd_inner_microstep: 3445.71 | bwd_allreduce_microstep: 15.32 | step_microstep: 8.14
[2024-09-15 15:52:54,788] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3838.99 | bwd: 6901.94 | bwd_inner: 6886.54 | bwd_allreduce: 15.34 | step: 8.37
 21%|██▏       | 100/468 [17:46<1:05:46, 10.73s/it]                                                   {'loss': 0.1906, 'learning_rate': 3.662457452691682e-05, 'epoch': 0.64}
 21%|██▏       | 100/468 [17:46<1:05:46, 10.73s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389
[2024-09-15 15:52:58,949] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1493.22 | bwd_microstep: 2638.29 | bwd_inner_microstep: 2638.26 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22
dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1881
[2024-09-15 15:53:04,411] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.60 | optimizer_gradients: 0.34 | optimizer_step: 0.40
[2024-09-15 15:53:04,412] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1940.54 | bwd_microstep: 3484.33 | bwd_inner_microstep: 3469.01 | bwd_allreduce_microstep: 15.27 | step_microstep: 7.90
[2024-09-15 15:53:04,412] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3433.74 | bwd: 6122.65 | bwd_inner: 6107.27 | bwd_allreduce: 15.29 | step: 8.12
 22%|██▏       | 101/468 [17:55<1:03:34, 10.39s/it]                                                   {'loss': 0.2127, 'learning_rate': 3.6547066869042524e-05, 'epoch': 0.65}
 22%|██▏       | 101/468 [17:55<1:03:34, 10.39s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1870
[2024-09-15 15:53:09,752] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1899.51 | bwd_microstep: 3411.10 | bwd_inner_microstep: 3411.07 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.21
dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1369
[2024-09-15 15:53:15,078] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.52 | optimizer_gradients: 0.38 | optimizer_step: 0.38
[2024-09-15 15:53:15,079] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1457.84 | bwd_microstep: 3834.01 | bwd_inner_microstep: 2577.16 | bwd_allreduce_microstep: 1256.79 | step_microstep: 7.58
[2024-09-15 15:53:15,079] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3357.32 | bwd: 7245.12 | bwd_inner: 5988.23 | bwd_allreduce: 1256.82 | step: 7.79
 22%|██▏       | 102/468 [18:06<1:03:54, 10.48s/it]                                                   {'loss': 0.3177, 'learning_rate': 3.6468763376923886e-05, 'epoch': 0.65}
 22%|██▏       | 102/468 [18:06<1:03:54, 10.48s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1870
[2024-09-15 15:53:20,420] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1895.29 | bwd_microstep: 3415.27 | bwd_inner_microstep: 3415.24 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.23
dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1368
[2024-09-15 15:53:25,717] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.45 | optimizer_gradients: 0.37 | optimizer_step: 0.38
[2024-09-15 15:53:25,718] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1455.84 | bwd_microstep: 3807.82 | bwd_inner_microstep: 2578.60 | bwd_allreduce_microstep: 1229.16 | step_microstep: 7.43
[2024-09-15 15:53:25,718] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3351.10 | bwd: 7223.10 | bwd_inner: 5993.85 | bwd_allreduce: 1229.19 | step: 7.67
 22%|██▏       | 103/468 [18:17<1:04:01, 10.53s/it]                                                   {'loss': 0.1738, 'learning_rate': 3.638966781658187e-05, 'epoch': 0.66}
 22%|██▏       | 103/468 [18:17<1:04:01, 10.53s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1900
[2024-09-15 15:53:31,126] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1921.72 | bwd_microstep: 3455.70 | bwd_inner_microstep: 3455.67 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.16
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368
[2024-09-15 15:53:35,244] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.39 | optimizer_gradients: 0.39 | optimizer_step: 0.42
[2024-09-15 15:53:35,245] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1469.45 | bwd_microstep: 2613.28 | bwd_inner_microstep: 2599.95 | bwd_allreduce_microstep: 13.28 | step_microstep: 8.06
[2024-09-15 15:53:35,245] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3391.15 | bwd: 6068.99 | bwd_inner: 6055.62 | bwd_allreduce: 13.30 | step: 8.25
 22%|██▏       | 104/468 [18:26<1:02:02, 10.23s/it]                                                   {'loss': 0.1801, 'learning_rate': 3.630978399213206e-05, 'epoch': 0.66}
 22%|██▏       | 104/468 [18:26<1:02:02, 10.23s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389
[2024-09-15 15:53:39,406] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1491.56 | bwd_microstep: 2636.48 | bwd_inner_microstep: 2636.45 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.05
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368
[2024-09-15 15:53:44,720] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.35 | optimizer_gradients: 0.44 | optimizer_step: 0.40
[2024-09-15 15:53:44,721] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1469.09 | bwd_microstep: 3811.89 | bwd_inner_microstep: 2595.82 | bwd_allreduce_microstep: 1216.01 | step_microstep: 8.29
[2024-09-15 15:53:44,721] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2960.62 | bwd: 6448.39 | bwd_inner: 5232.27 | bwd_allreduce: 1216.05 | step: 8.35
 22%|██▏       | 105/468 [18:36<1:00:30, 10.00s/it]                                                   {'loss': 0.2335, 'learning_rate': 3.622911574560181e-05, 'epoch': 0.67}
 22%|██▏       | 105/468 [18:36<1:00:30, 10.00s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1870
[2024-09-15 15:53:50,071] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1898.15 | bwd_microstep: 3417.01 | bwd_inner_microstep: 3416.56 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.27
dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1366
[2024-09-15 15:53:55,513] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.22 | optimizer_gradients: 0.40 | optimizer_step: 0.40
[2024-09-15 15:53:55,513] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1455.22 | bwd_microstep: 3931.00 | bwd_inner_microstep: 2571.10 | bwd_allreduce_microstep: 1359.84 | step_microstep: 7.53
[2024-09-15 15:53:55,513] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3353.34 | bwd: 7348.03 | bwd_inner: 5987.69 | bwd_allreduce: 1360.20 | step: 7.80
 23%|██▎       | 106/468 [18:46<1:01:46, 10.24s/it]                                                   {'loss': 0.2521, 'learning_rate': 3.6147666956745364e-05, 'epoch': 0.68}
 23%|██▎       | 106/468 [18:46<1:01:46, 10.24s/it]dynamic ViT batch size: 46, images per sample: 5.75, dynamic token length: 1878
[2024-09-15 15:54:00,975] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1945.92 | bwd_microstep: 3484.29 | bwd_inner_microstep: 3484.26 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.16
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876
[2024-09-15 15:54:06,388] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.41 | optimizer_gradients: 0.35 | optimizer_step: 0.41
[2024-09-15 15:54:06,388] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1917.80 | bwd_microstep: 3459.70 | bwd_inner_microstep: 3444.23 | bwd_allreduce_microstep: 15.42 | step_microstep: 8.11
[2024-09-15 15:54:06,389] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3863.70 | bwd: 6944.00 | bwd_inner: 6928.49 | bwd_allreduce: 15.44 | step: 8.30
 23%|██▎       | 107/468 [18:57<1:02:44, 10.43s/it]                                                   {'loss': 0.204, 'learning_rate': 3.60654415428573e-05, 'epoch': 0.68}
 23%|██▎       | 107/468 [18:57<1:02:44, 10.43s/it]dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1878
[2024-09-15 15:54:11,713] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1890.93 | bwd_microstep: 3403.51 | bwd_inner_microstep: 3403.37 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.25
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1369
[2024-09-15 15:54:17,031] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.29 | optimizer_gradients: 0.43 | optimizer_step: 0.40
[2024-09-15 15:54:17,032] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1471.25 | bwd_microstep: 3809.97 | bwd_inner_microstep: 2598.69 | bwd_allreduce_microstep: 1211.22 | step_microstep: 8.18
[2024-09-15 15:54:17,032] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3362.16 | bwd: 7213.53 | bwd_inner: 6002.07 | bwd_allreduce: 1211.33 | step: 8.44
 23%|██▎       | 108/468 [19:08<1:02:57, 10.49s/it]                                                   {'loss': 0.2199, 'learning_rate': 3.598244345858412e-05, 'epoch': 0.69}
 23%|██▎       | 108/468 [19:08<1:02:57, 10.49s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1388
[2024-09-15 15:54:21,186] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1492.01 | bwd_microstep: 2631.32 | bwd_inner_microstep: 2631.20 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.25
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878
[2024-09-15 15:54:27,750] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.21 | optimizer_gradients: 0.40 | optimizer_step: 0.40
[2024-09-15 15:54:27,750] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1919.55 | bwd_microstep: 4608.48 | bwd_inner_microstep: 3443.91 | bwd_allreduce_microstep: 1164.50 | step_microstep: 7.47
[2024-09-15 15:54:27,751] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3411.54 | bwd: 7239.84 | bwd_inner: 6075.11 | bwd_allreduce: 1164.59 | step: 7.73
 23%|██▎       | 109/468 [19:19<1:03:11, 10.56s/it]                                                   {'loss': 0.322, 'learning_rate': 3.589867669573404e-05, 'epoch': 0.7}
 23%|██▎       | 109/468 [19:19<1:03:11, 10.56s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876
[2024-09-15 15:54:33,137] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1917.41 | bwd_microstep: 3438.26 | bwd_inner_microstep: 3438.23 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.15
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877
[2024-09-15 15:54:38,557] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.44 | optimizer_gradients: 0.35 | optimizer_step: 0.39
[2024-09-15 15:54:38,557] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1920.03 | bwd_microstep: 3462.43 | bwd_inner_microstep: 3446.78 | bwd_allreduce_microstep: 15.61 | step_microstep: 7.85
[2024-09-15 15:54:38,557] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3837.41 | bwd: 6900.70 | bwd_inner: 6885.01 | bwd_allreduce: 15.63 | step: 8.03
 24%|██▎       | 110/468 [19:29<1:03:27, 10.63s/it]                                                   {'loss': 0.2733, 'learning_rate': 3.5814145283085055e-05, 'epoch': 0.7}
 24%|██▎       | 110/468 [19:29<1:03:27, 10.63s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1870
[2024-09-15 15:54:43,937] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1915.50 | bwd_microstep: 3432.95 | bwd_inner_microstep: 3432.92 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.25
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1366
[2024-09-15 15:54:49,376] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.35 | optimizer_gradients: 0.40 | optimizer_step: 0.40
[2024-09-15 15:54:49,376] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1468.49 | bwd_microstep: 3935.98 | bwd_inner_microstep: 2595.38 | bwd_allreduce_microstep: 1340.54 | step_microstep: 7.90
[2024-09-15 15:54:49,377] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3383.97 | bwd: 7368.94 | bwd_inner: 6028.30 | bwd_allreduce: 1340.57 | step: 8.16
 24%|██▎       | 111/468 [19:40<1:03:36, 10.69s/it]                                                   {'loss': 0.3102, 'learning_rate': 3.5728853286191075e-05, 'epoch': 0.71}
 24%|██▎       | 111/468 [19:40<1:03:36, 10.69s/it]dynamic ViT batch size: 39, images per sample: 4.875, dynamic token length: 1366
[2024-09-15 15:54:53,446] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1459.72 | bwd_microstep: 2579.38 | bwd_inner_microstep: 2579.25 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.25
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368
[2024-09-15 15:54:58,800] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.29 | optimizer_gradients: 0.39 | optimizer_step: 0.41
[2024-09-15 15:54:58,800] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1468.62 | bwd_microstep: 3849.14 | bwd_inner_microstep: 2595.38 | bwd_allreduce_microstep: 1253.69 | step_microstep: 7.57
[2024-09-15 15:54:58,801] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2928.32 | bwd: 6428.55 | bwd_inner: 5174.64 | bwd_allreduce: 1253.78 | step: 7.83
 24%|██▍       | 112/468 [19:50<1:01:10, 10.31s/it]                                                   {'loss': 0.2365, 'learning_rate': 3.56428048071865e-05, 'epoch': 0.72}
 24%|██▍       | 112/468 [19:50<1:01:10, 10.31s/it]dynamic ViT batch size: 39, images per sample: 4.875, dynamic token length: 1878
[2024-09-15 15:55:04,135] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1892.89 | bwd_microstep: 3410.73 | bwd_inner_microstep: 3410.66 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.25
dynamic ViT batch size: 37, images per sample: 4.625, dynamic token length: 1878
[2024-09-15 15:55:09,465] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.46 | optimizer_gradients: 0.35 | optimizer_step: 0.40
[2024-09-15 15:55:09,466] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1884.25 | bwd_microstep: 3411.06 | bwd_inner_microstep: 3398.52 | bwd_allreduce_microstep: 12.49 | step_microstep: 8.13
[2024-09-15 15:55:09,466] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3777.11 | bwd: 6821.83 | bwd_inner: 6809.18 | bwd_allreduce: 12.52 | step: 8.38
 24%|██▍       | 113/468 [20:00<1:01:37, 10.42s/it]                                                   {'loss': 0.1718, 'learning_rate': 3.555600398458885e-05, 'epoch': 0.72}
 24%|██▍       | 113/468 [20:00<1:01:37, 10.42s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1870
[2024-09-15 15:55:14,845] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1916.06 | bwd_microstep: 3432.56 | bwd_inner_microstep: 3432.53 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.16
dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1878
[2024-09-15 15:55:20,298] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.59 | optimizer_gradients: 0.35 | optimizer_step: 0.40
[2024-09-15 15:55:20,298] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1936.77 | bwd_microstep: 3481.07 | bwd_inner_microstep: 3468.56 | bwd_allreduce_microstep: 12.47 | step_microstep: 8.68
[2024-09-15 15:55:20,298] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3852.80 | bwd: 6913.64 | bwd_inner: 6901.09 | bwd_allreduce: 12.49 | step: 8.87
 24%|██▍       | 114/468 [20:11<1:02:11, 10.54s/it]                                                   {'loss': 0.198, 'learning_rate': 3.546845499309976e-05, 'epoch': 0.73}
 24%|██▍       | 114/468 [20:11<1:02:11, 10.54s/it]dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1389
[2024-09-15 15:55:24,424] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1481.81 | bwd_microstep: 2613.20 | bwd_inner_microstep: 2613.17 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.04
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1366
[2024-09-15 15:55:30,913] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.25 | optimizer_gradients: 0.39 | optimizer_step: 0.40
[2024-09-15 15:55:30,914] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1470.56 | bwd_microstep: 4987.68 | bwd_inner_microstep: 2597.68 | bwd_allreduce_microstep: 2389.93 | step_microstep: 7.52
[2024-09-15 15:55:30,914] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2952.33 | bwd: 7600.89 | bwd_inner: 5210.86 | bwd_allreduce: 2389.96 | step: 7.57
 25%|██▍       | 115/468 [20:22<1:02:08, 10.56s/it]                                                   {'loss': 0.2557, 'learning_rate': 3.538016204340418e-05, 'epoch': 0.73}
 25%|██▍       | 115/468 [20:22<1:02:08, 10.56s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1388
[2024-09-15 15:55:35,059] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1490.28 | bwd_microstep: 2624.60 | bwd_inner_microstep: 2624.57 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.05
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368
[2024-09-15 15:55:40,340] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.21 | optimizer_gradients: 0.41 | optimizer_step: 0.40
[2024-09-15 15:55:40,341] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1472.46 | bwd_microstep: 3775.45 | bwd_inner_microstep: 2596.63 | bwd_allreduce_microstep: 1178.76 | step_microstep: 7.53
[2024-09-15 15:55:40,341] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2962.71 | bwd: 6400.06 | bwd_inner: 5221.20 | bwd_allreduce: 1178.79 | step: 7.60
 25%|██▍       | 116/468 [20:31<59:58, 10.22s/it]                                                   {'loss': 0.1825, 'learning_rate': 3.529112938196787e-05, 'epoch': 0.74}
 25%|██▍       | 116/468 [20:31<59:58, 10.22s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1901
[2024-09-15 15:55:45,782] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1943.36 | bwd_microstep: 3467.14 | bwd_inner_microstep: 3467.11 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.04
dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1368
[2024-09-15 15:55:49,856] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.60 | optimizer_gradients: 0.34 | optimizer_step: 0.42
[2024-09-15 15:55:49,856] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1455.35 | bwd_microstep: 2588.29 | bwd_inner_microstep: 2573.26 | bwd_allreduce_microstep: 14.98 | step_microstep: 7.70
[2024-09-15 15:55:49,856] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3398.69 | bwd: 6055.44 | bwd_inner: 6040.38 | bwd_allreduce: 15.00 | step: 7.75
 25%|██▌       | 117/468 [20:41<58:33, 10.01s/it]                                                 {'loss': 0.3099, 'learning_rate': 3.5201361290833165e-05, 'epoch': 0.75}
 25%|██▌       | 117/468 [20:41<58:33, 10.01s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389
[2024-09-15 15:55:54,014] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1495.40 | bwd_microstep: 2631.79 | bwd_inner_microstep: 2631.76 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22
dynamic ViT batch size: 36, images per sample: 4.5, dynamic token length: 1389
[2024-09-15 15:56:00,668] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.52 | optimizer_gradients: 0.39 | optimizer_step: 0.42
[2024-09-15 15:56:00,669] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1464.33 | bwd_microstep: 5158.54 | bwd_inner_microstep: 2597.42 | bwd_allreduce_microstep: 2561.06 | step_microstep: 7.93
[2024-09-15 15:56:00,669] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2959.71 | bwd: 7790.35 | bwd_inner: 5229.18 | bwd_allreduce: 2561.10 | step: 8.16
 25%|██▌       | 118/468 [20:52<59:47, 10.25s/it]                                                 {'loss': 0.3273, 'learning_rate': 3.511086208741303e-05, 'epoch': 0.75}
 25%|██▌       | 118/468 [20:52<59:47, 10.25s/it]dynamic ViT batch size: 46, images per sample: 5.75, dynamic token length: 1878
[2024-09-15 15:56:06,129] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1944.58 | bwd_microstep: 3483.23 | bwd_inner_microstep: 3483.21 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.04
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1876
[2024-09-15 15:56:11,493] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.56 | optimizer_gradients: 0.35 | optimizer_step: 0.37
[2024-09-15 15:56:11,494] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1900.71 | bwd_microstep: 3432.62 | bwd_inner_microstep: 3417.52 | bwd_allreduce_microstep: 15.06 | step_microstep: 8.30
[2024-09-15 15:56:11,494] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3845.27 | bwd: 6915.87 | bwd_inner: 6900.73 | bwd_allreduce: 15.08 | step: 8.35
 25%|██▌       | 119/468 [21:02<1:00:37, 10.42s/it]                                                   {'loss': 0.4337, 'learning_rate': 3.501963612428341e-05, 'epoch': 0.76}
 25%|██▌       | 119/468 [21:02<1:00:37, 10.42s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878
[2024-09-15 15:56:16,877] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1917.03 | bwd_microstep: 3435.90 | bwd_inner_microstep: 3435.87 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.07
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1369
[2024-09-15 15:56:22,225] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.63 | optimizer_gradients: 0.38 | optimizer_step: 0.39
[2024-09-15 15:56:22,225] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1471.89 | bwd_microstep: 3841.45 | bwd_inner_microstep: 2594.16 | bwd_allreduce_microstep: 1247.23 | step_microstep: 8.01
[2024-09-15 15:56:22,225] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3388.88 | bwd: 7277.37 | bwd_inner: 6030.04 | bwd_allreduce: 1247.26 | step: 8.11
 26%|██▌       | 120/468 [21:13<1:00:59, 10.52s/it]                                                   {'loss': 0.2754, 'learning_rate': 3.492768778897388e-05, 'epoch': 0.77}
 26%|██▌       | 120/468 [21:13<1:00:59, 10.52s/it]dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1878
[2024-09-15 15:56:27,647] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1931.46 | bwd_microstep: 3459.48 | bwd_inner_microstep: 3459.46 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.24
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1366
[2024-09-15 15:56:32,997] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.58 | optimizer_gradients: 0.38 | optimizer_step: 0.38
[2024-09-15 15:56:32,997] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1470.35 | bwd_microstep: 3844.17 | bwd_inner_microstep: 2590.15 | bwd_allreduce_microstep: 1253.96 | step_microstep: 8.30
[2024-09-15 15:56:32,997] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3401.78 | bwd: 7303.67 | bwd_inner: 6049.61 | bwd_allreduce: 1253.99 | step: 8.55
 26%|██▌       | 121/468 [21:24<1:01:15, 10.59s/it]                                                   {'loss': 0.2226, 'learning_rate': 3.483502150375665e-05, 'epoch': 0.77}
 26%|██▌       | 121/468 [21:24<1:01:15, 10.59s/it]dynamic ViT batch size: 36, images per sample: 4.5, dynamic token length: 1366
[2024-09-15 15:56:37,014] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1436.71 | bwd_microstep: 2550.04 | bwd_inner_microstep: 2550.01 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876
[2024-09-15 15:56:43,584] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.34 | optimizer_gradients: 0.40 | optimizer_step: 0.40
[2024-09-15 15:56:43,584] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1917.11 | bwd_microstep: 4614.35 | bwd_inner_microstep: 3440.54 | bwd_allreduce_microstep: 1173.74 | step_microstep: 7.64
[2024-09-15 15:56:43,585] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3353.79 | bwd: 7164.40 | bwd_inner: 5990.56 | bwd_allreduce: 1173.77 | step: 7.80
 26%|██▌       | 122/468 [21:34<1:01:04, 10.59s/it]                                                   {'loss': 0.2698, 'learning_rate': 3.474164172543386e-05, 'epoch': 0.78}
 26%|██▌       | 122/468 [21:34<1:01:04, 10.59s/it]dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1881
[2024-09-15 15:56:49,014] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1933.55 | bwd_microstep: 3465.18 | bwd_inner_microstep: 3465.15 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.05
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876
[2024-09-15 15:56:54,423] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.45 | optimizer_gradients: 0.35 | optimizer_step: 0.41
[2024-09-15 15:56:54,423] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1917.81 | bwd_microstep: 3455.80 | bwd_inner_microstep: 3440.81 | bwd_allreduce_microstep: 14.93 | step_microstep: 8.04
[2024-09-15 15:56:54,423] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3851.32 | bwd: 6920.99 | bwd_inner: 6905.96 | bwd_allreduce: 14.97 | step: 8.11
 26%|██▋       | 123/468 [21:45<1:01:19, 10.67s/it]                                                   {'loss': 0.2, 'learning_rate': 3.464755294512325e-05, 'epoch': 0.79}
 26%|██▋       | 123/468 [21:45<1:01:19, 10.67s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1369
[2024-09-15 15:56:58,516] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1471.12 | bwd_microstep: 2591.67 | bwd_inner_microstep: 2591.49 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.29
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368
[2024-09-15 15:57:05,204] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.28 | optimizer_gradients: 0.68 | optimizer_step: 0.46
[2024-09-15 15:57:05,205] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1471.32 | bwd_microstep: 5178.84 | bwd_inner_microstep: 2592.53 | bwd_allreduce_microstep: 2586.19 | step_microstep: 10.44
[2024-09-15 15:57:05,205] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2942.41 | bwd: 7770.60 | bwd_inner: 5184.02 | bwd_allreduce: 2586.37 | step: 10.72
 26%|██▋       | 124/468 [21:56<1:01:20, 10.70s/it]                                                   {'loss': 0.1865, 'learning_rate': 3.455275968804212e-05, 'epoch': 0.79}
 26%|██▋       | 124/468 [21:56<1:01:20, 10.70s/it]dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1900
[2024-09-15 15:57:10,571] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1906.96 | bwd_microstep: 3428.61 | bwd_inner_microstep: 3428.59 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.11
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1366
[2024-09-15 15:57:15,895] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.26 | optimizer_gradients: 0.40 | optimizer_step: 0.40
[2024-09-15 15:57:15,896] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1467.46 | bwd_microstep: 3823.76 | bwd_inner_microstep: 2589.61 | bwd_allreduce_microstep: 1234.09 | step_microstep: 7.50
[2024-09-15 15:57:15,896] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3374.39 | bwd: 7252.38 | bwd_inner: 6018.20 | bwd_allreduce: 1234.12 | step: 7.63
 27%|██▋       | 125/468 [22:07<1:01:09, 10.70s/it]                                                   {'loss': 0.2586, 'learning_rate': 3.445726651328971e-05, 'epoch': 0.8}
 27%|██▋       | 125/468 [22:07<1:01:09, 10.70s/it]dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1878
[2024-09-15 15:57:21,316] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1931.69 | bwd_microstep: 3457.20 | bwd_inner_microstep: 3457.18 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.25
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368
[2024-09-15 15:57:26,555] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.22 | optimizer_gradients: 0.39 | optimizer_step: 0.42
[2024-09-15 15:57:26,555] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1468.78 | bwd_microstep: 3736.06 | bwd_inner_microstep: 2589.73 | bwd_allreduce_microstep: 1146.26 | step_microstep: 7.47
[2024-09-15 15:57:26,556] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3400.44 | bwd: 7193.28 | bwd_inner: 6046.91 | bwd_allreduce: 1146.30 | step: 7.75
 27%|██▋       | 126/468 [22:17<1:00:54, 10.69s/it]                                                   {'loss': 0.3064, 'learning_rate': 3.4361078013627945e-05, 'epoch': 0.81}
 27%|██▋       | 126/468 [22:17<1:00:54, 10.69s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1878
[2024-09-15 15:57:31,898] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1898.61 | bwd_microstep: 3413.99 | bwd_inner_microstep: 3413.97 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.05
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368
[2024-09-15 15:57:37,362] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.31 | optimizer_gradients: 0.39 | optimizer_step: 0.40
[2024-09-15 15:57:37,363] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1468.74 | bwd_microstep: 3964.98 | bwd_inner_microstep: 2591.20 | bwd_allreduce_microstep: 1373.72 | step_microstep: 7.56
[2024-09-15 15:57:37,363] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3367.32 | bwd: 7378.98 | bwd_inner: 6005.16 | bwd_allreduce: 1373.75 | step: 7.62
 27%|██▋       | 127/468 [22:28<1:00:56, 10.72s/it]                                                   {'loss': 0.21, 'learning_rate': 3.426419881526052e-05, 'epoch': 0.81}
 27%|██▋       | 127/468 [22:28<1:00:56, 10.72s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1901
[2024-09-15 15:57:42,793] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1937.47 | bwd_microstep: 3461.73 | bwd_inner_microstep: 3461.71 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.23
dynamic ViT batch size: 36, images per sample: 4.5, dynamic token length: 1878
[2024-09-15 15:57:48,218] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.37 | optimizer_gradients: 0.41 | optimizer_step: 0.40
[2024-09-15 15:57:48,219] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1875.41 | bwd_microstep: 3515.56 | bwd_inner_microstep: 3398.02 | bwd_allreduce_microstep: 117.48 | step_microstep: 7.94
[2024-09-15 15:57:48,219] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3812.85 | bwd: 6977.33 | bwd_inner: 6859.73 | bwd_allreduce: 117.51 | step: 8.18
 27%|██▋       | 128/468 [22:39<1:00:59, 10.76s/it]                                                   {'loss': 0.1701, 'learning_rate': 3.4166633577610425e-05, 'epoch': 0.82}
 27%|██▋       | 128/468 [22:39<1:00:59, 10.76s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877
[2024-09-15 15:57:53,603] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1917.76 | bwd_microstep: 3435.00 | bwd_inner_microstep: 3434.97 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.19
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876
[2024-09-15 15:57:59,007] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.49 | optimizer_gradients: 0.35 | optimizer_step: 0.40
[2024-09-15 15:57:59,007] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1918.12 | bwd_microstep: 3451.75 | bwd_inner_microstep: 3436.86 | bwd_allreduce_microstep: 14.84 | step_microstep: 8.29
[2024-09-15 15:57:59,007] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3835.86 | bwd: 6886.76 | bwd_inner: 6871.83 | bwd_allreduce: 14.86 | step: 8.51
 28%|██▊       | 129/468 [22:50<1:00:51, 10.77s/it]                                                   {'loss': 0.2909, 'learning_rate': 3.4068386993095806e-05, 'epoch': 0.82}
 28%|██▊       | 129/468 [22:50<1:00:51, 10.77s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877
[2024-09-15 15:58:04,402] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1919.22 | bwd_microstep: 3436.31 | bwd_inner_microstep: 3436.29 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.23
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1880
[2024-09-15 15:58:09,815] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.52 | optimizer_gradients: 0.37 | optimizer_step: 0.41
[2024-09-15 15:58:09,816] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1921.55 | bwd_microstep: 3455.83 | bwd_inner_microstep: 3440.61 | bwd_allreduce_microstep: 15.17 | step_microstep: 8.35
[2024-09-15 15:58:09,816] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3840.74 | bwd: 6892.15 | bwd_inner: 6876.90 | bwd_allreduce: 15.19 | step: 8.59
 28%|██▊       | 130/468 [23:01<1:00:44, 10.78s/it]                                                   {'loss': 0.2558, 'learning_rate': 3.396946378690435e-05, 'epoch': 0.83}
 28%|██▊       | 130/468 [23:01<1:00:44, 10.78s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389
[2024-09-15 15:58:13,968] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1493.04 | bwd_microstep: 2629.58 | bwd_inner_microstep: 2629.55 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.14
dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1878
[2024-09-15 15:58:19,422] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.49 | optimizer_gradients: 0.35 | optimizer_step: 0.41
[2024-09-15 15:58:19,422] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1937.65 | bwd_microstep: 3482.48 | bwd_inner_microstep: 3467.26 | bwd_allreduce_microstep: 15.17 | step_microstep: 7.66
[2024-09-15 15:58:19,422] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3430.67 | bwd: 6112.07 | bwd_inner: 6096.81 | bwd_allreduce: 15.20 | step: 7.82
 28%|██▊       | 131/468 [23:10<58:34, 10.43s/it]                                                   {'loss': 0.1833, 'learning_rate': 3.386986871676597e-05, 'epoch': 0.84}
 28%|██▊       | 131/468 [23:10<58:34, 10.43s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878
[2024-09-15 15:58:24,812] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1920.04 | bwd_microstep: 3439.25 | bwd_inner_microstep: 3439.22 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.23
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1366
[2024-09-15 15:58:28,924] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.84 | optimizer_gradients: 0.42 | optimizer_step: 0.40
[2024-09-15 15:58:28,925] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1469.48 | bwd_microstep: 2606.88 | bwd_inner_microstep: 2593.61 | bwd_allreduce_microstep: 13.22 | step_microstep: 9.11
[2024-09-15 15:58:28,925] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3389.49 | bwd: 6046.14 | bwd_inner: 6032.83 | bwd_allreduce: 13.24 | step: 9.34
 28%|██▊       | 132/468 [23:20<56:51, 10.15s/it]                                                 {'loss': 0.184, 'learning_rate': 3.3769606572724e-05, 'epoch': 0.84}
 28%|██▊       | 132/468 [23:20<56:51, 10.15s/it]dynamic ViT batch size: 32, images per sample: 4.0, dynamic token length: 1878
[2024-09-15 15:58:34,164] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1842.72 | bwd_microstep: 3361.09 | bwd_inner_microstep: 3361.07 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.14
dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1879
[2024-09-15 15:58:39,623] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.60 | optimizer_gradients: 0.35 | optimizer_step: 0.42
[2024-09-15 15:58:39,624] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1941.97 | bwd_microstep: 3486.50 | bwd_inner_microstep: 3471.45 | bwd_allreduce_microstep: 15.01 | step_microstep: 7.97
[2024-09-15 15:58:39,624] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3784.67 | bwd: 6847.61 | bwd_inner: 6832.51 | bwd_allreduce: 15.03 | step: 8.12
 28%|██▊       | 133/468 [23:31<57:35, 10.32s/it]                                                 {'loss': 0.2481, 'learning_rate': 3.366868217690482e-05, 'epoch': 0.85}
 28%|██▊       | 133/468 [23:31<57:35, 10.32s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878
[2024-09-15 15:58:45,021] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1922.33 | bwd_microstep: 3444.45 | bwd_inner_microstep: 3444.42 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.25
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1366
[2024-09-15 15:58:50,287] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.27 | optimizer_gradients: 0.39 | optimizer_step: 0.41
[2024-09-15 15:58:50,287] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1470.87 | bwd_microstep: 3760.89 | bwd_inner_microstep: 2596.74 | bwd_allreduce_microstep: 1164.09 | step_microstep: 7.53
[2024-09-15 15:58:50,287] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3393.17 | bwd: 7205.39 | bwd_inner: 6041.17 | bwd_allreduce: 1164.12 | step: 7.79
 29%|██▊       | 134/468 [23:41<58:00, 10.42s/it]                                                 {'loss': 0.1628, 'learning_rate': 3.3567100383285925e-05, 'epoch': 0.86}
 29%|██▊       | 134/468 [23:41<58:00, 10.42s/it]dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1901
[2024-09-15 15:58:55,768] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1959.94 | bwd_microstep: 3489.39 | bwd_inner_microstep: 3489.36 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.16
dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1878
[2024-09-15 15:59:01,229] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.45 | optimizer_gradients: 0.34 | optimizer_step: 0.40
[2024-09-15 15:59:01,230] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1939.42 | bwd_microstep: 3484.50 | bwd_inner_microstep: 3469.10 | bwd_allreduce_microstep: 15.35 | step_microstep: 8.13
[2024-09-15 15:59:01,230] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3899.33 | bwd: 6973.90 | bwd_inner: 6958.46 | bwd_allreduce: 15.38 | step: 8.33
 29%|██▉       | 135/468 [23:52<58:42, 10.58s/it]                                                 {'loss': 0.1711, 'learning_rate': 3.346486607746249e-05, 'epoch': 0.86}
 29%|██▉       | 135/468 [23:52<58:42, 10.58s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1876
[2024-09-15 15:59:06,594] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1903.69 | bwd_microstep: 3422.48 | bwd_inner_microstep: 3422.46 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.04
dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1878
[2024-09-15 15:59:12,050] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.46 | optimizer_gradients: 0.35 | optimizer_step: 0.41
[2024-09-15 15:59:12,050] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1938.70 | bwd_microstep: 3485.98 | bwd_inner_microstep: 3473.29 | bwd_allreduce_microstep: 12.64 | step_microstep: 8.18
[2024-09-15 15:59:12,051] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3842.37 | bwd: 6908.47 | bwd_inner: 6895.75 | bwd_allreduce: 12.66 | step: 8.24
 29%|██▉       | 136/468 [24:03<58:55, 10.65s/it]                                                 {'loss': 0.2293, 'learning_rate': 3.336198417641238e-05, 'epoch': 0.87}
 29%|██▉       | 136/468 [24:03<58:55, 10.65s/it]dynamic ViT batch size: 36, images per sample: 4.5, dynamic token length: 1389
[2024-09-15 15:59:16,143] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1464.41 | bwd_microstep: 2598.14 | bwd_inner_microstep: 2598.11 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.06
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876
[2024-09-15 15:59:22,810] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.21 | optimizer_gradients: 0.39 | optimizer_step: 0.41
[2024-09-15 15:59:22,811] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1920.50 | bwd_microstep: 4716.45 | bwd_inner_microstep: 3448.47 | bwd_allreduce_microstep: 1267.93 | step_microstep: 7.47
[2024-09-15 15:59:22,811] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3384.88 | bwd: 7314.61 | bwd_inner: 6046.58 | bwd_allreduce: 1267.96 | step: 7.53
 29%|██▉       | 137/468 [24:14<58:56, 10.68s/it]                                                 {'loss': 0.2515, 'learning_rate': 3.325845962825966e-05, 'epoch': 0.88}
 29%|██▉       | 137/468 [24:14<58:56, 10.68s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1902
[2024-09-15 15:59:28,253] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1939.54 | bwd_microstep: 3472.22 | bwd_inner_microstep: 3472.19 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.07
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1367
[2024-09-15 15:59:33,488] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.28 | optimizer_gradients: 0.40 | optimizer_step: 0.41
[2024-09-15 15:59:33,488] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1470.33 | bwd_microstep: 3728.91 | bwd_inner_microstep: 2596.10 | bwd_allreduce_microstep: 1132.75 | step_microstep: 7.64
[2024-09-15 15:59:33,489] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3409.84 | bwd: 7201.14 | bwd_inner: 6068.30 | bwd_allreduce: 1132.78 | step: 7.73
 29%|██▉       | 138/468 [24:24<58:44, 10.68s/it]                                                 {'loss': 0.1741, 'learning_rate': 3.315429741203666e-05, 'epoch': 0.88}
 29%|██▉       | 138/468 [24:24<58:44, 10.68s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389
[2024-09-15 15:59:37,646] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1493.03 | bwd_microstep: 2633.45 | bwd_inner_microstep: 2633.42 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878
[2024-09-15 15:59:43,060] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.51 | optimizer_gradients: 0.35 | optimizer_step: 0.41
[2024-09-15 15:59:43,060] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1920.40 | bwd_microstep: 3459.69 | bwd_inner_microstep: 3446.17 | bwd_allreduce_microstep: 13.48 | step_microstep: 7.56
[2024-09-15 15:59:43,060] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3413.40 | bwd: 6093.15 | bwd_inner: 6079.59 | bwd_allreduce: 13.50 | step: 7.69
 30%|██▉       | 139/468 [24:34<56:44, 10.35s/it]                                                 {'loss': 0.1595, 'learning_rate': 3.304950253744443e-05, 'epoch': 0.89}
 30%|██▉       | 139/468 [24:34<56:44, 10.35s/it]dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1876
[2024-09-15 15:59:48,381] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1889.90 | bwd_microstep: 3400.22 | bwd_inner_microstep: 3400.15 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.23
dynamic ViT batch size: 46, images per sample: 5.75, dynamic token length: 1877
[2024-09-15 15:59:53,875] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.28 | optimizer_gradients: 0.35 | optimizer_step: 0.41
[2024-09-15 15:59:53,875] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1953.56 | bwd_microstep: 3505.64 | bwd_inner_microstep: 3490.51 | bwd_allreduce_microstep: 15.08 | step_microstep: 7.83
[2024-09-15 15:59:53,876] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3843.43 | bwd: 6905.89 | bwd_inner: 6890.65 | bwd_allreduce: 15.12 | step: 8.07
 30%|██▉       | 140/468 [24:45<57:20, 10.49s/it]                                                 {'loss': 0.1889, 'learning_rate': 3.294408004461188e-05, 'epoch': 0.89}
 30%|██▉       | 140/468 [24:45<57:20, 10.49s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1388
[2024-09-15 15:59:58,034] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1494.48 | bwd_microstep: 2633.58 | bwd_inner_microstep: 2633.49 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.23
dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1878
[2024-09-15 16:00:04,612] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.21 | optimizer_gradients: 0.42 | optimizer_step: 0.40
[2024-09-15 16:00:04,612] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1939.83 | bwd_microstep: 4602.09 | bwd_inner_microstep: 3466.51 | bwd_allreduce_microstep: 1135.52 | step_microstep: 7.48
[2024-09-15 16:00:04,612] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3434.29 | bwd: 7235.71 | bwd_inner: 6100.01 | bwd_allreduce: 1135.58 | step: 7.73
 30%|███       | 141/468 [24:56<57:34, 10.56s/it]                                                 {'loss': 0.2864, 'learning_rate': 3.283803500385332e-05, 'epoch': 0.9}
 30%|███       | 141/468 [24:56<57:34, 10.56s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1388
[2024-09-15 16:00:08,766] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1493.52 | bwd_microstep: 2628.72 | bwd_inner_microstep: 2628.69 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.23
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877
[2024-09-15 16:00:14,181] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.50 | optimizer_gradients: 0.35 | optimizer_step: 0.41
[2024-09-15 16:00:14,182] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1920.68 | bwd_microstep: 3459.19 | bwd_inner_microstep: 3443.88 | bwd_allreduce_microstep: 15.27 | step_microstep: 7.49
[2024-09-15 16:00:14,182] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3414.17 | bwd: 6087.94 | bwd_inner: 6072.57 | bwd_allreduce: 15.29 | step: 7.73
 30%|███       | 142/468 [25:05<55:46, 10.27s/it]                                                 {'loss': 0.1941, 'learning_rate': 3.27313725154246e-05, 'epoch': 0.91}
 30%|███       | 142/468 [25:05<55:46, 10.27s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876
[2024-09-15 16:00:19,574] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1921.49 | bwd_microstep: 3440.51 | bwd_inner_microstep: 3440.44 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.25
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878
[2024-09-15 16:00:24,991] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.42 | optimizer_gradients: 0.36 | optimizer_step: 0.42
[2024-09-15 16:00:24,991] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1920.58 | bwd_microstep: 3459.79 | bwd_inner_microstep: 3444.81 | bwd_allreduce_microstep: 14.93 | step_microstep: 8.54
[2024-09-15 16:00:24,992] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3842.05 | bwd: 6900.33 | bwd_inner: 6885.26 | bwd_allreduce: 14.95 | step: 8.79
 31%|███       | 143/468 [25:16<56:29, 10.43s/it]                                                 {'loss': 0.2526, 'learning_rate': 3.2624097709277855e-05, 'epoch': 0.91}
 31%|███       | 143/468 [25:16<56:29, 10.43s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1390
[2024-09-15 16:00:29,153] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1497.55 | bwd_microstep: 2634.02 | bwd_inner_microstep: 2633.98 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.23
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1366
[2024-09-15 16:00:35,656] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.28 | optimizer_gradients: 0.39 | optimizer_step: 0.41
[2024-09-15 16:00:35,657] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1471.33 | bwd_microstep: 4996.13 | bwd_inner_microstep: 2597.04 | bwd_allreduce_microstep: 2399.04 | step_microstep: 7.57
[2024-09-15 16:00:35,657] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2968.86 | bwd: 7630.19 | bwd_inner: 5231.02 | bwd_allreduce: 2399.07 | step: 7.80
 31%|███       | 144/468 [25:27<56:41, 10.50s/it]                                                 {'loss': 0.1784, 'learning_rate': 3.251621574481475e-05, 'epoch': 0.92}
 31%|███       | 144/468 [25:27<56:41, 10.50s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878
[2024-09-15 16:00:41,046] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1917.76 | bwd_microstep: 3440.40 | bwd_inner_microstep: 3440.37 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.23
dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1878
[2024-09-15 16:00:46,497] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.47 | optimizer_gradients: 0.40 | optimizer_step: 0.40
[2024-09-15 16:00:46,497] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1934.85 | bwd_microstep: 3479.54 | bwd_inner_microstep: 3465.31 | bwd_allreduce_microstep: 14.18 | step_microstep: 8.37
[2024-09-15 16:00:46,498] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3852.59 | bwd: 6919.95 | bwd_inner: 6905.69 | bwd_allreduce: 14.20 | step: 8.61
 31%|███       | 145/468 [25:37<57:04, 10.60s/it]                                                 {'loss': 0.3217, 'learning_rate': 3.240773181063834e-05, 'epoch': 0.93}
 31%|███       | 145/468 [25:37<57:04, 10.60s/it]dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1877
[2024-09-15 16:00:51,819] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1890.27 | bwd_microstep: 3400.48 | bwd_inner_microstep: 3400.46 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.23
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876
[2024-09-15 16:00:57,238] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.55 | optimizer_gradients: 0.36 | optimizer_step: 0.40
[2024-09-15 16:00:57,238] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1923.36 | bwd_microstep: 3459.27 | bwd_inner_microstep: 3444.02 | bwd_allreduce_microstep: 15.21 | step_microstep: 8.49
[2024-09-15 16:00:57,238] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3813.61 | bwd: 6859.78 | bwd_inner: 6844.47 | bwd_allreduce: 15.23 | step: 8.73
 31%|███       | 146/468 [25:48<57:07, 10.64s/it]                                                 {'loss': 0.2438, 'learning_rate': 3.229865112430352e-05, 'epoch': 0.93}
 31%|███       | 146/468 [25:48<57:07, 10.64s/it]dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1877
[2024-09-15 16:01:02,676] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1938.26 | bwd_microstep: 3461.13 | bwd_inner_microstep: 3461.10 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.13
dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1881
[2024-09-15 16:01:08,025] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.27 | optimizer_gradients: 0.35 | optimizer_step: 0.40
[2024-09-15 16:01:08,026] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1893.79 | bwd_microstep: 3421.50 | bwd_inner_microstep: 3406.21 | bwd_allreduce_microstep: 15.24 | step_microstep: 7.87
[2024-09-15 16:01:08,026] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3832.03 | bwd: 6882.64 | bwd_inner: 6867.31 | bwd_allreduce: 15.26 | step: 8.03
 31%|███▏      | 147/468 [25:59<57:10, 10.69s/it]                                                 {'loss': 0.2797, 'learning_rate': 3.218897893206608e-05, 'epoch': 0.94}
 31%|███▏      | 147/468 [25:59<57:10, 10.69s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1369
[2024-09-15 16:01:12,128] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1473.11 | bwd_microstep: 2598.20 | bwd_inner_microstep: 2598.17 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.25
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1366
[2024-09-15 16:01:18,772] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.21 | optimizer_gradients: 0.40 | optimizer_step: 0.40
[2024-09-15 16:01:18,772] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1472.27 | bwd_microstep: 5137.56 | bwd_inner_microstep: 2594.08 | bwd_allreduce_microstep: 2543.42 | step_microstep: 7.44
[2024-09-15 16:01:18,773] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2945.36 | bwd: 7735.79 | bwd_inner: 5192.26 | bwd_allreduce: 2543.45 | step: 7.69
 32%|███▏      | 148/468 [26:10<57:05, 10.70s/it]                                                 {'loss': 0.2358, 'learning_rate': 3.2078720508630427e-05, 'epoch': 0.95}
 32%|███▏      | 148/468 [26:10<57:05, 10.70s/it]dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1878
[2024-09-15 16:01:24,198] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1933.64 | bwd_microstep: 3461.00 | bwd_inner_microstep: 3460.97 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.24
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876
[2024-09-15 16:01:29,608] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.42 | optimizer_gradients: 0.37 | optimizer_step: 0.40
[2024-09-15 16:01:29,608] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1919.13 | bwd_microstep: 3454.23 | bwd_inner_microstep: 3439.14 | bwd_allreduce_microstep: 15.05 | step_microstep: 8.07
[2024-09-15 16:01:29,608] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3852.74 | bwd: 6915.24 | bwd_inner: 6900.11 | bwd_allreduce: 15.07 | step: 8.31
 32%|███▏      | 149/468 [26:21<57:07, 10.74s/it]                                                 {'loss': 0.2403, 'learning_rate': 3.196788115689584e-05, 'epoch': 0.95}
 32%|███▏      | 149/468 [26:21<57:07, 10.74s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877
[2024-09-15 16:01:34,997] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1919.77 | bwd_microstep: 3438.64 | bwd_inner_microstep: 3438.62 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.25
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1366
[2024-09-15 16:01:40,273] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.34 | optimizer_gradients: 0.41 | optimizer_step: 0.40
[2024-09-15 16:01:40,274] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1469.93 | bwd_microstep: 3767.86 | bwd_inner_microstep: 2592.34 | bwd_allreduce_microstep: 1175.46 | step_microstep: 7.65
[2024-09-15 16:01:40,274] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3389.67 | bwd: 7206.52 | bwd_inner: 6030.96 | bwd_allreduce: 1175.49 | step: 7.90
 32%|███▏      | 150/468 [26:31<56:49, 10.72s/it]                                                 {'loss': 0.2295, 'learning_rate': 3.185646620770146e-05, 'epoch': 0.96}
 32%|███▏      | 150/468 [26:31<56:49, 10.72s/it]dynamic ViT batch size: 45, images per sample: 5.625, dynamic token length: 1878
[2024-09-15 16:01:45,720] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1939.84 | bwd_microstep: 3474.34 | bwd_inner_microstep: 3474.31 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.28
dynamic ViT batch size: 36, images per sample: 4.5, dynamic token length: 1368
[2024-09-15 16:01:50,855] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.31 | optimizer_gradients: 0.42 | optimizer_step: 0.40
[2024-09-15 16:01:50,856] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1440.08 | bwd_microstep: 3661.92 | bwd_inner_microstep: 2554.65 | bwd_allreduce_microstep: 1107.22 | step_microstep: 7.64
[2024-09-15 16:01:50,856] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3379.89 | bwd: 7136.27 | bwd_inner: 6028.96 | bwd_allreduce: 1107.23 | step: 7.94
 32%|███▏      | 151/468 [26:42<56:25, 10.68s/it]                                                 {'loss': 0.2063, 'learning_rate': 3.1744481019569885e-05, 'epoch': 0.96}
 32%|███▏      | 151/468 [26:42<56:25, 10.68s/it]dynamic ViT batch size: 46, images per sample: 5.75, dynamic token length: 1878
[2024-09-15 16:01:56,317] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1946.85 | bwd_microstep: 3482.23 | bwd_inner_microstep: 3482.19 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.25
dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1880
[2024-09-15 16:02:01,767] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.44 | optimizer_gradients: 0.35 | optimizer_step: 0.40
[2024-09-15 16:02:01,767] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1934.67 | bwd_microstep: 3480.12 | bwd_inner_microstep: 3465.22 | bwd_allreduce_microstep: 14.85 | step_microstep: 8.21
[2024-09-15 16:02:01,767] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3881.50 | bwd: 6962.39 | bwd_inner: 6947.42 | bwd_allreduce: 14.87 | step: 8.47
 32%|███▏      | 152/468 [26:53<56:36, 10.75s/it]                                                 {'loss': 0.2654, 'learning_rate': 3.163193097844949e-05, 'epoch': 0.97}
 32%|███▏      | 152/468 [26:53<56:36, 10.75s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878
[2024-09-15 16:02:07,151] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1917.88 | bwd_microstep: 3435.25 | bwd_inner_microstep: 3435.23 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878
[2024-09-15 16:02:12,563] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.43 | optimizer_gradients: 0.37 | optimizer_step: 0.41
[2024-09-15 16:02:12,564] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1920.40 | bwd_microstep: 3456.84 | bwd_inner_microstep: 3443.19 | bwd_allreduce_microstep: 13.60 | step_microstep: 8.11
[2024-09-15 16:02:12,564] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3838.25 | bwd: 6892.10 | bwd_inner: 6878.41 | bwd_allreduce: 13.62 | step: 8.26
 33%|███▎      | 153/468 [27:03<56:30, 10.76s/it]                                                 {'loss': 0.2071, 'learning_rate': 3.1518821497455326e-05, 'epoch': 0.98}
 33%|███▎      | 153/468 [27:03<56:30, 10.76s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1881
[2024-09-15 16:02:17,954] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1922.68 | bwd_microstep: 3437.62 | bwd_inner_microstep: 3437.59 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.08
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1880
[2024-09-15 16:02:23,378] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.46 | optimizer_gradients: 0.34 | optimizer_step: 0.40
[2024-09-15 16:02:23,379] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1923.92 | bwd_microstep: 3464.17 | bwd_inner_microstep: 3449.08 | bwd_allreduce_microstep: 15.05 | step_microstep: 8.13
[2024-09-15 16:02:23,379] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3846.58 | bwd: 6901.80 | bwd_inner: 6886.67 | bwd_allreduce: 15.07 | step: 8.25
 33%|███▎      | 154/468 [27:14<56:24, 10.78s/it]                                                 {'loss': 0.2035, 'learning_rate': 3.1405158016608806e-05, 'epoch': 0.98}
 33%|███▎      | 154/468 [27:14<56:24, 10.78s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1394
[2024-09-15 16:02:27,580] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1507.36 | bwd_microstep: 2663.19 | bwd_inner_microstep: 2663.16 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.27
dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1369
[2024-09-15 16:02:34,045] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.22 | optimizer_gradients: 0.40 | optimizer_step: 0.40
[2024-09-15 16:02:34,045] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1458.72 | bwd_microstep: 4972.41 | bwd_inner_microstep: 2575.10 | bwd_allreduce_microstep: 2397.24 | step_microstep: 7.49
[2024-09-15 16:02:34,046] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2966.06 | bwd: 7635.61 | bwd_inner: 5238.27 | bwd_allreduce: 2397.28 | step: 7.76
 33%|███▎      | 155/468 [27:25<56:03, 10.75s/it]                                                 {'loss': 0.2398, 'learning_rate': 3.129094600257611e-05, 'epoch': 0.99}
 33%|███▎      | 155/468 [27:25<56:03, 10.75s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1881
[2024-09-15 16:02:39,430] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1915.18 | bwd_microstep: 3439.16 | bwd_inner_microstep: 3439.13 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.24
dynamic ViT batch size: 36, images per sample: 4.5, dynamic token length: 1366
[2024-09-15 16:02:44,810] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.28 | optimizer_gradients: 0.41 | optimizer_step: 0.41
[2024-09-15 16:02:44,811] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1438.78 | bwd_microstep: 3907.01 | bwd_inner_microstep: 2552.79 | bwd_allreduce_microstep: 1354.12 | step_microstep: 7.57
[2024-09-15 16:02:44,811] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3353.93 | bwd: 7346.20 | bwd_inner: 5991.93 | bwd_allreduce: 1354.12 | step: 7.81
 33%|███▎      | 156/468 [27:36<55:54, 10.75s/it]                                                 {'loss': 0.1623, 'learning_rate': 3.1176190948405194e-05, 'epoch': 1.0}
 33%|███▎      | 156/468 [27:36<55:54, 10.75s/it]petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
dynamic ViT batch size: 41, images per sample: 5.125, dynamic token length: 1878
[2024-09-15 16:02:51,276] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1906.34 | bwd_microstep: 3418.42 | bwd_inner_microstep: 3418.39 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.24
[2024-09-15 16:02:53,613] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 16:02:53,631] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 16:02:53,661] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 16:02:53,681] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 16:02:57,384] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 16:02:57,441] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 16:02:57,530] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 16:02:57,531] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 16:03:01,109] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 16:03:01,156] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 16:03:01,280] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 16:03:01,311] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 16:03:04,874] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 16:03:04,891] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 16:03:05,004] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 16:03:05,026] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1870
[2024-09-15 16:03:12,877] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.59 | optimizer_gradients: 0.34 | optimizer_step: 0.39
[2024-09-15 16:03:12,877] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1900.40 | bwd_microstep: 3420.65 | bwd_inner_microstep: 3405.84 | bwd_allreduce_microstep: 14.77 | step_microstep: 7.88
[2024-09-15 16:03:12,877] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3806.71 | bwd: 6839.12 | bwd_inner: 6824.22 | bwd_allreduce: 14.79 | step: 8.13
 34%|███▎      | 157/468 [28:04<1:22:39, 15.95s/it]                                                   {'loss': 0.2313, 'learning_rate': 3.106089837326161e-05, 'epoch': 1.0}
 34%|███▎      | 157/468 [28:04<1:22:39, 15.95s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1877
[2024-09-15 16:03:18,209] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1894.64 | bwd_microstep: 3406.46 | bwd_inner_microstep: 3406.43 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.04
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1902
[2024-09-15 16:03:23,605] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.50 | optimizer_gradients: 0.34 | optimizer_step: 0.38
[2024-09-15 16:03:23,605] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1917.36 | bwd_microstep: 3447.44 | bwd_inner_microstep: 3432.88 | bwd_allreduce_microstep: 14.52 | step_microstep: 8.27
[2024-09-15 16:03:23,605] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3811.98 | bwd: 6853.91 | bwd_inner: 6839.31 | bwd_allreduce: 14.53 | step: 8.32
 34%|███▍      | 158/468 [28:15<1:14:17, 14.38s/it]                                                   {'loss': 0.2687, 'learning_rate': 3.094507382216312e-05, 'epoch': 1.01}
 34%|███▍      | 158/468 [28:15<1:14:17, 14.38s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368
[2024-09-15 16:03:27,673] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1460.55 | bwd_microstep: 2577.07 | bwd_inner_microstep: 2577.04 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.08
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1369
[2024-09-15 16:03:33,088] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.45 | optimizer_gradients: 0.37 | optimizer_step: 0.38
[2024-09-15 16:03:33,089] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1461.29 | bwd_microstep: 3918.14 | bwd_inner_microstep: 2578.81 | bwd_allreduce_microstep: 1339.27 | step_microstep: 7.46
[2024-09-15 16:03:33,089] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2921.81 | bwd: 6495.22 | bwd_inner: 5155.85 | bwd_allreduce: 1339.30 | step: 7.57
 34%|███▍      | 159/468 [28:24<1:06:29, 12.91s/it]                                                   {'loss': 0.246, 'learning_rate': 3.082872286571295e-05, 'epoch': 1.02}
 34%|███▍      | 159/468 [28:24<1:06:29, 12.91s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1366
[2024-09-15 16:03:37,157] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1460.00 | bwd_microstep: 2577.85 | bwd_inner_microstep: 2577.81 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.23
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1388
[2024-09-15 16:03:42,565] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.50 | optimizer_gradients: 0.38 | optimizer_step: 0.39
[2024-09-15 16:03:42,565] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1486.74 | bwd_microstep: 3886.88 | bwd_inner_microstep: 2620.29 | bwd_allreduce_microstep: 1266.53 | step_microstep: 7.57
[2024-09-15 16:03:42,565] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2946.71 | bwd: 6464.77 | bwd_inner: 5198.10 | bwd_allreduce: 1266.56 | step: 7.80
 34%|███▍      | 160/468 [28:33<1:00:59, 11.88s/it]                                                   {'loss': 0.191, 'learning_rate': 3.0711851099831885e-05, 'epoch': 1.02}
 34%|███▍      | 160/468 [28:33<1:00:59, 11.88s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368
[2024-09-15 16:03:46,634] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1459.82 | bwd_microstep: 2578.18 | bwd_inner_microstep: 2578.15 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1388
[2024-09-15 16:03:51,965] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.22 | optimizer_gradients: 0.38 | optimizer_step: 0.39
[2024-09-15 16:03:51,965] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1486.73 | bwd_microstep: 3808.83 | bwd_inner_microstep: 2620.55 | bwd_allreduce_microstep: 1188.22 | step_microstep: 7.26
[2024-09-15 16:03:51,966] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2946.52 | bwd: 6387.02 | bwd_inner: 5198.71 | bwd_allreduce: 1188.25 | step: 7.49
 34%|███▍      | 161/468 [28:43<56:58, 11.14s/it]                                                   {'loss': 0.1543, 'learning_rate': 3.059446414548915e-05, 'epoch': 1.03}
 34%|███▍      | 161/468 [28:43<56:58, 11.14s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1366
[2024-09-15 16:03:56,035] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1460.03 | bwd_microstep: 2578.69 | bwd_inner_microstep: 2578.66 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.23
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876
[2024-09-15 16:04:02,667] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.23 | optimizer_gradients: 0.37 | optimizer_step: 0.38
[2024-09-15 16:04:02,667] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1909.78 | bwd_microstep: 4687.16 | bwd_inner_microstep: 3433.56 | bwd_allreduce_microstep: 1253.54 | step_microstep: 7.18
[2024-09-15 16:04:02,667] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3369.79 | bwd: 7265.88 | bwd_inner: 6012.22 | bwd_allreduce: 1253.57 | step: 7.42
 35%|███▍      | 162/468 [28:54<56:07, 11.01s/it]                                                 {'loss': 0.1978, 'learning_rate': 3.047656764843203e-05, 'epoch': 1.04}
 35%|███▍      | 162/468 [28:54<56:07, 11.01s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368
[2024-09-15 16:04:06,742] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1461.64 | bwd_microstep: 2582.79 | bwd_inner_microstep: 2582.77 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.04
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1878
[2024-09-15 16:04:13,281] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.45 | optimizer_gradients: 0.38 | optimizer_step: 0.38
[2024-09-15 16:04:13,281] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1896.70 | bwd_microstep: 4611.47 | bwd_inner_microstep: 3411.83 | bwd_allreduce_microstep: 1199.58 | step_microstep: 7.58
[2024-09-15 16:04:13,281] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3358.32 | bwd: 7194.28 | bwd_inner: 5994.60 | bwd_allreduce: 1199.61 | step: 7.63
 35%|███▍      | 163/468 [29:04<55:21, 10.89s/it]                                                 {'loss': 0.2698, 'learning_rate': 3.0358167278914387e-05, 'epoch': 1.04}
 35%|███▍      | 163/468 [29:04<55:21, 10.89s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1366
[2024-09-15 16:04:17,356] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1461.19 | bwd_microstep: 2583.97 | bwd_inner_microstep: 2583.90 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1369
[2024-09-15 16:04:24,027] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.51 | optimizer_gradients: 0.39 | optimizer_step: 0.39
[2024-09-15 16:04:24,028] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1465.77 | bwd_microstep: 5170.78 | bwd_inner_microstep: 2586.94 | bwd_allreduce_microstep: 2583.78 | step_microstep: 7.52
[2024-09-15 16:04:24,028] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2926.94 | bwd: 7754.79 | bwd_inner: 5170.84 | bwd_allreduce: 2583.82 | step: 7.75
 35%|███▌      | 164/468 [29:15<54:57, 10.85s/it]                                                 {'loss': 0.2351, 'learning_rate': 3.023926873142391e-05, 'epoch': 1.05}
 35%|███▌      | 164/468 [29:15<54:57, 10.85s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1367
[2024-09-15 16:04:28,102] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1460.72 | bwd_microstep: 2582.52 | bwd_inner_microstep: 2582.49 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.21
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877
[2024-09-15 16:04:34,686] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.59 | optimizer_gradients: 0.37 | optimizer_step: 0.37
[2024-09-15 16:04:34,687] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1911.35 | bwd_microstep: 4637.38 | bwd_inner_microstep: 3433.83 | bwd_allreduce_microstep: 1203.49 | step_microstep: 7.54
[2024-09-15 16:04:34,687] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3372.05 | bwd: 7219.91 | bwd_inner: 6016.32 | bwd_allreduce: 1203.52 | step: 7.76
 35%|███▌      | 165/468 [29:26<54:29, 10.79s/it]                                                 {'loss': 0.1925, 'learning_rate': 3.011987772440825e-05, 'epoch': 1.05}
 35%|███▌      | 165/468 [29:26<54:29, 10.79s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876
[2024-09-15 16:04:40,052] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1908.70 | bwd_microstep: 3425.95 | bwd_inner_microstep: 3425.93 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.06
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389
[2024-09-15 16:04:44,211] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.52 | optimizer_gradients: 0.33 | optimizer_step: 0.38
[2024-09-15 16:04:44,211] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1487.93 | bwd_microstep: 2636.76 | bwd_inner_microstep: 2621.82 | bwd_allreduce_microstep: 14.90 | step_microstep: 7.56
[2024-09-15 16:04:44,212] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3396.61 | bwd: 6062.73 | bwd_inner: 6047.75 | bwd_allreduce: 14.92 | step: 7.65
 35%|███▌      | 166/468 [29:35<52:23, 10.41s/it]                                                 {'loss': 0.2246, 'learning_rate': 3.0000000000000004e-05, 'epoch': 1.06}
 35%|███▌      | 166/468 [29:35<52:23, 10.41s/it]dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1368
[2024-09-15 16:04:48,259] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1451.78 | bwd_microstep: 2566.07 | bwd_inner_microstep: 2566.04 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.21
dynamic ViT batch size: 37, images per sample: 4.625, dynamic token length: 1901
[2024-09-15 16:04:54,845] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.50 | optimizer_gradients: 0.37 | optimizer_step: 0.38
[2024-09-15 16:04:54,845] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1900.55 | bwd_microstep: 4650.43 | bwd_inner_microstep: 3414.70 | bwd_allreduce_microstep: 1235.67 | step_microstep: 7.63
[2024-09-15 16:04:54,845] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3352.30 | bwd: 7216.51 | bwd_inner: 5980.74 | bwd_allreduce: 1235.70 | step: 7.85
 36%|███▌      | 167/468 [29:46<52:33, 10.48s/it]                                                 {'loss': 0.2093, 'learning_rate': 2.9879641323740505e-05, 'epoch': 1.07}
 36%|███▌      | 167/468 [29:46<52:33, 10.48s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878
[2024-09-15 16:05:00,218] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1910.16 | bwd_microstep: 3431.95 | bwd_inner_microstep: 3431.92 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.18
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389
[2024-09-15 16:05:05,577] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.44 | optimizer_gradients: 0.39 | optimizer_step: 0.38
[2024-09-15 16:05:05,578] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1491.12 | bwd_microstep: 3832.54 | bwd_inner_microstep: 2628.18 | bwd_allreduce_microstep: 1204.28 | step_microstep: 7.85
[2024-09-15 16:05:05,578] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3401.25 | bwd: 7264.50 | bwd_inner: 6060.10 | bwd_allreduce: 1204.33 | step: 8.06
 36%|███▌      | 168/468 [29:56<52:46, 10.55s/it]                                                 {'loss': 0.2332, 'learning_rate': 2.9758807484302566e-05, 'epoch': 1.07}
 36%|███▌      | 168/468 [29:56<52:46, 10.55s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368
[2024-09-15 16:05:09,658] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1465.14 | bwd_microstep: 2585.40 | bwd_inner_microstep: 2585.38 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.16
dynamic ViT batch size: 36, images per sample: 4.5, dynamic token length: 1389
[2024-09-15 16:05:16,303] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.21 | optimizer_gradients: 0.38 | optimizer_step: 0.41
[2024-09-15 16:05:16,304] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1460.22 | bwd_microstep: 5150.13 | bwd_inner_microstep: 2590.45 | bwd_allreduce_microstep: 2559.62 | step_microstep: 7.23
[2024-09-15 16:05:16,304] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2925.34 | bwd: 7735.55 | bwd_inner: 5175.83 | bwd_allreduce: 2559.65 | step: 7.42
 36%|███▌      | 169/468 [30:07<52:51, 10.61s/it]                                                 {'loss': 0.1983, 'learning_rate': 2.963750429321208e-05, 'epoch': 1.08}
 36%|███▌      | 169/468 [30:07<52:51, 10.61s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368
[2024-09-15 16:05:20,375] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1460.72 | bwd_microstep: 2580.05 | bwd_inner_microstep: 2580.02 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.23
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1870
[2024-09-15 16:05:26,975] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.49 | optimizer_gradients: 0.37 | optimizer_step: 0.38
[2024-09-15 16:05:26,975] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1909.57 | bwd_microstep: 4654.52 | bwd_inner_microstep: 3424.90 | bwd_allreduce_microstep: 1229.57 | step_microstep: 7.49
[2024-09-15 16:05:26,975] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3370.27 | bwd: 7234.59 | bwd_inner: 6004.92 | bwd_allreduce: 1229.60 | step: 7.74
 36%|███▋      | 170/468 [30:18<52:46, 10.63s/it]                                                 {'loss': 0.2707, 'learning_rate': 2.9515737584568463e-05, 'epoch': 1.09}
 36%|███▋      | 170/468 [30:18<52:46, 10.63s/it]dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1876
[2024-09-15 16:05:32,387] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1927.24 | bwd_microstep: 3454.36 | bwd_inner_microstep: 3454.33 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.15
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1870
[2024-09-15 16:05:37,773] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.47 | optimizer_gradients: 0.34 | optimizer_step: 0.38
[2024-09-15 16:05:37,773] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1910.00 | bwd_microstep: 3440.29 | bwd_inner_microstep: 3424.99 | bwd_allreduce_microstep: 15.25 | step_microstep: 8.16
[2024-09-15 16:05:37,773] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3837.22 | bwd: 6894.66 | bwd_inner: 6879.33 | bwd_allreduce: 15.27 | step: 8.34
 37%|███▋      | 171/468 [30:29<52:51, 10.68s/it]                                                 {'loss': 0.2485, 'learning_rate': 2.939351321476412e-05, 'epoch': 1.09}
 37%|███▋      | 171/468 [30:29<52:51, 10.68s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1877
[2024-09-15 16:05:43,113] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1896.59 | bwd_microstep: 3412.59 | bwd_inner_microstep: 3412.57 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.04
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389
[2024-09-15 16:05:48,426] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.42 | optimizer_gradients: 0.38 | optimizer_step: 0.38
[2024-09-15 16:05:48,427] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1489.49 | bwd_microstep: 3793.80 | bwd_inner_microstep: 2625.49 | bwd_allreduce_microstep: 1168.25 | step_microstep: 7.73
[2024-09-15 16:05:48,427] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3386.05 | bwd: 7206.41 | bwd_inner: 6038.06 | bwd_allreduce: 1168.28 | step: 7.77
 37%|███▋      | 172/468 [30:39<52:38, 10.67s/it]                                                 {'loss': 0.1947, 'learning_rate': 2.927083706220274e-05, 'epoch': 1.1}
 37%|███▋      | 172/468 [30:39<52:38, 10.67s/it]dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1876
[2024-09-15 16:05:53,734] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1882.98 | bwd_microstep: 3394.15 | bwd_inner_microstep: 3394.13 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.11
dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1366
[2024-09-15 16:05:59,176] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.31 | optimizer_gradients: 0.38 | optimizer_step: 0.38
[2024-09-15 16:05:59,176] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1451.02 | bwd_microstep: 3956.21 | bwd_inner_microstep: 2569.25 | bwd_allreduce_microstep: 1386.91 | step_microstep: 7.33
[2024-09-15 16:05:59,177] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3333.97 | bwd: 7350.38 | bwd_inner: 5963.38 | bwd_allreduce: 1386.93 | step: 7.47
 37%|███▋      | 173/468 [30:50<52:34, 10.69s/it]                                                 {'loss': 0.1949, 'learning_rate': 2.9147715027016593e-05, 'epoch': 1.11}
 37%|███▋      | 173/468 [30:50<52:34, 10.69s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368
[2024-09-15 16:06:03,257] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1463.95 | bwd_microstep: 2585.34 | bwd_inner_microstep: 2585.31 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1388
[2024-09-15 16:06:09,869] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.22 | optimizer_gradients: 0.40 | optimizer_step: 0.37
[2024-09-15 16:06:09,869] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1491.28 | bwd_microstep: 5087.23 | bwd_inner_microstep: 2623.72 | bwd_allreduce_microstep: 2463.45 | step_microstep: 7.21
[2024-09-15 16:06:09,870] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2955.20 | bwd: 7672.58 | bwd_inner: 5209.03 | bwd_allreduce: 2463.48 | step: 7.43
 37%|███▋      | 174/468 [31:01<52:23, 10.69s/it]                                                 {'loss': 0.1907, 'learning_rate': 2.902415303078275e-05, 'epoch': 1.11}
 37%|███▋      | 174/468 [31:01<52:23, 10.69s/it]dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1368
[2024-09-15 16:06:13,911] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1448.29 | bwd_microstep: 2562.83 | bwd_inner_microstep: 2562.81 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389
[2024-09-15 16:06:20,659] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.23 | optimizer_gradients: 0.40 | optimizer_step: 0.39
[2024-09-15 16:06:20,660] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1487.01 | bwd_microstep: 5226.84 | bwd_inner_microstep: 2623.00 | bwd_allreduce_microstep: 2603.77 | step_microstep: 7.22
[2024-09-15 16:06:20,660] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2935.28 | bwd: 7789.70 | bwd_inner: 5185.81 | bwd_allreduce: 2603.80 | step: 7.44
 37%|███▋      | 175/468 [31:12<52:21, 10.72s/it]                                                 {'loss': 0.1801, 'learning_rate': 2.8900157016238296e-05, 'epoch': 1.12}
 37%|███▋      | 175/468 [31:12<52:21, 10.72s/it]dynamic ViT batch size: 39, images per sample: 4.875, dynamic token length: 1368
[2024-09-15 16:06:24,716] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1456.13 | bwd_microstep: 2570.42 | bwd_inner_microstep: 2570.40 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.12
dynamic ViT batch size: 39, images per sample: 4.875, dynamic token length: 1878
[2024-09-15 16:06:31,473] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.22 | optimizer_gradients: 0.37 | optimizer_step: 0.38
[2024-09-15 16:06:31,474] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1888.23 | bwd_microstep: 4834.89 | bwd_inner_microstep: 3403.05 | bwd_allreduce_microstep: 1431.77 | step_microstep: 7.18
[2024-09-15 16:06:31,474] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3344.33 | bwd: 7405.32 | bwd_inner: 5973.45 | bwd_allreduce: 1431.80 | step: 7.33
 38%|███▊      | 176/468 [31:22<52:19, 10.75s/it]                                                 {'loss': 0.2489, 'learning_rate': 2.8775732946994508e-05, 'epoch': 1.12}
 38%|███▊      | 176/468 [31:22<52:19, 10.75s/it]dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1878
[2024-09-15 16:06:36,785] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1881.44 | bwd_microstep: 3400.15 | bwd_inner_microstep: 3400.13 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.08
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389
[2024-09-15 16:06:42,109] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.44 | optimizer_gradients: 0.37 | optimizer_step: 0.38
[2024-09-15 16:06:42,110] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1486.70 | bwd_microstep: 3801.65 | bwd_inner_microstep: 2618.68 | bwd_allreduce_microstep: 1182.91 | step_microstep: 7.43
[2024-09-15 16:06:42,110] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3368.12 | bwd: 7201.81 | bwd_inner: 6018.81 | bwd_allreduce: 1182.94 | step: 7.54
 38%|███▊      | 177/468 [31:33<51:58, 10.72s/it]                                                 {'loss': 0.1969, 'learning_rate': 2.8650886807250024e-05, 'epoch': 1.13}
 38%|███▊      | 177/468 [31:33<51:58, 10.72s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1366
[2024-09-15 16:06:46,183] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1462.73 | bwd_microstep: 2580.63 | bwd_inner_microstep: 2580.60 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.06
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878
[2024-09-15 16:06:51,577] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.60 | optimizer_gradients: 0.33 | optimizer_step: 0.38
[2024-09-15 16:06:51,578] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1912.78 | bwd_microstep: 3446.98 | bwd_inner_microstep: 3432.30 | bwd_allreduce_microstep: 14.64 | step_microstep: 7.58
[2024-09-15 16:06:51,578] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3375.48 | bwd: 6027.62 | bwd_inner: 6012.90 | bwd_allreduce: 14.66 | step: 7.67
 38%|███▊      | 178/468 [31:42<49:59, 10.34s/it]                                                 {'loss': 0.1724, 'learning_rate': 2.8525624601503055e-05, 'epoch': 1.14}
 38%|███▊      | 178/468 [31:42<49:59, 10.34s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877
[2024-09-15 16:06:56,955] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1912.46 | bwd_microstep: 3434.63 | bwd_inner_microstep: 3434.60 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.11
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389
[2024-09-15 16:07:01,126] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.73 | optimizer_gradients: 0.33 | optimizer_step: 0.38
[2024-09-15 16:07:01,127] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1489.22 | bwd_microstep: 2644.73 | bwd_inner_microstep: 2632.24 | bwd_allreduce_microstep: 12.45 | step_microstep: 7.90
[2024-09-15 16:07:01,127] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3401.66 | bwd: 6079.38 | bwd_inner: 6066.84 | bwd_allreduce: 12.47 | step: 8.05
 38%|███▊      | 179/468 [31:52<48:39, 10.10s/it]                                                 {'loss': 0.1797, 'learning_rate': 2.8399952354262566e-05, 'epoch': 1.14}
 38%|███▊      | 179/468 [31:52<48:39, 10.10s/it]dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1877
[2024-09-15 16:07:06,551] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1930.56 | bwd_microstep: 3456.84 | bwd_inner_microstep: 3456.81 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878
[2024-09-15 16:07:11,956] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.60 | optimizer_gradients: 0.37 | optimizer_step: 0.38
[2024-09-15 16:07:11,956] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1917.00 | bwd_microstep: 3452.29 | bwd_inner_microstep: 3436.80 | bwd_allreduce_microstep: 15.45 | step_microstep: 8.25
[2024-09-15 16:07:11,957] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3847.53 | bwd: 6909.14 | bwd_inner: 6893.61 | bwd_allreduce: 15.47 | step: 8.38
 38%|███▊      | 180/468 [32:03<49:32, 10.32s/it]                                                 {'loss': 0.1341, 'learning_rate': 2.8273876109758568e-05, 'epoch': 1.15}
 38%|███▊      | 180/468 [32:03<49:32, 10.32s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1366
[2024-09-15 16:07:16,042] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1467.59 | bwd_microstep: 2587.26 | bwd_inner_microstep: 2587.23 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.14
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878
[2024-09-15 16:07:21,460] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.63 | optimizer_gradients: 0.39 | optimizer_step: 0.38
[2024-09-15 16:07:21,460] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1916.06 | bwd_microstep: 3466.14 | bwd_inner_microstep: 3442.07 | bwd_allreduce_microstep: 24.02 | step_microstep: 7.87
[2024-09-15 16:07:21,460] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3383.63 | bwd: 6053.41 | bwd_inner: 6029.30 | bwd_allreduce: 24.04 | step: 8.04
 39%|███▊      | 181/468 [32:12<48:11, 10.08s/it]                                                 {'loss': 0.2223, 'learning_rate': 2.8147401931651363e-05, 'epoch': 1.16}
 39%|███▊      | 181/468 [32:12<48:11, 10.08s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389
[2024-09-15 16:07:25,610] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1491.32 | bwd_microstep: 2628.48 | bwd_inner_microstep: 2628.46 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.09
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1388
[2024-09-15 16:07:30,964] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.44 | optimizer_gradients: 0.38 | optimizer_step: 0.38
[2024-09-15 16:07:30,965] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1493.72 | bwd_microstep: 3824.99 | bwd_inner_microstep: 2627.56 | bwd_allreduce_microstep: 1197.37 | step_microstep: 7.46
[2024-09-15 16:07:30,965] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2985.01 | bwd: 6453.49 | bwd_inner: 5256.02 | bwd_allreduce: 1197.40 | step: 7.58
 39%|███▉      | 182/468 [32:22<47:12,  9.90s/it]                                                 {'loss': 0.2221, 'learning_rate': 2.802053590273997e-05, 'epoch': 1.16}
 39%|███▉      | 182/468 [32:22<47:12,  9.90s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876
[2024-09-15 16:07:36,344] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1913.70 | bwd_microstep: 3435.32 | bwd_inner_microstep: 3435.30 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.17
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876
[2024-09-15 16:07:41,795] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.63 | optimizer_gradients: 0.39 | optimizer_step: 0.38
[2024-09-15 16:07:41,796] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1917.76 | bwd_microstep: 3499.15 | bwd_inner_microstep: 3435.21 | bwd_allreduce_microstep: 63.89 | step_microstep: 7.77
[2024-09-15 16:07:41,796] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3831.44 | bwd: 6934.49 | bwd_inner: 6870.51 | bwd_allreduce: 63.92 | step: 7.97
 39%|███▉      | 183/468 [32:33<48:22, 10.18s/it]                                                 {'loss': 0.2019, 'learning_rate': 2.789328412466953e-05, 'epoch': 1.17}
 39%|███▉      | 183/468 [32:33<48:22, 10.18s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1366
[2024-09-15 16:07:45,883] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1468.59 | bwd_microstep: 2588.58 | bwd_inner_microstep: 2588.55 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.08
dynamic ViT batch size: 37, images per sample: 4.625, dynamic token length: 1388
[2024-09-15 16:07:52,553] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.34 | optimizer_gradients: 0.38 | optimizer_step: 0.38
[2024-09-15 16:07:52,553] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1471.81 | bwd_microstep: 5162.99 | bwd_inner_microstep: 2600.48 | bwd_allreduce_microstep: 2562.45 | step_microstep: 7.32
[2024-09-15 16:07:52,553] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2940.37 | bwd: 7751.58 | bwd_inner: 5189.04 | bwd_allreduce: 2562.47 | step: 7.43
 39%|███▉      | 184/468 [32:43<49:00, 10.36s/it]                                                 {'loss': 0.1336, 'learning_rate': 2.7765652717637873e-05, 'epoch': 1.18}
 39%|███▉      | 184/468 [32:43<49:00, 10.36s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877
[2024-09-15 16:07:57,930] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1911.23 | bwd_microstep: 3434.42 | bwd_inner_microstep: 3434.40 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.14
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877
[2024-09-15 16:08:03,330] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.64 | optimizer_gradients: 0.33 | optimizer_step: 0.38
[2024-09-15 16:08:03,330] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1916.05 | bwd_microstep: 3449.56 | bwd_inner_microstep: 3436.12 | bwd_allreduce_microstep: 13.39 | step_microstep: 7.93
[2024-09-15 16:08:03,330] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3827.26 | bwd: 6883.99 | bwd_inner: 6870.52 | bwd_allreduce: 13.42 | step: 8.11
 40%|███▉      | 185/468 [32:54<49:26, 10.48s/it]                                                 {'loss': 0.2361, 'learning_rate': 2.763764782010116e-05, 'epoch': 1.18}
 40%|███▉      | 185/468 [32:54<49:26, 10.48s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878
[2024-09-15 16:08:08,712] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1915.40 | bwd_microstep: 3435.99 | bwd_inner_microstep: 3435.96 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.21
dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1389
[2024-09-15 16:08:12,849] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.58 | optimizer_gradients: 0.34 | optimizer_step: 0.37
[2024-09-15 16:08:12,850] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1477.86 | bwd_microstep: 2624.53 | bwd_inner_microstep: 2609.43 | bwd_allreduce_microstep: 15.06 | step_microstep: 7.78
[2024-09-15 16:08:12,850] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3393.23 | bwd: 6060.53 | bwd_inner: 6045.39 | bwd_allreduce: 15.08 | step: 8.01
 40%|███▉      | 186/468 [33:04<47:54, 10.19s/it]                                                 {'loss': 0.2023, 'learning_rate': 2.7509275588478606e-05, 'epoch': 1.19}
 40%|███▉      | 186/468 [33:04<47:54, 10.19s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1369
[2024-09-15 16:08:16,944] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1470.38 | bwd_microstep: 2590.58 | bwd_inner_microstep: 2590.55 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.18
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1876
[2024-09-15 16:08:22,310] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.59 | optimizer_gradients: 0.33 | optimizer_step: 0.38
[2024-09-15 16:08:22,310] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1900.99 | bwd_microstep: 3430.22 | bwd_inner_microstep: 3415.51 | bwd_allreduce_microstep: 14.66 | step_microstep: 7.80
[2024-09-15 16:08:22,311] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3371.35 | bwd: 6020.80 | bwd_inner: 6006.06 | bwd_allreduce: 14.68 | step: 8.01
 40%|███▉      | 187/468 [33:13<46:42,  9.97s/it]                                                 {'loss': 0.2428, 'learning_rate': 2.738054219685647e-05, 'epoch': 1.19}
 40%|███▉      | 187/468 [33:13<46:42,  9.97s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368
[2024-09-15 16:08:26,403] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1469.42 | bwd_microstep: 2592.84 | bwd_inner_microstep: 2592.81 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.13
dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1878
[2024-09-15 16:08:31,853] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.57 | optimizer_gradients: 0.33 | optimizer_step: 0.38
[2024-09-15 16:08:31,854] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1935.43 | bwd_microstep: 3479.57 | bwd_inner_microstep: 3467.12 | bwd_allreduce_microstep: 12.41 | step_microstep: 7.88
[2024-09-15 16:08:31,854] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3404.82 | bwd: 6072.42 | bwd_inner: 6059.93 | bwd_allreduce: 12.43 | step: 8.04
 40%|████      | 188/468 [33:23<45:56,  9.84s/it]                                                 {'loss': 0.2394, 'learning_rate': 2.725145383669106e-05, 'epoch': 1.2}
 40%|████      | 188/468 [33:23<45:56,  9.84s/it]dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1877
[2024-09-15 16:08:37,284] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1934.97 | bwd_microstep: 3464.73 | bwd_inner_microstep: 3464.71 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.19
dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1877
[2024-09-15 16:08:42,732] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.60 | optimizer_gradients: 0.35 | optimizer_step: 0.38
[2024-09-15 16:08:42,732] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1933.44 | bwd_microstep: 3479.11 | bwd_inner_microstep: 3464.37 | bwd_allreduce_microstep: 14.69 | step_microstep: 8.28
[2024-09-15 16:08:42,733] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3868.38 | bwd: 6943.85 | bwd_inner: 6929.08 | bwd_allreduce: 14.71 | step: 8.51
 40%|████      | 189/468 [33:34<47:13, 10.15s/it]                                                 {'loss': 0.1464, 'learning_rate': 2.712201671651094e-05, 'epoch': 1.21}
 40%|████      | 189/468 [33:34<47:13, 10.15s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1880
[2024-09-15 16:08:48,120] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1915.48 | bwd_microstep: 3441.15 | bwd_inner_microstep: 3441.12 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1388
[2024-09-15 16:08:53,409] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.39 | optimizer_gradients: 0.38 | optimizer_step: 0.38
[2024-09-15 16:08:53,410] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1494.84 | bwd_microstep: 3760.77 | bwd_inner_microstep: 2630.87 | bwd_allreduce_microstep: 1129.84 | step_microstep: 7.42
[2024-09-15 16:08:53,410] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3410.30 | bwd: 7201.93 | bwd_inner: 6071.99 | bwd_allreduce: 1129.87 | step: 7.64
 41%|████      | 190/468 [33:44<47:46, 10.31s/it]                                                 {'loss': 0.2485, 'learning_rate': 2.699223706161839e-05, 'epoch': 1.21}
 41%|████      | 190/468 [33:44<47:46, 10.31s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1369
[2024-09-15 16:08:57,501] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1468.38 | bwd_microstep: 2591.60 | bwd_inner_microstep: 2591.57 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22
dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1366
[2024-09-15 16:09:03,861] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.28 | optimizer_gradients: 0.37 | optimizer_step: 0.38
[2024-09-15 16:09:03,861] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1455.44 | bwd_microstep: 4868.93 | bwd_inner_microstep: 2574.80 | bwd_allreduce_microstep: 2294.07 | step_microstep: 7.24
[2024-09-15 16:09:03,861] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2923.80 | bwd: 7460.54 | bwd_inner: 5166.37 | bwd_allreduce: 2294.09 | step: 7.47
 41%|████      | 191/468 [33:55<47:47, 10.35s/it]                                                 {'loss': 0.3039, 'learning_rate': 2.6862121113789917e-05, 'epoch': 1.22}
 41%|████      | 191/468 [33:55<47:47, 10.35s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1870
[2024-09-15 16:09:09,189] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1893.23 | bwd_microstep: 3403.88 | bwd_inner_microstep: 3403.86 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22
dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1878
[2024-09-15 16:09:14,632] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.62 | optimizer_gradients: 0.33 | optimizer_step: 0.41
[2024-09-15 16:09:14,632] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1929.96 | bwd_microstep: 3475.59 | bwd_inner_microstep: 3460.97 | bwd_allreduce_microstep: 14.57 | step_microstep: 7.96
[2024-09-15 16:09:14,632] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3823.16 | bwd: 6879.49 | bwd_inner: 6864.82 | bwd_allreduce: 14.59 | step: 8.18
 41%|████      | 192/468 [34:06<48:12, 10.48s/it]                                                 {'loss': 0.1387, 'learning_rate': 2.673167513097613e-05, 'epoch': 1.23}
 41%|████      | 192/468 [34:06<48:12, 10.48s/it]dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1881
[2024-09-15 16:09:20,060] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1935.83 | bwd_microstep: 3462.09 | bwd_inner_microstep: 3462.07 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.07
dynamic ViT batch size: 36, images per sample: 4.5, dynamic token length: 1388
[2024-09-15 16:09:25,447] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.51 | optimizer_gradients: 0.38 | optimizer_step: 0.39
[2024-09-15 16:09:25,448] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1462.64 | bwd_microstep: 3888.61 | bwd_inner_microstep: 2590.93 | bwd_allreduce_microstep: 1297.62 | step_microstep: 7.86
[2024-09-15 16:09:25,448] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3398.44 | bwd: 7350.72 | bwd_inner: 6052.99 | bwd_allreduce: 1297.65 | step: 7.96
 41%|████      | 193/468 [34:16<48:29, 10.58s/it]                                                 {'loss': 0.2657, 'learning_rate': 2.6600905387000716e-05, 'epoch': 1.23}
 41%|████      | 193/468 [34:16<48:29, 10.58s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878
[2024-09-15 16:09:30,831] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1914.59 | bwd_microstep: 3437.84 | bwd_inner_microstep: 3437.81 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.13
dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1901
[2024-09-15 16:09:36,328] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.58 | optimizer_gradients: 0.37 | optimizer_step: 0.38
[2024-09-15 16:09:36,328] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1955.26 | bwd_microstep: 3506.07 | bwd_inner_microstep: 3491.13 | bwd_allreduce_microstep: 14.85 | step_microstep: 7.86
[2024-09-15 16:09:36,329] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3869.82 | bwd: 6943.94 | bwd_inner: 6928.95 | bwd_allreduce: 14.89 | step: 8.02
 41%|████▏     | 194/468 [34:27<48:43, 10.67s/it]                                                 {'loss': 0.176, 'learning_rate': 2.6469818171258723e-05, 'epoch': 1.24}
 41%|████▏     | 194/468 [34:27<48:43, 10.67s/it]dynamic ViT batch size: 48, images per sample: 6.0, dynamic token length: 1878
[2024-09-15 16:09:41,837] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1969.53 | bwd_microstep: 3507.52 | bwd_inner_microstep: 3507.50 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.08
dynamic ViT batch size: 46, images per sample: 5.75, dynamic token length: 1908
[2024-09-15 16:09:47,384] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.53 | optimizer_gradients: 0.37 | optimizer_step: 0.38
[2024-09-15 16:09:47,384] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1971.54 | bwd_microstep: 3540.79 | bwd_inner_microstep: 3526.02 | bwd_allreduce_microstep: 14.72 | step_microstep: 7.77
[2024-09-15 16:09:47,384] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3941.04 | bwd: 7048.32 | bwd_inner: 7033.52 | bwd_allreduce: 14.74 | step: 7.88
 42%|████▏     | 195/468 [34:38<49:04, 10.79s/it]                                                 {'loss': 0.1876, 'learning_rate': 2.633841978841406e-05, 'epoch': 1.25}
 42%|████▏     | 195/468 [34:38<49:04, 10.79s/it]dynamic ViT batch size: 36, images per sample: 4.5, dynamic token length: 1369
[2024-09-15 16:09:51,413] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1440.09 | bwd_microstep: 2558.21 | bwd_inner_microstep: 2558.18 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1900
[2024-09-15 16:09:58,168] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.30 | optimizer_gradients: 0.38 | optimizer_step: 0.38
[2024-09-15 16:09:58,168] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1925.70 | bwd_microstep: 4793.04 | bwd_inner_microstep: 3453.32 | bwd_allreduce_microstep: 1339.66 | step_microstep: 7.29
[2024-09-15 16:09:58,169] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3365.77 | bwd: 7351.29 | bwd_inner: 6011.50 | bwd_allreduce: 1339.69 | step: 7.52
 42%|████▏     | 196/468 [34:49<48:53, 10.79s/it]                                                 {'loss': 0.1726, 'learning_rate': 2.620671655809627e-05, 'epoch': 1.25}
 42%|████▏     | 196/468 [34:49<48:53, 10.79s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1369
[2024-09-15 16:10:02,255] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1467.28 | bwd_microstep: 2589.23 | bwd_inner_microstep: 2589.20 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22
dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1878
[2024-09-15 16:10:08,816] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.33 | optimizer_gradients: 0.38 | optimizer_step: 0.38
[2024-09-15 16:10:08,817] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1887.74 | bwd_microstep: 4639.37 | bwd_inner_microstep: 3404.96 | bwd_allreduce_microstep: 1234.35 | step_microstep: 7.36
[2024-09-15 16:10:08,817] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3355.00 | bwd: 7228.61 | bwd_inner: 5994.17 | bwd_allreduce: 1234.38 | step: 7.57
 42%|████▏     | 197/468 [35:00<48:31, 10.74s/it]                                                 {'loss': 0.2352, 'learning_rate': 2.60747148145966e-05, 'epoch': 1.26}
 42%|████▏     | 197/468 [35:00<48:31, 10.74s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1366
[2024-09-15 16:10:12,900] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1466.11 | bwd_microstep: 2586.48 | bwd_inner_microstep: 2586.45 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.24
dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1878
[2024-09-15 16:10:19,481] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.56 | optimizer_gradients: 0.37 | optimizer_step: 0.38
[2024-09-15 16:10:19,482] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1887.88 | bwd_microstep: 4659.02 | bwd_inner_microstep: 3401.38 | bwd_allreduce_microstep: 1257.58 | step_microstep: 7.84
[2024-09-15 16:10:19,482] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3353.96 | bwd: 7245.51 | bwd_inner: 5987.83 | bwd_allreduce: 1257.61 | step: 8.08
 42%|████▏     | 198/468 [35:10<48:14, 10.72s/it]                                                 {'loss': 0.254, 'learning_rate': 2.594242090656335e-05, 'epoch': 1.27}
 42%|████▏     | 198/468 [35:10<48:14, 10.72s/it]dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1878
[2024-09-15 16:10:24,903] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1930.05 | bwd_microstep: 3459.92 | bwd_inner_microstep: 3459.89 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.12
dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1870
[2024-09-15 16:10:30,280] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.56 | optimizer_gradients: 0.37 | optimizer_step: 0.37
[2024-09-15 16:10:30,281] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1883.97 | bwd_microstep: 3458.54 | bwd_inner_microstep: 3400.03 | bwd_allreduce_microstep: 58.46 | step_microstep: 8.04
[2024-09-15 16:10:30,281] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3814.00 | bwd: 6918.47 | bwd_inner: 6859.92 | bwd_allreduce: 58.48 | step: 8.19
 43%|████▎     | 199/468 [35:21<48:10, 10.74s/it]                                                 {'loss': 0.2125, 'learning_rate': 2.5809841196696504e-05, 'epoch': 1.27}
 43%|████▎     | 199/468 [35:21<48:10, 10.74s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1879
[2024-09-15 16:10:35,626] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1899.87 | bwd_microstep: 3414.85 | bwd_inner_microstep: 3414.83 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.18
dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1388
[2024-09-15 16:10:41,066] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.33 | optimizer_gradients: 0.38 | optimizer_step: 0.38
[2024-09-15 16:10:41,067] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1478.35 | bwd_microstep: 3925.63 | bwd_inner_microstep: 2610.56 | bwd_allreduce_microstep: 1315.02 | step_microstep: 8.32
[2024-09-15 16:10:41,067] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3378.20 | bwd: 7340.50 | bwd_inner: 6025.39 | bwd_allreduce: 1315.05 | step: 8.53
 43%|████▎     | 200/468 [35:32<48:02, 10.76s/it]                                                 {'loss': 0.2049, 'learning_rate': 2.5676982061441763e-05, 'epoch': 1.28}
 43%|████▎     | 200/468 [35:32<48:02, 10.76s/it][INFO|trainer.py:2936] 2024-09-15 16:10:48,754 >> Saving model checkpoint to work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora/tmp-checkpoint-200
[INFO|configuration_utils.py:473] 2024-09-15 16:10:48,756 >> Configuration saved in work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora/tmp-checkpoint-200/config.json
[INFO|configuration_utils.py:594] 2024-09-15 16:10:48,756 >> Configuration saved in work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora/tmp-checkpoint-200/generation_config.json
[INFO|modeling_utils.py:2501] 2024-09-15 16:11:02,093 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 4 checkpoint shards. You can find where each parameters has been saved in the index located at work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora/tmp-checkpoint-200/model.safetensors.index.json.
[INFO|tokenization_utils_base.py:2433] 2024-09-15 16:11:02,095 >> tokenizer config file saved in work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora/tmp-checkpoint-200/tokenizer_config.json
[INFO|tokenization_utils_base.py:2442] 2024-09-15 16:11:02,095 >> Special tokens file saved in work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora/tmp-checkpoint-200/special_tokens_map.json
[INFO|tokenization_utils_base.py:2493] 2024-09-15 16:11:02,095 >> added tokens file saved in work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora/tmp-checkpoint-200/added_tokens.json
[2024-09-15 16:11:02,728] [INFO] [logging.py:96:log_dist] [Rank 0] [Torch] Checkpoint global_step200 is about to be saved!
[2024-09-15 16:11:02,755] [INFO] [logging.py:96:log_dist] [Rank 0] Saving model checkpoint: work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora/tmp-checkpoint-200/global_step200/mp_rank_00_model_states.pt
[2024-09-15 16:11:02,755] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora/tmp-checkpoint-200/global_step200/mp_rank_00_model_states.pt...
[2024-09-15 16:11:18,868] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora/tmp-checkpoint-200/global_step200/mp_rank_00_model_states.pt.
[2024-09-15 16:11:18,871] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora/tmp-checkpoint-200/global_step200/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt...
[2024-09-15 16:11:19,023] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora/tmp-checkpoint-200/global_step200/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt.
[2024-09-15 16:11:19,024] [INFO] [engine.py:3488:_save_zero_checkpoint] zero checkpoint saved work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora/tmp-checkpoint-200/global_step200/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt
[2024-09-15 16:11:19,024] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step200 is ready now!
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876
[2024-09-15 16:11:24,334] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1888.77 | bwd_microstep: 3398.08 | bwd_inner_microstep: 3398.05 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.17
dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1878
[2024-09-15 16:11:29,718] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.61 | optimizer_gradients: 0.33 | optimizer_step: 0.38
[2024-09-15 16:11:29,719] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1909.78 | bwd_microstep: 3438.14 | bwd_inner_microstep: 3423.26 | bwd_allreduce_microstep: 14.83 | step_microstep: 7.87
[2024-09-15 16:11:29,719] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3798.52 | bwd: 6836.23 | bwd_inner: 6821.32 | bwd_allreduce: 14.85 | step: 8.07
 43%|████▎     | 201/468 [36:21<1:38:27, 22.13s/it]                                                   {'loss': 0.1865, 'learning_rate': 2.5543849890683813e-05, 'epoch': 1.28}
 43%|████▎     | 201/468 [36:21<1:38:27, 22.13s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877
[2024-09-15 16:11:35,046] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1897.09 | bwd_microstep: 3399.42 | bwd_inner_microstep: 3399.39 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.18
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1388
[2024-09-15 16:11:39,191] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.70 | optimizer_gradients: 0.37 | optimizer_step: 0.41
[2024-09-15 16:11:39,192] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1481.94 | bwd_microstep: 2627.42 | bwd_inner_microstep: 2612.37 | bwd_allreduce_microstep: 15.00 | step_microstep: 7.97
[2024-09-15 16:11:39,192] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3379.00 | bwd: 6026.85 | bwd_inner: 6011.77 | bwd_allreduce: 15.02 | step: 8.18
 43%|████▎     | 202/468 [36:30<1:21:15, 18.33s/it]                                                   {'loss': 0.2203, 'learning_rate': 2.5410451087439075e-05, 'epoch': 1.29}
 43%|████▎     | 202/468 [36:30<1:21:15, 18.33s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878
[2024-09-15 16:11:44,541] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1904.99 | bwd_microstep: 3413.54 | bwd_inner_microstep: 3413.51 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1388
[2024-09-15 16:11:49,728] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.40 | optimizer_gradients: 0.42 | optimizer_step: 0.39
[2024-09-15 16:11:49,728] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1485.10 | bwd_microstep: 3663.35 | bwd_inner_microstep: 2617.09 | bwd_allreduce_microstep: 1046.20 | step_microstep: 7.82
[2024-09-15 16:11:49,729] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3390.06 | bwd: 7076.90 | bwd_inner: 6030.60 | bwd_allreduce: 1046.23 | step: 8.04
 43%|████▎     | 203/468 [36:41<1:10:37, 15.99s/it]                                                   {'loss': 0.2002, 'learning_rate': 2.5276792067547672e-05, 'epoch': 1.3}
 43%|████▎     | 203/468 [36:41<1:10:37, 15.99s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877
[2024-09-15 16:11:55,085] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1905.03 | bwd_microstep: 3420.18 | bwd_inner_microstep: 3420.15 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22
dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1368
[2024-09-15 16:12:00,414] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.39 | optimizer_gradients: 0.37 | optimizer_step: 0.38
[2024-09-15 16:12:00,415] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1449.37 | bwd_microstep: 3845.98 | bwd_inner_microstep: 2563.93 | bwd_allreduce_microstep: 1281.99 | step_microstep: 7.80
[2024-09-15 16:12:00,415] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3354.37 | bwd: 7266.17 | bwd_inner: 5984.08 | bwd_allreduce: 1282.02 | step: 8.04
 44%|████▎     | 204/468 [36:51<1:03:21, 14.40s/it]                                                   {'loss': 0.1524, 'learning_rate': 2.514287925936492e-05, 'epoch': 1.3}
 44%|████▎     | 204/468 [36:51<1:03:21, 14.40s/it]dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1878
[2024-09-15 16:12:05,824] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1927.69 | bwd_microstep: 3450.13 | bwd_inner_microstep: 3450.10 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.17
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876
[2024-09-15 16:12:11,213] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.48 | optimizer_gradients: 0.40 | optimizer_step: 0.39
[2024-09-15 16:12:11,214] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1912.64 | bwd_microstep: 3441.08 | bwd_inner_microstep: 3426.04 | bwd_allreduce_microstep: 15.00 | step_microstep: 8.34
[2024-09-15 16:12:11,214] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3840.31 | bwd: 6891.22 | bwd_inner: 6876.14 | bwd_allreduce: 15.01 | step: 8.55
 44%|████▍     | 205/468 [37:02<58:23, 13.32s/it]                                                   {'loss': 0.2059, 'learning_rate': 2.500871910345212e-05, 'epoch': 1.31}
 44%|████▍     | 205/468 [37:02<58:23, 13.32s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368
[2024-09-15 16:12:15,293] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1463.02 | bwd_microstep: 2577.56 | bwd_inner_microstep: 2577.54 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.05
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1388
[2024-09-15 16:12:21,898] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.56 | optimizer_gradients: 0.37 | optimizer_step: 0.38
[2024-09-15 16:12:21,899] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1485.53 | bwd_microstep: 5083.45 | bwd_inner_microstep: 2617.76 | bwd_allreduce_microstep: 2465.61 | step_microstep: 7.55
[2024-09-15 16:12:21,899] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2948.52 | bwd: 7661.03 | bwd_inner: 5195.29 | bwd_allreduce: 2465.66 | step: 7.60
 44%|████▍     | 206/468 [37:13<54:42, 12.53s/it]                                                 {'loss': 0.1885, 'learning_rate': 2.4874318052266794e-05, 'epoch': 1.32}
 44%|████▍     | 206/468 [37:13<54:42, 12.53s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1367
[2024-09-15 16:12:25,973] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1465.21 | bwd_microstep: 2578.06 | bwd_inner_microstep: 2578.04 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.11
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389
[2024-09-15 16:12:32,544] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.58 | optimizer_gradients: 0.38 | optimizer_step: 0.38
[2024-09-15 16:12:32,545] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1486.51 | bwd_microstep: 5051.05 | bwd_inner_microstep: 2618.45 | bwd_allreduce_microstep: 2432.54 | step_microstep: 7.87
[2024-09-15 16:12:32,545] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2951.70 | bwd: 7629.13 | bwd_inner: 5196.49 | bwd_allreduce: 2432.57 | step: 8.01
 44%|████▍     | 207/468 [37:23<52:02, 11.96s/it]                                                 {'loss': 0.1837, 'learning_rate': 2.473968256985238e-05, 'epoch': 1.32}
 44%|████▍     | 207/468 [37:23<52:02, 11.96s/it]dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1878
[2024-09-15 16:12:37,949] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1930.49 | bwd_microstep: 3442.51 | bwd_inner_microstep: 3442.48 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.15
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876
[2024-09-15 16:12:43,333] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.64 | optimizer_gradients: 0.33 | optimizer_step: 0.39
[2024-09-15 16:12:43,333] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1907.89 | bwd_microstep: 3440.34 | bwd_inner_microstep: 3424.89 | bwd_allreduce_microstep: 15.40 | step_microstep: 8.64
[2024-09-15 16:12:43,333] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3838.33 | bwd: 6882.85 | bwd_inner: 6867.37 | bwd_allreduce: 15.42 | step: 8.82
 44%|████▍     | 208/468 [37:34<50:18, 11.61s/it]                                                 {'loss': 0.1938, 'learning_rate': 2.460481913152734e-05, 'epoch': 1.33}
 44%|████▍     | 208/468 [37:34<50:18, 11.61s/it]dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1366
[2024-09-15 16:12:47,372] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1448.32 | bwd_microstep: 2560.37 | bwd_inner_microstep: 2560.26 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.23
dynamic ViT batch size: 37, images per sample: 4.625, dynamic token length: 1901
[2024-09-15 16:12:54,232] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.30 | optimizer_gradients: 0.38 | optimizer_step: 0.38
[2024-09-15 16:12:54,232] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1896.65 | bwd_microstep: 4928.88 | bwd_inner_microstep: 3409.12 | bwd_allreduce_microstep: 1519.70 | step_microstep: 7.36
[2024-09-15 16:12:54,233] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3344.95 | bwd: 7489.29 | bwd_inner: 5969.39 | bwd_allreduce: 1519.78 | step: 7.59
 45%|████▍     | 209/468 [37:45<49:12, 11.40s/it]                                                 {'loss': 0.1497, 'learning_rate': 2.4469734223573703e-05, 'epoch': 1.34}
 45%|████▍     | 209/468 [37:45<49:12, 11.40s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1369
[2024-09-15 16:12:58,306] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1462.24 | bwd_microstep: 2580.32 | bwd_inner_microstep: 2580.30 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22
dynamic ViT batch size: 36, images per sample: 4.5, dynamic token length: 1877
[2024-09-15 16:13:04,965] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 4.31 | optimizer_gradients: 0.43 | optimizer_step: 0.38
[2024-09-15 16:13:04,966] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1867.51 | bwd_microstep: 4753.66 | bwd_inner_microstep: 3384.61 | bwd_allreduce_microstep: 1368.99 | step_microstep: 10.62
[2024-09-15 16:13:04,966] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3329.72 | bwd: 7334.00 | bwd_inner: 5964.91 | bwd_allreduce: 1369.02 | step: 10.84
 45%|████▍     | 210/468 [37:56<48:09, 11.20s/it]                                                 {'loss': 0.2644, 'learning_rate': 2.4334434342925133e-05, 'epoch': 1.34}
 45%|████▍     | 210/468 [37:56<48:09, 11.20s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877
[2024-09-15 16:13:10,332] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1907.95 | bwd_microstep: 3428.14 | bwd_inner_microstep: 3428.11 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.21
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1388
[2024-09-15 16:13:15,647] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.51 | optimizer_gradients: 0.42 | optimizer_step: 0.38
[2024-09-15 16:13:15,647] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1488.35 | bwd_microstep: 3792.19 | bwd_inner_microstep: 2623.09 | bwd_allreduce_microstep: 1169.04 | step_microstep: 7.63
[2024-09-15 16:13:15,647] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3396.28 | bwd: 7220.34 | bwd_inner: 6051.20 | bwd_allreduce: 1169.07 | step: 7.88
 45%|████▌     | 211/468 [38:07<47:18, 11.04s/it]                                                 {'loss': 0.1937, 'learning_rate': 2.4198925996854422e-05, 'epoch': 1.35}
 45%|████▌     | 211/468 [38:07<47:18, 11.04s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368
[2024-09-15 16:13:19,722] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1462.75 | bwd_microstep: 2581.94 | bwd_inner_microstep: 2581.92 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876
[2024-09-15 16:13:26,313] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 4.28 | optimizer_gradients: 0.38 | optimizer_step: 0.38
[2024-09-15 16:13:26,314] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1912.66 | bwd_microstep: 4639.89 | bwd_inner_microstep: 3435.09 | bwd_allreduce_microstep: 1204.74 | step_microstep: 10.22
[2024-09-15 16:13:26,314] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3375.39 | bwd: 7221.84 | bwd_inner: 6017.01 | bwd_allreduce: 1204.77 | step: 10.45
 45%|████▌     | 212/468 [38:17<46:38, 10.93s/it]                                                 {'loss': 0.2791, 'learning_rate': 2.4063215702660564e-05, 'epoch': 1.35}
 45%|████▌     | 212/468 [38:17<46:38, 10.93s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368
[2024-09-15 16:13:30,391] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1464.49 | bwd_microstep: 2582.70 | bwd_inner_microstep: 2582.68 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.21
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878
[2024-09-15 16:13:37,010] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.54 | optimizer_gradients: 0.37 | optimizer_step: 0.38
[2024-09-15 16:13:37,011] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1914.02 | bwd_microstep: 4670.40 | bwd_inner_microstep: 3436.67 | bwd_allreduce_microstep: 1233.67 | step_microstep: 7.47
[2024-09-15 16:13:37,011] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3378.49 | bwd: 7253.11 | bwd_inner: 6019.34 | bwd_allreduce: 1233.70 | step: 7.69
 46%|████▌     | 213/468 [38:28<46:09, 10.86s/it]                                                 {'loss': 0.24, 'learning_rate': 2.392730998735529e-05, 'epoch': 1.36}
 46%|████▌     | 213/468 [38:28<46:09, 10.86s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368
[2024-09-15 16:13:41,089] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1462.30 | bwd_microstep: 2584.98 | bwd_inner_microstep: 2584.92 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.23
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876
[2024-09-15 16:13:47,735] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.28 | optimizer_gradients: 0.37 | optimizer_step: 0.40
[2024-09-15 16:13:47,735] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1911.33 | bwd_microstep: 4700.18 | bwd_inner_microstep: 3432.19 | bwd_allreduce_microstep: 1267.93 | step_microstep: 7.24
[2024-09-15 16:13:47,735] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3373.61 | bwd: 7285.21 | bwd_inner: 6017.11 | bwd_allreduce: 1267.96 | step: 7.48
 46%|████▌     | 214/468 [38:39<45:48, 10.82s/it]                                                 {'loss': 0.1574, 'learning_rate': 2.379121538734912e-05, 'epoch': 1.37}
 46%|████▌     | 214/468 [38:39<45:48, 10.82s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1373
[2024-09-15 16:13:51,815] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1464.96 | bwd_microstep: 2584.33 | bwd_inner_microstep: 2584.30 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.15
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1870
[2024-09-15 16:13:57,196] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.65 | optimizer_gradients: 0.37 | optimizer_step: 0.38
[2024-09-15 16:13:57,196] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1909.29 | bwd_microstep: 3436.46 | bwd_inner_microstep: 3421.46 | bwd_allreduce_microstep: 14.95 | step_microstep: 7.83
[2024-09-15 16:13:57,196] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3374.23 | bwd: 6020.80 | bwd_inner: 6005.77 | bwd_allreduce: 14.97 | step: 8.01
 46%|████▌     | 215/468 [38:48<43:54, 10.41s/it]                                                 {'loss': 0.1896, 'learning_rate': 2.3654938448137062e-05, 'epoch': 1.37}
 46%|████▌     | 215/468 [38:48<43:54, 10.41s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1366
[2024-09-15 16:14:01,273] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1463.61 | bwd_microstep: 2582.80 | bwd_inner_microstep: 2582.77 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877
[2024-09-15 16:14:06,665] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.64 | optimizer_gradients: 0.34 | optimizer_step: 0.37
[2024-09-15 16:14:06,666] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1911.47 | bwd_microstep: 3446.49 | bwd_inner_microstep: 3431.73 | bwd_allreduce_microstep: 14.71 | step_microstep: 7.99
[2024-09-15 16:14:06,666] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3375.05 | bwd: 6029.29 | bwd_inner: 6014.50 | bwd_allreduce: 14.73 | step: 8.22
 46%|████▌     | 216/468 [38:58<42:32, 10.13s/it]                                                 {'loss': 0.1865, 'learning_rate': 2.351848572398371e-05, 'epoch': 1.38}
 46%|████▌     | 216/468 [38:58<42:32, 10.13s/it]dynamic ViT batch size: 39, images per sample: 4.875, dynamic token length: 1367
[2024-09-15 16:14:10,743] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1460.14 | bwd_microstep: 2578.63 | bwd_inner_microstep: 2578.60 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.25
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389
[2024-09-15 16:14:17,381] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.52 | optimizer_gradients: 0.38 | optimizer_step: 0.39
[2024-09-15 16:14:17,381] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1490.65 | bwd_microstep: 5111.97 | bwd_inner_microstep: 2626.37 | bwd_allreduce_microstep: 2485.54 | step_microstep: 7.59
[2024-09-15 16:14:17,382] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2950.76 | bwd: 7690.62 | bwd_inner: 5204.97 | bwd_allreduce: 2485.58 | step: 7.84
 46%|████▋     | 217/468 [39:08<43:06, 10.31s/it]                                                 {'loss': 0.1336, 'learning_rate': 2.338186377760811e-05, 'epoch': 1.39}
 46%|████▋     | 217/468 [39:08<43:06, 10.31s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368
[2024-09-15 16:14:21,456] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1463.53 | bwd_microstep: 2581.25 | bwd_inner_microstep: 2581.22 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389
[2024-09-15 16:14:28,064] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.52 | optimizer_gradients: 0.38 | optimizer_step: 0.39
[2024-09-15 16:14:28,065] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1490.09 | bwd_microstep: 5085.47 | bwd_inner_microstep: 2625.43 | bwd_allreduce_microstep: 2459.98 | step_microstep: 7.53
[2024-09-15 16:14:28,065] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2953.59 | bwd: 7666.73 | bwd_inner: 5206.66 | bwd_allreduce: 2460.01 | step: 7.76
 47%|████▋     | 218/468 [39:19<43:24, 10.42s/it]                                                 {'loss': 0.2047, 'learning_rate': 2.3245079179868054e-05, 'epoch': 1.39}
 47%|████▋     | 218/468 [39:19<43:24, 10.42s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368
[2024-09-15 16:14:32,139] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1462.15 | bwd_microstep: 2578.45 | bwd_inner_microstep: 2578.42 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.20
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877
[2024-09-15 16:14:38,794] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.47 | optimizer_gradients: 0.41 | optimizer_step: 0.38
[2024-09-15 16:14:38,794] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1912.65 | bwd_microstep: 4707.21 | bwd_inner_microstep: 3431.82 | bwd_allreduce_microstep: 1275.32 | step_microstep: 7.51
[2024-09-15 16:14:38,794] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3374.77 | bwd: 7285.66 | bwd_inner: 6010.24 | bwd_allreduce: 1275.35 | step: 7.73
 47%|████▋     | 219/468 [39:30<43:37, 10.51s/it]                                                 {'loss': 0.31, 'learning_rate': 2.31081385094441e-05, 'epoch': 1.4}
 47%|████▋     | 219/468 [39:30<43:37, 10.51s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1366
[2024-09-15 16:14:42,868] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1461.28 | bwd_microstep: 2582.30 | bwd_inner_microstep: 2582.19 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.17
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1369
[2024-09-15 16:14:49,640] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.23 | optimizer_gradients: 0.37 | optimizer_step: 0.38
[2024-09-15 16:14:49,640] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1466.10 | bwd_microstep: 5271.56 | bwd_inner_microstep: 2587.27 | bwd_allreduce_microstep: 2684.23 | step_microstep: 7.28
[2024-09-15 16:14:49,641] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2927.36 | bwd: 7853.90 | bwd_inner: 5169.47 | bwd_allreduce: 2684.30 | step: 7.45
 47%|████▋     | 220/468 [39:41<43:51, 10.61s/it]                                                 {'loss': 0.2096, 'learning_rate': 2.297104835252314e-05, 'epoch': 1.41}
 47%|████▋     | 220/468 [39:41<43:51, 10.61s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368
[2024-09-15 16:14:53,716] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1463.19 | bwd_microstep: 2577.99 | bwd_inner_microstep: 2577.96 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.24
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1900
[2024-09-15 16:14:59,126] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.67 | optimizer_gradients: 0.34 | optimizer_step: 0.38
[2024-09-15 16:14:59,126] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1919.35 | bwd_microstep: 3455.39 | bwd_inner_microstep: 3440.55 | bwd_allreduce_microstep: 14.80 | step_microstep: 7.96
[2024-09-15 16:14:59,126] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3382.51 | bwd: 6033.40 | bwd_inner: 6018.52 | bwd_allreduce: 14.82 | step: 8.18
 47%|████▋     | 221/468 [39:50<42:17, 10.27s/it]                                                 {'loss': 0.2593, 'learning_rate': 2.283381530248165e-05, 'epoch': 1.41}
 47%|████▋     | 221/468 [39:50<42:17, 10.27s/it]dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1366
[2024-09-15 16:15:03,173] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1451.15 | bwd_microstep: 2565.63 | bwd_inner_microstep: 2565.51 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.26
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1900
[2024-09-15 16:15:09,775] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.23 | optimizer_gradients: 0.39 | optimizer_step: 0.38
[2024-09-15 16:15:09,776] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1936.84 | bwd_microstep: 4629.06 | bwd_inner_microstep: 3469.00 | bwd_allreduce_microstep: 1159.99 | step_microstep: 7.07
[2024-09-15 16:15:09,776] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3387.96 | bwd: 7194.73 | bwd_inner: 6034.52 | bwd_allreduce: 1160.08 | step: 7.34
 47%|████▋     | 222/468 [40:01<42:35, 10.39s/it]                                                 {'loss': 0.1585, 'learning_rate': 2.2696445959568577e-05, 'epoch': 1.42}
 47%|████▋     | 222/468 [40:01<42:35, 10.39s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368
[2024-09-15 16:15:13,856] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1464.07 | bwd_microstep: 2585.49 | bwd_inner_microstep: 2585.46 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.18
dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1877
[2024-09-15 16:15:20,539] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.20 | optimizer_gradients: 0.43 | optimizer_step: 0.40
[2024-09-15 16:15:20,539] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1932.95 | bwd_microstep: 4716.70 | bwd_inner_microstep: 3457.85 | bwd_allreduce_microstep: 1258.78 | step_microstep: 7.44
[2024-09-15 16:15:20,539] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3397.00 | bwd: 7302.21 | bwd_inner: 6043.32 | bwd_allreduce: 1258.82 | step: 7.63
 48%|████▊     | 223/468 [40:11<42:52, 10.50s/it]                                                 {'loss': 0.1781, 'learning_rate': 2.2558946930587907e-05, 'epoch': 1.42}
 48%|████▊     | 223/468 [40:11<42:52, 10.50s/it]dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1878
[2024-09-15 16:15:25,956] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1929.33 | bwd_microstep: 3456.91 | bwd_inner_microstep: 3456.86 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.29
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389
[2024-09-15 16:15:31,302] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.21 | optimizer_gradients: 0.40 | optimizer_step: 0.40
[2024-09-15 16:15:31,302] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1491.92 | bwd_microstep: 3819.40 | bwd_inner_microstep: 2626.52 | bwd_allreduce_microstep: 1192.82 | step_microstep: 7.49
[2024-09-15 16:15:31,302] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3421.23 | bwd: 7276.36 | bwd_inner: 6083.38 | bwd_allreduce: 1192.85 | step: 7.78
 48%|████▊     | 224/468 [40:22<43:01, 10.58s/it]                                                 {'loss': 0.1462, 'learning_rate': 2.2421324828580877e-05, 'epoch': 1.43}
 48%|████▊     | 224/468 [40:22<43:01, 10.58s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368
[2024-09-15 16:15:35,381] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1464.55 | bwd_microstep: 2583.18 | bwd_inner_microstep: 2583.06 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.23
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389
[2024-09-15 16:15:42,085] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.22 | optimizer_gradients: 0.39 | optimizer_step: 0.40
[2024-09-15 16:15:42,085] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1492.47 | bwd_microstep: 5175.76 | bwd_inner_microstep: 2624.19 | bwd_allreduce_microstep: 2551.50 | step_microstep: 7.44
[2024-09-15 16:15:42,085] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2956.99 | bwd: 7758.97 | bwd_inner: 5207.26 | bwd_allreduce: 2551.59 | step: 7.68
 48%|████▊     | 225/468 [40:33<43:05, 10.64s/it]                                                 {'loss': 0.1687, 'learning_rate': 2.2283586272507975e-05, 'epoch': 1.44}
 48%|████▊     | 225/468 [40:33<43:05, 10.64s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876
[2024-09-15 16:15:47,462] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1910.51 | bwd_microstep: 3432.18 | bwd_inner_microstep: 3432.10 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.25
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1878
[2024-09-15 16:15:52,950] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.34 | optimizer_gradients: 0.42 | optimizer_step: 0.40
[2024-09-15 16:15:52,950] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1899.49 | bwd_microstep: 3552.76 | bwd_inner_microstep: 3417.01 | bwd_allreduce_microstep: 135.69 | step_microstep: 7.58
[2024-09-15 16:15:52,951] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3809.98 | bwd: 6984.99 | bwd_inner: 6849.11 | bwd_allreduce: 135.75 | step: 7.84
 48%|████▊     | 226/468 [40:44<43:11, 10.71s/it]                                                 {'loss': 0.1716, 'learning_rate': 2.214573788693054e-05, 'epoch': 1.44}
 48%|████▊     | 226/468 [40:44<43:11, 10.71s/it]dynamic ViT batch size: 36, images per sample: 4.5, dynamic token length: 1369
[2024-09-15 16:15:56,969] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1438.09 | bwd_microstep: 2550.26 | bwd_inner_microstep: 2550.18 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.19
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1870
[2024-09-15 16:16:02,453] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.38 | optimizer_gradients: 0.41 | optimizer_step: 0.41
[2024-09-15 16:16:02,454] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1914.36 | bwd_microstep: 3537.15 | bwd_inner_microstep: 3427.87 | bwd_allreduce_microstep: 109.22 | step_microstep: 7.71
[2024-09-15 16:16:02,454] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3352.42 | bwd: 6087.45 | bwd_inner: 5978.06 | bwd_allreduce: 109.29 | step: 7.90
 49%|████▊     | 227/468 [40:53<41:33, 10.35s/it]                                                 {'loss': 0.2694, 'learning_rate': 2.2007786301692205e-05, 'epoch': 1.45}
 49%|████▊     | 227/468 [40:53<41:33, 10.35s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878
[2024-09-15 16:16:07,837] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1915.47 | bwd_microstep: 3437.27 | bwd_inner_microstep: 3437.24 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.20
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878
[2024-09-15 16:16:13,241] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.72 | optimizer_gradients: 0.33 | optimizer_step: 0.39
[2024-09-15 16:16:13,242] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1916.50 | bwd_microstep: 3453.21 | bwd_inner_microstep: 3439.63 | bwd_allreduce_microstep: 13.53 | step_microstep: 8.02
[2024-09-15 16:16:13,242] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3831.95 | bwd: 6890.49 | bwd_inner: 6876.88 | bwd_allreduce: 13.55 | step: 8.25
 49%|████▊     | 228/468 [41:04<41:54, 10.48s/it]                                                 {'loss': 0.1615, 'learning_rate': 2.18697381516e-05, 'epoch': 1.46}
 49%|████▊     | 228/468 [41:04<41:54, 10.48s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1369
[2024-09-15 16:16:17,329] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1467.14 | bwd_microstep: 2589.30 | bwd_inner_microstep: 2589.27 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.17
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389
[2024-09-15 16:16:22,737] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.21 | optimizer_gradients: 0.37 | optimizer_step: 0.38
[2024-09-15 16:16:22,738] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1492.56 | bwd_microstep: 3881.80 | bwd_inner_microstep: 2628.96 | bwd_allreduce_microstep: 1252.78 | step_microstep: 7.20
[2024-09-15 16:16:22,738] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2959.67 | bwd: 6471.11 | bwd_inner: 5218.23 | bwd_allreduce: 1252.81 | step: 7.40
 49%|████▉     | 229/468 [41:14<40:33, 10.18s/it]                                                 {'loss': 0.1891, 'learning_rate': 2.1731600076105264e-05, 'epoch': 1.46}
 49%|████▉     | 229/468 [41:14<40:33, 10.18s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878
[2024-09-15 16:16:28,113] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1911.96 | bwd_microstep: 3432.45 | bwd_inner_microstep: 3432.43 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.19
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1900
[2024-09-15 16:16:33,567] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.65 | optimizer_gradients: 0.36 | optimizer_step: 0.39
[2024-09-15 16:16:33,568] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1938.92 | bwd_microstep: 3478.65 | bwd_inner_microstep: 3465.14 | bwd_allreduce_microstep: 13.46 | step_microstep: 7.91
[2024-09-15 16:16:33,568] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3850.86 | bwd: 6911.12 | bwd_inner: 6897.57 | bwd_allreduce: 13.49 | step: 8.13
 49%|████▉     | 230/468 [41:24<41:09, 10.38s/it]                                                 {'loss': 0.1321, 'learning_rate': 2.159337871898431e-05, 'epoch': 1.47}
 49%|████▉     | 230/468 [41:24<41:09, 10.38s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368
[2024-09-15 16:16:37,655] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1467.03 | bwd_microstep: 2589.43 | bwd_inner_microstep: 2589.40 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.23
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878
[2024-09-15 16:16:43,068] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.67 | optimizer_gradients: 0.37 | optimizer_step: 0.39
[2024-09-15 16:16:43,069] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1920.86 | bwd_microstep: 3449.79 | bwd_inner_microstep: 3434.56 | bwd_allreduce_microstep: 15.18 | step_microstep: 8.04
[2024-09-15 16:16:43,069] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3387.84 | bwd: 6039.23 | bwd_inner: 6023.96 | bwd_allreduce: 15.20 | step: 8.27
 49%|████▉     | 231/468 [41:34<39:57, 10.11s/it]                                                 {'loss': 0.1706, 'learning_rate': 2.145508072801888e-05, 'epoch': 1.48}
 49%|████▉     | 231/468 [41:34<39:57, 10.11s/it]dynamic ViT batch size: 34, images per sample: 4.25, dynamic token length: 1369
[2024-09-15 16:16:47,067] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1429.03 | bwd_microstep: 2539.21 | bwd_inner_microstep: 2539.05 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.24
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876
[2024-09-15 16:16:52,513] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.45 | optimizer_gradients: 0.38 | optimizer_step: 0.41
[2024-09-15 16:16:52,513] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1917.32 | bwd_microstep: 3492.77 | bwd_inner_microstep: 3438.61 | bwd_allreduce_microstep: 54.11 | step_microstep: 7.90
[2024-09-15 16:16:52,513] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3346.33 | bwd: 6032.02 | bwd_inner: 5977.67 | bwd_allreduce: 54.23 | step: 8.15
 50%|████▉     | 232/468 [41:43<38:59,  9.91s/it]                                                 {'loss': 0.2391, 'learning_rate': 2.131671275467647e-05, 'epoch': 1.48}
 50%|████▉     | 232/468 [41:43<38:59,  9.91s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878
[2024-09-15 16:16:57,906] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1922.43 | bwd_microstep: 3440.12 | bwd_inner_microstep: 3440.10 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.17
dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1876
[2024-09-15 16:17:03,363] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.92 | optimizer_gradients: 0.37 | optimizer_step: 0.41
[2024-09-15 16:17:03,364] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1938.29 | bwd_microstep: 3481.33 | bwd_inner_microstep: 3466.33 | bwd_allreduce_microstep: 14.96 | step_microstep: 9.84
[2024-09-15 16:17:03,364] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3860.69 | bwd: 6921.47 | bwd_inner: 6906.42 | bwd_allreduce: 14.98 | step: 10.04
 50%|████▉     | 233/468 [41:54<39:55, 10.19s/it]                                                 {'loss': 0.1462, 'learning_rate': 2.1178281453790358e-05, 'epoch': 1.49}
 50%|████▉     | 233/468 [41:54<39:55, 10.19s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876
[2024-09-15 16:17:08,755] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1920.93 | bwd_microstep: 3440.04 | bwd_inner_microstep: 3440.01 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.15
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1388
[2024-09-15 16:17:12,928] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.58 | optimizer_gradients: 0.35 | optimizer_step: 0.39
[2024-09-15 16:17:12,929] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1494.32 | bwd_microstep: 2644.28 | bwd_inner_microstep: 2631.54 | bwd_allreduce_microstep: 12.70 | step_microstep: 7.90
[2024-09-15 16:17:12,929] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3415.22 | bwd: 6084.33 | bwd_inner: 6071.55 | bwd_allreduce: 12.72 | step: 8.08
 50%|█████     | 234/468 [42:04<39:01, 10.01s/it]                                                 {'loss': 0.1928, 'learning_rate': 2.1039793483239607e-05, 'epoch': 1.5}
 50%|█████     | 234/468 [42:04<39:01, 10.01s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1366
[2024-09-15 16:17:17,023] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1472.20 | bwd_microstep: 2591.32 | bwd_inner_microstep: 2591.29 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389
[2024-09-15 16:17:22,391] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.21 | optimizer_gradients: 0.41 | optimizer_step: 0.40
[2024-09-15 16:17:22,391] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1496.53 | bwd_microstep: 3836.83 | bwd_inner_microstep: 2633.20 | bwd_allreduce_microstep: 1203.57 | step_microstep: 7.44
[2024-09-15 16:17:22,392] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2968.70 | bwd: 6428.16 | bwd_inner: 5224.50 | bwd_allreduce: 1203.60 | step: 7.67
 50%|█████     | 235/468 [42:13<38:13,  9.84s/it]                                                 {'loss': 0.3621, 'learning_rate': 2.090125550362879e-05, 'epoch': 1.5}
 50%|█████     | 235/468 [42:13<38:13,  9.84s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1870
[2024-09-15 16:17:27,765] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1915.23 | bwd_microstep: 3427.94 | bwd_inner_microstep: 3427.91 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.14
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389
[2024-09-15 16:17:31,948] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.66 | optimizer_gradients: 0.35 | optimizer_step: 0.40
[2024-09-15 16:17:31,949] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1495.42 | bwd_microstep: 2649.08 | bwd_inner_microstep: 2633.41 | bwd_allreduce_microstep: 15.62 | step_microstep: 8.00
[2024-09-15 16:17:31,949] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3410.62 | bwd: 6077.03 | bwd_inner: 6061.33 | bwd_allreduce: 15.64 | step: 8.17
 50%|█████     | 236/468 [42:23<37:43,  9.76s/it]                                                 {'loss': 0.1987, 'learning_rate': 2.0762674177967676e-05, 'epoch': 1.51}
 50%|█████     | 236/468 [42:23<37:43,  9.76s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878
[2024-09-15 16:17:37,344] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1921.67 | bwd_microstep: 3442.75 | bwd_inner_microstep: 3442.72 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.25
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876
[2024-09-15 16:17:42,759] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.51 | optimizer_gradients: 0.35 | optimizer_step: 0.42
[2024-09-15 16:17:42,759] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1923.86 | bwd_microstep: 3454.49 | bwd_inner_microstep: 3439.41 | bwd_allreduce_microstep: 15.01 | step_microstep: 8.02
[2024-09-15 16:17:42,760] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3845.50 | bwd: 6897.25 | bwd_inner: 6882.13 | bwd_allreduce: 15.04 | step: 8.28
 51%|█████     | 237/468 [42:34<38:46, 10.07s/it]                                                 {'loss': 0.2071, 'learning_rate': 2.0624056171350785e-05, 'epoch': 1.51}
 51%|█████     | 237/468 [42:34<38:46, 10.07s/it]dynamic ViT batch size: 36, images per sample: 4.5, dynamic token length: 1369
[2024-09-15 16:17:46,792] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1443.05 | bwd_microstep: 2559.30 | bwd_inner_microstep: 2559.07 | bwd_allreduce_microstep: 0.12 | step_microstep: 0.24
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1369
[2024-09-15 16:17:53,484] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.21 | optimizer_gradients: 0.40 | optimizer_step: 0.40
[2024-09-15 16:17:53,484] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1470.74 | bwd_microstep: 5187.07 | bwd_inner_microstep: 2596.75 | bwd_allreduce_microstep: 2590.25 | step_microstep: 7.48
[2024-09-15 16:17:53,484] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2913.77 | bwd: 7746.40 | bwd_inner: 5155.82 | bwd_allreduce: 2590.42 | step: 7.72
 51%|█████     | 238/468 [42:44<39:21, 10.27s/it]                                                 {'loss': 0.1817, 'learning_rate': 2.0485408150636804e-05, 'epoch': 1.52}
 51%|█████     | 238/468 [42:44<39:21, 10.27s/it]dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1878
[2024-09-15 16:17:58,907] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1934.20 | bwd_microstep: 3457.89 | bwd_inner_microstep: 3457.86 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.27
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876
[2024-09-15 16:18:04,336] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.55 | optimizer_gradients: 0.41 | optimizer_step: 0.40
[2024-09-15 16:18:04,336] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1917.27 | bwd_microstep: 3475.18 | bwd_inner_microstep: 3442.42 | bwd_allreduce_microstep: 32.71 | step_microstep: 8.42
[2024-09-15 16:18:04,336] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3851.44 | bwd: 6933.08 | bwd_inner: 6900.28 | bwd_allreduce: 32.73 | step: 8.70
 51%|█████     | 239/468 [42:55<39:51, 10.44s/it]                                                 {'loss': 0.1615, 'learning_rate': 2.0346736784127955e-05, 'epoch': 1.53}
 51%|█████     | 239/468 [42:55<39:51, 10.44s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1369
[2024-09-15 16:18:08,427] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1470.77 | bwd_microstep: 2589.94 | bwd_inner_microstep: 2589.92 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.24
dynamic ViT batch size: 39, images per sample: 4.875, dynamic token length: 1369
[2024-09-15 16:18:13,906] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.22 | optimizer_gradients: 0.40 | optimizer_step: 0.41
[2024-09-15 16:18:13,907] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1463.65 | bwd_microstep: 3980.90 | bwd_inner_microstep: 2580.97 | bwd_allreduce_microstep: 1399.86 | step_microstep: 7.45
[2024-09-15 16:18:13,907] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2934.40 | bwd: 6570.86 | bwd_inner: 5170.89 | bwd_allreduce: 1399.90 | step: 7.72
 51%|█████▏    | 240/468 [43:05<38:41, 10.18s/it]                                                 {'loss': 0.1521, 'learning_rate': 2.0208048741249288e-05, 'epoch': 1.53}
 51%|█████▏    | 240/468 [43:05<38:41, 10.18s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1366
[2024-09-15 16:18:17,990] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1467.23 | bwd_microstep: 2584.87 | bwd_inner_microstep: 2584.85 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.06
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1876
[2024-09-15 16:18:24,630] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.25 | optimizer_gradients: 0.37 | optimizer_step: 0.38
[2024-09-15 16:18:24,631] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1899.41 | bwd_microstep: 4706.68 | bwd_inner_microstep: 3422.92 | bwd_allreduce_microstep: 1283.70 | step_microstep: 7.15
[2024-09-15 16:18:24,631] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3366.62 | bwd: 7291.56 | bwd_inner: 6007.77 | bwd_allreduce: 1283.73 | step: 7.24
 51%|█████▏    | 241/468 [43:16<39:08, 10.34s/it]                                                 {'loss': 0.156, 'learning_rate': 2.006935069222789e-05, 'epoch': 1.54}
 51%|█████▏    | 241/468 [43:16<39:08, 10.34s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1369
[2024-09-15 16:18:28,716] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1464.91 | bwd_microstep: 2590.28 | bwd_inner_microstep: 2590.25 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.17
dynamic ViT batch size: 32, images per sample: 4.0, dynamic token length: 1876
[2024-09-15 16:18:35,428] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.31 | optimizer_gradients: 0.41 | optimizer_step: 0.38
[2024-09-15 16:18:35,429] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1841.31 | bwd_microstep: 4837.85 | bwd_inner_microstep: 3358.68 | bwd_allreduce_microstep: 1479.11 | step_microstep: 7.36
[2024-09-15 16:18:35,429] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3306.19 | bwd: 7428.14 | bwd_inner: 5948.93 | bwd_allreduce: 1479.14 | step: 7.56
 52%|█████▏    | 242/468 [43:26<39:28, 10.48s/it]                                                 {'loss': 0.2438, 'learning_rate': 1.9930649307772114e-05, 'epoch': 1.55}
 52%|█████▏    | 242/468 [43:26<39:28, 10.48s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876
[2024-09-15 16:18:40,807] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1913.35 | bwd_microstep: 3434.43 | bwd_inner_microstep: 3434.40 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.06
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1870
[2024-09-15 16:18:46,203] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.61 | optimizer_gradients: 0.33 | optimizer_step: 0.39
[2024-09-15 16:18:46,203] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1913.73 | bwd_microstep: 3446.04 | bwd_inner_microstep: 3431.17 | bwd_allreduce_microstep: 14.83 | step_microstep: 8.24
[2024-09-15 16:18:46,203] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3827.05 | bwd: 6880.48 | bwd_inner: 6865.57 | bwd_allreduce: 14.85 | step: 8.33
 52%|█████▏    | 243/468 [43:37<39:37, 10.57s/it]                                                 {'loss': 0.2432, 'learning_rate': 1.979195125875072e-05, 'epoch': 1.55}
 52%|█████▏    | 243/468 [43:37<39:37, 10.57s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368
[2024-09-15 16:18:50,290] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1467.64 | bwd_microstep: 2589.70 | bwd_inner_microstep: 2589.68 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.08
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1390
[2024-09-15 16:18:55,721] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.23 | optimizer_gradients: 0.37 | optimizer_step: 0.38
[2024-09-15 16:18:55,721] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1490.75 | bwd_microstep: 3904.87 | bwd_inner_microstep: 2629.24 | bwd_allreduce_microstep: 1275.57 | step_microstep: 7.22
[2024-09-15 16:18:55,722] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2958.37 | bwd: 6494.58 | bwd_inner: 5218.92 | bwd_allreduce: 1275.60 | step: 7.33
 52%|█████▏    | 244/468 [43:47<38:16, 10.25s/it]                                                 {'loss': 0.1292, 'learning_rate': 1.9653263215872048e-05, 'epoch': 1.56}
 52%|█████▏    | 244/468 [43:47<38:16, 10.25s/it]dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1878
[2024-09-15 16:19:01,142] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1932.06 | bwd_microstep: 3457.57 | bwd_inner_microstep: 3457.54 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.06
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1900
[2024-09-15 16:19:06,599] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.63 | optimizer_gradients: 0.33 | optimizer_step: 0.39
[2024-09-15 16:19:06,599] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1937.89 | bwd_microstep: 3482.87 | bwd_inner_microstep: 3470.59 | bwd_allreduce_microstep: 12.23 | step_microstep: 8.37
[2024-09-15 16:19:06,599] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3869.94 | bwd: 6940.45 | bwd_inner: 6928.13 | bwd_allreduce: 12.25 | step: 8.46
 52%|█████▏    | 245/468 [43:58<38:48, 10.44s/it]                                                 {'loss': 0.2293, 'learning_rate': 1.9514591849363203e-05, 'epoch': 1.57}
 52%|█████▏    | 245/468 [43:58<38:48, 10.44s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1366
[2024-09-15 16:19:10,687] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1468.10 | bwd_microstep: 2589.55 | bwd_inner_microstep: 2589.52 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.06
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389
[2024-09-15 16:19:17,434] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.31 | optimizer_gradients: 0.37 | optimizer_step: 0.39
[2024-09-15 16:19:17,434] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1491.64 | bwd_microstep: 5220.79 | bwd_inner_microstep: 2630.21 | bwd_allreduce_microstep: 2590.49 | step_microstep: 7.28
[2024-09-15 16:19:17,435] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2959.71 | bwd: 7810.34 | bwd_inner: 5219.73 | bwd_allreduce: 2590.53 | step: 7.37
 53%|█████▎    | 246/468 [44:08<39:04, 10.56s/it]                                                 {'loss': 0.1932, 'learning_rate': 1.9375943828649215e-05, 'epoch': 1.57}
 53%|█████▎    | 246/468 [44:08<39:04, 10.56s/it]dynamic ViT batch size: 34, images per sample: 4.25, dynamic token length: 1870
[2024-09-15 16:19:22,671] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1850.22 | bwd_microstep: 3357.03 | bwd_inner_microstep: 3357.01 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22
dynamic ViT batch size: 37, images per sample: 4.625, dynamic token length: 1622
[2024-09-15 16:19:28,025] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.23 | optimizer_gradients: 0.40 | optimizer_step: 0.41
[2024-09-15 16:19:28,025] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1664.19 | bwd_microstep: 3649.47 | bwd_inner_microstep: 2988.41 | bwd_allreduce_microstep: 661.00 | step_microstep: 7.64
[2024-09-15 16:19:28,025] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3514.36 | bwd: 7006.54 | bwd_inner: 6345.42 | bwd_allreduce: 661.04 | step: 7.86
 53%|█████▎    | 247/468 [44:19<38:55, 10.57s/it]                                                 {'loss': 0.1508, 'learning_rate': 1.923732582203233e-05, 'epoch': 1.58}
 53%|█████▎    | 247/468 [44:19<38:55, 10.57s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878
[2024-09-15 16:19:33,402] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1914.66 | bwd_microstep: 3432.16 | bwd_inner_microstep: 3432.13 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.07
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1878
[2024-09-15 16:19:38,766] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.62 | optimizer_gradients: 0.34 | optimizer_step: 0.38
[2024-09-15 16:19:38,766] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1899.23 | bwd_microstep: 3430.29 | bwd_inner_microstep: 3415.46 | bwd_allreduce_microstep: 14.78 | step_microstep: 8.30
[2024-09-15 16:19:38,766] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3813.86 | bwd: 6862.46 | bwd_inner: 6847.59 | bwd_allreduce: 14.80 | step: 8.40
 53%|█████▎    | 248/468 [44:30<38:56, 10.62s/it]                                                 {'loss': 0.2317, 'learning_rate': 1.909874449637122e-05, 'epoch': 1.58}
 53%|█████▎    | 248/468 [44:30<38:56, 10.62s/it]dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1878
[2024-09-15 16:19:44,188] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1935.56 | bwd_microstep: 3454.95 | bwd_inner_microstep: 3454.92 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.21
dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1369
[2024-09-15 16:19:49,556] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.44 | optimizer_gradients: 0.38 | optimizer_step: 0.38
[2024-09-15 16:19:49,556] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1454.85 | bwd_microstep: 3877.99 | bwd_inner_microstep: 2570.42 | bwd_allreduce_microstep: 1307.52 | step_microstep: 7.79
[2024-09-15 16:19:49,557] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3390.38 | bwd: 7332.96 | bwd_inner: 6025.34 | bwd_allreduce: 1307.55 | step: 8.03
 53%|█████▎    | 249/468 [44:40<38:57, 10.67s/it]                                                 {'loss': 0.218, 'learning_rate': 1.8960206516760396e-05, 'epoch': 1.59}
 53%|█████▎    | 249/468 [44:40<38:57, 10.67s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1366
[2024-09-15 16:19:53,638] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1464.39 | bwd_microstep: 2586.04 | bwd_inner_microstep: 2586.02 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.21
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389
[2024-09-15 16:20:00,277] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.40 | optimizer_gradients: 0.38 | optimizer_step: 0.38
[2024-09-15 16:20:00,278] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1491.56 | bwd_microstep: 5114.77 | bwd_inner_microstep: 2627.83 | bwd_allreduce_microstep: 2486.88 | step_microstep: 7.28
[2024-09-15 16:20:00,278] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2955.92 | bwd: 7700.83 | bwd_inner: 5213.84 | bwd_allreduce: 2486.92 | step: 7.52
 53%|█████▎    | 250/468 [44:51<38:49, 10.69s/it]                                                 {'loss': 0.1777, 'learning_rate': 1.8821718546209646e-05, 'epoch': 1.6}
 53%|█████▎    | 250/468 [44:51<38:49, 10.69s/it]dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1369
[2024-09-15 16:20:04,326] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1450.51 | bwd_microstep: 2567.64 | bwd_inner_microstep: 2567.60 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.23
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878
[2024-09-15 16:20:10,956] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.21 | optimizer_gradients: 0.40 | optimizer_step: 0.40
[2024-09-15 16:20:10,957] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1916.38 | bwd_microstep: 4678.51 | bwd_inner_microstep: 3434.70 | bwd_allreduce_microstep: 1243.76 | step_microstep: 7.51
[2024-09-15 16:20:10,957] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3366.86 | bwd: 7246.20 | bwd_inner: 6002.30 | bwd_allreduce: 1243.79 | step: 7.74
 54%|█████▎    | 251/468 [45:02<38:38, 10.68s/it]                                                 {'loss': 0.1577, 'learning_rate': 1.8683287245323536e-05, 'epoch': 1.6}
 54%|█████▎    | 251/468 [45:02<38:38, 10.68s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1371
[2024-09-15 16:20:15,041] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1466.68 | bwd_microstep: 2587.14 | bwd_inner_microstep: 2587.12 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876
[2024-09-15 16:20:20,463] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.54 | optimizer_gradients: 0.40 | optimizer_step: 0.40
[2024-09-15 16:20:20,463] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1916.24 | bwd_microstep: 3471.73 | bwd_inner_microstep: 3437.58 | bwd_allreduce_microstep: 34.10 | step_microstep: 7.72
[2024-09-15 16:20:20,464] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3382.89 | bwd: 6058.88 | bwd_inner: 6024.70 | bwd_allreduce: 34.12 | step: 7.95
 54%|█████▍    | 252/468 [45:11<37:11, 10.33s/it]                                                 {'loss': 0.2977, 'learning_rate': 1.8544919271981125e-05, 'epoch': 1.61}
 54%|█████▍    | 252/468 [45:11<37:11, 10.33s/it]dynamic ViT batch size: 36, images per sample: 4.5, dynamic token length: 1366
[2024-09-15 16:20:24,482] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1438.88 | bwd_microstep: 2549.52 | bwd_inner_microstep: 2549.40 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.23
dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1877
[2024-09-15 16:20:31,156] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.30 | optimizer_gradients: 0.43 | optimizer_step: 0.41
[2024-09-15 16:20:31,157] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1935.14 | bwd_microstep: 4703.71 | bwd_inner_microstep: 3463.29 | bwd_allreduce_microstep: 1240.36 | step_microstep: 7.65
[2024-09-15 16:20:31,157] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3373.99 | bwd: 7253.27 | bwd_inner: 6012.69 | bwd_allreduce: 1240.46 | step: 7.89
 54%|█████▍    | 253/468 [45:22<37:24, 10.44s/it]                                                 {'loss': 0.2277, 'learning_rate': 1.84066212810157e-05, 'epoch': 1.62}
 54%|█████▍    | 253/468 [45:22<37:24, 10.44s/it]dynamic ViT batch size: 36, images per sample: 4.5, dynamic token length: 1372
[2024-09-15 16:20:35,181] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1439.40 | bwd_microstep: 2554.83 | bwd_inner_microstep: 2554.80 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1388
[2024-09-15 16:20:41,847] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.21 | optimizer_gradients: 0.42 | optimizer_step: 0.41
[2024-09-15 16:20:41,848] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1492.26 | bwd_microstep: 5140.47 | bwd_inner_microstep: 2625.83 | bwd_allreduce_microstep: 2514.58 | step_microstep: 7.57
[2024-09-15 16:20:41,848] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2931.62 | bwd: 7695.32 | bwd_inner: 5180.63 | bwd_allreduce: 2514.62 | step: 7.79
 54%|█████▍    | 254/468 [45:33<37:30, 10.52s/it]                                                 {'loss': 0.1574, 'learning_rate': 1.8268399923894736e-05, 'epoch': 1.62}
 54%|█████▍    | 254/468 [45:33<37:30, 10.52s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368
[2024-09-15 16:20:45,929] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1466.20 | bwd_microstep: 2584.09 | bwd_inner_microstep: 2584.06 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389
[2024-09-15 16:20:52,579] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.21 | optimizer_gradients: 0.42 | optimizer_step: 0.40
[2024-09-15 16:20:52,580] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1491.79 | bwd_microstep: 5125.29 | bwd_inner_microstep: 2625.60 | bwd_allreduce_microstep: 2499.63 | step_microstep: 7.48
[2024-09-15 16:20:52,580] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2957.96 | bwd: 7709.42 | bwd_inner: 5209.66 | bwd_allreduce: 2499.66 | step: 7.71
 54%|█████▍    | 255/468 [45:43<37:33, 10.58s/it]                                                 {'loss': 0.3187, 'learning_rate': 1.8130261848399996e-05, 'epoch': 1.63}
 54%|█████▍    | 255/468 [45:43<37:33, 10.58s/it]dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1366
[2024-09-15 16:20:56,621] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1449.25 | bwd_microstep: 2561.72 | bwd_inner_microstep: 2561.63 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.24
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877
[2024-09-15 16:21:03,409] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.21 | optimizer_gradients: 0.40 | optimizer_step: 0.40
[2024-09-15 16:21:03,410] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1912.78 | bwd_microstep: 4839.59 | bwd_inner_microstep: 3432.58 | bwd_allreduce_microstep: 1406.95 | step_microstep: 7.53
[2024-09-15 16:21:03,410] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3362.01 | bwd: 7401.35 | bwd_inner: 5994.21 | bwd_allreduce: 1407.02 | step: 7.77
 55%|█████▍    | 256/468 [45:54<37:38, 10.66s/it]                                                 {'loss': 0.1758, 'learning_rate': 1.7992213698307795e-05, 'epoch': 1.64}
 55%|█████▍    | 256/468 [45:54<37:38, 10.66s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1881
[2024-09-15 16:21:08,785] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1912.48 | bwd_microstep: 3431.87 | bwd_inner_microstep: 3431.85 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.05
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1387
[2024-09-15 16:21:14,206] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.22 | optimizer_gradients: 0.39 | optimizer_step: 0.40
[2024-09-15 16:21:14,207] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1490.19 | bwd_microstep: 3898.96 | bwd_inner_microstep: 2620.66 | bwd_allreduce_microstep: 1278.23 | step_microstep: 7.44
[2024-09-15 16:21:14,207] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3402.65 | bwd: 7330.84 | bwd_inner: 6052.51 | bwd_allreduce: 1278.26 | step: 7.52
 55%|█████▍    | 257/468 [46:05<37:37, 10.70s/it]                                                 {'loss': 0.2118, 'learning_rate': 1.7854262113069468e-05, 'epoch': 1.64}
 55%|█████▍    | 257/468 [46:05<37:37, 10.70s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878
[2024-09-15 16:21:19,575] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1910.55 | bwd_microstep: 3427.73 | bwd_inner_microstep: 3427.70 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.19
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1388
[2024-09-15 16:21:25,011] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.44 | optimizer_gradients: 0.38 | optimizer_step: 0.40
[2024-09-15 16:21:25,011] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1490.06 | bwd_microstep: 3910.98 | bwd_inner_microstep: 2626.17 | bwd_allreduce_microstep: 1284.75 | step_microstep: 7.72
[2024-09-15 16:21:25,012] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3400.58 | bwd: 7338.72 | bwd_inner: 6053.87 | bwd_allreduce: 1284.78 | step: 7.94
 55%|█████▌    | 258/468 [46:16<37:33, 10.73s/it]                                                 {'loss': 0.1539, 'learning_rate': 1.7716413727492035e-05, 'epoch': 1.65}
 55%|█████▌    | 258/468 [46:16<37:33, 10.73s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368
[2024-09-15 16:21:29,093] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1463.51 | bwd_microstep: 2587.34 | bwd_inner_microstep: 2587.32 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.23
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389
[2024-09-15 16:21:34,481] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.46 | optimizer_gradients: 0.37 | optimizer_step: 0.38
[2024-09-15 16:21:34,481] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1489.36 | bwd_microstep: 3862.68 | bwd_inner_microstep: 2628.35 | bwd_allreduce_microstep: 1234.27 | step_microstep: 7.48
[2024-09-15 16:21:34,482] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2952.84 | bwd: 6450.06 | bwd_inner: 5215.67 | bwd_allreduce: 1234.30 | step: 7.71
 55%|█████▌    | 259/468 [46:25<36:03, 10.35s/it]                                                 {'loss': 0.1885, 'learning_rate': 1.757867517141913e-05, 'epoch': 1.65}
 55%|█████▌    | 259/468 [46:25<36:03, 10.35s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368
[2024-09-15 16:21:38,562] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1463.16 | bwd_microstep: 2587.52 | bwd_inner_microstep: 2587.49 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.27
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389
[2024-09-15 16:21:45,212] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.59 | optimizer_gradients: 0.39 | optimizer_step: 0.39
[2024-09-15 16:21:45,213] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1488.31 | bwd_microstep: 5124.55 | bwd_inner_microstep: 2628.01 | bwd_allreduce_microstep: 2496.47 | step_microstep: 9.10
[2024-09-15 16:21:45,213] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2951.45 | bwd: 7712.10 | bwd_inner: 5215.50 | bwd_allreduce: 2496.50 | step: 9.37
 56%|█████▌    | 260/468 [46:36<36:16, 10.47s/it]                                                 {'loss': 0.1915, 'learning_rate': 1.7441053069412103e-05, 'epoch': 1.66}
 56%|█████▌    | 260/468 [46:36<36:16, 10.47s/it]dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1878
[2024-09-15 16:21:50,628] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1927.62 | bwd_microstep: 3452.11 | bwd_inner_microstep: 3452.09 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1878
[2024-09-15 16:21:56,092] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.58 | optimizer_gradients: 0.38 | optimizer_step: 0.38
[2024-09-15 16:21:56,093] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1895.36 | bwd_microstep: 3534.30 | bwd_inner_microstep: 3415.17 | bwd_allreduce_microstep: 119.07 | step_microstep: 7.69
[2024-09-15 16:21:56,093] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3822.96 | bwd: 6986.43 | bwd_inner: 6867.26 | bwd_allreduce: 119.10 | step: 7.92
 56%|█████▌    | 261/468 [46:47<36:32, 10.59s/it]                                                 {'loss': 0.1716, 'learning_rate': 1.7303554040431426e-05, 'epoch': 1.67}
 56%|█████▌    | 261/468 [46:47<36:32, 10.59s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877
[2024-09-15 16:22:01,471] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1912.47 | bwd_microstep: 3434.54 | bwd_inner_microstep: 3434.51 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.24
dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1389
[2024-09-15 16:22:06,815] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.44 | optimizer_gradients: 0.40 | optimizer_step: 0.38
[2024-09-15 16:22:06,816] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1476.11 | bwd_microstep: 3832.31 | bwd_inner_microstep: 2607.99 | bwd_allreduce_microstep: 1224.26 | step_microstep: 8.20
[2024-09-15 16:22:06,816] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3388.56 | bwd: 7266.86 | bwd_inner: 6042.50 | bwd_allreduce: 1224.29 | step: 8.45
 56%|█████▌    | 262/468 [46:58<36:29, 10.63s/it]                                                 {'loss': 0.2217, 'learning_rate': 1.7166184697518352e-05, 'epoch': 1.67}
 56%|█████▌    | 262/468 [46:58<36:29, 10.63s/it]dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1369
[2024-09-15 16:22:10,865] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1450.19 | bwd_microstep: 2568.85 | bwd_inner_microstep: 2568.78 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22
dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1878
[2024-09-15 16:22:17,593] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.25 | optimizer_gradients: 0.38 | optimizer_step: 0.38
[2024-09-15 16:22:17,594] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1930.05 | bwd_microstep: 4762.21 | bwd_inner_microstep: 3460.89 | bwd_allreduce_microstep: 1301.26 | step_microstep: 7.28
[2024-09-15 16:22:17,594] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3380.22 | bwd: 7331.10 | bwd_inner: 6029.67 | bwd_allreduce: 1301.30 | step: 7.51
 56%|█████▌    | 263/468 [47:08<36:28, 10.67s/it]                                                 {'loss': 0.3266, 'learning_rate': 1.7028951647476862e-05, 'epoch': 1.68}
 56%|█████▌    | 263/468 [47:08<36:28, 10.67s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1877
[2024-09-15 16:22:22,937] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1896.53 | bwd_microstep: 3415.85 | bwd_inner_microstep: 3415.82 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.05
dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1877
[2024-09-15 16:22:28,382] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 4.24 | optimizer_gradients: 0.34 | optimizer_step: 0.38
[2024-09-15 16:22:28,382] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1929.19 | bwd_microstep: 3476.65 | bwd_inner_microstep: 3461.58 | bwd_allreduce_microstep: 15.02 | step_microstep: 11.85
[2024-09-15 16:22:28,382] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3825.69 | bwd: 6892.51 | bwd_inner: 6877.41 | bwd_allreduce: 15.04 | step: 11.92
 56%|█████▋    | 264/468 [47:19<36:24, 10.71s/it]                                                 {'loss': 0.2631, 'learning_rate': 1.6891861490555906e-05, 'epoch': 1.69}
 56%|█████▋    | 264/468 [47:19<36:24, 10.71s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368
[2024-09-15 16:22:32,471] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1466.71 | bwd_microstep: 2591.00 | bwd_inner_microstep: 2590.97 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878
[2024-09-15 16:22:39,075] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.47 | optimizer_gradients: 0.38 | optimizer_step: 0.38
[2024-09-15 16:22:39,075] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1915.76 | bwd_microstep: 4653.12 | bwd_inner_microstep: 3440.73 | bwd_allreduce_microstep: 1212.32 | step_microstep: 7.50
[2024-09-15 16:22:39,075] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3382.44 | bwd: 7244.13 | bwd_inner: 6031.70 | bwd_allreduce: 1212.36 | step: 7.72
 57%|█████▋    | 265/468 [47:30<36:12, 10.70s/it]                                                 {'loss': 0.2481, 'learning_rate': 1.6754920820131946e-05, 'epoch': 1.69}
 57%|█████▋    | 265/468 [47:30<36:12, 10.70s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878
[2024-09-15 16:22:44,455] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1911.35 | bwd_microstep: 3438.35 | bwd_inner_microstep: 3438.32 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.25
dynamic ViT batch size: 46, images per sample: 5.75, dynamic token length: 1900
[2024-09-15 16:22:49,986] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.43 | optimizer_gradients: 0.36 | optimizer_step: 0.39
[2024-09-15 16:22:49,987] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1967.31 | bwd_microstep: 3527.33 | bwd_inner_microstep: 3512.46 | bwd_allreduce_microstep: 14.82 | step_microstep: 8.09
[2024-09-15 16:22:49,987] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3878.64 | bwd: 6965.69 | bwd_inner: 6950.78 | bwd_allreduce: 14.84 | step: 8.36
 57%|█████▋    | 266/468 [47:41<36:14, 10.77s/it]                                                 {'loss': 0.2521, 'learning_rate': 1.6618136222391893e-05, 'epoch': 1.7}
 57%|█████▋    | 266/468 [47:41<36:14, 10.77s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368
[2024-09-15 16:22:54,076] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1467.75 | bwd_microstep: 2591.28 | bwd_inner_microstep: 2591.13 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.23
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1388
[2024-09-15 16:22:59,429] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.44 | optimizer_gradients: 0.38 | optimizer_step: 0.38
[2024-09-15 16:22:59,429] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1492.78 | bwd_microstep: 3825.57 | bwd_inner_microstep: 2632.39 | bwd_allreduce_microstep: 1193.12 | step_microstep: 7.51
[2024-09-15 16:22:59,429] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2960.51 | bwd: 6416.89 | bwd_inner: 5223.52 | bwd_allreduce: 1193.24 | step: 7.74
 57%|█████▋    | 267/468 [47:50<34:44, 10.37s/it]                                                 {'loss': 0.2022, 'learning_rate': 1.6481514276016297e-05, 'epoch': 1.71}
 57%|█████▋    | 267/468 [47:50<34:44, 10.37s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1367
[2024-09-15 16:23:03,517] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1466.16 | bwd_microstep: 2591.28 | bwd_inner_microstep: 2591.25 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.23
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1369
[2024-09-15 16:23:08,838] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.35 | optimizer_gradients: 0.38 | optimizer_step: 0.38
[2024-09-15 16:23:08,838] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1468.06 | bwd_microstep: 3816.86 | bwd_inner_microstep: 2595.20 | bwd_allreduce_microstep: 1221.61 | step_microstep: 7.47
[2024-09-15 16:23:08,839] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2934.20 | bwd: 6408.16 | bwd_inner: 5186.45 | bwd_allreduce: 1221.64 | step: 7.70
 57%|█████▋    | 268/468 [48:00<33:36, 10.08s/it]                                                 {'loss': 0.2055, 'learning_rate': 1.634506155186295e-05, 'epoch': 1.71}
 57%|█████▋    | 268/468 [48:00<33:36, 10.08s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1366
[2024-09-15 16:23:12,924] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1462.98 | bwd_microstep: 2591.38 | bwd_inner_microstep: 2591.34 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.23
dynamic ViT batch size: 36, images per sample: 4.5, dynamic token length: 1389
[2024-09-15 16:23:19,566] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.43 | optimizer_gradients: 0.38 | optimizer_step: 0.39
[2024-09-15 16:23:19,567] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1461.41 | bwd_microstep: 5146.82 | bwd_inner_microstep: 2595.77 | bwd_allreduce_microstep: 2550.99 | step_microstep: 7.51
[2024-09-15 16:23:19,567] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2924.37 | bwd: 7738.24 | bwd_inner: 5187.11 | bwd_allreduce: 2551.01 | step: 7.74
 57%|█████▋    | 269/468 [48:10<34:04, 10.28s/it]                                                 {'loss': 0.2257, 'learning_rate': 1.6208784612650883e-05, 'epoch': 1.72}
 57%|█████▋    | 269/468 [48:10<34:04, 10.28s/it]dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1878
[2024-09-15 16:23:24,878] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1881.65 | bwd_microstep: 3399.46 | bwd_inner_microstep: 3399.43 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878
[2024-09-15 16:23:30,281] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.74 | optimizer_gradients: 0.33 | optimizer_step: 0.38
[2024-09-15 16:23:30,282] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1913.13 | bwd_microstep: 3453.78 | bwd_inner_microstep: 3438.73 | bwd_allreduce_microstep: 15.00 | step_microstep: 8.00
[2024-09-15 16:23:30,282] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3794.75 | bwd: 6853.27 | bwd_inner: 6838.17 | bwd_allreduce: 15.03 | step: 8.23
 58%|█████▊    | 270/468 [48:21<34:20, 10.41s/it]                                                 {'loss': 0.2634, 'learning_rate': 1.6072690012644717e-05, 'epoch': 1.73}
 58%|█████▊    | 270/468 [48:21<34:20, 10.41s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876
[2024-09-15 16:23:35,660] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1911.89 | bwd_microstep: 3435.28 | bwd_inner_microstep: 3435.26 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.25
dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1390
[2024-09-15 16:23:39,797] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.69 | optimizer_gradients: 0.33 | optimizer_step: 0.40
[2024-09-15 16:23:39,797] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1478.06 | bwd_microstep: 2624.48 | bwd_inner_microstep: 2609.43 | bwd_allreduce_microstep: 15.00 | step_microstep: 7.92
[2024-09-15 16:23:39,797] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3389.92 | bwd: 6059.77 | bwd_inner: 6044.69 | bwd_allreduce: 15.02 | step: 8.18
 58%|█████▊    | 271/468 [48:31<33:17, 10.14s/it]                                                 {'loss': 0.2195, 'learning_rate': 1.593678429733944e-05, 'epoch': 1.73}
 58%|█████▊    | 271/468 [48:31<33:17, 10.14s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368
[2024-09-15 16:23:43,888] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1467.10 | bwd_microstep: 2592.67 | bwd_inner_microstep: 2592.64 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877
[2024-09-15 16:23:50,501] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.21 | optimizer_gradients: 0.37 | optimizer_step: 0.38
[2024-09-15 16:23:50,501] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1918.78 | bwd_microstep: 4659.90 | bwd_inner_microstep: 3440.25 | bwd_allreduce_microstep: 1219.60 | step_microstep: 7.21
[2024-09-15 16:23:50,501] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3385.85 | bwd: 7252.61 | bwd_inner: 6032.89 | bwd_allreduce: 1219.63 | step: 7.43
 58%|█████▊    | 272/468 [48:41<33:40, 10.31s/it]                                                 {'loss': 0.2063, 'learning_rate': 1.5801074003145585e-05, 'epoch': 1.74}
 58%|█████▊    | 272/468 [48:41<33:40, 10.31s/it]dynamic ViT batch size: 36, images per sample: 4.5, dynamic token length: 1369
[2024-09-15 16:23:54,525] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1437.33 | bwd_microstep: 2556.19 | bwd_inner_microstep: 2556.09 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.22
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1878
[2024-09-15 16:24:01,204] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.48 | optimizer_gradients: 0.38 | optimizer_step: 0.39
[2024-09-15 16:24:01,204] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1899.58 | bwd_microstep: 4745.28 | bwd_inner_microstep: 3425.19 | bwd_allreduce_microstep: 1320.03 | step_microstep: 7.60
[2024-09-15 16:24:01,205] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3336.89 | bwd: 7301.51 | bwd_inner: 5981.28 | bwd_allreduce: 1320.10 | step: 7.82
 58%|█████▊    | 273/468 [48:52<33:53, 10.43s/it]                                                 {'loss': 0.2601, 'learning_rate': 1.5665565657074874e-05, 'epoch': 1.74}
 58%|█████▊    | 273/468 [48:52<33:53, 10.43s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1870
[2024-09-15 16:24:06,569] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1908.38 | bwd_microstep: 3425.33 | bwd_inner_microstep: 3425.30 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1387
[2024-09-15 16:24:11,928] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.24 | optimizer_gradients: 0.40 | optimizer_step: 0.40
[2024-09-15 16:24:11,929] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1489.56 | bwd_microstep: 3833.44 | bwd_inner_microstep: 2628.83 | bwd_allreduce_microstep: 1204.55 | step_microstep: 7.51
[2024-09-15 16:24:11,929] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3397.92 | bwd: 7258.78 | bwd_inner: 6054.14 | bwd_allreduce: 1204.58 | step: 7.74
 59%|█████▊    | 274/468 [49:03<34:00, 10.52s/it]                                                 {'loss': 0.1633, 'learning_rate': 1.5530265776426294e-05, 'epoch': 1.75}
 59%|█████▊    | 274/468 [49:03<34:00, 10.52s/it]dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1878
[2024-09-15 16:24:17,355] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1934.72 | bwd_microstep: 3459.21 | bwd_inner_microstep: 3459.18 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.23
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1870
[2024-09-15 16:24:22,755] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 4.25 | optimizer_gradients: 0.39 | optimizer_step: 0.40
[2024-09-15 16:24:22,756] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1913.35 | bwd_microstep: 3448.09 | bwd_inner_microstep: 3433.14 | bwd_allreduce_microstep: 14.91 | step_microstep: 11.43
[2024-09-15 16:24:22,756] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3848.05 | bwd: 6907.31 | bwd_inner: 6892.32 | bwd_allreduce: 14.92 | step: 11.66
 59%|█████▉    | 275/468 [49:14<34:07, 10.61s/it]                                                 {'loss': 0.2333, 'learning_rate': 1.5395180868472662e-05, 'epoch': 1.76}
 59%|█████▉    | 275/468 [49:14<34:07, 10.61s/it]dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1878
[2024-09-15 16:24:28,185] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1935.06 | bwd_microstep: 3464.28 | bwd_inner_microstep: 3464.25 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.06
dynamic ViT batch size: 36, images per sample: 4.5, dynamic token length: 1366
[2024-09-15 16:24:33,392] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.29 | optimizer_gradients: 0.38 | optimizer_step: 0.38
[2024-09-15 16:24:33,392] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1441.92 | bwd_microstep: 3729.41 | bwd_inner_microstep: 2558.73 | bwd_allreduce_microstep: 1170.62 | step_microstep: 7.57
[2024-09-15 16:24:33,392] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3376.96 | bwd: 7193.70 | bwd_inner: 6022.98 | bwd_allreduce: 1170.65 | step: 7.66
 59%|█████▉    | 276/468 [49:24<33:58, 10.62s/it]                                                 {'loss': 0.149, 'learning_rate': 1.5260317430147627e-05, 'epoch': 1.76}
 59%|█████▉    | 276/468 [49:24<33:58, 10.62s/it]dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1368
[2024-09-15 16:24:37,448] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1453.66 | bwd_microstep: 2572.05 | bwd_inner_microstep: 2571.96 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1900
[2024-09-15 16:24:44,117] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.39 | optimizer_gradients: 0.37 | optimizer_step: 0.38
[2024-09-15 16:24:44,118] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1941.00 | bwd_microstep: 4693.81 | bwd_inner_microstep: 3471.86 | bwd_allreduce_microstep: 1221.89 | step_microstep: 7.46
[2024-09-15 16:24:44,118] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3394.64 | bwd: 7265.92 | bwd_inner: 6043.82 | bwd_allreduce: 1221.95 | step: 7.68
 59%|█████▉    | 277/468 [49:35<33:54, 10.65s/it]                                                 {'loss': 0.2587, 'learning_rate': 1.512568194773322e-05, 'epoch': 1.77}
 59%|█████▉    | 277/468 [49:35<33:54, 10.65s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368
[2024-09-15 16:24:48,206] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1466.20 | bwd_microstep: 2590.33 | bwd_inner_microstep: 2590.31 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.05
dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1877
[2024-09-15 16:24:53,650] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.66 | optimizer_gradients: 0.33 | optimizer_step: 0.38
[2024-09-15 16:24:53,651] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1932.36 | bwd_microstep: 3475.76 | bwd_inner_microstep: 3460.82 | bwd_allreduce_microstep: 14.90 | step_microstep: 7.86
[2024-09-15 16:24:53,651] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3398.52 | bwd: 6066.11 | bwd_inner: 6051.13 | bwd_allreduce: 14.92 | step: 7.92
 59%|█████▉    | 278/468 [49:45<32:39, 10.31s/it]                                                 {'loss': 0.1951, 'learning_rate': 1.4991280896547893e-05, 'epoch': 1.78}
 59%|█████▉    | 278/468 [49:45<32:39, 10.31s/it]dynamic ViT batch size: 48, images per sample: 6.0, dynamic token length: 1879
[2024-09-15 16:24:59,156] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1965.62 | bwd_microstep: 3507.94 | bwd_inner_microstep: 3507.92 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.23
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389
[2024-09-15 16:25:03,330] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.60 | optimizer_gradients: 0.35 | optimizer_step: 0.40
[2024-09-15 16:25:03,331] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1493.23 | bwd_microstep: 2646.31 | bwd_inner_microstep: 2631.14 | bwd_allreduce_microstep: 15.12 | step_microstep: 7.94
[2024-09-15 16:25:03,331] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3458.84 | bwd: 6154.27 | bwd_inner: 6139.06 | bwd_allreduce: 15.14 | step: 8.20
 60%|█████▉    | 279/468 [49:54<31:53, 10.12s/it]                                                 {'loss': 0.2757, 'learning_rate': 1.4857120740635084e-05, 'epoch': 1.78}
 60%|█████▉    | 279/468 [49:54<31:53, 10.12s/it]dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1881
[2024-09-15 16:25:08,757] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1935.30 | bwd_microstep: 3460.26 | bwd_inner_microstep: 3460.24 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.23
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1900
[2024-09-15 16:25:14,224] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.57 | optimizer_gradients: 0.36 | optimizer_step: 0.40
[2024-09-15 16:25:14,224] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1941.68 | bwd_microstep: 3488.02 | bwd_inner_microstep: 3473.04 | bwd_allreduce_microstep: 14.93 | step_microstep: 8.30
[2024-09-15 16:25:14,224] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3876.97 | bwd: 6948.29 | bwd_inner: 6933.28 | bwd_allreduce: 14.95 | step: 8.53
 60%|█████▉    | 280/468 [50:05<32:26, 10.36s/it]                                                 {'loss': 0.2321, 'learning_rate': 1.472320793245233e-05, 'epoch': 1.79}
 60%|█████▉    | 280/468 [50:05<32:26, 10.36s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1369
[2024-09-15 16:25:18,321] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1471.96 | bwd_microstep: 2593.94 | bwd_inner_microstep: 2593.91 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.23
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1388
[2024-09-15 16:25:24,844] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.82 | optimizer_gradients: 0.42 | optimizer_step: 0.41
[2024-09-15 16:25:24,844] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1495.07 | bwd_microstep: 4991.37 | bwd_inner_microstep: 2633.57 | bwd_allreduce_microstep: 2357.74 | step_microstep: 10.55
[2024-09-15 16:25:24,845] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2967.01 | bwd: 7585.35 | bwd_inner: 5227.48 | bwd_allreduce: 2357.77 | step: 10.80
 60%|██████    | 281/468 [50:16<32:31, 10.43s/it]                                                 {'loss': 0.2177, 'learning_rate': 1.4589548912560932e-05, 'epoch': 1.8}
 60%|██████    | 281/468 [50:16<32:31, 10.43s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1870
[2024-09-15 16:25:30,214] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1913.02 | bwd_microstep: 3425.62 | bwd_inner_microstep: 3425.59 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.16
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1877
[2024-09-15 16:25:35,588] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.53 | optimizer_gradients: 0.35 | optimizer_step: 0.40
[2024-09-15 16:25:35,588] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1902.69 | bwd_microstep: 3435.61 | bwd_inner_microstep: 3420.44 | bwd_allreduce_microstep: 15.12 | step_microstep: 8.26
[2024-09-15 16:25:35,589] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3815.68 | bwd: 6861.24 | bwd_inner: 6846.03 | bwd_allreduce: 15.15 | step: 8.44
 60%|██████    | 282/468 [50:26<32:38, 10.53s/it]                                                 {'loss': 0.1363, 'learning_rate': 1.4456150109316192e-05, 'epoch': 1.8}
 60%|██████    | 282/468 [50:26<32:38, 10.53s/it]dynamic ViT batch size: 34, images per sample: 4.25, dynamic token length: 1880
[2024-09-15 16:25:40,859] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1862.89 | bwd_microstep: 3377.19 | bwd_inner_microstep: 3376.95 | bwd_allreduce_microstep: 0.14 | step_microstep: 0.23
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876
[2024-09-15 16:25:46,266] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.39 | optimizer_gradients: 0.35 | optimizer_step: 0.40
[2024-09-15 16:25:46,266] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1919.98 | bwd_microstep: 3451.42 | bwd_inner_microstep: 3436.46 | bwd_allreduce_microstep: 14.91 | step_microstep: 8.10
[2024-09-15 16:25:46,266] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3782.85 | bwd: 6828.65 | bwd_inner: 6813.41 | bwd_allreduce: 15.07 | step: 8.34
 60%|██████    | 283/468 [50:37<32:35, 10.57s/it]                                                 {'loss': 0.1428, 'learning_rate': 1.4323017938558245e-05, 'epoch': 1.81}
 60%|██████    | 283/468 [50:37<32:35, 10.57s/it]dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1881
[2024-09-15 16:25:51,700] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1938.91 | bwd_microstep: 3463.31 | bwd_inner_microstep: 3463.29 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.31
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1878
[2024-09-15 16:25:57,079] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.36 | optimizer_gradients: 0.36 | optimizer_step: 0.40
[2024-09-15 16:25:57,080] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1905.37 | bwd_microstep: 3439.46 | bwd_inner_microstep: 3424.64 | bwd_allreduce_microstep: 14.78 | step_microstep: 8.06
[2024-09-15 16:25:57,080] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3844.25 | bwd: 6902.79 | bwd_inner: 6887.93 | bwd_allreduce: 14.80 | step: 8.40
 61%|██████    | 284/468 [50:48<32:38, 10.64s/it]                                                 {'loss': 0.2112, 'learning_rate': 1.4190158803303498e-05, 'epoch': 1.81}
 61%|██████    | 284/468 [50:48<32:38, 10.64s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1367
[2024-09-15 16:26:01,178] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1471.60 | bwd_microstep: 2595.56 | bwd_inner_microstep: 2595.46 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.24
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878
[2024-09-15 16:26:07,788] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.52 | optimizer_gradients: 0.39 | optimizer_step: 0.40
[2024-09-15 16:26:07,788] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1919.30 | bwd_microstep: 4655.48 | bwd_inner_microstep: 3442.48 | bwd_allreduce_microstep: 1212.95 | step_microstep: 7.80
[2024-09-15 16:26:07,789] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3390.88 | bwd: 7251.08 | bwd_inner: 6037.94 | bwd_allreduce: 1213.01 | step: 8.04
 61%|██████    | 285/468 [50:59<32:31, 10.66s/it]                                                 {'loss': 0.2269, 'learning_rate': 1.4057579093436653e-05, 'epoch': 1.82}
 61%|██████    | 285/468 [50:59<32:31, 10.66s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877
[2024-09-15 16:26:13,176] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1918.23 | bwd_microstep: 3438.48 | bwd_inner_microstep: 3438.40 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.23
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389
[2024-09-15 16:26:18,542] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.37 | optimizer_gradients: 0.38 | optimizer_step: 0.40
[2024-09-15 16:26:18,543] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1492.62 | bwd_microstep: 3839.42 | bwd_inner_microstep: 2631.24 | bwd_allreduce_microstep: 1208.12 | step_microstep: 7.76
[2024-09-15 16:26:18,543] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3410.83 | bwd: 7277.94 | bwd_inner: 6069.65 | bwd_allreduce: 1208.17 | step: 8.00
 61%|██████    | 286/468 [51:09<32:25, 10.69s/it]                                                 {'loss': 0.1712, 'learning_rate': 1.3925285185403406e-05, 'epoch': 1.83}
 61%|██████    | 286/468 [51:09<32:25, 10.69s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368
[2024-09-15 16:26:22,636] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1471.34 | bwd_microstep: 2591.51 | bwd_inner_microstep: 2591.48 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.34
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1388
[2024-09-15 16:26:29,216] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.47 | optimizer_gradients: 0.39 | optimizer_step: 0.40
[2024-09-15 16:26:29,217] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1495.09 | bwd_microstep: 5051.04 | bwd_inner_microstep: 2630.83 | bwd_allreduce_microstep: 2420.15 | step_microstep: 7.76
[2024-09-15 16:26:29,217] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2966.39 | bwd: 7642.58 | bwd_inner: 5222.31 | bwd_allreduce: 2420.18 | step: 8.11
 61%|██████▏   | 287/468 [51:20<32:14, 10.69s/it]                                                 {'loss': 0.1182, 'learning_rate': 1.3793283441903737e-05, 'epoch': 1.83}
 61%|██████▏   | 287/468 [51:20<32:14, 10.69s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1369
[2024-09-15 16:26:33,304] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1467.32 | bwd_microstep: 2588.99 | bwd_inner_microstep: 2588.94 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.25
dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1878
[2024-09-15 16:26:39,943] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.52 | optimizer_gradients: 0.41 | optimizer_step: 0.40
[2024-09-15 16:26:39,944] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1890.14 | bwd_microstep: 4714.75 | bwd_inner_microstep: 3399.82 | bwd_allreduce_microstep: 1314.87 | step_microstep: 7.90
[2024-09-15 16:26:39,944] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3357.42 | bwd: 7303.78 | bwd_inner: 5988.76 | bwd_allreduce: 1314.90 | step: 8.16
 62%|██████▏   | 288/468 [51:31<32:05, 10.70s/it]                                                 {'loss': 0.1667, 'learning_rate': 1.3661580211585947e-05, 'epoch': 1.84}
 62%|██████▏   | 288/468 [51:31<32:05, 10.70s/it]dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1880
[2024-09-15 16:26:45,364] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1932.70 | bwd_microstep: 3456.86 | bwd_inner_microstep: 3456.83 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22
dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1876
[2024-09-15 16:26:50,808] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.60 | optimizer_gradients: 0.36 | optimizer_step: 0.40
[2024-09-15 16:26:50,808] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1936.41 | bwd_microstep: 3472.38 | bwd_inner_microstep: 3457.56 | bwd_allreduce_microstep: 14.78 | step_microstep: 7.93
[2024-09-15 16:26:50,808] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3869.07 | bwd: 6929.25 | bwd_inner: 6914.39 | bwd_allreduce: 14.80 | step: 8.16
 62%|██████▏   | 289/468 [51:42<32:03, 10.75s/it]                                                 {'loss': 0.2887, 'learning_rate': 1.3530181828741285e-05, 'epoch': 1.85}
 62%|██████▏   | 289/468 [51:42<32:03, 10.75s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876
[2024-09-15 16:26:56,196] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1917.28 | bwd_microstep: 3438.92 | bwd_inner_microstep: 3438.88 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.30
dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1877
[2024-09-15 16:27:01,542] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.52 | optimizer_gradients: 0.35 | optimizer_step: 0.40
[2024-09-15 16:27:01,542] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1890.66 | bwd_microstep: 3420.55 | bwd_inner_microstep: 3404.82 | bwd_allreduce_microstep: 15.67 | step_microstep: 8.52
[2024-09-15 16:27:01,542] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3807.90 | bwd: 6859.50 | bwd_inner: 6843.71 | bwd_allreduce: 15.70 | step: 8.82
 62%|██████▏   | 290/468 [51:52<31:52, 10.74s/it]                                                 {'loss': 0.1624, 'learning_rate': 1.3399094612999291e-05, 'epoch': 1.85}
 62%|██████▏   | 290/468 [51:52<31:52, 10.74s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1878
[2024-09-15 16:27:06,895] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1901.96 | bwd_microstep: 3420.08 | bwd_inner_microstep: 3419.90 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.24
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878
[2024-09-15 16:27:12,308] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.57 | optimizer_gradients: 0.35 | optimizer_step: 0.40
[2024-09-15 16:27:12,309] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1919.53 | bwd_microstep: 3456.11 | bwd_inner_microstep: 3441.29 | bwd_allreduce_microstep: 14.78 | step_microstep: 8.59
[2024-09-15 16:27:12,309] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3821.44 | bwd: 6876.23 | bwd_inner: 6861.19 | bwd_allreduce: 14.92 | step: 8.84
 62%|██████▏   | 291/468 [52:03<31:42, 10.75s/it]                                                 {'loss': 0.2356, 'learning_rate': 1.3268324869023878e-05, 'epoch': 1.86}
 62%|██████▏   | 291/468 [52:03<31:42, 10.75s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1369
[2024-09-15 16:27:16,403] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1471.35 | bwd_microstep: 2592.63 | bwd_inner_microstep: 2592.61 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1369
[2024-09-15 16:27:21,744] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.21 | optimizer_gradients: 0.41 | optimizer_step: 0.40
[2024-09-15 16:27:21,745] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1473.35 | bwd_microstep: 3833.16 | bwd_inner_microstep: 2595.77 | bwd_allreduce_microstep: 1237.34 | step_microstep: 7.56
[2024-09-15 16:27:21,745] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2944.67 | bwd: 6425.81 | bwd_inner: 5188.37 | bwd_allreduce: 1237.37 | step: 7.79
 62%|██████▏   | 292/468 [52:13<30:22, 10.36s/it]                                                 {'loss': 0.2557, 'learning_rate': 1.313787888621009e-05, 'epoch': 1.87}
 62%|██████▏   | 292/468 [52:13<30:22, 10.36s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877
[2024-09-15 16:27:27,130] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1917.19 | bwd_microstep: 3437.21 | bwd_inner_microstep: 3437.14 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.24
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877
[2024-09-15 16:27:32,541] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.45 | optimizer_gradients: 0.35 | optimizer_step: 0.40
[2024-09-15 16:27:32,542] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1920.56 | bwd_microstep: 3454.98 | bwd_inner_microstep: 3439.66 | bwd_allreduce_microstep: 15.28 | step_microstep: 7.88
[2024-09-15 16:27:32,542] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3837.72 | bwd: 6892.23 | bwd_inner: 6876.80 | bwd_allreduce: 15.31 | step: 8.13
 63%|██████▎   | 293/468 [52:23<30:35, 10.49s/it]                                                 {'loss': 0.2203, 'learning_rate': 1.3007762938381619e-05, 'epoch': 1.87}
 63%|██████▎   | 293/468 [52:23<30:35, 10.49s/it]dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1876
[2024-09-15 16:27:37,973] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1936.69 | bwd_microstep: 3463.36 | bwd_inner_microstep: 3463.33 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.24
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1870
[2024-09-15 16:27:43,335] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.66 | optimizer_gradients: 0.40 | optimizer_step: 0.40
[2024-09-15 16:27:43,335] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1900.99 | bwd_microstep: 3427.06 | bwd_inner_microstep: 3411.81 | bwd_allreduce_microstep: 15.21 | step_microstep: 8.00
[2024-09-15 16:27:43,336] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3837.65 | bwd: 6890.44 | bwd_inner: 6875.14 | bwd_allreduce: 15.23 | step: 8.25
 63%|██████▎   | 294/468 [52:34<30:40, 10.58s/it]                                                 {'loss': 0.3142, 'learning_rate': 1.2877983283489062e-05, 'epoch': 1.88}
 63%|██████▎   | 294/468 [52:34<30:40, 10.58s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878
[2024-09-15 16:27:48,732] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1922.01 | bwd_microstep: 3444.34 | bwd_inner_microstep: 3444.22 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.23
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1876
[2024-09-15 16:27:54,108] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.56 | optimizer_gradients: 0.35 | optimizer_step: 0.40
[2024-09-15 16:27:54,108] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1904.01 | bwd_microstep: 3437.07 | bwd_inner_microstep: 3422.35 | bwd_allreduce_microstep: 14.65 | step_microstep: 8.14
[2024-09-15 16:27:54,109] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3826.00 | bwd: 6881.45 | bwd_inner: 6866.57 | bwd_allreduce: 14.75 | step: 8.37
 63%|██████▎   | 295/468 [52:45<30:40, 10.64s/it]                                                 {'loss': 0.16, 'learning_rate': 1.2748546163308947e-05, 'epoch': 1.88}
 63%|██████▎   | 295/468 [52:45<30:40, 10.64s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878
[2024-09-15 16:27:59,507] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1921.93 | bwd_microstep: 3445.40 | bwd_inner_microstep: 3445.35 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.24
dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1878
[2024-09-15 16:28:04,968] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.61 | optimizer_gradients: 0.35 | optimizer_step: 0.40
[2024-09-15 16:28:04,968] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1941.45 | bwd_microstep: 3483.19 | bwd_inner_microstep: 3468.04 | bwd_allreduce_microstep: 15.10 | step_microstep: 8.72
[2024-09-15 16:28:04,969] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3863.35 | bwd: 6928.62 | bwd_inner: 6913.39 | bwd_allreduce: 15.12 | step: 8.97
 63%|██████▎   | 296/468 [52:56<30:41, 10.70s/it]                                                 {'loss': 0.1647, 'learning_rate': 1.261945780314354e-05, 'epoch': 1.89}
 63%|██████▎   | 296/468 [52:56<30:41, 10.70s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1366
[2024-09-15 16:28:09,064] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1472.38 | bwd_microstep: 2592.22 | bwd_inner_microstep: 2592.10 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.26
dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1366
[2024-09-15 16:28:14,368] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.23 | optimizer_gradients: 0.41 | optimizer_step: 0.41
[2024-09-15 16:28:14,368] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1458.05 | bwd_microstep: 3811.62 | bwd_inner_microstep: 2577.30 | bwd_allreduce_microstep: 1234.26 | step_microstep: 7.55
[2024-09-15 16:28:14,368] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2930.41 | bwd: 6403.88 | bwd_inner: 5169.40 | bwd_allreduce: 1234.36 | step: 7.82
 63%|██████▎   | 297/468 [53:05<29:23, 10.31s/it]                                                 {'loss': 0.2373, 'learning_rate': 1.2490724411521406e-05, 'epoch': 1.9}
 63%|██████▎   | 297/468 [53:05<29:23, 10.31s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1367
[2024-09-15 16:28:18,462] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1470.07 | bwd_microstep: 2592.64 | bwd_inner_microstep: 2592.62 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.17
dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1389
[2024-09-15 16:28:25,084] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.53 | optimizer_gradients: 0.39 | optimizer_step: 0.39
[2024-09-15 16:28:25,085] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1483.35 | bwd_microstep: 5103.65 | bwd_inner_microstep: 2613.21 | bwd_allreduce_microstep: 2490.38 | step_microstep: 7.83
[2024-09-15 16:28:25,085] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2953.39 | bwd: 7696.30 | bwd_inner: 5205.82 | bwd_allreduce: 2490.41 | step: 8.03
 64%|██████▎   | 298/468 [53:16<29:33, 10.43s/it]                                                 {'loss': 0.1701, 'learning_rate': 1.2362352179898855e-05, 'epoch': 1.9}
 64%|██████▎   | 298/468 [53:16<29:33, 10.43s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368
[2024-09-15 16:28:29,170] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1468.55 | bwd_microstep: 2586.13 | bwd_inner_microstep: 2586.10 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.23
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1881
[2024-09-15 16:28:34,585] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.59 | optimizer_gradients: 0.35 | optimizer_step: 0.40
[2024-09-15 16:28:34,585] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1919.96 | bwd_microstep: 3459.40 | bwd_inner_microstep: 3443.82 | bwd_allreduce_microstep: 15.54 | step_microstep: 7.94
[2024-09-15 16:28:34,586] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3388.48 | bwd: 6045.55 | bwd_inner: 6029.92 | bwd_allreduce: 15.56 | step: 8.18
 64%|██████▍   | 299/468 [53:25<28:36, 10.15s/it]                                                 {'loss': 0.1522, 'learning_rate': 1.2234347282362129e-05, 'epoch': 1.91}
 64%|██████▍   | 299/468 [53:25<28:36, 10.15s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877
[2024-09-15 16:28:39,973] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1919.80 | bwd_microstep: 3436.63 | bwd_inner_microstep: 3436.61 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.23
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1901
[2024-09-15 16:28:45,432] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.61 | optimizer_gradients: 0.35 | optimizer_step: 0.40
[2024-09-15 16:28:45,433] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1941.63 | bwd_microstep: 3482.97 | bwd_inner_microstep: 3468.05 | bwd_allreduce_microstep: 14.87 | step_microstep: 8.59
[2024-09-15 16:28:45,433] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3861.41 | bwd: 6919.62 | bwd_inner: 6904.65 | bwd_allreduce: 14.90 | step: 8.82
 64%|██████▍   | 300/468 [53:36<29:00, 10.36s/it]                                                 {'loss': 0.1613, 'learning_rate': 1.2106715875330475e-05, 'epoch': 1.92}
 64%|██████▍   | 300/468 [53:36<29:00, 10.36s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878
[2024-09-15 16:28:50,825] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1918.95 | bwd_microstep: 3442.01 | bwd_inner_microstep: 3441.89 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.24
dynamic ViT batch size: 39, images per sample: 4.875, dynamic token length: 1388
[2024-09-15 16:28:54,984] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.56 | optimizer_gradients: 0.35 | optimizer_step: 0.40
[2024-09-15 16:28:54,985] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1488.36 | bwd_microstep: 2637.07 | bwd_inner_microstep: 2621.65 | bwd_allreduce_microstep: 15.37 | step_microstep: 7.92
[2024-09-15 16:28:54,985] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3407.29 | bwd: 6079.12 | bwd_inner: 6063.55 | bwd_allreduce: 15.45 | step: 8.16
 64%|██████▍   | 301/468 [53:46<28:09, 10.12s/it]                                                 {'loss': 0.1883, 'learning_rate': 1.1979464097260039e-05, 'epoch': 1.92}
 64%|██████▍   | 301/468 [53:46<28:09, 10.12s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1366
[2024-09-15 16:28:59,078] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1471.20 | bwd_microstep: 2591.80 | bwd_inner_microstep: 2591.71 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.24
dynamic ViT batch size: 45, images per sample: 5.625, dynamic token length: 1901
[2024-09-15 16:29:05,792] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.22 | optimizer_gradients: 0.40 | optimizer_step: 0.40
[2024-09-15 16:29:05,793] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1969.42 | bwd_microstep: 4708.98 | bwd_inner_microstep: 3510.22 | bwd_allreduce_microstep: 1198.70 | step_microstep: 7.50
[2024-09-15 16:29:05,793] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3440.60 | bwd: 7300.82 | bwd_inner: 6101.93 | bwd_allreduce: 1198.77 | step: 7.75
 65%|██████▍   | 302/468 [53:57<28:34, 10.33s/it]                                                 {'loss': 0.2338, 'learning_rate': 1.1852598068348642e-05, 'epoch': 1.93}
 65%|██████▍   | 302/468 [53:57<28:34, 10.33s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876
[2024-09-15 16:29:11,184] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1918.97 | bwd_microstep: 3441.17 | bwd_inner_microstep: 3441.14 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.26
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1388
[2024-09-15 16:29:15,362] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.57 | optimizer_gradients: 0.35 | optimizer_step: 0.41
[2024-09-15 16:29:15,363] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1496.79 | bwd_microstep: 2646.90 | bwd_inner_microstep: 2631.76 | bwd_allreduce_microstep: 15.10 | step_microstep: 7.99
[2024-09-15 16:29:15,363] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3415.74 | bwd: 6088.15 | bwd_inner: 6072.90 | bwd_allreduce: 15.12 | step: 8.26
 65%|██████▍   | 303/468 [54:06<27:46, 10.10s/it]                                                 {'loss': 0.2248, 'learning_rate': 1.1726123890241439e-05, 'epoch': 1.94}
 65%|██████▍   | 303/468 [54:06<27:46, 10.10s/it]dynamic ViT batch size: 36, images per sample: 4.5, dynamic token length: 1368
[2024-09-15 16:29:19,384] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1438.55 | bwd_microstep: 2553.49 | bwd_inner_microstep: 2553.23 | bwd_allreduce_microstep: 0.13 | step_microstep: 0.24
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1878
[2024-09-15 16:29:26,143] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.21 | optimizer_gradients: 0.41 | optimizer_step: 0.42
[2024-09-15 16:29:26,143] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1903.94 | bwd_microstep: 4820.79 | bwd_inner_microstep: 3424.10 | bwd_allreduce_microstep: 1396.62 | step_microstep: 7.52
[2024-09-15 16:29:26,144] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3342.46 | bwd: 7374.32 | bwd_inner: 5977.34 | bwd_allreduce: 1396.78 | step: 7.76
 65%|██████▍   | 304/468 [54:17<28:09, 10.30s/it]                                                 {'loss': 0.1811, 'learning_rate': 1.1600047645737433e-05, 'epoch': 1.94}
 65%|██████▍   | 304/468 [54:17<28:09, 10.30s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368
[2024-09-15 16:29:30,231] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1468.09 | bwd_microstep: 2589.19 | bwd_inner_microstep: 2589.05 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.24
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1388
[2024-09-15 16:29:36,758] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.26 | optimizer_gradients: 0.39 | optimizer_step: 0.40
[2024-09-15 16:29:36,759] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1495.09 | bwd_microstep: 4997.41 | bwd_inner_microstep: 2630.23 | bwd_allreduce_microstep: 2367.11 | step_microstep: 7.48
[2024-09-15 16:29:36,759] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2963.16 | bwd: 7586.64 | bwd_inner: 5219.28 | bwd_allreduce: 2367.23 | step: 7.72
 65%|██████▌   | 305/468 [54:28<28:14, 10.40s/it]                                                 {'loss': 0.1893, 'learning_rate': 1.1474375398496948e-05, 'epoch': 1.95}
 65%|██████▌   | 305/468 [54:28<28:14, 10.40s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1369
[2024-09-15 16:29:40,846] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1469.16 | bwd_microstep: 2587.71 | bwd_inner_microstep: 2587.66 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.23
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1388
[2024-09-15 16:29:46,232] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.22 | optimizer_gradients: 0.40 | optimizer_step: 0.41
[2024-09-15 16:29:46,232] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1491.87 | bwd_microstep: 3858.80 | bwd_inner_microstep: 2626.43 | bwd_allreduce_microstep: 1232.31 | step_microstep: 7.59
[2024-09-15 16:29:46,233] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2961.01 | bwd: 6446.56 | bwd_inner: 5214.09 | bwd_allreduce: 1232.34 | step: 7.83
 65%|██████▌   | 306/468 [54:37<27:19, 10.12s/it]                                                 {'loss': 0.203, 'learning_rate': 1.1349113192749986e-05, 'epoch': 1.96}
 65%|██████▌   | 306/468 [54:37<27:19, 10.12s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368
[2024-09-15 16:29:50,317] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1466.30 | bwd_microstep: 2587.91 | bwd_inner_microstep: 2587.84 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.24
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1901
[2024-09-15 16:29:57,024] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.22 | optimizer_gradients: 0.40 | optimizer_step: 0.40
[2024-09-15 16:29:57,025] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1940.36 | bwd_microstep: 4730.97 | bwd_inner_microstep: 3464.08 | bwd_allreduce_microstep: 1266.83 | step_microstep: 7.53
[2024-09-15 16:29:57,025] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3406.62 | bwd: 7318.92 | bwd_inner: 6051.92 | bwd_allreduce: 1266.88 | step: 7.78
 66%|██████▌   | 307/468 [54:48<27:41, 10.32s/it]                                                 {'loss': 0.1425, 'learning_rate': 1.1224267053005504e-05, 'epoch': 1.96}
 66%|██████▌   | 307/468 [54:48<27:41, 10.32s/it]dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1878
[2024-09-15 16:30:02,444] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1929.58 | bwd_microstep: 3458.23 | bwd_inner_microstep: 3458.18 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.24
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368
[2024-09-15 16:30:07,836] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.26 | optimizer_gradients: 0.39 | optimizer_step: 0.40
[2024-09-15 16:30:07,836] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1469.03 | bwd_microstep: 3888.51 | bwd_inner_microstep: 2589.86 | bwd_allreduce_microstep: 1298.59 | step_microstep: 7.51
[2024-09-15 16:30:07,836] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3398.58 | bwd: 7346.79 | bwd_inner: 6048.04 | bwd_allreduce: 1298.63 | step: 7.75
 66%|██████▌   | 308/468 [54:59<27:54, 10.47s/it]                                                 {'loss': 0.1337, 'learning_rate': 1.1099842983761712e-05, 'epoch': 1.97}
 66%|██████▌   | 308/468 [54:59<27:54, 10.47s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1366
[2024-09-15 16:30:11,920] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1466.17 | bwd_microstep: 2586.80 | bwd_inner_microstep: 2586.62 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.24
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1870
[2024-09-15 16:30:18,540] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.21 | optimizer_gradients: 0.40 | optimizer_step: 0.40
[2024-09-15 16:30:18,540] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1913.45 | bwd_microstep: 4672.12 | bwd_inner_microstep: 3430.12 | bwd_allreduce_microstep: 1241.94 | step_microstep: 7.51
[2024-09-15 16:30:18,540] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3379.60 | bwd: 7258.96 | bwd_inner: 6016.74 | bwd_allreduce: 1242.09 | step: 7.76
 66%|██████▌   | 309/468 [55:09<27:55, 10.54s/it]                                                 {'loss': 0.1684, 'learning_rate': 1.0975846969217258e-05, 'epoch': 1.97}
 66%|██████▌   | 309/468 [55:09<27:55, 10.54s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877
[2024-09-15 16:30:23,916] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1912.70 | bwd_microstep: 3432.48 | bwd_inner_microstep: 3432.45 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.19
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389
[2024-09-15 16:30:29,126] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.23 | optimizer_gradients: 0.39 | optimizer_step: 0.41
[2024-09-15 16:30:29,127] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1492.29 | bwd_microstep: 3685.70 | bwd_inner_microstep: 2629.68 | bwd_allreduce_microstep: 1055.96 | step_microstep: 7.45
[2024-09-15 16:30:29,127] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3404.96 | bwd: 7118.20 | bwd_inner: 6062.13 | bwd_allreduce: 1055.99 | step: 7.64
 66%|██████▌   | 310/468 [55:20<27:47, 10.55s/it]                                                 {'loss': 0.165, 'learning_rate': 1.0852284972983415e-05, 'epoch': 1.98}
 66%|██████▌   | 310/468 [55:20<27:47, 10.55s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877
[2024-09-15 16:30:34,508] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1915.96 | bwd_microstep: 3434.26 | bwd_inner_microstep: 3434.23 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.26
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877
[2024-09-15 16:30:39,912] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.56 | optimizer_gradients: 0.36 | optimizer_step: 0.40
[2024-09-15 16:30:39,912] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1917.16 | bwd_microstep: 3452.32 | bwd_inner_microstep: 3437.44 | bwd_allreduce_microstep: 14.84 | step_microstep: 8.29
[2024-09-15 16:30:39,913] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3833.09 | bwd: 6886.62 | bwd_inner: 6871.66 | bwd_allreduce: 14.86 | step: 8.56
 66%|██████▋   | 311/468 [55:31<27:47, 10.62s/it]                                                 {'loss': 0.2462, 'learning_rate': 1.0729162937797257e-05, 'epoch': 1.99}
 66%|██████▋   | 311/468 [55:31<27:47, 10.62s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1366
[2024-09-15 16:30:44,000] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1468.67 | bwd_microstep: 2588.25 | bwd_inner_microstep: 2588.03 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.25
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1870
[2024-09-15 16:30:50,730] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.74 | optimizer_gradients: 0.41 | optimizer_step: 0.41
[2024-09-15 16:30:50,730] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1899.75 | bwd_microstep: 4793.11 | bwd_inner_microstep: 3412.72 | bwd_allreduce_microstep: 1380.33 | step_microstep: 10.63
[2024-09-15 16:30:50,731] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3368.40 | bwd: 7381.41 | bwd_inner: 6000.75 | bwd_allreduce: 1380.52 | step: 10.89
 67%|██████▋   | 312/468 [55:42<27:46, 10.68s/it]                                                 {'loss': 0.2298, 'learning_rate': 1.0606486785235879e-05, 'epoch': 1.99}
 67%|██████▋   | 312/468 [55:42<27:46, 10.68s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
[2024-09-15 16:30:56,117] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1916.72 | bwd_microstep: 3437.20 | bwd_inner_microstep: 3437.10 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.24
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876
[2024-09-15 16:31:02,474] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.57 | optimizer_gradients: 0.34 | optimizer_step: 0.41
[2024-09-15 16:31:02,475] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1912.93 | bwd_microstep: 3448.45 | bwd_inner_microstep: 3433.22 | bwd_allreduce_microstep: 15.18 | step_microstep: 7.94
[2024-09-15 16:31:02,475] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3829.62 | bwd: 6885.69 | bwd_inner: 6870.32 | bwd_allreduce: 15.25 | step: 8.19
 67%|██████▋   | 313/468 [55:53<28:25, 11.00s/it]                                                 {'loss': 0.2047, 'learning_rate': 1.0484262415431536e-05, 'epoch': 2.0}
 67%|██████▋   | 313/468 [55:53<28:25, 11.00s/it][2024-09-15 16:31:04,793] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 16:31:04,816] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 16:31:04,816] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 16:31:04,825] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 16:31:08,575] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 16:31:08,712] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 16:31:08,730] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 16:31:08,730] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 16:31:12,269] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 16:31:12,465] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 16:31:12,554] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 16:31:12,572] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 16:31:16,009] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 16:31:16,153] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 16:31:16,227] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-15 16:31:16,309] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1870
[2024-09-15 16:31:23,670] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1887.95 | bwd_microstep: 3400.97 | bwd_inner_microstep: 3400.88 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.24
dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1372
[2024-09-15 16:31:28,116] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.28 | optimizer_gradients: 0.38 | optimizer_step: 0.38
[2024-09-15 16:31:28,117] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1453.96 | bwd_microstep: 2957.88 | bwd_inner_microstep: 2568.75 | bwd_allreduce_microstep: 389.07 | step_microstep: 7.37
[2024-09-15 16:31:28,117] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3341.88 | bwd: 6358.89 | bwd_inner: 5969.63 | bwd_allreduce: 389.13 | step: 7.62
 67%|██████▋   | 314/468 [56:19<39:30, 15.39s/it]                                                 {'loss': 0.2365, 'learning_rate': 1.0362495706787923e-05, 'epoch': 2.01}
 67%|██████▋   | 314/468 [56:19<39:30, 15.39s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1369
[2024-09-15 16:31:32,196] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1464.18 | bwd_microstep: 2584.16 | bwd_inner_microstep: 2584.13 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.20
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1366
[2024-09-15 16:31:38,728] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.39 | optimizer_gradients: 0.39 | optimizer_step: 0.42
[2024-09-15 16:31:38,729] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1466.22 | bwd_microstep: 5032.01 | bwd_inner_microstep: 2584.18 | bwd_allreduce_microstep: 2447.77 | step_microstep: 7.75
[2024-09-15 16:31:38,729] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2930.37 | bwd: 7616.18 | bwd_inner: 5168.32 | bwd_allreduce: 2447.80 | step: 7.96
 67%|██████▋   | 315/468 [56:30<35:35, 13.96s/it]                                                 {'loss': 0.1738, 'learning_rate': 1.0241192515697432e-05, 'epoch': 2.01}
 67%|██████▋   | 315/468 [56:30<35:35, 13.96s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1877
[2024-09-15 16:31:44,062] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1894.56 | bwd_microstep: 3407.84 | bwd_inner_microstep: 3407.77 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.25
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876
[2024-09-15 16:31:49,450] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.40 | optimizer_gradients: 0.35 | optimizer_step: 0.40
[2024-09-15 16:31:49,451] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1916.08 | bwd_microstep: 3435.64 | bwd_inner_microstep: 3420.32 | bwd_allreduce_microstep: 15.27 | step_microstep: 8.13
[2024-09-15 16:31:49,451] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3810.62 | bwd: 6843.53 | bwd_inner: 6828.09 | bwd_allreduce: 15.30 | step: 8.38
 68%|██████▊   | 316/468 [56:40<32:54, 12.99s/it]                                                 {'loss': 0.2467, 'learning_rate': 1.0120358676259508e-05, 'epoch': 2.02}
 68%|██████▊   | 316/468 [56:40<32:54, 12.99s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1388
[2024-09-15 16:31:53,582] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1485.88 | bwd_microstep: 2613.65 | bwd_inner_microstep: 2613.57 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.24
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368
[2024-09-15 16:32:00,125] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.21 | optimizer_gradients: 0.40 | optimizer_step: 0.41
[2024-09-15 16:32:00,125] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1464.92 | bwd_microstep: 5043.00 | bwd_inner_microstep: 2581.96 | bwd_allreduce_microstep: 2460.98 | step_microstep: 7.53
[2024-09-15 16:32:00,126] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2950.78 | bwd: 7656.68 | bwd_inner: 5195.53 | bwd_allreduce: 2461.02 | step: 7.78
 68%|██████▊   | 317/468 [56:51<30:56, 12.29s/it]                                                 {'loss': 0.1459, 'learning_rate': 1.0000000000000006e-05, 'epoch': 2.03}
 68%|██████▊   | 317/468 [56:51<30:56, 12.29s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389
[2024-09-15 16:32:04,268] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1493.23 | bwd_microstep: 2618.81 | bwd_inner_microstep: 2618.78 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.04
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1366
[2024-09-15 16:32:09,488] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.25 | optimizer_gradients: 0.39 | optimizer_step: 0.40
[2024-09-15 16:32:09,488] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1463.09 | bwd_microstep: 3722.95 | bwd_inner_microstep: 2581.90 | bwd_allreduce_microstep: 1141.00 | step_microstep: 7.48
[2024-09-15 16:32:09,489] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2956.30 | bwd: 6341.78 | bwd_inner: 5200.68 | bwd_allreduce: 1141.03 | step: 7.53
 68%|██████▊   | 318/468 [57:00<28:32, 11.41s/it]                                                 {'loss': 0.1594, 'learning_rate': 9.880122275591752e-06, 'epoch': 2.03}
 68%|██████▊   | 318/468 [57:00<28:32, 11.41s/it]dynamic ViT batch size: 46, images per sample: 5.75, dynamic token length: 1878
[2024-09-15 16:32:14,936] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1946.46 | bwd_microstep: 3469.60 | bwd_inner_microstep: 3469.58 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1366
[2024-09-15 16:32:19,032] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.57 | optimizer_gradients: 0.35 | optimizer_step: 0.41
[2024-09-15 16:32:19,033] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1465.44 | bwd_microstep: 2596.79 | bwd_inner_microstep: 2581.42 | bwd_allreduce_microstep: 15.32 | step_microstep: 7.93
[2024-09-15 16:32:19,033] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3411.88 | bwd: 6066.41 | bwd_inner: 6051.00 | bwd_allreduce: 15.35 | step: 8.17
 68%|██████▊   | 319/468 [57:10<26:57, 10.85s/it]                                                 {'loss': 0.1145, 'learning_rate': 9.760731268576095e-06, 'epoch': 2.04}
 68%|██████▊   | 319/468 [57:10<26:57, 10.85s/it]dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1390
[2024-09-15 16:32:23,139] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1474.84 | bwd_microstep: 2601.19 | bwd_inner_microstep: 2601.16 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22
dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1876
[2024-09-15 16:32:29,741] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.28 | optimizer_gradients: 0.39 | optimizer_step: 0.38
[2024-09-15 16:32:29,742] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1929.34 | bwd_microstep: 4637.85 | bwd_inner_microstep: 3458.78 | bwd_allreduce_microstep: 1179.00 | step_microstep: 7.29
[2024-09-15 16:32:29,742] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3404.16 | bwd: 7239.05 | bwd_inner: 6059.95 | bwd_allreduce: 1179.03 | step: 7.57
 68%|██████▊   | 320/468 [57:21<26:39, 10.81s/it]                                                 {'loss': 0.1485, 'learning_rate': 9.64183272108562e-06, 'epoch': 2.04}
 68%|██████▊   | 320/468 [57:21<26:39, 10.81s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389
[2024-09-15 16:32:33,883] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1488.05 | bwd_microstep: 2621.91 | bwd_inner_microstep: 2621.89 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877
[2024-09-15 16:32:40,443] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.21 | optimizer_gradients: 0.43 | optimizer_step: 0.40
[2024-09-15 16:32:40,444] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1912.65 | bwd_microstep: 4612.33 | bwd_inner_microstep: 3432.27 | bwd_allreduce_microstep: 1179.98 | step_microstep: 7.46
[2024-09-15 16:32:40,444] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3400.67 | bwd: 7234.26 | bwd_inner: 6054.16 | bwd_allreduce: 1180.04 | step: 7.62
 69%|██████▊   | 321/468 [57:31<26:24, 10.78s/it]                                                 {'loss': 0.0932, 'learning_rate': 9.523432351567979e-06, 'epoch': 2.05}
 69%|██████▊   | 321/468 [57:31<26:24, 10.78s/it]dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1388
[2024-09-15 16:32:44,548] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1475.44 | bwd_microstep: 2598.86 | bwd_inner_microstep: 2598.69 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.24
dynamic ViT batch size: 36, images per sample: 4.5, dynamic token length: 1367
[2024-09-15 16:32:51,125] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.24 | optimizer_gradients: 0.40 | optimizer_step: 0.40
[2024-09-15 16:32:51,125] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1438.74 | bwd_microstep: 5104.20 | bwd_inner_microstep: 2549.39 | bwd_allreduce_microstep: 2554.75 | step_microstep: 7.58
[2024-09-15 16:32:51,125] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2914.16 | bwd: 7703.10 | bwd_inner: 5148.08 | bwd_allreduce: 2554.87 | step: 7.83
 69%|██████▉   | 322/468 [57:42<26:09, 10.75s/it]                                                 {'loss': 0.1819, 'learning_rate': 9.405535854510863e-06, 'epoch': 2.06}
 69%|██████▉   | 322/468 [57:42<26:09, 10.75s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1870
[2024-09-15 16:32:56,483] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1907.94 | bwd_microstep: 3418.51 | bwd_inner_microstep: 3418.48 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.24
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877
[2024-09-15 16:33:01,877] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.33 | optimizer_gradients: 0.35 | optimizer_step: 0.40
[2024-09-15 16:33:01,878] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1913.28 | bwd_microstep: 3446.27 | bwd_inner_microstep: 3431.45 | bwd_allreduce_microstep: 14.77 | step_microstep: 7.70
[2024-09-15 16:33:01,878] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3821.20 | bwd: 6864.79 | bwd_inner: 6849.93 | bwd_allreduce: 14.79 | step: 7.95
 69%|██████▉   | 323/468 [57:53<25:58, 10.75s/it]                                                 {'loss': 0.1534, 'learning_rate': 9.288148900168122e-06, 'epoch': 2.06}
 69%|██████▉   | 323/468 [57:53<25:58, 10.75s/it]dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1878
[2024-09-15 16:33:07,193] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1884.07 | bwd_microstep: 3400.93 | bwd_inner_microstep: 3400.87 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.24
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1387
[2024-09-15 16:33:12,602] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.13 | optimizer_gradients: 0.40 | optimizer_step: 0.40
[2024-09-15 16:33:12,603] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1491.21 | bwd_microstep: 3882.04 | bwd_inner_microstep: 2624.54 | bwd_allreduce_microstep: 1257.44 | step_microstep: 7.46
[2024-09-15 16:33:12,603] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3375.25 | bwd: 7283.01 | bwd_inner: 6025.41 | bwd_allreduce: 1257.48 | step: 7.71
 69%|██████▉   | 324/468 [58:04<25:46, 10.74s/it]                                                 {'loss': 0.1875, 'learning_rate': 9.171277134287057e-06, 'epoch': 2.07}
 69%|██████▉   | 324/468 [58:04<25:46, 10.74s/it]dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1374
[2024-09-15 16:33:16,657] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1453.35 | bwd_microstep: 2570.34 | bwd_inner_microstep: 2570.27 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.24
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877
[2024-09-15 16:33:23,393] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.22 | optimizer_gradients: 0.39 | optimizer_step: 0.40
[2024-09-15 16:33:23,394] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1915.00 | bwd_microstep: 4785.58 | bwd_inner_microstep: 3436.78 | bwd_allreduce_microstep: 1348.74 | step_microstep: 7.45
[2024-09-15 16:33:23,394] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3368.33 | bwd: 7355.97 | bwd_inner: 6007.05 | bwd_allreduce: 1348.79 | step: 7.69
 69%|██████▉   | 325/468 [58:14<25:38, 10.76s/it]                                                 {'loss': 0.1416, 'learning_rate': 9.054926177836878e-06, 'epoch': 2.08}
 69%|██████▉   | 325/468 [58:14<25:38, 10.76s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389
[2024-09-15 16:33:27,532] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1488.60 | bwd_microstep: 2619.29 | bwd_inner_microstep: 2619.20 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.24
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877
[2024-09-15 16:33:32,932] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.51 | optimizer_gradients: 0.34 | optimizer_step: 0.41
[2024-09-15 16:33:32,933] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1916.30 | bwd_microstep: 3449.93 | bwd_inner_microstep: 3435.09 | bwd_allreduce_microstep: 14.79 | step_microstep: 7.69
[2024-09-15 16:33:32,933] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3404.87 | bwd: 6069.26 | bwd_inner: 6054.29 | bwd_allreduce: 14.84 | step: 7.93
 70%|██████▉   | 326/468 [58:24<24:35, 10.39s/it]                                                 {'loss': 0.1717, 'learning_rate': 8.939101626738395e-06, 'epoch': 2.08}
 70%|██████▉   | 326/468 [58:24<24:35, 10.39s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1388
[2024-09-15 16:33:37,075] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1490.04 | bwd_microstep: 2621.66 | bwd_inner_microstep: 2621.59 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.25
dynamic ViT batch size: 46, images per sample: 5.75, dynamic token length: 1884
[2024-09-15 16:33:43,642] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.24 | optimizer_gradients: 0.39 | optimizer_step: 0.43
[2024-09-15 16:33:43,642] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1949.86 | bwd_microstep: 4581.95 | bwd_inner_microstep: 3483.13 | bwd_allreduce_microstep: 1098.76 | step_microstep: 7.50
[2024-09-15 16:33:43,643] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3439.88 | bwd: 7203.66 | bwd_inner: 6104.72 | bwd_allreduce: 1098.79 | step: 7.75
 70%|██████▉   | 327/468 [58:35<24:38, 10.49s/it]                                                 {'loss': 0.1975, 'learning_rate': 8.823809051594816e-06, 'epoch': 2.09}
 70%|██████▉   | 327/468 [58:35<24:38, 10.49s/it]dynamic ViT batch size: 50, images per sample: 6.25, dynamic token length: 1900
[2024-09-15 16:33:49,230] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2001.28 | bwd_microstep: 3555.57 | bwd_inner_microstep: 3555.55 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.19
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1870
[2024-09-15 16:33:54,625] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.41 | optimizer_gradients: 0.35 | optimizer_step: 0.40
[2024-09-15 16:33:54,626] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1914.98 | bwd_microstep: 3444.63 | bwd_inner_microstep: 3429.28 | bwd_allreduce_microstep: 15.31 | step_microstep: 8.09
[2024-09-15 16:33:54,626] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3916.24 | bwd: 7000.21 | bwd_inner: 6984.82 | bwd_allreduce: 15.33 | step: 8.32
 70%|███████   | 328/468 [58:46<24:49, 10.64s/it]                                                 {'loss': 0.1706, 'learning_rate': 8.70905399742389e-06, 'epoch': 2.1}
 70%|███████   | 328/468 [58:46<24:49, 10.64s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1908
[2024-09-15 16:34:00,068] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1938.04 | bwd_microstep: 3472.99 | bwd_inner_microstep: 3472.96 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.27
dynamic ViT batch size: 41, images per sample: 5.125, dynamic token length: 1880
[2024-09-15 16:34:05,459] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.49 | optimizer_gradients: 0.37 | optimizer_step: 0.41
[2024-09-15 16:34:05,460] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1913.21 | bwd_microstep: 3445.41 | bwd_inner_microstep: 3430.38 | bwd_allreduce_microstep: 14.97 | step_microstep: 7.79
[2024-09-15 16:34:05,460] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3851.22 | bwd: 6918.42 | bwd_inner: 6903.34 | bwd_allreduce: 15.00 | step: 8.05
 70%|███████   | 329/468 [58:56<24:46, 10.70s/it]                                                 {'loss': 0.1869, 'learning_rate': 8.594841983391196e-06, 'epoch': 2.1}
 70%|███████   | 329/468 [58:56<24:46, 10.70s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1388
[2024-09-15 16:34:09,612] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1493.61 | bwd_microstep: 2627.93 | bwd_inner_microstep: 2627.81 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.25
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1366
[2024-09-15 16:34:16,119] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.21 | optimizer_gradients: 0.40 | optimizer_step: 0.40
[2024-09-15 16:34:16,120] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1471.17 | bwd_microstep: 4999.58 | bwd_inner_microstep: 2592.44 | bwd_allreduce_microstep: 2407.07 | step_microstep: 7.49
[2024-09-15 16:34:16,120] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2964.75 | bwd: 7627.55 | bwd_inner: 5220.25 | bwd_allreduce: 2407.16 | step: 7.74
 71%|███████   | 330/468 [59:07<24:34, 10.68s/it]                                                 {'loss': 0.2313, 'learning_rate': 8.481178502544684e-06, 'epoch': 2.11}
 71%|███████   | 330/468 [59:07<24:34, 10.68s/it]dynamic ViT batch size: 36, images per sample: 4.5, dynamic token length: 1369
[2024-09-15 16:34:20,141] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1437.65 | bwd_microstep: 2552.45 | bwd_inner_microstep: 2552.18 | bwd_allreduce_microstep: 0.14 | step_microstep: 0.24
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368
[2024-09-15 16:34:26,863] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.21 | optimizer_gradients: 0.43 | optimizer_step: 0.40
[2024-09-15 16:34:26,864] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1468.40 | bwd_microstep: 5220.41 | bwd_inner_microstep: 2590.15 | bwd_allreduce_microstep: 2630.20 | step_microstep: 7.54
[2024-09-15 16:34:26,864] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2906.02 | bwd: 7772.90 | bwd_inner: 5142.34 | bwd_allreduce: 2630.36 | step: 7.79
 71%|███████   | 331/468 [59:18<24:26, 10.70s/it]                                                 {'loss': 0.1843, 'learning_rate': 8.368069021550516e-06, 'epoch': 2.12}
 71%|███████   | 331/468 [59:18<24:26, 10.70s/it]dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1900
[2024-09-15 16:34:32,226] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1905.06 | bwd_microstep: 3426.68 | bwd_inner_microstep: 3426.66 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.04
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878
[2024-09-15 16:34:37,621] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.30 | optimizer_gradients: 0.35 | optimizer_step: 0.40
[2024-09-15 16:34:37,621] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1917.23 | bwd_microstep: 3447.24 | bwd_inner_microstep: 3432.32 | bwd_allreduce_microstep: 14.87 | step_microstep: 7.62
[2024-09-15 16:34:37,621] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3822.27 | bwd: 6873.93 | bwd_inner: 6858.98 | bwd_allreduce: 14.89 | step: 7.67
 71%|███████   | 332/468 [59:29<24:17, 10.72s/it]                                                 {'loss': 0.2359, 'learning_rate': 8.255518980430115e-06, 'epoch': 2.12}
 71%|███████   | 332/468 [59:29<24:17, 10.72s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878
[2024-09-15 16:34:43,004] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1916.97 | bwd_microstep: 3434.79 | bwd_inner_microstep: 3434.76 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.05
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368
[2024-09-15 16:34:48,388] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.28 | optimizer_gradients: 0.40 | optimizer_step: 0.40
[2024-09-15 16:34:48,388] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1469.03 | bwd_microstep: 3879.54 | bwd_inner_microstep: 2589.49 | bwd_allreduce_microstep: 1289.98 | step_microstep: 7.55
[2024-09-15 16:34:48,389] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3385.96 | bwd: 7314.34 | bwd_inner: 6024.25 | bwd_allreduce: 1290.02 | step: 7.62
 71%|███████   | 333/468 [59:39<24:09, 10.73s/it]                                                 {'loss': 0.1243, 'learning_rate': 8.143533792298545e-06, 'epoch': 2.13}
 71%|███████   | 333/468 [59:39<24:09, 10.73s/it]dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1878
[2024-09-15 16:34:53,812] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1934.48 | bwd_microstep: 3457.41 | bwd_inner_microstep: 3457.39 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.26
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1369
[2024-09-15 16:34:59,244] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.21 | optimizer_gradients: 0.39 | optimizer_step: 0.40
[2024-09-15 16:34:59,245] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1469.74 | bwd_microstep: 3926.45 | bwd_inner_microstep: 2589.85 | bwd_allreduce_microstep: 1336.54 | step_microstep: 7.45
[2024-09-15 16:34:59,245] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3404.19 | bwd: 7383.87 | bwd_inner: 6047.24 | bwd_allreduce: 1336.57 | step: 7.74
 71%|███████▏  | 334/468 [59:50<24:03, 10.77s/it]                                                 {'loss': 0.1325, 'learning_rate': 8.032118843104164e-06, 'epoch': 2.13}
 71%|███████▏  | 334/468 [59:50<24:03, 10.77s/it]dynamic ViT batch size: 36, images per sample: 4.5, dynamic token length: 1390
[2024-09-15 16:35:03,325] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1460.93 | bwd_microstep: 2588.63 | bwd_inner_microstep: 2588.49 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.25
dynamic ViT batch size: 36, images per sample: 4.5, dynamic token length: 1366
[2024-09-15 16:35:08,667] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.21 | optimizer_gradients: 0.44 | optimizer_step: 0.40
[2024-09-15 16:35:08,668] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1438.35 | bwd_microstep: 3870.36 | bwd_inner_microstep: 2552.32 | bwd_allreduce_microstep: 1317.98 | step_microstep: 7.48
[2024-09-15 16:35:08,668] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2899.25 | bwd: 6459.03 | bwd_inner: 5140.81 | bwd_allreduce: 1318.09 | step: 7.74
 72%|███████▏  | 335/468 [1:00:00<22:58, 10.37s/it]                                                   {'loss': 0.099, 'learning_rate': 7.921279491369575e-06, 'epoch': 2.14}
 72%|███████▏  | 335/468 [1:00:00<22:58, 10.37s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877
[2024-09-15 16:35:14,049] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1915.20 | bwd_microstep: 3435.38 | bwd_inner_microstep: 3435.35 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.24
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1366
[2024-09-15 16:35:19,305] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.31 | optimizer_gradients: 0.40 | optimizer_step: 0.41
[2024-09-15 16:35:19,306] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1467.36 | bwd_microstep: 3753.24 | bwd_inner_microstep: 2586.56 | bwd_allreduce_microstep: 1166.62 | step_microstep: 7.81
[2024-09-15 16:35:19,306] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3382.54 | bwd: 7188.63 | bwd_inner: 6021.91 | bwd_allreduce: 1166.65 | step: 8.06
 72%|███████▏  | 336/468 [1:00:10<22:59, 10.45s/it]                                                   {'loss': 0.2295, 'learning_rate': 7.811021067933919e-06, 'epoch': 2.15}
 72%|███████▏  | 336/468 [1:00:10<22:59, 10.45s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1901
[2024-09-15 16:35:24,715] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1923.76 | bwd_microstep: 3454.65 | bwd_inner_microstep: 3454.61 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.24
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368
[2024-09-15 16:35:28,822] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.49 | optimizer_gradients: 0.35 | optimizer_step: 0.40
[2024-09-15 16:35:28,822] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1467.88 | bwd_microstep: 2604.14 | bwd_inner_microstep: 2589.00 | bwd_allreduce_microstep: 15.07 | step_microstep: 7.67
[2024-09-15 16:35:28,822] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3391.61 | bwd: 6058.84 | bwd_inner: 6043.61 | bwd_allreduce: 15.10 | step: 7.91
 72%|███████▏  | 337/468 [1:00:20<22:12, 10.17s/it]                                                   {'loss': 0.1494, 'learning_rate': 7.701348875696486e-06, 'epoch': 2.15}
 72%|███████▏  | 337/468 [1:00:20<22:12, 10.17s/it]dynamic ViT batch size: 36, images per sample: 4.5, dynamic token length: 1390
[2024-09-15 16:35:32,902] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1462.01 | bwd_microstep: 2587.76 | bwd_inner_microstep: 2587.73 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878
[2024-09-15 16:35:38,317] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.35 | optimizer_gradients: 0.35 | optimizer_step: 0.40
[2024-09-15 16:35:38,318] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1919.44 | bwd_microstep: 3459.67 | bwd_inner_microstep: 3444.00 | bwd_allreduce_microstep: 15.60 | step_microstep: 7.88
[2024-09-15 16:35:38,318] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3381.43 | bwd: 6047.47 | bwd_inner: 6031.72 | bwd_allreduce: 15.64 | step: 8.10
 72%|███████▏  | 338/468 [1:00:29<21:35,  9.97s/it]                                                   {'loss': 0.1063, 'learning_rate': 7.59226818936166e-06, 'epoch': 2.16}
 72%|███████▏  | 338/468 [1:00:29<21:35,  9.97s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1870
[2024-09-15 16:35:43,694] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1914.04 | bwd_microstep: 3426.54 | bwd_inner_microstep: 3426.51 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.11
dynamic ViT batch size: 46, images per sample: 5.75, dynamic token length: 1878
[2024-09-15 16:35:49,183] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.39 | optimizer_gradients: 0.38 | optimizer_step: 0.43
[2024-09-15 16:35:49,184] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1951.06 | bwd_microstep: 3501.52 | bwd_inner_microstep: 3486.38 | bwd_allreduce_microstep: 15.10 | step_microstep: 8.11
[2024-09-15 16:35:49,184] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3865.07 | bwd: 6928.07 | bwd_inner: 6912.89 | bwd_allreduce: 15.12 | step: 8.24
 72%|███████▏  | 339/468 [1:00:40<22:00, 10.24s/it]                                                   {'loss': 0.1389, 'learning_rate': 7.483784255185249e-06, 'epoch': 2.17}
 72%|███████▏  | 339/468 [1:00:40<22:00, 10.24s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1374
[2024-09-15 16:35:53,282] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1473.84 | bwd_microstep: 2593.15 | bwd_inner_microstep: 2593.09 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.25
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1373
[2024-09-15 16:35:59,869] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.22 | optimizer_gradients: 0.40 | optimizer_step: 0.39
[2024-09-15 16:35:59,869] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1474.91 | bwd_microstep: 5076.75 | bwd_inner_microstep: 2597.50 | bwd_allreduce_microstep: 2479.18 | step_microstep: 7.45
[2024-09-15 16:35:59,870] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2948.72 | bwd: 7669.94 | bwd_inner: 5190.59 | bwd_allreduce: 2479.22 | step: 7.70
 73%|███████▎  | 340/468 [1:00:51<22:07, 10.37s/it]                                                   {'loss': 0.176, 'learning_rate': 7.375902290722146e-06, 'epoch': 2.17}
 73%|███████▎  | 340/468 [1:00:51<22:07, 10.37s/it]dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1870
[2024-09-15 16:36:05,167] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1881.34 | bwd_microstep: 3386.29 | bwd_inner_microstep: 3386.26 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.24
dynamic ViT batch size: 36, images per sample: 4.5, dynamic token length: 1369
[2024-09-15 16:36:10,550] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.37 | optimizer_gradients: 0.39 | optimizer_step: 0.40
[2024-09-15 16:36:10,550] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1440.24 | bwd_microstep: 3910.17 | bwd_inner_microstep: 2553.18 | bwd_allreduce_microstep: 1356.92 | step_microstep: 7.69
[2024-09-15 16:36:10,551] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3321.56 | bwd: 7296.47 | bwd_inner: 5939.44 | bwd_allreduce: 1356.96 | step: 7.93
 73%|███████▎  | 341/468 [1:01:01<22:08, 10.46s/it]                                                   {'loss': 0.142, 'learning_rate': 7.268627484575406e-06, 'epoch': 2.18}
 73%|███████▎  | 341/468 [1:01:01<22:08, 10.46s/it]dynamic ViT batch size: 36, images per sample: 4.5, dynamic token length: 1388
[2024-09-15 16:36:14,629] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1461.44 | bwd_microstep: 2586.98 | bwd_inner_microstep: 2586.85 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.24
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1388
[2024-09-15 16:36:19,998] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.26 | optimizer_gradients: 0.40 | optimizer_step: 0.40
[2024-09-15 16:36:19,999] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1494.16 | bwd_microstep: 3839.03 | bwd_inner_microstep: 2626.77 | bwd_allreduce_microstep: 1212.20 | step_microstep: 7.50
[2024-09-15 16:36:19,999] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2955.58 | bwd: 6426.04 | bwd_inner: 5213.62 | bwd_allreduce: 1212.30 | step: 7.75
 73%|███████▎  | 342/468 [1:01:11<21:20, 10.16s/it]                                                   {'loss': 0.2029, 'learning_rate': 7.161964996146689e-06, 'epoch': 2.19}
 73%|███████▎  | 342/468 [1:01:11<21:20, 10.16s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876
[2024-09-15 16:36:25,373] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1914.41 | bwd_microstep: 3429.40 | bwd_inner_microstep: 3429.37 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.19
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876
[2024-09-15 16:36:30,775] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.27 | optimizer_gradients: 0.44 | optimizer_step: 0.44
[2024-09-15 16:36:30,776] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1916.93 | bwd_microstep: 3451.04 | bwd_inner_microstep: 3435.95 | bwd_allreduce_microstep: 15.05 | step_microstep: 7.85
[2024-09-15 16:36:30,776] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3831.31 | bwd: 6880.45 | bwd_inner: 6865.32 | bwd_allreduce: 15.07 | step: 8.07
 73%|███████▎  | 343/468 [1:01:22<21:33, 10.34s/it]                                                   {'loss': 0.1416, 'learning_rate': 7.055919955388122e-06, 'epoch': 2.19}
 73%|███████▎  | 343/468 [1:01:22<21:33, 10.34s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878
[2024-09-15 16:36:36,159] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1917.07 | bwd_microstep: 3435.85 | bwd_inner_microstep: 3435.83 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.23
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368
[2024-09-15 16:36:40,268] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.49 | optimizer_gradients: 0.35 | optimizer_step: 0.40
[2024-09-15 16:36:40,268] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1468.94 | bwd_microstep: 2605.07 | bwd_inner_microstep: 2590.67 | bwd_allreduce_microstep: 14.35 | step_microstep: 7.77
[2024-09-15 16:36:40,269] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3385.98 | bwd: 6040.93 | bwd_inner: 6026.50 | bwd_allreduce: 14.37 | step: 8.01
 74%|███████▎  | 344/468 [1:01:31<20:51, 10.09s/it]                                                   {'loss': 0.1515, 'learning_rate': 6.95049746255557e-06, 'epoch': 2.2}
 74%|███████▎  | 344/468 [1:01:31<20:51, 10.09s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876
[2024-09-15 16:36:45,653] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1918.47 | bwd_microstep: 3435.65 | bwd_inner_microstep: 3435.63 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.25
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1878
[2024-09-15 16:36:51,035] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.42 | optimizer_gradients: 0.38 | optimizer_step: 0.40
[2024-09-15 16:36:51,036] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1904.91 | bwd_microstep: 3441.38 | bwd_inner_microstep: 3426.40 | bwd_allreduce_microstep: 14.93 | step_microstep: 8.15
[2024-09-15 16:36:51,036] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3823.36 | bwd: 6877.04 | bwd_inner: 6862.03 | bwd_allreduce: 14.95 | step: 8.41
 74%|███████▎  | 345/468 [1:01:42<21:05, 10.29s/it]                                                   {'loss': 0.1573, 'learning_rate': 6.845702587963352e-06, 'epoch': 2.2}
 74%|███████▎  | 345/468 [1:01:42<21:05, 10.29s/it]dynamic ViT batch size: 36, images per sample: 4.5, dynamic token length: 1388
[2024-09-15 16:36:55,121] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1464.86 | bwd_microstep: 2590.22 | bwd_inner_microstep: 2590.15 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.24
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876
[2024-09-15 16:37:01,738] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.21 | optimizer_gradients: 0.39 | optimizer_step: 0.40
[2024-09-15 16:37:01,738] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1919.37 | bwd_microstep: 4663.44 | bwd_inner_microstep: 3441.11 | bwd_allreduce_microstep: 1222.27 | step_microstep: 7.48
[2024-09-15 16:37:01,738] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3384.20 | bwd: 7253.70 | bwd_inner: 6031.26 | bwd_allreduce: 1222.31 | step: 7.73
 74%|███████▍  | 346/468 [1:01:53<21:10, 10.42s/it]                                                   {'loss': 0.2249, 'learning_rate': 6.741540371740347e-06, 'epoch': 2.21}
 74%|███████▍  | 346/468 [1:01:53<21:10, 10.42s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1878
[2024-09-15 16:37:07,090] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1901.43 | bwd_microstep: 3419.86 | bwd_inner_microstep: 3419.84 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.24
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1367
[2024-09-15 16:37:12,385] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.31 | optimizer_gradients: 0.39 | optimizer_step: 0.40
[2024-09-15 16:37:12,386] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1468.76 | bwd_microstep: 3791.30 | bwd_inner_microstep: 2591.95 | bwd_allreduce_microstep: 1199.29 | step_microstep: 7.74
[2024-09-15 16:37:12,386] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3370.17 | bwd: 7211.19 | bwd_inner: 6011.79 | bwd_allreduce: 1199.32 | step: 7.99
 74%|███████▍  | 347/468 [1:02:03<21:08, 10.49s/it]                                                   {'loss': 0.2207, 'learning_rate': 6.6380158235876335e-06, 'epoch': 2.22}
 74%|███████▍  | 347/468 [1:02:03<21:08, 10.49s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1388
[2024-09-15 16:37:16,539] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1490.46 | bwd_microstep: 2626.55 | bwd_inner_microstep: 2626.52 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368
[2024-09-15 16:37:23,177] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.35 | optimizer_gradients: 0.37 | optimizer_step: 0.38
[2024-09-15 16:37:23,178] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1468.29 | bwd_microstep: 5135.34 | bwd_inner_microstep: 2590.35 | bwd_allreduce_microstep: 2544.93 | step_microstep: 7.48
[2024-09-15 16:37:23,178] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2958.72 | bwd: 7761.90 | bwd_inner: 5216.87 | bwd_allreduce: 2544.96 | step: 7.71
 74%|███████▍  | 348/468 [1:02:14<21:09, 10.58s/it]                                                   {'loss': 0.1194, 'learning_rate': 6.535133922537513e-06, 'epoch': 2.22}
 74%|███████▍  | 348/468 [1:02:14<21:09, 10.58s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877
[2024-09-15 16:37:28,549] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1910.14 | bwd_microstep: 3430.95 | bwd_inner_microstep: 3430.92 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.15
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368
[2024-09-15 16:37:33,897] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.25 | optimizer_gradients: 0.39 | optimizer_step: 0.39
[2024-09-15 16:37:33,898] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1468.36 | bwd_microstep: 3846.70 | bwd_inner_microstep: 2587.97 | bwd_allreduce_microstep: 1258.67 | step_microstep: 7.48
[2024-09-15 16:37:33,898] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3378.47 | bwd: 7277.66 | bwd_inner: 6018.89 | bwd_allreduce: 1258.70 | step: 7.66
 75%|███████▍  | 349/468 [1:02:25<21:03, 10.62s/it]                                                   {'loss': 0.1655, 'learning_rate': 6.4328996167140786e-06, 'epoch': 2.23}
 75%|███████▍  | 349/468 [1:02:25<21:03, 10.62s/it]dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1877
[2024-09-15 16:37:39,316] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1930.70 | bwd_microstep: 3456.30 | bwd_inner_microstep: 3456.25 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.24
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368
[2024-09-15 16:37:44,615] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.21 | optimizer_gradients: 0.39 | optimizer_step: 0.40
[2024-09-15 16:37:44,616] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1467.88 | bwd_microstep: 3797.45 | bwd_inner_microstep: 2588.53 | bwd_allreduce_microstep: 1208.86 | step_microstep: 7.47
[2024-09-15 16:37:44,616] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3398.54 | bwd: 7253.79 | bwd_inner: 6044.78 | bwd_allreduce: 1208.89 | step: 7.72
 75%|███████▍  | 350/468 [1:02:36<20:56, 10.65s/it]                                                   {'loss': 0.1652, 'learning_rate': 6.331317823095184e-06, 'epoch': 2.24}
 75%|███████▍  | 350/468 [1:02:36<20:56, 10.65s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1374
[2024-09-15 16:37:48,704] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1467.70 | bwd_microstep: 2589.07 | bwd_inner_microstep: 2588.95 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.34
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1366
[2024-09-15 16:37:54,147] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.18 | optimizer_gradients: 0.39 | optimizer_step: 0.42
[2024-09-15 16:37:54,147] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1467.53 | bwd_microstep: 3941.29 | bwd_inner_microstep: 2586.49 | bwd_allreduce_microstep: 1354.74 | step_microstep: 7.41
[2024-09-15 16:37:54,147] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2935.20 | bwd: 6530.39 | bwd_inner: 5175.44 | bwd_allreduce: 1354.82 | step: 7.75
 75%|███████▌  | 351/468 [1:02:45<20:06, 10.31s/it]                                                   {'loss': 0.2378, 'learning_rate': 6.230393427276e-06, 'epoch': 2.24}
 75%|███████▌  | 351/468 [1:02:45<20:06, 10.31s/it]dynamic ViT batch size: 36, images per sample: 4.5, dynamic token length: 1390
[2024-09-15 16:37:58,227] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1461.33 | bwd_microstep: 2588.43 | bwd_inner_microstep: 2588.36 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.23
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878
[2024-09-15 16:38:03,670] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.21 | optimizer_gradients: 0.39 | optimizer_step: 0.40
[2024-09-15 16:38:03,671] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1915.74 | bwd_microstep: 3492.64 | bwd_inner_microstep: 3429.12 | bwd_allreduce_microstep: 63.47 | step_microstep: 7.62
[2024-09-15 16:38:03,671] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3377.05 | bwd: 6081.10 | bwd_inner: 6017.48 | bwd_allreduce: 63.51 | step: 7.85
 75%|███████▌  | 352/468 [1:02:55<19:28, 10.08s/it]                                                   {'loss': 0.229, 'learning_rate': 6.130131283234031e-06, 'epoch': 2.25}
 75%|███████▌  | 352/468 [1:02:55<19:28, 10.08s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1870
[2024-09-15 16:38:09,039] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1913.67 | bwd_microstep: 3423.00 | bwd_inner_microstep: 3422.97 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.24
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368
[2024-09-15 16:38:13,155] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.49 | optimizer_gradients: 0.40 | optimizer_step: 0.41
[2024-09-15 16:38:13,156] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1470.81 | bwd_microstep: 2612.71 | bwd_inner_microstep: 2591.08 | bwd_allreduce_microstep: 21.58 | step_microstep: 7.90
[2024-09-15 16:38:13,156] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3384.46 | bwd: 6035.72 | bwd_inner: 6014.05 | bwd_allreduce: 21.61 | step: 8.14
 75%|███████▌  | 353/468 [1:03:04<18:58,  9.90s/it]                                                   {'loss': 0.1397, 'learning_rate': 6.0305362130956504e-06, 'epoch': 2.26}
 75%|███████▌  | 353/468 [1:03:04<18:58,  9.90s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389
[2024-09-15 16:38:17,309] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1493.78 | bwd_microstep: 2629.02 | bwd_inner_microstep: 2628.95 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.24
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878
[2024-09-15 16:38:23,938] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.32 | optimizer_gradients: 0.39 | optimizer_step: 0.40
[2024-09-15 16:38:23,939] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1919.27 | bwd_microstep: 4673.70 | bwd_inner_microstep: 3442.50 | bwd_allreduce_microstep: 1231.14 | step_microstep: 7.64
[2024-09-15 16:38:23,939] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3413.03 | bwd: 7302.76 | bwd_inner: 6071.45 | bwd_allreduce: 1231.19 | step: 7.89
 76%|███████▌  | 354/468 [1:03:15<19:18, 10.16s/it]                                                   {'loss': 0.1139, 'learning_rate': 5.931613006904196e-06, 'epoch': 2.26}
 76%|███████▌  | 354/468 [1:03:15<19:18, 10.16s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1369
[2024-09-15 16:38:28,031] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1469.76 | bwd_microstep: 2591.72 | bwd_inner_microstep: 2591.66 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.24
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368
[2024-09-15 16:38:33,359] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.21 | optimizer_gradients: 0.45 | optimizer_step: 0.41
[2024-09-15 16:38:33,360] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1468.71 | bwd_microstep: 3826.98 | bwd_inner_microstep: 2593.09 | bwd_allreduce_microstep: 1233.83 | step_microstep: 7.50
[2024-09-15 16:38:33,360] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2938.44 | bwd: 6418.74 | bwd_inner: 5184.75 | bwd_allreduce: 1233.87 | step: 7.75
 76%|███████▌  | 355/468 [1:03:24<18:43,  9.94s/it]                                                   {'loss': 0.1611, 'learning_rate': 5.8333664223895906e-06, 'epoch': 2.27}
 76%|███████▌  | 355/468 [1:03:24<18:43,  9.94s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1388
[2024-09-15 16:38:37,505] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1490.21 | bwd_microstep: 2624.64 | bwd_inner_microstep: 2624.61 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.24
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1369
[2024-09-15 16:38:44,095] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.45 | optimizer_gradients: 0.36 | optimizer_step: 0.39
[2024-09-15 16:38:44,095] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1468.69 | bwd_microstep: 5084.88 | bwd_inner_microstep: 2593.08 | bwd_allreduce_microstep: 2491.74 | step_microstep: 7.35
[2024-09-15 16:38:44,095] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2958.87 | bwd: 7709.53 | bwd_inner: 5217.69 | bwd_allreduce: 2491.77 | step: 7.60
 76%|███████▌  | 356/468 [1:03:35<19:00, 10.18s/it]                                                   {'loss': 0.1784, 'learning_rate': 5.735801184739489e-06, 'epoch': 2.27}
 76%|███████▌  | 356/468 [1:03:35<19:00, 10.18s/it]dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1387
[2024-09-15 16:38:48,200] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1474.10 | bwd_microstep: 2600.57 | bwd_inner_microstep: 2600.54 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.27
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1878
[2024-09-15 16:38:54,718] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.31 | optimizer_gradients: 0.39 | optimizer_step: 0.40
[2024-09-15 16:38:54,719] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1900.57 | bwd_microstep: 4582.26 | bwd_inner_microstep: 3420.33 | bwd_allreduce_microstep: 1161.86 | step_microstep: 7.51
[2024-09-15 16:38:54,719] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3374.64 | bwd: 7182.85 | bwd_inner: 6020.88 | bwd_allreduce: 1161.89 | step: 7.78
 76%|███████▋  | 357/468 [1:03:46<19:04, 10.31s/it]                                                   {'loss': 0.2004, 'learning_rate': 5.638921986372064e-06, 'epoch': 2.28}
 76%|███████▋  | 357/468 [1:03:46<19:04, 10.31s/it]dynamic ViT batch size: 41, images per sample: 5.125, dynamic token length: 1870
[2024-09-15 16:39:00,062] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1902.23 | bwd_microstep: 3410.34 | bwd_inner_microstep: 3410.31 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.11
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1367
[2024-09-15 16:39:05,358] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.22 | optimizer_gradients: 0.39 | optimizer_step: 0.40
[2024-09-15 16:39:05,359] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1466.59 | bwd_microstep: 3794.28 | bwd_inner_microstep: 2588.35 | bwd_allreduce_microstep: 1205.87 | step_microstep: 7.48
[2024-09-15 16:39:05,359] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3368.79 | bwd: 7204.63 | bwd_inner: 5998.66 | bwd_allreduce: 1205.90 | step: 7.62
 76%|███████▋  | 358/468 [1:03:56<19:05, 10.41s/it]                                                   {'loss': 0.1553, 'learning_rate': 5.542733486710299e-06, 'epoch': 2.29}
 76%|███████▋  | 358/468 [1:03:56<19:05, 10.41s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1870
[2024-09-15 16:39:10,723] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1910.09 | bwd_microstep: 3423.51 | bwd_inner_microstep: 3423.48 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.24
dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1878
[2024-09-15 16:39:16,170] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.42 | optimizer_gradients: 0.35 | optimizer_step: 0.40
[2024-09-15 16:39:16,171] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1934.17 | bwd_microstep: 3477.10 | bwd_inner_microstep: 3461.07 | bwd_allreduce_microstep: 15.98 | step_microstep: 8.07
[2024-09-15 16:39:16,171] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3844.24 | bwd: 6900.62 | bwd_inner: 6884.55 | bwd_allreduce: 16.00 | step: 8.31
 77%|███████▋  | 359/468 [1:04:07<19:07, 10.53s/it]                                                   {'loss': 0.1787, 'learning_rate': 5.447240311957891e-06, 'epoch': 2.29}
 77%|███████▋  | 359/468 [1:04:07<19:07, 10.53s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876
[2024-09-15 16:39:21,551] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1918.28 | bwd_microstep: 3431.16 | bwd_inner_microstep: 3431.14 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.25
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1870
[2024-09-15 16:39:26,947] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.41 | optimizer_gradients: 0.36 | optimizer_step: 0.42
[2024-09-15 16:39:26,947] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1914.68 | bwd_microstep: 3445.48 | bwd_inner_microstep: 3429.84 | bwd_allreduce_microstep: 15.58 | step_microstep: 7.88
[2024-09-15 16:39:26,948] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3832.93 | bwd: 6876.66 | bwd_inner: 6860.98 | bwd_allreduce: 15.61 | step: 8.13
 77%|███████▋  | 360/468 [1:04:18<19:05, 10.60s/it]                                                   {'loss': 0.1856, 'learning_rate': 5.352447054876755e-06, 'epoch': 2.3}
 77%|███████▋  | 360/468 [1:04:18<19:05, 10.60s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389
[2024-09-15 16:39:31,096] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1491.73 | bwd_microstep: 2626.96 | bwd_inner_microstep: 2626.90 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.23
dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1880
[2024-09-15 16:39:37,697] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.45 | optimizer_gradients: 0.36 | optimizer_step: 0.37
[2024-09-15 16:39:37,697] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1937.44 | bwd_microstep: 4628.76 | bwd_inner_microstep: 3462.23 | bwd_allreduce_microstep: 1166.47 | step_microstep: 7.93
[2024-09-15 16:39:37,697] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3429.14 | bwd: 7255.76 | bwd_inner: 6089.13 | bwd_allreduce: 1166.50 | step: 8.16
 77%|███████▋  | 361/468 [1:04:29<18:59, 10.65s/it]                                                   {'loss': 0.2069, 'learning_rate': 5.258358274566142e-06, 'epoch': 2.31}
 77%|███████▋  | 361/468 [1:04:29<18:59, 10.65s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1369
[2024-09-15 16:39:41,788] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1469.42 | bwd_microstep: 2590.14 | bwd_inner_microstep: 2590.07 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.24
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1870
[2024-09-15 16:39:48,332] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.50 | optimizer_gradients: 0.36 | optimizer_step: 0.38
[2024-09-15 16:39:48,332] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1901.48 | bwd_microstep: 4608.62 | bwd_inner_microstep: 3410.97 | bwd_allreduce_microstep: 1197.59 | step_microstep: 7.34
[2024-09-15 16:39:48,332] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3370.86 | bwd: 7198.81 | bwd_inner: 6001.04 | bwd_allreduce: 1197.64 | step: 7.59
 77%|███████▋  | 362/468 [1:04:39<18:48, 10.64s/it]                                                   {'loss': 0.2023, 'learning_rate': 5.164978496243354e-06, 'epoch': 2.31}
 77%|███████▋  | 362/468 [1:04:39<18:48, 10.64s/it]dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1878
[2024-09-15 16:39:53,753] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1931.41 | bwd_microstep: 3459.42 | bwd_inner_microstep: 3459.39 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.18
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878
[2024-09-15 16:39:59,159] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.43 | optimizer_gradients: 0.36 | optimizer_step: 0.40
[2024-09-15 16:39:59,160] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1919.35 | bwd_microstep: 3451.30 | bwd_inner_microstep: 3436.34 | bwd_allreduce_microstep: 14.91 | step_microstep: 7.61
[2024-09-15 16:39:59,160] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3850.73 | bwd: 6910.73 | bwd_inner: 6895.73 | bwd_allreduce: 14.93 | step: 7.82
 78%|███████▊  | 363/468 [1:04:50<18:43, 10.70s/it]                                                   {'loss': 0.1453, 'learning_rate': 5.072312211026125e-06, 'epoch': 2.32}
 78%|███████▊  | 363/468 [1:04:50<18:43, 10.70s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1388
[2024-09-15 16:40:03,311] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1492.51 | bwd_microstep: 2627.24 | bwd_inner_microstep: 2627.14 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.25
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368
[2024-09-15 16:40:08,585] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.21 | optimizer_gradients: 0.39 | optimizer_step: 0.40
[2024-09-15 16:40:08,585] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1471.64 | bwd_microstep: 3767.44 | bwd_inner_microstep: 2593.08 | bwd_allreduce_microstep: 1174.31 | step_microstep: 7.45
[2024-09-15 16:40:08,586] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2964.12 | bwd: 6394.73 | bwd_inner: 5220.21 | bwd_allreduce: 1174.38 | step: 7.71
 78%|███████▊  | 364/468 [1:04:59<17:52, 10.32s/it]                                                   {'loss': 0.1377, 'learning_rate': 4.980363875716592e-06, 'epoch': 2.33}
 78%|███████▊  | 364/468 [1:04:59<17:52, 10.32s/it]dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1877
[2024-09-15 16:40:13,909] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1888.39 | bwd_microstep: 3404.61 | bwd_inner_microstep: 3404.52 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.26
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877
[2024-09-15 16:40:19,329] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.38 | optimizer_gradients: 0.36 | optimizer_step: 0.40
[2024-09-15 16:40:19,329] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1919.99 | bwd_microstep: 3465.37 | bwd_inner_microstep: 3438.17 | bwd_allreduce_microstep: 27.15 | step_microstep: 7.69
[2024-09-15 16:40:19,329] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3808.34 | bwd: 6870.02 | bwd_inner: 6842.68 | bwd_allreduce: 27.21 | step: 7.96
 78%|███████▊  | 365/468 [1:05:10<17:55, 10.45s/it]                                                   {'loss': 0.1887, 'learning_rate': 4.889137912586972e-06, 'epoch': 2.33}
 78%|███████▊  | 365/468 [1:05:10<17:55, 10.45s/it]dynamic ViT batch size: 34, images per sample: 4.25, dynamic token length: 1870
[2024-09-15 16:40:24,585] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1855.53 | bwd_microstep: 3371.39 | bwd_inner_microstep: 3371.24 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.24
dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1366
[2024-09-15 16:40:30,054] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.22 | optimizer_gradients: 0.40 | optimizer_step: 0.40
[2024-09-15 16:40:30,055] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1456.00 | bwd_microstep: 3978.00 | bwd_inner_microstep: 2572.46 | bwd_allreduce_microstep: 1405.46 | step_microstep: 7.53
[2024-09-15 16:40:30,055] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3311.50 | bwd: 7349.43 | bwd_inner: 5943.71 | bwd_allreduce: 1405.58 | step: 7.78
 78%|███████▊  | 366/468 [1:05:21<17:53, 10.53s/it]                                                   {'loss': 0.0932, 'learning_rate': 4.7986387091668365e-06, 'epoch': 2.34}
 78%|███████▊  | 366/468 [1:05:21<17:53, 10.53s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1388
[2024-09-15 16:40:34,205] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1491.45 | bwd_microstep: 2627.64 | bwd_inner_microstep: 2627.51 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.24
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878
[2024-09-15 16:40:40,724] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.43 | optimizer_gradients: 0.37 | optimizer_step: 0.38
[2024-09-15 16:40:40,724] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1918.96 | bwd_microstep: 4565.83 | bwd_inner_microstep: 3439.38 | bwd_allreduce_microstep: 1126.38 | step_microstep: 7.43
[2024-09-15 16:40:40,724] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3410.37 | bwd: 7193.51 | bwd_inner: 6066.89 | bwd_allreduce: 1126.48 | step: 7.67
 78%|███████▊  | 367/468 [1:05:32<17:47, 10.57s/it]                                                   {'loss': 0.1506, 'learning_rate': 4.708870618032133e-06, 'epoch': 2.35}
 78%|███████▊  | 367/468 [1:05:32<17:47, 10.57s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876
[2024-09-15 16:40:46,105] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1915.88 | bwd_microstep: 3434.03 | bwd_inner_microstep: 3434.00 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876
[2024-09-15 16:40:51,515] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.62 | optimizer_gradients: 0.34 | optimizer_step: 0.39
[2024-09-15 16:40:51,515] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1915.84 | bwd_microstep: 3456.55 | bwd_inner_microstep: 3441.49 | bwd_allreduce_microstep: 15.02 | step_microstep: 8.25
[2024-09-15 16:40:51,515] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3831.69 | bwd: 6890.59 | bwd_inner: 6875.49 | bwd_allreduce: 15.04 | step: 8.48
 79%|███████▊  | 368/468 [1:05:42<17:43, 10.64s/it]                                                   {'loss': 0.1293, 'learning_rate': 4.619837956595825e-06, 'epoch': 2.35}
 79%|███████▊  | 368/468 [1:05:42<17:43, 10.64s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876
[2024-09-15 16:40:56,903] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1919.98 | bwd_microstep: 3437.30 | bwd_inner_microstep: 3437.28 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.17
dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1878
[2024-09-15 16:41:02,254] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 4.39 | optimizer_gradients: 0.34 | optimizer_step: 0.38
[2024-09-15 16:41:02,255] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1889.14 | bwd_microstep: 3424.04 | bwd_inner_microstep: 3408.87 | bwd_allreduce_microstep: 15.11 | step_microstep: 11.36
[2024-09-15 16:41:02,255] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3809.09 | bwd: 6861.36 | bwd_inner: 6846.15 | bwd_allreduce: 15.14 | step: 11.55
 79%|███████▉  | 369/468 [1:05:53<17:36, 10.67s/it]                                                   {'loss': 0.1505, 'learning_rate': 4.531545006900244e-06, 'epoch': 2.36}
 79%|███████▉  | 369/468 [1:05:53<17:36, 10.67s/it]dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1878
[2024-09-15 16:41:07,687] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1936.30 | bwd_microstep: 3464.38 | bwd_inner_microstep: 3464.35 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.07
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368
[2024-09-15 16:41:12,985] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.34 | optimizer_gradients: 0.37 | optimizer_step: 0.38
[2024-09-15 16:41:12,985] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1470.74 | bwd_microstep: 3794.39 | bwd_inner_microstep: 2594.10 | bwd_allreduce_microstep: 1200.22 | step_microstep: 7.33
[2024-09-15 16:41:12,985] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3407.02 | bwd: 7258.77 | bwd_inner: 6058.45 | bwd_allreduce: 1200.25 | step: 7.43
 79%|███████▉  | 370/468 [1:06:04<17:27, 10.69s/it]                                                   {'loss': 0.1176, 'learning_rate': 4.443996015411151e-06, 'epoch': 2.36}
 79%|███████▉  | 370/468 [1:06:04<17:27, 10.69s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1901
[2024-09-15 16:41:18,424] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1942.39 | bwd_microstep: 3465.91 | bwd_inner_microstep: 3465.88 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.16
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877
[2024-09-15 16:41:23,836] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.47 | optimizer_gradients: 0.36 | optimizer_step: 0.40
[2024-09-15 16:41:23,837] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1920.23 | bwd_microstep: 3455.25 | bwd_inner_microstep: 3439.98 | bwd_allreduce_microstep: 15.22 | step_microstep: 8.20
[2024-09-15 16:41:23,837] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3862.59 | bwd: 6921.17 | bwd_inner: 6905.87 | bwd_allreduce: 15.24 | step: 8.39
 79%|███████▉  | 371/468 [1:06:15<17:21, 10.74s/it]                                                   {'loss': 0.2187, 'learning_rate': 4.357195192813504e-06, 'epoch': 2.37}
 79%|███████▉  | 371/468 [1:06:15<17:21, 10.74s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1388
[2024-09-15 16:41:27,992] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1494.27 | bwd_microstep: 2629.80 | bwd_inner_microstep: 2629.62 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.25
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1878
[2024-09-15 16:41:33,406] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.35 | optimizer_gradients: 0.38 | optimizer_step: 0.42
[2024-09-15 16:41:33,407] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1905.80 | bwd_microstep: 3474.49 | bwd_inner_microstep: 3433.71 | bwd_allreduce_microstep: 40.73 | step_microstep: 7.61
[2024-09-15 16:41:33,407] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3400.05 | bwd: 6104.33 | bwd_inner: 6063.34 | bwd_allreduce: 40.82 | step: 7.87
 79%|███████▉  | 372/468 [1:06:24<16:37, 10.39s/it]                                                   {'loss': 0.1538, 'learning_rate': 4.271146713808927e-06, 'epoch': 2.38}
 79%|███████▉  | 372/468 [1:06:24<16:37, 10.39s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389
[2024-09-15 16:41:37,565] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1495.89 | bwd_microstep: 2631.95 | bwd_inner_microstep: 2631.83 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.24
dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1878
[2024-09-15 16:41:42,916] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.32 | optimizer_gradients: 0.35 | optimizer_step: 0.41
[2024-09-15 16:41:42,916] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1892.44 | bwd_microstep: 3421.97 | bwd_inner_microstep: 3408.15 | bwd_allreduce_microstep: 13.77 | step_microstep: 7.67
[2024-09-15 16:41:42,916] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3388.30 | bwd: 6053.96 | bwd_inner: 6039.99 | bwd_allreduce: 13.85 | step: 7.92
 80%|███████▉  | 373/468 [1:06:34<16:01, 10.12s/it]                                                   {'loss': 0.2165, 'learning_rate': 4.185854716914952e-06, 'epoch': 2.38}
 80%|███████▉  | 373/468 [1:06:34<16:01, 10.12s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1902
[2024-09-15 16:41:48,360] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1945.27 | bwd_microstep: 3467.13 | bwd_inner_microstep: 3467.10 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.33
dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1369
[2024-09-15 16:41:53,665] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.22 | optimizer_gradients: 0.39 | optimizer_step: 0.40
[2024-09-15 16:41:53,665] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1459.08 | bwd_microstep: 3814.07 | bwd_inner_microstep: 2575.31 | bwd_allreduce_microstep: 1238.70 | step_microstep: 7.70
[2024-09-15 16:41:53,666] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3404.32 | bwd: 7281.21 | bwd_inner: 6042.41 | bwd_allreduce: 1238.73 | step: 8.03
 80%|███████▉  | 374/468 [1:06:45<16:09, 10.31s/it]                                                   {'loss': 0.1662, 'learning_rate': 4.1013233042659606e-06, 'epoch': 2.39}
 80%|███████▉  | 374/468 [1:06:45<16:09, 10.31s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877
[2024-09-15 16:41:59,057] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1920.50 | bwd_microstep: 3439.76 | bwd_inner_microstep: 3439.68 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.24
dynamic ViT batch size: 39, images per sample: 4.875, dynamic token length: 1876
[2024-09-15 16:42:04,410] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.39 | optimizer_gradients: 0.35 | optimizer_step: 0.39
[2024-09-15 16:42:04,411] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1895.03 | bwd_microstep: 3424.18 | bwd_inner_microstep: 3409.16 | bwd_allreduce_microstep: 14.98 | step_microstep: 8.17
[2024-09-15 16:42:04,411] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3815.51 | bwd: 6863.97 | bwd_inner: 6848.83 | bwd_allreduce: 15.02 | step: 8.42
 80%|████████  | 375/468 [1:06:55<16:11, 10.44s/it]                                                   {'loss': 0.143, 'learning_rate': 4.017556541415888e-06, 'epoch': 2.4}
 80%|████████  | 375/468 [1:06:55<16:11, 10.44s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389
[2024-09-15 16:42:08,570] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1496.60 | bwd_microstep: 2632.19 | bwd_inner_microstep: 2632.03 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.25
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878
[2024-09-15 16:42:15,133] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.22 | optimizer_gradients: 0.39 | optimizer_step: 0.40
[2024-09-15 16:42:15,134] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1919.58 | bwd_microstep: 4608.11 | bwd_inner_microstep: 3443.29 | bwd_allreduce_microstep: 1164.76 | step_microstep: 7.48
[2024-09-15 16:42:15,134] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3416.15 | bwd: 7240.34 | bwd_inner: 6075.32 | bwd_allreduce: 1164.89 | step: 7.74
 80%|████████  | 376/468 [1:07:06<16:08, 10.53s/it]                                                   {'loss': 0.1883, 'learning_rate': 3.9345584571427055e-06, 'epoch': 2.4}
 80%|████████  | 376/468 [1:07:06<16:08, 10.53s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1369
[2024-09-15 16:42:19,226] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1468.75 | bwd_microstep: 2592.15 | bwd_inner_microstep: 2592.07 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.24
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368
[2024-09-15 16:42:24,590] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.22 | optimizer_gradients: 0.40 | optimizer_step: 0.40
[2024-09-15 16:42:24,590] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1469.59 | bwd_microstep: 3859.26 | bwd_inner_microstep: 2594.41 | bwd_allreduce_microstep: 1264.80 | step_microstep: 7.51
[2024-09-15 16:42:24,591] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2938.31 | bwd: 6451.44 | bwd_inner: 5186.48 | bwd_allreduce: 1264.84 | step: 7.76
 81%|████████  | 377/468 [1:07:15<15:28, 10.21s/it]                                                   {'loss': 0.1624, 'learning_rate': 3.852333043254639e-06, 'epoch': 2.41}
 81%|████████  | 377/468 [1:07:15<15:28, 10.21s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1388
[2024-09-15 16:42:28,738] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1491.89 | bwd_microstep: 2625.76 | bwd_inner_microstep: 2625.73 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.05
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877
[2024-09-15 16:42:34,143] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.42 | optimizer_gradients: 0.35 | optimizer_step: 0.40
[2024-09-15 16:42:34,143] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1919.59 | bwd_microstep: 3454.08 | bwd_inner_microstep: 3438.21 | bwd_allreduce_microstep: 15.82 | step_microstep: 7.73
[2024-09-15 16:42:34,143] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3411.46 | bwd: 6079.85 | bwd_inner: 6063.94 | bwd_allreduce: 15.85 | step: 7.79
 81%|████████  | 378/468 [1:07:25<15:00, 10.01s/it]                                                   {'loss': 0.1824, 'learning_rate': 3.7708842543981928e-06, 'epoch': 2.42}
 81%|████████  | 378/468 [1:07:25<15:00, 10.01s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389
[2024-09-15 16:42:38,296] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1494.69 | bwd_microstep: 2627.58 | bwd_inner_microstep: 2627.47 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.24
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878
[2024-09-15 16:42:43,714] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.67 | optimizer_gradients: 0.34 | optimizer_step: 0.38
[2024-09-15 16:42:43,715] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1920.70 | bwd_microstep: 3463.30 | bwd_inner_microstep: 3447.99 | bwd_allreduce_microstep: 15.26 | step_microstep: 7.79
[2024-09-15 16:42:43,715] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3415.36 | bwd: 6090.92 | bwd_inner: 6075.45 | bwd_allreduce: 15.34 | step: 8.03
 81%|████████  | 379/468 [1:07:35<14:39,  9.88s/it]                                                   {'loss': 0.1933, 'learning_rate': 3.690216007867944e-06, 'epoch': 2.42}
 81%|████████  | 379/468 [1:07:35<14:39,  9.88s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1388
[2024-09-15 16:42:47,869] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1494.19 | bwd_microstep: 2628.85 | bwd_inner_microstep: 2628.82 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22
dynamic ViT batch size: 46, images per sample: 5.75, dynamic token length: 1878
[2024-09-15 16:42:53,357] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.42 | optimizer_gradients: 0.34 | optimizer_step: 0.37
[2024-09-15 16:42:53,358] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1950.14 | bwd_microstep: 3504.02 | bwd_inner_microstep: 3488.45 | bwd_allreduce_microstep: 15.52 | step_microstep: 7.67
[2024-09-15 16:42:53,358] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3444.31 | bwd: 6132.89 | bwd_inner: 6117.28 | bwd_allreduce: 15.54 | step: 7.90
 81%|████████  | 380/468 [1:07:44<14:23,  9.81s/it]                                                   {'loss': 0.1557, 'learning_rate': 3.6103321834181437e-06, 'epoch': 2.43}
 81%|████████  | 380/468 [1:07:44<14:23,  9.81s/it]dynamic ViT batch size: 39, images per sample: 4.875, dynamic token length: 1369
[2024-09-15 16:42:57,434] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1461.96 | bwd_microstep: 2583.74 | bwd_inner_microstep: 2583.71 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1870
[2024-09-15 16:43:02,840] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.62 | optimizer_gradients: 0.34 | optimizer_step: 0.38
[2024-09-15 16:43:02,841] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1918.82 | bwd_microstep: 3453.62 | bwd_inner_microstep: 3438.61 | bwd_allreduce_microstep: 14.96 | step_microstep: 7.61
[2024-09-15 16:43:02,841] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3380.76 | bwd: 6037.37 | bwd_inner: 6022.32 | bwd_allreduce: 14.99 | step: 7.84
 81%|████████▏ | 381/468 [1:07:54<14:04,  9.71s/it]                                                   {'loss': 0.143, 'learning_rate': 3.5312366230761154e-06, 'epoch': 2.43}
 81%|████████▏ | 381/468 [1:07:54<14:04,  9.71s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389
[2024-09-15 16:43:06,999] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1494.56 | bwd_microstep: 2632.87 | bwd_inner_microstep: 2632.84 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22
dynamic ViT batch size: 48, images per sample: 6.0, dynamic token length: 1901
[2024-09-15 16:43:12,585] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.58 | optimizer_gradients: 0.35 | optimizer_step: 0.41
[2024-09-15 16:43:12,585] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1993.16 | bwd_microstep: 3555.40 | bwd_inner_microstep: 3540.40 | bwd_allreduce_microstep: 14.95 | step_microstep: 7.84
[2024-09-15 16:43:12,585] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3487.70 | bwd: 6188.28 | bwd_inner: 6173.25 | bwd_allreduce: 14.97 | step: 8.07
 82%|████████▏ | 382/468 [1:08:03<13:55,  9.72s/it]                                                   {'loss': 0.1588, 'learning_rate': 3.452933130957481e-06, 'epoch': 2.44}
 82%|████████▏ | 382/468 [1:08:03<13:55,  9.72s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878
[2024-09-15 16:43:17,979] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1919.90 | bwd_microstep: 3443.21 | bwd_inner_microstep: 3443.18 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.25
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368
[2024-09-15 16:43:23,358] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.22 | optimizer_gradients: 0.40 | optimizer_step: 0.40
[2024-09-15 16:43:23,358] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1474.22 | bwd_microstep: 3870.21 | bwd_inner_microstep: 2596.86 | bwd_allreduce_microstep: 1273.28 | step_microstep: 7.54
[2024-09-15 16:43:23,359] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3394.10 | bwd: 7313.43 | bwd_inner: 6040.05 | bwd_allreduce: 1273.31 | step: 7.79
 82%|████████▏ | 383/468 [1:08:14<14:13, 10.04s/it]                                                   {'loss': 0.1964, 'learning_rate': 3.375425473083185e-06, 'epoch': 2.45}
 82%|████████▏ | 383/468 [1:08:14<14:13, 10.04s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1390
[2024-09-15 16:43:27,514] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1494.38 | bwd_microstep: 2630.47 | bwd_inner_microstep: 2630.44 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.23
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368
[2024-09-15 16:43:32,830] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.23 | optimizer_gradients: 0.38 | optimizer_step: 0.40
[2024-09-15 16:43:32,831] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1473.68 | bwd_microstep: 3807.60 | bwd_inner_microstep: 2596.95 | bwd_allreduce_microstep: 1210.60 | step_microstep: 7.25
[2024-09-15 16:43:32,831] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2968.03 | bwd: 6438.08 | bwd_inner: 5227.39 | bwd_allreduce: 1210.62 | step: 7.48
 82%|████████▏ | 384/468 [1:08:24<13:48,  9.87s/it]                                                   {'loss': 0.1688, 'learning_rate': 3.2987173771983816e-06, 'epoch': 2.45}
 82%|████████▏ | 384/468 [1:08:24<13:48,  9.87s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878
[2024-09-15 16:43:38,229] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1920.75 | bwd_microstep: 3446.84 | bwd_inner_microstep: 3446.81 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.23
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877
[2024-09-15 16:43:43,637] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.43 | optimizer_gradients: 0.35 | optimizer_step: 0.40
[2024-09-15 16:43:43,638] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1921.19 | bwd_microstep: 3451.13 | bwd_inner_microstep: 3435.91 | bwd_allreduce_microstep: 15.17 | step_microstep: 8.15
[2024-09-15 16:43:43,638] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3841.91 | bwd: 6897.98 | bwd_inner: 6882.72 | bwd_allreduce: 15.19 | step: 8.39
 82%|████████▏ | 385/468 [1:08:35<14:02, 10.15s/it]                                                   {'loss': 0.1837, 'learning_rate': 3.2228125325931514e-06, 'epoch': 2.46}
 82%|████████▏ | 385/468 [1:08:35<14:02, 10.15s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1870
[2024-09-15 16:43:49,013] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1915.53 | bwd_microstep: 3428.44 | bwd_inner_microstep: 3428.41 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.24
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1877
[2024-09-15 16:43:54,388] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.51 | optimizer_gradients: 0.35 | optimizer_step: 0.40
[2024-09-15 16:43:54,388] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1902.27 | bwd_microstep: 3436.67 | bwd_inner_microstep: 3424.11 | bwd_allreduce_microstep: 12.51 | step_microstep: 7.39
[2024-09-15 16:43:54,388] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3817.77 | bwd: 6865.15 | bwd_inner: 6852.53 | bwd_allreduce: 12.53 | step: 7.64
 82%|████████▏ | 386/468 [1:08:45<14:07, 10.33s/it]                                                   {'loss': 0.1463, 'learning_rate': 3.1477145899250326e-06, 'epoch': 2.47}
 82%|████████▏ | 386/468 [1:08:45<14:07, 10.33s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389
[2024-09-15 16:43:58,549] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1498.56 | bwd_microstep: 2632.33 | bwd_inner_microstep: 2632.14 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.26
dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1366
[2024-09-15 16:44:05,207] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.25 | optimizer_gradients: 0.42 | optimizer_step: 0.40
[2024-09-15 16:44:05,208] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1458.35 | bwd_microstep: 5164.17 | bwd_inner_microstep: 2577.02 | bwd_allreduce_microstep: 2587.09 | step_microstep: 7.52
[2024-09-15 16:44:05,208] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2956.89 | bwd: 7796.54 | bwd_inner: 5209.16 | bwd_allreduce: 2587.24 | step: 7.78
 83%|████████▎ | 387/468 [1:08:56<14:08, 10.48s/it]                                                   {'loss': 0.2133, 'learning_rate': 3.073427161043492e-06, 'epoch': 2.47}
 83%|████████▎ | 387/468 [1:08:56<14:08, 10.48s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1388
[2024-09-15 16:44:09,357] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1492.46 | bwd_microstep: 2626.25 | bwd_inner_microstep: 2626.22 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876
[2024-09-15 16:44:14,765] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.40 | optimizer_gradients: 0.36 | optimizer_step: 0.40
[2024-09-15 16:44:14,766] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1917.31 | bwd_microstep: 3455.06 | bwd_inner_microstep: 3440.01 | bwd_allreduce_microstep: 15.00 | step_microstep: 7.81
[2024-09-15 16:44:14,766] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3409.75 | bwd: 6081.31 | bwd_inner: 6066.23 | bwd_allreduce: 15.02 | step: 8.04
 83%|████████▎ | 388/468 [1:09:06<13:36, 10.20s/it]                                                   {'loss': 0.1345, 'learning_rate': 2.9999538188161705e-06, 'epoch': 2.48}
 83%|████████▎ | 388/468 [1:09:06<13:36, 10.20s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389
[2024-09-15 16:44:18,919] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1493.29 | bwd_microstep: 2629.55 | bwd_inner_microstep: 2629.47 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.24
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878
[2024-09-15 16:44:25,528] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.21 | optimizer_gradients: 0.39 | optimizer_step: 0.40
[2024-09-15 16:44:25,528] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1921.66 | bwd_microstep: 4652.83 | bwd_inner_microstep: 3442.99 | bwd_allreduce_microstep: 1209.77 | step_microstep: 7.49
[2024-09-15 16:44:25,529] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3414.93 | bwd: 7282.42 | bwd_inner: 6072.47 | bwd_allreduce: 1209.82 | step: 7.74
 83%|████████▎ | 389/468 [1:09:16<13:39, 10.37s/it]                                                   {'loss': 0.2226, 'learning_rate': 2.927298096957063e-06, 'epoch': 2.49}
 83%|████████▎ | 389/468 [1:09:16<13:39, 10.37s/it]dynamic ViT batch size: 46, images per sample: 5.75, dynamic token length: 1905
[2024-09-15 16:44:31,049] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1968.89 | bwd_microstep: 3520.78 | bwd_inner_microstep: 3520.75 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.05
dynamic ViT batch size: 36, images per sample: 4.5, dynamic token length: 1368
[2024-09-15 16:44:36,225] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.22 | optimizer_gradients: 0.39 | optimizer_step: 0.40
[2024-09-15 16:44:36,225] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1442.78 | bwd_microstep: 3698.57 | bwd_inner_microstep: 2558.77 | bwd_allreduce_microstep: 1139.73 | step_microstep: 7.51
[2024-09-15 16:44:36,225] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3411.64 | bwd: 7219.36 | bwd_inner: 6079.53 | bwd_allreduce: 1139.76 | step: 7.58
 83%|████████▎ | 390/468 [1:09:27<13:36, 10.47s/it]                                                   {'loss': 0.1664, 'learning_rate': 2.8554634898565668e-06, 'epoch': 2.49}
 83%|████████▎ | 390/468 [1:09:27<13:36, 10.47s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1388
[2024-09-15 16:44:40,373] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1492.10 | bwd_microstep: 2625.86 | bwd_inner_microstep: 2625.80 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.23
dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1877
[2024-09-15 16:44:45,824] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.57 | optimizer_gradients: 0.35 | optimizer_step: 0.40
[2024-09-15 16:44:45,824] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1937.47 | bwd_microstep: 3476.78 | bwd_inner_microstep: 3461.67 | bwd_allreduce_microstep: 15.06 | step_microstep: 7.89
[2024-09-15 16:44:45,824] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3429.54 | bwd: 6102.68 | bwd_inner: 6087.47 | bwd_allreduce: 15.09 | step: 8.13
 84%|████████▎ | 391/468 [1:09:37<13:05, 10.21s/it]                                                   {'loss': 0.1388, 'learning_rate': 2.784453452413405e-06, 'epoch': 2.5}
 84%|████████▎ | 391/468 [1:09:37<13:05, 10.21s/it]dynamic ViT batch size: 43, images per sample: 5.375, dynamic token length: 1878
[2024-09-15 16:44:51,235] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1928.31 | bwd_microstep: 3451.60 | bwd_inner_microstep: 3451.57 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.23
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1880
[2024-09-15 16:44:56,613] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.38 | optimizer_gradients: 0.35 | optimizer_step: 0.40
[2024-09-15 16:44:56,614] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1904.35 | bwd_microstep: 3438.69 | bwd_inner_microstep: 3423.73 | bwd_allreduce_microstep: 14.92 | step_microstep: 8.54
[2024-09-15 16:44:56,614] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3832.63 | bwd: 6890.31 | bwd_inner: 6875.31 | bwd_allreduce: 14.94 | step: 8.78
 84%|████████▍ | 392/468 [1:09:48<13:09, 10.38s/it]                                                   {'loss': 0.2065, 'learning_rate': 2.714271399868473e-06, 'epoch': 2.5}
 84%|████████▍ | 392/468 [1:09:48<13:09, 10.38s/it]dynamic ViT batch size: 46, images per sample: 5.75, dynamic token length: 1878
[2024-09-15 16:45:02,083] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1952.48 | bwd_microstep: 3486.77 | bwd_inner_microstep: 3486.74 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.16
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1870
[2024-09-15 16:45:07,485] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.70 | optimizer_gradients: 0.34 | optimizer_step: 0.38
[2024-09-15 16:45:07,486] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1918.57 | bwd_microstep: 3448.34 | bwd_inner_microstep: 3433.45 | bwd_allreduce_microstep: 14.81 | step_microstep: 8.25
[2024-09-15 16:45:07,486] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3871.03 | bwd: 6935.12 | bwd_inner: 6920.19 | bwd_allreduce: 14.85 | step: 8.44
 84%|████████▍ | 393/468 [1:09:58<13:09, 10.53s/it]                                                   {'loss': 0.1353, 'learning_rate': 2.6449207076405857e-06, 'epoch': 2.51}
 84%|████████▍ | 393/468 [1:09:58<13:09, 10.53s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1878
[2024-09-15 16:45:12,839] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1903.98 | bwd_microstep: 3417.56 | bwd_inner_microstep: 3417.53 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.05
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1878
[2024-09-15 16:45:18,230] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.24 | optimizer_gradients: 0.47 | optimizer_step: 0.41
[2024-09-15 16:45:18,230] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1905.62 | bwd_microstep: 3449.65 | bwd_inner_microstep: 3436.72 | bwd_allreduce_microstep: 12.88 | step_microstep: 10.02
[2024-09-15 16:45:18,231] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3809.57 | bwd: 6867.21 | bwd_inner: 6854.24 | bwd_allreduce: 12.90 | step: 10.09
 84%|████████▍ | 394/468 [1:10:09<13:04, 10.60s/it]                                                   {'loss': 0.1601, 'learning_rate': 2.57640471116412e-06, 'epoch': 2.52}
 84%|████████▍ | 394/468 [1:10:09<13:04, 10.60s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878
[2024-09-15 16:45:23,636] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1918.75 | bwd_microstep: 3451.42 | bwd_inner_microstep: 3451.40 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.18
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878
[2024-09-15 16:45:29,052] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.48 | optimizer_gradients: 0.33 | optimizer_step: 0.39
[2024-09-15 16:45:29,052] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1921.27 | bwd_microstep: 3459.02 | bwd_inner_microstep: 3444.43 | bwd_allreduce_microstep: 14.54 | step_microstep: 8.36
[2024-09-15 16:45:29,052] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3839.99 | bwd: 6910.45 | bwd_inner: 6895.83 | bwd_allreduce: 14.56 | step: 8.57
 84%|████████▍ | 395/468 [1:10:20<12:58, 10.66s/it]                                                   {'loss': 0.2478, 'learning_rate': 2.508726705728617e-06, 'epoch': 2.52}
 84%|████████▍ | 395/468 [1:10:20<12:58, 10.66s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389
[2024-09-15 16:45:33,214] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1493.15 | bwd_microstep: 2638.98 | bwd_inner_microstep: 2638.95 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876
[2024-09-15 16:45:38,631] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 4.09 | optimizer_gradients: 0.39 | optimizer_step: 0.40
[2024-09-15 16:45:38,632] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1921.59 | bwd_microstep: 3458.62 | bwd_inner_microstep: 3445.35 | bwd_allreduce_microstep: 13.22 | step_microstep: 9.04
[2024-09-15 16:45:38,632] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3414.72 | bwd: 6097.61 | bwd_inner: 6084.30 | bwd_allreduce: 13.24 | step: 9.21
 85%|████████▍ | 396/468 [1:10:30<12:24, 10.34s/it]                                                   {'loss': 0.1515, 'learning_rate': 2.441889946320266e-06, 'epoch': 2.53}
 85%|████████▍ | 396/468 [1:10:30<12:24, 10.34s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389
[2024-09-15 16:45:42,795] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1494.67 | bwd_microstep: 2636.36 | bwd_inner_microstep: 2636.34 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.04
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1369
[2024-09-15 16:45:48,061] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.72 | optimizer_gradients: 0.36 | optimizer_step: 0.37
[2024-09-15 16:45:48,061] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1472.66 | bwd_microstep: 3762.03 | bwd_inner_microstep: 2597.70 | bwd_allreduce_microstep: 1164.27 | step_microstep: 7.52
[2024-09-15 16:45:48,061] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2967.30 | bwd: 6398.41 | bwd_inner: 5234.04 | bwd_allreduce: 1164.30 | step: 7.57
 85%|████████▍ | 397/468 [1:10:39<11:54, 10.06s/it]                                                   {'loss': 0.11, 'learning_rate': 2.3758976474653904e-06, 'epoch': 2.54}
 85%|████████▍ | 397/468 [1:10:39<11:54, 10.06s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389
[2024-09-15 16:45:52,220] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1494.61 | bwd_microstep: 2632.80 | bwd_inner_microstep: 2632.77 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.04
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368
[2024-09-15 16:45:57,538] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.26 | optimizer_gradients: 0.40 | optimizer_step: 0.40
[2024-09-15 16:45:57,539] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1471.99 | bwd_microstep: 3816.40 | bwd_inner_microstep: 2602.68 | bwd_allreduce_microstep: 1213.65 | step_microstep: 7.48
[2024-09-15 16:45:57,539] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2966.58 | bwd: 6449.21 | bwd_inner: 5235.46 | bwd_allreduce: 1213.68 | step: 7.53
 85%|████████▌ | 398/468 [1:10:48<11:32,  9.89s/it]                                                   {'loss': 0.179, 'learning_rate': 2.310752983075819e-06, 'epoch': 2.54}
 85%|████████▌ | 398/468 [1:10:48<11:32,  9.89s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1870
[2024-09-15 16:46:02,916] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1917.91 | bwd_microstep: 3429.15 | bwd_inner_microstep: 3429.12 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.04
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368
[2024-09-15 16:46:08,275] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.22 | optimizer_gradients: 0.40 | optimizer_step: 0.44
[2024-09-15 16:46:08,275] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1471.65 | bwd_microstep: 3856.48 | bwd_inner_microstep: 2594.67 | bwd_allreduce_microstep: 1261.75 | step_microstep: 7.72
[2024-09-15 16:46:08,275] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3389.53 | bwd: 7285.64 | bwd_inner: 6023.79 | bwd_allreduce: 1261.78 | step: 7.77
 85%|████████▌ | 399/468 [1:10:59<11:39, 10.14s/it]                                                   {'loss': 0.1746, 'learning_rate': 2.2464590862962443e-06, 'epoch': 2.55}
 85%|████████▌ | 399/468 [1:10:59<11:39, 10.14s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1390
[2024-09-15 16:46:12,436] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1495.03 | bwd_microstep: 2634.76 | bwd_inner_microstep: 2634.65 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.24
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1878
[2024-09-15 16:46:19,047] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.21 | optimizer_gradients: 0.40 | optimizer_step: 0.40
[2024-09-15 16:46:19,048] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1904.23 | bwd_microstep: 4673.80 | bwd_inner_microstep: 3427.19 | bwd_allreduce_microstep: 1246.54 | step_microstep: 7.53
[2024-09-15 16:46:19,048] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3399.24 | bwd: 7308.60 | bwd_inner: 6061.84 | bwd_allreduce: 1246.63 | step: 7.78
 85%|████████▌ | 400/468 [1:11:10<11:42, 10.33s/it]                                                   {'loss': 0.1632, 'learning_rate': 2.1830190493535385e-06, 'epoch': 2.56}
 85%|████████▌ | 400/468 [1:11:10<11:42, 10.33s/it][INFO|trainer.py:2936] 2024-09-15 16:46:26,282 >> Saving model checkpoint to work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora/tmp-checkpoint-400
[INFO|configuration_utils.py:473] 2024-09-15 16:46:26,284 >> Configuration saved in work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora/tmp-checkpoint-400/config.json
[INFO|configuration_utils.py:594] 2024-09-15 16:46:26,284 >> Configuration saved in work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora/tmp-checkpoint-400/generation_config.json
[INFO|modeling_utils.py:2501] 2024-09-15 16:46:47,703 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 4 checkpoint shards. You can find where each parameters has been saved in the index located at work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora/tmp-checkpoint-400/model.safetensors.index.json.
[INFO|tokenization_utils_base.py:2433] 2024-09-15 16:46:47,705 >> tokenizer config file saved in work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora/tmp-checkpoint-400/tokenizer_config.json
[INFO|tokenization_utils_base.py:2442] 2024-09-15 16:46:47,705 >> Special tokens file saved in work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora/tmp-checkpoint-400/special_tokens_map.json
[INFO|tokenization_utils_base.py:2493] 2024-09-15 16:46:47,705 >> added tokens file saved in work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora/tmp-checkpoint-400/added_tokens.json
[2024-09-15 16:46:48,662] [INFO] [logging.py:96:log_dist] [Rank 0] [Torch] Checkpoint global_step400 is about to be saved!
[2024-09-15 16:46:48,690] [INFO] [logging.py:96:log_dist] [Rank 0] Saving model checkpoint: work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora/tmp-checkpoint-400/global_step400/mp_rank_00_model_states.pt
[2024-09-15 16:46:48,690] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora/tmp-checkpoint-400/global_step400/mp_rank_00_model_states.pt...
[2024-09-15 16:47:06,164] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora/tmp-checkpoint-400/global_step400/mp_rank_00_model_states.pt.
[2024-09-15 16:47:06,167] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora/tmp-checkpoint-400/global_step400/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt...
[2024-09-15 16:47:06,326] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora/tmp-checkpoint-400/global_step400/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt.
[2024-09-15 16:47:06,326] [INFO] [engine.py:3488:_save_zero_checkpoint] zero checkpoint saved work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora/tmp-checkpoint-400/global_step400/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt
[2024-09-15 16:47:06,326] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step400 is ready now!
[INFO|trainer.py:3028] 2024-09-15 16:47:06,338 >> Deleting older checkpoint [work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora/checkpoint-200] due to args.save_total_limit
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1877
[2024-09-15 16:47:13,743] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1870.20 | bwd_microstep: 3387.36 | bwd_inner_microstep: 3387.31 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.25
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368
[2024-09-15 16:47:17,816] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.28 | optimizer_gradients: 0.36 | optimizer_step: 0.40
[2024-09-15 16:47:17,817] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1452.23 | bwd_microstep: 2585.92 | bwd_inner_microstep: 2562.88 | bwd_allreduce_microstep: 23.00 | step_microstep: 8.05
[2024-09-15 16:47:17,817] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3322.40 | bwd: 5973.32 | bwd_inner: 5950.18 | bwd_allreduce: 23.02 | step: 8.30
 86%|████████▌ | 401/468 [1:12:09<27:45, 24.86s/it]                                                   {'loss': 0.2318, 'learning_rate': 2.1204359234080196e-06, 'epoch': 2.56}
 86%|████████▌ | 401/468 [1:12:09<27:45, 24.86s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1870
[2024-09-15 16:47:23,102] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1878.75 | bwd_microstep: 3375.62 | bwd_inner_microstep: 3375.60 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.24
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1367
[2024-09-15 16:47:28,368] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.35 | optimizer_gradients: 0.39 | optimizer_step: 0.40
[2024-09-15 16:47:28,368] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1455.01 | bwd_microstep: 3775.43 | bwd_inner_microstep: 2566.75 | bwd_allreduce_microstep: 1208.63 | step_microstep: 7.84
[2024-09-15 16:47:28,369] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3333.73 | bwd: 7151.07 | bwd_inner: 5942.34 | bwd_allreduce: 1208.66 | step: 8.09
 86%|████████▌ | 402/468 [1:12:19<22:37, 20.57s/it]                                                   {'loss': 0.2058, 'learning_rate': 2.058712718406719e-06, 'epoch': 2.57}
 86%|████████▌ | 402/468 [1:12:19<22:37, 20.57s/it]dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1878
[2024-09-15 16:47:33,743] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1918.32 | bwd_microstep: 3425.20 | bwd_inner_microstep: 3425.17 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22
dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1878
[2024-09-15 16:47:39,148] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.36 | optimizer_gradients: 0.36 | optimizer_step: 0.40
[2024-09-15 16:47:39,149] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1921.05 | bwd_microstep: 3447.91 | bwd_inner_microstep: 3433.17 | bwd_allreduce_microstep: 14.69 | step_microstep: 8.50
[2024-09-15 16:47:39,149] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3839.32 | bwd: 6873.11 | bwd_inner: 6858.34 | bwd_allreduce: 14.71 | step: 8.75
 86%|████████▌ | 403/468 [1:12:30<19:06, 17.63s/it]                                                   {'loss': 0.1229, 'learning_rate': 1.9978524029386026e-06, 'epoch': 2.58}
 86%|████████▌ | 403/468 [1:12:30<19:06, 17.63s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1388
[2024-09-15 16:47:43,270] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1483.12 | bwd_microstep: 2608.19 | bwd_inner_microstep: 2608.16 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.24
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876
[2024-09-15 16:47:49,838] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.49 | optimizer_gradients: 0.39 | optimizer_step: 0.40
[2024-09-15 16:47:49,838] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1907.29 | bwd_microstep: 4624.91 | bwd_inner_microstep: 3416.40 | bwd_allreduce_microstep: 1208.45 | step_microstep: 7.90
[2024-09-15 16:47:49,839] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3390.38 | bwd: 7233.13 | bwd_inner: 6024.56 | bwd_allreduce: 1208.48 | step: 8.15
 86%|████████▋ | 404/468 [1:12:41<16:35, 15.55s/it]                                                   {'loss': 0.1819, 'learning_rate': 1.937857904091818e-06, 'epoch': 2.58}
 86%|████████▋ | 404/468 [1:12:41<16:35, 15.55s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1388
[2024-09-15 16:47:53,964] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1483.96 | bwd_microstep: 2610.58 | bwd_inner_microstep: 2610.52 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.30
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1881
[2024-09-15 16:48:00,523] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 4.46 | optimizer_gradients: 0.42 | optimizer_step: 0.40
[2024-09-15 16:48:00,524] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1912.45 | bwd_microstep: 4609.68 | bwd_inner_microstep: 3420.48 | bwd_allreduce_microstep: 1189.13 | step_microstep: 11.76
[2024-09-15 16:48:00,524] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3396.39 | bwd: 7220.29 | bwd_inner: 6031.00 | bwd_allreduce: 1189.17 | step: 12.01
 87%|████████▋ | 405/468 [1:12:51<14:47, 14.09s/it]                                                   {'loss': 0.1619, 'learning_rate': 1.8787321073128817e-06, 'epoch': 2.59}
 87%|████████▋ | 405/468 [1:12:51<14:47, 14.09s/it]dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1368
[2024-09-15 16:48:04,562] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1449.55 | bwd_microstep: 2557.73 | bwd_inner_microstep: 2557.62 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.25
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878
[2024-09-15 16:48:11,181] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.35 | optimizer_gradients: 0.39 | optimizer_step: 0.39
[2024-09-15 16:48:11,182] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1910.16 | bwd_microstep: 4673.80 | bwd_inner_microstep: 3425.76 | bwd_allreduce_microstep: 1247.97 | step_microstep: 7.79
[2024-09-15 16:48:11,182] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3359.69 | bwd: 7231.56 | bwd_inner: 5983.39 | bwd_allreduce: 1248.05 | step: 8.04
 87%|████████▋ | 406/468 [1:13:02<13:29, 13.06s/it]                                                   {'loss': 0.1799, 'learning_rate': 1.8204778562679437e-06, 'epoch': 2.59}
 87%|████████▋ | 406/468 [1:13:02<13:29, 13.06s/it]dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1901
[2024-09-15 16:48:16,637] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1949.22 | bwd_microstep: 3474.08 | bwd_inner_microstep: 3474.00 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.24
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1366
[2024-09-15 16:48:20,732] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.81 | optimizer_gradients: 0.45 | optimizer_step: 0.40
[2024-09-15 16:48:20,732] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1463.32 | bwd_microstep: 2597.06 | bwd_inner_microstep: 2581.63 | bwd_allreduce_microstep: 15.38 | step_microstep: 8.30
[2024-09-15 16:48:20,732] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3412.53 | bwd: 6071.18 | bwd_inner: 6055.63 | bwd_allreduce: 15.42 | step: 8.55
 87%|████████▋ | 407/468 [1:13:12<12:12, 12.01s/it]                                                   {'loss': 0.1823, 'learning_rate': 1.7630979527059877e-06, 'epoch': 2.6}
 87%|████████▋ | 407/468 [1:13:12<12:12, 12.01s/it]dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1880
[2024-09-15 16:48:26,041] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1887.36 | bwd_microstep: 3391.14 | bwd_inner_microstep: 3391.01 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.24
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877
[2024-09-15 16:48:31,434] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.48 | optimizer_gradients: 0.34 | optimizer_step: 0.40
[2024-09-15 16:48:31,434] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1911.45 | bwd_microstep: 3445.07 | bwd_inner_microstep: 3429.93 | bwd_allreduce_microstep: 15.09 | step_microstep: 8.01
[2024-09-15 16:48:31,434] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3798.78 | bwd: 6836.24 | bwd_inner: 6820.94 | bwd_allreduce: 15.19 | step: 8.26
 87%|████████▋ | 408/468 [1:13:22<11:36, 11.62s/it]                                                   {'loss': 0.1946, 'learning_rate': 1.7065951563241022e-06, 'epoch': 2.61}
 87%|████████▋ | 408/468 [1:13:22<11:36, 11.62s/it]dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1388
[2024-09-15 16:48:35,542] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1475.73 | bwd_microstep: 2601.75 | bwd_inner_microstep: 2601.69 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878
[2024-09-15 16:48:42,121] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.32 | optimizer_gradients: 0.38 | optimizer_step: 0.41
[2024-09-15 16:48:42,122] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1914.41 | bwd_microstep: 4629.74 | bwd_inner_microstep: 3434.94 | bwd_allreduce_microstep: 1194.73 | step_microstep: 7.61
[2024-09-15 16:48:42,122] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3390.12 | bwd: 7231.53 | bwd_inner: 6036.63 | bwd_allreduce: 1194.77 | step: 7.84
 87%|████████▋ | 409/468 [1:13:33<11:08, 11.34s/it]                                                   {'loss': 0.1903, 'learning_rate': 1.6509721846347382e-06, 'epoch': 2.61}
 87%|████████▋ | 409/468 [1:13:33<11:08, 11.34s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876
[2024-09-15 16:48:47,499] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1913.93 | bwd_microstep: 3431.68 | bwd_inner_microstep: 3431.65 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.24
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877
[2024-09-15 16:48:52,899] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.47 | optimizer_gradients: 0.35 | optimizer_step: 0.42
[2024-09-15 16:48:52,899] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1917.53 | bwd_microstep: 3446.66 | bwd_inner_microstep: 3431.56 | bwd_allreduce_microstep: 15.05 | step_microstep: 8.32
[2024-09-15 16:48:52,900] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3831.44 | bwd: 6878.35 | bwd_inner: 6863.22 | bwd_allreduce: 15.07 | step: 8.56
 88%|████████▊ | 410/468 [1:13:44<10:47, 11.17s/it]                                                   {'loss': 0.1675, 'learning_rate': 1.5962317128350147e-06, 'epoch': 2.62}
 88%|████████▊ | 410/468 [1:13:44<10:47, 11.17s/it]dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1900
[2024-09-15 16:48:58,264] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1908.65 | bwd_microstep: 3425.70 | bwd_inner_microstep: 3425.67 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.04
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878
[2024-09-15 16:49:03,684] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.59 | optimizer_gradients: 0.39 | optimizer_step: 0.37
[2024-09-15 16:49:03,685] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1914.44 | bwd_microstep: 3473.70 | bwd_inner_microstep: 3434.33 | bwd_allreduce_microstep: 39.31 | step_microstep: 8.20
[2024-09-15 16:49:03,685] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3823.06 | bwd: 6899.41 | bwd_inner: 6860.00 | bwd_allreduce: 39.34 | step: 8.25
 88%|████████▊ | 411/468 [1:13:55<10:30, 11.05s/it]                                                   {'loss': 0.1545, 'learning_rate': 1.5423763736780583e-06, 'epoch': 2.63}
 88%|████████▊ | 411/468 [1:13:55<10:30, 11.05s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1870
[2024-09-15 16:49:09,021] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1896.69 | bwd_microstep: 3409.53 | bwd_inner_microstep: 3409.50 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.04
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1369
[2024-09-15 16:49:14,417] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.55 | optimizer_gradients: 0.45 | optimizer_step: 0.42
[2024-09-15 16:49:14,417] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1470.92 | bwd_microstep: 3890.89 | bwd_inner_microstep: 2597.80 | bwd_allreduce_microstep: 1293.02 | step_microstep: 8.73
[2024-09-15 16:49:14,418] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3367.59 | bwd: 7300.43 | bwd_inner: 6007.30 | bwd_allreduce: 1293.05 | step: 8.79
 88%|████████▊ | 412/468 [1:14:05<10:13, 10.96s/it]                                                   {'loss': 0.1292, 'learning_rate': 1.4894087573463734e-06, 'epoch': 2.63}
 88%|████████▊ | 412/468 [1:14:05<10:13, 10.96s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389
[2024-09-15 16:49:18,582] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1492.58 | bwd_microstep: 2630.79 | bwd_inner_microstep: 2630.76 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.19
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877
[2024-09-15 16:49:25,240] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.66 | optimizer_gradients: 0.38 | optimizer_step: 0.84
[2024-09-15 16:49:25,240] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1916.85 | bwd_microstep: 4705.93 | bwd_inner_microstep: 3438.64 | bwd_allreduce_microstep: 1267.23 | step_microstep: 8.60
[2024-09-15 16:49:25,240] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3409.40 | bwd: 7336.73 | bwd_inner: 6069.40 | bwd_allreduce: 1267.26 | step: 8.82
 88%|████████▊ | 413/468 [1:14:16<10:00, 10.92s/it]                                                   {'loss': 0.1341, 'learning_rate': 1.437331411327274e-06, 'epoch': 2.64}
 88%|████████▊ | 413/468 [1:14:16<10:00, 10.92s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1900
[2024-09-15 16:49:30,667] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1937.95 | bwd_microstep: 3458.61 | bwd_inner_microstep: 3458.59 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.06
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1369
[2024-09-15 16:49:35,991] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.41 | optimizer_gradients: 0.36 | optimizer_step: 0.36
[2024-09-15 16:49:35,992] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1468.91 | bwd_microstep: 3822.49 | bwd_inner_microstep: 2594.24 | bwd_allreduce_microstep: 1228.19 | step_microstep: 7.44
[2024-09-15 16:49:35,992] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3406.83 | bwd: 7281.12 | bwd_inner: 6052.83 | bwd_allreduce: 1228.22 | step: 7.53
 88%|████████▊ | 414/468 [1:14:27<09:46, 10.87s/it]                                                   {'loss': 0.1419, 'learning_rate': 1.3861468402903634e-06, 'epoch': 2.65}
 88%|████████▊ | 414/468 [1:14:27<09:46, 10.87s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1900
[2024-09-15 16:49:41,423] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1935.92 | bwd_microstep: 3464.79 | bwd_inner_microstep: 3464.77 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.24
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368
[2024-09-15 16:49:46,759] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.27 | optimizer_gradients: 0.40 | optimizer_step: 0.45
[2024-09-15 16:49:46,759] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1470.76 | bwd_microstep: 3829.91 | bwd_inner_microstep: 2591.20 | bwd_allreduce_microstep: 1238.65 | step_microstep: 7.63
[2024-09-15 16:49:46,760] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3406.66 | bwd: 7294.72 | bwd_inner: 6055.97 | bwd_allreduce: 1238.68 | step: 7.87
 89%|████████▊ | 415/468 [1:14:38<09:34, 10.84s/it]                                                   {'loss': 0.2009, 'learning_rate': 1.3358575059670532e-06, 'epoch': 2.65}
 89%|████████▊ | 415/468 [1:14:38<09:34, 10.84s/it]dynamic ViT batch size: 36, images per sample: 4.5, dynamic token length: 1388
[2024-09-15 16:49:50,836] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1459.37 | bwd_microstep: 2587.13 | bwd_inner_microstep: 2586.88 | bwd_allreduce_microstep: 0.15 | step_microstep: 0.25
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1366
[2024-09-15 16:49:57,501] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.44 | optimizer_gradients: 0.37 | optimizer_step: 0.38
[2024-09-15 16:49:57,502] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1469.77 | bwd_microstep: 5160.56 | bwd_inner_microstep: 2590.45 | bwd_allreduce_microstep: 2570.06 | step_microstep: 7.54
[2024-09-15 16:49:57,502] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2929.12 | bwd: 7747.74 | bwd_inner: 5177.33 | bwd_allreduce: 2570.23 | step: 7.79
 89%|████████▉ | 416/468 [1:14:48<09:22, 10.81s/it]                                                   {'loss': 0.1272, 'learning_rate': 1.2864658270321905e-06, 'epoch': 2.66}
 89%|████████▉ | 416/468 [1:14:48<09:22, 10.81s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1877
[2024-09-15 16:50:02,833] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1895.45 | bwd_microstep: 3403.39 | bwd_inner_microstep: 3403.36 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.20
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368
[2024-09-15 16:50:08,340] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.53 | optimizer_gradients: 0.38 | optimizer_step: 0.39
[2024-09-15 16:50:08,341] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1467.70 | bwd_microstep: 4003.07 | bwd_inner_microstep: 2588.72 | bwd_allreduce_microstep: 1414.30 | step_microstep: 7.97
[2024-09-15 16:50:08,341] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3363.13 | bwd: 7406.47 | bwd_inner: 5992.08 | bwd_allreduce: 1414.33 | step: 8.19
 89%|████████▉ | 417/468 [1:14:59<09:11, 10.82s/it]                                                   {'loss': 0.1821, 'learning_rate': 1.2379741789877175e-06, 'epoch': 2.66}
 89%|████████▉ | 417/468 [1:14:59<09:11, 10.82s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1369
[2024-09-15 16:50:12,421] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1463.62 | bwd_microstep: 2586.35 | bwd_inner_microstep: 2586.32 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1366
[2024-09-15 16:50:19,025] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.22 | optimizer_gradients: 0.38 | optimizer_step: 0.38
[2024-09-15 16:50:19,026] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1464.50 | bwd_microstep: 5104.68 | bwd_inner_microstep: 2587.84 | bwd_allreduce_microstep: 2516.77 | step_microstep: 7.24
[2024-09-15 16:50:19,026] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2928.10 | bwd: 7691.04 | bwd_inner: 5174.17 | bwd_allreduce: 2516.80 | step: 7.46
 89%|████████▉ | 418/468 [1:15:10<08:58, 10.78s/it]                                                   {'loss': 0.1756, 'learning_rate': 1.1903848940484241e-06, 'epoch': 2.67}
 89%|████████▉ | 418/468 [1:15:10<08:58, 10.78s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878
[2024-09-15 16:50:24,398] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1909.15 | bwd_microstep: 3431.46 | bwd_inner_microstep: 3431.43 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.21
dynamic ViT batch size: 36, images per sample: 4.5, dynamic token length: 1367
[2024-09-15 16:50:29,828] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.39 | optimizer_gradients: 0.38 | optimizer_step: 0.38
[2024-09-15 16:50:29,829] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1435.89 | bwd_microstep: 3959.05 | bwd_inner_microstep: 2547.99 | bwd_allreduce_microstep: 1411.00 | step_microstep: 7.66
[2024-09-15 16:50:29,829] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3345.02 | bwd: 7390.52 | bwd_inner: 5979.42 | bwd_allreduce: 1411.03 | step: 7.89
 90%|████████▉ | 419/468 [1:15:21<08:48, 10.79s/it]                                                   {'loss': 0.1342, 'learning_rate': 1.1437002610297787e-06, 'epoch': 2.68}
 90%|████████▉ | 419/468 [1:15:21<08:48, 10.79s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877
[2024-09-15 16:50:35,198] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1909.26 | bwd_microstep: 3428.96 | bwd_inner_microstep: 3428.94 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.15
dynamic ViT batch size: 39, images per sample: 4.875, dynamic token length: 1368
[2024-09-15 16:50:39,301] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.62 | optimizer_gradients: 0.35 | optimizer_step: 0.38
[2024-09-15 16:50:39,302] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1458.61 | bwd_microstep: 2609.67 | bwd_inner_microstep: 2573.57 | bwd_allreduce_microstep: 36.04 | step_microstep: 7.86
[2024-09-15 16:50:39,302] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3367.84 | bwd: 6038.64 | bwd_inner: 6002.51 | bwd_allreduce: 36.07 | step: 8.05
 90%|████████▉ | 420/468 [1:15:30<08:18, 10.39s/it]                                                   {'loss': 0.2193, 'learning_rate': 1.097922525237849e-06, 'epoch': 2.68}
 90%|████████▉ | 420/468 [1:15:30<08:18, 10.39s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878
[2024-09-15 16:50:44,677] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1912.25 | bwd_microstep: 3430.69 | bwd_inner_microstep: 3430.67 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.23
dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1870
[2024-09-15 16:50:50,021] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.48 | optimizer_gradients: 0.35 | optimizer_step: 0.38
[2024-09-15 16:50:50,021] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1882.46 | bwd_microstep: 3425.17 | bwd_inner_microstep: 3386.92 | bwd_allreduce_microstep: 38.20 | step_microstep: 8.16
[2024-09-15 16:50:50,022] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3794.69 | bwd: 6855.88 | bwd_inner: 6817.59 | bwd_allreduce: 38.22 | step: 8.40
 90%|████████▉ | 421/468 [1:15:41<08:13, 10.49s/it]                                                   {'loss': 0.1873, 'learning_rate': 1.0530538883613129e-06, 'epoch': 2.69}
 90%|████████▉ | 421/468 [1:15:41<08:13, 10.49s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1870
[2024-09-15 16:50:55,389] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1913.28 | bwd_microstep: 3423.58 | bwd_inner_microstep: 3423.55 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.18
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1878
[2024-09-15 16:51:00,760] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.52 | optimizer_gradients: 0.34 | optimizer_step: 0.38
[2024-09-15 16:51:00,761] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1900.15 | bwd_microstep: 3434.64 | bwd_inner_microstep: 3419.88 | bwd_allreduce_microstep: 14.71 | step_microstep: 8.34
[2024-09-15 16:51:00,761] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3813.41 | bwd: 6858.23 | bwd_inner: 6843.44 | bwd_allreduce: 14.73 | step: 8.55
 90%|█████████ | 422/468 [1:15:52<08:05, 10.56s/it]                                                   {'loss': 0.2289, 'learning_rate': 1.0090965083655657e-06, 'epoch': 2.7}
 90%|█████████ | 422/468 [1:15:52<08:05, 10.56s/it]dynamic ViT batch size: 36, images per sample: 4.5, dynamic token length: 1388
[2024-09-15 16:51:04,840] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1461.31 | bwd_microstep: 2588.03 | bwd_inner_microstep: 2587.84 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.23
dynamic ViT batch size: 36, images per sample: 4.5, dynamic token length: 1368
[2024-09-15 16:51:11,481] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.21 | optimizer_gradients: 0.36 | optimizer_step: 0.37
[2024-09-15 16:51:11,482] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1440.53 | bwd_microstep: 5166.57 | bwd_inner_microstep: 2555.92 | bwd_allreduce_microstep: 2610.52 | step_microstep: 7.24
[2024-09-15 16:51:11,482] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2901.82 | bwd: 7754.64 | bwd_inner: 5143.80 | bwd_allreduce: 2610.65 | step: 7.47
 90%|█████████ | 423/468 [1:16:02<07:57, 10.61s/it]                                                   {'loss': 0.1694, 'learning_rate': 9.660524993889386e-07, 'epoch': 2.7}
 90%|█████████ | 423/468 [1:16:02<07:57, 10.61s/it]dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1389
[2024-09-15 16:51:15,592] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1476.37 | bwd_microstep: 2603.62 | bwd_inner_microstep: 2603.59 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.21
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876
[2024-09-15 16:51:22,214] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.21 | optimizer_gradients: 0.38 | optimizer_step: 0.39
[2024-09-15 16:51:22,215] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1912.69 | bwd_microstep: 4676.15 | bwd_inner_microstep: 3434.07 | bwd_allreduce_microstep: 1242.02 | step_microstep: 7.14
[2024-09-15 16:51:22,215] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3389.03 | bwd: 7279.78 | bwd_inner: 6037.66 | bwd_allreduce: 1242.05 | step: 7.35
 91%|█████████ | 424/468 [1:16:13<07:48, 10.65s/it]                                                   {'loss': 0.1551, 'learning_rate': 9.239239316410109e-07, 'epoch': 2.71}
 91%|█████████ | 424/468 [1:16:13<07:48, 10.65s/it]dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1366
[2024-09-15 16:51:26,259] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1449.72 | bwd_microstep: 2564.19 | bwd_inner_microstep: 2564.16 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.20
dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1368
[2024-09-15 16:51:31,716] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.21 | optimizer_gradients: 0.40 | optimizer_step: 0.40
[2024-09-15 16:51:31,716] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1454.29 | bwd_microstep: 3968.27 | bwd_inner_microstep: 2568.99 | bwd_allreduce_microstep: 1399.22 | step_microstep: 7.50
[2024-09-15 16:51:31,716] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2903.98 | bwd: 6532.47 | bwd_inner: 5133.16 | bwd_allreduce: 1399.25 | step: 7.72
 91%|█████████ | 425/468 [1:16:23<07:23, 10.30s/it]                                                   {'loss': 0.2198, 'learning_rate': 8.827128313030453e-07, 'epoch': 2.72}
 91%|█████████ | 425/468 [1:16:23<07:23, 10.30s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389
[2024-09-15 16:51:35,865] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1489.97 | bwd_microstep: 2627.42 | bwd_inner_microstep: 2627.28 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.25
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877
[2024-09-15 16:51:42,552] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.21 | optimizer_gradients: 0.39 | optimizer_step: 0.40
[2024-09-15 16:51:42,553] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1918.22 | bwd_microstep: 4733.20 | bwd_inner_microstep: 3431.86 | bwd_allreduce_microstep: 1301.27 | step_microstep: 7.48
[2024-09-15 16:51:42,553] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3408.17 | bwd: 7360.66 | bwd_inner: 6059.14 | bwd_allreduce: 1301.39 | step: 7.74
 91%|█████████ | 426/468 [1:16:33<07:19, 10.46s/it]                                                   {'loss': 0.1434, 'learning_rate': 8.42421180430546e-07, 'epoch': 2.72}
 91%|█████████ | 426/468 [1:16:33<07:19, 10.46s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1870
[2024-09-15 16:51:47,915] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1907.38 | bwd_microstep: 3423.59 | bwd_inner_microstep: 3423.57 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.24
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1373
[2024-09-15 16:51:53,273] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.27 | optimizer_gradients: 0.43 | optimizer_step: 0.39
[2024-09-15 16:51:53,274] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1471.74 | bwd_microstep: 3852.14 | bwd_inner_microstep: 2593.45 | bwd_allreduce_microstep: 1258.64 | step_microstep: 7.43
[2024-09-15 16:51:53,274] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3379.10 | bwd: 7275.75 | bwd_inner: 6017.01 | bwd_allreduce: 1258.67 | step: 7.68
 91%|█████████ | 427/468 [1:16:44<07:12, 10.54s/it]                                                   {'loss': 0.2212, 'learning_rate': 8.03050916857917e-07, 'epoch': 2.73}
 91%|█████████ | 427/468 [1:16:44<07:12, 10.54s/it]dynamic ViT batch size: 41, images per sample: 5.125, dynamic token length: 1876
[2024-09-15 16:51:58,628] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1904.72 | bwd_microstep: 3419.12 | bwd_inner_microstep: 3419.09 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.23
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368
[2024-09-15 16:52:02,737] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.63 | optimizer_gradients: 0.33 | optimizer_step: 0.39
[2024-09-15 16:52:02,738] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1467.93 | bwd_microstep: 2607.11 | bwd_inner_microstep: 2590.27 | bwd_allreduce_microstep: 16.80 | step_microstep: 7.76
[2024-09-15 16:52:02,738] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3372.63 | bwd: 6026.24 | bwd_inner: 6009.36 | bwd_allreduce: 16.82 | step: 7.99
 91%|█████████▏| 428/468 [1:16:54<06:48, 10.22s/it]                                                   {'loss': 0.214, 'learning_rate': 7.646039341052747e-07, 'epoch': 2.73}
 91%|█████████▏| 428/468 [1:16:54<06:48, 10.22s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1870
[2024-09-15 16:52:08,106] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1912.89 | bwd_microstep: 3424.20 | bwd_inner_microstep: 3424.18 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.07
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877
[2024-09-15 16:52:13,517] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.36 | optimizer_gradients: 0.33 | optimizer_step: 0.39
[2024-09-15 16:52:13,518] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1918.35 | bwd_microstep: 3456.82 | bwd_inner_microstep: 3442.16 | bwd_allreduce_microstep: 14.62 | step_microstep: 7.97
[2024-09-15 16:52:13,518] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3831.21 | bwd: 6881.04 | bwd_inner: 6866.33 | bwd_allreduce: 14.64 | step: 8.07
 92%|█████████▏| 429/468 [1:17:04<06:45, 10.39s/it]                                                   {'loss': 0.1959, 'learning_rate': 7.270820812873714e-07, 'epoch': 2.74}
 92%|█████████▏| 429/468 [1:17:04<06:45, 10.39s/it]dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1366
[2024-09-15 16:52:17,574] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1454.77 | bwd_microstep: 2571.14 | bwd_inner_microstep: 2571.11 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.23
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877
[2024-09-15 16:52:24,101] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.28 | optimizer_gradients: 0.38 | optimizer_step: 0.39
[2024-09-15 16:52:24,102] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1917.16 | bwd_microstep: 4574.58 | bwd_inner_microstep: 3442.44 | bwd_allreduce_microstep: 1132.08 | step_microstep: 7.45
[2024-09-15 16:52:24,102] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3371.91 | bwd: 7145.73 | bwd_inner: 6013.56 | bwd_allreduce: 1132.11 | step: 7.69
 92%|█████████▏| 430/468 [1:17:15<06:36, 10.45s/it]                                                   {'loss': 0.1571, 'learning_rate': 6.904871630246646e-07, 'epoch': 2.75}
 92%|█████████▏| 430/468 [1:17:15<06:36, 10.45s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877
[2024-09-15 16:52:29,481] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1912.88 | bwd_microstep: 3436.11 | bwd_inner_microstep: 3436.09 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.07
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368
[2024-09-15 16:52:33,599] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.70 | optimizer_gradients: 0.38 | optimizer_step: 0.42
[2024-09-15 16:52:33,600] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1471.09 | bwd_microstep: 2608.36 | bwd_inner_microstep: 2593.14 | bwd_allreduce_microstep: 15.17 | step_microstep: 8.01
[2024-09-15 16:52:33,600] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3383.94 | bwd: 6044.48 | bwd_inner: 6029.23 | bwd_allreduce: 15.19 | step: 8.11
 92%|█████████▏| 431/468 [1:17:25<06:15, 10.16s/it]                                                   {'loss': 0.1282, 'learning_rate': 6.548209393565241e-07, 'epoch': 2.75}
 92%|█████████▏| 431/468 [1:17:25<06:15, 10.16s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1878
[2024-09-15 16:52:38,953] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1903.24 | bwd_microstep: 3419.45 | bwd_inner_microstep: 3419.40 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.24
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1878
[2024-09-15 16:52:44,340] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.49 | optimizer_gradients: 0.36 | optimizer_step: 0.41
[2024-09-15 16:52:44,340] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1905.12 | bwd_microstep: 3445.32 | bwd_inner_microstep: 3430.35 | bwd_allreduce_microstep: 14.93 | step_microstep: 8.30
[2024-09-15 16:52:44,341] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3808.34 | bwd: 6864.81 | bwd_inner: 6849.75 | bwd_allreduce: 14.95 | step: 8.54
 92%|█████████▏| 432/468 [1:17:35<06:12, 10.34s/it]                                                   {'loss': 0.0944, 'learning_rate': 6.200851256565799e-07, 'epoch': 2.76}
 92%|█████████▏| 432/468 [1:17:35<06:12, 10.34s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1369
[2024-09-15 16:52:48,437] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1471.74 | bwd_microstep: 2595.51 | bwd_inner_microstep: 2595.40 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.24
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1366
[2024-09-15 16:52:53,723] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.22 | optimizer_gradients: 0.40 | optimizer_step: 0.45
[2024-09-15 16:52:53,723] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1471.08 | bwd_microstep: 3778.52 | bwd_inner_microstep: 2593.46 | bwd_allreduce_microstep: 1185.01 | step_microstep: 7.56
[2024-09-15 16:52:53,723] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2942.80 | bwd: 6374.07 | bwd_inner: 5188.86 | bwd_allreduce: 1185.09 | step: 7.81
 93%|█████████▎| 433/468 [1:17:45<05:51, 10.05s/it]                                                   {'loss': 0.1417, 'learning_rate': 5.862813925502209e-07, 'epoch': 2.77}
 93%|█████████▎| 433/468 [1:17:45<05:51, 10.05s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878
[2024-09-15 16:52:59,105] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1917.13 | bwd_microstep: 3434.25 | bwd_inner_microstep: 3434.23 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.14
dynamic ViT batch size: 36, images per sample: 4.5, dynamic token length: 1368
[2024-09-15 16:53:04,467] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.43 | optimizer_gradients: 0.42 | optimizer_step: 0.38
[2024-09-15 16:53:04,468] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1441.09 | bwd_microstep: 3887.32 | bwd_inner_microstep: 2557.62 | bwd_allreduce_microstep: 1329.64 | step_microstep: 7.54
[2024-09-15 16:53:04,468] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3358.19 | bwd: 7321.58 | bwd_inner: 5991.85 | bwd_allreduce: 1329.67 | step: 7.71
 93%|█████████▎| 434/468 [1:17:55<05:48, 10.26s/it]                                                   {'loss': 0.2177, 'learning_rate': 5.53411365834251e-07, 'epoch': 2.77}
 93%|█████████▎| 434/468 [1:17:55<05:48, 10.26s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1388
[2024-09-15 16:53:08,617] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1491.42 | bwd_microstep: 2627.14 | bwd_inner_microstep: 2627.02 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.23
dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1881
[2024-09-15 16:53:15,094] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.22 | optimizer_gradients: 0.38 | optimizer_step: 0.38
[2024-09-15 16:53:15,094] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1888.53 | bwd_microstep: 4554.88 | bwd_inner_microstep: 3399.28 | bwd_allreduce_microstep: 1155.54 | step_microstep: 7.19
[2024-09-15 16:53:15,095] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3379.92 | bwd: 7182.06 | bwd_inner: 6026.30 | bwd_allreduce: 1155.62 | step: 7.43
 93%|█████████▎| 435/468 [1:18:06<05:42, 10.37s/it]                                                   {'loss': 0.1399, 'learning_rate': 5.214766263986848e-07, 'epoch': 2.78}
 93%|█████████▎| 435/468 [1:18:06<05:42, 10.37s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1900
[2024-09-15 16:53:20,489] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1920.54 | bwd_microstep: 3443.81 | bwd_inner_microstep: 3443.78 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.05
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1878
[2024-09-15 16:53:25,858] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.57 | optimizer_gradients: 0.34 | optimizer_step: 0.38
[2024-09-15 16:53:25,859] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1899.96 | bwd_microstep: 3434.30 | bwd_inner_microstep: 3420.47 | bwd_allreduce_microstep: 13.78 | step_microstep: 8.54
[2024-09-15 16:53:25,859] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3820.47 | bwd: 6878.11 | bwd_inner: 6864.25 | bwd_allreduce: 13.80 | step: 8.61
 93%|█████████▎| 436/468 [1:18:17<05:35, 10.49s/it]                                                   {'loss': 0.1264, 'learning_rate': 4.904787101507324e-07, 'epoch': 2.79}
 93%|█████████▎| 436/468 [1:18:17<05:35, 10.49s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878
[2024-09-15 16:53:31,243] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1916.57 | bwd_microstep: 3436.52 | bwd_inner_microstep: 3436.50 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.04
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878
[2024-09-15 16:53:36,647] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.56 | optimizer_gradients: 0.37 | optimizer_step: 0.41
[2024-09-15 16:53:36,648] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1918.24 | bwd_microstep: 3454.63 | bwd_inner_microstep: 3442.04 | bwd_allreduce_microstep: 12.55 | step_microstep: 8.56
[2024-09-15 16:53:36,648] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3834.78 | bwd: 6891.16 | bwd_inner: 6878.53 | bwd_allreduce: 12.57 | step: 8.61
 93%|█████████▎| 437/468 [1:18:28<05:27, 10.58s/it]                                                   {'loss': 0.1468, 'learning_rate': 4.604191079409126e-07, 'epoch': 2.79}
 93%|█████████▎| 437/468 [1:18:28<05:27, 10.58s/it]dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1389
[2024-09-15 16:53:40,768] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1478.38 | bwd_microstep: 2611.11 | bwd_inner_microstep: 2611.07 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368
[2024-09-15 16:53:47,330] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.21 | optimizer_gradients: 0.38 | optimizer_step: 0.38
[2024-09-15 16:53:47,330] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1472.10 | bwd_microstep: 5055.96 | bwd_inner_microstep: 2591.54 | bwd_allreduce_microstep: 2464.36 | step_microstep: 7.15
[2024-09-15 16:53:47,331] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2950.45 | bwd: 7667.12 | bwd_inner: 5202.61 | bwd_allreduce: 2464.39 | step: 7.37
 94%|█████████▎| 438/468 [1:18:38<05:18, 10.61s/it]                                                   {'loss': 0.1082, 'learning_rate': 4.3129926549136057e-07, 'epoch': 2.8}
 94%|█████████▎| 438/468 [1:18:38<05:18, 10.61s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876
[2024-09-15 16:53:52,706] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1913.46 | bwd_microstep: 3430.83 | bwd_inner_microstep: 3430.80 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.19
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1366
[2024-09-15 16:53:56,816] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.61 | optimizer_gradients: 0.37 | optimizer_step: 0.38
[2024-09-15 16:53:56,816] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1467.49 | bwd_microstep: 2606.21 | bwd_inner_microstep: 2591.13 | bwd_allreduce_microstep: 15.04 | step_microstep: 8.31
[2024-09-15 16:53:56,816] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3380.92 | bwd: 6037.05 | bwd_inner: 6021.93 | bwd_allreduce: 15.06 | step: 8.53
 94%|█████████▍| 439/468 [1:18:48<04:57, 10.27s/it]                                                   {'loss': 0.1996, 'learning_rate': 4.031205833262863e-07, 'epoch': 2.81}
 94%|█████████▍| 439/468 [1:18:48<04:57, 10.27s/it]dynamic ViT batch size: 41, images per sample: 5.125, dynamic token length: 1870
[2024-09-15 16:54:02,171] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1906.98 | bwd_microstep: 3413.13 | bwd_inner_microstep: 3413.10 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.24
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876
[2024-09-15 16:54:07,580] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.59 | optimizer_gradients: 0.34 | optimizer_step: 0.38
[2024-09-15 16:54:07,581] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1917.16 | bwd_microstep: 3455.85 | bwd_inner_microstep: 3440.91 | bwd_allreduce_microstep: 14.89 | step_microstep: 8.33
[2024-09-15 16:54:07,581] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3824.10 | bwd: 6868.99 | bwd_inner: 6854.01 | bwd_allreduce: 14.91 | step: 8.57
 94%|█████████▍| 440/468 [1:18:58<04:51, 10.42s/it]                                                   {'loss': 0.159, 'learning_rate': 3.7588441670462827e-07, 'epoch': 2.81}
 94%|█████████▍| 440/468 [1:18:58<04:51, 10.42s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389
[2024-09-15 16:54:11,733] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1491.38 | bwd_microstep: 2630.44 | bwd_inner_microstep: 2630.42 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368
[2024-09-15 16:54:17,078] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.55 | optimizer_gradients: 0.39 | optimizer_step: 0.38
[2024-09-15 16:54:17,079] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1470.08 | bwd_microstep: 3838.79 | bwd_inner_microstep: 2596.96 | bwd_allreduce_microstep: 1241.77 | step_microstep: 7.63
[2024-09-15 16:54:17,079] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2961.42 | bwd: 6469.25 | bwd_inner: 5227.38 | bwd_allreduce: 1241.80 | step: 7.86
 94%|█████████▍| 441/468 [1:19:08<04:33, 10.14s/it]                                                   {'loss': 0.1389, 'learning_rate': 3.4959207555485873e-07, 'epoch': 2.82}
 94%|█████████▍| 441/468 [1:19:08<04:33, 10.14s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877
[2024-09-15 16:54:22,461] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1913.18 | bwd_microstep: 3437.65 | bwd_inner_microstep: 3437.62 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.20
dynamic ViT batch size: 36, images per sample: 4.5, dynamic token length: 1369
[2024-09-15 16:54:27,897] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.50 | optimizer_gradients: 0.39 | optimizer_step: 0.38
[2024-09-15 16:54:27,898] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1440.35 | bwd_microstep: 3962.55 | bwd_inner_microstep: 2556.61 | bwd_allreduce_microstep: 1405.88 | step_microstep: 7.88
[2024-09-15 16:54:27,898] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3353.50 | bwd: 7400.21 | bwd_inner: 5994.23 | bwd_allreduce: 1405.91 | step: 8.10
 94%|█████████▍| 442/468 [1:19:19<04:28, 10.35s/it]                                                   {'loss': 0.1808, 'learning_rate': 3.242448244119967e-07, 'epoch': 2.82}
 94%|█████████▍| 442/468 [1:19:19<04:28, 10.35s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389
[2024-09-15 16:54:32,046] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1490.43 | bwd_microstep: 2627.18 | bwd_inner_microstep: 2627.15 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.07
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1878
[2024-09-15 16:54:38,475] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.39 | optimizer_gradients: 0.37 | optimizer_step: 0.40
[2024-09-15 16:54:38,475] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1901.09 | bwd_microstep: 4492.62 | bwd_inner_microstep: 3419.48 | bwd_allreduce_microstep: 1073.06 | step_microstep: 7.41
[2024-09-15 16:54:38,476] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3391.49 | bwd: 7119.81 | bwd_inner: 6046.64 | bwd_allreduce: 1073.10 | step: 7.51
 95%|█████████▍| 443/468 [1:19:29<04:20, 10.42s/it]                                                   {'loss': 0.1019, 'learning_rate': 2.99843882356774e-07, 'epoch': 2.83}
 95%|█████████▍| 443/468 [1:19:29<04:20, 10.42s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1900
[2024-09-15 16:54:43,873] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1919.59 | bwd_microstep: 3447.69 | bwd_inner_microstep: 3447.67 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.06
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1870
[2024-09-15 16:54:49,267] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 4.25 | optimizer_gradients: 0.33 | optimizer_step: 0.38
[2024-09-15 16:54:49,267] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1914.53 | bwd_microstep: 3439.27 | bwd_inner_microstep: 3424.46 | bwd_allreduce_microstep: 14.76 | step_microstep: 11.66
[2024-09-15 16:54:49,268] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3834.10 | bwd: 6886.97 | bwd_inner: 6872.13 | bwd_allreduce: 14.78 | step: 11.75
 95%|█████████▍| 444/468 [1:19:40<04:12, 10.53s/it]                                                   {'loss': 0.1137, 'learning_rate': 2.7639042295702245e-07, 'epoch': 2.84}
 95%|█████████▍| 444/468 [1:19:40<04:12, 10.53s/it]dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1901
[2024-09-15 16:54:54,648] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1910.79 | bwd_microstep: 3439.81 | bwd_inner_microstep: 3439.79 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.13
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368
[2024-09-15 16:54:59,997] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.48 | optimizer_gradients: 0.39 | optimizer_step: 0.38
[2024-09-15 16:54:59,998] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1471.90 | bwd_microstep: 3842.01 | bwd_inner_microstep: 2592.58 | bwd_allreduce_microstep: 1249.37 | step_microstep: 8.30
[2024-09-15 16:54:59,998] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3382.66 | bwd: 7281.84 | bwd_inner: 6032.37 | bwd_allreduce: 1249.40 | step: 8.46
 95%|█████████▌| 445/468 [1:19:51<04:03, 10.59s/it]                                                   {'loss': 0.2141, 'learning_rate': 2.5388557421120564e-07, 'epoch': 2.84}
 95%|█████████▌| 445/468 [1:19:51<04:03, 10.59s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389
[2024-09-15 16:55:04,145] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1490.23 | bwd_microstep: 2627.46 | bwd_inner_microstep: 2627.43 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.08
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878
[2024-09-15 16:55:10,740] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.39 | optimizer_gradients: 0.37 | optimizer_step: 0.38
[2024-09-15 16:55:10,740] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1917.32 | bwd_microstep: 4642.11 | bwd_inner_microstep: 3439.25 | bwd_allreduce_microstep: 1202.80 | step_microstep: 7.47
[2024-09-15 16:55:10,740] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3407.52 | bwd: 7269.58 | bwd_inner: 6066.68 | bwd_allreduce: 1202.83 | step: 7.58
 95%|█████████▌| 446/468 [1:20:02<03:53, 10.64s/it]                                                   {'loss': 0.1763, 'learning_rate': 2.3233041849419547e-07, 'epoch': 2.85}
 95%|█████████▌| 446/468 [1:20:02<03:53, 10.64s/it]dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1387
[2024-09-15 16:55:14,853] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1475.94 | bwd_microstep: 2606.73 | bwd_inner_microstep: 2606.70 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877
[2024-09-15 16:55:21,365] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.36 | optimizer_gradients: 0.38 | optimizer_step: 0.39
[2024-09-15 16:55:21,366] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1915.71 | bwd_microstep: 4560.68 | bwd_inner_microstep: 3439.87 | bwd_allreduce_microstep: 1120.75 | step_microstep: 7.96
[2024-09-15 16:55:21,366] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3391.63 | bwd: 7167.42 | bwd_inner: 6046.57 | bwd_allreduce: 1120.78 | step: 8.09
 96%|█████████▌| 447/468 [1:20:12<03:43, 10.63s/it]                                                   {'loss': 0.1588, 'learning_rate': 2.1172599250519398e-07, 'epoch': 2.86}
 96%|█████████▌| 447/468 [1:20:12<03:43, 10.63s/it]dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1388
[2024-09-15 16:55:25,476] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1474.43 | bwd_microstep: 2605.28 | bwd_inner_microstep: 2605.26 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.15
dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1368
[2024-09-15 16:55:30,852] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.57 | optimizer_gradients: 0.38 | optimizer_step: 0.38
[2024-09-15 16:55:30,852] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1453.77 | bwd_microstep: 3887.98 | bwd_inner_microstep: 2570.81 | bwd_allreduce_microstep: 1317.11 | step_microstep: 7.66
[2024-09-15 16:55:30,852] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2928.17 | bwd: 6493.28 | bwd_inner: 5176.07 | bwd_allreduce: 1317.14 | step: 7.84
 96%|█████████▌| 448/468 [1:20:22<03:25, 10.29s/it]                                                   {'loss': 0.2067, 'learning_rate': 1.9207328721788653e-07, 'epoch': 2.86}
 96%|█████████▌| 448/468 [1:20:22<03:25, 10.29s/it]dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1881
[2024-09-15 16:55:36,273] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1932.23 | bwd_microstep: 3457.35 | bwd_inner_microstep: 3457.32 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.06
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878
[2024-09-15 16:55:41,675] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.50 | optimizer_gradients: 0.42 | optimizer_step: 0.41
[2024-09-15 16:55:41,676] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1916.27 | bwd_microstep: 3450.88 | bwd_inner_microstep: 3436.11 | bwd_allreduce_microstep: 14.71 | step_microstep: 8.75
[2024-09-15 16:55:41,676] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3848.48 | bwd: 6908.23 | bwd_inner: 6893.43 | bwd_allreduce: 14.74 | step: 8.85
 96%|█████████▌| 449/468 [1:20:33<03:18, 10.45s/it]                                                   {'loss': 0.1241, 'learning_rate': 1.7337324783276878e-07, 'epoch': 2.87}
 96%|█████████▌| 449/468 [1:20:33<03:18, 10.45s/it]dynamic ViT batch size: 32, images per sample: 4.0, dynamic token length: 1369
[2024-09-15 16:55:45,634] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1407.94 | bwd_microstep: 2516.55 | bwd_inner_microstep: 2516.53 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.10
dynamic ViT batch size: 41, images per sample: 5.125, dynamic token length: 1877
[2024-09-15 16:55:51,169] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.54 | optimizer_gradients: 0.39 | optimizer_step: 0.38
[2024-09-15 16:55:51,170] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1910.02 | bwd_microstep: 3591.66 | bwd_inner_microstep: 3426.61 | bwd_allreduce_microstep: 164.99 | step_microstep: 7.67
[2024-09-15 16:55:51,170] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3317.92 | bwd: 6108.22 | bwd_inner: 5943.13 | bwd_allreduce: 165.02 | step: 7.79
 96%|█████████▌| 450/468 [1:20:42<03:02, 10.16s/it]                                                   {'loss': 0.1489, 'learning_rate': 1.5562677373169855e-07, 'epoch': 2.88}
 96%|█████████▌| 450/468 [1:20:42<03:02, 10.16s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1388
[2024-09-15 16:55:55,322] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1494.36 | bwd_microstep: 2627.59 | bwd_inner_microstep: 2627.56 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22
dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1878
[2024-09-15 16:56:01,874] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.51 | optimizer_gradients: 0.38 | optimizer_step: 0.38
[2024-09-15 16:56:01,875] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1936.16 | bwd_microstep: 4580.75 | bwd_inner_microstep: 3466.33 | bwd_allreduce_microstep: 1114.37 | step_microstep: 7.67
[2024-09-15 16:56:01,875] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3430.49 | bwd: 7208.35 | bwd_inner: 6093.89 | bwd_allreduce: 1114.40 | step: 7.90
 96%|█████████▋| 451/468 [1:20:53<02:55, 10.33s/it]                                                   {'loss': 0.136, 'learning_rate': 1.388347184346328e-07, 'epoch': 2.88}
 96%|█████████▋| 451/468 [1:20:53<02:55, 10.33s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876
[2024-09-15 16:56:07,254] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1915.60 | bwd_microstep: 3432.64 | bwd_inner_microstep: 3432.61 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.18
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1367
[2024-09-15 16:56:12,635] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.22 | optimizer_gradients: 0.38 | optimizer_step: 0.38
[2024-09-15 16:56:12,636] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1469.02 | bwd_microstep: 3877.83 | bwd_inner_microstep: 2593.00 | bwd_allreduce_microstep: 1284.77 | step_microstep: 7.21
[2024-09-15 16:56:12,636] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3384.59 | bwd: 7310.49 | bwd_inner: 6025.62 | bwd_allreduce: 1284.80 | step: 7.42
 97%|█████████▋| 452/468 [1:21:04<02:47, 10.46s/it]                                                   {'loss': 0.1723, 'learning_rate': 1.2299788955857817e-07, 'epoch': 2.89}
 97%|█████████▋| 452/468 [1:21:04<02:47, 10.46s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1388
[2024-09-15 16:56:16,784] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1489.86 | bwd_microstep: 2627.31 | bwd_inner_microstep: 2627.25 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.23
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1369
[2024-09-15 16:56:22,087] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.38 | optimizer_gradients: 0.42 | optimizer_step: 0.39
[2024-09-15 16:56:22,088] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1468.85 | bwd_microstep: 3799.95 | bwd_inner_microstep: 2591.84 | bwd_allreduce_microstep: 1208.05 | step_microstep: 7.84
[2024-09-15 16:56:22,088] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2958.69 | bwd: 6427.31 | bwd_inner: 5219.09 | bwd_allreduce: 1208.08 | step: 8.07
 97%|█████████▋| 453/468 [1:21:13<02:32, 10.15s/it]                                                   {'loss': 0.227, 'learning_rate': 1.0811704877875528e-07, 'epoch': 2.89}
 97%|█████████▋| 453/468 [1:21:13<02:32, 10.15s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876
[2024-09-15 16:56:27,462] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1911.81 | bwd_microstep: 3430.89 | bwd_inner_microstep: 3430.86 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876
[2024-09-15 16:56:32,867] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.65 | optimizer_gradients: 0.37 | optimizer_step: 0.38
[2024-09-15 16:56:32,867] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1915.60 | bwd_microstep: 3452.58 | bwd_inner_microstep: 3440.22 | bwd_allreduce_microstep: 12.31 | step_microstep: 8.66
[2024-09-15 16:56:32,868] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3827.38 | bwd: 6883.48 | bwd_inner: 6871.08 | bwd_allreduce: 12.34 | step: 8.89
 97%|█████████▋| 454/468 [1:21:24<02:24, 10.34s/it]                                                   {'loss': 0.1729, 'learning_rate': 9.419291179195267e-08, 'epoch': 2.9}
 97%|█████████▋| 454/468 [1:21:24<02:24, 10.34s/it]dynamic ViT batch size: 37, images per sample: 4.625, dynamic token length: 1366
[2024-09-15 16:56:36,907] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1447.58 | bwd_microstep: 2561.77 | bwd_inner_microstep: 2561.74 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.05
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368
[2024-09-15 16:56:43,593] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.56 | optimizer_gradients: 0.37 | optimizer_step: 0.38
[2024-09-15 16:56:43,593] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1471.30 | bwd_microstep: 5180.67 | bwd_inner_microstep: 2594.48 | bwd_allreduce_microstep: 2586.13 | step_microstep: 8.02
[2024-09-15 16:56:43,593] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2918.86 | bwd: 7742.44 | bwd_inner: 5156.22 | bwd_allreduce: 2586.16 | step: 8.09
 97%|█████████▋| 455/468 [1:21:34<02:15, 10.46s/it]                                                   {'loss': 0.1533, 'learning_rate': 8.122614828211861e-08, 'epoch': 2.91}
 97%|█████████▋| 455/468 [1:21:34<02:15, 10.46s/it]dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1876
[2024-09-15 16:56:49,006] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1927.42 | bwd_microstep: 3455.13 | bwd_inner_microstep: 3455.10 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.06
dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1368
[2024-09-15 16:56:54,334] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.50 | optimizer_gradients: 0.37 | optimizer_step: 0.38
[2024-09-15 16:56:54,334] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1454.28 | bwd_microstep: 3838.30 | bwd_inner_microstep: 2572.00 | bwd_allreduce_microstep: 1266.25 | step_microstep: 7.53
[2024-09-15 16:56:54,335] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3381.67 | bwd: 7293.44 | bwd_inner: 6027.10 | bwd_allreduce: 1266.28 | step: 7.62
 97%|█████████▋| 456/468 [1:21:45<02:06, 10.54s/it]                                                   {'loss': 0.1285, 'learning_rate': 6.921738188814254e-08, 'epoch': 2.91}
 97%|█████████▋| 456/468 [1:21:45<02:06, 10.54s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878
[2024-09-15 16:56:59,708] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1911.22 | bwd_microstep: 3431.65 | bwd_inner_microstep: 3431.63 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876
[2024-09-15 16:57:05,105] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.56 | optimizer_gradients: 0.37 | optimizer_step: 0.38
[2024-09-15 16:57:05,105] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1913.15 | bwd_microstep: 3448.51 | bwd_inner_microstep: 3433.72 | bwd_allreduce_microstep: 14.72 | step_microstep: 8.31
[2024-09-15 16:57:05,106] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3824.35 | bwd: 6880.19 | bwd_inner: 6865.35 | bwd_allreduce: 14.75 | step: 8.43
 98%|█████████▊| 457/468 [1:21:56<01:56, 10.61s/it]                                                   {'loss': 0.2263, 'learning_rate': 5.816719017386785e-08, 'epoch': 2.92}
 98%|█████████▊| 457/468 [1:21:56<01:56, 10.61s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1870
[2024-09-15 16:57:10,436] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1896.16 | bwd_microstep: 3403.59 | bwd_inner_microstep: 3403.57 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876
[2024-09-15 16:57:15,840] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.61 | optimizer_gradients: 0.34 | optimizer_step: 0.39
[2024-09-15 16:57:15,840] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1916.36 | bwd_microstep: 3452.04 | bwd_inner_microstep: 3437.22 | bwd_allreduce_microstep: 14.77 | step_microstep: 8.62
[2024-09-15 16:57:15,841] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3812.49 | bwd: 6855.65 | bwd_inner: 6840.79 | bwd_allreduce: 14.79 | step: 8.74
 98%|█████████▊| 458/468 [1:22:07<01:46, 10.65s/it]                                                   {'loss': 0.1303, 'learning_rate': 4.807610460030976e-08, 'epoch': 2.93}
 98%|█████████▊| 458/468 [1:22:07<01:46, 10.65s/it]dynamic ViT batch size: 36, images per sample: 4.5, dynamic token length: 1366
[2024-09-15 16:57:19,861] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1438.47 | bwd_microstep: 2552.65 | bwd_inner_microstep: 2552.63 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.04
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878
[2024-09-15 16:57:25,394] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.44 | optimizer_gradients: 0.38 | optimizer_step: 0.38
[2024-09-15 16:57:25,394] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1917.18 | bwd_microstep: 3584.28 | bwd_inner_microstep: 3444.45 | bwd_allreduce_microstep: 139.77 | step_microstep: 7.98
[2024-09-15 16:57:25,395] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3355.62 | bwd: 6136.95 | bwd_inner: 5997.08 | bwd_allreduce: 139.80 | step: 8.03
 98%|█████████▊| 459/468 [1:22:16<01:32, 10.32s/it]                                                   {'loss': 0.1302, 'learning_rate': 3.894461050010012e-08, 'epoch': 2.93}
 98%|█████████▊| 459/468 [1:22:16<01:32, 10.32s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389
[2024-09-15 16:57:29,550] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1493.47 | bwd_microstep: 2631.11 | bwd_inner_microstep: 2631.09 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.04
dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1876
[2024-09-15 16:57:36,098] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.40 | optimizer_gradients: 0.37 | optimizer_step: 0.38
[2024-09-15 16:57:36,099] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1933.89 | bwd_microstep: 4583.55 | bwd_inner_microstep: 3461.77 | bwd_allreduce_microstep: 1121.71 | step_microstep: 7.37
[2024-09-15 16:57:36,099] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3427.34 | bwd: 7214.68 | bwd_inner: 6092.85 | bwd_allreduce: 1121.75 | step: 7.43
 98%|█████████▊| 460/468 [1:22:27<01:23, 10.44s/it]                                                   {'loss': 0.1507, 'learning_rate': 3.077314705413503e-08, 'epoch': 2.94}
 98%|█████████▊| 460/468 [1:22:27<01:23, 10.44s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1870
[2024-09-15 16:57:41,465] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1911.26 | bwd_microstep: 3424.02 | bwd_inner_microstep: 3423.99 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.05
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876
[2024-09-15 16:57:46,871] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.42 | optimizer_gradients: 0.37 | optimizer_step: 0.40
[2024-09-15 16:57:46,872] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1917.20 | bwd_microstep: 3455.27 | bwd_inner_microstep: 3440.20 | bwd_allreduce_microstep: 15.03 | step_microstep: 8.41
[2024-09-15 16:57:46,872] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3828.44 | bwd: 6879.30 | bwd_inner: 6864.19 | bwd_allreduce: 15.05 | step: 8.48
 99%|█████████▊| 461/468 [1:22:38<01:13, 10.54s/it]                                                   {'loss': 0.2156, 'learning_rate': 2.356210727046504e-08, 'epoch': 2.95}
 99%|█████████▊| 461/468 [1:22:38<01:13, 10.54s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389
[2024-09-15 16:57:51,025] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1493.49 | bwd_microstep: 2629.05 | bwd_inner_microstep: 2629.02 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.23
dynamic ViT batch size: 36, images per sample: 4.5, dynamic token length: 1367
[2024-09-15 16:57:57,484] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.36 | optimizer_gradients: 0.42 | optimizer_step: 0.38
[2024-09-15 16:57:57,485] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1439.52 | bwd_microstep: 4983.96 | bwd_inner_microstep: 2557.41 | bwd_allreduce_microstep: 2426.49 | step_microstep: 7.65
[2024-09-15 16:57:57,485] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2932.98 | bwd: 7613.02 | bwd_inner: 5186.43 | bwd_allreduce: 2426.52 | step: 7.88
 99%|█████████▊| 462/468 [1:22:48<01:03, 10.56s/it]                                                   {'loss': 0.2267, 'learning_rate': 1.7311837965379164e-08, 'epoch': 2.95}
 99%|█████████▊| 462/468 [1:22:48<01:03, 10.56s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389
[2024-09-15 16:58:01,628] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1488.17 | bwd_microstep: 2624.95 | bwd_inner_microstep: 2624.84 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.23
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1870
[2024-09-15 16:58:07,014] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.53 | optimizer_gradients: 0.34 | optimizer_step: 0.38
[2024-09-15 16:58:07,014] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1913.28 | bwd_microstep: 3437.81 | bwd_inner_microstep: 3422.66 | bwd_allreduce_microstep: 15.10 | step_microstep: 7.78
[2024-09-15 16:58:07,015] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3401.42 | bwd: 6062.79 | bwd_inner: 6047.50 | bwd_allreduce: 15.17 | step: 8.02
 99%|█████████▉| 463/468 [1:22:58<00:51, 10.25s/it]                                                   {'loss': 0.2884, 'learning_rate': 1.202263974674045e-08, 'epoch': 2.96}
 99%|█████████▉| 463/468 [1:22:58<00:51, 10.25s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878
[2024-09-15 16:58:12,398] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1915.49 | bwd_microstep: 3437.11 | bwd_inner_microstep: 3437.09 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.12
dynamic ViT batch size: 46, images per sample: 5.75, dynamic token length: 1878
[2024-09-15 16:58:17,882] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.48 | optimizer_gradients: 0.34 | optimizer_step: 0.39
[2024-09-15 16:58:17,882] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1949.19 | bwd_microstep: 3498.57 | bwd_inner_microstep: 3483.62 | bwd_allreduce_microstep: 14.90 | step_microstep: 7.61
[2024-09-15 16:58:17,882] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3864.66 | bwd: 6935.70 | bwd_inner: 6920.71 | bwd_allreduce: 14.92 | step: 7.77
 99%|█████████▉| 464/468 [1:23:09<00:41, 10.44s/it]                                                   {'loss': 0.163, 'learning_rate': 7.694766999513104e-09, 'epoch': 2.96}
 99%|█████████▉| 464/468 [1:23:09<00:41, 10.44s/it]dynamic ViT batch size: 39, images per sample: 4.875, dynamic token length: 1369
[2024-09-15 16:58:21,958] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1462.31 | bwd_microstep: 2582.87 | bwd_inner_microstep: 2582.84 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.04
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878
[2024-09-15 16:58:28,702] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.21 | optimizer_gradients: 0.37 | optimizer_step: 0.38
[2024-09-15 16:58:28,702] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1917.02 | bwd_microstep: 4795.97 | bwd_inner_microstep: 3445.55 | bwd_allreduce_microstep: 1350.35 | step_microstep: 7.20
[2024-09-15 16:58:28,702] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3379.30 | bwd: 7378.85 | bwd_inner: 6028.39 | bwd_allreduce: 1350.39 | step: 7.25
 99%|█████████▉| 465/468 [1:23:20<00:31, 10.55s/it]                                                   {'loss': 0.1972, 'learning_rate': 4.328427873541152e-09, 'epoch': 2.97}
 99%|█████████▉| 465/468 [1:23:20<00:31, 10.55s/it]dynamic ViT batch size: 41, images per sample: 5.125, dynamic token length: 1878
[2024-09-15 16:58:34,068] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1905.90 | bwd_microstep: 3428.86 | bwd_inner_microstep: 3428.83 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.31
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876
[2024-09-15 16:58:39,477] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.53 | optimizer_gradients: 0.35 | optimizer_step: 0.39
[2024-09-15 16:58:39,478] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1915.20 | bwd_microstep: 3458.78 | bwd_inner_microstep: 3443.79 | bwd_allreduce_microstep: 14.94 | step_microstep: 8.57
[2024-09-15 16:58:39,478] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3821.08 | bwd: 6887.65 | bwd_inner: 6872.62 | bwd_allreduce: 14.96 | step: 8.89
100%|█████████▉| 466/468 [1:23:30<00:21, 10.62s/it]                                                   {'loss': 0.1847, 'learning_rate': 1.9237842735275737e-09, 'epoch': 2.98}
100%|█████████▉| 466/468 [1:23:30<00:21, 10.62s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1388
[2024-09-15 16:58:43,634] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1493.78 | bwd_microstep: 2632.13 | bwd_inner_microstep: 2632.11 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22
dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877
[2024-09-15 16:58:50,308] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.43 | optimizer_gradients: 0.38 | optimizer_step: 0.39
[2024-09-15 16:58:50,308] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1918.65 | bwd_microstep: 4719.08 | bwd_inner_microstep: 3445.05 | bwd_allreduce_microstep: 1273.97 | step_microstep: 8.04
[2024-09-15 16:58:50,308] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3412.41 | bwd: 7351.23 | bwd_inner: 6077.15 | bwd_allreduce: 1274.01 | step: 8.27
100%|█████████▉| 467/468 [1:23:41<00:10, 10.68s/it]                                                   {'loss': 0.2029, 'learning_rate': 4.809518512494116e-10, 'epoch': 2.98}
100%|█████████▉| 467/468 [1:23:41<00:10, 10.68s/it]dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1877
[2024-09-15 16:58:55,733] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1931.11 | bwd_microstep: 3462.44 | bwd_inner_microstep: 3462.41 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.18
dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368
[2024-09-15 16:59:00,936] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.34 | optimizer_gradients: 0.38 | optimizer_step: 0.38
[2024-09-15 16:59:00,936] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1468.85 | bwd_microstep: 3700.56 | bwd_inner_microstep: 2593.14 | bwd_allreduce_microstep: 1107.36 | step_microstep: 7.98
[2024-09-15 16:59:00,937] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3399.93 | bwd: 7163.01 | bwd_inner: 6055.55 | bwd_allreduce: 1107.39 | step: 8.19
100%|██████████| 468/468 [1:23:52<00:00, 10.67s/it]                                                   {'loss': 0.1615, 'learning_rate': 0.0, 'epoch': 2.99}
100%|██████████| 468/468 [1:23:52<00:00, 10.67s/it]petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
[INFO|trainer.py:1962] 2024-09-15 16:59:01,963 >> 

Training completed. Do not forget to share your model on huggingface.co/models =)


                                                   {'train_runtime': 5033.3609, 'train_samples_per_second': 5.958, 'train_steps_per_second': 0.093, 'train_loss': 0.3061195729762061, 'epoch': 2.99}
100%|██████████| 468/468 [1:23:53<00:00, 10.67s/it]100%|██████████| 468/468 [1:23:53<00:00, 10.76s/it]
[INFO|trainer.py:2936] 2024-09-15 16:59:09,984 >> Saving model checkpoint to work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora
[INFO|configuration_utils.py:473] 2024-09-15 16:59:09,985 >> Configuration saved in work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora/config.json
[INFO|configuration_utils.py:594] 2024-09-15 16:59:09,986 >> Configuration saved in work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora/generation_config.json
[INFO|modeling_utils.py:2501] 2024-09-15 16:59:25,737 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 4 checkpoint shards. You can find where each parameters has been saved in the index located at work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora/model.safetensors.index.json.
[INFO|tokenization_utils_base.py:2433] 2024-09-15 16:59:25,738 >> tokenizer config file saved in work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora/tokenizer_config.json
[INFO|tokenization_utils_base.py:2442] 2024-09-15 16:59:25,738 >> Special tokens file saved in work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora/special_tokens_map.json
[INFO|tokenization_utils_base.py:2493] 2024-09-15 16:59:25,739 >> added tokens file saved in work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora/added_tokens.json
***** train metrics *****
  epoch                    =       2.99
  train_loss               =     0.3061
  train_runtime            = 1:23:53.36
  train_samples            =       9996
  train_samples_per_second =      5.958
  train_steps_per_second   =      0.093