[2024-09-15 15:26:38,023] torch.distributed.run: [WARNING] [2024-09-15 15:26:38,023] torch.distributed.run: [WARNING] ***************************************** [2024-09-15 15:26:38,023] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2024-09-15 15:26:38,023] torch.distributed.run: [WARNING] ***************************************** [2024-09-15 15:26:40,361] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-09-15 15:26:40,380] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-09-15 15:26:40,383] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-09-15 15:26:40,413] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. [2024-09-15 15:26:42,413] [INFO] [comm.py:637:init_distributed] cdb=None petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. [2024-09-15 15:26:42,447] [INFO] [comm.py:637:init_distributed] cdb=None [2024-09-15 15:26:42,482] [INFO] [comm.py:637:init_distributed] cdb=None [2024-09-15 15:26:42,518] [INFO] [comm.py:637:init_distributed] cdb=None [2024-09-15 15:26:42,518] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl 09/15/2024 15:26:42 - WARNING - __main__ - Process rank: 2, device: cuda:2, n_gpu: 1distributed training: True, 16-bits training: False 09/15/2024 15:26:42 - WARNING - __main__ - Process rank: 3, device: cuda:3, n_gpu: 1distributed training: True, 16-bits training: False 09/15/2024 15:26:42 - WARNING - __main__ - Process rank: 1, device: cuda:1, n_gpu: 1distributed training: True, 16-bits training: False 09/15/2024 15:26:42 - WARNING - __main__ - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: False 09/15/2024 15:26:42 - INFO - __main__ - Training/evaluation parameters TrainingArguments( _n_gpu=1, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, auto_find_batch_size=False, bf16=True, bf16_full_eval=False, data_seed=None, dataloader_drop_last=False, dataloader_num_workers=4, dataloader_persistent_workers=False, dataloader_pin_memory=True, ddp_backend=None, ddp_broadcast_buffers=None, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=None, ddp_timeout=1800, debug=[], deepspeed=/workspace/InternVL/internvl_chat/zero_stage1_config.json, disable_tqdm=False, dispatch_batches=None, do_eval=False, do_predict=False, do_train=True, eval_accumulation_steps=None, eval_delay=0, eval_steps=None, evaluation_strategy=no, fp16=False, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, fsdp=[], fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, gradient_accumulation_steps=1, gradient_checkpointing=False, gradient_checkpointing_kwargs=None, greater_is_better=None, group_by_length=True, half_precision_backend=auto, hub_always_push=False, hub_model_id=None, hub_private_repo=False, hub_strategy=every_save, hub_token=, ignore_data_skip=False, include_inputs_for_metrics=False, include_num_input_tokens_seen=False, include_tokens_per_second=False, jit_mode_eval=False, label_names=None, label_smoothing_factor=0.0, learning_rate=4e-05, length_column_name=length, load_best_model_at_end=False, local_rank=0, log_level=passive, log_level_replica=warning, log_on_each_node=True, logging_dir=work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora/runs/Sep15_15-26-42_1fe2e16e3880, logging_first_step=False, logging_nan_inf_filter=True, logging_steps=1.0, logging_strategy=steps, lr_scheduler_kwargs={}, lr_scheduler_type=cosine, max_grad_norm=1.0, max_steps=-1, metric_for_best_model=None, mp_parameters=, neftune_noise_alpha=None, no_cuda=False, num_train_epochs=3.0, optim=adamw_torch, optim_args=None, output_dir=work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora, overwrite_output_dir=True, past_index=-1, per_device_eval_batch_size=8, per_device_train_batch_size=4, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=, ray_scope=last, remove_unused_columns=True, report_to=['tensorboard'], resume_from_checkpoint=None, run_name=work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora, save_on_each_node=False, save_only_model=False, save_safetensors=True, save_steps=200, save_strategy=steps, save_total_limit=1, seed=42, skip_memory_metrics=True, split_batches=False, tf32=None, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, torchdynamo=None, tpu_metrics_debug=False, tpu_num_cores=None, use_cpu=False, use_ipex=False, use_legacy_prediction_loop=False, use_mps_device=False, warmup_ratio=0.03, warmup_steps=0, weight_decay=0.01, ) 09/15/2024 15:26:42 - INFO - __main__ - Loading Tokenizer: /workspace/pretrained/InternVL2-8B [INFO|tokenization_utils_base.py:2025] 2024-09-15 15:26:42,601 >> loading file ./tokenizer.model [INFO|tokenization_utils_base.py:2025] 2024-09-15 15:26:42,602 >> loading file added_tokens.json [INFO|tokenization_utils_base.py:2025] 2024-09-15 15:26:42,602 >> loading file special_tokens_map.json [INFO|tokenization_utils_base.py:2025] 2024-09-15 15:26:42,602 >> loading file tokenizer_config.json [INFO|tokenization_utils_base.py:2025] 2024-09-15 15:26:42,602 >> loading file tokenizer.json [WARNING|logging.py:314] 2024-09-15 15:26:42,725 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [WARNING|logging.py:314] 2024-09-15 15:26:42,742 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [WARNING|logging.py:314] 2024-09-15 15:26:42,769 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [WARNING|logging.py:314] 2024-09-15 15:26:42,796 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. 09/15/2024 15:26:42 - INFO - __main__ - Loading InternVLChatModel... [INFO|configuration_utils.py:727] 2024-09-15 15:26:42,947 >> loading configuration file /workspace/pretrained/InternVL2-8B/config.json [INFO|configuration_utils.py:792] 2024-09-15 15:26:42,949 >> Model config InternVLChatConfig { "_commit_hash": null, "architectures": [ "InternVLChatModel" ], "auto_map": { "AutoConfig": "configuration_internvl_chat.InternVLChatConfig", "AutoModel": "modeling_internvl_chat.InternVLChatModel", "AutoModelForCausalLM": "modeling_internvl_chat.InternVLChatModel" }, "downsample_ratio": 0.5, "dynamic_image_size": true, "force_image_size": 448, "llm_config": { "_name_or_path": "internlm/internlm2_5-7b-chat", "add_cross_attention": false, "architectures": [ "InternLM2ForCausalLM" ], "attn_implementation": "flash_attention_2", "auto_map": { "AutoConfig": "configuration_internlm2.InternLM2Config", "AutoModel": "modeling_internlm2.InternLM2ForCausalLM", "AutoModelForCausalLM": "modeling_internlm2.InternLM2ForCausalLM" }, "bad_words_ids": null, "begin_suppress_tokens": null, "bias": false, "bos_token_id": 1, "chunk_size_feed_forward": 0, "cross_attention_hidden_size": null, "decoder_start_token_id": null, "diversity_penalty": 0.0, "do_sample": false, "early_stopping": false, "encoder_no_repeat_ngram_size": 0, "eos_token_id": 2, "exponential_decay_length_penalty": null, "finetuning_task": null, "forced_bos_token_id": null, "forced_eos_token_id": null, "hidden_act": "silu", "hidden_size": 4096, "id2label": { "0": "LABEL_0", "1": "LABEL_1" }, "initializer_range": 0.02, "intermediate_size": 14336, "is_decoder": false, "is_encoder_decoder": false, "label2id": { "LABEL_0": 0, "LABEL_1": 1 }, "length_penalty": 1.0, "max_length": 20, "max_position_embeddings": 32768, "min_length": 0, "model_type": "internlm2", "no_repeat_ngram_size": 0, "num_attention_heads": 32, "num_beam_groups": 1, "num_beams": 1, "num_hidden_layers": 32, "num_key_value_heads": 8, "num_return_sequences": 1, "output_attentions": false, "output_hidden_states": false, "output_scores": false, "pad_token_id": 2, "prefix": null, "pretraining_tp": 1, "problem_type": null, "pruned_heads": {}, "remove_invalid_values": false, "repetition_penalty": 1.0, "return_dict": true, "return_dict_in_generate": false, "rms_norm_eps": 1e-05, "rope_scaling": { "factor": 2.0, "type": "dynamic" }, "rope_theta": 1000000, "sep_token_id": null, "suppress_tokens": null, "task_specific_params": null, "temperature": 1.0, "tf_legacy_loss": false, "tie_encoder_decoder": false, "tie_word_embeddings": false, "tokenizer_class": null, "top_k": 50, "top_p": 1.0, "torch_dtype": "bfloat16", "torchscript": false, "transformers_version": "4.37.2", "typical_p": 1.0, "use_bfloat16": true, "use_cache": true, "vocab_size": 92553 }, "max_dynamic_patch": 12, "min_dynamic_patch": 1, "model_type": "internvl_chat", "pad2square": false, "ps_version": "v2", "select_layer": -1, "template": "internlm2-chat", "torch_dtype": "bfloat16", "transformers_version": null, "use_backbone_lora": 0, "use_llm_lora": 0, "use_thumbnail": true, "vision_config": { "_name_or_path": "", "add_cross_attention": false, "architectures": [ "InternVisionModel" ], "attention_dropout": 0.0, "bad_words_ids": null, "begin_suppress_tokens": null, "bos_token_id": null, "chunk_size_feed_forward": 0, "cross_attention_hidden_size": null, "decoder_start_token_id": null, "diversity_penalty": 0.0, "do_sample": false, "drop_path_rate": 0.0, "dropout": 0.0, "early_stopping": false, "encoder_no_repeat_ngram_size": 0, "eos_token_id": null, "exponential_decay_length_penalty": null, "finetuning_task": null, "forced_bos_token_id": null, "forced_eos_token_id": null, "hidden_act": "gelu", "hidden_size": 1024, "id2label": { "0": "LABEL_0", "1": "LABEL_1" }, "image_size": 448, "initializer_factor": 1.0, "initializer_range": 0.02, "intermediate_size": 4096, "is_decoder": false, "is_encoder_decoder": false, "label2id": { "LABEL_0": 0, "LABEL_1": 1 }, "layer_norm_eps": 1e-06, "length_penalty": 1.0, "max_length": 20, "min_length": 0, "model_type": "intern_vit_6b", "no_repeat_ngram_size": 0, "norm_type": "layer_norm", "num_attention_heads": 16, "num_beam_groups": 1, "num_beams": 1, "num_channels": 3, "num_hidden_layers": 24, "num_return_sequences": 1, "output_attentions": false, "output_hidden_states": false, "output_scores": false, "pad_token_id": null, "patch_size": 14, "prefix": null, "problem_type": null, "pruned_heads": {}, "qk_normalization": false, "qkv_bias": true, "remove_invalid_values": false, "repetition_penalty": 1.0, "return_dict": true, "return_dict_in_generate": false, "sep_token_id": null, "suppress_tokens": null, "task_specific_params": null, "temperature": 1.0, "tf_legacy_loss": false, "tie_encoder_decoder": false, "tie_word_embeddings": true, "tokenizer_class": null, "top_k": 50, "top_p": 1.0, "torch_dtype": "bfloat16", "torchscript": false, "transformers_version": "4.37.2", "typical_p": 1.0, "use_bfloat16": true, "use_flash_attn": true } } 09/15/2024 15:26:42 - INFO - __main__ - Using flash_attention_2 for InternLM [INFO|modeling_utils.py:3473] 2024-09-15 15:26:42,951 >> loading weights file /workspace/pretrained/InternVL2-8B/model.safetensors.index.json [INFO|modeling_utils.py:1426] 2024-09-15 15:26:42,951 >> Instantiating InternVLChatModel model under default dtype torch.bfloat16. [INFO|configuration_utils.py:826] 2024-09-15 15:26:42,952 >> Generate config GenerationConfig {} [INFO|configuration_utils.py:826] 2024-09-15 15:26:42,990 >> Generate config GenerationConfig { "bos_token_id": 1, "eos_token_id": 2, "pad_token_id": 2 } Loading checkpoint shards: 0%| | 0/4 [00:00> All model checkpoint weights were used when initializing InternVLChatModel. [INFO|modeling_utils.py:4358] 2024-09-15 15:26:54,253 >> All the weights of InternVLChatModel were initialized from the model checkpoint at /workspace/pretrained/InternVL2-8B. If your task is similar to the task the model of the checkpoint was trained on, you can already use InternVLChatModel for predictions without further training. [INFO|configuration_utils.py:779] 2024-09-15 15:26:54,259 >> loading configuration file /workspace/pretrained/InternVL2-8B/generation_config.json [INFO|configuration_utils.py:826] 2024-09-15 15:26:54,259 >> Generate config GenerationConfig { "eos_token_id": [ 92542, 92543 ] } 09/15/2024 15:26:54 - INFO - __main__ - Finished 09/15/2024 15:26:54 - INFO - __main__ - model.config.force_image_size: 448 09/15/2024 15:26:54 - INFO - __main__ - data_args.force_image_size: 448 09/15/2024 15:26:54 - INFO - __main__ - model.config.vision_config.image_size: 448 09/15/2024 15:26:54 - INFO - __main__ - [Dataset] num_image_token: 256 09/15/2024 15:26:54 - INFO - __main__ - [Dataset] dynamic_image_size: True 09/15/2024 15:26:54 - INFO - __main__ - [Dataset] use_thumbnail: True 09/15/2024 15:26:54 - INFO - __main__ - [Dataset] min_dynamic_patch: 1, max_dynamic_patch: 6 09/15/2024 15:26:54 - INFO - __main__ - Formatting inputs...Skip in lazy mode 09/15/2024 15:26:54 - INFO - __main__ - Add dataset: amazon-product-dataset with length: 9996 trainable params: 37,748,736 || all params: 7,775,531,008 || trainable%: 0.4855 trainable params: 37,748,736 || all params: 7,775,531,008 || trainable%: 0.4855 trainable params: 37,748,736 || all params: 7,775,531,008 || trainable%: 0.4855 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.0.attention.wqkv.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.0.attention.wqkv.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.0.attention.wo.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.0.attention.wo.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.0.feed_forward.w1.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.0.feed_forward.w1.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.0.feed_forward.w3.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.0.feed_forward.w3.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.0.feed_forward.w2.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.0.feed_forward.w2.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.1.attention.wqkv.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.1.attention.wqkv.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.1.attention.wo.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.1.attention.wo.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.1.feed_forward.w1.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.1.feed_forward.w1.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.1.feed_forward.w3.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.1.feed_forward.w3.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.1.feed_forward.w2.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.1.feed_forward.w2.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.2.attention.wqkv.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.2.attention.wqkv.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.2.attention.wo.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.2.attention.wo.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.2.feed_forward.w1.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.2.feed_forward.w1.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.2.feed_forward.w3.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.2.feed_forward.w3.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.2.feed_forward.w2.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.2.feed_forward.w2.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.3.attention.wqkv.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.3.attention.wqkv.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.3.attention.wo.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.3.attention.wo.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.3.feed_forward.w1.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.3.feed_forward.w1.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.3.feed_forward.w3.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.3.feed_forward.w3.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.3.feed_forward.w2.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.3.feed_forward.w2.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.4.attention.wqkv.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.4.attention.wqkv.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.4.attention.wo.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.4.attention.wo.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.4.feed_forward.w1.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.4.feed_forward.w1.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.4.feed_forward.w3.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.4.feed_forward.w3.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.4.feed_forward.w2.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.4.feed_forward.w2.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.5.attention.wqkv.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.5.attention.wqkv.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.5.attention.wo.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.5.attention.wo.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.5.feed_forward.w1.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.5.feed_forward.w1.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.5.feed_forward.w3.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.5.feed_forward.w3.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.5.feed_forward.w2.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.5.feed_forward.w2.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.6.attention.wqkv.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.6.attention.wqkv.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.6.attention.wo.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.6.attention.wo.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.6.feed_forward.w1.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.6.feed_forward.w1.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.6.feed_forward.w3.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.6.feed_forward.w3.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.6.feed_forward.w2.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.6.feed_forward.w2.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.7.attention.wqkv.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.7.attention.wqkv.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.7.attention.wo.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.7.attention.wo.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.7.feed_forward.w1.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.7.feed_forward.w1.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.7.feed_forward.w3.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.7.feed_forward.w3.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.7.feed_forward.w2.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.7.feed_forward.w2.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.8.attention.wqkv.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.8.attention.wqkv.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.8.attention.wo.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.8.attention.wo.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.8.feed_forward.w1.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.8.feed_forward.w1.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.8.feed_forward.w3.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.8.feed_forward.w3.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.8.feed_forward.w2.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.8.feed_forward.w2.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.9.attention.wqkv.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.9.attention.wqkv.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.9.attention.wo.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.9.attention.wo.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.9.feed_forward.w1.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.9.feed_forward.w1.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.9.feed_forward.w3.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.9.feed_forward.w3.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.9.feed_forward.w2.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.9.feed_forward.w2.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.10.attention.wqkv.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.10.attention.wqkv.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.10.attention.wo.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.10.attention.wo.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.10.feed_forward.w1.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.10.feed_forward.w1.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.10.feed_forward.w3.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.10.feed_forward.w3.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.10.feed_forward.w2.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.10.feed_forward.w2.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.11.attention.wqkv.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.11.attention.wqkv.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.11.attention.wo.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.11.attention.wo.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.11.feed_forward.w1.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.11.feed_forward.w1.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.11.feed_forward.w3.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.11.feed_forward.w3.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.11.feed_forward.w2.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.11.feed_forward.w2.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.12.attention.wqkv.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.12.attention.wqkv.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.12.attention.wo.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.12.attention.wo.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.12.feed_forward.w1.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.12.feed_forward.w1.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.12.feed_forward.w3.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.12.feed_forward.w3.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.12.feed_forward.w2.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.12.feed_forward.w2.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.13.attention.wqkv.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.13.attention.wqkv.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.13.attention.wo.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.13.attention.wo.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.13.feed_forward.w1.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.13.feed_forward.w1.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.13.feed_forward.w3.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.13.feed_forward.w3.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.13.feed_forward.w2.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.13.feed_forward.w2.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.14.attention.wqkv.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.14.attention.wqkv.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.14.attention.wo.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.14.attention.wo.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.14.feed_forward.w1.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.14.feed_forward.w1.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.14.feed_forward.w3.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.14.feed_forward.w3.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.14.feed_forward.w2.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.14.feed_forward.w2.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.15.attention.wqkv.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.15.attention.wqkv.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.15.attention.wo.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.15.attention.wo.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.15.feed_forward.w1.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.15.feed_forward.w1.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.15.feed_forward.w3.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.15.feed_forward.w3.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.15.feed_forward.w2.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.15.feed_forward.w2.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.16.attention.wqkv.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.16.attention.wqkv.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.16.attention.wo.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.16.attention.wo.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.16.feed_forward.w1.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.16.feed_forward.w1.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.16.feed_forward.w3.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.16.feed_forward.w3.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.16.feed_forward.w2.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.16.feed_forward.w2.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.17.attention.wqkv.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.17.attention.wqkv.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.17.attention.wo.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.17.attention.wo.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.17.feed_forward.w1.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.17.feed_forward.w1.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.17.feed_forward.w3.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.17.feed_forward.w3.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.17.feed_forward.w2.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.17.feed_forward.w2.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.18.attention.wqkv.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.18.attention.wqkv.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.18.attention.wo.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.18.attention.wo.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.18.feed_forward.w1.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.18.feed_forward.w1.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.18.feed_forward.w3.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.18.feed_forward.w3.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.18.feed_forward.w2.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.18.feed_forward.w2.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.19.attention.wqkv.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.19.attention.wqkv.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.19.attention.wo.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.19.attention.wo.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.19.feed_forward.w1.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.19.feed_forward.w1.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.19.feed_forward.w3.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.19.feed_forward.w3.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.19.feed_forward.w2.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.19.feed_forward.w2.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.20.attention.wqkv.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.20.attention.wqkv.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.20.attention.wo.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.20.attention.wo.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.20.feed_forward.w1.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.20.feed_forward.w1.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.20.feed_forward.w3.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.20.feed_forward.w3.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.20.feed_forward.w2.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.20.feed_forward.w2.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.21.attention.wqkv.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.21.attention.wqkv.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.21.attention.wo.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.21.attention.wo.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.21.feed_forward.w1.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.21.feed_forward.w1.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.21.feed_forward.w3.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.21.feed_forward.w3.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.21.feed_forward.w2.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.21.feed_forward.w2.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.22.attention.wqkv.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.22.attention.wqkv.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.22.attention.wo.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.22.attention.wo.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.22.feed_forward.w1.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.22.feed_forward.w1.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.22.feed_forward.w3.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.22.feed_forward.w3.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.22.feed_forward.w2.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.22.feed_forward.w2.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.23.attention.wqkv.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.23.attention.wqkv.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.23.attention.wo.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.23.attention.wo.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.23.feed_forward.w1.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.23.feed_forward.w1.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.23.feed_forward.w3.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.23.feed_forward.w3.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.23.feed_forward.w2.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.23.feed_forward.w2.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.24.attention.wqkv.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.24.attention.wqkv.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.24.attention.wo.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.24.attention.wo.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.24.feed_forward.w1.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.24.feed_forward.w1.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.24.feed_forward.w3.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.24.feed_forward.w3.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.24.feed_forward.w2.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.24.feed_forward.w2.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.25.attention.wqkv.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.25.attention.wqkv.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.25.attention.wo.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.25.attention.wo.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.25.feed_forward.w1.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.25.feed_forward.w1.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.25.feed_forward.w3.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.25.feed_forward.w3.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.25.feed_forward.w2.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.25.feed_forward.w2.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.26.attention.wqkv.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.26.attention.wqkv.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.26.attention.wo.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.26.attention.wo.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.26.feed_forward.w1.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.26.feed_forward.w1.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.26.feed_forward.w3.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.26.feed_forward.w3.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.26.feed_forward.w2.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.26.feed_forward.w2.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.27.attention.wqkv.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.27.attention.wqkv.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.27.attention.wo.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.27.attention.wo.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.27.feed_forward.w1.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.27.feed_forward.w1.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.27.feed_forward.w3.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.27.feed_forward.w3.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.27.feed_forward.w2.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.27.feed_forward.w2.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.28.attention.wqkv.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.28.attention.wqkv.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.28.attention.wo.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.28.attention.wo.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.28.feed_forward.w1.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.28.feed_forward.w1.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.28.feed_forward.w3.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.28.feed_forward.w3.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.28.feed_forward.w2.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.28.feed_forward.w2.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.29.attention.wqkv.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.29.attention.wqkv.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.29.attention.wo.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.29.attention.wo.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.29.feed_forward.w1.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.29.feed_forward.w1.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.29.feed_forward.w3.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.29.feed_forward.w3.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.29.feed_forward.w2.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.29.feed_forward.w2.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.30.attention.wqkv.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.30.attention.wqkv.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.30.attention.wo.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.30.attention.wo.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.30.feed_forward.w1.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.30.feed_forward.w1.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.30.feed_forward.w3.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.30.feed_forward.w3.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.30.feed_forward.w2.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.30.feed_forward.w2.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.31.attention.wqkv.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.31.attention.wqkv.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.31.attention.wo.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.31.attention.wo.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.31.feed_forward.w1.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.31.feed_forward.w1.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.31.feed_forward.w3.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.31.feed_forward.w3.lora_B.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.31.feed_forward.w2.lora_A.default.weight 09/15/2024 15:26:55 - INFO - __main__ - language_model.base_model.model.model.layers.31.feed_forward.w2.lora_B.default.weight [INFO|trainer.py:571] 2024-09-15 15:26:55,235 >> Using auto half precision backend [2024-09-15 15:26:55,443] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.13.5, git-hash=unknown, git-branch=unknown [2024-09-15 15:26:58,856] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False Using /root/.cache/torch_extensions/py310_cu118 as PyTorch extensions root... Using /root/.cache/torch_extensions/py310_cu118 as PyTorch extensions root... Using /root/.cache/torch_extensions/py310_cu118 as PyTorch extensions root... Using /root/.cache/torch_extensions/py310_cu118 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /root/.cache/torch_extensions/py310_cu118/fused_adam/build.ninja... Building extension module fused_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module fused_adam... Time to load fused_adam op: 0.08728218078613281 seconds Loading extension module fused_adam... Time to load fused_adam op: 0.10238933563232422 seconds Loading extension module fused_adam... Loading extension module fused_adam... Time to load fused_adam op: 0.1020348072052002 seconds Time to load fused_adam op: 0.10272526741027832 seconds [2024-09-15 15:26:59,612] [INFO] [logging.py:96:log_dist] [Rank 0] Using DeepSpeed Optimizer param name adamw as basic optimizer [2024-09-15 15:26:59,612] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer [2024-09-15 15:26:59,651] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedAdam [2024-09-15 15:26:59,651] [INFO] [utils.py:56:is_zero_supported_optimizer] Checking ZeRO support for optimizer=FusedAdam type= [2024-09-15 15:26:59,651] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 1 optimizer [2024-09-15 15:26:59,651] [INFO] [stage_1_and_2.py:149:__init__] Reduce bucket size 1000000000 [2024-09-15 15:26:59,651] [INFO] [stage_1_and_2.py:150:__init__] Allgather bucket size 1000000000 [2024-09-15 15:26:59,651] [INFO] [stage_1_and_2.py:151:__init__] CPU Offload: False [2024-09-15 15:26:59,652] [INFO] [stage_1_and_2.py:152:__init__] Round robin gradient partitioning: False [2024-09-15 15:26:59,955] [INFO] [utils.py:800:see_memory_usage] Before initializing optimizer states [2024-09-15 15:26:59,956] [INFO] [utils.py:801:see_memory_usage] MA 15.65 GB Max_MA 15.67 GB CA 15.85 GB Max_CA 16 GB [2024-09-15 15:26:59,957] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory: used = 87.25 GB, percent = 9.2% [2024-09-15 15:27:00,146] [INFO] [utils.py:800:see_memory_usage] After initializing optimizer states [2024-09-15 15:27:00,147] [INFO] [utils.py:801:see_memory_usage] MA 15.65 GB Max_MA 15.68 GB CA 15.89 GB Max_CA 16 GB [2024-09-15 15:27:00,148] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory: used = 87.35 GB, percent = 9.2% [2024-09-15 15:27:00,148] [INFO] [stage_1_and_2.py:539:__init__] optimizer state initialized [2024-09-15 15:27:00,327] [INFO] [utils.py:800:see_memory_usage] After initializing ZeRO optimizer [2024-09-15 15:27:00,327] [INFO] [utils.py:801:see_memory_usage] MA 15.65 GB Max_MA 15.65 GB CA 15.89 GB Max_CA 16 GB [2024-09-15 15:27:00,328] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory: used = 87.37 GB, percent = 9.3% [2024-09-15 15:27:00,330] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = adamw [2024-09-15 15:27:00,330] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client callable to create LR scheduler [2024-09-15 15:27:00,330] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = [2024-09-15 15:27:00,330] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0], mom=[[0.9, 0.999]] [2024-09-15 15:27:00,335] [INFO] [config.py:996:print] DeepSpeedEngine configuration: [2024-09-15 15:27:00,335] [INFO] [config.py:1000:print] activation_checkpointing_config { "partition_activations": false, "contiguous_memory_optimization": false, "cpu_checkpointing": false, "number_checkpoints": null, "synchronize_checkpoint_boundary": false, "profile": false } [2024-09-15 15:27:00,335] [INFO] [config.py:1000:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True} [2024-09-15 15:27:00,335] [INFO] [config.py:1000:print] amp_enabled .................. False [2024-09-15 15:27:00,335] [INFO] [config.py:1000:print] amp_params ................... False [2024-09-15 15:27:00,335] [INFO] [config.py:1000:print] autotuning_config ............ { "enabled": false, "start_step": null, "end_step": null, "metric_path": null, "arg_mappings": null, "metric": "throughput", "model_info": null, "results_dir": "autotuning_results", "exps_dir": "autotuning_exps", "overwrite": true, "fast": true, "start_profile_step": 3, "end_profile_step": 5, "tuner_type": "gridsearch", "tuner_early_stopping": 5, "tuner_num_trials": 50, "model_info_path": null, "mp_size": 1, "max_train_batch_size": null, "min_train_batch_size": 1, "max_train_micro_batch_size_per_gpu": 1.024000e+03, "min_train_micro_batch_size_per_gpu": 1, "num_tuning_micro_batch_sizes": 3 } [2024-09-15 15:27:00,335] [INFO] [config.py:1000:print] bfloat16_enabled ............. True [2024-09-15 15:27:00,335] [INFO] [config.py:1000:print] bfloat16_immediate_grad_update False [2024-09-15 15:27:00,335] [INFO] [config.py:1000:print] checkpoint_parallel_write_pipeline False [2024-09-15 15:27:00,335] [INFO] [config.py:1000:print] checkpoint_tag_validation_enabled True [2024-09-15 15:27:00,335] [INFO] [config.py:1000:print] checkpoint_tag_validation_fail False [2024-09-15 15:27:00,335] [INFO] [config.py:1000:print] comms_config ................. [2024-09-15 15:27:00,335] [INFO] [config.py:1000:print] communication_data_type ...... None [2024-09-15 15:27:00,336] [INFO] [config.py:1000:print] compile_config ............... enabled=False backend='inductor' kwargs={} [2024-09-15 15:27:00,336] [INFO] [config.py:1000:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}} [2024-09-15 15:27:00,336] [INFO] [config.py:1000:print] curriculum_enabled_legacy .... False [2024-09-15 15:27:00,336] [INFO] [config.py:1000:print] curriculum_params_legacy ..... False [2024-09-15 15:27:00,336] [INFO] [config.py:1000:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}} [2024-09-15 15:27:00,336] [INFO] [config.py:1000:print] data_efficiency_enabled ...... False [2024-09-15 15:27:00,336] [INFO] [config.py:1000:print] dataloader_drop_last ......... False [2024-09-15 15:27:00,336] [INFO] [config.py:1000:print] disable_allgather ............ False [2024-09-15 15:27:00,336] [INFO] [config.py:1000:print] dump_state ................... False [2024-09-15 15:27:00,336] [INFO] [config.py:1000:print] dynamic_loss_scale_args ...... None [2024-09-15 15:27:00,336] [INFO] [config.py:1000:print] eigenvalue_enabled ........... False [2024-09-15 15:27:00,336] [INFO] [config.py:1000:print] eigenvalue_gas_boundary_resolution 1 [2024-09-15 15:27:00,336] [INFO] [config.py:1000:print] eigenvalue_layer_name ........ bert.encoder.layer [2024-09-15 15:27:00,336] [INFO] [config.py:1000:print] eigenvalue_layer_num ......... 0 [2024-09-15 15:27:00,336] [INFO] [config.py:1000:print] eigenvalue_max_iter .......... 100 [2024-09-15 15:27:00,336] [INFO] [config.py:1000:print] eigenvalue_stability ......... 1e-06 [2024-09-15 15:27:00,336] [INFO] [config.py:1000:print] eigenvalue_tol ............... 0.01 [2024-09-15 15:27:00,336] [INFO] [config.py:1000:print] eigenvalue_verbose ........... False [2024-09-15 15:27:00,336] [INFO] [config.py:1000:print] elasticity_enabled ........... False [2024-09-15 15:27:00,336] [INFO] [config.py:1000:print] flops_profiler_config ........ { "enabled": false, "recompute_fwd_factor": 0.0, "profile_step": 1, "module_depth": -1, "top_modules": 1, "detailed": true, "output_file": null } [2024-09-15 15:27:00,336] [INFO] [config.py:1000:print] fp16_auto_cast ............... None [2024-09-15 15:27:00,336] [INFO] [config.py:1000:print] fp16_enabled ................. False [2024-09-15 15:27:00,336] [INFO] [config.py:1000:print] fp16_master_weights_and_gradients False [2024-09-15 15:27:00,336] [INFO] [config.py:1000:print] global_rank .................. 0 [2024-09-15 15:27:00,336] [INFO] [config.py:1000:print] grad_accum_dtype ............. None [2024-09-15 15:27:00,336] [INFO] [config.py:1000:print] gradient_accumulation_steps .. 1 [2024-09-15 15:27:00,336] [INFO] [config.py:1000:print] gradient_clipping ............ 1.0 [2024-09-15 15:27:00,336] [INFO] [config.py:1000:print] gradient_predivide_factor .... 1.0 [2024-09-15 15:27:00,336] [INFO] [config.py:1000:print] graph_harvesting ............. False [2024-09-15 15:27:00,336] [INFO] [config.py:1000:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8 [2024-09-15 15:27:00,336] [INFO] [config.py:1000:print] initial_dynamic_scale ........ 1 [2024-09-15 15:27:00,336] [INFO] [config.py:1000:print] load_universal_checkpoint .... False [2024-09-15 15:27:00,336] [INFO] [config.py:1000:print] loss_scale ................... 1.0 [2024-09-15 15:27:00,336] [INFO] [config.py:1000:print] memory_breakdown ............. False [2024-09-15 15:27:00,336] [INFO] [config.py:1000:print] mics_hierarchial_params_gather False [2024-09-15 15:27:00,336] [INFO] [config.py:1000:print] mics_shard_size .............. -1 [2024-09-15 15:27:00,336] [INFO] [config.py:1000:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False [2024-09-15 15:27:00,336] [INFO] [config.py:1000:print] nebula_config ................ { "enabled": false, "persistent_storage_path": null, "persistent_time_interval": 100, "num_of_version_in_retention": 2, "enable_nebula_load": true, "load_path": null } [2024-09-15 15:27:00,336] [INFO] [config.py:1000:print] optimizer_legacy_fusion ...... False [2024-09-15 15:27:00,336] [INFO] [config.py:1000:print] optimizer_name ............... adamw [2024-09-15 15:27:00,336] [INFO] [config.py:1000:print] optimizer_params ............. {'lr': 4e-05, 'betas': [0.9, 0.999], 'eps': 1e-08, 'weight_decay': 0.01} [2024-09-15 15:27:00,336] [INFO] [config.py:1000:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True} [2024-09-15 15:27:00,336] [INFO] [config.py:1000:print] pld_enabled .................. False [2024-09-15 15:27:00,336] [INFO] [config.py:1000:print] pld_params ................... False [2024-09-15 15:27:00,336] [INFO] [config.py:1000:print] prescale_gradients ........... False [2024-09-15 15:27:00,336] [INFO] [config.py:1000:print] scheduler_name ............... None [2024-09-15 15:27:00,337] [INFO] [config.py:1000:print] scheduler_params ............. None [2024-09-15 15:27:00,337] [INFO] [config.py:1000:print] seq_parallel_communication_data_type torch.float32 [2024-09-15 15:27:00,337] [INFO] [config.py:1000:print] sparse_attention ............. None [2024-09-15 15:27:00,337] [INFO] [config.py:1000:print] sparse_gradients_enabled ..... False [2024-09-15 15:27:00,337] [INFO] [config.py:1000:print] steps_per_print .............. inf [2024-09-15 15:27:00,337] [INFO] [config.py:1000:print] train_batch_size ............. 16 [2024-09-15 15:27:00,337] [INFO] [config.py:1000:print] train_micro_batch_size_per_gpu 4 [2024-09-15 15:27:00,337] [INFO] [config.py:1000:print] use_data_before_expert_parallel_ False [2024-09-15 15:27:00,337] [INFO] [config.py:1000:print] use_node_local_storage ....... False [2024-09-15 15:27:00,337] [INFO] [config.py:1000:print] wall_clock_breakdown ......... True [2024-09-15 15:27:00,337] [INFO] [config.py:1000:print] weight_quantization_config ... None [2024-09-15 15:27:00,337] [INFO] [config.py:1000:print] world_size ................... 4 [2024-09-15 15:27:00,337] [INFO] [config.py:1000:print] zero_allow_untested_optimizer False [2024-09-15 15:27:00,337] [INFO] [config.py:1000:print] zero_config .................. stage=1 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=1000000000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=1000000000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True [2024-09-15 15:27:00,337] [INFO] [config.py:1000:print] zero_enabled ................. True [2024-09-15 15:27:00,337] [INFO] [config.py:1000:print] zero_force_ds_cpu_optimizer .. True [2024-09-15 15:27:00,337] [INFO] [config.py:1000:print] zero_optimization_stage ...... 1 [2024-09-15 15:27:00,337] [INFO] [config.py:986:print_user_config] json = { "zero_optimization": { "stage": 1, "allgather_partitions": true, "allgather_bucket_size": 1.000000e+09, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 1.000000e+09, "contiguous_gradients": true }, "fp16": { "enabled": false, "auto_cast": true, "loss_scale": 0, "initial_scale_power": 32, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1 }, "bf16": { "enabled": true }, "optimizer": { "type": "AdamW", "params": { "lr": 4e-05, "betas": [0.9, 0.999], "eps": 1e-08, "weight_decay": 0.01 } }, "gradient_accumulation_steps": 1, "gradient_clipping": 1.0, "steps_per_print": inf, "train_batch_size": 16, "train_micro_batch_size_per_gpu": 4, "wall_clock_breakdown": true } [INFO|trainer.py:1721] 2024-09-15 15:27:00,337 >> ***** Running training ***** [INFO|trainer.py:1722] 2024-09-15 15:27:00,337 >> Num examples = 9,996 [INFO|trainer.py:1723] 2024-09-15 15:27:00,337 >> Num Epochs = 3 [INFO|trainer.py:1724] 2024-09-15 15:27:00,337 >> Instantaneous batch size per device = 4 [INFO|trainer.py:1727] 2024-09-15 15:27:00,337 >> Total train batch size (w. parallel, distributed & accumulation) = 16 [INFO|trainer.py:1728] 2024-09-15 15:27:00,337 >> Gradient Accumulation steps = 1 [INFO|trainer.py:1729] 2024-09-15 15:27:00,337 >> Total optimization steps = 1,875 [INFO|trainer.py:1730] 2024-09-15 15:27:00,343 >> Number of trainable parameters = 37,748,736 0%| | 0/1875 [00:00, ignore_data_skip=False, include_inputs_for_metrics=False, include_num_input_tokens_seen=False, include_tokens_per_second=False, jit_mode_eval=False, label_names=None, label_smoothing_factor=0.0, learning_rate=4e-05, length_column_name=length, load_best_model_at_end=False, local_rank=0, log_level=passive, log_level_replica=warning, log_on_each_node=True, logging_dir=work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora/runs/Sep15_15-28-19_1fe2e16e3880, logging_first_step=False, logging_nan_inf_filter=True, logging_steps=1.0, logging_strategy=steps, lr_scheduler_kwargs={}, lr_scheduler_type=cosine, max_grad_norm=1.0, max_steps=-1, metric_for_best_model=None, mp_parameters=, neftune_noise_alpha=None, no_cuda=False, num_train_epochs=3.0, optim=adamw_torch, optim_args=None, output_dir=work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora, overwrite_output_dir=True, past_index=-1, per_device_eval_batch_size=8, per_device_train_batch_size=8, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=, ray_scope=last, remove_unused_columns=True, report_to=['tensorboard'], resume_from_checkpoint=None, run_name=work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora, save_on_each_node=False, save_only_model=False, save_safetensors=True, save_steps=200, save_strategy=steps, save_total_limit=1, seed=42, skip_memory_metrics=True, split_batches=False, tf32=None, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, torchdynamo=None, tpu_metrics_debug=False, tpu_num_cores=None, use_cpu=False, use_ipex=False, use_legacy_prediction_loop=False, use_mps_device=False, warmup_ratio=0.03, warmup_steps=0, weight_decay=0.01, ) 09/15/2024 15:28:19 - INFO - __main__ - Loading Tokenizer: /workspace/pretrained/InternVL2-8B [INFO|tokenization_utils_base.py:2025] 2024-09-15 15:28:19,533 >> loading file ./tokenizer.model [INFO|tokenization_utils_base.py:2025] 2024-09-15 15:28:19,533 >> loading file added_tokens.json [INFO|tokenization_utils_base.py:2025] 2024-09-15 15:28:19,533 >> loading file special_tokens_map.json [INFO|tokenization_utils_base.py:2025] 2024-09-15 15:28:19,533 >> loading file tokenizer_config.json [INFO|tokenization_utils_base.py:2025] 2024-09-15 15:28:19,533 >> loading file tokenizer.json 09/15/2024 15:28:19 - WARNING - __main__ - Process rank: 3, device: cuda:3, n_gpu: 1distributed training: True, 16-bits training: False 09/15/2024 15:28:19 - WARNING - __main__ - Process rank: 2, device: cuda:2, n_gpu: 1distributed training: True, 16-bits training: False [WARNING|logging.py:314] 2024-09-15 15:28:19,691 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [WARNING|logging.py:314] 2024-09-15 15:28:19,711 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [WARNING|logging.py:314] 2024-09-15 15:28:19,722 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [WARNING|logging.py:314] 2024-09-15 15:28:19,733 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. 09/15/2024 15:28:19 - INFO - __main__ - Loading InternVLChatModel... [INFO|configuration_utils.py:727] 2024-09-15 15:28:19,861 >> loading configuration file /workspace/pretrained/InternVL2-8B/config.json [INFO|configuration_utils.py:792] 2024-09-15 15:28:19,862 >> Model config InternVLChatConfig { "_commit_hash": null, "architectures": [ "InternVLChatModel" ], "auto_map": { "AutoConfig": "configuration_internvl_chat.InternVLChatConfig", "AutoModel": "modeling_internvl_chat.InternVLChatModel", "AutoModelForCausalLM": "modeling_internvl_chat.InternVLChatModel" }, "downsample_ratio": 0.5, "dynamic_image_size": true, "force_image_size": 448, "llm_config": { "_name_or_path": "internlm/internlm2_5-7b-chat", "add_cross_attention": false, "architectures": [ "InternLM2ForCausalLM" ], "attn_implementation": "flash_attention_2", "auto_map": { "AutoConfig": "configuration_internlm2.InternLM2Config", "AutoModel": "modeling_internlm2.InternLM2ForCausalLM", "AutoModelForCausalLM": "modeling_internlm2.InternLM2ForCausalLM" }, "bad_words_ids": null, "begin_suppress_tokens": null, "bias": false, "bos_token_id": 1, "chunk_size_feed_forward": 0, "cross_attention_hidden_size": null, "decoder_start_token_id": null, "diversity_penalty": 0.0, "do_sample": false, "early_stopping": false, "encoder_no_repeat_ngram_size": 0, "eos_token_id": 2, "exponential_decay_length_penalty": null, "finetuning_task": null, "forced_bos_token_id": null, "forced_eos_token_id": null, "hidden_act": "silu", "hidden_size": 4096, "id2label": { "0": "LABEL_0", "1": "LABEL_1" }, "initializer_range": 0.02, "intermediate_size": 14336, "is_decoder": false, "is_encoder_decoder": false, "label2id": { "LABEL_0": 0, "LABEL_1": 1 }, "length_penalty": 1.0, "max_length": 20, "max_position_embeddings": 32768, "min_length": 0, "model_type": "internlm2", "no_repeat_ngram_size": 0, "num_attention_heads": 32, "num_beam_groups": 1, "num_beams": 1, "num_hidden_layers": 32, "num_key_value_heads": 8, "num_return_sequences": 1, "output_attentions": false, "output_hidden_states": false, "output_scores": false, "pad_token_id": 2, "prefix": null, "pretraining_tp": 1, "problem_type": null, "pruned_heads": {}, "remove_invalid_values": false, "repetition_penalty": 1.0, "return_dict": true, "return_dict_in_generate": false, "rms_norm_eps": 1e-05, "rope_scaling": { "factor": 2.0, "type": "dynamic" }, "rope_theta": 1000000, "sep_token_id": null, "suppress_tokens": null, "task_specific_params": null, "temperature": 1.0, "tf_legacy_loss": false, "tie_encoder_decoder": false, "tie_word_embeddings": false, "tokenizer_class": null, "top_k": 50, "top_p": 1.0, "torch_dtype": "bfloat16", "torchscript": false, "transformers_version": "4.37.2", "typical_p": 1.0, "use_bfloat16": true, "use_cache": true, "vocab_size": 92553 }, "max_dynamic_patch": 12, "min_dynamic_patch": 1, "model_type": "internvl_chat", "pad2square": false, "ps_version": "v2", "select_layer": -1, "template": "internlm2-chat", "torch_dtype": "bfloat16", "transformers_version": null, "use_backbone_lora": 0, "use_llm_lora": 0, "use_thumbnail": true, "vision_config": { "_name_or_path": "", "add_cross_attention": false, "architectures": [ "InternVisionModel" ], "attention_dropout": 0.0, "bad_words_ids": null, "begin_suppress_tokens": null, "bos_token_id": null, "chunk_size_feed_forward": 0, "cross_attention_hidden_size": null, "decoder_start_token_id": null, "diversity_penalty": 0.0, "do_sample": false, "drop_path_rate": 0.0, "dropout": 0.0, "early_stopping": false, "encoder_no_repeat_ngram_size": 0, "eos_token_id": null, "exponential_decay_length_penalty": null, "finetuning_task": null, "forced_bos_token_id": null, "forced_eos_token_id": null, "hidden_act": "gelu", "hidden_size": 1024, "id2label": { "0": "LABEL_0", "1": "LABEL_1" }, "image_size": 448, "initializer_factor": 1.0, "initializer_range": 0.02, "intermediate_size": 4096, "is_decoder": false, "is_encoder_decoder": false, "label2id": { "LABEL_0": 0, "LABEL_1": 1 }, "layer_norm_eps": 1e-06, "length_penalty": 1.0, "max_length": 20, "min_length": 0, "model_type": "intern_vit_6b", "no_repeat_ngram_size": 0, "norm_type": "layer_norm", "num_attention_heads": 16, "num_beam_groups": 1, "num_beams": 1, "num_channels": 3, "num_hidden_layers": 24, "num_return_sequences": 1, "output_attentions": false, "output_hidden_states": false, "output_scores": false, "pad_token_id": null, "patch_size": 14, "prefix": null, "problem_type": null, "pruned_heads": {}, "qk_normalization": false, "qkv_bias": true, "remove_invalid_values": false, "repetition_penalty": 1.0, "return_dict": true, "return_dict_in_generate": false, "sep_token_id": null, "suppress_tokens": null, "task_specific_params": null, "temperature": 1.0, "tf_legacy_loss": false, "tie_encoder_decoder": false, "tie_word_embeddings": true, "tokenizer_class": null, "top_k": 50, "top_p": 1.0, "torch_dtype": "bfloat16", "torchscript": false, "transformers_version": "4.37.2", "typical_p": 1.0, "use_bfloat16": true, "use_flash_attn": true } } 09/15/2024 15:28:19 - INFO - __main__ - Using flash_attention_2 for InternLM [INFO|modeling_utils.py:3473] 2024-09-15 15:28:19,864 >> loading weights file /workspace/pretrained/InternVL2-8B/model.safetensors.index.json [INFO|modeling_utils.py:1426] 2024-09-15 15:28:19,864 >> Instantiating InternVLChatModel model under default dtype torch.bfloat16. [INFO|configuration_utils.py:826] 2024-09-15 15:28:19,865 >> Generate config GenerationConfig {} [INFO|configuration_utils.py:826] 2024-09-15 15:28:19,905 >> Generate config GenerationConfig { "bos_token_id": 1, "eos_token_id": 2, "pad_token_id": 2 } Loading checkpoint shards: 0%| | 0/4 [00:00> All model checkpoint weights were used when initializing InternVLChatModel. [INFO|modeling_utils.py:4358] 2024-09-15 15:28:29,699 >> All the weights of InternVLChatModel were initialized from the model checkpoint at /workspace/pretrained/InternVL2-8B. If your task is similar to the task the model of the checkpoint was trained on, you can already use InternVLChatModel for predictions without further training. [INFO|configuration_utils.py:779] 2024-09-15 15:28:29,704 >> loading configuration file /workspace/pretrained/InternVL2-8B/generation_config.json [INFO|configuration_utils.py:826] 2024-09-15 15:28:29,705 >> Generate config GenerationConfig { "eos_token_id": [ 92542, 92543 ] } 09/15/2024 15:28:29 - INFO - __main__ - Finished 09/15/2024 15:28:29 - INFO - __main__ - model.config.force_image_size: 448 09/15/2024 15:28:29 - INFO - __main__ - data_args.force_image_size: 448 09/15/2024 15:28:29 - INFO - __main__ - model.config.vision_config.image_size: 448 09/15/2024 15:28:29 - INFO - __main__ - [Dataset] num_image_token: 256 09/15/2024 15:28:29 - INFO - __main__ - [Dataset] dynamic_image_size: True 09/15/2024 15:28:29 - INFO - __main__ - [Dataset] use_thumbnail: True 09/15/2024 15:28:29 - INFO - __main__ - [Dataset] min_dynamic_patch: 1, max_dynamic_patch: 6 09/15/2024 15:28:29 - INFO - __main__ - Formatting inputs...Skip in lazy mode Loading checkpoint shards: 100%|██████████| 4/4 [00:08<00:00, 1.89s/it] Loading checkpoint shards: 100%|██████████| 4/4 [00:08<00:00, 2.16s/it] Loading checkpoint shards: 100%|██████████| 4/4 [00:08<00:00, 1.90s/it] Loading checkpoint shards: 100%|██████████| 4/4 [00:08<00:00, 2.15s/it] Loading checkpoint shards: 75%|███████▌ | 3/4 [00:08<00:02, 2.84s/it]09/15/2024 15:28:30 - INFO - __main__ - Add dataset: amazon-product-dataset with length: 9996 trainable params: 37,748,736 || all params: 7,775,531,008 || trainable%: 0.4855 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.0.attention.wqkv.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.0.attention.wqkv.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.0.attention.wo.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.0.attention.wo.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.0.feed_forward.w1.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.0.feed_forward.w1.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.0.feed_forward.w3.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.0.feed_forward.w3.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.0.feed_forward.w2.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.0.feed_forward.w2.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.1.attention.wqkv.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.1.attention.wqkv.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.1.attention.wo.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.1.attention.wo.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.1.feed_forward.w1.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.1.feed_forward.w1.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.1.feed_forward.w3.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.1.feed_forward.w3.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.1.feed_forward.w2.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.1.feed_forward.w2.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.2.attention.wqkv.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.2.attention.wqkv.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.2.attention.wo.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.2.attention.wo.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.2.feed_forward.w1.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.2.feed_forward.w1.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.2.feed_forward.w3.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.2.feed_forward.w3.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.2.feed_forward.w2.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.2.feed_forward.w2.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.3.attention.wqkv.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.3.attention.wqkv.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.3.attention.wo.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.3.attention.wo.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.3.feed_forward.w1.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.3.feed_forward.w1.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.3.feed_forward.w3.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.3.feed_forward.w3.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.3.feed_forward.w2.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.3.feed_forward.w2.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.4.attention.wqkv.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.4.attention.wqkv.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.4.attention.wo.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.4.attention.wo.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.4.feed_forward.w1.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.4.feed_forward.w1.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.4.feed_forward.w3.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.4.feed_forward.w3.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.4.feed_forward.w2.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.4.feed_forward.w2.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.5.attention.wqkv.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.5.attention.wqkv.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.5.attention.wo.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.5.attention.wo.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.5.feed_forward.w1.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.5.feed_forward.w1.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.5.feed_forward.w3.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.5.feed_forward.w3.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.5.feed_forward.w2.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.5.feed_forward.w2.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.6.attention.wqkv.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.6.attention.wqkv.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.6.attention.wo.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.6.attention.wo.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.6.feed_forward.w1.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.6.feed_forward.w1.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.6.feed_forward.w3.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.6.feed_forward.w3.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.6.feed_forward.w2.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.6.feed_forward.w2.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.7.attention.wqkv.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.7.attention.wqkv.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.7.attention.wo.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.7.attention.wo.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.7.feed_forward.w1.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.7.feed_forward.w1.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.7.feed_forward.w3.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.7.feed_forward.w3.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.7.feed_forward.w2.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.7.feed_forward.w2.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.8.attention.wqkv.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.8.attention.wqkv.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.8.attention.wo.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.8.attention.wo.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.8.feed_forward.w1.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.8.feed_forward.w1.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.8.feed_forward.w3.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.8.feed_forward.w3.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.8.feed_forward.w2.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.8.feed_forward.w2.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.9.attention.wqkv.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.9.attention.wqkv.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.9.attention.wo.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.9.attention.wo.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.9.feed_forward.w1.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.9.feed_forward.w1.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.9.feed_forward.w3.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.9.feed_forward.w3.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.9.feed_forward.w2.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.9.feed_forward.w2.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.10.attention.wqkv.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.10.attention.wqkv.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.10.attention.wo.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.10.attention.wo.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.10.feed_forward.w1.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.10.feed_forward.w1.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.10.feed_forward.w3.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.10.feed_forward.w3.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.10.feed_forward.w2.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.10.feed_forward.w2.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.11.attention.wqkv.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.11.attention.wqkv.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.11.attention.wo.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.11.attention.wo.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.11.feed_forward.w1.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.11.feed_forward.w1.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.11.feed_forward.w3.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.11.feed_forward.w3.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.11.feed_forward.w2.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.11.feed_forward.w2.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.12.attention.wqkv.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.12.attention.wqkv.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.12.attention.wo.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.12.attention.wo.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.12.feed_forward.w1.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.12.feed_forward.w1.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.12.feed_forward.w3.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.12.feed_forward.w3.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.12.feed_forward.w2.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.12.feed_forward.w2.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.13.attention.wqkv.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.13.attention.wqkv.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.13.attention.wo.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.13.attention.wo.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.13.feed_forward.w1.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.13.feed_forward.w1.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.13.feed_forward.w3.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.13.feed_forward.w3.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.13.feed_forward.w2.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.13.feed_forward.w2.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.14.attention.wqkv.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.14.attention.wqkv.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.14.attention.wo.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.14.attention.wo.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.14.feed_forward.w1.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.14.feed_forward.w1.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.14.feed_forward.w3.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.14.feed_forward.w3.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.14.feed_forward.w2.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.14.feed_forward.w2.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.15.attention.wqkv.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.15.attention.wqkv.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.15.attention.wo.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.15.attention.wo.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.15.feed_forward.w1.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.15.feed_forward.w1.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.15.feed_forward.w3.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.15.feed_forward.w3.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.15.feed_forward.w2.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.15.feed_forward.w2.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.16.attention.wqkv.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.16.attention.wqkv.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.16.attention.wo.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.16.attention.wo.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.16.feed_forward.w1.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.16.feed_forward.w1.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.16.feed_forward.w3.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.16.feed_forward.w3.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.16.feed_forward.w2.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.16.feed_forward.w2.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.17.attention.wqkv.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.17.attention.wqkv.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.17.attention.wo.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.17.attention.wo.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.17.feed_forward.w1.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.17.feed_forward.w1.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.17.feed_forward.w3.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.17.feed_forward.w3.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.17.feed_forward.w2.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.17.feed_forward.w2.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.18.attention.wqkv.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.18.attention.wqkv.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.18.attention.wo.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.18.attention.wo.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.18.feed_forward.w1.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.18.feed_forward.w1.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.18.feed_forward.w3.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.18.feed_forward.w3.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.18.feed_forward.w2.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.18.feed_forward.w2.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.19.attention.wqkv.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.19.attention.wqkv.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.19.attention.wo.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.19.attention.wo.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.19.feed_forward.w1.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.19.feed_forward.w1.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.19.feed_forward.w3.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.19.feed_forward.w3.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.19.feed_forward.w2.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.19.feed_forward.w2.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.20.attention.wqkv.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.20.attention.wqkv.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.20.attention.wo.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.20.attention.wo.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.20.feed_forward.w1.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.20.feed_forward.w1.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.20.feed_forward.w3.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.20.feed_forward.w3.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.20.feed_forward.w2.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.20.feed_forward.w2.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.21.attention.wqkv.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.21.attention.wqkv.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.21.attention.wo.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.21.attention.wo.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.21.feed_forward.w1.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.21.feed_forward.w1.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.21.feed_forward.w3.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.21.feed_forward.w3.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.21.feed_forward.w2.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.21.feed_forward.w2.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.22.attention.wqkv.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.22.attention.wqkv.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.22.attention.wo.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.22.attention.wo.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.22.feed_forward.w1.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.22.feed_forward.w1.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.22.feed_forward.w3.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.22.feed_forward.w3.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.22.feed_forward.w2.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.22.feed_forward.w2.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.23.attention.wqkv.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.23.attention.wqkv.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.23.attention.wo.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.23.attention.wo.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.23.feed_forward.w1.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.23.feed_forward.w1.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.23.feed_forward.w3.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.23.feed_forward.w3.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.23.feed_forward.w2.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.23.feed_forward.w2.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.24.attention.wqkv.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.24.attention.wqkv.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.24.attention.wo.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.24.attention.wo.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.24.feed_forward.w1.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.24.feed_forward.w1.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.24.feed_forward.w3.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.24.feed_forward.w3.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.24.feed_forward.w2.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.24.feed_forward.w2.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.25.attention.wqkv.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.25.attention.wqkv.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.25.attention.wo.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.25.attention.wo.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.25.feed_forward.w1.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.25.feed_forward.w1.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.25.feed_forward.w3.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.25.feed_forward.w3.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.25.feed_forward.w2.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.25.feed_forward.w2.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.26.attention.wqkv.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.26.attention.wqkv.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.26.attention.wo.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.26.attention.wo.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.26.feed_forward.w1.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.26.feed_forward.w1.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.26.feed_forward.w3.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.26.feed_forward.w3.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.26.feed_forward.w2.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.26.feed_forward.w2.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.27.attention.wqkv.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.27.attention.wqkv.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.27.attention.wo.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.27.attention.wo.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.27.feed_forward.w1.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.27.feed_forward.w1.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.27.feed_forward.w3.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.27.feed_forward.w3.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.27.feed_forward.w2.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.27.feed_forward.w2.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.28.attention.wqkv.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.28.attention.wqkv.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.28.attention.wo.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.28.attention.wo.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.28.feed_forward.w1.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.28.feed_forward.w1.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.28.feed_forward.w3.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.28.feed_forward.w3.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.28.feed_forward.w2.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.28.feed_forward.w2.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.29.attention.wqkv.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.29.attention.wqkv.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.29.attention.wo.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.29.attention.wo.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.29.feed_forward.w1.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.29.feed_forward.w1.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.29.feed_forward.w3.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.29.feed_forward.w3.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.29.feed_forward.w2.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.29.feed_forward.w2.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.30.attention.wqkv.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.30.attention.wqkv.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.30.attention.wo.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.30.attention.wo.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.30.feed_forward.w1.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.30.feed_forward.w1.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.30.feed_forward.w3.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.30.feed_forward.w3.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.30.feed_forward.w2.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.30.feed_forward.w2.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.31.attention.wqkv.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.31.attention.wqkv.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.31.attention.wo.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.31.attention.wo.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.31.feed_forward.w1.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.31.feed_forward.w1.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.31.feed_forward.w3.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.31.feed_forward.w3.lora_B.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.31.feed_forward.w2.lora_A.default.weight 09/15/2024 15:28:30 - INFO - __main__ - language_model.base_model.model.model.layers.31.feed_forward.w2.lora_B.default.weight [INFO|trainer.py:571] 2024-09-15 15:28:30,663 >> Using auto half precision backend Traceback (most recent call last): File "/workspace/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 847, in main() File "/workspace/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 832, in main train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1539, in train return inner_training_loop( File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1580, in _inner_training_loop num_update_steps_per_epoch = len_dataloader // args.gradient_accumulation_steps ZeroDivisionError: integer division or modulo by zero trainable params: 37,748,736 || all params: 7,775,531,008 || trainable%: 0.4855 trainable params: 37,748,736 || all params: 7,775,531,008 || trainable%: 0.4855 Loading checkpoint shards: 100%|██████████| 4/4 [00:09<00:00, 2.06s/it] Loading checkpoint shards: 100%|██████████| 4/4 [00:09<00:00, 2.35s/it] Traceback (most recent call last): File "/workspace/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 847, in main() File "/workspace/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 832, in main train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1539, in train return inner_training_loop( File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1580, in _inner_training_loop num_update_steps_per_epoch = len_dataloader // args.gradient_accumulation_steps ZeroDivisionError: integer division or modulo by zero Traceback (most recent call last): File "/workspace/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 847, in main() File "/workspace/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 832, in main train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1539, in train return inner_training_loop( File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1580, in _inner_training_loop num_update_steps_per_epoch = len_dataloader // args.gradient_accumulation_steps ZeroDivisionError: integer division or modulo by zero trainable params: 37,748,736 || all params: 7,775,531,008 || trainable%: 0.4855 Traceback (most recent call last): File "/workspace/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 847, in main() File "/workspace/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 832, in main train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1539, in train return inner_training_loop( File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1580, in _inner_training_loop num_update_steps_per_epoch = len_dataloader // args.gradient_accumulation_steps ZeroDivisionError: integer division or modulo by zero [2024-09-15 15:28:35,003] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 8326) of binary: /usr/bin/python Traceback (most recent call last): File "/usr/local/bin/torchrun", line 8, in sys.exit(main()) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper return f(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 806, in main run(args) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 797, in run elastic_launch( File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /workspace/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-09-15_15:28:35 host : 1fe2e16e3880 rank : 1 (local_rank: 1) exitcode : 1 (pid: 8327) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2024-09-15_15:28:35 host : 1fe2e16e3880 rank : 2 (local_rank: 2) exitcode : 1 (pid: 8328) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [3]: time : 2024-09-15_15:28:35 host : 1fe2e16e3880 rank : 3 (local_rank: 3) exitcode : 1 (pid: 8329) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-09-15_15:28:35 host : 1fe2e16e3880 rank : 0 (local_rank: 0) exitcode : 1 (pid: 8326) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ [2024-09-15 15:29:04,013] torch.distributed.run: [WARNING] [2024-09-15 15:29:04,013] torch.distributed.run: [WARNING] ***************************************** [2024-09-15 15:29:04,013] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2024-09-15 15:29:04,013] torch.distributed.run: [WARNING] ***************************************** [2024-09-15 15:29:06,305] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-09-15 15:29:06,333] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-09-15 15:29:06,351] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-09-15 15:29:06,402] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. [2024-09-15 15:29:08,338] [INFO] [comm.py:637:init_distributed] cdb=None [2024-09-15 15:29:08,338] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl [2024-09-15 15:29:08,385] [INFO] [comm.py:637:init_distributed] cdb=None [2024-09-15 15:29:08,394] [INFO] [comm.py:637:init_distributed] cdb=None [2024-09-15 15:29:08,418] [INFO] [comm.py:637:init_distributed] cdb=None 09/15/2024 15:29:08 - WARNING - __main__ - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: False 09/15/2024 15:29:08 - INFO - __main__ - Training/evaluation parameters TrainingArguments( _n_gpu=1, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, auto_find_batch_size=False, bf16=True, bf16_full_eval=False, data_seed=None, dataloader_drop_last=False, dataloader_num_workers=4, dataloader_persistent_workers=False, dataloader_pin_memory=True, ddp_backend=None, ddp_broadcast_buffers=None, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=None, ddp_timeout=1800, debug=[], deepspeed=/workspace/InternVL/internvl_chat/zero_stage1_config.json, disable_tqdm=False, dispatch_batches=None, do_eval=False, do_predict=False, do_train=True, eval_accumulation_steps=None, eval_delay=0, eval_steps=None, evaluation_strategy=no, fp16=False, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, fsdp=[], fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, gradient_accumulation_steps=0, gradient_checkpointing=False, gradient_checkpointing_kwargs=None, greater_is_better=None, group_by_length=True, half_precision_backend=auto, hub_always_push=False, hub_model_id=None, hub_private_repo=False, hub_strategy=every_save, hub_token=, ignore_data_skip=False, include_inputs_for_metrics=False, include_num_input_tokens_seen=False, include_tokens_per_second=False, jit_mode_eval=False, label_names=None, label_smoothing_factor=0.0, learning_rate=4e-05, length_column_name=length, load_best_model_at_end=False, local_rank=0, log_level=passive, log_level_replica=warning, log_on_each_node=True, logging_dir=work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora/runs/Sep15_15-29-08_1fe2e16e3880, logging_first_step=False, logging_nan_inf_filter=True, logging_steps=1.0, logging_strategy=steps, lr_scheduler_kwargs={}, lr_scheduler_type=cosine, max_grad_norm=1.0, max_steps=-1, metric_for_best_model=None, mp_parameters=, neftune_noise_alpha=None, no_cuda=False, num_train_epochs=3.0, optim=adamw_torch, optim_args=None, output_dir=work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora, overwrite_output_dir=True, past_index=-1, per_device_eval_batch_size=8, per_device_train_batch_size=6, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=, ray_scope=last, remove_unused_columns=True, report_to=['tensorboard'], resume_from_checkpoint=None, run_name=work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora, save_on_each_node=False, save_only_model=False, save_safetensors=True, save_steps=200, save_strategy=steps, save_total_limit=1, seed=42, skip_memory_metrics=True, split_batches=False, tf32=None, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, torchdynamo=None, tpu_metrics_debug=False, tpu_num_cores=None, use_cpu=False, use_ipex=False, use_legacy_prediction_loop=False, use_mps_device=False, warmup_ratio=0.03, warmup_steps=0, weight_decay=0.01, ) 09/15/2024 15:29:08 - INFO - __main__ - Loading Tokenizer: /workspace/pretrained/InternVL2-8B [INFO|tokenization_utils_base.py:2025] 2024-09-15 15:29:08,448 >> loading file ./tokenizer.model [INFO|tokenization_utils_base.py:2025] 2024-09-15 15:29:08,448 >> loading file added_tokens.json [INFO|tokenization_utils_base.py:2025] 2024-09-15 15:29:08,448 >> loading file special_tokens_map.json [INFO|tokenization_utils_base.py:2025] 2024-09-15 15:29:08,448 >> loading file tokenizer_config.json [INFO|tokenization_utils_base.py:2025] 2024-09-15 15:29:08,448 >> loading file tokenizer.json 09/15/2024 15:29:08 - WARNING - __main__ - Process rank: 1, device: cuda:1, n_gpu: 1distributed training: True, 16-bits training: False 09/15/2024 15:29:08 - WARNING - __main__ - Process rank: 2, device: cuda:2, n_gpu: 1distributed training: True, 16-bits training: False 09/15/2024 15:29:08 - WARNING - __main__ - Process rank: 3, device: cuda:3, n_gpu: 1distributed training: True, 16-bits training: False [WARNING|logging.py:314] 2024-09-15 15:29:08,622 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [WARNING|logging.py:314] 2024-09-15 15:29:08,639 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [WARNING|logging.py:314] 2024-09-15 15:29:08,645 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [WARNING|logging.py:314] 2024-09-15 15:29:08,660 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. 09/15/2024 15:29:08 - INFO - __main__ - Loading InternVLChatModel... [INFO|configuration_utils.py:727] 2024-09-15 15:29:08,764 >> loading configuration file /workspace/pretrained/InternVL2-8B/config.json [INFO|configuration_utils.py:792] 2024-09-15 15:29:08,766 >> Model config InternVLChatConfig { "_commit_hash": null, "architectures": [ "InternVLChatModel" ], "auto_map": { "AutoConfig": "configuration_internvl_chat.InternVLChatConfig", "AutoModel": "modeling_internvl_chat.InternVLChatModel", "AutoModelForCausalLM": "modeling_internvl_chat.InternVLChatModel" }, "downsample_ratio": 0.5, "dynamic_image_size": true, "force_image_size": 448, "llm_config": { "_name_or_path": "internlm/internlm2_5-7b-chat", "add_cross_attention": false, "architectures": [ "InternLM2ForCausalLM" ], "attn_implementation": "flash_attention_2", "auto_map": { "AutoConfig": "configuration_internlm2.InternLM2Config", "AutoModel": "modeling_internlm2.InternLM2ForCausalLM", "AutoModelForCausalLM": "modeling_internlm2.InternLM2ForCausalLM" }, "bad_words_ids": null, "begin_suppress_tokens": null, "bias": false, "bos_token_id": 1, "chunk_size_feed_forward": 0, "cross_attention_hidden_size": null, "decoder_start_token_id": null, "diversity_penalty": 0.0, "do_sample": false, "early_stopping": false, "encoder_no_repeat_ngram_size": 0, "eos_token_id": 2, "exponential_decay_length_penalty": null, "finetuning_task": null, "forced_bos_token_id": null, "forced_eos_token_id": null, "hidden_act": "silu", "hidden_size": 4096, "id2label": { "0": "LABEL_0", "1": "LABEL_1" }, "initializer_range": 0.02, "intermediate_size": 14336, "is_decoder": false, "is_encoder_decoder": false, "label2id": { "LABEL_0": 0, "LABEL_1": 1 }, "length_penalty": 1.0, "max_length": 20, "max_position_embeddings": 32768, "min_length": 0, "model_type": "internlm2", "no_repeat_ngram_size": 0, "num_attention_heads": 32, "num_beam_groups": 1, "num_beams": 1, "num_hidden_layers": 32, "num_key_value_heads": 8, "num_return_sequences": 1, "output_attentions": false, "output_hidden_states": false, "output_scores": false, "pad_token_id": 2, "prefix": null, "pretraining_tp": 1, "problem_type": null, "pruned_heads": {}, "remove_invalid_values": false, "repetition_penalty": 1.0, "return_dict": true, "return_dict_in_generate": false, "rms_norm_eps": 1e-05, "rope_scaling": { "factor": 2.0, "type": "dynamic" }, "rope_theta": 1000000, "sep_token_id": null, "suppress_tokens": null, "task_specific_params": null, "temperature": 1.0, "tf_legacy_loss": false, "tie_encoder_decoder": false, "tie_word_embeddings": false, "tokenizer_class": null, "top_k": 50, "top_p": 1.0, "torch_dtype": "bfloat16", "torchscript": false, "transformers_version": "4.37.2", "typical_p": 1.0, "use_bfloat16": true, "use_cache": true, "vocab_size": 92553 }, "max_dynamic_patch": 12, "min_dynamic_patch": 1, "model_type": "internvl_chat", "pad2square": false, "ps_version": "v2", "select_layer": -1, "template": "internlm2-chat", "torch_dtype": "bfloat16", "transformers_version": null, "use_backbone_lora": 0, "use_llm_lora": 0, "use_thumbnail": true, "vision_config": { "_name_or_path": "", "add_cross_attention": false, "architectures": [ "InternVisionModel" ], "attention_dropout": 0.0, "bad_words_ids": null, "begin_suppress_tokens": null, "bos_token_id": null, "chunk_size_feed_forward": 0, "cross_attention_hidden_size": null, "decoder_start_token_id": null, "diversity_penalty": 0.0, "do_sample": false, "drop_path_rate": 0.0, "dropout": 0.0, "early_stopping": false, "encoder_no_repeat_ngram_size": 0, "eos_token_id": null, "exponential_decay_length_penalty": null, "finetuning_task": null, "forced_bos_token_id": null, "forced_eos_token_id": null, "hidden_act": "gelu", "hidden_size": 1024, "id2label": { "0": "LABEL_0", "1": "LABEL_1" }, "image_size": 448, "initializer_factor": 1.0, "initializer_range": 0.02, "intermediate_size": 4096, "is_decoder": false, "is_encoder_decoder": false, "label2id": { "LABEL_0": 0, "LABEL_1": 1 }, "layer_norm_eps": 1e-06, "length_penalty": 1.0, "max_length": 20, "min_length": 0, "model_type": "intern_vit_6b", "no_repeat_ngram_size": 0, "norm_type": "layer_norm", "num_attention_heads": 16, "num_beam_groups": 1, "num_beams": 1, "num_channels": 3, "num_hidden_layers": 24, "num_return_sequences": 1, "output_attentions": false, "output_hidden_states": false, "output_scores": false, "pad_token_id": null, "patch_size": 14, "prefix": null, "problem_type": null, "pruned_heads": {}, "qk_normalization": false, "qkv_bias": true, "remove_invalid_values": false, "repetition_penalty": 1.0, "return_dict": true, "return_dict_in_generate": false, "sep_token_id": null, "suppress_tokens": null, "task_specific_params": null, "temperature": 1.0, "tf_legacy_loss": false, "tie_encoder_decoder": false, "tie_word_embeddings": true, "tokenizer_class": null, "top_k": 50, "top_p": 1.0, "torch_dtype": "bfloat16", "torchscript": false, "transformers_version": "4.37.2", "typical_p": 1.0, "use_bfloat16": true, "use_flash_attn": true } } 09/15/2024 15:29:08 - INFO - __main__ - Using flash_attention_2 for InternLM [INFO|modeling_utils.py:3473] 2024-09-15 15:29:08,768 >> loading weights file /workspace/pretrained/InternVL2-8B/model.safetensors.index.json [INFO|modeling_utils.py:1426] 2024-09-15 15:29:08,768 >> Instantiating InternVLChatModel model under default dtype torch.bfloat16. [INFO|configuration_utils.py:826] 2024-09-15 15:29:08,770 >> Generate config GenerationConfig {} [INFO|configuration_utils.py:826] 2024-09-15 15:29:08,808 >> Generate config GenerationConfig { "bos_token_id": 1, "eos_token_id": 2, "pad_token_id": 2 } Loading checkpoint shards: 0%| | 0/4 [00:00> All model checkpoint weights were used when initializing InternVLChatModel. [INFO|modeling_utils.py:4358] 2024-09-15 15:29:18,959 >> All the weights of InternVLChatModel were initialized from the model checkpoint at /workspace/pretrained/InternVL2-8B. If your task is similar to the task the model of the checkpoint was trained on, you can already use InternVLChatModel for predictions without further training. [INFO|configuration_utils.py:779] 2024-09-15 15:29:18,964 >> loading configuration file /workspace/pretrained/InternVL2-8B/generation_config.json [INFO|configuration_utils.py:826] 2024-09-15 15:29:18,965 >> Generate config GenerationConfig { "eos_token_id": [ 92542, 92543 ] } 09/15/2024 15:29:18 - INFO - __main__ - Finished 09/15/2024 15:29:18 - INFO - __main__ - model.config.force_image_size: 448 09/15/2024 15:29:18 - INFO - __main__ - data_args.force_image_size: 448 09/15/2024 15:29:18 - INFO - __main__ - model.config.vision_config.image_size: 448 09/15/2024 15:29:18 - INFO - __main__ - [Dataset] num_image_token: 256 09/15/2024 15:29:18 - INFO - __main__ - [Dataset] dynamic_image_size: True 09/15/2024 15:29:18 - INFO - __main__ - [Dataset] use_thumbnail: True 09/15/2024 15:29:18 - INFO - __main__ - [Dataset] min_dynamic_patch: 1, max_dynamic_patch: 6 09/15/2024 15:29:18 - INFO - __main__ - Formatting inputs...Skip in lazy mode Loading checkpoint shards: 75%|███████▌ | 3/4 [00:08<00:02, 2.93s/it]09/15/2024 15:29:19 - INFO - __main__ - Add dataset: amazon-product-dataset with length: 9996 Loading checkpoint shards: 100%|██████████| 4/4 [00:09<00:00, 2.05s/it] Loading checkpoint shards: 100%|██████████| 4/4 [00:09<00:00, 2.33s/it] trainable params: 37,748,736 || all params: 7,775,531,008 || trainable%: 0.4855 trainable params: 37,748,736 || all params: 7,775,531,008 || trainable%: 0.4855 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.0.attention.wqkv.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.0.attention.wqkv.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.0.attention.wo.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.0.attention.wo.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.0.feed_forward.w1.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.0.feed_forward.w1.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.0.feed_forward.w3.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.0.feed_forward.w3.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.0.feed_forward.w2.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.0.feed_forward.w2.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.1.attention.wqkv.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.1.attention.wqkv.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.1.attention.wo.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.1.attention.wo.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.1.feed_forward.w1.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.1.feed_forward.w1.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.1.feed_forward.w3.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.1.feed_forward.w3.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.1.feed_forward.w2.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.1.feed_forward.w2.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.2.attention.wqkv.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.2.attention.wqkv.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.2.attention.wo.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.2.attention.wo.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.2.feed_forward.w1.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.2.feed_forward.w1.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.2.feed_forward.w3.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.2.feed_forward.w3.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.2.feed_forward.w2.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.2.feed_forward.w2.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.3.attention.wqkv.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.3.attention.wqkv.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.3.attention.wo.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.3.attention.wo.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.3.feed_forward.w1.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.3.feed_forward.w1.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.3.feed_forward.w3.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.3.feed_forward.w3.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.3.feed_forward.w2.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.3.feed_forward.w2.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.4.attention.wqkv.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.4.attention.wqkv.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.4.attention.wo.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.4.attention.wo.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.4.feed_forward.w1.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.4.feed_forward.w1.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.4.feed_forward.w3.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.4.feed_forward.w3.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.4.feed_forward.w2.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.4.feed_forward.w2.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.5.attention.wqkv.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.5.attention.wqkv.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.5.attention.wo.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.5.attention.wo.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.5.feed_forward.w1.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.5.feed_forward.w1.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.5.feed_forward.w3.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.5.feed_forward.w3.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.5.feed_forward.w2.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.5.feed_forward.w2.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.6.attention.wqkv.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.6.attention.wqkv.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.6.attention.wo.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.6.attention.wo.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.6.feed_forward.w1.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.6.feed_forward.w1.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.6.feed_forward.w3.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.6.feed_forward.w3.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.6.feed_forward.w2.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.6.feed_forward.w2.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.7.attention.wqkv.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.7.attention.wqkv.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.7.attention.wo.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.7.attention.wo.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.7.feed_forward.w1.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.7.feed_forward.w1.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.7.feed_forward.w3.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.7.feed_forward.w3.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.7.feed_forward.w2.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.7.feed_forward.w2.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.8.attention.wqkv.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.8.attention.wqkv.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.8.attention.wo.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.8.attention.wo.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.8.feed_forward.w1.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.8.feed_forward.w1.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.8.feed_forward.w3.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.8.feed_forward.w3.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.8.feed_forward.w2.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.8.feed_forward.w2.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.9.attention.wqkv.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.9.attention.wqkv.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.9.attention.wo.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.9.attention.wo.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.9.feed_forward.w1.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.9.feed_forward.w1.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.9.feed_forward.w3.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.9.feed_forward.w3.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.9.feed_forward.w2.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.9.feed_forward.w2.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.10.attention.wqkv.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.10.attention.wqkv.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.10.attention.wo.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.10.attention.wo.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.10.feed_forward.w1.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.10.feed_forward.w1.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.10.feed_forward.w3.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.10.feed_forward.w3.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.10.feed_forward.w2.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.10.feed_forward.w2.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.11.attention.wqkv.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.11.attention.wqkv.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.11.attention.wo.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.11.attention.wo.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.11.feed_forward.w1.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.11.feed_forward.w1.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.11.feed_forward.w3.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.11.feed_forward.w3.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.11.feed_forward.w2.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.11.feed_forward.w2.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.12.attention.wqkv.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.12.attention.wqkv.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.12.attention.wo.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.12.attention.wo.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.12.feed_forward.w1.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.12.feed_forward.w1.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.12.feed_forward.w3.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.12.feed_forward.w3.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.12.feed_forward.w2.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.12.feed_forward.w2.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.13.attention.wqkv.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.13.attention.wqkv.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.13.attention.wo.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.13.attention.wo.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.13.feed_forward.w1.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.13.feed_forward.w1.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.13.feed_forward.w3.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.13.feed_forward.w3.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.13.feed_forward.w2.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.13.feed_forward.w2.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.14.attention.wqkv.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.14.attention.wqkv.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.14.attention.wo.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.14.attention.wo.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.14.feed_forward.w1.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.14.feed_forward.w1.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.14.feed_forward.w3.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.14.feed_forward.w3.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.14.feed_forward.w2.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.14.feed_forward.w2.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.15.attention.wqkv.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.15.attention.wqkv.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.15.attention.wo.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.15.attention.wo.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.15.feed_forward.w1.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.15.feed_forward.w1.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.15.feed_forward.w3.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.15.feed_forward.w3.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.15.feed_forward.w2.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.15.feed_forward.w2.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.16.attention.wqkv.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.16.attention.wqkv.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.16.attention.wo.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.16.attention.wo.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.16.feed_forward.w1.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.16.feed_forward.w1.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.16.feed_forward.w3.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.16.feed_forward.w3.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.16.feed_forward.w2.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.16.feed_forward.w2.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.17.attention.wqkv.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.17.attention.wqkv.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.17.attention.wo.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.17.attention.wo.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.17.feed_forward.w1.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.17.feed_forward.w1.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.17.feed_forward.w3.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.17.feed_forward.w3.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.17.feed_forward.w2.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.17.feed_forward.w2.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.18.attention.wqkv.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.18.attention.wqkv.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.18.attention.wo.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.18.attention.wo.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.18.feed_forward.w1.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.18.feed_forward.w1.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.18.feed_forward.w3.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.18.feed_forward.w3.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.18.feed_forward.w2.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.18.feed_forward.w2.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.19.attention.wqkv.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.19.attention.wqkv.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.19.attention.wo.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.19.attention.wo.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.19.feed_forward.w1.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.19.feed_forward.w1.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.19.feed_forward.w3.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.19.feed_forward.w3.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.19.feed_forward.w2.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.19.feed_forward.w2.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.20.attention.wqkv.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.20.attention.wqkv.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.20.attention.wo.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.20.attention.wo.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.20.feed_forward.w1.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.20.feed_forward.w1.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.20.feed_forward.w3.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.20.feed_forward.w3.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.20.feed_forward.w2.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.20.feed_forward.w2.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.21.attention.wqkv.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.21.attention.wqkv.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.21.attention.wo.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.21.attention.wo.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.21.feed_forward.w1.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.21.feed_forward.w1.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.21.feed_forward.w3.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.21.feed_forward.w3.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.21.feed_forward.w2.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.21.feed_forward.w2.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.22.attention.wqkv.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.22.attention.wqkv.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.22.attention.wo.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.22.attention.wo.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.22.feed_forward.w1.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.22.feed_forward.w1.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.22.feed_forward.w3.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.22.feed_forward.w3.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.22.feed_forward.w2.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.22.feed_forward.w2.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.23.attention.wqkv.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.23.attention.wqkv.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.23.attention.wo.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.23.attention.wo.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.23.feed_forward.w1.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.23.feed_forward.w1.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.23.feed_forward.w3.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.23.feed_forward.w3.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.23.feed_forward.w2.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.23.feed_forward.w2.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.24.attention.wqkv.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.24.attention.wqkv.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.24.attention.wo.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.24.attention.wo.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.24.feed_forward.w1.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.24.feed_forward.w1.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.24.feed_forward.w3.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.24.feed_forward.w3.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.24.feed_forward.w2.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.24.feed_forward.w2.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.25.attention.wqkv.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.25.attention.wqkv.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.25.attention.wo.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.25.attention.wo.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.25.feed_forward.w1.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.25.feed_forward.w1.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.25.feed_forward.w3.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.25.feed_forward.w3.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.25.feed_forward.w2.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.25.feed_forward.w2.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.26.attention.wqkv.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.26.attention.wqkv.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.26.attention.wo.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.26.attention.wo.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.26.feed_forward.w1.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.26.feed_forward.w1.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.26.feed_forward.w3.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.26.feed_forward.w3.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.26.feed_forward.w2.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.26.feed_forward.w2.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.27.attention.wqkv.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.27.attention.wqkv.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.27.attention.wo.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.27.attention.wo.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.27.feed_forward.w1.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.27.feed_forward.w1.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.27.feed_forward.w3.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.27.feed_forward.w3.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.27.feed_forward.w2.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.27.feed_forward.w2.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.28.attention.wqkv.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.28.attention.wqkv.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.28.attention.wo.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.28.attention.wo.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.28.feed_forward.w1.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.28.feed_forward.w1.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.28.feed_forward.w3.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.28.feed_forward.w3.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.28.feed_forward.w2.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.28.feed_forward.w2.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.29.attention.wqkv.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.29.attention.wqkv.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.29.attention.wo.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.29.attention.wo.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.29.feed_forward.w1.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.29.feed_forward.w1.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.29.feed_forward.w3.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.29.feed_forward.w3.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.29.feed_forward.w2.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.29.feed_forward.w2.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.30.attention.wqkv.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.30.attention.wqkv.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.30.attention.wo.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.30.attention.wo.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.30.feed_forward.w1.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.30.feed_forward.w1.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.30.feed_forward.w3.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.30.feed_forward.w3.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.30.feed_forward.w2.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.30.feed_forward.w2.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.31.attention.wqkv.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.31.attention.wqkv.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.31.attention.wo.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.31.attention.wo.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.31.feed_forward.w1.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.31.feed_forward.w1.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.31.feed_forward.w3.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.31.feed_forward.w3.lora_B.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.31.feed_forward.w2.lora_A.default.weight 09/15/2024 15:29:19 - INFO - __main__ - language_model.base_model.model.model.layers.31.feed_forward.w2.lora_B.default.weight [INFO|trainer.py:571] 2024-09-15 15:29:19,927 >> Using auto half precision backend Traceback (most recent call last): File "/workspace/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 847, in main() File "/workspace/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 832, in main train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1539, in train return inner_training_loop( File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1580, in _inner_training_loop num_update_steps_per_epoch = len_dataloader // args.gradient_accumulation_steps ZeroDivisionError: integer division or modulo by zero Loading checkpoint shards: 100%|██████████| 4/4 [00:09<00:00, 2.14s/it] Loading checkpoint shards: 100%|██████████| 4/4 [00:09<00:00, 2.44s/it] Traceback (most recent call last): File "/workspace/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 847, in main() File "/workspace/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 832, in main train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1539, in train return inner_training_loop( File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1580, in _inner_training_loop num_update_steps_per_epoch = len_dataloader // args.gradient_accumulation_steps ZeroDivisionError: integer division or modulo by zero trainable params: 37,748,736 || all params: 7,775,531,008 || trainable%: 0.4855 Traceback (most recent call last): File "/workspace/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 847, in main() File "/workspace/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 832, in main train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1539, in train return inner_training_loop( File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1580, in _inner_training_loop num_update_steps_per_epoch = len_dataloader // args.gradient_accumulation_steps ZeroDivisionError: integer division or modulo by zero trainable params: 37,748,736 || all params: 7,775,531,008 || trainable%: 0.4855 Traceback (most recent call last): File "/workspace/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 847, in main() File "/workspace/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 832, in main train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1539, in train return inner_training_loop( File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1580, in _inner_training_loop num_update_steps_per_epoch = len_dataloader // args.gradient_accumulation_steps ZeroDivisionError: integer division or modulo by zero [2024-09-15 15:29:24,040] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 8453) of binary: /usr/bin/python Traceback (most recent call last): File "/usr/local/bin/torchrun", line 8, in sys.exit(main()) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper return f(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 806, in main run(args) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 797, in run elastic_launch( File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /workspace/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-09-15_15:29:24 host : 1fe2e16e3880 rank : 1 (local_rank: 1) exitcode : 1 (pid: 8454) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2024-09-15_15:29:24 host : 1fe2e16e3880 rank : 2 (local_rank: 2) exitcode : 1 (pid: 8455) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [3]: time : 2024-09-15_15:29:24 host : 1fe2e16e3880 rank : 3 (local_rank: 3) exitcode : 1 (pid: 8456) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-09-15_15:29:24 host : 1fe2e16e3880 rank : 0 (local_rank: 0) exitcode : 1 (pid: 8453) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ [2024-09-15 15:30:31,684] torch.distributed.run: [WARNING] [2024-09-15 15:30:31,684] torch.distributed.run: [WARNING] ***************************************** [2024-09-15 15:30:31,684] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2024-09-15 15:30:31,684] torch.distributed.run: [WARNING] ***************************************** [2024-09-15 15:30:34,079] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-09-15 15:30:34,121] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-09-15 15:30:34,218] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-09-15 15:30:34,263] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. Replace train sampler!! petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. [2024-09-15 15:30:36,205] [INFO] [comm.py:637:init_distributed] cdb=None [2024-09-15 15:30:36,233] [INFO] [comm.py:637:init_distributed] cdb=None [2024-09-15 15:30:36,247] [INFO] [comm.py:637:init_distributed] cdb=None [2024-09-15 15:30:36,247] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl [2024-09-15 15:30:36,271] [INFO] [comm.py:637:init_distributed] cdb=None 09/15/2024 15:30:36 - WARNING - __main__ - Process rank: 3, device: cuda:3, n_gpu: 1distributed training: True, 16-bits training: False 09/15/2024 15:30:36 - WARNING - __main__ - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: False 09/15/2024 15:30:36 - WARNING - __main__ - Process rank: 1, device: cuda:1, n_gpu: 1distributed training: True, 16-bits training: False 09/15/2024 15:30:36 - INFO - __main__ - Training/evaluation parameters TrainingArguments( _n_gpu=1, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, auto_find_batch_size=False, bf16=True, bf16_full_eval=False, data_seed=None, dataloader_drop_last=False, dataloader_num_workers=4, dataloader_persistent_workers=False, dataloader_pin_memory=True, ddp_backend=None, ddp_broadcast_buffers=None, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=None, ddp_timeout=1800, debug=[], deepspeed=/workspace/InternVL/internvl_chat/zero_stage1_config.json, disable_tqdm=False, dispatch_batches=None, do_eval=False, do_predict=False, do_train=True, eval_accumulation_steps=None, eval_delay=0, eval_steps=None, evaluation_strategy=no, fp16=False, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, fsdp=[], fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, gradient_accumulation_steps=0, gradient_checkpointing=False, gradient_checkpointing_kwargs=None, greater_is_better=None, group_by_length=True, half_precision_backend=auto, hub_always_push=False, hub_model_id=None, hub_private_repo=False, hub_strategy=every_save, hub_token=, ignore_data_skip=False, include_inputs_for_metrics=False, include_num_input_tokens_seen=False, include_tokens_per_second=False, jit_mode_eval=False, label_names=None, label_smoothing_factor=0.0, learning_rate=4e-05, length_column_name=length, load_best_model_at_end=False, local_rank=0, log_level=passive, log_level_replica=warning, log_on_each_node=True, logging_dir=work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora/runs/Sep15_15-30-36_1fe2e16e3880, logging_first_step=False, logging_nan_inf_filter=True, logging_steps=1.0, logging_strategy=steps, lr_scheduler_kwargs={}, lr_scheduler_type=cosine, max_grad_norm=1.0, max_steps=-1, metric_for_best_model=None, mp_parameters=, neftune_noise_alpha=None, no_cuda=False, num_train_epochs=3.0, optim=adamw_torch, optim_args=None, output_dir=work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora, overwrite_output_dir=True, past_index=-1, per_device_eval_batch_size=8, per_device_train_batch_size=8, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=, ray_scope=last, remove_unused_columns=True, report_to=['tensorboard'], resume_from_checkpoint=None, run_name=work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora, save_on_each_node=False, save_only_model=False, save_safetensors=True, save_steps=200, save_strategy=steps, save_total_limit=1, seed=42, skip_memory_metrics=True, split_batches=False, tf32=None, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, torchdynamo=None, tpu_metrics_debug=False, tpu_num_cores=None, use_cpu=False, use_ipex=False, use_legacy_prediction_loop=False, use_mps_device=False, warmup_ratio=0.03, warmup_steps=0, weight_decay=0.01, ) 09/15/2024 15:30:36 - INFO - __main__ - Loading Tokenizer: /workspace/pretrained/InternVL2-8B [INFO|tokenization_utils_base.py:2025] 2024-09-15 15:30:36,322 >> loading file ./tokenizer.model [INFO|tokenization_utils_base.py:2025] 2024-09-15 15:30:36,322 >> loading file added_tokens.json [INFO|tokenization_utils_base.py:2025] 2024-09-15 15:30:36,322 >> loading file special_tokens_map.json [INFO|tokenization_utils_base.py:2025] 2024-09-15 15:30:36,322 >> loading file tokenizer_config.json [INFO|tokenization_utils_base.py:2025] 2024-09-15 15:30:36,322 >> loading file tokenizer.json 09/15/2024 15:30:36 - WARNING - __main__ - Process rank: 2, device: cuda:2, n_gpu: 1distributed training: True, 16-bits training: False [WARNING|logging.py:314] 2024-09-15 15:30:36,487 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [WARNING|logging.py:314] 2024-09-15 15:30:36,511 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [WARNING|logging.py:314] 2024-09-15 15:30:36,520 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [WARNING|logging.py:314] 2024-09-15 15:30:36,536 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. 09/15/2024 15:30:36 - INFO - __main__ - Loading InternVLChatModel... [INFO|configuration_utils.py:727] 2024-09-15 15:30:36,660 >> loading configuration file /workspace/pretrained/InternVL2-8B/config.json [INFO|configuration_utils.py:792] 2024-09-15 15:30:36,662 >> Model config InternVLChatConfig { "_commit_hash": null, "architectures": [ "InternVLChatModel" ], "auto_map": { "AutoConfig": "configuration_internvl_chat.InternVLChatConfig", "AutoModel": "modeling_internvl_chat.InternVLChatModel", "AutoModelForCausalLM": "modeling_internvl_chat.InternVLChatModel" }, "downsample_ratio": 0.5, "dynamic_image_size": true, "force_image_size": 448, "llm_config": { "_name_or_path": "internlm/internlm2_5-7b-chat", "add_cross_attention": false, "architectures": [ "InternLM2ForCausalLM" ], "attn_implementation": "flash_attention_2", "auto_map": { "AutoConfig": "configuration_internlm2.InternLM2Config", "AutoModel": "modeling_internlm2.InternLM2ForCausalLM", "AutoModelForCausalLM": "modeling_internlm2.InternLM2ForCausalLM" }, "bad_words_ids": null, "begin_suppress_tokens": null, "bias": false, "bos_token_id": 1, "chunk_size_feed_forward": 0, "cross_attention_hidden_size": null, "decoder_start_token_id": null, "diversity_penalty": 0.0, "do_sample": false, "early_stopping": false, "encoder_no_repeat_ngram_size": 0, "eos_token_id": 2, "exponential_decay_length_penalty": null, "finetuning_task": null, "forced_bos_token_id": null, "forced_eos_token_id": null, "hidden_act": "silu", "hidden_size": 4096, "id2label": { "0": "LABEL_0", "1": "LABEL_1" }, "initializer_range": 0.02, "intermediate_size": 14336, "is_decoder": false, "is_encoder_decoder": false, "label2id": { "LABEL_0": 0, "LABEL_1": 1 }, "length_penalty": 1.0, "max_length": 20, "max_position_embeddings": 32768, "min_length": 0, "model_type": "internlm2", "no_repeat_ngram_size": 0, "num_attention_heads": 32, "num_beam_groups": 1, "num_beams": 1, "num_hidden_layers": 32, "num_key_value_heads": 8, "num_return_sequences": 1, "output_attentions": false, "output_hidden_states": false, "output_scores": false, "pad_token_id": 2, "prefix": null, "pretraining_tp": 1, "problem_type": null, "pruned_heads": {}, "remove_invalid_values": false, "repetition_penalty": 1.0, "return_dict": true, "return_dict_in_generate": false, "rms_norm_eps": 1e-05, "rope_scaling": { "factor": 2.0, "type": "dynamic" }, "rope_theta": 1000000, "sep_token_id": null, "suppress_tokens": null, "task_specific_params": null, "temperature": 1.0, "tf_legacy_loss": false, "tie_encoder_decoder": false, "tie_word_embeddings": false, "tokenizer_class": null, "top_k": 50, "top_p": 1.0, "torch_dtype": "bfloat16", "torchscript": false, "transformers_version": "4.37.2", "typical_p": 1.0, "use_bfloat16": true, "use_cache": true, "vocab_size": 92553 }, "max_dynamic_patch": 12, "min_dynamic_patch": 1, "model_type": "internvl_chat", "pad2square": false, "ps_version": "v2", "select_layer": -1, "template": "internlm2-chat", "torch_dtype": "bfloat16", "transformers_version": null, "use_backbone_lora": 0, "use_llm_lora": 0, "use_thumbnail": true, "vision_config": { "_name_or_path": "", "add_cross_attention": false, "architectures": [ "InternVisionModel" ], "attention_dropout": 0.0, "bad_words_ids": null, "begin_suppress_tokens": null, "bos_token_id": null, "chunk_size_feed_forward": 0, "cross_attention_hidden_size": null, "decoder_start_token_id": null, "diversity_penalty": 0.0, "do_sample": false, "drop_path_rate": 0.0, "dropout": 0.0, "early_stopping": false, "encoder_no_repeat_ngram_size": 0, "eos_token_id": null, "exponential_decay_length_penalty": null, "finetuning_task": null, "forced_bos_token_id": null, "forced_eos_token_id": null, "hidden_act": "gelu", "hidden_size": 1024, "id2label": { "0": "LABEL_0", "1": "LABEL_1" }, "image_size": 448, "initializer_factor": 1.0, "initializer_range": 0.02, "intermediate_size": 4096, "is_decoder": false, "is_encoder_decoder": false, "label2id": { "LABEL_0": 0, "LABEL_1": 1 }, "layer_norm_eps": 1e-06, "length_penalty": 1.0, "max_length": 20, "min_length": 0, "model_type": "intern_vit_6b", "no_repeat_ngram_size": 0, "norm_type": "layer_norm", "num_attention_heads": 16, "num_beam_groups": 1, "num_beams": 1, "num_channels": 3, "num_hidden_layers": 24, "num_return_sequences": 1, "output_attentions": false, "output_hidden_states": false, "output_scores": false, "pad_token_id": null, "patch_size": 14, "prefix": null, "problem_type": null, "pruned_heads": {}, "qk_normalization": false, "qkv_bias": true, "remove_invalid_values": false, "repetition_penalty": 1.0, "return_dict": true, "return_dict_in_generate": false, "sep_token_id": null, "suppress_tokens": null, "task_specific_params": null, "temperature": 1.0, "tf_legacy_loss": false, "tie_encoder_decoder": false, "tie_word_embeddings": true, "tokenizer_class": null, "top_k": 50, "top_p": 1.0, "torch_dtype": "bfloat16", "torchscript": false, "transformers_version": "4.37.2", "typical_p": 1.0, "use_bfloat16": true, "use_flash_attn": true } } 09/15/2024 15:30:36 - INFO - __main__ - Using flash_attention_2 for InternLM [INFO|modeling_utils.py:3473] 2024-09-15 15:30:36,664 >> loading weights file /workspace/pretrained/InternVL2-8B/model.safetensors.index.json [INFO|modeling_utils.py:1426] 2024-09-15 15:30:36,664 >> Instantiating InternVLChatModel model under default dtype torch.bfloat16. [INFO|configuration_utils.py:826] 2024-09-15 15:30:36,665 >> Generate config GenerationConfig {} [INFO|configuration_utils.py:826] 2024-09-15 15:30:36,702 >> Generate config GenerationConfig { "bos_token_id": 1, "eos_token_id": 2, "pad_token_id": 2 } Loading checkpoint shards: 0%| | 0/4 [00:00> All model checkpoint weights were used when initializing InternVLChatModel. [INFO|modeling_utils.py:4358] 2024-09-15 15:30:47,773 >> All the weights of InternVLChatModel were initialized from the model checkpoint at /workspace/pretrained/InternVL2-8B. If your task is similar to the task the model of the checkpoint was trained on, you can already use InternVLChatModel for predictions without further training. [INFO|configuration_utils.py:779] 2024-09-15 15:30:47,779 >> loading configuration file /workspace/pretrained/InternVL2-8B/generation_config.json [INFO|configuration_utils.py:826] 2024-09-15 15:30:47,779 >> Generate config GenerationConfig { "eos_token_id": [ 92542, 92543 ] } 09/15/2024 15:30:47 - INFO - __main__ - Finished 09/15/2024 15:30:47 - INFO - __main__ - model.config.force_image_size: 448 09/15/2024 15:30:47 - INFO - __main__ - data_args.force_image_size: 448 09/15/2024 15:30:47 - INFO - __main__ - model.config.vision_config.image_size: 448 09/15/2024 15:30:47 - INFO - __main__ - [Dataset] num_image_token: 256 09/15/2024 15:30:47 - INFO - __main__ - [Dataset] dynamic_image_size: True 09/15/2024 15:30:47 - INFO - __main__ - [Dataset] use_thumbnail: True 09/15/2024 15:30:47 - INFO - __main__ - [Dataset] min_dynamic_patch: 1, max_dynamic_patch: 6 09/15/2024 15:30:47 - INFO - __main__ - Formatting inputs...Skip in lazy mode Traceback (most recent call last): File "/workspace/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 847, in main() File "/workspace/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 832, in main train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1539, in train return inner_training_loop( File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1580, in _inner_training_loop num_update_steps_per_epoch = len_dataloader // args.gradient_accumulation_steps ZeroDivisionError: integer division or modulo by zero Traceback (most recent call last): File "/workspace/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 847, in main() File "/workspace/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 832, in main train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1539, in train return inner_training_loop( File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1580, in _inner_training_loop num_update_steps_per_epoch = len_dataloader // args.gradient_accumulation_steps ZeroDivisionError: integer division or modulo by zero 09/15/2024 15:30:48 - INFO - __main__ - Add dataset: amazon-product-dataset with length: 9996 trainable params: 37,748,736 || all params: 7,775,531,008 || trainable%: 0.4855 Traceback (most recent call last): File "/workspace/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 847, in main() File "/workspace/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 832, in main train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1539, in train return inner_training_loop( File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1580, in _inner_training_loop num_update_steps_per_epoch = len_dataloader // args.gradient_accumulation_steps ZeroDivisionError: integer division or modulo by zero trainable params: 37,748,736 || all params: 7,775,531,008 || trainable%: 0.4855 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.0.attention.wqkv.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.0.attention.wqkv.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.0.attention.wo.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.0.attention.wo.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.0.feed_forward.w1.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.0.feed_forward.w1.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.0.feed_forward.w3.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.0.feed_forward.w3.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.0.feed_forward.w2.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.0.feed_forward.w2.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.1.attention.wqkv.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.1.attention.wqkv.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.1.attention.wo.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.1.attention.wo.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.1.feed_forward.w1.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.1.feed_forward.w1.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.1.feed_forward.w3.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.1.feed_forward.w3.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.1.feed_forward.w2.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.1.feed_forward.w2.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.2.attention.wqkv.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.2.attention.wqkv.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.2.attention.wo.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.2.attention.wo.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.2.feed_forward.w1.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.2.feed_forward.w1.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.2.feed_forward.w3.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.2.feed_forward.w3.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.2.feed_forward.w2.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.2.feed_forward.w2.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.3.attention.wqkv.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.3.attention.wqkv.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.3.attention.wo.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.3.attention.wo.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.3.feed_forward.w1.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.3.feed_forward.w1.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.3.feed_forward.w3.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.3.feed_forward.w3.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.3.feed_forward.w2.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.3.feed_forward.w2.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.4.attention.wqkv.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.4.attention.wqkv.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.4.attention.wo.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.4.attention.wo.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.4.feed_forward.w1.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.4.feed_forward.w1.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.4.feed_forward.w3.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.4.feed_forward.w3.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.4.feed_forward.w2.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.4.feed_forward.w2.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.5.attention.wqkv.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.5.attention.wqkv.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.5.attention.wo.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.5.attention.wo.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.5.feed_forward.w1.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.5.feed_forward.w1.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.5.feed_forward.w3.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.5.feed_forward.w3.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.5.feed_forward.w2.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.5.feed_forward.w2.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.6.attention.wqkv.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.6.attention.wqkv.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.6.attention.wo.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.6.attention.wo.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.6.feed_forward.w1.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.6.feed_forward.w1.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.6.feed_forward.w3.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.6.feed_forward.w3.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.6.feed_forward.w2.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.6.feed_forward.w2.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.7.attention.wqkv.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.7.attention.wqkv.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.7.attention.wo.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.7.attention.wo.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.7.feed_forward.w1.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.7.feed_forward.w1.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.7.feed_forward.w3.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.7.feed_forward.w3.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.7.feed_forward.w2.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.7.feed_forward.w2.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.8.attention.wqkv.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.8.attention.wqkv.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.8.attention.wo.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.8.attention.wo.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.8.feed_forward.w1.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.8.feed_forward.w1.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.8.feed_forward.w3.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.8.feed_forward.w3.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.8.feed_forward.w2.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.8.feed_forward.w2.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.9.attention.wqkv.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.9.attention.wqkv.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.9.attention.wo.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.9.attention.wo.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.9.feed_forward.w1.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.9.feed_forward.w1.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.9.feed_forward.w3.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.9.feed_forward.w3.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.9.feed_forward.w2.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.9.feed_forward.w2.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.10.attention.wqkv.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.10.attention.wqkv.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.10.attention.wo.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.10.attention.wo.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.10.feed_forward.w1.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.10.feed_forward.w1.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.10.feed_forward.w3.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.10.feed_forward.w3.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.10.feed_forward.w2.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.10.feed_forward.w2.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.11.attention.wqkv.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.11.attention.wqkv.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.11.attention.wo.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.11.attention.wo.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.11.feed_forward.w1.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.11.feed_forward.w1.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.11.feed_forward.w3.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.11.feed_forward.w3.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.11.feed_forward.w2.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.11.feed_forward.w2.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.12.attention.wqkv.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.12.attention.wqkv.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.12.attention.wo.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.12.attention.wo.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.12.feed_forward.w1.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.12.feed_forward.w1.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.12.feed_forward.w3.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.12.feed_forward.w3.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.12.feed_forward.w2.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.12.feed_forward.w2.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.13.attention.wqkv.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.13.attention.wqkv.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.13.attention.wo.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.13.attention.wo.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.13.feed_forward.w1.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.13.feed_forward.w1.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.13.feed_forward.w3.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.13.feed_forward.w3.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.13.feed_forward.w2.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.13.feed_forward.w2.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.14.attention.wqkv.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.14.attention.wqkv.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.14.attention.wo.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.14.attention.wo.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.14.feed_forward.w1.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.14.feed_forward.w1.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.14.feed_forward.w3.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.14.feed_forward.w3.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.14.feed_forward.w2.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.14.feed_forward.w2.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.15.attention.wqkv.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.15.attention.wqkv.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.15.attention.wo.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.15.attention.wo.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.15.feed_forward.w1.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.15.feed_forward.w1.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.15.feed_forward.w3.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.15.feed_forward.w3.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.15.feed_forward.w2.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.15.feed_forward.w2.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.16.attention.wqkv.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.16.attention.wqkv.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.16.attention.wo.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.16.attention.wo.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.16.feed_forward.w1.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.16.feed_forward.w1.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.16.feed_forward.w3.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.16.feed_forward.w3.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.16.feed_forward.w2.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.16.feed_forward.w2.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.17.attention.wqkv.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.17.attention.wqkv.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.17.attention.wo.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.17.attention.wo.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.17.feed_forward.w1.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.17.feed_forward.w1.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.17.feed_forward.w3.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.17.feed_forward.w3.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.17.feed_forward.w2.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.17.feed_forward.w2.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.18.attention.wqkv.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.18.attention.wqkv.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.18.attention.wo.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.18.attention.wo.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.18.feed_forward.w1.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.18.feed_forward.w1.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.18.feed_forward.w3.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.18.feed_forward.w3.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.18.feed_forward.w2.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.18.feed_forward.w2.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.19.attention.wqkv.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.19.attention.wqkv.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.19.attention.wo.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.19.attention.wo.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.19.feed_forward.w1.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.19.feed_forward.w1.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.19.feed_forward.w3.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.19.feed_forward.w3.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.19.feed_forward.w2.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.19.feed_forward.w2.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.20.attention.wqkv.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.20.attention.wqkv.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.20.attention.wo.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.20.attention.wo.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.20.feed_forward.w1.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.20.feed_forward.w1.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.20.feed_forward.w3.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.20.feed_forward.w3.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.20.feed_forward.w2.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.20.feed_forward.w2.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.21.attention.wqkv.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.21.attention.wqkv.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.21.attention.wo.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.21.attention.wo.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.21.feed_forward.w1.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.21.feed_forward.w1.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.21.feed_forward.w3.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.21.feed_forward.w3.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.21.feed_forward.w2.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.21.feed_forward.w2.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.22.attention.wqkv.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.22.attention.wqkv.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.22.attention.wo.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.22.attention.wo.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.22.feed_forward.w1.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.22.feed_forward.w1.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.22.feed_forward.w3.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.22.feed_forward.w3.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.22.feed_forward.w2.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.22.feed_forward.w2.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.23.attention.wqkv.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.23.attention.wqkv.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.23.attention.wo.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.23.attention.wo.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.23.feed_forward.w1.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.23.feed_forward.w1.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.23.feed_forward.w3.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.23.feed_forward.w3.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.23.feed_forward.w2.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.23.feed_forward.w2.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.24.attention.wqkv.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.24.attention.wqkv.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.24.attention.wo.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.24.attention.wo.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.24.feed_forward.w1.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.24.feed_forward.w1.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.24.feed_forward.w3.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.24.feed_forward.w3.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.24.feed_forward.w2.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.24.feed_forward.w2.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.25.attention.wqkv.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.25.attention.wqkv.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.25.attention.wo.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.25.attention.wo.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.25.feed_forward.w1.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.25.feed_forward.w1.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.25.feed_forward.w3.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.25.feed_forward.w3.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.25.feed_forward.w2.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.25.feed_forward.w2.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.26.attention.wqkv.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.26.attention.wqkv.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.26.attention.wo.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.26.attention.wo.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.26.feed_forward.w1.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.26.feed_forward.w1.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.26.feed_forward.w3.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.26.feed_forward.w3.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.26.feed_forward.w2.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.26.feed_forward.w2.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.27.attention.wqkv.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.27.attention.wqkv.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.27.attention.wo.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.27.attention.wo.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.27.feed_forward.w1.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.27.feed_forward.w1.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.27.feed_forward.w3.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.27.feed_forward.w3.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.27.feed_forward.w2.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.27.feed_forward.w2.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.28.attention.wqkv.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.28.attention.wqkv.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.28.attention.wo.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.28.attention.wo.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.28.feed_forward.w1.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.28.feed_forward.w1.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.28.feed_forward.w3.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.28.feed_forward.w3.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.28.feed_forward.w2.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.28.feed_forward.w2.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.29.attention.wqkv.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.29.attention.wqkv.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.29.attention.wo.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.29.attention.wo.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.29.feed_forward.w1.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.29.feed_forward.w1.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.29.feed_forward.w3.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.29.feed_forward.w3.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.29.feed_forward.w2.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.29.feed_forward.w2.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.30.attention.wqkv.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.30.attention.wqkv.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.30.attention.wo.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.30.attention.wo.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.30.feed_forward.w1.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.30.feed_forward.w1.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.30.feed_forward.w3.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.30.feed_forward.w3.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.30.feed_forward.w2.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.30.feed_forward.w2.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.31.attention.wqkv.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.31.attention.wqkv.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.31.attention.wo.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.31.attention.wo.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.31.feed_forward.w1.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.31.feed_forward.w1.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.31.feed_forward.w3.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.31.feed_forward.w3.lora_B.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.31.feed_forward.w2.lora_A.default.weight 09/15/2024 15:30:48 - INFO - __main__ - language_model.base_model.model.model.layers.31.feed_forward.w2.lora_B.default.weight [INFO|trainer.py:571] 2024-09-15 15:30:48,754 >> Using auto half precision backend Traceback (most recent call last): File "/workspace/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 847, in main() File "/workspace/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 832, in main train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1539, in train return inner_training_loop( File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1580, in _inner_training_loop num_update_steps_per_epoch = len_dataloader // args.gradient_accumulation_steps ZeroDivisionError: integer division or modulo by zero [2024-09-15 15:30:51,713] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 8574) of binary: /usr/bin/python Traceback (most recent call last): File "/usr/local/bin/torchrun", line 8, in sys.exit(main()) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper return f(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 806, in main run(args) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 797, in run elastic_launch( File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /workspace/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-09-15_15:30:51 host : 1fe2e16e3880 rank : 1 (local_rank: 1) exitcode : 1 (pid: 8575) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2024-09-15_15:30:51 host : 1fe2e16e3880 rank : 2 (local_rank: 2) exitcode : 1 (pid: 8576) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [3]: time : 2024-09-15_15:30:51 host : 1fe2e16e3880 rank : 3 (local_rank: 3) exitcode : 1 (pid: 8577) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-09-15_15:30:51 host : 1fe2e16e3880 rank : 0 (local_rank: 0) exitcode : 1 (pid: 8574) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ [2024-09-15 15:32:34,497] torch.distributed.run: [WARNING] [2024-09-15 15:32:34,497] torch.distributed.run: [WARNING] ***************************************** [2024-09-15 15:32:34,497] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2024-09-15 15:32:34,497] torch.distributed.run: [WARNING] ***************************************** [2024-09-15 15:32:36,778] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-09-15 15:32:36,787] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-09-15 15:32:36,843] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-09-15 15:32:36,849] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. [2024-09-15 15:32:38,756] [INFO] [comm.py:637:init_distributed] cdb=None petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. [2024-09-15 15:32:38,813] [INFO] [comm.py:637:init_distributed] cdb=None [2024-09-15 15:32:38,822] [INFO] [comm.py:637:init_distributed] cdb=None [2024-09-15 15:32:38,847] [INFO] [comm.py:637:init_distributed] cdb=None [2024-09-15 15:32:38,847] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl 09/15/2024 15:32:38 - WARNING - __main__ - Process rank: 3, device: cuda:3, n_gpu: 1distributed training: True, 16-bits training: False 09/15/2024 15:32:38 - WARNING - __main__ - Process rank: 1, device: cuda:1, n_gpu: 1distributed training: True, 16-bits training: False 09/15/2024 15:32:38 - WARNING - __main__ - Process rank: 2, device: cuda:2, n_gpu: 1distributed training: True, 16-bits training: False 09/15/2024 15:32:38 - WARNING - __main__ - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: False 09/15/2024 15:32:38 - INFO - __main__ - Training/evaluation parameters TrainingArguments( _n_gpu=1, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, auto_find_batch_size=False, bf16=True, bf16_full_eval=False, data_seed=None, dataloader_drop_last=False, dataloader_num_workers=4, dataloader_persistent_workers=False, dataloader_pin_memory=True, ddp_backend=None, ddp_broadcast_buffers=None, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=None, ddp_timeout=1800, debug=[], deepspeed=/workspace/InternVL/internvl_chat/zero_stage1_config.json, disable_tqdm=False, dispatch_batches=None, do_eval=False, do_predict=False, do_train=True, eval_accumulation_steps=None, eval_delay=0, eval_steps=None, evaluation_strategy=no, fp16=False, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, fsdp=[], fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, gradient_accumulation_steps=1, gradient_checkpointing=False, gradient_checkpointing_kwargs=None, greater_is_better=None, group_by_length=True, half_precision_backend=auto, hub_always_push=False, hub_model_id=None, hub_private_repo=False, hub_strategy=every_save, hub_token=, ignore_data_skip=False, include_inputs_for_metrics=False, include_num_input_tokens_seen=False, include_tokens_per_second=False, jit_mode_eval=False, label_names=None, label_smoothing_factor=0.0, learning_rate=4e-05, length_column_name=length, load_best_model_at_end=False, local_rank=0, log_level=passive, log_level_replica=warning, log_on_each_node=True, logging_dir=work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora/runs/Sep15_15-32-38_1fe2e16e3880, logging_first_step=False, logging_nan_inf_filter=True, logging_steps=1.0, logging_strategy=steps, lr_scheduler_kwargs={}, lr_scheduler_type=cosine, max_grad_norm=1.0, max_steps=-1, metric_for_best_model=None, mp_parameters=, neftune_noise_alpha=None, no_cuda=False, num_train_epochs=3.0, optim=adamw_torch, optim_args=None, output_dir=work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora, overwrite_output_dir=True, past_index=-1, per_device_eval_batch_size=8, per_device_train_batch_size=8, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=, ray_scope=last, remove_unused_columns=True, report_to=['tensorboard'], resume_from_checkpoint=None, run_name=work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora, save_on_each_node=False, save_only_model=False, save_safetensors=True, save_steps=200, save_strategy=steps, save_total_limit=1, seed=42, skip_memory_metrics=True, split_batches=False, tf32=None, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, torchdynamo=None, tpu_metrics_debug=False, tpu_num_cores=None, use_cpu=False, use_ipex=False, use_legacy_prediction_loop=False, use_mps_device=False, warmup_ratio=0.03, warmup_steps=0, weight_decay=0.01, ) 09/15/2024 15:32:38 - INFO - __main__ - Loading Tokenizer: /workspace/pretrained/InternVL2-8B [INFO|tokenization_utils_base.py:2025] 2024-09-15 15:32:38,917 >> loading file ./tokenizer.model [INFO|tokenization_utils_base.py:2025] 2024-09-15 15:32:38,917 >> loading file added_tokens.json [INFO|tokenization_utils_base.py:2025] 2024-09-15 15:32:38,917 >> loading file special_tokens_map.json [INFO|tokenization_utils_base.py:2025] 2024-09-15 15:32:38,917 >> loading file tokenizer_config.json [INFO|tokenization_utils_base.py:2025] 2024-09-15 15:32:38,917 >> loading file tokenizer.json [WARNING|logging.py:314] 2024-09-15 15:32:39,038 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [WARNING|logging.py:314] 2024-09-15 15:32:39,068 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [WARNING|logging.py:314] 2024-09-15 15:32:39,090 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [WARNING|logging.py:314] 2024-09-15 15:32:39,103 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. 09/15/2024 15:32:39 - INFO - __main__ - Loading InternVLChatModel... [INFO|configuration_utils.py:727] 2024-09-15 15:32:39,253 >> loading configuration file /workspace/pretrained/InternVL2-8B/config.json [INFO|configuration_utils.py:792] 2024-09-15 15:32:39,255 >> Model config InternVLChatConfig { "_commit_hash": null, "architectures": [ "InternVLChatModel" ], "auto_map": { "AutoConfig": "configuration_internvl_chat.InternVLChatConfig", "AutoModel": "modeling_internvl_chat.InternVLChatModel", "AutoModelForCausalLM": "modeling_internvl_chat.InternVLChatModel" }, "downsample_ratio": 0.5, "dynamic_image_size": true, "force_image_size": 448, "llm_config": { "_name_or_path": "internlm/internlm2_5-7b-chat", "add_cross_attention": false, "architectures": [ "InternLM2ForCausalLM" ], "attn_implementation": "flash_attention_2", "auto_map": { "AutoConfig": "configuration_internlm2.InternLM2Config", "AutoModel": "modeling_internlm2.InternLM2ForCausalLM", "AutoModelForCausalLM": "modeling_internlm2.InternLM2ForCausalLM" }, "bad_words_ids": null, "begin_suppress_tokens": null, "bias": false, "bos_token_id": 1, "chunk_size_feed_forward": 0, "cross_attention_hidden_size": null, "decoder_start_token_id": null, "diversity_penalty": 0.0, "do_sample": false, "early_stopping": false, "encoder_no_repeat_ngram_size": 0, "eos_token_id": 2, "exponential_decay_length_penalty": null, "finetuning_task": null, "forced_bos_token_id": null, "forced_eos_token_id": null, "hidden_act": "silu", "hidden_size": 4096, "id2label": { "0": "LABEL_0", "1": "LABEL_1" }, "initializer_range": 0.02, "intermediate_size": 14336, "is_decoder": false, "is_encoder_decoder": false, "label2id": { "LABEL_0": 0, "LABEL_1": 1 }, "length_penalty": 1.0, "max_length": 20, "max_position_embeddings": 32768, "min_length": 0, "model_type": "internlm2", "no_repeat_ngram_size": 0, "num_attention_heads": 32, "num_beam_groups": 1, "num_beams": 1, "num_hidden_layers": 32, "num_key_value_heads": 8, "num_return_sequences": 1, "output_attentions": false, "output_hidden_states": false, "output_scores": false, "pad_token_id": 2, "prefix": null, "pretraining_tp": 1, "problem_type": null, "pruned_heads": {}, "remove_invalid_values": false, "repetition_penalty": 1.0, "return_dict": true, "return_dict_in_generate": false, "rms_norm_eps": 1e-05, "rope_scaling": { "factor": 2.0, "type": "dynamic" }, "rope_theta": 1000000, "sep_token_id": null, "suppress_tokens": null, "task_specific_params": null, "temperature": 1.0, "tf_legacy_loss": false, "tie_encoder_decoder": false, "tie_word_embeddings": false, "tokenizer_class": null, "top_k": 50, "top_p": 1.0, "torch_dtype": "bfloat16", "torchscript": false, "transformers_version": "4.37.2", "typical_p": 1.0, "use_bfloat16": true, "use_cache": true, "vocab_size": 92553 }, "max_dynamic_patch": 12, "min_dynamic_patch": 1, "model_type": "internvl_chat", "pad2square": false, "ps_version": "v2", "select_layer": -1, "template": "internlm2-chat", "torch_dtype": "bfloat16", "transformers_version": null, "use_backbone_lora": 0, "use_llm_lora": 0, "use_thumbnail": true, "vision_config": { "_name_or_path": "", "add_cross_attention": false, "architectures": [ "InternVisionModel" ], "attention_dropout": 0.0, "bad_words_ids": null, "begin_suppress_tokens": null, "bos_token_id": null, "chunk_size_feed_forward": 0, "cross_attention_hidden_size": null, "decoder_start_token_id": null, "diversity_penalty": 0.0, "do_sample": false, "drop_path_rate": 0.0, "dropout": 0.0, "early_stopping": false, "encoder_no_repeat_ngram_size": 0, "eos_token_id": null, "exponential_decay_length_penalty": null, "finetuning_task": null, "forced_bos_token_id": null, "forced_eos_token_id": null, "hidden_act": "gelu", "hidden_size": 1024, "id2label": { "0": "LABEL_0", "1": "LABEL_1" }, "image_size": 448, "initializer_factor": 1.0, "initializer_range": 0.02, "intermediate_size": 4096, "is_decoder": false, "is_encoder_decoder": false, "label2id": { "LABEL_0": 0, "LABEL_1": 1 }, "layer_norm_eps": 1e-06, "length_penalty": 1.0, "max_length": 20, "min_length": 0, "model_type": "intern_vit_6b", "no_repeat_ngram_size": 0, "norm_type": "layer_norm", "num_attention_heads": 16, "num_beam_groups": 1, "num_beams": 1, "num_channels": 3, "num_hidden_layers": 24, "num_return_sequences": 1, "output_attentions": false, "output_hidden_states": false, "output_scores": false, "pad_token_id": null, "patch_size": 14, "prefix": null, "problem_type": null, "pruned_heads": {}, "qk_normalization": false, "qkv_bias": true, "remove_invalid_values": false, "repetition_penalty": 1.0, "return_dict": true, "return_dict_in_generate": false, "sep_token_id": null, "suppress_tokens": null, "task_specific_params": null, "temperature": 1.0, "tf_legacy_loss": false, "tie_encoder_decoder": false, "tie_word_embeddings": true, "tokenizer_class": null, "top_k": 50, "top_p": 1.0, "torch_dtype": "bfloat16", "torchscript": false, "transformers_version": "4.37.2", "typical_p": 1.0, "use_bfloat16": true, "use_flash_attn": true } } 09/15/2024 15:32:39 - INFO - __main__ - Using flash_attention_2 for InternLM [INFO|modeling_utils.py:3473] 2024-09-15 15:32:39,256 >> loading weights file /workspace/pretrained/InternVL2-8B/model.safetensors.index.json [INFO|modeling_utils.py:1426] 2024-09-15 15:32:39,257 >> Instantiating InternVLChatModel model under default dtype torch.bfloat16. [INFO|configuration_utils.py:826] 2024-09-15 15:32:39,258 >> Generate config GenerationConfig {} [INFO|configuration_utils.py:826] 2024-09-15 15:32:39,297 >> Generate config GenerationConfig { "bos_token_id": 1, "eos_token_id": 2, "pad_token_id": 2 } Loading checkpoint shards: 0%| | 0/4 [00:00> All model checkpoint weights were used when initializing InternVLChatModel. [INFO|modeling_utils.py:4358] 2024-09-15 15:32:50,042 >> All the weights of InternVLChatModel were initialized from the model checkpoint at /workspace/pretrained/InternVL2-8B. If your task is similar to the task the model of the checkpoint was trained on, you can already use InternVLChatModel for predictions without further training. [INFO|configuration_utils.py:779] 2024-09-15 15:32:50,048 >> loading configuration file /workspace/pretrained/InternVL2-8B/generation_config.json [INFO|configuration_utils.py:826] 2024-09-15 15:32:50,048 >> Generate config GenerationConfig { "eos_token_id": [ 92542, 92543 ] } 09/15/2024 15:32:50 - INFO - __main__ - Finished 09/15/2024 15:32:50 - INFO - __main__ - model.config.force_image_size: 448 09/15/2024 15:32:50 - INFO - __main__ - data_args.force_image_size: 448 09/15/2024 15:32:50 - INFO - __main__ - model.config.vision_config.image_size: 448 09/15/2024 15:32:50 - INFO - __main__ - [Dataset] num_image_token: 256 09/15/2024 15:32:50 - INFO - __main__ - [Dataset] dynamic_image_size: True 09/15/2024 15:32:50 - INFO - __main__ - [Dataset] use_thumbnail: True 09/15/2024 15:32:50 - INFO - __main__ - [Dataset] min_dynamic_patch: 1, max_dynamic_patch: 6 09/15/2024 15:32:50 - INFO - __main__ - Formatting inputs...Skip in lazy mode trainable params: 37,748,736 || all params: 7,775,531,008 || trainable%: 0.4855 09/15/2024 15:32:50 - INFO - __main__ - Add dataset: amazon-product-dataset with length: 9996 Loading checkpoint shards: 100%|██████████| 4/4 [00:09<00:00, 2.15s/it] Loading checkpoint shards: 100%|██████████| 4/4 [00:09<00:00, 2.45s/it] trainable params: 37,748,736 || all params: 7,775,531,008 || trainable%: 0.4855 trainable params: 37,748,736 || all params: 7,775,531,008 || trainable%: 0.4855 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.0.attention.wqkv.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.0.attention.wqkv.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.0.attention.wo.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.0.attention.wo.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.0.feed_forward.w1.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.0.feed_forward.w1.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.0.feed_forward.w3.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.0.feed_forward.w3.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.0.feed_forward.w2.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.0.feed_forward.w2.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.1.attention.wqkv.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.1.attention.wqkv.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.1.attention.wo.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.1.attention.wo.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.1.feed_forward.w1.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.1.feed_forward.w1.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.1.feed_forward.w3.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.1.feed_forward.w3.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.1.feed_forward.w2.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.1.feed_forward.w2.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.2.attention.wqkv.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.2.attention.wqkv.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.2.attention.wo.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.2.attention.wo.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.2.feed_forward.w1.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.2.feed_forward.w1.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.2.feed_forward.w3.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.2.feed_forward.w3.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.2.feed_forward.w2.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.2.feed_forward.w2.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.3.attention.wqkv.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.3.attention.wqkv.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.3.attention.wo.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.3.attention.wo.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.3.feed_forward.w1.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.3.feed_forward.w1.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.3.feed_forward.w3.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.3.feed_forward.w3.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.3.feed_forward.w2.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.3.feed_forward.w2.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.4.attention.wqkv.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.4.attention.wqkv.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.4.attention.wo.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.4.attention.wo.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.4.feed_forward.w1.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.4.feed_forward.w1.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.4.feed_forward.w3.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.4.feed_forward.w3.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.4.feed_forward.w2.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.4.feed_forward.w2.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.5.attention.wqkv.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.5.attention.wqkv.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.5.attention.wo.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.5.attention.wo.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.5.feed_forward.w1.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.5.feed_forward.w1.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.5.feed_forward.w3.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.5.feed_forward.w3.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.5.feed_forward.w2.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.5.feed_forward.w2.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.6.attention.wqkv.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.6.attention.wqkv.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.6.attention.wo.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.6.attention.wo.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.6.feed_forward.w1.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.6.feed_forward.w1.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.6.feed_forward.w3.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.6.feed_forward.w3.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.6.feed_forward.w2.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.6.feed_forward.w2.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.7.attention.wqkv.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.7.attention.wqkv.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.7.attention.wo.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.7.attention.wo.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.7.feed_forward.w1.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.7.feed_forward.w1.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.7.feed_forward.w3.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.7.feed_forward.w3.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.7.feed_forward.w2.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.7.feed_forward.w2.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.8.attention.wqkv.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.8.attention.wqkv.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.8.attention.wo.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.8.attention.wo.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.8.feed_forward.w1.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.8.feed_forward.w1.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.8.feed_forward.w3.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.8.feed_forward.w3.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.8.feed_forward.w2.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.8.feed_forward.w2.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.9.attention.wqkv.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.9.attention.wqkv.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.9.attention.wo.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.9.attention.wo.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.9.feed_forward.w1.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.9.feed_forward.w1.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.9.feed_forward.w3.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.9.feed_forward.w3.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.9.feed_forward.w2.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.9.feed_forward.w2.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.10.attention.wqkv.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.10.attention.wqkv.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.10.attention.wo.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.10.attention.wo.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.10.feed_forward.w1.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.10.feed_forward.w1.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.10.feed_forward.w3.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.10.feed_forward.w3.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.10.feed_forward.w2.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.10.feed_forward.w2.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.11.attention.wqkv.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.11.attention.wqkv.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.11.attention.wo.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.11.attention.wo.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.11.feed_forward.w1.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.11.feed_forward.w1.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.11.feed_forward.w3.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.11.feed_forward.w3.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.11.feed_forward.w2.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.11.feed_forward.w2.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.12.attention.wqkv.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.12.attention.wqkv.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.12.attention.wo.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.12.attention.wo.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.12.feed_forward.w1.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.12.feed_forward.w1.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.12.feed_forward.w3.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.12.feed_forward.w3.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.12.feed_forward.w2.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.12.feed_forward.w2.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.13.attention.wqkv.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.13.attention.wqkv.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.13.attention.wo.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.13.attention.wo.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.13.feed_forward.w1.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.13.feed_forward.w1.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.13.feed_forward.w3.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.13.feed_forward.w3.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.13.feed_forward.w2.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.13.feed_forward.w2.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.14.attention.wqkv.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.14.attention.wqkv.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.14.attention.wo.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.14.attention.wo.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.14.feed_forward.w1.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.14.feed_forward.w1.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.14.feed_forward.w3.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.14.feed_forward.w3.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.14.feed_forward.w2.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.14.feed_forward.w2.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.15.attention.wqkv.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.15.attention.wqkv.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.15.attention.wo.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.15.attention.wo.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.15.feed_forward.w1.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.15.feed_forward.w1.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.15.feed_forward.w3.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.15.feed_forward.w3.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.15.feed_forward.w2.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.15.feed_forward.w2.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.16.attention.wqkv.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.16.attention.wqkv.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.16.attention.wo.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.16.attention.wo.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.16.feed_forward.w1.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.16.feed_forward.w1.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.16.feed_forward.w3.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.16.feed_forward.w3.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.16.feed_forward.w2.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.16.feed_forward.w2.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.17.attention.wqkv.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.17.attention.wqkv.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.17.attention.wo.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.17.attention.wo.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.17.feed_forward.w1.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.17.feed_forward.w1.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.17.feed_forward.w3.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.17.feed_forward.w3.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.17.feed_forward.w2.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.17.feed_forward.w2.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.18.attention.wqkv.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.18.attention.wqkv.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.18.attention.wo.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.18.attention.wo.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.18.feed_forward.w1.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.18.feed_forward.w1.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.18.feed_forward.w3.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.18.feed_forward.w3.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.18.feed_forward.w2.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.18.feed_forward.w2.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.19.attention.wqkv.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.19.attention.wqkv.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.19.attention.wo.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.19.attention.wo.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.19.feed_forward.w1.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.19.feed_forward.w1.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.19.feed_forward.w3.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.19.feed_forward.w3.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.19.feed_forward.w2.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.19.feed_forward.w2.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.20.attention.wqkv.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.20.attention.wqkv.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.20.attention.wo.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.20.attention.wo.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.20.feed_forward.w1.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.20.feed_forward.w1.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.20.feed_forward.w3.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.20.feed_forward.w3.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.20.feed_forward.w2.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.20.feed_forward.w2.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.21.attention.wqkv.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.21.attention.wqkv.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.21.attention.wo.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.21.attention.wo.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.21.feed_forward.w1.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.21.feed_forward.w1.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.21.feed_forward.w3.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.21.feed_forward.w3.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.21.feed_forward.w2.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.21.feed_forward.w2.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.22.attention.wqkv.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.22.attention.wqkv.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.22.attention.wo.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.22.attention.wo.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.22.feed_forward.w1.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.22.feed_forward.w1.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.22.feed_forward.w3.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.22.feed_forward.w3.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.22.feed_forward.w2.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.22.feed_forward.w2.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.23.attention.wqkv.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.23.attention.wqkv.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.23.attention.wo.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.23.attention.wo.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.23.feed_forward.w1.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.23.feed_forward.w1.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.23.feed_forward.w3.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.23.feed_forward.w3.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.23.feed_forward.w2.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.23.feed_forward.w2.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.24.attention.wqkv.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.24.attention.wqkv.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.24.attention.wo.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.24.attention.wo.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.24.feed_forward.w1.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.24.feed_forward.w1.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.24.feed_forward.w3.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.24.feed_forward.w3.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.24.feed_forward.w2.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.24.feed_forward.w2.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.25.attention.wqkv.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.25.attention.wqkv.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.25.attention.wo.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.25.attention.wo.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.25.feed_forward.w1.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.25.feed_forward.w1.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.25.feed_forward.w3.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.25.feed_forward.w3.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.25.feed_forward.w2.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.25.feed_forward.w2.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.26.attention.wqkv.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.26.attention.wqkv.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.26.attention.wo.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.26.attention.wo.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.26.feed_forward.w1.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.26.feed_forward.w1.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.26.feed_forward.w3.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.26.feed_forward.w3.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.26.feed_forward.w2.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.26.feed_forward.w2.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.27.attention.wqkv.lora_A.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.27.attention.wqkv.lora_B.default.weight 09/15/2024 15:32:50 - INFO - __main__ - language_model.base_model.model.model.layers.27.attention.wo.lora_A.default.weight 09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.27.attention.wo.lora_B.default.weight 09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.27.feed_forward.w1.lora_A.default.weight 09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.27.feed_forward.w1.lora_B.default.weight 09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.27.feed_forward.w3.lora_A.default.weight 09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.27.feed_forward.w3.lora_B.default.weight 09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.27.feed_forward.w2.lora_A.default.weight 09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.27.feed_forward.w2.lora_B.default.weight 09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.28.attention.wqkv.lora_A.default.weight 09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.28.attention.wqkv.lora_B.default.weight 09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.28.attention.wo.lora_A.default.weight 09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.28.attention.wo.lora_B.default.weight 09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.28.feed_forward.w1.lora_A.default.weight 09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.28.feed_forward.w1.lora_B.default.weight 09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.28.feed_forward.w3.lora_A.default.weight 09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.28.feed_forward.w3.lora_B.default.weight 09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.28.feed_forward.w2.lora_A.default.weight 09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.28.feed_forward.w2.lora_B.default.weight 09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.29.attention.wqkv.lora_A.default.weight 09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.29.attention.wqkv.lora_B.default.weight 09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.29.attention.wo.lora_A.default.weight 09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.29.attention.wo.lora_B.default.weight 09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.29.feed_forward.w1.lora_A.default.weight 09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.29.feed_forward.w1.lora_B.default.weight 09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.29.feed_forward.w3.lora_A.default.weight 09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.29.feed_forward.w3.lora_B.default.weight 09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.29.feed_forward.w2.lora_A.default.weight 09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.29.feed_forward.w2.lora_B.default.weight 09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.30.attention.wqkv.lora_A.default.weight 09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.30.attention.wqkv.lora_B.default.weight 09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.30.attention.wo.lora_A.default.weight 09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.30.attention.wo.lora_B.default.weight 09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.30.feed_forward.w1.lora_A.default.weight 09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.30.feed_forward.w1.lora_B.default.weight 09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.30.feed_forward.w3.lora_A.default.weight 09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.30.feed_forward.w3.lora_B.default.weight 09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.30.feed_forward.w2.lora_A.default.weight 09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.30.feed_forward.w2.lora_B.default.weight 09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.31.attention.wqkv.lora_A.default.weight 09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.31.attention.wqkv.lora_B.default.weight 09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.31.attention.wo.lora_A.default.weight 09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.31.attention.wo.lora_B.default.weight 09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.31.feed_forward.w1.lora_A.default.weight 09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.31.feed_forward.w1.lora_B.default.weight 09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.31.feed_forward.w3.lora_A.default.weight 09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.31.feed_forward.w3.lora_B.default.weight 09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.31.feed_forward.w2.lora_A.default.weight 09/15/2024 15:32:51 - INFO - __main__ - language_model.base_model.model.model.layers.31.feed_forward.w2.lora_B.default.weight [INFO|trainer.py:571] 2024-09-15 15:32:51,022 >> Using auto half precision backend [2024-09-15 15:32:51,228] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.13.5, git-hash=unknown, git-branch=unknown trainable params: 37,748,736 || all params: 7,775,531,008 || trainable%: 0.4855 [2024-09-15 15:32:54,896] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False Using /root/.cache/torch_extensions/py310_cu118 as PyTorch extensions root... Using /root/.cache/torch_extensions/py310_cu118 as PyTorch extensions root... Using /root/.cache/torch_extensions/py310_cu118 as PyTorch extensions root... Using /root/.cache/torch_extensions/py310_cu118 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /root/.cache/torch_extensions/py310_cu118/fused_adam/build.ninja... Building extension module fused_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module fused_adam... Time to load fused_adam op: 0.0755770206451416 seconds Loading extension module fused_adam... Time to load fused_adam op: 0.10229706764221191 seconds Loading extension module fused_adam... Time to load fused_adam op: 0.10167860984802246 seconds Loading extension module fused_adam... Time to load fused_adam op: 0.10164499282836914 seconds [2024-09-15 15:32:55,650] [INFO] [logging.py:96:log_dist] [Rank 0] Using DeepSpeed Optimizer param name adamw as basic optimizer [2024-09-15 15:32:55,651] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer [2024-09-15 15:32:55,691] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedAdam [2024-09-15 15:32:55,691] [INFO] [utils.py:56:is_zero_supported_optimizer] Checking ZeRO support for optimizer=FusedAdam type= [2024-09-15 15:32:55,691] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 1 optimizer [2024-09-15 15:32:55,692] [INFO] [stage_1_and_2.py:149:__init__] Reduce bucket size 1000000000 [2024-09-15 15:32:55,692] [INFO] [stage_1_and_2.py:150:__init__] Allgather bucket size 1000000000 [2024-09-15 15:32:55,692] [INFO] [stage_1_and_2.py:151:__init__] CPU Offload: False [2024-09-15 15:32:55,692] [INFO] [stage_1_and_2.py:152:__init__] Round robin gradient partitioning: False [2024-09-15 15:32:55,968] [INFO] [utils.py:800:see_memory_usage] Before initializing optimizer states [2024-09-15 15:32:55,969] [INFO] [utils.py:801:see_memory_usage] MA 15.65 GB Max_MA 15.67 GB CA 15.85 GB Max_CA 16 GB [2024-09-15 15:32:55,969] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory: used = 90.14 GB, percent = 9.5% [2024-09-15 15:32:56,150] [INFO] [utils.py:800:see_memory_usage] After initializing optimizer states [2024-09-15 15:32:56,151] [INFO] [utils.py:801:see_memory_usage] MA 15.65 GB Max_MA 15.68 GB CA 15.89 GB Max_CA 16 GB [2024-09-15 15:32:56,151] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory: used = 90.19 GB, percent = 9.5% [2024-09-15 15:32:56,151] [INFO] [stage_1_and_2.py:539:__init__] optimizer state initialized [2024-09-15 15:32:56,333] [INFO] [utils.py:800:see_memory_usage] After initializing ZeRO optimizer [2024-09-15 15:32:56,334] [INFO] [utils.py:801:see_memory_usage] MA 15.65 GB Max_MA 15.65 GB CA 15.89 GB Max_CA 16 GB [2024-09-15 15:32:56,334] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory: used = 90.2 GB, percent = 9.6% [2024-09-15 15:32:56,336] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = adamw [2024-09-15 15:32:56,336] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client callable to create LR scheduler [2024-09-15 15:32:56,336] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = [2024-09-15 15:32:56,336] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0], mom=[[0.9, 0.999]] [2024-09-15 15:32:56,341] [INFO] [config.py:996:print] DeepSpeedEngine configuration: [2024-09-15 15:32:56,341] [INFO] [config.py:1000:print] activation_checkpointing_config { "partition_activations": false, "contiguous_memory_optimization": false, "cpu_checkpointing": false, "number_checkpoints": null, "synchronize_checkpoint_boundary": false, "profile": false } [2024-09-15 15:32:56,341] [INFO] [config.py:1000:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True} [2024-09-15 15:32:56,341] [INFO] [config.py:1000:print] amp_enabled .................. False [2024-09-15 15:32:56,341] [INFO] [config.py:1000:print] amp_params ................... False [2024-09-15 15:32:56,342] [INFO] [config.py:1000:print] autotuning_config ............ { "enabled": false, "start_step": null, "end_step": null, "metric_path": null, "arg_mappings": null, "metric": "throughput", "model_info": null, "results_dir": "autotuning_results", "exps_dir": "autotuning_exps", "overwrite": true, "fast": true, "start_profile_step": 3, "end_profile_step": 5, "tuner_type": "gridsearch", "tuner_early_stopping": 5, "tuner_num_trials": 50, "model_info_path": null, "mp_size": 1, "max_train_batch_size": null, "min_train_batch_size": 1, "max_train_micro_batch_size_per_gpu": 1.024000e+03, "min_train_micro_batch_size_per_gpu": 1, "num_tuning_micro_batch_sizes": 3 } [2024-09-15 15:32:56,342] [INFO] [config.py:1000:print] bfloat16_enabled ............. True [2024-09-15 15:32:56,342] [INFO] [config.py:1000:print] bfloat16_immediate_grad_update False [2024-09-15 15:32:56,342] [INFO] [config.py:1000:print] checkpoint_parallel_write_pipeline False [2024-09-15 15:32:56,342] [INFO] [config.py:1000:print] checkpoint_tag_validation_enabled True [2024-09-15 15:32:56,342] [INFO] [config.py:1000:print] checkpoint_tag_validation_fail False [2024-09-15 15:32:56,342] [INFO] [config.py:1000:print] comms_config ................. [2024-09-15 15:32:56,342] [INFO] [config.py:1000:print] communication_data_type ...... None [2024-09-15 15:32:56,342] [INFO] [config.py:1000:print] compile_config ............... enabled=False backend='inductor' kwargs={} [2024-09-15 15:32:56,342] [INFO] [config.py:1000:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}} [2024-09-15 15:32:56,342] [INFO] [config.py:1000:print] curriculum_enabled_legacy .... False [2024-09-15 15:32:56,342] [INFO] [config.py:1000:print] curriculum_params_legacy ..... False [2024-09-15 15:32:56,342] [INFO] [config.py:1000:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}} [2024-09-15 15:32:56,342] [INFO] [config.py:1000:print] data_efficiency_enabled ...... False [2024-09-15 15:32:56,342] [INFO] [config.py:1000:print] dataloader_drop_last ......... False [2024-09-15 15:32:56,342] [INFO] [config.py:1000:print] disable_allgather ............ False [2024-09-15 15:32:56,342] [INFO] [config.py:1000:print] dump_state ................... False [2024-09-15 15:32:56,342] [INFO] [config.py:1000:print] dynamic_loss_scale_args ...... None [2024-09-15 15:32:56,342] [INFO] [config.py:1000:print] eigenvalue_enabled ........... False [2024-09-15 15:32:56,342] [INFO] [config.py:1000:print] eigenvalue_gas_boundary_resolution 1 [2024-09-15 15:32:56,342] [INFO] [config.py:1000:print] eigenvalue_layer_name ........ bert.encoder.layer [2024-09-15 15:32:56,342] [INFO] [config.py:1000:print] eigenvalue_layer_num ......... 0 [2024-09-15 15:32:56,342] [INFO] [config.py:1000:print] eigenvalue_max_iter .......... 100 [2024-09-15 15:32:56,342] [INFO] [config.py:1000:print] eigenvalue_stability ......... 1e-06 [2024-09-15 15:32:56,342] [INFO] [config.py:1000:print] eigenvalue_tol ............... 0.01 [2024-09-15 15:32:56,342] [INFO] [config.py:1000:print] eigenvalue_verbose ........... False [2024-09-15 15:32:56,342] [INFO] [config.py:1000:print] elasticity_enabled ........... False [2024-09-15 15:32:56,342] [INFO] [config.py:1000:print] flops_profiler_config ........ { "enabled": false, "recompute_fwd_factor": 0.0, "profile_step": 1, "module_depth": -1, "top_modules": 1, "detailed": true, "output_file": null } [2024-09-15 15:32:56,342] [INFO] [config.py:1000:print] fp16_auto_cast ............... None [2024-09-15 15:32:56,342] [INFO] [config.py:1000:print] fp16_enabled ................. False [2024-09-15 15:32:56,342] [INFO] [config.py:1000:print] fp16_master_weights_and_gradients False [2024-09-15 15:32:56,342] [INFO] [config.py:1000:print] global_rank .................. 0 [2024-09-15 15:32:56,342] [INFO] [config.py:1000:print] grad_accum_dtype ............. None [2024-09-15 15:32:56,342] [INFO] [config.py:1000:print] gradient_accumulation_steps .. 1 [2024-09-15 15:32:56,342] [INFO] [config.py:1000:print] gradient_clipping ............ 1.0 [2024-09-15 15:32:56,342] [INFO] [config.py:1000:print] gradient_predivide_factor .... 1.0 [2024-09-15 15:32:56,342] [INFO] [config.py:1000:print] graph_harvesting ............. False [2024-09-15 15:32:56,342] [INFO] [config.py:1000:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8 [2024-09-15 15:32:56,342] [INFO] [config.py:1000:print] initial_dynamic_scale ........ 1 [2024-09-15 15:32:56,342] [INFO] [config.py:1000:print] load_universal_checkpoint .... False [2024-09-15 15:32:56,343] [INFO] [config.py:1000:print] loss_scale ................... 1.0 [2024-09-15 15:32:56,343] [INFO] [config.py:1000:print] memory_breakdown ............. False [2024-09-15 15:32:56,343] [INFO] [config.py:1000:print] mics_hierarchial_params_gather False [2024-09-15 15:32:56,343] [INFO] [config.py:1000:print] mics_shard_size .............. -1 [2024-09-15 15:32:56,343] [INFO] [config.py:1000:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False [2024-09-15 15:32:56,343] [INFO] [config.py:1000:print] nebula_config ................ { "enabled": false, "persistent_storage_path": null, "persistent_time_interval": 100, "num_of_version_in_retention": 2, "enable_nebula_load": true, "load_path": null } [2024-09-15 15:32:56,343] [INFO] [config.py:1000:print] optimizer_legacy_fusion ...... False [2024-09-15 15:32:56,343] [INFO] [config.py:1000:print] optimizer_name ............... adamw [2024-09-15 15:32:56,343] [INFO] [config.py:1000:print] optimizer_params ............. {'lr': 4e-05, 'betas': [0.9, 0.999], 'eps': 1e-08, 'weight_decay': 0.01} [2024-09-15 15:32:56,343] [INFO] [config.py:1000:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True} [2024-09-15 15:32:56,343] [INFO] [config.py:1000:print] pld_enabled .................. False [2024-09-15 15:32:56,343] [INFO] [config.py:1000:print] pld_params ................... False [2024-09-15 15:32:56,343] [INFO] [config.py:1000:print] prescale_gradients ........... False [2024-09-15 15:32:56,343] [INFO] [config.py:1000:print] scheduler_name ............... None [2024-09-15 15:32:56,343] [INFO] [config.py:1000:print] scheduler_params ............. None [2024-09-15 15:32:56,343] [INFO] [config.py:1000:print] seq_parallel_communication_data_type torch.float32 [2024-09-15 15:32:56,343] [INFO] [config.py:1000:print] sparse_attention ............. None [2024-09-15 15:32:56,343] [INFO] [config.py:1000:print] sparse_gradients_enabled ..... False [2024-09-15 15:32:56,343] [INFO] [config.py:1000:print] steps_per_print .............. inf [2024-09-15 15:32:56,343] [INFO] [config.py:1000:print] train_batch_size ............. 32 [2024-09-15 15:32:56,343] [INFO] [config.py:1000:print] train_micro_batch_size_per_gpu 8 [2024-09-15 15:32:56,343] [INFO] [config.py:1000:print] use_data_before_expert_parallel_ False [2024-09-15 15:32:56,343] [INFO] [config.py:1000:print] use_node_local_storage ....... False [2024-09-15 15:32:56,343] [INFO] [config.py:1000:print] wall_clock_breakdown ......... True [2024-09-15 15:32:56,343] [INFO] [config.py:1000:print] weight_quantization_config ... None [2024-09-15 15:32:56,343] [INFO] [config.py:1000:print] world_size ................... 4 [2024-09-15 15:32:56,343] [INFO] [config.py:1000:print] zero_allow_untested_optimizer False [2024-09-15 15:32:56,343] [INFO] [config.py:1000:print] zero_config .................. stage=1 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=1000000000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=1000000000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True [2024-09-15 15:32:56,343] [INFO] [config.py:1000:print] zero_enabled ................. True [2024-09-15 15:32:56,343] [INFO] [config.py:1000:print] zero_force_ds_cpu_optimizer .. True [2024-09-15 15:32:56,343] [INFO] [config.py:1000:print] zero_optimization_stage ...... 1 [2024-09-15 15:32:56,343] [INFO] [config.py:986:print_user_config] json = { "zero_optimization": { "stage": 1, "allgather_partitions": true, "allgather_bucket_size": 1.000000e+09, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 1.000000e+09, "contiguous_gradients": true }, "fp16": { "enabled": false, "auto_cast": true, "loss_scale": 0, "initial_scale_power": 32, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1 }, "bf16": { "enabled": true }, "optimizer": { "type": "AdamW", "params": { "lr": 4e-05, "betas": [0.9, 0.999], "eps": 1e-08, "weight_decay": 0.01 } }, "gradient_accumulation_steps": 1, "gradient_clipping": 1.0, "steps_per_print": inf, "train_batch_size": 32, "train_micro_batch_size_per_gpu": 8, "wall_clock_breakdown": true } [INFO|trainer.py:1721] 2024-09-15 15:32:56,344 >> ***** Running training ***** [INFO|trainer.py:1722] 2024-09-15 15:32:56,344 >> Num examples = 9,996 [INFO|trainer.py:1723] 2024-09-15 15:32:56,344 >> Num Epochs = 3 [INFO|trainer.py:1724] 2024-09-15 15:32:56,344 >> Instantaneous batch size per device = 8 [INFO|trainer.py:1727] 2024-09-15 15:32:56,344 >> Total train batch size (w. parallel, distributed & accumulation) = 32 [INFO|trainer.py:1728] 2024-09-15 15:32:56,344 >> Gradient Accumulation steps = 1 [INFO|trainer.py:1729] 2024-09-15 15:32:56,344 >> Total optimization steps = 939 [INFO|trainer.py:1730] 2024-09-15 15:32:56,349 >> Number of trainable parameters = 37,748,736 0%| | 0/939 [00:00, ignore_data_skip=False, include_inputs_for_metrics=False, include_num_input_tokens_seen=False, include_tokens_per_second=False, jit_mode_eval=False, label_names=None, label_smoothing_factor=0.0, learning_rate=4e-05, length_column_name=length, load_best_model_at_end=False, local_rank=0, log_level=passive, log_level_replica=warning, log_on_each_node=True, logging_dir=work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora/runs/Sep15_15-34-50_1fe2e16e3880, logging_first_step=False, logging_nan_inf_filter=True, logging_steps=1.0, logging_strategy=steps, lr_scheduler_kwargs={}, lr_scheduler_type=cosine, max_grad_norm=1.0, max_steps=-1, metric_for_best_model=None, mp_parameters=, neftune_noise_alpha=None, no_cuda=False, num_train_epochs=3.0, optim=adamw_torch, optim_args=None, output_dir=work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora, overwrite_output_dir=True, past_index=-1, per_device_eval_batch_size=8, per_device_train_batch_size=8, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=, ray_scope=last, remove_unused_columns=True, report_to=['tensorboard'], resume_from_checkpoint=None, run_name=work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora, save_on_each_node=False, save_only_model=False, save_safetensors=True, save_steps=200, save_strategy=steps, save_total_limit=1, seed=42, skip_memory_metrics=True, split_batches=False, tf32=None, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, torchdynamo=None, tpu_metrics_debug=False, tpu_num_cores=None, use_cpu=False, use_ipex=False, use_legacy_prediction_loop=False, use_mps_device=False, warmup_ratio=0.03, warmup_steps=0, weight_decay=0.01, ) 09/15/2024 15:34:51 - INFO - __main__ - Loading Tokenizer: /workspace/pretrained/InternVL2-8B [INFO|tokenization_utils_base.py:2025] 2024-09-15 15:34:51,047 >> loading file ./tokenizer.model [INFO|tokenization_utils_base.py:2025] 2024-09-15 15:34:51,047 >> loading file added_tokens.json [INFO|tokenization_utils_base.py:2025] 2024-09-15 15:34:51,047 >> loading file special_tokens_map.json [INFO|tokenization_utils_base.py:2025] 2024-09-15 15:34:51,047 >> loading file tokenizer_config.json [INFO|tokenization_utils_base.py:2025] 2024-09-15 15:34:51,047 >> loading file tokenizer.json 09/15/2024 15:34:51 - WARNING - __main__ - Process rank: 3, device: cuda:3, n_gpu: 1distributed training: True, 16-bits training: False 09/15/2024 15:34:51 - WARNING - __main__ - Process rank: 2, device: cuda:2, n_gpu: 1distributed training: True, 16-bits training: False [WARNING|logging.py:314] 2024-09-15 15:34:51,204 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [WARNING|logging.py:314] 2024-09-15 15:34:51,236 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [WARNING|logging.py:314] 2024-09-15 15:34:51,253 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [WARNING|logging.py:314] 2024-09-15 15:34:51,255 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. 09/15/2024 15:34:51 - INFO - __main__ - Loading InternVLChatModel... [INFO|configuration_utils.py:727] 2024-09-15 15:34:51,387 >> loading configuration file /workspace/pretrained/InternVL2-8B/config.json [INFO|configuration_utils.py:792] 2024-09-15 15:34:51,389 >> Model config InternVLChatConfig { "_commit_hash": null, "architectures": [ "InternVLChatModel" ], "auto_map": { "AutoConfig": "configuration_internvl_chat.InternVLChatConfig", "AutoModel": "modeling_internvl_chat.InternVLChatModel", "AutoModelForCausalLM": "modeling_internvl_chat.InternVLChatModel" }, "downsample_ratio": 0.5, "dynamic_image_size": true, "force_image_size": 448, "llm_config": { "_name_or_path": "internlm/internlm2_5-7b-chat", "add_cross_attention": false, "architectures": [ "InternLM2ForCausalLM" ], "attn_implementation": "flash_attention_2", "auto_map": { "AutoConfig": "configuration_internlm2.InternLM2Config", "AutoModel": "modeling_internlm2.InternLM2ForCausalLM", "AutoModelForCausalLM": "modeling_internlm2.InternLM2ForCausalLM" }, "bad_words_ids": null, "begin_suppress_tokens": null, "bias": false, "bos_token_id": 1, "chunk_size_feed_forward": 0, "cross_attention_hidden_size": null, "decoder_start_token_id": null, "diversity_penalty": 0.0, "do_sample": false, "early_stopping": false, "encoder_no_repeat_ngram_size": 0, "eos_token_id": 2, "exponential_decay_length_penalty": null, "finetuning_task": null, "forced_bos_token_id": null, "forced_eos_token_id": null, "hidden_act": "silu", "hidden_size": 4096, "id2label": { "0": "LABEL_0", "1": "LABEL_1" }, "initializer_range": 0.02, "intermediate_size": 14336, "is_decoder": false, "is_encoder_decoder": false, "label2id": { "LABEL_0": 0, "LABEL_1": 1 }, "length_penalty": 1.0, "max_length": 20, "max_position_embeddings": 32768, "min_length": 0, "model_type": "internlm2", "no_repeat_ngram_size": 0, "num_attention_heads": 32, "num_beam_groups": 1, "num_beams": 1, "num_hidden_layers": 32, "num_key_value_heads": 8, "num_return_sequences": 1, "output_attentions": false, "output_hidden_states": false, "output_scores": false, "pad_token_id": 2, "prefix": null, "pretraining_tp": 1, "problem_type": null, "pruned_heads": {}, "remove_invalid_values": false, "repetition_penalty": 1.0, "return_dict": true, "return_dict_in_generate": false, "rms_norm_eps": 1e-05, "rope_scaling": { "factor": 2.0, "type": "dynamic" }, "rope_theta": 1000000, "sep_token_id": null, "suppress_tokens": null, "task_specific_params": null, "temperature": 1.0, "tf_legacy_loss": false, "tie_encoder_decoder": false, "tie_word_embeddings": false, "tokenizer_class": null, "top_k": 50, "top_p": 1.0, "torch_dtype": "bfloat16", "torchscript": false, "transformers_version": "4.37.2", "typical_p": 1.0, "use_bfloat16": true, "use_cache": true, "vocab_size": 92553 }, "max_dynamic_patch": 12, "min_dynamic_patch": 1, "model_type": "internvl_chat", "pad2square": false, "ps_version": "v2", "select_layer": -1, "template": "internlm2-chat", "torch_dtype": "bfloat16", "transformers_version": null, "use_backbone_lora": 0, "use_llm_lora": 0, "use_thumbnail": true, "vision_config": { "_name_or_path": "", "add_cross_attention": false, "architectures": [ "InternVisionModel" ], "attention_dropout": 0.0, "bad_words_ids": null, "begin_suppress_tokens": null, "bos_token_id": null, "chunk_size_feed_forward": 0, "cross_attention_hidden_size": null, "decoder_start_token_id": null, "diversity_penalty": 0.0, "do_sample": false, "drop_path_rate": 0.0, "dropout": 0.0, "early_stopping": false, "encoder_no_repeat_ngram_size": 0, "eos_token_id": null, "exponential_decay_length_penalty": null, "finetuning_task": null, "forced_bos_token_id": null, "forced_eos_token_id": null, "hidden_act": "gelu", "hidden_size": 1024, "id2label": { "0": "LABEL_0", "1": "LABEL_1" }, "image_size": 448, "initializer_factor": 1.0, "initializer_range": 0.02, "intermediate_size": 4096, "is_decoder": false, "is_encoder_decoder": false, "label2id": { "LABEL_0": 0, "LABEL_1": 1 }, "layer_norm_eps": 1e-06, "length_penalty": 1.0, "max_length": 20, "min_length": 0, "model_type": "intern_vit_6b", "no_repeat_ngram_size": 0, "norm_type": "layer_norm", "num_attention_heads": 16, "num_beam_groups": 1, "num_beams": 1, "num_channels": 3, "num_hidden_layers": 24, "num_return_sequences": 1, "output_attentions": false, "output_hidden_states": false, "output_scores": false, "pad_token_id": null, "patch_size": 14, "prefix": null, "problem_type": null, "pruned_heads": {}, "qk_normalization": false, "qkv_bias": true, "remove_invalid_values": false, "repetition_penalty": 1.0, "return_dict": true, "return_dict_in_generate": false, "sep_token_id": null, "suppress_tokens": null, "task_specific_params": null, "temperature": 1.0, "tf_legacy_loss": false, "tie_encoder_decoder": false, "tie_word_embeddings": true, "tokenizer_class": null, "top_k": 50, "top_p": 1.0, "torch_dtype": "bfloat16", "torchscript": false, "transformers_version": "4.37.2", "typical_p": 1.0, "use_bfloat16": true, "use_flash_attn": true } } 09/15/2024 15:34:51 - INFO - __main__ - Using flash_attention_2 for InternLM [INFO|modeling_utils.py:3473] 2024-09-15 15:34:51,390 >> loading weights file /workspace/pretrained/InternVL2-8B/model.safetensors.index.json [INFO|modeling_utils.py:1426] 2024-09-15 15:34:51,391 >> Instantiating InternVLChatModel model under default dtype torch.bfloat16. [INFO|configuration_utils.py:826] 2024-09-15 15:34:51,392 >> Generate config GenerationConfig {} [INFO|configuration_utils.py:826] 2024-09-15 15:34:51,429 >> Generate config GenerationConfig { "bos_token_id": 1, "eos_token_id": 2, "pad_token_id": 2 } Loading checkpoint shards: 0%| | 0/4 [00:00> All model checkpoint weights were used when initializing InternVLChatModel. [INFO|modeling_utils.py:4358] 2024-09-15 15:35:02,656 >> All the weights of InternVLChatModel were initialized from the model checkpoint at /workspace/pretrained/InternVL2-8B. If your task is similar to the task the model of the checkpoint was trained on, you can already use InternVLChatModel for predictions without further training. [INFO|configuration_utils.py:779] 2024-09-15 15:35:02,661 >> loading configuration file /workspace/pretrained/InternVL2-8B/generation_config.json [INFO|configuration_utils.py:826] 2024-09-15 15:35:02,661 >> Generate config GenerationConfig { "eos_token_id": [ 92542, 92543 ] } 09/15/2024 15:35:02 - INFO - __main__ - Finished 09/15/2024 15:35:02 - INFO - __main__ - model.config.force_image_size: 448 09/15/2024 15:35:02 - INFO - __main__ - data_args.force_image_size: 448 09/15/2024 15:35:02 - INFO - __main__ - model.config.vision_config.image_size: 448 09/15/2024 15:35:02 - INFO - __main__ - [Dataset] num_image_token: 256 09/15/2024 15:35:02 - INFO - __main__ - [Dataset] dynamic_image_size: True 09/15/2024 15:35:02 - INFO - __main__ - [Dataset] use_thumbnail: True 09/15/2024 15:35:02 - INFO - __main__ - [Dataset] min_dynamic_patch: 1, max_dynamic_patch: 6 09/15/2024 15:35:02 - INFO - __main__ - Formatting inputs...Skip in lazy mode 09/15/2024 15:35:03 - INFO - __main__ - Add dataset: amazon-product-dataset with length: 9996 trainable params: 37,748,736 || all params: 7,775,531,008 || trainable%: 0.4855 trainable params: 37,748,736 || all params: 7,775,531,008 || trainable%: 0.4855 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.0.attention.wqkv.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.0.attention.wqkv.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.0.attention.wo.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.0.attention.wo.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.0.feed_forward.w1.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.0.feed_forward.w1.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.0.feed_forward.w3.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.0.feed_forward.w3.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.0.feed_forward.w2.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.0.feed_forward.w2.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.1.attention.wqkv.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.1.attention.wqkv.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.1.attention.wo.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.1.attention.wo.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.1.feed_forward.w1.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.1.feed_forward.w1.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.1.feed_forward.w3.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.1.feed_forward.w3.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.1.feed_forward.w2.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.1.feed_forward.w2.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.2.attention.wqkv.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.2.attention.wqkv.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.2.attention.wo.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.2.attention.wo.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.2.feed_forward.w1.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.2.feed_forward.w1.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.2.feed_forward.w3.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.2.feed_forward.w3.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.2.feed_forward.w2.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.2.feed_forward.w2.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.3.attention.wqkv.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.3.attention.wqkv.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.3.attention.wo.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.3.attention.wo.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.3.feed_forward.w1.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.3.feed_forward.w1.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.3.feed_forward.w3.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.3.feed_forward.w3.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.3.feed_forward.w2.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.3.feed_forward.w2.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.4.attention.wqkv.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.4.attention.wqkv.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.4.attention.wo.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.4.attention.wo.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.4.feed_forward.w1.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.4.feed_forward.w1.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.4.feed_forward.w3.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.4.feed_forward.w3.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.4.feed_forward.w2.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.4.feed_forward.w2.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.5.attention.wqkv.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.5.attention.wqkv.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.5.attention.wo.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.5.attention.wo.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.5.feed_forward.w1.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.5.feed_forward.w1.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.5.feed_forward.w3.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.5.feed_forward.w3.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.5.feed_forward.w2.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.5.feed_forward.w2.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.6.attention.wqkv.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.6.attention.wqkv.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.6.attention.wo.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.6.attention.wo.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.6.feed_forward.w1.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.6.feed_forward.w1.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.6.feed_forward.w3.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.6.feed_forward.w3.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.6.feed_forward.w2.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.6.feed_forward.w2.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.7.attention.wqkv.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.7.attention.wqkv.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.7.attention.wo.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.7.attention.wo.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.7.feed_forward.w1.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.7.feed_forward.w1.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.7.feed_forward.w3.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.7.feed_forward.w3.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.7.feed_forward.w2.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.7.feed_forward.w2.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.8.attention.wqkv.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.8.attention.wqkv.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.8.attention.wo.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.8.attention.wo.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.8.feed_forward.w1.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.8.feed_forward.w1.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.8.feed_forward.w3.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.8.feed_forward.w3.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.8.feed_forward.w2.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.8.feed_forward.w2.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.9.attention.wqkv.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.9.attention.wqkv.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.9.attention.wo.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.9.attention.wo.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.9.feed_forward.w1.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.9.feed_forward.w1.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.9.feed_forward.w3.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.9.feed_forward.w3.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.9.feed_forward.w2.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.9.feed_forward.w2.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.10.attention.wqkv.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.10.attention.wqkv.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.10.attention.wo.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.10.attention.wo.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.10.feed_forward.w1.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.10.feed_forward.w1.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.10.feed_forward.w3.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.10.feed_forward.w3.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.10.feed_forward.w2.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.10.feed_forward.w2.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.11.attention.wqkv.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.11.attention.wqkv.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.11.attention.wo.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.11.attention.wo.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.11.feed_forward.w1.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.11.feed_forward.w1.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.11.feed_forward.w3.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.11.feed_forward.w3.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.11.feed_forward.w2.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.11.feed_forward.w2.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.12.attention.wqkv.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.12.attention.wqkv.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.12.attention.wo.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.12.attention.wo.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.12.feed_forward.w1.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.12.feed_forward.w1.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.12.feed_forward.w3.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.12.feed_forward.w3.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.12.feed_forward.w2.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.12.feed_forward.w2.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.13.attention.wqkv.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.13.attention.wqkv.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.13.attention.wo.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.13.attention.wo.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.13.feed_forward.w1.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.13.feed_forward.w1.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.13.feed_forward.w3.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.13.feed_forward.w3.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.13.feed_forward.w2.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.13.feed_forward.w2.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.14.attention.wqkv.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.14.attention.wqkv.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.14.attention.wo.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.14.attention.wo.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.14.feed_forward.w1.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.14.feed_forward.w1.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.14.feed_forward.w3.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.14.feed_forward.w3.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.14.feed_forward.w2.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.14.feed_forward.w2.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.15.attention.wqkv.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.15.attention.wqkv.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.15.attention.wo.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.15.attention.wo.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.15.feed_forward.w1.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.15.feed_forward.w1.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.15.feed_forward.w3.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.15.feed_forward.w3.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.15.feed_forward.w2.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.15.feed_forward.w2.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.16.attention.wqkv.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.16.attention.wqkv.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.16.attention.wo.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.16.attention.wo.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.16.feed_forward.w1.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.16.feed_forward.w1.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.16.feed_forward.w3.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.16.feed_forward.w3.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.16.feed_forward.w2.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.16.feed_forward.w2.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.17.attention.wqkv.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.17.attention.wqkv.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.17.attention.wo.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.17.attention.wo.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.17.feed_forward.w1.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.17.feed_forward.w1.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.17.feed_forward.w3.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.17.feed_forward.w3.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.17.feed_forward.w2.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.17.feed_forward.w2.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.18.attention.wqkv.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.18.attention.wqkv.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.18.attention.wo.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.18.attention.wo.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.18.feed_forward.w1.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.18.feed_forward.w1.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.18.feed_forward.w3.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.18.feed_forward.w3.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.18.feed_forward.w2.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.18.feed_forward.w2.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.19.attention.wqkv.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.19.attention.wqkv.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.19.attention.wo.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.19.attention.wo.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.19.feed_forward.w1.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.19.feed_forward.w1.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.19.feed_forward.w3.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.19.feed_forward.w3.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.19.feed_forward.w2.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.19.feed_forward.w2.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.20.attention.wqkv.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.20.attention.wqkv.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.20.attention.wo.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.20.attention.wo.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.20.feed_forward.w1.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.20.feed_forward.w1.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.20.feed_forward.w3.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.20.feed_forward.w3.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.20.feed_forward.w2.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.20.feed_forward.w2.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.21.attention.wqkv.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.21.attention.wqkv.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.21.attention.wo.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.21.attention.wo.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.21.feed_forward.w1.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.21.feed_forward.w1.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.21.feed_forward.w3.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.21.feed_forward.w3.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.21.feed_forward.w2.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.21.feed_forward.w2.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.22.attention.wqkv.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.22.attention.wqkv.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.22.attention.wo.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.22.attention.wo.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.22.feed_forward.w1.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.22.feed_forward.w1.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.22.feed_forward.w3.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.22.feed_forward.w3.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.22.feed_forward.w2.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.22.feed_forward.w2.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.23.attention.wqkv.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.23.attention.wqkv.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.23.attention.wo.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.23.attention.wo.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.23.feed_forward.w1.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.23.feed_forward.w1.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.23.feed_forward.w3.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.23.feed_forward.w3.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.23.feed_forward.w2.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.23.feed_forward.w2.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.24.attention.wqkv.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.24.attention.wqkv.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.24.attention.wo.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.24.attention.wo.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.24.feed_forward.w1.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.24.feed_forward.w1.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.24.feed_forward.w3.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.24.feed_forward.w3.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.24.feed_forward.w2.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.24.feed_forward.w2.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.25.attention.wqkv.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.25.attention.wqkv.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.25.attention.wo.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.25.attention.wo.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.25.feed_forward.w1.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.25.feed_forward.w1.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.25.feed_forward.w3.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.25.feed_forward.w3.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.25.feed_forward.w2.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.25.feed_forward.w2.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.26.attention.wqkv.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.26.attention.wqkv.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.26.attention.wo.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.26.attention.wo.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.26.feed_forward.w1.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.26.feed_forward.w1.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.26.feed_forward.w3.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.26.feed_forward.w3.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.26.feed_forward.w2.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.26.feed_forward.w2.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.27.attention.wqkv.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.27.attention.wqkv.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.27.attention.wo.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.27.attention.wo.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.27.feed_forward.w1.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.27.feed_forward.w1.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.27.feed_forward.w3.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.27.feed_forward.w3.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.27.feed_forward.w2.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.27.feed_forward.w2.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.28.attention.wqkv.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.28.attention.wqkv.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.28.attention.wo.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.28.attention.wo.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.28.feed_forward.w1.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.28.feed_forward.w1.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.28.feed_forward.w3.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.28.feed_forward.w3.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.28.feed_forward.w2.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.28.feed_forward.w2.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.29.attention.wqkv.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.29.attention.wqkv.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.29.attention.wo.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.29.attention.wo.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.29.feed_forward.w1.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.29.feed_forward.w1.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.29.feed_forward.w3.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.29.feed_forward.w3.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.29.feed_forward.w2.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.29.feed_forward.w2.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.30.attention.wqkv.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.30.attention.wqkv.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.30.attention.wo.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.30.attention.wo.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.30.feed_forward.w1.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.30.feed_forward.w1.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.30.feed_forward.w3.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.30.feed_forward.w3.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.30.feed_forward.w2.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.30.feed_forward.w2.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.31.attention.wqkv.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.31.attention.wqkv.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.31.attention.wo.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.31.attention.wo.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.31.feed_forward.w1.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.31.feed_forward.w1.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.31.feed_forward.w3.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.31.feed_forward.w3.lora_B.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.31.feed_forward.w2.lora_A.default.weight 09/15/2024 15:35:03 - INFO - __main__ - language_model.base_model.model.model.layers.31.feed_forward.w2.lora_B.default.weight [INFO|trainer.py:571] 2024-09-15 15:35:03,664 >> Using auto half precision backend [2024-09-15 15:35:03,871] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.13.5, git-hash=unknown, git-branch=unknown [2024-09-15 15:35:07,120] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False Using /root/.cache/torch_extensions/py310_cu118 as PyTorch extensions root... Using /root/.cache/torch_extensions/py310_cu118 as PyTorch extensions root... Using /root/.cache/torch_extensions/py310_cu118 as PyTorch extensions root... Using /root/.cache/torch_extensions/py310_cu118 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /root/.cache/torch_extensions/py310_cu118/fused_adam/build.ninja... Building extension module fused_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module fused_adam... Time to load fused_adam op: 0.06660223007202148 seconds Loading extension module fused_adam... Time to load fused_adam op: 0.10187005996704102 seconds Loading extension module fused_adam... Time to load fused_adam op: 0.10172486305236816 seconds Loading extension module fused_adam... Time to load fused_adam op: 0.10187482833862305 seconds [2024-09-15 15:35:07,875] [INFO] [logging.py:96:log_dist] [Rank 0] Using DeepSpeed Optimizer param name adamw as basic optimizer [2024-09-15 15:35:07,875] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer [2024-09-15 15:35:07,915] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedAdam [2024-09-15 15:35:07,915] [INFO] [utils.py:56:is_zero_supported_optimizer] Checking ZeRO support for optimizer=FusedAdam type= [2024-09-15 15:35:07,915] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 1 optimizer [2024-09-15 15:35:07,915] [INFO] [stage_1_and_2.py:149:__init__] Reduce bucket size 1000000000 [2024-09-15 15:35:07,915] [INFO] [stage_1_and_2.py:150:__init__] Allgather bucket size 1000000000 [2024-09-15 15:35:07,915] [INFO] [stage_1_and_2.py:151:__init__] CPU Offload: False [2024-09-15 15:35:07,915] [INFO] [stage_1_and_2.py:152:__init__] Round robin gradient partitioning: False [2024-09-15 15:35:08,212] [INFO] [utils.py:800:see_memory_usage] Before initializing optimizer states [2024-09-15 15:35:08,213] [INFO] [utils.py:801:see_memory_usage] MA 15.65 GB Max_MA 15.67 GB CA 15.85 GB Max_CA 16 GB [2024-09-15 15:35:08,213] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory: used = 83.43 GB, percent = 8.8% [2024-09-15 15:35:08,404] [INFO] [utils.py:800:see_memory_usage] After initializing optimizer states [2024-09-15 15:35:08,405] [INFO] [utils.py:801:see_memory_usage] MA 15.65 GB Max_MA 15.68 GB CA 15.89 GB Max_CA 16 GB [2024-09-15 15:35:08,406] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory: used = 83.54 GB, percent = 8.8% [2024-09-15 15:35:08,406] [INFO] [stage_1_and_2.py:539:__init__] optimizer state initialized [2024-09-15 15:35:08,586] [INFO] [utils.py:800:see_memory_usage] After initializing ZeRO optimizer [2024-09-15 15:35:08,587] [INFO] [utils.py:801:see_memory_usage] MA 15.65 GB Max_MA 15.65 GB CA 15.89 GB Max_CA 16 GB [2024-09-15 15:35:08,587] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory: used = 83.57 GB, percent = 8.8% [2024-09-15 15:35:08,589] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = adamw [2024-09-15 15:35:08,590] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client callable to create LR scheduler [2024-09-15 15:35:08,590] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = [2024-09-15 15:35:08,590] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0], mom=[[0.9, 0.999]] [2024-09-15 15:35:08,594] [INFO] [config.py:996:print] DeepSpeedEngine configuration: [2024-09-15 15:35:08,595] [INFO] [config.py:1000:print] activation_checkpointing_config { "partition_activations": false, "contiguous_memory_optimization": false, "cpu_checkpointing": false, "number_checkpoints": null, "synchronize_checkpoint_boundary": false, "profile": false } [2024-09-15 15:35:08,595] [INFO] [config.py:1000:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True} [2024-09-15 15:35:08,595] [INFO] [config.py:1000:print] amp_enabled .................. False [2024-09-15 15:35:08,595] [INFO] [config.py:1000:print] amp_params ................... False [2024-09-15 15:35:08,595] [INFO] [config.py:1000:print] autotuning_config ............ { "enabled": false, "start_step": null, "end_step": null, "metric_path": null, "arg_mappings": null, "metric": "throughput", "model_info": null, "results_dir": "autotuning_results", "exps_dir": "autotuning_exps", "overwrite": true, "fast": true, "start_profile_step": 3, "end_profile_step": 5, "tuner_type": "gridsearch", "tuner_early_stopping": 5, "tuner_num_trials": 50, "model_info_path": null, "mp_size": 1, "max_train_batch_size": null, "min_train_batch_size": 1, "max_train_micro_batch_size_per_gpu": 1.024000e+03, "min_train_micro_batch_size_per_gpu": 1, "num_tuning_micro_batch_sizes": 3 } [2024-09-15 15:35:08,595] [INFO] [config.py:1000:print] bfloat16_enabled ............. True [2024-09-15 15:35:08,595] [INFO] [config.py:1000:print] bfloat16_immediate_grad_update False [2024-09-15 15:35:08,595] [INFO] [config.py:1000:print] checkpoint_parallel_write_pipeline False [2024-09-15 15:35:08,595] [INFO] [config.py:1000:print] checkpoint_tag_validation_enabled True [2024-09-15 15:35:08,595] [INFO] [config.py:1000:print] checkpoint_tag_validation_fail False [2024-09-15 15:35:08,595] [INFO] [config.py:1000:print] comms_config ................. [2024-09-15 15:35:08,595] [INFO] [config.py:1000:print] communication_data_type ...... None [2024-09-15 15:35:08,595] [INFO] [config.py:1000:print] compile_config ............... enabled=False backend='inductor' kwargs={} [2024-09-15 15:35:08,595] [INFO] [config.py:1000:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}} [2024-09-15 15:35:08,595] [INFO] [config.py:1000:print] curriculum_enabled_legacy .... False [2024-09-15 15:35:08,595] [INFO] [config.py:1000:print] curriculum_params_legacy ..... False [2024-09-15 15:35:08,595] [INFO] [config.py:1000:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}} [2024-09-15 15:35:08,595] [INFO] [config.py:1000:print] data_efficiency_enabled ...... False [2024-09-15 15:35:08,595] [INFO] [config.py:1000:print] dataloader_drop_last ......... False [2024-09-15 15:35:08,595] [INFO] [config.py:1000:print] disable_allgather ............ False [2024-09-15 15:35:08,595] [INFO] [config.py:1000:print] dump_state ................... False [2024-09-15 15:35:08,596] [INFO] [config.py:1000:print] dynamic_loss_scale_args ...... None [2024-09-15 15:35:08,596] [INFO] [config.py:1000:print] eigenvalue_enabled ........... False [2024-09-15 15:35:08,596] [INFO] [config.py:1000:print] eigenvalue_gas_boundary_resolution 1 [2024-09-15 15:35:08,596] [INFO] [config.py:1000:print] eigenvalue_layer_name ........ bert.encoder.layer [2024-09-15 15:35:08,596] [INFO] [config.py:1000:print] eigenvalue_layer_num ......... 0 [2024-09-15 15:35:08,596] [INFO] [config.py:1000:print] eigenvalue_max_iter .......... 100 [2024-09-15 15:35:08,596] [INFO] [config.py:1000:print] eigenvalue_stability ......... 1e-06 [2024-09-15 15:35:08,596] [INFO] [config.py:1000:print] eigenvalue_tol ............... 0.01 [2024-09-15 15:35:08,596] [INFO] [config.py:1000:print] eigenvalue_verbose ........... False [2024-09-15 15:35:08,596] [INFO] [config.py:1000:print] elasticity_enabled ........... False [2024-09-15 15:35:08,596] [INFO] [config.py:1000:print] flops_profiler_config ........ { "enabled": false, "recompute_fwd_factor": 0.0, "profile_step": 1, "module_depth": -1, "top_modules": 1, "detailed": true, "output_file": null } [2024-09-15 15:35:08,596] [INFO] [config.py:1000:print] fp16_auto_cast ............... None [2024-09-15 15:35:08,596] [INFO] [config.py:1000:print] fp16_enabled ................. False [2024-09-15 15:35:08,596] [INFO] [config.py:1000:print] fp16_master_weights_and_gradients False [2024-09-15 15:35:08,596] [INFO] [config.py:1000:print] global_rank .................. 0 [2024-09-15 15:35:08,596] [INFO] [config.py:1000:print] grad_accum_dtype ............. None [2024-09-15 15:35:08,596] [INFO] [config.py:1000:print] gradient_accumulation_steps .. 2 [2024-09-15 15:35:08,596] [INFO] [config.py:1000:print] gradient_clipping ............ 1.0 [2024-09-15 15:35:08,596] [INFO] [config.py:1000:print] gradient_predivide_factor .... 1.0 [2024-09-15 15:35:08,596] [INFO] [config.py:1000:print] graph_harvesting ............. False [2024-09-15 15:35:08,596] [INFO] [config.py:1000:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8 [2024-09-15 15:35:08,596] [INFO] [config.py:1000:print] initial_dynamic_scale ........ 1 [2024-09-15 15:35:08,596] [INFO] [config.py:1000:print] load_universal_checkpoint .... False [2024-09-15 15:35:08,596] [INFO] [config.py:1000:print] loss_scale ................... 1.0 [2024-09-15 15:35:08,596] [INFO] [config.py:1000:print] memory_breakdown ............. False [2024-09-15 15:35:08,596] [INFO] [config.py:1000:print] mics_hierarchial_params_gather False [2024-09-15 15:35:08,596] [INFO] [config.py:1000:print] mics_shard_size .............. -1 [2024-09-15 15:35:08,596] [INFO] [config.py:1000:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False [2024-09-15 15:35:08,596] [INFO] [config.py:1000:print] nebula_config ................ { "enabled": false, "persistent_storage_path": null, "persistent_time_interval": 100, "num_of_version_in_retention": 2, "enable_nebula_load": true, "load_path": null } [2024-09-15 15:35:08,596] [INFO] [config.py:1000:print] optimizer_legacy_fusion ...... False [2024-09-15 15:35:08,596] [INFO] [config.py:1000:print] optimizer_name ............... adamw [2024-09-15 15:35:08,596] [INFO] [config.py:1000:print] optimizer_params ............. {'lr': 4e-05, 'betas': [0.9, 0.999], 'eps': 1e-08, 'weight_decay': 0.01} [2024-09-15 15:35:08,596] [INFO] [config.py:1000:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True} [2024-09-15 15:35:08,596] [INFO] [config.py:1000:print] pld_enabled .................. False [2024-09-15 15:35:08,596] [INFO] [config.py:1000:print] pld_params ................... False [2024-09-15 15:35:08,596] [INFO] [config.py:1000:print] prescale_gradients ........... False [2024-09-15 15:35:08,596] [INFO] [config.py:1000:print] scheduler_name ............... None [2024-09-15 15:35:08,596] [INFO] [config.py:1000:print] scheduler_params ............. None [2024-09-15 15:35:08,596] [INFO] [config.py:1000:print] seq_parallel_communication_data_type torch.float32 [2024-09-15 15:35:08,596] [INFO] [config.py:1000:print] sparse_attention ............. None [2024-09-15 15:35:08,596] [INFO] [config.py:1000:print] sparse_gradients_enabled ..... False [2024-09-15 15:35:08,596] [INFO] [config.py:1000:print] steps_per_print .............. inf [2024-09-15 15:35:08,596] [INFO] [config.py:1000:print] train_batch_size ............. 64 [2024-09-15 15:35:08,596] [INFO] [config.py:1000:print] train_micro_batch_size_per_gpu 8 [2024-09-15 15:35:08,596] [INFO] [config.py:1000:print] use_data_before_expert_parallel_ False [2024-09-15 15:35:08,596] [INFO] [config.py:1000:print] use_node_local_storage ....... False [2024-09-15 15:35:08,596] [INFO] [config.py:1000:print] wall_clock_breakdown ......... True [2024-09-15 15:35:08,596] [INFO] [config.py:1000:print] weight_quantization_config ... None [2024-09-15 15:35:08,596] [INFO] [config.py:1000:print] world_size ................... 4 [2024-09-15 15:35:08,597] [INFO] [config.py:1000:print] zero_allow_untested_optimizer False [2024-09-15 15:35:08,597] [INFO] [config.py:1000:print] zero_config .................. stage=1 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=1000000000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=1000000000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True [2024-09-15 15:35:08,597] [INFO] [config.py:1000:print] zero_enabled ................. True [2024-09-15 15:35:08,597] [INFO] [config.py:1000:print] zero_force_ds_cpu_optimizer .. True [2024-09-15 15:35:08,597] [INFO] [config.py:1000:print] zero_optimization_stage ...... 1 [2024-09-15 15:35:08,597] [INFO] [config.py:986:print_user_config] json = { "zero_optimization": { "stage": 1, "allgather_partitions": true, "allgather_bucket_size": 1.000000e+09, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 1.000000e+09, "contiguous_gradients": true }, "fp16": { "enabled": false, "auto_cast": true, "loss_scale": 0, "initial_scale_power": 32, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1 }, "bf16": { "enabled": true }, "optimizer": { "type": "AdamW", "params": { "lr": 4e-05, "betas": [0.9, 0.999], "eps": 1e-08, "weight_decay": 0.01 } }, "gradient_accumulation_steps": 2, "gradient_clipping": 1.0, "steps_per_print": inf, "train_batch_size": 64, "train_micro_batch_size_per_gpu": 8, "wall_clock_breakdown": true } [INFO|trainer.py:1721] 2024-09-15 15:35:08,597 >> ***** Running training ***** [INFO|trainer.py:1722] 2024-09-15 15:35:08,597 >> Num examples = 9,996 [INFO|trainer.py:1723] 2024-09-15 15:35:08,597 >> Num Epochs = 3 [INFO|trainer.py:1724] 2024-09-15 15:35:08,597 >> Instantaneous batch size per device = 8 [INFO|trainer.py:1727] 2024-09-15 15:35:08,597 >> Total train batch size (w. parallel, distributed & accumulation) = 64 [INFO|trainer.py:1728] 2024-09-15 15:35:08,597 >> Gradient Accumulation steps = 2 [INFO|trainer.py:1729] 2024-09-15 15:35:08,597 >> Total optimization steps = 468 [INFO|trainer.py:1730] 2024-09-15 15:35:08,602 >> Number of trainable parameters = 37,748,736 0%| | 0/468 [00:00> Saving model checkpoint to work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora/tmp-checkpoint-200 [INFO|configuration_utils.py:473] 2024-09-15 16:10:48,756 >> Configuration saved in work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora/tmp-checkpoint-200/config.json [INFO|configuration_utils.py:594] 2024-09-15 16:10:48,756 >> Configuration saved in work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora/tmp-checkpoint-200/generation_config.json [INFO|modeling_utils.py:2501] 2024-09-15 16:11:02,093 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 4 checkpoint shards. You can find where each parameters has been saved in the index located at work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora/tmp-checkpoint-200/model.safetensors.index.json. [INFO|tokenization_utils_base.py:2433] 2024-09-15 16:11:02,095 >> tokenizer config file saved in work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora/tmp-checkpoint-200/tokenizer_config.json [INFO|tokenization_utils_base.py:2442] 2024-09-15 16:11:02,095 >> Special tokens file saved in work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora/tmp-checkpoint-200/special_tokens_map.json [INFO|tokenization_utils_base.py:2493] 2024-09-15 16:11:02,095 >> added tokens file saved in work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora/tmp-checkpoint-200/added_tokens.json [2024-09-15 16:11:02,728] [INFO] [logging.py:96:log_dist] [Rank 0] [Torch] Checkpoint global_step200 is about to be saved! [2024-09-15 16:11:02,755] [INFO] [logging.py:96:log_dist] [Rank 0] Saving model checkpoint: work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora/tmp-checkpoint-200/global_step200/mp_rank_00_model_states.pt [2024-09-15 16:11:02,755] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora/tmp-checkpoint-200/global_step200/mp_rank_00_model_states.pt... [2024-09-15 16:11:18,868] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora/tmp-checkpoint-200/global_step200/mp_rank_00_model_states.pt. [2024-09-15 16:11:18,871] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora/tmp-checkpoint-200/global_step200/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... [2024-09-15 16:11:19,023] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora/tmp-checkpoint-200/global_step200/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. [2024-09-15 16:11:19,024] [INFO] [engine.py:3488:_save_zero_checkpoint] zero checkpoint saved work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora/tmp-checkpoint-200/global_step200/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt [2024-09-15 16:11:19,024] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step200 is ready now! dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876 [2024-09-15 16:11:24,334] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1888.77 | bwd_microstep: 3398.08 | bwd_inner_microstep: 3398.05 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.17 dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1878 [2024-09-15 16:11:29,718] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.61 | optimizer_gradients: 0.33 | optimizer_step: 0.38 [2024-09-15 16:11:29,719] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1909.78 | bwd_microstep: 3438.14 | bwd_inner_microstep: 3423.26 | bwd_allreduce_microstep: 14.83 | step_microstep: 7.87 [2024-09-15 16:11:29,719] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3798.52 | bwd: 6836.23 | bwd_inner: 6821.32 | bwd_allreduce: 14.85 | step: 8.07 43%|████▎ | 201/468 [36:21<1:38:27, 22.13s/it] {'loss': 0.1865, 'learning_rate': 2.5543849890683813e-05, 'epoch': 1.28} 43%|████▎ | 201/468 [36:21<1:38:27, 22.13s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877 [2024-09-15 16:11:35,046] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1897.09 | bwd_microstep: 3399.42 | bwd_inner_microstep: 3399.39 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.18 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1388 [2024-09-15 16:11:39,191] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.70 | optimizer_gradients: 0.37 | optimizer_step: 0.41 [2024-09-15 16:11:39,192] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1481.94 | bwd_microstep: 2627.42 | bwd_inner_microstep: 2612.37 | bwd_allreduce_microstep: 15.00 | step_microstep: 7.97 [2024-09-15 16:11:39,192] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3379.00 | bwd: 6026.85 | bwd_inner: 6011.77 | bwd_allreduce: 15.02 | step: 8.18 43%|████▎ | 202/468 [36:30<1:21:15, 18.33s/it] {'loss': 0.2203, 'learning_rate': 2.5410451087439075e-05, 'epoch': 1.29} 43%|████▎ | 202/468 [36:30<1:21:15, 18.33s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878 [2024-09-15 16:11:44,541] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1904.99 | bwd_microstep: 3413.54 | bwd_inner_microstep: 3413.51 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1388 [2024-09-15 16:11:49,728] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.40 | optimizer_gradients: 0.42 | optimizer_step: 0.39 [2024-09-15 16:11:49,728] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1485.10 | bwd_microstep: 3663.35 | bwd_inner_microstep: 2617.09 | bwd_allreduce_microstep: 1046.20 | step_microstep: 7.82 [2024-09-15 16:11:49,729] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3390.06 | bwd: 7076.90 | bwd_inner: 6030.60 | bwd_allreduce: 1046.23 | step: 8.04 43%|████▎ | 203/468 [36:41<1:10:37, 15.99s/it] {'loss': 0.2002, 'learning_rate': 2.5276792067547672e-05, 'epoch': 1.3} 43%|████▎ | 203/468 [36:41<1:10:37, 15.99s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877 [2024-09-15 16:11:55,085] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1905.03 | bwd_microstep: 3420.18 | bwd_inner_microstep: 3420.15 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22 dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1368 [2024-09-15 16:12:00,414] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.39 | optimizer_gradients: 0.37 | optimizer_step: 0.38 [2024-09-15 16:12:00,415] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1449.37 | bwd_microstep: 3845.98 | bwd_inner_microstep: 2563.93 | bwd_allreduce_microstep: 1281.99 | step_microstep: 7.80 [2024-09-15 16:12:00,415] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3354.37 | bwd: 7266.17 | bwd_inner: 5984.08 | bwd_allreduce: 1282.02 | step: 8.04 44%|████▎ | 204/468 [36:51<1:03:21, 14.40s/it] {'loss': 0.1524, 'learning_rate': 2.514287925936492e-05, 'epoch': 1.3} 44%|████▎ | 204/468 [36:51<1:03:21, 14.40s/it]dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1878 [2024-09-15 16:12:05,824] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1927.69 | bwd_microstep: 3450.13 | bwd_inner_microstep: 3450.10 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.17 dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876 [2024-09-15 16:12:11,213] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.48 | optimizer_gradients: 0.40 | optimizer_step: 0.39 [2024-09-15 16:12:11,214] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1912.64 | bwd_microstep: 3441.08 | bwd_inner_microstep: 3426.04 | bwd_allreduce_microstep: 15.00 | step_microstep: 8.34 [2024-09-15 16:12:11,214] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3840.31 | bwd: 6891.22 | bwd_inner: 6876.14 | bwd_allreduce: 15.01 | step: 8.55 44%|████▍ | 205/468 [37:02<58:23, 13.32s/it] {'loss': 0.2059, 'learning_rate': 2.500871910345212e-05, 'epoch': 1.31} 44%|████▍ | 205/468 [37:02<58:23, 13.32s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368 [2024-09-15 16:12:15,293] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1463.02 | bwd_microstep: 2577.56 | bwd_inner_microstep: 2577.54 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.05 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1388 [2024-09-15 16:12:21,898] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.56 | optimizer_gradients: 0.37 | optimizer_step: 0.38 [2024-09-15 16:12:21,899] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1485.53 | bwd_microstep: 5083.45 | bwd_inner_microstep: 2617.76 | bwd_allreduce_microstep: 2465.61 | step_microstep: 7.55 [2024-09-15 16:12:21,899] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2948.52 | bwd: 7661.03 | bwd_inner: 5195.29 | bwd_allreduce: 2465.66 | step: 7.60 44%|████▍ | 206/468 [37:13<54:42, 12.53s/it] {'loss': 0.1885, 'learning_rate': 2.4874318052266794e-05, 'epoch': 1.32} 44%|████▍ | 206/468 [37:13<54:42, 12.53s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1367 [2024-09-15 16:12:25,973] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1465.21 | bwd_microstep: 2578.06 | bwd_inner_microstep: 2578.04 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.11 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389 [2024-09-15 16:12:32,544] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.58 | optimizer_gradients: 0.38 | optimizer_step: 0.38 [2024-09-15 16:12:32,545] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1486.51 | bwd_microstep: 5051.05 | bwd_inner_microstep: 2618.45 | bwd_allreduce_microstep: 2432.54 | step_microstep: 7.87 [2024-09-15 16:12:32,545] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2951.70 | bwd: 7629.13 | bwd_inner: 5196.49 | bwd_allreduce: 2432.57 | step: 8.01 44%|████▍ | 207/468 [37:23<52:02, 11.96s/it] {'loss': 0.1837, 'learning_rate': 2.473968256985238e-05, 'epoch': 1.32} 44%|████▍ | 207/468 [37:23<52:02, 11.96s/it]dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1878 [2024-09-15 16:12:37,949] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1930.49 | bwd_microstep: 3442.51 | bwd_inner_microstep: 3442.48 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.15 dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876 [2024-09-15 16:12:43,333] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.64 | optimizer_gradients: 0.33 | optimizer_step: 0.39 [2024-09-15 16:12:43,333] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1907.89 | bwd_microstep: 3440.34 | bwd_inner_microstep: 3424.89 | bwd_allreduce_microstep: 15.40 | step_microstep: 8.64 [2024-09-15 16:12:43,333] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3838.33 | bwd: 6882.85 | bwd_inner: 6867.37 | bwd_allreduce: 15.42 | step: 8.82 44%|████▍ | 208/468 [37:34<50:18, 11.61s/it] {'loss': 0.1938, 'learning_rate': 2.460481913152734e-05, 'epoch': 1.33} 44%|████▍ | 208/468 [37:34<50:18, 11.61s/it]dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1366 [2024-09-15 16:12:47,372] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1448.32 | bwd_microstep: 2560.37 | bwd_inner_microstep: 2560.26 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.23 dynamic ViT batch size: 37, images per sample: 4.625, dynamic token length: 1901 [2024-09-15 16:12:54,232] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.30 | optimizer_gradients: 0.38 | optimizer_step: 0.38 [2024-09-15 16:12:54,232] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1896.65 | bwd_microstep: 4928.88 | bwd_inner_microstep: 3409.12 | bwd_allreduce_microstep: 1519.70 | step_microstep: 7.36 [2024-09-15 16:12:54,233] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3344.95 | bwd: 7489.29 | bwd_inner: 5969.39 | bwd_allreduce: 1519.78 | step: 7.59 45%|████▍ | 209/468 [37:45<49:12, 11.40s/it] {'loss': 0.1497, 'learning_rate': 2.4469734223573703e-05, 'epoch': 1.34} 45%|████▍ | 209/468 [37:45<49:12, 11.40s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1369 [2024-09-15 16:12:58,306] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1462.24 | bwd_microstep: 2580.32 | bwd_inner_microstep: 2580.30 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22 dynamic ViT batch size: 36, images per sample: 4.5, dynamic token length: 1877 [2024-09-15 16:13:04,965] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 4.31 | optimizer_gradients: 0.43 | optimizer_step: 0.38 [2024-09-15 16:13:04,966] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1867.51 | bwd_microstep: 4753.66 | bwd_inner_microstep: 3384.61 | bwd_allreduce_microstep: 1368.99 | step_microstep: 10.62 [2024-09-15 16:13:04,966] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3329.72 | bwd: 7334.00 | bwd_inner: 5964.91 | bwd_allreduce: 1369.02 | step: 10.84 45%|████▍ | 210/468 [37:56<48:09, 11.20s/it] {'loss': 0.2644, 'learning_rate': 2.4334434342925133e-05, 'epoch': 1.34} 45%|████▍ | 210/468 [37:56<48:09, 11.20s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877 [2024-09-15 16:13:10,332] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1907.95 | bwd_microstep: 3428.14 | bwd_inner_microstep: 3428.11 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.21 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1388 [2024-09-15 16:13:15,647] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.51 | optimizer_gradients: 0.42 | optimizer_step: 0.38 [2024-09-15 16:13:15,647] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1488.35 | bwd_microstep: 3792.19 | bwd_inner_microstep: 2623.09 | bwd_allreduce_microstep: 1169.04 | step_microstep: 7.63 [2024-09-15 16:13:15,647] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3396.28 | bwd: 7220.34 | bwd_inner: 6051.20 | bwd_allreduce: 1169.07 | step: 7.88 45%|████▌ | 211/468 [38:07<47:18, 11.04s/it] {'loss': 0.1937, 'learning_rate': 2.4198925996854422e-05, 'epoch': 1.35} 45%|████▌ | 211/468 [38:07<47:18, 11.04s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368 [2024-09-15 16:13:19,722] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1462.75 | bwd_microstep: 2581.94 | bwd_inner_microstep: 2581.92 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22 dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876 [2024-09-15 16:13:26,313] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 4.28 | optimizer_gradients: 0.38 | optimizer_step: 0.38 [2024-09-15 16:13:26,314] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1912.66 | bwd_microstep: 4639.89 | bwd_inner_microstep: 3435.09 | bwd_allreduce_microstep: 1204.74 | step_microstep: 10.22 [2024-09-15 16:13:26,314] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3375.39 | bwd: 7221.84 | bwd_inner: 6017.01 | bwd_allreduce: 1204.77 | step: 10.45 45%|████▌ | 212/468 [38:17<46:38, 10.93s/it] {'loss': 0.2791, 'learning_rate': 2.4063215702660564e-05, 'epoch': 1.35} 45%|████▌ | 212/468 [38:17<46:38, 10.93s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368 [2024-09-15 16:13:30,391] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1464.49 | bwd_microstep: 2582.70 | bwd_inner_microstep: 2582.68 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.21 dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878 [2024-09-15 16:13:37,010] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.54 | optimizer_gradients: 0.37 | optimizer_step: 0.38 [2024-09-15 16:13:37,011] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1914.02 | bwd_microstep: 4670.40 | bwd_inner_microstep: 3436.67 | bwd_allreduce_microstep: 1233.67 | step_microstep: 7.47 [2024-09-15 16:13:37,011] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3378.49 | bwd: 7253.11 | bwd_inner: 6019.34 | bwd_allreduce: 1233.70 | step: 7.69 46%|████▌ | 213/468 [38:28<46:09, 10.86s/it] {'loss': 0.24, 'learning_rate': 2.392730998735529e-05, 'epoch': 1.36} 46%|████▌ | 213/468 [38:28<46:09, 10.86s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368 [2024-09-15 16:13:41,089] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1462.30 | bwd_microstep: 2584.98 | bwd_inner_microstep: 2584.92 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.23 dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876 [2024-09-15 16:13:47,735] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.28 | optimizer_gradients: 0.37 | optimizer_step: 0.40 [2024-09-15 16:13:47,735] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1911.33 | bwd_microstep: 4700.18 | bwd_inner_microstep: 3432.19 | bwd_allreduce_microstep: 1267.93 | step_microstep: 7.24 [2024-09-15 16:13:47,735] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3373.61 | bwd: 7285.21 | bwd_inner: 6017.11 | bwd_allreduce: 1267.96 | step: 7.48 46%|████▌ | 214/468 [38:39<45:48, 10.82s/it] {'loss': 0.1574, 'learning_rate': 2.379121538734912e-05, 'epoch': 1.37} 46%|████▌ | 214/468 [38:39<45:48, 10.82s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1373 [2024-09-15 16:13:51,815] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1464.96 | bwd_microstep: 2584.33 | bwd_inner_microstep: 2584.30 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.15 dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1870 [2024-09-15 16:13:57,196] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.65 | optimizer_gradients: 0.37 | optimizer_step: 0.38 [2024-09-15 16:13:57,196] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1909.29 | bwd_microstep: 3436.46 | bwd_inner_microstep: 3421.46 | bwd_allreduce_microstep: 14.95 | step_microstep: 7.83 [2024-09-15 16:13:57,196] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3374.23 | bwd: 6020.80 | bwd_inner: 6005.77 | bwd_allreduce: 14.97 | step: 8.01 46%|████▌ | 215/468 [38:48<43:54, 10.41s/it] {'loss': 0.1896, 'learning_rate': 2.3654938448137062e-05, 'epoch': 1.37} 46%|████▌ | 215/468 [38:48<43:54, 10.41s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1366 [2024-09-15 16:14:01,273] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1463.61 | bwd_microstep: 2582.80 | bwd_inner_microstep: 2582.77 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22 dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877 [2024-09-15 16:14:06,665] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.64 | optimizer_gradients: 0.34 | optimizer_step: 0.37 [2024-09-15 16:14:06,666] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1911.47 | bwd_microstep: 3446.49 | bwd_inner_microstep: 3431.73 | bwd_allreduce_microstep: 14.71 | step_microstep: 7.99 [2024-09-15 16:14:06,666] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3375.05 | bwd: 6029.29 | bwd_inner: 6014.50 | bwd_allreduce: 14.73 | step: 8.22 46%|████▌ | 216/468 [38:58<42:32, 10.13s/it] {'loss': 0.1865, 'learning_rate': 2.351848572398371e-05, 'epoch': 1.38} 46%|████▌ | 216/468 [38:58<42:32, 10.13s/it]dynamic ViT batch size: 39, images per sample: 4.875, dynamic token length: 1367 [2024-09-15 16:14:10,743] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1460.14 | bwd_microstep: 2578.63 | bwd_inner_microstep: 2578.60 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.25 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389 [2024-09-15 16:14:17,381] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.52 | optimizer_gradients: 0.38 | optimizer_step: 0.39 [2024-09-15 16:14:17,381] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1490.65 | bwd_microstep: 5111.97 | bwd_inner_microstep: 2626.37 | bwd_allreduce_microstep: 2485.54 | step_microstep: 7.59 [2024-09-15 16:14:17,382] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2950.76 | bwd: 7690.62 | bwd_inner: 5204.97 | bwd_allreduce: 2485.58 | step: 7.84 46%|████▋ | 217/468 [39:08<43:06, 10.31s/it] {'loss': 0.1336, 'learning_rate': 2.338186377760811e-05, 'epoch': 1.39} 46%|████▋ | 217/468 [39:08<43:06, 10.31s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368 [2024-09-15 16:14:21,456] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1463.53 | bwd_microstep: 2581.25 | bwd_inner_microstep: 2581.22 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389 [2024-09-15 16:14:28,064] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.52 | optimizer_gradients: 0.38 | optimizer_step: 0.39 [2024-09-15 16:14:28,065] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1490.09 | bwd_microstep: 5085.47 | bwd_inner_microstep: 2625.43 | bwd_allreduce_microstep: 2459.98 | step_microstep: 7.53 [2024-09-15 16:14:28,065] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2953.59 | bwd: 7666.73 | bwd_inner: 5206.66 | bwd_allreduce: 2460.01 | step: 7.76 47%|████▋ | 218/468 [39:19<43:24, 10.42s/it] {'loss': 0.2047, 'learning_rate': 2.3245079179868054e-05, 'epoch': 1.39} 47%|████▋ | 218/468 [39:19<43:24, 10.42s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368 [2024-09-15 16:14:32,139] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1462.15 | bwd_microstep: 2578.45 | bwd_inner_microstep: 2578.42 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.20 dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877 [2024-09-15 16:14:38,794] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.47 | optimizer_gradients: 0.41 | optimizer_step: 0.38 [2024-09-15 16:14:38,794] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1912.65 | bwd_microstep: 4707.21 | bwd_inner_microstep: 3431.82 | bwd_allreduce_microstep: 1275.32 | step_microstep: 7.51 [2024-09-15 16:14:38,794] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3374.77 | bwd: 7285.66 | bwd_inner: 6010.24 | bwd_allreduce: 1275.35 | step: 7.73 47%|████▋ | 219/468 [39:30<43:37, 10.51s/it] {'loss': 0.31, 'learning_rate': 2.31081385094441e-05, 'epoch': 1.4} 47%|████▋ | 219/468 [39:30<43:37, 10.51s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1366 [2024-09-15 16:14:42,868] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1461.28 | bwd_microstep: 2582.30 | bwd_inner_microstep: 2582.19 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.17 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1369 [2024-09-15 16:14:49,640] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.23 | optimizer_gradients: 0.37 | optimizer_step: 0.38 [2024-09-15 16:14:49,640] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1466.10 | bwd_microstep: 5271.56 | bwd_inner_microstep: 2587.27 | bwd_allreduce_microstep: 2684.23 | step_microstep: 7.28 [2024-09-15 16:14:49,641] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2927.36 | bwd: 7853.90 | bwd_inner: 5169.47 | bwd_allreduce: 2684.30 | step: 7.45 47%|████▋ | 220/468 [39:41<43:51, 10.61s/it] {'loss': 0.2096, 'learning_rate': 2.297104835252314e-05, 'epoch': 1.41} 47%|████▋ | 220/468 [39:41<43:51, 10.61s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368 [2024-09-15 16:14:53,716] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1463.19 | bwd_microstep: 2577.99 | bwd_inner_microstep: 2577.96 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.24 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1900 [2024-09-15 16:14:59,126] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.67 | optimizer_gradients: 0.34 | optimizer_step: 0.38 [2024-09-15 16:14:59,126] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1919.35 | bwd_microstep: 3455.39 | bwd_inner_microstep: 3440.55 | bwd_allreduce_microstep: 14.80 | step_microstep: 7.96 [2024-09-15 16:14:59,126] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3382.51 | bwd: 6033.40 | bwd_inner: 6018.52 | bwd_allreduce: 14.82 | step: 8.18 47%|████▋ | 221/468 [39:50<42:17, 10.27s/it] {'loss': 0.2593, 'learning_rate': 2.283381530248165e-05, 'epoch': 1.41} 47%|████▋ | 221/468 [39:50<42:17, 10.27s/it]dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1366 [2024-09-15 16:15:03,173] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1451.15 | bwd_microstep: 2565.63 | bwd_inner_microstep: 2565.51 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.26 dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1900 [2024-09-15 16:15:09,775] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.23 | optimizer_gradients: 0.39 | optimizer_step: 0.38 [2024-09-15 16:15:09,776] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1936.84 | bwd_microstep: 4629.06 | bwd_inner_microstep: 3469.00 | bwd_allreduce_microstep: 1159.99 | step_microstep: 7.07 [2024-09-15 16:15:09,776] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3387.96 | bwd: 7194.73 | bwd_inner: 6034.52 | bwd_allreduce: 1160.08 | step: 7.34 47%|████▋ | 222/468 [40:01<42:35, 10.39s/it] {'loss': 0.1585, 'learning_rate': 2.2696445959568577e-05, 'epoch': 1.42} 47%|████▋ | 222/468 [40:01<42:35, 10.39s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368 [2024-09-15 16:15:13,856] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1464.07 | bwd_microstep: 2585.49 | bwd_inner_microstep: 2585.46 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.18 dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1877 [2024-09-15 16:15:20,539] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.20 | optimizer_gradients: 0.43 | optimizer_step: 0.40 [2024-09-15 16:15:20,539] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1932.95 | bwd_microstep: 4716.70 | bwd_inner_microstep: 3457.85 | bwd_allreduce_microstep: 1258.78 | step_microstep: 7.44 [2024-09-15 16:15:20,539] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3397.00 | bwd: 7302.21 | bwd_inner: 6043.32 | bwd_allreduce: 1258.82 | step: 7.63 48%|████▊ | 223/468 [40:11<42:52, 10.50s/it] {'loss': 0.1781, 'learning_rate': 2.2558946930587907e-05, 'epoch': 1.42} 48%|████▊ | 223/468 [40:11<42:52, 10.50s/it]dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1878 [2024-09-15 16:15:25,956] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1929.33 | bwd_microstep: 3456.91 | bwd_inner_microstep: 3456.86 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.29 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389 [2024-09-15 16:15:31,302] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.21 | optimizer_gradients: 0.40 | optimizer_step: 0.40 [2024-09-15 16:15:31,302] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1491.92 | bwd_microstep: 3819.40 | bwd_inner_microstep: 2626.52 | bwd_allreduce_microstep: 1192.82 | step_microstep: 7.49 [2024-09-15 16:15:31,302] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3421.23 | bwd: 7276.36 | bwd_inner: 6083.38 | bwd_allreduce: 1192.85 | step: 7.78 48%|████▊ | 224/468 [40:22<43:01, 10.58s/it] {'loss': 0.1462, 'learning_rate': 2.2421324828580877e-05, 'epoch': 1.43} 48%|████▊ | 224/468 [40:22<43:01, 10.58s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368 [2024-09-15 16:15:35,381] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1464.55 | bwd_microstep: 2583.18 | bwd_inner_microstep: 2583.06 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.23 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389 [2024-09-15 16:15:42,085] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.22 | optimizer_gradients: 0.39 | optimizer_step: 0.40 [2024-09-15 16:15:42,085] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1492.47 | bwd_microstep: 5175.76 | bwd_inner_microstep: 2624.19 | bwd_allreduce_microstep: 2551.50 | step_microstep: 7.44 [2024-09-15 16:15:42,085] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2956.99 | bwd: 7758.97 | bwd_inner: 5207.26 | bwd_allreduce: 2551.59 | step: 7.68 48%|████▊ | 225/468 [40:33<43:05, 10.64s/it] {'loss': 0.1687, 'learning_rate': 2.2283586272507975e-05, 'epoch': 1.44} 48%|████▊ | 225/468 [40:33<43:05, 10.64s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876 [2024-09-15 16:15:47,462] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1910.51 | bwd_microstep: 3432.18 | bwd_inner_microstep: 3432.10 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.25 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1878 [2024-09-15 16:15:52,950] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.34 | optimizer_gradients: 0.42 | optimizer_step: 0.40 [2024-09-15 16:15:52,950] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1899.49 | bwd_microstep: 3552.76 | bwd_inner_microstep: 3417.01 | bwd_allreduce_microstep: 135.69 | step_microstep: 7.58 [2024-09-15 16:15:52,951] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3809.98 | bwd: 6984.99 | bwd_inner: 6849.11 | bwd_allreduce: 135.75 | step: 7.84 48%|████▊ | 226/468 [40:44<43:11, 10.71s/it] {'loss': 0.1716, 'learning_rate': 2.214573788693054e-05, 'epoch': 1.44} 48%|████▊ | 226/468 [40:44<43:11, 10.71s/it]dynamic ViT batch size: 36, images per sample: 4.5, dynamic token length: 1369 [2024-09-15 16:15:56,969] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1438.09 | bwd_microstep: 2550.26 | bwd_inner_microstep: 2550.18 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.19 dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1870 [2024-09-15 16:16:02,453] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.38 | optimizer_gradients: 0.41 | optimizer_step: 0.41 [2024-09-15 16:16:02,454] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1914.36 | bwd_microstep: 3537.15 | bwd_inner_microstep: 3427.87 | bwd_allreduce_microstep: 109.22 | step_microstep: 7.71 [2024-09-15 16:16:02,454] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3352.42 | bwd: 6087.45 | bwd_inner: 5978.06 | bwd_allreduce: 109.29 | step: 7.90 49%|████▊ | 227/468 [40:53<41:33, 10.35s/it] {'loss': 0.2694, 'learning_rate': 2.2007786301692205e-05, 'epoch': 1.45} 49%|████▊ | 227/468 [40:53<41:33, 10.35s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878 [2024-09-15 16:16:07,837] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1915.47 | bwd_microstep: 3437.27 | bwd_inner_microstep: 3437.24 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.20 dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878 [2024-09-15 16:16:13,241] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.72 | optimizer_gradients: 0.33 | optimizer_step: 0.39 [2024-09-15 16:16:13,242] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1916.50 | bwd_microstep: 3453.21 | bwd_inner_microstep: 3439.63 | bwd_allreduce_microstep: 13.53 | step_microstep: 8.02 [2024-09-15 16:16:13,242] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3831.95 | bwd: 6890.49 | bwd_inner: 6876.88 | bwd_allreduce: 13.55 | step: 8.25 49%|████▊ | 228/468 [41:04<41:54, 10.48s/it] {'loss': 0.1615, 'learning_rate': 2.18697381516e-05, 'epoch': 1.46} 49%|████▊ | 228/468 [41:04<41:54, 10.48s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1369 [2024-09-15 16:16:17,329] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1467.14 | bwd_microstep: 2589.30 | bwd_inner_microstep: 2589.27 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.17 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389 [2024-09-15 16:16:22,737] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.21 | optimizer_gradients: 0.37 | optimizer_step: 0.38 [2024-09-15 16:16:22,738] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1492.56 | bwd_microstep: 3881.80 | bwd_inner_microstep: 2628.96 | bwd_allreduce_microstep: 1252.78 | step_microstep: 7.20 [2024-09-15 16:16:22,738] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2959.67 | bwd: 6471.11 | bwd_inner: 5218.23 | bwd_allreduce: 1252.81 | step: 7.40 49%|████▉ | 229/468 [41:14<40:33, 10.18s/it] {'loss': 0.1891, 'learning_rate': 2.1731600076105264e-05, 'epoch': 1.46} 49%|████▉ | 229/468 [41:14<40:33, 10.18s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878 [2024-09-15 16:16:28,113] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1911.96 | bwd_microstep: 3432.45 | bwd_inner_microstep: 3432.43 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.19 dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1900 [2024-09-15 16:16:33,567] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.65 | optimizer_gradients: 0.36 | optimizer_step: 0.39 [2024-09-15 16:16:33,568] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1938.92 | bwd_microstep: 3478.65 | bwd_inner_microstep: 3465.14 | bwd_allreduce_microstep: 13.46 | step_microstep: 7.91 [2024-09-15 16:16:33,568] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3850.86 | bwd: 6911.12 | bwd_inner: 6897.57 | bwd_allreduce: 13.49 | step: 8.13 49%|████▉ | 230/468 [41:24<41:09, 10.38s/it] {'loss': 0.1321, 'learning_rate': 2.159337871898431e-05, 'epoch': 1.47} 49%|████▉ | 230/468 [41:24<41:09, 10.38s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368 [2024-09-15 16:16:37,655] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1467.03 | bwd_microstep: 2589.43 | bwd_inner_microstep: 2589.40 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.23 dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878 [2024-09-15 16:16:43,068] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.67 | optimizer_gradients: 0.37 | optimizer_step: 0.39 [2024-09-15 16:16:43,069] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1920.86 | bwd_microstep: 3449.79 | bwd_inner_microstep: 3434.56 | bwd_allreduce_microstep: 15.18 | step_microstep: 8.04 [2024-09-15 16:16:43,069] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3387.84 | bwd: 6039.23 | bwd_inner: 6023.96 | bwd_allreduce: 15.20 | step: 8.27 49%|████▉ | 231/468 [41:34<39:57, 10.11s/it] {'loss': 0.1706, 'learning_rate': 2.145508072801888e-05, 'epoch': 1.48} 49%|████▉ | 231/468 [41:34<39:57, 10.11s/it]dynamic ViT batch size: 34, images per sample: 4.25, dynamic token length: 1369 [2024-09-15 16:16:47,067] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1429.03 | bwd_microstep: 2539.21 | bwd_inner_microstep: 2539.05 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.24 dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876 [2024-09-15 16:16:52,513] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.45 | optimizer_gradients: 0.38 | optimizer_step: 0.41 [2024-09-15 16:16:52,513] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1917.32 | bwd_microstep: 3492.77 | bwd_inner_microstep: 3438.61 | bwd_allreduce_microstep: 54.11 | step_microstep: 7.90 [2024-09-15 16:16:52,513] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3346.33 | bwd: 6032.02 | bwd_inner: 5977.67 | bwd_allreduce: 54.23 | step: 8.15 50%|████▉ | 232/468 [41:43<38:59, 9.91s/it] {'loss': 0.2391, 'learning_rate': 2.131671275467647e-05, 'epoch': 1.48} 50%|████▉ | 232/468 [41:43<38:59, 9.91s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878 [2024-09-15 16:16:57,906] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1922.43 | bwd_microstep: 3440.12 | bwd_inner_microstep: 3440.10 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.17 dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1876 [2024-09-15 16:17:03,363] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.92 | optimizer_gradients: 0.37 | optimizer_step: 0.41 [2024-09-15 16:17:03,364] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1938.29 | bwd_microstep: 3481.33 | bwd_inner_microstep: 3466.33 | bwd_allreduce_microstep: 14.96 | step_microstep: 9.84 [2024-09-15 16:17:03,364] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3860.69 | bwd: 6921.47 | bwd_inner: 6906.42 | bwd_allreduce: 14.98 | step: 10.04 50%|████▉ | 233/468 [41:54<39:55, 10.19s/it] {'loss': 0.1462, 'learning_rate': 2.1178281453790358e-05, 'epoch': 1.49} 50%|████▉ | 233/468 [41:54<39:55, 10.19s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876 [2024-09-15 16:17:08,755] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1920.93 | bwd_microstep: 3440.04 | bwd_inner_microstep: 3440.01 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.15 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1388 [2024-09-15 16:17:12,928] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.58 | optimizer_gradients: 0.35 | optimizer_step: 0.39 [2024-09-15 16:17:12,929] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1494.32 | bwd_microstep: 2644.28 | bwd_inner_microstep: 2631.54 | bwd_allreduce_microstep: 12.70 | step_microstep: 7.90 [2024-09-15 16:17:12,929] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3415.22 | bwd: 6084.33 | bwd_inner: 6071.55 | bwd_allreduce: 12.72 | step: 8.08 50%|█████ | 234/468 [42:04<39:01, 10.01s/it] {'loss': 0.1928, 'learning_rate': 2.1039793483239607e-05, 'epoch': 1.5} 50%|█████ | 234/468 [42:04<39:01, 10.01s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1366 [2024-09-15 16:17:17,023] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1472.20 | bwd_microstep: 2591.32 | bwd_inner_microstep: 2591.29 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389 [2024-09-15 16:17:22,391] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.21 | optimizer_gradients: 0.41 | optimizer_step: 0.40 [2024-09-15 16:17:22,391] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1496.53 | bwd_microstep: 3836.83 | bwd_inner_microstep: 2633.20 | bwd_allreduce_microstep: 1203.57 | step_microstep: 7.44 [2024-09-15 16:17:22,392] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2968.70 | bwd: 6428.16 | bwd_inner: 5224.50 | bwd_allreduce: 1203.60 | step: 7.67 50%|█████ | 235/468 [42:13<38:13, 9.84s/it] {'loss': 0.3621, 'learning_rate': 2.090125550362879e-05, 'epoch': 1.5} 50%|█████ | 235/468 [42:13<38:13, 9.84s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1870 [2024-09-15 16:17:27,765] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1915.23 | bwd_microstep: 3427.94 | bwd_inner_microstep: 3427.91 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.14 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389 [2024-09-15 16:17:31,948] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.66 | optimizer_gradients: 0.35 | optimizer_step: 0.40 [2024-09-15 16:17:31,949] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1495.42 | bwd_microstep: 2649.08 | bwd_inner_microstep: 2633.41 | bwd_allreduce_microstep: 15.62 | step_microstep: 8.00 [2024-09-15 16:17:31,949] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3410.62 | bwd: 6077.03 | bwd_inner: 6061.33 | bwd_allreduce: 15.64 | step: 8.17 50%|█████ | 236/468 [42:23<37:43, 9.76s/it] {'loss': 0.1987, 'learning_rate': 2.0762674177967676e-05, 'epoch': 1.51} 50%|█████ | 236/468 [42:23<37:43, 9.76s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878 [2024-09-15 16:17:37,344] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1921.67 | bwd_microstep: 3442.75 | bwd_inner_microstep: 3442.72 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.25 dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876 [2024-09-15 16:17:42,759] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.51 | optimizer_gradients: 0.35 | optimizer_step: 0.42 [2024-09-15 16:17:42,759] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1923.86 | bwd_microstep: 3454.49 | bwd_inner_microstep: 3439.41 | bwd_allreduce_microstep: 15.01 | step_microstep: 8.02 [2024-09-15 16:17:42,760] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3845.50 | bwd: 6897.25 | bwd_inner: 6882.13 | bwd_allreduce: 15.04 | step: 8.28 51%|█████ | 237/468 [42:34<38:46, 10.07s/it] {'loss': 0.2071, 'learning_rate': 2.0624056171350785e-05, 'epoch': 1.51} 51%|█████ | 237/468 [42:34<38:46, 10.07s/it]dynamic ViT batch size: 36, images per sample: 4.5, dynamic token length: 1369 [2024-09-15 16:17:46,792] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1443.05 | bwd_microstep: 2559.30 | bwd_inner_microstep: 2559.07 | bwd_allreduce_microstep: 0.12 | step_microstep: 0.24 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1369 [2024-09-15 16:17:53,484] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.21 | optimizer_gradients: 0.40 | optimizer_step: 0.40 [2024-09-15 16:17:53,484] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1470.74 | bwd_microstep: 5187.07 | bwd_inner_microstep: 2596.75 | bwd_allreduce_microstep: 2590.25 | step_microstep: 7.48 [2024-09-15 16:17:53,484] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2913.77 | bwd: 7746.40 | bwd_inner: 5155.82 | bwd_allreduce: 2590.42 | step: 7.72 51%|█████ | 238/468 [42:44<39:21, 10.27s/it] {'loss': 0.1817, 'learning_rate': 2.0485408150636804e-05, 'epoch': 1.52} 51%|█████ | 238/468 [42:44<39:21, 10.27s/it]dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1878 [2024-09-15 16:17:58,907] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1934.20 | bwd_microstep: 3457.89 | bwd_inner_microstep: 3457.86 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.27 dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876 [2024-09-15 16:18:04,336] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.55 | optimizer_gradients: 0.41 | optimizer_step: 0.40 [2024-09-15 16:18:04,336] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1917.27 | bwd_microstep: 3475.18 | bwd_inner_microstep: 3442.42 | bwd_allreduce_microstep: 32.71 | step_microstep: 8.42 [2024-09-15 16:18:04,336] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3851.44 | bwd: 6933.08 | bwd_inner: 6900.28 | bwd_allreduce: 32.73 | step: 8.70 51%|█████ | 239/468 [42:55<39:51, 10.44s/it] {'loss': 0.1615, 'learning_rate': 2.0346736784127955e-05, 'epoch': 1.53} 51%|█████ | 239/468 [42:55<39:51, 10.44s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1369 [2024-09-15 16:18:08,427] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1470.77 | bwd_microstep: 2589.94 | bwd_inner_microstep: 2589.92 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.24 dynamic ViT batch size: 39, images per sample: 4.875, dynamic token length: 1369 [2024-09-15 16:18:13,906] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.22 | optimizer_gradients: 0.40 | optimizer_step: 0.41 [2024-09-15 16:18:13,907] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1463.65 | bwd_microstep: 3980.90 | bwd_inner_microstep: 2580.97 | bwd_allreduce_microstep: 1399.86 | step_microstep: 7.45 [2024-09-15 16:18:13,907] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2934.40 | bwd: 6570.86 | bwd_inner: 5170.89 | bwd_allreduce: 1399.90 | step: 7.72 51%|█████▏ | 240/468 [43:05<38:41, 10.18s/it] {'loss': 0.1521, 'learning_rate': 2.0208048741249288e-05, 'epoch': 1.53} 51%|█████▏ | 240/468 [43:05<38:41, 10.18s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1366 [2024-09-15 16:18:17,990] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1467.23 | bwd_microstep: 2584.87 | bwd_inner_microstep: 2584.85 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.06 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1876 [2024-09-15 16:18:24,630] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.25 | optimizer_gradients: 0.37 | optimizer_step: 0.38 [2024-09-15 16:18:24,631] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1899.41 | bwd_microstep: 4706.68 | bwd_inner_microstep: 3422.92 | bwd_allreduce_microstep: 1283.70 | step_microstep: 7.15 [2024-09-15 16:18:24,631] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3366.62 | bwd: 7291.56 | bwd_inner: 6007.77 | bwd_allreduce: 1283.73 | step: 7.24 51%|█████▏ | 241/468 [43:16<39:08, 10.34s/it] {'loss': 0.156, 'learning_rate': 2.006935069222789e-05, 'epoch': 1.54} 51%|█████▏ | 241/468 [43:16<39:08, 10.34s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1369 [2024-09-15 16:18:28,716] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1464.91 | bwd_microstep: 2590.28 | bwd_inner_microstep: 2590.25 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.17 dynamic ViT batch size: 32, images per sample: 4.0, dynamic token length: 1876 [2024-09-15 16:18:35,428] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.31 | optimizer_gradients: 0.41 | optimizer_step: 0.38 [2024-09-15 16:18:35,429] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1841.31 | bwd_microstep: 4837.85 | bwd_inner_microstep: 3358.68 | bwd_allreduce_microstep: 1479.11 | step_microstep: 7.36 [2024-09-15 16:18:35,429] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3306.19 | bwd: 7428.14 | bwd_inner: 5948.93 | bwd_allreduce: 1479.14 | step: 7.56 52%|█████▏ | 242/468 [43:26<39:28, 10.48s/it] {'loss': 0.2438, 'learning_rate': 1.9930649307772114e-05, 'epoch': 1.55} 52%|█████▏ | 242/468 [43:26<39:28, 10.48s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876 [2024-09-15 16:18:40,807] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1913.35 | bwd_microstep: 3434.43 | bwd_inner_microstep: 3434.40 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.06 dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1870 [2024-09-15 16:18:46,203] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.61 | optimizer_gradients: 0.33 | optimizer_step: 0.39 [2024-09-15 16:18:46,203] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1913.73 | bwd_microstep: 3446.04 | bwd_inner_microstep: 3431.17 | bwd_allreduce_microstep: 14.83 | step_microstep: 8.24 [2024-09-15 16:18:46,203] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3827.05 | bwd: 6880.48 | bwd_inner: 6865.57 | bwd_allreduce: 14.85 | step: 8.33 52%|█████▏ | 243/468 [43:37<39:37, 10.57s/it] {'loss': 0.2432, 'learning_rate': 1.979195125875072e-05, 'epoch': 1.55} 52%|█████▏ | 243/468 [43:37<39:37, 10.57s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368 [2024-09-15 16:18:50,290] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1467.64 | bwd_microstep: 2589.70 | bwd_inner_microstep: 2589.68 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.08 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1390 [2024-09-15 16:18:55,721] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.23 | optimizer_gradients: 0.37 | optimizer_step: 0.38 [2024-09-15 16:18:55,721] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1490.75 | bwd_microstep: 3904.87 | bwd_inner_microstep: 2629.24 | bwd_allreduce_microstep: 1275.57 | step_microstep: 7.22 [2024-09-15 16:18:55,722] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2958.37 | bwd: 6494.58 | bwd_inner: 5218.92 | bwd_allreduce: 1275.60 | step: 7.33 52%|█████▏ | 244/468 [43:47<38:16, 10.25s/it] {'loss': 0.1292, 'learning_rate': 1.9653263215872048e-05, 'epoch': 1.56} 52%|█████▏ | 244/468 [43:47<38:16, 10.25s/it]dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1878 [2024-09-15 16:19:01,142] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1932.06 | bwd_microstep: 3457.57 | bwd_inner_microstep: 3457.54 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.06 dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1900 [2024-09-15 16:19:06,599] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.63 | optimizer_gradients: 0.33 | optimizer_step: 0.39 [2024-09-15 16:19:06,599] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1937.89 | bwd_microstep: 3482.87 | bwd_inner_microstep: 3470.59 | bwd_allreduce_microstep: 12.23 | step_microstep: 8.37 [2024-09-15 16:19:06,599] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3869.94 | bwd: 6940.45 | bwd_inner: 6928.13 | bwd_allreduce: 12.25 | step: 8.46 52%|█████▏ | 245/468 [43:58<38:48, 10.44s/it] {'loss': 0.2293, 'learning_rate': 1.9514591849363203e-05, 'epoch': 1.57} 52%|█████▏ | 245/468 [43:58<38:48, 10.44s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1366 [2024-09-15 16:19:10,687] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1468.10 | bwd_microstep: 2589.55 | bwd_inner_microstep: 2589.52 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.06 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389 [2024-09-15 16:19:17,434] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.31 | optimizer_gradients: 0.37 | optimizer_step: 0.39 [2024-09-15 16:19:17,434] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1491.64 | bwd_microstep: 5220.79 | bwd_inner_microstep: 2630.21 | bwd_allreduce_microstep: 2590.49 | step_microstep: 7.28 [2024-09-15 16:19:17,435] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2959.71 | bwd: 7810.34 | bwd_inner: 5219.73 | bwd_allreduce: 2590.53 | step: 7.37 53%|█████▎ | 246/468 [44:08<39:04, 10.56s/it] {'loss': 0.1932, 'learning_rate': 1.9375943828649215e-05, 'epoch': 1.57} 53%|█████▎ | 246/468 [44:08<39:04, 10.56s/it]dynamic ViT batch size: 34, images per sample: 4.25, dynamic token length: 1870 [2024-09-15 16:19:22,671] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1850.22 | bwd_microstep: 3357.03 | bwd_inner_microstep: 3357.01 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22 dynamic ViT batch size: 37, images per sample: 4.625, dynamic token length: 1622 [2024-09-15 16:19:28,025] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.23 | optimizer_gradients: 0.40 | optimizer_step: 0.41 [2024-09-15 16:19:28,025] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1664.19 | bwd_microstep: 3649.47 | bwd_inner_microstep: 2988.41 | bwd_allreduce_microstep: 661.00 | step_microstep: 7.64 [2024-09-15 16:19:28,025] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3514.36 | bwd: 7006.54 | bwd_inner: 6345.42 | bwd_allreduce: 661.04 | step: 7.86 53%|█████▎ | 247/468 [44:19<38:55, 10.57s/it] {'loss': 0.1508, 'learning_rate': 1.923732582203233e-05, 'epoch': 1.58} 53%|█████▎ | 247/468 [44:19<38:55, 10.57s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878 [2024-09-15 16:19:33,402] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1914.66 | bwd_microstep: 3432.16 | bwd_inner_microstep: 3432.13 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.07 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1878 [2024-09-15 16:19:38,766] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.62 | optimizer_gradients: 0.34 | optimizer_step: 0.38 [2024-09-15 16:19:38,766] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1899.23 | bwd_microstep: 3430.29 | bwd_inner_microstep: 3415.46 | bwd_allreduce_microstep: 14.78 | step_microstep: 8.30 [2024-09-15 16:19:38,766] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3813.86 | bwd: 6862.46 | bwd_inner: 6847.59 | bwd_allreduce: 14.80 | step: 8.40 53%|█████▎ | 248/468 [44:30<38:56, 10.62s/it] {'loss': 0.2317, 'learning_rate': 1.909874449637122e-05, 'epoch': 1.58} 53%|█████▎ | 248/468 [44:30<38:56, 10.62s/it]dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1878 [2024-09-15 16:19:44,188] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1935.56 | bwd_microstep: 3454.95 | bwd_inner_microstep: 3454.92 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.21 dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1369 [2024-09-15 16:19:49,556] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.44 | optimizer_gradients: 0.38 | optimizer_step: 0.38 [2024-09-15 16:19:49,556] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1454.85 | bwd_microstep: 3877.99 | bwd_inner_microstep: 2570.42 | bwd_allreduce_microstep: 1307.52 | step_microstep: 7.79 [2024-09-15 16:19:49,557] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3390.38 | bwd: 7332.96 | bwd_inner: 6025.34 | bwd_allreduce: 1307.55 | step: 8.03 53%|█████▎ | 249/468 [44:40<38:57, 10.67s/it] {'loss': 0.218, 'learning_rate': 1.8960206516760396e-05, 'epoch': 1.59} 53%|█████▎ | 249/468 [44:40<38:57, 10.67s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1366 [2024-09-15 16:19:53,638] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1464.39 | bwd_microstep: 2586.04 | bwd_inner_microstep: 2586.02 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.21 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389 [2024-09-15 16:20:00,277] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.40 | optimizer_gradients: 0.38 | optimizer_step: 0.38 [2024-09-15 16:20:00,278] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1491.56 | bwd_microstep: 5114.77 | bwd_inner_microstep: 2627.83 | bwd_allreduce_microstep: 2486.88 | step_microstep: 7.28 [2024-09-15 16:20:00,278] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2955.92 | bwd: 7700.83 | bwd_inner: 5213.84 | bwd_allreduce: 2486.92 | step: 7.52 53%|█████▎ | 250/468 [44:51<38:49, 10.69s/it] {'loss': 0.1777, 'learning_rate': 1.8821718546209646e-05, 'epoch': 1.6} 53%|█████▎ | 250/468 [44:51<38:49, 10.69s/it]dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1369 [2024-09-15 16:20:04,326] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1450.51 | bwd_microstep: 2567.64 | bwd_inner_microstep: 2567.60 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.23 dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878 [2024-09-15 16:20:10,956] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.21 | optimizer_gradients: 0.40 | optimizer_step: 0.40 [2024-09-15 16:20:10,957] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1916.38 | bwd_microstep: 4678.51 | bwd_inner_microstep: 3434.70 | bwd_allreduce_microstep: 1243.76 | step_microstep: 7.51 [2024-09-15 16:20:10,957] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3366.86 | bwd: 7246.20 | bwd_inner: 6002.30 | bwd_allreduce: 1243.79 | step: 7.74 54%|█████▎ | 251/468 [45:02<38:38, 10.68s/it] {'loss': 0.1577, 'learning_rate': 1.8683287245323536e-05, 'epoch': 1.6} 54%|█████▎ | 251/468 [45:02<38:38, 10.68s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1371 [2024-09-15 16:20:15,041] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1466.68 | bwd_microstep: 2587.14 | bwd_inner_microstep: 2587.12 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22 dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876 [2024-09-15 16:20:20,463] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.54 | optimizer_gradients: 0.40 | optimizer_step: 0.40 [2024-09-15 16:20:20,463] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1916.24 | bwd_microstep: 3471.73 | bwd_inner_microstep: 3437.58 | bwd_allreduce_microstep: 34.10 | step_microstep: 7.72 [2024-09-15 16:20:20,464] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3382.89 | bwd: 6058.88 | bwd_inner: 6024.70 | bwd_allreduce: 34.12 | step: 7.95 54%|█████▍ | 252/468 [45:11<37:11, 10.33s/it] {'loss': 0.2977, 'learning_rate': 1.8544919271981125e-05, 'epoch': 1.61} 54%|█████▍ | 252/468 [45:11<37:11, 10.33s/it]dynamic ViT batch size: 36, images per sample: 4.5, dynamic token length: 1366 [2024-09-15 16:20:24,482] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1438.88 | bwd_microstep: 2549.52 | bwd_inner_microstep: 2549.40 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.23 dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1877 [2024-09-15 16:20:31,156] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.30 | optimizer_gradients: 0.43 | optimizer_step: 0.41 [2024-09-15 16:20:31,157] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1935.14 | bwd_microstep: 4703.71 | bwd_inner_microstep: 3463.29 | bwd_allreduce_microstep: 1240.36 | step_microstep: 7.65 [2024-09-15 16:20:31,157] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3373.99 | bwd: 7253.27 | bwd_inner: 6012.69 | bwd_allreduce: 1240.46 | step: 7.89 54%|█████▍ | 253/468 [45:22<37:24, 10.44s/it] {'loss': 0.2277, 'learning_rate': 1.84066212810157e-05, 'epoch': 1.62} 54%|█████▍ | 253/468 [45:22<37:24, 10.44s/it]dynamic ViT batch size: 36, images per sample: 4.5, dynamic token length: 1372 [2024-09-15 16:20:35,181] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1439.40 | bwd_microstep: 2554.83 | bwd_inner_microstep: 2554.80 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1388 [2024-09-15 16:20:41,847] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.21 | optimizer_gradients: 0.42 | optimizer_step: 0.41 [2024-09-15 16:20:41,848] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1492.26 | bwd_microstep: 5140.47 | bwd_inner_microstep: 2625.83 | bwd_allreduce_microstep: 2514.58 | step_microstep: 7.57 [2024-09-15 16:20:41,848] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2931.62 | bwd: 7695.32 | bwd_inner: 5180.63 | bwd_allreduce: 2514.62 | step: 7.79 54%|█████▍ | 254/468 [45:33<37:30, 10.52s/it] {'loss': 0.1574, 'learning_rate': 1.8268399923894736e-05, 'epoch': 1.62} 54%|█████▍ | 254/468 [45:33<37:30, 10.52s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368 [2024-09-15 16:20:45,929] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1466.20 | bwd_microstep: 2584.09 | bwd_inner_microstep: 2584.06 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389 [2024-09-15 16:20:52,579] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.21 | optimizer_gradients: 0.42 | optimizer_step: 0.40 [2024-09-15 16:20:52,580] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1491.79 | bwd_microstep: 5125.29 | bwd_inner_microstep: 2625.60 | bwd_allreduce_microstep: 2499.63 | step_microstep: 7.48 [2024-09-15 16:20:52,580] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2957.96 | bwd: 7709.42 | bwd_inner: 5209.66 | bwd_allreduce: 2499.66 | step: 7.71 54%|█████▍ | 255/468 [45:43<37:33, 10.58s/it] {'loss': 0.3187, 'learning_rate': 1.8130261848399996e-05, 'epoch': 1.63} 54%|█████▍ | 255/468 [45:43<37:33, 10.58s/it]dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1366 [2024-09-15 16:20:56,621] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1449.25 | bwd_microstep: 2561.72 | bwd_inner_microstep: 2561.63 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.24 dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877 [2024-09-15 16:21:03,409] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.21 | optimizer_gradients: 0.40 | optimizer_step: 0.40 [2024-09-15 16:21:03,410] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1912.78 | bwd_microstep: 4839.59 | bwd_inner_microstep: 3432.58 | bwd_allreduce_microstep: 1406.95 | step_microstep: 7.53 [2024-09-15 16:21:03,410] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3362.01 | bwd: 7401.35 | bwd_inner: 5994.21 | bwd_allreduce: 1407.02 | step: 7.77 55%|█████▍ | 256/468 [45:54<37:38, 10.66s/it] {'loss': 0.1758, 'learning_rate': 1.7992213698307795e-05, 'epoch': 1.64} 55%|█████▍ | 256/468 [45:54<37:38, 10.66s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1881 [2024-09-15 16:21:08,785] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1912.48 | bwd_microstep: 3431.87 | bwd_inner_microstep: 3431.85 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.05 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1387 [2024-09-15 16:21:14,206] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.22 | optimizer_gradients: 0.39 | optimizer_step: 0.40 [2024-09-15 16:21:14,207] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1490.19 | bwd_microstep: 3898.96 | bwd_inner_microstep: 2620.66 | bwd_allreduce_microstep: 1278.23 | step_microstep: 7.44 [2024-09-15 16:21:14,207] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3402.65 | bwd: 7330.84 | bwd_inner: 6052.51 | bwd_allreduce: 1278.26 | step: 7.52 55%|█████▍ | 257/468 [46:05<37:37, 10.70s/it] {'loss': 0.2118, 'learning_rate': 1.7854262113069468e-05, 'epoch': 1.64} 55%|█████▍ | 257/468 [46:05<37:37, 10.70s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878 [2024-09-15 16:21:19,575] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1910.55 | bwd_microstep: 3427.73 | bwd_inner_microstep: 3427.70 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.19 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1388 [2024-09-15 16:21:25,011] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.44 | optimizer_gradients: 0.38 | optimizer_step: 0.40 [2024-09-15 16:21:25,011] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1490.06 | bwd_microstep: 3910.98 | bwd_inner_microstep: 2626.17 | bwd_allreduce_microstep: 1284.75 | step_microstep: 7.72 [2024-09-15 16:21:25,012] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3400.58 | bwd: 7338.72 | bwd_inner: 6053.87 | bwd_allreduce: 1284.78 | step: 7.94 55%|█████▌ | 258/468 [46:16<37:33, 10.73s/it] {'loss': 0.1539, 'learning_rate': 1.7716413727492035e-05, 'epoch': 1.65} 55%|█████▌ | 258/468 [46:16<37:33, 10.73s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368 [2024-09-15 16:21:29,093] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1463.51 | bwd_microstep: 2587.34 | bwd_inner_microstep: 2587.32 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.23 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389 [2024-09-15 16:21:34,481] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.46 | optimizer_gradients: 0.37 | optimizer_step: 0.38 [2024-09-15 16:21:34,481] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1489.36 | bwd_microstep: 3862.68 | bwd_inner_microstep: 2628.35 | bwd_allreduce_microstep: 1234.27 | step_microstep: 7.48 [2024-09-15 16:21:34,482] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2952.84 | bwd: 6450.06 | bwd_inner: 5215.67 | bwd_allreduce: 1234.30 | step: 7.71 55%|█████▌ | 259/468 [46:25<36:03, 10.35s/it] {'loss': 0.1885, 'learning_rate': 1.757867517141913e-05, 'epoch': 1.65} 55%|█████▌ | 259/468 [46:25<36:03, 10.35s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368 [2024-09-15 16:21:38,562] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1463.16 | bwd_microstep: 2587.52 | bwd_inner_microstep: 2587.49 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.27 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389 [2024-09-15 16:21:45,212] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.59 | optimizer_gradients: 0.39 | optimizer_step: 0.39 [2024-09-15 16:21:45,213] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1488.31 | bwd_microstep: 5124.55 | bwd_inner_microstep: 2628.01 | bwd_allreduce_microstep: 2496.47 | step_microstep: 9.10 [2024-09-15 16:21:45,213] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2951.45 | bwd: 7712.10 | bwd_inner: 5215.50 | bwd_allreduce: 2496.50 | step: 9.37 56%|█████▌ | 260/468 [46:36<36:16, 10.47s/it] {'loss': 0.1915, 'learning_rate': 1.7441053069412103e-05, 'epoch': 1.66} 56%|█████▌ | 260/468 [46:36<36:16, 10.47s/it]dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1878 [2024-09-15 16:21:50,628] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1927.62 | bwd_microstep: 3452.11 | bwd_inner_microstep: 3452.09 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1878 [2024-09-15 16:21:56,092] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.58 | optimizer_gradients: 0.38 | optimizer_step: 0.38 [2024-09-15 16:21:56,093] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1895.36 | bwd_microstep: 3534.30 | bwd_inner_microstep: 3415.17 | bwd_allreduce_microstep: 119.07 | step_microstep: 7.69 [2024-09-15 16:21:56,093] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3822.96 | bwd: 6986.43 | bwd_inner: 6867.26 | bwd_allreduce: 119.10 | step: 7.92 56%|█████▌ | 261/468 [46:47<36:32, 10.59s/it] {'loss': 0.1716, 'learning_rate': 1.7303554040431426e-05, 'epoch': 1.67} 56%|█████▌ | 261/468 [46:47<36:32, 10.59s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877 [2024-09-15 16:22:01,471] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1912.47 | bwd_microstep: 3434.54 | bwd_inner_microstep: 3434.51 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.24 dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1389 [2024-09-15 16:22:06,815] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.44 | optimizer_gradients: 0.40 | optimizer_step: 0.38 [2024-09-15 16:22:06,816] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1476.11 | bwd_microstep: 3832.31 | bwd_inner_microstep: 2607.99 | bwd_allreduce_microstep: 1224.26 | step_microstep: 8.20 [2024-09-15 16:22:06,816] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3388.56 | bwd: 7266.86 | bwd_inner: 6042.50 | bwd_allreduce: 1224.29 | step: 8.45 56%|█████▌ | 262/468 [46:58<36:29, 10.63s/it] {'loss': 0.2217, 'learning_rate': 1.7166184697518352e-05, 'epoch': 1.67} 56%|█████▌ | 262/468 [46:58<36:29, 10.63s/it]dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1369 [2024-09-15 16:22:10,865] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1450.19 | bwd_microstep: 2568.85 | bwd_inner_microstep: 2568.78 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22 dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1878 [2024-09-15 16:22:17,593] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.25 | optimizer_gradients: 0.38 | optimizer_step: 0.38 [2024-09-15 16:22:17,594] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1930.05 | bwd_microstep: 4762.21 | bwd_inner_microstep: 3460.89 | bwd_allreduce_microstep: 1301.26 | step_microstep: 7.28 [2024-09-15 16:22:17,594] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3380.22 | bwd: 7331.10 | bwd_inner: 6029.67 | bwd_allreduce: 1301.30 | step: 7.51 56%|█████▌ | 263/468 [47:08<36:28, 10.67s/it] {'loss': 0.3266, 'learning_rate': 1.7028951647476862e-05, 'epoch': 1.68} 56%|█████▌ | 263/468 [47:08<36:28, 10.67s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1877 [2024-09-15 16:22:22,937] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1896.53 | bwd_microstep: 3415.85 | bwd_inner_microstep: 3415.82 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.05 dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1877 [2024-09-15 16:22:28,382] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 4.24 | optimizer_gradients: 0.34 | optimizer_step: 0.38 [2024-09-15 16:22:28,382] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1929.19 | bwd_microstep: 3476.65 | bwd_inner_microstep: 3461.58 | bwd_allreduce_microstep: 15.02 | step_microstep: 11.85 [2024-09-15 16:22:28,382] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3825.69 | bwd: 6892.51 | bwd_inner: 6877.41 | bwd_allreduce: 15.04 | step: 11.92 56%|█████▋ | 264/468 [47:19<36:24, 10.71s/it] {'loss': 0.2631, 'learning_rate': 1.6891861490555906e-05, 'epoch': 1.69} 56%|█████▋ | 264/468 [47:19<36:24, 10.71s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368 [2024-09-15 16:22:32,471] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1466.71 | bwd_microstep: 2591.00 | bwd_inner_microstep: 2590.97 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22 dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878 [2024-09-15 16:22:39,075] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.47 | optimizer_gradients: 0.38 | optimizer_step: 0.38 [2024-09-15 16:22:39,075] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1915.76 | bwd_microstep: 4653.12 | bwd_inner_microstep: 3440.73 | bwd_allreduce_microstep: 1212.32 | step_microstep: 7.50 [2024-09-15 16:22:39,075] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3382.44 | bwd: 7244.13 | bwd_inner: 6031.70 | bwd_allreduce: 1212.36 | step: 7.72 57%|█████▋ | 265/468 [47:30<36:12, 10.70s/it] {'loss': 0.2481, 'learning_rate': 1.6754920820131946e-05, 'epoch': 1.69} 57%|█████▋ | 265/468 [47:30<36:12, 10.70s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878 [2024-09-15 16:22:44,455] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1911.35 | bwd_microstep: 3438.35 | bwd_inner_microstep: 3438.32 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.25 dynamic ViT batch size: 46, images per sample: 5.75, dynamic token length: 1900 [2024-09-15 16:22:49,986] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.43 | optimizer_gradients: 0.36 | optimizer_step: 0.39 [2024-09-15 16:22:49,987] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1967.31 | bwd_microstep: 3527.33 | bwd_inner_microstep: 3512.46 | bwd_allreduce_microstep: 14.82 | step_microstep: 8.09 [2024-09-15 16:22:49,987] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3878.64 | bwd: 6965.69 | bwd_inner: 6950.78 | bwd_allreduce: 14.84 | step: 8.36 57%|█████▋ | 266/468 [47:41<36:14, 10.77s/it] {'loss': 0.2521, 'learning_rate': 1.6618136222391893e-05, 'epoch': 1.7} 57%|█████▋ | 266/468 [47:41<36:14, 10.77s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368 [2024-09-15 16:22:54,076] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1467.75 | bwd_microstep: 2591.28 | bwd_inner_microstep: 2591.13 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.23 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1388 [2024-09-15 16:22:59,429] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.44 | optimizer_gradients: 0.38 | optimizer_step: 0.38 [2024-09-15 16:22:59,429] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1492.78 | bwd_microstep: 3825.57 | bwd_inner_microstep: 2632.39 | bwd_allreduce_microstep: 1193.12 | step_microstep: 7.51 [2024-09-15 16:22:59,429] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2960.51 | bwd: 6416.89 | bwd_inner: 5223.52 | bwd_allreduce: 1193.24 | step: 7.74 57%|█████▋ | 267/468 [47:50<34:44, 10.37s/it] {'loss': 0.2022, 'learning_rate': 1.6481514276016297e-05, 'epoch': 1.71} 57%|█████▋ | 267/468 [47:50<34:44, 10.37s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1367 [2024-09-15 16:23:03,517] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1466.16 | bwd_microstep: 2591.28 | bwd_inner_microstep: 2591.25 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.23 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1369 [2024-09-15 16:23:08,838] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.35 | optimizer_gradients: 0.38 | optimizer_step: 0.38 [2024-09-15 16:23:08,838] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1468.06 | bwd_microstep: 3816.86 | bwd_inner_microstep: 2595.20 | bwd_allreduce_microstep: 1221.61 | step_microstep: 7.47 [2024-09-15 16:23:08,839] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2934.20 | bwd: 6408.16 | bwd_inner: 5186.45 | bwd_allreduce: 1221.64 | step: 7.70 57%|█████▋ | 268/468 [48:00<33:36, 10.08s/it] {'loss': 0.2055, 'learning_rate': 1.634506155186295e-05, 'epoch': 1.71} 57%|█████▋ | 268/468 [48:00<33:36, 10.08s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1366 [2024-09-15 16:23:12,924] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1462.98 | bwd_microstep: 2591.38 | bwd_inner_microstep: 2591.34 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.23 dynamic ViT batch size: 36, images per sample: 4.5, dynamic token length: 1389 [2024-09-15 16:23:19,566] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.43 | optimizer_gradients: 0.38 | optimizer_step: 0.39 [2024-09-15 16:23:19,567] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1461.41 | bwd_microstep: 5146.82 | bwd_inner_microstep: 2595.77 | bwd_allreduce_microstep: 2550.99 | step_microstep: 7.51 [2024-09-15 16:23:19,567] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2924.37 | bwd: 7738.24 | bwd_inner: 5187.11 | bwd_allreduce: 2551.01 | step: 7.74 57%|█████▋ | 269/468 [48:10<34:04, 10.28s/it] {'loss': 0.2257, 'learning_rate': 1.6208784612650883e-05, 'epoch': 1.72} 57%|█████▋ | 269/468 [48:10<34:04, 10.28s/it]dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1878 [2024-09-15 16:23:24,878] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1881.65 | bwd_microstep: 3399.46 | bwd_inner_microstep: 3399.43 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22 dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878 [2024-09-15 16:23:30,281] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.74 | optimizer_gradients: 0.33 | optimizer_step: 0.38 [2024-09-15 16:23:30,282] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1913.13 | bwd_microstep: 3453.78 | bwd_inner_microstep: 3438.73 | bwd_allreduce_microstep: 15.00 | step_microstep: 8.00 [2024-09-15 16:23:30,282] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3794.75 | bwd: 6853.27 | bwd_inner: 6838.17 | bwd_allreduce: 15.03 | step: 8.23 58%|█████▊ | 270/468 [48:21<34:20, 10.41s/it] {'loss': 0.2634, 'learning_rate': 1.6072690012644717e-05, 'epoch': 1.73} 58%|█████▊ | 270/468 [48:21<34:20, 10.41s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876 [2024-09-15 16:23:35,660] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1911.89 | bwd_microstep: 3435.28 | bwd_inner_microstep: 3435.26 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.25 dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1390 [2024-09-15 16:23:39,797] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.69 | optimizer_gradients: 0.33 | optimizer_step: 0.40 [2024-09-15 16:23:39,797] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1478.06 | bwd_microstep: 2624.48 | bwd_inner_microstep: 2609.43 | bwd_allreduce_microstep: 15.00 | step_microstep: 7.92 [2024-09-15 16:23:39,797] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3389.92 | bwd: 6059.77 | bwd_inner: 6044.69 | bwd_allreduce: 15.02 | step: 8.18 58%|█████▊ | 271/468 [48:31<33:17, 10.14s/it] {'loss': 0.2195, 'learning_rate': 1.593678429733944e-05, 'epoch': 1.73} 58%|█████▊ | 271/468 [48:31<33:17, 10.14s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368 [2024-09-15 16:23:43,888] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1467.10 | bwd_microstep: 2592.67 | bwd_inner_microstep: 2592.64 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22 dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877 [2024-09-15 16:23:50,501] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.21 | optimizer_gradients: 0.37 | optimizer_step: 0.38 [2024-09-15 16:23:50,501] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1918.78 | bwd_microstep: 4659.90 | bwd_inner_microstep: 3440.25 | bwd_allreduce_microstep: 1219.60 | step_microstep: 7.21 [2024-09-15 16:23:50,501] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3385.85 | bwd: 7252.61 | bwd_inner: 6032.89 | bwd_allreduce: 1219.63 | step: 7.43 58%|█████▊ | 272/468 [48:41<33:40, 10.31s/it] {'loss': 0.2063, 'learning_rate': 1.5801074003145585e-05, 'epoch': 1.74} 58%|█████▊ | 272/468 [48:41<33:40, 10.31s/it]dynamic ViT batch size: 36, images per sample: 4.5, dynamic token length: 1369 [2024-09-15 16:23:54,525] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1437.33 | bwd_microstep: 2556.19 | bwd_inner_microstep: 2556.09 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.22 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1878 [2024-09-15 16:24:01,204] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.48 | optimizer_gradients: 0.38 | optimizer_step: 0.39 [2024-09-15 16:24:01,204] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1899.58 | bwd_microstep: 4745.28 | bwd_inner_microstep: 3425.19 | bwd_allreduce_microstep: 1320.03 | step_microstep: 7.60 [2024-09-15 16:24:01,205] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3336.89 | bwd: 7301.51 | bwd_inner: 5981.28 | bwd_allreduce: 1320.10 | step: 7.82 58%|█████▊ | 273/468 [48:52<33:53, 10.43s/it] {'loss': 0.2601, 'learning_rate': 1.5665565657074874e-05, 'epoch': 1.74} 58%|█████▊ | 273/468 [48:52<33:53, 10.43s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1870 [2024-09-15 16:24:06,569] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1908.38 | bwd_microstep: 3425.33 | bwd_inner_microstep: 3425.30 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1387 [2024-09-15 16:24:11,928] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.24 | optimizer_gradients: 0.40 | optimizer_step: 0.40 [2024-09-15 16:24:11,929] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1489.56 | bwd_microstep: 3833.44 | bwd_inner_microstep: 2628.83 | bwd_allreduce_microstep: 1204.55 | step_microstep: 7.51 [2024-09-15 16:24:11,929] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3397.92 | bwd: 7258.78 | bwd_inner: 6054.14 | bwd_allreduce: 1204.58 | step: 7.74 59%|█████▊ | 274/468 [49:03<34:00, 10.52s/it] {'loss': 0.1633, 'learning_rate': 1.5530265776426294e-05, 'epoch': 1.75} 59%|█████▊ | 274/468 [49:03<34:00, 10.52s/it]dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1878 [2024-09-15 16:24:17,355] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1934.72 | bwd_microstep: 3459.21 | bwd_inner_microstep: 3459.18 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.23 dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1870 [2024-09-15 16:24:22,755] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 4.25 | optimizer_gradients: 0.39 | optimizer_step: 0.40 [2024-09-15 16:24:22,756] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1913.35 | bwd_microstep: 3448.09 | bwd_inner_microstep: 3433.14 | bwd_allreduce_microstep: 14.91 | step_microstep: 11.43 [2024-09-15 16:24:22,756] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3848.05 | bwd: 6907.31 | bwd_inner: 6892.32 | bwd_allreduce: 14.92 | step: 11.66 59%|█████▉ | 275/468 [49:14<34:07, 10.61s/it] {'loss': 0.2333, 'learning_rate': 1.5395180868472662e-05, 'epoch': 1.76} 59%|█████▉ | 275/468 [49:14<34:07, 10.61s/it]dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1878 [2024-09-15 16:24:28,185] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1935.06 | bwd_microstep: 3464.28 | bwd_inner_microstep: 3464.25 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.06 dynamic ViT batch size: 36, images per sample: 4.5, dynamic token length: 1366 [2024-09-15 16:24:33,392] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.29 | optimizer_gradients: 0.38 | optimizer_step: 0.38 [2024-09-15 16:24:33,392] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1441.92 | bwd_microstep: 3729.41 | bwd_inner_microstep: 2558.73 | bwd_allreduce_microstep: 1170.62 | step_microstep: 7.57 [2024-09-15 16:24:33,392] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3376.96 | bwd: 7193.70 | bwd_inner: 6022.98 | bwd_allreduce: 1170.65 | step: 7.66 59%|█████▉ | 276/468 [49:24<33:58, 10.62s/it] {'loss': 0.149, 'learning_rate': 1.5260317430147627e-05, 'epoch': 1.76} 59%|█████▉ | 276/468 [49:24<33:58, 10.62s/it]dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1368 [2024-09-15 16:24:37,448] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1453.66 | bwd_microstep: 2572.05 | bwd_inner_microstep: 2571.96 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22 dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1900 [2024-09-15 16:24:44,117] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.39 | optimizer_gradients: 0.37 | optimizer_step: 0.38 [2024-09-15 16:24:44,118] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1941.00 | bwd_microstep: 4693.81 | bwd_inner_microstep: 3471.86 | bwd_allreduce_microstep: 1221.89 | step_microstep: 7.46 [2024-09-15 16:24:44,118] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3394.64 | bwd: 7265.92 | bwd_inner: 6043.82 | bwd_allreduce: 1221.95 | step: 7.68 59%|█████▉ | 277/468 [49:35<33:54, 10.65s/it] {'loss': 0.2587, 'learning_rate': 1.512568194773322e-05, 'epoch': 1.77} 59%|█████▉ | 277/468 [49:35<33:54, 10.65s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368 [2024-09-15 16:24:48,206] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1466.20 | bwd_microstep: 2590.33 | bwd_inner_microstep: 2590.31 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.05 dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1877 [2024-09-15 16:24:53,650] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.66 | optimizer_gradients: 0.33 | optimizer_step: 0.38 [2024-09-15 16:24:53,651] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1932.36 | bwd_microstep: 3475.76 | bwd_inner_microstep: 3460.82 | bwd_allreduce_microstep: 14.90 | step_microstep: 7.86 [2024-09-15 16:24:53,651] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3398.52 | bwd: 6066.11 | bwd_inner: 6051.13 | bwd_allreduce: 14.92 | step: 7.92 59%|█████▉ | 278/468 [49:45<32:39, 10.31s/it] {'loss': 0.1951, 'learning_rate': 1.4991280896547893e-05, 'epoch': 1.78} 59%|█████▉ | 278/468 [49:45<32:39, 10.31s/it]dynamic ViT batch size: 48, images per sample: 6.0, dynamic token length: 1879 [2024-09-15 16:24:59,156] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1965.62 | bwd_microstep: 3507.94 | bwd_inner_microstep: 3507.92 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.23 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389 [2024-09-15 16:25:03,330] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.60 | optimizer_gradients: 0.35 | optimizer_step: 0.40 [2024-09-15 16:25:03,331] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1493.23 | bwd_microstep: 2646.31 | bwd_inner_microstep: 2631.14 | bwd_allreduce_microstep: 15.12 | step_microstep: 7.94 [2024-09-15 16:25:03,331] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3458.84 | bwd: 6154.27 | bwd_inner: 6139.06 | bwd_allreduce: 15.14 | step: 8.20 60%|█████▉ | 279/468 [49:54<31:53, 10.12s/it] {'loss': 0.2757, 'learning_rate': 1.4857120740635084e-05, 'epoch': 1.78} 60%|█████▉ | 279/468 [49:54<31:53, 10.12s/it]dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1881 [2024-09-15 16:25:08,757] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1935.30 | bwd_microstep: 3460.26 | bwd_inner_microstep: 3460.24 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.23 dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1900 [2024-09-15 16:25:14,224] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.57 | optimizer_gradients: 0.36 | optimizer_step: 0.40 [2024-09-15 16:25:14,224] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1941.68 | bwd_microstep: 3488.02 | bwd_inner_microstep: 3473.04 | bwd_allreduce_microstep: 14.93 | step_microstep: 8.30 [2024-09-15 16:25:14,224] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3876.97 | bwd: 6948.29 | bwd_inner: 6933.28 | bwd_allreduce: 14.95 | step: 8.53 60%|█████▉ | 280/468 [50:05<32:26, 10.36s/it] {'loss': 0.2321, 'learning_rate': 1.472320793245233e-05, 'epoch': 1.79} 60%|█████▉ | 280/468 [50:05<32:26, 10.36s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1369 [2024-09-15 16:25:18,321] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1471.96 | bwd_microstep: 2593.94 | bwd_inner_microstep: 2593.91 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.23 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1388 [2024-09-15 16:25:24,844] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.82 | optimizer_gradients: 0.42 | optimizer_step: 0.41 [2024-09-15 16:25:24,844] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1495.07 | bwd_microstep: 4991.37 | bwd_inner_microstep: 2633.57 | bwd_allreduce_microstep: 2357.74 | step_microstep: 10.55 [2024-09-15 16:25:24,845] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2967.01 | bwd: 7585.35 | bwd_inner: 5227.48 | bwd_allreduce: 2357.77 | step: 10.80 60%|██████ | 281/468 [50:16<32:31, 10.43s/it] {'loss': 0.2177, 'learning_rate': 1.4589548912560932e-05, 'epoch': 1.8} 60%|██████ | 281/468 [50:16<32:31, 10.43s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1870 [2024-09-15 16:25:30,214] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1913.02 | bwd_microstep: 3425.62 | bwd_inner_microstep: 3425.59 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.16 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1877 [2024-09-15 16:25:35,588] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.53 | optimizer_gradients: 0.35 | optimizer_step: 0.40 [2024-09-15 16:25:35,588] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1902.69 | bwd_microstep: 3435.61 | bwd_inner_microstep: 3420.44 | bwd_allreduce_microstep: 15.12 | step_microstep: 8.26 [2024-09-15 16:25:35,589] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3815.68 | bwd: 6861.24 | bwd_inner: 6846.03 | bwd_allreduce: 15.15 | step: 8.44 60%|██████ | 282/468 [50:26<32:38, 10.53s/it] {'loss': 0.1363, 'learning_rate': 1.4456150109316192e-05, 'epoch': 1.8} 60%|██████ | 282/468 [50:26<32:38, 10.53s/it]dynamic ViT batch size: 34, images per sample: 4.25, dynamic token length: 1880 [2024-09-15 16:25:40,859] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1862.89 | bwd_microstep: 3377.19 | bwd_inner_microstep: 3376.95 | bwd_allreduce_microstep: 0.14 | step_microstep: 0.23 dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876 [2024-09-15 16:25:46,266] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.39 | optimizer_gradients: 0.35 | optimizer_step: 0.40 [2024-09-15 16:25:46,266] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1919.98 | bwd_microstep: 3451.42 | bwd_inner_microstep: 3436.46 | bwd_allreduce_microstep: 14.91 | step_microstep: 8.10 [2024-09-15 16:25:46,266] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3782.85 | bwd: 6828.65 | bwd_inner: 6813.41 | bwd_allreduce: 15.07 | step: 8.34 60%|██████ | 283/468 [50:37<32:35, 10.57s/it] {'loss': 0.1428, 'learning_rate': 1.4323017938558245e-05, 'epoch': 1.81} 60%|██████ | 283/468 [50:37<32:35, 10.57s/it]dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1881 [2024-09-15 16:25:51,700] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1938.91 | bwd_microstep: 3463.31 | bwd_inner_microstep: 3463.29 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.31 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1878 [2024-09-15 16:25:57,079] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.36 | optimizer_gradients: 0.36 | optimizer_step: 0.40 [2024-09-15 16:25:57,080] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1905.37 | bwd_microstep: 3439.46 | bwd_inner_microstep: 3424.64 | bwd_allreduce_microstep: 14.78 | step_microstep: 8.06 [2024-09-15 16:25:57,080] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3844.25 | bwd: 6902.79 | bwd_inner: 6887.93 | bwd_allreduce: 14.80 | step: 8.40 61%|██████ | 284/468 [50:48<32:38, 10.64s/it] {'loss': 0.2112, 'learning_rate': 1.4190158803303498e-05, 'epoch': 1.81} 61%|██████ | 284/468 [50:48<32:38, 10.64s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1367 [2024-09-15 16:26:01,178] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1471.60 | bwd_microstep: 2595.56 | bwd_inner_microstep: 2595.46 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.24 dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878 [2024-09-15 16:26:07,788] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.52 | optimizer_gradients: 0.39 | optimizer_step: 0.40 [2024-09-15 16:26:07,788] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1919.30 | bwd_microstep: 4655.48 | bwd_inner_microstep: 3442.48 | bwd_allreduce_microstep: 1212.95 | step_microstep: 7.80 [2024-09-15 16:26:07,789] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3390.88 | bwd: 7251.08 | bwd_inner: 6037.94 | bwd_allreduce: 1213.01 | step: 8.04 61%|██████ | 285/468 [50:59<32:31, 10.66s/it] {'loss': 0.2269, 'learning_rate': 1.4057579093436653e-05, 'epoch': 1.82} 61%|██████ | 285/468 [50:59<32:31, 10.66s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877 [2024-09-15 16:26:13,176] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1918.23 | bwd_microstep: 3438.48 | bwd_inner_microstep: 3438.40 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.23 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389 [2024-09-15 16:26:18,542] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.37 | optimizer_gradients: 0.38 | optimizer_step: 0.40 [2024-09-15 16:26:18,543] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1492.62 | bwd_microstep: 3839.42 | bwd_inner_microstep: 2631.24 | bwd_allreduce_microstep: 1208.12 | step_microstep: 7.76 [2024-09-15 16:26:18,543] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3410.83 | bwd: 7277.94 | bwd_inner: 6069.65 | bwd_allreduce: 1208.17 | step: 8.00 61%|██████ | 286/468 [51:09<32:25, 10.69s/it] {'loss': 0.1712, 'learning_rate': 1.3925285185403406e-05, 'epoch': 1.83} 61%|██████ | 286/468 [51:09<32:25, 10.69s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368 [2024-09-15 16:26:22,636] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1471.34 | bwd_microstep: 2591.51 | bwd_inner_microstep: 2591.48 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.34 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1388 [2024-09-15 16:26:29,216] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.47 | optimizer_gradients: 0.39 | optimizer_step: 0.40 [2024-09-15 16:26:29,217] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1495.09 | bwd_microstep: 5051.04 | bwd_inner_microstep: 2630.83 | bwd_allreduce_microstep: 2420.15 | step_microstep: 7.76 [2024-09-15 16:26:29,217] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2966.39 | bwd: 7642.58 | bwd_inner: 5222.31 | bwd_allreduce: 2420.18 | step: 8.11 61%|██████▏ | 287/468 [51:20<32:14, 10.69s/it] {'loss': 0.1182, 'learning_rate': 1.3793283441903737e-05, 'epoch': 1.83} 61%|██████▏ | 287/468 [51:20<32:14, 10.69s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1369 [2024-09-15 16:26:33,304] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1467.32 | bwd_microstep: 2588.99 | bwd_inner_microstep: 2588.94 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.25 dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1878 [2024-09-15 16:26:39,943] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.52 | optimizer_gradients: 0.41 | optimizer_step: 0.40 [2024-09-15 16:26:39,944] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1890.14 | bwd_microstep: 4714.75 | bwd_inner_microstep: 3399.82 | bwd_allreduce_microstep: 1314.87 | step_microstep: 7.90 [2024-09-15 16:26:39,944] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3357.42 | bwd: 7303.78 | bwd_inner: 5988.76 | bwd_allreduce: 1314.90 | step: 8.16 62%|██████▏ | 288/468 [51:31<32:05, 10.70s/it] {'loss': 0.1667, 'learning_rate': 1.3661580211585947e-05, 'epoch': 1.84} 62%|██████▏ | 288/468 [51:31<32:05, 10.70s/it]dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1880 [2024-09-15 16:26:45,364] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1932.70 | bwd_microstep: 3456.86 | bwd_inner_microstep: 3456.83 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22 dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1876 [2024-09-15 16:26:50,808] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.60 | optimizer_gradients: 0.36 | optimizer_step: 0.40 [2024-09-15 16:26:50,808] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1936.41 | bwd_microstep: 3472.38 | bwd_inner_microstep: 3457.56 | bwd_allreduce_microstep: 14.78 | step_microstep: 7.93 [2024-09-15 16:26:50,808] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3869.07 | bwd: 6929.25 | bwd_inner: 6914.39 | bwd_allreduce: 14.80 | step: 8.16 62%|██████▏ | 289/468 [51:42<32:03, 10.75s/it] {'loss': 0.2887, 'learning_rate': 1.3530181828741285e-05, 'epoch': 1.85} 62%|██████▏ | 289/468 [51:42<32:03, 10.75s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876 [2024-09-15 16:26:56,196] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1917.28 | bwd_microstep: 3438.92 | bwd_inner_microstep: 3438.88 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.30 dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1877 [2024-09-15 16:27:01,542] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.52 | optimizer_gradients: 0.35 | optimizer_step: 0.40 [2024-09-15 16:27:01,542] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1890.66 | bwd_microstep: 3420.55 | bwd_inner_microstep: 3404.82 | bwd_allreduce_microstep: 15.67 | step_microstep: 8.52 [2024-09-15 16:27:01,542] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3807.90 | bwd: 6859.50 | bwd_inner: 6843.71 | bwd_allreduce: 15.70 | step: 8.82 62%|██████▏ | 290/468 [51:52<31:52, 10.74s/it] {'loss': 0.1624, 'learning_rate': 1.3399094612999291e-05, 'epoch': 1.85} 62%|██████▏ | 290/468 [51:52<31:52, 10.74s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1878 [2024-09-15 16:27:06,895] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1901.96 | bwd_microstep: 3420.08 | bwd_inner_microstep: 3419.90 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.24 dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878 [2024-09-15 16:27:12,308] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.57 | optimizer_gradients: 0.35 | optimizer_step: 0.40 [2024-09-15 16:27:12,309] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1919.53 | bwd_microstep: 3456.11 | bwd_inner_microstep: 3441.29 | bwd_allreduce_microstep: 14.78 | step_microstep: 8.59 [2024-09-15 16:27:12,309] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3821.44 | bwd: 6876.23 | bwd_inner: 6861.19 | bwd_allreduce: 14.92 | step: 8.84 62%|██████▏ | 291/468 [52:03<31:42, 10.75s/it] {'loss': 0.2356, 'learning_rate': 1.3268324869023878e-05, 'epoch': 1.86} 62%|██████▏ | 291/468 [52:03<31:42, 10.75s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1369 [2024-09-15 16:27:16,403] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1471.35 | bwd_microstep: 2592.63 | bwd_inner_microstep: 2592.61 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1369 [2024-09-15 16:27:21,744] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.21 | optimizer_gradients: 0.41 | optimizer_step: 0.40 [2024-09-15 16:27:21,745] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1473.35 | bwd_microstep: 3833.16 | bwd_inner_microstep: 2595.77 | bwd_allreduce_microstep: 1237.34 | step_microstep: 7.56 [2024-09-15 16:27:21,745] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2944.67 | bwd: 6425.81 | bwd_inner: 5188.37 | bwd_allreduce: 1237.37 | step: 7.79 62%|██████▏ | 292/468 [52:13<30:22, 10.36s/it] {'loss': 0.2557, 'learning_rate': 1.313787888621009e-05, 'epoch': 1.87} 62%|██████▏ | 292/468 [52:13<30:22, 10.36s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877 [2024-09-15 16:27:27,130] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1917.19 | bwd_microstep: 3437.21 | bwd_inner_microstep: 3437.14 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.24 dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877 [2024-09-15 16:27:32,541] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.45 | optimizer_gradients: 0.35 | optimizer_step: 0.40 [2024-09-15 16:27:32,542] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1920.56 | bwd_microstep: 3454.98 | bwd_inner_microstep: 3439.66 | bwd_allreduce_microstep: 15.28 | step_microstep: 7.88 [2024-09-15 16:27:32,542] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3837.72 | bwd: 6892.23 | bwd_inner: 6876.80 | bwd_allreduce: 15.31 | step: 8.13 63%|██████▎ | 293/468 [52:23<30:35, 10.49s/it] {'loss': 0.2203, 'learning_rate': 1.3007762938381619e-05, 'epoch': 1.87} 63%|██████▎ | 293/468 [52:23<30:35, 10.49s/it]dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1876 [2024-09-15 16:27:37,973] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1936.69 | bwd_microstep: 3463.36 | bwd_inner_microstep: 3463.33 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.24 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1870 [2024-09-15 16:27:43,335] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.66 | optimizer_gradients: 0.40 | optimizer_step: 0.40 [2024-09-15 16:27:43,335] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1900.99 | bwd_microstep: 3427.06 | bwd_inner_microstep: 3411.81 | bwd_allreduce_microstep: 15.21 | step_microstep: 8.00 [2024-09-15 16:27:43,336] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3837.65 | bwd: 6890.44 | bwd_inner: 6875.14 | bwd_allreduce: 15.23 | step: 8.25 63%|██████▎ | 294/468 [52:34<30:40, 10.58s/it] {'loss': 0.3142, 'learning_rate': 1.2877983283489062e-05, 'epoch': 1.88} 63%|██████▎ | 294/468 [52:34<30:40, 10.58s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878 [2024-09-15 16:27:48,732] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1922.01 | bwd_microstep: 3444.34 | bwd_inner_microstep: 3444.22 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.23 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1876 [2024-09-15 16:27:54,108] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.56 | optimizer_gradients: 0.35 | optimizer_step: 0.40 [2024-09-15 16:27:54,108] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1904.01 | bwd_microstep: 3437.07 | bwd_inner_microstep: 3422.35 | bwd_allreduce_microstep: 14.65 | step_microstep: 8.14 [2024-09-15 16:27:54,109] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3826.00 | bwd: 6881.45 | bwd_inner: 6866.57 | bwd_allreduce: 14.75 | step: 8.37 63%|██████▎ | 295/468 [52:45<30:40, 10.64s/it] {'loss': 0.16, 'learning_rate': 1.2748546163308947e-05, 'epoch': 1.88} 63%|██████▎ | 295/468 [52:45<30:40, 10.64s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878 [2024-09-15 16:27:59,507] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1921.93 | bwd_microstep: 3445.40 | bwd_inner_microstep: 3445.35 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.24 dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1878 [2024-09-15 16:28:04,968] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.61 | optimizer_gradients: 0.35 | optimizer_step: 0.40 [2024-09-15 16:28:04,968] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1941.45 | bwd_microstep: 3483.19 | bwd_inner_microstep: 3468.04 | bwd_allreduce_microstep: 15.10 | step_microstep: 8.72 [2024-09-15 16:28:04,969] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3863.35 | bwd: 6928.62 | bwd_inner: 6913.39 | bwd_allreduce: 15.12 | step: 8.97 63%|██████▎ | 296/468 [52:56<30:41, 10.70s/it] {'loss': 0.1647, 'learning_rate': 1.261945780314354e-05, 'epoch': 1.89} 63%|██████▎ | 296/468 [52:56<30:41, 10.70s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1366 [2024-09-15 16:28:09,064] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1472.38 | bwd_microstep: 2592.22 | bwd_inner_microstep: 2592.10 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.26 dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1366 [2024-09-15 16:28:14,368] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.23 | optimizer_gradients: 0.41 | optimizer_step: 0.41 [2024-09-15 16:28:14,368] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1458.05 | bwd_microstep: 3811.62 | bwd_inner_microstep: 2577.30 | bwd_allreduce_microstep: 1234.26 | step_microstep: 7.55 [2024-09-15 16:28:14,368] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2930.41 | bwd: 6403.88 | bwd_inner: 5169.40 | bwd_allreduce: 1234.36 | step: 7.82 63%|██████▎ | 297/468 [53:05<29:23, 10.31s/it] {'loss': 0.2373, 'learning_rate': 1.2490724411521406e-05, 'epoch': 1.9} 63%|██████▎ | 297/468 [53:05<29:23, 10.31s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1367 [2024-09-15 16:28:18,462] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1470.07 | bwd_microstep: 2592.64 | bwd_inner_microstep: 2592.62 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.17 dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1389 [2024-09-15 16:28:25,084] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.53 | optimizer_gradients: 0.39 | optimizer_step: 0.39 [2024-09-15 16:28:25,085] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1483.35 | bwd_microstep: 5103.65 | bwd_inner_microstep: 2613.21 | bwd_allreduce_microstep: 2490.38 | step_microstep: 7.83 [2024-09-15 16:28:25,085] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2953.39 | bwd: 7696.30 | bwd_inner: 5205.82 | bwd_allreduce: 2490.41 | step: 8.03 64%|██████▎ | 298/468 [53:16<29:33, 10.43s/it] {'loss': 0.1701, 'learning_rate': 1.2362352179898855e-05, 'epoch': 1.9} 64%|██████▎ | 298/468 [53:16<29:33, 10.43s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368 [2024-09-15 16:28:29,170] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1468.55 | bwd_microstep: 2586.13 | bwd_inner_microstep: 2586.10 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.23 dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1881 [2024-09-15 16:28:34,585] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.59 | optimizer_gradients: 0.35 | optimizer_step: 0.40 [2024-09-15 16:28:34,585] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1919.96 | bwd_microstep: 3459.40 | bwd_inner_microstep: 3443.82 | bwd_allreduce_microstep: 15.54 | step_microstep: 7.94 [2024-09-15 16:28:34,586] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3388.48 | bwd: 6045.55 | bwd_inner: 6029.92 | bwd_allreduce: 15.56 | step: 8.18 64%|██████▍ | 299/468 [53:25<28:36, 10.15s/it] {'loss': 0.1522, 'learning_rate': 1.2234347282362129e-05, 'epoch': 1.91} 64%|██████▍ | 299/468 [53:25<28:36, 10.15s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877 [2024-09-15 16:28:39,973] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1919.80 | bwd_microstep: 3436.63 | bwd_inner_microstep: 3436.61 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.23 dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1901 [2024-09-15 16:28:45,432] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.61 | optimizer_gradients: 0.35 | optimizer_step: 0.40 [2024-09-15 16:28:45,433] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1941.63 | bwd_microstep: 3482.97 | bwd_inner_microstep: 3468.05 | bwd_allreduce_microstep: 14.87 | step_microstep: 8.59 [2024-09-15 16:28:45,433] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3861.41 | bwd: 6919.62 | bwd_inner: 6904.65 | bwd_allreduce: 14.90 | step: 8.82 64%|██████▍ | 300/468 [53:36<29:00, 10.36s/it] {'loss': 0.1613, 'learning_rate': 1.2106715875330475e-05, 'epoch': 1.92} 64%|██████▍ | 300/468 [53:36<29:00, 10.36s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878 [2024-09-15 16:28:50,825] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1918.95 | bwd_microstep: 3442.01 | bwd_inner_microstep: 3441.89 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.24 dynamic ViT batch size: 39, images per sample: 4.875, dynamic token length: 1388 [2024-09-15 16:28:54,984] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.56 | optimizer_gradients: 0.35 | optimizer_step: 0.40 [2024-09-15 16:28:54,985] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1488.36 | bwd_microstep: 2637.07 | bwd_inner_microstep: 2621.65 | bwd_allreduce_microstep: 15.37 | step_microstep: 7.92 [2024-09-15 16:28:54,985] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3407.29 | bwd: 6079.12 | bwd_inner: 6063.55 | bwd_allreduce: 15.45 | step: 8.16 64%|██████▍ | 301/468 [53:46<28:09, 10.12s/it] {'loss': 0.1883, 'learning_rate': 1.1979464097260039e-05, 'epoch': 1.92} 64%|██████▍ | 301/468 [53:46<28:09, 10.12s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1366 [2024-09-15 16:28:59,078] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1471.20 | bwd_microstep: 2591.80 | bwd_inner_microstep: 2591.71 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.24 dynamic ViT batch size: 45, images per sample: 5.625, dynamic token length: 1901 [2024-09-15 16:29:05,792] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.22 | optimizer_gradients: 0.40 | optimizer_step: 0.40 [2024-09-15 16:29:05,793] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1969.42 | bwd_microstep: 4708.98 | bwd_inner_microstep: 3510.22 | bwd_allreduce_microstep: 1198.70 | step_microstep: 7.50 [2024-09-15 16:29:05,793] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3440.60 | bwd: 7300.82 | bwd_inner: 6101.93 | bwd_allreduce: 1198.77 | step: 7.75 65%|██████▍ | 302/468 [53:57<28:34, 10.33s/it] {'loss': 0.2338, 'learning_rate': 1.1852598068348642e-05, 'epoch': 1.93} 65%|██████▍ | 302/468 [53:57<28:34, 10.33s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876 [2024-09-15 16:29:11,184] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1918.97 | bwd_microstep: 3441.17 | bwd_inner_microstep: 3441.14 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.26 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1388 [2024-09-15 16:29:15,362] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.57 | optimizer_gradients: 0.35 | optimizer_step: 0.41 [2024-09-15 16:29:15,363] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1496.79 | bwd_microstep: 2646.90 | bwd_inner_microstep: 2631.76 | bwd_allreduce_microstep: 15.10 | step_microstep: 7.99 [2024-09-15 16:29:15,363] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3415.74 | bwd: 6088.15 | bwd_inner: 6072.90 | bwd_allreduce: 15.12 | step: 8.26 65%|██████▍ | 303/468 [54:06<27:46, 10.10s/it] {'loss': 0.2248, 'learning_rate': 1.1726123890241439e-05, 'epoch': 1.94} 65%|██████▍ | 303/468 [54:06<27:46, 10.10s/it]dynamic ViT batch size: 36, images per sample: 4.5, dynamic token length: 1368 [2024-09-15 16:29:19,384] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1438.55 | bwd_microstep: 2553.49 | bwd_inner_microstep: 2553.23 | bwd_allreduce_microstep: 0.13 | step_microstep: 0.24 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1878 [2024-09-15 16:29:26,143] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.21 | optimizer_gradients: 0.41 | optimizer_step: 0.42 [2024-09-15 16:29:26,143] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1903.94 | bwd_microstep: 4820.79 | bwd_inner_microstep: 3424.10 | bwd_allreduce_microstep: 1396.62 | step_microstep: 7.52 [2024-09-15 16:29:26,144] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3342.46 | bwd: 7374.32 | bwd_inner: 5977.34 | bwd_allreduce: 1396.78 | step: 7.76 65%|██████▍ | 304/468 [54:17<28:09, 10.30s/it] {'loss': 0.1811, 'learning_rate': 1.1600047645737433e-05, 'epoch': 1.94} 65%|██████▍ | 304/468 [54:17<28:09, 10.30s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368 [2024-09-15 16:29:30,231] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1468.09 | bwd_microstep: 2589.19 | bwd_inner_microstep: 2589.05 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.24 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1388 [2024-09-15 16:29:36,758] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.26 | optimizer_gradients: 0.39 | optimizer_step: 0.40 [2024-09-15 16:29:36,759] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1495.09 | bwd_microstep: 4997.41 | bwd_inner_microstep: 2630.23 | bwd_allreduce_microstep: 2367.11 | step_microstep: 7.48 [2024-09-15 16:29:36,759] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2963.16 | bwd: 7586.64 | bwd_inner: 5219.28 | bwd_allreduce: 2367.23 | step: 7.72 65%|██████▌ | 305/468 [54:28<28:14, 10.40s/it] {'loss': 0.1893, 'learning_rate': 1.1474375398496948e-05, 'epoch': 1.95} 65%|██████▌ | 305/468 [54:28<28:14, 10.40s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1369 [2024-09-15 16:29:40,846] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1469.16 | bwd_microstep: 2587.71 | bwd_inner_microstep: 2587.66 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.23 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1388 [2024-09-15 16:29:46,232] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.22 | optimizer_gradients: 0.40 | optimizer_step: 0.41 [2024-09-15 16:29:46,232] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1491.87 | bwd_microstep: 3858.80 | bwd_inner_microstep: 2626.43 | bwd_allreduce_microstep: 1232.31 | step_microstep: 7.59 [2024-09-15 16:29:46,233] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2961.01 | bwd: 6446.56 | bwd_inner: 5214.09 | bwd_allreduce: 1232.34 | step: 7.83 65%|██████▌ | 306/468 [54:37<27:19, 10.12s/it] {'loss': 0.203, 'learning_rate': 1.1349113192749986e-05, 'epoch': 1.96} 65%|██████▌ | 306/468 [54:37<27:19, 10.12s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368 [2024-09-15 16:29:50,317] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1466.30 | bwd_microstep: 2587.91 | bwd_inner_microstep: 2587.84 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.24 dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1901 [2024-09-15 16:29:57,024] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.22 | optimizer_gradients: 0.40 | optimizer_step: 0.40 [2024-09-15 16:29:57,025] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1940.36 | bwd_microstep: 4730.97 | bwd_inner_microstep: 3464.08 | bwd_allreduce_microstep: 1266.83 | step_microstep: 7.53 [2024-09-15 16:29:57,025] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3406.62 | bwd: 7318.92 | bwd_inner: 6051.92 | bwd_allreduce: 1266.88 | step: 7.78 66%|██████▌ | 307/468 [54:48<27:41, 10.32s/it] {'loss': 0.1425, 'learning_rate': 1.1224267053005504e-05, 'epoch': 1.96} 66%|██████▌ | 307/468 [54:48<27:41, 10.32s/it]dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1878 [2024-09-15 16:30:02,444] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1929.58 | bwd_microstep: 3458.23 | bwd_inner_microstep: 3458.18 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.24 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368 [2024-09-15 16:30:07,836] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.26 | optimizer_gradients: 0.39 | optimizer_step: 0.40 [2024-09-15 16:30:07,836] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1469.03 | bwd_microstep: 3888.51 | bwd_inner_microstep: 2589.86 | bwd_allreduce_microstep: 1298.59 | step_microstep: 7.51 [2024-09-15 16:30:07,836] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3398.58 | bwd: 7346.79 | bwd_inner: 6048.04 | bwd_allreduce: 1298.63 | step: 7.75 66%|██████▌ | 308/468 [54:59<27:54, 10.47s/it] {'loss': 0.1337, 'learning_rate': 1.1099842983761712e-05, 'epoch': 1.97} 66%|██████▌ | 308/468 [54:59<27:54, 10.47s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1366 [2024-09-15 16:30:11,920] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1466.17 | bwd_microstep: 2586.80 | bwd_inner_microstep: 2586.62 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.24 dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1870 [2024-09-15 16:30:18,540] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.21 | optimizer_gradients: 0.40 | optimizer_step: 0.40 [2024-09-15 16:30:18,540] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1913.45 | bwd_microstep: 4672.12 | bwd_inner_microstep: 3430.12 | bwd_allreduce_microstep: 1241.94 | step_microstep: 7.51 [2024-09-15 16:30:18,540] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3379.60 | bwd: 7258.96 | bwd_inner: 6016.74 | bwd_allreduce: 1242.09 | step: 7.76 66%|██████▌ | 309/468 [55:09<27:55, 10.54s/it] {'loss': 0.1684, 'learning_rate': 1.0975846969217258e-05, 'epoch': 1.97} 66%|██████▌ | 309/468 [55:09<27:55, 10.54s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877 [2024-09-15 16:30:23,916] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1912.70 | bwd_microstep: 3432.48 | bwd_inner_microstep: 3432.45 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.19 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389 [2024-09-15 16:30:29,126] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.23 | optimizer_gradients: 0.39 | optimizer_step: 0.41 [2024-09-15 16:30:29,127] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1492.29 | bwd_microstep: 3685.70 | bwd_inner_microstep: 2629.68 | bwd_allreduce_microstep: 1055.96 | step_microstep: 7.45 [2024-09-15 16:30:29,127] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3404.96 | bwd: 7118.20 | bwd_inner: 6062.13 | bwd_allreduce: 1055.99 | step: 7.64 66%|██████▌ | 310/468 [55:20<27:47, 10.55s/it] {'loss': 0.165, 'learning_rate': 1.0852284972983415e-05, 'epoch': 1.98} 66%|██████▌ | 310/468 [55:20<27:47, 10.55s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877 [2024-09-15 16:30:34,508] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1915.96 | bwd_microstep: 3434.26 | bwd_inner_microstep: 3434.23 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.26 dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877 [2024-09-15 16:30:39,912] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.56 | optimizer_gradients: 0.36 | optimizer_step: 0.40 [2024-09-15 16:30:39,912] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1917.16 | bwd_microstep: 3452.32 | bwd_inner_microstep: 3437.44 | bwd_allreduce_microstep: 14.84 | step_microstep: 8.29 [2024-09-15 16:30:39,913] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3833.09 | bwd: 6886.62 | bwd_inner: 6871.66 | bwd_allreduce: 14.86 | step: 8.56 66%|██████▋ | 311/468 [55:31<27:47, 10.62s/it] {'loss': 0.2462, 'learning_rate': 1.0729162937797257e-05, 'epoch': 1.99} 66%|██████▋ | 311/468 [55:31<27:47, 10.62s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1366 [2024-09-15 16:30:44,000] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1468.67 | bwd_microstep: 2588.25 | bwd_inner_microstep: 2588.03 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.25 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1870 [2024-09-15 16:30:50,730] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.74 | optimizer_gradients: 0.41 | optimizer_step: 0.41 [2024-09-15 16:30:50,730] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1899.75 | bwd_microstep: 4793.11 | bwd_inner_microstep: 3412.72 | bwd_allreduce_microstep: 1380.33 | step_microstep: 10.63 [2024-09-15 16:30:50,731] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3368.40 | bwd: 7381.41 | bwd_inner: 6000.75 | bwd_allreduce: 1380.52 | step: 10.89 67%|██████▋ | 312/468 [55:42<27:46, 10.68s/it] {'loss': 0.2298, 'learning_rate': 1.0606486785235879e-05, 'epoch': 1.99} 67%|██████▋ | 312/468 [55:42<27:46, 10.68s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877 petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. [2024-09-15 16:30:56,117] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1916.72 | bwd_microstep: 3437.20 | bwd_inner_microstep: 3437.10 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.24 petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876 [2024-09-15 16:31:02,474] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.57 | optimizer_gradients: 0.34 | optimizer_step: 0.41 [2024-09-15 16:31:02,475] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1912.93 | bwd_microstep: 3448.45 | bwd_inner_microstep: 3433.22 | bwd_allreduce_microstep: 15.18 | step_microstep: 7.94 [2024-09-15 16:31:02,475] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3829.62 | bwd: 6885.69 | bwd_inner: 6870.32 | bwd_allreduce: 15.25 | step: 8.19 67%|██████▋ | 313/468 [55:53<28:25, 11.00s/it] {'loss': 0.2047, 'learning_rate': 1.0484262415431536e-05, 'epoch': 2.0} 67%|██████▋ | 313/468 [55:53<28:25, 11.00s/it][2024-09-15 16:31:04,793] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-09-15 16:31:04,816] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-09-15 16:31:04,816] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-09-15 16:31:04,825] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-09-15 16:31:08,575] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-09-15 16:31:08,712] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-09-15 16:31:08,730] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-09-15 16:31:08,730] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-09-15 16:31:12,269] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-09-15 16:31:12,465] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-09-15 16:31:12,554] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-09-15 16:31:12,572] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-09-15 16:31:16,009] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-09-15 16:31:16,153] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-09-15 16:31:16,227] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-09-15 16:31:16,309] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1870 [2024-09-15 16:31:23,670] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1887.95 | bwd_microstep: 3400.97 | bwd_inner_microstep: 3400.88 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.24 dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1372 [2024-09-15 16:31:28,116] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.28 | optimizer_gradients: 0.38 | optimizer_step: 0.38 [2024-09-15 16:31:28,117] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1453.96 | bwd_microstep: 2957.88 | bwd_inner_microstep: 2568.75 | bwd_allreduce_microstep: 389.07 | step_microstep: 7.37 [2024-09-15 16:31:28,117] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3341.88 | bwd: 6358.89 | bwd_inner: 5969.63 | bwd_allreduce: 389.13 | step: 7.62 67%|██████▋ | 314/468 [56:19<39:30, 15.39s/it] {'loss': 0.2365, 'learning_rate': 1.0362495706787923e-05, 'epoch': 2.01} 67%|██████▋ | 314/468 [56:19<39:30, 15.39s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1369 [2024-09-15 16:31:32,196] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1464.18 | bwd_microstep: 2584.16 | bwd_inner_microstep: 2584.13 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.20 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1366 [2024-09-15 16:31:38,728] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.39 | optimizer_gradients: 0.39 | optimizer_step: 0.42 [2024-09-15 16:31:38,729] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1466.22 | bwd_microstep: 5032.01 | bwd_inner_microstep: 2584.18 | bwd_allreduce_microstep: 2447.77 | step_microstep: 7.75 [2024-09-15 16:31:38,729] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2930.37 | bwd: 7616.18 | bwd_inner: 5168.32 | bwd_allreduce: 2447.80 | step: 7.96 67%|██████▋ | 315/468 [56:30<35:35, 13.96s/it] {'loss': 0.1738, 'learning_rate': 1.0241192515697432e-05, 'epoch': 2.01} 67%|██████▋ | 315/468 [56:30<35:35, 13.96s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1877 [2024-09-15 16:31:44,062] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1894.56 | bwd_microstep: 3407.84 | bwd_inner_microstep: 3407.77 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.25 dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876 [2024-09-15 16:31:49,450] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.40 | optimizer_gradients: 0.35 | optimizer_step: 0.40 [2024-09-15 16:31:49,451] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1916.08 | bwd_microstep: 3435.64 | bwd_inner_microstep: 3420.32 | bwd_allreduce_microstep: 15.27 | step_microstep: 8.13 [2024-09-15 16:31:49,451] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3810.62 | bwd: 6843.53 | bwd_inner: 6828.09 | bwd_allreduce: 15.30 | step: 8.38 68%|██████▊ | 316/468 [56:40<32:54, 12.99s/it] {'loss': 0.2467, 'learning_rate': 1.0120358676259508e-05, 'epoch': 2.02} 68%|██████▊ | 316/468 [56:40<32:54, 12.99s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1388 [2024-09-15 16:31:53,582] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1485.88 | bwd_microstep: 2613.65 | bwd_inner_microstep: 2613.57 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.24 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368 [2024-09-15 16:32:00,125] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.21 | optimizer_gradients: 0.40 | optimizer_step: 0.41 [2024-09-15 16:32:00,125] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1464.92 | bwd_microstep: 5043.00 | bwd_inner_microstep: 2581.96 | bwd_allreduce_microstep: 2460.98 | step_microstep: 7.53 [2024-09-15 16:32:00,126] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2950.78 | bwd: 7656.68 | bwd_inner: 5195.53 | bwd_allreduce: 2461.02 | step: 7.78 68%|██████▊ | 317/468 [56:51<30:56, 12.29s/it] {'loss': 0.1459, 'learning_rate': 1.0000000000000006e-05, 'epoch': 2.03} 68%|██████▊ | 317/468 [56:51<30:56, 12.29s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389 [2024-09-15 16:32:04,268] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1493.23 | bwd_microstep: 2618.81 | bwd_inner_microstep: 2618.78 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.04 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1366 [2024-09-15 16:32:09,488] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.25 | optimizer_gradients: 0.39 | optimizer_step: 0.40 [2024-09-15 16:32:09,488] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1463.09 | bwd_microstep: 3722.95 | bwd_inner_microstep: 2581.90 | bwd_allreduce_microstep: 1141.00 | step_microstep: 7.48 [2024-09-15 16:32:09,489] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2956.30 | bwd: 6341.78 | bwd_inner: 5200.68 | bwd_allreduce: 1141.03 | step: 7.53 68%|██████▊ | 318/468 [57:00<28:32, 11.41s/it] {'loss': 0.1594, 'learning_rate': 9.880122275591752e-06, 'epoch': 2.03} 68%|██████▊ | 318/468 [57:00<28:32, 11.41s/it]dynamic ViT batch size: 46, images per sample: 5.75, dynamic token length: 1878 [2024-09-15 16:32:14,936] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1946.46 | bwd_microstep: 3469.60 | bwd_inner_microstep: 3469.58 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1366 [2024-09-15 16:32:19,032] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.57 | optimizer_gradients: 0.35 | optimizer_step: 0.41 [2024-09-15 16:32:19,033] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1465.44 | bwd_microstep: 2596.79 | bwd_inner_microstep: 2581.42 | bwd_allreduce_microstep: 15.32 | step_microstep: 7.93 [2024-09-15 16:32:19,033] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3411.88 | bwd: 6066.41 | bwd_inner: 6051.00 | bwd_allreduce: 15.35 | step: 8.17 68%|██████▊ | 319/468 [57:10<26:57, 10.85s/it] {'loss': 0.1145, 'learning_rate': 9.760731268576095e-06, 'epoch': 2.04} 68%|██████▊ | 319/468 [57:10<26:57, 10.85s/it]dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1390 [2024-09-15 16:32:23,139] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1474.84 | bwd_microstep: 2601.19 | bwd_inner_microstep: 2601.16 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22 dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1876 [2024-09-15 16:32:29,741] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.28 | optimizer_gradients: 0.39 | optimizer_step: 0.38 [2024-09-15 16:32:29,742] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1929.34 | bwd_microstep: 4637.85 | bwd_inner_microstep: 3458.78 | bwd_allreduce_microstep: 1179.00 | step_microstep: 7.29 [2024-09-15 16:32:29,742] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3404.16 | bwd: 7239.05 | bwd_inner: 6059.95 | bwd_allreduce: 1179.03 | step: 7.57 68%|██████▊ | 320/468 [57:21<26:39, 10.81s/it] {'loss': 0.1485, 'learning_rate': 9.64183272108562e-06, 'epoch': 2.04} 68%|██████▊ | 320/468 [57:21<26:39, 10.81s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389 [2024-09-15 16:32:33,883] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1488.05 | bwd_microstep: 2621.91 | bwd_inner_microstep: 2621.89 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.13 dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877 [2024-09-15 16:32:40,443] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.21 | optimizer_gradients: 0.43 | optimizer_step: 0.40 [2024-09-15 16:32:40,444] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1912.65 | bwd_microstep: 4612.33 | bwd_inner_microstep: 3432.27 | bwd_allreduce_microstep: 1179.98 | step_microstep: 7.46 [2024-09-15 16:32:40,444] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3400.67 | bwd: 7234.26 | bwd_inner: 6054.16 | bwd_allreduce: 1180.04 | step: 7.62 69%|██████▊ | 321/468 [57:31<26:24, 10.78s/it] {'loss': 0.0932, 'learning_rate': 9.523432351567979e-06, 'epoch': 2.05} 69%|██████▊ | 321/468 [57:31<26:24, 10.78s/it]dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1388 [2024-09-15 16:32:44,548] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1475.44 | bwd_microstep: 2598.86 | bwd_inner_microstep: 2598.69 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.24 dynamic ViT batch size: 36, images per sample: 4.5, dynamic token length: 1367 [2024-09-15 16:32:51,125] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.24 | optimizer_gradients: 0.40 | optimizer_step: 0.40 [2024-09-15 16:32:51,125] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1438.74 | bwd_microstep: 5104.20 | bwd_inner_microstep: 2549.39 | bwd_allreduce_microstep: 2554.75 | step_microstep: 7.58 [2024-09-15 16:32:51,125] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2914.16 | bwd: 7703.10 | bwd_inner: 5148.08 | bwd_allreduce: 2554.87 | step: 7.83 69%|██████▉ | 322/468 [57:42<26:09, 10.75s/it] {'loss': 0.1819, 'learning_rate': 9.405535854510863e-06, 'epoch': 2.06} 69%|██████▉ | 322/468 [57:42<26:09, 10.75s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1870 [2024-09-15 16:32:56,483] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1907.94 | bwd_microstep: 3418.51 | bwd_inner_microstep: 3418.48 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.24 dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877 [2024-09-15 16:33:01,877] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.33 | optimizer_gradients: 0.35 | optimizer_step: 0.40 [2024-09-15 16:33:01,878] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1913.28 | bwd_microstep: 3446.27 | bwd_inner_microstep: 3431.45 | bwd_allreduce_microstep: 14.77 | step_microstep: 7.70 [2024-09-15 16:33:01,878] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3821.20 | bwd: 6864.79 | bwd_inner: 6849.93 | bwd_allreduce: 14.79 | step: 7.95 69%|██████▉ | 323/468 [57:53<25:58, 10.75s/it] {'loss': 0.1534, 'learning_rate': 9.288148900168122e-06, 'epoch': 2.06} 69%|██████▉ | 323/468 [57:53<25:58, 10.75s/it]dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1878 [2024-09-15 16:33:07,193] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1884.07 | bwd_microstep: 3400.93 | bwd_inner_microstep: 3400.87 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.24 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1387 [2024-09-15 16:33:12,602] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.13 | optimizer_gradients: 0.40 | optimizer_step: 0.40 [2024-09-15 16:33:12,603] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1491.21 | bwd_microstep: 3882.04 | bwd_inner_microstep: 2624.54 | bwd_allreduce_microstep: 1257.44 | step_microstep: 7.46 [2024-09-15 16:33:12,603] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3375.25 | bwd: 7283.01 | bwd_inner: 6025.41 | bwd_allreduce: 1257.48 | step: 7.71 69%|██████▉ | 324/468 [58:04<25:46, 10.74s/it] {'loss': 0.1875, 'learning_rate': 9.171277134287057e-06, 'epoch': 2.07} 69%|██████▉ | 324/468 [58:04<25:46, 10.74s/it]dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1374 [2024-09-15 16:33:16,657] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1453.35 | bwd_microstep: 2570.34 | bwd_inner_microstep: 2570.27 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.24 dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877 [2024-09-15 16:33:23,393] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.22 | optimizer_gradients: 0.39 | optimizer_step: 0.40 [2024-09-15 16:33:23,394] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1915.00 | bwd_microstep: 4785.58 | bwd_inner_microstep: 3436.78 | bwd_allreduce_microstep: 1348.74 | step_microstep: 7.45 [2024-09-15 16:33:23,394] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3368.33 | bwd: 7355.97 | bwd_inner: 6007.05 | bwd_allreduce: 1348.79 | step: 7.69 69%|██████▉ | 325/468 [58:14<25:38, 10.76s/it] {'loss': 0.1416, 'learning_rate': 9.054926177836878e-06, 'epoch': 2.08} 69%|██████▉ | 325/468 [58:14<25:38, 10.76s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389 [2024-09-15 16:33:27,532] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1488.60 | bwd_microstep: 2619.29 | bwd_inner_microstep: 2619.20 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.24 dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877 [2024-09-15 16:33:32,932] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.51 | optimizer_gradients: 0.34 | optimizer_step: 0.41 [2024-09-15 16:33:32,933] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1916.30 | bwd_microstep: 3449.93 | bwd_inner_microstep: 3435.09 | bwd_allreduce_microstep: 14.79 | step_microstep: 7.69 [2024-09-15 16:33:32,933] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3404.87 | bwd: 6069.26 | bwd_inner: 6054.29 | bwd_allreduce: 14.84 | step: 7.93 70%|██████▉ | 326/468 [58:24<24:35, 10.39s/it] {'loss': 0.1717, 'learning_rate': 8.939101626738395e-06, 'epoch': 2.08} 70%|██████▉ | 326/468 [58:24<24:35, 10.39s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1388 [2024-09-15 16:33:37,075] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1490.04 | bwd_microstep: 2621.66 | bwd_inner_microstep: 2621.59 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.25 dynamic ViT batch size: 46, images per sample: 5.75, dynamic token length: 1884 [2024-09-15 16:33:43,642] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.24 | optimizer_gradients: 0.39 | optimizer_step: 0.43 [2024-09-15 16:33:43,642] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1949.86 | bwd_microstep: 4581.95 | bwd_inner_microstep: 3483.13 | bwd_allreduce_microstep: 1098.76 | step_microstep: 7.50 [2024-09-15 16:33:43,643] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3439.88 | bwd: 7203.66 | bwd_inner: 6104.72 | bwd_allreduce: 1098.79 | step: 7.75 70%|██████▉ | 327/468 [58:35<24:38, 10.49s/it] {'loss': 0.1975, 'learning_rate': 8.823809051594816e-06, 'epoch': 2.09} 70%|██████▉ | 327/468 [58:35<24:38, 10.49s/it]dynamic ViT batch size: 50, images per sample: 6.25, dynamic token length: 1900 [2024-09-15 16:33:49,230] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2001.28 | bwd_microstep: 3555.57 | bwd_inner_microstep: 3555.55 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.19 dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1870 [2024-09-15 16:33:54,625] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.41 | optimizer_gradients: 0.35 | optimizer_step: 0.40 [2024-09-15 16:33:54,626] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1914.98 | bwd_microstep: 3444.63 | bwd_inner_microstep: 3429.28 | bwd_allreduce_microstep: 15.31 | step_microstep: 8.09 [2024-09-15 16:33:54,626] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3916.24 | bwd: 7000.21 | bwd_inner: 6984.82 | bwd_allreduce: 15.33 | step: 8.32 70%|███████ | 328/468 [58:46<24:49, 10.64s/it] {'loss': 0.1706, 'learning_rate': 8.70905399742389e-06, 'epoch': 2.1} 70%|███████ | 328/468 [58:46<24:49, 10.64s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1908 [2024-09-15 16:34:00,068] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1938.04 | bwd_microstep: 3472.99 | bwd_inner_microstep: 3472.96 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.27 dynamic ViT batch size: 41, images per sample: 5.125, dynamic token length: 1880 [2024-09-15 16:34:05,459] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.49 | optimizer_gradients: 0.37 | optimizer_step: 0.41 [2024-09-15 16:34:05,460] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1913.21 | bwd_microstep: 3445.41 | bwd_inner_microstep: 3430.38 | bwd_allreduce_microstep: 14.97 | step_microstep: 7.79 [2024-09-15 16:34:05,460] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3851.22 | bwd: 6918.42 | bwd_inner: 6903.34 | bwd_allreduce: 15.00 | step: 8.05 70%|███████ | 329/468 [58:56<24:46, 10.70s/it] {'loss': 0.1869, 'learning_rate': 8.594841983391196e-06, 'epoch': 2.1} 70%|███████ | 329/468 [58:56<24:46, 10.70s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1388 [2024-09-15 16:34:09,612] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1493.61 | bwd_microstep: 2627.93 | bwd_inner_microstep: 2627.81 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.25 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1366 [2024-09-15 16:34:16,119] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.21 | optimizer_gradients: 0.40 | optimizer_step: 0.40 [2024-09-15 16:34:16,120] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1471.17 | bwd_microstep: 4999.58 | bwd_inner_microstep: 2592.44 | bwd_allreduce_microstep: 2407.07 | step_microstep: 7.49 [2024-09-15 16:34:16,120] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2964.75 | bwd: 7627.55 | bwd_inner: 5220.25 | bwd_allreduce: 2407.16 | step: 7.74 71%|███████ | 330/468 [59:07<24:34, 10.68s/it] {'loss': 0.2313, 'learning_rate': 8.481178502544684e-06, 'epoch': 2.11} 71%|███████ | 330/468 [59:07<24:34, 10.68s/it]dynamic ViT batch size: 36, images per sample: 4.5, dynamic token length: 1369 [2024-09-15 16:34:20,141] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1437.65 | bwd_microstep: 2552.45 | bwd_inner_microstep: 2552.18 | bwd_allreduce_microstep: 0.14 | step_microstep: 0.24 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368 [2024-09-15 16:34:26,863] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.21 | optimizer_gradients: 0.43 | optimizer_step: 0.40 [2024-09-15 16:34:26,864] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1468.40 | bwd_microstep: 5220.41 | bwd_inner_microstep: 2590.15 | bwd_allreduce_microstep: 2630.20 | step_microstep: 7.54 [2024-09-15 16:34:26,864] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2906.02 | bwd: 7772.90 | bwd_inner: 5142.34 | bwd_allreduce: 2630.36 | step: 7.79 71%|███████ | 331/468 [59:18<24:26, 10.70s/it] {'loss': 0.1843, 'learning_rate': 8.368069021550516e-06, 'epoch': 2.12} 71%|███████ | 331/468 [59:18<24:26, 10.70s/it]dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1900 [2024-09-15 16:34:32,226] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1905.06 | bwd_microstep: 3426.68 | bwd_inner_microstep: 3426.66 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.04 dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878 [2024-09-15 16:34:37,621] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.30 | optimizer_gradients: 0.35 | optimizer_step: 0.40 [2024-09-15 16:34:37,621] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1917.23 | bwd_microstep: 3447.24 | bwd_inner_microstep: 3432.32 | bwd_allreduce_microstep: 14.87 | step_microstep: 7.62 [2024-09-15 16:34:37,621] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3822.27 | bwd: 6873.93 | bwd_inner: 6858.98 | bwd_allreduce: 14.89 | step: 7.67 71%|███████ | 332/468 [59:29<24:17, 10.72s/it] {'loss': 0.2359, 'learning_rate': 8.255518980430115e-06, 'epoch': 2.12} 71%|███████ | 332/468 [59:29<24:17, 10.72s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878 [2024-09-15 16:34:43,004] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1916.97 | bwd_microstep: 3434.79 | bwd_inner_microstep: 3434.76 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.05 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368 [2024-09-15 16:34:48,388] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.28 | optimizer_gradients: 0.40 | optimizer_step: 0.40 [2024-09-15 16:34:48,388] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1469.03 | bwd_microstep: 3879.54 | bwd_inner_microstep: 2589.49 | bwd_allreduce_microstep: 1289.98 | step_microstep: 7.55 [2024-09-15 16:34:48,389] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3385.96 | bwd: 7314.34 | bwd_inner: 6024.25 | bwd_allreduce: 1290.02 | step: 7.62 71%|███████ | 333/468 [59:39<24:09, 10.73s/it] {'loss': 0.1243, 'learning_rate': 8.143533792298545e-06, 'epoch': 2.13} 71%|███████ | 333/468 [59:39<24:09, 10.73s/it]dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1878 [2024-09-15 16:34:53,812] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1934.48 | bwd_microstep: 3457.41 | bwd_inner_microstep: 3457.39 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.26 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1369 [2024-09-15 16:34:59,244] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.21 | optimizer_gradients: 0.39 | optimizer_step: 0.40 [2024-09-15 16:34:59,245] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1469.74 | bwd_microstep: 3926.45 | bwd_inner_microstep: 2589.85 | bwd_allreduce_microstep: 1336.54 | step_microstep: 7.45 [2024-09-15 16:34:59,245] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3404.19 | bwd: 7383.87 | bwd_inner: 6047.24 | bwd_allreduce: 1336.57 | step: 7.74 71%|███████▏ | 334/468 [59:50<24:03, 10.77s/it] {'loss': 0.1325, 'learning_rate': 8.032118843104164e-06, 'epoch': 2.13} 71%|███████▏ | 334/468 [59:50<24:03, 10.77s/it]dynamic ViT batch size: 36, images per sample: 4.5, dynamic token length: 1390 [2024-09-15 16:35:03,325] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1460.93 | bwd_microstep: 2588.63 | bwd_inner_microstep: 2588.49 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.25 dynamic ViT batch size: 36, images per sample: 4.5, dynamic token length: 1366 [2024-09-15 16:35:08,667] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.21 | optimizer_gradients: 0.44 | optimizer_step: 0.40 [2024-09-15 16:35:08,668] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1438.35 | bwd_microstep: 3870.36 | bwd_inner_microstep: 2552.32 | bwd_allreduce_microstep: 1317.98 | step_microstep: 7.48 [2024-09-15 16:35:08,668] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2899.25 | bwd: 6459.03 | bwd_inner: 5140.81 | bwd_allreduce: 1318.09 | step: 7.74 72%|███████▏ | 335/468 [1:00:00<22:58, 10.37s/it] {'loss': 0.099, 'learning_rate': 7.921279491369575e-06, 'epoch': 2.14} 72%|███████▏ | 335/468 [1:00:00<22:58, 10.37s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877 [2024-09-15 16:35:14,049] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1915.20 | bwd_microstep: 3435.38 | bwd_inner_microstep: 3435.35 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.24 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1366 [2024-09-15 16:35:19,305] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.31 | optimizer_gradients: 0.40 | optimizer_step: 0.41 [2024-09-15 16:35:19,306] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1467.36 | bwd_microstep: 3753.24 | bwd_inner_microstep: 2586.56 | bwd_allreduce_microstep: 1166.62 | step_microstep: 7.81 [2024-09-15 16:35:19,306] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3382.54 | bwd: 7188.63 | bwd_inner: 6021.91 | bwd_allreduce: 1166.65 | step: 8.06 72%|███████▏ | 336/468 [1:00:10<22:59, 10.45s/it] {'loss': 0.2295, 'learning_rate': 7.811021067933919e-06, 'epoch': 2.15} 72%|███████▏ | 336/468 [1:00:10<22:59, 10.45s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1901 [2024-09-15 16:35:24,715] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1923.76 | bwd_microstep: 3454.65 | bwd_inner_microstep: 3454.61 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.24 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368 [2024-09-15 16:35:28,822] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.49 | optimizer_gradients: 0.35 | optimizer_step: 0.40 [2024-09-15 16:35:28,822] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1467.88 | bwd_microstep: 2604.14 | bwd_inner_microstep: 2589.00 | bwd_allreduce_microstep: 15.07 | step_microstep: 7.67 [2024-09-15 16:35:28,822] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3391.61 | bwd: 6058.84 | bwd_inner: 6043.61 | bwd_allreduce: 15.10 | step: 7.91 72%|███████▏ | 337/468 [1:00:20<22:12, 10.17s/it] {'loss': 0.1494, 'learning_rate': 7.701348875696486e-06, 'epoch': 2.15} 72%|███████▏ | 337/468 [1:00:20<22:12, 10.17s/it]dynamic ViT batch size: 36, images per sample: 4.5, dynamic token length: 1390 [2024-09-15 16:35:32,902] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1462.01 | bwd_microstep: 2587.76 | bwd_inner_microstep: 2587.73 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22 dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878 [2024-09-15 16:35:38,317] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.35 | optimizer_gradients: 0.35 | optimizer_step: 0.40 [2024-09-15 16:35:38,318] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1919.44 | bwd_microstep: 3459.67 | bwd_inner_microstep: 3444.00 | bwd_allreduce_microstep: 15.60 | step_microstep: 7.88 [2024-09-15 16:35:38,318] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3381.43 | bwd: 6047.47 | bwd_inner: 6031.72 | bwd_allreduce: 15.64 | step: 8.10 72%|███████▏ | 338/468 [1:00:29<21:35, 9.97s/it] {'loss': 0.1063, 'learning_rate': 7.59226818936166e-06, 'epoch': 2.16} 72%|███████▏ | 338/468 [1:00:29<21:35, 9.97s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1870 [2024-09-15 16:35:43,694] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1914.04 | bwd_microstep: 3426.54 | bwd_inner_microstep: 3426.51 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.11 dynamic ViT batch size: 46, images per sample: 5.75, dynamic token length: 1878 [2024-09-15 16:35:49,183] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.39 | optimizer_gradients: 0.38 | optimizer_step: 0.43 [2024-09-15 16:35:49,184] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1951.06 | bwd_microstep: 3501.52 | bwd_inner_microstep: 3486.38 | bwd_allreduce_microstep: 15.10 | step_microstep: 8.11 [2024-09-15 16:35:49,184] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3865.07 | bwd: 6928.07 | bwd_inner: 6912.89 | bwd_allreduce: 15.12 | step: 8.24 72%|███████▏ | 339/468 [1:00:40<22:00, 10.24s/it] {'loss': 0.1389, 'learning_rate': 7.483784255185249e-06, 'epoch': 2.17} 72%|███████▏ | 339/468 [1:00:40<22:00, 10.24s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1374 [2024-09-15 16:35:53,282] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1473.84 | bwd_microstep: 2593.15 | bwd_inner_microstep: 2593.09 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.25 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1373 [2024-09-15 16:35:59,869] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.22 | optimizer_gradients: 0.40 | optimizer_step: 0.39 [2024-09-15 16:35:59,869] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1474.91 | bwd_microstep: 5076.75 | bwd_inner_microstep: 2597.50 | bwd_allreduce_microstep: 2479.18 | step_microstep: 7.45 [2024-09-15 16:35:59,870] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2948.72 | bwd: 7669.94 | bwd_inner: 5190.59 | bwd_allreduce: 2479.22 | step: 7.70 73%|███████▎ | 340/468 [1:00:51<22:07, 10.37s/it] {'loss': 0.176, 'learning_rate': 7.375902290722146e-06, 'epoch': 2.17} 73%|███████▎ | 340/468 [1:00:51<22:07, 10.37s/it]dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1870 [2024-09-15 16:36:05,167] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1881.34 | bwd_microstep: 3386.29 | bwd_inner_microstep: 3386.26 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.24 dynamic ViT batch size: 36, images per sample: 4.5, dynamic token length: 1369 [2024-09-15 16:36:10,550] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.37 | optimizer_gradients: 0.39 | optimizer_step: 0.40 [2024-09-15 16:36:10,550] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1440.24 | bwd_microstep: 3910.17 | bwd_inner_microstep: 2553.18 | bwd_allreduce_microstep: 1356.92 | step_microstep: 7.69 [2024-09-15 16:36:10,551] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3321.56 | bwd: 7296.47 | bwd_inner: 5939.44 | bwd_allreduce: 1356.96 | step: 7.93 73%|███████▎ | 341/468 [1:01:01<22:08, 10.46s/it] {'loss': 0.142, 'learning_rate': 7.268627484575406e-06, 'epoch': 2.18} 73%|███████▎ | 341/468 [1:01:01<22:08, 10.46s/it]dynamic ViT batch size: 36, images per sample: 4.5, dynamic token length: 1388 [2024-09-15 16:36:14,629] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1461.44 | bwd_microstep: 2586.98 | bwd_inner_microstep: 2586.85 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.24 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1388 [2024-09-15 16:36:19,998] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.26 | optimizer_gradients: 0.40 | optimizer_step: 0.40 [2024-09-15 16:36:19,999] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1494.16 | bwd_microstep: 3839.03 | bwd_inner_microstep: 2626.77 | bwd_allreduce_microstep: 1212.20 | step_microstep: 7.50 [2024-09-15 16:36:19,999] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2955.58 | bwd: 6426.04 | bwd_inner: 5213.62 | bwd_allreduce: 1212.30 | step: 7.75 73%|███████▎ | 342/468 [1:01:11<21:20, 10.16s/it] {'loss': 0.2029, 'learning_rate': 7.161964996146689e-06, 'epoch': 2.19} 73%|███████▎ | 342/468 [1:01:11<21:20, 10.16s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876 [2024-09-15 16:36:25,373] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1914.41 | bwd_microstep: 3429.40 | bwd_inner_microstep: 3429.37 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.19 dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876 [2024-09-15 16:36:30,775] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.27 | optimizer_gradients: 0.44 | optimizer_step: 0.44 [2024-09-15 16:36:30,776] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1916.93 | bwd_microstep: 3451.04 | bwd_inner_microstep: 3435.95 | bwd_allreduce_microstep: 15.05 | step_microstep: 7.85 [2024-09-15 16:36:30,776] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3831.31 | bwd: 6880.45 | bwd_inner: 6865.32 | bwd_allreduce: 15.07 | step: 8.07 73%|███████▎ | 343/468 [1:01:22<21:33, 10.34s/it] {'loss': 0.1416, 'learning_rate': 7.055919955388122e-06, 'epoch': 2.19} 73%|███████▎ | 343/468 [1:01:22<21:33, 10.34s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878 [2024-09-15 16:36:36,159] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1917.07 | bwd_microstep: 3435.85 | bwd_inner_microstep: 3435.83 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.23 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368 [2024-09-15 16:36:40,268] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.49 | optimizer_gradients: 0.35 | optimizer_step: 0.40 [2024-09-15 16:36:40,268] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1468.94 | bwd_microstep: 2605.07 | bwd_inner_microstep: 2590.67 | bwd_allreduce_microstep: 14.35 | step_microstep: 7.77 [2024-09-15 16:36:40,269] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3385.98 | bwd: 6040.93 | bwd_inner: 6026.50 | bwd_allreduce: 14.37 | step: 8.01 74%|███████▎ | 344/468 [1:01:31<20:51, 10.09s/it] {'loss': 0.1515, 'learning_rate': 6.95049746255557e-06, 'epoch': 2.2} 74%|███████▎ | 344/468 [1:01:31<20:51, 10.09s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876 [2024-09-15 16:36:45,653] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1918.47 | bwd_microstep: 3435.65 | bwd_inner_microstep: 3435.63 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.25 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1878 [2024-09-15 16:36:51,035] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.42 | optimizer_gradients: 0.38 | optimizer_step: 0.40 [2024-09-15 16:36:51,036] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1904.91 | bwd_microstep: 3441.38 | bwd_inner_microstep: 3426.40 | bwd_allreduce_microstep: 14.93 | step_microstep: 8.15 [2024-09-15 16:36:51,036] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3823.36 | bwd: 6877.04 | bwd_inner: 6862.03 | bwd_allreduce: 14.95 | step: 8.41 74%|███████▎ | 345/468 [1:01:42<21:05, 10.29s/it] {'loss': 0.1573, 'learning_rate': 6.845702587963352e-06, 'epoch': 2.2} 74%|███████▎ | 345/468 [1:01:42<21:05, 10.29s/it]dynamic ViT batch size: 36, images per sample: 4.5, dynamic token length: 1388 [2024-09-15 16:36:55,121] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1464.86 | bwd_microstep: 2590.22 | bwd_inner_microstep: 2590.15 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.24 dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876 [2024-09-15 16:37:01,738] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.21 | optimizer_gradients: 0.39 | optimizer_step: 0.40 [2024-09-15 16:37:01,738] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1919.37 | bwd_microstep: 4663.44 | bwd_inner_microstep: 3441.11 | bwd_allreduce_microstep: 1222.27 | step_microstep: 7.48 [2024-09-15 16:37:01,738] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3384.20 | bwd: 7253.70 | bwd_inner: 6031.26 | bwd_allreduce: 1222.31 | step: 7.73 74%|███████▍ | 346/468 [1:01:53<21:10, 10.42s/it] {'loss': 0.2249, 'learning_rate': 6.741540371740347e-06, 'epoch': 2.21} 74%|███████▍ | 346/468 [1:01:53<21:10, 10.42s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1878 [2024-09-15 16:37:07,090] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1901.43 | bwd_microstep: 3419.86 | bwd_inner_microstep: 3419.84 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.24 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1367 [2024-09-15 16:37:12,385] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.31 | optimizer_gradients: 0.39 | optimizer_step: 0.40 [2024-09-15 16:37:12,386] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1468.76 | bwd_microstep: 3791.30 | bwd_inner_microstep: 2591.95 | bwd_allreduce_microstep: 1199.29 | step_microstep: 7.74 [2024-09-15 16:37:12,386] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3370.17 | bwd: 7211.19 | bwd_inner: 6011.79 | bwd_allreduce: 1199.32 | step: 7.99 74%|███████▍ | 347/468 [1:02:03<21:08, 10.49s/it] {'loss': 0.2207, 'learning_rate': 6.6380158235876335e-06, 'epoch': 2.22} 74%|███████▍ | 347/468 [1:02:03<21:08, 10.49s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1388 [2024-09-15 16:37:16,539] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1490.46 | bwd_microstep: 2626.55 | bwd_inner_microstep: 2626.52 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368 [2024-09-15 16:37:23,177] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.35 | optimizer_gradients: 0.37 | optimizer_step: 0.38 [2024-09-15 16:37:23,178] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1468.29 | bwd_microstep: 5135.34 | bwd_inner_microstep: 2590.35 | bwd_allreduce_microstep: 2544.93 | step_microstep: 7.48 [2024-09-15 16:37:23,178] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2958.72 | bwd: 7761.90 | bwd_inner: 5216.87 | bwd_allreduce: 2544.96 | step: 7.71 74%|███████▍ | 348/468 [1:02:14<21:09, 10.58s/it] {'loss': 0.1194, 'learning_rate': 6.535133922537513e-06, 'epoch': 2.22} 74%|███████▍ | 348/468 [1:02:14<21:09, 10.58s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877 [2024-09-15 16:37:28,549] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1910.14 | bwd_microstep: 3430.95 | bwd_inner_microstep: 3430.92 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.15 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368 [2024-09-15 16:37:33,897] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.25 | optimizer_gradients: 0.39 | optimizer_step: 0.39 [2024-09-15 16:37:33,898] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1468.36 | bwd_microstep: 3846.70 | bwd_inner_microstep: 2587.97 | bwd_allreduce_microstep: 1258.67 | step_microstep: 7.48 [2024-09-15 16:37:33,898] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3378.47 | bwd: 7277.66 | bwd_inner: 6018.89 | bwd_allreduce: 1258.70 | step: 7.66 75%|███████▍ | 349/468 [1:02:25<21:03, 10.62s/it] {'loss': 0.1655, 'learning_rate': 6.4328996167140786e-06, 'epoch': 2.23} 75%|███████▍ | 349/468 [1:02:25<21:03, 10.62s/it]dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1877 [2024-09-15 16:37:39,316] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1930.70 | bwd_microstep: 3456.30 | bwd_inner_microstep: 3456.25 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.24 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368 [2024-09-15 16:37:44,615] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.21 | optimizer_gradients: 0.39 | optimizer_step: 0.40 [2024-09-15 16:37:44,616] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1467.88 | bwd_microstep: 3797.45 | bwd_inner_microstep: 2588.53 | bwd_allreduce_microstep: 1208.86 | step_microstep: 7.47 [2024-09-15 16:37:44,616] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3398.54 | bwd: 7253.79 | bwd_inner: 6044.78 | bwd_allreduce: 1208.89 | step: 7.72 75%|███████▍ | 350/468 [1:02:36<20:56, 10.65s/it] {'loss': 0.1652, 'learning_rate': 6.331317823095184e-06, 'epoch': 2.24} 75%|███████▍ | 350/468 [1:02:36<20:56, 10.65s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1374 [2024-09-15 16:37:48,704] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1467.70 | bwd_microstep: 2589.07 | bwd_inner_microstep: 2588.95 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.34 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1366 [2024-09-15 16:37:54,147] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.18 | optimizer_gradients: 0.39 | optimizer_step: 0.42 [2024-09-15 16:37:54,147] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1467.53 | bwd_microstep: 3941.29 | bwd_inner_microstep: 2586.49 | bwd_allreduce_microstep: 1354.74 | step_microstep: 7.41 [2024-09-15 16:37:54,147] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2935.20 | bwd: 6530.39 | bwd_inner: 5175.44 | bwd_allreduce: 1354.82 | step: 7.75 75%|███████▌ | 351/468 [1:02:45<20:06, 10.31s/it] {'loss': 0.2378, 'learning_rate': 6.230393427276e-06, 'epoch': 2.24} 75%|███████▌ | 351/468 [1:02:45<20:06, 10.31s/it]dynamic ViT batch size: 36, images per sample: 4.5, dynamic token length: 1390 [2024-09-15 16:37:58,227] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1461.33 | bwd_microstep: 2588.43 | bwd_inner_microstep: 2588.36 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.23 dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878 [2024-09-15 16:38:03,670] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.21 | optimizer_gradients: 0.39 | optimizer_step: 0.40 [2024-09-15 16:38:03,671] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1915.74 | bwd_microstep: 3492.64 | bwd_inner_microstep: 3429.12 | bwd_allreduce_microstep: 63.47 | step_microstep: 7.62 [2024-09-15 16:38:03,671] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3377.05 | bwd: 6081.10 | bwd_inner: 6017.48 | bwd_allreduce: 63.51 | step: 7.85 75%|███████▌ | 352/468 [1:02:55<19:28, 10.08s/it] {'loss': 0.229, 'learning_rate': 6.130131283234031e-06, 'epoch': 2.25} 75%|███████▌ | 352/468 [1:02:55<19:28, 10.08s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1870 [2024-09-15 16:38:09,039] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1913.67 | bwd_microstep: 3423.00 | bwd_inner_microstep: 3422.97 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.24 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368 [2024-09-15 16:38:13,155] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.49 | optimizer_gradients: 0.40 | optimizer_step: 0.41 [2024-09-15 16:38:13,156] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1470.81 | bwd_microstep: 2612.71 | bwd_inner_microstep: 2591.08 | bwd_allreduce_microstep: 21.58 | step_microstep: 7.90 [2024-09-15 16:38:13,156] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3384.46 | bwd: 6035.72 | bwd_inner: 6014.05 | bwd_allreduce: 21.61 | step: 8.14 75%|███████▌ | 353/468 [1:03:04<18:58, 9.90s/it] {'loss': 0.1397, 'learning_rate': 6.0305362130956504e-06, 'epoch': 2.26} 75%|███████▌ | 353/468 [1:03:04<18:58, 9.90s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389 [2024-09-15 16:38:17,309] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1493.78 | bwd_microstep: 2629.02 | bwd_inner_microstep: 2628.95 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.24 dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878 [2024-09-15 16:38:23,938] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.32 | optimizer_gradients: 0.39 | optimizer_step: 0.40 [2024-09-15 16:38:23,939] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1919.27 | bwd_microstep: 4673.70 | bwd_inner_microstep: 3442.50 | bwd_allreduce_microstep: 1231.14 | step_microstep: 7.64 [2024-09-15 16:38:23,939] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3413.03 | bwd: 7302.76 | bwd_inner: 6071.45 | bwd_allreduce: 1231.19 | step: 7.89 76%|███████▌ | 354/468 [1:03:15<19:18, 10.16s/it] {'loss': 0.1139, 'learning_rate': 5.931613006904196e-06, 'epoch': 2.26} 76%|███████▌ | 354/468 [1:03:15<19:18, 10.16s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1369 [2024-09-15 16:38:28,031] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1469.76 | bwd_microstep: 2591.72 | bwd_inner_microstep: 2591.66 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.24 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368 [2024-09-15 16:38:33,359] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.21 | optimizer_gradients: 0.45 | optimizer_step: 0.41 [2024-09-15 16:38:33,360] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1468.71 | bwd_microstep: 3826.98 | bwd_inner_microstep: 2593.09 | bwd_allreduce_microstep: 1233.83 | step_microstep: 7.50 [2024-09-15 16:38:33,360] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2938.44 | bwd: 6418.74 | bwd_inner: 5184.75 | bwd_allreduce: 1233.87 | step: 7.75 76%|███████▌ | 355/468 [1:03:24<18:43, 9.94s/it] {'loss': 0.1611, 'learning_rate': 5.8333664223895906e-06, 'epoch': 2.27} 76%|███████▌ | 355/468 [1:03:24<18:43, 9.94s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1388 [2024-09-15 16:38:37,505] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1490.21 | bwd_microstep: 2624.64 | bwd_inner_microstep: 2624.61 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.24 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1369 [2024-09-15 16:38:44,095] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.45 | optimizer_gradients: 0.36 | optimizer_step: 0.39 [2024-09-15 16:38:44,095] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1468.69 | bwd_microstep: 5084.88 | bwd_inner_microstep: 2593.08 | bwd_allreduce_microstep: 2491.74 | step_microstep: 7.35 [2024-09-15 16:38:44,095] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2958.87 | bwd: 7709.53 | bwd_inner: 5217.69 | bwd_allreduce: 2491.77 | step: 7.60 76%|███████▌ | 356/468 [1:03:35<19:00, 10.18s/it] {'loss': 0.1784, 'learning_rate': 5.735801184739489e-06, 'epoch': 2.27} 76%|███████▌ | 356/468 [1:03:35<19:00, 10.18s/it]dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1387 [2024-09-15 16:38:48,200] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1474.10 | bwd_microstep: 2600.57 | bwd_inner_microstep: 2600.54 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.27 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1878 [2024-09-15 16:38:54,718] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.31 | optimizer_gradients: 0.39 | optimizer_step: 0.40 [2024-09-15 16:38:54,719] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1900.57 | bwd_microstep: 4582.26 | bwd_inner_microstep: 3420.33 | bwd_allreduce_microstep: 1161.86 | step_microstep: 7.51 [2024-09-15 16:38:54,719] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3374.64 | bwd: 7182.85 | bwd_inner: 6020.88 | bwd_allreduce: 1161.89 | step: 7.78 76%|███████▋ | 357/468 [1:03:46<19:04, 10.31s/it] {'loss': 0.2004, 'learning_rate': 5.638921986372064e-06, 'epoch': 2.28} 76%|███████▋ | 357/468 [1:03:46<19:04, 10.31s/it]dynamic ViT batch size: 41, images per sample: 5.125, dynamic token length: 1870 [2024-09-15 16:39:00,062] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1902.23 | bwd_microstep: 3410.34 | bwd_inner_microstep: 3410.31 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.11 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1367 [2024-09-15 16:39:05,358] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.22 | optimizer_gradients: 0.39 | optimizer_step: 0.40 [2024-09-15 16:39:05,359] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1466.59 | bwd_microstep: 3794.28 | bwd_inner_microstep: 2588.35 | bwd_allreduce_microstep: 1205.87 | step_microstep: 7.48 [2024-09-15 16:39:05,359] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3368.79 | bwd: 7204.63 | bwd_inner: 5998.66 | bwd_allreduce: 1205.90 | step: 7.62 76%|███████▋ | 358/468 [1:03:56<19:05, 10.41s/it] {'loss': 0.1553, 'learning_rate': 5.542733486710299e-06, 'epoch': 2.29} 76%|███████▋ | 358/468 [1:03:56<19:05, 10.41s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1870 [2024-09-15 16:39:10,723] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1910.09 | bwd_microstep: 3423.51 | bwd_inner_microstep: 3423.48 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.24 dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1878 [2024-09-15 16:39:16,170] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.42 | optimizer_gradients: 0.35 | optimizer_step: 0.40 [2024-09-15 16:39:16,171] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1934.17 | bwd_microstep: 3477.10 | bwd_inner_microstep: 3461.07 | bwd_allreduce_microstep: 15.98 | step_microstep: 8.07 [2024-09-15 16:39:16,171] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3844.24 | bwd: 6900.62 | bwd_inner: 6884.55 | bwd_allreduce: 16.00 | step: 8.31 77%|███████▋ | 359/468 [1:04:07<19:07, 10.53s/it] {'loss': 0.1787, 'learning_rate': 5.447240311957891e-06, 'epoch': 2.29} 77%|███████▋ | 359/468 [1:04:07<19:07, 10.53s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876 [2024-09-15 16:39:21,551] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1918.28 | bwd_microstep: 3431.16 | bwd_inner_microstep: 3431.14 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.25 dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1870 [2024-09-15 16:39:26,947] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.41 | optimizer_gradients: 0.36 | optimizer_step: 0.42 [2024-09-15 16:39:26,947] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1914.68 | bwd_microstep: 3445.48 | bwd_inner_microstep: 3429.84 | bwd_allreduce_microstep: 15.58 | step_microstep: 7.88 [2024-09-15 16:39:26,948] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3832.93 | bwd: 6876.66 | bwd_inner: 6860.98 | bwd_allreduce: 15.61 | step: 8.13 77%|███████▋ | 360/468 [1:04:18<19:05, 10.60s/it] {'loss': 0.1856, 'learning_rate': 5.352447054876755e-06, 'epoch': 2.3} 77%|███████▋ | 360/468 [1:04:18<19:05, 10.60s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389 [2024-09-15 16:39:31,096] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1491.73 | bwd_microstep: 2626.96 | bwd_inner_microstep: 2626.90 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.23 dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1880 [2024-09-15 16:39:37,697] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.45 | optimizer_gradients: 0.36 | optimizer_step: 0.37 [2024-09-15 16:39:37,697] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1937.44 | bwd_microstep: 4628.76 | bwd_inner_microstep: 3462.23 | bwd_allreduce_microstep: 1166.47 | step_microstep: 7.93 [2024-09-15 16:39:37,697] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3429.14 | bwd: 7255.76 | bwd_inner: 6089.13 | bwd_allreduce: 1166.50 | step: 8.16 77%|███████▋ | 361/468 [1:04:29<18:59, 10.65s/it] {'loss': 0.2069, 'learning_rate': 5.258358274566142e-06, 'epoch': 2.31} 77%|███████▋ | 361/468 [1:04:29<18:59, 10.65s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1369 [2024-09-15 16:39:41,788] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1469.42 | bwd_microstep: 2590.14 | bwd_inner_microstep: 2590.07 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.24 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1870 [2024-09-15 16:39:48,332] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.50 | optimizer_gradients: 0.36 | optimizer_step: 0.38 [2024-09-15 16:39:48,332] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1901.48 | bwd_microstep: 4608.62 | bwd_inner_microstep: 3410.97 | bwd_allreduce_microstep: 1197.59 | step_microstep: 7.34 [2024-09-15 16:39:48,332] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3370.86 | bwd: 7198.81 | bwd_inner: 6001.04 | bwd_allreduce: 1197.64 | step: 7.59 77%|███████▋ | 362/468 [1:04:39<18:48, 10.64s/it] {'loss': 0.2023, 'learning_rate': 5.164978496243354e-06, 'epoch': 2.31} 77%|███████▋ | 362/468 [1:04:39<18:48, 10.64s/it]dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1878 [2024-09-15 16:39:53,753] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1931.41 | bwd_microstep: 3459.42 | bwd_inner_microstep: 3459.39 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.18 dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878 [2024-09-15 16:39:59,159] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.43 | optimizer_gradients: 0.36 | optimizer_step: 0.40 [2024-09-15 16:39:59,160] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1919.35 | bwd_microstep: 3451.30 | bwd_inner_microstep: 3436.34 | bwd_allreduce_microstep: 14.91 | step_microstep: 7.61 [2024-09-15 16:39:59,160] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3850.73 | bwd: 6910.73 | bwd_inner: 6895.73 | bwd_allreduce: 14.93 | step: 7.82 78%|███████▊ | 363/468 [1:04:50<18:43, 10.70s/it] {'loss': 0.1453, 'learning_rate': 5.072312211026125e-06, 'epoch': 2.32} 78%|███████▊ | 363/468 [1:04:50<18:43, 10.70s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1388 [2024-09-15 16:40:03,311] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1492.51 | bwd_microstep: 2627.24 | bwd_inner_microstep: 2627.14 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.25 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368 [2024-09-15 16:40:08,585] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.21 | optimizer_gradients: 0.39 | optimizer_step: 0.40 [2024-09-15 16:40:08,585] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1471.64 | bwd_microstep: 3767.44 | bwd_inner_microstep: 2593.08 | bwd_allreduce_microstep: 1174.31 | step_microstep: 7.45 [2024-09-15 16:40:08,586] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2964.12 | bwd: 6394.73 | bwd_inner: 5220.21 | bwd_allreduce: 1174.38 | step: 7.71 78%|███████▊ | 364/468 [1:04:59<17:52, 10.32s/it] {'loss': 0.1377, 'learning_rate': 4.980363875716592e-06, 'epoch': 2.33} 78%|███████▊ | 364/468 [1:04:59<17:52, 10.32s/it]dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1877 [2024-09-15 16:40:13,909] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1888.39 | bwd_microstep: 3404.61 | bwd_inner_microstep: 3404.52 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.26 dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877 [2024-09-15 16:40:19,329] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.38 | optimizer_gradients: 0.36 | optimizer_step: 0.40 [2024-09-15 16:40:19,329] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1919.99 | bwd_microstep: 3465.37 | bwd_inner_microstep: 3438.17 | bwd_allreduce_microstep: 27.15 | step_microstep: 7.69 [2024-09-15 16:40:19,329] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3808.34 | bwd: 6870.02 | bwd_inner: 6842.68 | bwd_allreduce: 27.21 | step: 7.96 78%|███████▊ | 365/468 [1:05:10<17:55, 10.45s/it] {'loss': 0.1887, 'learning_rate': 4.889137912586972e-06, 'epoch': 2.33} 78%|███████▊ | 365/468 [1:05:10<17:55, 10.45s/it]dynamic ViT batch size: 34, images per sample: 4.25, dynamic token length: 1870 [2024-09-15 16:40:24,585] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1855.53 | bwd_microstep: 3371.39 | bwd_inner_microstep: 3371.24 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.24 dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1366 [2024-09-15 16:40:30,054] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.22 | optimizer_gradients: 0.40 | optimizer_step: 0.40 [2024-09-15 16:40:30,055] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1456.00 | bwd_microstep: 3978.00 | bwd_inner_microstep: 2572.46 | bwd_allreduce_microstep: 1405.46 | step_microstep: 7.53 [2024-09-15 16:40:30,055] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3311.50 | bwd: 7349.43 | bwd_inner: 5943.71 | bwd_allreduce: 1405.58 | step: 7.78 78%|███████▊ | 366/468 [1:05:21<17:53, 10.53s/it] {'loss': 0.0932, 'learning_rate': 4.7986387091668365e-06, 'epoch': 2.34} 78%|███████▊ | 366/468 [1:05:21<17:53, 10.53s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1388 [2024-09-15 16:40:34,205] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1491.45 | bwd_microstep: 2627.64 | bwd_inner_microstep: 2627.51 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.24 dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878 [2024-09-15 16:40:40,724] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.43 | optimizer_gradients: 0.37 | optimizer_step: 0.38 [2024-09-15 16:40:40,724] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1918.96 | bwd_microstep: 4565.83 | bwd_inner_microstep: 3439.38 | bwd_allreduce_microstep: 1126.38 | step_microstep: 7.43 [2024-09-15 16:40:40,724] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3410.37 | bwd: 7193.51 | bwd_inner: 6066.89 | bwd_allreduce: 1126.48 | step: 7.67 78%|███████▊ | 367/468 [1:05:32<17:47, 10.57s/it] {'loss': 0.1506, 'learning_rate': 4.708870618032133e-06, 'epoch': 2.35} 78%|███████▊ | 367/468 [1:05:32<17:47, 10.57s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876 [2024-09-15 16:40:46,105] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1915.88 | bwd_microstep: 3434.03 | bwd_inner_microstep: 3434.00 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22 dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876 [2024-09-15 16:40:51,515] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.62 | optimizer_gradients: 0.34 | optimizer_step: 0.39 [2024-09-15 16:40:51,515] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1915.84 | bwd_microstep: 3456.55 | bwd_inner_microstep: 3441.49 | bwd_allreduce_microstep: 15.02 | step_microstep: 8.25 [2024-09-15 16:40:51,515] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3831.69 | bwd: 6890.59 | bwd_inner: 6875.49 | bwd_allreduce: 15.04 | step: 8.48 79%|███████▊ | 368/468 [1:05:42<17:43, 10.64s/it] {'loss': 0.1293, 'learning_rate': 4.619837956595825e-06, 'epoch': 2.35} 79%|███████▊ | 368/468 [1:05:42<17:43, 10.64s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876 [2024-09-15 16:40:56,903] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1919.98 | bwd_microstep: 3437.30 | bwd_inner_microstep: 3437.28 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.17 dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1878 [2024-09-15 16:41:02,254] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 4.39 | optimizer_gradients: 0.34 | optimizer_step: 0.38 [2024-09-15 16:41:02,255] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1889.14 | bwd_microstep: 3424.04 | bwd_inner_microstep: 3408.87 | bwd_allreduce_microstep: 15.11 | step_microstep: 11.36 [2024-09-15 16:41:02,255] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3809.09 | bwd: 6861.36 | bwd_inner: 6846.15 | bwd_allreduce: 15.14 | step: 11.55 79%|███████▉ | 369/468 [1:05:53<17:36, 10.67s/it] {'loss': 0.1505, 'learning_rate': 4.531545006900244e-06, 'epoch': 2.36} 79%|███████▉ | 369/468 [1:05:53<17:36, 10.67s/it]dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1878 [2024-09-15 16:41:07,687] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1936.30 | bwd_microstep: 3464.38 | bwd_inner_microstep: 3464.35 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.07 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368 [2024-09-15 16:41:12,985] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.34 | optimizer_gradients: 0.37 | optimizer_step: 0.38 [2024-09-15 16:41:12,985] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1470.74 | bwd_microstep: 3794.39 | bwd_inner_microstep: 2594.10 | bwd_allreduce_microstep: 1200.22 | step_microstep: 7.33 [2024-09-15 16:41:12,985] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3407.02 | bwd: 7258.77 | bwd_inner: 6058.45 | bwd_allreduce: 1200.25 | step: 7.43 79%|███████▉ | 370/468 [1:06:04<17:27, 10.69s/it] {'loss': 0.1176, 'learning_rate': 4.443996015411151e-06, 'epoch': 2.36} 79%|███████▉ | 370/468 [1:06:04<17:27, 10.69s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1901 [2024-09-15 16:41:18,424] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1942.39 | bwd_microstep: 3465.91 | bwd_inner_microstep: 3465.88 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.16 dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877 [2024-09-15 16:41:23,836] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.47 | optimizer_gradients: 0.36 | optimizer_step: 0.40 [2024-09-15 16:41:23,837] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1920.23 | bwd_microstep: 3455.25 | bwd_inner_microstep: 3439.98 | bwd_allreduce_microstep: 15.22 | step_microstep: 8.20 [2024-09-15 16:41:23,837] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3862.59 | bwd: 6921.17 | bwd_inner: 6905.87 | bwd_allreduce: 15.24 | step: 8.39 79%|███████▉ | 371/468 [1:06:15<17:21, 10.74s/it] {'loss': 0.2187, 'learning_rate': 4.357195192813504e-06, 'epoch': 2.37} 79%|███████▉ | 371/468 [1:06:15<17:21, 10.74s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1388 [2024-09-15 16:41:27,992] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1494.27 | bwd_microstep: 2629.80 | bwd_inner_microstep: 2629.62 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.25 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1878 [2024-09-15 16:41:33,406] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.35 | optimizer_gradients: 0.38 | optimizer_step: 0.42 [2024-09-15 16:41:33,407] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1905.80 | bwd_microstep: 3474.49 | bwd_inner_microstep: 3433.71 | bwd_allreduce_microstep: 40.73 | step_microstep: 7.61 [2024-09-15 16:41:33,407] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3400.05 | bwd: 6104.33 | bwd_inner: 6063.34 | bwd_allreduce: 40.82 | step: 7.87 79%|███████▉ | 372/468 [1:06:24<16:37, 10.39s/it] {'loss': 0.1538, 'learning_rate': 4.271146713808927e-06, 'epoch': 2.38} 79%|███████▉ | 372/468 [1:06:24<16:37, 10.39s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389 [2024-09-15 16:41:37,565] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1495.89 | bwd_microstep: 2631.95 | bwd_inner_microstep: 2631.83 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.24 dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1878 [2024-09-15 16:41:42,916] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.32 | optimizer_gradients: 0.35 | optimizer_step: 0.41 [2024-09-15 16:41:42,916] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1892.44 | bwd_microstep: 3421.97 | bwd_inner_microstep: 3408.15 | bwd_allreduce_microstep: 13.77 | step_microstep: 7.67 [2024-09-15 16:41:42,916] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3388.30 | bwd: 6053.96 | bwd_inner: 6039.99 | bwd_allreduce: 13.85 | step: 7.92 80%|███████▉ | 373/468 [1:06:34<16:01, 10.12s/it] {'loss': 0.2165, 'learning_rate': 4.185854716914952e-06, 'epoch': 2.38} 80%|███████▉ | 373/468 [1:06:34<16:01, 10.12s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1902 [2024-09-15 16:41:48,360] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1945.27 | bwd_microstep: 3467.13 | bwd_inner_microstep: 3467.10 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.33 dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1369 [2024-09-15 16:41:53,665] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.22 | optimizer_gradients: 0.39 | optimizer_step: 0.40 [2024-09-15 16:41:53,665] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1459.08 | bwd_microstep: 3814.07 | bwd_inner_microstep: 2575.31 | bwd_allreduce_microstep: 1238.70 | step_microstep: 7.70 [2024-09-15 16:41:53,666] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3404.32 | bwd: 7281.21 | bwd_inner: 6042.41 | bwd_allreduce: 1238.73 | step: 8.03 80%|███████▉ | 374/468 [1:06:45<16:09, 10.31s/it] {'loss': 0.1662, 'learning_rate': 4.1013233042659606e-06, 'epoch': 2.39} 80%|███████▉ | 374/468 [1:06:45<16:09, 10.31s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877 [2024-09-15 16:41:59,057] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1920.50 | bwd_microstep: 3439.76 | bwd_inner_microstep: 3439.68 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.24 dynamic ViT batch size: 39, images per sample: 4.875, dynamic token length: 1876 [2024-09-15 16:42:04,410] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.39 | optimizer_gradients: 0.35 | optimizer_step: 0.39 [2024-09-15 16:42:04,411] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1895.03 | bwd_microstep: 3424.18 | bwd_inner_microstep: 3409.16 | bwd_allreduce_microstep: 14.98 | step_microstep: 8.17 [2024-09-15 16:42:04,411] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3815.51 | bwd: 6863.97 | bwd_inner: 6848.83 | bwd_allreduce: 15.02 | step: 8.42 80%|████████ | 375/468 [1:06:55<16:11, 10.44s/it] {'loss': 0.143, 'learning_rate': 4.017556541415888e-06, 'epoch': 2.4} 80%|████████ | 375/468 [1:06:55<16:11, 10.44s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389 [2024-09-15 16:42:08,570] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1496.60 | bwd_microstep: 2632.19 | bwd_inner_microstep: 2632.03 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.25 dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878 [2024-09-15 16:42:15,133] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.22 | optimizer_gradients: 0.39 | optimizer_step: 0.40 [2024-09-15 16:42:15,134] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1919.58 | bwd_microstep: 4608.11 | bwd_inner_microstep: 3443.29 | bwd_allreduce_microstep: 1164.76 | step_microstep: 7.48 [2024-09-15 16:42:15,134] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3416.15 | bwd: 7240.34 | bwd_inner: 6075.32 | bwd_allreduce: 1164.89 | step: 7.74 80%|████████ | 376/468 [1:07:06<16:08, 10.53s/it] {'loss': 0.1883, 'learning_rate': 3.9345584571427055e-06, 'epoch': 2.4} 80%|████████ | 376/468 [1:07:06<16:08, 10.53s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1369 [2024-09-15 16:42:19,226] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1468.75 | bwd_microstep: 2592.15 | bwd_inner_microstep: 2592.07 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.24 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368 [2024-09-15 16:42:24,590] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.22 | optimizer_gradients: 0.40 | optimizer_step: 0.40 [2024-09-15 16:42:24,590] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1469.59 | bwd_microstep: 3859.26 | bwd_inner_microstep: 2594.41 | bwd_allreduce_microstep: 1264.80 | step_microstep: 7.51 [2024-09-15 16:42:24,591] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2938.31 | bwd: 6451.44 | bwd_inner: 5186.48 | bwd_allreduce: 1264.84 | step: 7.76 81%|████████ | 377/468 [1:07:15<15:28, 10.21s/it] {'loss': 0.1624, 'learning_rate': 3.852333043254639e-06, 'epoch': 2.41} 81%|████████ | 377/468 [1:07:15<15:28, 10.21s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1388 [2024-09-15 16:42:28,738] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1491.89 | bwd_microstep: 2625.76 | bwd_inner_microstep: 2625.73 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.05 dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877 [2024-09-15 16:42:34,143] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.42 | optimizer_gradients: 0.35 | optimizer_step: 0.40 [2024-09-15 16:42:34,143] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1919.59 | bwd_microstep: 3454.08 | bwd_inner_microstep: 3438.21 | bwd_allreduce_microstep: 15.82 | step_microstep: 7.73 [2024-09-15 16:42:34,143] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3411.46 | bwd: 6079.85 | bwd_inner: 6063.94 | bwd_allreduce: 15.85 | step: 7.79 81%|████████ | 378/468 [1:07:25<15:00, 10.01s/it] {'loss': 0.1824, 'learning_rate': 3.7708842543981928e-06, 'epoch': 2.42} 81%|████████ | 378/468 [1:07:25<15:00, 10.01s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389 [2024-09-15 16:42:38,296] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1494.69 | bwd_microstep: 2627.58 | bwd_inner_microstep: 2627.47 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.24 dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878 [2024-09-15 16:42:43,714] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.67 | optimizer_gradients: 0.34 | optimizer_step: 0.38 [2024-09-15 16:42:43,715] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1920.70 | bwd_microstep: 3463.30 | bwd_inner_microstep: 3447.99 | bwd_allreduce_microstep: 15.26 | step_microstep: 7.79 [2024-09-15 16:42:43,715] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3415.36 | bwd: 6090.92 | bwd_inner: 6075.45 | bwd_allreduce: 15.34 | step: 8.03 81%|████████ | 379/468 [1:07:35<14:39, 9.88s/it] {'loss': 0.1933, 'learning_rate': 3.690216007867944e-06, 'epoch': 2.42} 81%|████████ | 379/468 [1:07:35<14:39, 9.88s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1388 [2024-09-15 16:42:47,869] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1494.19 | bwd_microstep: 2628.85 | bwd_inner_microstep: 2628.82 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22 dynamic ViT batch size: 46, images per sample: 5.75, dynamic token length: 1878 [2024-09-15 16:42:53,357] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.42 | optimizer_gradients: 0.34 | optimizer_step: 0.37 [2024-09-15 16:42:53,358] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1950.14 | bwd_microstep: 3504.02 | bwd_inner_microstep: 3488.45 | bwd_allreduce_microstep: 15.52 | step_microstep: 7.67 [2024-09-15 16:42:53,358] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3444.31 | bwd: 6132.89 | bwd_inner: 6117.28 | bwd_allreduce: 15.54 | step: 7.90 81%|████████ | 380/468 [1:07:44<14:23, 9.81s/it] {'loss': 0.1557, 'learning_rate': 3.6103321834181437e-06, 'epoch': 2.43} 81%|████████ | 380/468 [1:07:44<14:23, 9.81s/it]dynamic ViT batch size: 39, images per sample: 4.875, dynamic token length: 1369 [2024-09-15 16:42:57,434] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1461.96 | bwd_microstep: 2583.74 | bwd_inner_microstep: 2583.71 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22 dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1870 [2024-09-15 16:43:02,840] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.62 | optimizer_gradients: 0.34 | optimizer_step: 0.38 [2024-09-15 16:43:02,841] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1918.82 | bwd_microstep: 3453.62 | bwd_inner_microstep: 3438.61 | bwd_allreduce_microstep: 14.96 | step_microstep: 7.61 [2024-09-15 16:43:02,841] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3380.76 | bwd: 6037.37 | bwd_inner: 6022.32 | bwd_allreduce: 14.99 | step: 7.84 81%|████████▏ | 381/468 [1:07:54<14:04, 9.71s/it] {'loss': 0.143, 'learning_rate': 3.5312366230761154e-06, 'epoch': 2.43} 81%|████████▏ | 381/468 [1:07:54<14:04, 9.71s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389 [2024-09-15 16:43:06,999] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1494.56 | bwd_microstep: 2632.87 | bwd_inner_microstep: 2632.84 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22 dynamic ViT batch size: 48, images per sample: 6.0, dynamic token length: 1901 [2024-09-15 16:43:12,585] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.58 | optimizer_gradients: 0.35 | optimizer_step: 0.41 [2024-09-15 16:43:12,585] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1993.16 | bwd_microstep: 3555.40 | bwd_inner_microstep: 3540.40 | bwd_allreduce_microstep: 14.95 | step_microstep: 7.84 [2024-09-15 16:43:12,585] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3487.70 | bwd: 6188.28 | bwd_inner: 6173.25 | bwd_allreduce: 14.97 | step: 8.07 82%|████████▏ | 382/468 [1:08:03<13:55, 9.72s/it] {'loss': 0.1588, 'learning_rate': 3.452933130957481e-06, 'epoch': 2.44} 82%|████████▏ | 382/468 [1:08:03<13:55, 9.72s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878 [2024-09-15 16:43:17,979] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1919.90 | bwd_microstep: 3443.21 | bwd_inner_microstep: 3443.18 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.25 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368 [2024-09-15 16:43:23,358] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.22 | optimizer_gradients: 0.40 | optimizer_step: 0.40 [2024-09-15 16:43:23,358] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1474.22 | bwd_microstep: 3870.21 | bwd_inner_microstep: 2596.86 | bwd_allreduce_microstep: 1273.28 | step_microstep: 7.54 [2024-09-15 16:43:23,359] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3394.10 | bwd: 7313.43 | bwd_inner: 6040.05 | bwd_allreduce: 1273.31 | step: 7.79 82%|████████▏ | 383/468 [1:08:14<14:13, 10.04s/it] {'loss': 0.1964, 'learning_rate': 3.375425473083185e-06, 'epoch': 2.45} 82%|████████▏ | 383/468 [1:08:14<14:13, 10.04s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1390 [2024-09-15 16:43:27,514] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1494.38 | bwd_microstep: 2630.47 | bwd_inner_microstep: 2630.44 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.23 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368 [2024-09-15 16:43:32,830] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.23 | optimizer_gradients: 0.38 | optimizer_step: 0.40 [2024-09-15 16:43:32,831] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1473.68 | bwd_microstep: 3807.60 | bwd_inner_microstep: 2596.95 | bwd_allreduce_microstep: 1210.60 | step_microstep: 7.25 [2024-09-15 16:43:32,831] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2968.03 | bwd: 6438.08 | bwd_inner: 5227.39 | bwd_allreduce: 1210.62 | step: 7.48 82%|████████▏ | 384/468 [1:08:24<13:48, 9.87s/it] {'loss': 0.1688, 'learning_rate': 3.2987173771983816e-06, 'epoch': 2.45} 82%|████████▏ | 384/468 [1:08:24<13:48, 9.87s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878 [2024-09-15 16:43:38,229] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1920.75 | bwd_microstep: 3446.84 | bwd_inner_microstep: 3446.81 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.23 dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877 [2024-09-15 16:43:43,637] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.43 | optimizer_gradients: 0.35 | optimizer_step: 0.40 [2024-09-15 16:43:43,638] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1921.19 | bwd_microstep: 3451.13 | bwd_inner_microstep: 3435.91 | bwd_allreduce_microstep: 15.17 | step_microstep: 8.15 [2024-09-15 16:43:43,638] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3841.91 | bwd: 6897.98 | bwd_inner: 6882.72 | bwd_allreduce: 15.19 | step: 8.39 82%|████████▏ | 385/468 [1:08:35<14:02, 10.15s/it] {'loss': 0.1837, 'learning_rate': 3.2228125325931514e-06, 'epoch': 2.46} 82%|████████▏ | 385/468 [1:08:35<14:02, 10.15s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1870 [2024-09-15 16:43:49,013] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1915.53 | bwd_microstep: 3428.44 | bwd_inner_microstep: 3428.41 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.24 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1877 [2024-09-15 16:43:54,388] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.51 | optimizer_gradients: 0.35 | optimizer_step: 0.40 [2024-09-15 16:43:54,388] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1902.27 | bwd_microstep: 3436.67 | bwd_inner_microstep: 3424.11 | bwd_allreduce_microstep: 12.51 | step_microstep: 7.39 [2024-09-15 16:43:54,388] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3817.77 | bwd: 6865.15 | bwd_inner: 6852.53 | bwd_allreduce: 12.53 | step: 7.64 82%|████████▏ | 386/468 [1:08:45<14:07, 10.33s/it] {'loss': 0.1463, 'learning_rate': 3.1477145899250326e-06, 'epoch': 2.47} 82%|████████▏ | 386/468 [1:08:45<14:07, 10.33s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389 [2024-09-15 16:43:58,549] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1498.56 | bwd_microstep: 2632.33 | bwd_inner_microstep: 2632.14 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.26 dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1366 [2024-09-15 16:44:05,207] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.25 | optimizer_gradients: 0.42 | optimizer_step: 0.40 [2024-09-15 16:44:05,208] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1458.35 | bwd_microstep: 5164.17 | bwd_inner_microstep: 2577.02 | bwd_allreduce_microstep: 2587.09 | step_microstep: 7.52 [2024-09-15 16:44:05,208] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2956.89 | bwd: 7796.54 | bwd_inner: 5209.16 | bwd_allreduce: 2587.24 | step: 7.78 83%|████████▎ | 387/468 [1:08:56<14:08, 10.48s/it] {'loss': 0.2133, 'learning_rate': 3.073427161043492e-06, 'epoch': 2.47} 83%|████████▎ | 387/468 [1:08:56<14:08, 10.48s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1388 [2024-09-15 16:44:09,357] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1492.46 | bwd_microstep: 2626.25 | bwd_inner_microstep: 2626.22 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22 dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876 [2024-09-15 16:44:14,765] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.40 | optimizer_gradients: 0.36 | optimizer_step: 0.40 [2024-09-15 16:44:14,766] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1917.31 | bwd_microstep: 3455.06 | bwd_inner_microstep: 3440.01 | bwd_allreduce_microstep: 15.00 | step_microstep: 7.81 [2024-09-15 16:44:14,766] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3409.75 | bwd: 6081.31 | bwd_inner: 6066.23 | bwd_allreduce: 15.02 | step: 8.04 83%|████████▎ | 388/468 [1:09:06<13:36, 10.20s/it] {'loss': 0.1345, 'learning_rate': 2.9999538188161705e-06, 'epoch': 2.48} 83%|████████▎ | 388/468 [1:09:06<13:36, 10.20s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389 [2024-09-15 16:44:18,919] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1493.29 | bwd_microstep: 2629.55 | bwd_inner_microstep: 2629.47 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.24 dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878 [2024-09-15 16:44:25,528] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.21 | optimizer_gradients: 0.39 | optimizer_step: 0.40 [2024-09-15 16:44:25,528] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1921.66 | bwd_microstep: 4652.83 | bwd_inner_microstep: 3442.99 | bwd_allreduce_microstep: 1209.77 | step_microstep: 7.49 [2024-09-15 16:44:25,529] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3414.93 | bwd: 7282.42 | bwd_inner: 6072.47 | bwd_allreduce: 1209.82 | step: 7.74 83%|████████▎ | 389/468 [1:09:16<13:39, 10.37s/it] {'loss': 0.2226, 'learning_rate': 2.927298096957063e-06, 'epoch': 2.49} 83%|████████▎ | 389/468 [1:09:16<13:39, 10.37s/it]dynamic ViT batch size: 46, images per sample: 5.75, dynamic token length: 1905 [2024-09-15 16:44:31,049] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1968.89 | bwd_microstep: 3520.78 | bwd_inner_microstep: 3520.75 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.05 dynamic ViT batch size: 36, images per sample: 4.5, dynamic token length: 1368 [2024-09-15 16:44:36,225] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.22 | optimizer_gradients: 0.39 | optimizer_step: 0.40 [2024-09-15 16:44:36,225] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1442.78 | bwd_microstep: 3698.57 | bwd_inner_microstep: 2558.77 | bwd_allreduce_microstep: 1139.73 | step_microstep: 7.51 [2024-09-15 16:44:36,225] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3411.64 | bwd: 7219.36 | bwd_inner: 6079.53 | bwd_allreduce: 1139.76 | step: 7.58 83%|████████▎ | 390/468 [1:09:27<13:36, 10.47s/it] {'loss': 0.1664, 'learning_rate': 2.8554634898565668e-06, 'epoch': 2.49} 83%|████████▎ | 390/468 [1:09:27<13:36, 10.47s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1388 [2024-09-15 16:44:40,373] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1492.10 | bwd_microstep: 2625.86 | bwd_inner_microstep: 2625.80 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.23 dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1877 [2024-09-15 16:44:45,824] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.57 | optimizer_gradients: 0.35 | optimizer_step: 0.40 [2024-09-15 16:44:45,824] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1937.47 | bwd_microstep: 3476.78 | bwd_inner_microstep: 3461.67 | bwd_allreduce_microstep: 15.06 | step_microstep: 7.89 [2024-09-15 16:44:45,824] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3429.54 | bwd: 6102.68 | bwd_inner: 6087.47 | bwd_allreduce: 15.09 | step: 8.13 84%|████████▎ | 391/468 [1:09:37<13:05, 10.21s/it] {'loss': 0.1388, 'learning_rate': 2.784453452413405e-06, 'epoch': 2.5} 84%|████████▎ | 391/468 [1:09:37<13:05, 10.21s/it]dynamic ViT batch size: 43, images per sample: 5.375, dynamic token length: 1878 [2024-09-15 16:44:51,235] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1928.31 | bwd_microstep: 3451.60 | bwd_inner_microstep: 3451.57 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.23 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1880 [2024-09-15 16:44:56,613] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.38 | optimizer_gradients: 0.35 | optimizer_step: 0.40 [2024-09-15 16:44:56,614] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1904.35 | bwd_microstep: 3438.69 | bwd_inner_microstep: 3423.73 | bwd_allreduce_microstep: 14.92 | step_microstep: 8.54 [2024-09-15 16:44:56,614] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3832.63 | bwd: 6890.31 | bwd_inner: 6875.31 | bwd_allreduce: 14.94 | step: 8.78 84%|████████▍ | 392/468 [1:09:48<13:09, 10.38s/it] {'loss': 0.2065, 'learning_rate': 2.714271399868473e-06, 'epoch': 2.5} 84%|████████▍ | 392/468 [1:09:48<13:09, 10.38s/it]dynamic ViT batch size: 46, images per sample: 5.75, dynamic token length: 1878 [2024-09-15 16:45:02,083] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1952.48 | bwd_microstep: 3486.77 | bwd_inner_microstep: 3486.74 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.16 dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1870 [2024-09-15 16:45:07,485] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.70 | optimizer_gradients: 0.34 | optimizer_step: 0.38 [2024-09-15 16:45:07,486] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1918.57 | bwd_microstep: 3448.34 | bwd_inner_microstep: 3433.45 | bwd_allreduce_microstep: 14.81 | step_microstep: 8.25 [2024-09-15 16:45:07,486] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3871.03 | bwd: 6935.12 | bwd_inner: 6920.19 | bwd_allreduce: 14.85 | step: 8.44 84%|████████▍ | 393/468 [1:09:58<13:09, 10.53s/it] {'loss': 0.1353, 'learning_rate': 2.6449207076405857e-06, 'epoch': 2.51} 84%|████████▍ | 393/468 [1:09:58<13:09, 10.53s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1878 [2024-09-15 16:45:12,839] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1903.98 | bwd_microstep: 3417.56 | bwd_inner_microstep: 3417.53 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.05 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1878 [2024-09-15 16:45:18,230] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.24 | optimizer_gradients: 0.47 | optimizer_step: 0.41 [2024-09-15 16:45:18,230] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1905.62 | bwd_microstep: 3449.65 | bwd_inner_microstep: 3436.72 | bwd_allreduce_microstep: 12.88 | step_microstep: 10.02 [2024-09-15 16:45:18,231] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3809.57 | bwd: 6867.21 | bwd_inner: 6854.24 | bwd_allreduce: 12.90 | step: 10.09 84%|████████▍ | 394/468 [1:10:09<13:04, 10.60s/it] {'loss': 0.1601, 'learning_rate': 2.57640471116412e-06, 'epoch': 2.52} 84%|████████▍ | 394/468 [1:10:09<13:04, 10.60s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878 [2024-09-15 16:45:23,636] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1918.75 | bwd_microstep: 3451.42 | bwd_inner_microstep: 3451.40 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.18 dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878 [2024-09-15 16:45:29,052] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.48 | optimizer_gradients: 0.33 | optimizer_step: 0.39 [2024-09-15 16:45:29,052] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1921.27 | bwd_microstep: 3459.02 | bwd_inner_microstep: 3444.43 | bwd_allreduce_microstep: 14.54 | step_microstep: 8.36 [2024-09-15 16:45:29,052] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3839.99 | bwd: 6910.45 | bwd_inner: 6895.83 | bwd_allreduce: 14.56 | step: 8.57 84%|████████▍ | 395/468 [1:10:20<12:58, 10.66s/it] {'loss': 0.2478, 'learning_rate': 2.508726705728617e-06, 'epoch': 2.52} 84%|████████▍ | 395/468 [1:10:20<12:58, 10.66s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389 [2024-09-15 16:45:33,214] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1493.15 | bwd_microstep: 2638.98 | bwd_inner_microstep: 2638.95 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.13 dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876 [2024-09-15 16:45:38,631] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 4.09 | optimizer_gradients: 0.39 | optimizer_step: 0.40 [2024-09-15 16:45:38,632] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1921.59 | bwd_microstep: 3458.62 | bwd_inner_microstep: 3445.35 | bwd_allreduce_microstep: 13.22 | step_microstep: 9.04 [2024-09-15 16:45:38,632] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3414.72 | bwd: 6097.61 | bwd_inner: 6084.30 | bwd_allreduce: 13.24 | step: 9.21 85%|████████▍ | 396/468 [1:10:30<12:24, 10.34s/it] {'loss': 0.1515, 'learning_rate': 2.441889946320266e-06, 'epoch': 2.53} 85%|████████▍ | 396/468 [1:10:30<12:24, 10.34s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389 [2024-09-15 16:45:42,795] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1494.67 | bwd_microstep: 2636.36 | bwd_inner_microstep: 2636.34 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.04 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1369 [2024-09-15 16:45:48,061] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.72 | optimizer_gradients: 0.36 | optimizer_step: 0.37 [2024-09-15 16:45:48,061] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1472.66 | bwd_microstep: 3762.03 | bwd_inner_microstep: 2597.70 | bwd_allreduce_microstep: 1164.27 | step_microstep: 7.52 [2024-09-15 16:45:48,061] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2967.30 | bwd: 6398.41 | bwd_inner: 5234.04 | bwd_allreduce: 1164.30 | step: 7.57 85%|████████▍ | 397/468 [1:10:39<11:54, 10.06s/it] {'loss': 0.11, 'learning_rate': 2.3758976474653904e-06, 'epoch': 2.54} 85%|████████▍ | 397/468 [1:10:39<11:54, 10.06s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389 [2024-09-15 16:45:52,220] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1494.61 | bwd_microstep: 2632.80 | bwd_inner_microstep: 2632.77 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.04 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368 [2024-09-15 16:45:57,538] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.26 | optimizer_gradients: 0.40 | optimizer_step: 0.40 [2024-09-15 16:45:57,539] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1471.99 | bwd_microstep: 3816.40 | bwd_inner_microstep: 2602.68 | bwd_allreduce_microstep: 1213.65 | step_microstep: 7.48 [2024-09-15 16:45:57,539] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2966.58 | bwd: 6449.21 | bwd_inner: 5235.46 | bwd_allreduce: 1213.68 | step: 7.53 85%|████████▌ | 398/468 [1:10:48<11:32, 9.89s/it] {'loss': 0.179, 'learning_rate': 2.310752983075819e-06, 'epoch': 2.54} 85%|████████▌ | 398/468 [1:10:48<11:32, 9.89s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1870 [2024-09-15 16:46:02,916] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1917.91 | bwd_microstep: 3429.15 | bwd_inner_microstep: 3429.12 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.04 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368 [2024-09-15 16:46:08,275] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.22 | optimizer_gradients: 0.40 | optimizer_step: 0.44 [2024-09-15 16:46:08,275] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1471.65 | bwd_microstep: 3856.48 | bwd_inner_microstep: 2594.67 | bwd_allreduce_microstep: 1261.75 | step_microstep: 7.72 [2024-09-15 16:46:08,275] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3389.53 | bwd: 7285.64 | bwd_inner: 6023.79 | bwd_allreduce: 1261.78 | step: 7.77 85%|████████▌ | 399/468 [1:10:59<11:39, 10.14s/it] {'loss': 0.1746, 'learning_rate': 2.2464590862962443e-06, 'epoch': 2.55} 85%|████████▌ | 399/468 [1:10:59<11:39, 10.14s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1390 [2024-09-15 16:46:12,436] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1495.03 | bwd_microstep: 2634.76 | bwd_inner_microstep: 2634.65 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.24 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1878 [2024-09-15 16:46:19,047] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.21 | optimizer_gradients: 0.40 | optimizer_step: 0.40 [2024-09-15 16:46:19,048] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1904.23 | bwd_microstep: 4673.80 | bwd_inner_microstep: 3427.19 | bwd_allreduce_microstep: 1246.54 | step_microstep: 7.53 [2024-09-15 16:46:19,048] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3399.24 | bwd: 7308.60 | bwd_inner: 6061.84 | bwd_allreduce: 1246.63 | step: 7.78 85%|████████▌ | 400/468 [1:11:10<11:42, 10.33s/it] {'loss': 0.1632, 'learning_rate': 2.1830190493535385e-06, 'epoch': 2.56} 85%|████████▌ | 400/468 [1:11:10<11:42, 10.33s/it][INFO|trainer.py:2936] 2024-09-15 16:46:26,282 >> Saving model checkpoint to work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora/tmp-checkpoint-400 [INFO|configuration_utils.py:473] 2024-09-15 16:46:26,284 >> Configuration saved in work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora/tmp-checkpoint-400/config.json [INFO|configuration_utils.py:594] 2024-09-15 16:46:26,284 >> Configuration saved in work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora/tmp-checkpoint-400/generation_config.json [INFO|modeling_utils.py:2501] 2024-09-15 16:46:47,703 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 4 checkpoint shards. You can find where each parameters has been saved in the index located at work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora/tmp-checkpoint-400/model.safetensors.index.json. [INFO|tokenization_utils_base.py:2433] 2024-09-15 16:46:47,705 >> tokenizer config file saved in work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora/tmp-checkpoint-400/tokenizer_config.json [INFO|tokenization_utils_base.py:2442] 2024-09-15 16:46:47,705 >> Special tokens file saved in work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora/tmp-checkpoint-400/special_tokens_map.json [INFO|tokenization_utils_base.py:2493] 2024-09-15 16:46:47,705 >> added tokens file saved in work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora/tmp-checkpoint-400/added_tokens.json [2024-09-15 16:46:48,662] [INFO] [logging.py:96:log_dist] [Rank 0] [Torch] Checkpoint global_step400 is about to be saved! [2024-09-15 16:46:48,690] [INFO] [logging.py:96:log_dist] [Rank 0] Saving model checkpoint: work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora/tmp-checkpoint-400/global_step400/mp_rank_00_model_states.pt [2024-09-15 16:46:48,690] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora/tmp-checkpoint-400/global_step400/mp_rank_00_model_states.pt... [2024-09-15 16:47:06,164] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora/tmp-checkpoint-400/global_step400/mp_rank_00_model_states.pt. [2024-09-15 16:47:06,167] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora/tmp-checkpoint-400/global_step400/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... [2024-09-15 16:47:06,326] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora/tmp-checkpoint-400/global_step400/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. [2024-09-15 16:47:06,326] [INFO] [engine.py:3488:_save_zero_checkpoint] zero checkpoint saved work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora/tmp-checkpoint-400/global_step400/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt [2024-09-15 16:47:06,326] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step400 is ready now! [INFO|trainer.py:3028] 2024-09-15 16:47:06,338 >> Deleting older checkpoint [work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora/checkpoint-200] due to args.save_total_limit dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1877 [2024-09-15 16:47:13,743] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1870.20 | bwd_microstep: 3387.36 | bwd_inner_microstep: 3387.31 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.25 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368 [2024-09-15 16:47:17,816] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.28 | optimizer_gradients: 0.36 | optimizer_step: 0.40 [2024-09-15 16:47:17,817] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1452.23 | bwd_microstep: 2585.92 | bwd_inner_microstep: 2562.88 | bwd_allreduce_microstep: 23.00 | step_microstep: 8.05 [2024-09-15 16:47:17,817] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3322.40 | bwd: 5973.32 | bwd_inner: 5950.18 | bwd_allreduce: 23.02 | step: 8.30 86%|████████▌ | 401/468 [1:12:09<27:45, 24.86s/it] {'loss': 0.2318, 'learning_rate': 2.1204359234080196e-06, 'epoch': 2.56} 86%|████████▌ | 401/468 [1:12:09<27:45, 24.86s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1870 [2024-09-15 16:47:23,102] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1878.75 | bwd_microstep: 3375.62 | bwd_inner_microstep: 3375.60 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.24 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1367 [2024-09-15 16:47:28,368] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.35 | optimizer_gradients: 0.39 | optimizer_step: 0.40 [2024-09-15 16:47:28,368] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1455.01 | bwd_microstep: 3775.43 | bwd_inner_microstep: 2566.75 | bwd_allreduce_microstep: 1208.63 | step_microstep: 7.84 [2024-09-15 16:47:28,369] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3333.73 | bwd: 7151.07 | bwd_inner: 5942.34 | bwd_allreduce: 1208.66 | step: 8.09 86%|████████▌ | 402/468 [1:12:19<22:37, 20.57s/it] {'loss': 0.2058, 'learning_rate': 2.058712718406719e-06, 'epoch': 2.57} 86%|████████▌ | 402/468 [1:12:19<22:37, 20.57s/it]dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1878 [2024-09-15 16:47:33,743] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1918.32 | bwd_microstep: 3425.20 | bwd_inner_microstep: 3425.17 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22 dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1878 [2024-09-15 16:47:39,148] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.36 | optimizer_gradients: 0.36 | optimizer_step: 0.40 [2024-09-15 16:47:39,149] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1921.05 | bwd_microstep: 3447.91 | bwd_inner_microstep: 3433.17 | bwd_allreduce_microstep: 14.69 | step_microstep: 8.50 [2024-09-15 16:47:39,149] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3839.32 | bwd: 6873.11 | bwd_inner: 6858.34 | bwd_allreduce: 14.71 | step: 8.75 86%|████████▌ | 403/468 [1:12:30<19:06, 17.63s/it] {'loss': 0.1229, 'learning_rate': 1.9978524029386026e-06, 'epoch': 2.58} 86%|████████▌ | 403/468 [1:12:30<19:06, 17.63s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1388 [2024-09-15 16:47:43,270] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1483.12 | bwd_microstep: 2608.19 | bwd_inner_microstep: 2608.16 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.24 dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876 [2024-09-15 16:47:49,838] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.49 | optimizer_gradients: 0.39 | optimizer_step: 0.40 [2024-09-15 16:47:49,838] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1907.29 | bwd_microstep: 4624.91 | bwd_inner_microstep: 3416.40 | bwd_allreduce_microstep: 1208.45 | step_microstep: 7.90 [2024-09-15 16:47:49,839] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3390.38 | bwd: 7233.13 | bwd_inner: 6024.56 | bwd_allreduce: 1208.48 | step: 8.15 86%|████████▋ | 404/468 [1:12:41<16:35, 15.55s/it] {'loss': 0.1819, 'learning_rate': 1.937857904091818e-06, 'epoch': 2.58} 86%|████████▋ | 404/468 [1:12:41<16:35, 15.55s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1388 [2024-09-15 16:47:53,964] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1483.96 | bwd_microstep: 2610.58 | bwd_inner_microstep: 2610.52 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.30 dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1881 [2024-09-15 16:48:00,523] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 4.46 | optimizer_gradients: 0.42 | optimizer_step: 0.40 [2024-09-15 16:48:00,524] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1912.45 | bwd_microstep: 4609.68 | bwd_inner_microstep: 3420.48 | bwd_allreduce_microstep: 1189.13 | step_microstep: 11.76 [2024-09-15 16:48:00,524] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3396.39 | bwd: 7220.29 | bwd_inner: 6031.00 | bwd_allreduce: 1189.17 | step: 12.01 87%|████████▋ | 405/468 [1:12:51<14:47, 14.09s/it] {'loss': 0.1619, 'learning_rate': 1.8787321073128817e-06, 'epoch': 2.59} 87%|████████▋ | 405/468 [1:12:51<14:47, 14.09s/it]dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1368 [2024-09-15 16:48:04,562] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1449.55 | bwd_microstep: 2557.73 | bwd_inner_microstep: 2557.62 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.25 dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878 [2024-09-15 16:48:11,181] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.35 | optimizer_gradients: 0.39 | optimizer_step: 0.39 [2024-09-15 16:48:11,182] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1910.16 | bwd_microstep: 4673.80 | bwd_inner_microstep: 3425.76 | bwd_allreduce_microstep: 1247.97 | step_microstep: 7.79 [2024-09-15 16:48:11,182] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3359.69 | bwd: 7231.56 | bwd_inner: 5983.39 | bwd_allreduce: 1248.05 | step: 8.04 87%|████████▋ | 406/468 [1:13:02<13:29, 13.06s/it] {'loss': 0.1799, 'learning_rate': 1.8204778562679437e-06, 'epoch': 2.59} 87%|████████▋ | 406/468 [1:13:02<13:29, 13.06s/it]dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1901 [2024-09-15 16:48:16,637] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1949.22 | bwd_microstep: 3474.08 | bwd_inner_microstep: 3474.00 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.24 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1366 [2024-09-15 16:48:20,732] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.81 | optimizer_gradients: 0.45 | optimizer_step: 0.40 [2024-09-15 16:48:20,732] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1463.32 | bwd_microstep: 2597.06 | bwd_inner_microstep: 2581.63 | bwd_allreduce_microstep: 15.38 | step_microstep: 8.30 [2024-09-15 16:48:20,732] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3412.53 | bwd: 6071.18 | bwd_inner: 6055.63 | bwd_allreduce: 15.42 | step: 8.55 87%|████████▋ | 407/468 [1:13:12<12:12, 12.01s/it] {'loss': 0.1823, 'learning_rate': 1.7630979527059877e-06, 'epoch': 2.6} 87%|████████▋ | 407/468 [1:13:12<12:12, 12.01s/it]dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1880 [2024-09-15 16:48:26,041] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1887.36 | bwd_microstep: 3391.14 | bwd_inner_microstep: 3391.01 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.24 dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877 [2024-09-15 16:48:31,434] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.48 | optimizer_gradients: 0.34 | optimizer_step: 0.40 [2024-09-15 16:48:31,434] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1911.45 | bwd_microstep: 3445.07 | bwd_inner_microstep: 3429.93 | bwd_allreduce_microstep: 15.09 | step_microstep: 8.01 [2024-09-15 16:48:31,434] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3798.78 | bwd: 6836.24 | bwd_inner: 6820.94 | bwd_allreduce: 15.19 | step: 8.26 87%|████████▋ | 408/468 [1:13:22<11:36, 11.62s/it] {'loss': 0.1946, 'learning_rate': 1.7065951563241022e-06, 'epoch': 2.61} 87%|████████▋ | 408/468 [1:13:22<11:36, 11.62s/it]dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1388 [2024-09-15 16:48:35,542] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1475.73 | bwd_microstep: 2601.75 | bwd_inner_microstep: 2601.69 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22 dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878 [2024-09-15 16:48:42,121] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.32 | optimizer_gradients: 0.38 | optimizer_step: 0.41 [2024-09-15 16:48:42,122] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1914.41 | bwd_microstep: 4629.74 | bwd_inner_microstep: 3434.94 | bwd_allreduce_microstep: 1194.73 | step_microstep: 7.61 [2024-09-15 16:48:42,122] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3390.12 | bwd: 7231.53 | bwd_inner: 6036.63 | bwd_allreduce: 1194.77 | step: 7.84 87%|████████▋ | 409/468 [1:13:33<11:08, 11.34s/it] {'loss': 0.1903, 'learning_rate': 1.6509721846347382e-06, 'epoch': 2.61} 87%|████████▋ | 409/468 [1:13:33<11:08, 11.34s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876 [2024-09-15 16:48:47,499] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1913.93 | bwd_microstep: 3431.68 | bwd_inner_microstep: 3431.65 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.24 dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877 [2024-09-15 16:48:52,899] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.47 | optimizer_gradients: 0.35 | optimizer_step: 0.42 [2024-09-15 16:48:52,899] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1917.53 | bwd_microstep: 3446.66 | bwd_inner_microstep: 3431.56 | bwd_allreduce_microstep: 15.05 | step_microstep: 8.32 [2024-09-15 16:48:52,900] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3831.44 | bwd: 6878.35 | bwd_inner: 6863.22 | bwd_allreduce: 15.07 | step: 8.56 88%|████████▊ | 410/468 [1:13:44<10:47, 11.17s/it] {'loss': 0.1675, 'learning_rate': 1.5962317128350147e-06, 'epoch': 2.62} 88%|████████▊ | 410/468 [1:13:44<10:47, 11.17s/it]dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1900 [2024-09-15 16:48:58,264] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1908.65 | bwd_microstep: 3425.70 | bwd_inner_microstep: 3425.67 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.04 dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878 [2024-09-15 16:49:03,684] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.59 | optimizer_gradients: 0.39 | optimizer_step: 0.37 [2024-09-15 16:49:03,685] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1914.44 | bwd_microstep: 3473.70 | bwd_inner_microstep: 3434.33 | bwd_allreduce_microstep: 39.31 | step_microstep: 8.20 [2024-09-15 16:49:03,685] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3823.06 | bwd: 6899.41 | bwd_inner: 6860.00 | bwd_allreduce: 39.34 | step: 8.25 88%|████████▊ | 411/468 [1:13:55<10:30, 11.05s/it] {'loss': 0.1545, 'learning_rate': 1.5423763736780583e-06, 'epoch': 2.63} 88%|████████▊ | 411/468 [1:13:55<10:30, 11.05s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1870 [2024-09-15 16:49:09,021] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1896.69 | bwd_microstep: 3409.53 | bwd_inner_microstep: 3409.50 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.04 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1369 [2024-09-15 16:49:14,417] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.55 | optimizer_gradients: 0.45 | optimizer_step: 0.42 [2024-09-15 16:49:14,417] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1470.92 | bwd_microstep: 3890.89 | bwd_inner_microstep: 2597.80 | bwd_allreduce_microstep: 1293.02 | step_microstep: 8.73 [2024-09-15 16:49:14,418] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3367.59 | bwd: 7300.43 | bwd_inner: 6007.30 | bwd_allreduce: 1293.05 | step: 8.79 88%|████████▊ | 412/468 [1:14:05<10:13, 10.96s/it] {'loss': 0.1292, 'learning_rate': 1.4894087573463734e-06, 'epoch': 2.63} 88%|████████▊ | 412/468 [1:14:05<10:13, 10.96s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389 [2024-09-15 16:49:18,582] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1492.58 | bwd_microstep: 2630.79 | bwd_inner_microstep: 2630.76 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.19 dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877 [2024-09-15 16:49:25,240] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.66 | optimizer_gradients: 0.38 | optimizer_step: 0.84 [2024-09-15 16:49:25,240] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1916.85 | bwd_microstep: 4705.93 | bwd_inner_microstep: 3438.64 | bwd_allreduce_microstep: 1267.23 | step_microstep: 8.60 [2024-09-15 16:49:25,240] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3409.40 | bwd: 7336.73 | bwd_inner: 6069.40 | bwd_allreduce: 1267.26 | step: 8.82 88%|████████▊ | 413/468 [1:14:16<10:00, 10.92s/it] {'loss': 0.1341, 'learning_rate': 1.437331411327274e-06, 'epoch': 2.64} 88%|████████▊ | 413/468 [1:14:16<10:00, 10.92s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1900 [2024-09-15 16:49:30,667] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1937.95 | bwd_microstep: 3458.61 | bwd_inner_microstep: 3458.59 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.06 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1369 [2024-09-15 16:49:35,991] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.41 | optimizer_gradients: 0.36 | optimizer_step: 0.36 [2024-09-15 16:49:35,992] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1468.91 | bwd_microstep: 3822.49 | bwd_inner_microstep: 2594.24 | bwd_allreduce_microstep: 1228.19 | step_microstep: 7.44 [2024-09-15 16:49:35,992] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3406.83 | bwd: 7281.12 | bwd_inner: 6052.83 | bwd_allreduce: 1228.22 | step: 7.53 88%|████████▊ | 414/468 [1:14:27<09:46, 10.87s/it] {'loss': 0.1419, 'learning_rate': 1.3861468402903634e-06, 'epoch': 2.65} 88%|████████▊ | 414/468 [1:14:27<09:46, 10.87s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1900 [2024-09-15 16:49:41,423] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1935.92 | bwd_microstep: 3464.79 | bwd_inner_microstep: 3464.77 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.24 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368 [2024-09-15 16:49:46,759] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.27 | optimizer_gradients: 0.40 | optimizer_step: 0.45 [2024-09-15 16:49:46,759] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1470.76 | bwd_microstep: 3829.91 | bwd_inner_microstep: 2591.20 | bwd_allreduce_microstep: 1238.65 | step_microstep: 7.63 [2024-09-15 16:49:46,760] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3406.66 | bwd: 7294.72 | bwd_inner: 6055.97 | bwd_allreduce: 1238.68 | step: 7.87 89%|████████▊ | 415/468 [1:14:38<09:34, 10.84s/it] {'loss': 0.2009, 'learning_rate': 1.3358575059670532e-06, 'epoch': 2.65} 89%|████████▊ | 415/468 [1:14:38<09:34, 10.84s/it]dynamic ViT batch size: 36, images per sample: 4.5, dynamic token length: 1388 [2024-09-15 16:49:50,836] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1459.37 | bwd_microstep: 2587.13 | bwd_inner_microstep: 2586.88 | bwd_allreduce_microstep: 0.15 | step_microstep: 0.25 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1366 [2024-09-15 16:49:57,501] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.44 | optimizer_gradients: 0.37 | optimizer_step: 0.38 [2024-09-15 16:49:57,502] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1469.77 | bwd_microstep: 5160.56 | bwd_inner_microstep: 2590.45 | bwd_allreduce_microstep: 2570.06 | step_microstep: 7.54 [2024-09-15 16:49:57,502] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2929.12 | bwd: 7747.74 | bwd_inner: 5177.33 | bwd_allreduce: 2570.23 | step: 7.79 89%|████████▉ | 416/468 [1:14:48<09:22, 10.81s/it] {'loss': 0.1272, 'learning_rate': 1.2864658270321905e-06, 'epoch': 2.66} 89%|████████▉ | 416/468 [1:14:48<09:22, 10.81s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1877 [2024-09-15 16:50:02,833] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1895.45 | bwd_microstep: 3403.39 | bwd_inner_microstep: 3403.36 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.20 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368 [2024-09-15 16:50:08,340] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.53 | optimizer_gradients: 0.38 | optimizer_step: 0.39 [2024-09-15 16:50:08,341] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1467.70 | bwd_microstep: 4003.07 | bwd_inner_microstep: 2588.72 | bwd_allreduce_microstep: 1414.30 | step_microstep: 7.97 [2024-09-15 16:50:08,341] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3363.13 | bwd: 7406.47 | bwd_inner: 5992.08 | bwd_allreduce: 1414.33 | step: 8.19 89%|████████▉ | 417/468 [1:14:59<09:11, 10.82s/it] {'loss': 0.1821, 'learning_rate': 1.2379741789877175e-06, 'epoch': 2.66} 89%|████████▉ | 417/468 [1:14:59<09:11, 10.82s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1369 [2024-09-15 16:50:12,421] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1463.62 | bwd_microstep: 2586.35 | bwd_inner_microstep: 2586.32 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1366 [2024-09-15 16:50:19,025] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.22 | optimizer_gradients: 0.38 | optimizer_step: 0.38 [2024-09-15 16:50:19,026] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1464.50 | bwd_microstep: 5104.68 | bwd_inner_microstep: 2587.84 | bwd_allreduce_microstep: 2516.77 | step_microstep: 7.24 [2024-09-15 16:50:19,026] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2928.10 | bwd: 7691.04 | bwd_inner: 5174.17 | bwd_allreduce: 2516.80 | step: 7.46 89%|████████▉ | 418/468 [1:15:10<08:58, 10.78s/it] {'loss': 0.1756, 'learning_rate': 1.1903848940484241e-06, 'epoch': 2.67} 89%|████████▉ | 418/468 [1:15:10<08:58, 10.78s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878 [2024-09-15 16:50:24,398] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1909.15 | bwd_microstep: 3431.46 | bwd_inner_microstep: 3431.43 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.21 dynamic ViT batch size: 36, images per sample: 4.5, dynamic token length: 1367 [2024-09-15 16:50:29,828] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.39 | optimizer_gradients: 0.38 | optimizer_step: 0.38 [2024-09-15 16:50:29,829] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1435.89 | bwd_microstep: 3959.05 | bwd_inner_microstep: 2547.99 | bwd_allreduce_microstep: 1411.00 | step_microstep: 7.66 [2024-09-15 16:50:29,829] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3345.02 | bwd: 7390.52 | bwd_inner: 5979.42 | bwd_allreduce: 1411.03 | step: 7.89 90%|████████▉ | 419/468 [1:15:21<08:48, 10.79s/it] {'loss': 0.1342, 'learning_rate': 1.1437002610297787e-06, 'epoch': 2.68} 90%|████████▉ | 419/468 [1:15:21<08:48, 10.79s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877 [2024-09-15 16:50:35,198] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1909.26 | bwd_microstep: 3428.96 | bwd_inner_microstep: 3428.94 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.15 dynamic ViT batch size: 39, images per sample: 4.875, dynamic token length: 1368 [2024-09-15 16:50:39,301] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.62 | optimizer_gradients: 0.35 | optimizer_step: 0.38 [2024-09-15 16:50:39,302] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1458.61 | bwd_microstep: 2609.67 | bwd_inner_microstep: 2573.57 | bwd_allreduce_microstep: 36.04 | step_microstep: 7.86 [2024-09-15 16:50:39,302] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3367.84 | bwd: 6038.64 | bwd_inner: 6002.51 | bwd_allreduce: 36.07 | step: 8.05 90%|████████▉ | 420/468 [1:15:30<08:18, 10.39s/it] {'loss': 0.2193, 'learning_rate': 1.097922525237849e-06, 'epoch': 2.68} 90%|████████▉ | 420/468 [1:15:30<08:18, 10.39s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878 [2024-09-15 16:50:44,677] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1912.25 | bwd_microstep: 3430.69 | bwd_inner_microstep: 3430.67 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.23 dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1870 [2024-09-15 16:50:50,021] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.48 | optimizer_gradients: 0.35 | optimizer_step: 0.38 [2024-09-15 16:50:50,021] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1882.46 | bwd_microstep: 3425.17 | bwd_inner_microstep: 3386.92 | bwd_allreduce_microstep: 38.20 | step_microstep: 8.16 [2024-09-15 16:50:50,022] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3794.69 | bwd: 6855.88 | bwd_inner: 6817.59 | bwd_allreduce: 38.22 | step: 8.40 90%|████████▉ | 421/468 [1:15:41<08:13, 10.49s/it] {'loss': 0.1873, 'learning_rate': 1.0530538883613129e-06, 'epoch': 2.69} 90%|████████▉ | 421/468 [1:15:41<08:13, 10.49s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1870 [2024-09-15 16:50:55,389] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1913.28 | bwd_microstep: 3423.58 | bwd_inner_microstep: 3423.55 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.18 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1878 [2024-09-15 16:51:00,760] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.52 | optimizer_gradients: 0.34 | optimizer_step: 0.38 [2024-09-15 16:51:00,761] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1900.15 | bwd_microstep: 3434.64 | bwd_inner_microstep: 3419.88 | bwd_allreduce_microstep: 14.71 | step_microstep: 8.34 [2024-09-15 16:51:00,761] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3813.41 | bwd: 6858.23 | bwd_inner: 6843.44 | bwd_allreduce: 14.73 | step: 8.55 90%|█████████ | 422/468 [1:15:52<08:05, 10.56s/it] {'loss': 0.2289, 'learning_rate': 1.0090965083655657e-06, 'epoch': 2.7} 90%|█████████ | 422/468 [1:15:52<08:05, 10.56s/it]dynamic ViT batch size: 36, images per sample: 4.5, dynamic token length: 1388 [2024-09-15 16:51:04,840] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1461.31 | bwd_microstep: 2588.03 | bwd_inner_microstep: 2587.84 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.23 dynamic ViT batch size: 36, images per sample: 4.5, dynamic token length: 1368 [2024-09-15 16:51:11,481] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.21 | optimizer_gradients: 0.36 | optimizer_step: 0.37 [2024-09-15 16:51:11,482] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1440.53 | bwd_microstep: 5166.57 | bwd_inner_microstep: 2555.92 | bwd_allreduce_microstep: 2610.52 | step_microstep: 7.24 [2024-09-15 16:51:11,482] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2901.82 | bwd: 7754.64 | bwd_inner: 5143.80 | bwd_allreduce: 2610.65 | step: 7.47 90%|█████████ | 423/468 [1:16:02<07:57, 10.61s/it] {'loss': 0.1694, 'learning_rate': 9.660524993889386e-07, 'epoch': 2.7} 90%|█████████ | 423/468 [1:16:02<07:57, 10.61s/it]dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1389 [2024-09-15 16:51:15,592] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1476.37 | bwd_microstep: 2603.62 | bwd_inner_microstep: 2603.59 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.21 dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876 [2024-09-15 16:51:22,214] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.21 | optimizer_gradients: 0.38 | optimizer_step: 0.39 [2024-09-15 16:51:22,215] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1912.69 | bwd_microstep: 4676.15 | bwd_inner_microstep: 3434.07 | bwd_allreduce_microstep: 1242.02 | step_microstep: 7.14 [2024-09-15 16:51:22,215] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3389.03 | bwd: 7279.78 | bwd_inner: 6037.66 | bwd_allreduce: 1242.05 | step: 7.35 91%|█████████ | 424/468 [1:16:13<07:48, 10.65s/it] {'loss': 0.1551, 'learning_rate': 9.239239316410109e-07, 'epoch': 2.71} 91%|█████████ | 424/468 [1:16:13<07:48, 10.65s/it]dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1366 [2024-09-15 16:51:26,259] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1449.72 | bwd_microstep: 2564.19 | bwd_inner_microstep: 2564.16 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.20 dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1368 [2024-09-15 16:51:31,716] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.21 | optimizer_gradients: 0.40 | optimizer_step: 0.40 [2024-09-15 16:51:31,716] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1454.29 | bwd_microstep: 3968.27 | bwd_inner_microstep: 2568.99 | bwd_allreduce_microstep: 1399.22 | step_microstep: 7.50 [2024-09-15 16:51:31,716] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2903.98 | bwd: 6532.47 | bwd_inner: 5133.16 | bwd_allreduce: 1399.25 | step: 7.72 91%|█████████ | 425/468 [1:16:23<07:23, 10.30s/it] {'loss': 0.2198, 'learning_rate': 8.827128313030453e-07, 'epoch': 2.72} 91%|█████████ | 425/468 [1:16:23<07:23, 10.30s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389 [2024-09-15 16:51:35,865] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1489.97 | bwd_microstep: 2627.42 | bwd_inner_microstep: 2627.28 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.25 dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877 [2024-09-15 16:51:42,552] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.21 | optimizer_gradients: 0.39 | optimizer_step: 0.40 [2024-09-15 16:51:42,553] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1918.22 | bwd_microstep: 4733.20 | bwd_inner_microstep: 3431.86 | bwd_allreduce_microstep: 1301.27 | step_microstep: 7.48 [2024-09-15 16:51:42,553] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3408.17 | bwd: 7360.66 | bwd_inner: 6059.14 | bwd_allreduce: 1301.39 | step: 7.74 91%|█████████ | 426/468 [1:16:33<07:19, 10.46s/it] {'loss': 0.1434, 'learning_rate': 8.42421180430546e-07, 'epoch': 2.72} 91%|█████████ | 426/468 [1:16:33<07:19, 10.46s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1870 [2024-09-15 16:51:47,915] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1907.38 | bwd_microstep: 3423.59 | bwd_inner_microstep: 3423.57 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.24 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1373 [2024-09-15 16:51:53,273] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.27 | optimizer_gradients: 0.43 | optimizer_step: 0.39 [2024-09-15 16:51:53,274] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1471.74 | bwd_microstep: 3852.14 | bwd_inner_microstep: 2593.45 | bwd_allreduce_microstep: 1258.64 | step_microstep: 7.43 [2024-09-15 16:51:53,274] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3379.10 | bwd: 7275.75 | bwd_inner: 6017.01 | bwd_allreduce: 1258.67 | step: 7.68 91%|█████████ | 427/468 [1:16:44<07:12, 10.54s/it] {'loss': 0.2212, 'learning_rate': 8.03050916857917e-07, 'epoch': 2.73} 91%|█████████ | 427/468 [1:16:44<07:12, 10.54s/it]dynamic ViT batch size: 41, images per sample: 5.125, dynamic token length: 1876 [2024-09-15 16:51:58,628] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1904.72 | bwd_microstep: 3419.12 | bwd_inner_microstep: 3419.09 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.23 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368 [2024-09-15 16:52:02,737] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.63 | optimizer_gradients: 0.33 | optimizer_step: 0.39 [2024-09-15 16:52:02,738] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1467.93 | bwd_microstep: 2607.11 | bwd_inner_microstep: 2590.27 | bwd_allreduce_microstep: 16.80 | step_microstep: 7.76 [2024-09-15 16:52:02,738] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3372.63 | bwd: 6026.24 | bwd_inner: 6009.36 | bwd_allreduce: 16.82 | step: 7.99 91%|█████████▏| 428/468 [1:16:54<06:48, 10.22s/it] {'loss': 0.214, 'learning_rate': 7.646039341052747e-07, 'epoch': 2.73} 91%|█████████▏| 428/468 [1:16:54<06:48, 10.22s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1870 [2024-09-15 16:52:08,106] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1912.89 | bwd_microstep: 3424.20 | bwd_inner_microstep: 3424.18 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.07 dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877 [2024-09-15 16:52:13,517] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.36 | optimizer_gradients: 0.33 | optimizer_step: 0.39 [2024-09-15 16:52:13,518] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1918.35 | bwd_microstep: 3456.82 | bwd_inner_microstep: 3442.16 | bwd_allreduce_microstep: 14.62 | step_microstep: 7.97 [2024-09-15 16:52:13,518] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3831.21 | bwd: 6881.04 | bwd_inner: 6866.33 | bwd_allreduce: 14.64 | step: 8.07 92%|█████████▏| 429/468 [1:17:04<06:45, 10.39s/it] {'loss': 0.1959, 'learning_rate': 7.270820812873714e-07, 'epoch': 2.74} 92%|█████████▏| 429/468 [1:17:04<06:45, 10.39s/it]dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1366 [2024-09-15 16:52:17,574] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1454.77 | bwd_microstep: 2571.14 | bwd_inner_microstep: 2571.11 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.23 dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877 [2024-09-15 16:52:24,101] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.28 | optimizer_gradients: 0.38 | optimizer_step: 0.39 [2024-09-15 16:52:24,102] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1917.16 | bwd_microstep: 4574.58 | bwd_inner_microstep: 3442.44 | bwd_allreduce_microstep: 1132.08 | step_microstep: 7.45 [2024-09-15 16:52:24,102] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3371.91 | bwd: 7145.73 | bwd_inner: 6013.56 | bwd_allreduce: 1132.11 | step: 7.69 92%|█████████▏| 430/468 [1:17:15<06:36, 10.45s/it] {'loss': 0.1571, 'learning_rate': 6.904871630246646e-07, 'epoch': 2.75} 92%|█████████▏| 430/468 [1:17:15<06:36, 10.45s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877 [2024-09-15 16:52:29,481] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1912.88 | bwd_microstep: 3436.11 | bwd_inner_microstep: 3436.09 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.07 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368 [2024-09-15 16:52:33,599] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.70 | optimizer_gradients: 0.38 | optimizer_step: 0.42 [2024-09-15 16:52:33,600] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1471.09 | bwd_microstep: 2608.36 | bwd_inner_microstep: 2593.14 | bwd_allreduce_microstep: 15.17 | step_microstep: 8.01 [2024-09-15 16:52:33,600] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3383.94 | bwd: 6044.48 | bwd_inner: 6029.23 | bwd_allreduce: 15.19 | step: 8.11 92%|█████████▏| 431/468 [1:17:25<06:15, 10.16s/it] {'loss': 0.1282, 'learning_rate': 6.548209393565241e-07, 'epoch': 2.75} 92%|█████████▏| 431/468 [1:17:25<06:15, 10.16s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1878 [2024-09-15 16:52:38,953] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1903.24 | bwd_microstep: 3419.45 | bwd_inner_microstep: 3419.40 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.24 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1878 [2024-09-15 16:52:44,340] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.49 | optimizer_gradients: 0.36 | optimizer_step: 0.41 [2024-09-15 16:52:44,340] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1905.12 | bwd_microstep: 3445.32 | bwd_inner_microstep: 3430.35 | bwd_allreduce_microstep: 14.93 | step_microstep: 8.30 [2024-09-15 16:52:44,341] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3808.34 | bwd: 6864.81 | bwd_inner: 6849.75 | bwd_allreduce: 14.95 | step: 8.54 92%|█████████▏| 432/468 [1:17:35<06:12, 10.34s/it] {'loss': 0.0944, 'learning_rate': 6.200851256565799e-07, 'epoch': 2.76} 92%|█████████▏| 432/468 [1:17:35<06:12, 10.34s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1369 [2024-09-15 16:52:48,437] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1471.74 | bwd_microstep: 2595.51 | bwd_inner_microstep: 2595.40 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.24 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1366 [2024-09-15 16:52:53,723] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.22 | optimizer_gradients: 0.40 | optimizer_step: 0.45 [2024-09-15 16:52:53,723] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1471.08 | bwd_microstep: 3778.52 | bwd_inner_microstep: 2593.46 | bwd_allreduce_microstep: 1185.01 | step_microstep: 7.56 [2024-09-15 16:52:53,723] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2942.80 | bwd: 6374.07 | bwd_inner: 5188.86 | bwd_allreduce: 1185.09 | step: 7.81 93%|█████████▎| 433/468 [1:17:45<05:51, 10.05s/it] {'loss': 0.1417, 'learning_rate': 5.862813925502209e-07, 'epoch': 2.77} 93%|█████████▎| 433/468 [1:17:45<05:51, 10.05s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878 [2024-09-15 16:52:59,105] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1917.13 | bwd_microstep: 3434.25 | bwd_inner_microstep: 3434.23 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.14 dynamic ViT batch size: 36, images per sample: 4.5, dynamic token length: 1368 [2024-09-15 16:53:04,467] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.43 | optimizer_gradients: 0.42 | optimizer_step: 0.38 [2024-09-15 16:53:04,468] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1441.09 | bwd_microstep: 3887.32 | bwd_inner_microstep: 2557.62 | bwd_allreduce_microstep: 1329.64 | step_microstep: 7.54 [2024-09-15 16:53:04,468] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3358.19 | bwd: 7321.58 | bwd_inner: 5991.85 | bwd_allreduce: 1329.67 | step: 7.71 93%|█████████▎| 434/468 [1:17:55<05:48, 10.26s/it] {'loss': 0.2177, 'learning_rate': 5.53411365834251e-07, 'epoch': 2.77} 93%|█████████▎| 434/468 [1:17:55<05:48, 10.26s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1388 [2024-09-15 16:53:08,617] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1491.42 | bwd_microstep: 2627.14 | bwd_inner_microstep: 2627.02 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.23 dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1881 [2024-09-15 16:53:15,094] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.22 | optimizer_gradients: 0.38 | optimizer_step: 0.38 [2024-09-15 16:53:15,094] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1888.53 | bwd_microstep: 4554.88 | bwd_inner_microstep: 3399.28 | bwd_allreduce_microstep: 1155.54 | step_microstep: 7.19 [2024-09-15 16:53:15,095] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3379.92 | bwd: 7182.06 | bwd_inner: 6026.30 | bwd_allreduce: 1155.62 | step: 7.43 93%|█████████▎| 435/468 [1:18:06<05:42, 10.37s/it] {'loss': 0.1399, 'learning_rate': 5.214766263986848e-07, 'epoch': 2.78} 93%|█████████▎| 435/468 [1:18:06<05:42, 10.37s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1900 [2024-09-15 16:53:20,489] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1920.54 | bwd_microstep: 3443.81 | bwd_inner_microstep: 3443.78 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.05 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1878 [2024-09-15 16:53:25,858] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.57 | optimizer_gradients: 0.34 | optimizer_step: 0.38 [2024-09-15 16:53:25,859] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1899.96 | bwd_microstep: 3434.30 | bwd_inner_microstep: 3420.47 | bwd_allreduce_microstep: 13.78 | step_microstep: 8.54 [2024-09-15 16:53:25,859] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3820.47 | bwd: 6878.11 | bwd_inner: 6864.25 | bwd_allreduce: 13.80 | step: 8.61 93%|█████████▎| 436/468 [1:18:17<05:35, 10.49s/it] {'loss': 0.1264, 'learning_rate': 4.904787101507324e-07, 'epoch': 2.79} 93%|█████████▎| 436/468 [1:18:17<05:35, 10.49s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878 [2024-09-15 16:53:31,243] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1916.57 | bwd_microstep: 3436.52 | bwd_inner_microstep: 3436.50 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.04 dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878 [2024-09-15 16:53:36,647] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.56 | optimizer_gradients: 0.37 | optimizer_step: 0.41 [2024-09-15 16:53:36,648] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1918.24 | bwd_microstep: 3454.63 | bwd_inner_microstep: 3442.04 | bwd_allreduce_microstep: 12.55 | step_microstep: 8.56 [2024-09-15 16:53:36,648] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3834.78 | bwd: 6891.16 | bwd_inner: 6878.53 | bwd_allreduce: 12.57 | step: 8.61 93%|█████████▎| 437/468 [1:18:28<05:27, 10.58s/it] {'loss': 0.1468, 'learning_rate': 4.604191079409126e-07, 'epoch': 2.79} 93%|█████████▎| 437/468 [1:18:28<05:27, 10.58s/it]dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1389 [2024-09-15 16:53:40,768] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1478.38 | bwd_microstep: 2611.11 | bwd_inner_microstep: 2611.07 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368 [2024-09-15 16:53:47,330] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.21 | optimizer_gradients: 0.38 | optimizer_step: 0.38 [2024-09-15 16:53:47,330] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1472.10 | bwd_microstep: 5055.96 | bwd_inner_microstep: 2591.54 | bwd_allreduce_microstep: 2464.36 | step_microstep: 7.15 [2024-09-15 16:53:47,331] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2950.45 | bwd: 7667.12 | bwd_inner: 5202.61 | bwd_allreduce: 2464.39 | step: 7.37 94%|█████████▎| 438/468 [1:18:38<05:18, 10.61s/it] {'loss': 0.1082, 'learning_rate': 4.3129926549136057e-07, 'epoch': 2.8} 94%|█████████▎| 438/468 [1:18:38<05:18, 10.61s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876 [2024-09-15 16:53:52,706] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1913.46 | bwd_microstep: 3430.83 | bwd_inner_microstep: 3430.80 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.19 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1366 [2024-09-15 16:53:56,816] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.61 | optimizer_gradients: 0.37 | optimizer_step: 0.38 [2024-09-15 16:53:56,816] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1467.49 | bwd_microstep: 2606.21 | bwd_inner_microstep: 2591.13 | bwd_allreduce_microstep: 15.04 | step_microstep: 8.31 [2024-09-15 16:53:56,816] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3380.92 | bwd: 6037.05 | bwd_inner: 6021.93 | bwd_allreduce: 15.06 | step: 8.53 94%|█████████▍| 439/468 [1:18:48<04:57, 10.27s/it] {'loss': 0.1996, 'learning_rate': 4.031205833262863e-07, 'epoch': 2.81} 94%|█████████▍| 439/468 [1:18:48<04:57, 10.27s/it]dynamic ViT batch size: 41, images per sample: 5.125, dynamic token length: 1870 [2024-09-15 16:54:02,171] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1906.98 | bwd_microstep: 3413.13 | bwd_inner_microstep: 3413.10 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.24 dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876 [2024-09-15 16:54:07,580] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.59 | optimizer_gradients: 0.34 | optimizer_step: 0.38 [2024-09-15 16:54:07,581] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1917.16 | bwd_microstep: 3455.85 | bwd_inner_microstep: 3440.91 | bwd_allreduce_microstep: 14.89 | step_microstep: 8.33 [2024-09-15 16:54:07,581] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3824.10 | bwd: 6868.99 | bwd_inner: 6854.01 | bwd_allreduce: 14.91 | step: 8.57 94%|█████████▍| 440/468 [1:18:58<04:51, 10.42s/it] {'loss': 0.159, 'learning_rate': 3.7588441670462827e-07, 'epoch': 2.81} 94%|█████████▍| 440/468 [1:18:58<04:51, 10.42s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389 [2024-09-15 16:54:11,733] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1491.38 | bwd_microstep: 2630.44 | bwd_inner_microstep: 2630.42 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368 [2024-09-15 16:54:17,078] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.55 | optimizer_gradients: 0.39 | optimizer_step: 0.38 [2024-09-15 16:54:17,079] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1470.08 | bwd_microstep: 3838.79 | bwd_inner_microstep: 2596.96 | bwd_allreduce_microstep: 1241.77 | step_microstep: 7.63 [2024-09-15 16:54:17,079] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2961.42 | bwd: 6469.25 | bwd_inner: 5227.38 | bwd_allreduce: 1241.80 | step: 7.86 94%|█████████▍| 441/468 [1:19:08<04:33, 10.14s/it] {'loss': 0.1389, 'learning_rate': 3.4959207555485873e-07, 'epoch': 2.82} 94%|█████████▍| 441/468 [1:19:08<04:33, 10.14s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877 [2024-09-15 16:54:22,461] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1913.18 | bwd_microstep: 3437.65 | bwd_inner_microstep: 3437.62 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.20 dynamic ViT batch size: 36, images per sample: 4.5, dynamic token length: 1369 [2024-09-15 16:54:27,897] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.50 | optimizer_gradients: 0.39 | optimizer_step: 0.38 [2024-09-15 16:54:27,898] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1440.35 | bwd_microstep: 3962.55 | bwd_inner_microstep: 2556.61 | bwd_allreduce_microstep: 1405.88 | step_microstep: 7.88 [2024-09-15 16:54:27,898] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3353.50 | bwd: 7400.21 | bwd_inner: 5994.23 | bwd_allreduce: 1405.91 | step: 8.10 94%|█████████▍| 442/468 [1:19:19<04:28, 10.35s/it] {'loss': 0.1808, 'learning_rate': 3.242448244119967e-07, 'epoch': 2.82} 94%|█████████▍| 442/468 [1:19:19<04:28, 10.35s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389 [2024-09-15 16:54:32,046] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1490.43 | bwd_microstep: 2627.18 | bwd_inner_microstep: 2627.15 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.07 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1878 [2024-09-15 16:54:38,475] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.39 | optimizer_gradients: 0.37 | optimizer_step: 0.40 [2024-09-15 16:54:38,475] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1901.09 | bwd_microstep: 4492.62 | bwd_inner_microstep: 3419.48 | bwd_allreduce_microstep: 1073.06 | step_microstep: 7.41 [2024-09-15 16:54:38,476] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3391.49 | bwd: 7119.81 | bwd_inner: 6046.64 | bwd_allreduce: 1073.10 | step: 7.51 95%|█████████▍| 443/468 [1:19:29<04:20, 10.42s/it] {'loss': 0.1019, 'learning_rate': 2.99843882356774e-07, 'epoch': 2.83} 95%|█████████▍| 443/468 [1:19:29<04:20, 10.42s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1900 [2024-09-15 16:54:43,873] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1919.59 | bwd_microstep: 3447.69 | bwd_inner_microstep: 3447.67 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.06 dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1870 [2024-09-15 16:54:49,267] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 4.25 | optimizer_gradients: 0.33 | optimizer_step: 0.38 [2024-09-15 16:54:49,267] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1914.53 | bwd_microstep: 3439.27 | bwd_inner_microstep: 3424.46 | bwd_allreduce_microstep: 14.76 | step_microstep: 11.66 [2024-09-15 16:54:49,268] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3834.10 | bwd: 6886.97 | bwd_inner: 6872.13 | bwd_allreduce: 14.78 | step: 11.75 95%|█████████▍| 444/468 [1:19:40<04:12, 10.53s/it] {'loss': 0.1137, 'learning_rate': 2.7639042295702245e-07, 'epoch': 2.84} 95%|█████████▍| 444/468 [1:19:40<04:12, 10.53s/it]dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1901 [2024-09-15 16:54:54,648] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1910.79 | bwd_microstep: 3439.81 | bwd_inner_microstep: 3439.79 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.13 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368 [2024-09-15 16:54:59,997] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.48 | optimizer_gradients: 0.39 | optimizer_step: 0.38 [2024-09-15 16:54:59,998] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1471.90 | bwd_microstep: 3842.01 | bwd_inner_microstep: 2592.58 | bwd_allreduce_microstep: 1249.37 | step_microstep: 8.30 [2024-09-15 16:54:59,998] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3382.66 | bwd: 7281.84 | bwd_inner: 6032.37 | bwd_allreduce: 1249.40 | step: 8.46 95%|█████████▌| 445/468 [1:19:51<04:03, 10.59s/it] {'loss': 0.2141, 'learning_rate': 2.5388557421120564e-07, 'epoch': 2.84} 95%|█████████▌| 445/468 [1:19:51<04:03, 10.59s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389 [2024-09-15 16:55:04,145] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1490.23 | bwd_microstep: 2627.46 | bwd_inner_microstep: 2627.43 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.08 dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878 [2024-09-15 16:55:10,740] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.39 | optimizer_gradients: 0.37 | optimizer_step: 0.38 [2024-09-15 16:55:10,740] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1917.32 | bwd_microstep: 4642.11 | bwd_inner_microstep: 3439.25 | bwd_allreduce_microstep: 1202.80 | step_microstep: 7.47 [2024-09-15 16:55:10,740] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3407.52 | bwd: 7269.58 | bwd_inner: 6066.68 | bwd_allreduce: 1202.83 | step: 7.58 95%|█████████▌| 446/468 [1:20:02<03:53, 10.64s/it] {'loss': 0.1763, 'learning_rate': 2.3233041849419547e-07, 'epoch': 2.85} 95%|█████████▌| 446/468 [1:20:02<03:53, 10.64s/it]dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1387 [2024-09-15 16:55:14,853] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1475.94 | bwd_microstep: 2606.73 | bwd_inner_microstep: 2606.70 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.10 dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877 [2024-09-15 16:55:21,365] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.36 | optimizer_gradients: 0.38 | optimizer_step: 0.39 [2024-09-15 16:55:21,366] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1915.71 | bwd_microstep: 4560.68 | bwd_inner_microstep: 3439.87 | bwd_allreduce_microstep: 1120.75 | step_microstep: 7.96 [2024-09-15 16:55:21,366] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3391.63 | bwd: 7167.42 | bwd_inner: 6046.57 | bwd_allreduce: 1120.78 | step: 8.09 96%|█████████▌| 447/468 [1:20:12<03:43, 10.63s/it] {'loss': 0.1588, 'learning_rate': 2.1172599250519398e-07, 'epoch': 2.86} 96%|█████████▌| 447/468 [1:20:12<03:43, 10.63s/it]dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1388 [2024-09-15 16:55:25,476] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1474.43 | bwd_microstep: 2605.28 | bwd_inner_microstep: 2605.26 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.15 dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1368 [2024-09-15 16:55:30,852] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.57 | optimizer_gradients: 0.38 | optimizer_step: 0.38 [2024-09-15 16:55:30,852] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1453.77 | bwd_microstep: 3887.98 | bwd_inner_microstep: 2570.81 | bwd_allreduce_microstep: 1317.11 | step_microstep: 7.66 [2024-09-15 16:55:30,852] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2928.17 | bwd: 6493.28 | bwd_inner: 5176.07 | bwd_allreduce: 1317.14 | step: 7.84 96%|█████████▌| 448/468 [1:20:22<03:25, 10.29s/it] {'loss': 0.2067, 'learning_rate': 1.9207328721788653e-07, 'epoch': 2.86} 96%|█████████▌| 448/468 [1:20:22<03:25, 10.29s/it]dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1881 [2024-09-15 16:55:36,273] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1932.23 | bwd_microstep: 3457.35 | bwd_inner_microstep: 3457.32 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.06 dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878 [2024-09-15 16:55:41,675] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.50 | optimizer_gradients: 0.42 | optimizer_step: 0.41 [2024-09-15 16:55:41,676] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1916.27 | bwd_microstep: 3450.88 | bwd_inner_microstep: 3436.11 | bwd_allreduce_microstep: 14.71 | step_microstep: 8.75 [2024-09-15 16:55:41,676] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3848.48 | bwd: 6908.23 | bwd_inner: 6893.43 | bwd_allreduce: 14.74 | step: 8.85 96%|█████████▌| 449/468 [1:20:33<03:18, 10.45s/it] {'loss': 0.1241, 'learning_rate': 1.7337324783276878e-07, 'epoch': 2.87} 96%|█████████▌| 449/468 [1:20:33<03:18, 10.45s/it]dynamic ViT batch size: 32, images per sample: 4.0, dynamic token length: 1369 [2024-09-15 16:55:45,634] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1407.94 | bwd_microstep: 2516.55 | bwd_inner_microstep: 2516.53 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.10 dynamic ViT batch size: 41, images per sample: 5.125, dynamic token length: 1877 [2024-09-15 16:55:51,169] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.54 | optimizer_gradients: 0.39 | optimizer_step: 0.38 [2024-09-15 16:55:51,170] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1910.02 | bwd_microstep: 3591.66 | bwd_inner_microstep: 3426.61 | bwd_allreduce_microstep: 164.99 | step_microstep: 7.67 [2024-09-15 16:55:51,170] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3317.92 | bwd: 6108.22 | bwd_inner: 5943.13 | bwd_allreduce: 165.02 | step: 7.79 96%|█████████▌| 450/468 [1:20:42<03:02, 10.16s/it] {'loss': 0.1489, 'learning_rate': 1.5562677373169855e-07, 'epoch': 2.88} 96%|█████████▌| 450/468 [1:20:42<03:02, 10.16s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1388 [2024-09-15 16:55:55,322] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1494.36 | bwd_microstep: 2627.59 | bwd_inner_microstep: 2627.56 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22 dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1878 [2024-09-15 16:56:01,874] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.51 | optimizer_gradients: 0.38 | optimizer_step: 0.38 [2024-09-15 16:56:01,875] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1936.16 | bwd_microstep: 4580.75 | bwd_inner_microstep: 3466.33 | bwd_allreduce_microstep: 1114.37 | step_microstep: 7.67 [2024-09-15 16:56:01,875] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3430.49 | bwd: 7208.35 | bwd_inner: 6093.89 | bwd_allreduce: 1114.40 | step: 7.90 96%|█████████▋| 451/468 [1:20:53<02:55, 10.33s/it] {'loss': 0.136, 'learning_rate': 1.388347184346328e-07, 'epoch': 2.88} 96%|█████████▋| 451/468 [1:20:53<02:55, 10.33s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876 [2024-09-15 16:56:07,254] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1915.60 | bwd_microstep: 3432.64 | bwd_inner_microstep: 3432.61 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.18 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1367 [2024-09-15 16:56:12,635] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.22 | optimizer_gradients: 0.38 | optimizer_step: 0.38 [2024-09-15 16:56:12,636] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1469.02 | bwd_microstep: 3877.83 | bwd_inner_microstep: 2593.00 | bwd_allreduce_microstep: 1284.77 | step_microstep: 7.21 [2024-09-15 16:56:12,636] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3384.59 | bwd: 7310.49 | bwd_inner: 6025.62 | bwd_allreduce: 1284.80 | step: 7.42 97%|█████████▋| 452/468 [1:21:04<02:47, 10.46s/it] {'loss': 0.1723, 'learning_rate': 1.2299788955857817e-07, 'epoch': 2.89} 97%|█████████▋| 452/468 [1:21:04<02:47, 10.46s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1388 [2024-09-15 16:56:16,784] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1489.86 | bwd_microstep: 2627.31 | bwd_inner_microstep: 2627.25 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.23 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1369 [2024-09-15 16:56:22,087] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.38 | optimizer_gradients: 0.42 | optimizer_step: 0.39 [2024-09-15 16:56:22,088] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1468.85 | bwd_microstep: 3799.95 | bwd_inner_microstep: 2591.84 | bwd_allreduce_microstep: 1208.05 | step_microstep: 7.84 [2024-09-15 16:56:22,088] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2958.69 | bwd: 6427.31 | bwd_inner: 5219.09 | bwd_allreduce: 1208.08 | step: 8.07 97%|█████████▋| 453/468 [1:21:13<02:32, 10.15s/it] {'loss': 0.227, 'learning_rate': 1.0811704877875528e-07, 'epoch': 2.89} 97%|█████████▋| 453/468 [1:21:13<02:32, 10.15s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876 [2024-09-15 16:56:27,462] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1911.81 | bwd_microstep: 3430.89 | bwd_inner_microstep: 3430.86 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22 dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876 [2024-09-15 16:56:32,867] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.65 | optimizer_gradients: 0.37 | optimizer_step: 0.38 [2024-09-15 16:56:32,867] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1915.60 | bwd_microstep: 3452.58 | bwd_inner_microstep: 3440.22 | bwd_allreduce_microstep: 12.31 | step_microstep: 8.66 [2024-09-15 16:56:32,868] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3827.38 | bwd: 6883.48 | bwd_inner: 6871.08 | bwd_allreduce: 12.34 | step: 8.89 97%|█████████▋| 454/468 [1:21:24<02:24, 10.34s/it] {'loss': 0.1729, 'learning_rate': 9.419291179195267e-08, 'epoch': 2.9} 97%|█████████▋| 454/468 [1:21:24<02:24, 10.34s/it]dynamic ViT batch size: 37, images per sample: 4.625, dynamic token length: 1366 [2024-09-15 16:56:36,907] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1447.58 | bwd_microstep: 2561.77 | bwd_inner_microstep: 2561.74 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.05 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368 [2024-09-15 16:56:43,593] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.56 | optimizer_gradients: 0.37 | optimizer_step: 0.38 [2024-09-15 16:56:43,593] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1471.30 | bwd_microstep: 5180.67 | bwd_inner_microstep: 2594.48 | bwd_allreduce_microstep: 2586.13 | step_microstep: 8.02 [2024-09-15 16:56:43,593] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2918.86 | bwd: 7742.44 | bwd_inner: 5156.22 | bwd_allreduce: 2586.16 | step: 8.09 97%|█████████▋| 455/468 [1:21:34<02:15, 10.46s/it] {'loss': 0.1533, 'learning_rate': 8.122614828211861e-08, 'epoch': 2.91} 97%|█████████▋| 455/468 [1:21:34<02:15, 10.46s/it]dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1876 [2024-09-15 16:56:49,006] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1927.42 | bwd_microstep: 3455.13 | bwd_inner_microstep: 3455.10 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.06 dynamic ViT batch size: 38, images per sample: 4.75, dynamic token length: 1368 [2024-09-15 16:56:54,334] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.50 | optimizer_gradients: 0.37 | optimizer_step: 0.38 [2024-09-15 16:56:54,334] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1454.28 | bwd_microstep: 3838.30 | bwd_inner_microstep: 2572.00 | bwd_allreduce_microstep: 1266.25 | step_microstep: 7.53 [2024-09-15 16:56:54,335] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3381.67 | bwd: 7293.44 | bwd_inner: 6027.10 | bwd_allreduce: 1266.28 | step: 7.62 97%|█████████▋| 456/468 [1:21:45<02:06, 10.54s/it] {'loss': 0.1285, 'learning_rate': 6.921738188814254e-08, 'epoch': 2.91} 97%|█████████▋| 456/468 [1:21:45<02:06, 10.54s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878 [2024-09-15 16:56:59,708] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1911.22 | bwd_microstep: 3431.65 | bwd_inner_microstep: 3431.63 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.10 dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876 [2024-09-15 16:57:05,105] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.56 | optimizer_gradients: 0.37 | optimizer_step: 0.38 [2024-09-15 16:57:05,105] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1913.15 | bwd_microstep: 3448.51 | bwd_inner_microstep: 3433.72 | bwd_allreduce_microstep: 14.72 | step_microstep: 8.31 [2024-09-15 16:57:05,106] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3824.35 | bwd: 6880.19 | bwd_inner: 6865.35 | bwd_allreduce: 14.75 | step: 8.43 98%|█████████▊| 457/468 [1:21:56<01:56, 10.61s/it] {'loss': 0.2263, 'learning_rate': 5.816719017386785e-08, 'epoch': 2.92} 98%|█████████▊| 457/468 [1:21:56<01:56, 10.61s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1870 [2024-09-15 16:57:10,436] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1896.16 | bwd_microstep: 3403.59 | bwd_inner_microstep: 3403.57 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.10 dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876 [2024-09-15 16:57:15,840] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.61 | optimizer_gradients: 0.34 | optimizer_step: 0.39 [2024-09-15 16:57:15,840] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1916.36 | bwd_microstep: 3452.04 | bwd_inner_microstep: 3437.22 | bwd_allreduce_microstep: 14.77 | step_microstep: 8.62 [2024-09-15 16:57:15,841] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3812.49 | bwd: 6855.65 | bwd_inner: 6840.79 | bwd_allreduce: 14.79 | step: 8.74 98%|█████████▊| 458/468 [1:22:07<01:46, 10.65s/it] {'loss': 0.1303, 'learning_rate': 4.807610460030976e-08, 'epoch': 2.93} 98%|█████████▊| 458/468 [1:22:07<01:46, 10.65s/it]dynamic ViT batch size: 36, images per sample: 4.5, dynamic token length: 1366 [2024-09-15 16:57:19,861] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1438.47 | bwd_microstep: 2552.65 | bwd_inner_microstep: 2552.63 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.04 dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878 [2024-09-15 16:57:25,394] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.44 | optimizer_gradients: 0.38 | optimizer_step: 0.38 [2024-09-15 16:57:25,394] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1917.18 | bwd_microstep: 3584.28 | bwd_inner_microstep: 3444.45 | bwd_allreduce_microstep: 139.77 | step_microstep: 7.98 [2024-09-15 16:57:25,395] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3355.62 | bwd: 6136.95 | bwd_inner: 5997.08 | bwd_allreduce: 139.80 | step: 8.03 98%|█████████▊| 459/468 [1:22:16<01:32, 10.32s/it] {'loss': 0.1302, 'learning_rate': 3.894461050010012e-08, 'epoch': 2.93} 98%|█████████▊| 459/468 [1:22:16<01:32, 10.32s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389 [2024-09-15 16:57:29,550] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1493.47 | bwd_microstep: 2631.11 | bwd_inner_microstep: 2631.09 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.04 dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1876 [2024-09-15 16:57:36,098] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.40 | optimizer_gradients: 0.37 | optimizer_step: 0.38 [2024-09-15 16:57:36,099] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1933.89 | bwd_microstep: 4583.55 | bwd_inner_microstep: 3461.77 | bwd_allreduce_microstep: 1121.71 | step_microstep: 7.37 [2024-09-15 16:57:36,099] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3427.34 | bwd: 7214.68 | bwd_inner: 6092.85 | bwd_allreduce: 1121.75 | step: 7.43 98%|█████████▊| 460/468 [1:22:27<01:23, 10.44s/it] {'loss': 0.1507, 'learning_rate': 3.077314705413503e-08, 'epoch': 2.94} 98%|█████████▊| 460/468 [1:22:27<01:23, 10.44s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1870 [2024-09-15 16:57:41,465] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1911.26 | bwd_microstep: 3424.02 | bwd_inner_microstep: 3423.99 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.05 dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876 [2024-09-15 16:57:46,871] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.42 | optimizer_gradients: 0.37 | optimizer_step: 0.40 [2024-09-15 16:57:46,872] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1917.20 | bwd_microstep: 3455.27 | bwd_inner_microstep: 3440.20 | bwd_allreduce_microstep: 15.03 | step_microstep: 8.41 [2024-09-15 16:57:46,872] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3828.44 | bwd: 6879.30 | bwd_inner: 6864.19 | bwd_allreduce: 15.05 | step: 8.48 99%|█████████▊| 461/468 [1:22:38<01:13, 10.54s/it] {'loss': 0.2156, 'learning_rate': 2.356210727046504e-08, 'epoch': 2.95} 99%|█████████▊| 461/468 [1:22:38<01:13, 10.54s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389 [2024-09-15 16:57:51,025] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1493.49 | bwd_microstep: 2629.05 | bwd_inner_microstep: 2629.02 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.23 dynamic ViT batch size: 36, images per sample: 4.5, dynamic token length: 1367 [2024-09-15 16:57:57,484] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.36 | optimizer_gradients: 0.42 | optimizer_step: 0.38 [2024-09-15 16:57:57,485] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1439.52 | bwd_microstep: 4983.96 | bwd_inner_microstep: 2557.41 | bwd_allreduce_microstep: 2426.49 | step_microstep: 7.65 [2024-09-15 16:57:57,485] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2932.98 | bwd: 7613.02 | bwd_inner: 5186.43 | bwd_allreduce: 2426.52 | step: 7.88 99%|█████████▊| 462/468 [1:22:48<01:03, 10.56s/it] {'loss': 0.2267, 'learning_rate': 1.7311837965379164e-08, 'epoch': 2.95} 99%|█████████▊| 462/468 [1:22:48<01:03, 10.56s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1389 [2024-09-15 16:58:01,628] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1488.17 | bwd_microstep: 2624.95 | bwd_inner_microstep: 2624.84 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.23 dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1870 [2024-09-15 16:58:07,014] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.53 | optimizer_gradients: 0.34 | optimizer_step: 0.38 [2024-09-15 16:58:07,014] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1913.28 | bwd_microstep: 3437.81 | bwd_inner_microstep: 3422.66 | bwd_allreduce_microstep: 15.10 | step_microstep: 7.78 [2024-09-15 16:58:07,015] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3401.42 | bwd: 6062.79 | bwd_inner: 6047.50 | bwd_allreduce: 15.17 | step: 8.02 99%|█████████▉| 463/468 [1:22:58<00:51, 10.25s/it] {'loss': 0.2884, 'learning_rate': 1.202263974674045e-08, 'epoch': 2.96} 99%|█████████▉| 463/468 [1:22:58<00:51, 10.25s/it]dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878 [2024-09-15 16:58:12,398] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1915.49 | bwd_microstep: 3437.11 | bwd_inner_microstep: 3437.09 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.12 dynamic ViT batch size: 46, images per sample: 5.75, dynamic token length: 1878 [2024-09-15 16:58:17,882] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.48 | optimizer_gradients: 0.34 | optimizer_step: 0.39 [2024-09-15 16:58:17,882] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1949.19 | bwd_microstep: 3498.57 | bwd_inner_microstep: 3483.62 | bwd_allreduce_microstep: 14.90 | step_microstep: 7.61 [2024-09-15 16:58:17,882] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3864.66 | bwd: 6935.70 | bwd_inner: 6920.71 | bwd_allreduce: 14.92 | step: 7.77 99%|█████████▉| 464/468 [1:23:09<00:41, 10.44s/it] {'loss': 0.163, 'learning_rate': 7.694766999513104e-09, 'epoch': 2.96} 99%|█████████▉| 464/468 [1:23:09<00:41, 10.44s/it]dynamic ViT batch size: 39, images per sample: 4.875, dynamic token length: 1369 [2024-09-15 16:58:21,958] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1462.31 | bwd_microstep: 2582.87 | bwd_inner_microstep: 2582.84 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.04 dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1878 [2024-09-15 16:58:28,702] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.21 | optimizer_gradients: 0.37 | optimizer_step: 0.38 [2024-09-15 16:58:28,702] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1917.02 | bwd_microstep: 4795.97 | bwd_inner_microstep: 3445.55 | bwd_allreduce_microstep: 1350.35 | step_microstep: 7.20 [2024-09-15 16:58:28,702] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3379.30 | bwd: 7378.85 | bwd_inner: 6028.39 | bwd_allreduce: 1350.39 | step: 7.25 99%|█████████▉| 465/468 [1:23:20<00:31, 10.55s/it] {'loss': 0.1972, 'learning_rate': 4.328427873541152e-09, 'epoch': 2.97} 99%|█████████▉| 465/468 [1:23:20<00:31, 10.55s/it]dynamic ViT batch size: 41, images per sample: 5.125, dynamic token length: 1878 [2024-09-15 16:58:34,068] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1905.90 | bwd_microstep: 3428.86 | bwd_inner_microstep: 3428.83 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.31 dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1876 [2024-09-15 16:58:39,477] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.53 | optimizer_gradients: 0.35 | optimizer_step: 0.39 [2024-09-15 16:58:39,478] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1915.20 | bwd_microstep: 3458.78 | bwd_inner_microstep: 3443.79 | bwd_allreduce_microstep: 14.94 | step_microstep: 8.57 [2024-09-15 16:58:39,478] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3821.08 | bwd: 6887.65 | bwd_inner: 6872.62 | bwd_allreduce: 14.96 | step: 8.89 100%|█████████▉| 466/468 [1:23:30<00:21, 10.62s/it] {'loss': 0.1847, 'learning_rate': 1.9237842735275737e-09, 'epoch': 2.98} 100%|█████████▉| 466/468 [1:23:30<00:21, 10.62s/it]dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1388 [2024-09-15 16:58:43,634] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1493.78 | bwd_microstep: 2632.13 | bwd_inner_microstep: 2632.11 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.22 dynamic ViT batch size: 42, images per sample: 5.25, dynamic token length: 1877 [2024-09-15 16:58:50,308] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.43 | optimizer_gradients: 0.38 | optimizer_step: 0.39 [2024-09-15 16:58:50,308] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1918.65 | bwd_microstep: 4719.08 | bwd_inner_microstep: 3445.05 | bwd_allreduce_microstep: 1273.97 | step_microstep: 8.04 [2024-09-15 16:58:50,308] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3412.41 | bwd: 7351.23 | bwd_inner: 6077.15 | bwd_allreduce: 1274.01 | step: 8.27 100%|█████████▉| 467/468 [1:23:41<00:10, 10.68s/it] {'loss': 0.2029, 'learning_rate': 4.809518512494116e-10, 'epoch': 2.98} 100%|█████████▉| 467/468 [1:23:41<00:10, 10.68s/it]dynamic ViT batch size: 44, images per sample: 5.5, dynamic token length: 1877 [2024-09-15 16:58:55,733] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1931.11 | bwd_microstep: 3462.44 | bwd_inner_microstep: 3462.41 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.18 dynamic ViT batch size: 40, images per sample: 5.0, dynamic token length: 1368 [2024-09-15 16:59:00,936] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.34 | optimizer_gradients: 0.38 | optimizer_step: 0.38 [2024-09-15 16:59:00,936] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1468.85 | bwd_microstep: 3700.56 | bwd_inner_microstep: 2593.14 | bwd_allreduce_microstep: 1107.36 | step_microstep: 7.98 [2024-09-15 16:59:00,937] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3399.93 | bwd: 7163.01 | bwd_inner: 6055.55 | bwd_allreduce: 1107.39 | step: 8.19 100%|██████████| 468/468 [1:23:52<00:00, 10.67s/it] {'loss': 0.1615, 'learning_rate': 0.0, 'epoch': 2.99} 100%|██████████| 468/468 [1:23:52<00:00, 10.67s/it]petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. [INFO|trainer.py:1962] 2024-09-15 16:59:01,963 >> Training completed. Do not forget to share your model on huggingface.co/models =) {'train_runtime': 5033.3609, 'train_samples_per_second': 5.958, 'train_steps_per_second': 0.093, 'train_loss': 0.3061195729762061, 'epoch': 2.99} 100%|██████████| 468/468 [1:23:53<00:00, 10.67s/it] 100%|██████████| 468/468 [1:23:53<00:00, 10.76s/it] [INFO|trainer.py:2936] 2024-09-15 16:59:09,984 >> Saving model checkpoint to work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora [INFO|configuration_utils.py:473] 2024-09-15 16:59:09,985 >> Configuration saved in work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora/config.json [INFO|configuration_utils.py:594] 2024-09-15 16:59:09,986 >> Configuration saved in work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora/generation_config.json [INFO|modeling_utils.py:2501] 2024-09-15 16:59:25,737 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 4 checkpoint shards. You can find where each parameters has been saved in the index located at work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora/model.safetensors.index.json. [INFO|tokenization_utils_base.py:2433] 2024-09-15 16:59:25,738 >> tokenizer config file saved in work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora/tokenizer_config.json [INFO|tokenization_utils_base.py:2442] 2024-09-15 16:59:25,738 >> Special tokens file saved in work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora/special_tokens_map.json [INFO|tokenization_utils_base.py:2493] 2024-09-15 16:59:25,739 >> added tokens file saved in work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora/added_tokens.json ***** train metrics ***** epoch = 2.99 train_loss = 0.3061 train_runtime = 1:23:53.36 train_samples = 9996 train_samples_per_second = 5.958 train_steps_per_second = 0.093