size of hidden layers and sliding window attention - dimension is the same, 4096. Is that for a reason?
#153
by
keval-sha
- opened
Looking at the configuration of Mistral-7B-v0.1
:
Model configuration: MistralConfig {
"_name_or_path": "mistralai/Mistral-7B-v0.1",
"architectures": [
"MistralForCausalLM"
],
"attention_dropout": 0.0,
"bos_token_id": 1,
"eos_token_id": 2,
"hidden_act": "silu",
> "hidden_size": 4096,
"initializer_range": 0.02,
"intermediate_size": 14336,
"max_position_embeddings": 32768,
"model_type": "mistral",
"num_attention_heads": 32,
"num_hidden_layers": 32,
"num_key_value_heads": 8,
"rms_norm_eps": 1e-05,
"rope_theta": 10000.0,
> "sliding_window": 4096,
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"transformers_version": "4.41.1",
"use_cache": true,
"vocab_size": 32000
}
The hidden_size
attribute of the hidden layers and sliding_window
token length for local attention is exactly the same. Curious, why is it so case?
Those two values aren't related, the sliding window refers to attention context and is related to the max_position_embeddings
Yeah, that makes sense. I am wondering why are they both exactly 4096? Interesting architecture choices.