mistralai/Mistral-7B-v0.1 · size of hidden layers and sliding window attention - dimension is the same, 4096. Is that for a reason?

May 24

•

Looking at the configuration of Mistral-7B-v0.1:

Model configuration: MistralConfig {
  "_name_or_path": "mistralai/Mistral-7B-v0.1",
  "architectures": [
    "MistralForCausalLM"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_act": "silu",
  > "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 32768,
  "model_type": "mistral",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "rms_norm_eps": 1e-05,
  "rope_theta": 10000.0,
  > "sliding_window": 4096,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.41.1",
  "use_cache": true,
  "vocab_size": 32000
}

The hidden_size attribute of the hidden layers and sliding_window token length for local attention is exactly the same. Curious, why is it so case?

bartowski

May 24

Those two values aren't related, the sliding window refers to attention context and is related to the max_position_embeddings

keval-sha

May 24

Yeah, that makes sense. I am wondering why are they both exactly 4096? Interesting architecture choices.