Upload ZettHypernet

Browse files

Files changed (5) hide show

README.md +199 -0
config.json +56 -0
configuration_hypernet.py +56 -0
model.safetensors +3 -0
modeling_hypernet.py +267 -0

README.md ADDED Viewed

	@@ -0,0 +1,199 @@

+---
+library_name: transformers
+tags: []
+---
+# Model Card for Model ID
+<!-- Provide a quick summary of what the model is/does. -->
+## Model Details
+### Model Description
+<!-- Provide a longer summary of what this model is. -->
+This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
+- **Developed by:** [More Information Needed]
+- **Funded by [optional]:** [More Information Needed]
+- **Shared by [optional]:** [More Information Needed]
+- **Model type:** [More Information Needed]
+- **Language(s) (NLP):** [More Information Needed]
+- **License:** [More Information Needed]
+- **Finetuned from model [optional]:** [More Information Needed]
+### Model Sources [optional]
+<!-- Provide the basic links for the model. -->
+- **Repository:** [More Information Needed]
+- **Paper [optional]:** [More Information Needed]
+- **Demo [optional]:** [More Information Needed]
+## Uses
+<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
+### Direct Use
+<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
+[More Information Needed]
+### Downstream Use [optional]
+<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
+[More Information Needed]
+### Out-of-Scope Use
+<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
+[More Information Needed]
+## Bias, Risks, and Limitations
+<!-- This section is meant to convey both technical and sociotechnical limitations. -->
+[More Information Needed]
+### Recommendations
+<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
+Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
+## How to Get Started with the Model
+Use the code below to get started with the model.
+[More Information Needed]
+## Training Details
+### Training Data
+<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
+[More Information Needed]
+### Training Procedure
+<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
+#### Preprocessing [optional]
+[More Information Needed]
+#### Training Hyperparameters
+- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
+#### Speeds, Sizes, Times [optional]
+<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
+[More Information Needed]
+## Evaluation
+<!-- This section describes the evaluation protocols and provides the results. -->
+### Testing Data, Factors & Metrics
+#### Testing Data
+<!-- This should link to a Dataset Card if possible. -->
+[More Information Needed]
+#### Factors
+<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
+[More Information Needed]
+#### Metrics
+<!-- These are the evaluation metrics being used, ideally with a description of why. -->
+[More Information Needed]
+### Results
+[More Information Needed]
+#### Summary
+## Model Examination [optional]
+<!-- Relevant interpretability work for the model goes here -->
+[More Information Needed]
+## Environmental Impact
+<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
+Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
+- **Hardware Type:** [More Information Needed]
+- **Hours used:** [More Information Needed]
+- **Cloud Provider:** [More Information Needed]
+- **Compute Region:** [More Information Needed]
+- **Carbon Emitted:** [More Information Needed]
+## Technical Specifications [optional]
+### Model Architecture and Objective
+[More Information Needed]
+### Compute Infrastructure
+[More Information Needed]
+#### Hardware
+[More Information Needed]
+#### Software
+[More Information Needed]
+## Citation [optional]
+<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
+**BibTeX:**
+[More Information Needed]
+**APA:**
+[More Information Needed]
+## Glossary [optional]
+<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
+[More Information Needed]
+## More Information [optional]
+[More Information Needed]
+## Model Card Authors [optional]
+[More Information Needed]
+## Model Card Contact
+[More Information Needed]

config.json ADDED Viewed

	@@ -0,0 +1,56 @@

+{
+  "_name_or_path": "mistralai/Mistral-7B-v0.1",
+  "architectures": [
+    "ZettHypernet"
+  ],
+  "attention_dropout": 0.0,
+  "auto_map": {
+    "AutoConfig": "configuration_hypernet.ZettHypernetConfig",
+    "AutoModel": "modeling_hypernet.ZettHypernet"
+  },
+  "bos_token_id": 1,
+  "eos_token_id": 2,
+  "hidden_act": "silu",
+  "hidden_size": 4096,
+  "hn_add_inter_token_attention": false,
+  "hn_concat_last_hidden_state": false,
+  "hn_embed_lang_id": false,
+  "hn_embed_target_priors": false,
+  "hn_embed_using_source_embeddings": true,
+  "hn_hidden_size": 4096,
+  "hn_inter_token_attention_bias_by_priors": true,
+  "hn_inter_token_attention_bias_scaler": 1.0,
+  "hn_intermediate_size": 8192,
+  "hn_language_adapter_bottleneck_dim": 0,
+  "hn_model_name_or_path": "roberta-base",
+  "hn_model_type": "roberta",
+  "hn_n_extra_tokens": 522,
+  "hn_n_inter_token_blocks": 16,
+  "hn_n_layers": 3,
+  "hn_num_attention_heads": 32,
+  "hn_predict_bias": true,
+  "hn_rescale_embeddings": true,
+  "hn_single_head": false,
+  "hn_surface_maxlen": 7,
+  "initializer_range": 0.02,
+  "intermediate_size": 14336,
+  "max_position_embeddings": 32768,
+  "n_embd": 4096,
+  "n_langs": 7,
+  "name": "v7:mistral7b_en+code:lw=0.5_long",
+  "num_attention_heads": 32,
+  "num_hidden_layers": 32,
+  "num_key_value_heads": 8,
+  "original_vocab_size": 32000,
+  "pad_token_id": 2,
+  "rms_norm_eps": 1e-05,
+  "rope_theta": 10000.0,
+  "separate_out_embeddings": true,
+  "sliding_window": 4096,
+  "tie_word_embeddings": false,
+  "torch_dtype": "float32",
+  "transformers_version": "4.39.0.dev0",
+  "use_cache": true,
+  "use_unigram_bias": true,
+  "vocab_size": 32896
+}

configuration_hypernet.py ADDED Viewed

	@@ -0,0 +1,56 @@

+from transformers import PretrainedConfig
+class ZettHypernetConfig(PretrainedConfig):
+    def __init__(
+        self,
+        hn_model_name_or_path: str = "roberta-base",
+        hn_surface_maxlen: int = 16,
+        hn_n_layers: int = 3,
+        n_embd: int = 768,
+        hn_hidden_size: int = None,
+        hn_intermediate_size: int = None,
+        hn_rescale_embeddings: bool = False,
+        use_unigram_bias: bool = False,
+        hn_embed_target_priors: bool = False,
+        hn_add_inter_token_attention: bool = False,
+        hn_inter_token_attention_bias_by_priors: bool = False,
+        hn_inter_token_attention_bias_scaler: float = 1.0,
+        hn_n_inter_token_blocks: int = 16,
+        hn_language_adapter_bottleneck_dim: int = 0,
+        hn_embed_using_source_embeddings: bool = False,
+        hn_concat_last_hidden_state: bool = False,
+        hn_single_head: bool = False,
+        hn_predict_bias: bool = True,
+        hn_num_attention_heads: int = None,
+        hn_embed_lang_id: bool = False,
+        hn_model_type: str = "roberta",
+        n_langs: int = None,  # set in train.py
+        **kwargs
+    ):
+        super().__init__(**kwargs)
+        self.model_type = "zett_hypernetwork"
+        self.hn_model_name_or_path = hn_model_name_or_path
+        self.hn_surface_maxlen = hn_surface_maxlen
+        self.hn_n_layers = hn_n_layers
+        self.n_embd = n_embd
+        self.hn_hidden_size = hn_hidden_size
+        self.hn_intermediate_size = hn_intermediate_size
+        self.hn_rescale_embeddings = hn_rescale_embeddings
+        self.use_unigram_bias = use_unigram_bias
+        self.hn_embed_target_priors = hn_embed_target_priors
+        self.hn_add_inter_token_attention = hn_add_inter_token_attention
+        self.hn_inter_token_attention_bias_by_priors = (
+            hn_inter_token_attention_bias_by_priors
+        )
+        self.hn_inter_token_attention_bias_scaler = hn_inter_token_attention_bias_scaler
+        self.hn_n_inter_token_blocks = hn_n_inter_token_blocks
+        self.hn_language_adapter_bottleneck_dim = hn_language_adapter_bottleneck_dim
+        self.hn_embed_using_source_embeddings = hn_embed_using_source_embeddings
+        self.hn_concat_last_hidden_state = hn_concat_last_hidden_state
+        self.hn_single_head = hn_single_head
+        self.hn_predict_bias = hn_predict_bias
+        self.hn_num_attention_heads = hn_num_attention_heads
+        self.hn_embed_lang_id = hn_embed_lang_id
+        self.hn_model_type = hn_model_type
+        self.n_langs = n_langs

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e6f9e3a501e86bd36e217f0f6dc8066a3dbeb78df4c483e0079bbe0ee7cdfbe1
+size 2710971844

modeling_hypernet.py ADDED Viewed

	@@ -0,0 +1,267 @@

+from .configuration_hypernet import ZettHypernetConfig
+from transformers import PreTrainedModel, RobertaConfig, RobertaModel
+from functools import partial
+from torch import nn as nn
+import torch
+from torch.nn import functional as F
+class Rescaler(nn.Module):
+    def __init__(self, dim: int):
+        super().__init__()
+        self.dim = dim
+        self.w = nn.Parameter(torch.ones((1, self.dim)), requires_grad=False)
+        self.b = nn.Parameter(torch.ones((1, self.dim)), requires_grad=False)
+    def __call__(self, x):
+        return self.w * x + self.b
+class ProjectorBlock(nn.Module):
+    def __init__(self, input_dim: int, dim: int, intermediate_dim: int):
+        super().__init__()
+        self.input_dim = input_dim
+        self.dim = dim
+        self.intermediate_dim = intermediate_dim
+        self.dense1 = nn.Linear(self.input_dim, self.intermediate_dim)
+        self.dense2 = nn.Linear(self.intermediate_dim, self.dim)
+        self.ln = nn.LayerNorm(self.dim, eps=1e-6)
+    def __call__(self, x):
+        h = F.gelu(
+            self.dense2(F.gelu(self.dense1(x), approximate="tanh")),
+            approximate="tanh",
+        )
+        return self.ln(h + x)
+class ZettHypernet(PreTrainedModel):
+    config_class = ZettHypernetConfig
+    def __init__(self, config: ZettHypernetConfig):
+        super().__init__(config)
+        self.config = config
+        self.has_separate_out_embeddings = getattr(
+            self.config, "separate_out_embeddings", False
+        )
+        if self.config.hn_embed_lang_id:
+            self.lang_embeddings = nn.Embedding(
+                self.config.n_langs, self.config.hn_hidden_size
+            )
+        if self.has_separate_out_embeddings:
+            n_in_embd = self.config.n_embd * 2
+            n_out_embd = self.config.n_embd
+        else:
+            n_in_embd = self.config.n_embd
+            n_out_embd = self.config.n_embd
+        if self.config.hn_model_type == "roberta":
+            config = RobertaConfig.from_pretrained(
+                self.config.hn_model_name_or_path
+            )
+            config.num_hidden_layers = self.config.hn_n_layers
+            config.hidden_size = self.config.hn_hidden_size
+            config.intermediate_size = self.config.hn_intermediate_size
+            if getattr(self.config, "hn_num_attention_heads", None) is None:
+                self.config.hn_num_attention_heads = self.config.hn_hidden_size // 64
+            config.num_attention_heads = self.config.hn_num_attention_heads
+            self.embed_init_range = config.initializer_range
+            module_class = partial(RobertaModel, add_pooling_layer=False)
+        elif self.config.hn_model_type == "t5":
+            raise NotImplementedError()
+        if self.config.hn_embed_using_source_embeddings:
+            # do not need to alloc embeddings since inputs_embeds is always used
+            config.vocab_size = self.config.pad_token_id + 1
+        if (
+            self.config.hn_add_inter_token_attention
+            or self.config.hn_embed_target_priors
+        ):
+            raise NotImplementedError()
+        self.pad_token_id = self.config.pad_token_id
+        assert self.pad_token_id is not None
+        self.model = module_class(config)
+        # need at least one embedding
+        self.fallback_embeddings = nn.Embedding(
+            max(self.config.hn_n_extra_tokens, 1), n_in_embd
+        )
+        if self.config.hn_embed_using_source_embeddings:
+            self.input_projection = nn.Sequential(
+                *[
+                    nn.Linear(n_in_embd, self.config.hn_hidden_size),
+                    ProjectorBlock(
+                        self.config.hn_hidden_size,
+                        self.config.hn_hidden_size,
+                        self.config.hn_intermediate_size,
+                    ),
+                ]
+            )
+        if self.config.hn_single_head:
+            self.output_projection = nn.Sequential(
+                *[
+                    ProjectorBlock(
+                        self.config.hn_hidden_size,
+                        self.config.hn_hidden_size,
+                        self.config.hn_intermediate_size,
+                    ),
+                    nn.Linear(self.config.hn_hidden_size, n_in_embd),
+                ]
+            )
+        else:
+            self.output_projection = nn.Sequential(
+                *[
+                    ProjectorBlock(
+                        self.config.hn_hidden_size,
+                        self.config.hn_hidden_size,
+                        self.config.hn_intermediate_size,
+                    ),
+                    nn.Linear(self.config.hn_hidden_size, n_out_embd),
+                ]
+            )
+            if self.has_separate_out_embeddings:
+                self.output_projection_out = nn.Sequential(
+                    *[
+                        ProjectorBlock(
+                            self.config.hn_hidden_size,
+                            self.config.hn_hidden_size,
+                            self.config.hn_intermediate_size,
+                        ),
+                        nn.Linear(self.config.hn_hidden_size, self.config.n_embd),
+                    ]
+                )
+        if self.config.hn_rescale_embeddings:
+            self.in_scaler = Rescaler(n_in_embd)
+            self.scaler = Rescaler(n_out_embd)
+            if self.has_separate_out_embeddings:
+                self.out_scaler = Rescaler(self.config.n_embd)
+        if getattr(self.config, "hn_predict_bias", False):
+            self.bias_projection = nn.Linear(self.config.hn_hidden_size, 1)
+    def __call__(
+        self,
+        target_surface_forms,
+        target_priors=None,
+        source_embeddings=None,
+        lang_index=None,
+        deterministic: bool = True,
+    ):
+        if target_priors is not None:
+            raise NotImplementedError()
+        if not self.config.hn_embed_using_source_embeddings:
+            raise NotImplementedError()
+        use_fallback = target_surface_forms >= self.config.original_vocab_size
+        main_ids = torch.minimum(
+            target_surface_forms, torch.tensor(self.config.original_vocab_size - 1, device=self.device)
+        )
+        fallback_ids = torch.maximum(
+            target_surface_forms - self.config.original_vocab_size, torch.tensor(0, device=self.device)
+        )
+        source_embeds = F.embedding(main_ids, weight=source_embeddings)
+        if self.config.hn_rescale_embeddings:
+            source_embeds = self.in_scaler(source_embeds)
+        inputs_embeds = torch.where(
+            use_fallback[..., None],
+            self.fallback_embeddings(fallback_ids),
+            source_embeds,
+        )
+        inputs_embeds = self.input_projection(inputs_embeds)
+        attention_mask = target_surface_forms != self.pad_token_id
+        if self.config.hn_embed_lang_id:
+            lang_embedding = self.lang_embeddings(lang_index).squeeze()
+            # position embed and type embed are added afterwards only in PT version so we need to subtract them here
+            lang_embedding -= self.model.embeddings.token_type_embeddings(
+                torch.tensor(0, device=self.device)
+            ) + self.model.embeddings.position_embeddings(
+                torch.tensor(attention_mask.shape[1], device=self.device)
+            )
+            lang_embedding = lang_embedding[None, None, :].expand(
+                inputs_embeds.shape[0], -1, -1
+            )
+            inputs_embeds = torch.cat(
+                [
+                    inputs_embeds,
+                    lang_embedding,
+                ],
+                axis=1,
+            )
+            attention_mask = torch.cat(
+                [
+                    attention_mask,
+                    torch.ones(lang_embedding.shape[:-1], dtype=torch.bool, device=self.device),
+                ],
+                axis=1,
+            )
+        position_ids = torch.broadcast_to(
+            torch.arange(torch.atleast_2d(attention_mask).shape[-1], device=self.device),
+            attention_mask.shape,
+        )
+        hidden_states = self.model(
+            inputs_embeds=inputs_embeds,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+        ).last_hidden_state
+        if self.config.hn_concat_last_hidden_state:
+            hidden_states = hidden_states.reshape(target_surface_forms.shape[0], -1)
+        else:
+            hidden_states = hidden_states[:, 0]
+        predicted_embeddings = self.output_projection(hidden_states)
+        if self.config.hn_single_head:
+            predicted_embeddings_in = predicted_embeddings[..., : self.config.n_embd]
+            if self.has_separate_out_embeddings:
+                predicted_embeddings_out = predicted_embeddings[
+                    ..., self.config.n_embd :
+                ]
+            else:
+                predicted_embeddings_out = None
+        else:
+            predicted_embeddings_in = predicted_embeddings
+            if self.has_separate_out_embeddings:
+                predicted_embeddings_out = self.output_projection_out(hidden_states)
+            else:
+                predicted_embeddings_out = None
+        if self.config.hn_rescale_embeddings:
+            predicted_embeddings_in = self.scaler(predicted_embeddings_in)
+            if predicted_embeddings_out is not None:
+                predicted_embeddings_out = self.out_scaler(predicted_embeddings_out)
+        if getattr(self.config, "hn_predict_bias", False):
+            predicted_bias = self.bias_projection(hidden_states)[..., 0]
+        else:
+            predicted_bias = torch.zeros_like(
+                target_surface_forms[..., 0], dtype=self.dtype
+            )
+        return predicted_embeddings_in, predicted_embeddings_out, predicted_bias