Add new SentenceTransformer model.

Browse files

Files changed (11) hide show

1_Pooling/config.json +10 -0
README.md +370 -0
config.json +32 -0
config_sentence_transformers.json +10 -0
model.safetensors +3 -0
modules.json +14 -0
sentence_bert_config.json +4 -0
special_tokens_map.json +37 -0
tokenizer.json +0 -0
tokenizer_config.json +64 -0
vocab.txt +0 -0

1_Pooling/config.json ADDED Viewed

	@@ -0,0 +1,10 @@

+{
+  "word_embedding_dimension": 768,
+  "pooling_mode_cls_token": false,
+  "pooling_mode_mean_tokens": true,
+  "pooling_mode_max_tokens": false,
+  "pooling_mode_mean_sqrt_len_tokens": false,
+  "pooling_mode_weightedmean_tokens": false,
+  "pooling_mode_lasttoken": false,
+  "include_prompt": true
+}

README.md ADDED Viewed

	@@ -0,0 +1,370 @@

+---
+language: []
+library_name: sentence-transformers
+tags:
+- sentence-transformers
+- sentence-similarity
+- feature-extraction
+- dataset_size:10K<n<100K
+- loss:ContrastiveLoss
+base_model: raquelsilveira/legalbertpt_fp
+widget:
+- source_sentence: Alteração, fixação, jornada de trabalho, psicólogo.
+  sentences:
+  - "Alteração, lei federal, definição, jornada de trabalho, psicólogo.\r\n\r\n"
+  - Concessão, Pensão especial, pessoa, Sequela, Coronavírus, sujeição, tratamento
+    médico, ineficácia, diretrizes.
+  - Alteração, Código Civil, garantia, companheiro, direito real, habitação, imóvel
+    residencial, inventário.
+- source_sentence: Criação, Fundo Garantidor, empresa, alimentação.
+  sentences:
+  - 'Critérios, concessão, auxíio financeiro, Municípios, compensação, redução, cota,
+    Fundo de Participação dos Municípios (FPM).  '
+  - Alteração, Lei dos Crimes Hediondos, inclusão, crime hediondo, concussão, corrupção
+    ativa, corrupção passiva.
+  - Constituição federal (1988), Direitos e garantias fundamentais, acesso, Internet,
+    inviolabilidade, sigilo, comunicação eletrônica.
+- source_sentence: Fixação, preço, Gás Liquefeito de Petróleo (GLP).
+  sentences:
+  - Autorização, Porto do Forno, município, Arraial do Cabo, (RJ), importação, exportação,
+    biocombustível.
+  - 'Obrigatoriedade, instalação, agência lotérica, banheiro feminino, banheiro masculino,
+    bebedouro, consumidor. '
+  - Proibição, empresa, telefonia móvel, mensagem, cobrança, inadimplência, ligação,
+    cliente.
+- source_sentence: Fixação, prazo, mandato, membro, Tribunal de Contas.
+  sentences:
+  - 'Constituição Federal (1988), criação, mandato coletivo, mandato parlamentar.  '
+  - 'Alteração, Lei Antifumo, teor alcóolico, proibição, propaganda comercial, bebida
+    alcoólica, comunicação de massa. '
+  - Obrigatoriedade, restaurante, concessão, desconto, cliente, cirurgia bariátrica,
+    gastroplastia endoscópica, descumprimento, multa.
+- source_sentence: Regulamentação, profissão, designer de interiores.
+  sentences:
+  - Regulamentação profissional, Influenciador digital, criação, geração, Conteúdo
+    digital, Rede social, Mídia social, atribuição, deveres.
+  - 'Proibição, nomeação, homem, Cargo em comissão, Administração federal, condenação,
+    crime, violência contra mulher. '
+  - 'Alteração, Código Penal,  crime contra a liberdade sexual, tipicidade penal,
+    violação sexual mediante fraude, utilização, sedação, reclusão. '
+pipeline_tag: sentence-similarity
+---
+# SentenceTransformer based on raquelsilveira/legalbertpt_fp
+This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [raquelsilveira/legalbertpt_fp](https://huggingface.co/raquelsilveira/legalbertpt_fp). It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
+## Model Details
+### Model Description
+- **Model Type:** Sentence Transformer
+- **Base model:** [raquelsilveira/legalbertpt_fp](https://huggingface.co/raquelsilveira/legalbertpt_fp) <!-- at revision c6d8158c5561e78815d354efce6ff77a9e6730c7 -->
+- **Maximum Sequence Length:** 512 tokens
+- **Output Dimensionality:** 768 tokens
+- **Similarity Function:** Cosine Similarity
+<!-- - **Training Dataset:** Unknown -->
+<!-- - **Language:** Unknown -->
+<!-- - **License:** Unknown -->
+### Model Sources
+- **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
+- **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
+- **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
+### Full Model Architecture
+```
+SentenceTransformer(
+  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
+  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
+)
+```
+## Usage
+### Direct Usage (Sentence Transformers)
+First install the Sentence Transformers library:
+```bash
+pip install -U sentence-transformers
+```
+Then you can load this model and run inference.
+```python
+from sentence_transformers import SentenceTransformer
+# Download from the 🤗 Hub
+model = SentenceTransformer("josedossantos/urf-txtIndexacao-legalbertpt")
+# Run inference
+sentences = [
+    'Regulamentação, profissão, designer de interiores.',
+    'Regulamentação profissional, Influenciador digital, criação, geração, Conteúdo digital, Rede social, Mídia social, atribuição, deveres.',
+    'Proibição, nomeação, homem, Cargo em comissão, Administração federal, condena��ão, crime, violência contra mulher. ',
+]
+embeddings = model.encode(sentences)
+print(embeddings.shape)
+# [3, 768]
+# Get the similarity scores for the embeddings
+similarities = model.similarity(embeddings, embeddings)
+print(similarities.shape)
+# [3, 3]
+```
+<!--
+### Direct Usage (Transformers)
+<details><summary>Click to see the direct usage in Transformers</summary>
+</details>
+-->
+<!--
+### Downstream Usage (Sentence Transformers)
+You can finetune this model on your own dataset.
+<details><summary>Click to expand</summary>
+</details>
+-->
+<!--
+### Out-of-Scope Use
+*List how the model may foreseeably be misused and address what users ought not to do with the model.*
+-->
+<!--
+## Bias, Risks and Limitations
+*What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
+-->
+<!--
+### Recommendations
+*What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
+-->
+## Training Details
+### Training Dataset
+#### Unnamed Dataset
+* Size: 10,962 training samples
+* Columns: <code>sentence_0</code>, <code>sentence_1</code>, and <code>label</code>
+* Approximate statistics based on the first 1000 samples:
+  |         | sentence_0                                                                         | sentence_1                                                                         | label                                           |
+  |:--------|:-----------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|:------------------------------------------------|
+  | type    | string                                                                             | string                                                                             | int                                             |
+  | details | <ul><li>min: 9 tokens</li><li>mean: 49.68 tokens</li><li>max: 249 tokens</li></ul> | <ul><li>min: 9 tokens</li><li>mean: 53.11 tokens</li><li>max: 421 tokens</li></ul> | <ul><li>0: ~49.90%</li><li>1: ~50.10%</li></ul> |
+* Samples:
+  | sentence_0                                                                                                                                                                                                                                                                                     | sentence_1                                                                                                                                                                                                                                                                                                                         | label          |
+  |:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:---------------|
+  | <code>Alteração, Lei de Benefícios da Previdência Social, criação, disciplinamento, auxílio-cuidador, segurado, Regime Geral de Previdência Social (RGPS),  familiar, exercício, atividade, cuidador de deficientes.</code>                                                                    | <code>Alteração, Estatuto do Idoso, requisito, exercício profissional, cuidador de idosos. _Poder público, estímulo, adoção, idoso, campanha educativa. </code>                                                                                                                                                                    | <code>1</code> |
+  | <code>Equiparação, doença, Lúpus Eritematoso Sistêmico, deficiência física, deficiência intelectual, efeito jurídico.</code>                                                                                                                                                                   | <code>Criação, Política Nacional de Conscientização e Orientação sobre LES, combate, doença grave, campanha educativa, tratamento médico, informações, coleta, dados, portador, doença, pesquisa científica, garantia, acesso, medicamentos, inclusão, cosméticos, bloqueador solar, proteção, radiação ultravioleta, pele.</code> | <code>0</code> |
+  | <code>Alteração, Lei de Isenção do IPI para Compra de Automóveis, critério, isenção tributária, Imposto sobre Produtos Industrializados (IPI), aquisição, Automóvel, motorista, Transporte individual, transporte de passageiro, Motorista de aplicativo, benefício fiscal, tributação.</code> | <code>Alteração, Lei de Isenção do IPI para Compra de Automóveis, isenção,  Imposto sobre Produtos Industrializados (IPI), motorista de aplicativo, aquisição, veículo de passageiro, tributação.</code>                                                                                                                           | <code>1</code> |
+* Loss: [<code>ContrastiveLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#contrastiveloss) with these parameters:
+  ```json
+  {
+      "distance_metric": "SiameseDistanceMetric.COSINE_DISTANCE",
+      "margin": 0.5,
+      "size_average": true
+  }
+  ```
+### Training Hyperparameters
+#### Non-Default Hyperparameters
+- `per_device_train_batch_size`: 2
+- `per_device_eval_batch_size`: 2
+- `num_train_epochs`: 1
+- `multi_dataset_batch_sampler`: round_robin
+#### All Hyperparameters
+<details><summary>Click to expand</summary>
+- `overwrite_output_dir`: False
+- `do_predict`: False
+- `prediction_loss_only`: True
+- `per_device_train_batch_size`: 2
+- `per_device_eval_batch_size`: 2
+- `per_gpu_train_batch_size`: None
+- `per_gpu_eval_batch_size`: None
+- `gradient_accumulation_steps`: 1
+- `eval_accumulation_steps`: None
+- `learning_rate`: 5e-05
+- `weight_decay`: 0.0
+- `adam_beta1`: 0.9
+- `adam_beta2`: 0.999
+- `adam_epsilon`: 1e-08
+- `max_grad_norm`: 1
+- `num_train_epochs`: 1
+- `max_steps`: -1
+- `lr_scheduler_type`: linear
+- `lr_scheduler_kwargs`: {}
+- `warmup_ratio`: 0.0
+- `warmup_steps`: 0
+- `log_level`: passive
+- `log_level_replica`: warning
+- `log_on_each_node`: True
+- `logging_nan_inf_filter`: True
+- `save_safetensors`: True
+- `save_on_each_node`: False
+- `save_only_model`: False
+- `no_cuda`: False
+- `use_cpu`: False
+- `use_mps_device`: False
+- `seed`: 42
+- `data_seed`: None
+- `jit_mode_eval`: False
+- `use_ipex`: False
+- `bf16`: False
+- `fp16`: False
+- `fp16_opt_level`: O1
+- `half_precision_backend`: auto
+- `bf16_full_eval`: False
+- `fp16_full_eval`: False
+- `tf32`: None
+- `local_rank`: 0
+- `ddp_backend`: None
+- `tpu_num_cores`: None
+- `tpu_metrics_debug`: False
+- `debug`: []
+- `dataloader_drop_last`: False
+- `dataloader_num_workers`: 0
+- `dataloader_prefetch_factor`: None
+- `past_index`: -1
+- `disable_tqdm`: False
+- `remove_unused_columns`: True
+- `label_names`: None
+- `load_best_model_at_end`: False
+- `ignore_data_skip`: False
+- `fsdp`: []
+- `fsdp_min_num_params`: 0
+- `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
+- `fsdp_transformer_layer_cls_to_wrap`: None
+- `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True}
+- `deepspeed`: None
+- `label_smoothing_factor`: 0.0
+- `optim`: adamw_torch
+- `optim_args`: None
+- `adafactor`: False
+- `group_by_length`: False
+- `length_column_name`: length
+- `ddp_find_unused_parameters`: None
+- `ddp_bucket_cap_mb`: None
+- `ddp_broadcast_buffers`: False
+- `dataloader_pin_memory`: True
+- `dataloader_persistent_workers`: False
+- `skip_memory_metrics`: True
+- `use_legacy_prediction_loop`: False
+- `push_to_hub`: False
+- `resume_from_checkpoint`: None
+- `hub_model_id`: None
+- `hub_strategy`: every_save
+- `hub_private_repo`: False
+- `hub_always_push`: False
+- `gradient_checkpointing`: False
+- `gradient_checkpointing_kwargs`: None
+- `include_inputs_for_metrics`: False
+- `fp16_backend`: auto
+- `push_to_hub_model_id`: None
+- `push_to_hub_organization`: None
+- `mp_parameters`:
+- `auto_find_batch_size`: False
+- `full_determinism`: False
+- `torchdynamo`: None
+- `ray_scope`: last
+- `ddp_timeout`: 1800
+- `torch_compile`: False
+- `torch_compile_backend`: None
+- `torch_compile_mode`: None
+- `dispatch_batches`: None
+- `split_batches`: None
+- `include_tokens_per_second`: False
+- `include_num_input_tokens_seen`: False
+- `neftune_noise_alpha`: None
+- `optim_target_modules`: None
+- `batch_sampler`: batch_sampler
+- `multi_dataset_batch_sampler`: round_robin
+</details>
+### Training Logs
+| Epoch  | Step | Training Loss |
+|:------:|:----:|:-------------:|
+| 0.0912 | 500  | 0.0278        |
+| 0.1824 | 1000 | 0.0242        |
+| 0.2737 | 1500 | 0.0226        |
+| 0.3649 | 2000 | 0.0201        |
+| 0.4561 | 2500 | 0.0189        |
+| 0.5473 | 3000 | 0.0165        |
+| 0.6386 | 3500 | 0.0148        |
+| 0.7298 | 4000 | 0.0135        |
+| 0.8210 | 4500 | 0.0122        |
+| 0.9122 | 5000 | 0.0128        |
+### Framework Versions
+- Python: 3.10.14
+- Sentence Transformers: 3.0.0
+- Transformers: 4.39.3
+- PyTorch: 2.2.0
+- Accelerate: 0.30.1
+- Datasets: 2.14.4
+- Tokenizers: 0.15.1
+## Citation
+### BibTeX
+#### Sentence Transformers
+```bibtex
+@inproceedings{reimers-2019-sentence-bert,
+    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
+    author = "Reimers, Nils and Gurevych, Iryna",
+    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
+    month = "11",
+    year = "2019",
+    publisher = "Association for Computational Linguistics",
+    url = "https://arxiv.org/abs/1908.10084",
+}
+```
+#### ContrastiveLoss
+```bibtex
+@inproceedings{hadsell2006dimensionality,
+    author={Hadsell, R. and Chopra, S. and LeCun, Y.},
+    booktitle={2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06)},
+    title={Dimensionality Reduction by Learning an Invariant Mapping},
+    year={2006},
+    volume={2},
+    number={},
+    pages={1735-1742},
+    doi={10.1109/CVPR.2006.100}
+}
+```
+<!--
+## Glossary
+*Clearly define terms in order to be accessible across audiences.*
+-->
+<!--
+## Model Card Authors
+*Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
+-->
+<!--
+## Model Card Contact
+*Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
+-->

config.json ADDED Viewed

	@@ -0,0 +1,32 @@

+{
+  "_name_or_path": "sentence-transformers/models/urf/txtIndexacao_raq/",
+  "architectures": [
+    "BertModel"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "classifier_dropout": null,
+  "directionality": "bidi",
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 768,
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "layer_norm_eps": 1e-12,
+  "max_position_embeddings": 512,
+  "model_type": "bert",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 12,
+  "output_past": true,
+  "pad_token_id": 0,
+  "pooler_fc_size": 768,
+  "pooler_num_attention_heads": 12,
+  "pooler_num_fc_layers": 3,
+  "pooler_size_per_head": 128,
+  "pooler_type": "first_token_transform",
+  "position_embedding_type": "absolute",
+  "torch_dtype": "float32",
+  "transformers_version": "4.42.4",
+  "type_vocab_size": 2,
+  "use_cache": true,
+  "vocab_size": 29794
+}

config_sentence_transformers.json ADDED Viewed

	@@ -0,0 +1,10 @@

+{
+  "__version__": {
+    "sentence_transformers": "3.0.1",
+    "transformers": "4.42.4",
+    "pytorch": "2.3.1+cu118"
+  },
+  "prompts": {},
+  "default_prompt_name": null,
+  "similarity_fn_name": null
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:91615a54dfe7c1cb09e9528a7eecef95391985e41be45f83b57a196071c87897
+size 435714904

modules.json ADDED Viewed

	@@ -0,0 +1,14 @@

+[
+  {
+    "idx": 0,
+    "name": "0",
+    "path": "",
+    "type": "sentence_transformers.models.Transformer"
+  },
+  {
+    "idx": 1,
+    "name": "1",
+    "path": "1_Pooling",
+    "type": "sentence_transformers.models.Pooling"
+  }
+]

sentence_bert_config.json ADDED Viewed

	@@ -0,0 +1,4 @@

+{
+  "max_seq_length": 512,
+  "do_lower_case": false
+}

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,37 @@

+{
+  "cls_token": {
+    "content": "[CLS]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "mask_token": {
+    "content": "[MASK]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "[PAD]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "sep_token": {
+    "content": "[SEP]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "[UNK]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,64 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "100": {
+      "content": "[UNK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "101": {
+      "content": "[CLS]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "102": {
+      "content": "[SEP]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "103": {
+      "content": "[MASK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "clean_up_tokenization_spaces": true,
+  "cls_token": "[CLS]",
+  "do_basic_tokenize": true,
+  "do_lower_case": false,
+  "mask_token": "[MASK]",
+  "max_length": 512,
+  "model_max_length": 512,
+  "never_split": null,
+  "pad_to_multiple_of": null,
+  "pad_token": "[PAD]",
+  "pad_token_type_id": 0,
+  "padding_side": "right",
+  "sep_token": "[SEP]",
+  "stride": 0,
+  "strip_accents": null,
+  "tokenize_chinese_chars": true,
+  "tokenizer_class": "BertTokenizer",
+  "truncation_side": "right",
+  "truncation_strategy": "longest_first",
+  "unk_token": "[UNK]"
+}

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff