Text Classification
sentence-transformers
PyTorch
setfit
Spanish
mpnet
mserras commited on
Commit
c2b344d
1 Parent(s): 9c442e6
Files changed (2) hide show
  1. README.md +22 -45
  2. config.json +1 -1
README.md CHANGED
@@ -1,33 +1,18 @@
1
  ---
 
2
  tags:
3
  - setfit
4
  - sentence-transformers
5
  - text-classification
6
  pipeline_tag: text-classification
7
- datasets:
8
- - mserras/alpaca-es-hackaton
9
- - somosnlp/somos-clean-alpaca-es
10
- language:
11
- - es
12
  ---
13
 
14
- # mserras/setfit-alpaca-es-unprocessable-sample-detection
15
 
16
- This is a [SetFit model](https://github.com/huggingface/setfit) that can be used for filtering the Alpaca ES instruction dataset.
17
 
18
- The base model is [Paraphrase mpnet base v2](sentence-transformers/paraphrase-mpnet-base-v2) from Sentence Transformers
19
-
20
- This model has been developed during the 2023 Hackaton organized by [SomosNLP](https://somosnlp.org/)/[HF Card](https://huggingface.co/somosnlp) and with the GPUs provided by [Q Blocks](https://www.qblocks.cloud)
21
-
22
- This model has been trained over "unprocessable" samples of the translated [Clean Alpaca Es](https://huggingface.co/datasets/somosnlp/somos-clean-alpaca-es) dataset from
23
- the HF [Argilla](https://argilla.io) space https://huggingface.co/spaces/mserras/somos-alpaca-es.
24
-
25
- To this end, a custom tag is proposed: "unprocessable" which corresponds to instruction/input/output triplets that require processing image, fetching information from the
26
- open web and similar tasks where the LLM has no capability action, thus, ending in hallucinations or strange outcomes.
27
-
28
- As this model was trained over samples of Alpaca, which were generated using ChatGPT3.5 this model **cannot be used for commercial purposes or to compete against OpenAI**
29
-
30
- The scores are dumped in the dataset in the metadata field "sf-unprocessable-score"
31
 
32
  ## Usage
33
 
@@ -41,32 +26,24 @@ You can then run inference as follows:
41
 
42
  ```python
43
  from setfit import SetFitModel
44
- import argilla as rg
45
-
46
 
47
  # Download from Hub and run inference
48
- model = SetFitModel.from_pretrained("mserras/setfit-alpaca-es-unprocessable-sample-detection")
49
-
50
- def instruct_fields_to_text(field_instruction: str, field_input: str, field_output: str):
51
- """Given the instruction, input and output fields, return a text to be used by setfit"""
52
- return f"INSTRUCTION:\n{field_instruction}\nINPUT:\n{field_input}\nOUTPUT:\n{field_output}\n"
53
-
54
- def sample_to_text(sample: rg.TextClassificationRecord) -> str:
55
- """Converts and Argilla TextClassificationRecord to a text to be used by setfit"""
56
- return instruct_fields_to_text(sample.inputs["1-instruction"], sample.inputs["2-input"], sample.inputs["3-output"])
57
-
58
- # For a given Argilla record:
59
-
60
- unprocessable_score = model.predict_proba([sample_to_text(argilla_record)])[0].tolist()[1]
61
-
62
  ```
63
 
64
- ## Evaluation
65
-
66
- *Disclaimer*: There was no formal evaluation done, just a bunch of guys looking at the data & the outcomes.
67
-
68
- ## Changelog
69
-
70
- - [09/04/2023] SQL code generation, date conversion, percentual discounts and renewable energies no longer detected as unprocessable.
71
- - [06/04/2023] It no longer detects password generation as unprocessable.
72
-
 
 
 
 
 
 
1
  ---
2
+ license: apache-2.0
3
  tags:
4
  - setfit
5
  - sentence-transformers
6
  - text-classification
7
  pipeline_tag: text-classification
 
 
 
 
 
8
  ---
9
 
10
+ # hackathon-somos-nlp-2023/setfit-alpaca-es-unprocessable-sample-detection
11
 
12
+ This is a [SetFit model](https://github.com/huggingface/setfit) that can be used for text classification. The model has been trained using an efficient few-shot learning technique that involves:
13
 
14
+ 1. Fine-tuning a [Sentence Transformer](https://www.sbert.net) with contrastive learning.
15
+ 2. Training a classification head with features from the fine-tuned Sentence Transformer.
 
 
 
 
 
 
 
 
 
 
 
16
 
17
  ## Usage
18
 
 
26
 
27
  ```python
28
  from setfit import SetFitModel
 
 
29
 
30
  # Download from Hub and run inference
31
+ model = SetFitModel.from_pretrained("hackathon-somos-nlp-2023/setfit-alpaca-es-unprocessable-sample-detection")
32
+ # Run inference
33
+ preds = model(["i loved the spiderman movie!", "pineapple on pizza is the worst 🤮"])
 
 
 
 
 
 
 
 
 
 
 
34
  ```
35
 
36
+ ## BibTeX entry and citation info
37
+
38
+ ```bibtex
39
+ @article{https://doi.org/10.48550/arxiv.2209.11055,
40
+ doi = {10.48550/ARXIV.2209.11055},
41
+ url = {https://arxiv.org/abs/2209.11055},
42
+ author = {Tunstall, Lewis and Reimers, Nils and Jo, Unso Eun Seo and Bates, Luke and Korat, Daniel and Wasserblat, Moshe and Pereg, Oren},
43
+ keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
44
+ title = {Efficient Few-Shot Learning Without Prompts},
45
+ publisher = {arXiv},
46
+ year = {2022},
47
+ copyright = {Creative Commons Attribution 4.0 International}
48
+ }
49
+ ```
config.json CHANGED
@@ -1,5 +1,5 @@
1
  {
2
- "_name_or_path": "/home/mserras/Downloads/setfit-model-nomulti/backup-model-setfit-unprocessable/",
3
  "architectures": [
4
  "MPNetModel"
5
  ],
 
1
  {
2
+ "_name_or_path": "mserras/setfit-alpaca-es-unprocessable-sample-detection/",
3
  "architectures": [
4
  "MPNetModel"
5
  ],