Text Classification
sentence-transformers
PyTorch
setfit
Spanish
mpnet
mserras commited on
Commit
dceb97e
1 Parent(s): 9b04f35

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +40 -6
README.md CHANGED
@@ -5,14 +5,25 @@ tags:
5
  - sentence-transformers
6
  - text-classification
7
  pipeline_tag: text-classification
 
 
 
 
 
8
  ---
9
 
10
  # mserras/setfit-alpaca-es-unprocessable-sample-detection
11
 
12
- This is a [SetFit model](https://github.com/huggingface/setfit) that can be used for text classification. The model has been trained using an efficient few-shot learning technique that involves:
 
 
 
 
 
 
 
 
13
 
14
- 1. Fine-tuning a [Sentence Transformer](https://www.sbert.net) with contrastive learning.
15
- 2. Training a classification head with features from the fine-tuned Sentence Transformer.
16
 
17
  ## Usage
18
 
@@ -26,13 +37,36 @@ You can then run inference as follows:
26
 
27
  ```python
28
  from setfit import SetFitModel
 
 
29
 
30
  # Download from Hub and run inference
31
  model = SetFitModel.from_pretrained("mserras/setfit-alpaca-es-unprocessable-sample-detection")
32
- # Run inference
33
- preds = model(["i loved the spiderman movie!", "pineapple on pizza is the worst 🤮"])
 
 
 
 
 
 
 
 
 
 
 
34
  ```
35
 
 
 
 
 
 
 
 
 
 
 
36
  ## BibTeX entry and citation info
37
 
38
  ```bibtex
@@ -46,4 +80,4 @@ publisher = {arXiv},
46
  year = {2022},
47
  copyright = {Creative Commons Attribution 4.0 International}
48
  }
49
- ```
 
5
  - sentence-transformers
6
  - text-classification
7
  pipeline_tag: text-classification
8
+ datasets:
9
+ - mserras/alpaca-es-hackaton
10
+ - somosnlp/somos-clean-alpaca-es
11
+ language:
12
+ - es
13
  ---
14
 
15
  # mserras/setfit-alpaca-es-unprocessable-sample-detection
16
 
17
+ This is a [SetFit model](https://github.com/huggingface/setfit) that can be used for filtering the Alpaca ES instruction dataset.
18
+
19
+ This model has been developed during the 2023 Hackaton organized by [SomosNLP](https://somosnlp.org/)/[HF Card](https://huggingface.co/somosnlp) and with the GPUs provided by [Q Blocks](https://www.qblocks.cloud)
20
+
21
+ This model has been trained over "unprocessable" samples of the translated [Clean Alpaca Es](https://huggingface.co/datasets/somosnlp/somos-clean-alpaca-es) dataset from
22
+ the HF [Argilla](https://argilla.io) space https://huggingface.co/spaces/mserras/somos-alpaca-es.
23
+
24
+ To this end, a custom tag is proposed: "unprocessable" which corresponds to instruction/input/output triplets that require processing image, fetching information from the
25
+ open web and similar tasks where the LLM has no capability action, thus, ending in hallucinations or strange outcomes.
26
 
 
 
27
 
28
  ## Usage
29
 
 
37
 
38
  ```python
39
  from setfit import SetFitModel
40
+ import argilla as rg
41
+
42
 
43
  # Download from Hub and run inference
44
  model = SetFitModel.from_pretrained("mserras/setfit-alpaca-es-unprocessable-sample-detection")
45
+
46
+ def instruct_fields_to_text(field_instruction: str, field_input: str, field_output: str):
47
+ """Given the instruction, input and output fields, return a text to be used by setfit"""
48
+ return f"INSTRUCTION:\n{field_instruction}\nINPUT:\n{field_input}\nOUTPUT:\n{field_output}\n"
49
+
50
+ def sample_to_text(sample: rg.TextClassificationRecord) -> str:
51
+ """Converts and Argilla TextClassificationRecord to a text to be used by setfit"""
52
+ return instruct_fields_to_text(sample.inputs["1-instruction"], sample.inputs["2-input"], sample.inputs["3-output"])
53
+
54
+ # For a given Argilla record:
55
+
56
+ unprocessable_score = model.predict_proba([sample_to_text(argilla_record)])[0].tolist()[1]
57
+
58
  ```
59
 
60
+ ## Evaluation
61
+
62
+ *Disclaimer*: There was no formal evaluation done, just a bunch of guys looking at the data & the outcomes.
63
+
64
+ ## Changelog
65
+
66
+ - [09/04/2023] SQL code generation, date conversion, percentual discounts and renewable energies no longer detected as unprocessable.
67
+ - [06/04/2023] It no longer detects password generation as unprocessable.
68
+
69
+
70
  ## BibTeX entry and citation info
71
 
72
  ```bibtex
 
80
  year = {2022},
81
  copyright = {Creative Commons Attribution 4.0 International}
82
  }
83
+ ```