somosnlp-hackathon-2023
/

setfit-alpaca-es-unprocessable-sample-detection

@@ -1,33 +1,18 @@
 ---
 tags:
 - setfit
 - sentence-transformers
 - text-classification
 pipeline_tag: text-classification
-datasets:
-- mserras/alpaca-es-hackaton
-- somosnlp/somos-clean-alpaca-es
-language:
-- es
 ---
-# mserras/setfit-alpaca-es-unprocessable-sample-detection
-This is a [SetFit model](https://github.com/huggingface/setfit) that can be used for filtering the Alpaca ES instruction dataset.
-The base model is [Paraphrase mpnet base v2](sentence-transformers/paraphrase-mpnet-base-v2) from Sentence Transformers
- This model has been developed during the 2023 Hackaton organized by [SomosNLP](https://somosnlp.org/)/[HF Card](https://huggingface.co/somosnlp) and with the GPUs provided by [Q Blocks](https://www.qblocks.cloud)
-This model has been trained over "unprocessable" samples of the translated [Clean Alpaca Es](https://huggingface.co/datasets/somosnlp/somos-clean-alpaca-es) dataset from
-the HF [Argilla](https://argilla.io) space https://huggingface.co/spaces/mserras/somos-alpaca-es.
-To this end, a custom tag is proposed: "unprocessable" which corresponds to instruction/input/output triplets that require processing image, fetching information from the
-open web and similar tasks where the LLM has no capability action, thus, ending in hallucinations or strange outcomes.
-As this model was trained over samples of Alpaca, which were generated using ChatGPT3.5 this model **cannot be used for commercial purposes or to compete against OpenAI**
-The scores are dumped in the dataset in the metadata field "sf-unprocessable-score"
 ## Usage
@@ -41,32 +26,24 @@ You can then run inference as follows:
 ```python
 from setfit import SetFitModel
-import argilla as rg
 # Download from Hub and run inference
-model = SetFitModel.from_pretrained("mserras/setfit-alpaca-es-unprocessable-sample-detection")
-def instruct_fields_to_text(field_instruction: str, field_input: str, field_output: str):
-    """Given the instruction, input and output fields, return a text to be used by setfit"""
-    return f"INSTRUCTION:\n{field_instruction}\nINPUT:\n{field_input}\nOUTPUT:\n{field_output}\n"
-def sample_to_text(sample: rg.TextClassificationRecord) -> str:
-    """Converts and Argilla TextClassificationRecord to a text to be used by setfit"""
-    return instruct_fields_to_text(sample.inputs["1-instruction"], sample.inputs["2-input"], sample.inputs["3-output"])
-# For a given Argilla record:
-unprocessable_score = model.predict_proba([sample_to_text(argilla_record)])[0].tolist()[1]
 ```
-## Evaluation
-*Disclaimer*: There was no formal evaluation done, just a bunch of guys looking at the data & the outcomes.
-## Changelog
-- [09/04/2023] SQL code generation, date conversion, percentual discounts and renewable energies no longer detected as unprocessable.
-- [06/04/2023] It no longer detects password generation as unprocessable.

 ---
+license: apache-2.0
 tags:
 - setfit
 - sentence-transformers
 - text-classification
 pipeline_tag: text-classification
 ---
+# hackathon-somos-nlp-2023/setfit-alpaca-es-unprocessable-sample-detection
+This is a [SetFit model](https://github.com/huggingface/setfit) that can be used for text classification. The model has been trained using an efficient few-shot learning technique that involves:
+1. Fine-tuning a [Sentence Transformer](https://www.sbert.net) with contrastive learning.
+2. Training a classification head with features from the fine-tuned Sentence Transformer.
 ## Usage
 ```python
 from setfit import SetFitModel
 # Download from Hub and run inference
+model = SetFitModel.from_pretrained("hackathon-somos-nlp-2023/setfit-alpaca-es-unprocessable-sample-detection")
+# Run inference
+preds = model(["i loved the spiderman movie!", "pineapple on pizza is the worst 🤮"])
 ```
+## BibTeX entry and citation info
+```bibtex
+@article{https://doi.org/10.48550/arxiv.2209.11055,
+doi = {10.48550/ARXIV.2209.11055},
+url = {https://arxiv.org/abs/2209.11055},
+author = {Tunstall, Lewis and Reimers, Nils and Jo, Unso Eun Seo and Bates, Luke and Korat, Daniel and Wasserblat, Moshe and Pereg, Oren},
+keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
+title = {Efficient Few-Shot Learning Without Prompts},
+publisher = {arXiv},
+year = {2022},
+copyright = {Creative Commons Attribution 4.0 International}
+}
+```

config.json CHANGED Viewed

@@ -1,5 +1,5 @@
 {
-  "_name_or_path": "/home/mserras/Downloads/setfit-model-nomulti/backup-model-setfit-unprocessable/",
   "architectures": [
     "MPNetModel"
   ],

 {
+  "_name_or_path": "mserras/setfit-alpaca-es-unprocessable-sample-detection/",
   "architectures": [
     "MPNetModel"
   ],