somosnlp-hackathon-2023
/

setfit-alpaca-es-unprocessable-sample-detection

@@ -1,18 +1,31 @@
 ---
-license: apache-2.0
 tags:
 - setfit
 - sentence-transformers
 - text-classification
 pipeline_tag: text-classification
 ---
-# hackathon-somos-nlp-2023/setfit-alpaca-es-unprocessable-sample-detection
-This is a [SetFit model](https://github.com/huggingface/setfit) that can be used for text classification. The model has been trained using an efficient few-shot learning technique that involves:
-1. Fine-tuning a [Sentence Transformer](https://www.sbert.net) with contrastive learning.
-2. Training a classification head with features from the fine-tuned Sentence Transformer.
 ## Usage
@@ -26,24 +39,32 @@ You can then run inference as follows:
 ```python
 from setfit import SetFitModel
 # Download from Hub and run inference
-model = SetFitModel.from_pretrained("hackathon-somos-nlp-2023/setfit-alpaca-es-unprocessable-sample-detection")
-# Run inference
-preds = model(["i loved the spiderman movie!", "pineapple on pizza is the worst 🤮"])
-```
-## BibTeX entry and citation info
-```bibtex
-@article{https://doi.org/10.48550/arxiv.2209.11055,
-doi = {10.48550/ARXIV.2209.11055},
-url = {https://arxiv.org/abs/2209.11055},
-author = {Tunstall, Lewis and Reimers, Nils and Jo, Unso Eun Seo and Bates, Luke and Korat, Daniel and Wasserblat, Moshe and Pereg, Oren},
-keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
-title = {Efficient Few-Shot Learning Without Prompts},
-publisher = {arXiv},
-year = {2022},
-copyright = {Creative Commons Attribution 4.0 International}
-}
 ```

 ---
 tags:
 - setfit
 - sentence-transformers
 - text-classification
 pipeline_tag: text-classification
+datasets:
+- mserras/alpaca-es-hackaton
+- somosnlp/somos-clean-alpaca-es
+language:
+- es
 ---
+# mserras/setfit-alpaca-es-unprocessable-sample-detection
+This is a [SetFit model](https://github.com/huggingface/setfit) that can be used for filtering the Alpaca ES instruction dataset.
+The base model is the multilingual model of [Paraphrase mpnet base v2](sentence-transformers/paraphrase-multilingual-mpnet-base-v2) from Sentence Transformers
+ This model has been developed during the 2023 Hackaton organized by [SomosNLP](https://somosnlp.org/)/[HF Card](https://huggingface.co/somosnlp) and with the GPUs provided by [Q Blocks](https://www.qblocks.cloud)
+This model has been trained over "unprocessable" samples of the translated [Clean Alpaca Es](https://huggingface.co/datasets/somosnlp/somos-clean-alpaca-es) dataset from
+the HF [Argilla](https://argilla.io) space https://huggingface.co/spaces/mserras/somos-alpaca-es.
+To this end, a custom tag is proposed: "unprocessable" which corresponds to instruction/input/output triplets that require processing image, fetching information from the
+open web and similar tasks where the LLM has no capability action, thus, ending in hallucinations or strange outcomes.
+As this model was trained over samples of Alpaca, which were generated using OpenAI's models this model **cannot be used for commercial purposes or to compete against OpenAI**
 ## Usage
 ```python
 from setfit import SetFitModel
+import argilla as rg
 # Download from Hub and run inference
+model = SetFitModel.from_pretrained("mserras/setfit-alpaca-es-unprocessable-sample-detection")
+def instruct_fields_to_text(field_instruction: str, field_input: str, field_output: str):
+    """Given the instruction, input and output fields, return a text to be used by setfit"""
+    return f"INSTRUCTION:\n{field_instruction}\nINPUT:\n{field_input}\nOUTPUT:\n{field_output}\n"
+def sample_to_text(sample: rg.TextClassificationRecord) -> str:
+    """Converts and Argilla TextClassificationRecord to a text to be used by setfit"""
+    return instruct_fields_to_text(sample.inputs["1-instruction"], sample.inputs["2-input"], sample.inputs["3-output"])
+# For a given Argilla record:
+unprocessable_score = model.predict_proba([sample_to_text(argilla_record)])[0].tolist()[1]
 ```
+## Evaluation
+*Disclaimer*: There was no formal evaluation done, just a bunch of guys looking at the data & the outcomes.
+## Changelog
+- [09/04/2023] SQL code generation, date conversion, percentual discounts and renewable energies no longer detected as unprocessable.
+- [06/04/2023] It no longer detects password generation as unprocessable.