--- library_name: transformers tags: - embeddings - darija - arabic - DarijaBERT - camelbert - fine-tuning datasets: - HANTIFARAH/combined_darija_dataset_cleaned language: - ar metrics: - accuracy base_model: - SI2M-Lab/DarijaBERT pipeline_tag: fill-mask --- # Model Card for Fine-Tuned SI2M_DarijaBERT and CamelBERT This model card outlines the fine-tuning of **SI2M_DarijaBERT** on a trunc of a large Moroccan Darija dataset scraped from youtube transcriptions and other websites that you can find here : https://huggingface.co/datasets/HANTIFARAH/combined_darija_dataset_cleaned . These transformer model were fine-tuned for the purpose embedding generation in Moroccan Darija, enhancing it performance on specific NLP tasks and tested it Embeddings on text Classification tasks. ## Model Details ### Model Description The **SI2M_DarijaBERT** model have been fine-tuned on Moroccan Darija texts. the model is based on the BERT architecture and specialize in generating embeddings for text classification tasks in Moroccan Darija. - **Developed by:** [BAGUENNA Mohammed-Amine] - **Model type:** Transformer-based (BERT architecture) - **Language(s) (NLP):** Moroccan Darija (Arabic dialect) - **Finetuned from model:** SI2M_DarijaBERT ### Recommendations Users should take care to ensure their data falls within the domain of Moroccan Darija text. Further fine-tuning with more specialized data is recommended for domain-specific applications (e.g., medical language). ## How to Get Started with the Model You can use the models with the following code: ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification model = AutoModel.from_pretrained("bagamine/SI2M_DarijaBERTV1") tokenizer = AutoTokenizer.from_pretrained("bagamine/SI2M_DarijaBERTV1") ```