--- tags: - model_hub_mixin - pytorch_model_hub_mixin --- This model has been pushed to the Hub using the [PytorchModelHubMixin](https://huggingface.co/docs/huggingface_hub/package_reference/mixins#huggingface_hub.PyTorchModelHubMixin) integration: - Library: [More Information Needed] - Docs: [More Information Needed] ## Model This model is based on the [google/muril-base-cased](https://huggingface.co/google/muril-base-cased) model, fine-tuned on a WiC dataset in Hindi by Dubossarsky and Dairkee (2024) using the Siamese network architecture, WordTransformer from [pierluigic](https://huggingface.co/pierluigic/xl-lexeme) by Casotti et al. (2023). ## Usage (WordTransformer) To recreate our setup, you have to first install the WordTransformer architecture from [pierluigic/xl-lexeme](https://huggingface.co/pierluigic/xl-lexeme) ```python git clone git@github.com:pierluigic/xl-lexeme.git cd xl-lexeme pip3 install . ``` Then you have to add "PyTorchModelHubMixin" to the WordTransformer class definition in xl-lexeme/WordTransformer/WordTransformer: ```python class WordTransformer(nn.Sequential, PyTorchModelHubMixin): ``` To load the model: ```python from WordTransformer import WordTransformer, InputExample model = WordTransformer.from_pretrained(Roksana/hindi_wic_muril) ``` OR load it as a simple embedding model: ```python # Load pre-trained model and tokenizer tokenizer = BertTokenizer.from_pretrained(model_name) model = BertModel.from_pretrained(model_name) ``` ## Citations and Acknowledgements ``` @inproceedings{dubossarsky-dairkee-2024-strengthening-wic, title = "Strengthening the {W}i{C}: New Polysemy Dataset in {H}indi and Lack of Cross Lingual Transfer", author = "Dubossarsky, Haim and Dairkee, Farheen", editor = "Calzolari, Nicoletta and Kan, Min-Yen and Hoste, Veronique and Lenci, Alessandro and Sakti, Sakriani and Xue, Nianwen", booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)", month = may, year = "2024", address = "Torino, Italia", publisher = "ELRA and ICCL", url = "https://aclanthology.org/2024.lrec-main.1332", pages = "15341--15349", abstract = "This study addresses the critical issue of Natural Language Processing in low-resource languages such as Hindi, which, despite having substantial number of speakers, is limited in linguistic resources. The paper focuses on Word Sense Disambiguation, a fundamental NLP task that deals with polysemous words. It introduces a novel Hindi WSD dataset in the modern WiC format, enabling the training and testing of contextualized models. The primary contributions of this work lie in testing the efficacy of multilingual models to transfer across languages and hence to handle polysemy in low-resource languages, and in providing insights into the minimum training data required for a viable solution. Experiments compare different contextualized models on the WiC task via transfer learning from English to Hindi. Models purely transferred from English yield poor 55{\%} accuracy, while fine-tuning on Hindi dramatically improves performance to 90{\%} accuracy. This demonstrates the need for language-specific tuning and resources like the introduced Hindi WiC dataset to drive advances in Hindi NLP. The findings offer valuable insights into addressing the NLP needs of widely spoken yet low-resourced languages, shedding light on the problem of transfer learning in these contexts.", }```