robbiemu
/

salamandra-2b

@@ -175,7 +175,7 @@ Feel free to click the expand button below to see the full list of sources.
 | MC4-Legal                                                                  | bg, cs, da, de, el, en, es, et, fi, fr, ga, hu, it, lt, lv, mt, nl, pl, pt, ro, sk, sl, sv | [Link](https://huggingface.co/datasets/joelito/legal-mc4)                                                      |
 | CURLICAT Corpus                                                            | bg, hr, hu, pl, ro, sk, sl                                                                 | (Váradi et al., 2022)                                                                                           |
 | CATalog                                                                    | ca                                                                                         | (Palomar-Giner et al., 2024)                                                                                     |
-| Spanish Crawling                                                           | ca, es, eu, gl                                                                             | -                                                                                                              |
 | Starcoder                                                                  | code                                                                                       | (Li et al., 2023)                                                                                               |
 | SYN v9: large corpus of written Czech                                      | cs                                                                                         | (Křen et al., 2021)                                                                                             |
 | Welsh-GOV                                                                  | cy                                                                                         | Crawling from [Link](https://www.llyw.cymru)                                                                     |
@@ -197,13 +197,13 @@ Feel free to click the expand button below to see the full list of sources.
 | The Pile (PhilPapers subset)                                               | en                                                                                         | (Gao et al., 2021)                                                                                              |
 | Spanish Legal Domain Corpora                                               | es                                                                                         | (Gutiérrez-Fandiño et al., 2021)                                                                                 |
 | HPLTDatasets v1 - Spanish                                                   | es                                                                                         | (de Gibert et al., 2024)                                                                                         |
-| Legal                                                                       | es                                                                                         | BOE, BORME, Senado, Congreso, sentencias (ULPGC)                                                                |
-| Biomedical                                                                  | es                                                                                         | -                                                                                                             |
-| Scientific                                                                  | es                                                                                         | -                                                                                                             |
 | Estonian National Corpus 2021                                               | et                                                                                         | (Koppel & Kallas, 2022)                                                                                          |
 | Estonian Reference Corpus                                                  | et                                                                                         | [Link](https://www.cl.ut.ee/korpused/segakorpus/)                                                               |
 | EusCrawl (filtered: no Wikipedia, no NC-licenses)                          | eu                                                                                         | (Artetxe et al., 2022)                                                                                           |
-| GAITU                                                                      | eu                                                                                         | Compilation of CulturaX, Booktegi, some dumps of Colossal Oscar, Egunkaria, Euscrawl, HPLT and Wikipedia in Basque. |
 | Yle Finnish News Archive                                                    | fi                                                                                         | [Link](http://urn.fi/urn:nbn:fi:lb-2021050401)                                                                 |
 | CaBeRnet: a New French Balanced Reference Corpus                          | fr                                                                                         | (Popa-Fabre et al., 2020)                                                                                       |
 | French Public Domain Newspapers                                            | fr                                                                                         | [Link](https://huggingface.co/datasets/PleIAs/French-PD-Newspapers)                                            |
@@ -217,7 +217,7 @@ Feel free to click the expand button below to see the full list of sources.
 | Korpus Malti                                                                | mt                                                                                         | (Micallef et al., 2022)                                                                                         |
 | SoNaR Corpus NC 1.2                                                         | nl                                                                                         | [Link](https://taalmaterialen.ivdnt.org/download/tstc-sonar-corpus/)                                           |
 | Norwegian Colossal Corpus                                                   | nn, no                                                                                     | (Kummervold et al., 2021)                                                                                       |
-| Occitan Corpus                                                              | oc                                                                                         | -                                                                                                             |
 | Polish Parliamentary Corpus / Korpus Dyskursu Parlamentarnego              | pl                                                                                         | (Ogrodniczuk, 2018)                                                                                             |
 | NKJP-PodkorpusMilionowy-1.2 (National Corpus of Polish)                    | pl                                                                                         | (Lewandowska-Tomaszczyk et al., 2013)                                                                           |
 | Brazilian Portuguese Web as Corpus                                          | pt                                                                                         | (Wagner Filho et al., 2018)                                                                                     |
@@ -243,6 +243,7 @@ Feel free to click the expand button below to see the full list of sources.
 - Dodge, J., Sap, M., Marasović, A., Agnew, W., Ilharco, G., Groeneveld, D., Mitchell, M., & Gardner, M. (2021). Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus. In M.-F. Moens, X. Huang, L. Specia, & S. W. Yih (Eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (pp. 1286–1305). Association for Computational Linguistics. [Link](https://doi.org/10.18653/v1/2021.emnlp-main.98)
 - Erjavec, T., Ljubešić, N., & Logar, N. (2015). The slWaC corpus of the Slovene web. Informatica (Slovenia), 39, 35–42.
 - Erjavec, T., Ogrodniczuk, M., Osenova, P., Ljubešić, N., Simov, K., Grigorova, V., Rudolf, M., Pančur, A., Kopp, M., Barkarson, S., Steingrímsson, S. hór, van der Pol, H., Depoorter, G., de Does, J., Jongejan, B., Haltrup Hansen, D., Navarretta, C., Calzada Pérez, M., de Macedo, L. D., … Rayson, P. (2021). Linguistically annotated multilingual comparable corpora of parliamentary debates ParlaMint.ana 2.1. [Link](http://hdl.handle.net/11356/1431)
 - Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Foster, C., Phang, J., He, H., Thite, A., Nabeshima, N., Presser, S., & Leahy, C. (2021). The Pile: An 800GB Dataset of Diverse Text for Language Modeling. CoRR, abs/2101.00027. [Link](https://arxiv.org/abs/2101.00027)
 - Gutiérrez-Fandiño, A., Armengol-Estapé, J., Gonzalez-Agirre, A., & Villegas, M. (2021). Spanish Legalese Language Model and Corpora.
 - Hansen, D. H. (2018). The Danish Parliament Corpus 2009—2017, v1. [Link](http://hdl.handle.net/20.500.12115/8)

 | MC4-Legal                                                                  | bg, cs, da, de, el, en, es, et, fi, fr, ga, hu, it, lt, lv, mt, nl, pl, pt, ro, sk, sl, sv | [Link](https://huggingface.co/datasets/joelito/legal-mc4)                                                      |
 | CURLICAT Corpus                                                            | bg, hr, hu, pl, ro, sk, sl                                                                 | (Váradi et al., 2022)                                                                                           |
 | CATalog                                                                    | ca                                                                                         | (Palomar-Giner et al., 2024)                                                                                     |
+| Spanish Crawling                                                           | ca, es, eu, gl                                                                             | Relevant Spanish websites crawling                                                                                                            |
 | Starcoder                                                                  | code                                                                                       | (Li et al., 2023)                                                                                               |
 | SYN v9: large corpus of written Czech                                      | cs                                                                                         | (Křen et al., 2021)                                                                                             |
 | Welsh-GOV                                                                  | cy                                                                                         | Crawling from [Link](https://www.llyw.cymru)                                                                     |
 | The Pile (PhilPapers subset)                                               | en                                                                                         | (Gao et al., 2021)                                                                                              |
 | Spanish Legal Domain Corpora                                               | es                                                                                         | (Gutiérrez-Fandiño et al., 2021)                                                                                 |
 | HPLTDatasets v1 - Spanish                                                   | es                                                                                         | (de Gibert et al., 2024)                                                                                         |
+| Legal                                                                       | es                                                                                         | Internally generated legal dataset: BOE, BORME, Senado, Congreso, Spanish court orders, DOGC                     |
+| Biomedical                                                                  | es                                                                                         | Internally generated scientific dataset: Dialnet, Scielo, CSIC, TDX, BSC, UCM                                    |
+| Scientific                                                                  | es                                                                                         | Internally generated scientific dataset: Wikipedia LS, Pubmed, MeSpEn, patents, clinical cases, medical crawler  |
 | Estonian National Corpus 2021                                               | et                                                                                         | (Koppel & Kallas, 2022)                                                                                          |
 | Estonian Reference Corpus                                                  | et                                                                                         | [Link](https://www.cl.ut.ee/korpused/segakorpus/)                                                               |
 | EusCrawl (filtered: no Wikipedia, no NC-licenses)                          | eu                                                                                         | (Artetxe et al., 2022)                                                                                           |
+| Latxa Corpus v1.1                                                          | eu                                                                                         | (Etxaniz et al., 2024) [Link](https://huggingface.co/datasets/HiTZ/latxa-corpus-v1.1)|
 | Yle Finnish News Archive                                                    | fi                                                                                         | [Link](http://urn.fi/urn:nbn:fi:lb-2021050401)                                                                 |
 | CaBeRnet: a New French Balanced Reference Corpus                          | fr                                                                                         | (Popa-Fabre et al., 2020)                                                                                       |
 | French Public Domain Newspapers                                            | fr                                                                                         | [Link](https://huggingface.co/datasets/PleIAs/French-PD-Newspapers)                                            |
 | Korpus Malti                                                                | mt                                                                                         | (Micallef et al., 2022)                                                                                         |
 | SoNaR Corpus NC 1.2                                                         | nl                                                                                         | [Link](https://taalmaterialen.ivdnt.org/download/tstc-sonar-corpus/)                                           |
 | Norwegian Colossal Corpus                                                   | nn, no                                                                                     | (Kummervold et al., 2021)                                                                                       |
+| Occitan Corpus                                                              | oc                                                                                         | Provided by [IEA](https://www.institutestudisaranesi.cat/)                                                                                                            |
 | Polish Parliamentary Corpus / Korpus Dyskursu Parlamentarnego              | pl                                                                                         | (Ogrodniczuk, 2018)                                                                                             |
 | NKJP-PodkorpusMilionowy-1.2 (National Corpus of Polish)                    | pl                                                                                         | (Lewandowska-Tomaszczyk et al., 2013)                                                                           |
 | Brazilian Portuguese Web as Corpus                                          | pt                                                                                         | (Wagner Filho et al., 2018)                                                                                     |
 - Dodge, J., Sap, M., Marasović, A., Agnew, W., Ilharco, G., Groeneveld, D., Mitchell, M., & Gardner, M. (2021). Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus. In M.-F. Moens, X. Huang, L. Specia, & S. W. Yih (Eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (pp. 1286–1305). Association for Computational Linguistics. [Link](https://doi.org/10.18653/v1/2021.emnlp-main.98)
 - Erjavec, T., Ljubešić, N., & Logar, N. (2015). The slWaC corpus of the Slovene web. Informatica (Slovenia), 39, 35–42.
 - Erjavec, T., Ogrodniczuk, M., Osenova, P., Ljubešić, N., Simov, K., Grigorova, V., Rudolf, M., Pančur, A., Kopp, M., Barkarson, S., Steingrímsson, S. hór, van der Pol, H., Depoorter, G., de Does, J., Jongejan, B., Haltrup Hansen, D., Navarretta, C., Calzada Pérez, M., de Macedo, L. D., … Rayson, P. (2021). Linguistically annotated multilingual comparable corpora of parliamentary debates ParlaMint.ana 2.1. [Link](http://hdl.handle.net/11356/1431)
+- Etxaniz, J., Sainz, O., Perez, N., Aldabe, I., Rigau, G., Agirre, E., Ormazabal, A., Artetxe, M., & Soroa, A. (2024). Latxa: An Open Language Model and Evaluation Suite for Basque. [Link] (https://arxiv.org/abs/2403.20266)
 - Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Foster, C., Phang, J., He, H., Thite, A., Nabeshima, N., Presser, S., & Leahy, C. (2021). The Pile: An 800GB Dataset of Diverse Text for Language Modeling. CoRR, abs/2101.00027. [Link](https://arxiv.org/abs/2101.00027)
 - Gutiérrez-Fandiño, A., Armengol-Estapé, J., Gonzalez-Agirre, A., & Villegas, M. (2021). Spanish Legalese Language Model and Corpora.
 - Hansen, D. H. (2018). The Danish Parliament Corpus 2009—2017, v1. [Link](http://hdl.handle.net/20.500.12115/8)