mapama247 commited on
Commit
e312cd4
1 Parent(s): c46c980

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +7 -6
README.md CHANGED
@@ -175,7 +175,7 @@ Feel free to click the expand button below to see the full list of sources.
175
  | MC4-Legal | bg, cs, da, de, el, en, es, et, fi, fr, ga, hu, it, lt, lv, mt, nl, pl, pt, ro, sk, sl, sv | [Link](https://huggingface.co/datasets/joelito/legal-mc4) |
176
  | CURLICAT Corpus | bg, hr, hu, pl, ro, sk, sl | (Váradi et al., 2022) |
177
  | CATalog | ca | (Palomar-Giner et al., 2024) |
178
- | Spanish Crawling | ca, es, eu, gl | - |
179
  | Starcoder | code | (Li et al., 2023) |
180
  | SYN v9: large corpus of written Czech | cs | (Křen et al., 2021) |
181
  | Welsh-GOV | cy | Crawling from [Link](https://www.llyw.cymru) |
@@ -197,13 +197,13 @@ Feel free to click the expand button below to see the full list of sources.
197
  | The Pile (PhilPapers subset) | en | (Gao et al., 2021) |
198
  | Spanish Legal Domain Corpora | es | (Gutiérrez-Fandiño et al., 2021) |
199
  | HPLTDatasets v1 - Spanish | es | (de Gibert et al., 2024) |
200
- | Legal | es | BOE, BORME, Senado, Congreso, sentencias (ULPGC) |
201
- | Biomedical | es | - |
202
- | Scientific | es | - |
203
  | Estonian National Corpus 2021 | et | (Koppel & Kallas, 2022) |
204
  | Estonian Reference Corpus | et | [Link](https://www.cl.ut.ee/korpused/segakorpus/) |
205
  | EusCrawl (filtered: no Wikipedia, no NC-licenses) | eu | (Artetxe et al., 2022) |
206
- | GAITU | eu | Compilation of CulturaX, Booktegi, some dumps of Colossal Oscar, Egunkaria, Euscrawl, HPLT and Wikipedia in Basque. |
207
  | Yle Finnish News Archive | fi | [Link](http://urn.fi/urn:nbn:fi:lb-2021050401) |
208
  | CaBeRnet: a New French Balanced Reference Corpus | fr | (Popa-Fabre et al., 2020) |
209
  | French Public Domain Newspapers | fr | [Link](https://huggingface.co/datasets/PleIAs/French-PD-Newspapers) |
@@ -217,7 +217,7 @@ Feel free to click the expand button below to see the full list of sources.
217
  | Korpus Malti | mt | (Micallef et al., 2022) |
218
  | SoNaR Corpus NC 1.2 | nl | [Link](https://taalmaterialen.ivdnt.org/download/tstc-sonar-corpus/) |
219
  | Norwegian Colossal Corpus | nn, no | (Kummervold et al., 2021) |
220
- | Occitan Corpus | oc | - |
221
  | Polish Parliamentary Corpus / Korpus Dyskursu Parlamentarnego | pl | (Ogrodniczuk, 2018) |
222
  | NKJP-PodkorpusMilionowy-1.2 (National Corpus of Polish) | pl | (Lewandowska-Tomaszczyk et al., 2013) |
223
  | Brazilian Portuguese Web as Corpus | pt | (Wagner Filho et al., 2018) |
@@ -243,6 +243,7 @@ Feel free to click the expand button below to see the full list of sources.
243
  - Dodge, J., Sap, M., Marasović, A., Agnew, W., Ilharco, G., Groeneveld, D., Mitchell, M., & Gardner, M. (2021). Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus. In M.-F. Moens, X. Huang, L. Specia, & S. W. Yih (Eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (pp. 1286–1305). Association for Computational Linguistics. [Link](https://doi.org/10.18653/v1/2021.emnlp-main.98)
244
  - Erjavec, T., Ljubešić, N., & Logar, N. (2015). The slWaC corpus of the Slovene web. Informatica (Slovenia), 39, 35–42.
245
  - Erjavec, T., Ogrodniczuk, M., Osenova, P., Ljubešić, N., Simov, K., Grigorova, V., Rudolf, M., Pančur, A., Kopp, M., Barkarson, S., Steingrímsson, S. hór, van der Pol, H., Depoorter, G., de Does, J., Jongejan, B., Haltrup Hansen, D., Navarretta, C., Calzada Pérez, M., de Macedo, L. D., … Rayson, P. (2021). Linguistically annotated multilingual comparable corpora of parliamentary debates ParlaMint.ana 2.1. [Link](http://hdl.handle.net/11356/1431)
 
246
  - Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Foster, C., Phang, J., He, H., Thite, A., Nabeshima, N., Presser, S., & Leahy, C. (2021). The Pile: An 800GB Dataset of Diverse Text for Language Modeling. CoRR, abs/2101.00027. [Link](https://arxiv.org/abs/2101.00027)
247
  - Gutiérrez-Fandiño, A., Armengol-Estapé, J., Gonzalez-Agirre, A., & Villegas, M. (2021). Spanish Legalese Language Model and Corpora.
248
  - Hansen, D. H. (2018). The Danish Parliament Corpus 2009—2017, v1. [Link](http://hdl.handle.net/20.500.12115/8)
 
175
  | MC4-Legal | bg, cs, da, de, el, en, es, et, fi, fr, ga, hu, it, lt, lv, mt, nl, pl, pt, ro, sk, sl, sv | [Link](https://huggingface.co/datasets/joelito/legal-mc4) |
176
  | CURLICAT Corpus | bg, hr, hu, pl, ro, sk, sl | (Váradi et al., 2022) |
177
  | CATalog | ca | (Palomar-Giner et al., 2024) |
178
+ | Spanish Crawling | ca, es, eu, gl | Relevant Spanish websites crawling |
179
  | Starcoder | code | (Li et al., 2023) |
180
  | SYN v9: large corpus of written Czech | cs | (Křen et al., 2021) |
181
  | Welsh-GOV | cy | Crawling from [Link](https://www.llyw.cymru) |
 
197
  | The Pile (PhilPapers subset) | en | (Gao et al., 2021) |
198
  | Spanish Legal Domain Corpora | es | (Gutiérrez-Fandiño et al., 2021) |
199
  | HPLTDatasets v1 - Spanish | es | (de Gibert et al., 2024) |
200
+ | Legal | es | Internally generated legal dataset: BOE, BORME, Senado, Congreso, Spanish court orders, DOGC |
201
+ | Biomedical | es | Internally generated scientific dataset: Dialnet, Scielo, CSIC, TDX, BSC, UCM |
202
+ | Scientific | es | Internally generated scientific dataset: Wikipedia LS, Pubmed, MeSpEn, patents, clinical cases, medical crawler |
203
  | Estonian National Corpus 2021 | et | (Koppel & Kallas, 2022) |
204
  | Estonian Reference Corpus | et | [Link](https://www.cl.ut.ee/korpused/segakorpus/) |
205
  | EusCrawl (filtered: no Wikipedia, no NC-licenses) | eu | (Artetxe et al., 2022) |
206
+ | Latxa Corpus v1.1 | eu | (Etxaniz et al., 2024) [Link](https://huggingface.co/datasets/HiTZ/latxa-corpus-v1.1)|
207
  | Yle Finnish News Archive | fi | [Link](http://urn.fi/urn:nbn:fi:lb-2021050401) |
208
  | CaBeRnet: a New French Balanced Reference Corpus | fr | (Popa-Fabre et al., 2020) |
209
  | French Public Domain Newspapers | fr | [Link](https://huggingface.co/datasets/PleIAs/French-PD-Newspapers) |
 
217
  | Korpus Malti | mt | (Micallef et al., 2022) |
218
  | SoNaR Corpus NC 1.2 | nl | [Link](https://taalmaterialen.ivdnt.org/download/tstc-sonar-corpus/) |
219
  | Norwegian Colossal Corpus | nn, no | (Kummervold et al., 2021) |
220
+ | Occitan Corpus | oc | Provided by [IEA](https://www.institutestudisaranesi.cat/) |
221
  | Polish Parliamentary Corpus / Korpus Dyskursu Parlamentarnego | pl | (Ogrodniczuk, 2018) |
222
  | NKJP-PodkorpusMilionowy-1.2 (National Corpus of Polish) | pl | (Lewandowska-Tomaszczyk et al., 2013) |
223
  | Brazilian Portuguese Web as Corpus | pt | (Wagner Filho et al., 2018) |
 
243
  - Dodge, J., Sap, M., Marasović, A., Agnew, W., Ilharco, G., Groeneveld, D., Mitchell, M., & Gardner, M. (2021). Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus. In M.-F. Moens, X. Huang, L. Specia, & S. W. Yih (Eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (pp. 1286–1305). Association for Computational Linguistics. [Link](https://doi.org/10.18653/v1/2021.emnlp-main.98)
244
  - Erjavec, T., Ljubešić, N., & Logar, N. (2015). The slWaC corpus of the Slovene web. Informatica (Slovenia), 39, 35–42.
245
  - Erjavec, T., Ogrodniczuk, M., Osenova, P., Ljubešić, N., Simov, K., Grigorova, V., Rudolf, M., Pančur, A., Kopp, M., Barkarson, S., Steingrímsson, S. hór, van der Pol, H., Depoorter, G., de Does, J., Jongejan, B., Haltrup Hansen, D., Navarretta, C., Calzada Pérez, M., de Macedo, L. D., … Rayson, P. (2021). Linguistically annotated multilingual comparable corpora of parliamentary debates ParlaMint.ana 2.1. [Link](http://hdl.handle.net/11356/1431)
246
+ - Etxaniz, J., Sainz, O., Perez, N., Aldabe, I., Rigau, G., Agirre, E., Ormazabal, A., Artetxe, M., & Soroa, A. (2024). Latxa: An Open Language Model and Evaluation Suite for Basque. [Link] (https://arxiv.org/abs/2403.20266)
247
  - Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Foster, C., Phang, J., He, H., Thite, A., Nabeshima, N., Presser, S., & Leahy, C. (2021). The Pile: An 800GB Dataset of Diverse Text for Language Modeling. CoRR, abs/2101.00027. [Link](https://arxiv.org/abs/2101.00027)
248
  - Gutiérrez-Fandiño, A., Armengol-Estapé, J., Gonzalez-Agirre, A., & Villegas, M. (2021). Spanish Legalese Language Model and Corpora.
249
  - Hansen, D. H. (2018). The Danish Parliament Corpus 2009—2017, v1. [Link](http://hdl.handle.net/20.500.12115/8)