Update README.md
Browse files
README.md
CHANGED
@@ -175,7 +175,7 @@ Feel free to click the expand button below to see the full list of sources.
|
|
175 |
| MC4-Legal | bg, cs, da, de, el, en, es, et, fi, fr, ga, hu, it, lt, lv, mt, nl, pl, pt, ro, sk, sl, sv | [Link](https://huggingface.co/datasets/joelito/legal-mc4) |
|
176 |
| CURLICAT Corpus | bg, hr, hu, pl, ro, sk, sl | (Váradi et al., 2022) |
|
177 |
| CATalog | ca | (Palomar-Giner et al., 2024) |
|
178 |
-
| Spanish Crawling | ca, es, eu, gl |
|
179 |
| Starcoder | code | (Li et al., 2023) |
|
180 |
| SYN v9: large corpus of written Czech | cs | (Křen et al., 2021) |
|
181 |
| Welsh-GOV | cy | Crawling from [Link](https://www.llyw.cymru) |
|
@@ -197,13 +197,13 @@ Feel free to click the expand button below to see the full list of sources.
|
|
197 |
| The Pile (PhilPapers subset) | en | (Gao et al., 2021) |
|
198 |
| Spanish Legal Domain Corpora | es | (Gutiérrez-Fandiño et al., 2021) |
|
199 |
| HPLTDatasets v1 - Spanish | es | (de Gibert et al., 2024) |
|
200 |
-
| Legal | es | BOE, BORME, Senado, Congreso,
|
201 |
-
| Biomedical | es |
|
202 |
-
| Scientific | es |
|
203 |
| Estonian National Corpus 2021 | et | (Koppel & Kallas, 2022) |
|
204 |
| Estonian Reference Corpus | et | [Link](https://www.cl.ut.ee/korpused/segakorpus/) |
|
205 |
| EusCrawl (filtered: no Wikipedia, no NC-licenses) | eu | (Artetxe et al., 2022) |
|
206 |
-
|
|
207 |
| Yle Finnish News Archive | fi | [Link](http://urn.fi/urn:nbn:fi:lb-2021050401) |
|
208 |
| CaBeRnet: a New French Balanced Reference Corpus | fr | (Popa-Fabre et al., 2020) |
|
209 |
| French Public Domain Newspapers | fr | [Link](https://huggingface.co/datasets/PleIAs/French-PD-Newspapers) |
|
@@ -217,7 +217,7 @@ Feel free to click the expand button below to see the full list of sources.
|
|
217 |
| Korpus Malti | mt | (Micallef et al., 2022) |
|
218 |
| SoNaR Corpus NC 1.2 | nl | [Link](https://taalmaterialen.ivdnt.org/download/tstc-sonar-corpus/) |
|
219 |
| Norwegian Colossal Corpus | nn, no | (Kummervold et al., 2021) |
|
220 |
-
| Occitan Corpus | oc |
|
221 |
| Polish Parliamentary Corpus / Korpus Dyskursu Parlamentarnego | pl | (Ogrodniczuk, 2018) |
|
222 |
| NKJP-PodkorpusMilionowy-1.2 (National Corpus of Polish) | pl | (Lewandowska-Tomaszczyk et al., 2013) |
|
223 |
| Brazilian Portuguese Web as Corpus | pt | (Wagner Filho et al., 2018) |
|
@@ -243,6 +243,7 @@ Feel free to click the expand button below to see the full list of sources.
|
|
243 |
- Dodge, J., Sap, M., Marasović, A., Agnew, W., Ilharco, G., Groeneveld, D., Mitchell, M., & Gardner, M. (2021). Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus. In M.-F. Moens, X. Huang, L. Specia, & S. W. Yih (Eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (pp. 1286–1305). Association for Computational Linguistics. [Link](https://doi.org/10.18653/v1/2021.emnlp-main.98)
|
244 |
- Erjavec, T., Ljubešić, N., & Logar, N. (2015). The slWaC corpus of the Slovene web. Informatica (Slovenia), 39, 35–42.
|
245 |
- Erjavec, T., Ogrodniczuk, M., Osenova, P., Ljubešić, N., Simov, K., Grigorova, V., Rudolf, M., Pančur, A., Kopp, M., Barkarson, S., Steingrímsson, S. hór, van der Pol, H., Depoorter, G., de Does, J., Jongejan, B., Haltrup Hansen, D., Navarretta, C., Calzada Pérez, M., de Macedo, L. D., … Rayson, P. (2021). Linguistically annotated multilingual comparable corpora of parliamentary debates ParlaMint.ana 2.1. [Link](http://hdl.handle.net/11356/1431)
|
|
|
246 |
- Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Foster, C., Phang, J., He, H., Thite, A., Nabeshima, N., Presser, S., & Leahy, C. (2021). The Pile: An 800GB Dataset of Diverse Text for Language Modeling. CoRR, abs/2101.00027. [Link](https://arxiv.org/abs/2101.00027)
|
247 |
- Gutiérrez-Fandiño, A., Armengol-Estapé, J., Gonzalez-Agirre, A., & Villegas, M. (2021). Spanish Legalese Language Model and Corpora.
|
248 |
- Hansen, D. H. (2018). The Danish Parliament Corpus 2009—2017, v1. [Link](http://hdl.handle.net/20.500.12115/8)
|
|
|
175 |
| MC4-Legal | bg, cs, da, de, el, en, es, et, fi, fr, ga, hu, it, lt, lv, mt, nl, pl, pt, ro, sk, sl, sv | [Link](https://huggingface.co/datasets/joelito/legal-mc4) |
|
176 |
| CURLICAT Corpus | bg, hr, hu, pl, ro, sk, sl | (Váradi et al., 2022) |
|
177 |
| CATalog | ca | (Palomar-Giner et al., 2024) |
|
178 |
+
| Spanish Crawling | ca, es, eu, gl | Relevant Spanish websites crawling |
|
179 |
| Starcoder | code | (Li et al., 2023) |
|
180 |
| SYN v9: large corpus of written Czech | cs | (Křen et al., 2021) |
|
181 |
| Welsh-GOV | cy | Crawling from [Link](https://www.llyw.cymru) |
|
|
|
197 |
| The Pile (PhilPapers subset) | en | (Gao et al., 2021) |
|
198 |
| Spanish Legal Domain Corpora | es | (Gutiérrez-Fandiño et al., 2021) |
|
199 |
| HPLTDatasets v1 - Spanish | es | (de Gibert et al., 2024) |
|
200 |
+
| Legal | es | Internally generated legal dataset: BOE, BORME, Senado, Congreso, Spanish court orders, DOGC |
|
201 |
+
| Biomedical | es | Internally generated scientific dataset: Dialnet, Scielo, CSIC, TDX, BSC, UCM |
|
202 |
+
| Scientific | es | Internally generated scientific dataset: Wikipedia LS, Pubmed, MeSpEn, patents, clinical cases, medical crawler |
|
203 |
| Estonian National Corpus 2021 | et | (Koppel & Kallas, 2022) |
|
204 |
| Estonian Reference Corpus | et | [Link](https://www.cl.ut.ee/korpused/segakorpus/) |
|
205 |
| EusCrawl (filtered: no Wikipedia, no NC-licenses) | eu | (Artetxe et al., 2022) |
|
206 |
+
| Latxa Corpus v1.1 | eu | (Etxaniz et al., 2024) [Link](https://huggingface.co/datasets/HiTZ/latxa-corpus-v1.1)|
|
207 |
| Yle Finnish News Archive | fi | [Link](http://urn.fi/urn:nbn:fi:lb-2021050401) |
|
208 |
| CaBeRnet: a New French Balanced Reference Corpus | fr | (Popa-Fabre et al., 2020) |
|
209 |
| French Public Domain Newspapers | fr | [Link](https://huggingface.co/datasets/PleIAs/French-PD-Newspapers) |
|
|
|
217 |
| Korpus Malti | mt | (Micallef et al., 2022) |
|
218 |
| SoNaR Corpus NC 1.2 | nl | [Link](https://taalmaterialen.ivdnt.org/download/tstc-sonar-corpus/) |
|
219 |
| Norwegian Colossal Corpus | nn, no | (Kummervold et al., 2021) |
|
220 |
+
| Occitan Corpus | oc | Provided by [IEA](https://www.institutestudisaranesi.cat/) |
|
221 |
| Polish Parliamentary Corpus / Korpus Dyskursu Parlamentarnego | pl | (Ogrodniczuk, 2018) |
|
222 |
| NKJP-PodkorpusMilionowy-1.2 (National Corpus of Polish) | pl | (Lewandowska-Tomaszczyk et al., 2013) |
|
223 |
| Brazilian Portuguese Web as Corpus | pt | (Wagner Filho et al., 2018) |
|
|
|
243 |
- Dodge, J., Sap, M., Marasović, A., Agnew, W., Ilharco, G., Groeneveld, D., Mitchell, M., & Gardner, M. (2021). Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus. In M.-F. Moens, X. Huang, L. Specia, & S. W. Yih (Eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (pp. 1286–1305). Association for Computational Linguistics. [Link](https://doi.org/10.18653/v1/2021.emnlp-main.98)
|
244 |
- Erjavec, T., Ljubešić, N., & Logar, N. (2015). The slWaC corpus of the Slovene web. Informatica (Slovenia), 39, 35–42.
|
245 |
- Erjavec, T., Ogrodniczuk, M., Osenova, P., Ljubešić, N., Simov, K., Grigorova, V., Rudolf, M., Pančur, A., Kopp, M., Barkarson, S., Steingrímsson, S. hór, van der Pol, H., Depoorter, G., de Does, J., Jongejan, B., Haltrup Hansen, D., Navarretta, C., Calzada Pérez, M., de Macedo, L. D., … Rayson, P. (2021). Linguistically annotated multilingual comparable corpora of parliamentary debates ParlaMint.ana 2.1. [Link](http://hdl.handle.net/11356/1431)
|
246 |
+
- Etxaniz, J., Sainz, O., Perez, N., Aldabe, I., Rigau, G., Agirre, E., Ormazabal, A., Artetxe, M., & Soroa, A. (2024). Latxa: An Open Language Model and Evaluation Suite for Basque. [Link] (https://arxiv.org/abs/2403.20266)
|
247 |
- Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Foster, C., Phang, J., He, H., Thite, A., Nabeshima, N., Presser, S., & Leahy, C. (2021). The Pile: An 800GB Dataset of Diverse Text for Language Modeling. CoRR, abs/2101.00027. [Link](https://arxiv.org/abs/2101.00027)
|
248 |
- Gutiérrez-Fandiño, A., Armengol-Estapé, J., Gonzalez-Agirre, A., & Villegas, M. (2021). Spanish Legalese Language Model and Corpora.
|
249 |
- Hansen, D. H. (2018). The Danish Parliament Corpus 2009—2017, v1. [Link](http://hdl.handle.net/20.500.12115/8)
|