List splits and configurations
Datasets typically have splits and may also have configurations. A split is a subset of the dataset, like train
and test
, that are used during different stages of training and evaluating a model. A configuration is a sub-dataset contained within a larger dataset. Configurations are especially common in multilingual speech datasets where there may be a different configuration for each language. If you’re interested in learning more about splits and configurations, check out the conceptual guide on “Splits and configurations”!
This guide shows you how to use the dataset viewer’s /splits
endpoint to retrieve a dataset’s splits and configurations programmatically. Feel free to also try it out with Postman, RapidAPI, or ReDoc
The /splits
endpoint accepts the dataset name as its query parameter:
import requests
headers = {"Authorization": f"Bearer {API_TOKEN}"}
API_URL = "https://datasets-server.huggingface.co/splits?dataset=ibm/duorc"
def query():
response = requests.get(API_URL, headers=headers)
return response.json()
data = query()
The endpoint response is a JSON containing a list of the dataset’s splits and configurations. For example, the ibm/duorc dataset has six splits and two configurations:
{
"splits": [
{ "dataset": "ibm/duorc", "config": "ParaphraseRC", "split": "train" },
{ "dataset": "ibm/duorc", "config": "ParaphraseRC", "split": "validation" },
{ "dataset": "ibm/duorc", "config": "ParaphraseRC", "split": "test" },
{ "dataset": "ibm/duorc", "config": "SelfRC", "split": "train" },
{ "dataset": "ibm/duorc", "config": "SelfRC", "split": "validation" },
{ "dataset": "ibm/duorc", "config": "SelfRC", "split": "test" }
],
"pending": [],
"failed": []
}