Hub documentation

SQL Console: Query Hugging Face datasets in your browser

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

SQL Console: Query Hugging Face datasets in your browser

You can run SQL queries on the dataset in the browser using the SQL Console. The SQL Console is powered by DuckDB WASM and runs entirely in the browser. You can access the SQL Console from the dataset page by clicking on the SQL Console badge.

To learn more about the SQL Console, see the SQL Console blog post.

Through the SQL Console, you can:

  • Run DuckDB SQL queries on the dataset (checkout SQL Snippets for useful queries)
  • Share results of the query with others via a link (check out this example)
  • Download the results of the query to a parquet file
  • Embed the results of the query in your own webpage using an iframe
You can also use the DuckDB locally through the CLI to query the dataset via the `hf://` protocol. See the DuckDB Datasets documentation for more information. The SQL Console provides a convenient `Copy to DuckDB CLI` button that generates the SQL query for creating views and executing your query in the DuckDB CLI.

Examples

Filtering

The SQL Console makes filtering datasets really easy. For example, if you want to filter the SkunkworksAI/reasoning-0.01 dataset for instructions and responses with a reasoning length of at least 10, you can use the following query:

In the query, we can use the len function to get the length of the reasoning_chains column and the bar function to create a bar chart of the reasoning lengths.

SELECT len(reasoning_chains) AS reason_len, bar(reason_len, 0, 100), * 
FROM train
WHERE reason_len > 10
ORDER BY reason_len DESC

The bar function is a neat built-in DuckDB function that creates a bar chart of the reasoning lengths.

Histogram

Many dataset authors choose to include statistics about the distribution of the data in the dataset. Using the DuckDB histogram function, we can plot a histogram of a column’s values.

For example, to plot a histogram of the reason_len column in the SkunkworksAI/reasoning-0.01 dataset, you can use the following query:

Learn more about the `histogram` function and parameters here.

FROM histogram(train, len(reasoning_chains))

Regex Matching

One of the most powerful features of DuckDB is the deep support for regular expressions. You can use the regexp function to match patterns in your data.

Using the regexp_matches function, we can filter the SkunkworksAI/reasoning-0.01 dataset for instructions that contain markdown code blocks.

Learn more about the DuckDB regex functions here.

SELECT * 
FROM train
WHERE regexp_matches(instruction, '```[a-z]*\n')
limit 100

Leakage Detection

Leakage detection is the process of identifying whether data in a dataset is present in multiple splits, for example, whether the test set is present in the training set.

Learn more about leakage detection here.

WITH
    overlapping_rows AS (
        SELECT COALESCE(
            (SELECT COUNT(*) AS overlap_count
             FROM train
             INTERSECT
             SELECT COUNT(*) AS overlap_count
             FROM test),
            0
        ) AS overlap_count
    ),
    total_unique_rows AS (
        SELECT COUNT(*) AS total_count
        FROM (
            SELECT * FROM train
            UNION
            SELECT * FROM test
        ) combined
    )
SELECT
    overlap_count,
    total_count,
    CASE 
        WHEN total_count > 0 THEN (overlap_count * 100.0 / total_count)
        ELSE 0
    END AS overlap_percentage
FROM overlapping_rows, total_unique_rows;
< > Update on GitHub