The Beginners Guide to Cleaning a Dataset

Community Article Published November 18, 2024

Sometimes datasets are raw, unfiltered, and not 100% ready for training and other tasks. Dataset cleaning is a very common task for getting a dataset ready for training.

Microsoft Orca Agent Instruct Dataset

One example is Maxime Labonne's cleaned version of Microsoft Orca Agent Instruct Dataset which can be found at mlabonne/orca-agentinstruct-1M-v1-cleaned.

The Microsoft Orca Agent Instruct Dataset is a dataset of ~1M instruction examples across a wide range of tasks including text editing, creative writing, coding, reading comprehension and brain teasers.

Cleaning Steps

In the dataset card, Maxime outlined the cleaning process:

Here's what I changed:

Splits are unified into one, with a new "split" column

Strings were converted into lists of dicts to ensure compatibility with most frameworks

Empty system prompts were removed so you don't get weird errors

In this post, we will walk through how to do this really easily without leaving the webpage of the dataset. We will break down the steps above and iteratively write a bit of SQL to accomplish this via the SQL Console.

Open the SQL Console

To start we will click the SQL Console button on the dataset page which will open the SQL Console and automatically load a subset and split.

There's a couple things that you will notice. First, the 1/15 splits loaded, which indicates that we have loaded only the creative_content split and their are 14 other unloaded views. We can simply click on the other splits to autmoatically load them into DuckDB so we can query them.

1. Converting string into STRUCTs

Converting strings into list of dictionaries is fairly simple. We know that the messages field is an list of conversation messages with keys: role and content. We can represent that in DuckDB as a STRUCT and use CAST to convert the VARCHAR type to the Struct.

CAST(messages AS STRUCT(role VARCHAR, content VARCHAR)[]) as messages

2. Removing empty system prompts

Next, let's filter out messages that are empty and just start the conversation with the user message.

Since messages is a LIST type, we can use the handy DuckDB list function: list_filter to filter out the empty messages.

We can start by expanding on our code above and using list_filter:

-- filter out if content is null or empty string
list_filter(messages, x -> x.content != '' AND x.content is not null) as messages

This will filter out all empty messages. If we wanted to be more specific and filter out only empty system prompts, we can update our lambda to do so.

3. Joining splits together

Now that we are converting messages from string to list of structs and removing empty messages, let's join two splits and add a column indicating which split it is.

Let's start by:

Loading all the splits as views that you would want to join
Use the SQL UNION operator to combine the results of all the splits

We will just join two splits since the dataset will be too big to do all in the browser. But for smaller datasets, it is very capable of joining entire datasets!

WITH joined_data as (
  SELECT *, 'creative_content' as split FROM creative_content
    UNION ALL
  SELECT *, 'text_modification' as split FROM text_modification
)

select * from joined_data

we can use the UNION ALL to join all the rows together

Finally, we can add the other pieces to get a final cleaned joined result:

-- join the splits together
WITH joined_data as (
  SELECT *, 'creative_content' as split FROM creative_content
    UNION ALL
  SELECT *, 'text_modification' as split FROM text_modification
)

select 
  list_filter(
    CAST(messages AS STRUCT(role VARCHAR, content VARCHAR)[]), -- cast string to list of structs
    x -> x.content != '' AND x.content is not null -- filter out empty messages
  ) AS messages,
  split
from joined_data

Now, we have a dataset that we can then download and use!

Resources

Upvote