|
--- |
|
license: cc-by-sa-4.0 |
|
language: |
|
- en |
|
metrics: |
|
- accuracy |
|
pipeline_tag: text-classification |
|
tags: |
|
- sports |
|
datasets: |
|
- Chrisneverdie/OnlySports_Dataset |
|
base_model: Snowflake/snowflake-arctic-embed-xs |
|
--- |
|
|
|
|
|
# Sports Text Classifier |
|
|
|
## Overview |
|
|
|
This Sports Text Classifier is a crucial component of the OnlySports Dataset creation pipeline. It's designed to accurately identify and extract sports-related documents from a large corpus of web content. |
|
|
|
## Model Architecture |
|
|
|
- Base model: [Snowflake-arctic-embed-xs](https://huggingface.co/Snowflake/snowflake-arctic-embed-xs) |
|
- Additional layer: Binary classification layer |
|
- Training: 10 epochs with a learning rate of 3e-4 |
|
|
|
## Performance |
|
|
|
The classifier achieves exceptional accuracy in distinguishing between sports and non-sports documents: |
|
|
|
![image/png](https://cdn-uploads.huggingface.co/production/uploads/656590bd40440ddcc051ade7/hK_a183i2_H5AfUF6ZXd6.png) |
|
|
|
## Training Data |
|
|
|
The classifier was trained on a balanced dataset of sports and non-sports content: |
|
|
|
- 64k samples from seven prestigious sports websites |
|
- 36k non-sports text documents classified using GPT-3.5 |
|
|
|
## Usage |
|
|
|
This classifier is primarily used in the creation of the OnlySports Dataset, presented in this [paper](https://arxiv.org/abs/2409.00286). It can be applied to filter large text corpora for sports-related content with high accuracy. |
|
|
|
## Integration |
|
|
|
The classifier is integrated into a MapReduce architecture for efficient processing of large-scale datasets. It's used in conjunction with URL keyword filtering to create a comprehensive sports text dataset. |
|
|
|
## Related Projects |
|
|
|
This classifier is part of the larger OnlySports collection, which includes: |
|
|
|
- [OnlySports Dataset](https://huggingface.co/collections/Chrisneverdie/onlysports-66b3e5cf595eb81220cc27a6) |
|
- [OnlySportsLM](https://huggingface.co/Chrisneverdie/OnlySportsLM_196M) |
|
|
|
For more information, check our [paper](https://arxiv.org/abs/2409.00286) or email [email protected]. |