Sports Text Classifier
Overview
This Sports Text Classifier is a crucial component of the OnlySports Dataset creation pipeline. It's designed to accurately identify and extract sports-related documents from a large corpus of web content.
Model Architecture
- Base model: Snowflake-arctic-embed-xs
- Additional layer: Binary classification layer
- Training: 10 epochs with a learning rate of 3e-4
Performance
The classifier achieves exceptional accuracy in distinguishing between sports and non-sports documents:
Training Data
The classifier was trained on a balanced dataset of sports and non-sports content:
- 64k samples from seven prestigious sports websites
- 36k non-sports text documents classified using GPT-3.5
Usage
This classifier is primarily used in the creation of the OnlySports Dataset, presented in this paper. It can be applied to filter large text corpora for sports-related content with high accuracy.
Integration
The classifier is integrated into a MapReduce architecture for efficient processing of large-scale datasets. It's used in conjunction with URL keyword filtering to create a comprehensive sports text dataset.
Related Projects
This classifier is part of the larger OnlySports collection, which includes:
For more information, check our paper.
- Downloads last month
- 25
Model tree for Chrisneverdie/OnlySports_Classifier
Base model
Snowflake/snowflake-arctic-embed-xs