Edit model card
YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

SocBERT model

Pretrained model on 20GB English tweets and 72GB Reddit comments using a masked language modeling (MLM) objective. The tweets are from Archive and collected from Twitter Streaming API. The Reddit comments are ramdonly sampled from all subreddits from 2015-2019. SocBERT-base was pretrained on 819M sequence blocks for 100K steps. SocBERT-final was pretrained on 929M (819M+110M) sequence blocks for 112K (100K+12K) steps. We benchmarked SocBERT, on 40 text classification tasks with social media data.

The experiment results can be found in our paper:

@inproceedings{socbert:2023,
title     = {{SocBERT: A Pretrained Model for Social Media Text}},
author    = {Yuting Guo and Abeed Sarker},
booktitle = {Proceedings of the Fourth Workshop on Insights from Negative Results in NLP},
year      = {2023}
}

A base version of the model can be found at SocBERT-base.

Downloads last month
12
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.