Model Card for RoBERTa Social Roles Classifier
This model is a token classifier that extracts social roles from explicit expressions of self-identification in sentences, e.g. I am a designer, entrepreneur, and mother.
Model Details
Model Description
We continue pretraining RoBERTa-base for 10 epochs on individuals' about
pages, which is a subset of Common Crawl and can be accessed here.
Then, we finetuned on hand-annotated token-level labels as described in this paper. We use a train-dev-test split of 600/200/200 labeled sentences.
Our definition of "roles" or "occupations" on about
pages is any singular noun referring to the subject of the bio. The roles and occupations can be ones that the subject actively participated in the past, e.g. Throughout my life I have been a teacher, a startup founder, and a seashell collector.
Subject of the about
page
- First person biographies: the subject is I, me, my, mine.
- Third person biographies: we assume the bio’s subject is the main person referenced in the excerpt sentence.
Positive examples of self-identification
- I am a chef, author, and mom living in Virginia.
- As an award-winning geologist, Sebastian has given talks around the world.
- Knitter, blogger, & dreamer. In the last example above, the sentence’s relation to the subject of the bio is implied rather than stated.
Negative examples
- My wife loves beekeeping as well.
- Janice works hard to accommodate every client.
Language(s) (NLP): English
License: Apache 2.0
Uses
We use tagged social roles in web pages to assess the social impact of LLM pretraining data curation decisions. Text linked to descriptions of their creators can also facilitate other areas of research, including self-presentation and language variation.
Evaluation
On our test set, we achieve a precision score of 0.856, recall score of 0.945, and F1 score of 0.898.
Citation
@misc{lucy2024aboutme,
title={AboutMe: Using Self-Descriptions in Webpages to Document the Effects of English Pretraining Data Filters},
author={Li Lucy and Suchin Gururangan and Luca Soldaini and Emma Strubell and David Bamman and Lauren Klein and Jesse Dodge},
year={2024},
eprint={2401.06408},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Contact
- Downloads last month
- 3