Embedding model for Labor Space

This repository is fine-tuned BERT model for the Labor Space : A Unifying Representation of the Labor Market via Large Language Models

Model description

LABERT(Labor market + BERT) is a BERT based sentence-transformers model fine-tuned on a domain-specific corpus of labor market text data. We fine-tune the original BERT model in two ways to capture the latent structure of the labor market. More precisely, it was fine-tuned with two objectives:

Context learning : We use HuggingFace’s “fill mask” pipeline with descriptions for each entity to cover context information of the labor market at the individual word token level. We concatenate (1) 308 NAICS 4-digit descriptions, (2) O*NET’s descriptions for 36 skills,25 knowledge domains, 46 abilities, 1,016 occupations, (3) ESCO’s descriptions for 15,000 skills, 3,000 occupations, and (4) 489 Crunchbase S&P 500 firm descriptions, excluding their labels.
Relation learning : We build an additional fine-tuning process to incorporate inter-entity relatedness. Different types of the labor market are interwined with the other unit of the labor market. For example, industry-specific occupational employment represents the numerical relatedness between industry and occupation and tells us which occupations are conceptually close to specific industry entities. Relation learning makes our embedding space capture this inter-entity relatedness. As a result of relation learning, entitiy embedding is more closer to highly associated other entities than it does not. For more detail, see Section 3.4 Fine-tuning for relation learning in the paper

How to use

Using this model becomes easy when you have sentence-transformers installed:

pip install -U sentence-transformers

Then you can use the model like this:

from sentence_transformers import SentenceTransformer, models

base_model = "seongwoon/LAbert"
embedding_model = models.Transformer(base_model) ## Step 1: use an existing language model

pooling_model = models.Pooling(embedding_model.get_word_embedding_dimension()) ## Step 2: use a pool function over the token embeddings
pooling_model.pooling_mode_mean_tokens = True
pooling_model.pooling_mode_cls_token = False
pooling_model.pooling_mode_max_tokens = False

model = SentenceTransformer(modules=[embedding_model, pooling_model]) ## Join steps 1 and 2 using the modules argument


dancer_description = "Perform dances. May perform on stage, for broadcasting, or for video recording"
embedding_of_dancer_description = model.encode(dancer, convert_to_tensor= True)

print(description_embedding)

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

Citing & Authors

@inproceedings{kim2024labor,
  title={Labor Space: A Unifying Representation of the Labor Market via Large Language Models},
  author={Kim, Seongwoon and Ahn, Yong-Yeol and Park, Jaehyuk},
  booktitle={Proceedings of the ACM on Web Conference 2024},
  pages={2441--2451},
  year={2024}
}

seongwoon
/

labor_space

You need to agree to share your contact information to access this model

Embedding model for Labor Space

Model description

How to use

Full Model Architecture

Citing & Authors