Model Card for PhysBERT
PhysBERT is a specialized text embedding model for physics, designed to improve information retrieval, citation classification, and clustering of physics literature. Trained on 1.2 million physics papers, it outperforms general-purpose models in physics-specific tasks.
Model Description
PhysBERT is a BERT-based text embedding model for physics, fine-tuned using SimCSE for optimized physics-specific performance. This model enables efficient retrieval, categorization, and analysis of physics literature, achieving higher relevance and accuracy on domain-specific NLP tasks. The cased version can be found here.
- Developed by: Thorsten Hellert, João Montenegro, Andrea Pollastro
- Funded by: US Department of Energy, Lawrence Berkeley National Laboratory
- Model type: Text embedding model (BERT-based)
- Language(s) (NLP): English
- Paper: PhysBERT: A Text Embedding Model for Physics Scientific Literature
Training Data
Trained on a 40GB corpus from arXiv’s physics publications, consisting of 1.2 million documents, refined for scientific accuracy.
Training Procedure
The model was pre-trained using Masked Language Modeling (MLM) and fine-tuned with SimCSE for sentence embeddings.
Example of Usage
from transformers import AutoTokenizer, AutoModel
import torch
# Load PhysBERT tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("thellert/physbert_uncased")
model = AutoModel.from_pretrained("thellert/physbert_uncased")
# Sample text to embed
sample_text = "Electrons exhibit both particle and wave-like behavior."
# Tokenize the input text and pass it through the model
inputs = tokenizer(sample_text, return_tensors="pt")
outputs = model(**inputs)
# Extract the token embeddings
token_embeddings = outputs.last_hidden_state
# Drop CLS and SEP tokens, then take the mean for the sentence embedding
token_embeddings = token_embeddings[:, 1:-1, :]
sentence_embedding = token_embeddings.mean(dim=1)
Citation
If you find this work useful please consider citing the following paper:
@article{10.1063/5.0238090,
author = {Hellert, Thorsten and Montenegro, João and Pollastro, Andrea},
title = "{PhysBERT: A text embedding model for physics scientific literature}",
journal = {APL Machine Learning},
volume = {2},
number = {4},
pages = {046105},
year = {2024},
month = {10},
issn = {2770-9019},
doi = {10.1063/5.0238090},
url = {https://doi.org/10.1063/5.0238090},
eprint = {https://pubs.aip.org/aip/aml/article-pdf/doi/10.1063/5.0238090/20227307/046105\_1\_5.0238090.pdf},
}
Model Card Authors
Thorsten Hellert, João Montenegro, Andrea Pollastro
Model Card Contact
Thorsten Hellert, Lawrence Berkeley National Laboratory, [email protected]
- Downloads last month
- 40