Model Card for PhysBERT

PhysBERT is a specialized text embedding model for physics, designed to improve information retrieval, citation classification, and clustering of physics literature. Trained on 1.2 million physics papers, it outperforms general-purpose models in physics-specific tasks.

Model Description

PhysBERT is a BERT-based text embedding model for physics, fine-tuned using SimCSE for optimized physics-specific performance. This model enables efficient retrieval, categorization, and analysis of physics literature, achieving higher relevance and accuracy on domain-specific NLP tasks. The cased version can be found here.

Developed by: Thorsten Hellert, João Montenegro, Andrea Pollastro
Funded by: US Department of Energy, Lawrence Berkeley National Laboratory
Model type: Text embedding model (BERT-based)
Language(s) (NLP): English
Paper: PhysBERT: A Text Embedding Model for Physics Scientific Literature

Training Data

Trained on a 40GB corpus from arXiv’s physics publications, consisting of 1.2 million documents, refined for scientific accuracy.

Training Procedure

The model was pre-trained using Masked Language Modeling (MLM) and fine-tuned with SimCSE for sentence embeddings.

Example of Usage

from transformers import AutoTokenizer, AutoModel
import torch

# Load PhysBERT tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("thellert/physbert_uncased")
model = AutoModel.from_pretrained("thellert/physbert_uncased")

# Sample text to embed
sample_text = "Electrons exhibit both particle and wave-like behavior."

# Tokenize the input text and pass it through the model
inputs = tokenizer(sample_text, return_tensors="pt")
outputs = model(**inputs)

# Extract the token embeddings
token_embeddings = outputs.last_hidden_state
# Drop CLS and SEP tokens, then take the mean for the sentence embedding
token_embeddings = token_embeddings[:, 1:-1, :]
sentence_embedding = token_embeddings.mean(dim=1)

Citation

If you find this work useful please consider citing the following paper:

@article{10.1063/5.0238090,
    author = {Hellert, Thorsten and Montenegro, João and Pollastro, Andrea},
    title = "{PhysBERT: A text embedding model for physics scientific literature}",
    journal = {APL Machine Learning},
    volume = {2},
    number = {4},
    pages = {046105},
    year = {2024},
    month = {10},
    issn = {2770-9019},
    doi = {10.1063/5.0238090},
    url = {https://doi.org/10.1063/5.0238090},
    eprint = {https://pubs.aip.org/aip/aml/article-pdf/doi/10.1063/5.0238090/20227307/046105\_1\_5.0238090.pdf},
}

Model Card Authors

Thorsten Hellert, João Montenegro, Andrea Pollastro

Model Card Contact

Thorsten Hellert, Lawrence Berkeley National Laboratory, [email protected]

thellert
/

physbert_uncased