YAML Metadata
Warning:
empty or missing yaml metadata in repo card
(https://huggingface.co/docs/hub/model-cards#model-card-metadata)
Material SciBERT (TPU): Improving language understanding in materials science
Work in progress
Introduction
SciBERT-based model pre-trained with materials science scientific fulltext
Authors
Luca Foppiano Pedro Ortiz Suarez
TLDR
- Collected full-text from ~700000 articles provided by the National Institute for Materials Science (NIMS) TDM platform (https://dice.nims.go.jp/services/TDM-PF/en/), dataset called ScienceCorpus (SciCorpus)
- We added to the SciBERT vocabulary (32k tokens), 100 domain-specific unknown words extracted from SciCorpus with a keywords modeler (KeyBERT)
- Starting conditions: original SciBERT weights
- Pre-train the model MatTpuSciBERT from on the Google Cloud with the TPU (Tensor Processing Unit) as follow:
- 800000 steps with batch_size: 256, max_seq_length:512
- 100000 steps with batch_size: 2048, max_seq_length:128
- Fine-tuning and testing on NER on superconductors (https://github.com/lfoppiano/grobid-superconductors) and physical quantities (https://github.com/kermitt2/grobid-quantities)
Related work
BERT Implementations
- BERT (the original) https://arxiv.org/abs/1810.04805
- RoBERTa (Re-implementation by Facebook) https://arxiv.org/abs/1907.11692
Relevant models
- SciBERT: BERT, from scratch, scientific articles (biology + CS) https://github.com/allenai/scibert
- MatSciBERT (Gupta): RoBERTa, from scratch, SciBERT vocab and weights, ~150 K paper limited to 4 MS families http://github.com/m3rg-iitd/matscibert
- MaterialBERT: Not yet published
- MatBERT (CEDER): BERT, from scratch, 2M documents on materials science (~60M paragraphs) https://github.com/lbnlp/MatBERT
- BatteryBERT (Cole): BERT, mixed from scratch and with predefined weights https://github.com/ShuHuang/batterybert/
Results
Results obtained via 10-fold cross-validation, using DeLFT (https://github.com/kermitt2/delft)
NER Superconductors
Model | Precision | Recall | F1 |
---|---|---|---|
SciBERT (baseline) | 81.62% | 84.23% | 82.90% |
MatSciBERT (Gupta) | 81.45% | 84.36% | 82.88% |
MatTPUSciBERT | 82.13% | 85.15% | 83.61% |
MatBERT (Ceder) | 81.25% | 83.99% | 82.60% |
BatteryScibert-cased | 81.09% | 84.14% | 82.59% |
NER Quantities
Model | Precision | Recall | F1 |
---|---|---|---|
SciBERT (baseline) | 88.73% | 86.76% | 87.73% |
MatSciBERT (Gupta) | 84.98% | 90.12% | 87.47% |
MatTPUSciBERT | 88.62% | 86.33% | 87.46% |
MatBERT (Ceder) | 85.08% | 89.93% | 87.44% |
BatteryScibert-cased | 85.02% | 89.30% | 87.11% |
BatteryScibert-cased | 81.09% | 84.14% | 82.59% |
References
This work was supported by Google, through the researchers program https://cloud.google.com/edu/researchers
Acknowledgements
TBA
- Downloads last month
- 459
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.