Model Card for INDUS-Small (nasa-smd-ibm-distil-v0.1)

INDUS-Small(nasa-smd-ibm-distil-v0.1) is a distilled version of the RoBERTa-based, Encoder-only transformer model INDUS (nasa-impact/nasa-smd-ibm-v0.1), domain-adapted for NASA Science Mission Directorate (SMD) applications. It's fine-tuned on scientific journals and articles relevant to NASA SMD, aiming to enhance natural language technologies like information retrieval and intelligent search.

We trained the smaller model, INDUS_SMALL, with 38M parameters through knowledge distillation techniques by using INDUS as the teacher. INDUS_SMALL follows a 4-layer architecture recommended by the Neural Architecture Search engine (Trivedi et al., 2023) with an optimal trade-off between performance and latency. We adopted the distillation objective proposed in MiniLMv2 (Wang et al., 2021) to transfer fine-grained self-attention relations, which has been shown to be the current state-of-the-art (Udagawa et al., 2023). Using this objective, we trained the model for 500K steps with an effective batch size of 480 on 30 V100 GPUs.

Model Details

Base Model: INDUS
Tokenizer: Custom
Original version Parameters: 125M
Pretraining Strategy: Masked Language Modeling (MLM)
Distilled Version Parameters: 38 Million Parameters

Training Data

Wikipedia English (Feb 1, 2020)
AGU Publications
AMS Publications
Scientific papers from Astrophysics Data Systems (ADS)
PubMed abstracts
PubMedCentral (PMC) (commercial license subset)

Training Procedure

Framework: fairseq 0.12.1 with PyTorch 1.9.1
transformers Version: 4.2.0
Strategy: Masked Language Modeling (MLM)

Evaluation

BLURB benchmark

(Standard deviation across 10 random seeds in parenthesis. Macro avg. reported across datasets and micro avg. computed by averaging scores on each task then averaging across task averages.)

Climate Change NER, and NASA-QA benchmark

(Climate Change NER and NASA-QA benchmark results. Standard Deviation over multiple runs given in parantheses)

Please refer to the following dataset cards for further benchmarks and evaluation

NASA-IR Benchmark - https://huggingface.co/datasets/nasa-impact/nasa-smd-IR-benchmark
NASA-QA Benchmark - https://huggingface.co/datasets/nasa-impact/nasa-smd-qa-benchmark
Climate Change NER Benchmark - https://huggingface.co/datasets/ibm/Climate-Change-NER

Please refer to the following dataset cards for benchmark evaluation

NASA IR Benchmark - https://huggingface.co/datasets/nasa-impact/nasa-smd-IR-benchmark
NASA SMD Expert QA Benchmark - https://huggingface.co/datasets/nasa-impact/nasa-smd-qa-benchmark
Climate CHange Benchmark - https://huggingface.co/datasets/ibm/Climate-Change-NER

Uses

Named Entity Recognition (NER)
Information Retrieval
Sentence Transformers
Extractive QA

For NASA SMD related, scientific usecases.

Note

This Model is released in support of the training and evaluation of the encoder language model "Indus".

Accompanying paper can be found here: https://arxiv.org/abs/2405.10725

Citation

If you find this work useful, please cite using the following bibtex citation:

@misc {nasa-impact_2023,
    author       = {Masayasu Maraoka and Bishwaranjan Bhattacharjee and Muthukumaran Ramasubramanian and Ikhsa Gurung and Rahul Ramachandran and Manil Maskey and Kaylin Bugbee and Rong Zhang and Yousef El Kurdi and Bharath Dandala and Mike Little and Elizabeth Fancher and Lauren Sanders and Sylvain Costes and Sergi Blanco-Cuaresma and Kelly Lockhart and Thomas Allen and Felix Grazes and Megan Ansdell and Alberto Accomazzi and Sanaz Vahidinia and Ryan McGranaghan and Armin Mehrabian and Tsendgar Lee},
    title        = { nasa-smd-ibm-v0.1 (Revision f01d42f) },
    year         = 2023,
    url          = { https://huggingface.co/nasa-impact/nasa-smd-ibm-v0.1 },
    doi          = { 10.57967/hf/1429 },
    publisher    = { Hugging Face }
}

Attribution

IBM Research

Masayasu Muraoka
Bishwaranjan Bhattacharjee
Rong Zhang
Yousef El Kurdi
Bharath Dandala

NASA SMD

Muthukumaran Ramasubramanian
Iksha Gurung
Rahul Ramachandran
Manil Maskey
Kaylin Bugbee
Mike Little
Elizabeth Fancher
Lauren Sanders
Sylvain Costes
Sergi Blanco-Cuaresma
Kelly Lockhart
Thomas Allen
Felix Grazes
Megan Ansdell
Alberto Accomazzi
Sanaz Vahidinia
Ryan McGranaghan
Armin Mehrabian
Tsendgar Lee

Disclaimer

This Encoder-only model is currently in an experimental phase. We are working to improve the model's capabilities and performance, and as we progress, we invite the community to engage with this model, provide feedback, and contribute to its evolution.