Muthukumaran
commited on
Commit
•
9c1989c
1
Parent(s):
0e7fa9a
add readme
Browse files
README.md
ADDED
@@ -0,0 +1,126 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
language:
|
4 |
+
- en
|
5 |
+
library_name: sentence-transformers
|
6 |
+
tags:
|
7 |
+
- earth science
|
8 |
+
- climate
|
9 |
+
- biology
|
10 |
+
pipeline_tag: sentence-similarity
|
11 |
+
---
|
12 |
+
|
13 |
+
# Model Card for nasa-smd-ibm-st-v2
|
14 |
+
|
15 |
+
`nasa-smd-ibm-st.38m` is a Bi-encoder sentence transformer model, that is fine-tuned from nasa-smd-ibm-v0.1 encoder model. it is a smaller version of `nasa-smd-ibm-st` with better performance, using fewer parameters (shown below). It's trained with 271 million examples along with a domain-specific dataset of 2.6 million examples from documents curated by NASA Science Mission Directorate (SMD). With this model, we aim to enhance natural language technologies like information retrieval and intelligent search as it applies to SMD NLP applications.
|
16 |
+
|
17 |
+
## Model Details
|
18 |
+
- **Base Encoder Model**: nasa-smd-ibm-v0.1
|
19 |
+
- **Tokenizer**: Custom
|
20 |
+
- **Parameters**: 38M
|
21 |
+
- **Training Strategy**: Sentence Pairs, and score indicating relevancy. The model encodes the two sentence pairs independently and cosine similarity is calculated. the similarity is optimized using the relevance score.
|
22 |
+
|
23 |
+
## Training Data
|
24 |
+
|
25 |
+
|
26 |
+
![image/png](https://cdn-uploads.huggingface.co/production/uploads/61099e5d86580d4580767226/fcsd0fEY_EoMA1F_CsEbD.png)
|
27 |
+
|
28 |
+
Figure: dataset sources for sentence transformers (269M in total)
|
29 |
+
|
30 |
+
Additionally, 2.6M abstract + title pairs collected from NASA SMD documents.
|
31 |
+
|
32 |
+
|
33 |
+
## Training Procedure
|
34 |
+
- **Framework**: PyTorch 1.9.1
|
35 |
+
- **sentence-transformers version**: 4.30.2
|
36 |
+
- **Strategy**: Sentence Pairs
|
37 |
+
|
38 |
+
## Evaluation
|
39 |
+
Following models are evaluated:
|
40 |
+
|
41 |
+
1. All-MiniLM-l6-v2 [sentence-transformers/all-MiniLM-L6-v2]
|
42 |
+
2. BGE-base [BAAI/bge-base-en-v1.5]
|
43 |
+
3. RoBERTa-base [roberta-base]
|
44 |
+
4. nasa-smd-ibm-rtvr_v0.1 [nasa-impact/nasa-smd-ibm-st]
|
45 |
+
|
46 |
+
|
47 |
+
|
48 |
+
![image/png](https://cdn-uploads.huggingface.co/production/uploads/61099e5d86580d4580767226/0e83srGhSH7-n11tezzHV.png)
|
49 |
+
|
50 |
+
Figure: BEIR Evaluation Metrics
|
51 |
+
|
52 |
+
|
53 |
+
![image/png](https://cdn-uploads.huggingface.co/production/uploads/61099e5d86580d4580767226/KerkB8PvDDPTcj9JBWtwG.png)
|
54 |
+
|
55 |
+
Figure: NASA QA Retrieval Benchmark Evaluation
|
56 |
+
|
57 |
+
## Uses
|
58 |
+
- Information Retreival
|
59 |
+
- Sentence Similarity Search
|
60 |
+
|
61 |
+
For NASA SMD related, scientific usecases.
|
62 |
+
|
63 |
+
### Usage
|
64 |
+
|
65 |
+
```python
|
66 |
+
|
67 |
+
from sentence_transformers import SentenceTransformer, Util
|
68 |
+
|
69 |
+
model = SentenceTransformer("nasa-impact/nasa-smd-ibm-st-v2")
|
70 |
+
|
71 |
+
input_queries = [
|
72 |
+
'query: how much protein should a female eat', 'query: summit define']
|
73 |
+
input_passages = [
|
74 |
+
"As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day.
|
75 |
+
But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.",
|
76 |
+
"Definition of summit for English Language Learners. : 1 the highest point of a mountain : the top of a mountain. : 2 the highest level. : 3 a meeting or series of meetings between the leaders of two or more governments."]
|
77 |
+
query_embeddings = model.encode(input_queries)
|
78 |
+
passage_embeddings = model.encode(input_passages)
|
79 |
+
print(util.cos_sim(query_embeddings, passage_embeddings))
|
80 |
+
```
|
81 |
+
|
82 |
+
## Citation
|
83 |
+
If you find this work useful, please cite using the following bibtex citation:
|
84 |
+
|
85 |
+
```bibtex
|
86 |
+
@misc {nasa-impact_2024,
|
87 |
+
author = { {NASA-IMPACT} },
|
88 |
+
title = { nasa-smd-ibm-st-v2 (Revision d249d84) },
|
89 |
+
year = 2024,
|
90 |
+
url = { https://huggingface.co/nasa-impact/nasa-smd-ibm-st-v2 },
|
91 |
+
doi = { 10.57967/hf/1800 },
|
92 |
+
publisher = { Hugging Face }
|
93 |
+
}
|
94 |
+
```
|
95 |
+
|
96 |
+
## Attribution
|
97 |
+
|
98 |
+
IBM Research
|
99 |
+
- Aashka Trivedi
|
100 |
+
- Masayasu Maraoka
|
101 |
+
- Bishwaranjan Bhattacharjee
|
102 |
+
|
103 |
+
NASA SMD
|
104 |
+
- Muthukumaran Ramasubramanian
|
105 |
+
- Iksha Gurung
|
106 |
+
- Rahul Ramachandran
|
107 |
+
- Manil Maskey
|
108 |
+
- Kaylin Bugbee
|
109 |
+
- Mike Little
|
110 |
+
- Elizabeth Fancher
|
111 |
+
- Lauren Sanders
|
112 |
+
- Sylvain Costes
|
113 |
+
- Sergi Blanco-Cuaresma
|
114 |
+
- Kelly Lockhart
|
115 |
+
- Thomas Allen
|
116 |
+
- Felix Grazes
|
117 |
+
- Megan Ansdell
|
118 |
+
- Alberto Accomazzi
|
119 |
+
- Sanaz Vahidinia
|
120 |
+
- Ryan McGranaghan
|
121 |
+
- Armin Mehrabian
|
122 |
+
- Tsendgar Lee
|
123 |
+
|
124 |
+
## Disclaimer
|
125 |
+
|
126 |
+
This sentence-transformer model is currently in an experimental phase. We are working to improve the model's capabilities and performance, and as we progress, we invite the community to engage with this model, provide feedback, and contribute to its evolution.
|