Muthukumaran commited on
Commit
9c1989c
1 Parent(s): 0e7fa9a

add readme

Browse files
Files changed (1) hide show
  1. README.md +126 -0
README.md ADDED
@@ -0,0 +1,126 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ library_name: sentence-transformers
6
+ tags:
7
+ - earth science
8
+ - climate
9
+ - biology
10
+ pipeline_tag: sentence-similarity
11
+ ---
12
+
13
+ # Model Card for nasa-smd-ibm-st-v2
14
+
15
+ `nasa-smd-ibm-st.38m` is a Bi-encoder sentence transformer model, that is fine-tuned from nasa-smd-ibm-v0.1 encoder model. it is a smaller version of `nasa-smd-ibm-st` with better performance, using fewer parameters (shown below). It's trained with 271 million examples along with a domain-specific dataset of 2.6 million examples from documents curated by NASA Science Mission Directorate (SMD). With this model, we aim to enhance natural language technologies like information retrieval and intelligent search as it applies to SMD NLP applications.
16
+
17
+ ## Model Details
18
+ - **Base Encoder Model**: nasa-smd-ibm-v0.1
19
+ - **Tokenizer**: Custom
20
+ - **Parameters**: 38M
21
+ - **Training Strategy**: Sentence Pairs, and score indicating relevancy. The model encodes the two sentence pairs independently and cosine similarity is calculated. the similarity is optimized using the relevance score.
22
+
23
+ ## Training Data
24
+
25
+
26
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/61099e5d86580d4580767226/fcsd0fEY_EoMA1F_CsEbD.png)
27
+
28
+ Figure: dataset sources for sentence transformers (269M in total)
29
+
30
+ Additionally, 2.6M abstract + title pairs collected from NASA SMD documents.
31
+
32
+
33
+ ## Training Procedure
34
+ - **Framework**: PyTorch 1.9.1
35
+ - **sentence-transformers version**: 4.30.2
36
+ - **Strategy**: Sentence Pairs
37
+
38
+ ## Evaluation
39
+ Following models are evaluated:
40
+
41
+ 1. All-MiniLM-l6-v2 [sentence-transformers/all-MiniLM-L6-v2]
42
+ 2. BGE-base [BAAI/bge-base-en-v1.5]
43
+ 3. RoBERTa-base [roberta-base]
44
+ 4. nasa-smd-ibm-rtvr_v0.1 [nasa-impact/nasa-smd-ibm-st]
45
+
46
+
47
+
48
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/61099e5d86580d4580767226/0e83srGhSH7-n11tezzHV.png)
49
+
50
+ Figure: BEIR Evaluation Metrics
51
+
52
+
53
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/61099e5d86580d4580767226/KerkB8PvDDPTcj9JBWtwG.png)
54
+
55
+ Figure: NASA QA Retrieval Benchmark Evaluation
56
+
57
+ ## Uses
58
+ - Information Retreival
59
+ - Sentence Similarity Search
60
+
61
+ For NASA SMD related, scientific usecases.
62
+
63
+ ### Usage
64
+
65
+ ```python
66
+
67
+ from sentence_transformers import SentenceTransformer, Util
68
+
69
+ model = SentenceTransformer("nasa-impact/nasa-smd-ibm-st-v2")
70
+
71
+ input_queries = [
72
+ 'query: how much protein should a female eat', 'query: summit define']
73
+ input_passages = [
74
+ "As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day.
75
+ But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.",
76
+ "Definition of summit for English Language Learners. : 1 the highest point of a mountain : the top of a mountain. : 2 the highest level. : 3 a meeting or series of meetings between the leaders of two or more governments."]
77
+ query_embeddings = model.encode(input_queries)
78
+ passage_embeddings = model.encode(input_passages)
79
+ print(util.cos_sim(query_embeddings, passage_embeddings))
80
+ ```
81
+
82
+ ## Citation
83
+ If you find this work useful, please cite using the following bibtex citation:
84
+
85
+ ```bibtex
86
+ @misc {nasa-impact_2024,
87
+ author = { {NASA-IMPACT} },
88
+ title = { nasa-smd-ibm-st-v2 (Revision d249d84) },
89
+ year = 2024,
90
+ url = { https://huggingface.co/nasa-impact/nasa-smd-ibm-st-v2 },
91
+ doi = { 10.57967/hf/1800 },
92
+ publisher = { Hugging Face }
93
+ }
94
+ ```
95
+
96
+ ## Attribution
97
+
98
+ IBM Research
99
+ - Aashka Trivedi
100
+ - Masayasu Maraoka
101
+ - Bishwaranjan Bhattacharjee
102
+
103
+ NASA SMD
104
+ - Muthukumaran Ramasubramanian
105
+ - Iksha Gurung
106
+ - Rahul Ramachandran
107
+ - Manil Maskey
108
+ - Kaylin Bugbee
109
+ - Mike Little
110
+ - Elizabeth Fancher
111
+ - Lauren Sanders
112
+ - Sylvain Costes
113
+ - Sergi Blanco-Cuaresma
114
+ - Kelly Lockhart
115
+ - Thomas Allen
116
+ - Felix Grazes
117
+ - Megan Ansdell
118
+ - Alberto Accomazzi
119
+ - Sanaz Vahidinia
120
+ - Ryan McGranaghan
121
+ - Armin Mehrabian
122
+ - Tsendgar Lee
123
+
124
+ ## Disclaimer
125
+
126
+ This sentence-transformer model is currently in an experimental phase. We are working to improve the model's capabilities and performance, and as we progress, we invite the community to engage with this model, provide feedback, and contribute to its evolution.