Muthukumaran commited on
Commit
0c6fde4
1 Parent(s): d7d2441

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +116 -1
README.md CHANGED
@@ -2,5 +2,120 @@
2
  license: apache-2.0
3
  language:
4
  - en
 
 
 
 
 
5
  pipeline_tag: sentence-similarity
6
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  license: apache-2.0
3
  language:
4
  - en
5
+ library_name: sentence-transformers
6
+ tags:
7
+ - earth science
8
+ - climate
9
+ - biology
10
  pipeline_tag: sentence-similarity
11
+ ---
12
+
13
+ # Model Card for nasa-smd-ibm-v0.1
14
+
15
+ `nasa-smd-ibm-st-v2` is improved version of Bi-encoder sentence transformer model (`nasa-smd-ibm-st`), that is fine-tuned from nasa-smd-ibm-v0.1 encoder model. It's trained with 271 million examples along with a domain-specific dataset of 2.6 million examples from documents curated by NASA Science Mission Directorate (SMD). With this model, we aim to enhance natural language technologies like information retrieval and intelligent search as it applies to SMD NLP applications.
16
+
17
+ ## Model Details
18
+ - **Base Model**: nasa-smd-ibm-v0.1
19
+ - **Tokenizer**: Custom
20
+ - **Parameters**: 125M
21
+ - **Training Strategy**: Sentence Pairs, and score indicating relevancy. The model encodes the two sentence pairs independently and cosine similarity is calculated. the similarity is optimized using the relevance score.
22
+ - **Distilled Version**: You can download a distilled version of the model (30 Million Parameters) here: https://drive.google.com/file/d/19s2Vv9WlmlRhh_AhzdP-s__0spQCG8cQ/view?usp=sharing
23
+ ## Training Data
24
+
25
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/61099e5d86580d4580767226/ZjcHW24iKsvUYBhoL7eMM.png)
26
+ Figure: Open dataset sources for sentence transformers (269M in total)
27
+
28
+ Additionally, 2.6M abstract + title pairs collected from NASA SMD documents.
29
+
30
+
31
+ ## Training Procedure
32
+ - **Framework**: PyTorch 1.9.1
33
+ - **sentence-transformers version**: 4.30.2
34
+ - **Strategy**: Sentence Pairs
35
+
36
+ ## Evaluation
37
+ Following models are evaluated:
38
+
39
+ 1. All-MiniLM-l6-v2 [sentence-transformers/all-MiniLM-L6-v2]
40
+ 2. BGE-base [BAAI/bge-base-en-v1.5]
41
+ 3. RoBERTa-base [roberta-base]
42
+ 4. nasa-smd-ibm-rtvr_v0.1 [nasa-impact/nasa-smd-ibm-st]
43
+
44
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/61099e5d86580d4580767226/QvuEkZJjDGNllRyzl3Oh6.png)
45
+
46
+ Figure: BEIR Evaluation Metrics
47
+
48
+
49
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/61099e5d86580d4580767226/J3iuPWaGp_qTbllPFpchi.png)
50
+
51
+ Figure: Retrieval Benchmark Evaluation
52
+
53
+ ## Uses
54
+ - Information Retrieval
55
+ - Sentence Similarity Search
56
+ - Retrieval Augmented Generation
57
+
58
+ For NASA SMD related, scientific usecases.
59
+
60
+ ### Usage
61
+
62
+ ```python
63
+
64
+ from sentence_transformers import SentenceTransformer, util
65
+ model = SentenceTransformer('path_to_model')
66
+ input_queries = [
67
+ 'query: how much protein should a female eat', 'query: summit define']
68
+ input_passages = [
69
+ "As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day.
70
+ But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.",
71
+ "Definition of summit for English Language Learners. : 1 the highest point of a mountain : the top of a mountain. : 2 the highest level. : 3 a meeting or series of meetings between the leaders of two or more governments."]
72
+ query_embeddings = model.encode(input_queries)
73
+ passage_embeddings = model.encode(input_passages)
74
+ print(util.cos_sim(query_embeddings, passage_embeddings))
75
+ ```
76
+
77
+ ## Citation
78
+ If you find this work useful, please cite using the following bibtex citation:
79
+
80
+ ```bibtex
81
+ @misc {nasa-impact_2023,
82
+ author = { Aashka Trivedi and Bishwaranjan Bhattacharjee and Muthukumaran Ramasubramanian and Iksha Gurung and Masayasu Maraoka and Rahul Ramachandran and Manil Maskey and Kaylin Bugbee and Mike Little and Elizabeth Fancher and Lauren Sanders and Sylvain Costes and Sergi Blanco-Cuaresma and Kelly Lockhart and Thomas Allen and Felix Grazes and Megan Ansdell and Alberto Accomazzi and Sanaz Vahidinia and Ryan McGranaghan and Armin Mehrabian and Tsendgar Lee},
83
+ title = { nasa-smd-ibm-st (Revision 08ac2b4) },
84
+ year = 2023,
85
+ url = { https://huggingface.co/nasa-impact/nasa-smd-ibm-st },
86
+ doi = { 10.57967/hf/1441 },
87
+ publisher = { Hugging Face }
88
+ }
89
+ ```
90
+
91
+ ## Attribution
92
+
93
+ IBM Research
94
+ - Aashka Trivedi
95
+ - Masayasu Maraoka
96
+ - Bishwaranjan Bhattacharjee
97
+
98
+ NASA SMD
99
+ - Muthukumaran Ramasubramanian
100
+ - Iksha Gurung
101
+ - Rahul Ramachandran
102
+ - Manil Maskey
103
+ - Kaylin Bugbee
104
+ - Mike Little
105
+ - Elizabeth Fancher
106
+ - Lauren Sanders
107
+ - Sylvain Costes
108
+ - Sergi Blanco-Cuaresma
109
+ - Kelly Lockhart
110
+ - Thomas Allen
111
+ - Felix Grazes
112
+ - Megan Ansdell
113
+ - Alberto Accomazzi
114
+ - Sanaz Vahidinia
115
+ - Ryan McGranaghan
116
+ - Armin Mehrabian
117
+ - Tsendgar Lee
118
+
119
+ ## Disclaimer
120
+
121
+ This sentence-transformer model is currently in an experimental phase. We are working to improve the model's capabilities and performance, and as we progress, we invite the community to engage with this model, provide feedback, and contribute to its evolution.