Spaces:
Running
Running
Luca Foppiano
commited on
Commit
•
4e6f989
1
Parent(s):
048eb6f
Fix typo, acknowledge more contributors
Browse files
README.md
CHANGED
@@ -19,13 +19,13 @@ license: apache-2.0
|
|
19 |
## Introduction
|
20 |
|
21 |
Question/Answering on scientific documents using LLMs: ChatGPT-3.5-turbo, Mistral-7b-instruct and Zephyr-7b-beta.
|
22 |
-
The streamlit application
|
23 |
-
|
24 |
-
We target only the full-text using [Grobid](https://github.com/kermitt2/grobid)
|
25 |
|
26 |
Additionally, this frontend provides the visualisation of named entities on LLM responses to extract <span stype="color:yellow">physical quantities, measurements</span> (with [grobid-quantities](https://github.com/kermitt2/grobid-quantities)) and <span stype="color:blue">materials</span> mentions (with [grobid-superconductors](https://github.com/lfoppiano/grobid-superconductors)).
|
27 |
|
28 |
-
The conversation is kept in memory
|
29 |
|
30 |
(The image on the right was generated with https://huggingface.co/spaces/stabilityai/stable-diffusion)
|
31 |
|
@@ -35,9 +35,9 @@ The conversation is kept in memory up by a buffered sliding window memory (top 4
|
|
35 |
|
36 |
## Getting started
|
37 |
|
38 |
-
- Select the model+embedding combination you want
|
39 |
- Enter your API Key ([Open AI](https://platform.openai.com/account/api-keys) or [Huggingface](https://huggingface.co/docs/hub/security-tokens)).
|
40 |
-
- Upload a scientific article as PDF document. You will see a spinner or loading indicator while the processing is in progress.
|
41 |
- Once the spinner stops, you can proceed to ask your questions
|
42 |
|
43 |
![screenshot2.png](docs%2Fimages%2Fscreenshot2.png)
|
@@ -53,9 +53,9 @@ With default settings, each question uses around 1000 tokens.
|
|
53 |
|
54 |
### Chunks size
|
55 |
When uploaded, each document is split into blocks of a determined size (250 tokens by default).
|
56 |
-
This setting
|
57 |
-
Smaller blocks will result in smaller context, yielding more precise sections of the document.
|
58 |
-
Larger blocks will result in larger context less constrained around the question.
|
59 |
|
60 |
### Query mode
|
61 |
Indicates whether sending a question to the LLM (Language Model) or to the vector storage.
|
@@ -65,7 +65,7 @@ Indicates whether sending a question to the LLM (Language Model) or to the vecto
|
|
65 |
### NER (Named Entities Recognition)
|
66 |
|
67 |
This feature is specifically crafted for people working with scientific documents in materials science.
|
68 |
-
It enables to run NER on the response from the LLM, to identify materials mentions and properties (quantities,
|
69 |
This feature leverages both [grobid-quantities](https://github.com/kermitt2/grobid-quanities) and [grobid-superconductors](https://github.com/lfoppiano/grobid-superconductors) external services.
|
70 |
|
71 |
|
@@ -78,7 +78,9 @@ To release a new version:
|
|
78 |
|
79 |
To use docker:
|
80 |
|
81 |
-
- docker run `lfoppiano/document-insights-qa:
|
|
|
|
|
82 |
|
83 |
To install the library with Pypi:
|
84 |
|
@@ -88,6 +90,9 @@ To install the library with Pypi:
|
|
88 |
## Acknolwedgement
|
89 |
|
90 |
This project is developed at the [National Institute for Materials Science](https://www.nims.go.jp) (NIMS) in Japan in collaboration with the [Lambard-ML-Team](https://github.com/Lambard-ML-Team).
|
|
|
|
|
|
|
91 |
|
92 |
|
93 |
|
|
|
19 |
## Introduction
|
20 |
|
21 |
Question/Answering on scientific documents using LLMs: ChatGPT-3.5-turbo, Mistral-7b-instruct and Zephyr-7b-beta.
|
22 |
+
The streamlit application demonstrates the implementation of a RAG (Retrieval Augmented Generation) on scientific documents, that we are developing at NIMS (National Institute for Materials Science), in Tsukuba, Japan.
|
23 |
+
Different to most of the projects, we focus on scientific articles.
|
24 |
+
We target only the full-text using [Grobid](https://github.com/kermitt2/grobid) which provides cleaner results than the raw PDF2Text converter (which is comparable with most of other solutions).
|
25 |
|
26 |
Additionally, this frontend provides the visualisation of named entities on LLM responses to extract <span stype="color:yellow">physical quantities, measurements</span> (with [grobid-quantities](https://github.com/kermitt2/grobid-quantities)) and <span stype="color:blue">materials</span> mentions (with [grobid-superconductors](https://github.com/lfoppiano/grobid-superconductors)).
|
27 |
|
28 |
+
The conversation is kept in memory by a buffered sliding window memory (top 4 more recent messages) and the messages are injected in the context as "previous messages".
|
29 |
|
30 |
(The image on the right was generated with https://huggingface.co/spaces/stabilityai/stable-diffusion)
|
31 |
|
|
|
35 |
|
36 |
## Getting started
|
37 |
|
38 |
+
- Select the model+embedding combination you want to use
|
39 |
- Enter your API Key ([Open AI](https://platform.openai.com/account/api-keys) or [Huggingface](https://huggingface.co/docs/hub/security-tokens)).
|
40 |
+
- Upload a scientific article as a PDF document. You will see a spinner or loading indicator while the processing is in progress.
|
41 |
- Once the spinner stops, you can proceed to ask your questions
|
42 |
|
43 |
![screenshot2.png](docs%2Fimages%2Fscreenshot2.png)
|
|
|
53 |
|
54 |
### Chunks size
|
55 |
When uploaded, each document is split into blocks of a determined size (250 tokens by default).
|
56 |
+
This setting allows users to modify the size of such blocks.
|
57 |
+
Smaller blocks will result in a smaller context, yielding more precise sections of the document.
|
58 |
+
Larger blocks will result in a larger context less constrained around the question.
|
59 |
|
60 |
### Query mode
|
61 |
Indicates whether sending a question to the LLM (Language Model) or to the vector storage.
|
|
|
65 |
### NER (Named Entities Recognition)
|
66 |
|
67 |
This feature is specifically crafted for people working with scientific documents in materials science.
|
68 |
+
It enables to run NER on the response from the LLM, to identify materials mentions and properties (quantities, measurements).
|
69 |
This feature leverages both [grobid-quantities](https://github.com/kermitt2/grobid-quanities) and [grobid-superconductors](https://github.com/lfoppiano/grobid-superconductors) external services.
|
70 |
|
71 |
|
|
|
78 |
|
79 |
To use docker:
|
80 |
|
81 |
+
- docker run `lfoppiano/document-insights-qa:{latest_version)`
|
82 |
+
|
83 |
+
- docker run `lfoppiano/document-insights-qa:latest-develop` for the latest development version
|
84 |
|
85 |
To install the library with Pypi:
|
86 |
|
|
|
90 |
## Acknolwedgement
|
91 |
|
92 |
This project is developed at the [National Institute for Materials Science](https://www.nims.go.jp) (NIMS) in Japan in collaboration with the [Lambard-ML-Team](https://github.com/Lambard-ML-Team).
|
93 |
+
Contributed by Pedro Ortiz Suarez (@pjox), Tomoya Mato (@t29mato).
|
94 |
+
Thanks also to [Patrice Lopez](https://www.science-miner.com), the author of [Grobid](https://github.com/kermitt2/grobid).
|
95 |
+
|
96 |
|
97 |
|
98 |
|