Spaces:

lfoppiano
/

document-qa

Running

App Files Files Community

Luca Foppiano commited on Nov 29, 2023

Commit

4e6f989

•

1 Parent(s): 048eb6f

Fix typo, acknowledge more contributors

Browse files

Files changed (1) hide show

README.md +16 -11

README.md CHANGED Viewed

@@ -19,13 +19,13 @@ license: apache-2.0
 ## Introduction
 Question/Answering on scientific documents using LLMs: ChatGPT-3.5-turbo, Mistral-7b-instruct and Zephyr-7b-beta.
-The streamlit application demonstrate the implementaiton of a RAG (Retrieval Augmented Generation) on scientific documents, that we are developing at NIMS (National Institute for Materials Science), in Tsukuba, Japan.
-Differently to most of the projects, we focus on scientific articles.
-We target only the full-text using [Grobid](https://github.com/kermitt2/grobid) that provide and cleaner results than the raw PDF2Text converter (which is comparable with most of other solutions).
 Additionally, this frontend provides the visualisation of named entities on LLM responses to extract <span stype="color:yellow">physical quantities, measurements</span> (with [grobid-quantities](https://github.com/kermitt2/grobid-quantities)) and <span stype="color:blue">materials</span> mentions (with [grobid-superconductors](https://github.com/lfoppiano/grobid-superconductors)).
-The conversation is kept in memory up by a buffered sliding window memory (top 4 more recent messages) and the messages are injected in the context as "previous messages".
 (The image on the right was generated with https://huggingface.co/spaces/stabilityai/stable-diffusion)
@@ -35,9 +35,9 @@ The conversation is kept in memory up by a buffered sliding window memory (top 4
 ## Getting started
-- Select the model+embedding combination you want ot use
 - Enter your API Key ([Open AI](https://platform.openai.com/account/api-keys) or [Huggingface](https://huggingface.co/docs/hub/security-tokens)).
-- Upload a scientific article as PDF document. You will see a spinner or loading indicator while the processing is in progress.
 - Once the spinner stops, you can proceed to ask your questions
  ![screenshot2.png](docs%2Fimages%2Fscreenshot2.png)
@@ -53,9 +53,9 @@ With default settings, each question uses around 1000 tokens.
 ### Chunks size
 When uploaded, each document is split into blocks of a determined size (250 tokens by default).
-This setting allow users to modify the size of such blocks.
-Smaller blocks will result in smaller context, yielding more precise sections of the document.
-Larger blocks will result in larger context less constrained around the question.
 ### Query mode
 Indicates whether sending a question to the LLM (Language Model) or to the vector storage.
@@ -65,7 +65,7 @@ Indicates whether sending a question to the LLM (Language Model) or to the vecto
 ### NER (Named Entities Recognition)
 This feature is specifically crafted for people working with scientific documents in materials science.
-It enables to run NER on the response from the LLM, to identify materials mentions and properties (quantities, masurements).
 This feature leverages both [grobid-quantities](https://github.com/kermitt2/grobid-quanities) and [grobid-superconductors](https://github.com/lfoppiano/grobid-superconductors) external services.
@@ -78,7 +78,9 @@ To release a new version:
 To use docker:
-- docker run `lfoppiano/document-insights-qa:latest`
 To install the library with Pypi:
@@ -88,6 +90,9 @@ To install the library with Pypi:
 ## Acknolwedgement
 This project is developed at the [National Institute for Materials Science](https://www.nims.go.jp) (NIMS) in Japan in collaboration with the [Lambard-ML-Team](https://github.com/Lambard-ML-Team).

 ## Introduction
 Question/Answering on scientific documents using LLMs: ChatGPT-3.5-turbo, Mistral-7b-instruct and Zephyr-7b-beta.
+The streamlit application demonstrates the implementation of a RAG (Retrieval Augmented Generation) on scientific documents, that we are developing at NIMS (National Institute for Materials Science), in Tsukuba, Japan.
+Different to most of the projects, we focus on scientific articles.
+We target only the full-text using [Grobid](https://github.com/kermitt2/grobid) which provides cleaner results than the raw PDF2Text converter (which is comparable with most of other solutions).
 Additionally, this frontend provides the visualisation of named entities on LLM responses to extract <span stype="color:yellow">physical quantities, measurements</span> (with [grobid-quantities](https://github.com/kermitt2/grobid-quantities)) and <span stype="color:blue">materials</span> mentions (with [grobid-superconductors](https://github.com/lfoppiano/grobid-superconductors)).
+The conversation is kept in memory by a buffered sliding window memory (top 4 more recent messages) and the messages are injected in the context as "previous messages".
 (The image on the right was generated with https://huggingface.co/spaces/stabilityai/stable-diffusion)
 ## Getting started
+- Select the model+embedding combination you want to use
 - Enter your API Key ([Open AI](https://platform.openai.com/account/api-keys) or [Huggingface](https://huggingface.co/docs/hub/security-tokens)).
+- Upload a scientific article as a PDF document. You will see a spinner or loading indicator while the processing is in progress.
 - Once the spinner stops, you can proceed to ask your questions
  ![screenshot2.png](docs%2Fimages%2Fscreenshot2.png)
 ### Chunks size
 When uploaded, each document is split into blocks of a determined size (250 tokens by default).
+This setting allows users to modify the size of such blocks.
+Smaller blocks will result in a smaller context, yielding more precise sections of the document.
+Larger blocks will result in a larger context less constrained around the question.
 ### Query mode
 Indicates whether sending a question to the LLM (Language Model) or to the vector storage.
 ### NER (Named Entities Recognition)
 This feature is specifically crafted for people working with scientific documents in materials science.
+It enables to run NER on the response from the LLM, to identify materials mentions and properties (quantities, measurements).
 This feature leverages both [grobid-quantities](https://github.com/kermitt2/grobid-quanities) and [grobid-superconductors](https://github.com/lfoppiano/grobid-superconductors) external services.
 To use docker:
+- docker run `lfoppiano/document-insights-qa:{latest_version)`
+- docker run `lfoppiano/document-insights-qa:latest-develop` for the latest development version
 To install the library with Pypi:
 ## Acknolwedgement
 This project is developed at the [National Institute for Materials Science](https://www.nims.go.jp) (NIMS) in Japan in collaboration with the [Lambard-ML-Team](https://github.com/Lambard-ML-Team).
+Contributed by Pedro Ortiz Suarez (@pjox), Tomoya Mato (@t29mato).
+Thanks also to [Patrice Lopez](https://www.science-miner.com), the author of [Grobid](https://github.com/kermitt2/grobid).