Spaces:

lfoppiano
/

document-qa

Running

App Files Files Community

lfoppiano commited on Nov 29, 2023

Commit

8700e22

•

2 Parent(s): a70fbd3 7b33dad

Merge branch 'main' into pdf-render

Browse files

Files changed (6) hide show

CHANGELOG.md +31 -10
README.md +7 -3
document_qa/document_qa_engine.py +66 -25
document_qa/grobid_processors.py +1 -1
pyproject.toml +1 -1
streamlit_app.py +15 -11

CHANGELOG.md CHANGED Viewed

@@ -4,27 +4,49 @@ All notable changes to this project will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
 ## [0.2.0] – 2023-10-31
 ### Added
 + Selection of chunk size on which embeddings are created upon
-+ Mistral model to be used freely via the Huggingface free API
 ### Changed
-+ Improved documentation, adding privacy statement
 + Moved settings on the sidebar
 + Disable NER extraction by default, and allow user to activate it
 + Read API KEY from the environment variables and if present, avoid asking the user
 + Avoid changing model after update
 ## [0.1.3] – 2023-10-30
 ### Fixed
-+ ChromaDb accumulating information even when new papers were uploaded
 ## [0.1.2] – 2023-10-26
@@ -36,9 +58,8 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
 ### Fixed
-+ Github action build
-+ dependencies of langchain and chromadb
 ## [0.1.0] – 2023-10-26
@@ -54,8 +75,8 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
 + Kick off application
 + Support for GPT-3.5
 + Support for Mistral + SentenceTransformer
-+ Streamlit application
-+ Docker image
 + pypi package
 <!-- markdownlint-disable-file MD024 MD033 -->

 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
+## [0.3.1] - 2023-11-22
+### Added
++ Include biblio in embeddings by @lfoppiano in #21
+### Fixed
++ Fix conversational memory by @lfoppiano in #20
+## [0.3.0] - 2023-11-18
+### Added
++ add zephyr-7b by @lfoppiano in #15
++ add conversational memory in #18
+## [0.2.1] - 2023-11-01
+### Fixed
++ fix env variables by @lfoppiano in #9
 ## [0.2.0] – 2023-10-31
 ### Added
 + Selection of chunk size on which embeddings are created upon
++ Mistral model to be used freely via the Huggingface free API
 ### Changed
++ Improved documentation, adding privacy statement
 + Moved settings on the sidebar
 + Disable NER extraction by default, and allow user to activate it
 + Read API KEY from the environment variables and if present, avoid asking the user
 + Avoid changing model after update
 ## [0.1.3] – 2023-10-30
 ### Fixed
++ ChromaDb accumulating information even when new papers were uploaded
 ## [0.1.2] – 2023-10-26
 ### Fixed
++ Github action build
++ dependencies of langchain and chromadb
 ## [0.1.0] – 2023-10-26
 + Kick off application
 + Support for GPT-3.5
 + Support for Mistral + SentenceTransformer
++ Streamlit application
++ Docker image
 + pypi package
 <!-- markdownlint-disable-file MD024 MD033 -->

README.md CHANGED Viewed

@@ -14,6 +14,8 @@ license: apache-2.0
 **Work in progress** :construction_worker:
 ## Introduction
 Question/Answering on scientific documents using LLMs: ChatGPT-3.5-turbo, Mistral-7b-instruct and Zephyr-7b-beta.
@@ -23,11 +25,13 @@ We target only the full-text using [Grobid](https://github.com/kermitt2/grobid)
 Additionally, this frontend provides the visualisation of named entities on LLM responses to extract <span stype="color:yellow">physical quantities, measurements</span> (with [grobid-quantities](https://github.com/kermitt2/grobid-quantities)) and <span stype="color:blue">materials</span> mentions (with [grobid-superconductors](https://github.com/lfoppiano/grobid-superconductors)).
-The conversation is backed up by a sliding window memory (top 4 more recent messages) that help refers to information previously discussed in the chat.
 **Demos**:
- - (on HuggingFace spaces): https://lfoppiano-document-qa.hf.space/
- - (on the Streamlit cloud): https://document-insights.streamlit.app/
 ## Getting started

 **Work in progress** :construction_worker:
+<img src="https://github.com/lfoppiano/document-qa/assets/15426/f0a04a86-96b3-406e-8303-904b93f00015" width=300 align="right" />
 ## Introduction
 Question/Answering on scientific documents using LLMs: ChatGPT-3.5-turbo, Mistral-7b-instruct and Zephyr-7b-beta.
 Additionally, this frontend provides the visualisation of named entities on LLM responses to extract <span stype="color:yellow">physical quantities, measurements</span> (with [grobid-quantities](https://github.com/kermitt2/grobid-quantities)) and <span stype="color:blue">materials</span> mentions (with [grobid-superconductors](https://github.com/lfoppiano/grobid-superconductors)).
+The conversation is kept in memory up by a buffered sliding window memory (top 4 more recent messages) and the messages are injected in the context as "previous messages".
+(The image on the right was generated with https://huggingface.co/spaces/stabilityai/stable-diffusion)
 **Demos**:
+ - (stable version): https://lfoppiano-document-qa.hf.space/
+ - (unstable version): https://document-insights.streamlit.app/
 ## Getting started

document_qa/document_qa_engine.py CHANGED Viewed

@@ -3,17 +3,18 @@ import os
 from pathlib import Path
 from typing import Union, Any
 from grobid_client.grobid_client import GrobidClient
-from langchain.chains import create_extraction_chain
-from langchain.chains.question_answering import load_qa_chain
 from langchain.prompts import SystemMessagePromptTemplate, HumanMessagePromptTemplate, ChatPromptTemplate
 from langchain.retrievers import MultiQueryRetriever
 from langchain.text_splitter import RecursiveCharacterTextSplitter
 from langchain.vectorstores import Chroma
 from tqdm import tqdm
-from document_qa.grobid_processors import GrobidProcessor
 class DocumentQAEngine:
     llm = None
@@ -23,15 +24,24 @@ class DocumentQAEngine:
     embeddings_map_from_md5 = {}
     embeddings_map_to_md5 = {}
     def __init__(self,
                  llm,
                  embedding_function,
                  qa_chain_type="stuff",
                  embeddings_root_path=None,
                  grobid_url=None,
                  ):
         self.embedding_function = embedding_function
         self.llm = llm
         self.chain = load_qa_chain(llm, chain_type=qa_chain_type)
         if embeddings_root_path is not None:
@@ -87,14 +97,14 @@ class DocumentQAEngine:
         return self.embeddings_map_from_md5[md5]
     def query_document(self, query: str, doc_id, output_parser=None, context_size=4, extraction_schema=None,
-                       verbose=False, memory=None) -> (
             Any, str):
         # self.load_embeddings(self.embeddings_root_path)
         if verbose:
             print(query)
-        response = self._run_query(doc_id, query, context_size=context_size, memory=memory)
         response = response['output_text'] if 'output_text' in response else response
         if verbose:
@@ -144,21 +154,25 @@ class DocumentQAEngine:
         return parsed_output
-    def _run_query(self, doc_id, query, memory=None, context_size=4):
         relevant_documents = self._get_context(doc_id, query, context_size)
-        if memory:
-            return self.chain.run(input_documents=relevant_documents,
                                   question=query)
-        else:
-            return self.chain.run(input_documents=relevant_documents,
-                                  question=query,
-                                  memory=memory)
-        # return self.chain({"input_documents": relevant_documents, "question": prompt_chat_template}, return_only_outputs=True)
     def _get_context(self, doc_id, query, context_size=4):
         db = self.embeddings_dict[doc_id]
         retriever = db.as_retriever(search_kwargs={"k": context_size})
         relevant_documents = retriever.get_relevant_documents(query)
         return relevant_documents
     def get_all_context_by_document(self, doc_id):
@@ -173,8 +187,10 @@ class DocumentQAEngine:
         relevant_documents = multi_query_retriever.get_relevant_documents(query)
         return relevant_documents
-    def get_text_from_document(self, pdf_file_path, chunk_size=-1, perc_overlap=0.1, verbose=False):
-        """Extract text from documents using Grobid, if chunk_size is < 0 it keep each paragraph separately"""
         if verbose:
             print("File", pdf_file_path)
         filename = Path(pdf_file_path).stem
@@ -189,6 +205,7 @@ class DocumentQAEngine:
         texts = []
         metadatas = []
         ids = []
         if chunk_size < 0:
             for passage in structure['passages']:
                 biblio_copy = copy.copy(biblio)
@@ -212,28 +229,49 @@ class DocumentQAEngine:
             metadatas = [biblio for _ in range(len(texts))]
             ids = [id for id, t in enumerate(texts)]
         return texts, metadatas, ids
-    def create_memory_embeddings(self, pdf_path, doc_id=None, chunk_size=500, perc_overlap=0.1):
-        texts, metadata, ids = self.get_text_from_document(pdf_path, chunk_size=chunk_size, perc_overlap=perc_overlap)
         if doc_id:
             hash = doc_id
         else:
             hash = metadata[0]['hash']
         if hash not in self.embeddings_dict.keys():
-            self.embeddings_dict[hash] = Chroma.from_texts(texts, embedding=self.embedding_function, metadatas=metadata,
                                                           collection_name=hash)
         else:
-            self.embeddings_dict[hash].delete(ids=self.embeddings_dict[hash].get()['ids'])
-            self.embeddings_dict[hash] = Chroma.from_texts(texts, embedding=self.embedding_function, metadatas=metadata,
                                                            collection_name=hash)
         self.embeddings_root_path = None
         return hash
-    def create_embeddings(self, pdfs_dir_path: Path, chunk_size=500, perc_overlap=0.1):
         input_files = []
         for root, dirs, files in os.walk(pdfs_dir_path, followlinks=False):
             for file_ in files:
@@ -250,9 +288,12 @@ class DocumentQAEngine:
             if os.path.exists(data_path):
                 print(data_path, "exists. Skipping it ")
                 continue
-            texts, metadata, ids = self.get_text_from_document(input_file, chunk_size=chunk_size,
-                                                               perc_overlap=perc_overlap)
             filename = metadata[0]['filename']
             vector_db_document = Chroma.from_texts(texts,

 from pathlib import Path
 from typing import Union, Any
+from document_qa.grobid_processors import GrobidProcessor
 from grobid_client.grobid_client import GrobidClient
+from langchain.chains import create_extraction_chain, ConversationChain, ConversationalRetrievalChain
+from langchain.chains.question_answering import load_qa_chain, stuff_prompt, refine_prompts, map_reduce_prompt, \
+    map_rerank_prompt
 from langchain.prompts import SystemMessagePromptTemplate, HumanMessagePromptTemplate, ChatPromptTemplate
 from langchain.retrievers import MultiQueryRetriever
+from langchain.schema import Document
 from langchain.text_splitter import RecursiveCharacterTextSplitter
 from langchain.vectorstores import Chroma
 from tqdm import tqdm
 class DocumentQAEngine:
     llm = None
     embeddings_map_from_md5 = {}
     embeddings_map_to_md5 = {}
+    default_prompts = {
+        'stuff': stuff_prompt,
+        'refine': refine_prompts,
+        "map_reduce": map_reduce_prompt,
+        "map_rerank": map_rerank_prompt
+    }
     def __init__(self,
                  llm,
                  embedding_function,
                  qa_chain_type="stuff",
                  embeddings_root_path=None,
                  grobid_url=None,
+                 memory=None
                  ):
         self.embedding_function = embedding_function
         self.llm = llm
+        self.memory = memory
         self.chain = load_qa_chain(llm, chain_type=qa_chain_type)
         if embeddings_root_path is not None:
         return self.embeddings_map_from_md5[md5]
     def query_document(self, query: str, doc_id, output_parser=None, context_size=4, extraction_schema=None,
+                       verbose=False) -> (
             Any, str):
         # self.load_embeddings(self.embeddings_root_path)
         if verbose:
             print(query)
+        response = self._run_query(doc_id, query, context_size=context_size)
         response = response['output_text'] if 'output_text' in response else response
         if verbose:
         return parsed_output
+    def _run_query(self, doc_id, query, context_size=4):
         relevant_documents = self._get_context(doc_id, query, context_size)
+        response = self.chain.run(input_documents=relevant_documents,
                                   question=query)
+        if self.memory:
+            self.memory.save_context({"input": query}, {"output": response})
+        return response
     def _get_context(self, doc_id, query, context_size=4):
         db = self.embeddings_dict[doc_id]
         retriever = db.as_retriever(search_kwargs={"k": context_size})
         relevant_documents = retriever.get_relevant_documents(query)
+        if self.memory and len(self.memory.buffer_as_messages) > 0:
+            relevant_documents.append(
+                Document(
+                    page_content="""Following, the previous question and answers. Use these information only when in the question there are unspecified references:\n{}\n\n""".format(
+                        self.memory.buffer_as_str))
+            )
         return relevant_documents
     def get_all_context_by_document(self, doc_id):
         relevant_documents = multi_query_retriever.get_relevant_documents(query)
         return relevant_documents
+    def get_text_from_document(self, pdf_file_path, chunk_size=-1, perc_overlap=0.1, include=(), verbose=False):
+        """
+        Extract text from documents using Grobid, if chunk_size is < 0 it keeps each paragraph separately
+        """
         if verbose:
             print("File", pdf_file_path)
         filename = Path(pdf_file_path).stem
         texts = []
         metadatas = []
         ids = []
         if chunk_size < 0:
             for passage in structure['passages']:
                 biblio_copy = copy.copy(biblio)
             metadatas = [biblio for _ in range(len(texts))]
             ids = [id for id, t in enumerate(texts)]
+        if "biblio" in include:
+            biblio_metadata = copy.copy(biblio)
+            biblio_metadata['type'] = "biblio"
+            biblio_metadata['section'] = "header"
+            for key in ['title', 'authors', 'publication_year']:
+                if key in biblio_metadata:
+                    texts.append("{}: {}".format(key, biblio_metadata[key]))
+                    metadatas.append(biblio_metadata)
+                    ids.append(key)
         return texts, metadatas, ids
+    def create_memory_embeddings(self, pdf_path, doc_id=None, chunk_size=500, perc_overlap=0.1, include_biblio=False):
+        include = ["biblio"] if include_biblio else []
+        texts, metadata, ids = self.get_text_from_document(
+            pdf_path,
+            chunk_size=chunk_size,
+            perc_overlap=perc_overlap,
+            include=include)
         if doc_id:
             hash = doc_id
         else:
             hash = metadata[0]['hash']
         if hash not in self.embeddings_dict.keys():
+            self.embeddings_dict[hash] = Chroma.from_texts(texts,
+                                                           embedding=self.embedding_function,
+                                                           metadatas=metadata,
                                                           collection_name=hash)
         else:
+            # if 'documents' in self.embeddings_dict[hash].get() and len(self.embeddings_dict[hash].get()['documents']) == 0:
+            #     self.embeddings_dict[hash].delete(ids=self.embeddings_dict[hash].get()['ids'])
+            self.embeddings_dict[hash].delete_collection()
+            self.embeddings_dict[hash] = Chroma.from_texts(texts,
+                                                           embedding=self.embedding_function,
+                                                           metadatas=metadata,
                                                            collection_name=hash)
         self.embeddings_root_path = None
         return hash
+    def create_embeddings(self, pdfs_dir_path: Path, chunk_size=500, perc_overlap=0.1, include_biblio=False):
         input_files = []
         for root, dirs, files in os.walk(pdfs_dir_path, followlinks=False):
             for file_ in files:
             if os.path.exists(data_path):
                 print(data_path, "exists. Skipping it ")
                 continue
+            include = ["biblio"] if include_biblio else []
+            texts, metadata, ids = self.get_text_from_document(
+                input_file,
+                chunk_size=chunk_size,
+                perc_overlap=perc_overlap,
+                include=include)
             filename = metadata[0]['filename']
             vector_db_document = Chroma.from_texts(texts,

document_qa/grobid_processors.py CHANGED Viewed

@@ -171,7 +171,7 @@ class GrobidProcessor(BaseProcessor):
         }
         try:
             year = dateparser.parse(doc_biblio.header.date).year
-            biblio["year"] = year
         except:
             pass

         }
         try:
             year = dateparser.parse(doc_biblio.header.date).year
+            biblio["publication_year"] = year
         except:
             pass

pyproject.toml CHANGED Viewed

@@ -3,7 +3,7 @@ requires = ["setuptools", "setuptools-scm"]
 build-backend = "setuptools.build_meta"
 [tool.bumpversion]
-current_version = "0.3.0"
 commit = "true"
 tag = "true"
 tag_name = "v{new_version}"

 build-backend = "setuptools.build_meta"
 [tool.bumpversion]
+current_version = "0.3.2"
 commit = "true"
 tag = "true"
 tag_name = "v{new_version}"

streamlit_app.py CHANGED Viewed

@@ -115,6 +115,7 @@ def clear_memory():
 # @st.cache_resource
 def init_qa(model, api_key=None):
     if model == 'chatgpt-3.5-turbo':
         if api_key:
             chat = ChatOpenAI(model_name="gpt-3.5-turbo",
@@ -143,7 +144,7 @@ def init_qa(model, api_key=None):
         st.stop()
         return
-    return DocumentQAEngine(chat, embeddings, grobid_url=os.environ['GROBID_URL'])
 @st.cache_resource
@@ -252,7 +253,8 @@ with st.sidebar:
     st.button(
         'Reset chat memory.',
-        on_click=clear_memory(),
         help="Clear the conversational memory. Currently implemented to retrain the 4 most recent messages.")
 left_column, right_column = st.columns([1, 1])
@@ -264,7 +266,9 @@ with right_column:
     st.markdown(
         ":warning: Do not upload sensitive data. We **temporarily** store text from the uploaded PDF documents solely for the purpose of processing your request, and we **do not assume responsibility** for any subsequent use or handling of the data submitted to third parties LLMs.")
-    uploaded_file = st.file_uploader("Upload an article", type=("pdf", "txt"), on_change=new_file,
                                      disabled=st.session_state['model'] is not None and st.session_state['model'] not in
                                               st.session_state['api_keys'],
                                      help="The full-text is extracted using Grobid. ")
@@ -331,7 +335,8 @@ if uploaded_file and not st.session_state.loaded_embeddings:
             st.session_state['doc_id'] = hash = st.session_state['rqa'][model].create_memory_embeddings(tmp_file.name,
                                                                                                         chunk_size=chunk_size,
-                                                                                                        perc_overlap=0.1)
             st.session_state['loaded_embeddings'] = True
             st.session_state.messages = []
@@ -384,8 +389,7 @@ with right_column:
         elif mode == "LLM":
             with st.spinner("Generating response..."):
                 _, text_response = st.session_state['rqa'][model].query_document(question, st.session_state.doc_id,
-                                                                                 context_size=context_size,
-                                                                                 memory=st.session_state.memory)
         if not text_response:
             st.error("Something went wrong. Contact Luca Foppiano ([email protected]) to report the issue.")
@@ -404,11 +408,11 @@ with right_column:
                 st.write(text_response)
             st.session_state.messages.append({"role": "assistant", "mode": mode, "content": text_response})
-        for id in range(0, len(st.session_state.messages), 2):
-            question = st.session_state.messages[id]['content']
-            if len(st.session_state.messages) > id + 1:
-                answer = st.session_state.messages[id + 1]['content']
-                st.session_state.memory.save_context({"input": question}, {"output": answer})
     elif st.session_state.loaded_embeddings and st.session_state.doc_id:
         play_old_messages()

 # @st.cache_resource
 def init_qa(model, api_key=None):
+    ## For debug add: callbacks=[PromptLayerCallbackHandler(pl_tags=["langchain", "chatgpt", "document-qa"])])
     if model == 'chatgpt-3.5-turbo':
         if api_key:
             chat = ChatOpenAI(model_name="gpt-3.5-turbo",
         st.stop()
         return
+    return DocumentQAEngine(chat, embeddings, grobid_url=os.environ['GROBID_URL'], memory=st.session_state['memory'])
 @st.cache_resource
     st.button(
         'Reset chat memory.',
+        key="reset-memory-button",
+        on_click=clear_memory,
         help="Clear the conversational memory. Currently implemented to retrain the 4 most recent messages.")
 left_column, right_column = st.columns([1, 1])
     st.markdown(
         ":warning: Do not upload sensitive data. We **temporarily** store text from the uploaded PDF documents solely for the purpose of processing your request, and we **do not assume responsibility** for any subsequent use or handling of the data submitted to third parties LLMs.")
+uploaded_file = st.file_uploader("Upload an article",
+                                 type=("pdf", "txt"),
+                                 on_change=new_file,
                                      disabled=st.session_state['model'] is not None and st.session_state['model'] not in
                                               st.session_state['api_keys'],
                                      help="The full-text is extracted using Grobid. ")
             st.session_state['doc_id'] = hash = st.session_state['rqa'][model].create_memory_embeddings(tmp_file.name,
                                                                                                         chunk_size=chunk_size,
+                                                                                                    perc_overlap=0.1,
+                                                                                                    include_biblio=True)
             st.session_state['loaded_embeddings'] = True
             st.session_state.messages = []
         elif mode == "LLM":
             with st.spinner("Generating response..."):
                 _, text_response = st.session_state['rqa'][model].query_document(question, st.session_state.doc_id,
+                                                                             context_size=context_size)
         if not text_response:
             st.error("Something went wrong. Contact Luca Foppiano ([email protected]) to report the issue.")
                 st.write(text_response)
             st.session_state.messages.append({"role": "assistant", "mode": mode, "content": text_response})
+        # if len(st.session_state.messages) > 1:
+        #     last_answer = st.session_state.messages[len(st.session_state.messages)-1]
+        #     if last_answer['role'] == "assistant":
+        #         last_question = st.session_state.messages[len(st.session_state.messages)-2]
+        #         st.session_state.memory.save_context({"input": last_question['content']}, {"output": last_answer['content']})
     elif st.session_state.loaded_embeddings and st.session_state.doc_id:
         play_old_messages()