Spaces:

vincha77
/

aie4_midterm_app

Paused

App Files Files Community

vinod chandrashekaran commited on Sep 22

Commit

6f57442

•

1 Parent(s): b2d1286

add application and other files

Browse files

Files changed (6) hide show

Dockerfile +11 -0
README.md +27 -0
app_v1.py +225 -0
myutils/rag_pipeline_utils.py +289 -0
myutils/ragas_pipeline.py +86 -0
requirements.txt +18 -0

Dockerfile ADDED Viewed

	@@ -0,0 +1,11 @@

+FROM python:3.11
+RUN useradd -m -u 1000 user
+USER user
+ENV HOME=/home/user \
+    PATH=/home/user/.local/bin:$PATH
+WORKDIR $HOME/app
+COPY --chown=user . $HOME/app
+COPY ./requirements.txt ~/app/requirements.txt
+RUN pip install -r requirements.txt
+COPY . .
+CMD ["chainlit", "run", "app_v1.py", "--port", "7860"]

README.md CHANGED Viewed

@@ -9,3 +9,30 @@ license: apache-2.0
 ---
 Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
 Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
+Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
+Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
+# Summary
+This is my app for AI Engineering Cohort#4 Midterm Assignment.
+With this application, you can chat with these TWO uploaded PDFs:
+https://www.whitehouse.gov/wp-content/uploads/2022/10/Blueprint-for-an-AI-Bill-of-Rights.pdf
+AND
+https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.600-1.pdf
+Code features:
+1.  Text splitter - RecursiveTextSplitter from Langchain
+2.  Vector store - using Qdrant db
+3.  Retrieval chain using LCEL syntax
+4.  Chat model is OpenAI's gpt-4o-mini.
+5.  There are two variants of the app that can be deployed:
+    a.  An early prototype built using OpenAI Embeddings text-embedding-3-small
+    b.  A more advanced prototype using finetuned version of `Snowflake/snowflake-arctic-embed-m`
+        the finetuned version is available at `vincha77/finetuned_arctic`

app_v1.py ADDED Viewed

	@@ -0,0 +1,225 @@

+"""
+app_end_to_end_prototype.py
+1. This app loads two pdf documents and allows the user to ask questions about these documents.
+    The documents that are used are:
+    https://www.whitehouse.gov/wp-content/uploads/2022/10/Blueprint-for-an-AI-Bill-of-Rights.pdf
+    AND
+    https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.600-1.pdf
+2. The two documents are pre-processed on start.  Here are brief details on the pre-processing:
+    a.  text is split into chunks using langchain RecursiveCharacterTextSplitter method.
+    b.  The text in each chunk is converted to an embedding using OpenAI text-embedding-3-small embeddings.
+        Each embedding produced by this model has dimension 1536.
+        Each chunk is therefore represented by an embedding of dimension 1536.
+    c.  The collection of embeddings for all chunks along with metadata are saved/indexed in a vector database.
+    d.  For this exercise, I use an in-memory version of Qdrant vector db.
+3.  The next step is to build a RAG pipeline to answer questions.  This is implemented as follows:
+    a.  I use a simple prompt that retrieves relevant contexts based on a user query.
+    b.  First, the user query is encoded using the same embedding model as the documents.
+    c.  Second, a set of relevant documents is returned by the retriever
+        which efficiently searches the vector db and returns the most relevant chunks.
+    d.  Third, the user query and retrieved contexts are then passed to a chat-enabled LLM.
+        I use OpenAI's gpt-4o-mini throughout this exercise.
+    e.  Fourth, the chat model processes the user query and context along with the prompt and
+        generates a response that is then passed to the user.
+4.  The cl.on_start initiates the conversation with the user.
+5.  The cl.on_message decorator wraps the main function
+        This function does the following:
+            a. receives the query that the user types in
+            b. runs the RAG pipeline
+            c. sends results back to UI for display
+Additional Notes:
+a. note the use of async functions and await async syntax throughout the module here!
+b. note the use of yield rather than return in certain key functions
+c. note the use of streaming capabilities when needed
+"""
+import os
+from typing import List
+from dotenv import load_dotenv
+# chainlit imports
+import chainlit as cl
+# langchain imports
+# document loader
+from langchain_community.document_loaders import PyPDFLoader
+# text splitter
+from langchain_text_splitters import RecursiveCharacterTextSplitter
+# embeddings model to embed each chunk of text in doc
+from langchain_openai import OpenAIEmbeddings
+# vector store
+# llm for text generation using prompt plus retrieved context plus query
+from langchain_openai import ChatOpenAI
+# templates to create custom prompts
+from langchain_core.prompts import ChatPromptTemplate, PromptTemplate
+# chains
+# LCEL Runnable Passthrough
+from langchain_core.runnables import RunnablePassthrough
+# to parse output from llm
+from langchain_core.output_parsers import StrOutputParser
+from langchain.docstore.document import Document
+from langchain_huggingface import HuggingFaceEmbeddings
+from langchain.document_loaders import PyMuPDFLoader
+from sentence_transformers import SentenceTransformer
+from myutils.rag_pipeline_utils import SimpleTextSplitter, SemanticTextSplitter, VectorStore, AdvancedRetriever
+from myutils.ragas_pipeline import RagasPipeline
+from myutils.rag_pipeline_utils import load_all_pdfs, set_up_rag_pipeline
+load_dotenv()
+# Flag to indicate if pdfs should be loaded directly from URLs
+# If True, get pdfs from urls; if false, get them from local copy
+LOAD_PDF_DIRECTLY_FROM_URL = True
+# set the APP_MODE
+# one of two choices:
+# early_prototype means use OpenAI embeddings
+# advanced_prototype means use finetuned model embeddings
+APP_MODE = "early_prototype"
+if APP_MODE == "early_prototype":
+    embeddings = OpenAIEmbeddings(model='text-embedding-3-small')
+    embed_dim = 1536
+    appendix_to_user_message = "This chatbot is built using OpenAI Embeddings as a fast prototype."
+else:
+    finetuned_model_id = "vincha77/finetuned_arctic"
+    arctic_finetuned_model = SentenceTransformer(finetuned_model_id)
+    embeddings = HuggingFaceEmbeddings(model_name="vincha77/finetuned_arctic")
+    appendix_to_user_message = "Our Tech team finetuned snowflake-arctic-embed-m to bring you this chatbot!!"
+    embed_dim = 768
+rag_template = """
+You are an assistant for question-answering tasks.
+You will be given documents on the risks of AI, frameworks and
+policies formulated by various governmental agencies to articulate
+these risks and to safeguard against these risks.
+Use the following pieces of retrieved context to answer
+the question.
+You must answer the question only based on the context provided.
+If you don't know the answer or if the context does not provide sufficient information,
+then say that you don't know.
+Think through your answer step-by-step.
+Context:
+{context}
+Question:
+{question}
+"""
+rag_prompt = ChatPromptTemplate.from_template(template=rag_template)
+# parameters to manage text splitting/chunking
+chunk_kwargs = {
+    'chunk_size': 1000,
+    'chunk_overlap': 300
+}
+retrieval_chain_kwargs = {
+    'location': ":memory:",
+    'collection_name': 'End_to_End_Prototype',
+    'embeddings': embeddings,
+    'embed_dim': embed_dim,
+    'prompt': rag_prompt,
+    'qa_llm': ChatOpenAI(model_name="gpt-4o-mini", temperature=0)
+}
+urls_for_pdfs = [
+    "https://www.whitehouse.gov/wp-content/uploads/2022/10/Blueprint-for-an-AI-Bill-of-Rights.pdf",
+    "https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.600-1.pdf"
+]
+pdf_file_paths = [
+    './data/docs_for_rag/Blueprint-for-an-AI-Bill-of-Rights.pdf',
+    './data/docs_for_rag/NIST.AI.600-1.pdf'
+]
+# if flag is True, then pass in pointers to URLs
+# if flag is false, then pass in file pointers
+if LOAD_PDF_DIRECTLY_FROM_URL:
+    docpathlist = urls_for_pdfs
+else:
+    docpathlist = pdf_file_paths
+class RetrievalAugmentedQAPipelineWithLangchain:
+    def __init__(self,
+                 list_of_documents,
+                 chunk_kwargs,
+                 retrieval_chain_kwargs):
+        self.list_of_documents = list_of_documents
+        self.chunk_kwargs = chunk_kwargs
+        self.retrieval_chain_kwargs = retrieval_chain_kwargs
+        self.load_documents()
+        self.split_text()
+        self.set_up_rag_pipeline()
+        return
+    def load_documents(self):
+        self.documents = load_all_pdfs(self.list_of_documents)
+        return self
+    def split_text(self):
+        baseline_text_splitter = \
+            SimpleTextSplitter(**self.chunk_kwargs, documents=self.documents)
+        # split text for baseline case
+        self.baseline_text_splits = baseline_text_splitter.split_text()
+        return self
+    def set_up_rag_pipeline(self):
+        self.retrieval_chain = set_up_rag_pipeline(
+            **self.retrieval_chain_kwargs,
+            text_splits=self.baseline_text_splits
+        )
+        return self
+RETRIEVAL_CHAIN = \
+    RetrievalAugmentedQAPipelineWithLangchain(
+        list_of_documents=docpathlist,
+        chunk_kwargs=chunk_kwargs,
+        retrieval_chain_kwargs=retrieval_chain_kwargs
+    ).retrieval_chain
+@cl.on_chat_start
+async def on_chat_start():
+    msg = cl.Message(content=f"""
+                    Hello dear colleague!  Welcome to this chatbot!  In recent weeks, many of you have shared that you'd like to understand how AI is evolving.  What better way to help you understand the implications of AI than "use AI to answer questions about AI".  Your colleagues in Technology have worked hard to create this chatbot.  We've used a few key policy and framework proposals from the US government that this chatbot can search for a response to your question.  Occasionally, the chatbot may respond with "I don't know".  If it does that, try a more specific variation of your question.  Oh! And one more thing: {appendix_to_user_message}...Please go ahead and enter your question...
+                    """
+                    )
+    await msg.send()
+    cl.user_session.set("retrieval_chain", RETRIEVAL_CHAIN)
+@cl.on_message
+async def main(message):
+    retrieval_chain = cl.user_session.get("retrieval_chain")
+    msg = cl.Message(content="")
+    # result = await raqa_chain.invoke({"input": message.content})
+    result = await cl.make_async(retrieval_chain.invoke)({"question": message.content})
+    # async for stream_resp in result["answer"]:
+    for stream_resp in result["response"].content:
+        await msg.stream_token(stream_resp)
+    await msg.send()

myutils/rag_pipeline_utils.py ADDED Viewed

	@@ -0,0 +1,289 @@

+"""
+rag_pipeline_utils.py
+This python script implements various classes useful for a RAG pipeline.
+Currently I have implemented:
+   Text splitting
+      SimpleTextSplitter: uses RecursiveTextSplitter
+      SemanticTextSplitter: uses SemanticChunker (different threshold types can be used)
+   VectorStore
+      currently only sets up Qdrant vector store in memory
+   AdvancedRetriever
+      simple retriever is a special case -
+      advanced retriever - currently implemented MultiQueryRetriever
+"""
+from operator import itemgetter
+from typing import List
+from langchain_core.runnables import RunnablePassthrough
+from langchain.text_splitter import RecursiveCharacterTextSplitter
+from langchain_experimental.text_splitter import SemanticChunker
+from langchain_openai.embeddings import OpenAIEmbeddings
+from langchain_qdrant import QdrantVectorStore
+from qdrant_client import QdrantClient
+from qdrant_client.http.models import Distance, VectorParams
+from langchain.retrievers.multi_query import MultiQueryRetriever
+from langchain_community.document_loaders import PyMuPDFLoader
+from langchain_core.documents import Document
+from datasets import Dataset
+from ragas import evaluate
+def load_all_pdfs(list_of_pdf_files: List[str]) -> List[Document]:
+    alldocs = []
+    for pdffile in list_of_pdf_files:
+        thisdoc = PyMuPDFLoader(file_path=pdffile).load()
+        print(f'loaded {pdffile} with {len(thisdoc)} pages ')
+        alldocs.extend(thisdoc)
+    print(f'loaded all files: total number of pages: {len(alldocs)} ')
+    return alldocs
+class SimpleTextSplitter:
+    def __init__(self,
+                 chunk_size,
+                 chunk_overlap,
+                 documents):
+       self.chunk_size = chunk_size
+       self.chunk_overlap = chunk_overlap
+       self.documents = documents
+       return
+    def split_text(self):
+       text_splitter = RecursiveCharacterTextSplitter(
+          chunk_size=self.chunk_size,
+          chunk_overlap=self.chunk_overlap
+       )
+       all_splits = text_splitter.split_documents(self.documents)
+       return all_splits
+class SemanticTextSplitter:
+    def __init__(self,
+                 llm_embeddings=OpenAIEmbeddings(),
+                 threshold_type="interquartile",
+                 documents=None):
+       self.llm_embeddings = llm_embeddings
+       self.threshold_type = threshold_type
+       self.documents = documents
+       return
+    def split_text(self):
+       text_splitter = SemanticChunker(
+          embeddings=self.llm_embeddings,
+          breakpoint_threshold_type="interquartile"
+       )
+       print(f'loaded {len(self.documents)} to be split ')
+       all_splits = text_splitter.split_documents(self.documents)
+       print(f'returning docs split into {len(all_splits)} chunks ')
+       return all_splits
+class VectorStore:
+    def __init__(self,
+                 location,
+                 name,
+                 documents,
+                 size,
+                 embedding=OpenAIEmbeddings()):
+       self.location = location
+       self.name = name
+       self.size = size
+       self.documents = documents
+       self.embedding = embedding
+       self.qdrant_client = QdrantClient(self.location)
+       self.qdrant_client.create_collection(
+          collection_name=self.name,
+          vectors_config=VectorParams(size=self.size, distance=Distance.COSINE),
+       )
+       return
+    def set_up_vectorstore(self):
+       self.qdrant_vector_store = QdrantVectorStore(
+          client=self.qdrant_client,
+          collection_name=self.name,
+          embedding=self.embedding
+       )
+       self.qdrant_vector_store.add_documents(self.documents)
+       return self
+class AdvancedRetriever:
+    def __init__(self,
+                 vectorstore):
+        self.vectorstore = vectorstore
+        return
+    def set_up_simple_retriever(self):
+        simple_retriever = self.vectorstore.as_retriever(
+            search_type='similarity',
+            search_kwargs={
+                'k': 5
+            }
+        )
+        return simple_retriever
+    def set_up_multi_query_retriever(self, llm):
+        retriever = self.set_up_simple_retriever()
+        advanced_retriever = MultiQueryRetriever.from_llm(
+            retriever=retriever, llm=llm
+        )
+        return advanced_retriever
+def run_and_eval_rag_pipeline(location, collection_name, embed_dim, text_splits, embeddings,
+                              prompt, qa_llm, metrics, test_df):
+    """
+    Helper function that runs and evaluates different rag pipelines
+        based on different text_splits presented to the pipeline
+    """
+    # vector store
+    vs = VectorStore(location=location,
+                     name=collection_name,
+                     documents=text_splits,
+                     size=embed_dim,
+                     embedding=embeddings)
+    qdvs = vs.set_up_vectorstore().qdrant_vector_store
+    # retriever
+    retriever = AdvancedRetriever(vectorstore=qdvs).set_up_simple_retriever()
+    # q&a chain using LCEL
+    retrieval_chain = (
+        {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
+        | RunnablePassthrough.assign(context=itemgetter("context"))
+        | {"response": prompt | qa_llm, "context": itemgetter("context")}
+    )
+    # get questions, and ground-truth
+    test_questions = test_df["question"].values.tolist()
+    test_groundtruths = test_df["ground_truth"].values.tolist()
+    # run RAG pipeline
+    answers = []
+    contexts = []
+    for question in test_questions:
+        response = retrieval_chain.invoke({"question" : question})
+        answers.append(response["response"].content)
+        contexts.append([context.page_content for context in response["context"]])
+    # Save RAG pipeline results to HF Dataset object
+    response_dataset = Dataset.from_dict({
+        "question" : test_questions,
+        "answer" : answers,
+        "contexts" : contexts,
+        "ground_truth" : test_groundtruths
+    })
+    # Run RAGAS Evaluation - using metrics
+    results = evaluate(response_dataset, metrics)
+    # save results to df
+    results_df = results.to_pandas()
+    return results, results_df
+def set_up_rag_pipeline(location, collection_name,
+                        embeddings, embed_dim,
+                        prompt, qa_llm,
+                        text_splits,):
+    """
+    Helper function that sets up a RAG pipeline
+    Inputs
+        location:           memory or persistent store
+        collection_name:    name of collection, string
+        embeddings:         object referring to embeddings to be used
+        embed_dim:          embedding dimension
+        prompt:             prompt used in RAG pipeline
+        qa_llm:             LLM used to generate response
+        text_splits:        list containing text splits
+    Returns a retrieval chain
+    """
+    # vector store
+    vs = VectorStore(location=location,
+                     name=collection_name,
+                     documents=text_splits,
+                     size=embed_dim,
+                     embedding=embeddings)
+    qdvs = vs.set_up_vectorstore().qdrant_vector_store
+    # retriever
+    retriever = AdvancedRetriever(vectorstore=qdvs).set_up_simple_retriever()
+    # q&a chain using LCEL
+    retrieval_chain = (
+        {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
+        | RunnablePassthrough.assign(context=itemgetter("context"))
+        | {"response": prompt | qa_llm, "context": itemgetter("context")}
+    )
+    return retrieval_chain
+def test_rag_pipeline(retrieval_chain, list_of_questions):
+    """
+    Tests RAG pipeline
+    Inputs
+        retrieval_chain:    retrieval chain
+        list_of_questions:  list of questions to use to test RAG pipeline
+    Output
+        List of RAG-pipeline-generated responses to each question
+    """
+    all_answers = []
+    for i, question in enumerate(list_of_questions):
+        response = retrieval_chain.invoke({'question': question})
+        answer = response["response"].content
+        all_answers.append(answer)
+    return all_answers
+def get_vibe_check_on_list_of_questions(collection_name,
+                                        embeddings, embed_dim,
+                                        prompt, llm, text_splits,
+                                        list_of_questions):
+    """
+    HELPER FUNCTION
+    set up retrieval chain for each scenario and print out results
+    of the q_and_a for any list of questions
+    """
+    # set up baseline retriever
+    retrieval_chain = \
+        set_up_rag_pipeline(location=":memory:", collection_name=collection_name,
+                            embeddings=embeddings, embed_dim=embed_dim,
+                            prompt=prompt, qa_llm=llm,
+                            text_splits=text_splits)
+    # run RAG pipeline and get responses
+    answers = test_rag_pipeline(retrieval_chain, list_of_questions)
+    # create question, answer tuples
+    q_and_a = [(x, y) for x, y in zip(list_of_questions, answers)]
+    # print out question/answer pairs to review the performance of the pipeline
+    for i, item in enumerate(q_and_a):
+        print('=================')
+        print(f'=====question number: {i} =============')
+        print(item[0])
+        print(item[1])
+    return retrieval_chain, q_and_a

myutils/ragas_pipeline.py ADDED Viewed

	@@ -0,0 +1,86 @@

+"""
+ragas_pipeline.py
+Implements the core pipeline to generate test set for RAGAS.
+"""
+from langchain_openai import ChatOpenAI, OpenAIEmbeddings
+from ragas.testset.generator import TestsetGenerator
+from ragas import evaluate
+from datasets import Dataset
+from myutils.rag_pipeline_utils import SimpleTextSplitter, SemanticTextSplitter, VectorStore, AdvancedRetriever
+class RagasPipeline:
+    def __init__(self, generator_llm_model, critic_llm_model, embedding_model,
+                 number_of_qa_pairs,
+                 chunk_size, chunk_overlap, documents,
+                 distributions):
+        self.generator_llm = ChatOpenAI(model=generator_llm_model)
+        self.critic_llm = ChatOpenAI(model=critic_llm_model)
+        self.embeddings = OpenAIEmbeddings(model=embedding_model)
+        self.number_of_qa_pairs = number_of_qa_pairs
+        self.chunk_size = chunk_size
+        self.chunk_overlap = chunk_overlap
+        self.documents = documents
+        self.distributions = distributions
+        self.generator = TestsetGenerator.from_langchain(
+            self.generator_llm,
+            self.critic_llm,
+            self.embeddings
+        )
+        return
+    def generate_testset(self):
+        text_splitter = SimpleTextSplitter(
+            chunk_size=self.chunk_size,
+            chunk_overlap=self.chunk_overlap,
+            documents=self.documents
+        )
+        ragas_text_splits = text_splitter.split_text()
+        testset = self.generator.generate_with_langchain_docs(
+            ragas_text_splits,
+            self.number_of_qa_pairs,
+            self.distributions
+        )
+        testset_df = testset.to_pandas()
+        return testset_df
+    def ragas_eval_of_rag_pipeline(self, retrieval_chain, ragas_questions, ragas_groundtruths, ragas_metrics):
+        """
+        Helper function that runs and evaluates different rag pipelines
+            based on RAGAS test questions
+        """
+        # run RAG pipeline on RAGAS synthetic questions
+        answers = []
+        contexts = []
+        for question in ragas_questions:
+            response = retrieval_chain.invoke({"question" : question})
+            answers.append(response["response"].content)
+            contexts.append([context.page_content for context in response["context"]])
+        # Save RAG pipeline results to HF Dataset object
+        response_dataset = Dataset.from_dict({
+            "question" : ragas_questions,
+            "answer" : answers,
+            "contexts" : contexts,
+            "ground_truth" : ragas_groundtruths
+        })
+        # Run RAGAS Evaluation - using metrics
+        results = evaluate(response_dataset, ragas_metrics)
+        # save results to df
+        results_df = results.to_pandas()
+        return results, results_df

requirements.txt ADDED Viewed

	@@ -0,0 +1,18 @@

+langchain
+langchain-openai
+langchain_core==0.2.38
+langchain-community
+langchainhub
+langchain-qdrant
+langchain_huggingface
+langchain-text-splitters
+langchain_experimental
+ragas
+openai
+pymupdf
+faiss-cpu
+sentence_transformers
+datasets
+pyarrow==14.0.1
+chainlit==0.7.700
+python-dotenv==1.0.0