vinod chandrashekaran commited on
Commit
6f57442
1 Parent(s): b2d1286

add application and other files

Browse files
Files changed (6) hide show
  1. Dockerfile +11 -0
  2. README.md +27 -0
  3. app_v1.py +225 -0
  4. myutils/rag_pipeline_utils.py +289 -0
  5. myutils/ragas_pipeline.py +86 -0
  6. requirements.txt +18 -0
Dockerfile ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ FROM python:3.11
2
+ RUN useradd -m -u 1000 user
3
+ USER user
4
+ ENV HOME=/home/user \
5
+ PATH=/home/user/.local/bin:$PATH
6
+ WORKDIR $HOME/app
7
+ COPY --chown=user . $HOME/app
8
+ COPY ./requirements.txt ~/app/requirements.txt
9
+ RUN pip install -r requirements.txt
10
+ COPY . .
11
+ CMD ["chainlit", "run", "app_v1.py", "--port", "7860"]
README.md CHANGED
@@ -9,3 +9,30 @@ license: apache-2.0
9
  ---
10
 
11
  Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
  ---
10
 
11
  Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
12
+
13
+ Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
14
+
15
+
16
+ Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
17
+
18
+
19
+ # Summary
20
+
21
+ This is my app for AI Engineering Cohort#4 Midterm Assignment.
22
+
23
+ With this application, you can chat with these TWO uploaded PDFs:
24
+ https://www.whitehouse.gov/wp-content/uploads/2022/10/Blueprint-for-an-AI-Bill-of-Rights.pdf
25
+ AND
26
+ https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.600-1.pdf
27
+
28
+
29
+ Code features:
30
+ 1. Text splitter - RecursiveTextSplitter from Langchain
31
+ 2. Vector store - using Qdrant db
32
+ 3. Retrieval chain using LCEL syntax
33
+ 4. Chat model is OpenAI's gpt-4o-mini.
34
+ 5. There are two variants of the app that can be deployed:
35
+ a. An early prototype built using OpenAI Embeddings text-embedding-3-small
36
+ b. A more advanced prototype using finetuned version of `Snowflake/snowflake-arctic-embed-m`
37
+ the finetuned version is available at `vincha77/finetuned_arctic`
38
+
app_v1.py ADDED
@@ -0,0 +1,225 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ app_end_to_end_prototype.py
3
+
4
+ 1. This app loads two pdf documents and allows the user to ask questions about these documents.
5
+ The documents that are used are:
6
+
7
+ https://www.whitehouse.gov/wp-content/uploads/2022/10/Blueprint-for-an-AI-Bill-of-Rights.pdf
8
+ AND
9
+ https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.600-1.pdf
10
+
11
+ 2. The two documents are pre-processed on start. Here are brief details on the pre-processing:
12
+ a. text is split into chunks using langchain RecursiveCharacterTextSplitter method.
13
+ b. The text in each chunk is converted to an embedding using OpenAI text-embedding-3-small embeddings.
14
+ Each embedding produced by this model has dimension 1536.
15
+ Each chunk is therefore represented by an embedding of dimension 1536.
16
+ c. The collection of embeddings for all chunks along with metadata are saved/indexed in a vector database.
17
+ d. For this exercise, I use an in-memory version of Qdrant vector db.
18
+
19
+ 3. The next step is to build a RAG pipeline to answer questions. This is implemented as follows:
20
+ a. I use a simple prompt that retrieves relevant contexts based on a user query.
21
+ b. First, the user query is encoded using the same embedding model as the documents.
22
+ c. Second, a set of relevant documents is returned by the retriever
23
+ which efficiently searches the vector db and returns the most relevant chunks.
24
+ d. Third, the user query and retrieved contexts are then passed to a chat-enabled LLM.
25
+ I use OpenAI's gpt-4o-mini throughout this exercise.
26
+ e. Fourth, the chat model processes the user query and context along with the prompt and
27
+ generates a response that is then passed to the user.
28
+
29
+ 4. The cl.on_start initiates the conversation with the user.
30
+
31
+ 5. The cl.on_message decorator wraps the main function
32
+ This function does the following:
33
+ a. receives the query that the user types in
34
+ b. runs the RAG pipeline
35
+ c. sends results back to UI for display
36
+
37
+ Additional Notes:
38
+ a. note the use of async functions and await async syntax throughout the module here!
39
+ b. note the use of yield rather than return in certain key functions
40
+ c. note the use of streaming capabilities when needed
41
+
42
+ """
43
+
44
+ import os
45
+ from typing import List
46
+ from dotenv import load_dotenv
47
+
48
+ # chainlit imports
49
+ import chainlit as cl
50
+
51
+ # langchain imports
52
+ # document loader
53
+ from langchain_community.document_loaders import PyPDFLoader
54
+ # text splitter
55
+ from langchain_text_splitters import RecursiveCharacterTextSplitter
56
+ # embeddings model to embed each chunk of text in doc
57
+ from langchain_openai import OpenAIEmbeddings
58
+ # vector store
59
+ # llm for text generation using prompt plus retrieved context plus query
60
+ from langchain_openai import ChatOpenAI
61
+ # templates to create custom prompts
62
+ from langchain_core.prompts import ChatPromptTemplate, PromptTemplate
63
+ # chains
64
+ # LCEL Runnable Passthrough
65
+ from langchain_core.runnables import RunnablePassthrough
66
+ # to parse output from llm
67
+ from langchain_core.output_parsers import StrOutputParser
68
+ from langchain.docstore.document import Document
69
+ from langchain_huggingface import HuggingFaceEmbeddings
70
+ from langchain.document_loaders import PyMuPDFLoader
71
+
72
+ from sentence_transformers import SentenceTransformer
73
+
74
+ from myutils.rag_pipeline_utils import SimpleTextSplitter, SemanticTextSplitter, VectorStore, AdvancedRetriever
75
+ from myutils.ragas_pipeline import RagasPipeline
76
+ from myutils.rag_pipeline_utils import load_all_pdfs, set_up_rag_pipeline
77
+
78
+
79
+ load_dotenv()
80
+
81
+ # Flag to indicate if pdfs should be loaded directly from URLs
82
+ # If True, get pdfs from urls; if false, get them from local copy
83
+ LOAD_PDF_DIRECTLY_FROM_URL = True
84
+
85
+ # set the APP_MODE
86
+ # one of two choices:
87
+ # early_prototype means use OpenAI embeddings
88
+ # advanced_prototype means use finetuned model embeddings
89
+ APP_MODE = "early_prototype"
90
+
91
+ if APP_MODE == "early_prototype":
92
+ embeddings = OpenAIEmbeddings(model='text-embedding-3-small')
93
+ embed_dim = 1536
94
+ appendix_to_user_message = "This chatbot is built using OpenAI Embeddings as a fast prototype."
95
+ else:
96
+ finetuned_model_id = "vincha77/finetuned_arctic"
97
+ arctic_finetuned_model = SentenceTransformer(finetuned_model_id)
98
+ embeddings = HuggingFaceEmbeddings(model_name="vincha77/finetuned_arctic")
99
+ appendix_to_user_message = "Our Tech team finetuned snowflake-arctic-embed-m to bring you this chatbot!!"
100
+ embed_dim = 768
101
+
102
+ rag_template = """
103
+ You are an assistant for question-answering tasks.
104
+ You will be given documents on the risks of AI, frameworks and
105
+ policies formulated by various governmental agencies to articulate
106
+ these risks and to safeguard against these risks.
107
+
108
+ Use the following pieces of retrieved context to answer
109
+ the question.
110
+
111
+ You must answer the question only based on the context provided.
112
+
113
+ If you don't know the answer or if the context does not provide sufficient information,
114
+ then say that you don't know.
115
+
116
+ Think through your answer step-by-step.
117
+
118
+ Context:
119
+ {context}
120
+
121
+ Question:
122
+ {question}
123
+ """
124
+
125
+ rag_prompt = ChatPromptTemplate.from_template(template=rag_template)
126
+
127
+ # parameters to manage text splitting/chunking
128
+ chunk_kwargs = {
129
+ 'chunk_size': 1000,
130
+ 'chunk_overlap': 300
131
+ }
132
+
133
+ retrieval_chain_kwargs = {
134
+ 'location': ":memory:",
135
+ 'collection_name': 'End_to_End_Prototype',
136
+ 'embeddings': embeddings,
137
+ 'embed_dim': embed_dim,
138
+ 'prompt': rag_prompt,
139
+ 'qa_llm': ChatOpenAI(model_name="gpt-4o-mini", temperature=0)
140
+ }
141
+
142
+ urls_for_pdfs = [
143
+ "https://www.whitehouse.gov/wp-content/uploads/2022/10/Blueprint-for-an-AI-Bill-of-Rights.pdf",
144
+ "https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.600-1.pdf"
145
+ ]
146
+
147
+ pdf_file_paths = [
148
+ './data/docs_for_rag/Blueprint-for-an-AI-Bill-of-Rights.pdf',
149
+ './data/docs_for_rag/NIST.AI.600-1.pdf'
150
+ ]
151
+
152
+ # if flag is True, then pass in pointers to URLs
153
+ # if flag is false, then pass in file pointers
154
+ if LOAD_PDF_DIRECTLY_FROM_URL:
155
+ docpathlist = urls_for_pdfs
156
+ else:
157
+ docpathlist = pdf_file_paths
158
+
159
+
160
+ class RetrievalAugmentedQAPipelineWithLangchain:
161
+ def __init__(self,
162
+ list_of_documents,
163
+ chunk_kwargs,
164
+ retrieval_chain_kwargs):
165
+ self.list_of_documents = list_of_documents
166
+ self.chunk_kwargs = chunk_kwargs
167
+ self.retrieval_chain_kwargs = retrieval_chain_kwargs
168
+
169
+ self.load_documents()
170
+ self.split_text()
171
+ self.set_up_rag_pipeline()
172
+ return
173
+
174
+ def load_documents(self):
175
+ self.documents = load_all_pdfs(self.list_of_documents)
176
+ return self
177
+
178
+ def split_text(self):
179
+ baseline_text_splitter = \
180
+ SimpleTextSplitter(**self.chunk_kwargs, documents=self.documents)
181
+ # split text for baseline case
182
+ self.baseline_text_splits = baseline_text_splitter.split_text()
183
+ return self
184
+
185
+ def set_up_rag_pipeline(self):
186
+ self.retrieval_chain = set_up_rag_pipeline(
187
+ **self.retrieval_chain_kwargs,
188
+ text_splits=self.baseline_text_splits
189
+ )
190
+ return self
191
+
192
+
193
+ RETRIEVAL_CHAIN = \
194
+ RetrievalAugmentedQAPipelineWithLangchain(
195
+ list_of_documents=docpathlist,
196
+ chunk_kwargs=chunk_kwargs,
197
+ retrieval_chain_kwargs=retrieval_chain_kwargs
198
+ ).retrieval_chain
199
+
200
+
201
+ @cl.on_chat_start
202
+ async def on_chat_start():
203
+
204
+ msg = cl.Message(content=f"""
205
+ Hello dear colleague! Welcome to this chatbot! In recent weeks, many of you have shared that you'd like to understand how AI is evolving. What better way to help you understand the implications of AI than "use AI to answer questions about AI". Your colleagues in Technology have worked hard to create this chatbot. We've used a few key policy and framework proposals from the US government that this chatbot can search for a response to your question. Occasionally, the chatbot may respond with "I don't know". If it does that, try a more specific variation of your question. Oh! And one more thing: {appendix_to_user_message}...Please go ahead and enter your question...
206
+ """
207
+ )
208
+
209
+ await msg.send()
210
+ cl.user_session.set("retrieval_chain", RETRIEVAL_CHAIN)
211
+
212
+ @cl.on_message
213
+ async def main(message):
214
+ retrieval_chain = cl.user_session.get("retrieval_chain")
215
+
216
+ msg = cl.Message(content="")
217
+
218
+ # result = await raqa_chain.invoke({"input": message.content})
219
+ result = await cl.make_async(retrieval_chain.invoke)({"question": message.content})
220
+
221
+ # async for stream_resp in result["answer"]:
222
+ for stream_resp in result["response"].content:
223
+ await msg.stream_token(stream_resp)
224
+
225
+ await msg.send()
myutils/rag_pipeline_utils.py ADDED
@@ -0,0 +1,289 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ rag_pipeline_utils.py
3
+
4
+ This python script implements various classes useful for a RAG pipeline.
5
+
6
+ Currently I have implemented:
7
+
8
+ Text splitting
9
+ SimpleTextSplitter: uses RecursiveTextSplitter
10
+ SemanticTextSplitter: uses SemanticChunker (different threshold types can be used)
11
+
12
+ VectorStore
13
+ currently only sets up Qdrant vector store in memory
14
+
15
+ AdvancedRetriever
16
+ simple retriever is a special case -
17
+ advanced retriever - currently implemented MultiQueryRetriever
18
+
19
+ """
20
+
21
+ from operator import itemgetter
22
+ from typing import List
23
+
24
+ from langchain_core.runnables import RunnablePassthrough
25
+ from langchain.text_splitter import RecursiveCharacterTextSplitter
26
+ from langchain_experimental.text_splitter import SemanticChunker
27
+ from langchain_openai.embeddings import OpenAIEmbeddings
28
+ from langchain_qdrant import QdrantVectorStore
29
+
30
+ from qdrant_client import QdrantClient
31
+ from qdrant_client.http.models import Distance, VectorParams
32
+
33
+ from langchain.retrievers.multi_query import MultiQueryRetriever
34
+ from langchain_community.document_loaders import PyMuPDFLoader
35
+ from langchain_core.documents import Document
36
+ from datasets import Dataset
37
+
38
+ from ragas import evaluate
39
+
40
+
41
+ def load_all_pdfs(list_of_pdf_files: List[str]) -> List[Document]:
42
+ alldocs = []
43
+ for pdffile in list_of_pdf_files:
44
+ thisdoc = PyMuPDFLoader(file_path=pdffile).load()
45
+ print(f'loaded {pdffile} with {len(thisdoc)} pages ')
46
+ alldocs.extend(thisdoc)
47
+ print(f'loaded all files: total number of pages: {len(alldocs)} ')
48
+ return alldocs
49
+
50
+
51
+ class SimpleTextSplitter:
52
+ def __init__(self,
53
+ chunk_size,
54
+ chunk_overlap,
55
+ documents):
56
+ self.chunk_size = chunk_size
57
+ self.chunk_overlap = chunk_overlap
58
+ self.documents = documents
59
+ return
60
+
61
+ def split_text(self):
62
+ text_splitter = RecursiveCharacterTextSplitter(
63
+ chunk_size=self.chunk_size,
64
+ chunk_overlap=self.chunk_overlap
65
+ )
66
+ all_splits = text_splitter.split_documents(self.documents)
67
+ return all_splits
68
+
69
+
70
+ class SemanticTextSplitter:
71
+ def __init__(self,
72
+ llm_embeddings=OpenAIEmbeddings(),
73
+ threshold_type="interquartile",
74
+ documents=None):
75
+ self.llm_embeddings = llm_embeddings
76
+ self.threshold_type = threshold_type
77
+ self.documents = documents
78
+ return
79
+
80
+ def split_text(self):
81
+ text_splitter = SemanticChunker(
82
+ embeddings=self.llm_embeddings,
83
+ breakpoint_threshold_type="interquartile"
84
+ )
85
+
86
+ print(f'loaded {len(self.documents)} to be split ')
87
+ all_splits = text_splitter.split_documents(self.documents)
88
+ print(f'returning docs split into {len(all_splits)} chunks ')
89
+ return all_splits
90
+
91
+
92
+ class VectorStore:
93
+ def __init__(self,
94
+ location,
95
+ name,
96
+ documents,
97
+ size,
98
+ embedding=OpenAIEmbeddings()):
99
+ self.location = location
100
+ self.name = name
101
+ self.size = size
102
+ self.documents = documents
103
+ self.embedding = embedding
104
+
105
+ self.qdrant_client = QdrantClient(self.location)
106
+ self.qdrant_client.create_collection(
107
+ collection_name=self.name,
108
+ vectors_config=VectorParams(size=self.size, distance=Distance.COSINE),
109
+ )
110
+ return
111
+
112
+ def set_up_vectorstore(self):
113
+ self.qdrant_vector_store = QdrantVectorStore(
114
+ client=self.qdrant_client,
115
+ collection_name=self.name,
116
+ embedding=self.embedding
117
+ )
118
+
119
+ self.qdrant_vector_store.add_documents(self.documents)
120
+ return self
121
+
122
+
123
+ class AdvancedRetriever:
124
+ def __init__(self,
125
+ vectorstore):
126
+ self.vectorstore = vectorstore
127
+ return
128
+
129
+ def set_up_simple_retriever(self):
130
+ simple_retriever = self.vectorstore.as_retriever(
131
+ search_type='similarity',
132
+ search_kwargs={
133
+ 'k': 5
134
+ }
135
+ )
136
+ return simple_retriever
137
+
138
+ def set_up_multi_query_retriever(self, llm):
139
+ retriever = self.set_up_simple_retriever()
140
+ advanced_retriever = MultiQueryRetriever.from_llm(
141
+ retriever=retriever, llm=llm
142
+ )
143
+ return advanced_retriever
144
+
145
+
146
+ def run_and_eval_rag_pipeline(location, collection_name, embed_dim, text_splits, embeddings,
147
+ prompt, qa_llm, metrics, test_df):
148
+ """
149
+ Helper function that runs and evaluates different rag pipelines
150
+ based on different text_splits presented to the pipeline
151
+ """
152
+ # vector store
153
+ vs = VectorStore(location=location,
154
+ name=collection_name,
155
+ documents=text_splits,
156
+ size=embed_dim,
157
+ embedding=embeddings)
158
+
159
+ qdvs = vs.set_up_vectorstore().qdrant_vector_store
160
+
161
+ # retriever
162
+ retriever = AdvancedRetriever(vectorstore=qdvs).set_up_simple_retriever()
163
+
164
+ # q&a chain using LCEL
165
+ retrieval_chain = (
166
+ {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
167
+ | RunnablePassthrough.assign(context=itemgetter("context"))
168
+ | {"response": prompt | qa_llm, "context": itemgetter("context")}
169
+ )
170
+
171
+ # get questions, and ground-truth
172
+ test_questions = test_df["question"].values.tolist()
173
+ test_groundtruths = test_df["ground_truth"].values.tolist()
174
+
175
+
176
+ # run RAG pipeline
177
+ answers = []
178
+ contexts = []
179
+
180
+ for question in test_questions:
181
+ response = retrieval_chain.invoke({"question" : question})
182
+ answers.append(response["response"].content)
183
+ contexts.append([context.page_content for context in response["context"]])
184
+
185
+ # Save RAG pipeline results to HF Dataset object
186
+ response_dataset = Dataset.from_dict({
187
+ "question" : test_questions,
188
+ "answer" : answers,
189
+ "contexts" : contexts,
190
+ "ground_truth" : test_groundtruths
191
+ })
192
+
193
+ # Run RAGAS Evaluation - using metrics
194
+ results = evaluate(response_dataset, metrics)
195
+
196
+ # save results to df
197
+ results_df = results.to_pandas()
198
+
199
+ return results, results_df
200
+
201
+
202
+ def set_up_rag_pipeline(location, collection_name,
203
+ embeddings, embed_dim,
204
+ prompt, qa_llm,
205
+ text_splits,):
206
+ """
207
+ Helper function that sets up a RAG pipeline
208
+ Inputs
209
+ location: memory or persistent store
210
+ collection_name: name of collection, string
211
+ embeddings: object referring to embeddings to be used
212
+ embed_dim: embedding dimension
213
+ prompt: prompt used in RAG pipeline
214
+ qa_llm: LLM used to generate response
215
+ text_splits: list containing text splits
216
+
217
+
218
+ Returns a retrieval chain
219
+ """
220
+ # vector store
221
+ vs = VectorStore(location=location,
222
+ name=collection_name,
223
+ documents=text_splits,
224
+ size=embed_dim,
225
+ embedding=embeddings)
226
+
227
+ qdvs = vs.set_up_vectorstore().qdrant_vector_store
228
+
229
+ # retriever
230
+ retriever = AdvancedRetriever(vectorstore=qdvs).set_up_simple_retriever()
231
+
232
+ # q&a chain using LCEL
233
+ retrieval_chain = (
234
+ {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
235
+ | RunnablePassthrough.assign(context=itemgetter("context"))
236
+ | {"response": prompt | qa_llm, "context": itemgetter("context")}
237
+ )
238
+
239
+ return retrieval_chain
240
+
241
+
242
+ def test_rag_pipeline(retrieval_chain, list_of_questions):
243
+ """
244
+ Tests RAG pipeline
245
+ Inputs
246
+ retrieval_chain: retrieval chain
247
+ list_of_questions: list of questions to use to test RAG pipeline
248
+ Output
249
+ List of RAG-pipeline-generated responses to each question
250
+ """
251
+ all_answers = []
252
+ for i, question in enumerate(list_of_questions):
253
+ response = retrieval_chain.invoke({'question': question})
254
+ answer = response["response"].content
255
+ all_answers.append(answer)
256
+ return all_answers
257
+
258
+
259
+ def get_vibe_check_on_list_of_questions(collection_name,
260
+ embeddings, embed_dim,
261
+ prompt, llm, text_splits,
262
+ list_of_questions):
263
+ """
264
+ HELPER FUNCTION
265
+ set up retrieval chain for each scenario and print out results
266
+ of the q_and_a for any list of questions
267
+ """
268
+
269
+ # set up baseline retriever
270
+ retrieval_chain = \
271
+ set_up_rag_pipeline(location=":memory:", collection_name=collection_name,
272
+ embeddings=embeddings, embed_dim=embed_dim,
273
+ prompt=prompt, qa_llm=llm,
274
+ text_splits=text_splits)
275
+
276
+ # run RAG pipeline and get responses
277
+ answers = test_rag_pipeline(retrieval_chain, list_of_questions)
278
+
279
+ # create question, answer tuples
280
+ q_and_a = [(x, y) for x, y in zip(list_of_questions, answers)]
281
+
282
+ # print out question/answer pairs to review the performance of the pipeline
283
+ for i, item in enumerate(q_and_a):
284
+ print('=================')
285
+ print(f'=====question number: {i} =============')
286
+ print(item[0])
287
+ print(item[1])
288
+
289
+ return retrieval_chain, q_and_a
myutils/ragas_pipeline.py ADDED
@@ -0,0 +1,86 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ ragas_pipeline.py
3
+
4
+ Implements the core pipeline to generate test set for RAGAS.
5
+
6
+ """
7
+
8
+ from langchain_openai import ChatOpenAI, OpenAIEmbeddings
9
+ from ragas.testset.generator import TestsetGenerator
10
+ from ragas import evaluate
11
+
12
+ from datasets import Dataset
13
+
14
+ from myutils.rag_pipeline_utils import SimpleTextSplitter, SemanticTextSplitter, VectorStore, AdvancedRetriever
15
+
16
+
17
+ class RagasPipeline:
18
+ def __init__(self, generator_llm_model, critic_llm_model, embedding_model,
19
+ number_of_qa_pairs,
20
+ chunk_size, chunk_overlap, documents,
21
+ distributions):
22
+ self.generator_llm = ChatOpenAI(model=generator_llm_model)
23
+ self.critic_llm = ChatOpenAI(model=critic_llm_model)
24
+ self.embeddings = OpenAIEmbeddings(model=embedding_model)
25
+ self.number_of_qa_pairs = number_of_qa_pairs
26
+
27
+ self.chunk_size = chunk_size
28
+ self.chunk_overlap = chunk_overlap
29
+ self.documents = documents
30
+
31
+ self.distributions = distributions
32
+
33
+ self.generator = TestsetGenerator.from_langchain(
34
+ self.generator_llm,
35
+ self.critic_llm,
36
+ self.embeddings
37
+ )
38
+ return
39
+
40
+ def generate_testset(self):
41
+ text_splitter = SimpleTextSplitter(
42
+ chunk_size=self.chunk_size,
43
+ chunk_overlap=self.chunk_overlap,
44
+ documents=self.documents
45
+ )
46
+ ragas_text_splits = text_splitter.split_text()
47
+
48
+ testset = self.generator.generate_with_langchain_docs(
49
+ ragas_text_splits,
50
+ self.number_of_qa_pairs,
51
+ self.distributions
52
+ )
53
+
54
+ testset_df = testset.to_pandas()
55
+ return testset_df
56
+
57
+ def ragas_eval_of_rag_pipeline(self, retrieval_chain, ragas_questions, ragas_groundtruths, ragas_metrics):
58
+ """
59
+ Helper function that runs and evaluates different rag pipelines
60
+ based on RAGAS test questions
61
+ """
62
+
63
+ # run RAG pipeline on RAGAS synthetic questions
64
+ answers = []
65
+ contexts = []
66
+
67
+ for question in ragas_questions:
68
+ response = retrieval_chain.invoke({"question" : question})
69
+ answers.append(response["response"].content)
70
+ contexts.append([context.page_content for context in response["context"]])
71
+
72
+ # Save RAG pipeline results to HF Dataset object
73
+ response_dataset = Dataset.from_dict({
74
+ "question" : ragas_questions,
75
+ "answer" : answers,
76
+ "contexts" : contexts,
77
+ "ground_truth" : ragas_groundtruths
78
+ })
79
+
80
+ # Run RAGAS Evaluation - using metrics
81
+ results = evaluate(response_dataset, ragas_metrics)
82
+
83
+ # save results to df
84
+ results_df = results.to_pandas()
85
+
86
+ return results, results_df
requirements.txt ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ langchain
2
+ langchain-openai
3
+ langchain_core==0.2.38
4
+ langchain-community
5
+ langchainhub
6
+ langchain-qdrant
7
+ langchain_huggingface
8
+ langchain-text-splitters
9
+ langchain_experimental
10
+ ragas
11
+ openai
12
+ pymupdf
13
+ faiss-cpu
14
+ sentence_transformers
15
+ datasets
16
+ pyarrow==14.0.1
17
+ chainlit==0.7.700
18
+ python-dotenv==1.0.0