Towards a Fully Arabic Retrieval-Augmented Generation (RAG) Pipeline:

Community Article Published November 30, 2024

image/png

Can We Build a Fully Arabic RAG Pipeline? Building a fully Arabic Retrieval-Augmented Generation (RAG) pipeline is an ambitious yet achievable goal. This requires combining advancements in retrieval systems, reranking techniques, and generative models—all tailored to handle the unique complexities of the Arabic language. While each component serves a distinct role, they must work seamlessly together to deliver high-quality results in downstream NLP applications. To create a comprehensive Arabic RAG pipeline, we need three main components:

1. Arabic Retrieval System [GATE Embeddings]

The retrieval system is the foundation of the pipeline, responsible for efficiently identifying relevant documents or text chunks from a large corpus. For an Arabic-specific retrieval system, we must focus on:

✨ Semantic Understanding: Leveraging embeddings that capture the nuances of Arabic morphology, syntax, and semantics.

✨ Pretrained Models: Using state-of-the-art embeddings fine-tuned for Arabic retrieval tasks.

Semantic embeddings are the foundation of this retrieval process, it transformes textual data into high-dimensional vectors that encode semantic information. These embeddings are generated using advanced neural network architectures, such as BERT, which have been pre-trained on large corpora and fine-tuned for specific tasks. However, current semantic embedding models face several challenges, including managing contextually rich queries, and ensuring efficient retrieval without compromising accuracy. Additionally, most of the embeddings are primarily designed for English or multilingual contexts, with limited focus on Arabic, which can decrease their effectiveness for Arabic-specific applications.

What We Have Done So Far to Enhance Arabic Retrieval?

To improve Arabic retrieval, we focused on creating high-quality embeddings tailored to the unique characteristics of the Arabic language. One significant milestone in this effort is the development of Omartificial-Intelligence-Space/GATE-AraBERT-v1, a powerful model designed specifically for Arabic text embedding.

About GATE-AraBERT-v1:

GATE, or General Arabic Text Embedding, is a specialized embedding model trained using the SentenceTransformers framework in a multi-task setup. This setup ensures the model performs well across multiple related tasks, enhancing its versatility and robustness.

To make GATE-AraBERT-v1 highly effective for semantic retrieval, it was trained on: 🔸 AllNLI (All Natural Language Inference) Dataset which helps the model understand relationships between sentences, such as entailment, contradiction, and neutrality. it also ensures the embeddings capture contextual and semantic relationships between text pairs.

🔸 STS (Semantic Textual Similarity) Dataset to enhances the model's ability to quantify how similar two pieces of text are in meaning.

🔸 Triplet and Semantic Data where the model was also trained using triplet, which optimizes embeddings to minimize the distance between semantically similar texts while maximizing the distance between dissimilar ones.

This training approach ensures the model captures deep semantic nuances in Arabic text, which is crucial for effective retrieval. Key Advantages

The development and fine-tuning of GATE-AraBERT-v1 have led to significant improvements in Arabic retrieval capabilities:

🔹 Semantic Understanding: The embeddings are fine-tuned to grasp the complex morphology, syntax, and semantics of Arabic.

🔹 Enhanced Document Retrieval: By leveraging semantic similarity, the model retrieves documents that are highly relevant to the user's query, even when exact keyword matches are absent.

Applications and Impact

The GATE-AraBERT-v1 embeddings are a cornerstone for building systems that require precise and semantically rich Arabic retrieval. From question-answering systems to information retrieval platforms, this model significantly boosts the quality of results and user satisfaction.

By focusing on embedding quality and training with diverse and semantically rich datasets, we have laid a solid foundation for enhancing Arabic retrieval. This progress brings us closer to realizing a fully integrated Arabic RAG pipeline.

2. A Good Reranker (Optional but Critical for Enhanced Precision)

Reranking is a vital step in Retrieval-Augmented Generation (RAG) pipelines, as it refines the set of documents retrieved in the initial stage. While the initial retrieval focuses on speed, it often returns a mix of highly relevant and tangential results.

Reranking ensures that only the most relevant information is prioritized, reducing noise and irrelevant content.

By feeding the LLM (Large Language Model) a more precise and contextually appropriate set of documents, reranking enhances the accuracy and coherence of responses. Without reranking, the LLM risks processing less relevant data, increasing the chance of generating inaccurate or ungrounded information. This makes reranking essential for optimizing both the efficiency and reliability of RAG systems.

What We Have Done So Far to Enhance Arabic Rerankers?

To improve Arabic reranking in Retrieval-Augmented Generation (RAG) systems, we developed Omartificial-Intelligence-Space/ARA-Reranker-V1, also known as ARM-V1.

This model is specifically designed to handle Arabic language reranking tasks with precision. Unlike embedding models that generate vector representations, ARM-V1 evaluates the direct similarity between a query and a passage, producing a relevance score that identifies the most contextually suitable documents. The output score is normalized to a range of [0, 1] using a sigmoid function, ensuring an interpretable relevance metric.

Training Process

ARM-V1 is trained on a rich dataset of positive and hard negative query-passage pairs, enabling it to excel in distinguishing relevant results from distractors. Key aspects of the training process include:

▪️Leveraging a dataset with multiple hard negatives per query to improve robustness.

▪️Using a CrossEncoder architecture that directly computes relevance scores.

▪️Employing batch training with thousands of samples, ensuring high performance even in challenging scenarios.

Below is a simplified overview of the training:

▫️Dataset consists of Queries, positive examples, and multiple negatives are loaded and processed into training samples using InputExample.

▫️ A CrossEncoder model is initialized with a continuous output range for relevance scores.

▫️ The model is trained using a DataLoader for batching and a reranking evaluator (CERerankingEvaluator) to monitor performance. Warmup steps and automatic mixed precision (AMP) are used to optimize training efficiency.

▫️ The model is fine-tuned over three epochs with frequent evaluations to ensure consistent improvement.

The ARM-V1 model has become a cornerstone of our Arabic RAG pipeline, drastically improving the precision and reliability of document reranking. By narrowing down results to the most relevant documents, ARM-V1 ensures that the generative model receives high-quality input, reducing noise and enhancing response accuracy.

3. A Good Generative Arabic LLM:

At the heart of the RAG pipeline lies the generative model, responsible for synthesizing coherent and contextually accurate responses based on retrieved information. However, when it comes to Arabic, there is a significant gap in reliable, open-source Large Language Models (LLMs) specifically designed for this step.

Currently, most systems rely on costly API-based solutions like OpenAI's GPT models, Anthropic's Claude, or Google's Gemini, which offer robust capabilities but come with challenges such as cost, limited customization, and data privacy concerns. Reaching a truly effective Arabic-specific generative model requires significant investment in language-focused pretraining, fine-tuning, and alignment with user preferences.

The Promise of ALLaM

A promising development in this domain is ALLaM (Arabic Large Language Model), a state-of-the-art LLM specifically designed for Arabic. ALLaM represents a significant step forward for ArabicNLP, offering capabilities that make it highly competitive with other global models.

Key features of ALLaM include:

🔸 Language Alignment and Knowledge Transfer: ALLaM employs vocabulary expansion and a mixture of Arabic and English pretraining to enable seamless second-language acquisition, ensuring strong Arabic capabilities without losing its English proficiency.

🔸 Parallel/Translated Data Usage: The model leverages parallel datasets to align knowledge between Arabic and English, enhancing its ability to understand and generate Arabic text.

🔸 Human Preference Alignment: Extensive fine-tuning with human feedback has significantly improved ALLaM's performance, even surpassing larger models that lack such alignment.

🔸 State-of-the-Art Performance: ALLaM achieves top results on several Arabic NLP benchmarks, including MMLU Arabic, ACVA, and Arabic Exams, demonstrating its superior capabilities across various tasks.

While ALLaM showcases remarkable potential, it is not open-source, which limits its accessibility for broader use cases and community-driven enhancements.

The Need for Open-Source Arabic Generative Models

To truly enable robust Arabic RAG pipelines, there is an urgent need to develop open-source LLMs tailored to the Arabic language. Such models would allow for:

▪️ Customization: Adapting the model for specific use cases or domains.

▪️ Cost-Effectiveness: Reducing reliance on expensive APIs.

▪️ Data Privacy: Ensuring sensitive data does not leave organizational boundaries.

▪️ Community Contributions: Encouraging collaborative improvements and innovation within the Arabic NLP ecosystem.

While models like ALLaM are paving the way, the future of Arabic NLP hinges on the development of accessible and open-source solutions that cater specifically to the linguistic richness and complexity of Arabic.

The Integration Challenge: Setting the Stage for an Arabic RAG Pipeline

Building a cohesive Arabic RAG pipeline that integrates retrieval, reranking, and generation requires meticulous engineering and alignment across all components. Each plays a critical role, and their interplay determines the overall effectiveness of the system. As we move into the implementation phase, here’s what to consider:

Key Integration Requirements

⏺ Seamless Interoperability: Ensuring that the retrieval system and reranker produce output formats that align effortlessly with the input expectations of the generative model. This smooth handoff is crucial for maintaining pipeline efficiency and accuracy.

⏺ Performance Optimization: Addressing computational demands and reducing latency is vital to achieve real-time or near-real-time responses, especially for user-facing applications.

⏺ Arabic-Specific Evaluation Metrics: Measuring the pipeline's success involves using benchmarks tailored to Arabic NLP tasks for retrieval, ranking, and generation. These metrics ensure the system is meeting language-specific challenges effectively.

Implementing the Arabic RAG Pipeline

In the next section, we’ll demonstrate how to build a fully functional Arabic RAG pipeline using LangChain, our custom retrieval system and reranker, and GPT-4 mini as the generative model.

This practical implementation showcases:

⏺ How the retrieval system quickly fetches potentially relevant documents.

⏺ How the reranker filters and prioritizes these documents, ensuring the generative model only receives the most contextually appropriate information.

⏺ How the LLM synthesizes coherent and precise responses based on high-quality inputs.

The Role of Reranking and Retrieval in Action

By combining these components, the pipeline demonstrates the power of reranking and retrieval in ensuring that users receive the most relevant and accurate outputs. This workflow not only highlights the advanced capabilities of our Arabic-specific retrieval and reranker but also sets a benchmark for creating efficient and reliable Arabic NLP solutions. In the following application, you’ll see how the synergy between these components drives exceptional performance, delivering contextually rich and accurate results tailored to user queries.

Building an Arabic RAG Pipeline: A Hands-On Guide

This hands-on guide demonstrates how to build an Arabic Retrieval-Augmented Generation (RAG) pipeline using LangChain, custom retrieval embeddings, a reranker, and the GPT-4 mini model for generation.

Dataset Overview:

The dataset contains historical information about the establishment of the Umayyad Caliphate, formatted as a two-page Arabic PDF.

Pipeline Components:

  • PDF Loader: Extracts text from the dataset. [ Arabic Content]
  • Semantic Chunker: Splits the text into semantically meaningful chunks using Arabic embeddings. [GATE Embeddings]
  • Vectorstore Retriever: Retrieves the most relevant documents based on the query.
  • Cross-Encoder Reranker: Reranks the retrieved documents for relevance. [ARM-V1]
  • LLM: Generates answers based on the retrieved and reranked context. [GPT-4o-mini]

Implementation

1. Loading the Dataset

from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("/path/to/your/dataset.pdf")
docs = loader.load()

This code loads the dataset as a list of documents. Each document is structured with metadata (e.g., page numbers, source).

2. Splitting Documents into Chunks

from langchain_experimental.text_splitter import SemanticChunker
from langchain_huggingface import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(model_name="Omartificial-Intelligence-Space/GATE-AraBert-v1")
splitter = SemanticChunker(embeddings)
chunks = splitter.split_documents(docs)

This step ensures that the documents are divided into manageable, semantically meaningful chunks for retrieval.

3. Creating the Vectorstore Retriever

from langchain_community.vectorstores import Chroma

vectorstore = Chroma.from_documents(chunks, embeddings)
vectorstore_retriever = vectorstore.as_retriever(search_kwargs={"k": 3})

The vectorstore uses semantic embeddings to retrieve the top 3 relevant documents for a given query.

4. Reranking the Retrieved Documents

from langchain_community.cross_encoders import HuggingFaceCrossEncoder
from langchain.retrievers.document_compressors import CrossEncoderReranker
from langchain.retrievers import ContextualCompressionRetriever

model = HuggingFaceCrossEncoder(model_name="Omartificial-Intelligence-Space/ARA-Reranker-V1")
compressor = CrossEncoderReranker(model=model, top_n=3)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=vectorstore_retriever
)

The reranker reorders the retrieved documents based on their relevance scores, improving the quality of inputs for the LLM.

5. Generating the Final Answer

from langchain_community.chat_models import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

llm = ChatOpenAI(temperature=0.7, max_tokens=256, model_name="gpt-4o-mini")

template = """
<|system|>
انتا مساعد ذكي تجيب على الاسئلة باللغة العربية و بشكل واضح بدون أي إضافات
context: {context}
</s>
<|user|>
{query}
</s>
<|assistant|>
""" 

prompt = ChatPromptTemplate.from_template(template)
output_parser = StrOutputParser()

qa_chain = (
    {"context": compression_retriever, "query": RunnablePassthrough()}
    | prompt
    | llm
    | output_parser
)

query = "ما تأسست الدولة الأموية؟"
result = qa_chain.invoke(query)

This chain integrates all components to produce the final answer.

Pipeline Execution

1. Query

ما تأسست الدولة الأموية؟

2. Metadata for Vectorstore Retrieved Contexts

Here are the top 3 contexts retrieved by the vectorstore retriever, along with their metadata:

Arabic RAG Pipeline: Context Metadata Analysis

This section highlights the retrieved and reranked contexts for the query: "ما تأسست الدولة الأموية؟", along with their associated metadata. The comparison demonstrates how reranking improves the relevance of the retrieved information.


Vectorstore Retrieved Contexts

Context ID Page Content (Snippet) Metadata
1 تأسست الدولة الأموية مباشرة بعد مقتل الإمام علي بن أبي طالب... {'page': 0, 'source': '/path/to/test2.pdf'}
2 مراحل الحكم في الدولة الأموية كان نظام الحكم مختلفًا تمامًا... {'page': 0, 'source': '/path/to/test2.pdf'}
3 تأسست الدولة الأموية بعد معركة الجماعة التي وحدت الأمة... {'page': 0, 'source': '/path/to/test2.pdf'}

Reranked Contexts

Context ID Page Content (Snippet) Metadata
1 تأسست الدولة الأموية مباشرة بعد مقتل الإمام علي بن أبي طالب... {'page': 0, 'source': '/path/to/test2.pdf'}
2 تم إنشاء الدولة الأموية بعد أن اختار أهل الشام معاوية بن أبي سفيان... {'page': 0, 'source': '/path/to/test2.pdf'}
3 مراحل الحكم في الدولة الأموية تُظهر انتقال الخلافة إلى شكل وراثي... {'page': 0, 'source': '/path/to/test2.pdf'}

Key Differences Between Retrieval and Reranking

Aspect Vectorstore Retrieval Reranking
Purpose Fetches documents semantically related to the query. Filters and prioritizes the documents based on specific relevance to the query.
Relevance Retrieves broad contexts, which may include tangential information. Refines the list, ensuring the most contextually accurate chunks are prioritized.
Metadata Contains references to the source PDF and page number, same as reranking. Same as retrieval but focuses on the most relevant parts of the text.
Focus Broader scope of related information. Sharp focus on the query-specific details.

In summary,

  • Vectorstore Retrieval provides a broad range of semantically relevant contexts.
  • Reranking enhances focus and precision by prioritizing the most query-relevant content.
  • This synergy ensures the generative model produces coherent, accurate answers.

3. Final Answer Generated by the Pipeline

تأسست الدولة الأموية مباشرة بعد مقتل الإمام علي بن أبي طالب في عام 661 ميلادي، حيث تم انتخاب معاوية بن أبي سفيان كخليفة للمسلمين.

Conclusion

In this hands-on experience, we demonstrated how to build an effective Arabic Retrieval-Augmented Generation (RAG) pipeline by integrating advanced retrieval, reranking, and generative components. This pipeline highlights the following key insights: Robust Retrieval: The use of GATE-AraBERT embeddings and a semantic chunker enabled efficient document retrieval, laying a strong foundation for the pipeline.

⏺ Enhanced Precision via Reranking: Incorporating the ARA-Reranker significantly improved the relevance of retrieved contexts, ensuring the generative model operates with highly focused and meaningful inputs.

⏺ Accurate Generation: By combining precise retrieval and reranking with the power of GPT-4 mini, the pipeline delivered clear and contextually accurate answers in Arabic.

⏺ Arabic-Specific Optimization: The pipeline addressed challenges unique to Arabic NLP, such as semantic complexity, by leveraging specialized tools and models.

Key Takeaway

This hands-on implementation underscores the importance of synergy between retrieval, reranking, and generation in producing high-quality outputs for Arabic language applications. By enhancing each component of the pipeline, we move closer to building efficient, scalable, and language-specific RAG systems.

As the ecosystem of Arabic NLP tools continues to grow, further optimizations—such as open-source Arabic LLMs—will unlock new possibilities, making these pipelines even more accessible and impactful.

References

[1] Nacar, O., & Koubaa, A. (2024). Enhancing Semantic Similarity Understanding in Arabic NLP with Nested Embedding Learning. arXiv preprint arXiv:2407.21139.

[2] Introducing 100K Context Windows (2023), Anthropic

[3] N. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, P. Liang, Lost in the Middle: How Language Models Use Long Contexts (2023),

[4] N. Reimers, I. Gurevych, Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks (2019), UKP-TUDA

[5] Bari, M. S., Alnumay, Y., Alzahrani, N. A., Alotaibi, N. M., Alyahya, H. A., AlRashed, S., ... & Khan, H. (2024). Allam: Large language models for arabic and english. arXiv preprint arXiv:2407.15390.

Omer Nacar