Build a Complete OpenSource LLM RAG QA Chatbot — Choose the Model

In this second episode, we’re diving into selecting the Open Source model for our RAG Application. But how do we precisely go about choosing an open source model?

To begin this journey, the HuggingFace Leaderboard (https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard) of chatbots stands out as an excellent starting point. It offers an overview of various models, albeit not all of them are open source. Therefore, it’s crucial to check the licensing details before proceeding.

HuggingFace LeaderBoard

What’s the next step? Personally, I’m inclined towards experimenting with the Ministral-Instruct Model or the Zephyr Model. Why these choices? They’ve undergone extensive testing by developers over several months and, in contrast to the llama models, they possess a more permissive license. However, throughout this series, you’re encouraged to select the model that aligns best with your preferences. For the initial tests and the series, I’ll opt for the Ministral-7B model.

Before diving into our testing phase, we require additional information. How can we effectively run a massive model on a modest PC? This is where quantization becomes pivotal.

What is Quantization?

Quantization in the context of Language Model (LM) refers to the process of reducing the precision or size of numerical values within the model. This reduction is often applied to both the model’s parameters and activations, which are typically stored as high-precision floating-point numbers.

The goal of quantization is to decrease the computational resources required for running the model while trying to preserve its performance as much as possible. By converting these high-precision values into lower bit representations (like 8-bit integers), the model’s size can be significantly reduced, leading to faster inference times and decreased memory usage.

For instance, instead of representing numbers with 32 bits (single-precision floating-point), quantization might use 8-bit integers to represent weights, biases, and activations. However, this reduction in precision can potentially affect the model’s accuracy and performance. Hence, techniques like quantization-aware training or fine-tuning after quantization might be used to mitigate these effects by training the model with awareness of the reduced precision.

Overall, quantization is a technique used to make language models more efficient for deployment on devices with limited computational resources like mobile phones or edge devices, enabling faster inference and reduced memory footprint while aiming to maintain performance.

and so quantization has different type of algorithms:

Quantization algorithms aim to reduce the memory and computational requirements of models by converting their parameters and activations into lower bit representations. Here are a few different quantization algorithms commonly used in deep learning:

Fixed-Point Quantization: This method represents numbers using fixed-point arithmetic, typically using integers instead of floating-point numbers. It’s a straightforward approach where floating-point values are scaled and rounded to fit within a fixed range, often using a power-of-two scaling factor.Uniform Quantization: In uniform quantization, the range of values is divided into equal intervals. For instance, in 8-bit quantization, the range of values would be divided into 256 equally spaced intervals.Non-Uniform Quantization: Unlike uniform quantization, this method allows for variable-sized intervals. It assigns more bits to regions that require higher precision and fewer bits to regions where precision isn’t as critical. This adaptive approach can improve the trade-off between accuracy and model size.Quantization-Aware Training: Instead of post-training quantization, this technique involves training models with awareness of quantization. During training, it introduces quantization-related constraints or considerations, ensuring that the model adapts to lower bit precision from the start, potentially maintaining better accuracy.Vector Quantization: This technique clusters similar vectors together and represents them with a single codebook entry. It’s often used in scenarios where the quantization process benefits from identifying and grouping similar values together.Weight Sharing: Weight sharing quantization methods aim to reduce the number of unique weights in a model. This involves grouping similar or near-identical weights together, storing them only once, and referencing them in multiple locations within the model.Sparsity and Zero-Centered Quantization: These techniques exploit the sparsity of models where many parameters are close to zero. By focusing on the non-zero elements and using zero-centered quantization techniques, they aim to quantize mainly the non-zero values while representing zeros more efficiently.Dynamic Quantization: Dynamic quantization adjusts the quantization levels dynamically during inference, allowing for more flexibility based on the actual distribution of values encountered during runtime.

Each algorithm comes with its advantages and limitations, and the choice often depends on the specific use case, the model architecture, desired inference speed, memory constraints, and acceptable trade-offs in accuracy. Experimentation and fine-tuning are often necessary to find the best quantization method for a particular scenario.

Understanding quantization presents us with two distinct paths to explore:

Self-hosted model: While self-hosting a quantized model presents a route to maintain control and customization, it’s crucial to weigh the associated costs and performance considerations. For instance, initial setup costs can be high, encompassing infrastructure requirements, maintenance, and potential scalability challenges. Additionally, on systems with limited resources, the performance of self-hosted models might not meet desired benchmarks, impacting the overall user experience.APIs: Despite reservations about the encapsulated nature of some APIs, the prospect of utilizing Open Source Models through more affordable API services brings an intriguing angle to the table. This approach could potentially bridge the gap between accessibility and control, offering the benefits of established infrastructure and cost-effectiveness. Leveraging tools like Perplexity and LlamaIndex opens avenues to tap into the power of open source models through APIs, providing a compromise between the ‘black box’ nature of certain services and the desire for transparency and flexibility.

Testing these methods within the Google Colab environment using a T4 runtime allows us to simulate real-world scenarios in a controlled setting. It empowers us to evaluate the trade-offs between these two approaches — weighing factors such as performance, resource utilization, scalability, and cost-effectiveness. By doing so, we gain invaluable insights that guide us toward making an informed decision aligned with our project’s objectives.

In essence, this exploration not only determines the most suitable method for our current needs but also lays the groundwork for potential scalability and adaptability as the project progresses.

Local Model

!pip install git+https://github.com/run-llama/llama_index
!pip install transformers accelerate bitsandbytes

we have installed the dependencies:

from llama_index.readers import BeautifulSoupWebReader

url = “https://www.theverge.com/2023/9/29/23895675/ai-bot-social-network-openai-meta-chatbots”

documents = BeautifulSoupWebReader().load_data([url])

Now We Load Some data for the tests:

import torch
from transformers import BitsAndBytesConfig
from llama_index.prompts import PromptTemplate
from llama_index.llms import HuggingFaceLLM

quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_quant_type=”nf4″,
bnb_4bit_use_double_quant=True,
)

llm = HuggingFaceLLM(
model_name=”mistralai/Mistral-7B-Instruct-v0.1″,
tokenizer_name=”mistralai/Mistral-7B-Instruct-v0.1″,
query_wrapper_prompt=PromptTemplate(“<s>[INST] {query_str} [/INST] </s>n”),
context_window=3900,
max_new_tokens=256,
model_kwargs={“quantization_config”: quantization_config},
# tokenizer_kwargs={},
generate_kwargs={“temperature”: 0.2, “top_k”: 5, “top_p”: 0.95},
device_map=”auto”,
)

Load the Ministral model

from llama_index import ServiceContext

service_context = ServiceContext.from_defaults(llm=llm, embed_model=”local:BAAI/bge-small-en-v1.5″)

Service Context and choosing the embedding model, for this example the embedding model is bge-small-en-v1.5

from llama_index import VectorStoreIndex

vector_index = VectorStoreIndex.from_documents(documents, service_context=service_context)

Building a local VectorIndex

query_engine = vector_index.as_query_engine(response_mode=”compact”)

response = query_engine.query(“How do OpenAI and Meta differ on AI tools?”)

display_response(response)

The generated response encapsulates the intricate details discussed earlier, providing a comprehensive understanding of the choices at hand. However, it’s noteworthy that while the response quality is substantial, the generation time seems to be slightly prolonged, averaging around 10 seconds.

Optimizing generation time while maintaining response quality is pivotal, especially in real-time or time-sensitive applications. Techniques such as model caching, parallelization, or leveraging more optimized hardware might aid in mitigating this delay.

Striking a balance between response excellence and swift generation remains a key consideration. This trade-off often demands a meticulous approach to ensure efficient processing without compromising on the richness and accuracy of the generated content.

The generation of the final response took approximately 10 seconds.

you can try yourself the following steps from the original LlamaIndex Notebook and more! https://colab.research.google.com/drive/1ZAdrabTJmZ_etDp10rjij_zME2Q3umAQ?usp=sharing#scrollTo=ghqk6C04TD3b

Shifting gears, we’re now delving into exploring the Perplexity APIs for a similar test structure. This involves reloading the same data, and just as before, the installation of the LlamaIndex library will be essential.

However, this time, a notable departure lies in our approach. Instead of relying on a locally hosted model, we’re tapping into the power of the LLM model through Perplexity’s APIs. This transition introduces a shift from local resource utilization to leveraging remote infrastructure, potentially altering our considerations around latency, accessibility, and data transfer speeds.

By interfacing with the LLM model via Perplexity’s APIs, we’re afforded a unique opportunity to evaluate the trade-offs between local and remote model utilization. This includes considerations of computational resources, network latency, and the overall responsiveness of the API-based approach compared to locally hosted models.

Furthermore, while this shift introduces potential advantages in terms of scalability and accessibility, it also necessitates a thorough evaluation of API performance and reliability. Our aim remains to assess not only the quality of responses but also the efficiency and speed of generating those responses using the Perplexity APIs.

i will not repeat the LlamaIndex lib instalation.

Now we are loading the LLM model from Perplexity and not from a local model:

from llama_index.llms import Perplexity

pplx_api_key = “your_key”

llm = Perplexity(
api_key=pplx_api_key, model=”mistral-7b-instruct”, temperature=0.5
)

read the same docs of before:

from llama_index.readers import BeautifulSoupWebReader

url = “https://www.theverge.com/2023/9/29/23895675/ai-bot-social-network-openai-meta-chatbots”

documents = BeautifulSoupWebReader().load_data([url])

and also we are going to use the same embeddings:

from llama_index import ServiceContext

service_context = ServiceContext.from_defaults(llm=llm, embed_model=”local:BAAI/bge-small-en-v1.5″)

rebuilding the index:

from llama_index import VectorStoreIndex

vector_index = VectorStoreIndex.from_documents(documents, service_context=service_context)

and the response generation:

from llama_index.response.notebook_utils import display_response

query_engine = vector_index.as_query_engine(response_mode=”compact”)

response = query_engine.query(“How do OpenAI and Meta differ on AI tools?”)

display_response(response)

And now the results! the response is also great and more concise, but the great part is the latency, 1 sec for generate the response it’s soo fast!

Agin the full notebook to reproduce that is here: https://colab.research.google.com/drive/1Z49ulP_uIU6HikzrpLVj1cGT7rntz5ZV?usp=sharing (this time made by me).

In navigating the landscape of open source models and their implementation, this exploration has unraveled intriguing pathways for leveraging these models within the context of the RAG Application. The journey began with a meticulous evaluation of various models, considering their performance, licensing, and developer testing.

The dichotomy between self-hosted models and API-based approaches illuminated diverse considerations. While self-hosting offers control and customization, API utilization introduces accessibility and potential cost-effectiveness. The nuanced evaluation between local and remote model utilization, including considerations of performance, scalability, and responsiveness, became paramount.

Quantization emerged as a pivotal tool, allowing the compression of models for efficient execution on varying hardware, mitigating resource constraints without sacrificing quality.

The article dived into testing methodologies within Google Colab, harnessing the power of T4 runtimes to simulate real-world scenarios. This testing ground provided invaluable insights into the performance, latency, and generation times of models, underpinning informed decision-making.

Transitioning to the Perplexity APIs opened vistas to remote model access, broadening scalability possibilities while warranting a critical assessment of API performance and reliability.

Ultimately, the article underscores the dynamic interplay between accessibility, performance, and control in the realm of open source models. The choices made here serve not only the immediate project needs but also lay the groundwork for adaptability and evolution in the ever-evolving landscape of AI implementation.

As technology continues to advance, this journey remains an evolving narrative — a testament to the ongoing quest for optimizing AI model utilization within real-world applications.

In the next episode, we will set up the flask server and set up Perplexity model on it.

If you have questions, feel free to leave a comment and if you enjoy the story leave a clap!

If I see there are a lot of questions about Perplexity or costs, I will make dedicated guides on it!

Build a complete OpenSource LLM RAG QA Chatbot — Choose the Model was originally published in Level Up Coding on Medium, where people are continuing the conversation by highlighting and responding to this story.

​ Level Up Coding – Medium

about Infinite Loop Digital

We support businesses by identifying requirements and helping clients integrate AI seamlessly into their operations.

Gartner
Gartner Digital Workplace Summit Generative Al

GenAI sessions:

  • 4 Use Cases for Generative AI and ChatGPT in the Digital Workplace
  • How the Power of Generative AI Will Transform Knowledge Management
  • The Perils and Promises of Microsoft 365 Copilot
  • How to Be the Generative AI Champion Your CIO and Organization Need
  • How to Shift Organizational Culture Today to Embrace Generative AI Tomorrow
  • Mitigate the Risks of Generative AI by Enhancing Your Information Governance
  • Cultivate Essential Skills for Collaborating With Artificial Intelligence
  • Ask the Expert: Microsoft 365 Copilot
  • Generative AI Across Digital Workplace Markets
10 – 11 June 2024

London, U.K.