Build a Complete OpenSource LLM RAG QA Chatbot — Flask Server

Previous Articles:

An In-depth Journey: the Model:

Introduction and Objective:

In this third segment, our aim is to establish a robust Flask API Backend that will serve responses for user inquiries in REST API format. To accomplish this, we’ll utilize Python, specifically leveraging the Flask library. While our previous discussions covered the use of MongoDB and Pinecone for the database, this episode will focus primarily on crafting the Flask architecture. Our key objectives include setting up the Pinecone VectorStore Index and retrieving initial sample responses through REST using tools like Postman.

Noteworthy Update:

An exciting development worth noting is the recent announcement by Perplexity. They’ve made their APIs accessible to everyone, providing a stable version free from usage limitations. This signifies a major leap forward, enabling users to deploy these APIs confidently in production environments.

Setting Up Foundational Files: and

Let’s take a closer look at the file:

from conf import load_env


print(‘Loading Flask app…’)

from apis import chats

from app import getFlaskApp

app = getFlaskApp()

The serves as the starting point for our backend system. Here’s a breakdown of what each step accomplishes:

Loading Environment Variables: The load_env() function ensures that all required environment variables are properly configured, ensuring the smooth operation of the application.App Initialization: This section signifies the initiation of the Flask app, denoted by the print statement confirming the loading process.Importing Modules: Here, we import the chat APIs and necessary configurations from the file to integrate them into our Flask application.Initializing the Flask App: Utilizing the getFlaskApp() function from, we initialize the Flask app, preparing it for execution.

This file acts as the coordinator, bringing together various components, loading essential configurations, and ultimately setting the stage for the successful launch of our Flask API Backend.

let’s expand on the file and shed more light on its significance within our backend architecture.

from flask_limiter.util import get_remote_address

from flask_limiter import Limiter
from flask_cors import CORS
from flask import Flask

import os

app = Flask(__name__)

app.config[“MONGO_URI”] = os.getenv(“MONGODB_URI”)
app.config[‘SECRET_KEY’] = os.getenv(“SECRET_KEY”)
app.config[‘CORS_HEADERS’] = ‘Content-Type’

cors = CORS(app)

limiter = Limiter(

def getFlaskApp():
return app

def getLimiter():
return limiterFlask App Initialization: The file begins by initializing a Flask application. This instance serves as the backbone of our API backend.Configuration Setup: Key configurations such as the MongoDB URI, secret keys, and CORS headers are set within the app’s configuration, fetched from environment variables for security and flexibility.CORS Implementation: Cross-Origin Resource Sharing (CORS) is enabled using the flask_cors extension. This functionality allows the API to be accessed by web pages from different domains, ensuring wider accessibility.Rate Limiting with Limiter: An essential aspect of securing our API involves implementing rate limiting. The Limiter from flask_limiter is utilized here. It’s configured to restrict the number of requests a client can make within a specific time frame, based on the client’s IP address. This security measure mitigates potential risks of abuse, preventing scenarios like bots or DDoS attacks, which could inflate costs, especially in relation to LLM (Language Model) API calls.

Significance in Production:

The role of this file goes beyond simple initialization. It lays the foundation for security measures, such as CORS handling and rate limiting, crucial for a production-grade application. Rate limiting, specifically, acts as a safeguard against abuse, ensuring responsible and legitimate usage of the API. This is pivotal since excessive calls to certain APIs might incur costs, and rate limiting helps prevent such occurrences.

These security measures, along with those we’ll discuss in subsequent articles, collectively fortify our backend, making it more resilient against potential threats, ensuring cost-efficiency, and maintaining a secure environment for serving chat responses.

Now is the turn of the file for read the envs, we create a file named

import os

from dotenv import load_dotenv

def load_env():
for env_file in (‘.env’, ‘.flaskenv’):
env = os.path.join(os.getcwd(), env_file)

if os.path.exists(env):
load_dotenv(env)Environment Variable Loading: The file specializes in handling environment variables required for the application’s configuration.Usage of dotenv Library: dotenv is utilized here to facilitate the loading of environment variables from files named .env and .flaskenv.Iterating Through Files: The script iterates through the list of potential environment files (.env and .flaskenv) and attempts to locate them within the project directory.Loading Variables: When a file is found (as indicated by the os.path.exists check), the load_dotenv() function is employed to load the variables from the respective environment file.

Importance of Environment Variables:

The usage of environment variables, managed by the dotenv library, is crucial for storing sensitive information such as API keys, database URIs, and other configurations securely. The .env file serves as a secure repository for these variables, ensuring they are not hardcoded within the codebase and thus not exposed inadvertently.

This abstraction helps maintain security, allows for easy configuration changes without modifying the code, and enhances portability by separating sensitive information from the application logic.

The file, by handling the loading of these environment variables, ensures that the application can access and utilize the necessary configurations securely and efficiently.

Next step is setting up the Pinecone VectorStorage for save our future parsed documents.

Here’s a step-by-step guide for setting up Pinecone VectorStorage to save our parsed documents:

go to Pinecone dashboard and register: that create your first project, ensure to choose the free tier! or you have to pay.(lews option with the label free-tier)

after that build your first index

It’s crucial to make informed choices regarding the size and method of index creation to attain optimal results. Considering the embeddings model selected (e.g., bge-small-en from a previous article), it’s essential to determine the appropriate size for the index. This information can often be found on the Hugging Face model page.

from that page we see that our size is 384 and cosine (standard) and after that we create the index. Keep in mind to save also from Pinecone online console the index name, the Api Key and the project name (for free tier gpc-starter is the default project name).

after that we are ready to build our index, create a file named

from llama_index import (

from llama_index.indices.loading import load_index_from_storage
from utils.vector_database import build_pinecone_vector_store, build_mongo_index
from mongodb.index import getExistingLlamaIndexes
from llama_index import SimpleDirectoryReader

from llama_index.llms import Perplexity
import os

llm = Perplexity(
api_key=os.getenv(“PERPLEXITY_API_KEY”), model=”mistral-7b-instruct”, temperature=0.5

index_store = build_mongo_index()
vector_store = build_pinecone_vector_store()

llm_predictor = LLMPredictor(llm=llm)

service_context = ServiceContext.from_defaults(

storage_context = StorageContext.from_defaults(

mongoIndex = None

def initialize_index():
existing_indexes = getExistingLlamaIndexes()

global mongoIndex

if len(existing_indexes) > 0:
print(“Loading existing index…”)

mongoIndex = load_index_from_storage(

return createQueryEngine(mongoIndex)
print(“Building index…”)

mongoIndex = buildVectorIndex()

return createQueryEngine(mongoIndex)

def createQueryEngine(index):
return index.as_query_engine(response_mode=”simple_summarize”, top_k=3)

def get_service_context():
return service_context

def buildVectorIndex():
reader = SimpleDirectoryReader(
input_files=[“./data/rules.pdf”] )

documents = reader.load_data()

index = VectorStoreIndex.from_documents(


return index

The file start by initializing the Perplexity language model using the Perplexity() constructor, which relies on the previously loaded environment variable API_SECRET for secure authentication.

Following this, the index_store is configured to manage the remote index. This index, residing in remote storage, serves as a crucial component for efficient document retrieval. The specific steps involved in building and managing this remote index will be detailed later in the article.

Within the initialize_index() function, the script verifies the existence of an index. If an index is already present, it is loaded; otherwise, the script proceeds to construct a new index. This functionality ensures efficient management of existing indexes and guarantees the availability of an index for storing and retrieving documents.

The createQueryEngine() function is responsible for constructing a query engine tailored to the index. This engine is configured with parameters such as response_mode set to simple_summarize. This setting enables the engine to execute one single LLM (Language Model) call, considering the entirety of the context, thereby providing a more comprehensive response. Additionally, top_k is specified as 3 to include more contextual information, unlike the default value of 2. These settings contribute to refining the query responses for better contextual understanding.

The buildVectorIndex() function is designed to create the index from a straightforward data directory. For instance, in this demonstration, a board game rules PDF is chosen as the data source. For a complete example, a repository with this file and further details will be available on GitHub. Future episodes will explore dynamic data sources with real-time updates. However, in this context, the focus remains on setting up the Python Flask server rather than diving deeply into RAG (Retrieval Augmented Generation) specifics.

The inclusion of the MongoDB connection, specified within the file in a dedicated mongodb directory, serves a pivotal purpose. LlamaIndex is designed not only to store vectors within the Pinecone database but also to retain index-related information within MongoDB.

This MongoDB connection plays a crucial role in managing and verifying whether the index has been previously constructed. The script checks for existing index information in a MongoDB collection. This verification process aids in determining if the index has already been built or needs to be initiated.

To facilitate this functionality, it’s recommended to create a directory structure that includes a dedicated mongodb directory. Within this directory, an file should be established. This Python script will handle the MongoDB connection and interaction, enabling the retrieval of index-related information from the MongoDB collection.

By leveraging MongoDB in parallel with Pinecone’s VectorStorage, the system gains the ability to cross-reference and verify the existence of indexes. This approach streamlines the index initialization process, ensuring efficiency in managing and utilizing stored document vectors. This combination of Pinecone for vector storage and MongoDB for index-related data management creates a robust foundation for handling document retrieval and index status verification within the application.

Lets create a MongoDb folder and a file.


from pymongo import MongoClient

import os

db = MongoClient(host=os.getenv(“MONGODB_URI”)).get_database()

def getExistingLlamaIndexes():
Get the existing Llama Indexes

indexes = []

cursor = db[‘index_store/data’].find({})

for item in cursor:

return indexes

Now the last step, it’s time to create our first API, lets create a directory apis with a file

from index_manager import initialize_index, get_service_context
from flask import request, jsonify
from app import getFlaskApp, getLimiter

app = getFlaskApp()
limiter = getLimiter()

query_engine = initialize_index()

@app.route(“/chats/<chatId>/answer”, methods=[“GET”])
@limiter.limit(“10 per minute”)
def query_index(chatId):
answer = request.args.get(‘answer’)

result = query_engine.query(answer)

return jsonify({
“answer”: result.response,
“sources”: result.get_formatted_sources()

In this file, we import the previously built query engine. We define the endpoint ‘chats/chatId/answer’ to respond to user-provided questions. Throughout the upcoming chapters, we’ll consistently utilize the ‘chatId’ parameter. We employ a limiter to ensure that a user cannot make more than 10 calls per minute. Subsequently, we utilize the query engine to respond to a user’s question and identify the sources from which the response is derived.

In the upcoming chapters, we’ll delve into a more robust chat engine. For now, this setup is primarily for testing purposes. Let’s test its functionality! Please ensure all dependencies are installed and run the server using the commands provided above.


pip install -r requirements.txt —> install the deps
flask –app index run —> run the server in dev mode

and now let’s try our endpoint with postman

we start with a basic question about the game name and it perform well, let’s try with another question but now more complicated, let’s query of how to perform the first turn:

It’s performing impressively well! The extended, meaningful responses are a testament to its effectiveness. It’s remarkable that we focused on establishing the Flask infrastructure without extensively fine-tuning the super simple basic RAG.

Full example code link:

Here’s a glimpse of what’s coming up in the next episodes:

Fine-tuning the RAG for a genuinely efficient ChatEngine, optimizing for low latency and high performance.Utilizing MongoDB entities such as chats and single messages to store data for bot history and facilitate admin data analysis.Implementing dynamic indexing documents, fetching them from MongoDB instead of a local folder. This dynamic approach allows for real-time addition and removal of sources from Pinecone, potentially accessible from an external source like a back office.Preparing everything for production deployment.Plus, a lot more exciting developments in store!

Stay tuned for more! If you enjoyed the article or have suggestions for upcoming episodes, show your support by leaving a clap or dropping a comment!

Build a complete OpenSource LLM RAG QA Chatbot — Flask Server was originally published in Level Up Coding on Medium, where people are continuing the conversation by highlighting and responding to this story.

​ Level Up Coding – Medium

about Infinite Loop Digital

We support businesses by identifying requirements and helping clients integrate AI seamlessly into their operations.

Gartner Digital Workplace Summit Generative Al

GenAI sessions:

  • 4 Use Cases for Generative AI and ChatGPT in the Digital Workplace
  • How the Power of Generative AI Will Transform Knowledge Management
  • The Perils and Promises of Microsoft 365 Copilot
  • How to Be the Generative AI Champion Your CIO and Organization Need
  • How to Shift Organizational Culture Today to Embrace Generative AI Tomorrow
  • Mitigate the Risks of Generative AI by Enhancing Your Information Governance
  • Cultivate Essential Skills for Collaborating With Artificial Intelligence
  • Ask the Expert: Microsoft 365 Copilot
  • Generative AI Across Digital Workplace Markets
10 – 11 June 2024

London, U.K.