Python and LLM for Stock Market Analysis Part IV — ElasticSearch for Stock Symbol/Ticker accuracy

This post is a continution of our previous post and to be read after completion the previous.

Oftentimes, a trading system is required to read a multitude of finance articles, quarterly results, or, at the very least, the recent official announcements from companies.

In our last post, we aggregated news articles, used LLM to extract the symbol and sentiment of the news, and then combined it with technical indicators. While NLP and LLM have great potential to perform all of these tasks, it’s crucial to tie every news article back to the correct organization/symbol. This ensures that we can combine the news with its fundamentals and technical data from other sources.

Is obtaining the accurate symbol an AI problem?

While it is common to use LLM/NLP to identify a symbol from the stock’s name,

this doesn’t constitute a complete AI problem statement; rather, it can be addressed using a Software Engineering approach.Additionally, the symbol is not something we want hallucinations to impact, and we are aware that LLMs can adeptly create fictional information.

Well, then If I give the complete organization symbol database to LLM and ask to look for errors would it not solve the problem for me without making things up?

Thinking about cost, scalability and performance, LLM may not be an ideal option for this problem statement.

we will utilize LLM or NLP models to get the sentiment of the news articles and extract the name of the organization. We then employ elastic search to obtain a matching trading symbol.

What is ElasticSearch?

ElasticSearch is an open-source full-text search engine library and is built on top of Apache Lucene. The benefits are plenty as it offers text-search (even if we do not have a exact matching text to search for). It’s called Fuzzy Search. Fuzzy search allows ElasticSearch to find documents that match a given search term even if there are spelling errors or slight variations in the terms. This is especially useful when LLM/NLP models doesn’t provide us with the exact name of the organization or when the source articles itself has only a partial name or name with spelling issues. There are other benefits. we will eventually look at it as we implement.

Recently Elasticsearch introduced something called Elasticsearch Query Language (ES|QL). It transforms and simplifies data investigation. The ES|QL engine delivers advanced search capabilities, improving efficiency and accelerating resolution with search and streamlined workflows.First thing we need is a data source for the list of stocks. we can get it from official NSE and BSE site.https://www.nseindia.com/market-data/securities-available-for-trading

2. Download the file and store it in a csv — EQUITY_L.csv

2. Next thing is we need a ElasticSearch instance. I usually run such instances as a Docker container. Follow the below link to install Elastic on Docker.

https://www.elastic.co/guide/en/elasticsearch/reference/current/docker.html

Note: Just one node is sufficient unless it is a production like environment.

3. lets write a simple script to index the stock data into elasticsearch. we need a class that will be used for both loading the data and searching the data.

In the class definition we will set some basic details of elastic configs and index name.

#elastic_interface.py
import pandas as pd
from elasticsearch import Elasticsearch
from elasticsearch.helpers import bulk
import json

class Elastic:
def __init__(self):
# Elasticsearch connection settings
self.es_user = ‘elastic’
self.es_password = ‘mfb6RIIrWNrg7ors2LhA’

# Create Elasticsearch connection with authentication
self.es = Elasticsearch(
“https://localhost:9200”,
http_auth=[self.es_user,self.es_password],
verify_certs=False
)
# Specify your index (without doc_type for recent Elasticsearch versions)
self.index_name = ‘stocks’

def load_data(self):
# Read CSV file into a DataFrame
csv_file_path = ‘EQUITY_L.csv’
df = pd.read_csv(csv_file_path)

# Convert DataFrame to JSON with orient=’records’
json_data = df.to_json(orient=’records’)

# Convert JSON data to a list of dictionaries
documents = json.loads(json_data)

# Use the bulk API to index the data
actions = [
{“_op_type”: “index”, “_index”: self.index_name, “_source”: doc}
for doc in documents
]

success, failed = bulk(self.es, actions)
print(f”Successfully indexed {success} documents. Failed to index {failed} documents.”)

Finally execute the method by initializing the object

elastic = Elastic()
#elastic.load_data()

Uncomment the line elastic.load_data() and run the file by using python3 elastic_interface.py and comment it back. we dont need to load the data more than once (unless the container needs to be started again)

Note: verify_certs=False should be considered only if you are running elasticsearch locally. if running on a server, especially a production-like environment, consider CA certificates.

A short explanation of the code:

Load the csv and convert to JSON.Loop through the JSON and prepare data for indexing as follows_op_type: The operation type, which can be “index,” “create,” “update,” or “delete.” In our case, we are using “index” to indicate that we want to index (insert or update) a document._index: The name of the index where the document should be stored._source: The actual document data.

we are using bulk API of elasticsearch library to index the data, hence we do the step 2.

3. store the success and failed results in 2 parameters that bulk api returns by default.

The out of the above script will produce an output like below

Successfully indexed 1941 documents. Failed to index [] documents.

Now that our Elasticsearch is equipped with all the stock symbols and names, we are prepared for the main action: identifying stock symbols from the names.

But before that, it’s important to note that Yahoo Finance itself provides an API to retrieve symbols from names. However, this API may expect a generic organization name and might not account for spelling errors or hallucinations that LLMs could introduce. Additionally, it appears that this API still has some bugs. For instance, passing Indian Bank returns the symbol for South Indian Bank. They are 2 different banks! Therefore, relying solely on the Yahoo Finance search API may not be foolproof, but it can serve as a good solution that we can enhance with our own Elasticsearch.

we are going to update the file yahoo_finance.py that we created in previous article. Lets add some dependencies first

#yahoo_finance.py
from elastic_interface import elastic
import requests

We will create a new method get_symbol_from_name that will use both yahoo finance api and elastic search to ensure the right symbol is identified.

#yahoo_finance.py

def get_symbol_from_name(self, name, symbol):
try:
yfinance = “https://query2.finance.yahoo.com/v1/finance/search”
user_agent = ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36’
params = {“q”: name, “quotes_count”: 1, “country”: “India”}
res = requests.get(url=yfinance, params=params, headers={‘User-Agent’: user_agent})
data = res.json()
symbol = next((row[‘symbol’] for row in data[‘quotes’] if row[‘exchange’] == ‘NSI’ or row[‘exchange’] == ‘BSE’),None)
symbol = next(row[‘symbol’] for row in data[‘quotes’]) if not symbol else symbol
except:
symbol = elastic.perform_search(name) if data else “”
symbol = str(symbol) + (‘.NS’ if symbol and not(symbol.endswith(‘.NS’)) and not(symbol.endswith(‘.BO’)) else ”)
return symbol

we are making an API requests to the url https://query2.finance.yahoo.com/v1/finance/search

Note how we have exception scenario defaulted to elasticsearch. Everytime, Yahoo finance couldn’t find a symbol for a company, it takes the exception path that leads to the elasticsearch.

let’s think about few scenarios here.

When The article is about a company that is not listed in the market.When The article is about a company that is not listed in the Indian market.When The LLM has provided the name of the company, but it doesn’t an exact match with the Yahoo Finance database, or there are multiple companies with similar names with very less ways to uniquely identify.We have loaded only the Indian equity list into Elasticsearch, and it cannot handle companies from other countries.

Considering all the above factors, once we make the API call for search, we do the below

symbol = next((row[‘symbol’] for row in data[‘quotes’] if row[‘exchange’] == ‘NSI’ or row[‘exchange’] == ‘BSE’),None)
symbol = next(row[‘symbol’] for row in data[‘quotes’]) if not symbol else symbol

What are we really doing here?

Line 1: If the company is listed in the Indian stock market (checking BSE or NSI), we take the symbol; otherwise, we set it to None.

Line 2: If the company is not listed in the Indian stock market but is listed in some other part of the world, we still take the symbol (we’ll explore why shortly). If not, we take the exception path.

It’s noteworthy that we intentionally don’t set a default None value in line 2. We want it to take the exception path when it reaches line 2, whereas in line 1, we don’t want that because we want line 2 to execute before it can go to the exception path.

Towards the end, we deliberately add a .NS to the symbol even if it’s not an Indian company. This is because Yahoo Finance will consider such a company non-existent and will not return any data. we don’t want yahoo to return technical data when the stock is not part of BSE or NSE as our focus here is only on stocks in NSE and BSE, so we follow this approach. As a reader, if you are looking to use this for other exchanges, you might have to slightly change this here.

Note we have also imported elastic object and consumed elastic.perform_search(name) in the above code.

lets write the definition of perform_search

#elastic_interface.py

def perform_search(self,search_term):
if not search_term:
return
print(search_term)
q = {
“query”: {
“match” : {
“NAME OF COMPANY”: {
“query”: search_term,
“fuzziness”: “AUTO”
}
}
}
}
search_results = self.es.search(index=self.index_name, body=q)

for hit in search_results[‘hits’][‘hits’]:
return hit[‘_source’][‘Symbol’]

This is the Elasticsearch way of querying, similar to an SQL query. As you can observe, we’ve employed a fuzzy search with a value matching the field NAME OF COMPANY. Fuzzy search allows room for mistakes and errors. In case INFOSYS is misspelled as INSOFYS, fuzzy search utilizes something called Levenshtein distance, allowing for such minimal errors and we will still get the right symbol from elasticsearch as a response.

This is particularly crucial when dealing with LLMs because for the same prompt, an LLM might respond differently at different times. It might say RELIANCE LTD once and RELIANCE INDUSTRIES LIMITED another time, and we cannot rely on an exact text match. This feature is akin to what helps us find the right movie even when we type a partial or incorrect term in a Netflix search.

While there are pros to this Elasticsearch approach, there are also cons. For instance, a news article about Boeing Ltd. is identified by Elasticsearch as Bombay Dyeing!

It’s funny and embarrassing at first. Overcoming this issue took some time (DM to know what it takes to avoid such issues). To avoid such things from happening, we should ensure our search criteria allows only little errors and it is not way too much.

One last change is to incorporate the changes for Yahoo Finance search and Elasticsearch in our main file that produces news-based recommendations.

Add the following line after we summarize and extract the result from LLM(refer previous article for clarity).

#news_tech_trader.py
symbol = yfi.get_symbol_from_name(data[‘name’],data[‘symbol’]) if data else “”Note Since this file is developed as part of our previous article, please refer to it or the github.

Now that we are done writing the code, lets see it in action. Lets run the code we created in previous article again news_tech_trader.py by executing python3 news_tech_trader.py in a terminal.

The result looks something like below

Not bad? The quality of results have largely improved from our last article as we might see the symbol predictions and the technical data has less errors than how we had when we implemented it first. we have combined news articles and some technical indicators along with some sentiment prediction with LLM so far. Here is the complete code in github.

Next, we’re assembling a user-friendly UI so that manual execution in the terminal is not required. Additionally, we’ll transition from Google BARD to other LLM models, specifically trained on finance datasets.

Thanks for reading it this far. If you find this post useful, please leave a clap or two, or if you have suggestions or feedbacks, please feel free to comment, It would mean a lot to me!

Incase of queries or details, Please feel free to connect with me on LinkedIn or X(formerly twitter).

Python and LLM for Stock Market Analysis Part IV — ElasticSearch for Stock Symbol/Ticker accuracy was originally published in Level Up Coding on Medium, where people are continuing the conversation by highlighting and responding to this story.

​ Level Up Coding – Medium

about Infinite Loop Digital

We support businesses by identifying requirements and helping clients integrate AI seamlessly into their operations.

Gartner
Gartner Digital Workplace Summit Generative Al

GenAI sessions:

  • 4 Use Cases for Generative AI and ChatGPT in the Digital Workplace
  • How the Power of Generative AI Will Transform Knowledge Management
  • The Perils and Promises of Microsoft 365 Copilot
  • How to Be the Generative AI Champion Your CIO and Organization Need
  • How to Shift Organizational Culture Today to Embrace Generative AI Tomorrow
  • Mitigate the Risks of Generative AI by Enhancing Your Information Governance
  • Cultivate Essential Skills for Collaborating With Artificial Intelligence
  • Ask the Expert: Microsoft 365 Copilot
  • Generative AI Across Digital Workplace Markets
10 – 11 June 2024

London, U.K.