Making all my blog posts and YouTube videos searchable in real-time.

In my previous blog post, I went over how I added a smart search feature to my website that lets my audience search across all my blog posts and YouTube videos.

In this post, I will focus on the real-time pipeline that is powering the smart search, so that every time I publish a new blog post or video, the content is automatically “indexed”.

Let’s get started.

A Quick Overview

If you don’t have the time to read my previous post, I will try to summarize the overall architecture for you:

Blogs are stored as Markdown files inside my NextJS appBlog metadata (URL, title, slug, category, etc) are stored in a Postgres DBYouTube video data is refreshed using YouTube’s Data APIYouTube metadata (URL, title, category, thumbnail, etc) are stored in a Postgres DBVector embeddings are generated through OpenAI for both blogs and videosGenerated embeddings are stored with the metadata

The end product: all blog and video metadata are stored in a Postgres DB called content, along with their vector embedding that is used to search for similar content.

When a user performs a search — say “productivity” — in real-time embeddings for the term “productivity” are generated, and then using some Postgres functions, a cosine similarity check is computed against all videos and blogs to return the top 10 most similar content.

What Needs to be Realtime?

Every time I publish a piece of content — whether a blog or video — I want something to automatically index my content, and make it discoverable by smart search.

I publish 2–3 times a week. With such a low cadence, I could have easily updated the indexes manually, but the engineer in me hates not automating repetitive tasks.

That’s why I had to figure out a way to automate the process.

So, this is what needed to happen every time I hit “Publish”.

For my blog posts:

The process starts when a new Markdown file is added to my NextJS appThe file (blog) needed to be read and parsedOpenAI Embeddings had to be generated and stored in Postgres (note: this step only applied to the new blog post. Generating embeddings for historical content is expensive and unnecessary)

For my YouTube videos, a similar process:

The process starts whenever a new video is found using the YouTube APIThe video title and description are parsed and concatenated into a single stringThe string is passed to OpenAI for Embeddings to be generatedEmbeddings for the new video are stored in Postgres

If you understand better with illustrations, here you go.

First Step: FastAPI Endpoint

There were a few common pieces of logic that applied to both blogs and videos:

Only generate embeddings for content not available in Postgres alreadyFor duplicate requests to OpenAI, serve from the cache to save moneyExpose the ability to “force refresh” if I want to forcefully generate new embeddings for historical content

Creating a new private endpoint to encapsulate the above made the most sense.

The endpoint took 3 query parameters:

youtube — if a new YouTube video needs to be processedblog — if a new blog post needs to be processedforce_refresh — if an already processed blog or video needs to be re-processed

To guarantee not processing already “seen” content, I only look for unique URLs.

A new blog or video only gets processed, when the new URL does not already exist in stored_video_urls.

The last important logic I want to focus on has to do with caching.

Requests to OpenAI can become expensive with increasing token sizes. Because of that, I don’t want to make API calls for input_text that I had already seen.

Using Python’s built-in lru_cache does the job for now. Later on, I plan on migrating to something like Redis.

Okay, so now that you know about the important parts of the endpoint, let’s talk about how the API is called in the real-time pipeline.

Second Step: Deployment Pipelines

Whenever I want to publish a new blog post, I run a bash script on my production server.

The bash script does the following:

Pulls the latest changes on GitHubKicks off a new Node process for my NextJS appMakes an API request to the endpoint above to kick off processing for the new blog post

So, whenever a new blog post is added to my website, automatically the API endpoint is invoked, which in turn updates the Postgres DB with embeddings for the new blog post.

Moving to my YouTube videos now, the process is slightly different.

I upload my YouTube videos through the YouTube Studio UI. So, my production server does not get notified immediately.

Instead, I have a cron job that runs every 10 minutes on my server. It does one thing only: Pulls the latest channel video using the YouTube API, and calls the above /create_embeddings endpoint if there’s a new video to process.

At worst, there will be a 10-minute delay between uploading a new video and it being available through Smart Search.

Third Step: Periodic Refreshes

Lastly, I have set up a weekly cron job that invokes the API endpoint as follows:

curl -X ‘POST’
‘http://127.0.0.1:8000/create_embeddings’
-H ‘accept: application/json’
-H ‘Content-Type: application/json’
-d ‘{
“youtube”: true,
“blog”: true,
“force_refresh”: false
}’

In other words, it invokes the internally running API to create embeddings for the latest blog and video, if for some reason, it wasn’t created already through step 2.

With this fallback, I know that even if my automated pipeline fails, there’s a backup that will make everything discoverable to smart search by the end of the week.

Closing Thoughts

If you have made it this far, I hope you found this valuable.

Here are a few ways you can do so: follow me on Medium, subscribe to my website, or follow me on YouTube.

I Built a Realtime Smart Search Pipeline was originally published in Level Up Coding on Medium, where people are continuing the conversation by highlighting and responding to this story.

​ Level Up Coding – Medium

about Infinite Loop Digital

We support businesses by identifying requirements and helping clients integrate AI seamlessly into their operations.

Gartner
Gartner Digital Workplace Summit Generative Al

GenAI sessions:

  • 4 Use Cases for Generative AI and ChatGPT in the Digital Workplace
  • How the Power of Generative AI Will Transform Knowledge Management
  • The Perils and Promises of Microsoft 365 Copilot
  • How to Be the Generative AI Champion Your CIO and Organization Need
  • How to Shift Organizational Culture Today to Embrace Generative AI Tomorrow
  • Mitigate the Risks of Generative AI by Enhancing Your Information Governance
  • Cultivate Essential Skills for Collaborating With Artificial Intelligence
  • Ask the Expert: Microsoft 365 Copilot
  • Generative AI Across Digital Workplace Markets
10 – 11 June 2024

London, U.K.