Exploring the good and bad about GIN Indices in PostgreSQLGenerated via Copilot

Table of contents

MotivationIntroductionGin Indexing InternalsPractical ExampleLessons from GitLab’s GIN Trigram IndexConclusionReferences

1. Motivation

Recently, I found myself grappling with a formidable task: enabling text-based searches on a table housing millions of rows.It was like trying to sift through a mountain of information with bare hands.That’s when I stumbled upon GIN indexing in PostgreSQL — a hidden gem that promised to untangle the mess and streamline the search process.However, some precautions also need to be taken to ensure that GIN indexing doesn’t cause write overhead as that will result in a bad user experience especially if a majority of the users use it frequently.I decided to compile my findings, experiments, safety measures and conclusion in this blog to provide a comprehensive guide for anyone who is trying to understand the GIN Index in PostgreSQL.

2. Introduction

PostgreSQL offers a variety of index types, and one of the most powerful and flexible is the Generalized Inverted Index (GIN).As per Tom Lane, “The GIN index type was designed to deal with data types that are subdividable and you want to search for individual component values (array elements, lexemes in a text document, etc)”The initial motivation for the GIN index was a full-text search. Before GIN was added, there was no way to index full-text search in Postgres, instead requiring a very slow sequential scan of the table.However, later its application was extended to more complex datatypes like Arrays, JSONB, tsvector, hstore, etc.

3. Gin Indexing Internals

On a high level, a GIN index essentially stores a set of (key, posting list) pairs.The key represents the value being indexed.The posting list contains row IDs where the key occurs.An item can contain more than one key, so that the same row ID can appear in multiple posting lists.Each key value is stored only once, making GIN indexes compact when the same key appears many times.Now, let’s dive deeper into it’s internal structure.Source — PostgreSQL Documentation

Meta page

In a GIN index, similar to a B-tree index, the initial page is termed the “meta page,” which holds essential information about the index.The meta page includes information such as the number of levels in the index, the total number of entries, the root block pointer, and any other index-specific details required for its management and traversal.

Entry tree

The entry tree in a GIN index is a B-tree structure formed from the indexed entries, facilitating efficient retrieval of associated data.The GIN index is structured internally with a B-tree index. This B-tree index is built over keys, where each key corresponds to one or more indexed items.For instance, if you’re indexing an array, each element of that array becomes a key.

Posting tree and Posting list

As it can be seen in the above diagram, in the leaf pages of this B-tree index, there are tuples. Each tuple either points to a B-tree of heap pointers (known as a “posting tree”) or contains a simple list of heap pointers (referred to as a “posting list”).The choice between these structures depends on the size of the list of pointers. If the list is small enough, it fits into a single index tuple along with the key value.When it comes to multicolumn GIN indexes, they are implemented by constructing a single B-tree over composite values. These composite values consist of the column number and the corresponding key value.It’s important to note that the key values for different columns can be of different types, providing flexibility in indexing various data structures.

Pending List

The GIN index employs a special mechanism called “fastupdate” to optimize index updates. By default, fastupdate is enabled for GIN indexes. It defers index updates, consolidating them to occur at points where multiple updates need to be made.This approach reduces the overhead for individual UPDATE operations but requires the deferred work to be completed later.The data awaiting processing is stored in a designated area known as the “pending list.” This pending list serves as a buffer for holding index updates until they are ready to be incorporated into the main index structure. The pending list is flushed to the main index structure under one of three conditions:When the gin_pending_list_limit is reached during a standard index update operation. By default, this limit is set to 4MB.When there is an explicit call to the gin_clean_pending_list function.During an autovacuum process on the table containing the GIN index. GIN pending list cleanup occurs automatically at the conclusion of the vacuum operation.This pending list mechanism ensures efficient management of index updates, balancing the trade-off between immediate update performance and reduced overhead for individual operations.

Sample Structure —

Source — Link

Example —

Let’s consider that for an array data structure, we have a query seeking items related to compensation and accelerometers.The index indicates the positions of these terms within the indexed items.For compensation, the positions are 30 and 68, while for accelerometers, the positions are 5, 10, 25, 28, 30, 36, 58, 59, 61, 73, and 74.Utilizing GIN indexing, the index efficiently organizes these positions for each term, enabling quick retrieval of matching items.The query returns item 30 as it contains both compensation (at positions 30 and 68) and accelerometers (at position 30).This illustrates how GIN indexing facilitates rapid and precise retrieval of relevant items based on multiple search terms.

4. Practical Example

The internals are fine, but does a GIN index really make queries faster?Let’s find out via a practical example with an address field for an e-commerce company. We’ll use randomized data for the addresses and perform searches on the indexed and unindexed versions of the field.

Steps —

1 Extension Installation: First, we need to ensures that the pg_trgm extension, which supports trigram matching for text data, is available in the PostgreSQL database.

CREATE EXTENSION IF NOT EXISTS pg_trgm;

2. Table Creation: We begin by creating a table named customer_addresses to store customer address data, including fields for customer_id, address, and address_indexed.

CREATE TABLE customer_addresses (
customer_id SERIAL PRIMARY KEY,
address text,
address_indexed text
);

3. Index Creation: With the extension installed, we proceed to establish a gin index named customer_addresses_search_idx on the address_indexed column using the gin_trgm_ops operator class, facilitating trigram-based indexing.

CREATE INDEX customer_addresses_search_idx ON customer_addresses USING gin (address_indexed gin_trgm_ops);

4. Data Seeding Function: To generate diverse test cases for indexing, we define a random_address() function responsible for producing randomized address data.

CREATE OR REPLACE FUNCTION random_address()
RETURNS TEXT AS $$
BEGIN
RETURN CONCAT(
(SELECT md5(random()::text) FROM generate_series(1,1)),
‘, ‘,
(SELECT md5(random()::text) FROM generate_series(1,1))
);
END;
$$ LANGUAGE plpgsql;

5. Data Seeding: Utilizing the random_address() function, we populate the customer_addresses table with randomized address data, simulating real-world scenarios.

INSERT INTO customer_addresses (address, address_indexed)
SELECT
random_address(),
random_address()
FROM generate_series(1,100000) AS id;

6. Data Display: Following data population, we inspect the contents of the customer_addresses table to ensure successful data insertion and indexing.

SELECT * FROM customer_addresses LIMIT 5;

The table looks like this —

PgAdmin output screenshot

7. Query Analysis (Unindexed): Subsequently, we conduct an analysis of query performance without indexing, illustrating the impact of indexing on search efficiency.

EXPLAIN ANALYZE SELECT * FROM customer_addresses WHERE address ILIKE ‘%913%’;PgAdmin output screenshotFor theunindexed column, the query is performing a sequential scan and the execution time is 90.577 ms.

8. Query Analysis (Indexed): Finally, we examine query performance with trigram indexing, showcasing the efficiency gains achieved through optimized indexing strategies.

EXPLAIN ANALYZE SELECT * FROM customer_addresses WHERE address_indexed ILIKE ‘%913%’;PgAdmin output screenshotFor the indexed column, the query is performing a heap scan and the execution time is 4.191 ms which is a 20x improvement 🚀 over the sequential scan.

9. Index Size: Let’s also check the size of the index.

SELECT pg_size_pretty(pg_relation_size(‘customer_addresses_search_idx’)) AS index_size;PgAdmin output screenshotThe size of the index is 19 MB. This indicates the additional storage required to maintain the index for efficient querying of address data.

5. Lessons from GitLab’s GIN Trigram Index

GitLab’s discussion about an issue related to updating shed light on a GIN trigram index causing occasional slow merge requests.Despite minimal locking statements, the team uncovered that the culprit was the GIN pending list.Cleaning this list for the description field incurred significant overhead, especially with numerous pending entries.Manual observations showed cleaning durations ranging from 465 ms to 3155 ms.Further investigation revealed that the GIN pending list was flushed at a high frequency during business hours, filling up approximately once every 2.7 seconds.You can read more about it in this section of the PgAnalyze blog.The insights gleaned from GitLab’s discussions offer valuable lessons in the realm of database optimization, particularly concerning GIN trigram indexes.By examining the challenges faced with slow merge requests, attributed to the GIN pending list, we uncover the critical importance of fine-tuning index maintenance processes

6. Conclusion

As we wrap up our exploration of GIN indexing, we’ve uncovered its inner workings, practical applications, and the impact it can have on database performance.Through hands-on experiments and examples, we’ve seen how GIN indexing can revolutionize search operations, making queries faster and more efficient. From simple searches to complex pattern matching, GIN indexing offers a versatile solution for optimizing database performance.While GIN indexing can significantly improve search speed, it’s essential to be mindful of potential challenges, such as index maintenance overhead. By understanding these nuances, we can fine-tune our indexing strategies for maximum effectiveness.

7. References

Official PostgreSQL documentation — LinkFull-text search presentation by Oleg Bartunov and Alexander Korotkov— LinkPgAnalyze Blog— LinkLouise Grandjonc’s Blog — LinkGreekDataGuy’s Blog — Link

👏 Your 3 claps mean the world to me! If you found value in this article, a simple tap on those clapping hands would make my day.

🚀 Please consider following for more tech-related content.

🌟 I’m genuinely grateful for your support. It’s what keeps me inspired to share more insightful content with you.

Thank you for being a part of this journey! 🙏

Generalized Inverted Index in PostgreSQL was originally published in Level Up Coding on Medium, where people are continuing the conversation by highlighting and responding to this story.

​ Level Up Coding – Medium

about Infinite Loop Digital

We support businesses by identifying requirements and helping clients integrate AI seamlessly into their operations.

Gartner
Gartner Digital Workplace Summit Generative Al

GenAI sessions:

  • 4 Use Cases for Generative AI and ChatGPT in the Digital Workplace
  • How the Power of Generative AI Will Transform Knowledge Management
  • The Perils and Promises of Microsoft 365 Copilot
  • How to Be the Generative AI Champion Your CIO and Organization Need
  • How to Shift Organizational Culture Today to Embrace Generative AI Tomorrow
  • Mitigate the Risks of Generative AI by Enhancing Your Information Governance
  • Cultivate Essential Skills for Collaborating With Artificial Intelligence
  • Ask the Expert: Microsoft 365 Copilot
  • Generative AI Across Digital Workplace Markets
10 – 11 June 2024

London, U.K.