PostgreSQL Deep Dive Part 11 — Understanding Database Indexing Fundamentals | Abhishek Kumar Singh

Access complete, unfiltered study notes maintained on GitHub for easily maintaing Second Brain Dir

The Foundation of Database Indexing 🔍

Think of PostgreSQL's indexing system as a sophisticated library card catalog. Just as a library catalog helps you find books without scanning every shelf, database indexes help PostgreSQL find rows without scanning every page. But there's much more to this story.

Core Index Characteristics

Indexes in PostgreSQL are separate data structures that maintain a specialized copy of your data. This separation is crucial for understanding both their benefits and costs:

sql

-- Creating a basic index
CREATE INDEX idx_user_email ON users(email);
CREATE INDEX idx_user_email on users using btree(email); -- this is an alternate way of creating index
 
-- Behind the scenes, PostgreSQL:
-- 1. Creates a new file on disk (to understand more check Part 13)
-- 2. Organizes email values in B-tree structure
-- 3. Maintains pointers (CTIDs) back to the heap

sql

-- Creating a basic index
CREATE INDEX idx_user_email ON users(email);
CREATE INDEX idx_user_email on users using btree(email); -- this is an alternate way of creating index
 
-- Behind the scenes, PostgreSQL:
-- 1. Creates a new file on disk (to understand more check Part 13)
-- 2. Organizes email values in B-tree structure
-- 3. Maintains pointers (CTIDs) back to the heap

Why this matters:

Each index is an independent structure
Most commonly implemented as B-trees
Contains a strategic subset of table data
Maintains pointers back to the full rows

The index doesn't just make a simple copy — it organizes data in a way that optimizes for specific types of access patterns.

Understanding the Heap and CTIDs: PostgreSQL's Storage Foundation 📚

The heap in PostgreSQL is where the actual data lives, but its organization might surprise developers coming from other databases. Let's dive deep into how PostgreSQL manages data at the physical level:

The Heap Structure

sql

-- Create a sample table
CREATE TABLE books (
    id BIGINT PRIMARY KEY,
    title TEXT,
    author TEXT,
    published_date DATE
);
 
-- Examining the heap structure
SELECT *, ctid FROM books;
 
-- Result:
/*
| id     | title      | author     | published_date     | ctid      |
|--------|------------|------------|--------------------|-----------|
| 1      | Book A     | Auth A     | 2024-01-01         | (0,1)     |
| 2      | Book B     | Auth B     | 2024-01-02         | (0,2)     |
*/

sql

-- Create a sample table
CREATE TABLE books (
    id BIGINT PRIMARY KEY,
    title TEXT,
    author TEXT,
    published_date DATE
);
 
-- Examining the heap structure
SELECT *, ctid FROM books;
 
-- Result:
/*
| id     | title      | author     | published_date     | ctid      |
|--------|------------|------------|--------------------|-----------|
| 1      | Book A     | Auth A     | 2024-01-01         | (0,1)     |
| 2      | Book B     | Auth B     | 2024-01-02         | (0,2)     |
*/

In this output, the CTID (0,1) tells us:

0: The page number in the heap file
1: The item number within that page

Key concepts about heap organization:

Fixed-size pages (typically 8KB)
Equal chunks within pages
Row version information
MVCC (Multi-Version Concurrency Control) support: MVCC is PostgreSQL's mechanism for managing concurrent database access by maintaining multiple versions of rows, allowing readers and writers to operate simultaneously without blocking each other.

Why CTIDs matter:

sql

-- CTIDs can change after VACUUM
-- (will be talking about VACUUM in coming parts)
VACUUM books;
 
SELECT *, ctid FROM books;
-- Notice how CTIDs might be different now

sql

-- CTIDs can change after VACUUM
-- (will be talking about VACUUM in coming parts)
VACUUM books;
 
SELECT *, ctid FROM books;
-- Notice how CTIDs might be different now

The B-tree Index: PostgreSQL's Workhorse 🌳

B-tree (Balanced Tree) indexes are PostgreSQL's default, and understanding their structure is crucial for optimization. Let's examine how they work:

B-tree Structure Fundamentals

                Root Node
                   |
         |-------------------|
    Internal Node      Internal Node
         |                  |
   |-----------|      |-----------|
Leaf Node  Leaf Node  Leaf Node  Leaf Node

🎄

                Root Node
                   |
         |-------------------|
    Internal Node      Internal Node
         |                  |
   |-----------|      |-----------|
Leaf Node  Leaf Node  Leaf Node  Leaf Node

🎄

Each node contains:

Key values
Pointers to child nodes (internal nodes)
CTIDs (leaf nodes)

sql

-- Creating a B-tree index with visualization options
CREATE INDEX CONCURRENTLY idx_books_date ON books USING btree(published_date);
 
-- Examining index statistics
SELECT * FROM pg_stats
WHERE tablename = 'books'
			AND
			attname = 'published_date';
 
-- Regular Index Creation
CREATE INDEX idx_name ON users(email);
-- Takes exclusive lock and blocks writes (INSERT/UPDATE/DELETE)
-- Faster to create but requires planned downtime for production systems
 
-- Concurrent Index Creation
CREATE INDEX CONCURRENTLY idx_name ON users(email);
-- Allows writes to continue during index creation, perfect for production
-- Takes ~3x longer but prevents application downtime during creation

sql

-- Creating a B-tree index with visualization options
CREATE INDEX CONCURRENTLY idx_books_date ON books USING btree(published_date);
 
-- Examining index statistics
SELECT * FROM pg_stats
WHERE tablename = 'books'
			AND
			attname = 'published_date';
 
-- Regular Index Creation
CREATE INDEX idx_name ON users(email);
-- Takes exclusive lock and blocks writes (INSERT/UPDATE/DELETE)
-- Faster to create but requires planned downtime for production systems
 
-- Concurrent Index Creation
CREATE INDEX CONCURRENTLY idx_name ON users(email);
-- Allows writes to continue during index creation, perfect for production
-- Takes ~3x longer but prevents application downtime during creation

B-tree Operation Details

When PostgreSQL executes a query using a B-tree index:

Starts at root node
Traverses internal nodes based on comparison operations
Reaches leaf node containing target value
Use CTID to fetch full row from heap

Primary Key vs Secondary Indexes: The PostgreSQL Way 🔑

Unlike other databases, PostgreSQL takes a unique approach to indexing that affects both performance and design decisions.

The Secondary Index Architecture

sql

-- PostgreSQL's Unique Heap Architecture: Understanding Primary Keys 🎯
 
-- Traditional Thinking: Data Organization
-- Most developers from MySQL expect:
--   "Primary Key = Data Organization"
-- Because in MySQL, the primary key determines data layout on disk
 
-- PostgreSQL's Different Philosophy:
-- 1. The Heap Approach
--    - All data lives in "heap" storage
--    - Think of heap as a large storage room for data pages
--    - Data placement is independent of indexes
--    - Even primary keys don't affect physical storage
 
-- 2. Index Navigation
--    Step 1: Look up value in index (including primary key)
--    Step 2: Get CTID (physical location pointer)
--    Step 3: Jump to heap to fetch the row
--    This process applies to ALL indexes!
 
-- Primary Keys in PostgreSQL: More Than Meets the Eye
-- When you create a primary key:
CREATE TABLE products (
    id BIGINT PRIMARY KEY,  -- This does multiple things!
    name TEXT
);
 
-- Behind the scenes, PostgreSQL:
-- 1. Creates a UNIQUE constraint
-- 2. Adds a NOT NULL constraint
-- 3. Creates a standard B-tree index
-- 4. Set up system-level primary key tracking
 
-- The Freedom of Secondary Indexes
-- Since primary keys don't determine storage:
CREATE INDEX idx_products_name ON products(name);  -- Just another index
CREATE INDEX idx_products_id ON products(id);      -- Structurally same as PK!
 
-- All indexes (including PK) are "secondary" because:
-- - They point to heap via CTIDs
-- - They don't determine physical data layout
-- - They have equal lookup overhead (index → heap)
 
-- MySQL vs PostgreSQL: A Clear Comparison
-- MySQL's Approach:
-- - Primary key creates a clustered index
-- - Data physically organizes based on this index
-- - Table becomes one large B-tree where:
--   - Primary key orders the data
--   - Complete rows live in leaf nodes
--   - Other indexes must reference the clustered index
 
-- PostgreSQL's Approach:
-- - All data lives in the heap (think: organized storage room)
-- - Every index (including primary keys) is secondary
-- - Each index has CTIDs (like map coordinates) pointing to heap
-- - Primary keys are simply unique, non-null indexes
 
-- The Key Distinction:
-- MySQL: Primary key determines data organization (clustered)
-- PostgreSQL: Data organization stays independent (heap-based)
-- Result: PostgreSQL treats all indexes equally, with direct data pointers

sql

-- PostgreSQL's Unique Heap Architecture: Understanding Primary Keys 🎯
 
-- Traditional Thinking: Data Organization
-- Most developers from MySQL expect:
--   "Primary Key = Data Organization"
-- Because in MySQL, the primary key determines data layout on disk
 
-- PostgreSQL's Different Philosophy:
-- 1. The Heap Approach
--    - All data lives in "heap" storage
--    - Think of heap as a large storage room for data pages
--    - Data placement is independent of indexes
--    - Even primary keys don't affect physical storage
 
-- 2. Index Navigation
--    Step 1: Look up value in index (including primary key)
--    Step 2: Get CTID (physical location pointer)
--    Step 3: Jump to heap to fetch the row
--    This process applies to ALL indexes!
 
-- Primary Keys in PostgreSQL: More Than Meets the Eye
-- When you create a primary key:
CREATE TABLE products (
    id BIGINT PRIMARY KEY,  -- This does multiple things!
    name TEXT
);
 
-- Behind the scenes, PostgreSQL:
-- 1. Creates a UNIQUE constraint
-- 2. Adds a NOT NULL constraint
-- 3. Creates a standard B-tree index
-- 4. Set up system-level primary key tracking
 
-- The Freedom of Secondary Indexes
-- Since primary keys don't determine storage:
CREATE INDEX idx_products_name ON products(name);  -- Just another index
CREATE INDEX idx_products_id ON products(id);      -- Structurally same as PK!
 
-- All indexes (including PK) are "secondary" because:
-- - They point to heap via CTIDs
-- - They don't determine physical data layout
-- - They have equal lookup overhead (index → heap)
 
-- MySQL vs PostgreSQL: A Clear Comparison
-- MySQL's Approach:
-- - Primary key creates a clustered index
-- - Data physically organizes based on this index
-- - Table becomes one large B-tree where:
--   - Primary key orders the data
--   - Complete rows live in leaf nodes
--   - Other indexes must reference the clustered index
 
-- PostgreSQL's Approach:
-- - All data lives in the heap (think: organized storage room)
-- - Every index (including primary keys) is secondary
-- - Each index has CTIDs (like map coordinates) pointing to heap
-- - Primary keys are simply unique, non-null indexes
 
-- The Key Distinction:
-- MySQL: Primary key determines data organization (clustered)
-- PostgreSQL: Data organization stays independent (heap-based)
-- Result: PostgreSQL treats all indexes equally, with direct data pointers

Key differences from other databases:

No clustered indexes
All indexes reference heap via CTIDs
Primary key is a unique constraint + index

Performance Implications

This architecture has important implications:

sql

-- Every index lookup requires heap access
EXPLAIN SELECT * FROM products WHERE id = 1;
 
-- Unless it's an index-only scan
EXPLAIN SELECT id FROM products WHERE id = 1;

sql

-- Every index lookup requires heap access
EXPLAIN SELECT * FROM products WHERE id = 1;
 
-- Unless it's an index-only scan
EXPLAIN SELECT id FROM products WHERE id = 1;

The heap-centric architecture means:

More flexible index usage
Consistent behavior across all indexes
Additional I/O for heap access

Understanding Index Storage and Memory 💾

Let's examine the storage implications of PostgreSQL's indexing approach:

sql

-- Checking index sizes
SELECT
    pg_relation_size('products') as table_size,
    pg_relation_size('products_pkey') as pk_index_size,
    pg_relation_size('idx_products_name') as name_index_size;
 
-- Analyzing index statistics
SELECT
    schemaname,
    tablename,
    indexrelname,
    idx_scan,
    idx_tup_read,
    idx_tup_fetch
FROM pg_stat_user_indexes
WHERE tablename = 'products';

sql

-- Checking index sizes
SELECT
    pg_relation_size('products') as table_size,
    pg_relation_size('products_pkey') as pk_index_size,
    pg_relation_size('idx_products_name') as name_index_size;
 
-- Analyzing index statistics
SELECT
    schemaname,
    tablename,
    indexrelname,
    idx_scan,
    idx_tup_read,
    idx_tup_fetch
FROM pg_stat_user_indexes
WHERE tablename = 'products';

Real-World Implications and Best Practices 🎯

Understanding these fundamentals leads to better indexing decisions:

Key considerations for index management:

Space vs. Speed
- Each index increases storage requirements
- Must justify its existence through query performance
- Regular monitoring of usage patterns
Memory Pressure
- Buffer cache utilization

This deep understanding of PostgreSQL's indexing internals forms the foundation for effective database optimization. In Part 12, we'll build upon these concepts to explore advanced indexing strategies including selectivity, composite indexes, and specialized indexing techniques. Stay tuned!

🗂️

PostgreSQL Deep Dive Part 11 — Understanding Database Indexing Fundamentals