PostgreSQL Deep Dive Part 2 - Character Types, Binary Data, and UUIDs | Abhishek Kumar Singh

Access complete, unfiltered study notes maintained on GitHub for easily maintaing Second Brain Dir

In our journey through PostgreSQL's type system, we're now venturing into territory that often trips up even seasoned developers. Character types, binary data, and UUIDs might seem straightforward at first glance, but the devils hiding in the details can make or break your database design.

Character Types: The String Theory 📝

Let's dive into PostgreSQL's character types and uncover the subtle nuances that can impact your application's performance and reliability.

The Three Musketeers of Text Storage

sql

CREATE TABLE text_comparison (
    fixed_char CHAR(10),
    var_char VARCHAR(10),
    unlimited_text TEXT
);
 
-- Let's see how they handle different input lengths
INSERT INTO text_comparison VALUES
    ('short', 'short', 'short'),
    ('exactly_ten', 'exactly_ten', 'exactly_ten'),
    ('this_is_longer_than_ten', 'this_is_longer_than_ten', 'this_is_longer_than_ten'); -- Error: as char and varchar won't allow higher size
 
INSERT INTO text_comparison VALUES
    ('short', 'short', 'short'),
    ('exactlyTen', 'exactlyTen', 'exactly_ten'),
    ('exactlyTen','exactlyTen','this_is_longer_than_ten');
 
--| fixed_char  | var_char   | unlimited_text          |
--|-------------|------------|-------------------------|
--| 'short    ' | short      | short                   |
--| exactlyTen  | exactlyTen | exactly_ten             |
--| exactlyTen  | exactlyTen | this_is_longer_than_ten |

sql

CREATE TABLE text_comparison (
    fixed_char CHAR(10),
    var_char VARCHAR(10),
    unlimited_text TEXT
);
 
-- Let's see how they handle different input lengths
INSERT INTO text_comparison VALUES
    ('short', 'short', 'short'),
    ('exactly_ten', 'exactly_ten', 'exactly_ten'),
    ('this_is_longer_than_ten', 'this_is_longer_than_ten', 'this_is_longer_than_ten'); -- Error: as char and varchar won't allow higher size
 
INSERT INTO text_comparison VALUES
    ('short', 'short', 'short'),
    ('exactlyTen', 'exactlyTen', 'exactly_ten'),
    ('exactlyTen','exactlyTen','this_is_longer_than_ten');
 
--| fixed_char  | var_char   | unlimited_text          |
--|-------------|------------|-------------------------|
--| 'short    ' | short      | short                   |
--| exactlyTen  | exactlyTen | exactly_ten             |
--| exactlyTen  | exactlyTen | this_is_longer_than_ten |

Let's break down what happens behind the scenes:

CHAR(n): The Fixed-Width Trap

sql

SELECT concat('|', fixed_char, '|')
FROM text_comparison
WHERE fixed_char = 'short';
-- Output: |short     |  -- Notice the padding spaces

sql

SELECT concat('|', fixed_char, '|')
FROM text_comparison
WHERE fixed_char = 'short';
-- Output: |short     |  -- Notice the padding spaces

VARCHAR(n): The Constrained Variable

sql

SELECT pg_column_size(var_char) as storage_size, var_char
FROM text_comparison;
-- Reveals variable storage size based on content length
 
--| storage_size | var_char   |
--|--------------|------------|
--|            6 | short      |
--|           11 | exactlyTen |
--|           11 | exactlyTen |

sql

SELECT pg_column_size(var_char) as storage_size, var_char
FROM text_comparison;
-- Reveals variable storage size based on content length
 
--| storage_size | var_char   |
--|--------------|------------|
--|            6 | short      |
--|           11 | exactlyTen |
--|           11 | exactlyTen |

TEXT: The Flexible Friend, VARCHAR has a declared maximum length limit, while TEXT can store strings of unlimited length

Here's a table comparing CHAR, VARCHAR, and TEXT in PostgreSQL:

Feature	CHAR	VARCHAR	TEXT
Length	Fixed-length, padded with spaces	Variable-length up to limit	Unlimited length
Max Size	1GB	1GB with declared limit	1GB
Storage	Uses full declared length	Uses actual content length + 1 byte	Uses actual content length + 1 byte
Performance	Slightly faster for fixed-length data	Efficient for variable-length	Same as VARCHAR
Use Case	Fixed-length codes, known-length data	Data with max length constraint	Large text without length limits
Space Efficiency	Lower due to padding	High	High
TOAST	Only if length > 2KB	Only if length > 2KB	Only if length > 2KB

In practice, TEXT and VARCHAR perform the same, so use TEXT unless you need the length constraint of VARCHAR.

Character Set and Collation Deep Dive

Character set and collation in PostgreSQL control how text data is stored and compared:

Character Set (encoding):

Defines how characters are stored in the database
Common options: UTF8 (default), LATIN1, WIN1252, etc
Set during database creation with CREATE DATABASE dbname ENCODING 'encoding'

Collation:

Determines how text values are sorted and compared
Affects ORDER BY, comparisons, and text pattern matching
Format: language_territory.encoding
Example: 'en_US.UTF8' for US English sorting rules
Set at database, table, or column level:

sql

\l -- database list does have column to show encoding.
 
show client_encoding; -- UTF8
show server_encoding; -- UTF8
 
-- Creating a case-insensitive collation
CREATE COLLATION case_insensitive (
    provider = icu, -- this stands for International Components for Unicode
    locale = 'en-US-u-ks-level1',
    deterministic = false
);
 
-- Practical application
CREATE TABLE user_tags (
    tag_name TEXT COLLATE case_insensitive UNIQUE
);
--This creates a table user_tags with a single column tag_name that:
--Is case-insensitive for comparisons and sorting
--Must contain unique values (case-insensitive)
 
INSERT INTO user_tags VALUES ('Programming');
INSERT INTO user_tags VALUES ('programming'); -- This will fail due to case-insensitive uniqueness:

sql

\l -- database list does have column to show encoding.
 
show client_encoding; -- UTF8
show server_encoding; -- UTF8
 
-- Creating a case-insensitive collation
CREATE COLLATION case_insensitive (
    provider = icu, -- this stands for International Components for Unicode
    locale = 'en-US-u-ks-level1',
    deterministic = false
);
 
-- Practical application
CREATE TABLE user_tags (
    tag_name TEXT COLLATE case_insensitive UNIQUE
);
--This creates a table user_tags with a single column tag_name that:
--Is case-insensitive for comparisons and sorting
--Must contain unique values (case-insensitive)
 
INSERT INTO user_tags VALUES ('Programming');
INSERT INTO user_tags VALUES ('programming'); -- This will fail due to case-insensitive uniqueness:

Binary Data: The Raw Reality 🔧

Binary storage in PostgreSQL opens up a world of possibilities - and potential pitfalls.

The BYTEA Type In Action

sql

CREATE TABLE binary_storage (
    file_name TEXT,
    file_data BYTEA,
    file_hash BYTEA
);
 
-- Storing different types of binary data
INSERT INTO binary_storage (file_name, file_data, file_hash) VALUES
    ('example.txt',
    decode('48656C6C6F20576F726C64', 'hex'),
    sha256('Hello World'));
--| file_name   |       file_data    |      file_hash     |
--|-------------|--------------------|--------------------|
--| example.txt | BINARY - 11 bytes  | BINARY - 32 bytes  |

sql

CREATE TABLE binary_storage (
    file_name TEXT,
    file_data BYTEA,
    file_hash BYTEA
);
 
-- Storing different types of binary data
INSERT INTO binary_storage (file_name, file_data, file_hash) VALUES
    ('example.txt',
    decode('48656C6C6F20576F726C64', 'hex'),
    sha256('Hello World'));
--| file_name   |       file_data    |      file_hash     |
--|-------------|--------------------|--------------------|
--| example.txt | BINARY - 11 bytes  | BINARY - 32 bytes  |

Binary Data Best Practices

sql

-- Compare different hash storage methods and sizes
select pg_typeof(md5('hello world')); -- text
select pg_typeof(sha256('hello world')); -- bytea
 
-- to convert md5 to bytea
select pg_typeof(decode(md5('hello world'), 'hex')); -- bytea
 
SELECT
    pg_column_size(digest('test data', 'md5')) as md5_bytea_size, -- bytea
    pg_column_size(md5('test data')) as md5_text_size, -- md5
    pg_column_size(md5('test data')::uuid) as md5_uuid_size; -- uuid
 
--| md5_bytea_size | md5_text_size | md5_uuid_size |
--|----------------|---------------|---------------|
--|             20 |            36 |            16 |

sql

-- Compare different hash storage methods and sizes
select pg_typeof(md5('hello world')); -- text
select pg_typeof(sha256('hello world')); -- bytea
 
-- to convert md5 to bytea
select pg_typeof(decode(md5('hello world'), 'hex')); -- bytea
 
SELECT
    pg_column_size(digest('test data', 'md5')) as md5_bytea_size, -- bytea
    pg_column_size(md5('test data')) as md5_text_size, -- md5
    pg_column_size(md5('test data')::uuid) as md5_uuid_size; -- uuid
 
--| md5_bytea_size | md5_text_size | md5_uuid_size |
--|----------------|---------------|---------------|
--|             20 |            36 |            16 |

UUID is the smallest representation of hash.
This is not recommended to store larger size of data, like images, etc. use document DBs for that stuff.
Binary types support up-to 1GB of data.

UUIDs: The Universal Identifiers 🎯

UUIDs deserve special attention in modern database design, especially in distributed systems.

UUID Implementation Strategies

sql

CREATE TABLE distributed_entities (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    created_at TIMESTAMPTZ DEFAULT CURRENT_TIMESTAMP,
    data JSONB
);
 
INSERT INTO
	distributed_entities (data)
VALUES
	('{"a": "b"}');
 
--| id                       | created_at           | data       |
--|--------------------------|----------------------|------------|
--| b1b8c4d6-304c-4ca6-98... | 2024-12-24 10:17:... | {"a": "b"} |
 
-- Different UUID generation methods
SELECT
    gen_random_uuid() as random_uuid,
    uuid_generate_v1() as timestamp_based_uuid,
    uuid_generate_v4() as random_uuid_v4;
 
-- uuid_generate_v1 and uuid_generate_v4 will need the below extension
CREATE EXTENSION IF NOT EXISTS "uuid-ossp";

sql

CREATE TABLE distributed_entities (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    created_at TIMESTAMPTZ DEFAULT CURRENT_TIMESTAMP,
    data JSONB
);
 
INSERT INTO
	distributed_entities (data)
VALUES
	('{"a": "b"}');
 
--| id                       | created_at           | data       |
--|--------------------------|----------------------|------------|
--| b1b8c4d6-304c-4ca6-98... | 2024-12-24 10:17:... | {"a": "b"} |
 
-- Different UUID generation methods
SELECT
    gen_random_uuid() as random_uuid,
    uuid_generate_v1() as timestamp_based_uuid,
    uuid_generate_v4() as random_uuid_v4;
 
-- uuid_generate_v1 and uuid_generate_v4 will need the below extension
CREATE EXTENSION IF NOT EXISTS "uuid-ossp";

Storage Efficiency

sql

-- Compare storage sizes
SELECT
    pg_column_size('12345678-1234-5678-1234-567812345678'::uuid) as uuid_size,
    pg_column_size('12345678-1234-5678-1234-567812345678') as text_size;
 
--| uuid_size | text_size |
--|-----------|-----------|
--|        16 |        37 |
 
-- UUID only takes 16 bytes
-- If we convert UUID to Text, then it changes to ~40 bytes.
-- If you are storing UUID, store in UUID type.

sql

-- Compare storage sizes
SELECT
    pg_column_size('12345678-1234-5678-1234-567812345678'::uuid) as uuid_size,
    pg_column_size('12345678-1234-5678-1234-567812345678') as text_size;
 
--| uuid_size | text_size |
--|-----------|-----------|
--|        16 |        37 |
 
-- UUID only takes 16 bytes
-- If we convert UUID to Text, then it changes to ~40 bytes.
-- If you are storing UUID, store in UUID type.

This deep dive into character types, binary data, and UUIDs reveals the complexity hidden beneath these seemingly simple types. In our next instalment, we'll explore PostgreSQL's date, time, and timezone handling - another area rich with subtle complexities and gotchas.

Ready for Part 3?

🚁

PostgreSQL Deep Dive Part 2 - Character Types, Binary Data, and UUIDs