👋
Welcome to my blog!

PostgreSQL Deep Dive Part 2 - Character Types, Binary Data, and UUIDs

Explore PostgreSQL's character types, binary data storage, and UUID implementation. Learn best practices for handling text, BYTEA, and unique identifiers in production.

PostgreSQL Deep Dive Part 2 - Character Types, Binary Data, and UUIDs
Database
Postgres
Development

Published At

12/24/2024

Reading Time

~ 10 min read

📚 Raw Postgres Study Notes

Access complete, unfiltered study notes maintained on GitHub for easily maintaing Second Brain Dir

In our journey through PostgreSQL's type system, we're now venturing into territory that often trips up even seasoned developers. Character types, binary data, and UUIDs might seem straightforward at first glance, but the devils hiding in the details can make or break your database design.

Character Types: The String Theory 📝

Let's dive into PostgreSQL's character types and uncover the subtle nuances that can impact your application's performance and reliability.

The Three Musketeers of Text Storage

sql
CREATE TABLE text_comparison (
    fixed_char CHAR(10),
    var_char VARCHAR(10),
    unlimited_text TEXT
);
 
-- Let's see how they handle different input lengths
INSERT INTO text_comparison VALUES
    ('short', 'short', 'short'),
    ('exactly_ten', 'exactly_ten', 'exactly_ten'),
    ('this_is_longer_than_ten', 'this_is_longer_than_ten', 'this_is_longer_than_ten'); -- Error: as char and varchar won't allow higher size
 
INSERT INTO text_comparison VALUES
    ('short', 'short', 'short'),
    ('exactlyTen', 'exactlyTen', 'exactly_ten'),
    ('exactlyTen','exactlyTen','this_is_longer_than_ten');
 
--| fixed_char  | var_char   | unlimited_text          |
--|-------------|------------|-------------------------|
--| 'short    ' | short      | short                   |
--| exactlyTen  | exactlyTen | exactly_ten             |
--| exactlyTen  | exactlyTen | this_is_longer_than_ten |
sql
CREATE TABLE text_comparison (
    fixed_char CHAR(10),
    var_char VARCHAR(10),
    unlimited_text TEXT
);
 
-- Let's see how they handle different input lengths
INSERT INTO text_comparison VALUES
    ('short', 'short', 'short'),
    ('exactly_ten', 'exactly_ten', 'exactly_ten'),
    ('this_is_longer_than_ten', 'this_is_longer_than_ten', 'this_is_longer_than_ten'); -- Error: as char and varchar won't allow higher size
 
INSERT INTO text_comparison VALUES
    ('short', 'short', 'short'),
    ('exactlyTen', 'exactlyTen', 'exactly_ten'),
    ('exactlyTen','exactlyTen','this_is_longer_than_ten');
 
--| fixed_char  | var_char   | unlimited_text          |
--|-------------|------------|-------------------------|
--| 'short    ' | short      | short                   |
--| exactlyTen  | exactlyTen | exactly_ten             |
--| exactlyTen  | exactlyTen | this_is_longer_than_ten |

Let's break down what happens behind the scenes:

CHAR(n): The Fixed-Width Trap

sql
SELECT concat('|', fixed_char, '|')
FROM text_comparison
WHERE fixed_char = 'short';
-- Output: |short     |  -- Notice the padding spaces
sql
SELECT concat('|', fixed_char, '|')
FROM text_comparison
WHERE fixed_char = 'short';
-- Output: |short     |  -- Notice the padding spaces

VARCHAR(n): The Constrained Variable

sql
SELECT pg_column_size(var_char) as storage_size, var_char
FROM text_comparison;
-- Reveals variable storage size based on content length
 
--| storage_size | var_char   |
--|--------------|------------|
--|            6 | short      |
--|           11 | exactlyTen |
--|           11 | exactlyTen |
sql
SELECT pg_column_size(var_char) as storage_size, var_char
FROM text_comparison;
-- Reveals variable storage size based on content length
 
--| storage_size | var_char   |
--|--------------|------------|
--|            6 | short      |
--|           11 | exactlyTen |
--|           11 | exactlyTen |

TEXT: The Flexible Friend, VARCHAR has a declared maximum length limit, while TEXT can store strings of unlimited length

Here's a table comparing CHAR, VARCHAR, and TEXT in PostgreSQL:

FeatureCHARVARCHARTEXT
LengthFixed-length, padded with spacesVariable-length up to limitUnlimited length
Max Size1GB1GB with declared limit1GB
StorageUses full declared lengthUses actual content length + 1 byteUses actual content length + 1 byte
PerformanceSlightly faster for fixed-length dataEfficient for variable-lengthSame as VARCHAR
Use CaseFixed-length codes, known-length dataData with max length constraintLarge text without length limits
Space EfficiencyLower due to paddingHighHigh
TOASTOnly if length > 2KBOnly if length > 2KBOnly if length > 2KB

In practice, TEXT and VARCHAR perform the same, so use TEXT unless you need the length constraint of VARCHAR.

Character Set and Collation Deep Dive

Character set and collation in PostgreSQL control how text data is stored and compared:

Character Set (encoding):

  • Defines how characters are stored in the database
  • Common options: UTF8 (default), LATIN1, WIN1252, etc
  • Set during database creation with CREATE DATABASE dbname ENCODING 'encoding'

Collation:

  • Determines how text values are sorted and compared
  • Affects ORDER BY, comparisons, and text pattern matching
  • Format: language_territory.encoding
  • Example: 'en_US.UTF8' for US English sorting rules
  • Set at database, table, or column level:
sql
\l -- database list does have column to show encoding.
 
show client_encoding; -- UTF8
show server_encoding; -- UTF8
 
-- Creating a case-insensitive collation
CREATE COLLATION case_insensitive (
    provider = icu, -- this stands for International Components for Unicode
    locale = 'en-US-u-ks-level1',
    deterministic = false
);
 
-- Practical application
CREATE TABLE user_tags (
    tag_name TEXT COLLATE case_insensitive UNIQUE
);
--This creates a table user_tags with a single column tag_name that:
--Is case-insensitive for comparisons and sorting
--Must contain unique values (case-insensitive)
 
INSERT INTO user_tags VALUES ('Programming');
INSERT INTO user_tags VALUES ('programming'); -- This will fail due to case-insensitive uniqueness:
sql
\l -- database list does have column to show encoding.
 
show client_encoding; -- UTF8
show server_encoding; -- UTF8
 
-- Creating a case-insensitive collation
CREATE COLLATION case_insensitive (
    provider = icu, -- this stands for International Components for Unicode
    locale = 'en-US-u-ks-level1',
    deterministic = false
);
 
-- Practical application
CREATE TABLE user_tags (
    tag_name TEXT COLLATE case_insensitive UNIQUE
);
--This creates a table user_tags with a single column tag_name that:
--Is case-insensitive for comparisons and sorting
--Must contain unique values (case-insensitive)
 
INSERT INTO user_tags VALUES ('Programming');
INSERT INTO user_tags VALUES ('programming'); -- This will fail due to case-insensitive uniqueness:

Binary Data: The Raw Reality 🔧

Binary storage in PostgreSQL opens up a world of possibilities - and potential pitfalls.

The BYTEA Type In Action

sql
CREATE TABLE binary_storage (
    file_name TEXT,
    file_data BYTEA,
    file_hash BYTEA
);
 
-- Storing different types of binary data
INSERT INTO binary_storage (file_name, file_data, file_hash) VALUES
    ('example.txt',
    decode('48656C6C6F20576F726C64', 'hex'),
    sha256('Hello World'));
--| file_name   |       file_data    |      file_hash     |
--|-------------|--------------------|--------------------|
--| example.txt | BINARY - 11 bytes  | BINARY - 32 bytes  |
sql
CREATE TABLE binary_storage (
    file_name TEXT,
    file_data BYTEA,
    file_hash BYTEA
);
 
-- Storing different types of binary data
INSERT INTO binary_storage (file_name, file_data, file_hash) VALUES
    ('example.txt',
    decode('48656C6C6F20576F726C64', 'hex'),
    sha256('Hello World'));
--| file_name   |       file_data    |      file_hash     |
--|-------------|--------------------|--------------------|
--| example.txt | BINARY - 11 bytes  | BINARY - 32 bytes  |

Binary Data Best Practices

sql
-- Compare different hash storage methods and sizes
select pg_typeof(md5('hello world')); -- text
select pg_typeof(sha256('hello world')); -- bytea
 
-- to convert md5 to bytea
select pg_typeof(decode(md5('hello world'), 'hex')); -- bytea
 
SELECT
    pg_column_size(digest('test data', 'md5')) as md5_bytea_size, -- bytea
    pg_column_size(md5('test data')) as md5_text_size, -- md5
    pg_column_size(md5('test data')::uuid) as md5_uuid_size; -- uuid
 
--| md5_bytea_size | md5_text_size | md5_uuid_size |
--|----------------|---------------|---------------|
--|             20 |            36 |            16 |
sql
-- Compare different hash storage methods and sizes
select pg_typeof(md5('hello world')); -- text
select pg_typeof(sha256('hello world')); -- bytea
 
-- to convert md5 to bytea
select pg_typeof(decode(md5('hello world'), 'hex')); -- bytea
 
SELECT
    pg_column_size(digest('test data', 'md5')) as md5_bytea_size, -- bytea
    pg_column_size(md5('test data')) as md5_text_size, -- md5
    pg_column_size(md5('test data')::uuid) as md5_uuid_size; -- uuid
 
--| md5_bytea_size | md5_text_size | md5_uuid_size |
--|----------------|---------------|---------------|
--|             20 |            36 |            16 |
  • UUID is the smallest representation of hash.
  • This is not recommended to store larger size of data, like images, etc. use document DBs for that stuff.
  • Binary types support up-to 1GB of data.

UUIDs: The Universal Identifiers 🎯

UUIDs deserve special attention in modern database design, especially in distributed systems.

UUID Implementation Strategies

sql
CREATE TABLE distributed_entities (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    created_at TIMESTAMPTZ DEFAULT CURRENT_TIMESTAMP,
    data JSONB
);
 
INSERT INTO
	distributed_entities (data)
VALUES
	('{"a": "b"}');
 
--| id                       | created_at           | data       |
--|--------------------------|----------------------|------------|
--| b1b8c4d6-304c-4ca6-98... | 2024-12-24 10:17:... | {"a": "b"} |
 
-- Different UUID generation methods
SELECT
    gen_random_uuid() as random_uuid,
    uuid_generate_v1() as timestamp_based_uuid,
    uuid_generate_v4() as random_uuid_v4;
 
-- uuid_generate_v1 and uuid_generate_v4 will need the below extension
CREATE EXTENSION IF NOT EXISTS "uuid-ossp";
sql
CREATE TABLE distributed_entities (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    created_at TIMESTAMPTZ DEFAULT CURRENT_TIMESTAMP,
    data JSONB
);
 
INSERT INTO
	distributed_entities (data)
VALUES
	('{"a": "b"}');
 
--| id                       | created_at           | data       |
--|--------------------------|----------------------|------------|
--| b1b8c4d6-304c-4ca6-98... | 2024-12-24 10:17:... | {"a": "b"} |
 
-- Different UUID generation methods
SELECT
    gen_random_uuid() as random_uuid,
    uuid_generate_v1() as timestamp_based_uuid,
    uuid_generate_v4() as random_uuid_v4;
 
-- uuid_generate_v1 and uuid_generate_v4 will need the below extension
CREATE EXTENSION IF NOT EXISTS "uuid-ossp";
  1. Storage Efficiency
sql
-- Compare storage sizes
SELECT
    pg_column_size('12345678-1234-5678-1234-567812345678'::uuid) as uuid_size,
    pg_column_size('12345678-1234-5678-1234-567812345678') as text_size;
 
--| uuid_size | text_size |
--|-----------|-----------|
--|        16 |        37 |
 
-- UUID only takes 16 bytes
-- If we convert UUID to Text, then it changes to ~40 bytes.
-- If you are storing UUID, store in UUID type.
sql
-- Compare storage sizes
SELECT
    pg_column_size('12345678-1234-5678-1234-567812345678'::uuid) as uuid_size,
    pg_column_size('12345678-1234-5678-1234-567812345678') as text_size;
 
--| uuid_size | text_size |
--|-----------|-----------|
--|        16 |        37 |
 
-- UUID only takes 16 bytes
-- If we convert UUID to Text, then it changes to ~40 bytes.
-- If you are storing UUID, store in UUID type.

This deep dive into character types, binary data, and UUIDs reveals the complexity hidden beneath these seemingly simple types. In our next instalment, we'll explore PostgreSQL's date, time, and timezone handling - another area rich with subtle complexities and gotchas.

Ready for Part 3?

🚁

Do you have any questions, or simply wish to contact me privately? Don't hesitate to shoot me a DM on Twitter.

Have a wonderful day.
Abhishek 🙏

Join My Exclusive Newsletter Community

Step into a world where creativity intersects with technology. By subscribing, you'll get a front-row seat to my latest musings, full-stack development resources, and exclusive previews of future posts. Each email is a crafted experience that includes:

  • In-depth looks at my covert projects and musings to ignite your imagination.
  • Handpicked frontend development resources and current explorations, aimed at expanding your developer toolkit.
  • A monthly infusion of inspiration with my personal selection of quotes, books, and music.

Embrace the confluence of words and wonder, curated thoughtfully and sent straight to your inbox.

No fluff. Just the highest caliber of ideas.