PostgreSQL Deep Dive Part 2 - Character Types, Binary Data, and UUIDs
Explore PostgreSQL's character types, binary data storage, and UUID implementation. Learn best practices for handling text, BYTEA, and unique identifiers in production.
Published At
12/24/2024
Reading Time
~ 10 min read
Access complete, unfiltered study notes maintained on GitHub for easily maintaing Second Brain Dir
In our journey through PostgreSQL's type system, we're now venturing into territory that often trips up even seasoned developers. Character types, binary data, and UUIDs might seem straightforward at first glance, but the devils hiding in the details can make or break your database design.
Character Types: The String Theory 📝
Let's dive into PostgreSQL's character types and uncover the subtle nuances that can impact your application's performance and reliability.
The Three Musketeers of Text Storage
CREATE TABLE text_comparison (
fixed_char CHAR(10),
var_char VARCHAR(10),
unlimited_text TEXT
);
-- Let's see how they handle different input lengths
INSERT INTO text_comparison VALUES
('short', 'short', 'short'),
('exactly_ten', 'exactly_ten', 'exactly_ten'),
('this_is_longer_than_ten', 'this_is_longer_than_ten', 'this_is_longer_than_ten'); -- Error: as char and varchar won't allow higher size
INSERT INTO text_comparison VALUES
('short', 'short', 'short'),
('exactlyTen', 'exactlyTen', 'exactly_ten'),
('exactlyTen','exactlyTen','this_is_longer_than_ten');
--| fixed_char | var_char | unlimited_text |
--|-------------|------------|-------------------------|
--| 'short ' | short | short |
--| exactlyTen | exactlyTen | exactly_ten |
--| exactlyTen | exactlyTen | this_is_longer_than_ten |
CREATE TABLE text_comparison (
fixed_char CHAR(10),
var_char VARCHAR(10),
unlimited_text TEXT
);
-- Let's see how they handle different input lengths
INSERT INTO text_comparison VALUES
('short', 'short', 'short'),
('exactly_ten', 'exactly_ten', 'exactly_ten'),
('this_is_longer_than_ten', 'this_is_longer_than_ten', 'this_is_longer_than_ten'); -- Error: as char and varchar won't allow higher size
INSERT INTO text_comparison VALUES
('short', 'short', 'short'),
('exactlyTen', 'exactlyTen', 'exactly_ten'),
('exactlyTen','exactlyTen','this_is_longer_than_ten');
--| fixed_char | var_char | unlimited_text |
--|-------------|------------|-------------------------|
--| 'short ' | short | short |
--| exactlyTen | exactlyTen | exactly_ten |
--| exactlyTen | exactlyTen | this_is_longer_than_ten |
Let's break down what happens behind the scenes:
CHAR(n)
: The Fixed-Width Trap
SELECT concat('|', fixed_char, '|')
FROM text_comparison
WHERE fixed_char = 'short';
-- Output: |short | -- Notice the padding spaces
SELECT concat('|', fixed_char, '|')
FROM text_comparison
WHERE fixed_char = 'short';
-- Output: |short | -- Notice the padding spaces
VARCHAR(n)
: The Constrained Variable
SELECT pg_column_size(var_char) as storage_size, var_char
FROM text_comparison;
-- Reveals variable storage size based on content length
--| storage_size | var_char |
--|--------------|------------|
--| 6 | short |
--| 11 | exactlyTen |
--| 11 | exactlyTen |
SELECT pg_column_size(var_char) as storage_size, var_char
FROM text_comparison;
-- Reveals variable storage size based on content length
--| storage_size | var_char |
--|--------------|------------|
--| 6 | short |
--| 11 | exactlyTen |
--| 11 | exactlyTen |
TEXT
: The Flexible Friend, VARCHAR has a declared maximum length limit, while TEXT can store strings of unlimited length
Here's a table comparing CHAR, VARCHAR, and TEXT in PostgreSQL:
Feature | CHAR | VARCHAR | TEXT |
---|---|---|---|
Length | Fixed-length, padded with spaces | Variable-length up to limit | Unlimited length |
Max Size | 1GB | 1GB with declared limit | 1GB |
Storage | Uses full declared length | Uses actual content length + 1 byte | Uses actual content length + 1 byte |
Performance | Slightly faster for fixed-length data | Efficient for variable-length | Same as VARCHAR |
Use Case | Fixed-length codes, known-length data | Data with max length constraint | Large text without length limits |
Space Efficiency | Lower due to padding | High | High |
TOAST | Only if length > 2KB | Only if length > 2KB | Only if length > 2KB |
In practice, TEXT and VARCHAR perform the same, so use TEXT unless you need the length constraint of VARCHAR.
Character Set and Collation Deep Dive
Character set and collation in PostgreSQL control how text data is stored and compared:
Character Set (encoding):
- Defines how characters are stored in the database
- Common options: UTF8 (default), LATIN1, WIN1252, etc
- Set during database creation with
CREATE DATABASE dbname ENCODING 'encoding'
Collation:
- Determines how text values are sorted and compared
- Affects ORDER BY, comparisons, and text pattern matching
- Format: language_territory.encoding
- Example: 'en_US.UTF8' for US English sorting rules
- Set at database, table, or column level:
\l -- database list does have column to show encoding.
show client_encoding; -- UTF8
show server_encoding; -- UTF8
-- Creating a case-insensitive collation
CREATE COLLATION case_insensitive (
provider = icu, -- this stands for International Components for Unicode
locale = 'en-US-u-ks-level1',
deterministic = false
);
-- Practical application
CREATE TABLE user_tags (
tag_name TEXT COLLATE case_insensitive UNIQUE
);
--This creates a table user_tags with a single column tag_name that:
--Is case-insensitive for comparisons and sorting
--Must contain unique values (case-insensitive)
INSERT INTO user_tags VALUES ('Programming');
INSERT INTO user_tags VALUES ('programming'); -- This will fail due to case-insensitive uniqueness:
\l -- database list does have column to show encoding.
show client_encoding; -- UTF8
show server_encoding; -- UTF8
-- Creating a case-insensitive collation
CREATE COLLATION case_insensitive (
provider = icu, -- this stands for International Components for Unicode
locale = 'en-US-u-ks-level1',
deterministic = false
);
-- Practical application
CREATE TABLE user_tags (
tag_name TEXT COLLATE case_insensitive UNIQUE
);
--This creates a table user_tags with a single column tag_name that:
--Is case-insensitive for comparisons and sorting
--Must contain unique values (case-insensitive)
INSERT INTO user_tags VALUES ('Programming');
INSERT INTO user_tags VALUES ('programming'); -- This will fail due to case-insensitive uniqueness:
Binary Data: The Raw Reality 🔧
Binary storage in PostgreSQL opens up a world of possibilities - and potential pitfalls.
The BYTEA Type In Action
CREATE TABLE binary_storage (
file_name TEXT,
file_data BYTEA,
file_hash BYTEA
);
-- Storing different types of binary data
INSERT INTO binary_storage (file_name, file_data, file_hash) VALUES
('example.txt',
decode('48656C6C6F20576F726C64', 'hex'),
sha256('Hello World'));
--| file_name | file_data | file_hash |
--|-------------|--------------------|--------------------|
--| example.txt | BINARY - 11 bytes | BINARY - 32 bytes |
CREATE TABLE binary_storage (
file_name TEXT,
file_data BYTEA,
file_hash BYTEA
);
-- Storing different types of binary data
INSERT INTO binary_storage (file_name, file_data, file_hash) VALUES
('example.txt',
decode('48656C6C6F20576F726C64', 'hex'),
sha256('Hello World'));
--| file_name | file_data | file_hash |
--|-------------|--------------------|--------------------|
--| example.txt | BINARY - 11 bytes | BINARY - 32 bytes |
Binary Data Best Practices
-- Compare different hash storage methods and sizes
select pg_typeof(md5('hello world')); -- text
select pg_typeof(sha256('hello world')); -- bytea
-- to convert md5 to bytea
select pg_typeof(decode(md5('hello world'), 'hex')); -- bytea
SELECT
pg_column_size(digest('test data', 'md5')) as md5_bytea_size, -- bytea
pg_column_size(md5('test data')) as md5_text_size, -- md5
pg_column_size(md5('test data')::uuid) as md5_uuid_size; -- uuid
--| md5_bytea_size | md5_text_size | md5_uuid_size |
--|----------------|---------------|---------------|
--| 20 | 36 | 16 |
-- Compare different hash storage methods and sizes
select pg_typeof(md5('hello world')); -- text
select pg_typeof(sha256('hello world')); -- bytea
-- to convert md5 to bytea
select pg_typeof(decode(md5('hello world'), 'hex')); -- bytea
SELECT
pg_column_size(digest('test data', 'md5')) as md5_bytea_size, -- bytea
pg_column_size(md5('test data')) as md5_text_size, -- md5
pg_column_size(md5('test data')::uuid) as md5_uuid_size; -- uuid
--| md5_bytea_size | md5_text_size | md5_uuid_size |
--|----------------|---------------|---------------|
--| 20 | 36 | 16 |
- UUID is the smallest representation of hash.
- This is not recommended to store larger size of data, like images, etc. use document DBs for that stuff.
- Binary types support up-to 1GB of data.
UUIDs: The Universal Identifiers 🎯
UUIDs deserve special attention in modern database design, especially in distributed systems.
UUID Implementation Strategies
CREATE TABLE distributed_entities (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
created_at TIMESTAMPTZ DEFAULT CURRENT_TIMESTAMP,
data JSONB
);
INSERT INTO
distributed_entities (data)
VALUES
('{"a": "b"}');
--| id | created_at | data |
--|--------------------------|----------------------|------------|
--| b1b8c4d6-304c-4ca6-98... | 2024-12-24 10:17:... | {"a": "b"} |
-- Different UUID generation methods
SELECT
gen_random_uuid() as random_uuid,
uuid_generate_v1() as timestamp_based_uuid,
uuid_generate_v4() as random_uuid_v4;
-- uuid_generate_v1 and uuid_generate_v4 will need the below extension
CREATE EXTENSION IF NOT EXISTS "uuid-ossp";
CREATE TABLE distributed_entities (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
created_at TIMESTAMPTZ DEFAULT CURRENT_TIMESTAMP,
data JSONB
);
INSERT INTO
distributed_entities (data)
VALUES
('{"a": "b"}');
--| id | created_at | data |
--|--------------------------|----------------------|------------|
--| b1b8c4d6-304c-4ca6-98... | 2024-12-24 10:17:... | {"a": "b"} |
-- Different UUID generation methods
SELECT
gen_random_uuid() as random_uuid,
uuid_generate_v1() as timestamp_based_uuid,
uuid_generate_v4() as random_uuid_v4;
-- uuid_generate_v1 and uuid_generate_v4 will need the below extension
CREATE EXTENSION IF NOT EXISTS "uuid-ossp";
- Storage Efficiency
-- Compare storage sizes
SELECT
pg_column_size('12345678-1234-5678-1234-567812345678'::uuid) as uuid_size,
pg_column_size('12345678-1234-5678-1234-567812345678') as text_size;
--| uuid_size | text_size |
--|-----------|-----------|
--| 16 | 37 |
-- UUID only takes 16 bytes
-- If we convert UUID to Text, then it changes to ~40 bytes.
-- If you are storing UUID, store in UUID type.
-- Compare storage sizes
SELECT
pg_column_size('12345678-1234-5678-1234-567812345678'::uuid) as uuid_size,
pg_column_size('12345678-1234-5678-1234-567812345678') as text_size;
--| uuid_size | text_size |
--|-----------|-----------|
--| 16 | 37 |
-- UUID only takes 16 bytes
-- If we convert UUID to Text, then it changes to ~40 bytes.
-- If you are storing UUID, store in UUID type.
This deep dive into character types, binary data, and UUIDs reveals the complexity hidden beneath these seemingly simple types. In our next instalment, we'll explore PostgreSQL's date, time, and timezone handling - another area rich with subtle complexities and gotchas.
Ready for Part 3?
🚁
Do you have any questions, or simply wish to contact me privately? Don't hesitate to shoot me a DM on Twitter.
Have a wonderful day.
Abhishek 🙏