Elizabeth Christensen | CrunchyData Blog

Postgres Security Checklist from the Center for Internet Security

Elizabeth.Christensen@crunchydata.com (Elizabeth Christensen) — Tue, 25 Mar 2025 11:00:00 EDT

The Center for Internet Security (CIS) releases security benchmarks to cover a wide variety of infrastructure used in modern applications, including databases, operating systems, cloud services, containerized services, and even networking. Since 2016 Crunchy Data has collaborated with CIS to provide this security resource for those deploying Postgres. The output of this collaboration is a checklist for folks to follow and improve the security posture of Postgres deployments.

The PostgreSQL CIS Benchmark™ for PostgreSQL 17 was just recently released.

The Center for Internet Security

The Center for Internet Security (CIS) is a nonprofit organization that collaborates with government and commercial entities to develop best practices for securing IT systems and data. CIS Benchmarks are community driven and help provide configuration recommendations in the form of security checklists. CIS allows public contributions, reviews, and an open discussion forum on the benchmarks to make sure they meet broader community standards.

The CIS Benchmark for Postgres is a free, community supported, security checklist for Postgres.

Getting started with the Postgres benchmark

The CIS Benchmark for Postgres is a freely available pdf for non-commercial use with recommendations alongside Postgres configurations. The pdf is 200+ pages of descriptions, rational, and sample code to verify Postgres configurations.

In addition to manual verification, to standardize on this benchmark, teams incorporate these settings into their infrastructure deployment tools. Using infrastructure-as-code tools with the benchmarks ensure deployments across an organization meet these security specifications.

For commercial use of CIS Benchmarks, CIS has membership and tools to automatically run the benchmarks.

What is in the CIS Postgres benchmark security checklist?

The benchmark covers a variety of topics for Postgres deployment and configurations, including:

Postgres install and file permission settings
Recommended settings for logs
User access, role creation, passwords, and authorization
Guidance for using key Postgres extensions like pg_audit, set_user, pg_crypto, and pgBackRest

The document is very hands on, in many cases, CIS provides specific scripts to do the security check. For example, this will look for PGPASSWORD stored environment variable, which is something to avoid:

# grep PGPASSWORD --no-messages /home/*/.{bashrc,profile,bash_profile} 
# grep PGPASSWORD --no-messages /root/.{bashrc,profile,bash_profile} 
# grep PGPASSWORD --no-messages /etc/environment

There are also several statements and queries to help with role and user validation. This SQL query creates a role tree that is pretty neat. It creates a view that shows all roles with login access, superuser configuration, and more:

CREATE 
OR REPLACE VIEW roletree AS WITH RECURSIVE roltree AS (
  SELECT 
    u.rolname AS rolname, 
    u.oid AS roloid, 
    u.rolcanlogin, 
    u.rolsuper, 
    '{}' :: name[] AS rolparents, 
    NULL :: oid AS parent_roloid, 
    NULL :: name AS parent_rolname 
  FROM 
    pg_catalog.pg_authid u 
    LEFT JOIN pg_catalog.pg_auth_members m on u.oid = m.member 
    LEFT JOIN pg_catalog.pg_authid g on m.roleid = g.oid 
  WHERE 
    g.oid IS NULL 
  UNION ALL 
  SELECT 
    u.rolname AS rolname, 
    u.oid AS roloid, 
    u.rolcanlogin, 
    u.rolsuper, 
    t.rolparents || g.rolname AS rolparents, 
    g.oid AS parent_roloid, 
    g.rolname AS parent_rolname 
  FROM 
    pg_catalog.pg_authid u 
    JOIN pg_catalog.pg_auth_members m on u.oid = m.member 
    JOIN pg_catalog.pg_authid g on m.roleid = g.oid 
    JOIN roltree t on t.roloid = g.oid
) 
SELECT 
  r.rolname, 
  r.roloid, 
  r.rolcanlogin, 
  r.rolsuper, 
  r.rolparents 
FROM 
  roltree r 
ORDER BY 
  1;

Updating the benchmark for new Postgres versions

Crunchy Data helps update the benchmark with every major Postgres version. Are new features added that should be in the benchmark? Or features to be wary of?

In this last release a couple notable changes were made:

Addition of a recommendation for passwordcheck
Addition of a recommendation for password complexity
Revisions of the Logging, Monitoring, and Auditing section

Final notes

The CIS benchmark is a fantastic resource for anyone working with security around Postgres. If you need an even deeper security resource, we also work with the United States Department of Defense on a Postgres Security Technical Implementation Guide STIG.

Need help with Postgres security? Contact our team.

Validating Data Types from Semi-Structured Data Loads in Postgres with pg_input_is_valid

Elizabeth.Christensen@crunchydata.com (Elizabeth Christensen) — Wed, 05 Mar 2025 09:30:00 EST

Working on big data loads and or data type changes can be tricky - especially finding and correcting individual errors across a large data set. Postgres versions, 16, 17, and newer have a new function to help with data validation: pg_input_is_valid .

pg_input_is_valid is a sql function that can be queried that will determine if a given input can be parsed into a specific type like numeric, date, JSON, etc. Here’s a super basic query to ask if ‘123’ is a valid integer.

SELECT pg_input_is_valid('123', 'integer');
 pg_input_is_valid
-------------------
 t

This function gives a t-true and f-false response. So if I asked SELECT pg_input_is_valid('123', 'date'); the answer would be `f', since that's not a date.

This does not require special error handling or special scripts, it is just built right into Postgres and can be used with standard SQL. At Crunchy Data we’ve seen some nice use cases with this where you can validate data before importing it. Generally this works best if with a staging table or a temporary table and the validation is done and offending rows can be identified before a final data copy or import is run. Let’s take a look today with a few examples about how the validation input function might help.

Validating data for columns changes

There’s a lot of occasions when a database administrator needs to change data types. You can check something like text to integer easily. You might want to use newer JSON features and move away from old formatting. For moving columns to JSON, pg_input_is_valid can query existing rows to see if they’d conform to JSONB.

SELECT pg_input_is_valid(data_column, 'jsonb')
FROM bytea_table;

You might also want to use pg_input_is_valid to check text columns you want to use for integer or date. You can use a regular validity check for this or could create a new date column with only data that is valid.

UPDATE test_data
SET
    actual_date = CASE
        WHEN pg_input_is_valid (maybe_date, 'date') THEN maybe_date::date
        ELSE NULL
    END;

SELECT * from test_data ;

   name    | maybe_date   | actual_date
-----------+--------------+-------------
 David     | 2023-01-02   | 2023-01-02
 Elizabeth | Jan 1, 2024  |

Validating data for data load

Let’s say you have a CSV file containing customer data to import it into a table named customers. Before importing, it is a good idea to ensure that the data in the CSV file adheres to the expected format, particularly for the age and signup_date columns.

The table has the following structure:

customer_id SERIAL PRIMARY KEY,
name TEXT,
email TEXT,
age INTEGER,
signup_date DATE

Create a staging table

Import the CSV data into a staging table without data type casting yet. Everything will go in as text:

CREATE TEMP TABLE staging_customers (
    customer_id TEXT,
    name TEXT,
    email TEXT,
    age TEXT,
    signup_date TEXT
);

-- copy in the data to the temp table
COPY staging_customers FROM '/path/to/customers.csv' CSV HEADER;

Use `pg_input_is_valid` to validate data types

Now we can write queries to identify rows with invalid data. For example, validate that the age column can be an integer and that the signup column can be a date field.

SELECT *
FROM staging_customers
WHERE NOT pg_input_is_valid(age, 'integer')
   OR NOT pg_input_is_valid(signup_date, 'date');

This query will return all rows with either an invalid age or signup_date.

Exclude invalid rows and copy data to your final table

Once the problematic rows have been identified, the rows can be manually fixed or removed. Sometimes an even cleaner option is to use pg_input_is_valid to skip bad rows as the data is copied to the table and insert only valid rows.

INSERT INTO customers (name, email, age, signup_date)
SELECT name, email, age::integer, signup_date::date
FROM staging_customers
WHERE pg_input_is_valid(age, 'integer')
  AND pg_input_is_valid(signup_date, 'date');

Conclusion

pg_input_is_valid is a great recent addition to the Postgres toolkit data manipulation - moving data or changing data types. In general, where I’ve seen the best use of pg_input_is valid is doing a two step data import with a staging table, a validation step to check for errors, and a final migration of data. Since this is build right into Postgres itself, whether you’re working with small datasets or millions of rows, pg_input_is_valid is a scalable, performant, and reliable way to clean and validate your data.

Indexing Materialized Views in Postgres

Elizabeth.Christensen@crunchydata.com (Elizabeth Christensen) — Mon, 03 Feb 2025 10:30:00 EST

Materialized views are widely used in Postgres today. Many of us are working with using connected systems through foreign data wrappers, separate analytics systems like data warehouses, and merging data from different locations with Postgres queries. Materialized views let you precompile a query or partial table, for both local and remote data. Materialized views are static and have to be refreshed.

One of the things that can be really important for using materialized views efficiently is indexing.

Adding indexes to Postgres in general is critical for operation and query performance. Adding indexes for materialized views is also generally recommended for a few different reasons.

Materialized views are typically used for larger data sets where queries or large joins benefit from being precompiled and indexes add an additional performance boost.
Even if an underlying table data has indexes, those indexes will not be used in a materialized view. The materialized view is saved on disk separately, so it needs a stand alone index.
When querying a materialized view, Postgres treats it like a regular table. There’s no special query planner for materialized views. If it is the type of data that would benefit from an index if it was table, the materialized view would benefit from an index.

Views vs Materialized views

In case you have not thought about Postgres views and Postgres materialized views recently, let’s just do a quick refresher.

A view is a saved query. It is not stored on the disk. It dynamically fetches data from the underlying tables whenever queried. Since views do not have their own storage, views cannot have indexes.
Materialized views do not dynamically fetch data from underlying tables- they are stored on disk - and must be explicitly refreshed to update the contents. This makes them ideal for scenarios involving complex queries or frequent access to relatively static datasets. Because they can be stored on disk, materialized views can be indexed.

Building a materialized view with indexes

I have an example in a materialized view tutorial on the Postgres developer tutorials site. There’s three tables from a demo ecommerce site - products, orders, product_orders. We’d like to show on our demo site how often this product has been purchased. This is helpful for marketing the product but we don’t need to recalculate this from the database every time a product is displayed. So static information in a materialized view is perfect for this use case. Pre-joining tables and having this ready to go will make queries to the sku really easy.

Here’s a sample materialized view that shows recent product sales by sku.

CREATE MATERIALIZED VIEW recent_product_sales AS
SELECT
    p.sku,
    SUM(po.qty) AS total_quantity
FROM
    products p
    JOIN product_orders po ON p.sku = po.sku
    JOIN orders o ON po.order_id = o.order_id
WHERE
    o.status = 'Shipped'
GROUP BY
    p.sku
ORDER BY
    2 DESC;

We will likely be looking this up by sku, so we can add a simple b-tree index, calling out the materialized view like we would a table.

CREATE INDEX sku_index ON recent_product_sales (sku);

Creating indexes for materialized views works exactly like it does with tables. Postgres supports all the major index types, B-tree, hast, GiST, GIN, BRIN, and others on materialized views. If you need a basic intro to indexing I have a blog on Postgres Indexing types.

Refreshing our materialized view and indexes

Materialized views are static, so to add new data, we have to refresh it. There’s two ways Postgres can refresh a materialized view. A regular refresh and one done concurrently.

Non-Concurrent (locking) refresh

This refresh completely replaces the content of the materialized view. The index you built prior to this remain and Postgres will recreate the index with the refreshed data.

Postgres acquires an exclusive lock on the materialized view during this refresh, preventing any reads or writes. This is the fastest option but it often won’t work for production systems with live reads coming in.

REFRESH MATERIALIZED VIEW recent_product_sales;

In addition to building the materialized view, Postgres will have to rebuild the index from scratch. Depending on data size, this can be a pretty long operation.

Concurrent (non-locking) refresh

This refresh will update the materialized view without locking the table, letting you read currently while the refresh is happening. This utilizes the Postgres reindex concurrently too if you’re familiar with that feature. Postgres will reindex everything as the data is refreshed. This is normally slower than a regular refresh due to the incremental approach but allowing reads during the process makes it the favorable option for production databases.

Concurrent refresh requires a unique index on the materialized view to function. The unique index ensures that each row in the materialized view can be uniquely identified. The b-tree index we added earlier has not been explicitly declared as unique, so we can add a new unique index and drop the old one.

CREATE UNIQUE INDEX unique_idx_recent_product_sales ON recent_product_sales(sku);

DROP INDEX sku_index ON recent_product_sales(sku);

Now we can do our concurrent refresh:

REFRESH MATERIALIZED VIEW CONCURRENTLY recent_product_sales;

Materialized views that generate columns with non-unique values cannot use unique indexes - and cannot use the concurrent refresh option. In that case, you’ll have to work around it with the regular refresh.

Summary

There’s always additional considerations with indexing. Unique planning for each project is needed to review index usage based on query patterns, refresh frequency, and the materialized view’s size to maximize efficiency. Indexes are stored on disk so they require their own additional storage. They can have a performance impact, especially with large or complex views and indexes. You can monitor the time index refreshes are taking \timing or logs.

Summary notes:

Even if an underlying table data has indexes, they have to be recreated with materialized views.
Materialized views often benefit from indexing.
Materialized views are static and need to be refreshed. A regular refresh will lock the view from reads, a concurrent refresh will not.
When using the refresh concurrently option for materialized views, the data has to have a UNIQUE index.

Name Collision of the Year: Vector

Elizabeth.Christensen@crunchydata.com (Elizabeth Christensen) — Thu, 26 Dec 2024 08:30:00 EST

I can’t get through a zoom call, a conference talk, or an afternoon scroll through LinkedIn without hearing about vectors. Do you feel like the term vector is everywhere this year? It is. Vector actually means several different things and it's confusing. Vector means AI data, GIS locations, digital graphics, and a type of query optimization, and more. The terms and uses are related, sure. They all stem from the same original concept. However their practical applications are quite different. So “Vector” is my choice for this year’s name collision of the year.

In this post I want to break down the vector. The history of the vector, how vectors were used in the past and how they evolved to what they are today (with examples!).

The original vector

The idea that vectors are based on goes back to the 1500s when René Descartes first developed the Cartesian coordinate XY system to represent points in space. Descartes didn't use the word vector but he did develop a numerical representation of a location and direction. Numerical locations is the foundational concept of the vector - used for measuring spatial relationships.

The first use of the term vector was in the 1840s by an Irish mathematician named William Rowan Hamilton. Hamilton defined a vector as a quantity with both magnitude and direction in three-dimensional space. He used it to describe geometric directions and distances, like arrows in 3D space. Hamilton combined his vectors with several other math terms to solve problems with rotation and three dimensional units.

The word Hamilton chose, vector, comes from the Latin word vehere meaning ‘to carry’ or ‘conveyor’ (yes, same origin for the word vehicle). We assume Hamilton chose this Latin word origin to emphasize the idea of a vector carrying a point from one location to another.

There’s a book about the history of vectors published just this year, and a nice summary here. I’ve already let Santa know this is on my list this year.

Mathematical vectors

Building upon Hamilton’s work, vectors have been used extensively in linear algebra pre and post computational math. If it has been 20 since you took a math class here’s a quick refresher.

Linear algebra is a branch of mathematics that focuses on vectors, matrices, and arrays of numbers. Here’s a super simple mathematical vector equation. We have two points on an XY coordinate system, point A at 1, 2 and B at 4,6. The vector formula for this is below in this diagram, final solution 3,4.

Linear algebra of much more complicated forms is used in solving systems of linear differential equations. Vector equations have practical use cases in physics and engineering for things we use every day like heat conduction, fluids, and electrical circuits.

Computer science vectors

Early computer scientists made heavy use of the vector in a variety of ways. A computational vector can be similar to the example above or even just a simple numeric array of fixed size with where the numbers have related values. In early computer programming, simple operations like additions or subtraction would be applied to a set of vectors.

A basic example of this could be financial portfolio analysis where you have two vectors: 1 - Portfolio weights, v1, showing the proportion of investment in different stocks and 2 - market impact adjustments, v2, that adjusts markets based on current values. This code sample here in C calculates the adjusted weights for each stock in the portfolio by adding the two vectors.

#include <stdio.h>

#define STOCKS 8

typedef float Portfolio[STOCKS];

int main() {
    // Portfolio weights (in percentages, out of 100)
    Portfolio portfolioWeights = {10.0, 20.0, 15.0, 25.0, 5.0, 10.0, 10.0, 5.0};
    // Market impact adjustments (positive or negative percentages)
    Portfolio marketAdjustments = {0.5, -0.3, 1.0, -0.5, 0.2, -0.1, 0.0, 0.7};
    Portfolio adjustedWeights;

    // Perform vector addition
    for (int i = 0; i < STOCKS; i++) {
        adjustedWeights[i] = portfolioWeights[i] + marketAdjustments[i];
    }

    // Print adjusted weights
    printf("Adjusted Portfolio Weights: <");
    for (int i = 0; i < STOCKS; i++) {
        printf("%s%.1f%%", i > 0 ? ", " : "", adjustedWeights[i]);
    }
    printf(">\n");

    return 0;
}

Modern computer science builds on similar concepts of organizing and processing collections. The std::vector in C++ and Vec<T> in Rust are general-purpose dynamic arrays. They can be virtually any data type to help manage or compute collections of elements.

Graphics and vectors

Vector graphics were used in early arcade and video game development. Think of something like Spacewar! or Asteroids. Vectors could be used to draw lines and shapes like ships and stars.

Here’s a super simple example of how vectors could be used to draw a triangle.

#define DrawLine(pt1, pt2)

typedef struct Point {
    int x, y;
} Point;

typedef struct Line {
    Point start;
    Point end;
} Line;

Line lines[3] = {
    {{0, 0}, {100, 100}},  // Line 1
    {{100, 100}, {200, 50}}, // Line 2
    {{200, 50}, {0, 0}}    // Line 3
};

// Loop through these points to draw our triangle on the screen.
int main()
{
    for (int i = 0; i < 3; i++)
    {
        DrawLine(lines[i].start, lines[i].end);
    }
    return 0;
}

These early xy arrays and computerized graphics paved the way for modern computer graphics which make use of vectors in even more advanced ways. When you play a modern 3D video game, many characters, objects, and movement you see on the screen are powered by linear algebra vectors.

The Graphics Processing Unit (GPU) was a specialized computer developed in the 1990s and then improved on in the decades since. GPUs handle the millions of vector operations required to create 3D graphics in real time. GPUs now are used for far more than 3D graphics. Vector-based assembly operations can operate on a continuous block of memory, doing the same operation across different chunks of memory.

Scalable vector graphics (SVG)

SVGs are 2D vector graphics that have become a de-facto image format in web design and development. There’s a vector standard that allows svg graphics to be created with a series of numbers that represent shapes and paths that work across devices and web browsers. SVG graphics display logos, icons, charts, and animations. Their popularity took off in the mid 2010s and continues to grow as they remain popular due to their performance and lightweight nature.

SVGs use some number of vector numbers to describe the object they represent. For a simple SVG with a few shapes might be dozens of numbers. A more complex SVG like one for a detailed icon or map might include thousands of numbers.

Here’s what the SVG of the Crunchy Data hippo logo looks like:

<svg
	id="aad9811e-aeeb-4dae-a064-7d889077489a"
	data-name="Layer 6"
	xmlns="http://www.w3.org/2000/svg"
	viewBox="0 0 1407.15 1158.38"
>
	<path
		d="M553.21,651l124.3,122.4-154.9-89Zm-304.5-496.6-54.6,148.9L35.71,415.19,6.81,523.49l-6.5,67.9,83.1,65.2h0l208.7-10.3,114.1-155.7,3.6-166,199.3-200.5-104.7-41.9Zm0,0,360.4-30.3m-104.7-41.9-114.1,61.4-130.7,213.5-105.5,150.5-70.8,149m322.9-166-145.9-135.4-222.5,62.1M294.21,642l-140.1-135.1L1,586.39m36.1-171.2,116.3,91,190.8-73.1m-95.5-278.7L259.61,357m150.1-32.4-19.4-181m218.8-19.5,14.7,196.7-59.5,137.4-49.1,104-92.7,47.2-128.8,35.9,139.8,39.3L621.21,632l62.4-196.3,16.7-174.4-92.4-136.9M621.21,632l-215-141.5,26.7,194-349.6-28m617-395.2-294.1,229.3,215,141.5m-217.1,50.2,8.6,306.7-17.5,35.7,6.1,52.8,101.7-4.8,63.5-63.9,6-47.9L588.41,792h0l89.2-18.4,97.2,23.4,84.2,19.7-2.1,46.5,10.5,30.4-19,28.9,28.1,1.9,1.6-.8,6,105.5-15.1,40.1,25.3,88.7,132.1-33-6.1-50.6,65.5-306.8,49.5-12.2,57-43,29,41.1,2.4,88.3,5.8,61.8-18.6,46.2,23.5,38.7,96.5-12.4,44.3-43.5-21.1-28.8,13.8-216.9,4-65.5,34.6-116.4-23.4-120.4-332.8-215.1L842,135l-151.2,47.5m119.9,84.8-202.4-143.1m202.4,143.1L849,552.39l134.2-214.2ZM1164,453.09l-180.8-115-42.6,277Zm-486.5,320.4,263-158.4L849,552.39Zm133.2-506.2-110.6-4-4.6,48.5,115-42.3m-133,504-154.9-89,65.7,107.4Zm170.3-25.9,35.1,87,57.6-219.4Zm117.7,83.3-25-215.8-57.6,219.4Zm-24.9-215.8,25,215.8,120.2-63.5Zm12.7,418.8,94-83.9-81.9-119.1Zm-105.5-285.6-170.3,25.2,200,47.7ZM1164,453.09l-70.6,270.3,141.1-114Zm70.5,156.3,77.8-132.8L1195,262.89Zm-251.3-271.3,180.8,115,31.1-190.2Zm67.1-168.8-67.1,168.8,211.9-75.2ZM842,135l-151.2,47.5,359.5-13.9Zm244.2,633.2,7.2-44.8m167.2-63.1,51.8-183.7-77.9,132.8Zm0,0-26.1-50.9-99.3,145.8Zm0,0,84.1-88.7-32.4-95Zm84.1-88.7-84.1,88.7,42.4-7.6Zm-22.6-226.7-9.8,131.7,32.4,95Zm0,0,22.6,226.7,62-69Zm46.3,339.3-65.3-30.2,56.7,161.5Zm-114.7,122.3,77.3-31.9-28.1-121.8Zm49.2-153.7,28.1,121.8,28.9,40.9Zm69.3-32.3-27.5-48.9,23.7,112.6ZM1331,774.59l-4.7,123.7,33.6-82.7Zm-93.9,213.3,94.5-12.7-5.4-78.4Zm16.6-181.4-30,35.1,13.4,139.9,63.4-138.2Zm0,0-33.1-115.9,3.1,150.6Zm-32.8-115.2,82.2-37.2m-73.5,249.3,7.6,84.6m94.5-12.8,43.7-42.9-49.1-35.5Zm-5.8-79.2,29.1,7.3m-942.3,85.6-11.4,88.5,63.4-55.8Zm51.2,31.9,38.7,52.5,63.8-64.5Zm556,53.9-66.6-40.8-59.2,123.9Zm-431.6-282.8-112.2,70.4-11.4,159.3Zm-178.6,89.3,2.9,107.7,63.5-126.6Zm238-729.1,40.7-57.4L702,45.29l-13.6-32L650.11.49l-13.6,2.6-31.2,41.3-10.3,73,14.1,6.7ZM650,.49l-48.6,74.7,81.4-45.9Zm32.7,28.4L702,45.19m-19.1-15.3,5.5,64.8L647.31,110l-38.2,14.1m0,0-7.7-48.9m87-61.9-5.5,16.6L650,.59m-269.3,116-4.1-59.1-45-22.9-43.7,26.8,2.7,42.8,11.5,35.3M346.21,81l-14.6-46.5-41,69.7L346.21,81l-43.8,58.5m74.2-82.1L346.21,81l34.5,35.6m486.4,777.9,10.9,29m4.9-90.7-15.6,60.6,10.7,30.1Zm-407,32,46.7-180.3-112.9,196.7m23.2-196.6,89.7-.1,30.6-33.4M744.81,394l-10.6,113.9L849,552.39Zm-75.5,84.8L621.21,632l113.1-124.1Zm64.9,29.1-56.7,265.6m0,0,27.2-133.3-83.6-8.1Zm68.1-380.1-59.2,18m9-99.7,49.4,82.3,65.7-124.6Zm-289.2,178.9,277.3-54.9m200.3,594.7,31-31.4,50.7-168.1m-82.6,1.9,31.9,166.1,38.5,34.9M1331,774.59l-30.4,68.7,25.8,53.5M287.91,61.39l23.9,6.7"
		fill="none"
		stroke="currentColor"
		stroke-linejoin="bevel"
	/>
</svg>

GIS vector data

In modern computational GIS, vectors are used to represent geometric data types like points, line-strings, and polygons. Like any other x,y,z vector coordinate system the vectors refer to specific global points or objects. There’s quite a few different spatial reference systems that can be used. The vectors are typically stored in PostGIS using a binary format Well-Known Binary (WKB), which is a standardized binary encoding for geometries. Vectorization also powers many of the key functions in modern geospatial data processing like intersections, distance calculations, joins, and proximity analysis.

Here’s the vector binary for (imho) the best BBQ restaurant in the world:

 restaurant_name |                        geom
-----------------+----------------------------------------------------
Gates Bar B Q    | 0101000020E610000082E673EE76A557C007B47405DB884340

AI Vectors

AI vectors emerged from the mathematical and computational foundations of vectors that I covered above. Through advancements in hardware and in machine learning algorithms, vectors can be used as a system to describe virtually anything. Large Language Models (LLMs) convert data like text, images, or other inputs into vectors through a process called embedding. LLMs use layers of neural networks to process the embeddings in a specific context. So the vectors numerically represent relationships between objects within the context they were created with.

You’ve probably heard of the pgvector extension that is used for storing and querying AI related embedding data. pgvector adds a custom data type vector for storing fixed-length arrays of floating-point numbers. pgvector stores up to 16k dimensions.

My colleague Karen Jex has a great embedding talk she does about AI called “What’s the Opposite of a Corn Dog”. The vector embedding for a corn dog from an OpenAI menu dataset is an array of a staggering 1536 numbers. Here’s a snippet.

// vector of a Corn Dog
[0.0045576594,-0.00088141876,-0.014024569,-0.011641564,0.0038251784,0.010306821,-0.01265076,-0.013672978,-0.01582159,-0.041670028,0.0044274405,.........0.040185533,-0.010463083,0.004326521,-0.019571891,0.01853014,0.025770308,-0.017787892,0.0018572462]

In AI and machine learning, a vector is an ordered list of numbers that represents data for literally anything. Really what “AI” is doing is turning anything and everything into a vector and then comparing that vector with other vectors in the same matrix.

Vectorized queries

As the use of computational vectors have become so popular along with machine learning, the underlying methods and CPU hardware for processing vector data is now used to process other kinds of data.

There are several databases on the market now like DuckDB, Big Query, Snowflake, and Crunchy Data Warehouse that make use of vectorized query execution to speed up analytics queries. Vectorized database queries split up and streamline queries into similar results over chunks of data of the same type. In a way, they’re treating columns of data like mathematical vectors. This can be much more powerful than reading data row by row. The power here also comes from the parallelization and effective CPU and IO usage.

The values processed with vectorized execution are typically treated as vectors in the sense that they’re contiguous batches of data elements. Surprisingly, they do not need to represent mathematical vectors—they can be any kind of data that fits the processing model.

Vectors are everywhere!

Vectors are everywhere and they can mean virtually anything in a computerized context - especially now with AI - everything is or can be a vector.

Vectors and their uses are one of the main characters in the story of modern computing. An evolution from pen and ink math to modern ML algorithms. The beauty of the vector in its infinite use of numeric representation. From simple concepts like a point on the globe to computerized graphics and animation, and AI embeddings for any text or image.

Vector use summary:

Attributions

Hamilton’s Lecture on Vectors

Easy Totals and Subtotals in Postgres with Rollup and Cube

Elizabeth.Christensen@crunchydata.com (Elizabeth Christensen) — Mon, 18 Nov 2024 09:30:00 EST

Postgres is being used more and more for analytical workloads. There’s a few hidden gems I recently ran across that are really handy for doing SQL for data analysis, ROLLUP and CUBE. Rollup and cube don’t get a lot of attention, but follow along with me in this post to see how they can save you a few steps and enhance your date binning and summary reporting.

We also have a web based tutorial that covers Postgres Functions for Rolling Up Data by Date if you want to try it yourself with a sample data set.

Superpowered Group By

Before we dig into rollup and cube, let’s look at GROUP BY statements. Here’s an example query where I want to get totals for all my months of sales and categories.

Using to_char with date_trunc I can roll up things by month. With GROUP BY I can rollup data by category.

-- Get totals for each month and category
SELECT
    to_char(date_trunc('month', order_date), 'FMMonth YYYY') AS month,
    category,
    COUNT(*) AS total_orders,
    SUM(total_amount) AS total_amount
FROM
    orders
GROUP BY
    date_trunc('month', order_date), category
ORDER BY
    date_trunc('month', order_date), category;

     month      |  category   | total_orders | total_amount
----------------+-------------+--------------+--------------
 October 2021   | Books       |            3 |      2375.73
 October 2021   | Clothing    |           18 |     13770.09
 October 2021   | Computers   |           17 |     13005.87

If I wanted to get subtotals for years, I would have to pull this into Excel or write separate select statements and unions. You would have to add a lot more SQL in snippets and section like this:

-- Get total for each month
....
UNION ALL

SELECT
    to_char(date_trunc('month', order_date), 'FMMonth YYYY') AS month,
    NULL AS category,
    COUNT(*) AS total_orders,
    SUM(total_amount) AS total_amount
FROM
    orders
GROUP BY
    date_trunc('month', order_date)
ORDER BY
    date_trunc('month', order_date)

-- Get grand total across all months and categories
...
UNION ALL

SELECT
    NULL AS month,
    NULL AS category,
    COUNT(*) AS total_orders,
    SUM(total_amount) AS total_amount
FROM
    orders

ROLLUP and CUBE will do the subtotals all for you though! Let’s take a closer look.

GROUP BY ROLLUP

ROLLUP is an extension you can add to the GROUP BY clause. When you use it, ROLLUP will give you both individual bins of totals and a sub-total. Here’s an example where I just add ROLLUP to my group by.

SELECT
    to_char(date_trunc('month', order_date), 'FMMonth YYYY') AS month,
    category,
    COUNT(*) AS total_orders,
    SUM(total_amount) AS total_amount
FROM
    orders
GROUP BY
    ROLLUP (date_trunc('month', order_date), category)
ORDER BY
    date_trunc('month', order_date), category;

If you see in this example, the sub-total has a null value for the category.

     month      |  category   | total_orders | total_amount
----------------+-------------+--------------+--------------
 October 2021   | Books       |            3 |      2375.73
 October 2021   | Clothing    |           18 |     13770.09
 October 2021   | Computers   |           17 |     13005.87
 October 2021   | Electronics |           25 |     16358.96
 October 2021   |             |           63 |     45510.65

These null values represent the subtotals.

You can probably see that this is a really handy function for reporting. Rollup will give you big batches of things, including things with NULL values. If you want to do a quick survey of your products or data by certain categories, rollup is a great tool. You can combine that with date_trunc to get a rollup of categories by any date bin.

GROUP BY CUBE

The CUBE function takes the rollup one step further and does subtotals and grand totals across all the dimensions you’ve queried. Very similar to ROLLUP, we can look at both dates and categories of sales. Again, in this example, I just add CUBE to the GROUP BY statement.

SELECT
    to_char(date_trunc('month', order_date), 'FMMonth YYYY') AS month,
    category,
    COUNT(*) AS total_orders,
    SUM(total_amount) AS total_amount
FROM
    orders
GROUP BY
    CUBE (date_trunc('month', order_date), category)
ORDER BY
    date_trunc('month', order_date), category;

      month      |  category   | total_orders | total_amount
----------------+-------------+--------------+--------------
 October 2024   | Books       |            9 |      5574.92
 October 2024   | Clothing    |           19 |     11856.80
 October 2024   | Computers   |           22 |     13002.10
 October 2024   | Electronics |           50 |     34251.83
 October 2024   |             |          100 |     64685.65
                | Books       |          521 |    328242.79
                | Clothing    |         1133 |    739866.25
                | Computers   |         1069 |    680817.70
                | Electronics |         2709 |   1707713.80
                |             |         5432 |   3456640.54

Like ROLLUP these sub-totals aren’t labeled, they have null values representing the totals, like this:

Label ROLLUP and CUBE totals with COALESCE

I find these null values for the sub-totals kind of strange. If you’re like more, or sharing these raw reports with several people, you might want labels instead of these null values. You can use the COALESCE function to do some basic renaming of the null values. COALESCE is commonly used in cases like these when you want to handle NULL values in queries.

Here’s a sample where COALESCE comes before each category and time bin so when we add the group by cube below, the sub-totals are labeled.

SELECT
    COALESCE(to_char(date_trunc('month', order_date), 'FMMonth YYYY'), 'Grand Total') AS month,
    COALESCE(category, 'Subtotal') AS category,
    COUNT(*) AS total_orders,
    SUM(total_amount) AS total_amount
FROM
    orders
GROUP BY
    CUBE (date_trunc('month', order_date), category)
ORDER BY
    date_trunc('month', order_date), category;

     month      |  category   | total_orders | total_amount
----------------+-------------+--------------+--------------
 October 2024   | Books       |            9 |      5574.92
 October 2024   | Clothing    |           19 |     11856.80
 October 2024   | Computers   |           22 |     13002.10
 October 2024   | Electronics |           50 |     34251.83
 October 2024   | Subtotal    |          100 |     64685.65
 Grand Total    | Books       |          521 |    328242.79
 Grand Total    | Clothing    |         1133 |    739866.25
 Grand Total    | Computers   |         1069 |    680817.70
 Grand Total    | Electronics |         2709 |   1707713.80
 Grand Total    | Subtotal    |         5432 |   3456640.54

Summary

If you need to do date binning or rollups for your data by date, check out rollup and cube. They’re super easy additions to the GROUP BY function that will do your subtotals and grand totals.