Performance Tips Using Postgres and pgvector

Note: pgvector 0.5 released HNSW indexes which improved performance significantly. Read more about it HNSW Indexes with Postgres and pgvector. We have additional articles in this Postgres AI series.

As we've been helping people get started with AI in Postgres with pgvector, there have been few questions around performance. At a basic level, pgvector performance relies on 3 things:

  1. Are your queries using indexes?
  2. Are you setting your list size appropriately for your data set?
  3. Do you have enough memory for your indexes + ability to change settings?

For an intro to using pgvector, see What’s Postgres Got To Do With AI. In it, we discuss the vector datatype, querying, and indexing options. During this blog post, we will refer to a “recipes”. In the prior blog post, we built an AI powered recipe recommendation engine.

Do you want an index?

Probably you do. It is important to note that vector indexes allow “approximate nearest neighbor” (ANN) searching. So if you have a hard requirement that a query return absolutely 100% of all nearby vectors, you are going to be stuck with full scans, which will be slow on large data sets.

However, most vector use cases are all about finding things that are “kind of similar to” other things, and a loose index supports that use case fine. Most users use ANN indexes and are happy with the results.

How vector indexes uses lists

As with other data types, indexes on the ‘vector’ type provide the system a shortcut to finding records the query is looking forward, by accessing a “more organized” form of the data.

For numbers and strings, that organization takes the form of a balanced tree. For vectors, the organization is a partitioning of the data into a set of “lists”, each of which covers a distinct partition of the multi-dimensional space vectors are embedded in.

When initially building an index, the system takes a sample of vectors and runs a K-means clustering of the sample, to generate the space partitions for each list.

diagram of k-means clustering

A bigger K means more lists which means fewer records per list which means a more sensitive index. But it also means more calculation time to generate those K means!

pgvector uses ivfflat indexes with three distance calculation algorithms: vector_l2_ops, vector_ip_ops, vector_cosine_ops. These behave similarly.

For this post, we used the following index:

CREATE INDEX ON recipes USING ivfflat (embedding vector_l2_ops)
WITH (lists = 500);

Are your queries using indexes?

Once your index is built, you still need to make sure your query uses the index. Vector indexes are “approximate nearest neighbor” indexes, so the first thing to note is your SQL will have to be structured to use the “nearest neighbor operator” in the “ORDER BY” clause in just the right way.

When using vector data, be prepared to re-engineer your queries a bit so that they hit indexes. For instance, the following two queries return the same results, however one does not use the index, but the other does.

A query that takes 500ms, and does not use indexes:

SELECT
    r1.id
FROM recipes r0, recipes r1
WHERE r0.id = 142508
	AND r1.id != r0.id
ORDER BY r0.embedding <-> r1.embedding
LIMIT 1;

A query that takes 5ms, and does use indexes:

SELECT
    r1.id
FROM recipes r1
WHERE id != 142508
ORDER BY r1.embedding <-> (SELECT embedding FROM recipes WHERE id = 142508)
LIMIT 1;

Both of these queries return exactly one result, and the same result. Without indexes, both have nearly identical performance, yet they have a different EXPLAIN query path.

Be ready to adjust the queries you are writing so that the optimizer executes them differently. To understand query execution with vector data, get familiar with EXPLAIN.

When using EXPLAIN, this is the type of line you are looking for that signifies you are using the index for your query:

->  Index Scan using recipes_embedding_idx on recipes r1  (cost=204.38..11300.37 rows=99999 width=12)

If queries do not use an existing index, refactor your query. Or, the index’s list size may not be large enough. Read about that below.

Choosing a list size for your index

In the pgvector docs, it recommends that your list size equal # of rows / 1000. But, what are the trade-offs? Below is a chart from real nearest neighbor queries on different settings:

Untitled

So, what does this mean? For the dataset that we were experimenting with and the query we were running, the best list size to performance trade off was about lists = 500. When lists = 500, the index build took 28 seconds, and the queries took 5 ms. When using lists = 500 versus lists = 250, the query ran 5x faster.

If 500 is good, then 2000 is better, right? Wrong! When we set lists = 2000, queries continued to range from 5ms to 6ms, but the build time took 3 minutes. Additionally, because vector indexes are approximations that use lists, larger list sizes are more likely to give the a different answer. For this scenario, by using lists = 500, we balance out the performance needs of the database.

Once the list size is too small for your data, the query optimizer will no longer use indexes. The dotted line between lists = 62 and lists = 125 is the when the database quits using the indexes.

Because the list size is dependent on the number of rows, what is good for today may not be good for tomorrow. At a specific dataset and query, expect a linear relationship between the list size and the build time, and expect a list value to provide the best performance for your dataset and query. But that will change tomorrow.

If your application is querying in a way that uses indexes, increasing list size may improve query performance, but will be much, much slower to build.

Enough memory for your indexes + ability to change memory settings

Keeping indexes in memory is essential, no matter the data set. You’ll need the memory for two reasons:

Have enough RAM for entire index size. Indexes of vector data sets can be sizeable, and increase in size as the data size increases. In our tests, the indexes sizes were as large as the table sizes. Surprisingly, larger list sizes did not significant impact on the data size of the index — lists = 500 and lists = 2000 generated similar index sizes (only a 1.5% difference).

Check indexes sizes by running the following command in psql:

\di+ index_name

Have enough RAM to build new indexes. Building indexes with larger lists requires higher settings for maintenance_work_mem — if you do not have the enough memory you’ll get an error. When building the lists = 2000 index above, the the maintenance_work_mem required 1.3GB of RAM.

recipes=# CREATE INDEX ON recipes USING ivfflat (embedding vector_l2_ops) WITH (lists = 2000);
ERROR:  memory required is 1390 MB, maintenance_work_mem is 1024 MB

Ability to change maintenance_work_mem settings. You can't even build indexes with a larger list size unless you can increase the maintenance_work_mem. So, make sure your Postgres provider gives you the ability to tune settings.

Recap

pgvector provides a comprehensive, performant, and 100% open source database for vector data. As a developer, the key to getting performance from pgvector are:

  1. Ensure your query is using the indexes
  2. Ensure your indexes have the optimal list size
  3. Ensure you have enough memory for the index

This isn’t a static idea. As you continue to build additional data through OpenAI, continue reindexing, reviewing your list size settings, and memory. Do you need to do with this 10% more data? Probably not. But at 50% more data, rinse and repeat. This post discussed achieving performance with vector data, next we will discuss scaling.

Want to try out some AI data inside Postgres? Try pgvector on Crunchy Bridge today!

Avatar for Christopher Winslett

Written by

Christopher Winslett

May 5, 2023 More by this author