As we've been helping people get started with AI in Postgres with
there have been few questions around performance. At a basic level, pgvector
performance relies on 3 things:
- Are your queries using indexes?
- Are you setting your
listsize appropriately for your data set?
- Do you have enough memory for your indexes + ability to change settings?
For an intro to using pgvector, see What’s Postgres Got To Do With AI. In it, we discuss the vector datatype, querying, and indexing options. During this blog post, we will refer to a “recipes”. In the prior blog post, we built an AI powered recipe recommendation engine.
Probably you do. It is important to note that vector indexes allow “approximate nearest neighbor” (ANN) searching. So if you have a hard requirement that a query return absolutely 100% of all nearby vectors, you are going to be stuck with full scans, which will be slow on large data sets.
However, most vector use cases are all about finding things that are “kind of similar to” other things, and a loose index supports that use case fine. Most users use ANN indexes and are happy with the results.
As with other data types, indexes on the ‘vector’ type provide the system a shortcut to finding records the query is looking forward, by accessing a “more organized” form of the data.
For numbers and strings, that organization takes the form of a balanced tree. For vectors, the organization is a partitioning of the data into a set of “lists”, each of which covers a distinct partition of the multi-dimensional space vectors are embedded in.
When initially building an index, the system takes a sample of vectors and runs a K-means clustering of the sample, to generate the space partitions for each list.
A bigger K means more lists which means fewer records per list which means a more sensitive index. But it also means more calculation time to generate those K means!
ivfflat indexes with three distance calculation algorithms:
vector_cosine_ops. These behave similarly.
For this post, we used the following index:
CREATE INDEX ON recipes USING ivfflat (embedding vector_l2_ops)
WITH (lists = 500);
Once your index is built, you still need to make sure your query uses the index. Vector indexes are “approximate nearest neighbor” indexes, so the first thing to note is your SQL will have to be structured to use the “nearest neighbor operator” in the “ORDER BY” clause in just the right way.
vector data, be prepared to re-engineer your queries a bit so that
they hit indexes. For instance, the following two queries return the same
results, however one does not use the index, but the other does.
A query that takes 500ms, and does not use indexes:
FROM recipes r0, recipes r1
WHERE r0.id = 142508
AND r1.id != r0.id
ORDER BY r0.embedding <-> r1.embedding
A query that takes 5ms, and does use indexes:
FROM recipes r1
WHERE id != 142508
ORDER BY r1.embedding <-> (SELECT embedding FROM recipes WHERE id = 142508)
Both of these queries return exactly one result, and the same result. Without indexes, both have nearly identical performance, yet they have a different EXPLAIN query path.
Be ready to adjust the queries you are writing so that the optimizer executes
them differently. To understand query execution with vector data, get familiar
EXPLAIN, this is the type of line you are looking for that
signifies you are using the index for your query:
-> Index Scan using recipes_embedding_idx on recipes r1 (cost=204.38..11300.37 rows=99999 width=12)
If queries do not use an existing index, refactor your query. Or, the index’s list size may not be large enough. Read about that below.
pgvector docs, it recommends that your list size equal
# of rows / 1000. But, what are the trade-offs? Below is a chart from real
nearest neighbor queries on different settings:
So, what does this mean? For the dataset that we were experimenting with and the query we were running, the best list size to performance trade off was about lists = 500. When lists = 500, the index build took 28 seconds, and the queries took 5 ms. When using lists = 500 versus lists = 250, the query ran 5x faster.
If 500 is good, then 2000 is better, right? Wrong! When we set lists = 2000, queries continued to range from 5ms to 6ms, but the build time took 3 minutes. Additionally, because vector indexes are approximations that use lists, larger list sizes are more likely to give the a different answer. For this scenario, by using lists = 500, we balance out the performance needs of the database.
Once the list size is too small for your data, the query optimizer will no longer use indexes. The dotted line between lists = 62 and lists = 125 is the when the database quits using the indexes.
Because the list size is dependent on the number of rows, what is good for today may not be good for tomorrow. At a specific dataset and query, expect a linear relationship between the list size and the build time, and expect a list value to provide the best performance for your dataset and query. But that will change tomorrow.
If your application is querying in a way that uses indexes, increasing list size may improve query performance, but will be much, much slower to build.
Keeping indexes in memory is essential, no matter the data set. You’ll need the memory for two reasons:
Have enough RAM for entire index size. Indexes of vector data sets can be sizeable, and increase in size as the data size increases. In our tests, the indexes sizes were as large as the table sizes. Surprisingly, larger list sizes did not significant impact on the data size of the index — lists = 500 and lists = 2000 generated similar index sizes (only a 1.5% difference).
Check indexes sizes by running the following command in
Have enough RAM to build new indexes. Building indexes with larger lists
requires higher settings for
maintenance_work_mem — if you do not have the
enough memory you’ll get an error. When building the lists = 2000 index above,
maintenance_work_mem required 1.3GB of RAM.
recipes=# CREATE INDEX ON recipes USING ivfflat (embedding vector_l2_ops) WITH (lists = 2000);
ERROR: memory required is 1390 MB, maintenance_work_mem is 1024 MB
Ability to change
maintenance_work_mem settings. You can't even build
indexes with a larger list size unless you can increase the
maintenance_work_mem. So, make sure your Postgres provider gives you the
ability to tune settings.
pgvector provides a comprehensive, performant, and 100% open source database for vector data. As a developer, the key to getting performance from pgvector are:
- Ensure your query is using the indexes
- Ensure your indexes have the optimal list size
- Ensure you have enough memory for the index
This isn’t a static idea. As you continue to build additional data through OpenAI, continue reindexing, reviewing your list size settings, and memory. Do you need to do with this 10% more data? Probably not. But at 50% more data, rinse and repeat. This post discussed achieving performance with vector data, next we will discuss scaling.
Want to try out some AI data inside Postgres? Try
Crunchy Bridge today!
May 5, 2023 •More by this author