What's Postgres Got To Do With AI?

Note: We have additional articles in this Postgres AI series.

In the past month at Crunchy Data, we have talked to a steady stream of customers & community folks wanting to know how to augment their data platforms for AI. Fortunately, Postgres is equipped, nearly out of the box, and ready for the task of storing and querying this data. Through the magic of OpenAI’s API we can easily send data for classification and return the values.

Alongside this post, I created a sample code-base and data packet here. If you want to jump straight to the code and data, head straight there. We will reference back to the code and the data a few times throughout this post. This favors simplicity for understanding over comprehensiveness. We’ll follow up this post with comprehensive information later.

Using OpenAI in Postgres

For an AI-powered simple recommendation engine, the typical solution will pre-process the data with the following steps:

  1. Send data to an OpenAI model
  2. Receive a tokened set of values called embeddings
  3. Store embeddings in Postgres on the records
  4. Then, we run SQL queries that measure distance between embeddings.

Here’s a diagram of the data going out, going to OpenAI, and coming back as embeddings.


A diagram of how embeddings can be measured in terms of closeness.


AI with Postgres by example

In this example, we'll use Postgres + OpenAI as a backbone to solve a common problem: Choosing new food recipes that a picky child would enjoy? If we don't, then we'll have pizza all week and turn in crime fighting ninja turtles.

My child has a diverse, but picky palate. He enjoys lamb curry. He does not enjoy things like tomato soup yet, he loves tomato in paste form on a pizza. In this tutorial, we'll use a combination of AI data plus traditional query filters to try to nail down a new recipe that we'll try this week.

Let's see if we can over-engineer a way to make my picky eater happy!

Queries are powered by pgvector

Powering our queries is a Postgres extension called pgvector . It gives Postgres the power to natively store, index, and query the OpenAI embeddings. An OpenAI embedding is a set of tokenized float values that allow us to measure the relatedness of text strings. These embeddings look like this:


pgvector can be downloaded from the Github repo and built locally. We have it ready to go on Crunchy Bridge and you could use our $10 / month plan for a quick test.

Loading Sample Data

For my example, I used the ArmedForcesRecipes.xml. We’ll process the XML and load it into a Postgres database using a parser file. The parser file looks like this:

connection_string = ENV['DATABASE_URL']
DB = Sequel.connect(connection_string)

DB.create_table? :recipes do
  primary_key :id
  String :name
  String :description
  Vector :embedding

recipes_xml = Nokogiri::XML(File.read('ArmedForcesRecipes.xml'))

for recipe_xml in recipes_xml.xpath('/*/recipe')
  recipe_id = DB[:recipes].insert(
    name: recipe_xml["description"],
    description: recipe_xml.xpath(".//XML_MEMO1")[0]&.text

In the parser.rb file, we build the recipes table in our Postgres database. This table contains a name, description, and embedding columns. Notice that embedding is a vector type.

Table "public.recipes"
   Column    |  Type   | Collation | Nullable |             Default
 id          | integer |           | not null | generated by default as identity
 name        | text    |           |          |
 description | text    |           |          |
 embedding   | vector  |           |          |
    "recipes_pkey" PRIMARY KEY, btree (id)

The description contains the long-form process, ingredients, and notes that are typically found in a recipe book. This is the data that we will send to OpenAI for classification. The contents of which are a bit long to list in this blog post, but you can see one by running SELECT description FROM recipes WHERE name = 'Pizza';

Classifying data with OpenAI

For our OpenAI classification, we use the recipes.description column. This column is a fairly standardized set of text that contains the process of cooking the recipe as well as the ingredients.

Using the classifier.rb file, I can send the description to OpenAI API and receive back a data object in the form of a tokenized set of values (called embedding). A sample embedding looks like this:

The classifier.rb file looks like this:

openai = OpenAI::Client.new(access_token: ENV['OPENAI_API_KEY'])

while recipe = DB[:recipes].where(embedding: nil).exclude(description: nil).first do
  submitted_value = recipe[:description].gsub(/\n/, ' ')

  response = openai.embeddings(
    parameters: {
      model: 'text-embedding-ada-002',
      input: submitted_value

    embedding_value = response["data"][0]["embedding"].to_s
    DB[:recipes].where(id: recipe[:id]).update(embedding: embedding_value)
    puts [$!, response].inspect

  sleep 1.2

We run a loop that plucks out a single recipe without an embedding, then format the description to remove new line characters. Then, send the value to the OpenAI API using the openai Ruby gem.

From the response, we pluck out the embedding_value and store it with minimal modification directly into the embedding column.

Running this file will take a little while, as OpenAI has a 60 requests / minute limit. Thus, we wait 1.2 seconds between requests. After running this file, the embedding column on the recipe will have data. If you run the following query, you'll find a set of float values: SELECT embedding FROM recipes WHERE embedding IS NOT NULL LIMIT 1;. The response will look like this:

-[ RECORD 1 ]--------------------------------------------------------------------------------------------
embedding | [0.019761972,-0.0066673,0.0073532923,-0.0010382808,-0.007948044,-0.010725804,-0.0038692644...

Querying for similarities

The crux of finding similar recipes is using the <=> operator, which is supplied by the pgvector extension. This operator will find the distance between the embedding values returned by OpenAI. So, will be doing something like the following, which will yield a distance as measured by a float.

recipe_1.embedding <=> recipe_2.embedding

To find similar recipes, we can find the recipes with the shortest distance between their embedding values.

Let’s find the closest recipe to pizza. There is a recipe in this dataset with just the term Pizza with id = 431. This is the Pizza recipe that my son loves, so we will use it as our reference recipe.

Let's find the closest recipes to Pizza:

FROM (SELECT * FROM recipes WHERE name = 'Pizza') recipe_1,
        recipes AS recipe_2
ORDER BY recipe_1.embedding <=> recipe_2.embedding


 id  | name  | id  |       name
 431 | Pizza | 431 | Pizza
 431 | Pizza | 433 | Pizza, 12 in, fzn
 431 | Pizza | 126 | Chicken, parmesan
 431 | Pizza | 435 | Pizza, treats
(4 rows)

With the query above, the closest recipe to pizza is more pizza. Let's use SQL's NOT LIKE conditional statement to filter out other pizza:

FROM (SELECT * FROM recipes WHERE name = 'Pizza') recipe_1,
	recipes AS recipe_2
WHERE LOWER(recipe_2.name) NOT LIKE '%pizza%'
ORDER BY recipe_1.embedding <=> recipe_2.embedding


 id  | name  | id  |                name
 431 | Pizza | 126 | Chicken, parmesan
 431 | Pizza |  31 | Beef, ground, hamburger, w/parmesan
 431 | Pizza | 211 | Dish, eggplant, parmesan
 431 | Pizza | 229 | Dish, lasagna
(4 rows)

In my scenario, I returned "Chicken, parmesan" as the best option! Excellent! I say "in my scenario" because we are using a few layers of non-deterministic outcomes -- data sent to OpenAI models may return different values, using the <=> operator may return different distances.

Yes, I think my son would love chicken parmesan. When using these recipes, remember to reduce portion sizes, else you'll make food for all of your neighbors and neighbor's neighbors.

Side quest: querying for opposites

What is the opposite of a corn dog? We can flip the order from ASC to DESC and find the maximum distance.

FROM (SELECT * FROM recipes WHERE name = 'Corn Dog' LIMIT 1) recipe_1,
        recipes AS recipe_2
  recipe_2.description IS NOT NULL
ORDER BY recipe_1.embedding <=> recipe_2.embedding DESC


 id  |   name   | id  |           name
 165 | Corn Dog | 538 | Salad, green, tossed
 165 | Corn Dog | 519 | Salad, chicken, tropical
(2 rows)

According to the embedding distance, the opposite of Corn Dog is “Salad, green, tossed”. Yeah, that makes sense.

AI in Postgres

The tools we have available today make AI approachable for nearly all scenarios. Over the past 20 years, I can think of many different scenarios where I wish I had access to OpenAI capabilities. pgvector has the first commits starting in 2021. So many things are coming together in a nice-to-use AI toolbox for the typical application developer. It’s awesome to see how Postgres will power developers to integrate AI capabilities in their products.

Hungry for more? Try pgvector on Crunchy Bridge today!

Avatar for Christopher Winslett

Written by

Christopher Winslett

February 22, 2023 More by this author