For many developers, databases are basically magic. Like Penn & Teller, this blog post is set to break the illusion. Databases are just like any other code, they have algorithms and processes. These algorithms and processes are meant to improve performance, but can cause limitations if they are not expected.
Disclaimer: it is okay to break the rules. Sometimes, you may choose to have slow database interactions, because you know they are rare events.
Assuming a well-designed database infrastructure (which is what we launch at Crunchy Data), database performance is reactionary to the queries, reads, and writes sent to it. Databases do not blow themselves up on purpose, in fact, they are continuously trying to self-right from interactions so that they can return to equilibrium.
In this post, we'll describe things conceptually (want more? check out Postgres Tips)), then give examples using a physical library as the metaphor. Libraries used these algorithms in a physical form long before databases used them in logical form, the examples are applicable for understanding performance.
Although we are showing tables below, the same concepts apply to most NoSQL databases.
Should you write code and queries with a knowledge of the following topics, you'll dodge quite a few reasons for performance issues.
Basically, think of indexes as a sorted list of keys with values. Indexes speed up reads at the expense of speed during writes. Indexes can be a simple single-key, or a multi-dimensional key. Often multi-dimensional keys are useful if a query searches two values together, or always runs a sort on a different key after an initial filter.
To execute a write, a database must update any indexes associated with the row. The more indexes that have to be updated to perform a write, the slower the write. Additionally, not all index entries have the same performance. The fastest index writes are those that append "to the end" of the index. The slowest indexes update values in the middle of the index, and particularly those with composite values in the middle of multiple indexes.
Library Example: Using a card-catalog to find a book is faster than searching all shelves. Why? the card-catalog is alphabetically sorted, and reduces physical movement. Yet, when the librarian puts a new book into a shelf at the library, it is slower because they also have to update the card catalog. The more card catalogs that a librarian has to update, the slower the process (subject cards, author cards, topic cards, title cards). The fastest card catalogs to update would be a "latest books" catalog, because the librarian could just throw the new book's card on the end of the list. The slowest would be any alphabetic card because the library has to find the specific spot in the list-o-cards to inject the specific card. Assuming a library has 5 card catalogs for different characteristics, then to add a book to a shelf, the librarian must perform 6 actions (add the book + update 5 catalogs).
Want a deeper dive into indexes? Check out Postgres Indexes for Newbies!
The keyword to know for indexes is "cardinality". Cardinality is the inverse of "the number of records returned per value." High cardinality returns one record per index value, and low cardinality returns many records per index value. Efficient indexes are high-cardinality indexes. Primary keys are the best example of a high-cardinality index. String-based tags are an example of a low-cardinality index because there are typically many objects assigned to a single tag.
Library Example: You are looking for an author with the last name of "Smith" (a highly common name in the United States). Go to the card catalog for authors and look for "Smith" and you will find numerous cards. This is a low-cardinality lookup. There is a high-probability that you will have to comb through many cards to find the book you are looking for. Yet, it may be at the front as well. Low-cardinality indexes create unpredictability in lookup performance. Let's say you also have the ISBN number for a book, you go to the catalog, and you find it instantly. It is quick.
Build the indexes to match your query needs, and note that high-cardinality is faster. Do not over-index a table.
When reading from a database, if a database does not have a matching index or when an index does not offer high-cardinality, it falls-back to table scans. The database quits using highly-optimized indexes, and starts reading directly from the table data to determine if the query conditions are matched. As you can imagine, this is much slower.
During a table scan, the database starts at the beginning of a set of rows. The process loops through each row, checking conditions, and ends at the bottom. The only limits on table scans is the use of a LIMIT 1. When using a table scan, the database keeps all of the records in RAM while comparing values. When a database uses table scans, a few things happen that affect performance: 1) table scans use RAM, which could be used by indexes, 2) table scans perform random disk reads which are much-much slower than RAM, 3) CPU load increases because it has to perform the comparisons.
In a world of SSDs, this isn't as punitive as it was in a spindle world. Spindle disks were around 200 I/O operations per second (IOPs), and SSDs start at 8,300 IOPs and can exceed 1M IOPs.
Library Example: From the card catalog, you find 10 possible books that match your needs. Now, you go to the shelves and pull each book to see if they fit your criteria. You loop through all of your books until you find the correct one. Now, imagine you find 1,000 possible books, and it becomes linearly slower. In a different scenario, If you only need 1 book, then you can throw a LIMIT 1 on your query. Then, take the first book that matches your criteria (even if 1,000 were potentially matching from the card catalog).
The difference between 2010-era spindle disks and modern SSDs is the equivalent of looking up books on Google versus looking up books in a card catalog. Much faster, but still limitations.
There are a few different gotchas that come to mind, but here are two to consider:
Consider your dataset when deciding between a "sort and a table scan" v. "table scan and a sort"
Library Example: At the library, if you have a topic with a limited number of books, it is better go to a card catalog to search for a topic, then go to the books to find the newest. However, if you are searching for a relatively common topic with many books, it may be faster to go to the latest books catalog, and search for the latest topic. If you run this query enough, it is probably better to have a new card catalog that sorts by "topic" then by "date published." (i.e. create a new index).
Avoid modifying the index or table side of the conditional.
publish_date < now() - interval '3 days' -- is always faster than publish_date + interval '3 days' < now()
Library Example: At the library, if you run the second query above, it is the equivalent of going to each book's card, finding the publish_date, then adding 3 days to this publish_date, and comparing this calculated date to now. Slow right? If you run the first query, it's the equivalent of doing a calculation, writing it down, then looping through each card to compare the calculated date to the publish_date.
This is a date example, but goes for all value types — strings, INTs, dates, bools. Explain is a powerful tool for understanding how Postgres is processing your queries, check out Get Started with Explain & Analyze.
All databases employ some level of copy-on-write. Postgres writes a new version of a row each time it inserts or updates a value. Other databases have a block-size allocated for a value, then, if an updated value exceeds the size, it writes to a new location. Other databases reject updates, claim immutability, and push the tracking of updates off to the application developer.
These new rows leave gaps in the data, which uses storage inefficiently and makes table scans more inefficient. Databases perform "vacuums" or "compactions" to reorganize around the gaps and reclaim disk space.
Deleting rows is similar, yet you aren't injecting new data.
Library Example: you receive a new edition of a book, and you must remove the old edition of the book. The operations included are: 1) insert the new cards for the card catalog, 2) delete the old cards in the card catalog, 3) remove the old edition (while leaving the shelf empty where it was with a pointer to the new book), 4) add the new book to the shelf. After a year, you need to go back and reclaim all the empty-shelf space. This requires moving books and updating indexes to point to the new locations.
Look at how your database-of-choice performs these types of operations. For Postgres, if you need these types quick-writes that cause a need to vacuum, look at how to do HOT updates.
Hardware can overcome poor database choices. As mentioned about the meteoric performance of SSDs over the past 10 years, sins of the past are a mere blip in performance. But, at scale, these things matter.
Library Example: a small 100 book library is fast and efficient, can probably get by without an index, and generally relies on the care-takers knowledge to find the book. Else, it's not that fast to scan through books to find one. A 10M book library better have indexes, else reads are slow. If a library is receiving 10k new books per day, well designed indexes and card catalogs are important. At some point, it may be best to start splitting a library into different buildings based on topics (i.e. sharding, but that is another blog post).
Additionally, one person in a library can move around freely, but imagine you had 10,000 at one time. Obviously, there is some scheduling issues between who uses the card catalog first, and which books are pulled from the shelf for scanning.
Performance issues are rarely noticed with a single query, but are quickly noticed once load increases.
As you build queries and think about your data, consider some of these tips when you are building your queries.
Even as someone with experience optimizing queries, I do not always know what is best for a database. Any change cannot be performed in isolation. Experiment with query performance improvements to see if any change improves the entire system. Else, rollback. Experimentation is the fastest way to understanding.
May 6, 2022 •More by this author