Postgres Full-Text Search: A Search Engine in a Database
Early in on my SQL journey, I thought that searching for a piece of text in the database mostly involved querying like this:
SELECT col FROM table WHERE col LIKE '%some_value%';
Then I would throw in some wildcard operators or regular expressions if I wanted to get more specific.
Later on, I worked with a client who wanted search functionality in an app, so “LIKE” and regex weren't going to cut it. What I had known all along was just pattern matching. It works perfectly fine for certain purposes, but what happens when it's not just a matter of checking for a straightforward pattern in a single text field?
For example, what if you wanted to search across multiple fields? How about
returning possible matches even if the search term happens to be misspelled?
Also, what if you have very large amounts of data to search on? Sure, you can
create indexes for columns on which you want to query for pattern matches, but
that will have limitations (for instance, the B-tree index doesn't work for
col LIKE '%substring%').
So when we say PostgreSQL is the "batteries included database," this is just one reason why. With Postgres, you don't need to immediately look farther than your own database management system for a full-text search solution. If you haven't yet given Postgres' built-in full-text search a try, read on for a simple intro.
Postgres Full-Text Search Basics for the Uninitiated
Core Postgres includes the following full-text search capabilities. To name a few:
- Ignore stop words (common words such as "the" or "an").
- Stemming, where search matches can be based on a "root" form, or stem, of a word (“run” matches “runs” and “running” and even “ran”).
- Weight and rank search matches (so best matches can be sorted to the top of a result list).
Before we go further, let's also get ourselves familiarized with the following concepts:
- A document is a set of data on which you want to carry out your full-text search. In Postgres, this could be built from a single column, or a combination of columns, even from multiple tables.
- The document is parsed into tokens, which are small fragments (e.g. words, phrases, etc) from the document's text. Tokens are then converted to more meaningful units of text called lexemes.
- In Postgres, this conversion is done with dictionaries -- there are built-in ones, but custom dictionaries can be created if necessary. These dictionaries help determine stop words that should get ignored, and whether differently-derived words have the same stem. Most dictionaries are for a specific language (English, German, etc) but you could also have ones that are for a specific domain.
- The sorted list of lexemes from the document is stored in the tsvector data type.
Example: Searching Storm Event Details
I have a table that contains storm events data gathered by the U.S. National Weather Service. For simplicity's sake I won't include all possible fields in the statement below, but there's a copy of the data and some further information available in this repository.
CREATE TABLE se_details ( episode_id int, event_id int primary key, state text, event_type text, begin_date_time timestamp, episode_narrative text, event_narrative text, ... );
Let's also say that we want to carry out a full-text search on the data on the event_narrative column. We could add a new column to the table to store the preprocessed search document (i.e. the list of lexemes):
ALTER TABLE se_details ADD COLUMN ts tsvector GENERATED ALWAYS AS (to_tsvector('english', event_narrative)) STORED;
ts is a generated column (new as of Postgres 12), and it's automatically synced with the source data.
We can then create a GIN index on ts:
CREATE INDEX ts_idx ON se_details USING GIN (ts);
And then we can query like so:
SELECT state, begin_date_time, event_type, event_narrative FROM se_details WHERE ts @@ to_tsquery('english', 'tornado');
tsquery is the other full-text search data type in Postgres. It represents search terms that have also been processed as lexemes, so we'll pass in our input term to the to_tsquery function in order to optimize our query for full-text search. (@@ is a match operator.)
What we get with this query are records where "tornado" is somewhere in the text string, but in addition to that, here are a couple of records in the result set where there are also matches for "tornado" as lexeme ("tornado-like" and "tornadoes"):
state | KENTUCKY begin_date_time | 2018-04-03 18:08:00 event_type | Thunderstorm Wind event_narrative | A 1.5 mile wide swath of winds gusting to around 95 mph created **tornado-like** damage along Kentucky Highway 259 in Edmons on County. The winds, extending 3/4 of a mile north and south of Bee Spring, destroyed or heavily damaged several small outbuildings, tore part of the roof off of one home, uprooted and snapped the trunks of numerous trees, and snapped around a dozen power poles. Several othe r homes sustained roof damage, and wind-driven hail shredded vinyl siding on a number of buildings.
state | WISCONSIN begin_date_time | 2018-08-28 15:30:00 event_type | Thunderstorm Wind event_narrative | A swath of widespread tree and crop damage across the southern portion of the county. Sections of trees and crops compl etely flattened, and some structural damage from fallen trees or due to the strong downburst winds. Various roads closed due to fallen tre es. Two semi-trucks were overturned on highway 57 in Waldo. The widespread wind damage and tornadoes caused structural damage to many home s with 70 homes sustaining affected damage, 3 homes with minor damage, 2 homes with major damage, one home destroyed, and 2 businesses wit h minor damage.
Searching for Phrases
One way to handle phrases as search terms is to use the & (AND) or <-> (FOLLOWED BY) Boolean operators with the tsquery.
For example, if we want to search for the phrase "rain of debris":
SELECT state, begin_date_time, event_type, event_narrative FROM se_details WHERE ts @@ to_tsquery('english', **'rain & of & debris'**);
The search phrase gets normalized to 'rain' & 'debri'. The order doesn't matter as long as both 'rain' and 'debri' have matches in the document, such as this example:
A debris flow caused by heavy rain on a saturated hillside blocked the Omak River Road one mile south of the intersection with State Route 97.
If we do
to_tsquery('english', 'rain <-> of <-> debris') the tsquery value is
'rain' <2> 'debri', meaning it will only match where 'rain' is followed by
'debri' precisely two positions away, such as here:
Heavy rain caused debris flows on the Coal Hollow Fire and Tank Hollow Fire burn scars.
(This was actually the only match, so using the <-> operator is a little bit more restrictive.)
function can also parse the phrase itself, and inserts
phraseto_tsquery('english', 'rain of debris')
The tsquery value is 'rain' <2> 'debri' like above, so phraseto_tsquery also accounts for positioning.
Functions for Weighting and Ranking Search Results
One very common use case for assigning different weights and ranking is searching on articles. For example, you may want to merge the article title and abstract or content together for search, but want matches on title to be considered more relevant and thus rank higher.
Going back to our storm events example, our data table also has an episode*narrative column in addition to eventnarrative. For storm data, an _event is an individual type of storm event (e.g. flood, hail), while an episode is an entire storm system and could contain many different types of events.
Let's say we want to be able to carry out a full-text search on event as well as episode narratives, but have decided that the event narrative should weigh more than the episode narratives. We could define the ts column like this instead:
ALTER TABLE se_details ADD COLUMN ts tsvector GENERATED ALWAYS AS **(setweight(to_tsvector('english', coalesce(event_narrative, '')), 'A') ||** **setweight(to_tsvector('english', coalesce(episode_narrative, '')), 'B'))** STORED;
setweight is a full-text function that assigns a weight to the components of a document. The function takes the characters 'A', 'B', 'C', or 'D' (most weight to least, in that order). We're also using a coalesce here so that the concatenation doesn't result in nulls if either episode_narrative or event_narrative contain null values.
You could then use the ts_rank function in an ORDER BY clause to return results from most relevant to less.
SELECT … **ORDER BY ts_rank(ts, to_tsquery('english', 'tornado')) DESC**;
So, this record is ranked higher in the search results:
state | MISSISSIPPI begin_date_time | 2018-04-06 22:18:00 event_type | Tornado event_narrative | This tornado touched down near the Jefferson Davis-Covington County line along Lucas Hollow Road. It continued southeast, crossing the county line. Some large limbs and trees were snapped and uprooted at this location. It then crossed Lucas Hollow Road again before crossing Leonard Road. A tornado debris signature was indicated on radar in these locations. The tornado uprooted and snapped many trees in this region. It also overturned a sm all tractor trailer on Oakvale Road and caused some minor shingle damage to a home. After crossing Oakvale Road twice, the tornado lifted before crossing Highway 35. The maximum winds in this tornado was 105mph and total path length was 2.91 miles. The maximum path width was 440 yards. **episode_narrative |** A warm front was stretched across the region on April 6th. As a disturbance rode along this stalled front, it brought copious amounts of rain to the region thanks to ample moisture in place. As daytime heating occurred, some storms developed which brought severe weather to the region.
Compared to this, where there is a match for "tornado" in episode_narrative but not event_narrative:
state | NEBRASKA begin_date_time | 2018-06-06 18:10:00 event_type | Hail event_narrative | Hail predominately penny size with some quarter size hail mixed in. episode_narrative |** Severe storms developed in the Nebraska Panhandle during the early evening hours of Jun e 6th. As this activity tracked east, a broken line of strong to severe thunderstorms developed. Hail up to the size of ping pong balls, thunderstorm wind gusts to 70 MPH and a brief tornado touchdown were reported . Heavy rain also fell leading to flash flooding in western Keith county.
Tip: ts_rank returns a floating-point value, so you could include the expression in your SELECT to see how these matches score. In my case I get around a 0.890 for the Mississippi event, and 0.243 for the Nebraska event.
Yes, You Can Keep Full-Text Search in Postgres
You can get even deeper and make your Postgres full-text search even more robust, by implementing features such as highlighting results, or writing your own custom dictionaries or functions. You could also look into enabling extensions such as unaccent (remove diacritic signs from lexemes) or pg_trgm (for fuzzy search). Speaking of extensions, those were just two of the extensions supported in Crunchy Bridge. We've built our managed cloud Postgres service such that you can dive right in and take advantage of all these Postgres features.
With all that said: as you can see, you don't need a very involved setup to get started. It's a good idea to try out whether you are just beginning to explore a full-text search solution, or even just reevaluating whether you need to go all out for a dedicated full-text search service, especially if you already have Postgres in your stack.
To be fair, Postgres doesn't have some search features that are available with platforms such as Elasticsearch. But a major advantage is that you won't have to maintain and sync a separate data store. If you don't quite need search at super scale, there might be more for you to gain by minimizing dependencies. Plus, the Postgres query syntax that you already know with the addition of some new functions and operators, can get you pretty far. Got any other questions or thoughts about full-text search with Postgres? We're happy to hear them on @crunchydata.
July 27, 2021 •More by this author