Early in on my SQL journey, I thought that searching for a piece of text in the database mostly involved querying like this:
SELECT col FROM table WHERE col LIKE '%some_value%';
Then I would throw in some wildcard operators or regular expressions if I wanted to get more specific.
Later on, I worked with a client who wanted search functionality in an app, so
LIKE and regex weren't going to cut it. What I had known all along was just
It works perfectly fine for certain purposes, but what happens when it's not
just a matter of checking for a straightforward pattern in a single text field?
For example, what if you wanted to search across multiple fields? How about
returning possible matches even if the search term happens to be misspelled?
Also, what if you have very large amounts of data to search on? Sure, you can
create indexes for columns on which you want to query for pattern matches, but
that will have limitations (for instance, the B-tree index doesn't work for
col LIKE '%substring%').
So when we say PostgreSQL is the "batteries included database," this is just one reason why. With Postgres, you don't need to immediately look farther than your own database management system for a full-text search solution. If you haven't yet given Postgres' built-in full-text search a try, read on for a simple intro.
Core Postgres includes the following full-text search capabilities. To name a few:
- Ignore stop words (common words such as "the" or "an").
- Stemming, where search matches can be based on a "root" form, or stem, of a word (“run” matches “runs” and “running” and even “ran”).
- Weight and rank search matches (so best matches can be sorted to the top of a result list).
Before we go further, let's also get ourselves familiarized with the following concepts:
- A document is a set of data on which you want to carry out your full-text search. In Postgres, this could be built from a single column, or a combination of columns, even from multiple tables.
- The document is parsed into tokens, which are small fragments (e.g. words, phrases, etc) from the document's text. Tokens are then converted to more meaningful units of text called lexemes.
- In Postgres, this conversion is done with dictionaries -- there are built-in ones, but custom dictionaries can be created if necessary. These dictionaries help determine stop words that should get ignored, and whether differently-derived words have the same stem. Most dictionaries are for a specific language (English, German, etc) but you could also have ones that are for a specific domain.
- The sorted list of lexemes from the document is stored in the
I have a table that contains storm events data gathered by the U.S. National Weather Service. For simplicity's sake I won't include all possible fields in the statement below, but there's a copy of the data and some further information available in this repository.
CREATE TABLE se_details ( episode_id int, event_id int primary key, state text, event_type text, begin_date_time timestamp, episode_narrative text, event_narrative text, ... );
Let's also say that we want to carry out a full-text search on the data on the
event_narrative column. We could add a new column to the table to store the
preprocessed search document (i.e. the list of lexemes):
ALTER TABLE se_details ADD COLUMN ts tsvector GENERATED ALWAYS AS (to_tsvector('english', event_narrative)) STORED;
ts is a generated column (new as of Postgres 12), and it's automatically synced with the source data.
We can then create a GIN index on ts:
CREATE INDEX ts_idx ON se_details USING GIN (ts);
And then we can query like so:
SELECT state, begin_date_time, event_type, event_narrative FROM se_details WHERE ts @@ to_tsquery('english', 'tornado');
is the other full-text search data type in Postgres. It represents search terms
that have also been processed as lexemes, so we'll pass in our input term to the
to_tsquery function in order to optimize our query for full-text search. (
is a match
What we get with this query are records where "tornado" is somewhere in the text string, but in addition to that, here are a couple of records in the result set where there are also matches for "tornado" as lexeme ("tornado-like" and "tornadoes"):
state | KENTUCKY begin_date_time | 2018-04-03 18:08:00 event_type | Thunderstorm Wind event_narrative | A 1.5 mile wide swath of winds gusting to around 95 mph created **tornado-like** damage along Kentucky Highway 259 in Edmons on County. The winds, extending 3/4 of a mile north and south of Bee Spring, destroyed or heavily damaged several small outbuildings, tore part of the roof off of one home, uprooted and snapped the trunks of numerous trees, and snapped around a dozen power poles. Several othe r homes sustained roof damage, and wind-driven hail shredded vinyl siding on a number of buildings.
state | WISCONSIN begin_date_time | 2018-08-28 15:30:00 event_type | Thunderstorm Wind event_narrative | A swath of widespread tree and crop damage across the southern portion of the county. Sections of trees and crops compl etely flattened, and some structural damage from fallen trees or due to the strong downburst winds. Various roads closed due to fallen tre es. Two semi-trucks were overturned on highway 57 in Waldo. The widespread wind damage and tornadoes caused structural damage to many home s with 70 homes sustaining affected damage, 3 homes with minor damage, 2 homes with major damage, one home destroyed, and 2 businesses wit h minor damage.
One way to handle phrases as search terms is to use the
FOLLOWED BY) Boolean operators with the
For example, if we want to search for the phrase "rain of debris":
SELECT state, begin_date_time, event_type, event_narrative FROM se_details WHERE ts @@ to_tsquery('english', **'rain & of & debris'**);
The search phrase gets normalized to 'rain' & 'debri'. The order doesn't matter as long as both 'rain' and 'debri' have matches in the document, such as this example:
A debris flow caused by heavy rain on a saturated hillside blocked the Omak River Road one mile south of the intersection with State Route 97.
If we do
to_tsquery('english', 'rain <-> of <-> debris') the tsquery value is
'rain' <2> 'debri', meaning it will only match where 'rain' is followed by
'debri' precisely two positions away, such as here:
Heavy rain caused debris flows on the Coal Hollow Fire and Tank Hollow Fire burn scars.
(This was actually the only match, so using the <-> operator is a little bit more restrictive.)
function can also parse the phrase itself, and inserts
<N> between lexemes
where N is the integer position of the next lexeme when counting from the
preceding one. This function doesn't recognize operators unlike to_tsquery; for
example, we can just pass in the entire phrase like so:
phraseto_tsquery('english', 'rain of debris')
tsquery value is
'rain' <2> 'debri' like above, so
also accounts for positioning.
One very common use case for assigning different weights and ranking is searching on articles. For example, you may want to merge the article title and abstract or content together for search, but want matches on title to be considered more relevant and thus rank higher.
Going back to our storm events example, our data table also has an
episode_narrative column in addition to
event_narrative. For storm data, an
event is an individual type of storm event (e.g. flood, hail), while an
episode is an entire storm system and could contain many different types of
Let's say we want to be able to carry out a full-text search on event as well as episode narratives, but have decided that the event narrative should weigh more than the episode narratives. We could define the ts column like this instead:
ALTER TABLE se_details ADD COLUMN ts tsvector GENERATED ALWAYS AS **(setweight(to_tsvector('english', coalesce(event_narrative, '')), 'A') ||** **setweight(to_tsvector('english', coalesce(episode_narrative, '')), 'B'))** STORED;
is a full-text function that assigns a weight to the components of a document.
The function takes the characters 'A', 'B', 'C', or 'D' (most weight to least,
in that order). We're also using a coalesce here so that the concatenation
doesn't result in nulls if either
contain null values.
You could then use the
function in an
ORDER BY clause to return results from most relevant to less.
SELECT … ORDER BY ts_rank(ts, to_tsquery('english', 'tornado')) DESC;
So, this record is ranked higher in the search results:
state | MISSISSIPPI begin_date_time | 2018-04-06 22:18:00 event_type | Tornado event_narrative | This tornado touched down near the Jefferson Davis-Covington County line along Lucas Hollow Road. It continued southeast, crossing the county line. Some large limbs and trees were snapped and uprooted at this location. It then crossed Lucas Hollow Road again before crossing Leonard Road. A tornado debris signature was indicated on radar in these locations. The tornado uprooted and snapped many trees in this region. It also overturned a sm all tractor trailer on Oakvale Road and caused some minor shingle damage to a home. After crossing Oakvale Road twice, the tornado lifted before crossing Highway 35. The maximum winds in this tornado was 105mph and total path length was 2.91 miles. The maximum path width was 440 yards. episode_narrative | A warm front was stretched across the region on April 6th. As a disturbance rode along this stalled front, it brought copious amounts of rain to the region thanks to ample moisture in place. As daytime heating occurred, some storms developed which brought severe weather to the region.
Compared to this, where there is a match for "tornado" in
state | NEBRASKA begin_date_time | 2018-06-06 18:10:00 event_type | Hail event_narrative | Hail predominately penny size with some quarter size hail mixed in. episode_narrative | Severe storms developed in the Nebraska Panhandle during the early evening hours of Jun e 6th. As this activity tracked east, a broken line of strong to severe thunderstorms developed. Hail up to the size of ping pong balls, thunderstorm wind gusts to 70 MPH and a brief tornado touchdown were reported . Heavy rain also fell leading to flash flooding in western Keith county.
Tip: ts_rank returns a floating-point value, so you could include the expression
SELECT to see how these matches score. In my case I get around a 0.890
for the Mississippi event, and 0.243 for the Nebraska event.
You can get even deeper and make your Postgres full-text search even more robust, by implementing features such as highlighting results, or writing your own custom dictionaries or functions. You could also look into enabling extensions such as unaccent (remove diacritic signs from lexemes) or pg_trgm (for fuzzy search). Speaking of extensions, those were just two of the extensions supported in Crunchy Bridge. We've built our managed cloud Postgres service such that you can dive right in and take advantage of all these Postgres features.
With all that said: as you can see, you don't need a very involved setup to get started. It's a good idea to try out whether you are just beginning to explore a full-text search solution, or even just reevaluating whether you need to go all out for a dedicated full-text search service, especially if you already have Postgres in your stack.
To be fair, Postgres doesn't have some search features that are available with platforms such as Elasticsearch. But a major advantage is that you won't have to maintain and sync a separate data store. If you don't quite need search at super scale, there might be more for you to gain by minimizing dependencies. Plus, the Postgres query syntax that you already know with the addition of some new functions and operators, can get you pretty far. Got any other questions or thoughts about full-text search with Postgres? We're happy to hear them on @crunchydata.
July 27, 2021 •More by this author