8.11. Text Search Types

PostgreSQL provides two data types that are designed to support full text search, which is the activity of searching through a collection of natural-language documents to locate those that best match a query. The tsvector type represents a document in a form suited for text search, while the tsquery type similarly represents a query. Chapter 12 provides a detailed explanation of this facility, and Section 9.13 summarizes the related functions and operators.

8.11.1. tsvector

A tsvector value is a sorted list of distinct lexemes, which are words that have been normalized to make different variants of the same word look alike (see Chapter 12 for details). Sorting and duplicate-elimination are done automatically during input, as shown in this example:

SELECT 'a fat cat sat on a mat and ate a fat rat'::tsvector;
                      tsvector
----------------------------------------------------
 'a' 'on' 'and' 'ate' 'cat' 'fat' 'mat' 'rat' 'sat'

(As the example shows, the sorting is first by length and then alphabetically, but that detail is seldom important.) To represent lexemes containing whitespace or punctuation, surround them with quotes:

SELECT $$the lexeme '    ' contains spaces$$::tsvector;
                 tsvector                  
-------------------------------------------
 'the' '    ' 'lexeme' 'spaces' 'contains'

(We use dollar-quoted string literals in this example and the next one, to avoid confusing matters by having to double quote marks within the literals.) Embedded quotes and backslashes must be doubled:

SELECT $$the lexeme 'Joe''s' contains a quote$$::tsvector;
                    tsvector                    
------------------------------------------------
 'a' 'the' 'Joe''s' 'quote' 'lexeme' 'contains'

Optionally, integer position(s) can be attached to any or all of the lexemes:

SELECT 'a:1 fat:2 cat:3 sat:4 on:5 a:6 mat:7 and:8 ate:9 a:10 fat:11 rat:12'::tsvector;
                                  tsvector
-------------------------------------------------------------------------------
 'a':1,6,10 'on':5 'and':8 'ate':9 'cat':3 'fat':2,11 'mat':7 'rat':12 'sat':4

A position normally indicates the source word's location in the document. Positional information can be used for proximity ranking. Position values can range from 1 to 16383; larger numbers are silently clamped to 16383. Duplicate positions for the same lexeme are discarded.

Lexemes that have positions can further be labeled with a weight, which can be A, B, C, or D. D is the default and hence is not shown on output:

SELECT 'a:1A fat:2B,4C cat:5D'::tsvector;
          tsvector          
----------------------------
 'a':1A 'cat':5 'fat':2B,4C

Weights are typically used to reflect document structure, for example by marking title words differently from body words. Text search ranking functions can assign different priorities to the different weight markers.

It is important to understand that the tsvector type itself does not perform any normalization; it assumes that the words it is given are normalized appropriately for the application. For example,

select 'The Fat Rats'::tsvector;
      tsvector      
--------------------
 'Fat' 'The' 'Rats'

For most English-text-searching applications the above words would be considered non-normalized, but tsvector doesn't care. Raw document text should usually be passed through to_tsvector to normalize the words appropriately for searching:

SELECT to_tsvector('english', 'The Fat Rats');         
   to_tsvector   
-----------------
 'fat':2 'rat':3

Again, see Chapter 12 for more detail.

8.11.2. tsquery

A tsquery value stores lexemes that are to be searched for, and combines them using the boolean operators & (AND), | (OR), and ! (NOT). Parentheses can be used to enforce grouping of the operators:

 SELECT 'fat & rat'::tsquery;
    tsquery    
---------------
 'fat' & 'rat'

SELECT 'fat & (rat | cat)'::tsquery;
          tsquery          
---------------------------
 'fat' & ( 'rat' | 'cat' )

SELECT 'fat & rat & ! cat'::tsquery;
        tsquery         
------------------------
 'fat' & 'rat' & !'cat'

In the absence of parentheses, ! (NOT) binds most tightly, and & (AND) binds more tightly than | (OR).

Optionally, lexemes in a tsquery can be labeled with one or more weight letters, which restricts them to match only tsvector lexemes with one of those weights:

SELECT 'fat:ab & cat'::tsquery;
    tsquery
------------------
 'fat':AB & 'cat'

Quoting rules for lexemes are the same as described above for lexemes in tsvector; and, as with tsvector, any required normalization of words must be done before putting them into the tsquery type. The to_tsquery function is convenient for performing such normalization:

SELECT to_tsquery('Fat:ab & Cats');
    to_tsquery    
------------------
 'fat':AB & 'cat'