PostgreSQL full-text index default type: GiST or GIN

Hi,

I am working on an upgrade from 3.8.x to 4.4.x and I noticed the
following comment in the UPGRADING-4.2 file:

The full-text indexing defaults for PostgreSQL have changed; GiST is now
the suggested index, as well as storing data in a separate
AttachmentsIndex table. Both changes improve lookup speed. For
improved search performance, you may wish to drop existing C
and C indexes on C, and re-generate the index using
C.

I am curious about the benchmarks you used to make this determination?
In particular, the GIN fastscan option can dramatically improve search
performance in versions 9.4 and above. Here is a nice discussion:

http://blog.pgaddict.com/posts/performance-since-postgresql-7-4-to-9-4-fulltext

In addition, GIN indexes are much smaller in PostgreSQL 9.4 and above.
So I am curious about the data that motivated the change to GiST.

Regards,
Ken

Hi,

I am working on an upgrade from 3.8.x to 4.4.x and I noticed the
following comment in the UPGRADING-4.2 file:

The full-text indexing defaults for PostgreSQL have changed; GiST is now
the suggested index, as well as storing data in a separate
AttachmentsIndex table. Both changes improve lookup speed. For
improved search performance, you may wish to drop existing C
and C indexes on C, and re-generate the index using
C.

I am curious about the benchmarks you used to make this determination?
In particular, the GIN fastscan option can dramatically improve search
performance in versions 9.4 and above. Here is a nice discussion:

http://blog.pgaddict.com/posts/performance-since-postgresql-7-4-to-9-4-fulltext

In addition, GIN indexes are much smaller in PostgreSQL 9.4 and above.
So I am curious about the data that motivated the change to GiST.

Hi Ken,

Here is a correspondence between myself and a BP engineer you may find helpful:

I’ve enabled full-text searching on our 4.2 instance.

How often are folks running the sbin/rt-fulltext-indexer to keep the
index up-to-date?

Since the pg updates are incremental, they are safe to run rather
frequently. Depending on ticket volume, they’ve been run anywhere from
every minute to every hour. Unlike the Sphinx indexes for MySQL,
there’s no need to hold back and run them daily. Indeed, we’ve
considered updating the Pg index in realtime as attachments are created
– it’s sufficiently fast.

We are using Pg 9.1 on the backend with GIN indexing. We get ~100
tickets created per day and I am not sure how many transactions are
created per day.

Any best practices or anecdotal data would be helpful. Also, how
"strong" of a default recommendation is the GiST indexing? Am I being
foolish for choosing GIN?

The GiST recommendation is very much a soft one. We’re not aware of any
particular cases where it has been explicitly required, nor have we done
comprehensive comparisons between the two on RT’s data at scale.

—end of email—

There is also some good documentation on the Pg site regarding the
index choices:

http://www.postgresql.org/docs/9.5/static/textsearch-indexes.html

-m

I am curious about the benchmarks you used to make this determination?
In particular, the GIN fastscan option can dramatically improve search
performance in versions 9.4 and above. Here is a nice discussion:

http://blog.pgaddict.com/posts/performance-since-postgresql-7-4-to-9-4-fulltext

In addition, GIN indexes are much smaller in PostgreSQL 9.4 and above.
So I am curious about the data that motivated the change to GiST.

You are quite correct; GIN indexes are superior in virtually every
way for our use case. And the documentation is unfortunately in error;
the default switched in the other direction, from GiST to GIN. See
https://github.com/bestpractical/rt/commit/e103f6da for the actual
behavior change.

This mistake is entirely my fault, and my only excuse for getting it
backwards is that the documentation commit was written several months
after the code was. Apologies!

Here is a correspondence between myself and a BP engineer you may find helpful:
[snip]

That correspondence is out of date; at the time, GiST was the
default, based on no particular research. The change to default to
GIN was made late 2014/early 2015 after I conducted additional research
on real-world data – as well as the Postgres documentation, which is
quite clear that static data should use GIN.

You can see the results of that research at
https://chmrr.net/fts-charts/pg.html and
https://chmrr.net/fts-charts/query.html#pg

Those are from performing speed analysis on rebuilding the index on a
clone of Best Practical’s own ticketing system, which was taken as
being a representative real-life sample of data. It was run on
Postgres 9.3, and resulted in the following branch:


The last ~7 commits are the most Postgres-specific.

The end result is an indexing engine which, in its default
configuration, indexes data an order of magnitude faster, as well as
provides results an order of magnitude faster. Wins all around!

I’m happy to supply a patch to fix the documentation, but I suspect
Shawn or someone at BPS will be faster to simply fix it directly. :slight_smile:

  • Alex

I am curious about the benchmarks you used to make this determination?
In particular, the GIN fastscan option can dramatically improve search
performance in versions 9.4 and above. Here is a nice discussion:

http://blog.pgaddict.com/posts/performance-since-postgresql-7-4-to-9-4-fulltext

In addition, GIN indexes are much smaller in PostgreSQL 9.4 and above.
So I am curious about the data that motivated the change to GiST.

You are quite correct; GIN indexes are superior in virtually every
way for our use case. And the documentation is unfortunately in error;
the default switched in the other direction, from GiST to GIN. See
https://github.com/bestpractical/rt/commit/e103f6da for the actual
behavior change.

Hi Alex,

I am glad to find out it is only a documentation problem. I had not had a
chance to look at the actual code yet.

Regards,
Ken