'.' as delimiter/boundary breaks domain name searches

[ Similar, but unrelated to my other message from 10 minutes ago. ]

It appears any ‘.’ is interpreted as a word boundary with
Pg full-text indexing turned on.

Is that known to be true, or am I wrong?

This breaks searches for FQDNs names in ticket contents.

Searching for ‘foobar’ will hit foobar.org

Searching for ‘foobar.org’ will not hit ‘foobar.org

[ Similar, but unrelated to my other message from 10 minutes ago. ]

It appears any ‘.’ is interpreted as a word boundary with
Pg full-text indexing turned on.

Is that known to be true, or am I wrong?

This breaks searches for FQDNs names in ticket contents.

Searching for ‘foobar’ will hit foobar.org

Searching for ‘foobar.org’ will not hit ‘foobar.org

What FTS will match/return is dictated by your database and its
configuration.

Have you reviewed the Postgres full text search documentation for your
release of Pg?

http://www.postgresql.org/docs/8.4/static/textsearch.html

-kevin

[ Similar, but unrelated to my other message from 10 minutes ago. ]

It appears any ‘.’ is interpreted as a word boundary with
Pg full-text indexing turned on.

Is that known to be true, or am I wrong?

This breaks searches for FQDNs names in ticket contents.

Searching for ‘foobar’ will hit foobar.org

Searching for ‘foobar.org’ will not hit ‘foobar.org

What FTS will match/return is dictated by your database and its
configuration.

Have you reviewed the Postgres full text search documentation for your
release of Pg?

http://www.postgresql.org/docs/8.4/static/textsearch.html

-kevin

Wow! PostgreSQL 8.4, 4 major releases back! I cannot be certain that I
am recalling this correctly, but the default parser in older versions
of PostgreSQL did have that behavior. I do not know when they made the
change to fix it. What do you get when you run:

rt3=# select plainto_tsquery(‘rice.edu’);
plainto_tsquery
rice.edu
(1 row)

I seem to recall that in the older version when I saw this issue, it
returned:

plainto_tsquery
’rice’ & ‘edu’

You may be able to make a custom config for your text search using
the definitions from the current release. I just ended up searching
for ‘rice’ instead of ‘rice.edu’, for example.

Regards,
Ken

I’m reviving this one time in case anyone has further ideas.

  • PostgreSQL 8.4.20 (RHEL 6.6) with FTS does the right
    thing when parsing an email address[1]

  • An RT 4.2.12 search for the same string returns 0 results[2].
    I’m 98% certain the ticket is indexed though as other queries
    return it[2].

  • The rt-fulltext-indexer script runs every 10 minutes and has
    no errors.

Jeff

Footnote 1:

rt4=# SELECT alias, description, token FROM ts_debug(‘foo@domain.com’);
alias | description | token
email | Email address | foo@domain.com
(1 row)

rt4=# SELECT alias, description, token FROM ts_debug(‘foo@domain.com’);
alias | description | token
blank | Space symbols | <
email | Email address | foo@domain.com
blank | Space symbols | >
(3 rows)

rt4=#

Footnote 2:

Content LIKE ‘foo@domain.com’

0 results

Content LIKE ‘domain.com

0 results

Content LIKE ‘domain’

0 results

Content LIKE ‘another unique string in the ticket I am trying to hit’

1 resultOn 10/2/2014 11:08 AM, ktm@rice.edu wrote:

On Thu, Oct 02, 2014 at 10:56:56AM -0400, Kevin Falcone wrote:

On Wed, Oct 01, 2014 at 03:50:43PM -0400, Jeff Blaine wrote:

[ Similar, but unrelated to my other message from 10 minutes ago. ]

It appears any ‘.’ is interpreted as a word boundary with
Pg full-text indexing turned on.

Is that known to be true, or am I wrong?

This breaks searches for FQDNs names in ticket contents.

Searching for ‘foobar’ will hit foobar.org

Searching for ‘foobar.org’ will not hit ‘foobar.org

What FTS will match/return is dictated by your database and its
configuration.

Have you reviewed the Postgres full text search documentation for your
release of Pg?

http://www.postgresql.org/docs/8.4/static/textsearch.html

-kevin

Wow! PostgreSQL 8.4, 4 major releases back! I cannot be certain that I
am recalling this correctly, but the default parser in older versions
of PostgreSQL did have that behavior. I do not know when they made the
change to fix it. What do you get when you run:

rt3=# select plainto_tsquery(‘rice.edu’);
plainto_tsquery

rice.edu
(1 row)

I seem to recall that in the older version when I saw this issue, it
returned:

plainto_tsquery

‘rice’ & ‘edu’

You may be able to make a custom config for your text search using
the definitions from the current release. I just ended up searching
for ‘rice’ instead of ‘rice.edu’, for example.

Regards,
Ken

Jeff Blaine
kickflop.net
PGP/GnuPG Key ID: 0x0C8EDD02

I’m reviving this one time in case anyone has further ideas.

  • PostgreSQL 8.4.20 (RHEL 6.6) with FTS does the right
    thing when parsing an email address[1]

I question that.

I know that with Pg 9.0 the FTS indexer does arguably “right” things in
parsing email addresses, hostnames, and IP addresses BUT that behavior
effectively hides the octets inside an IP, the host and domain name
elements in a hostname, and the local and domain parts of an email
address. For our business (customized hosting & related services) this
made the indexing do more harm than good. We have adequate capacity (and
suitable config) on our DB server to handle most of the searches we need
to do in reasonable time without the indexing, so we disabled it.

I’m reviving this one time in case anyone has further ideas.

  • PostgreSQL 8.4.20 (RHEL 6.6) with FTS does the right
    thing when parsing an email address[1]

  • An RT 4.2.12 search for the same string returns 0 results[2].
    I’m 98% certain the ticket is indexed though as other queries
    return it[2].

  • The rt-fulltext-indexer script runs every 10 minutes and has
    no errors.

Is the ticket in question a merged ticket? You had a mail from the
same timeframe which dealt with merged ticket FTS, which is a known
bug: https://issues.bestpractical.com/Ticket/Display.html?id=9370
Sadly, there’s not good solution to that bug at this time.

Otherwise, seeing the query RT is generating, and stripping clauses
out of it until it matches, is likely your best bet. You can see the
SQL generated from a TicketSQL query by running:

perl -MRT=-init -le '$t=RT::Tickets->new(RT->SystemUser);$t->FromSQL("@ARGV");print $t->BuildSelectQuery' \
    "Content LIKE 'foo@domain.com'"
  • Alex
  • An RT 4.2.12 search for the same string returns 0 results[2].
    I’m 98% certain the ticket is indexed though as other queries
    return it[2].

Is the ticket in question a merged ticket? You had a mail from the
same timeframe which dealt with merged ticket FTS, which is a known
bug: https://issues.bestpractical.com/Ticket/Display.html?id=9370
Sadly, there’s not good solution to that bug at this time.

I’m losing what little is left of my mind, apparently. I looked last
week at the problem ticket and could swear it had no merging in its
history. Looking today, I see it does.

perl -MRT=-init -le '$t=RT::Tickets->new(RT->SystemUser);$t->FromSQL("@ARGV");print $t->BuildSelectQuery' \
    "Content LIKE 'foo@domain.com'"

Handy! Used this to generate the SQL. Pasted the SQL into psql for kicks
and got the unfortunately now-expected 0 results. Filed this snippet
away for later.

Thanks Alex.

Jeff