Mysql & sphinx

Hi,

I’m going to setup full text search with mysql 5.5, sphinxse 2.1 and sphinxd
0.9.9.

max_matches worries me, from docs:

“Take, for example, the instance where Sphinx is configured to return a
maximum of three results, and tickets 1, 2, 3, 4, and 5 contain the
string “target”, but only ticket 5 is in status “Open”. A search for
"Content LIKE ‘target’ AND Status = ‘Open’” may return no results,
despite ticket 5 matching those criteria, as Sphinx will only return
tickets 1, 2, and 3 as possible matches.
"

This means that sphinx will never ever return new matching tickets that are
above max_matches :-/ Would be acceptable if it use max_matches counting down
from latest one but this doc suggests it’s count from first one.

I assume there is no solution for this other that use some huge number as
max_matches?

Arkadiusz Miśkiewicz PLD/Linux Team
arekm / maven.pl http://ftp.pld-linux.org/

I’m going to setup full text search with mysql 5.5, sphinxse 2.1 and sphinxd
0.9.9.
[snip]
This means that sphinx will never ever return new matching tickets that are
above max_matches :-/ Would be acceptable if it use max_matches counting down
from latest one but this doc suggests it’s count from first one.

That is not quite what it means. To make the limitation clearer, assume
there are 100,000 tickets in the database, and only five tickets contain
the word “target” (once each): ticket ids 3, 44, 555, 6666, and 77777.
If max_matches is set to 5, and the search is for “Content LIKE
’target’”, all five tickets will be returned. If max_matches is set to
3, only tickets 3, 44, and 555 will be returned. That is, max_matches
need not be set to 100,000 to return results in tickets that high; it
should be set comfortably higher than the number of occurrences of the
words you expect to be searching for[*].

Given the same scenario, but with a query, of “Content LIKE ‘target’ AND
Status = ‘Open’” and only ticket 77777 in the Open status, a max_matches
of 5 would suffice to return that one result. A max_matches of 3 would
return no results, as Sphinx would return only three results to RT (3,
44, and 555) which would then be filtered to only open tickets, which is
a null set.

Does that help to clarify the limitation? To be sure, it is still an
irritating limitation, and one that i wish we could work around somehow.
Unfortunately, short of pushing more of the search parameters down into
sphinx, which would be a rather complicated piece of work, I see little
way around it.

  • Alex

[*] To complicate matters, this is technically the number of
attachments matching the full-text criteria, not the number of
tickets. That is, ticket 3 contained 500 emails, each of which
contained the word “target”, then (contrary to the above example)
max_matches would need to be 501 in order for the results to contain
more than just ticket id 3.

Does that help to clarify the limitation?

Yes, it does. Thanks.

More questions follows. I see that sphinx is learning
only attachments with ContentType = ‘text/plain’ entries
which looks unfortunate since I have tons of html email.

Did indexing text/html and having html_strip=1 [1]
in sphinx produce any problems that caused only text/plain
to be choosen for indexation in rt-setup-fulltext-index?

  1. http://sphinxsearch.com/docs/current.html#conf-html-strip

ps. here is my setup with delta indexes, could be useful for
other prople

source rt {
type = mysql

sql_host        = localhost
sql_db          = rt3
sql_user        = 
sql_pass        = 

sql_query_pre   = SET NAMES utf8
sql_query_pre   = REPLACE INTO SphinxCounters SELECT 1, MAX(id) FROM Attachments
sql_query       = \
    SELECT a.id, a.content FROM Attachments a \
    JOIN Transactions txn ON a.TransactionId = txn.id AND txn.ObjectType = 'RT::Ticket' \
    JOIN Tickets t ON txn.ObjectId = t.id \
    WHERE a.ContentType = 'text/plain' AND t.Status != 'deleted' \
    AND a.id<=( SELECT max_doc_id FROM SphinxCounters WHERE counter_id=1 )

sql_query_info  = SELECT * FROM Attachments WHERE id=$id

}

source rt_delta : rt {
sql_query_pre = SET NAMES utf8
sql_query =
SELECT a.id, a.content FROM Attachments a
JOIN Transactions txn ON a.TransactionId = txn.id AND txn.ObjectType = ‘RT::Ticket’
JOIN Tickets t ON txn.ObjectId = t.id
WHERE a.ContentType = ‘text/plain’ AND t.Status != ‘deleted’
AND a.id>( SELECT max_doc_id FROM SphinxCounters WHERE counter_id=1 )

sql_query_info  = SELECT * FROM Attachments WHERE id=$id
sql_query_post_index    =

}

index rt {
source = rt
path = /var/lib/sphinx/rt.index
docinfo = extern
charset_type = utf-8
charset_table = 0…9, A…Z->a…z, a…z, U+0143->U+0144, U+0104->U+0105, U+0106->U+0107, U+0118->U+0119, U+0141->U+0142, U+00D3->U+00F3, U+015A->U+015B,
U+0179->U+017A, U+017B->U+017C, U+0105, U+0107, U+0119, U+0142, U+00F3, U+015B, U+017A, U+017C, U+0144
}

(various charset_table are required for many non-english languages)

index rt_delta : rt
{
source = rt_delta
path = /var/lib/sphinx/rt.delta.index
}

CREATE TABLE AttachmentsIndex (
id int(10) unsigned NOT NULL,
weight int(11) NOT NULL,
query varchar(3072) NOT NULL,
KEY query (query(255))
) ENGINE=SPHINX DEFAULT CHARSET=utf8 CONNECTION=‘sphinx://127.0.0.1:3312/rt,rt_delta’

(note, two indexes in CONNECTION)

CREATE TABLE SphinxCounters (
counter_id int(11) NOT NULL,
max_doc_id int(11) NOT NULL,
PRIMARY KEY (counter_id)
) ENGINE=InnoDB DEFAULT CHARSET=utf8

all indexes updated daily, rt_delta updated every 15 minutes

cat /etc/cron.d/sphinx

@daily root /usr/bin/indexer --quiet --rotate --all
*/15 * * * * root /usr/bin/indexer --quiet --rotate rt_delta

Arkadiusz Miśkiewicz PLD/Linux Team
arekm / maven.pl http://ftp.pld-linux.org/

Did indexing text/html and having html_strip=1 [1]
in sphinx produce any problems that caused only text/plain
to be choosen for indexation in rt-setup-fulltext-index?

With the caveats that you’ll need additional html_strip=1 indexes for
both the main and the delta index (otherwise will be stripped
from text/plain content), and that we’ve not tested it locally, it
should probably work, yes. We opted to provide the text/plain
configuration as the default because most html mail has a text/plain
alternative. But as rt-setup-fulltext-index states:

    Below is a simple Sphinx configuration which can be used to
    index all text/plain attachments in your database.  This
    configuration is not ideal; you should read the Sphinx
    documentation to understand how to configure it to better suit
    your needs.
  • Alex