SearchBuilder DISTINCTs

Ian_Grant · February 20, 2003, 10:14am

Sorry about the spurious Subject: header and the duplicate signatures on my
previous message.

I suspect that SearchBuilder’s query building probably needs to be
refactored to call out to methods in Handle, to allow easier
customization

Is this for reasons other than the particular problem I know about?

…It’s a fairly serious refactoring job, if you’re up for
it.

I’m not the best person to do this sort of thing - I am not good at
object-oriented programming.

The first is that SearchBuilder does SELECTs always using the DISTINCT
keyword. This is a problem because SQL Server will not allow DISTINCT with
large object (type TEXT) fields. (They can be up to 2GB and I suppose the
comparisons could get expensive.)

My question is: Does anyone know, or, failing that, can anyone suggest a way
to find out where the DISTINCT clauses are really necessary?

Er. Just about everywhere. DBIx::SearchBuilder desperately needs the set
of rows returned to be unique.

Having stared at this a little harder I am now of the opinion that the
DISTINCT argument is redundant in SELECT DISTINCT main.* FROM table main [,
…] because every table has a column ‘id’ which contains an integer unique
to that row in that table. This is sufficient to make every row distinct. As
far as I can see this the only place in SearchBuilder.pm where it’s a problem
for SQL Server.

Ian Grant, Computer Lab., William Gates Building, JJ Thomson Ave., Cambridge
Phone: +44 1223 334420

Robert_Spier · February 20, 2003, 5:58pm

Having stared at this a little harder I am now of the opinion that
the DISTINCT argument is redundant in SELECT DISTINCT main.* FROM
table main [, …] because every table has a column ‘id’ which
contains an integer unique to that row in that table. This is
sufficient to make every row distinct. As far as I can see this the
only place in SearchBuilder.pm where it’s a problem for SQL Server.

Ian-

The reason you cannot get rid of the DISTINCTs is because of
the joins.  They will lead to multiple duplicate rows.

If you can come up with a solution to this problem that works
on complicated queries, (of which I can provide you many),
then we can get rid of DISTINCT.  

By subclassing the query building stuff, you can eliminate the
DISTINCTs for SQL Server, and take a performance hit by doing
dupe checking on the perl side.  But every other database
doesn't have to deal with it.  

If you have 2GB attachments in the database things are going
to be slow anyway-  Maybe SQL Server has another data type
that can be used?

-R

Matt_Knopp · February 20, 2003, 6:17pm

Dude! It’s like rocket surgery! No really! Add a LimitDistinctP() method to
SearchBuilder which sets an instance variable, say ‘UseDistinctP’. Then in
_DoSearch change:

$QueryString =
"SELECT DISTINCT main.* FROM " . $self->_TableAliases . " ";

To something like:

$QueryString =
“SELECT " . ($this->{‘UseDistinctP’} ? “DISTINCT” : “”) .
” main.* from " . $self->_TableAliases . " ";

Sure, it requires the SB user to know that they want distinct values, but
really that’s okay. It’s worse to force distinct values on someone who may
not have expected it.

-Matt

Jesse_Vincent · February 20, 2003, 6:25pm

Dude! It’s like rocket surgery! No really! Add a LimitDistinctP() method to
SearchBuilder which sets an instance variable, say ‘UseDistinctP’. Then in
_DoSearch change:

$QueryString =
"SELECT DISTINCT main.* FROM " . $self->_TableAliases . " ";

To something like:

$QueryString =
"SELECT " . ($this->{‘UseDistinctP’} ? “DISTINCT” : “”) .
" main.* from " . $self->_TableAliases . " ";

This is a somewhat shortsighted way to do this stuff, since different
databases have different needs, which is why I recommended a proper
refactoring.

Sure, it requires the SB user to know that they want distinct values, but
really that’s okay. It’s worse to force distinct values on someone who may
not have expected it.

It’s a core assumption of how the SearchBuilder OO-RDBMS mapper works.
SearchBuilder is not about returning a bunch of possibly redundant
database rows. it’s about finding the set of objects that match the
query.

-Matt

rt-devel mailing list
rt-devel@lists.fsck.com
http://lists.fsck.com/mailman/listinfo/rt-devel

Request Tracker... So much more than a help desk — Best Practical Solutions – Trouble Ticketing. Free.

Matt_Knopp · February 20, 2003, 7:09pm

Sure, but you are forcing constraints on the query that can not be altered
or influenced using the query-altering parts of the API, ala Limit. I would
not have made that design decision, I think it’s bogus.

I’m curious what problems you are envisioning with removing distinct from
a majority of the queries? For anything other then joins it seems like your
pkeys should be sufficient, and in the case of joins you (the caller)
ultimately must understand the implications of the query you are executing.
So I don’t think it is unreasonable to force them to set a distinct flag.

Matt

Robert_Spier · February 20, 2003, 7:42pm

Sure, but you are forcing constraints on the query that can not be altered
or influenced using the query-altering parts of the API, ala Limit. I would
not have made that design decision, I think it’s bogus.

By definition SearchBuilder only returns unique objects. If the same
object/instance is returned twice in a single query, that is wrong,
for how SearchBuilder was designed.

I’m curious what problems you are envisioning with removing distinct from
a majority of the queries? For anything other then joins it seems like your
pkeys should be sufficient, and in the case of joins you (the caller)
ultimately must understand the implications of the query you are executing.
So I don’t think it is unreasonable to force them to set a distinct flag.

Joins exist in all but the simplest queries in SearchBuilder and most
other OO-RDBMS mappers. And, the whole purpose of an OO-RDBMS mapper
is to insulate the the user/programmer from the RDBMS.

There are several ways to change how SearchBuilder works such that
DISTINCT is not needed, but the penalty is performance.

-R

Ian_Grant · February 21, 2003, 10:05am

Having stared at this a little harder I am now of the opinion that
the DISTINCT argument is redundant in SELECT DISTINCT main.* FROM
table main [, …] because every table has a column ‘id’ which
contains an integer unique to that row in that table. This is
sufficient to make every row distinct. As far as I can see this the
only place in SearchBuilder.pm where it’s a problem for SQL Server.
…
The reason you cannot get rid of the DISTINCTs is because of
the joins. They will lead to multiple duplicate rows.

Ah yes, I didn’t think of that. The problem is when there are joins and the
unique id column in the joined table is not selected.

If you can come up with a solution to this problem that works
on complicated queries, (of which I can provide you many),
then we can get rid of DISTINCT.

There might be a way to detect these queries and log them. Quite a bit of RT
does work without the DISTINCT clauses. My interest is in getting RT usable
with our database. If that can be done without major changes SearchBuilder
then I would rather do that. I have a limited amount of time to spend on this.

By subclassing the query building stuff, you can eliminate the
DISTINCTs for SQL Server, and take a performance hit by doing
dupe checking on the perl side. But every other database
doesn’t have to deal with it.

If you have 2GB attachments in the database things are going
to be slow anyway-

It’s not my attachments that are the problem - I have had to limit them to 8K
for other reasons (the Sybase DBD driver allocates MAX_LENGTH buffers for
every returned column.) I was simply imagining M$'s justification for the
restriction, albeit a weak one. They don’t offer any explanation:

  Maybe SQL Server has another data type

that can be used?

Only VARCHAR which is limited to 256 bytes.
Ian Grant, Computer Lab., William Gates Building, JJ Thomson Ave., Cambridge
Phone: +44 1223 334420

Robert_Spier · February 21, 2003, 4:20pm

Only VARCHAR which is limited to 256 bytes.

VARCHAR2 goes much higher, iirc.

-R

Ian_Grant · February 21, 2003, 4:28pm

Only VARCHAR which is limited to 256 bytes.

I was wrong. VARCHAR can go up to 8,000.

VARCHAR2 goes much higher, iirc.

It’s not mentioned in M$ documentation.

Thanks,
Ian
Ian Grant, Computer Lab., William Gates Building, JJ Thomson Ave., Cambridge
Phone: +44 1223 334420