I'm so close I can taste it.
I have a table called "links" that has a bunch of stuff relevant to links into my forum. Two of those columns are thread_id and lastpost, as well as link_id which is unique.
I have 16 tables called "lkX" where X is a hex digit that is used for keyword-to-link mapping. In the query I'll show you, I determine, programatically, which one to use. So when you see that I use lk7 to look up one particular keyword, presume that I did that right. They're split up because each one has a little over 12 million rows! lk stands for "link-keyword" and all these tables do is map keyword_id values to link_id values. Many to many, of course.
I have a table called "keywords" that contains a bunch of varchar(30) keywords and gives each one a unique keyword_id.
So let's say I want to say: show me all result sets where keywords "escort" and "silly" are there, but EXCLUDE those where "fear" is there.
Here's the query I came up with.
SELECT l.thread_id,l.lastpost,l.link_id,w0.weight,w1.weight
FROM links AS l
JOIN keywords AS k0 ON k0.keyword='escort'
JOIN lk7 AS w0 ON w0.keyword_id=k0.keyword_id
JOIN keywords AS k1 ON k1.keyword='silly'
JOIN lk6 AS w1 ON w1.keyword_id=k1.keyword_id
JOIN keywords AS k2 ON k2.keyword='fear'
JOIN lke AS w2 ON w2.keyword_id=k2.keyword_id
WHERE (w0.link_id=l.link_id AND w1.link_id=l.link_id AND w2.link_id != l.link_id)
GROUP BY l.link_id
So you can see that I get the keyword_id for each word, link that up with the JOIN to the appropriate lkX table, and then link those all up with the links table.
This works fine for the first two (looking for all results that have "escort" and "silly" but when I add the third condition (not "fear") it takes a VERY long time and returns no rows.
I thought doing the simple w2.link_id != l.link_id would have the desired result, but apparently not.
How do I craft this SQL to say that I want link_ids excluded if there's an entry for "fear" for them? That is, I want to EXCLUDE any results for "fear" as a keyword.
Playing with it a little, I removed the second join for the excluded word (since the table isn't needed), kept the join on the keyword table to get the keyword_id, and then replaced the last conditional in the WHERE with this:
AND l.link_id NOT IN (SELECT link_id FROM db66698_mm_forum.lke AS w2 WHERE w2.keyword_id=k2.keyword_id)
That cut the query time down to a reasonable 19 seconds and appears to return the proper result set. So... is that the most efficient way to do it?
So my full query now is:
SELECT l.thread_id,l.lastpost,l.link_id,w0.weight,w1.weight
FROM links AS l
JOIN keywords AS k0 ON k0.keyword='escort'
JOIN lk7 AS w0 ON w0.keyword_id=k0.keyword_id
JOIN keywords AS k1 ON k1.keyword='silly'
JOIN lk6 AS w1 ON w1.keyword_id=k1.keyword_id
JOIN keywords AS k2 ON k2.keyword='fear'
WHERE
(w0.link_id=l.link_id AND
w1.link_id=l.link_id AND
l.link_id NOT IN
(SELECT link_id FROM db66698_mm_forum.lke AS w2 WHERE w2.keyword_id=k2.keyword_id))
GROUP BY l.link_idOn that last select (the subselect), I'm getting back about 3000 rows on about 2200 unique link_id values. Is there value in my adding a DISTINCT and GROUP BY to that subselect? Is the cost to do that less than the cost of the NOT IN that is comparing against the same link_id about 800 times?
Am I into "Advanced SQL" yet? :-)