mysql - Get all rows within a specifiy range/radius (Document Term Matrix) -
i store document-term matrix in mysql , want results queries these:
example: rows token_id '1' , token_id '2'(but maybe more 2) within range of 10 words.
my table:
dt_matrix_token_id int(11) pk auto_increment, token_id int(11), storage_data_id int(11), position int(11)
so token_id describes token , position describes on position in original text token was.
selecting rows token_id not problem, problem on how describe inside query both words must within specific "radius/range".
select * dt_matrix_token token_id in(1,2) , ???
??? stuck, because how can tell shall query against found values? because when result contains row position = 12 other valid rows should have position >= 2 & position =< 22
btw: similiar geo location query within radius?
edit: heres actual progress sample data: http://sqlfiddle.com/#!2/52f48/2
the query works fine, not complete yet, if 2x token 1 matches in document, "valid" result, , of course false. correct when there given tokens. , solution must extendable 3+ tokens.
i start query dt_matrix_token table joined second instance of dt_matrix_token table, both instances have token_id in range of values interested in, can't both have same value.
they should have matching storage_data_id (i.e. they're in same document), , position of second token must greater or equal first.
select mt1.dt_matrix_token_id, mt1.storage_data_id, mt1.token_id token_id1, mt2.token_id token_id2, mt1.position position1, mt2.position position2 dt_matrix_token mt1 join dt_matrix_token mt2 mt1.token_id in (1,2,3) , mt2.token_id in (1,2,3) , mt1.token_id <> mt2.token_id , mt1.storage_data_id = mt2.storage_data_id , mt2.position >= mt1.position
this gives every sequential pair of tokens care about.
now if group dt_matrix_token_id first table, combined token_id second table, narrow down set of results 1 of each token_id second table every token in first.
and when grouping results second table, it's minimum position care about. since second token follows first, gives position nearest first token.
select mt1.dt_matrix_token_id, mt1.storage_data_id, mt1.token_id token_id1, mt2.token_id token_id2, mt1.position position1, min(mt2.position) position2 dt_matrix_token mt1 join dt_matrix_token mt2 mt1.token_id in (1,2,3) , mt2.token_id in (1,2,3) , mt2.token_id <> mt1.token_id , mt2.storage_data_id = mt1.storage_data_id , mt2.position >= mt1.position group mt1.dt_matrix_token_id, mt2.token_id
so now, every instance of token care about, have nearest position of tokens follow in same document.
but want maximum distance first token of tokens follow it. need group dt_matrix_token_id again, , calculate distance maximum of second positions (i.e. maximum of minimums each token_id).
select dt_matrix_token_id, storage_data_id, max(position2)-position1 distance ( select mt1.dt_matrix_token_id, mt1.storage_data_id, mt1.position position1, min(mt2.position) position2 dt_matrix_token mt1 join dt_matrix_token mt2 mt1.token_id in (1,2,3) , mt2.token_id in (1,2,3) , mt2.token_id <> mt1.token_id , mt2.storage_data_id = mt1.storage_data_id , mt2.position >= mt1.position group mt1.dt_matrix_token_id, mt2.token_id ) temp group dt_matrix_token_id
however, not every token first table have been followed of other tokens care about. need make sure count
of results in each group equal number of tokens care minus 1 (1 token in first table, , n-1 tokens in second).
you can having
clause - having count(*) = 3-1
- 3 in expression represents number of tokens searching for.
now every instance of token care about, followed of other tokens care (in same document), have shortest distance covers of them.
but there quite multiple results each document, , need know shortest in each case. need group storage_data_id , calculate minimum distance in group.
select storage_data_id, min(distance) distance ( select dt_matrix_token_id, storage_data_id, max(position2)-position1 distance ( select mt1.dt_matrix_token_id, mt1.storage_data_id, mt1.position position1, min(mt2.position) position2 dt_matrix_token mt1 join dt_matrix_token mt2 mt1.token_id in (1,2,3) , mt2.token_id in (1,2,3) , mt2.token_id <> mt1.token_id , mt2.storage_data_id = mt1.storage_data_id , mt2.position >= mt1.position group mt1.dt_matrix_token_id, mt2.token_id ) temp group dt_matrix_token_id having count(*) = 3-1 ) temp group storage_data_id
this gives each document contains tokens care about, , minimum distance covers of tokens. limit results distances in specific range, can add having
clause.
having distance <= 20
then number of results query should tell how many documents contain tokens care within specified range.
Comments
Post a Comment