hadoop - extracting non-matching records between files in Pig Latin -
i beginner, learning pig latin. need extract records file. have created 2 files t1 , t2, tuples common both files, need extract tuples present in t1 , need omit common tuples between t1 & t2. can please me...
thanks
firstly you'll want take @ this venn diagram
. want middle bit. first need full outer join
on data. then, since nulls
created in outer join when key not common, want filter result of join contain lines have 1 null (the non-intersecting part of venn diagram).
this how in pig script:
-- t1 , t2 2 sets of tuples using, schemas are: -- t1: {t: (num1: int, num2: int)} -- t2: {t: (num1: int, num2: int)} -- yours different, principle same b = join t1 t full, t2 t ; c = filter b t1::t null or t2::t null ; d = foreach c generate (t1::t not null? t1::t : a2::t) ;
walking through steps using sample input:
t1: t2: (1,2) (4,5) (3,4) (1,2)
b
full outer join resulting in:
b: {t1::t: (num1: int,num2: int),t2::t: (num1: int,num2: int)} ((1,2),(1,2)) (,(4,5)) ((3.4),)
t1
left tuple, , t2
right tuple. have use ::
identify t
, since have same name.
now, c
filters b
lines null kept. resulting in:
c: {t1::t: (num1: int,num2: int),t2::t: (num1: int,num2: int)} (,(4,5)) ((3.4),)
this output want, little messy use. d
uses bincond
(the ?:) remove null. final output be:
d: {t1::t: (num1: int,num2: int)} ((4,5)) ((3.4))
update:
if want keep left (t1) (or right (t2) if switch things around) side of join. can this:
-- b same -- want keep tuples t2 tuple null c = filter b t2::t null ; -- generate t1::t rid of null t2::t d = foreach c generate t1::t ;
however, looking @ original venn diagram, using full join
unnecessary. if @ different venn diagram
, can see covers set want without operations. therefore, should change b
to:
b = join t1 t left, t2 t ;
Comments
Post a Comment