hadoop - extracting non-matching records between files in Pig Latin -


i beginner, learning pig latin. need extract records file. have created 2 files t1 , t2, tuples common both files, need extract tuples present in t1 , need omit common tuples between t1 & t2. can please me...

thanks

firstly you'll want take @ this venn diagram. want middle bit. first need full outer join on data. then, since nulls created in outer join when key not common, want filter result of join contain lines have 1 null (the non-intersecting part of venn diagram).

this how in pig script:

-- t1 , t2 2 sets of tuples using, schemas are: -- t1: {t: (num1: int, num2: int)} -- t2: {t: (num1: int, num2: int)} -- yours different, principle same  b = join t1 t full, t2 t ; c = filter b t1::t null or t2::t null ; d = foreach c generate (t1::t not null? t1::t : a2::t) ; 

walking through steps using sample input:

t1:      t2: (1,2)    (4,5) (3,4)    (1,2) 

b full outer join resulting in:

b: {t1::t: (num1: int,num2: int),t2::t: (num1: int,num2: int)} ((1,2),(1,2)) (,(4,5)) ((3.4),) 

t1 left tuple, , t2 right tuple. have use :: identify t, since have same name.

now, c filters b lines null kept. resulting in:

c: {t1::t: (num1: int,num2: int),t2::t: (num1: int,num2: int)} (,(4,5)) ((3.4),) 

this output want, little messy use. d uses bincond (the ?:) remove null. final output be:

d: {t1::t: (num1: int,num2: int)} ((4,5)) ((3.4)) 

update:
if want keep left (t1) (or right (t2) if switch things around) side of join. can this:

-- b same  -- want keep tuples t2 tuple null c = filter b t2::t null ; -- generate t1::t rid of null t2::t d = foreach c generate t1::t ; 

however, looking @ original venn diagram, using full join unnecessary. if @ different venn diagram, can see covers set want without operations. therefore, should change b to:

b = join t1 t left, t2 t ; 

Comments

Popular posts from this blog

c++ - Creating new partition disk winapi -

Android Prevent Bluetooth Pairing Dialog -

VBA function to include CDATA -