java - Hbase Scheme design- Best Practice -

- September 15, 2014

i have switched hbase rdbms handling millions of records.. newbie not sure efficient way of designing hbase scheme. actually, scenario have text files have hundred, thousands , millions records have read , store hbase. so, there 2 set of text files(rawdata file, label file) linked each other belong same user, these files have made 2 separate tables(rawdata , label) , storing info there. rawdata file , rawdata table this:

enter image description here

so can see in rawdata table have row key file name of text file( 01-01-all-data.txt) row number of each row of textfile. , column family random 'r' , column qualifiers columns of text files , value values of column. how inserting record in table , have third table(mapfile) store name of textfile row key user id of user column qualifier , total number of records of textfile value looks this:

            01-01-all-data.txt       column=m:1, timestamp=1375189274467, value=146209

i use mapfile table in order read rawdata table row row..

what suggestion kind hbase schema? proper way? or doesn't make sense in hbase concepts?

furthermore, worths mention taking around 3 mins in inserting 21 mbs file 146207 rows in hbase.

please advice.

thanks

although don't find wrong current schema, it's appropriate or not can decided after analyzing use case , frequent access pattern. correct not appropriate, imho. since don't have idea suggestions may sound incorrect. please let me know if case. i'll update answer accordingly. here go,

does make sense(keeping data , access pattern in mind) have 1 table 3 column families :

rd - rawdata file have columns of file
lf - label file columns of file, and
mf - mapfile having 1 column holding number of records of textfile.

use userid rowkey. unique , doesn't lengthy. design bypass overhead of shunting 1 table while fetching data.

few more suggestions :

if userids monotonically increasing hash rowkeys don't suffer regionserver hotspotting.
you create pre-splitted tables in order better distribution.
shorten column names if possible.
keep number of version low possible.

furthermore, worths mention taking around 3 mins in inserting 21 mbs file 146207 rows in hbase.

how inserting data?mapreduce or normal java+hbase api?what cluster size?configuration , specs?

you might find these links useful :

hth

Search This Blog

Search

java - Hbase Scheme design- Best Practice -

Comments

Post a Comment

Popular posts from this blog

c++ - Creating new partition disk winapi -

VBA function to include CDATA -

php - Warning: file_get_contents() expects parameter 1 to be a valid path, array given 16 -