In Hive, partitioning and bucketing are two techniques used for organizing and optimizing data storage and querying.
Hive Partition with bucket Example
Scenario: Consider a dataset containing drug sales details, and it contains 6 pid, pname, drug, gender, tot_amt, country.
Agenda : make partition column country
column and bucket using pid
column.
In this table, we were able to partition by using country
column. Hereafter we couldn't partition further Since each record is unique. So that bucked using pid
column into 3 buckets.
Note: here, small data set were used if you want to more details about bucket count then read this blog.
Create table partition with bucket
CREATE TABLE partition_bucketed(
pid INT,
pname STRING,
drug STRING,
gender STRING,
tot_amt INT
) PARTITIONED BY(country STRING) CLUSTERED BY(pid) INTO 3 BUCKETS
Row format delimited fields terminated by ',' Stored as textfile;
-- insert records
INSERT INTO TABLE partition_bucketed PARTITION (country)
SELECT
pid,
pname,
drug,
gender,
tot_amt,
country
FROM
stage_pwb_table DISTRIBUTE BY country;
Set below properties to create dynamic partition.
set hive.exec.dynamic.partition = true;
set hive.exec.dynamic.partition.mode = nonstrict;
set hive.enforce.bucketing = true;
Normal table:
Here, I have created one table to load the data from a text file. You can use temporary table also.
create table stage_pwb_table (
pid INT,
pname STRING,
drug STRING,
gender STRING,
tot_amt INT,
country STRING
) row format delimited fields terminated by ',' stored as textfile;
Load the data from text file
load data local inpath '/home/navin/Documents/data1.txt'
into table stage_pwb_table;
conclusion :
Do some POC to understand better. Use large data set to know the performance.