Hive Partition with Bucket

In Hive, partitioning and bucketing are two techniques used for organizing and optimizing data storage and querying.

Hive Partition with bucket Example

Scenario: Consider a dataset containing drug sales details, and it contains 6 pid, pname, drug, gender, tot_amt, country.

Agenda : make partition column country column and bucket using pid column.

In this table, we were able to partition by using country column. Hereafter we couldn't partition further Since each record is unique. So that bucked using pid column into 3 buckets.

Note: here, small data set were used if you want to more details about bucket count then read this blog.

Create table partition with bucket

CREATE TABLE partition_bucketed(
    pid INT,
    pname STRING,
    drug STRING,
    gender STRING,
    tot_amt INT
) PARTITIONED BY(country STRING) CLUSTERED BY(pid) INTO 3 BUCKETS 
Row format delimited fields terminated by ',' Stored as textfile;

-- insert records 
INSERT INTO TABLE partition_bucketed PARTITION (country)
SELECT
  pid,
  pname,
  drug,
  gender,
  tot_amt,
  country
FROM
  stage_pwb_table DISTRIBUTE BY country;

Set below properties to create dynamic partition.

set hive.exec.dynamic.partition = true;
set hive.exec.dynamic.partition.mode = nonstrict;
set hive.enforce.bucketing = true;

Normal table:

Here, I have created one table to load the data from a text file. You can use temporary table also.

create table stage_pwb_table (
  pid INT,
  pname STRING,
  drug STRING,
  gender STRING,
  tot_amt INT,
  country STRING
) row format delimited fields terminated by ',' stored as textfile;

Load the data from text file

load data local inpath '/home/navin/Documents/data1.txt' 
into table stage_pwb_table;

conclusion :

Do some POC to understand better. Use large data set to know the performance.

Reference:

Hive Partition with Bucket Explained

Hive Partition with Bucket

Table of contents