Sunday, June 10, 2012

JDBC Storage Handler for Hive

I was able to complete the implementation of Hive JDBC Storage Handler with basic functionality. Therefore I thought to write a blog post describing the usage with some sample queries. Currently It supports writing into any database and reading from major databases (MySql, MsSql, Oracle, H2, PostgreSQL). This feature comes with WSO2 BAM 2.0.0 release. 

Setting up the BAM to use Hive jdbc-handler. 


Please add your jdbc-driver to $BAM_HOME/repository/component/lib directory, before starting the server. 

Web UI for executing Hive queries.

BAM2 comes with a web ui for executing the Hive queries. Also there is a option to schedule the script

 
User interface for writing Hive Queries


User interface for scheduling hive script

Sample on writing analyzed data into JDBC 


Here I am going to demonstrate the functionality of writing the analyzed data into JDBC storage. In this simple example, We'll fetch records from a file then analyze it using hive and finally store those analyzed data into MySQL database. 

Records - These are the records that we are going to analyze.



bread   12      12/01/2012
sugar   20      12/01/2012
milk    5       12/01/2012
tea     33      12/01/2012
soap    10      12/01/2012
tea     9       13/01/2012
bread   21      13/01/2012
sugar   9       13/01/2012
milk    14      13/01/2012
soap    8       13/01/2012
biscuit 10      14/01/2012


Hive Queries



//drop tables if already exist
 
drop table productTable;
 
drop table summarizedTable;
 
CREATE TABLE productTable (product STRING, noOfItems INT, dateOfSold STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
 
//Load the file with above records
 
load data local inpath '/opt/sample/data/productInfo.txt' into table productTable;
 
CREATE EXTERNAL TABLE IF NOT EXISTS
summarizedTable( product STRING, itemsSold INT) 
STORED BY 'org.wso2.carbon.hadoop.hive.jdbc.storage.JDBCStorageHandler'
    TBLPROPERTIES (
                'mapred.jdbc.driver.class' = 'com.mysql.jdbc.Driver',
                'mapred.jdbc.url' = 'jdbc:mysql://localhost/test',
                'mapred.jdbc.username' = 'username',
                'mapred.jdbc.password' = 'password',
                'hive.jdbc.update.on.duplicate'= 'true',
                'hive.jdbc.primary.key.fields'='product',
                'hive.jdbc.table.create.query''CREATE TABLE productSummary (product VARCHAR(50) NOT NULL PRIMARY KEY, itemsSold INT NOT NULL)');
 
insert overwrite table summarizedTable SELECT product, sum(noOfItems) FROM productTable GROUP BY product;


View the result in mysql.

mysql> select * from productSummary;
+---------+-----------+
| product | itemsSold |
+---------+-----------+
| biscuit |        10 |
| bread   |        33 |
| milk    |        19 |
| soap    |        18 |
| sugar   |        29 |
| tea     |        42 |
+---------+-----------+
6 rows in set (0.00 sec)


Detail description on TBLPROPERTIES in storage handler.



Property name Required Detail
mapred.jdbc.driver.class Yes
The classname for the JDBC Driver to use. This should be available on Hive's classpath.
mapred.jdbc.url  YesThe connection url for the database.
mapred.jdbc.username NoThe database username, if it's required.
mapred.jdbc.password  No The database Password, if it's required.
hive.jdbc.table.create.query No
If table already exist in the database, then you don't need this. Otherwise you should provide the sql query for creating the table in the database.

mapred.jdbc.output.table.name  No
The name of the table in the database. It does not have to be the same as the name of the table in Hive. If you have specified the sql query for creating the table, handler will pick the table name from query. Otherwise you need to specify this if your meta table name is different from the table in database.
hive.jdbc.primary.key.fields YesIf you have any primary keys in the database table
hive.jdbc.update.on.duplicate No
Expected values are either "true" or "false". If "true" then the storage handler will update the records with duplicate keys. Otherwise it will insert all data. 

hive.jdbc.output.upsert.queryNo
This can be use to optimize the update operation. The default implementation is  to use insert or update statement after the select statement. So there will be two database round trips. But we can reduce it to one by using db specific upsert statement. Example query for mysql database is 'INSERT INTO productSummary (product, itemsSold) values (?,?) ON DUPLICATE KEY UPDATE itemsSold=?'

hive.jdbc.upsert.query.values.order No
If you are using an upsert query then this is mandatory. sample values for above query will be 'product,itemsSold,itemsSold' //values order for each question mark 

hive.jdbc.input.columns.mapping No
This is mandatory if your field names in meta table and database tables are different. Provide the field names in database table in the same order as the field names in meta table with ',' separated values. example: productNames,noOfItemsSold. These will map to your meta table with product,itemsSold field names.

mapred.jdbc.input.table.name No
Used when reading from a database table. This is needed if the meta table name and database table name are different.



Sample on reading from JDBC.


Now I am going to read the previously saved records from mysql using hive jdbc-handler.

Hive queries



//drop table if already exists
drop table savedRecords;
 
CREATE EXTERNAL TABLE IF NOT EXISTS savedRecords( product STRING, itemsSold INT) 
STORED BY 'org.wso2.carbon.hadoop.hive.jdbc.storage.JDBCStorageHandler'        
             TBLPROPERTIES (                
                    'mapred.jdbc.driver.class' = 'com.mysql.jdbc.Driver',
                    'mapred.jdbc.url' = 'jdbc:mysql://localhost/test', 
                    'mapred.jdbc.username' = 'username',     
                    'mapred.jdbc.password' = 'password',
                    'mapred.jdbc.input.table.name' = 'productSummary');
SELECT product,itemsSold FROM savedRecords ORDER BY itemsSold;

This will give all the records in the productSummary table.