Information Integration Blog: Steps to Kerborize HDFS in Cloudera Manager and access the same from Information Server

Overview:

This blog talks about Kerborize HDFS (for that matter all modules) in Cloudera Manager and access the same from Information Server for the purpose of Profiling, Data Quality analysis, Data Integration etc using the data stored in HDFS.

Prerequisites:

Install Cloudera Manager on the server (let us say on “blr01.ibm.com”)
Install InfoSphere Information Server on a different server (let us say on “blr02.ibm.com”)
Install and setup Kerberos Server on “blr02.ibm.com”. (Note that this can also done on any other server and can be referred from all Kerberos clients in their configurations accordingly)
Install and setup Kerberos Client on “blr01.ibm.com” (where Cloudera Manager is installed and setup) and on “blr02.ibm.com” ( In case if Kerberos Server is not setup on this box)

KDC Infrastructure setup:

The below are some of the steps that need to be followed while installing and setting up KDC

1. Install Kerberos V5 server/client libraries

2. Install Master KDC

Edit Configuration Files
Create Database
Add Administrators to the Acl File
Add Administrators to the Kerberos Database
Create a kadmind Keytab
Start the Kerberos Daemons on the Master KDC

3. Install Slave KDCs (optional but highly recommended)

4. Propagate Database to Slave KDCs

5. Create Stash Files and Start krb5kdc Daemons on Slave KDCs

4. Configure Kerberos Client machines

Sample kdc.conf

[kdcdefaults]

kdc_ports = 88

kdc_tcp_ports = 88

[realms]

IPS.COM = {

master_key_type = aes128-cts

max_life = 2d

max_renewable_life = 2w

acl_file = /var/kerberos/krb5kdc/kadm5.acl

dict_file = /usr/share/dict/words

admin_keytab = /var/kerberos/krb5kdc/kadm5.keytab

supported_enctypes = aes128-cts:normal des3-hmac-sha1:normal arcfour-hmac:normal des-hmac-sha1:normal des-cbc-md5:normal des-cbc-crc:normal

}

Sample krb5.conf

[logging]

default = FILE:/var/log/krb5libs.log

kdc = FILE:/var/log/krb5kdc.log

admin_server = FILE:/var/log/kadmind.log

[libdefaults]

default_realm = IPS.COM

dns_lookup_realm = false

dns_lookup_kdc = false

ticket_lifetime = 2d

renew_lifetime = 2w

kdc_timeout = 10s

forwardable = true

renewable = true

[realms]

IPS.COM = {

kdc = blr02.ibm.com:88

admin_server = blr02.ibm.com:749

}

[domain_realm]

.ibm.com = IPS.COM

[kdc]

profile=/var/kerberos/krb5kdc/kdc.conf

The following links give some info on overview, install and setup steps for KDC Infrastructure.

http://tldp.org/HOWTO/Kerberos-Infrastructure-HOWTO/overview.html

http://tldp.org/HOWTO/Kerberos-Infrastructure-HOWTO/install.html

http://docs.hortonworks.com/HDPDocuments/HDP1/HDP-1.3.2/bk_using_Ambari_book/content/ambari-kerb.html

Enable Kerberos Security in ClouderaManager:

Logon to Coudera Manager admin console (for example: http://blr01.ibm.com:7180/)

Follow the steps mentioned in http://www.cloudera.com/content/cloudera-content/cloudera-docs/CM4Ent/latest/Configuring-Hadoop-Security-with-Cloudera-Manager/Configuring-Hadoop-Security-with-Cloudera-Manager.html

to enable Hadoop Security.

These steps create the required keytab files in the server where Cloudera Manager is installed. One of these keytab fles need to be moved to the client (blr02.ibm.com, where Information Server is installed)

Drivers used:

Cloudera ODBC Driver for Apache Hive

(http://www.cloudera.com/content/cloudera-content/cloudera-docs/Connectors/PDF/Cloudera-ODBC-Driver-for-Apache-Hive-Install-Guide.pdf)

This driver needs to be installed on the server where Information Server is installed (blr02.ibm.com)

P.S. Always have latest version of Cloudera ODBC Driver for Apache Hive to avoid any performance issues

Mandatory steps to be followed in the server where Information Server (IS) Engine is installed:

DSN Entry in .odbc.ini file

[Hive_Cloudera]

Driver=/opt/cloudera/hiveodbc/lib/64/libclouderahiveodbc64.so

Host= blr01.ibm.com

Port=10000

Schema= default

DefaultStringColumnLength=255

FastSQLPrepare=0

UseNativeQuery=0

HiveServerType=2

AuthMech=1

HS2AuthMech=1

HS2HostFQDN= blr01.ibm.com

HS2KrbServiceName=hive

HS2KrbRealm= IPS.COM

Create /opt/IBM/InformationServer/Server/DSEngine/.cloudera.hiveodbc.ini file with the below contents

[Driver]

## - Note that this default DriverManagerEncoding of UTF-32 is for iODBC.

## - unixODBC uses UTF-16 by default.

## - If unixODBC was compiled with -DSQL_WCHART_CONVERT, then UTF-32 is the correct value.

## Execute 'odbc_config --cflags' to determine if you need UTF-32 or UTF-16 on unixODBC

## - SimbaDM can be used with UTF-8 or UTF-16.

## The DriverUnicodeEncoding setting will cause SimbaDM to run in UTF-8 when set to 2 or UTF-16 when set to 1.

DriverManagerEncoding=UTF-8

ErrorMessagesPath=/opt/cloudera/hiveodbc/ErrorMessages/

LogLevel=0

LogPath=/tmp

## - Uncomment the ODBCInstLib corresponding to the Driver Manager being used.

## - Note that the path to your ODBC Driver Manager must be specified in LD_LIBRARY_PATH (LIBPATH for AIX).

## - Note that AIX has a different format for specifying its shared libraries.

# Generic ODBCInstLib

# iODBC

ODBCInstLib=/opt/IBM/InformationServer/Server/branded_odbc/lib/libodbcinst.so

# SimbaDM / unixODBC

#ODBCInstLib=libodbcinst.so

# AIX specific ODBCInstLib

# iODBC

#ODBCInstLib=libiodbcinst.a(libiodbcinst.so.2)

# SimbaDM

#ODBCInstLib=libodbcinst.a(odbcinst.so)

# unixODBC

#ODBCInstLib=libodbcinst.a(libodbcinst.so.1)

Add the below entry in /opt/IBM/InformationServer/Server/DSEngine/dsenv file

export SIMBAINI=/opt/IBM/InformationServer/Server/DSEngine/.cloudera.hiveodbc.ini

Locate hive.keytab file in hiveserver2 (on the server where Cloudera Manager is installed, here it is blr01.ibm.com) and transfer it to the machine where IS Engine is installed (i.e blr02.ibm.com).

ls -alt `find . -name hive.keytab`

Pick up the first one and transfer to /opt/cloudera/ folder on IS Engine machine

Modify the file permissions

chmod 777 /opt/cloudera/hive.keytab

Logon with dsadm user and run the kinit command

kinit -k -t /opt/cloudera/hive.keytab hive/blr01.ibm.com@IPS.COM

Verify the ticket information by issuing klist –e command
Logon to Administrator client and define the two environment variables (KRB5CCNAME and KRB5_CONFIG)

P.S. Alternatively you can also add these two environment variables to /opt/IBM/InformationServer/Server/DSEngine/dsenv file.

Export the following two environment variables and Test the connection from DataDirect example program

[root@blr02 example]#export KRB5CCNAME=/tmp/krb5cc_0

[root@blr02 example]#export KRB5_CONFIG=/etc/krb5.conf

[root@blr02 example]#. /opt/IBM/InformationServer/Server/DSEngine/dsenv

[root@blr02 example]#cd /opt/IBM/InformationServer/Server/branded_odbc/samples/example

[root@blr02 example]# ./example

./example DataDirect Technologies, Inc. ODBC Example Application.

Enter the data source name : Hive_Cloudera

Enter the user name : <leave blank>

Enter the password : <leave blank>

Connecting...

JDK_Connecting..

JDK_Connected..

Connected

Enter SQL statements (Press ENTER to QUIT)

SQL> show databases

database_name

default

test

ued_qbo

Enter SQL statements (Press ENTER to QUIT)

SQL> use default

Enter SQL statements (Press ENTER to QUIT)

SQL> show tables

tab_name

inttypes

jdk

tab2

Enter SQL statements (Press ENTER to QUIT)

SQL> select * from tab2

col1 col2

1 ABCD

2 EFG

3 HIJK

4 LMNOP

5 QRST

6 UVWX

7 YZabc

8 defghi

9 klmnop

10 qrstuvwxyz

Enter SQL statements (Press ENTER to QUIT)

SQL>

Using beeline to connect to Hive:

Users can also preferably use beeline to connect to hive and check the contents in the following manner:

[hive@blr01 run]$ beeline

Beeline version 0.10.0-cdh4.4.0 by Apache Hive

beeline> !connect jdbc:hive2://blr01.ibm.com:10000/default;principal=hive/blr01.ibm.com@IPS.COM

scan complete in 6ms

Connecting to jdbc:hive2://blr01.ibm.com:10000/default;principal=hive/blr01.ibm.com@IPS.COM

Enter username for jdbc:hive2://blr01.ibm.com:10000/default;principal=hive/blr01.ibm.com@IPS.COM:

Enter password for jdbc:hive2://blr01.ibm.com:10000/default;principal=hive/blr01.ibm.com@IPS.COM:

Connected to: Hive (version 0.10.0)

Driver: Hive (version 0.10.0-cdh4.4.0)

Transaction isolation: TRANSACTION_REPEATABLE_READ

0: jdbc:hive2://blr01.ibm.com:10000/def> use default;

No rows affected (1.168 seconds)

0: jdbc:hive2://blr01.ibm.com:10000/def> select col1, col2 from tab2;

+-------+-------------+

| col1 | col2 |

+-------+-------------+

| 1 | ABCD |

| 2 | EFG |

| 3 | HIJK |

| 4 | LMNOP |

| 5 | QRST |

| 6 | UVWX |

| 7 | YZabc |

| 8 | defghi |

| 9 | klmnop |

| 10 | qrstuvwxyz |

+-------+-------------+

10 rows selected (26.746 seconds)

0: jdbc:hive2://blr01.ibm.com:10000/def>

Usage of this data in Information Server:

Create an ImportArea, Data connection in InfoSphere Metadata Asset Manager (IMAM) to this datasource and Import Metadata.
Register the metadata created in previous step in a Information Analyzer project and one can perform Data Profiling and Data Quality Analysis.
The same metadata can be used in Data Integration in DataStage (or) across any component in InfoSphere Information Server.

Forums:

One can also register in a Google forum in https://groups.google.com/a/cloudera.org/forum/#!forum/scm-users and seek help for any specific questions related to Cloudera Manager.

Disclaimer: The postings on this site are those of the authors and don’t necessarily represent IBM’s positions, strategies or opinions.

Information Integration Blog

Wednesday, 2 July 2014

Steps to Kerborize HDFS in Cloudera Manager and access the same from Information Server

1 comment: