Wednesday, 2 July 2014

Steps to Kerborize HDFS in Cloudera Manager and access the same from Information Server



Overview:
This blog talks about Kerborize HDFS (for that matter all modules) in Cloudera Manager and access the same from Information Server for the purpose of Profiling, Data Quality analysis, Data Integration etc using the data stored in HDFS.
Prerequisites:
  1. Install Cloudera Manager on the server (let us say on “blr01.ibm.com”)
  2. Install InfoSphere Information Server on a different server (let us say on “blr02.ibm.com”)
  3. Install and setup Kerberos Server on “blr02.ibm.com”.  (Note that this can also done on any other server and can be referred from all Kerberos clients in their configurations accordingly)
  4. Install and setup Kerberos Client on “blr01.ibm.com” (where Cloudera Manager is installed and setup) and on “blr02.ibm.com” ( In case if Kerberos Server is not setup on this box)
KDC Infrastructure setup:
The below are some of the steps that need to be followed while installing and setting up KDC
1. Install Kerberos V5 server/client libraries
2. Install Master KDC
  • Edit Configuration Files
  • Create Database
  • Add Administrators to the Acl File
  • Add Administrators to the Kerberos Database
  • Create a kadmind Keytab
  • Start the Kerberos Daemons on the Master KDC
3. Install Slave KDCs (optional but highly recommended)
4. Propagate Database to Slave KDCs
5. Create Stash Files and Start krb5kdc Daemons on Slave KDCs
4. Configure Kerberos Client machines

Sample kdc.conf
[kdcdefaults]
 kdc_ports = 88
 kdc_tcp_ports = 88

[realms]
 IPS.COM = {
  master_key_type = aes128-cts
  max_life = 2d
  max_renewable_life = 2w
  acl_file = /var/kerberos/krb5kdc/kadm5.acl
  dict_file = /usr/share/dict/words
  admin_keytab = /var/kerberos/krb5kdc/kadm5.keytab
  supported_enctypes = aes128-cts:normal des3-hmac-sha1:normal arcfour-hmac:normal des-hmac-sha1:normal des-cbc-md5:normal des-cbc-crc:normal
 }


Sample krb5.conf
[logging]
  default = FILE:/var/log/krb5libs.log
  kdc = FILE:/var/log/krb5kdc.log
  admin_server = FILE:/var/log/kadmind.log

[libdefaults]
  default_realm = IPS.COM
  dns_lookup_realm = false
  dns_lookup_kdc = false
  ticket_lifetime = 2d
  renew_lifetime = 2w
  kdc_timeout = 10s
  forwardable = true
  renewable = true

[realms]
  IPS.COM = {
    kdc = blr02.ibm.com:88
    admin_server = blr02.ibm.com:749
}

[domain_realm]
 .ibm.com = IPS.COM
 .ibm.com = IPS.COM

[kdc]
  profile=/var/kerberos/krb5kdc/kdc.conf

The following links give some info on overview, install and setup steps for KDC Infrastructure.

Enable Kerberos Security in ClouderaManager:
Logon to Coudera Manager admin console (for example: http://blr01.ibm.com:7180/)
to enable Hadoop Security.
These steps create the required keytab files in the server where Cloudera Manager is installed.  One of these keytab fles need to be moved to the client (blr02.ibm.com, where Information Server is installed)
Drivers used:
                Cloudera ODBC Driver for Apache Hive
This driver needs to be installed on the server where Information Server is installed (blr02.ibm.com)
P.S. Always have latest version of Cloudera ODBC Driver for Apache Hive to avoid any performance issues
Mandatory steps to be followed in the server where Information Server (IS) Engine is installed:
  1. DSN Entry in .odbc.ini file
[Hive_Cloudera]
Driver=/opt/cloudera/hiveodbc/lib/64/libclouderahiveodbc64.so
Host= blr01.ibm.com
Port=10000
Schema= default
DefaultStringColumnLength=255
FastSQLPrepare=0
UseNativeQuery=0
HiveServerType=2
AuthMech=1
HS2AuthMech=1
HS2HostFQDN= blr01.ibm.com
HS2KrbServiceName=hive
HS2KrbRealm= IPS.COM
  1. Create /opt/IBM/InformationServer/Server/DSEngine/.cloudera.hiveodbc.ini file            with the below contents
[Driver]

## - Note that this default DriverManagerEncoding of UTF-32 is for iODBC.
## - unixODBC uses UTF-16 by default.
## - If unixODBC was compiled with -DSQL_WCHART_CONVERT, then UTF-32 is the correct value.
##   Execute 'odbc_config --cflags' to determine if you need UTF-32 or UTF-16 on unixODBC
## - SimbaDM can be used with UTF-8 or UTF-16.
##   The DriverUnicodeEncoding setting will cause SimbaDM to run in UTF-8 when set to 2 or UTF-16 when set to 1.

DriverManagerEncoding=UTF-8
ErrorMessagesPath=/opt/cloudera/hiveodbc/ErrorMessages/
LogLevel=0
LogPath=/tmp

## - Uncomment the ODBCInstLib corresponding to the Driver Manager being used.
## - Note that the path to your ODBC Driver Manager must be specified in LD_LIBRARY_PATH (LIBPATH for AIX).
## - Note that AIX has a different format for specifying its shared libraries.

# Generic ODBCInstLib
#   iODBC
ODBCInstLib=/opt/IBM/InformationServer/Server/branded_odbc/lib/libodbcinst.so

#   SimbaDM / unixODBC
#ODBCInstLib=libodbcinst.so

# AIX specific ODBCInstLib
#   iODBC
#ODBCInstLib=libiodbcinst.a(libiodbcinst.so.2)

#   SimbaDM
#ODBCInstLib=libodbcinst.a(odbcinst.so)

#   unixODBC
#ODBCInstLib=libodbcinst.a(libodbcinst.so.1)
  1. Add the below entry in /opt/IBM/InformationServer/Server/DSEngine/dsenv file
export SIMBAINI=/opt/IBM/InformationServer/Server/DSEngine/.cloudera.hiveodbc.ini
  1. Locate hive.keytab file in hiveserver2 (on the server where Cloudera Manager is installed, here it is blr01.ibm.com) and transfer it to the machine where IS Engine is installed (i.e blr02.ibm.com).
                ls -alt `find . -name hive.keytab`                       

Pick up the first one and transfer to /opt/cloudera/ folder on IS Engine machine
  1. Modify the file permissions
chmod 777   /opt/cloudera/hive.keytab
  1. Logon with dsadm user and run the kinit command
kinit -k -t /opt/cloudera/hive.keytab hive/blr01.ibm.com@IPS.COM
  1. Verify the ticket information by issuing klist –e command
  2. Logon to Administrator client and define the two environment variables (KRB5CCNAME and KRB5_CONFIG)



P.S. Alternatively you can also add these two environment variables to /opt/IBM/InformationServer/Server/DSEngine/dsenv file.
  1. Export the following two environment variables and Test the connection from DataDirect example program
[root@blr02 example]#export KRB5CCNAME=/tmp/krb5cc_0
[root@blr02 example]#export KRB5_CONFIG=/etc/krb5.conf

[root@blr02 example]#. /opt/IBM/InformationServer/Server/DSEngine/dsenv
[root@blr02 example]#cd /opt/IBM/InformationServer/Server/branded_odbc/samples/example

[root@blr02 example]# ./example
./example DataDirect Technologies, Inc. ODBC Example Application.
Enter the data source name : Hive_Cloudera

Enter the user name        : <leave blank>

Enter the password         : <leave blank>

Connecting...
JDK_Connecting..
JDK_Connected..

Connected

Enter SQL statements (Press ENTER to QUIT)
SQL> show databases

database_name
default
test
ued_qbo


Enter SQL statements (Press ENTER to QUIT)
SQL> use default

Enter SQL statements (Press ENTER to QUIT)
SQL> show tables

tab_name
inttypes
jdk
tab2


Enter SQL statements (Press ENTER to QUIT)
SQL> select * from tab2

col1    col2
1       ABCD
2       EFG
3       HIJK
4       LMNOP
5       QRST
6       UVWX
7       YZabc
8       defghi
9       klmnop
10      qrstuvwxyz


Enter SQL statements (Press ENTER to QUIT)
SQL>
Using beeline to connect to Hive:
Users can also preferably use beeline to connect to hive and check the contents in the following manner:

[hive@blr01 run]$ beeline

Beeline version 0.10.0-cdh4.4.0 by Apache Hive
beeline> !connect jdbc:hive2://blr01.ibm.com:10000/default;principal=hive/blr01.ibm.com@IPS.COM
scan complete in 6ms
Connecting to jdbc:hive2://blr01.ibm.com:10000/default;principal=hive/blr01.ibm.com@IPS.COM
Enter username for jdbc:hive2://blr01.ibm.com:10000/default;principal=hive/blr01.ibm.com@IPS.COM:
Enter password for jdbc:hive2://blr01.ibm.com:10000/default;principal=hive/blr01.ibm.com@IPS.COM:
Connected to: Hive (version 0.10.0)
Driver: Hive (version 0.10.0-cdh4.4.0)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://blr01.ibm.com:10000/def> use default;
No rows affected (1.168 seconds)

0: jdbc:hive2://blr01.ibm.com:10000/def> select col1, col2 from tab2;
+-------+-------------+
| col1  |    col2     |
+-------+-------------+
| 1     | ABCD        |
| 2     | EFG         |
| 3     | HIJK        |
| 4     | LMNOP       |
| 5     | QRST        |
| 6     | UVWX        |
| 7     | YZabc       |
| 8     | defghi      |
| 9     | klmnop      |
| 10    | qrstuvwxyz  |
+-------+-------------+
10 rows selected (26.746 seconds)

0: jdbc:hive2://blr01.ibm.com:10000/def>

Usage of this data in Information Server:
  • Create an ImportArea, Data connection in InfoSphere Metadata Asset Manager (IMAM) to this datasource and Import Metadata.
  • Register the metadata created in previous step in a Information Analyzer project and one can perform Data Profiling and Data Quality Analysis.
  • The same metadata can be used in Data Integration in DataStage (or) across any component in InfoSphere Information Server.

Forums:
One can also register in a Google forum in https://groups.google.com/a/cloudera.org/forum/#!forum/scm-users and seek help for any specific questions related to Cloudera Manager.

Disclaimer: The postings on this site are those of the authors and don’t necessarily represent IBM’s positions, strategies or opinions.

1 comment:

  1. As per my opinion, videos play a vital role in learning. And when you consider Big data modernization solutions , then you should focus on all the learning methods. Udacity seems to be an excellent place to explore machine learning.

    ReplyDelete