Connect to Hive, Impala and HDFS

Anaconda Enterprise contains numerous example projects, including a Spark/Hadoop project. This project includes the libraries needed to connect to Hive, Impala and HDFS with Python libraries, as well as example notebooks to connect to these services.

Impala

To connect to an Impala cluster you need the address and port to a running Impala Daemon, normally port 21050.

To use Impyla, open a Python Notebook based on the anaconda50_impyla environment and run:

from impala.dbapi import connect
conn = connect('<Impala Daemon>', port=21050)
cursor = conn.cursor()
cursor.execute('SHOW DATABASES')
cursor.fetchall()

Hive

To connect to a Hive cluster you need the address and port to a running Hive Server 2, normally port 10000.

To use PyHive, open a Python notebook based on the anaconda50_hadoop environment and run:

from pyhive import hive
conn = hive.connect('<Hive Server 2>', port=10000)
cursor = conn.cursor()
cursor.execute('SHOW DATABASES')
cursor.fetchall()

HDFS

To connect to an HDFS cluster you need the address and port to the HDFS Namenode, normally port 50070.

To use the hdfscli command line, configure the ~/.hdfscli.cfg file:

[global]
default.alias = dev

[dev.alias]
url = http://<Namenode>:port

Once the library is configured, you can use it to perform actions on HDFS with the command line by starting a terminal based on the anaconda50_hadoop environment and executing the hdfscli command. For example:

$ hdfscli

Welcome to the interactive HDFS python shell.
The HDFS client is available as `CLIENT`.

In [1]: CLIENT.list("/")
Out[1]: ['hbase', 'solr', 'tmp', 'user']

Kerberos

To authenticate and connect to Kerberized services, you need the appropriate configuration to execute a kinit command in a terminal on the project session.

NOTE: For more information, see the authentication section in Kerberos Configuration.

The Hadoop/Spark example project also includes example notebooks that show how to use the multiple libraries with Kerberos authentication.