2. Hive on Spark

With "Hive on Spark," Spark operates as an execution backend for Hive queries.

The following example reads and writes to HDFS under Hive directories using the built-in UDF collect_list(col), which returns a list of objects with duplicates.

In a production environment this type of operation would run under an account with appropriate HDFS permissions; the following example uses hdfs user.

  1. Launch the Spark Shell on a YARN cluster:

    su hdfs
    ./bin/spark-shell --num-executors 2 --executor-memory 512m --master yarn-client

  2. As of Spark 1.5, hiveContext is created automatically and is named sqlContext. If you have existing hiveContext code, you can optionally change it to sqlContext and remove the context creation code.

  3. Create a Hive table:

    scala> hiveContext.sql("CREATE TABLE IF NOT EXISTS TestTable (key INT, value STRING)")

    You should see output similar to the following:

    ...
    15/11/10 14:40:02 INFO log.PerfLogger: &lt/PERFLOG method=Driver.run 
    start=1447184401403 
    end=1447184402898 
    duration=1495 
    from=org.apache.hadoop.hive.ql.Driver&gt
    res8: org.apache.spark.sql.DataFrame = [result: string]

  4. Load sample data from KV1.txt into the table:

    scala> hiveContext.sql("LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE TestTable")

  5. Invoke the Hive collect_list UDF:

    scala> hiveContext.sql("from TestTable SELECT key, collect_list(value) group by key order by key").collect.foreach(println)