With "Hive on Spark," Spark operates as an execution backend for Hive queries.
The following example reads and writes to HDFS under Hive directories using the built-in UDF
collect_list(col), which returns a list of objects with duplicates.
In a production environment this type of operation would run under an account with
appropriate HDFS permissions; the following example uses hdfs user.
Launch the Spark Shell on a YARN cluster:
su hdfs ./bin/spark-shell --num-executors 2 --executor-memory 512m --master yarn-client
As of Spark 1.5,
hiveContextis created automatically and is namedsqlContext. If you have existinghiveContextcode, you can optionally change it tosqlContextand remove the context creation code.Create a Hive table:
scala> hiveContext.sql("CREATE TABLE IF NOT EXISTS TestTable (key INT, value STRING)")You should see output similar to the following:
... 15/11/10 14:40:02 INFO log.PerfLogger: </PERFLOG method=Driver.run start=1447184401403 end=1447184402898 duration=1495 from=org.apache.hadoop.hive.ql.Driver> res8: org.apache.spark.sql.DataFrame = [result: string]
Load sample data from
KV1.txtinto the table:scala> hiveContext.sql("LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE TestTable")Invoke the Hive
collect_listUDF:scala> hiveContext.sql("from TestTable SELECT key, collect_list(value) group by key order by key").collect.foreach(println)

