Running Shark on Tachyon

The additional prerequisite for this part is Shark. We also assume that the user has set up Tachyon and Hadoop in accordance to these guides Local Mode or Cluster Mode.

Shark 0.7 adds a new storage format to support efficiently reading data from Tachyon, which enables data sharing and isolation across instances of Shark. Our meetup slide gives a good overview of the benefits of using Tachyon to cache Shark’s tables. In summary, the followings are four major ones:

Shark Compatibility

Tachyon VersionShark Version
0.2.1 0.7.x
0.3.0 0.8.1
0.4.0 0.9.0
0.4.1 0.9.1 +
0.5.0 0.9.1 +

Setup

In order to run Shark on Tachyon, you need to setup Tachyon first, either in Local Mode or in Cluster Mode, with HDFS.

Then add the following lines to shark-env.sh.

export TACHYON_MASTER="tachyon://TachyonMasterHost:TachyonMasterPort"
export TACHYON_WAREHOUSE_PATH=/sharktables

Caching Shark tables in Tachyon

There are a couple ways to create tables that are cached on Tachyon. Running these queries requires some data to already be on the filesystom or loaded into Shark.

Specify TBLPROPERTIES(“shark.cache” = “tachyon”), for example:
CREATE TABLE data TBLPROPERTIES(“shark.cache” = “tachyon”) AS SELECT a, b, c from data_on_disk WHERE month=“May”;
Specify the table’s name ending with _tachyon, for example:
CREATE TABLE orders_tachyon AS SELECT * FROM orders;

After creating the table in Tachyon, you can query it like any other table.