The feed entity defines the data set that Falcon replicates. Reference your cluster entities to determine which clusters the feed uses.
Create an XML file for the Feed entity.
<?xml version="1.0"?>
Describe the feed.
<?xml version="1.0"?> <feed description="$rawInputFeed" name=”testFeed” xmlns="uri:falcon:feed:0.1"> </feed>
Specify the frequency of the feed.
<?xml version="1.0"?> <feed description="$rawInputFeed" name=”testFeed” xmlns="uri:falcon:feed:0.1"> <!--Feed run frequency--> <frequency>hours(1)</frequency> </feed>
Choose a retention policy for the data to remain on the cluster.
For example:
<?xml version="1.0"?> <feed description="$rawInputFeed" name=”testFeed” xmlns="uri:falcon:feed:0.1"> <!--Feed run frequency--> <frequency>hours(1)</frequency> </feed>
(Optional) Set a late-arrival cut-off policy. The supported policies for late data handling are backoff, exp-backoff (default), and final.
For example, to set the policy to a late cutoff of 6 hours:
<?xml version="1.0"?> <feed description="$rawInputFeed" name=”testFeed” xmlns="uri:falcon:feed:0.1"> <!--Feed run frequency--> <frequency>hours(1)</frequency> <!-- Late arrival cut-off --> <late-arrival cut-off="hours(6)”/> </feed>
Define your source and target clusters for the feed.
For example, for two clusters, MyDataCenter and MyDataCenter-secondary cluster:
<?xml version="1.0"?> <feed description="$rawInputFeed" name=”testFeed” xmlns="uri:falcon:feed:0.1"> <!--Feed run frequency--> <frequency>hours(1)</frequency> <!-- Late arrival cut-off --> <late-arrival cut-off="hours(6)”/> <!-- Target clusters for retention and replication. --> <clusters> <cluster name="<MyDataCenter>" type="source"> <validity start="$date" end="$date"/> <!--Currently delete is the only action available --> <retention limit="days($n)" action="delete"> </cluster> <cluster name="$MyDataCenter-secondary" type="target"> <validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/> <location type="data” path="/churn/weblogs/${YEAR}-${MONTH}-${DAY}-${HOUR} "/> <retention limit="days(7)" action="delete"/> </cluster> </clusters> </feed>
Specify the HDFS weblogs path locations or Hive table locations. For example to specify the HDFS weblogs location:
<?xml version="1.0"?> <feed description="$rawInputFeed" name=”testFeed” xmlns="uri:falcon:feed:0.1"> <!--Feed run frequency--> <frequency>hours(1)</frequency> <!-- Late arrival cut-off --> <late-arrival cut-off="hours(6)”/> <!-- Target clusters for retention and replication. --> <clusters> <cluster name="<MyDataCenter>" type="source"> <validity start="$date" end="$date"/> <!--Currently delete is the only action available --> <retention limit="days($n)" action="delete"> </cluster> <cluster name="$MyDataCenter-secondary" type="target"> <validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/> <location type="data” path="/churn/weblogs/${YEAR}-${MONTH}-${DAY}-${HOUR} "/> <retention limit="days(7)" action="delete"/> </cluster> </clusters> <locations> <!-- Global location across clusters - HDFS paths or Hive tables --> <location type="data" path="/weblogs/${YEAR}-${MONTH}-${DAY}-${HOUR} "/> </locations> </feed>
Specify HDFS ACLs. Set the owner, group, and level of permissions for HDFS. For example:
<?xml version="1.0"?> <feed description="$rawInputFeed" name=”testFeed” xmlns="uri:falcon:feed:0.1"> <!--Feed run frequency--> <frequency>hours(1)</frequency> <!-- Late arrival cut-off --> <late-arrival cut-off="hours(6)”/> <!-- Target clusters for retention and replication. --> <clusters> <cluster name="<MyDataCenter>" type="source"> <validity start="$date" end="$date"/> <!--Currently delete is the only action available --> <retention limit="days($n)" action="delete"> </cluster> <cluster name="$MyDataCenter-secondary" type="target"> <validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/> <location type="data” path="/churn/weblogs/${YEAR}-${MONTH}-${DAY}-${HOUR} "/> <retention limit="days(7)" action="delete"/> </cluster> </clusters> <!-- Global location across clusters - HDFS paths or Hive tables --> <locations> <location type="data" path="/weblogs/${YEAR}-${MONTH}-${DAY}-${HOUR} "/> </locations> <!-- Required for HDFS. --> <ACL owner="hdfs" group="users" permission="0755"/> </feed>
Specify the location of the schema file for the feed as well as the provider of the schema like protobuf, thrift etc. For example:
<?xml version="1.0"?> <feed description="$rawInputFeed" name=”testFeed” xmlns="uri:falcon:feed:0.1"> <!--Feed run frequency--> <frequency>hours(1)</frequency> <!-- Late arrival cut-off --> <late-arrival cut-off="hours(6)”/> <!-- Target clusters for retention and replication. --> <clusters> <cluster name="<MyDataCenter>" type="source"> <validity start="$date" end="$date"/> <!--Currently delete is the only action available --> <retention limit="days($n)" action="delete"> </cluster> <cluster name="$MyDataCenter-secondary" type="target"> <validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/> <location type="data” path="/churn/weblogs/${YEAR}-${MONTH}-${DAY}-${HOUR} "/> <retention limit="days(7)" action="delete"/> </cluster> </clusters> <!-- Global location across clusters - HDFS paths or Hive tables --> <locations> <location type="data" path="/weblogs/${YEAR}-${MONTH}-${DAY}-${HOUR} "/> </locations> <!-- Required for HDFS. --> <ACL owner="hdfs" group="users" permission="0755"/> <schema location="/schema" provider="protobuf"/> </feed>