Chapter 42. Overview

Introduction

What is Subgraph

A subgraph is a user-defined reusable component with logic implemented as ETL graph instead of Java code.

Subgraph definition is a regular ETL graph and may use any graph elements (components, connections, lookups, sequences or parameters).

Subgraphs can be nested; a subgraph definition may use other subgraphs.

Subgraph definition is stored in a separate file with *.sgrf extension. In default CloverETL project layout a directory ${PROJECT}/graph/subgraph is created for storing subgraph files. You can reference this directory via ${SUBGRAPH_DIR} parameter.

Use Subgraph component to reference a subgraph in regular ETL graph. Once configured with subgraph file the Subgraph component automatically updates its ports according to ports from subgraph definition.

What are Subgraphs Good for?

Simplifying Complex ETL Logic

Use subgraphs to visually reduce the number of component in complex ETL graphs and highlight important processing logic.

Creating Reusable Blocks of Logic

Subgraphs allow developing prefabricated blocks of logic that can be used by other members of development team. This approach to ETL development promotes reusability and standardization.

Creating Connectors

Subgraphs provide an easy way to create new connectors from webservices or databases. Webservices communicate over HTTP protocol and provide data in JSON or XML format that needs to be preprocessed before use in ETL logic. Subgraphs can hide the parsing logic and provide data in easy-to-consume format.

Similarly for databases with complex relational structure, the DBAs can develop tuned-up queries for accessing data via optimized views and indices then publish the queries in the form of subgraphs as easy-to-use connectors to common data entities.

Design & Execution

  • Create a body of subgraph in the same way as an ordinary graph. you can use the same components, structure and overall approach.

  • Use connections, lookup tables, dictionary etc. All these features are available in the subgraphs as well as in the graph.

  • Define an input and output interface. The interface - input and output ports of subgraphs component - is defined by components SubgraphInput and SubgraphOutput .

  • Launch as a single unit or from the graph. Subgraph can be launched as a standalone graph or as component from parent graph.

Anatomy of Subgraphs

ETL graph defining a subgraph contains the following sections:

Subgraph Layout

Figure 42.1. Subgraph Layout


  • SubgraphInput
    • Represents inputs of subgraph

    • Each Subgraph contains exactly one instance of SubgraphInput component

    • Number of its output ports define the number of subgraph’s inputs

  • SubgraphOutput
    • Represents outputs of subgraph

    • Subgraph contains exactly one instance of SubgraphOutput component

    • Number of its input ports define the number of subgraph’s outputs

  • Body of Subgraph
    • Contains implementation of subgraph logic

    • Subgraph body can contain components (e.g. Reader) not connected to SubgraphInput or SubgraphOutput to access external data sources or static data sets

    • Body of subgraph may contain multiple phases and define component allocation for execution control. Phases and allocation are applied separately from the parent graph. For phases this means that as the subgraph is started in a phase of its parent graph, then the subgraph's first phase runs, then second, third etc. After all phases of the subgraph finish, it's considered finished by the parent graph and the next phase of the parent graph can start.

    • Components in subgraph body can use own connections, lookups, metadata and parameters

  • Debug Inputs
    • Any components connected to input ports of SubgraphInput component.

    • Can be used to generate test data when developing and testing subgraph logic

    • Components in debug input section will be automatically disabled when subgraph is executed from a parent graph, this is visualized by graying out these components.

  • Debug Outputs
    • Any components connected to output port of SubgraphOutput component, or with higher phase than SubgraphOutput

    • Can be used to inspect and store test data when developing and testing subgraph

    • Components in debug output section will be automatically disabled when subgraph is executed from a parent graph, this is visualized by graying out these components.

Example of subgraph with multiple output ports

Figure 42.2. Example of subgraph with multiple output ports


Subgraphs vs. Jobflow

While both Subgraphs and Jobflow provide a way of creating reusable processing logic, they serve different purposes.

Subgraphs behave the same as other built-in ETL components; they stream data to parent graph. When used in ETL graph, they execute in parallel with other ETL components running in the graph.

Use subgraph when you need to create a new component that should be used in ETL processing and exchange large amounts of data with other components.

Jobflow in its nature provides step-by-step sequential processing . Individual steps in jobflow do not exchange large amounts of data instead they pass status and configuration parameters to each other.

If you need to create logic that should be executed as one of several processing steps or you want to react to job status after its execution, create an ETL graph and call it from Jobflow via ExecuteGraph.

[Note]Note

Graphs and subgraphs cannot contain cycles (Jobflow can). Thus subgraphs cannot be called recursively.