ApproximativeJoin

Not available in Community Designer

Deprecated Component

Short Description
Ports
Metadata
ApproximativeJoin Attributes
Details
Best Practices
Compatibility
See also

Short Description

ApproximativeJoin merges sorted data from two data sources on a common matching key. Afterwards, it distributes records to the output based on a user-specified Conformity limit.

Component Same input metadata Sorted inputs Slave inputs Outputs Output for drivers without slave Output for slaves without driver Joining based on equality Auto-propagated metadata
ApproximativeJoin
no
yes
12-4
yes
yes
yes
yes

Icon

Ports

ApproximativeJoin receives data through two input ports, each of which may have a different metadata structure.

The conformity is then computed for matched data records. The records with greater conformity are sent to the first output port. Those with smaller conformity are sent to the second output port. The third output port can optionally be used to capture unmatched master records. The fourth output port can optionally be used to capture unmatched slave records.

Port typeNumberRequiredDescriptionMetadata
Input0
yes
Master input portAny
1
yes
Slave input portAny
Output0
yes
Output port for the joined data with greater conformityAny, optionally including additional fields: _total_conformity_ and _keyName_conformity_. See Metadata.
1
yes
Output port for the joined data with smaller conformityAny, optionally including additional fields: _total_conformity_ and _keyName_conformity_. See Metadata.
2
no
Optional output port for master data records without slave matchesInput 0
3
no
Optional output port for slave data records without master matchesInput 1

Metadata

ApproximativeJoin propagates metadata from the first input port to the third output port (from left to right) and from the second input port to the fourth output port (from left to right).

Additional fields

Metadata on the first and second output ports can contain additional fields of numeric data type. Their names must be the following: "_total_conformity_" and some number of "_keyName_conformity_" fields.

In the last field names, you must use the field names of the Join key attribute as the keyName in these additional field names. To these additional fields the values of computed conformity (total or that for keyName) will be written.

ApproximativeJoin Attributes

AttributeReqDescriptionPossible values
Basic
Join keyyesKey according to which the incoming data flows with the same value of Matching key are compared and distributed between the first and the second output port. Depending on the specified Conformity limit. See Join key.  
Matching keyyesThis key serves to match master and slave records. See Matching key.  
Transform [1] Transformation in CTL or Java defined in the graph for records with greater conformity.  
Transform URL[1]External file defining the transformation in CTL or Java for records with greater conformity. 
Transform class[1]External transformation class for records with greater conformity. 
Transform for suspicious [2] Transformation in CTL or Java defined in the graph for records with smaller conformity. 
Transform URL for suspicious[2]External file defining the transformation in CTL or Java for records with smaller conformity. 
Transform class for suspicious[2]External transformation class for records with smaller conformity. 
Conformity limit (0,1) This attribute defines the limit of conformity for pairs of records. To the records with conformity higher than this value the transformation is applied, to those with conformity less than this value, the transformation for suspicious is applied. See Conformity limit 0.75 (default) | between 0 and 1
Transform source charset 

Encoding of external file defining the transformation.

The default encoding depends on DEFAULT_SOURCE_CODE_CHARSET in defaultProperties.

E.g. UTF-8
Deprecated
Locale Locale to be used when internationalization is used. 
Case sensitive If set to true, upper and lower cases of characters are considered different. By default, they are processed as if they were equal to each other. false (default) | true
Error actions Definition of the action that should be performed when the specified transformation returns some Error code. See Return Values of Transformations.  
Error log URL of the file to which error messages for specified Error actions should be written. If not set, they are written to Console.  
Slave override key In older versions of CloverETL, slave part of Join key. Join key was defined as the sequence of individual expressions consisting of master field names each of them was followed by parentheses containing the 6 parameters mentioned below. These individual expressions were separated by semicolon. The Slave override key was a sequence of slave counterparts of the master Join key fields. Thus, in the case mentioned above, Slave override key would be fname;lname, whereas Join key would be first_name(3 0.8 true false false false);last_name(4 0.2 true false false false).  
Slave override matching key In older versions of CloverETL, slave part of Matching key. Matching key was defined as a master field name. Slave override matching key was its slave counterpart. Thus, in the case mentioned above ($masterField=$slaveField), Slave override matching key would be this slaveField only. And Matching key would be this masterField.  

[1]  One of these must be set. These transformation attributes must be specified. Any of these transformation attributes must use a common CTL template for Joiners or implement a RecordTransform interface.

[2]  One of these must be set. These transformation attributes must be specified. Any of these transformation attributes must use a common CTL template for Joiners or implement a RecordTransform interface.

See CTL Scripting Specifics or Java Interfaces for more information.

See also Defining Transformations for detailed information about transformations.

Details

ApproximativeJoin is a fuzzy joiner that is usually used in quite special situations. It requires the input to be sorted and is very fast as it processes data in the memory. However, it should be avoided in case of large inputs as its memory requirements may be proportional to the size of the input.

The data attached to the first input port is called master as in the other Joiners. The second input port is called slave.

Unlike other joiners, this component uses two keys for joining. First of all, the records are matched in a standard way using Matching Key. Each pair of these matched records is then reviewed again and the conformity (similarity) of these two records is computed using Join key and a user-defined algorithm. The conformity level is then compared to Conformity limit and each record is sent either to the first (greater conformity) or to the second output port (smaller conformity). The rest of the records is sent to the third and fourth output port.

Join key

You can define the Join key with the help of the Join key wizard. When you open the Join key wizard, you can see two tabs: Master key tab and Slave key tab.

Join Key Wizard (Master Key Tab)

Figure 62.1. Join Key Wizard (Master Key Tab)


In the Master key tab, you must select the driver (master) fields in the Fields pane on the left and drag and drop them to the Master key pane on the right. (You can also use the arrow buttons.)

Join Key Wizard (Slave Key Tab)

Figure 62.2. Join Key Wizard (Slave Key Tab)


In the Slave key tab, you can see the Fields pane (containing all slave fields) on the left and the Key mapping pane on the right.

You must select some of these slave fields and drag and drop them to the Slave key field column at the right from the Master key field column (containing the master fields selected in the Master key tab in the first step). In addition to these two columns, there are other six columns that should be defined: Maximum changes, Weight and the last four representing strength of comparison.

Maximum Changes

The maximum changes property contains the integer number that is equal to the number of letters that should be changed so as to convert one data value to another value. The maximum changes property serves to compute the conformity. The conformity between two strings is 0, if more letters must be changed so as to convert one string to the other.

Weight

The weight property defines the weight of the field in computing the similarity. Weight of each field difference is computed as the quotient of the weight defined by user and the sum of the weights defined by user.

Strength of Comparison

The strength of comparison can be identical, tertiary, secondary or primary.

identical

Only identical letters are considered equal.

tertiary

Upper and lower case letters are considered equal.

secondary

Diacritic letters and their Latin equivalents are considered equal.

primary

Letters with additional features such as a peduncle, pink, ring and their Latin equivalents are considered equal.

In the wizard, you can change any boolean value of these columns by simply clicking. This switches true to false, and vice versa. You can also change any numeric value by clicking and typing the desired value.

When you click OK, you will obtain a sequence of assignments of driver (master) fields and slave fields preceded by dollar sign and separated by semicolon. Each slave field is followed by parentheses containing six mentioned parameters separated by white spaces. The sequence will look like this:

$driver_field1=$slave_field1(parameters);...;$driver_fieldN=$slave_fieldN(parameters)
An Example of the Join Key Attribute in ApproximativeJoin Component

Figure 62.3. An Example of the Join Key Attribute in ApproximativeJoin Component


Example 62.1. Join Key for ApproximativeJoin

$first_name=$fname(3 0.8 true false false false);$last_name=$lname(4 0.2 true false false false). In this Join key, first_name and last_name are fields from the first (master) data flow and fname and lname are fields from the second (slave) data flow.


Matching key

The Matching key is defined using the Matching key wizard. You only need to select the desired master (driver) field in the Master key pane on the left and drag and drop it to the Master key pane on the right in the Master key tab. (You can also use the provided arrow buttons.)

Matching Key Wizard (Master Key Tab)

Figure 62.4. Matching Key Wizard (Master Key Tab)


In the Slave key tab, you must select one of the slave fields in the Fields pane on the left and drag and drop it to the Slave key field column at the right from the Master key field column (containing the master field the Master key tab) in the Key mapping pane.

Matching Key Wizard (Slave Key Tab)

Figure 62.5. Matching Key Wizard (Slave Key Tab)


Example 62.2. Matching Key

Matching key looks like this:

$master_field=$slave_field

Conformity limit

You have to define the limit of conformity (Conformity limit (0,1)). The defined value distributes incoming records according to their conformity. The conformity can be greater or smaller than the specified limit. You have to define transformations for either group. The records with smaller conformity are marked "suspicious" and sent to port 1, while records with higher conformity go to port 0 ("good match").

The conformity calculation is a challenge so let us try to explain at least in basic terms. First, groups of records are made based on Matching key. Afterwards, all records in a single group are compared to each other according to the Join Key specification. The strength of comparison selected in particular Join key fields determines what "penalty" characters get (for comparison strength, see Join key):

  • Identical - is a character-by-character comparison. The penalty is given for each different character (similar to String.equals()).

  • Tertiary - ignores differences in lower/upper case (similar to String.equalsIgnoreCase()), if it is the only comparison strength activated. If activated together with Identical, then a difference in diacritic (e.g. 'c' vs. ' ') is a full penalty and a difference in case (e.g. 'a' vs. 'A') is half a penalty.

  • Secondary - a plain letter and its diacritic derivatives for the same language are considered equal. The language used during comparison is taken from the metadata on the field. When no metadata is set on the field, it is treated as en and should work identically to Primary (i.e. derivatives are treated as equal).

    Example:

    language=sk: 'a', 'á', 'ä' are equal because they are all Slovak characters

    language= sk: 'a', ' ' are different because ' ' is a Polish (and not Slovak) character

  • Primary - all diacritic-derivatives are considered equal regardless of language settings.

    Example:

    language=any: 'a', 'á', 'ä', ' ' are equal because they are all derivatives of 'a'

As a final step, the total conformity is calculated as a weighted average of field conformities.

CTL Scripting Specifics

When you define your join attributes you must specify a transformation that maps fields from input data sources to the output. This can be done using the Transformations tab of the Transform Editor. However, you may find that you are unable to specify more advanced transformations using this easiest approach. This is when you need to use CTL scripting.

For detailed information about Clover Transformation Language see Part X, CTL2 - CloverETL Transformation Language.

CTL scripting allows you to specify custom field mapping using the simple CTL scripting language.

All Joiners share the same transformation template which can be found in CTL Templates for Joiners.

Java Interfaces

If you define your transformation in Java, it must implement the following interface that is common for all Joiners:

Java Interfaces for Joiners

Best Practices

If you specify the transformation in an external file (with Transform URL), we recommend you to explicitly specify Transform source charset.

Compatibility

ApproximativeJoin was deprecated in CloverETL 4.1.0-M1.

See also

Common Properties of Components
Specific Attribute Types
Common Properties of Joiners
Joiners Comparison
Chapter 62, Deprecated