Ab Initio Component | CONCATENATE

 Purpose of CONCATENATE


  • CONCATENATE appends multiple flow partitions of records one after another. The in port for CONCATENATE is ordered.


Parameters for CONCATENATE

  • Concatenate doesn't have any parameter


Runtime behavior of CONCATENATE

CONCATENATE perform following operations:

1.Its reads all records from the first flow connected to the in port (counting from top to bottom on the graph) and copies them to the out port.

2.Then reads all records from the next flow connected to the in port and appends them to the records from the previously processed flow.


    And repeats Step 2 for each subsequent flow connected to the in port.
    
    


Ab Initio Component | BROADCAST

 Purpose of BROADCAST


  • BROADCAST component combines in an arbitrary order all records it receives into a single flow and writes a copy of that flow to each of its output flow partitions.

  • You can use BROADCAST to increase data parallelism when you have connected a single fan-out flow to the out port, or to increase component parallelism when you have connected multiple straight flows to the out port.


Parameters for BROADCAST


  • Broadcast components has no parameter


Runtime behavior of BROADCAST


BROADCAST does the following:

    1. It reads records from all flows on the in port.

    2. Then combines the records in an arbitrary order into a single flow.

    3. Then copies all the records to all the flow partitions connected to the out port.

Ab Initio Component | REPLICATE

Purpose of REPLICATE


  • REPLICATE arbitrarily combines all records it receives into a single flow and writes a copy of that flow to each of its output flows. Use REPLICATE to support component parallelism — such as when you want to perform more than one operation on a flow of records coming from an active component.

  • The input port of REPLICATE has an implicit gather. If this port receives more than one input stream, REPLICATE does not preserve sort order.

  • REPLICATE is not required when you want to read data from an INPUT FILE component on multiple flows. Rather, you can connect multiple flows directly to the INPUT FILE. The downstream components each read the data independently, which sometimes results in improved graph performance.

Parameters for REPLICATE

  •     Replicate Components does not have any parameters. 

Runtime behavior of REPLICATE

REPLICATE perform the following operations:

1. It arbitrarily combines the records from all the flows on the in port into a single flow.

2. After performing step1 it copies that flow to all the flows connected to the out port.

Note :

REPLICATE does not support implicit reformat, so you cannot use it to change the record format associated with a particular flow. For that reason, you must make the record format of the in and out ports identical. If you do not, execution of the graph stops when it reaches REPLICATE.

Ab Initio Component | LEADING RECORDS

Purpose of LEADING RECORDS


  • LEADING RECORDS copies a specified number of records from its in to its out port, counting from the first record in the input flow.


Parameters for LEADING RECORDS


num_records (integer, required)


  • This parameter specifies the number of records to copy from the in port to the out port. If you enter a value of -1, it specifies that all records appearing on the component’s in port should be passed through to its out port.


early-close    (boolean, required)


  • When this parameter set to True, LEADING RECORDS closes its output immediately after reaching the value specified in num_records, which speeds downstream processing. In a non-continuous graph, always set early-close to True.


    Default is False. Use this setting only for continuous graphs.
    

Runtime behavior of LEADING RECORDS

  • When this component is connected to an INPUT FILE component, LEADING RECORDS has a useful optimization: it stops reading the input file as soon as it reaches the specified number of records. To get this optimization, make sure the INPUT FILE and the LEADING RECORDS use the same layout.

  • When this component is connected to an INPUT TABLE component  this optimization does not apply.With an INPUT TABLE, it is most efficient to select the desired records using a SELECT statement.

  • LEADING RECORDS does not have a port to which to send unused records.

  • LEADING RECORDS supports implicit reformat. 

Ab Initio Component | Filter by Expression: Part 2

 ...continue from part 1....


ramp (real, required)

  • This parameter defines Rate of toleration of reject events in the number of records processed.

  •  When the reject-threshold parameter is set to Use limit/ramp, the component uses the values of the ramp and limit parameters in a formula to determine the component’s tolerance for reject events.

    Default is 0.0.
 

logging (boolean, optional)

  • This parameter  specifies whether the component logs certain events.  
 

log_input  (choice, optional)

  • This parameter specifies how often the component sends an input record to its log port. The logging parameter must be set to True for this parameter to be available.

  • For example, if you select 100, the component sends every 100th input record to its log port.
 

log_output   (choice, optional)

  • This parameter specifies how often the component sends an output record to its log port. The logging parameter must be set to True for this parameter to be available.

  • For example, if you select 100, the component sends every 100th output record to its log port.

log_reject   (choice, optional)

  • This parameter specifies how often the component sends a reject record to its log port. The logging parameter must be set to True for this parameter to be available.

 

Runtime behavior of FILTER BY EXPRESSION


FILTER BY EXPRESSION perform the following operations :

    1. It reads data records from the in port.

  • If the use_package parameter is false, applies the expression in the select_expr parameter to each record. It routes records as follows, based on how the expression evaluates:

  • For a non-0 value, FILTER BY EXPRESSION writes the record to the out port.

  • For 0, FILTER BY EXPRESSION writes the record to the deselect port. If you do not connect a flow to the deselect port, FILTER BY EXPRESSION discards the records.

  • For NULL, FILTER BY EXPRESSION writes the record to the reject port and a descriptive error message to the error port.

  • If the use_package parameter is true, executes the functions defined in the package:

  • If the select function returns 1, the component writes the record to the out port.

  • If the select function returns 0, the component writes the record to the deselect port.


    2. If output_for_error or make_error is defined, executes them whenever an error event occurs. If log_error is defined and logging of rejects is turned on, executes log_error.

    3. FILTER BY EXPRESSION stops execution of the graph according to the reject-threshold parameter. If its value is use limit/ramp, the graph stops when the number of reject events exceeds the result of the following formula:

        limit + (ramp *  number_of_records_processed_so_far)

Ab Initio Component | Filter by Expression: Part 1

 Purpose of FILTER BY EXPRESSION

  •  FILTER BY EXPRESSION is used to filter records according to a DML expression or transform function, which specifies the selection criteria.
  • FILTER BY EXPRESSION can also sometimes used to create a subset, or sample, of the data. For example, you can configure FILTER BY EXPRESSION to select a certain percentage of records, or to select every third (or fourth, or fifth, and so on) record. 
 

Parameters for FILTER BY EXPRESSION


select_expr   (expression, required when use_package is false)
 
  • This parameter filters records according to the DML expression you specify.

use_package  (boolean, optional)

  • This parameter controls whether the component uses the select_expr parameter or the package to specify the filter criteria. When the value is true, it uses the package.
  • When false (the default), you may still use the package to customize the component’s handling of error and log information.

package   (filename or embedded string, optional)

  • This parameter is package that can include a select function (required when use_package is true). It also allows you to customize the component’s handling of error and log information.
 
error_group  (string, optional)

  • This parameter defines name of the error group to which this component belongs. It sends its error output to the HANDLE ERRORS component with a matching error_group value.
    
log_group  (string, optional)

  • This parameter defines name  of the log group to which this component belongs. It sends its log output to the HANDLE LOGS component with a matching log_group value.

reject-threshold  (choice, required)

  • This parameter specifies the component’s tolerance for reject events.

limit  (integer, required)

  • This parameter defines a number representing reject events.

  • When the reject-threshold parameter is set to Use limit/ramp, the component uses the values of the ramp and limit parameters in a formula to determine the component’s tolerance for reject events.

    Default is 0.

 



Ab Initio Component | SCAN:Part 3

 ....Continue from Part 2.....

Runtime behavior of SCAN

 

SCAN perform following operation for each group of records:

 

1. Performing Input selection:

  • If you have defined the input_select function, SCAN filters the input records accordingly.

  • However if you have not defined the input_select function in your transform, SCAN processes all records.

2. Performing Key change (for sorted input only):

  • For every record except the first, SCAN checks whether a key change has occurred:

  • SCAN compares the current record’s key value to the previous record’s key value, unless the key_change function is defined.

  • If the key_change function is defined, SCAN calls that function to check for a key change.

3. Performing Temporary initialization:

  • SCAN passes the first record in each group to the initialize transform function.

4. Performing Computation:

  • SCAN calls the scan transform function for each record in a group, including the first, using the input record and the temporary record for the group to which the input record belongs. The scan transform function returns a new temporary record.

5. Finalizing the output:

  • SCAN calls the finalize transform function once for every input record. SCAN passes the input record and the temporary record that the scan function returned to the finalize transform function. The finalize transform function produces an output record for each input record.

  • SCAN stops execution of the graph when the number of reject events exceeds the result of the following formula:

           limit+(ramp* number_of_records_processed_so_far)

6. Output selection:

  • If you have defined the output_select transform function, SCAN filters the output records.

Ab Initio Component | SCAN:Part 2

 .....Continue from Part 2.....


maintain-order(boolean, required)

  • This parameter is available only when the sorted-input parameter is set to False.

  • When the input is too large to fit within the memory limit specified by max-core, the maintain-order parameter, when set to True, stops the graph, ensuring that records are not reordered.

  • When the parameter is set to False (the default), the component stores some of its intermediate results in temporary files on disk. This alters the order of records.

Default is False.

grouped-input (boolean, optional)

  • This parameter is available only when the sorted-input parameter is set to False.

  • Set this parameter to Data is grouped by a major key in order to specify the major-key by which the input is sorted or grouped. In this case, the key parameter becomes the minor key: it is the field (or fields) to be scanned.

  • When you specify a major key, SCAN is more efficient in its use of memory: SCAN clears its in-memory table of intermediate results at the end of each major-key group of input records.

Default is Data is not grouped by a major key.

major-key(key specifier, optional)

  • This parameter is available only when the grouped-input parameter is set to Data is grouped by a major key. Specifies a field or set of fields by which the input data is sorted or grouped. 

 check-sort(boolean, optional)

  • This parameter is available only when the sorted-input parameter is set to True and the key-method parameter is set to Use key specifier.

  • This parameter indicates whether the component should fail when it first encounters an input record that is out of sorted order. Setting this parameter to False effectively treats every key change as a change in group.

Default is True.
 

reject-threshold(choice, required)

  • Specifies the component’s tolerance for reject event

Ab Initio Component | SCAN:Part 1

 Purpose of SCAN

  • For every input record, SCAN generates an output record that consists of a running cumulative summary for the group to which the input record belongs, up to and including the current record 
  •  SCAN is similar to ROLLUP. The difference between the two is that SCAN produces one output record for each input record, while ROLLUP produces one output record for each key group 

Two modes to use SCAN 

Unlike ROLLUP SCAN can also be used in template mode and expanded mode

  • Template mode — You define a simple scan function that typically includes aggregation functions.

  • Expanded mode — You create a transform using an expanded scan package. This mode allows for scans that do not necessarily use regular aggregation functions. 

Parameters for SCAN (Not all Parameters are covered)

 

sorted-input(boolean, required)
  • This parameter specifies whether the component accepts unsorted (or ungrouped) input.
  • If you want to process ungrouped input/data, set this parameter to False.
Default is True.
 
key-method(choice, optional)

  • This parameter is defines method by which the component determines the boundary between one group of records and the next. The choices are as follows:
  • Use key specifier — The component uses one or more of the fields in the input record as the grouping key.
  • Use key_change function — Instead of using fields from the input record to group the input, the component uses the key_change transform function to determine when a new group begins. 

 key(key specifier, required when key-method is Use key specifier)

  • This parameter consists names of the key fields that the component can use to group or define groups of records. 
 
transform(filename or string, required)
  • This param consists of either the name of the file containing the types and transform functions, or a transform string. 
 
 max-core (integer, required)

  • This parameter define maximum memory usage in bytes.
  • This parameter is available only when the sorted-input parameter is set to False.
  • If the total size of the intermediate results that the component holds in memory exceeds the number of bytes specified in the max-core parameter, the component writes temporary files to disk.

Default is 67108864 bytes (64 MB).

 

Ab Initio Component | ROLLUP : Part 3

 ....Continue from Part 2....

 

Function used in expanded mode 

  • Expanded mode provides more control over the transform. It lets you edit the expanded package, so you can specify transformations that are not possible with template mode 

  • With an expanded ROLLUP package, you must define the following function in it:

  • DML type named temporary_type

  • initialize function that returns a temporary_type record

  • rollup function that takes two input arguments (an input record and a temporary_type record) and returns an updated temporary_type record

  • finalize function that returns an output record

     

Runtime behavior of ROLLUP 

ROLLUP perform following operation for each group of records: 

1. Performing Input selection:

  • If you have not defined the input_select function in your transform, ROLLUP processes all records.

  • If you have defined the input_select function, ROLLUP filters the input records accordingly.

2. Performing Key change (for sorted input only):

  • For every record except the first, ROLLUP checks whether a key change has occurred:

  • ROLLUP compares the current record’s key value to the previous record’s key value, unless the key_change function is defined.

  • If the key_change function is defined, ROLLUP calls that function to check for a key change.

3. Temporary initialization:

  • ROLLUP passes the first record in each group to the initialize transform function.

4. Performing Computation:

  • ROLLUP calls the rollup transform function for each input record.
  • The input to the rollup transform function is the input record and the temporary record for the group to which the input record belongs.
  • The rollup transform function returns an updated temporary record for that input group. 

5. Performing Finalization of  the output:

With sorted-input set to True:

  • ROLLUP calls the finalize transform function after it processes all the input records in each group.

  • ROLLUP passes the temporary record for the group and the last input record in the group to the finalize transform function.

  • The finalize transform function produces an output record for the group.

Note:

  • For sorted-input set to False  ROLLUP processes all the input records, it calls the finalize transform function with the temporary record for each group and an arbitrary input record from each group as arguments.

  • ROLLUP repeats this procedure with each group.

  • The finalize transform function then produces an output record for each group.

  • The component stops the execution of the graph when the number of reject events exceeds the result of the following formula:

limit+(ramp* number_of_records_processed_so_far)

6. Output selection:

  • If you have defined the output_select transform function, it filters the output records.

Ab Initio Component | ROLLUP : Part 2

 ...Continue from part 1......

max-core(integer, required)

  • This parameter define maximum memory usage in bytes.

  • It is available only when the sorted-input parameter is set to False.

  • If the total size of the intermediate results that the component holds in memory exceeds the number of bytes specified in the max-core parameter, the component writes temporary files to disk.

Default is 67108864 (64 MB).

reject-threshold(choice, required)

  • Specifies the component’s tolerance for reject events i.e after how many reject records the component should abort its operation

check-sort(boolean, optional)

  • This parameter is available only when the sorted-input parameter is set to True and the key-method parameter is set to Use key specifier.

 Difference between using unsorted and sorted data

With unsorted data

  • When the input data is not sorted (and the sorted-input parameter is set to False), the function outputs an arbitrary record from each group. This might not be particularly useful.
  • To get the first or last record in the unsorted data, you can use the first or last aggregation function.

With sorted data

  • When the input data is sorted (and the sorted-input parameter is set to True), the function outputs the last record from each group.
  • In this case, the function is equivalent to the following, which uses the last aggregation function 


Ab Initio Component | ROLLUP : Part 1

 Purpose of ROLLUP

  • ROLLUP is used to process groups of input records that have the same key, generating one output record for each group. 

  • Typically, the output record is summary or aggregates the data in some way; for example, a simple ROLLUP can be used to calculate a sum or average of one or more input fields.

  • ROLLUP can also be used to select certain information from each group; for example, it might output the largest value in a field, or accumulate a vector of values that conform to specific criteria.

Two modes to use ROLLUP

You can use a ROLLUP component in two modes, depending on how you define the transform parameter:

1. Template mode — You define a simple rollup function that may include aggregation functions. Template mode is the most common/simple way to use ROLLUP.

2. Expanded mode — You create a transformation using an expanded rollup package. This mode allows for rollups that do not necessarily use regular aggregation functions.


Parameters for ROLLUP (Not all parameters are covered.)

 sorted-input(boolean, required)

  • This parameter to specifies whether the component accepts unsorted (or ungrouped) input.

  • If you want to process ungrouped input, set this parameter to False.

Default is True.

key-method (choice, optional)

  • This parameter determines the method by which the component determines the boundary between one group of records and the next. The choices are as follows:

1. Use key specifier — The component uses one or more of the fields in the input record as the grouping key.

2. Use key_change function — Instead of using fields from the input record to group the input, the component uses the key_change transform function to determine when a new group begins.


Default is Use key specifier.

key(key specifier, required when key-method is Use key specifier)

  • This parameter contain the name(s) of the key fields that the component can use to group or define groups of records.

 transformp(filename or string, required)

  • This parameter contains either the name of the file containing the types and transform functions, or a transform string.

output_without_input(choice, optional)

  • This parameter specifies the event that, when received, triggers the component to call the output_without_input function, if no input records have been received since the last such event or since the component started. The choices are as follows:

Never — The function will not be called.

At each computepoint — The function is called at each computepoint event.

At each checkpoint — The function is called at each checkpoint event.

At component shutdown — The function is called when the component is shutdown.

Default is Never.

Ab Initio Component | JOIN: Part 3

 .....Continue from part 2.....


maintain-order  (boolean, required)

  • Set to True to ensure that records remain in the original order of the driving input. (The driving input is the largest input, as specified by the driving parameter.)
  • Available only when the sorted-input parameter is set to False. If the sorted-input parameter is set to True and all inputs are sorted on the fields given in the key parameter, the output maintains the sort order on that key without the use of this parameter.
  • If any inputs other than the driving input are too large to fit within the memory limit specified by max-core, the behavior of the component depends on the setting of maintain-order:
  • False — The component stores some of its intermediate results in temporary files on disk. This alters the order of records in the driving input.
  •  True — The component stops execution of the graph.
   
    Default is False.


max-core (integer, required)

  • Maximum memory usage in bytes. Available only when the sorted-input parameter is set to False.
  • If the total size of the non-driving inputs that the component holds in memory exceeds the number of bytes specified in the max-core parameter, the component writes temporary files to disk.
    Default value is 67108864 bytes (64 MB). 


Runtime behavior of JOIN

 

 JOIN performs following Operations:
 
 1. Reads data records from multiple inn ports. Depending on the setting of the sorted-input parameter, it does one of the following:

  •  If input is sorted, it reads records in the order in which they arrive. 
  •  In input is unsorted, it loads all records from all inputs except the driving input into main memory. Once the non-driving inputs are loaded, it reads records from the driving input in the order in which they arrive. 

2. Applies the expression in any defined selectn parameter to the records on the corresponding inn port:

  • If the value of select expression evaluates to 0 for a record the join components does not process the record, and the record does not appear on any output port

  • Evaluates to anything other than 0 or NULL for a particular record    Processes the record
  • If you do not supply an expression for a selectn parameter, JOIN processes all the records on the corresponding inn port
 
3. Removes any duplicate records that have made it through the select if dedupn parameter to True. 

4.Operates on records that have matching key values using a multi-input transform function.

If the transform function returns NULL, then JOIN:
  • Writes each input record to the corresponding rejectn port, then stops execution of the graph when the number of reject events exceeds the result of the following formula:

        limit + (ramp * number_of_records_processed_so_far)
  • Writes an error message to the corresponding errorn port.If no flows are connected to rejectn or errorn ports, JOIN component discards the information
5. Writes the non-NULL return record from the transform function to the out port. 

 






Ab Initio Component | JOIN: Part 2

 .....Continue from part 1....

record-requiredn (boolean, required)

  • This parameter is available only when the parameter-interface parameter is set to legacy (or in a pre-Version 3.2.1 JOIN component) and the join-type parameter is set to Explicit.

    The default is True. 

record-match-requiredn (boolean, required)

  • This parameter is available only when the join-type parameter is set to Explicit.

  • It  is used to specify whether a record is required or whether to substitute a null for a missing record.

    The default is True.     

    To use this parameter, note the following points:

  •  When there are two inputs, set record-match-requiredn to True on the input port for which you want to call the transform for every record, regardless of whether there is a matching record on the other input port.

  •  When there are more than two inputs, set record-match-requiredn to True when you want to call the transform only when there are records with matching keys on all input ports for which record-match-requiredn is True.

 dedupn(boolean, required)

  • Set the dedupn parameter to Dedup this input before joining to remove duplicates from the corresponding inn port before joining. This allows you to choose only one record from a group with matching key values as the argument to the transform function.

  • There is one dedupn parameter associated with each inn port. Unused duplicates are sent to the unusedn port.

    Default is Do not dedup this input.

selectn  (expression, optional)

  • Filters for records before a join function. One per inn port; n represents the number of an in port. If you use selectn with dedupn, the JOIN component performs the select first, then removes the duplicate records that made it through the select. 

 max-memory (integer, required)

  • Maximum memory usage in bytes before the component writes temporary files to disk. Available only when the sorted-input parameter is set to True.

    The default value is 8388608 bytes (8 MB).
    

    
check-sort  (boolean, required)

  • Available only when the sorted-input parameter is set to True.

  • If set to True, stops the graph on the first input record that is out of sorted order (according to the key). Available when the sorted-input parameter is set to True.

  • The default is False. In this case, JOIN does not necessarily stop or issue an error when it encounters unsorted inputs. If sorted input is a requirement, set check-sort to True.

 

 

Ab Initio Component | JOIN : Part 1

 Purpose of JOIN Components

  • JOIN  is used to reads data from two or more input ports, combines records with matching keys according to the transform you specify, and sends the transformed records to the output port.
  •  Its additional ports caln also be used to collect rejected and unused records.  

 

Parameters for JOIN (Not all parameters are covered.)


count (integer, required)

  • It is an integer n specifying the total number of inputs (in ports) to join. The number of input ports also determines the number of the following ports and parameters:

        unused ports

        reject ports

        error ports

        record-match-required parameters

        dedup parameters

        select parameters

        override-key parameters

    Default is 2.

    Each in port (always two or more) has a number n appended. Each outn, unusedn, rejectn, and errorn port corresponds to an inn port.
 
 
sorted-input (boolean, required)

  • When this parameter is set to False, the component accepts unsorted input and permits the use of the maintain-order parameter.
  • When this parameter is set to True, the component requires sorted input .In this case, consider setting the check-sort parameter to True.
    Default is True. 

key(key specifier, required)
 
  • Name(s) of the field(s) in the input records that must have matching values for JOIN to call the transform function. The types of the fields in the different inputs must be compatible; 
 
transform (filename or string, required)

  • Either the name of the file containing the transform function, or a transform string. 
 join-type (choice, required)

    You have to  choose one of the option  from the following:

  • Inner join (default) — Sets the record-match-requiredn parameters for all ports to True. The GDE does not display the record-match-requiredn parameters, because they all have the same value.
  •  Outer join — Sets the record-match-requiredn parameters for all ports to False. The GDE does not display the record-match-requiredn parameters, because they all have the same value.
  •  Explicit — Allows you to set the record-match-requiredn parameter for each port individually.

    If you set the dedupn parameter to True on the driving input, set the join-type parameter to Inner join. (The driving input is the largest input, as specified by the driving parameter.)

    If you remove duplicates on this input port before joining it to the driving input, set the record-match-requiredn parameter to True on all other ports.
 
 
parameter-interface (choice, required)

  • This parameter is available only after you update a pre-Version 3.2.1 JOIN component to Version 3.2.2 or higher. It is not available for new components.
  • Controls whether to use a legacy or improved parameter interface. The choices are the following:

  •  legacy — Displays the record-requiredn parameter whose boolean value specifies whether to use an inner or outer join and whether a record is required or substitute a null for missing records. This parameter has inverted booleans. The default for pre-Version 3.2.1 components.
  • version-3-2-2 — Displays the record-match-requiredn parameter whose boolean value specifies whether to use an inner or outer join. This parameter has normal booleans 

 

 
 
 
 

Ab Initio Component | DEDUP SORTED : Part 2

 ...Continue from Part 1...


check-sort (boolean, optional)

  • Defines whether you want processing to abort on the first record that is out of sorted order. True causes processing to abort on the first record out of order.

    Default is True.

logging (boolean, optional)

  • Defines whether the component logs certain events.

    Default is False.

log_input (choice, optional)

  •  Defines how often the component sends an input record to its log port. Available only when the logging parameter is set to True.


log_output     (choice, optional)

  •  Defines how  often the component sends an output record to its log port. Available only when the logging parameter is set to True.


log_reject (choice, optional)

  • Defines how  often the component sends a reject record to its log port. Available only when the logging parameter is set to True.

    

Runtime behavior of DEDUP SORTED

    
DEDUP SORTED performs following operations:

   1.  Reads a grouped flow of records from the in port.

    2.  Does one of the following if a select expression is specified:

 

  •  If the expression  values evaluates to 0 for a particular record then it does not process the record.
  • Produces NULL for a particular record then it writes the record to the reject port and writes a descriptive error message to the error port.

  • Evaluates to anything other than 0 or NULL for a particular record    Processes the record.

  • If you do not supply an expression for the select parameter, DEDUP SORTED processes all records on the in port.


    3. Processes groups of records as follows:

  • It considers any consecutive records with the same key value to be in the same group.

  • If a group consists of one record, writes that record to the out port.

  •  If a group consists of more than one record, uses the value of the keep parameter to determine which record — if any — to write to the out port, and which record or records to write to the dup port.

  •  If you have chosen unique-only for the keep parameter, does not write records to the out port from any groups consisting of more than one record.

Ab Initio Component | DEDUP SORTED: Part 1

 Purpose of DEDUP SORTED

  • DEDUP SORTED is used to separate one specified record in each group of records from the rest of the records in the group.

 

Parameters for DEDUP SORTED 

 
key (key specifier, required)
  •  Name of the key/(s) field you want the component to use when determining groups of data records.

select (expression, optional)

  • Provide the expresion to filters/select only those records accordingly before the component separates duplicates. 
 
Keep (choice, required)

  • It specifies which records the component keeps to write to the out port. You have to set one of the following options:

        first — Keeps the first record of a group

        last — Keeps the last record of a group

        unique-only — Keeps only records with unique key values

    The component writes the remaining records of each group to the dup port.

    Default is first.
    
package (transform, optional)

  •  Allows you to define this component’s log- and error-handling functions.
    
error_group  (string, optional)

  • Defines name of the error group to which this component belongs. The component sends its error output to the HANDLE ERRORS component with a matching error_group value.

log_group  (string, optional)


  •  Defines name of the log group to which this component belongs. The component sends its log output to the HANDLE LOGS component with a matching log_group value.

reject-threshold (choice, optional)

  • The component’s tolerance for reject events.

limit (integer, optional)

  •  A number representing reject events.
  • When the reject-threshold parameter is set to Use limit/ramp, the component uses the values of the ramp and limit parameters in a formula to determine the component’s tolerance for reject events.
    Default is 0.

ramp (real, optional)

  •  Rate of tolerance for reject events in the number of records processed.
  • When the reject-threshold parameter is set to Use limit/ramp, the component uses the values of the ramp and limit parameters in a formula to determine the component’s tolerance for reject events.
    Default is 0.0.

Note:

  • When you set the reject-threshold parameter of a component to Use limit/ramp, the limit and ramp parameters become available. The component then uses the limit and ramp parameters together in a formula to control the component’s tolerance for reject events:
  • The component stops execution of the graph if the number of reject events exceeds the result of the following formula:

limit + (ramp *  number_of_records_processed_so_far)

Ab Initio Component | SORT

 Purpose of SORT Components

 
  • SORT  components sorts and merges records. You can use it to order records before you send them to a component that requires grouped or sorted records. 
  • The SORT components accepts the data from in component in fan-in and all-to-all  flow (for partitioned data). As SORT perform gather operation on its in port so there is no need to gather data before sending to SORT 
  •  Stability in SORT is not guaranteed as the records with identical key values may not maintain their relative order after being sorted. 

 Note:

  • If the key/s on which sorts is being performed contains NULL values, the NULL records are listed first with ascending sort order and last with descending sort order. 
 

Parameters for SORT 

 
key (key specifier, required) 
 
  • Name of the key(s) field(s) and the sequence specifier(s) you want the component to use when it orders records. 
 
 max-core (integer, required)
 
  • You can set this paramter to Maximum memory usage in bytes.
    Default is 100663296 (96 MB). 


Runtime behavior of SORT

 SORT Components perform the following operation:
 
1. SORT Component reads the records from all flows connected to the in port and splits it into temporary files that are smaller in size than the number of bytes specified by the max-core parameter . 

2. Sorts the records in each temporary file according to the sort key.

3. SORT stores any temporary files in the working directories specified in its layout.

4. Repeats steps 1 and 2 until it has read all records.
 
5. Merges all temporary files, maintaining the sort order. 
 
6. Writes the result to the out port.


 
 
 
 
 
 
 
 
 
 
 

Ab Initio Component | REDEFINE FORMAT

 Purpose of Redefine Format:

  • REDEFINE FORMAT copies records from its in port to its out port without changing the values in the records.

  • It can also be used to improve the graph performance by reducing number of fields in the input records - by renaming the fields

There are no parameters in Components

 Runtime behavior of REDEFINE FORMAT

  • Reads the records arriving at the in port.
  • Writes the records to the out port with the fields renamed according to the record format of the out port.

 

REDEFINE FORMAT is designed not to support implicit reformat, so that you can use it to change the record format associated with a particular flow.
To do this, you have to assign a record format to the out port different from the record format on the in port.

If you use REDEFINE FORMAT to change a record format, then you have to make sure you specify an output record format compatible with the input record format. For example, if you combine several fields into one, that one field must have the same number of bytes as the total of the original several fields .See the below example for more details:

Suppose the input record format is:

record
   string(10)    first_name;
   string(10)    last_name;
   string(30)    address;
   decimal(5)    zipcode;
   decimal(8.2)  income;
end;

You can reduce the number of fields by specifying an output record format of:

record
   string(55)   person_info;
   decimal(8.2) income;
end;

Ab Initio Component | REFORMAT: Part 3

  ..Continue from Part 2..

 

 Runtime behavior of REFORMAT

  • First it read the port count(n) from count parameter of component the n in outn gives each out port a unique number. Each outn port has a corresponding rejectn and errorn port. REFORMAT does the following:
  1. REFORMAT reads a record from the in port of component. 
  2.  If the select parameter has an expression specified, It uses the expression to evaluate the input record: 
  •  if the expression evaluates to false (0), then REFORMAT discards the input record and starts over with step 1.
  •  If the expression produces NULL, then REFORMAT writes a descriptive error message and stops execution of the graph.
  • If the expression evaluates to true (anything other than 0 or NULL), then REFORMAT begins processing the input record.
3. If the select parameter does not have a value, REFORMAT begins processing the input record.
 
4.  REFORMAT determines whether a transform function is specified in either the output-index or output-indexes parameter:

  • If neither output-index nor output-indexes has a value (the usual case when there is only one out port), REFORMAT sends the input record to every transform-out port pair, beginning with out0 and progressing sequentially.
  • If output-index or output-indexes has a value, REFORMAT evaluates the specified index transform. If output-indexes is defined, it should return a vector of port index values. If output-index is defined, it should return a single port index value.
  •  REFORMAT uses one or more values from the index transform to determine the appropriate transform-output port pair or pairs for the input record. If the index transform returns more than one value, REFORMAT sends the record to each of the appropriate ports, starting with the lowest numbered port and progressing to the other ports sequentially.

 5. REFORMAT determines whether each outn port has a transform function.

  •  If an out port does not have a transform function, REFORMAT uses implicit reformat to process the input record. For more information, see “Implicit reformat”.
  • If the input record is sent to more than one port, the order of the transform evaluation is sequential: it calls the transform function on each port in order, starting with the lowest numbered port. For example, if the record is to be sent to port0 and port2, it is sent to port0 first, and then to port2. The evaluation of the second transform can depend on the side-effects of the first transform, which means you could make successive calls to a function like next_in_sequence from sequential transforms for the same input record
  •  If a transform function results in an error or returns NULL, REFORMAT writes the following: 
  • An error message to the corresponding error port
  • The current input record to the corresponding reject port
  •  The component stops execution of the graph when the number of reject events exceeds the reject threshold. 
  • if the reject or error ports do not have flows attached to them, REFORMAT discards the record.   
6. REFORMAT writes the record to the out port of each successful transform,and then begins processing the next input record.

 

 

 

 

 

 

Ab Initio Component | REFORMAT: Part 2

 ..Continue from Part 1..

 

output-index(filename or string, optional)

  • It specifies either the name of a file containing a transform function, or a transform string. This  component calls the specified transform function for each input record. The transform function should return an index value between 0 and the highest-numbered output port. REFORMAT uses this value to direct the input record to the output port that has the same number as the value, and executes the transform function, if any, associated with that port.
  • When you specify a value for this parameter, each input record goes to exactly one transform-output port pair. For example, suppose there are 100 input records and two output ports. Each output port receives between 0 and 100 records. So according to transform function you can specify the output-index split to 50/50, 60/40, 0/100, 99/1 or any other combination which can add up to 100
  • If an index is out of range (less than zero or greater than the highest-numbered port), the component discards the input record.
    
    NOTE: If you specify a value for the output-index parameter, you cannot also specify the output‑indexes parameter and vice-versa

  • If you do not specify a value for either output-index or output-indexes, the component sends every input record to every transform-output port pair.  
 

output-indexes (filename or string, optional)

  • It specifies either the name of a file containing a transform function, or a transform string. The component calls the specified transform function for each input record. The transform function uses the value of the input record to direct that input record to particular transform-output ports. 
  •  The expected output of the transform function is a vector of numeric values. The component considers each element of this vector as an index into the output transforms and ports. The component directs the input record to the identified output ports and executes the transform functions, if any, associated with those ports. 
  • If an index is out of range (less than zero or greater than the highest-numbered port), the component discards the input record.
  • If you do not specify a value for either output-indexes or output-index, the component sends every input record to every transform-output port pair.
  • As an example of how the component uses the transform function specified in the output-indexes parameter, consider the following:

    out :: output_indexes(in) =
    begin
      out :1: if (in.kind == "x") [vector 0, 1, 2];
      out :2: if (in.kind == "y") [vector 2, 3, 4];
      out : : [vector 5];
    end; 
 
  •  When you specify this transform function for the output-indexes parameter, it directs the input record to the transform functions for:

        Ports 0, 1, and 2 if the field in.kind is “x”

        Ports 2, 3, and 4 if the field in.kind is “y”

        Port 5 if otherwise

 logging(boolean, optional)

  •     Specifies whether the component logs certain events.
 

 log_input (choice, optional)

  • Specifies how often the component sends an input record to its log port. The logging parameter must be set to True for this parameter to be available.
  •  For example, if you select 100, the component sends every 100th input record to its log port. 
 

log_output (choice, optional) 

  • Specifies how often the component sends an output record to its log port. The logging parameter must be set to True for this parameter to be available.
  • For example, if you select 100, the component sends every 100th output record to its log port. 
 

log_reject(choice, optional)

  • Specifies how often the component sends a reject record to its log port. The logging parameter must be set to True for this parameter to be available.
  • For example, if you select 100, the component sends every 100th reject record to its log port. 

 

 

 

Ab Initio Component | REFORMAT: Part 1

 Purpose of Reformat:

  • Reformat Component is used to change the format of records by dropping fields, or by using DML expressions to add fields, combine fields, or transform the data in the records.
  •  REFORMAT performs an implicit reformat when you do not define a reformat function or transformation for the fields

 

Parameters for REFORMAT

 

 count (integer, required)

     It is used to sets the number of:
  •         out ports
  •         reject ports
  •         error ports
  •         transform parameters
    Default is 1. 
 

select (expression, optional)

  •  It is used to filter the records before reformatting 
 

error_group (string, optional)

  •  It is name of the error group to which this component belongs. It sends its error output to the HANDLE ERRORS component with a matching error_group value.

 

log_group (string, optional)

  •  It is name of the log group to which this component belongs. It sends its log output to the HANDLE LOGS component with a matching log_group value.

reject-threshold (choice, required)

  • It is used to specifies the component’s tolerance for reject events.
  •  The reject-threshold parameter specifies the component’s tolerance for reject events. Choose one of the following       
  1. Abort on first reject — The component stops execution of the graph at the first reject event it generates.
  2. Never abort — The component does not stop execution of the graph no matter how many reject events it generates.
  3. Use limit/ramp — The component uses the settings in the limit and ramp parameters to determine how many reject events to allow before it stops execution of the graph.

 limit(integer, required)

  • A number representing reject events.
  • When the reject-threshold parameter is set to Use limit/ramp, the component uses the values of the ramp and limit parameters in a formula to determine the component’s tolerance for reject events. 
       Default is 0.  

ramp(real, required)

  • Rate of toleration of reject events in the number of records processed.
  • When the reject-threshold parameter is set to Use limit/ramp, the component uses the values of the ramp and limit parameters in a formula to determine the component’s tolerance for reject events.
    Default is 0.0.

Note:

When you set the reject-threshold parameter of a component to Use limit/ramp, the limit and ramp parameters become available. The component then uses the limit and ramp parameters together in a formula to control the component’s tolerance for reject events:

The component stops execution of the graph if the number of reject events exceeds the result of the following formula:

limit + (ramp *  number_of_records_processed_so_far)


Ab Initio Component | OUTPUT FILE

 Purpose of Output File : 

  •  OUTPUT FILE represents records written as output from a graph into one or more serial files or a multifile.
  • OUTPUT FILE can also be used to write files directly to Amazon S3 and Google Cloud Storage.
  •  OUTPUT FILE cannot be used in continuous graphs and only be used in batch graph.

Parameter of Output FILE 


1. Data Tab:

Use the Data tab to specify the following:

  • The path to a file reusable dataset
  • The physical location for a data file
  • If appropriate, an alternative means to associate a specified data file with an EME dataset in the EME Technical Repository

Reusable dataset

  • Specifies the use and location of a file reusable dataset that is preconfigured to access a particular set of data. Using this option configures the component as a dataset-linked component. For more information, see “Reusable datasets” in the Co>Operating System Graph Developer’s Guide.
    Reuse an existing dataset .

Data location 

    Specifies the data location as:

  • The URL of a serial file or of a multifile in a multifile system
  • The URLs of the individual partitions of an ad hoc multifile

File details

Opens a window, where you can see the following information about the file that corresponds to the specified data location:

  •     Permissions on the file
  •     Owner of the file
  •     Size of the file in bytes
  •     Date and time the file was last modified
  •     Full pathname of the file
  •     Any resolution errors 
 
 

2. Access Tab:

Below are the option available for File handling in Access Tab

  1. If the file does not exist  Create file : Creates the output or intermediate file before writing to it.

    By default, this option is selected.
  2. If the file does not exist Fail : Forces the graph to fail if the file does not exist. 
  3. If the file exists Delete and recreate file :Deletes the output or intermediate file and creates a new one before writing to it.By default, this option is selected.
  4. If the file exists Append to file: Writes output to the end of the intermediate or output file. 
  5. If the file exists Fail : Forces the graph to fail if the file exists.
  6. Upon job failure, roll the file back to the last checkpoint : Rolls the file back and discards output if the job fails in the phase writing the file, or fails in a subsequent phase before the next checkpoint.

    By default, this option is selected.
  7. Delete file after the last phase that reads it completes :Removes the input or intermediate file after the last phase that reads it has finished running.

    By default, this option is not selected.
  8. Write file only when phase completes : Instead of writing the data file incrementally, writes the file when the phase has run to completion. This ensures that a separate process that is looking for the file while the graph is running does not pick up a partially written file.

    By default, this option is not selected.

3. File protection

  • Sets permissions to the input, output, and intermediate files. (Default settings are those assigned at file creation.) The checkboxes match the Unix file protection standards: Read (R), Write (W), and Execute (X) for User, Group, and Other. 

4. Ports

  • Used for providing the DML of the file which can be used to map data in the file
 

Ab Initio Component | INPUT FILE

 Purpose of Input File :

  • INPUT FILE represents records read as input to a graph from one or more serial files or from a multi-file.
  • INPUT FILE can also be used to read the files from Hadoop file system,amazon S3 and google cloud storage
  • INPUT FILE is not a phased component
  • INPUT FILE can be used only in batch graph and cannot be used in continuous flow graph.

 

Parameter of INPUT FILE 

 

1. Data Tab:

Use the Data tab to specify the following:

  • The path to a file reusable dataset
  • The physical location for a data file
  • If appropriate, an alternative means to associate a specified data file with an EME dataset in the EME Technical Repository

Reusable dataset

  • Specifies the use and location of a file reusable dataset that is preconfigured to access a particular set of data. Using this option configures the component as a dataset-linked component. For more information, see “Reusable datasets” in the Co>Operating System Graph Developer’s Guide.
    Reuse an existing dataset .

Data location 

    Specifies the data location as:

  • The URL of a serial file or of a multifile in a multifile system
  • The URLs of the individual partitions of an ad hoc multifile

File details

Opens a window, where you can see the following information about the file that corresponds to the specified data location:

  •     Permissions on the file
  •     Owner of the file
  •     Size of the file in bytes
  •     Date and time the file was last modified
  •     Full pathname of the file
  •     Any resolution errors 
 
 

2. Access Tab:

Below are the option available for File handling in Access Tab

  1. If the file does not exist  Create file : Creates the output or intermediate file before writing to it.

    By default, this option is selected.
  2. If the file does not exist Fail : Forces the graph to fail if the file does not exist. 
  3. If the file exists Delete and recreate file :Deletes the output or intermediate file and creates a new one before writing to it.By default, this option is selected.
  4. If the file exists Append to file: Writes output to the end of the intermediate or output file. 
  5. If the file exists Fail : Forces the graph to fail if the file exists.
  6. Upon job failure, roll the file back to the last checkpoint : Rolls the file back and discards output if the job fails in the phase writing the file, or fails in a subsequent phase before the next checkpoint.

    By default, this option is selected.
  7. Delete file after the last phase that reads it completes :Removes the input or intermediate file after the last phase that reads it has finished running.

    By default, this option is not selected.
  8. Write file only when phase completes : Instead of writing the data file incrementally, writes the file when the phase has run to completion. This ensures that a separate process that is looking for the file while the graph is running does not pick up a partially written file.

    By default, this option is not selected.

3. File protection

  • Sets permissions to the input, output, and intermediate files. (Default settings are those assigned at file creation.) The checkboxes match the Unix file protection standards: Read (R), Write (W), and Execute (X) for User, Group, and Other. 

4. Ports

  • Used for providing the DML of the file which can be used to map data in the file







What is Ab Initio ?

 According to Wikipedia :

  • Ab Initio Software is an American multinational enterprise software corporation based in Lexington, Massachusetts. The company specializes in high-volume data processing applications and enterprise application integration. It was founded in 1995 by the former CEO of Thinking Machines Corporation, Sheryl Handler, and several other former employees after the bankruptcy of that company.
  • Ab initio (/ˌæb ɪˈnɪʃioʊ/ AB in-ISH-ee-oh) is a Latin term meaning "from the beginning" and is derived from the Latin ab ("from") + initio, ablative singular of initium ("beginning"). 
  • Ab inito is majorly used in Dataware housing and ETL for extracting,transforming and loading the data from and to  various sources.
  • It provides a very user friendly homogeneous and heterogeneous platform for parallel data processing applications