Writing Aggregate Functions

This chapter gives a brief overview of writing aggregate functions for user-defined data types.

Aggregate Function

Aggregate functions take a collection of values as input (contrasted with scalar functions, which take a single input). Aggregate functions are used to perform a summary operation on the set of input values.

The GROUP BY clause in SQL provides the basis for identifying the sets of parameter values. The set of rows with the same values for the GROUP BY columns produces the input parameters for each execution of the aggregate function, or in the absence of a GROUP BY clause, all rows from the query produce a single set of input parameters to the aggregate.

SQL supports the following aggregate functions for its intrinsic data types:

count
max
min
sum
avg
variance
standard deviation
correlation and regression analysis

Function Definitions for Aggregates

Functions for user-defined types require the following:

A function definition (using the IIADD_FO_DFN structure) to define the function name
A function instance definition (using the IIADD_FI_DFN structure) to define the functions that will perform the operations

Function definitions for count(), max(), min() and sum() are already included in Ingres, in the same sense that arithmetic and comparison operations are pre-defined for user-defined data types; however, avg() is NOT pre-defined. Users can also code their own aggregate functions with distinct function names. The IIADD_FO_DFN structure instance must define "fod_type" to be II_AGGREGATE in this case.

Like all functions on user-defined data types, aggregate functions must also include definitions of each function instance for the function on a specific user-defined type. These are defined by IIADD_FI_DFN structure instances, and for aggregate functions they must include an "fid_optype" of II_AGGREGATE.

The instances of the function instance structure must be in a particular sorted sequence. Function instances with fid_optype of II_AGGREGATE must be placed between those for II_OPERATOR and II_NORMAL. If the function instance is for one of the standard Ingres aggregates, the "fid_opid" field must be set to the appropriate code (II_COUNT_OP for COUNT, II_MAX_OP for MAX, II_MIN_OP for MIN, II_SUM_OP for SUM). If the function instance is for a user-defined aggregate function, the fid_opid field must contain the value from the "fod_id" field of the corresponding IIADD_FO_DFN structure instance.

Note: The function instance definitions must not use the II_RES_EXTERN value for the "fid_rltype" field.

Code for an Aggregate Function

The code generated into a query plan by Ingres for the evaluation of an aggregate function consists of three parts.

The first part is executed for each new set of GROUP BY column values. In the current implementation, Ingres builds a work field that contains either the "empty" value (as generated by the "getempty" method of the type definition) or if the function is max or min, the minimum or maximum value for the type as generated by the "minmaxdv" method. Each successive set of GROUP BY column values calls this code again to reset the work field to the initialization value.

The second part of code is executed for each row of a particular set of GROUP BY column values and invokes the function variable defined in the aggregate function instance definition. This function variable is passed the parameters defined for the function instance. The first parameter defines the result of the function execution and the second-through-nth-parameters describe the parameters of the aggregate function invocation syntax. Each parameter to the function variable is a pointer to a II_DATA_VALUE structure instance that describes and addresses the corresponding value. Since the function variable is called with each row to be aggregated, it is assumed that it will perform the aggregation into the result parameter. The result parameter is the same work field whose initialization is described in the preceding paragraph.

So, for example, an implementation of the function variable for the "max" aggregate might simply compare the current value of the aggregate parameter with the value in the work field, replacing it if the new value is "larger" (remember, that the work field will be initialized to the minimum value for the data type for each new group of rows). Likewise, an implementation of the function variable for the "sum" aggregate might add the current parameter value to the value in the work field, accumulating the sum in the work field.

The last part of code is executed after each group of rows (defined by a distinct set of GROUP BY column values) is processed. Ingres simply copies the current contents of the work field to the result location (based on the assumption that the aggregate is accumulated in the work field).

Example—Function to perform the "sum" operation on the ORD_PAIR type

** Name: usop_sum() - sum a set of ord_pair's (just sums each element).

** Description:

** Inputs:

** scb Pointer to a session control block.

** rdv Pointer to II_DATA_VALUE to hold

** resulting summed result.

** dv1 Pointer to II_DATA_VALUE of the first

** operand, which is a ORD_PAIR datatype.

** Outputs:

** rdv

** .db_data Pointer to resulting currency value.

** Returns:

** II_STATUS

## History:

## 19-oct-05 (inkdo01)

## Written as proof of concept for UDT aggregation.

II_STATUS

usop_sum(

II_SCB *scb,

II_DATA_VALUE *rdv,

II_DATA_VALUE *dv1)

{

ORD_PAIR *ival, *rval;

ival = (ORD_PAIR *)dv1->db_data;

rval = (ORD_PAIR *)rdv->db_data;

/* Simply accumulate the sums of the x & y values

in the result work field. */

rval->op_x += ival->op_x;

rval->op_y += ival->op_y;

return( II_OK );

}

Problems with AVG

Note: Ingres currently does not support the AVG operator for user-defined types because Ingres assumes a division operator is not generally available for user-defined types. (AVG is compiled as a SUM divided by a COUNT.) Users, however, can implement a SUM operator and explicitly code "sum(abc) / count(abc) as "avg(abc)" in a query. Also, users can code type-specific functions to perform AVG (for example, avg_op, to compute the average of a set of ordered pairs) using an algorithm appropriate to the type.