Skip to main content

Aggregation Functions

Functions that operate on groups of rows to produce summary values when used with AGG or WINDOW commands.

count([expression])

Returns the number of rows in each group during aggregation.

Parameters

  • expression (optional) - When provided, only counts rows where this expression evaluates to a non-null value

Description

The count() function returns the number of rows in each group when you use it with AGG or WINDOW commands. Unlike SQL, Hamelin uses count() rather than count(*). When no grouping is specified, it counts all rows in the result.

When you provide an expression parameter, count() only counts rows where that expression evaluates to a non-null value. This lets you count specific conditions without using separate filtering.

When you omit an explicit field name, Hamelin automatically generates the field name count() or count(expression) that you can reference using backticks in subsequent commands.

count_distinct(x)

Counts the number of distinct values in each group.

Parameters

  • x - Expression of any type to count distinct values for

Description

The count_distinct() function counts only unique values within each group, ignoring duplicates. Null values are not counted. This provides the same functionality as SQL's COUNT(DISTINCT x) but with cleaner syntax. For example, counting distinct user IDs shows how many unique users performed actions, regardless of how many actions each user performed.

approx_distinct(x)

Returns an approximate count of distinct values in each group.

Parameters

  • x - Expression of any type to count distinct values for

Description

The approx_distinct() function provides an approximate count of distinct values using probabilistic algorithms that are much faster and use less memory than exact counting, especially for high-cardinality data. The result is typically accurate within a few percent. This is particularly useful when you need fast estimates for large datasets where exact precision isn't critical.

count_if(condition)

Counts the number of rows where a condition is true in each group.

Parameters

  • condition - Boolean expression that must evaluate to true for the row to be counted

Description

The count_if() function counts only rows where the specified condition evaluates to true. This provides a more concise alternative to using WHERE clauses or case() expressions for conditional counting.

When you omit an explicit field name, Hamelin automatically generates the field name count_if(condition) that you can reference using backticks in subsequent commands.

sum(x)

Returns the sum of all values in each group.

Parameters

  • x - Numeric expression to sum

Description

The sum() function adds up all non-null values of the specified expression within each group. If all values are null, it returns null. The result type matches the input type for exact numeric types.

avg(x)

Returns the average (arithmetic mean) of all values in each group.

Parameters

  • x - Numeric expression to average

Description

The avg() function calculates the arithmetic mean of all non-null values within each group. If all values are null, it returns null. The result is always a floating-point type regardless of input type.

stddev(x)

Returns the standard deviation of all values in each group.

Parameters

  • x - Numeric expression to calculate standard deviation for

Description

The stddev() function calculates the sample standard deviation of all non-null values within each group. If there are fewer than two non-null values, it returns null. The result is always a floating-point type.

approx_percentile(x, percentile)

Returns an approximate percentile value for the specified expression.

Parameters

  • x - Numeric expression to calculate percentile for
  • percentile - Numeric value between 0.0 and 1.0 representing the desired percentile

Description

The approx_percentile() function calculates an approximate percentile using efficient algorithms suitable for large datasets. The percentile parameter should be between 0.0 (minimum) and 1.0 (maximum). For example, 0.5 returns the median, 0.95 returns the 95th percentile.

min(x)

Returns the minimum value in each group.

Parameters

  • x - Expression of numeric, string, or timestamp type

Description

The min() function finds the smallest value within each group. It works with numeric types (returning the numerically smallest), strings (lexicographic ordering), and timestamps (chronologically earliest). If all values are null, it returns null.

max(x)

Returns the maximum value in each group.

Parameters

  • x - Expression of numeric, string, or timestamp type

Description

The max() function finds the largest value within each group. It works with numeric types (returning the numerically largest), strings (lexicographic ordering), and timestamps (chronologically latest). If all values are null, it returns null.

any_value(x)

Returns an arbitrary value from each group.

Parameters

  • x - Expression of any type to select a value from

Description

The any_value() function returns an arbitrary non-null value from each group. This is useful when you're grouping data and need to include a field that's not part of the grouping criteria, where you know all values in the group are the same or where you don't care which specific value is selected. Unlike min() or max(), this function doesn't impose any ordering overhead, making it more efficient when you just need any representative value from the group.

array_agg(x)

Collects all values in each group into an array.

Parameters

  • x - Expression of any type to collect into an array

Description

The array_agg() function creates an array containing all non-null values from the specified expression within each group. The order of elements in the resulting array follows the order specified by any SORT clause in the aggregation command. If there are no non-null values, it returns an empty array.

map_agg(key, value)

Collects key-value pairs in each group into a map.

Parameters

  • key - Expression of any type to use as map keys
  • value - Expression of any type to use as map values

Description

The map_agg() function creates a map from key-value pairs within each group. If the same key appears multiple times, only the last value is retained. The order of processing follows any SORT clause in the aggregation command.

multimap_agg(key, value)

Collects key-value pairs in each group into a map where each key maps to an array of values.

Parameters

  • key - Expression of any type to use as map keys
  • value - Expression of any type to collect into arrays

Description

The multimap_agg() function creates a map where each unique key maps to an array of all values associated with that key within each group. This preserves all values for duplicate keys, unlike map_agg() which keeps only the last value.

any(x)

Returns true if any value in the group is true (logical OR aggregation).

Parameters

  • x - Boolean expression to test

Description

The any() function performs logical OR aggregation on boolean values within each group. It returns true if at least one value is true, false if all values are false, and null if all values are null.

all(x)

Returns true if all values in the group are true (logical AND aggregation).

Parameters

  • x - Boolean expression to test

Description

The all() function performs logical AND aggregation on boolean values within each group. It returns true if all values are true, false if at least one value is false, and null if all values are null.