Skip to main content

TRIMSTRINGS

Byte-cap every string-typed column in the current row to a maximum byte length, recursing through structs and arrays.

Syntax

TRIMSTRINGS expression

Parameters

  • expression - Non-negative integer specifying the maximum length in bytes for each string value

Description

The TRIMSTRINGS command rewrites every string-typed column so values longer than the given byte length are truncated. Truncation always lands on a valid UTF-8 boundary — partial codepoints at the cut point are dropped, never substituted with a replacement character. Values shorter than the limit pass through unchanged. NULL values stay NULL.

The truncation walks the schema recursively:

  • Top-level string columns are truncated.
  • Struct fields that are strings are truncated; non-string fields pass through.
  • Array elements are truncated, including arrays of strings and arrays of structs (recursing into each struct's string fields).
  • Nested combinations (Struct<Array<Struct<{name: String, ...}>>>) all work the way you'd expect — only the string leaves get capped.

Unlike substr, which counts characters (Unicode codepoints), TRIMSTRINGS is byte-based — it measures raw UTF-8 encoded size. This is the only command in Hamelin that operates on bytes instead of characters. It exists for ingestion-time capping where downstream storage has hard byte limits per column (e.g. transport limits, column-storage caps).

Examples

A common use case is capping every string column right before writing rows to a table. In an ingest pipeline you typically read from a source, apply any transforms, then cap strings to stay within storage limits. Here TRIMSTRINGS 65536 ensures no string column exceeds 64 KiB before the rows land in the events table:

SET raw_message = 'a very long event payload...'
| TRIMSTRINGS 65536
| APPEND events

Every string column in the row — including nested struct fields and array elements — is capped to 65 536 bytes. Values already under the limit pass through unchanged.

For plain ASCII text, each character is exactly one byte, so the byte limit and the character limit are the same. The following example sets a string and truncates it to 5 bytes. Because every character in 'hello world' is a single-byte ASCII character, the result is the first five characters:

SET s = 'hello world'
| TRIMSTRINGS 5

Result: s = 'hello'.

Things get more interesting with multibyte characters. In UTF-8, non-ASCII characters occupy two or more bytes. When the byte limit falls in the middle of a multibyte character, TRIMSTRINGS drops the partial character entirely so the output is always valid UTF-8. The string 'héllo' starts with 'h' (1 byte) followed by 'é' (2 bytes, 0xC3 0xA9). A limit of 2 bytes cannot fit both, so TRIMSTRINGS keeps only the 'h':

SET s = 'héllo'   -- 'é' is two bytes (0xC3 0xA9)
| TRIMSTRINGS 2

Result: s = 'h' — the partial 'é' is dropped, output ends on a valid UTF-8 boundary (1 byte, shorter than the limit).

Observability

When executed against the DataFusion backend, the executor tracks cumulative counters across the lifetime of the executor: rows/bytes read by scan nodes, rows/bytes written by sink nodes, and trim_clamped (the number of column-values clamped by TRIMSTRINGS — summed across all string columns and nested locations, not unique rows). The Hamelin CLI prints a single trailing stderr line summarizing those counters after each query:

stats: rows_read=N bytes_read=N rows_written=N bytes_written=N trim_clamped=N

Hamelin suppresses the line when every counter is zero. Trino does not track these metrics.

  • SET - Add or modify columns (often used to construct test inputs for TRIMSTRINGS)
  • APPEND - Write rows to a table (TRIMSTRINGS is typically the last command before APPEND in ingest pipelines)