TRIMSTRINGS
Byte-cap every string-typed column in the current row to a maximum byte length, recursing through structs and arrays.
Syntax
TRIMSTRINGS expression
Parameters
- expression - Non-negative integer specifying the maximum length in bytes for each string value
Description
The TRIMSTRINGS command rewrites every string-typed column so values longer
than the given byte length are truncated. Truncation always lands on a valid
UTF-8 boundary — partial codepoints at the cut point are dropped, never
substituted with a replacement character. Values shorter than the limit pass
through unchanged. NULL values stay NULL.
The truncation walks the schema recursively:
- Top-level string columns are truncated.
- Struct fields that are strings are truncated; non-string fields pass through.
- Array elements are truncated, including arrays of strings and arrays of structs (recursing into each struct's string fields).
- Nested combinations (
Struct<Array<Struct<{name: String, ...}>>>) all work the way you'd expect — only the string leaves get capped.
Unlike substr, which counts characters (Unicode codepoints),
TRIMSTRINGS is byte-based — it measures raw UTF-8 encoded size.
This is the only command in Hamelin that operates on bytes instead of
characters. It exists for ingestion-time capping where downstream
storage has hard byte limits per column (e.g. transport limits,
column-storage caps).
Examples
A common use case is capping every string column right before
writing rows to a table. In an ingest pipeline you typically read
from a source, apply any transforms, then cap strings to stay within
storage limits. Here TRIMSTRINGS 65536 ensures no string column
exceeds 64 KiB before the rows land in the events table:
SET raw_message = 'a very long event payload...'
| TRIMSTRINGS 65536
| APPEND events
Every string column in the row — including nested struct fields and array elements — is capped to 65 536 bytes. Values already under the limit pass through unchanged.
For plain ASCII text, each character is exactly one byte, so the byte
limit and the character limit are the same. The following example
sets a string and truncates it to 5 bytes. Because every character in
'hello world' is a single-byte ASCII character, the result is the
first five characters:
SET s = 'hello world'
| TRIMSTRINGS 5
Result: s = 'hello'.
Things get more interesting with multibyte characters. In UTF-8,
non-ASCII characters occupy two or more bytes. When the byte limit
falls in the middle of a multibyte character, TRIMSTRINGS drops
the partial character entirely so the output is always valid UTF-8.
The string 'héllo' starts with 'h' (1 byte) followed by 'é'
(2 bytes, 0xC3 0xA9). A limit of 2 bytes cannot fit both, so
TRIMSTRINGS keeps only the 'h':
SET s = 'héllo' -- 'é' is two bytes (0xC3 0xA9)
| TRIMSTRINGS 2
Result: s = 'h' — the partial 'é' is dropped, output ends on a valid UTF-8
boundary (1 byte, shorter than the limit).
Observability
When executed against the DataFusion backend, the executor tracks cumulative
counters across the lifetime of the executor: rows/bytes read by scan nodes,
rows/bytes written by sink nodes, and trim_clamped (the number of
column-values clamped by TRIMSTRINGS — summed across all string columns and
nested locations, not unique rows). The Hamelin CLI prints a single trailing
stderr line summarizing those counters after each query:
stats: rows_read=N bytes_read=N rows_written=N bytes_written=N trim_clamped=N
Hamelin suppresses the line when every counter is zero. Trino does not track these metrics.