JOIN and LOOKUP: Combining datasets
In event analytics, event records are often narrow and require enrichment with
additional context. You use the JOIN and LOOKUP commands to do this
enrichment. Hamelin gives you most of the power of SQL joins - you get inner
join behavior (with JOIN) and left outer join behavior (with LOOKUP).
Basic syntax
Join your main dataset with additional data by specifying a matching condition. The ON clause defines how records from both datasets should be linked together:
| JOIN other_dataset ON field_name == other_dataset.field_name
Use LOOKUP when you want to keep all your original records, even if some don't have matches in the second dataset:
| LOOKUP other_dataset ON field_name == other_dataset.field_name
Nested results
When you join datasets, Hamelin nests the joined data as a struct to prevent field name collisions. This structure keeps your original fields separate from the joined fields, making it clear which data came from which dataset.
Combine user records with order data:
FROM users
| WHERE timestamp > "2024-01-01"
| JOIN orders ON user_id == orders.user_id
This creates records where each user has an orders struct containing all the matched order information. Your original user fields remain at the top level, while order details are nested inside the orders structure.
Custom naming
You can control the name of the nested struct using assignment syntax. This makes your queries more readable when the default dataset name isn't descriptive:
FROM users
| WHERE timestamp > "2024-01-01"
| JOIN purchase_history = orders ON user_id == orders.user_id
Now the joined data appears under the more descriptive name purchase_history instead of the generic orders name.
Accessing joined fields
You access fields from the joined struct using dot notation. This lets you filter, select, or manipulate the joined data just like any other nested structure:
FROM users
| WHERE timestamp > "2024-01-01"
| JOIN orders ON user_id == orders.user_id
| WHERE orders.total > 100
This query finds users who have at least one order with a total over 100, demonstrating how you can filter on joined data.
Multiple joins
You can chain multiple JOIN operations to combine data from several datasets. Each join creates its own nested struct, letting you pull related information from multiple sources:
FROM transactions
| WHERE amount > 1000
| JOIN user_details = users ON user_id == users.id
| JOIN account_info = accounts ON account_id == accounts.id
| WHERE user_details.risk_score > 0.8
This creates records where each transaction has both user_details and account_info structs, giving you access to related data from multiple datasets in a single query.
JOIN vs LOOKUP: Required vs optional matches
The key difference between JOIN and LOOKUP is how they handle missing matches. This choice determines whether you get only complete records or keep all your original data with optional enrichment.
JOIN requires matches
When you use JOIN, only rows that have a match in both datasets appear in your results. Rows without matches get filtered out completely, giving you a dataset that only contains records with complete information.
Get users who have placed orders:
FROM users
| JOIN orders ON user_id == orders.user_id
This returns only users who have actually placed orders. Users without any orders are excluded from the results entirely.
LOOKUP keeps all rows
When you use LOOKUP, all rows from your main dataset stay in the results, regardless of whether they have matches. For rows without matches, the nested struct gets set to null, preserving your complete dataset while adding optional enrichment.
Get all users and include their order information when available:
FROM users
| LOOKUP orders ON user_id == orders.user_id
This returns every user from your dataset. Users with orders get an orders struct containing their order data, while users without orders still appear with their orders field set to null.
When to use each
Use JOIN when you only want records that have complete information from both datasets. Use LOOKUP when you want to preserve your entire main dataset and optionally enrich it with additional data that might not exist for every record.