Schema overview
TrackLayer's warehouse export is intentionally opinionated. Instead of asking every analytics team to reverse-engineer how an anonymous cookie, a checkout email hash, and a backend account ID should connect, the export surfaces both the resolved cluster and the link evidence behind it. In practice, that means most product and growth analysis can begin with `canonical_user_id`, while deeper graph audits can still inspect edges and per-identifier confidence.
The core pattern is stable even as merchants add more sources. `identities` stores the nodes, `identity_links` stores the edges, and `events` carries the resolved user key. Many teams also scope every table by `merchant_id`, even when it is not listed in the shortened column examples below, because partitioning and access control almost always happen at the merchant boundary.
| Table | Exported columns | Purpose |
|---|---|---|
| identities | identity_id, canonical_id, type, value, first_seen, last_seen, confidence | One row per identifier observed by TrackLayer. Think email hash, cookie ID, account ID, or another normalized identifier mapped into a canonical cluster. |
| identity_links | from_id, to_id, linked_at, link_source, confidence | The edge table that records why two identifiers were connected. This is the audit trail for graph growth, backfills, and confidence reviews. |
| events | ..., canonical_user_id, merchant_id, event_timestamp, revenue, ... | Your event stream with `canonical_user_id` already resolved through the identity graph, so anonymous and authenticated events can be queried together. |
5 useful SQL queries
The fastest way to make the warehouse export valuable is to codify a few reusable questions. Each example below is written twice: once for BigQuery and once for Snowflake. The syntax differs, but the modeling principle does not. Start from canonical users, only inspect raw identifiers when you need evidence, and keep merchant-level pruning in place so the query planner does not scan the entire graph for every exploratory question.
Find all events for a canonical user across anonymous and signed-up states
This is the basic sanity check for any identity warehouse rollout. Analysts should be able to start from one canonical user and see the full stitched timeline without manually unioning cookies, email hashes, and account IDs.
DECLARE target_merchant STRING DEFAULT 'merchant_42';
DECLARE target_canonical STRING DEFAULT 'can_123';
SELECT
e.event_timestamp,
e.event_name,
e.event_id,
e.session_id,
e.order_id,
e.revenue,
i.type AS identifier_type,
i.value AS identifier_value
FROM `warehouse.events` e
LEFT JOIN `warehouse.identities` i
ON e.merchant_id = target_merchant
AND i.merchant_id = target_merchant
AND i.canonical_id = e.canonical_user_id
WHERE e.merchant_id = target_merchant
AND e.canonical_user_id = target_canonical
ORDER BY e.event_timestamp;SET target_merchant = 'merchant_42';
SET target_canonical = 'can_123';
SELECT
e.event_timestamp,
e.event_name,
e.event_id,
e.session_id,
e.order_id,
e.revenue,
i.type AS identifier_type,
i.value AS identifier_value
FROM warehouse.events e
LEFT JOIN warehouse.identities i
ON i.merchant_id = $target_merchant
AND i.canonical_id = e.canonical_user_id
WHERE e.merchant_id = $target_merchant
AND e.canonical_user_id = $target_canonical
ORDER BY e.event_timestamp;Compute cross-device coverage percentage per merchant
Warehouse teams usually want a compact KPI that shows whether the identity graph is actually bridging devices, not just creating bigger cookie clusters. This version treats more than one `device_type` per canonical user as cross-device coverage.
WITH user_devices AS (
SELECT
merchant_id,
canonical_user_id,
COUNT(DISTINCT COALESCE(device_type, 'unknown')) AS device_count
FROM `warehouse.events`
WHERE canonical_user_id IS NOT NULL
GROUP BY 1, 2
)
SELECT
merchant_id,
COUNT(*) AS resolved_users,
COUNTIF(device_count > 1) AS cross_device_users,
ROUND(
100 * SAFE_DIVIDE(COUNTIF(device_count > 1), COUNT(*)),
2
) AS cross_device_coverage_pct
FROM user_devices
GROUP BY 1
ORDER BY cross_device_coverage_pct DESC;WITH user_devices AS (
SELECT
merchant_id,
canonical_user_id,
COUNT(DISTINCT COALESCE(device_type, 'unknown')) AS device_count
FROM warehouse.events
WHERE canonical_user_id IS NOT NULL
GROUP BY 1, 2
)
SELECT
merchant_id,
COUNT(*) AS resolved_users,
COUNT_IF(device_count > 1) AS cross_device_users,
ROUND(
100 * COUNT_IF(device_count > 1) / NULLIF(COUNT(*), 0),
2
) AS cross_device_coverage_pct
FROM user_devices
GROUP BY 1
ORDER BY cross_device_coverage_pct DESC;Identify suspicious identity clusters with more than 10 merged identifiers
Large clusters are not automatically wrong, but they are where over-merging, bot activity, shared-device abuse, or spammy lead flows tend to hide. Treat this as a review queue, not an auto-blocking rule.
SELECT
merchant_id,
canonical_id,
COUNT(DISTINCT identity_id) AS identity_count,
MAX(last_seen) AS cluster_last_seen,
AVG(confidence) AS avg_identity_confidence
FROM `warehouse.identities`
GROUP BY 1, 2
HAVING COUNT(DISTINCT identity_id) > 10
ORDER BY identity_count DESC, cluster_last_seen DESC;SELECT
merchant_id,
canonical_id,
COUNT(DISTINCT identity_id) AS identity_count,
MAX(last_seen) AS cluster_last_seen,
AVG(confidence) AS avg_identity_confidence
FROM warehouse.identities
GROUP BY 1, 2
HAVING COUNT(DISTINCT identity_id) > 10
ORDER BY identity_count DESC, cluster_last_seen DESC;Historical revenue attribution stitched across cookie to email
This pattern isolates revenue from users whose canonical cluster contains both a browser-side cookie and an email identifier. It lets you quantify how much revenue would have looked unattributed before the graph stitched the journey.
WITH stitched_users AS (
SELECT
merchant_id,
canonical_id
FROM `warehouse.identities`
GROUP BY 1, 2
HAVING COUNTIF(type = 'cookie') > 0
AND COUNTIF(type = 'email_hash') > 0
),
first_touch AS (
SELECT
e.merchant_id,
e.canonical_user_id,
ARRAY_AGG(
IF(e.traffic_source IS NOT NULL, e.traffic_source, NULL)
IGNORE NULLS
ORDER BY e.event_timestamp
LIMIT 1
)[OFFSET(0)] AS first_known_source
FROM `warehouse.events` e
JOIN stitched_users s
ON s.merchant_id = e.merchant_id
AND s.canonical_id = e.canonical_user_id
GROUP BY 1, 2
)
SELECT
DATE(e.event_timestamp) AS purchase_date,
COALESCE(f.first_known_source, 'direct_or_unknown') AS stitched_source,
SUM(e.revenue) AS attributed_revenue
FROM `warehouse.events` e
JOIN first_touch f
ON f.merchant_id = e.merchant_id
AND f.canonical_user_id = e.canonical_user_id
WHERE e.event_name = 'purchase'
GROUP BY 1, 2
ORDER BY purchase_date DESC, attributed_revenue DESC;WITH stitched_users AS (
SELECT
merchant_id,
canonical_id
FROM warehouse.identities
GROUP BY 1, 2
HAVING COUNT_IF(type = 'cookie') > 0
AND COUNT_IF(type = 'email_hash') > 0
),
first_touch AS (
SELECT
e.merchant_id,
e.canonical_user_id,
e.traffic_source AS first_known_source
FROM warehouse.events e
JOIN stitched_users s
ON s.merchant_id = e.merchant_id
AND s.canonical_id = e.canonical_user_id
WHERE e.traffic_source IS NOT NULL
QUALIFY ROW_NUMBER() OVER (
PARTITION BY e.merchant_id, e.canonical_user_id
ORDER BY e.event_timestamp
) = 1
)
SELECT
DATE(e.event_timestamp) AS purchase_date,
COALESCE(f.first_known_source, 'direct_or_unknown') AS stitched_source,
SUM(e.revenue) AS attributed_revenue
FROM warehouse.events e
JOIN first_touch f
ON f.merchant_id = e.merchant_id
AND f.canonical_user_id = e.canonical_user_id
WHERE e.event_name = 'purchase'
GROUP BY 1, 2
ORDER BY purchase_date DESC, attributed_revenue DESC;Recalculate match quality if a hypothetical new identifier were added
Before you add a new identifier type such as MAID, you can estimate its practical lift. The example assumes a staging table with candidate MAIDs for unresolved events and measures how many events would gain a canonical user if those identifiers matched existing graph rows.
WITH current_unresolved AS (
SELECT
merchant_id,
event_id,
maid_hash
FROM `staging.hypothetical_identifiers`
WHERE maid_hash IS NOT NULL
),
candidate_matches AS (
SELECT
u.merchant_id,
u.event_id,
i.canonical_id,
i.confidence
FROM current_unresolved u
LEFT JOIN `warehouse.identities` i
ON i.merchant_id = u.merchant_id
AND i.type = 'maid'
AND i.value = u.maid_hash
)
SELECT
merchant_id,
COUNT(*) AS unresolved_events_tested,
COUNTIF(canonical_id IS NOT NULL) AS events_that_would_resolve,
ROUND(
100 * SAFE_DIVIDE(COUNTIF(canonical_id IS NOT NULL), COUNT(*)),
2
) AS hypothetical_resolution_lift_pct,
AVG(confidence) AS avg_candidate_confidence
FROM candidate_matches
GROUP BY 1
ORDER BY hypothetical_resolution_lift_pct DESC;WITH current_unresolved AS (
SELECT
merchant_id,
event_id,
maid_hash
FROM staging.hypothetical_identifiers
WHERE maid_hash IS NOT NULL
),
candidate_matches AS (
SELECT
u.merchant_id,
u.event_id,
i.canonical_id,
i.confidence
FROM current_unresolved u
LEFT JOIN warehouse.identities i
ON i.merchant_id = u.merchant_id
AND i.type = 'maid'
AND i.value = u.maid_hash
)
SELECT
merchant_id,
COUNT(*) AS unresolved_events_tested,
COUNT_IF(canonical_id IS NOT NULL) AS events_that_would_resolve,
ROUND(
100 * COUNT_IF(canonical_id IS NOT NULL) / NULLIF(COUNT(*), 0),
2
) AS hypothetical_resolution_lift_pct,
AVG(confidence) AS avg_candidate_confidence
FROM candidate_matches
GROUP BY 1
ORDER BY hypothetical_resolution_lift_pct DESC;Schema evolution
Identity graphs fail in warehouses when analysts hardcode today's identifier mix as if it were permanent. A merchant might start with cookie, email hash, and account ID, then later add app user IDs, loyalty numbers, or mobile advertising IDs such as MAID. If your dashboards join on a specific identifier column rather than `canonical_id`, every new source becomes a migration project.
The safer contract is additive. New identifier types arrive as new `type` values in `identities`, new edges in `identity_links`, and better coverage in `events.canonical_user_id`. Existing queries keep working because the canonical cluster shape does not change. Analysts who care about the new signal can opt in with a targeted `WHERE type = 'maid'` filter. Everyone else continues to query the same durable user key they used before.
Performance tips
Identity data becomes expensive when warehouse teams forget that graph queries are join-heavy by default. The simplest optimization is merchant pruning. Partition large exports by `merchant_id` so scans stay local, especially in shared multi-tenant projects where thousands of merchants coexist in one dataset.
Clustering should follow the question shape. For `identities` and `events`, clustering by `canonical_id` or `canonical_user_id` makes user-journey lookups and cluster-size analysis much cheaper. For recurring dashboards, materialized views are worth the maintenance cost. A pre-joined view that exposes `merchant_id`, `canonical_user_id`, primary identity types, and last activity often replaces multiple repetitive joins that would otherwise run in every BI tool refresh.
Privacy + right-to-delete propagation
A deletion request should not stop at the profile row that first triggered it. In an identity graph, the real deletion scope is the resolved cluster: the account ID, linked email hashes, cookie IDs, stitched event rows, materialized views, and any downstream sync tables derived from that cluster. If one layer is missed, the user can be accidentally rehydrated from stale exports.
The warehouse pattern is therefore cascade-first. Mark the canonical cluster as deletion-pending, enumerate every linked identifier, remove or tombstone matching base rows, refresh any graph-derived materializations, and re-run outbound audience or destination sync jobs against the post-delete state. The important outcome is consistency: a merchant should never see the base graph say deleted while a cached audience table still treats the same canonical user as active.
Integration with Segment Personas / Customer.io
When TrackLayer becomes the identity resolution layer, downstream systems should consume `canonical_id` as the stable source-of-truth identifier instead of choosing whichever email or cookie happened to be present last. In Segment Personas, that usually means forwarding `canonical_id` as the durable external ID used for profile stitching and audience eligibility. In Customer.io, the same rule keeps lifecycle messaging attached to one resolved user even when email changes or multiple identifiers coexist inside the cluster.
This does not replace destination-specific identifiers. Email, phone, or app IDs still matter for message delivery and ad match quality. The difference is control: TrackLayer decides cluster membership once, then activation tools consume that decision rather than re-inventing identity logic inside each downstream platform.
Common questions
Do I need to rebuild the identity graph in SQL?
Usually no. TrackLayer already exports `canonical_id` relationships and an event-side `canonical_user_id`. Most warehouse work should start from those fields and only drop down to `identity_links` when you need auditability, debugging, or confidence analysis.
Why keep both `identities` and `identity_links` if `canonical_id` already exists?
Because canonical membership answers who belongs together, while links answer why they were merged. Analysts use canonical IDs for reporting and links for lineage, merge review, and operational troubleshooting.
Will new identifier types break my dashboards?
They should not if you query by `canonical_id`, `canonical_user_id`, and high-level `type` filters. New types such as MAID can arrive as additive values in `type` while the canonical join contract stays stable.
Should downstream tools use email as the warehouse join key?
Prefer `canonical_id` as the source-of-truth join key and treat email as one attribute inside the cluster. Email changes, multiple emails can exist in one customer history, and some users never provide email at all.
How quickly should deletions propagate to derived tables?
As fast as your refresh schedule allows, but the rule should be deterministic. Base exports, materialized views, destination sync tables, and activation audiences all need the same deletion watermark so the graph cannot be rehydrated from stale warehouse outputs.
Related implementation guides
Identity resolution deep dive
How TrackLayer builds and governs the graph before those rows ever reach the warehouse.
Read guide →BigQuery export guide
Operational patterns for modeling TrackLayer data in BigQuery once the export is enabled.
Read guide →Snowflake export guide
Warehouse structure, loading patterns, and performance considerations for Snowflake teams.
Read guide →