Skip to main content
GUIDE · IDENTITY WAREHOUSE11 min read

Querying TrackLayer's identity graph in your warehouse

A practical guide for analysts and data engineers who want to use TrackLayer's resolved identity graph inside BigQuery or Snowflake without flattening it into brittle one-off joins. The goal is simple: keep `canonical_id` as the durable unit of truth, ask better questions across anonymous and known behavior, and keep privacy workflows intact as the graph evolves.

Schema

Schema overview

TrackLayer's warehouse export is intentionally opinionated. Instead of asking every analytics team to reverse-engineer how an anonymous cookie, a checkout email hash, and a backend account ID should connect, the export surfaces both the resolved cluster and the link evidence behind it. In practice, that means most product and growth analysis can begin with `canonical_user_id`, while deeper graph audits can still inspect edges and per-identifier confidence.

The core pattern is stable even as merchants add more sources. `identities` stores the nodes, `identity_links` stores the edges, and `events` carries the resolved user key. Many teams also scope every table by `merchant_id`, even when it is not listed in the shortened column examples below, because partitioning and access control almost always happen at the merchant boundary.

TableExported columnsPurpose
identitiesidentity_id, canonical_id, type, value, first_seen, last_seen, confidenceOne row per identifier observed by TrackLayer. Think email hash, cookie ID, account ID, or another normalized identifier mapped into a canonical cluster.
identity_linksfrom_id, to_id, linked_at, link_source, confidenceThe edge table that records why two identifiers were connected. This is the audit trail for graph growth, backfills, and confidence reviews.
events..., canonical_user_id, merchant_id, event_timestamp, revenue, ...Your event stream with `canonical_user_id` already resolved through the identity graph, so anonymous and authenticated events can be queried together.
SQL

5 useful SQL queries

The fastest way to make the warehouse export valuable is to codify a few reusable questions. Each example below is written twice: once for BigQuery and once for Snowflake. The syntax differs, but the modeling principle does not. Start from canonical users, only inspect raw identifiers when you need evidence, and keep merchant-level pruning in place so the query planner does not scan the entire graph for every exploratory question.

Query 01

Find all events for a canonical user across anonymous and signed-up states

This is the basic sanity check for any identity warehouse rollout. Analysts should be able to start from one canonical user and see the full stitched timeline without manually unioning cookies, email hashes, and account IDs.

BigQuery
DECLARE target_merchant STRING DEFAULT 'merchant_42';
DECLARE target_canonical STRING DEFAULT 'can_123';

SELECT
  e.event_timestamp,
  e.event_name,
  e.event_id,
  e.session_id,
  e.order_id,
  e.revenue,
  i.type AS identifier_type,
  i.value AS identifier_value
FROM `warehouse.events` e
LEFT JOIN `warehouse.identities` i
  ON e.merchant_id = target_merchant
 AND i.merchant_id = target_merchant
 AND i.canonical_id = e.canonical_user_id
WHERE e.merchant_id = target_merchant
  AND e.canonical_user_id = target_canonical
ORDER BY e.event_timestamp;
Snowflake
SET target_merchant = 'merchant_42';
SET target_canonical = 'can_123';

SELECT
  e.event_timestamp,
  e.event_name,
  e.event_id,
  e.session_id,
  e.order_id,
  e.revenue,
  i.type AS identifier_type,
  i.value AS identifier_value
FROM warehouse.events e
LEFT JOIN warehouse.identities i
  ON i.merchant_id = $target_merchant
 AND i.canonical_id = e.canonical_user_id
WHERE e.merchant_id = $target_merchant
  AND e.canonical_user_id = $target_canonical
ORDER BY e.event_timestamp;
Query 02

Compute cross-device coverage percentage per merchant

Warehouse teams usually want a compact KPI that shows whether the identity graph is actually bridging devices, not just creating bigger cookie clusters. This version treats more than one `device_type` per canonical user as cross-device coverage.

BigQuery
WITH user_devices AS (
  SELECT
    merchant_id,
    canonical_user_id,
    COUNT(DISTINCT COALESCE(device_type, 'unknown')) AS device_count
  FROM `warehouse.events`
  WHERE canonical_user_id IS NOT NULL
  GROUP BY 1, 2
)
SELECT
  merchant_id,
  COUNT(*) AS resolved_users,
  COUNTIF(device_count > 1) AS cross_device_users,
  ROUND(
    100 * SAFE_DIVIDE(COUNTIF(device_count > 1), COUNT(*)),
    2
  ) AS cross_device_coverage_pct
FROM user_devices
GROUP BY 1
ORDER BY cross_device_coverage_pct DESC;
Snowflake
WITH user_devices AS (
  SELECT
    merchant_id,
    canonical_user_id,
    COUNT(DISTINCT COALESCE(device_type, 'unknown')) AS device_count
  FROM warehouse.events
  WHERE canonical_user_id IS NOT NULL
  GROUP BY 1, 2
)
SELECT
  merchant_id,
  COUNT(*) AS resolved_users,
  COUNT_IF(device_count > 1) AS cross_device_users,
  ROUND(
    100 * COUNT_IF(device_count > 1) / NULLIF(COUNT(*), 0),
    2
  ) AS cross_device_coverage_pct
FROM user_devices
GROUP BY 1
ORDER BY cross_device_coverage_pct DESC;
Query 03

Identify suspicious identity clusters with more than 10 merged identifiers

Large clusters are not automatically wrong, but they are where over-merging, bot activity, shared-device abuse, or spammy lead flows tend to hide. Treat this as a review queue, not an auto-blocking rule.

BigQuery
SELECT
  merchant_id,
  canonical_id,
  COUNT(DISTINCT identity_id) AS identity_count,
  MAX(last_seen) AS cluster_last_seen,
  AVG(confidence) AS avg_identity_confidence
FROM `warehouse.identities`
GROUP BY 1, 2
HAVING COUNT(DISTINCT identity_id) > 10
ORDER BY identity_count DESC, cluster_last_seen DESC;
Snowflake
SELECT
  merchant_id,
  canonical_id,
  COUNT(DISTINCT identity_id) AS identity_count,
  MAX(last_seen) AS cluster_last_seen,
  AVG(confidence) AS avg_identity_confidence
FROM warehouse.identities
GROUP BY 1, 2
HAVING COUNT(DISTINCT identity_id) > 10
ORDER BY identity_count DESC, cluster_last_seen DESC;
Query 04

Historical revenue attribution stitched across cookie to email

This pattern isolates revenue from users whose canonical cluster contains both a browser-side cookie and an email identifier. It lets you quantify how much revenue would have looked unattributed before the graph stitched the journey.

BigQuery
WITH stitched_users AS (
  SELECT
    merchant_id,
    canonical_id
  FROM `warehouse.identities`
  GROUP BY 1, 2
  HAVING COUNTIF(type = 'cookie') > 0
     AND COUNTIF(type = 'email_hash') > 0
),
first_touch AS (
  SELECT
    e.merchant_id,
    e.canonical_user_id,
    ARRAY_AGG(
      IF(e.traffic_source IS NOT NULL, e.traffic_source, NULL)
      IGNORE NULLS
      ORDER BY e.event_timestamp
      LIMIT 1
    )[OFFSET(0)] AS first_known_source
  FROM `warehouse.events` e
  JOIN stitched_users s
    ON s.merchant_id = e.merchant_id
   AND s.canonical_id = e.canonical_user_id
  GROUP BY 1, 2
)
SELECT
  DATE(e.event_timestamp) AS purchase_date,
  COALESCE(f.first_known_source, 'direct_or_unknown') AS stitched_source,
  SUM(e.revenue) AS attributed_revenue
FROM `warehouse.events` e
JOIN first_touch f
  ON f.merchant_id = e.merchant_id
 AND f.canonical_user_id = e.canonical_user_id
WHERE e.event_name = 'purchase'
GROUP BY 1, 2
ORDER BY purchase_date DESC, attributed_revenue DESC;
Snowflake
WITH stitched_users AS (
  SELECT
    merchant_id,
    canonical_id
  FROM warehouse.identities
  GROUP BY 1, 2
  HAVING COUNT_IF(type = 'cookie') > 0
     AND COUNT_IF(type = 'email_hash') > 0
),
first_touch AS (
  SELECT
    e.merchant_id,
    e.canonical_user_id,
    e.traffic_source AS first_known_source
  FROM warehouse.events e
  JOIN stitched_users s
    ON s.merchant_id = e.merchant_id
   AND s.canonical_id = e.canonical_user_id
  WHERE e.traffic_source IS NOT NULL
  QUALIFY ROW_NUMBER() OVER (
    PARTITION BY e.merchant_id, e.canonical_user_id
    ORDER BY e.event_timestamp
  ) = 1
)
SELECT
  DATE(e.event_timestamp) AS purchase_date,
  COALESCE(f.first_known_source, 'direct_or_unknown') AS stitched_source,
  SUM(e.revenue) AS attributed_revenue
FROM warehouse.events e
JOIN first_touch f
  ON f.merchant_id = e.merchant_id
 AND f.canonical_user_id = e.canonical_user_id
WHERE e.event_name = 'purchase'
GROUP BY 1, 2
ORDER BY purchase_date DESC, attributed_revenue DESC;
Query 05

Recalculate match quality if a hypothetical new identifier were added

Before you add a new identifier type such as MAID, you can estimate its practical lift. The example assumes a staging table with candidate MAIDs for unresolved events and measures how many events would gain a canonical user if those identifiers matched existing graph rows.

BigQuery
WITH current_unresolved AS (
  SELECT
    merchant_id,
    event_id,
    maid_hash
  FROM `staging.hypothetical_identifiers`
  WHERE maid_hash IS NOT NULL
),
candidate_matches AS (
  SELECT
    u.merchant_id,
    u.event_id,
    i.canonical_id,
    i.confidence
  FROM current_unresolved u
  LEFT JOIN `warehouse.identities` i
    ON i.merchant_id = u.merchant_id
   AND i.type = 'maid'
   AND i.value = u.maid_hash
)
SELECT
  merchant_id,
  COUNT(*) AS unresolved_events_tested,
  COUNTIF(canonical_id IS NOT NULL) AS events_that_would_resolve,
  ROUND(
    100 * SAFE_DIVIDE(COUNTIF(canonical_id IS NOT NULL), COUNT(*)),
    2
  ) AS hypothetical_resolution_lift_pct,
  AVG(confidence) AS avg_candidate_confidence
FROM candidate_matches
GROUP BY 1
ORDER BY hypothetical_resolution_lift_pct DESC;
Snowflake
WITH current_unresolved AS (
  SELECT
    merchant_id,
    event_id,
    maid_hash
  FROM staging.hypothetical_identifiers
  WHERE maid_hash IS NOT NULL
),
candidate_matches AS (
  SELECT
    u.merchant_id,
    u.event_id,
    i.canonical_id,
    i.confidence
  FROM current_unresolved u
  LEFT JOIN warehouse.identities i
    ON i.merchant_id = u.merchant_id
   AND i.type = 'maid'
   AND i.value = u.maid_hash
)
SELECT
  merchant_id,
  COUNT(*) AS unresolved_events_tested,
  COUNT_IF(canonical_id IS NOT NULL) AS events_that_would_resolve,
  ROUND(
    100 * COUNT_IF(canonical_id IS NOT NULL) / NULLIF(COUNT(*), 0),
    2
  ) AS hypothetical_resolution_lift_pct,
  AVG(confidence) AS avg_candidate_confidence
FROM candidate_matches
GROUP BY 1
ORDER BY hypothetical_resolution_lift_pct DESC;
Evolution

Schema evolution

Identity graphs fail in warehouses when analysts hardcode today's identifier mix as if it were permanent. A merchant might start with cookie, email hash, and account ID, then later add app user IDs, loyalty numbers, or mobile advertising IDs such as MAID. If your dashboards join on a specific identifier column rather than `canonical_id`, every new source becomes a migration project.

The safer contract is additive. New identifier types arrive as new `type` values in `identities`, new edges in `identity_links`, and better coverage in `events.canonical_user_id`. Existing queries keep working because the canonical cluster shape does not change. Analysts who care about the new signal can opt in with a targeted `WHERE type = 'maid'` filter. Everyone else continues to query the same durable user key they used before.

Performance

Performance tips

Identity data becomes expensive when warehouse teams forget that graph queries are join-heavy by default. The simplest optimization is merchant pruning. Partition large exports by `merchant_id` so scans stay local, especially in shared multi-tenant projects where thousands of merchants coexist in one dataset.

Clustering should follow the question shape. For `identities` and `events`, clustering by `canonical_id` or `canonical_user_id` makes user-journey lookups and cluster-size analysis much cheaper. For recurring dashboards, materialized views are worth the maintenance cost. A pre-joined view that exposes `merchant_id`, `canonical_user_id`, primary identity types, and last activity often replaces multiple repetitive joins that would otherwise run in every BI tool refresh.

Privacy

Privacy + right-to-delete propagation

A deletion request should not stop at the profile row that first triggered it. In an identity graph, the real deletion scope is the resolved cluster: the account ID, linked email hashes, cookie IDs, stitched event rows, materialized views, and any downstream sync tables derived from that cluster. If one layer is missed, the user can be accidentally rehydrated from stale exports.

The warehouse pattern is therefore cascade-first. Mark the canonical cluster as deletion-pending, enumerate every linked identifier, remove or tombstone matching base rows, refresh any graph-derived materializations, and re-run outbound audience or destination sync jobs against the post-delete state. The important outcome is consistency: a merchant should never see the base graph say deleted while a cached audience table still treats the same canonical user as active.

Activation

Integration with Segment Personas / Customer.io

When TrackLayer becomes the identity resolution layer, downstream systems should consume `canonical_id` as the stable source-of-truth identifier instead of choosing whichever email or cookie happened to be present last. In Segment Personas, that usually means forwarding `canonical_id` as the durable external ID used for profile stitching and audience eligibility. In Customer.io, the same rule keeps lifecycle messaging attached to one resolved user even when email changes or multiple identifiers coexist inside the cluster.

This does not replace destination-specific identifiers. Email, phone, or app IDs still matter for message delivery and ad match quality. The difference is control: TrackLayer decides cluster membership once, then activation tools consume that decision rather than re-inventing identity logic inside each downstream platform.

FAQ

Common questions

Do I need to rebuild the identity graph in SQL?

Usually no. TrackLayer already exports `canonical_id` relationships and an event-side `canonical_user_id`. Most warehouse work should start from those fields and only drop down to `identity_links` when you need auditability, debugging, or confidence analysis.

Why keep both `identities` and `identity_links` if `canonical_id` already exists?

Because canonical membership answers who belongs together, while links answer why they were merged. Analysts use canonical IDs for reporting and links for lineage, merge review, and operational troubleshooting.

Will new identifier types break my dashboards?

They should not if you query by `canonical_id`, `canonical_user_id`, and high-level `type` filters. New types such as MAID can arrive as additive values in `type` while the canonical join contract stays stable.

Should downstream tools use email as the warehouse join key?

Prefer `canonical_id` as the source-of-truth join key and treat email as one attribute inside the cluster. Email changes, multiple emails can exist in one customer history, and some users never provide email at all.

How quickly should deletions propagate to derived tables?

As fast as your refresh schedule allows, but the rule should be deterministic. Base exports, materialized views, destination sync tables, and activation audiences all need the same deletion watermark so the graph cannot be rehydrated from stale warehouse outputs.

Next reads

Related implementation guides

We use essential cookies to keep the site secure and functional. Analytics and third-party tags run only with your consent. See our Cookie Policy.

We use essential cookies to keep the site secure and functional. Analytics and third-party tags run only with your consent. See our Cookie Policy.