Skip to main content
GUIDE · IDENTITY DEEP DIVE13 min read

Identity resolution for tracking: from cookie graphs to union-find

A technical guide to the identity layer behind server-side tracking: how cookies become graph nodes, how hashed identifiers raise confidence, how union-find keeps merges fast, and how privacy constraints define what should never be stitched.

Foundation

Why identity is the foundation

Tracking systems do not fail only because an event is missing. They also fail when a valid event is attached to the wrong identity, stranded in an anonymous session, or counted twice under two identifiers that really describe one customer journey. A paid click in Safari, a browse session in Chrome, an add-to-cart event on mobile, and a backend purchase webhook can all be true records. Without an identity layer, the reporting system still sees four fragments. Attribution then rewards the event that happened to carry the strongest identifier, not necessarily the path that influenced the order.

Identity resolution is the discipline of deciding when those fragments belong together. In tracking, the decision must be fast, conservative, auditable, and reversible. Fast because events arrive continuously. Conservative because over-merging corrupts analytics more severely than under-merging. Auditable because privacy and finance teams need to understand why records were joined. Reversible because consent withdrawal and right-to-delete requests must propagate through every joined record, not just the visible customer profile.

Model

The 4 identity layers

A robust graph does not treat every identifier equally. TrackLayer assigns each layer a persistence expectation, a confidence level, and a privacy classification so merge decisions can be policy-led instead of opportunistic.

LayerPersistence windowWho uses itConfidence levelPrivacy classification
Cookie/PseudonymousHours to 13 months depending on browser, consent, and storage limitsWeb analytics, ad pixels, session replay, onsite personalizationMedium on one browser, low across browsers or devicesPseudonymous personal data when linkable to behavior or a user
Email hashLong-lived until email changes, deletion, or consent withdrawalAds CAPI, lifecycle tools, CRM sync, server-side attributionHigh when normalized, consented, and collected directlyPersonal data in most regimes because hashing is reversible by matching
Phone hashLong-lived, but more prone to reuse, family plans, and formatting driftSMS platforms, ads enhanced matching, checkout recoveryHigh with E.164 normalization, medium when captured from formsPersonal data and often sensitive operational contact data
Account IDDurable for the life of the customer accountBackend systems, warehouses, subscription platforms, support toolsVery high inside the first-party system of recordDirect personal data once mapped to an identifiable account
Algorithm

Union-find algorithm explained

01

The disjoint-set data structure

Union-find starts with a simple promise: every identifier is a node, and every node belongs to exactly one set. At first, a cookie, an email hash, a phone hash, and an account ID may all be separate sets. Each set has a representative node called a root. The root is not morally better than the other nodes. It is just the pointer that lets the system answer one question quickly: which identity cluster does this identifier currently belong to?

02

How TrackLayer finds roots and merges nodes

When TrackLayer receives an event, it normalizes the identifiers on that event and looks up each node. The find operation follows parent pointers until it reaches the root for that cluster. If a checkout event contains cookie tl_abc and email hash em_123, TrackLayer finds both roots. If they differ and the link passes consent and confidence rules, union attaches one root to the other. From that moment, either identifier resolves to the same cluster for reporting and destination matching.

03

Retroactive backfill after cookie → email

The graph becomes especially useful when a user browses anonymously before identifying later. Suppose three product views arrive under cookie tl_abc on Monday, then the same browser submits an email during checkout on Friday. Once TrackLayer links tl_abc to em_123, earlier events in the cookie cluster can be backfilled into the email-bearing identity cluster. The events keep their original timestamps, but downstream attribution can now include them in the same journey.

04

Why O(alpha(n)) matters

Union-find is fast because path compression and union by rank flatten the graph over time. The amortized complexity is O(alpha(n)), where alpha is the inverse Ackermann function. In practice it grows so slowly that it behaves like constant time for tracking workloads. That matters when a high-volume store is resolving identifiers on every page view, cart mutation, checkout step, webhook, and order event.

find(cookie_tl_abc) → root_cookie_cluster
find(email_em_123) → root_email_cluster

if consent_allows_merge && confidence_is_high:
  union(root_cookie_cluster, root_email_cluster)
  backfill(cookie_tl_abc.events, email_em_123.cluster)
Scenarios

Common identity stitching scenarios

Anonymous → signup on same device

A visitor lands from paid search, views products, adds one item to cart, then signs up with email in the same browser. TrackLayer keeps the anonymous cookie events available for immediate session reporting. After signup, the cookie and email hash are joined, so the acquisition click, product interest, and signup become one identity path.

Cross-device phone → desktop

A user clicks an SMS link on a phone, then later buys on desktop with the same email or account login. The phone hash may form the first cluster, while desktop checkout contributes cookie and email hash nodes. A confirmed email or account ID lets those device-specific paths converge without relying on probabilistic device fingerprinting.

Cross-browser Safari → Chrome

Safari and Chrome maintain separate browser storage, so a cookie-only model treats them as two users. If the same shopper identifies in both browsers, TrackLayer can merge the two cookie nodes through the shared email hash or account ID. If neither browser identifies, the graph stays split because the evidence is not strong enough.

Shared devices

A family computer can produce conflicting signals: one cookie, multiple emails, multiple accounts, and overlapping checkout behavior. TrackLayer treats this as a collision risk. It can keep the cookie as a weak shared node, prefer account-scoped events for reporting, and avoid retroactively assigning every anonymous action to the newest login.

Privacy

Privacy guardrails

Identity systems should be judged by what they refuse to merge. TrackLayer keeps privacy signals in the same decision path as identifiers, so a technically possible link can still be blocked when the user, browser, or policy says no.

  • Denied-consent users are not merged into marketing or attribution identity clusters, even if identifiers appear in backend events.
  • Incognito and private browsing sessions are treated as intentionally short-lived. TrackLayer does not use them as durable graph evidence.
  • Strict-tracking headers and comparable privacy signals prevent opportunistic joining, enrichment, or backfill beyond the permitted operational purpose.
  • Conflicting high-confidence identifiers can quarantine a graph segment until the collision is reviewed or resolved by account-system truth.
Operations

Identity resolution metrics to monitor

resolution_rate

Share of events that resolve beyond a single anonymous session into a known or durable pseudonymous cluster.

cross_device_match_rate

Share of resolved identities with evidence from more than one device class or browser context.

anonymous_signup_backfill_rate

Share of signup or purchase identities where earlier anonymous events were attached after identification.

identity_collision_rate

Frequency of clusters containing conflicting emails, phones, account IDs, consent states, or deletion markers.

deletion_propagation_lag

Time between a deletion request and removal of every joined record, edge, event reference, and destination export.

Attribution

Attribution implications

Attribution quality is downstream of identity quality. When a platform receives a conversion with only a weak browser identifier, it may match the event poorly or not at all. When the same event carries normalized, consented, hashed email or phone in addition to a first-party event ID, the destination has more deterministic evidence. That improves match quality scores, reduces unattributed conversions, and gives optimization systems cleaner feedback.

The biggest reporting shift comes from pre-identification behavior. If anonymous product views and cart events are later backfilled into a known cluster, conversion reporting can connect upper-funnel activity to the purchase without pretending the user was known at the time. The timestamps remain historical, the merge is auditable, and destinations can receive richer conversion payloads where policy allows.

Deletion

GDPR + right-to-delete

A deletion request is graph-wide, not profile-wide. If account ID acct_77 is joined to email hash em_123, phone hash ph_456, and cookie tl_abc, deleting only the account record leaves linked behavioral data behind. TrackLayer treats the resolved cluster as the deletion scope. It marks the root as deletion-pending, freezes new marketing merges, enumerates every joined identifier, removes or tombstones events according to retention policy, and sends deletion signals to destinations that received exported records.

The deletion job also has to prevent rehydration. If a later webhook arrives with the same account ID, it should not recreate the old graph from stale exports or queue retries. The graph needs a deletion tombstone with enough metadata to suppress accidental rebuilds while avoiding retention of the personal data the user asked to remove.

FAQ

Common questions

Is this fingerprinting?

No. The model described here links identifiers that a site directly collects or sets, such as first-party cookies, hashed email, hashed phone, and account IDs. TrackLayer does not need canvas, font, battery, hardware, or passive device fingerprints to build the graph.

Can we hash before TrackLayer sees the value?

Yes. Many teams normalize and hash email or phone before sending it to TrackLayer. The important part is consistency: trim whitespace, lowercase emails, normalize phones to E.164, use the agreed hash algorithm, and apply consent rules before dispatch.

Does TrackLayer support cross-domain stitching?

Cross-domain stitching is possible when the domains are under a permitted first-party relationship and the consent model allows it. The safest implementation passes a short-lived linker token or server-issued identifier during navigation, then exchanges it server-side rather than exposing a durable universal ID in URLs.

What is the difference between device graphs and identity graphs?

A device graph tries to connect browsers, phones, tablets, and apps that may belong to one person or household. An identity graph is anchored on identifiers with business meaning, such as account ID, email hash, phone hash, and consented customer records. TrackLayer favors identity evidence over probabilistic device assumptions.

Can identity data seed lookalike audiences?

Only when consent, destination policy, and your lawful basis allow that use. Identity resolution can improve match quality for approved conversion and audience workflows, but TrackLayer should not turn operational identifiers into ad audience seeds by default.

Next reads

Related implementation guides

We use essential cookies to keep the site secure and functional. Analytics and third-party tags run only with your consent. See our Cookie Policy.

We use essential cookies to keep the site secure and functional. Analytics and third-party tags run only with your consent. See our Cookie Policy.