How we built predictive audiences
When we say predictive audiences, we mean segments that update from live behaviour—not a one-off export and not a black box you cannot audit. Operators need to know why someone entered a cohort, how fresh the signal is, and what happens when consent or jurisdiction changes mid-flight. That requirement shaped every layer of the design.
Start with the contract, not the model
We began with a typed event contract: purchase, view, lead, subscription change, and server-side custom events all map into a single canonical schema before any scoring runs. If a field is optional in the warehouse but required for a model feature, the pipeline rejects it early with a lineage tag. That sounds strict, but it saved us weeks of silent drift between training data and production traffic.
Scores are versioned. Each audience definition references a score_version so you can replay history, compare uplift, and freeze a campaign on a known model slice while the rest of the workspace moves forward. Versioning is boring infrastructure work; it is also the difference between a demo and something finance will sign off on.
Features at the edge, training off the hot path
Inference runs close to ingestion: a compact feature vector is hydrated from recent events and profile traits, then fed into a quantized model small enough to execute without cold-starting a GPU farm. Training still happens in batch jobs on trusted infrastructure, with exports signed and checksummed before they are promoted. We never wanted an online training loop tied to ad click traffic; the failure modes are too exciting for a compliance-minded CDP.
Cold start is handled explicitly. New SKUs and new traffic sources get backoff rules and sane defaults instead of pretending the model is confident on day one. Dashboards show coverage—how much of the segment has fresh features—as prominently as click-through.
Evaluation that marketing and engineering share
We evaluate with holdout groups and with counterfactual sanity checks simple enough to explain in a stand-up: if we shuffle purchase timestamps within a window, does the score collapse as expected? If we strip a key feature, does rank order degrade smoothly? When those tests fail, we block the release rather than shipping a clever hack.
Shipping predictive audiences was less about a novel architecture and more about discipline: typed events, versioned scores, edge inference with an honest cold-start story, and evaluations that both teams can read. If you are building something similar, obsess over the contracts first—the model is the easy part.
What we deliberately did not build
We avoided real-time model retraining on click streams because the attack surface and statistical validity story are both nightmares. We avoided shipping a generic “propensity bundle” whose features are unknowable destination-side—auditors, and your future teammates, deserve explicit feature lists.
We also said no to “just trust the uplift study from the vendor deck.” Everything we promote internally has reproducible notebooks and frozen seeds so the growth team cannot accidentally compare apples shipped on Tuesday with oranges counted on Wednesday.
Operating the thing
On-call owns latency and error budgets for scoring, not the accuracy of a specific campaign narrative. Accuracy discussions happen in experimentation forums with predefined minimum detectable effects. That separation keeps incidents actionable: paging fires when ingestion drops or scores fail to hydrate, while model quality debates settle on dashboards with owners and calendar time.
Glue work mattered enormously—documentation that explains which events power which features, runbooks when a downstream destination rejects hashed identifiers, and playbooks when consent revokes midway through an active flight. Predictive audiences are a product surface, operations surface, and legal surface simultaneously; pretending otherwise invites silent failure modes.
Finally, partner feedback tightened our stance on explanations: marketers do not want SHAP diagrams for breakfast, but they do want a concise “top three drivers” tooltip per decile during an active campaign. We render those from pre-aggregated feature attributions recomputed nightly so interactive reads stay cheap. Detail exists for analysts who drill in; sane defaults exist for everybody else.
Get new posts in your inbox
Same list as changelog subscribers — product writing, no spam.

