Why Databricks
Databricks is a strong destination for TrackLayer events when the same ecommerce stream needs to support unified analytics, production reporting, experimentation, and machine learning. TrackLayer can land canonical events as Delta tables with stable fields for event_time, merchant_id, event_name, user_id, order_id, value, currency, platform, attribution metadata, and the raw payload. Analysts can query the stream in Databricks SQL, data engineers can refine it with notebooks or workflows, and ML teams can train models without copying the data into a separate feature environment.
Unity Catalog adds the governance layer that usually decides whether a warehouse export is usable in production. Instead of scattering permissions across workspace-local objects, your team can grant access at the catalog, schema, table, view, and function level, then trace lineage as raw TrackLayer events move into dbt models, dashboards, feature tables, and MLflow experiments. That separation lets TrackLayer own clean event delivery while Databricks owns durable storage, access policy, transformation, and model training.
Prerequisites
A Databricks workspace with permission to manage SQL warehouses, service principals, and Unity Catalog objects.
Unity Catalog enabled, with a metastore attached to the workspace that will receive TrackLayer event tables.
A SQL warehouse for direct sync writes or a Delta Live Tables pipeline if your team routes landing data through managed ingestion flows.
A Databricks service principal with a token or OAuth credential that can be scoped to the target catalog and schema.
Storage credentials for S3, ADLS, or GCS configured according to your cloud, metastore, and external location policies.
Step-by-step setup
Create the catalog and schema
Start by creating a dedicated Unity Catalog namespace. Keeping TrackLayer raw data in its own catalog and schema makes permissions, lineage, retention, and downstream source declarations easier to audit.
CREATE CATALOG IF NOT EXISTS tracklayer
COMMENT 'Raw and modeled TrackLayer ecommerce event data';
CREATE SCHEMA IF NOT EXISTS tracklayer.events
COMMENT 'Canonical TrackLayer event stream and ingestion metadata';Grant the service principal
Create or select a Databricks service principal, generate a credential in your normal secret workflow, and grant only the privileges TrackLayer needs to create and maintain Delta tables in the events schema.
GRANT USE CATALOG ON CATALOG tracklayer
TO `tracklayer-sync-sp`;
GRANT USE SCHEMA ON SCHEMA tracklayer.events
TO `tracklayer-sync-sp`;
GRANT CREATE TABLE ON SCHEMA tracklayer.events
TO `tracklayer-sync-sp`;Add the Databricks destination in TrackLayer
In TrackLayer, open Destinations, choose Databricks, and enter the connection details for the workspace and SQL warehouse. Use a staging workspace first if your governance team requires destination validation before production activation.
Host: https://adb-1234567890123456.7.azuredatabricks.net
Token: stored service principal secret
Warehouse HTTP path: /sql/1.0/warehouses/abc123def456
Catalog: tracklayer
Schema: eventsRun the first sync
Start a manual sync from TrackLayer. With CREATE TABLE privileges, TrackLayer can auto-create Delta tables for the canonical event stream, ingestion timestamps, payload metadata, and stable ecommerce dimensions.
tracklayer.events.events
tracklayer.events.sync_runs
tracklayer.events.event_errorsVerify recent rows
After the first sync finishes, validate row counts in Databricks SQL before enabling production schedules. Confirm event_time filters work because nearly every downstream model will use them.
SELECT COUNT(*)
FROM tracklayer.events.events
WHERE event_time > current_date() - INTERVAL 1 DAY;Delta Lake benefits
Delta Lake is the right storage format for a TrackLayer export because event tables are rarely static. Teams replay historical events, fix attribution rules, backfill merchant properties, and build new marts from the same source contract. Delta gives those changes transactional behavior without asking analysts to reason about partially written files.
- ACID transactions keep exports consistent even when scheduled syncs, analyst queries, and downstream jobs touch the same tables.
- Time travel gives data teams a practical way to inspect previous table versions during audits, backfills, or attribution model regressions.
- MERGE support makes replay handling and late-arriving updates straightforward when you deduplicate by event_id or update order-level facts.
- Z-ordering still helps for large tables with repeated filters on merchant_id, event_name, order_id, or platform. For newer workloads, evaluate liquid clustering before adding manual optimization jobs.
Sample queries
Revenue attribution by campaign
SELECT
attribution.campaign_id,
attribution.channel,
date_trunc('day', event_time) AS event_day,
sum(value) AS attributed_revenue,
count(DISTINCT order_id) AS orders
FROM tracklayer.events.events
WHERE event_name = 'purchase'
AND event_time >= current_date() - INTERVAL 30 DAYS
GROUP BY 1, 2, 3
ORDER BY event_day DESC, attributed_revenue DESC;Cohort LTV in PySpark
from pyspark.sql import functions as F
events = spark.table("tracklayer.events.events")
purchases = events.filter(F.col("event_name") == "purchase")
first_purchase = purchases.groupBy("merchant_id", "user_id").agg(
F.date_trunc("month", F.min("event_time")).alias("cohort_month")
)
ltv = (
purchases.join(first_purchase, ["merchant_id", "user_id"])
.withColumn("purchase_month", F.date_trunc("month", "event_time"))
.groupBy("merchant_id", "cohort_month", "purchase_month")
.agg(
F.sum("value").alias("cohort_revenue"),
F.countDistinct("user_id").alias("customers"),
)
.withColumn("ltv", F.col("cohort_revenue") / F.col("customers"))
)
display(ltv.orderBy("merchant_id", "cohort_month", "purchase_month"))Platform match quality distribution
SELECT
platform,
match_quality,
count(*) AS events,
count(DISTINCT user_id) AS matched_users,
round(100.0 * count(*) / sum(count(*)) OVER (PARTITION BY platform), 2)
AS platform_share
FROM tracklayer.events.events
WHERE event_time >= current_date() - INTERVAL 7 DAYS
GROUP BY 1, 2
ORDER BY platform, events DESC;MLflow integration
Once TrackLayer events are available in Databricks, MLflow can use the same governed Delta tables that power reporting. A typical attribution workflow starts by joining purchases, ad touches, consent state, product categories, discount usage, and customer history into a feature table. From there, data scientists can train incrementality, propensity, or channel contribution models while MLflow tracks parameters, metrics, artifacts, and model versions against the exact table snapshot used during training.
The practical benefit is repeatability. When a model changes the way it weights paid social, lifecycle email, marketplace traffic, or affiliate sessions, the team can inspect the underlying TrackLayer event version, compare experiment runs, and promote the model only after finance and analytics agree with the validation set. Databricks keeps the training path close to the raw event history instead of creating a disconnected modeling silo.
dbt on Databricks
Use dbt-databricks to treat tracklayer.events.events as a raw source and publish clean staging models, marts, and metrics-ready views. The first dbt layer should cast timestamps, unpack common structs, normalize event names, deduplicate event_id replays, and preserve raw payload fields needed for audits.
sources:
- name: tracklayer
catalog: tracklayer
schema: events
tables:
- name: events
models:
staging:
stg_tracklayer_events → cleaned canonical events
marts:
fct_orders → one row per order
fct_attribution_touches → channel and campaign paths
dim_customers → identity, cohort, and consent stateCost optimization
- Use serverless SQL warehouses when your team wants fast cold starts and less operational tuning. Use provisioned warehouses when workload isolation, predictable schedules, or existing reserved capacity matter more.
- Keep Photon enabled for Databricks SQL workloads. TrackLayer event queries usually scan timestamps, strings, structs, and numeric revenue fields that benefit from vectorized execution.
- Write queries and dbt models with event_time, merchant_id, platform, and event_name filters so Delta can prune files instead of scanning the full event history.
- For high-volume tables, test liquid clustering on event_time plus merchant_id or event_name before maintaining separate Z-order jobs. Revisit clustering keys when access patterns change.
Troubleshooting
PERMISSION_DENIED on catalog or schema
Confirm the service principal has USE CATALOG on tracklayer, USE SCHEMA on tracklayer.events, and CREATE TABLE on the events schema. Check that grants were applied to the principal used by the token.
Warehouse HTTP path is invalid
Copy the HTTP path from the Databricks SQL warehouse connection details, not the browser URL. Restart the TrackLayer connection test after updating the destination.
Delta table was created in the wrong namespace
Check the catalog and schema fields in TrackLayer. If a default catalog was used accidentally, pause syncs, update the destination, and move or replay the initial data into tracklayer.events.
Queries are slow after several weeks of data
Filter by event_time and merchant_id, compact small files, enable Photon, and evaluate liquid clustering or OPTIMIZE ZORDER BY for the fields used by dashboards and dbt models.
Common questions
Does TrackLayer require Unity Catalog?
Use Unity Catalog for production exports. It gives your team centralized grants, lineage, storage governance, and a cleaner path to share event data with analytics, ML, and finance teams.
Can TrackLayer write through Delta Live Tables?
TrackLayer usually writes to Databricks through a SQL warehouse. Teams that require DLT can land data in a raw Delta table first, then let a DLT pipeline validate, deduplicate, and publish curated tables.
Which table name should analysts use?
Start with tracklayer.events.events for raw canonical events. Most teams then expose dbt models or curated Databricks views for orders, attribution touches, cohorts, and campaign reporting.
How should historical replays be handled?
Use event_id as the stable deduplication key and preserve loaded_at for ingestion audits. Delta MERGE works well when you need replayed events to update a curated table instead of appending duplicates.
Can ML teams use the same data?
Yes. Feature engineering notebooks, Databricks workflows, and MLflow experiments can train directly from TrackLayer Delta tables or from curated feature tables built on top of the raw export.
Related implementation guides
Snowflake export guide
Compare the same TrackLayer warehouse pattern for Snowflake service users, raw events tables, dbt staging, and cost controls.
Read guide →Attribution models guide
Turn exported event streams into channel, campaign, and lifecycle models that explain what actually drove revenue.
Read guide →Identity resolution guide
Design stable customer identifiers before building Databricks cohorts, LTV models, and cross-platform match-quality reports.
Read guide →