Skip to main content
GUIDE · DATABRICKS9 min read

Export TrackLayer events to Databricks (Unity Catalog + Delta Lake)

A practical setup guide for sending TrackLayer's canonical ecommerce events into Databricks, governing them with Unity Catalog, storing them as Delta Lake tables, and preparing the data for analytics, attribution modeling, dbt, and MLflow workflows.

Context

Why Databricks

Databricks is a strong destination for TrackLayer events when the same ecommerce stream needs to support unified analytics, production reporting, experimentation, and machine learning. TrackLayer can land canonical events as Delta tables with stable fields for event_time, merchant_id, event_name, user_id, order_id, value, currency, platform, attribution metadata, and the raw payload. Analysts can query the stream in Databricks SQL, data engineers can refine it with notebooks or workflows, and ML teams can train models without copying the data into a separate feature environment.

Unity Catalog adds the governance layer that usually decides whether a warehouse export is usable in production. Instead of scattering permissions across workspace-local objects, your team can grant access at the catalog, schema, table, view, and function level, then trace lineage as raw TrackLayer events move into dbt models, dashboards, feature tables, and MLflow experiments. That separation lets TrackLayer own clean event delivery while Databricks owns durable storage, access policy, transformation, and model training.

Requirements

Prerequisites

01

A Databricks workspace with permission to manage SQL warehouses, service principals, and Unity Catalog objects.

02

Unity Catalog enabled, with a metastore attached to the workspace that will receive TrackLayer event tables.

03

A SQL warehouse for direct sync writes or a Delta Live Tables pipeline if your team routes landing data through managed ingestion flows.

04

A Databricks service principal with a token or OAuth credential that can be scoped to the target catalog and schema.

05

Storage credentials for S3, ADLS, or GCS configured according to your cloud, metastore, and external location policies.

Build

Step-by-step setup

Step 1

Create the catalog and schema

Start by creating a dedicated Unity Catalog namespace. Keeping TrackLayer raw data in its own catalog and schema makes permissions, lineage, retention, and downstream source declarations easier to audit.

CREATE CATALOG IF NOT EXISTS tracklayer
  COMMENT 'Raw and modeled TrackLayer ecommerce event data';

CREATE SCHEMA IF NOT EXISTS tracklayer.events
  COMMENT 'Canonical TrackLayer event stream and ingestion metadata';
Step 2

Grant the service principal

Create or select a Databricks service principal, generate a credential in your normal secret workflow, and grant only the privileges TrackLayer needs to create and maintain Delta tables in the events schema.

GRANT USE CATALOG ON CATALOG tracklayer
  TO `tracklayer-sync-sp`;

GRANT USE SCHEMA ON SCHEMA tracklayer.events
  TO `tracklayer-sync-sp`;

GRANT CREATE TABLE ON SCHEMA tracklayer.events
  TO `tracklayer-sync-sp`;
Step 3

Add the Databricks destination in TrackLayer

In TrackLayer, open Destinations, choose Databricks, and enter the connection details for the workspace and SQL warehouse. Use a staging workspace first if your governance team requires destination validation before production activation.

Host: https://adb-1234567890123456.7.azuredatabricks.net
Token: stored service principal secret
Warehouse HTTP path: /sql/1.0/warehouses/abc123def456
Catalog: tracklayer
Schema: events
Step 4

Run the first sync

Start a manual sync from TrackLayer. With CREATE TABLE privileges, TrackLayer can auto-create Delta tables for the canonical event stream, ingestion timestamps, payload metadata, and stable ecommerce dimensions.

tracklayer.events.events
tracklayer.events.sync_runs
tracklayer.events.event_errors
Step 5

Verify recent rows

After the first sync finishes, validate row counts in Databricks SQL before enabling production schedules. Confirm event_time filters work because nearly every downstream model will use them.

SELECT COUNT(*)
FROM tracklayer.events.events
WHERE event_time > current_date() - INTERVAL 1 DAY;
Lakehouse

Delta Lake benefits

Delta Lake is the right storage format for a TrackLayer export because event tables are rarely static. Teams replay historical events, fix attribution rules, backfill merchant properties, and build new marts from the same source contract. Delta gives those changes transactional behavior without asking analysts to reason about partially written files.

  • ACID transactions keep exports consistent even when scheduled syncs, analyst queries, and downstream jobs touch the same tables.
  • Time travel gives data teams a practical way to inspect previous table versions during audits, backfills, or attribution model regressions.
  • MERGE support makes replay handling and late-arriving updates straightforward when you deduplicate by event_id or update order-level facts.
  • Z-ordering still helps for large tables with repeated filters on merchant_id, event_name, order_id, or platform. For newer workloads, evaluate liquid clustering before adding manual optimization jobs.
Analytics

Sample queries

Revenue attribution by campaign

SELECT
  attribution.campaign_id,
  attribution.channel,
  date_trunc('day', event_time) AS event_day,
  sum(value) AS attributed_revenue,
  count(DISTINCT order_id) AS orders
FROM tracklayer.events.events
WHERE event_name = 'purchase'
  AND event_time >= current_date() - INTERVAL 30 DAYS
GROUP BY 1, 2, 3
ORDER BY event_day DESC, attributed_revenue DESC;

Cohort LTV in PySpark

from pyspark.sql import functions as F

events = spark.table("tracklayer.events.events")
purchases = events.filter(F.col("event_name") == "purchase")

first_purchase = purchases.groupBy("merchant_id", "user_id").agg(
    F.date_trunc("month", F.min("event_time")).alias("cohort_month")
)

ltv = (
    purchases.join(first_purchase, ["merchant_id", "user_id"])
    .withColumn("purchase_month", F.date_trunc("month", "event_time"))
    .groupBy("merchant_id", "cohort_month", "purchase_month")
    .agg(
        F.sum("value").alias("cohort_revenue"),
        F.countDistinct("user_id").alias("customers"),
    )
    .withColumn("ltv", F.col("cohort_revenue") / F.col("customers"))
)

display(ltv.orderBy("merchant_id", "cohort_month", "purchase_month"))

Platform match quality distribution

SELECT
  platform,
  match_quality,
  count(*) AS events,
  count(DISTINCT user_id) AS matched_users,
  round(100.0 * count(*) / sum(count(*)) OVER (PARTITION BY platform), 2)
    AS platform_share
FROM tracklayer.events.events
WHERE event_time >= current_date() - INTERVAL 7 DAYS
GROUP BY 1, 2
ORDER BY platform, events DESC;
Machine learning

MLflow integration

Once TrackLayer events are available in Databricks, MLflow can use the same governed Delta tables that power reporting. A typical attribution workflow starts by joining purchases, ad touches, consent state, product categories, discount usage, and customer history into a feature table. From there, data scientists can train incrementality, propensity, or channel contribution models while MLflow tracks parameters, metrics, artifacts, and model versions against the exact table snapshot used during training.

The practical benefit is repeatability. When a model changes the way it weights paid social, lifecycle email, marketplace traffic, or affiliate sessions, the team can inspect the underlying TrackLayer event version, compare experiment runs, and promote the model only after finance and analytics agree with the validation set. Databricks keeps the training path close to the raw event history instead of creating a disconnected modeling silo.

Modeling

dbt on Databricks

Use dbt-databricks to treat tracklayer.events.events as a raw source and publish clean staging models, marts, and metrics-ready views. The first dbt layer should cast timestamps, unpack common structs, normalize event names, deduplicate event_id replays, and preserve raw payload fields needed for audits.

sources:
  - name: tracklayer
    catalog: tracklayer
    schema: events
    tables:
      - name: events

models:
  staging:
    stg_tracklayer_events → cleaned canonical events
  marts:
    fct_orders → one row per order
    fct_attribution_touches → channel and campaign paths
    dim_customers → identity, cohort, and consent state
Efficiency

Cost optimization

  • Use serverless SQL warehouses when your team wants fast cold starts and less operational tuning. Use provisioned warehouses when workload isolation, predictable schedules, or existing reserved capacity matter more.
  • Keep Photon enabled for Databricks SQL workloads. TrackLayer event queries usually scan timestamps, strings, structs, and numeric revenue fields that benefit from vectorized execution.
  • Write queries and dbt models with event_time, merchant_id, platform, and event_name filters so Delta can prune files instead of scanning the full event history.
  • For high-volume tables, test liquid clustering on event_time plus merchant_id or event_name before maintaining separate Z-order jobs. Revisit clustering keys when access patterns change.
Diagnostics

Troubleshooting

PERMISSION_DENIED on catalog or schema

Confirm the service principal has USE CATALOG on tracklayer, USE SCHEMA on tracklayer.events, and CREATE TABLE on the events schema. Check that grants were applied to the principal used by the token.

Warehouse HTTP path is invalid

Copy the HTTP path from the Databricks SQL warehouse connection details, not the browser URL. Restart the TrackLayer connection test after updating the destination.

Delta table was created in the wrong namespace

Check the catalog and schema fields in TrackLayer. If a default catalog was used accidentally, pause syncs, update the destination, and move or replay the initial data into tracklayer.events.

Queries are slow after several weeks of data

Filter by event_time and merchant_id, compact small files, enable Photon, and evaluate liquid clustering or OPTIMIZE ZORDER BY for the fields used by dashboards and dbt models.

FAQ

Common questions

Does TrackLayer require Unity Catalog?

Use Unity Catalog for production exports. It gives your team centralized grants, lineage, storage governance, and a cleaner path to share event data with analytics, ML, and finance teams.

Can TrackLayer write through Delta Live Tables?

TrackLayer usually writes to Databricks through a SQL warehouse. Teams that require DLT can land data in a raw Delta table first, then let a DLT pipeline validate, deduplicate, and publish curated tables.

Which table name should analysts use?

Start with tracklayer.events.events for raw canonical events. Most teams then expose dbt models or curated Databricks views for orders, attribution touches, cohorts, and campaign reporting.

How should historical replays be handled?

Use event_id as the stable deduplication key and preserve loaded_at for ingestion audits. Delta MERGE works well when you need replayed events to update a curated table instead of appending duplicates.

Can ML teams use the same data?

Yes. Feature engineering notebooks, Databricks workflows, and MLflow experiments can train directly from TrackLayer Delta tables or from curated feature tables built on top of the raw export.

Next reads

Related implementation guides

We use essential cookies to keep the site secure and functional. Analytics and third-party tags run only with your consent. See our Cookie Policy.

We use essential cookies to keep the site secure and functional. Analytics and third-party tags run only with your consent. See our Cookie Policy.