warehouse-native CDP

Warehouse-Native CDPs: Customer Data That Stays in Snowflake

A warehouse-native CDP runs without copying data into a vendor database. Here is how that changes your stack, what you gain, and where the model still bites.

Open the architecture diagram of a traditional customer data platform and follow the arrows. Web events, app events, purchase history, support tickets, loyalty records: every one of them flows out of a source system, through a pipeline, and into a database the CDP vendor owns and operates. Your customer data lives there now, in a second home separate from your warehouse, governed by someone else's rules and billed on someone else's terms.

The warehouse-native CDP looks at that diagram and asks an awkward question. If the same customer data already sits in Snowflake, BigQuery, or Databricks, fully governed and queryable, why copy it into a vendor's database at all? Why not run the CDP's work, identity resolution, segmentation, activation, on the data where it already lives?

That question has reshaped the CDP market. The CDP Institute reports that more than one in four CDPs now support a warehouse-centric architecture, and that composable vendors grew employment at roughly six times the industry rate. In the 2026 Gartner Magic Quadrant for Customer Data Platforms, Hightouch, a composable vendor that started as a reverse ETL tool, landed in the Leader quadrant on its first appearance. This piece explains the architecture underneath the trend: the old copy-everything model and what was wrong with it, what zero-copy and reverse ETL actually do, who builds these tools, and the limits the marketing decks tend to skip.

Where the copy-everything model came from

The first CDPs were not being lazy when they built their own databases. They were solving a real problem with the tools of their time.

Around 2013, when the category got its name, the cloud data warehouse barely existed as a serious option. Amazon Redshift had launched the same year. Snowflake would not arrive until 2014. Most companies kept customer data scattered across an email tool, an analytics tool, a CRM, and a pile of spreadsheets, with no central, fast, queryable store. A marketer who wanted a unified profile had nowhere to build one.

So the CDP became that store. It shipped with SDKs to capture web and app events, connectors to pull data from other systems, an identity graph to stitch records into profiles, and a database to hold all of it. The pitch was simple: send us everything, we will unify it, and you get one profile per customer that any downstream tool can use. For a marketing team with no data engineers, a CDP that owned its own storage was the only thing that worked.

The copy was a feature. It became a problem only when the ground shifted underneath it.

What went wrong with the copy

By the early 2020s the cloud data warehouse had won. Snowflake, BigQuery, Databricks, and Redshift became the place where companies put everything they knew, customer data included. Once the warehouse held a complete, governed copy, the CDP's separate copy stopped looking like a feature and started looking like a liability. Four problems in particular.

Duplication. The same customer record now existed in at least two places, the warehouse and the CDP, and usually more. Someone has to keep them aligned, and someone has to pay to store both.

Staleness. Pipelines run on a schedule. The moment customer data is copied into the CDP, it begins drifting from the warehouse version. A churn score recalculated overnight in the warehouse does not reach the CDP until the next sync. Marketing ends up acting on a customer who no longer matches the one the business sees.

Governance. This is the sharp one. Data was extracted into the CDP vendor's database, then copied again into activation tools: the email platform, the ad networks, the CRM. Every copy is a new place to secure, a new surface for a breach, and a new system to chase when a customer files a GDPR deletion request. Delete the record in the warehouse and it can still be sitting in the CDP and three downstream tools. Privacy law assumes you know where personal data is. Scatter enough copies and you no longer do.

Lock-in. When the vendor's platform holds the canonical profile, the identity graph, and the segment definitions, leaving is genuinely hard. Your customer intelligence is entangled with their product. That is good for the vendor's retention numbers and bad for your negotiating position.

None of this made traditional CDPs useless. Plenty of teams run them well. But it created an opening, and a clear question: what if the data never left the warehouse in the first place?

Zero-copy: what it actually means

"Zero-copy" is the phrase the market reached for, and it is doing a lot of work. It is best understood as an umbrella term for techniques that minimize persistent copies of customer data and keep the canonical version in storage your own data team controls. Underneath it sit two genuinely different mechanisms, and a careful buyer keeps them apart.

The first is federated query, sometimes called live query. The CDP does not hold customer data. When it needs data for an operation, building an audience, checking a profile attribute, it issues a SQL query straight to the warehouse and reads the answer back. Google's BigQuery documentation describes federated queries as a way to read from an external source while the data stays in its original location. Salesforce uses the same approach for external data in Data 360: a live query with query pushdown, so the warehouse does the computation and no copy is persisted in Salesforce.

The second is data sharing, built on open table formats. Snowflake, BigQuery, and Databricks can expose tables to another account or tool without copying the underlying files. The mechanics involve formats like Apache Iceberg and protocols like Databricks Delta Sharing: the data sits once in cloud storage, and access is granted by reference. Amperity's Bridge uses exactly this, sharing open table formats across Snowflake, BigQuery, and Databricks, so a customer running more than one warehouse gets a unified view without duplicating anything. Two parties read the same physical files. Neither makes a copy.

Both deliver the outcome the warehouse-native pitch promises: the raw customer data stays in the governed environment, and the CDP becomes a logic layer on top rather than a second database. Governance rules set in the warehouse, GDPR deletions, access controls, lineage, apply to everything built above it, because there is nothing built beside it to escape them.

Worth flagging, because vendors blur it: zero-copy is not the same as zero-ETL. Zero-copy restricts data movement. Zero-ETL still moves data, but the vendor manages the pipeline so you do not build it. Pin down which one a demo is actually showing you.

Reverse ETL: getting data out without owning it

Federated query and data sharing explain how a warehouse-native CDP reads and computes. They do not explain how a segment built in the warehouse reaches the email tool or the ad platform, because Klaviyo, Meta, and Google do not run live queries against your Snowflake instance. Something has to deliver the audience. That something is reverse ETL.

Traditional ETL extracts data from operational systems and loads it into the warehouse for analysis. Reverse ETL runs the loop backwards. It takes modeled data already in the warehouse, a segment, a set of traits, a propensity score, and pushes it out to the operational tools that act: the CRM, the email service provider, the ad networks, the support desk. The term is recent, popularized around 2020 to 2021 by Census and Hightouch as the cloud warehouse became standard, both companies built on the idea that the warehouse, not a CDP, should be the source of truth.

The crucial detail for this architecture: a reverse ETL tool does not store customer data. It is a pipeline, not a database. It queries the warehouse on a schedule or trigger, transforms the result to fit the destination, and syncs it onward. Federated query and data sharing keep the data home for computation; reverse ETL delivers the results outward without the tool ever becoming a place customer data lives.

Hightouch is the clearest example. It connects to a warehouse, runs SQL or visual audience definitions against it, and syncs the output to a couple of hundred destinations. The company says it moves trillions of records from warehouses to destinations without copying them into its own store. RudderStack takes a similar warehouse-first line, explicitly holding no customer data of its own.

Here is the honest tension in the model. Federated query and data sharing genuinely avoid copies. Reverse ETL, by its nature, copies data into every downstream tool on every sync. The moment a segment lands in your email platform, that platform holds personally identifiable information. "Your data never leaves the warehouse" is accurate for storage and computation. It is not accurate for activation, and it never could be, because activation means handing data to a system that acts on it. The model shrinks the number of copies and centralizes the canonical one. It does not get the count to zero.

The vendors, and the packaged response

The warehouse-native side splits into a few groups. Pure composable players build on reverse ETL: Hightouch, the category's most visible name, and Census, which Fivetran acquired in May 2025 to pair ingestion with activation. RudderStack and Snowplow come at it from a developer-first, event-pipeline angle, warehouse-first by design. GrowthLoop, Simon Data, and others occupy the activation and orchestration space around the same idea.

The signal worth weighing is money and analyst recognition. Hightouch raised a 150 million dollar Series D in April 2026 at a 2.75 billion dollar valuation, led by Goldman Sachs Growth and Bain Capital Ventures. Gartner placed it in the Leader quadrant the same year, a notable jump for a vendor that began life as a reverse ETL utility. Warehouse-native has moved from fringe argument to a position the analysts now treat as central.

The packaged vendors did not stand still. Salesforce launched its Zero Copy Partner Network in April 2024 with AWS, Databricks, Google Cloud, and Snowflake, later adding Microsoft. Its CDP, renamed Data 360, can query and activate data sitting in Snowflake or BigQuery without persisting a copy, and share enriched profiles back the same way. Adobe and others have moved in the same direction. The line between "packaged CDP" and "warehouse-native CDP" is genuinely blurring. Both camps now agree the warehouse should be the source of truth. They disagree mainly on how much of the CDP's logic and interface should sit inside the vendor's platform versus the customer's warehouse.

The honest limits

Warehouse-native solves real problems. It also moves problems around rather than deleting them, and three are worth naming plainly.

It still needs engineering. A packaged CDP ships opinionated, pre-built identity resolution, data models, and a marketer-friendly interface. Warehouse-native hands you flexibility and the bill that comes with it: someone has to model the data, write the SQL or dbt logic, and maintain the pipelines. Industry pricing analyses estimate composable stacks often need three to five dedicated engineers, an annual cost that rarely shows up in a side-by-side licensing comparison. The architecture suits teams that already have data engineers. It punishes teams that do not.

Real-time is harder. Warehouses are built for analytical queries over large tables, not for sub-second lookups on a single profile. Reverse ETL syncs run on schedules, and tightening the schedule gets expensive fast (more on that below). For use cases where minutes of latency are fine, a cart abandonment flow, where you cannot tell a cart is truly abandoned until the session ends anyway, this is a non-issue. For genuine millisecond decisioning at the point of interaction, warehouse-native architectures struggle, and critics point this out fairly.

The warehouse meters compute. This is the cost nobody puts on the slide. Querying large warehouse tables on every audience build or sync consumes compute, and the warehouse bills for it. The numbers escalate sharply with frequency: industry analysis suggests moving audience refreshes from daily to hourly can raise compute cost on the order of 25 times, and daily to five-minute syncs by 50 times or more. The CDP vendor's invoice may look smaller. The warehouse invoice quietly absorbs the difference, and compute-heavy AI decisioning will push it higher. Warehouse-native does not remove the cost of the CDP. It relocates it from a license line to a compute line.

Where this is heading

The direction is set. Zero-copy is becoming the default enterprise expectation, not a differentiator, and the packaged-versus-composable distinction is dissolving into a spectrum as both sides converge on the warehouse as the source of truth.

The next force pushing the same way is AI agents. Gartner frames the CDP's near future as a split between platformization and agentification, with the agentic path describing the stack as "warehouse plus CDP plus agents" rather than the warehouse plus dozens of separate applications. An AI agent that reads customer profiles and acts on them needs governed, current, single-source data to reason over. Agents are also less forgiving than people: they act fast and on exactly what the data says. A stale copy a human marketer might have caught becomes an agent's wrong decision, executed at speed. That raises the value of an architecture where the data the agent reads is the data the business governs, with no copy drifting in between.

The realistic near-term picture is not warehouse-native replacing packaged CDPs outright. It is the copy quietly disappearing from the architecture diagram across the whole category, while the genuine work, identity resolution, data modeling, governance, the compute bill, stays as hard as it always was. It just moves into a building you already own.

Council summary

This post argues that the warehouse-native CDP removes the vendor-owned copy of customer data, turning the CDP into a logic layer over Snowflake, BigQuery, or Databricks rather than a second database. It explains the three mechanisms underneath, federated query, data sharing on open table formats, and reverse ETL, and stays honest that activation still copies data into every downstream tool. The council checked every figure against primary sources, including the CDP Institute numbers, Hightouch's Series D and 2026 Gartner Leader placement, the Fivetran acquisition of Census, and the 25 times and 50 times compute multipliers; all held, and one phrase was aligned to the CDP Institute's exact wording. The buyer takeaway: warehouse-native cuts copies and lock-in, but it moves cost from a license line to a compute line and still demands real data engineering.

Comments

Leave a comment

Your email won't be published. Comments are reviewed before they appear.
★ Read next