GrowthNovember 12, 2024

Part Three: Zero-Copy as a Data Strategy—The Role of Data Quality and Governance

This post is part of the CDP 2.0: Why Zero-Waste Is Now series. For the best experience, we recommend starting with the Introduction and reading each chapter in sequence.

"Zero-copy" arguments

In the last couple of years, Data and IT team-led CDP evaluations have introduced a new feature requirement, "zero-copy" deployments. "Zero-copy" has existed for many years as a software concept, referring to programming techniques which improve system performance by eliminating in-memory data duplication. The "zero-copy" CDP requirement extended from CDW vendors advocating their data sharing capabilities, which (theoretically) eliminate the need for data duplication between systems.  

Warehouse-native CDPs were quick to advocate for "zero-copy" deployments in literal fashion, that CDPs should not copy data to any degree. Their CDW overlay architecture made this their default deployment. Additionally, they positioned "packaged CDPs" as "non-zero-copy," as these CDPs copied a subset of the customer data to their own data stores to perform and optimize CDP compute tasks.   

This literal interpretation of "zero-copy" led to quite a bit of confusion, as it quickly became apparent that the "zero-copy" composable CDPs were making offsite copies of data for performance latency reasons. This led to some humorous dueling blog posts of who was more "truly" zero-copy amongst the zero-copy CDP vendors. Unfortunately, this did not lead to public discussions of what "zero-copy" actually represented as an enterprise data strategy, nor the legitimate, and often desirable, strategies of running CDP compute in optimized data stores, rather than customers' CDW environments.

We are not going to repeat this mistake here. In order to understand CDPs, we have to consider the larger enterprise data picture.

"Zero-copy" as a data strategy

The literal "zero-data-copy" interpretation—that all data duplication should be eliminated, is un-useful and impractical when considering enterprise data strategies. However, the term "zero-copy" has become a term of art to describe the general approach of increasing the quality and governance of data within the enterprise. The term itself, "zero-copy," is being used because technologists universally agree that eliminating unneeded data duplication increases both the quality and governance of data. Again, this should not be confused with "never make data copies."

As data has become a core strategic asset, technically savvy companies have strategic initiatives to improve the quality and governance of their data. To clarify these terms, most interpret "governance" as the security, privacy, and access aspects of their data. "Quality" is usually understood to be the consistency, timeliness, and trustworthiness of data. While there are various approaches to improving quality and governance, the majority of current efforts focus on CDWs becoming the proverbial "single source of truth." By placing the CDW at the center of the data architecture, IT organizations can constrain problem space of quality and governance to a central system. The "zero-copy" features of CDWs then enable data sharing within this highly managed architecture, as the CDW controls both governance and quality aspects at all times (in theory).

To any experienced technologist, this strategy is highly rational. It is vastly preferential to managing a network of data platforms where data discrepancies and governance irregularities arise with regular frequency. Consequently, the positioning of warehouse-native CDPs as "zero-copy" platforms is very attractive for IT teams responsible for governance and quality efforts. In effect, warehouse-native CDPs are viewed as supportive of CDW-centered governance and quality investments.

Unfortunately, warehouse-native CDPs' apparent fit-for-purpose of "zero-copy" data strategies was embellished, via product marketing, into a narrative that alternative solutions are inherently antagonistic to data quality and governance efforts. The perception was unsurprisingly invoked and reinforced by the CDW vendors themselves, as CDPs historically have reallocated compute tasks away from the CDW, which directly impedes potential revenue for the CDW.

While there are valid architectural questions about all CDPs and zero-copy efforts, the positioning of CDPs that have optimized data copy capabilities as antithetical to governance and quality efforts is driven more by marketplace incentives, rather than serious technical evaluation. While these dynamics have led to some skepticism toward zero-copy strategies overall, an honest and rational evaluation reveals several valid approaches to supporting the data governance and quality efforts, each with its own virtues.  

The role of data quality and governance

Rather than the all-or-nothing discussion, which divides CDPs into supporting data quality and governance or not, a more useful framing is how CDPs support the spirit of "zero-copy" today and going forward.

The biggest concern for most IT teams and CDP is data quality; more specifically, data consistency, to ensure data in the CDP remains synchronized with the CDW. All of the major CDP players understand this is a major need and pain point for CDP customers. As CDW overlays, warehouse-native CDPs offer convenient data storage optimization, requiring no synchronization mechanisms—assuming the duplication they perform is unproblematic. For all other types of CDPs, this is an area of active development; most support or are developing bi-directional CDW capabilities specifically to ensure data consistency. While it is an open question how much benefit there is to mimic "zero-copy," it is certain the feature gap between all CDP types will close in the near future.

It should also be noted that data quality efforts are not limited to CDW centralization. While this is the dominant focus currently, improving quality at data source/capture is another important tactic. In this area, other CDPs are often superior to offerings provided by warehouse-native CDP vendors. Making a universal statement about which class is better is not practical, as the type of data capture and sources are handled better by some CDPs than others. Another area where data quality is addressed outside of the CDW is in how well the CDP can do transformations at the edge in order to best support downstream data quality. In most cases, CDPs with a strong vertical focus will do a much better job for their target verticals with specific SDK, quality and depth of partner integrations, and process experience.

Governance considerations are a bit more complicated. While CDPs that are CDW-centric inherit the underlying virtues of their CDW environment, events of the last year have shown this does not magically confer better security, access, and privacy outcomes. Rigorous policy and operational discipline are still the main driver for governance success. Composable CDPs neither improve nor degrade their governance context. For packaged CDPs, the main governance challenge is localization of data processing, which may be required for regulatory reasons. Assuming they meet these requirements and have a rigorous governance profile, they are no more or less effective in supporting the underlying CDW governance than composable ones.   

While ink gets spilled about "copying data," for all practical purposes in the CDP space, this is a minor issue. All of the major CDP vendors, including the warehouse-native CDPs, duplicate data from CDWs for performance and latency reasons. Moving data into fit-for-purpose architectures as a primary optimization strategy is a well-known and understood technical principle. While duplication may introduce complexity, a rational trade-off does not run counter to quality and governance efforts. Furthermore, the amount of data duplication that marketer-centric CDPs perform is often overstated; many of their capabilities can be run in zero-copy fashion, but are not, due to performance reasons.

Longer-term, the different approaches have very different paths to lakehouse or data mesh deployments. Some marketer-centric CDPs have highly optimized compute engines, specifically built for CDP workloads, especially in low latency applications. Once these CDPs build Delta Lake and/or Iceberg facilities, they will be well-positioned to deploy into lakehouse.

Overall, these observations lead us to the fact that CDPs are converging on a similar data quality and governance end state. Certainly warehouse-native CDPs are well positioned to inherit the data quality and governance context they are being deployed into. However, they do not fundamentally improve or degrade either of these. Marketer-centric CDPs are not as well-positioned on the quality front, but are rapidly improving and have their own set of advantages. From a governance perspective, established CDPs have no net effect architecturally.

More from the series

See all insights