The Canonical Data Model - can one size fit all?

22.10.15 Steve Miller

Access to real time data is rapidly becoming one of the most discussed – and most challenging – items on the agenda for CIOs and enterprise architecture teams. The potential rewards are substantial, and the companies who will win are those who find a way to unlock the pieces of data that are isolated from each other in independent silos, and turn them into integrated information that can be accessed in real time to improve information delivery, data mining, and business intelligence and decision making.

Evolving architectures

One fundamental barrier is that most banks and large enterprises rely on architectures that have evolved over time. In fact, ‘architecture’ may not even be the right term, as that implies a level of intent and design that can often be wholly absent. In a sense these landscapes are more like application ecosystems, consisting of a patchwork of inter-dependent systems that provide the underlying services the company relies upon to conduct its day-to-day business.

There are typically complex webs of dependency between each component that preclude wholesale architectural overhaul, and a mixture of in-house and vendor-supplied software on multiple technology platforms.

When these systems need to share data with each other, point-to-point integration can seem like the easiest and least expensive way of achieving the desired outcome – and when integrating just a few applications, it often is. But it doesn’t scale, as we all know very well by now – there are hundreds of applications, with multiple interfaces, developed on incompatible data models. The point is, no-one would deliberately design an architecture like this, and yet the picture we are painting, or some variation on that theme, exists at just about every major institution I have ever worked with.

The pain of anti-patterns

The particular issue for us to focus on is that across the multiple data sources in an enterprise, there are significant differences in how data about essentially the “same things” are structured and represented. For example, when one application defines “customer” as “client”, another as “account” and one refers to the same as “consumer”, it becomes difficult to share information efficiently across the business, and communication between these diverse applications can begin to feel like unravelling a ball of tangled wires. The traditional response to this has been to stick an integration broker or some other such monolith in the middle of everything and create a multitude of point-point mappings.

As complexity increases with new components and version changes, the costs of maintaining these systems, and the tangled wires which connect them, begins to soar. Achieving real-time data (or even just low-latency data) becomes impossible, and robustness, stability and scalability are sacrificed.

Towards the canonical data model

And so the conundrum - all large organizations have many applications which were developed based on different data models and data definitions, yet they all need to share data.

This gives rise to the argument for a canonical data model. With a canonical approach, each application translates its data into a common model understandable to all applications - a loosely coupled pattern that in theory goes some way towards minimizing the impact of change. If system A wants to send data to system B, it translates the data into an agreed-upon standard syntax (canonical or common format) without regard to the data structures, syntax or protocol of system B. When system B receives the data, it translates the canonical format into its own internal structures and format. So far so good, in theory.

Holy grail or poison chalice?

But designing and implementing an enterprise-wide canonical data model can require herculean effort. Creating a data model that represents both the interactions between systems, but also the internals of those systems, and then implementing such a model, can only be done as part of an enterprise-wide project involving conversations with stakeholders from across the business to understand how all of the systems work. Given that such projects often have little visible impact for the end users in terms of additional features and functionality, the cost can also be difficult to justify.

Even then, the resulting model can be too complex, too difficult to understand and implement, and can lead to information exchanges which are inefficient and memory-hungry.

There can also be problems of enforcement and adoption, and unless the canonical model is properly curated with a coherent change/extension policy, it quickly bifurcates and becomes polluted with uncontrolled variants as individual lines of business unconcerned with a strategic architectural vision, code around it to get their tactical jobs done.

Given the cost implications and high level of stakeholder buy-in required to implement a truly canonical model, the resultant architecture is often a fondue, melting together several different standards while truly conforming to none of them. This in turn creates new proprietary models which may go some of the way towards easing the pain caused by the complexity of multi-point integrations, but also creates new problems going forward.

Simple tools for complex problems

In a perfect world, every business unit would create, transmit, and consume data in exactly the same language, using the same semantics and taxonomy as part of a globally understood and shared model. However, this is essentially academic, because we all know it will never happen in reality. Standards practitioners have long realized that it is how you cope with change and variation that is at least as important, if not more so, than getting the base standard right in the first place.

As a side note – in the financial services world, this has led to the creation and increasingly, the adoption, of ISO 20022 as a common base standard for the exchange of information relating to all manner of payments and securities business flows. ISO 20022 is first and foremost a business model, from which concrete messaging standards and syntaxes are derived, rather than being a blunt and naïve ‘standard format’ as previous attempts have been. It also has both a community governance model and acknowledges that one size can never fit all, providing robust mechanisms for controlling extension and variation.

The canonical model is ultimately only one part of the solution, though, and it doesn’t really matter whether that is based on ISO 20022, FpML, FIX or something else. The crunch point is the hand-off between System A, the canonical model, and System B, where A and B have (potentially) totally orthogonal views of the data involved in the exchange. Taking a basic financial transaction flow as an example, the trading view of the transaction is always going to be different to how it is viewed by the middle office, settlements and accounting. Risk and regulatory reporting views are highly likely to be different yet again.

C24 Studio provides enterprises with the tools to solve complex problems. With C24, you can (of course) build models to represent the structure, syntax and semantics of each application’s view of the world. You can (of course) also define the rules for transforming between them. Crucially, though, you can also choose instead to define a single model that is capable of presenting different interfaces to each application as required, meaning that each part of the business can access the data they need (and provide only the data for which they are responsible) without the time and maintenance expense of full-blown transformation of data at the point of transmission and the point of consumption.