Operational and Analytical Data

Unfortunately, we still have a big confusion about what exactly operational and analytical data is. As a result, we are still struggling to find the right approach to handling data from an overarching enterprise perspective.

What has been identified as the ‘great divide of data’ is the source for many challenges in our Data Architecture today. The distinction between operational and analytical data is not helpful in its current definition.

Image by Author, inspired by the Great Divide of Data by Zhamak Dehghani

I have written about that particular problem in previous articles and made a key statement in the first part of my series on "Challenges and Solutions in Data Mesh":

To solve the challenge of brittle ETL pipelines, let’s refrain from drawing a strict line between operational and analytical data altogether. Instead, we should only distinguish source data from derived data – both can be used for operational and analytical purposes.

This point is so fundamental that I want to expand on it to make it clear why I am so committed to universal data supply that effectively bridges the gap between the two planes.

The misconception

I’ve said it before and I repeat it emphatically:

We should not distinguish between operational and analytical data.

Let’s analyze the distinction made by Zhamak Dehghani in her article on data mesh – it’s unfortunately repeated by other renowned architecture veterans in their very insightful book "Software Architecture: The Hard Parts"; jointly written by Neil Ford, Mark Richards, Pramod Sadalage and Zhamak Dehghani.

Operational Data is explained as data used to directly run the business and serve the end users. It is collected and then transformed to analytical data.

A quote from their book:

This type of data is defined as Online Transactional Processing (OLTP), which typically involves inserting, updating, and deleting data in a database.

Analytical Data is explained as a non-volatile, integrated, time-variant collection of data transformed from operational data, that is today stored in a data warehouse or lake.

A quote from their book:

This data isn’t critical for the day-to-day operation but rather for the long-term strategic direction and decisions.

Now, what is wrong with this distinction?

I posed the following questions to challenge it:

What type of data does an analytical process produce when deriving an intelligent KPI that is used in subsequent operational processes? Is this analytical data because it was derived in an analytical process based on input from a data warehouse? Or is it operational data because its used for operational purposes to directly run the business?
Data lakes and data warehouses (with an attached raw data vault/data lake) explicitly store operational data for subsequent transformations and analytical purposes. Is this analytical data because its non-volatile, integrated and stored in a time-variant way in a data warehouse or lake? And is derived analytical data actually only stored in data warehouses or lakes?

I didn’t explicitly give answers in the mentioned article because I thought it’s pretty obvious without. But I keep being confronted with this distinction and I observe people struggling to properly manage data based on that definition.

So let me try to convince you that this distinction is not helpful and that we should stop using it.

What type of data does an analytical process produce when deriving an intelligent KPI that is used in subsequent operational processes?

Let’s take an example based on a real-world banking scenario. We extract data from a lot of different operational systems and save it in our data warehouse. We derive basic KPIs from it and store them in the data warehouse to be used in an operational online loan application to calculate individual interest rates.

I think we can safely say, that the KPIs are derived in an analytical process and following the definition the result is qualified as analytical data.

But this data is also used for operational purposes as input for interest rate calculation in an online loan application – this online process definitely directly runs our business. Hence, it also qualifies for being defined as operational data.

Is derived analytical data actually only stored in data warehouses or lakes?

The KPIs and especially the interest rate for the loan application would definitely not only be stored in a data warehouse/lake. Most certainly it will also be stored in the operational loan system because it’s a key input for the loan acceptance.

It is even stated that analytical data isn’t critical for the day-to-day operation but rather for the long-term strategic direction and decisions.

But the KPIs used for the interest rate calculation together with the acceptance of the loan application are highly critical for the day-to-day business of a commercial private bank.

And this is not only true for this example. It’s the rule rather the exception that data created by analytical processes is also used in subsequent operational processes.

Business does not distinguish between analytical and operational data

It’s simply not a helpful distinction for real-life scenarios in the business world. Only the business processes and therefore also the IT applications can be distinguished as having an operational or dispositive (planning or analytical) purpose.

But even this distinction is blurred by the fact that analytical results are typically the foundation for decisions to change the way the business operates.

However, an analytical process can tolerate longer down-time. Hence, the service level agreement on analytical processes can be more relaxed compared to operational processes that run the business.

But we need to recognize that all data, regardless of whether it was generated in an operational or analytical business process, is important for the enterprise and always has operational significance.

Data is no process and therefore we cannot say that operational data is OLTP. This just doesn’t make sense.

Helpful distinctions

Let’s therefore stop categorizing data in operational and analytical data. It lacks helpful distinctive criteria and is at best relevant from a pure technical perspective to decide which service level is appropriate for the applications using that data.

Source data and derived data

Instead, we should distinguish source data from derived data – both can be used for operational and analytical purposes.

Why is that distinction more helpful?

Because it matches the company’s business view. The source data is new digitalized information that was not previously available in the organization. It cannot be derived from other data available in the enterprise.

New data must be captured by human input or generated automatically via sensors, optical/acoustic systems or IoT devices. It can also be imported (or acquired) from data providers outside the organization. For this edge case, we can decide whether we want to treat it as source data or as derived data, although the providing application then works outside our organization.

This distinction is very important, as derived data can always be reconstructed by applying the same business logic to the corresponding source data. Source data, on the other hand, can never be reconstructed with logic and must therefore be backed up to prevent data loss.

It’s the data itself that is different, not the processes that use that data. Therefore, we need to manage source data and derived data in different ways from both a technical and a business perspective.

Data on the outside and data on the inside

The excellent article "Data on the Outside vs. Data on the Inside" by Pat Helland, explores the distinction between data managed within a service (inside) and data shared between services (outside) in a Service-Oriented Architecture (SOA) context.

Helland’s conclusion was that SOA requires data representations to play to their respective strengths: SQL for inside data, XML for inter-service communication, and objects for business logic within services. This blend allows each system to balance encapsulation, flexibility, and independence.

With the exception of the restrictive use of SQL, XML, and objects for data representations, the core idea is still very valid from my point of view.

Applications or services should be self-contained, with encapsulated data and logic interacting solely through messages. This approach prevents direct access to the data of another service, strengthening cohesion within the service and decoupling between the services.

Inside data often operates within an ACID (Atomic, Consistent, Isolated, Durable) transaction context, while outside data lacks this immediacy. Outside data should be represented as a stream of events (or data atoms) with eventual consistency. Any data sent outside of a service should be treated as immutable once shared. I have described this in more detail in the second part of my series on "Challenges and Solutions in Data Mesh".

The conclusion is that we should use data representations according to their individual strengths. A data representation is effectively a physical data model and any database type supporting that model has its specific strengths and weaknesses for particular purposes.

Whether you use a relational database, a document store, a graph database or maybe a very basic model like a key/value store for your data inside the service is highly use case dependent.

To facilitate inter-service communication, XML, JSON, protocol buffers, AVRO or Parquet that offer schema independence and flexibility are better suited for outside sharing of immutable state.

From my point of view the approach to use ‘data as products’ implemented as immutable data structures is best suited to enable universal sharing of information. The selection of the physical data model to be used inside your service is a technical optimization decision dependent on your use case.

Data on the Outside vs Data on the Inside - Image by author — Data on the Outside vs Data on the Inside – Image by author

However, the logical view on your overall business information needs to be consistent across all applications or services and should be independent from any physical data representation used inside your application or service.

Inter-service communication can be achieved via APIs or the exchange of data as products. See my articles on universal data supply to understand the whole concept.

If you liked this article then please consider to clap.

How do you think about our challenge to develop a data architecture that is reliable, easy to change, and scaleable?

I’d love to hear about your opinions.