Bernd Wessely, Author at Towards Data Science https://towardsdatascience.com/author/bernd-wessely/ The world’s leading publication for data science, AI, and ML professionals. Mon, 03 Feb 2025 14:54:30 +0000 en-US hourly 1 https://wordpress.org/?v=6.7.1 https://towardsdatascience.com/wp-content/uploads/2025/02/cropped-Favicon-32x32.png Bernd Wessely, Author at Towards Data Science https://towardsdatascience.com/author/bernd-wessely/ 32 32 Align Your Data Architecture for Universal Data Supply https://towardsdatascience.com/align-your-data-architecture-for-universal-data-supply-656349c9ae66/ Fri, 31 Jan 2025 17:02:00 +0000 https://towardsdatascience.com/align-your-data-architecture-for-universal-data-supply-656349c9ae66/ Follow me through the steps on how to evolve your architecture to align with your business needs

The post Align Your Data Architecture for Universal Data Supply appeared first on Towards Data Science.

]]>
Photo by Simone Hutsch on Unsplash
Photo by Simone Hutsch on Unsplash

Now that we understand the business requirements, we need to check if the current Data Architecture supports them.

If you’re wondering what to assess in our data architecture and what the current setup looks like, check the business case description.

· Assessing against short-term requirementsInitial alignment approach · Medium-term requirements and long-term vision · Step-by-step conversionAgility requires some foresightBuild your business process and information modelHolistically challenge your architectureDecouple and evolve

Assessing against short-term requirements

Let’s recap the short-term requirements:

  1. Immediate feedback with automated compliance monitors: Providing timely feedback to staff on compliance to reinforce hand hygiene practices effectively. Calculate the compliance rates in near real time and show them on ward monitors using a simple traffic light visualization.
  2. Device availability and maintenance: Ensuring dispensers are always functional, with near real-time tracking for refills to avoid compliance failures due to empty dispensers.

The current weekly batch ETL process is obviously not able to deliver immediate feedback.

However, we could try to reduce the batch runtime as much as possible and loop it continuously. For near real-time feedback, we would also need to run the query continuously to get the latest compliance rate report.

Both of these technical requirements are challenging. The weekly batch process from the HIS handles large data volumes and can’t be adjusted to run in seconds. Continuous monitoring would also put a heavy load on the data warehouse if we keep the current model, which is optimized for tracking history.

Before we dig deeper to solve this, let’s also examine the second requirement.

The smart dispenser can be loaded with bottles of various sizes, tracked in the Dispenser Master Data. To calculate the current fill level, we subtract the total amount dispensed from the initial volume. Each time the bottle is replaced, the fill level should reset to the initial volume. To support this, the dispenser manufacturer has announced two new events to be implemented in a future release:

  • The dispenser will automatically track its fill level and send a refill warning when it reaches a configurable low point. This threshold is based on the estimated time until the bottle is empty (remaining time to failure).
  • When the dispenser’s bottle is replaced, it will send a bottle exchange event.

However, these improved devices won’t be available for about 12 months. As a workaround, the current ETL process needs to be updated to perform the required calculations and generate the events.

A new report is needed based on these events to inform support staff about dispensers requiring timely bottle replacement. In medium-sized hospitals with 200–500 dispensers, intensive care units use about two 1-liter bottles of disinfectant per month. This means around 19 dispensers need refilling in the support staff’s weekly exchange plan.

Since dispenser usage varies widely across wards, the locations needing bottle replacements are spread throughout the hospital. Support staff would like to receive the bottle exchange list organized in an optimal route through the building.

Initial alignment approach

Following the principle "Never change a running system," we could try to reuse as many components as possible to minimize changes.

Initial idea to implement short-term requirements— Image by author
Initial idea to implement short-term requirements— Image by author

We would have to build NEW components (in green) and CHANGE existing components (in dark blue) to support the new requirements.

We know the batch needs to be replaced with stream processing for near real-time feedback. We consider using Change Data Capture (CDC)a technology to get updates on dispenser usage from the internal relational database. However, tests on the Dispenser Monitoring System showed that the Dispenser Usage Data Collector only updates the database every 5 minutes. To keep things simple, we decide to reschedule the weekly batch extraction process to sync with the monitoring system’s 5-minute update cycle.

By reducing the batch runtime and continuously looping over it, we effectively create a microbatch that supports stream processing. For more details, see my article on how to unify batch and stream processing.

Reducing the runtime of the HIS Data ETL batch process is a major challenge due to the large amount of data involved. We could decouple patient and occupancy data from the rest of the HIS data, but the HIS database extraction process is a complex, long-neglected COBOL program that no one dares to modify. The extraction logic is buried deep within the COBOL monolith, and there is limited knowledge of the source systems. Therefore, we consider implementing near real-time extraction of patient and occupancy data from HIS as "not feasible."

Instead, we plan to adjust the Compliance Rate Calculation to allow near real-time Dispenser Usage Data to be combined with the still-weekly updated HIS data. After discussing this with the hygiene specialists, we agree that the low rates of change in patient treatment and occupancy suggest the situation will remain fairly stable throughout the week.

The Continuous Compliance Rate On Ward Level will be stored in a real-time partition associated to the ward entity of the data warehouse. It will support short runtimes of the new Traffic Light Monitor Query that is scheduled as successor to the respective ETL batch process.

Consequently, the monitor will be updated every 5 minutes, which seems close enough to near real-time. The new Exchange List Query will be scheduled weekly to create the Weekly Bottle-Exchange Plan to be sent by email to the support staff.

We feel confident that this will adequately address the short-term requirements.

Medium-term requirements and long-term vision

However, before we start sprinting ahead with the short-term solution, we should also examine the medium and long-term vision. Let’s recap the identified requirements:

  1. Granular data insights: Moving beyond aggregate reports to gain insight into compliance at more specific levels (e.g., by shift or even person).
  2. Actionable alerts for non-compliance: Leveraging historical data with near real-time extended monitoring data to enable systems to notify staff immediately of missed hygiene actions, ideally personalized by healthcare worker.
  3. Personalized compliance dashboards: Creating personalized dashboards that show each worker’s compliance history, improvement opportunities, and benchmarks.
  4. Integration with smart wearables: Utilizing wearable technology to give real-time and discrete feedback directly to healthcare workers, supporting compliance at the point of care.

These long-term visions highlight the need to significantly improve real-time processing capabilities. They also emphasize the importance of processing data at a more granular level and using intelligent processing to derive individualized insights. Processing personalized information raises security concerns that must be properly addressed as well. Finally, we need to seamlessly integrate advanced monitoring devices and smart wearables to receive personalized information in a secure, discreet, and timely manner.

That leads to a whole chain of additional challenges for our current architecture.

But it’s not only the requirements of the hygiene monitoring that are challenging; the hospital is also about to be taken over by a large private hospital operator.

This means the current HIS must be integrated into a larger system that will cover 30 hospitals. The goal is to extend the advanced monitoring functionality for hygiene dispensers so that other hospitals in the new operator’s network can also benefit. As a long-term vision, they want the monitoring functionality to be seamlessly integrated into their global HIS.

Another challenge is planning for the announced innovations from the dispenser manufacturer. Through ongoing discussions about remaining time to failure, refill warnings, and bottle exchange events, we know the manufacturer is open to enabling real-time streaming for Dispenser Usage Data. This would allow data to be sent directly to consumers, bypassing the current 5-minute batch process through the relational database.

Step-by-step conversion

We want to counter the enormous challenges facing our architecture with a gradual transformation.

Since we’ve learned that working agile is beneficial, we want to start with the initial idea and then refine the system in subsequent steps.

But is this really agile working?

Agility requires some foresight

What I often encounter is that people equate "acting in small incremental steps" with working agile. While it’s true that we want to evolve our architecture progressively, each step should aim at the long-term target.

If we constrain our evolution to what the current IT architecture can deliver, we might not be moving toward what is truly needed.

When we developed our initial alignment, we just reasoned on how to implement the first step within the existing architecture’s constraints. However, this approach narrows our view to what’s ‘feasible’ within the current setup.

So, let’s try the opposite and clearly address what’s needed including the long-term requirements. Only then we can target the next steps to move the architecture in the right direction.

For architecture decisions, we don’t need to detail every aspect of the business processes using standards like Business Process Model and Notation (BPMN). We just need a high-level understanding of the process and information flow.

But what’s the right level of detail that allows us to make evolutionary architecture decisions?

Build your business process and information model

Let’s start very high to find out about the right level.

In part 3 of my series on Challenges and Solutions in Data Mesh I have outlined an approach based on modeling patterns to model an ontology or enterprise data model. Let’s apply this approach to our example.

Note: We can’t create a complete ontology for the healthcare industry in this article. However, we can apply this approach to the small sub-topic relevant to our example.


Let’s identify the obvious modeling patterns relevant for our example:

Party & Role: The parties acting in our healthcare example include patients, medical device suppliers, healthcare professionals (doctors, nurses, hygiene specialists, etc.), the hospital operator, support staff and the hospital as an organizational unit.

Location: The hospital building address, patient rooms, floors, laboratories, operating rooms, etc.

Ressource / Asset: The hospital as a building, medical devices like our intelligent dispensers, etc.

Document: All kinds of files representing patient information like diagnosis, written agreements, treatment plans, etc.

Event: We have identified dispenser-related events, such as bottle exchange and refill warnings, as well as healthcare practitioner-related events, like an identified hand hygiene opportunity or moment.

Task: From the doctor’s patient treatment plan, we can directly derive procedures or activities that healthcare workers need to perform. Monitoring these procedures is one of the many information requirements for delivering healthcare services.


The following high-level modeling patterns my not be as obvious for the healthcare setup in our example at first sight:

Product: Although we might not think of hospitals of being product-oriented, they certainly provide services like diagnoses or patient treatments. If pharmaceuticals, supplies, and medical equipment are offered, we even can talk about typical products. A better overall term would probably be a "health care offering".

Agreement: Not only agreements between provider networks and supplier agreements for the purchase of medical products and medicines but also agreements between patients and doctors.

Account: Our use case is mainly concerned with upholding best hygiene practices by closely monitoring and educating staff. We just don’t focus on accounting aspects here. However, accounting in general as well as claims management and payment settlement are very important healthcare business processes. A large part of the Hospital Information System (HIS) therefore deals with accounting.

Let’s visualize our use case with this high-level modeling patterns and their relationships.

Our example from the healthcare sector, illustrated with high-level modeling patterns - Image by author
Our example from the healthcare sector, illustrated with high-level modeling patterns – Image by author

What does this buy us?

With this high-level model we can identify ‘hygiene monitoring’ as an overall business process to observe patient care and take appropriate action so that infections associated with care are prevented in the best possible way.

We recognize ‘patient management’ as an overall process to manage and track all the patient care activities related to the healthcare plan prepared by the doctors.

We recognize ‘hospital management’ that organizes assets like hospital buildings with patient bedrooms as well as all medical devices and instrumentation inside. Patients and staff occupy and use these assets over time and this usage needs to be managed.

Let’s describe some of the processes:

  • A Doctor documents the Diagnosis derived from the examination of the Patient
  • A Doctor discusses the derived Diagnosis with the Patient and documents everything that has been agreed with the Patient about the recommended treatment in a Patient Treatment Plan.
  • The Agreement on the treatment triggers the Treatment Procedure and reflects the responsibility of the Doctor and Nurses for the patient’s treatment.
  • A Nurse responsible for Patient Bed Occupancy will assign a patient bed at the ward, which triggers a Patient Bed Allocation.
  • A Nurse responsible for the patient’s treatment takes a blood sample from the patient and triggers several Hand Hygiene Opportunities and Dispenser Hygiene Actions detected by Hygiene Monitoring.
  • The Hygiene Monitoring calculates compliance from Dispenser Hygiene Action, Hand Hygiene Opportunity, and Patient Bed Allocation information and documents it for the Continuous Compliance Monitor.
  • During the week ongoing Dispenser Hygiene Actions cause the Hygiene Monitoring to trigger Dispenser Refill Warnings.
  • A Hygiene Specialist responsible for the Hygiene Monitoring compiles a weekly Bottle Exchange Plan from accumulated Dispenser Refill Warnings.
  • Support Staff responsible for the weekly Exchange Bottle Tour receives the Bottle Exchange Plan and triggers Dispenser Bottle Exchange events when replacing empty bottles for the affected dispensers.
  • and so on …

This way we get an overall functional view of our business. The view is completely independent of the architectural style we’ll choose to actually implement the business requirements.

A high-level business process and information model is therefore a perfect artifact to discuss any use case with healthcare practitioners.

Holistically challenge your architecture

With such a thorough understanding of our business, we can challenge our architecture more holistically. Everything we already understand and know today can and should be used to drive our next step toward the target architecture.

Let’s examine why our initial architecture approach falls short to properly support all identified requirements:

  • Near real-time processing is only partly addressed

A traditional data warehouse architecture is not the ideal architectural approach for near real-time processing. In our example, the long-running HIS data extraction process is a batch-oriented monolith that cannot be tuned to support low-latency requirements.

We can split the monolith into independent extraction processes, but to really enable all involved applications for near real-time processing, we need to rethink the way we share data across applications.

As data engineers, we should create abstractions that relieve the application developer from low-level data processing decisions. They should neither have to reason about whether batch or stream processing style needs to be chosen nor need to know how to actually implement this technically.

If we allow the application developers to implement the required business logic independent of these technical data details, it would greatly simplify their job.

You can get more details on how to practically implement this in my article on unifying batch and stream processing.

  • The initial alignment is driven by technology, not by business

Business requirements should drive the IT architecture decisions. If we turn a blind eye and soften the requirements to such an extent that it becomes ‘feasible’, we allow technology to drive the process.

The discussion with the hygiene specialists about the low rates of change in patient treatment and occupancy are such a softening of requirements. We know that there will be situations where the state will change during the week, but we accept the assumption of stability to keep the current IT architecture.

Even if we won’t be able to immediately change the complete architecture, we should take steps into the right direction. Even if we cannot enable all applications at once to support near real-time processing, we should take action to create support for it.

  • Smart devices, standard operational systems (HIS) and advanced monitoring need to be seamlessly integrated

The long-term vision is to seamlessly integrate the monitoring functionality with available HIS features. This includes the integration of various new (sub-)systems and new technical devices that are essential for operating the hospital.

With an architecture that focuses one-sidedly on analytical processing, we cannot adequately address these cross-cutting needs. We need to find ways to enable flexible data flow between all future participants in the system. Every application or system component requires to be connected to our mesh of data without having to change the component itself.

Overall, we can state that the initial architecture change plan won’t be a targeted step towards such a flexible integration approach.

Decouple and evolve

To ensure that each and every step is effective in moving towards our target architecture, we need a balanced decoupling of our current architecture components.

Universal data supply therefore defines the abstraction data as a product for the exchange of data between applications of any kind. To enable current applications to create data as a product without having to completely redesign them, we use data agents to (re-)direct data flow from the application to the mesh.

Modern Data And Application Engineering Breaks the Loss of Business Context

By using these abstractions, any application can also become near real-time capable. Because it doesn’t matter if the application is part of the operational or the analytical plane, the intended integration of the operational HIS with hygiene monitoring components is significantly simplified.

Operational and Analytical Data

Let’s examine how the decoupling helps, for instance, to integrate the current data warehouse to the mesh.

The data warehouse can be redefined to act like one among many applications in the mesh. We can, for instance, re-design the ETL component Continuous Compliance Rate on Ward Level as an independent application producing the data as a product abstraction. If we don’t want or can’t touch the ETL logic itself, we can instead use the data agent abstraction to transform data to the target structure.

We can do the same for Dispenser Exchange Events or any other ETL or query / reporting component identified. The COBOL monolith HIS Data can be decoupled by implementing a data agent that separates the data products HIS occupancy data and HIS patient data. This allows to evolve the data delivering components completely independent of the consumers.

Whenever the dispenser vendor is ready to deliver advanced functionalities to directly create the required exchange events, we would just have to change the Dispenser Exchange Events component. Either the vendor can deliver the data as a product abstraction directly, or we can convert the dispenser’s proprietary data output by adapting Dispenser Exchange Event data agent and logic.

Aligned Architecture as an Adapted Data Mesh enabling universal data supply - Image by author
Aligned Architecture as an Adapted Data Mesh enabling universal data supply – Image by author

Whenever we are able to directly create HIS patient data or HIS occupancy data from the HIS, we can partly or completely decommission the HIS Data component without affecting the rest of the system.


We need to assess our architecture holistically, considering all known business requirements. A technology-constrained approach can lead to intermediate steps that are not geared towards what’s needed but just towards what seems feasible.

Dive deep into your business and derive technology-agnostic processes and information models. These models will foster your business understanding and at the same time allow your business to drive your architecture.


In subsequent steps, we will look at more technical details on how to design data as a product and data agents based on these ideas. Stay tuned for more insights!

The post Align Your Data Architecture for Universal Data Supply appeared first on Towards Data Science.

]]>
Modern Data And Application Engineering Breaks the Loss of Business Context https://towardsdatascience.com/modern-data-and-application-engineering-breaks-the-loss-of-business-context-7d0bca755adb/ Sat, 18 Jan 2025 15:57:59 +0000 https://towardsdatascience.com/modern-data-and-application-engineering-breaks-the-loss-of-business-context-7d0bca755adb/ Here's how your data retains its business relevance as it travels through your enterprise

The post Modern Data And Application Engineering Breaks the Loss of Business Context appeared first on Towards Data Science.

]]>
A giant task for data and application engineer ing— Image created by DALL-E
A giant task for data and application engineer ing— Image created by DALL-E

I have previously written about the need to redefine the current data engineering discipline. I looked at it primarily from an organizational perspective and described what a data engineer should and should not take responsibility for.

The main argument was that business logic should be the concern of application engineers (developers) while all about data should be the data engineers’ concern. I advocated a redefinition of Data Engineering as "all about the movement, manipulation, and management of data".

Who cares for the intersection of data and logic? - Image from author
Who cares for the intersection of data and logic? – Image from author

Now, as a matter of fact, the representation of application engineers’ created logic actually also results in data. Depending on which angle we look at this from, it means that we either have a technical gap or too much overlap at the intersection of data and logic.

So let’s roll up our sleeves and commonly take on the responsibility for maintaining the dependency between logic and data.

What exactly is data, information and the logic in between?

Let’s go through some basic definitions to better understand that dependency and how we can preserve it.

  • Data is the digitalized representation of information.
  • Information is data that has been processed and contextualized to provide meaning.
  • Logic is inherently conceptual, representing reasoning processes of various kinds, such as decision-making, answering, and problem-solving.
  • Applications are machine-executable, digital representations of human-defined logic using programming languages.
  • Programming languages are formal representation systems designed to express human logic in a way that computers can understand and execute as applications.
  • Machine Learning (ML) is the process of deriving information and logic from data through logic (sophisticated algorithms). The resulting logic can be saved in models.
  • Models are generated representations of logic derived from ML. Models can be used in applications to make intelligent predictions or decisions based on previously unseen data input. In this sense, models are software modules for logic that can’t be easily expressed by humans using programming languages.

Finally, we can conclude that logic applied to source data leads to information or other (machine-generated) logic. The logic itself can also be encoded or represented as data – quite similar to how information is digitalized.

The representation can be in form of programming languages, compiled applications or executable images (like docker), generated models from ML (like ONNX) and other intermediate representations such as Java bytecode for JVM, LLVM Intermediate Representation (IR), or .NET Common Intermediate Language (CIL).

If we really work hard to maintain the relation between source data and the applied logic, we can re-create derived information at any time by re-executing that logic.

Now what does this buy us?

Business context is key to derive insight from data

Data without any business context (or metadata) is by and large worthless. The less you know about the schema and the logic that produced the data, the more difficult is it to derive information from it.

Data Empowers Business

Regrettably, we often regard metadata as secondary. Although the required information is usually available in the source applications, it’s rarely stored together with the related data. And this despite the fact we know that even with the help of AI it’s extremely challenging and expensive to reconstruct the business context from data alone.

Why is context lost?

So why do we throw away context, when we later have to reconstruct it at much higher costs?

Remember, I’m not only talking about the data schema, which is generally considered important. It’s about the complete business context in which the information was created. This includes everything needed to re-create the information from the available sources (source data or the source application itself, the schema, and the logic in digitalized form) and information that helps to understand the meaning and background (descriptions, relations, time of creation, data owner, etc.).

The strategy to keep the ability to reconstruct derived data from logic is similar to the core principles of functional programming (FP) or data-oriented programming (DOP). These principles advice us to separate logic from data and allow us to transparently decide whether we keep the logic and the source data or else also cache the result of that logic for optimization purposes. Both ways are conceptually the same, as long as the logic (function) is idempotent and the source data is immutable.

Retaining or losing business context - Image by author
Retaining or losing business context – Image by author

Now, I don’t want to add arguments to the discussion for or against the use of functional programming languages. While functional programming is increasingly used today, it was noted in 2023 that functional programming languages still collectively hold less than 5% of mind share.

Perhaps this is the reason why, at the enterprise level, we are still by and large only caching the resulting data and thus losing the business context of the source applications.

This really is a lamentable practice that data and application engineering urgently need to fix.

If we base our data and application architecture on the following principles, we stand a good chance of retaining business relevance as data flows through the enterprise.

Save and version all logic applied to source data

Logic and data versioned and persisted as an object - Image by author
Logic and data versioned and persisted as an object – Image by author

Idempotency

Referencing functional programming principles, applications in our enterprise architecture can and should act like idempotent functions.

These functions, when called multiple times with the same input, produce exactly the same output as if they were called only once. In other words, executing such an application multiple times doesn’t change the result beyond the initial application.

However, within the application (at the micro level), the internal processing of data can vary extensively, as long as these variations do not affect the application’s external output or observable side effects (at the macro level).

Such internal processing might include local data manipulations, temporary calculations, intermediate state changes, or caching.

The internal workings of the application can even process data differently each time, as long as the final output remains consistent for the same input.

What we really need to avoid are global state changes that could make repeated calls with the same input produce different output.

Treat logic as data

The representations of our applications are currently managed by the application engineers. They store the source code and – since the emergence of DevOps – everything else needed to derive executable code in code repositories (such as GIT). This is certainly a good practice, but the relationship of the application’s logic actually applied to a specific version of data is not managed by application engineering.

We don’t currently have a good system to manage and store the dynamic relationship between application logic and data with the same rigor as we do this for data on its own.

Digital representation starts with data, not logic. The logic is encoded in the source code of a programming language, which is compiled into machine-executable applications (files with machine-executable byte codes). For the operating system, it’s only data until a special, executable file is started as a program.

An operating system can easily start any application version to process data of a specific version. However, it also has no built-in functionality to track which application version has processed which data version.

We urgently need such a system on enterprise level. It’s needed as urgently as databases were once needed to manage data.

Since the representation of the application logic is also data, I believe both engineering disciplines are called upon to take responsibility.

Actively maintain relationships between logic and data

There are two main approaches how logic and its associated data are managed in systems today. Either it’s application-centric as practiced in application engineering or it’s data-centric as practiced in data engineering.


The type of logic management mainly practiced in the enterprise today is application-centric.

Applications are installed on operating systems primarily using application packaging systems. These systems enable to pull application versions from central repositories, handling all necessary dependencies.

By default, the well-known APT (Advanced Package Tool) does not support installing multiple versions of one application at the same time. It’s designed to manage and install a single version only.

Since container technology emerged on Linux, application engineering enhanced this system to better enable the management of applications in isolated environments.

This allows us to install and manage several versions of the same application side by side.

In a Kubernetes cluster, for instance, the executable docker images are managed in an image database called registry. The cluster dynamically installs and runs any application (a micro-service if you like) of a specific version requested in an isolated pod. Data is then read and written using persistent volume claims (PVC) from and to a database or data system.

The same application running concurrently as different versions in isolated pods - Image by author
The same application running concurrently as different versions in isolated pods – Image by author

While we do see advancements in managing the concurrent execution of several application versions, the dynamic relation of data and applied logic is still neglected. There is no standard way of managing this relationship over time.


Apache Spark, as a typical data-centric system, treats logic as functions that are tightly coupled to its source data. The core abstraction of a Resilient Distributed Dataset (RDD) defines such a data object as an abstract class with pre-defined low-level functions (map, reduce/aggregate, filter, join, etc.) that can sequentially be applied to the source data.

The chain of functions applied to the source data are tracked as a directed acyclic graph (DAG). An application in Spark is therefore an instantiated chain of functions applied to source data. Hence, the relationship of data and logic is properly managed by Spark in the RDD.

However, directly passing RDDs between applications is not possible due to the nature of RDDs and Spark’s architecture. An RDD tracks the lineage of logic applied to the source data, but it’s ephemeral and local to the application and can’t be transferred to another Spark application. Whenever you persist the data from an RDD to exchange it with other applications, the context of applied logic is again stripped away.


Unfortunately both engineering disciplines cook their own soups. On one side we have applications managed in file systems, code repositories, and image registries maintained by application engineers. And on the other side we have data managed in databases or data platforms allowing application logic to be applied but maintained by data engineers.

Unfortunately no single discipline invented a good common system to manage the combination of data and applied logic. This relation is largely lost as soon as the logic has been applied and the resulting data needs to be persisted.

I can already hear you screaming that we have a principle to handle this. And yes, we have object-oriented programming (OOP), which has taught us to bundle logic and data into objects. This is true, but unfortunately it’s also true that OOP failed to deliver completely.

A good solution for the persistence and exchange of objects between applications running in completely different environments was not provided here either. Object-oriented database management systems (OODBMS) have never gained acceptance due to this restriction.

I think data and application engineering has to agree on a way to maintain the unit of data and applied logic as an object, but allow both parts to evolve independently.

Just imagine RDDs as a persistable abstraction that tracks the lineage of arbitrarily complex logic and can be exchanged between applications across system boundaries.

I described such an object as the abstraction ‘data as a product using a pure data structure’ in my article "Deliver Your Data as a Product, But Not as an Application".

Note, that this concept is different to completely event-based data processing. Event-based processing systems are architectures where all participating applications only communicate and process data in the form of events. An event is a record of a significant change or action that has occurred within a system and is comparable to data atoms described in the next chapter.

These systems are typically designed to consistently handle real-time data flows by reacting to events as they happen. However, processing at the enterprise level typically requires many more different ways to transform and manage data. Especially legacy applications may use completely different processing styles and can’t directly participate in event-based processing.

But as we’ve seen, applications can locally act in very different styles as long as they stay idempotent at the macro level.

If we adhere to the principles described, we can integrate applications of any kind at the macro (enterprise) level and prevent the loss of business context. We do not have to force applications to only process data atoms (events) in near real-time. Applications can manage their internal data (data on the inside) as needed and completely independent of data to be exchanged (data on the outside) with other applications.

Applications can stay idempotent at the macro-level by managing internal data (data on the inside) completely different to shared data - Image by author
Applications can stay idempotent at the macro-level by managing internal data (data on the inside) completely different to shared data – Image by author

Create source data in atomic form and keep it immutable

Now, if we are able to seamlessly track and manage the lineage of data through all applications in the enterprise, we need to have a special look at source data.

Source data is special because it’s original information that, apart from the initial encoding into data, has not yet been further transformed by application logic.

This is new information that cannot be obtained by applying logic to existing data. Rather, it must first be created, measured, observed or otherwise recorded by the company and encoded to data.

If we save the original information created by source applications in immutable and atomic form, we enable the data to be stored in the most compact and lossless way to be subsequently usable in the most flexible way.

Immutability

Immutability forces the versioning of any source data updates instead of directly overwriting the data. This enables us to fully preserve all data that has ever been used for application transformation logic.

Data immutability refers to the concept that data, once created, cannot be altered later.

Does that mean that we can’t change anything, once created?

No, this would be completely impractical. Instead of modifying existing data structures, new ones are created that can easily be versioned.

But isn’t most of the information in the enterprise derived from original information, instead of being created as new?

Yes, and as discussed, this derivation can best be tracked and managed by the chain of application versions applied to immutable source data.

Besides this core benefit of immutable data, it offers other benefits as well:

Predictability

Since data doesn’t change, applications that operate on immutable data are easier to understand and its effect can be better predicted.

Concurrency

Immutable data structures are inherently thread-safe. This enables concurrent processing without the complexities of managing shared mutable state.

Debugging

With immutable data, the state of the system at any point in time is fixed. This greatly simplifies the debugging process.

But let me assure you once again: We do not have to convert all our databases and data storage to immutability. It’s perfectly fine for an application to use a conventional relational database system to manage its local state, for example.

It’s the publicly shared original information at the macro (enterprise) level which needs to stay immutable.

Atomicity

Storing data in atomic form is an optimal model for source data because it captures every detail of what has happened or become known to the organization over time.

As described in my article on taking a fresh view on data modeling, any other data model can be derived from atomic data by applying appropriate transformation logic. In concept descriptions of the Data Mesh, data as a product is often classified as source-aligned and consumer-aligned. This is an overly coarse classification of the many possible intermediate data models that can be derived from source data in atomic form.

Because source data can’t be re-created with saved logic, it’s really important to durably save that data. So better setup a proper backup process for it.

If we decide to persist (or cache) specific derived data, we are able to use this as a specialized physical data model to optimize further logic based on that data. Any derived data model can in this setup be treated as a long-term cache for the logic applied.

Check my article on Modern Enterprise Data Modeling for more details on how to encode complex information as time-ordered data atoms and organize data governance at enterprise level. The minimal schema applied to encode the information enables extremely flexible use, comparable to completely unstructured data. However, it allows the data to be processed much more efficient compared to its unstructured variant.

This maximum flexibility is especially important for source data, where we do not yet know how it will be further transformed and used in the enterprise.

By adhering to the principles described, we can integrate applications of any kind at the macro or enterprise level and break the loss of business context.


If both engineering disciplines agree upon a common system that acts at the intersection of data and logic, we could better maintain business meaning throughout the enterprise.

  • Application engineering provides source data in atomic form when consumers and their individual requirements are not yet known.
  • Data and application engineering agree on the management of data and logic relationship by a common system.
  • Data engineering doesn’t implement any business logic, but leaves this to application engineers.
  • Data engineering abstracts away the low-level differences between data streaming and batch processing as well as eventual and immediate consistency for data.

This modern way of managing data in the enterprise is the backbone of what I call universal data supply.

Towards Universal Data Supply

The post Modern Data And Application Engineering Breaks the Loss of Business Context appeared first on Towards Data Science.

]]>
The Case Against Centralized Medallion Architecture https://towardsdatascience.com/the-case-against-centralized-medallion-architecture-297a1e21bc0f/ Mon, 09 Dec 2024 20:28:45 +0000 https://towardsdatascience.com/the-case-against-centralized-medallion-architecture-297a1e21bc0f/ Why tailored, decentralized data quality trumps the medallion architecture

The post The Case Against Centralized Medallion Architecture appeared first on Towards Data Science.

]]>
DALL-E generated
DALL-E generated

I’ve seen too many articles praising the medallion architecture as the go-to solution for enterprise Data Quality. At first sight, the structured, three-layered approach sounds like a no-brainer – organize your data into neat bronze, silver, and gold layers, and you have established a perfect data quality enhancement.

But on closer inspection, my aversion to this architectural approach grows ever greater. Sure, it promises consistent, scalable, and centralized information quality improvement. In practice, however, quality problems are constantly rectified too late and rigidly with the same tool, regardless of the context.

Enterprises are complex adaptive systems with wildly different data sources, each with unique challenges regarding its information quality. Why impose the same rigid process on all of them? Forcing them all into the same centralized quality framework will lead to inefficiencies and unnecessary overhead.

I want to challenge the Medallion Architecture as the supposed best answer to enterprise data quality problems. I’ll make the case for a more tailored, decentralized approach – one inspired by Total Quality Management (TQM) and aligned with the decentralized approach of universal data supply.

Medallion architecture in a nutshell

The medallion architecture seeks to improve data quality through a tiered approach that incrementally enhances data downstream to its production. By dividing the data into three medals or layers (commonly referred to as Bronze, Silver, and Gold), the architecture systematically applies data transformation and validation steps to ensure quality and usability.

The bronze layer is defined as containing raw, unprocessed data from the sources including any inconsistencies, duplicates or even errors. It serves as the single source of truth and can also be used to trace back the original information.

The silver layer processes and refines the raw data to resolve issues and improve consistency. It produces cleansed and validated data in a more consistent format.

The gold layer is defined to finally deliver highly refined, domain-specific datasets ready for business use. It offers the data aggregated, enriched and optimized for analytics or reporting.

The medallion architecture is actually based on technical advancements from the vendor Databricks that allowed the data warehouse to be redefined as data lakehouse.

As the name suggests, the lakehouse offers classic data warehouse functionality, like ACID updates on structured datasets, on top of a data lake. The data lake is known for supporting the processing of unstructured big data better than data warehouses based on relational databases.

Medallion Architecture is basically identical with the classic Data Engineering Lifecycle of data collection approaches like Data Warehouse and Data Lakehouse — Image by author
Medallion Architecture is basically identical with the classic Data Engineering Lifecycle of data collection approaches like Data Warehouse and Data Lakehouse — Image by author

The medallion architecture addresses the business need for good information quality with this technical improvements. But does one technical improvement applied to a business requirement already make a better architecture?

Centralized, rigidly layered data collections won’t scale

By investigating my articles on universal data supply you’ll see that I’m a strong advocate of decentralized data processing on the enterprise level.

The fundamental lesson is that no single, centralized platform can solve all the varied information requirements in a sufficiently large enterprise.

Centralized data collection approaches like data warehouse and data lakehouse therefore cannot deliver a universal data supply.

At its core, the medaillon architecture just defines three standardized layers within the data lakehouse setup and is therefore not suitable as an enterprise-wide data quality solution.

Let’s dig deeper and recognize the deficits.

Rigid layering

Applying a rigid three-layer data structure for all sources leads to inefficiencies when certain datasets do not require extensive cleansing or transformation.

Highly reliable internal source systems may not need extensive quality enhancements. Small-scale projects, exploratory data analysis, or non-critical data may not need gold-standard cleansing or structuring. While some data need extensive pre-processing through many transformation applications, other data may be directly fit for purpose without any transformation at all.

Three fixed layers do not fit well to such varied business requirements. Applying the same standard data quality processing can waste resources and slow down innovation in such scenarios.

Operational complexity

Maintaining and enforcing such a centralized layered system requires significant operational overhead, especially in environments with rapidly changing requirements or datasets.

Each data layer involves additional processes like ETL/ELT pipelines and validations. Monitoring and debugging these pipelines become harder as the architecture scales.

The medallaion architecture suffers from the same problems as the centralized data lakehouse. In an extremely distributed application landscape, a single centralized data quality platform cannot efficiently implement all necessary data quality improvements. Similar to the centralized data lakehouse that cannot efficiently apply all the necessary business rules to derive value from data.

Increased latency

Each data layer adds latency since data must move sequentially from one layer to the next.

Real-time or near-real-time analytics may require bypassing or optimizing the bronze/silver stages, which contradicts the layered nature of the architecture.

Overall the forced data layers result in delays for delivering insights for time-sensitive use cases like fraud detection.

One-sided focus on reactive downstream correction

The medallion architecture only improves quality after data has already been created with defects. That’s like trying to repair or optimize a car after it has been fully assembled.

In manufacturing Total Quality Management (TQM) therefore stipulates that quality is designed into the product, starting from raw materials, processes, and components at the source. Defects are prevented rather than corrected.

Medaillon is only reactive and always assumes error-prone raw data that has to be cleaned up and standardized layer by layer.

Total data quality management

TQM in manufacturing is proactive and focuses on preventing defects through continuous improvement, rigorous standards, and embedding quality checks at every stage of production. TQM is a holistic approach that is strongly customer-oriented regarding the product requirements and design. It has been sucessfully applied to many industrial production processes.

The fundamental principles of business excellence for quality:

  • Customer focus
  • Applying measurements to the outputs of processes
  • Continuous process improvement based on the measurements

Because in universal data supply business processes create ‘data as a product’, we can directly apply these manufacturing quality principles.

We need to apply the TQM thinking to the creation of ‘data as a product’. We need Total Quality Data Management (TQDM).

A downstream approach like medallion inherently has higher costs and risks of missed errors compared to an upstream approach like TQDM, where issues are resolved closer to the source. Quality cannot be efficiently guaranteed by making corrections solely in the downstream systems.

I am repeatedly confronted with the following arguments, which suggest that downstream data corrections can be more efficient than process improvements to eliminate the root cause of the quality problem:

  • Legacy systems might be difficult or expensive to modify to meet required quality standards. Always ask yourself if the continuous correction of errors really can be cheaper than correcting the root cause.

  • Human errors by manual data entry or inconsistent formats are hard to eliminate entirely. Just because it’s difficult to avoid all possible errors should not stop us from making every effort to prevent them.

  • Downstream corrections reduce the load on highly stressed process-owning teams and prevents over-engineering at the source. This is against my own experience – decoupled data teams are actually completely overloaded with correcting all conceivable errors from the business domains.

  • Setting up downstream cleansing and refinement of externally provided data is often the only practicable option and avoids unrealistically high costs due to the demand for perfect quality from external sources. My work for car manufacturers has shown me how far you can go to commit suppliers to minimum quality standards. The same should be possible for external data providers – however we may have to provide temporary work-arounds through quality enhancing agents.

  • Even in systems with robust quality controls, unexpected issues (e.g., system outages) can still occur. Downstream layers can act as a safety net to relieve source systems to handle every possible edge case. Just because it’s difficult to avoid all possible errors should not stop us from making every effort to prevent them. On the other hand we cannot afford to build costly blanket coverage safety nets in the value chain.

While it can be beneficial to refine data for specific business purposes and having a safety net for specific system outages, a generic and rigid three-tiered approach does not meet the varied requirements at enterprise level. Not every source needs the same ‘enhancement’ and often the arguments listed simply do not apply.

If we are in doubt, we should start measuring the real costs caused by low-quality data in the enterprise. From my experience, I can say that an internal process improvement was long-term always cheaper than on-going downstream data corrections.

If downstream correction is really the only viable option, for instance because an external source cannot directly be fixed, it’s much more efficient to install purpose built quality enhancing agents for that specific source only. This tailored approach fits well with the decentralized universal data supply, where data producers share their data on the outside with all consumers. Quality enhancing agents can act as a decoupled selective corrective participating in the shared data infrastructure. Consumers can choose which enhancing procedures are beneficial for their individual information needs and the process can easily be disconnected when it’s no longer needed.

TQDM needs to address the problem holistically and universal data supply is a perfect fit for this. 'Data as a product' needs to be a quality product— Image by author
TQDM needs to address the problem holistically and universal data supply is a perfect fit for this. ‘Data as a product’ needs to be a quality product— Image by author

We should combine centralized oversight with decentralized execution:

  • Centralized governance and tools: Define organizational quality standards and provide shared tools (e.g., data validation frameworks as part of the data fabric = self-service data platform tools) to be used directly in the domain teams.
  • Decentralized implementation: Allow domain teams to customize quality processes based on their specific data sources and use cases.
  • Selective layering: Customize medallion’s layered approach as needed and implement it only where it’s truly beneficial, avoiding over-engineering for simple and clean datasets.

Instead of setting up centralized downstream layers that all data has to pass through across the board, we should primarily invest in improving the data generation processes in order to prevent errors as far as possible. We need high quality ‘data as products’ across the entire value chain.

TQDM holistically addresses this problem and aligns well with the domain-specific ownership of data in universal data supply. It can adapt quickly to changing business needs without impacting unrelated processes. It emphasizes prevention over correction. Unavoidable corrections can be implemented in a selective and cost-effective manner early in the value chain.

TQDM combined with universal data supply outperforms the centralized medallion architecture to ensure data quality on the enterprise level.


If you want to learn more on TQM and TQDM you can read the excellent book from the information quality expert Larry English:

Universal data supply is an approach based on the adapted data mesh that effectively addresses challenges of the original data mesh as defined by Zhamak Dehghani:

Towards Universal Data Supply

Challenges and Solutions in Data Mesh

The post The Case Against Centralized Medallion Architecture appeared first on Towards Data Science.

]]>
Engineering the Future: Common Threads in Data, Software, and Artificial Intelligence https://towardsdatascience.com/engineering-the-future-common-threads-in-data-software-and-artificial-intelligence-2aa46b262150/ Sat, 23 Nov 2024 12:02:06 +0000 https://towardsdatascience.com/engineering-the-future-common-threads-in-data-software-and-artificial-intelligence-2aa46b262150/ How recognizing cross-discipline commonalities not only enhances recruitment strategies but also supports adaptable IT architectures.

The post Engineering the Future: Common Threads in Data, Software, and Artificial Intelligence appeared first on Towards Data Science.

]]>
I’ve noticed an ongoing trend toward over-specialization in IT departments. However, one of my key lessons learned over the years is the negative impact of this siloed specialization.

While it’s primarily an organizational issue, the trend towards the mindless embrace of specialized platform offerings from vendors has also led to significant overlap of functions in our enterprise architectures.

If your business is the provision of specialized IT solution platforms, you can of course benefit from razor-sharp specialization.

For all other businesses, I think this needs to be corrected.

The shift from silos to better collaboration

Traditional software application engineering, data engineering and Artificial Intelligence / machine learning (AI/ML) form large silos today.

While the different IT tasks were assumed to be largely distinct and the objectives different, the business actually demands seamless data exchange and integration between applications and AI/ML models.

We need to shift from isolated tasks to integrated systems.

Engineers in each domain are actually dependent on many shared practices, requiring a common language and methodology. Data pipelines must now support real-time model inference; application software must handle data streams dynamically; and AI/ML models must fit seamlessly into live applications.

These cross-domain interactions should redefine the siloed role of engineers in each area, making it clear that we must think beyond the boundaries of traditional disciplines.

While I worked for the healthcare industry, I observed the same problem of over-specialization. Doctors also have a one-sided focus on specific organs or systems (e.g., cardiologists, neurologists). This over-spezialization, while advancing treatments for certain conditions, often lead to a fragmented approach that can overlook the holistic health of patients. This can make it really difficult to get good, comprehensive advice.

However, there has indeed been a major shift in healthcare in recent years: away from silo thinking towards a more integrated, holistic approach. This trend emphasizes interdisciplinary collaboration, combining knowledge from different specialties to improve patient outcomes.

We urgently need the same rethinking in IT engineering.

Common threads: Principles bridging the disciplines

As I look back, there are a few key principles that stand out as essential, whether you’re a data engineer, a software developer, or an AI/ML practitioner.

Obvious commonalities are programming proficiency, algorithmic thinking and problem solving as well as proper handling of data structures. These principles create a common foundation that all engineers should have.

Let’s look at some more common threads.

Modularity and Reusability

Modularity has been a cornerstone of software architecture for years.

In Data Engineering, this principle is equally critical. A well-designed data pipeline must be modular to support reusable data transformations and easily adjustable components. While in application development we learned to think in (micro-)services that contribute to a coherent overall system, we still lack the same proficiency in building data pipelines. Instead, I often hear the ill-advised claim that data engineering is not software engineering.

A look at the Google paper "Hidden Technical Debt in Machine Learning Systems" clearly shows that the model itself is only a small part of the overall AI/ML service that needs to be developed. The majority of the service requires software and data engineering know-how to properly integrate it to the enterprise architecture. Feature engineering, for instance, is actually data engineering for AI/ML models and shares many commonalities with traditional ETL processing for data warehouses.

When all three disciplines strive for a modular architecture, it becomes easier to integrate the disparate systems and reuse components across the silos.

Version Control and lifecycle management

In software development, version control is essential for managing changes, and this principle applies equally to data and AI/ML models. Data versioning ensures teams can track changes, maintain lineage, and guarantee reproducibility. Experiment tracking and lifecycle management for AI/ML models prevent updates from disrupting processes or introducing unexpected behavior in production.

A disciplined approach to version control in all areas ensures clean synchronization of systems, especially in our dynamic environments where data, code and models are constantly evolving. This need is reflected in the rise of "*Ops" disciplines like DevOps, MLOps, and DataOps, which all aim to promote the rapid delivery of high-quality software products.

However, these overlapping disciplines lead to unnecessary project management and workflow overhead. We maintain three separate, overspecialized versions of fundamentally similar processes. A unified approach that bridges these silos would significantly reduce complexity and improve efficiency.

Real-Time processing and responsiveness

With the increasing need for low latency processing, traditional batch systems are no longer sufficient. Today’s users expect instant information supply. This shift toward near real-time responsiveness demands a new level of integration.

For data engineers, real-time processing means rethinking traditional ETL pipelines, moving to more event-driven architectures that push data as it’s created. Software engineers must design systems that can handle real-time data streams, often integrating AI/ML inference to provide personalized or context-aware responses. For AI/ML engineers, it’s about building models that operate with minimal latency.

Unfortunately, we are still too far away from unifying batch and stream processing.

The Power of abstraction enables cross-functional systems

One of the most powerful tools to avoid overlapping functionality is abstraction.

Each domain has developed its own abstractions – e.g. UX principles like Model-View-Controller (MVC) or Backend for Frontend (BFF) in application development, ETL pipeline orchestration in data engineering, and layers in neural networks for ML.

By building systems on common abstractions, we create a language that can be understood across disciplines.

Consider how an abstraction like data as a product can serve as a shared language. For a data engineer, data as a product is a well-defined dataset created by applications to be disclosed and transported to consumers. For an AI/ML practitioner, it’s a feature set prepared for model training. For a software engineer, it’s like an API endpoint delivering reliable data input for application functionality. By creating and consuming data as a product, each team speaks the same language and this promotes better understanding.

Operating systems (OS) are traditionally the basic infrastructure that provides such fundamental abstractions to work equally well for all specific applications. Before we create new, fundamental abstractions as specialized tools in a single discipline, we should think twice about whether it would not be better covered by an infrastructure component – for example as an OS extension.

Embracing the feedback loop

As the boundaries between disciplines blur, the need for feedback loops becomes essential.

Data, software, and AI/ML systems are no longer static; they are continuously evolving, driven by feedback from users and insights from analytics. This further closes the gap between development and production, enabling systems to learn and adapt over time. The discipline that targets such feedback loops is commonly referred to as observability.

In data engineering, observability may mean monitoring data flow allowing ongoing collaboration to improve accuracy and reliability. For software engineers, it can be gathering real-time application usage and user feedback to refine functionality and user experience. In ML, feedback loops are critical for retraining models based on new data distribution, ensuring predictions stay relevant and accurate.

A well-designed feedback loop ensures that all systems are continuously optimized. These loops also enable cross-functional learning, where insights from one domain feed directly into improvements in another, creating a virtuous cycle of enhancement and adaptation.

Streamline your recruitment

The increasing specialization reflects a necessary evolution to address the growing complexity of modern systems.

While specialized disciplines can bring significant benefits, their highly overlapping parts lead to coordination and integration challenges. Organizations that succeed in harmonizing these crosscutting fields – through reliance on sound architecture principles, collaborative cultures, and unified strategies – will gain a competitive advantage.

You don’t need over-specialized engineers for every single aspect of your Enterprise Architecture. We won’t succeed with only a few enterprise architects having enough experience to oversee cross-discipline aspects. Powerful abstractions don’t emerge by living and thinking in silos. Engineers must be encouraged to think outside the box and understand the benefits of evolutional architectures at enterprise level.

All engineers need to follow sound enterprise architecture principles, not only the architects. Therefore, make sure you have a broad base of architecture know-how among your IT engineers.

Don’t look for a highly specialized DevOps engineer knowing all the latest tools, look for an IT engineer who knows a lot about Software Engineering and understands how to get software to production quickly while maintaining the highest quality.

Toward a unified engineering mindset

As we engineer the future, it’s clear that our success depends on bridging the separated disciplines where needed. Data engineers, software developers, and AI/ML practitioners must adopt a unified engineering mindset, embracing shared principles and practices to create systems that address the crosscutting requirements of business.

I strongly believe the future of engineering is a collaborative journey. By working within a shared framework – modularity, version control, near real-time responsiveness, and abstraction – we lay the groundwork for integrated systems. The goal is not to erase the distinctions between fields, but to leverage their unique strengths to go beyond the limitations of any one discipline.

Success will belong to those who can cross boundaries, adopt cross-functional principles, and think holistically about the systems they build. By engineering with these common threads, we not only improve the efficiency of each domain but also enable greater cross-cutting innovation and agility. The future is interconnected, and the path to building it starts with embracing common principles in IT engineering.

The post Engineering the Future: Common Threads in Data, Software, and Artificial Intelligence appeared first on Towards Data Science.

]]>
Operational and Analytical Data https://towardsdatascience.com/operational-and-analytical-data-54fc9de05330/ Thu, 07 Nov 2024 19:48:51 +0000 https://towardsdatascience.com/operational-and-analytical-data-54fc9de05330/ What is the difference and how should we treat data in the enterprise?

The post Operational and Analytical Data appeared first on Towards Data Science.

]]>
Unfortunately, we still have a big confusion about what exactly operational and analytical data is. As a result, we are still struggling to find the right approach to handling data from an overarching enterprise perspective.

What has been identified as the ‘great divide of data’ is the source for many challenges in our Data Architecture today. The distinction between operational and analytical data is not helpful in its current definition.

Image by Author, inspired by the Great Divide of Data by Zhamak Dehghani
Image by Author, inspired by the Great Divide of Data by Zhamak Dehghani

I have written about that particular problem in previous articles and made a key statement in the first part of my series on "Challenges and Solutions in Data Mesh":

To solve the challenge of brittle ETL pipelines, let’s refrain from drawing a strict line between operational and analytical data altogether. Instead, we should only distinguish source data from derived data – both can be used for operational and analytical purposes.

This point is so fundamental that I want to expand on it to make it clear why I am so committed to universal data supply that effectively bridges the gap between the two planes.

The misconception

I’ve said it before and I repeat it emphatically:

We should not distinguish between operational and analytical data.

Let’s analyze the distinction made by Zhamak Dehghani in her article on data mesh – it’s unfortunately repeated by other renowned architecture veterans in their very insightful book "Software Architecture: The Hard Parts"; jointly written by Neil Ford, Mark Richards, Pramod Sadalage and Zhamak Dehghani.

Operational Data is explained as data used to directly run the business and serve the end users. It is collected and then transformed to analytical data.

A quote from their book:

This type of data is defined as Online Transactional Processing (OLTP), which typically involves inserting, updating, and deleting data in a database.

Analytical Data is explained as a non-volatile, integrated, time-variant collection of data transformed from operational data, that is today stored in a data warehouse or lake.

A quote from their book:

This data isn’t critical for the day-to-day operation but rather for the long-term strategic direction and decisions.

Now, what is wrong with this distinction?

I posed the following questions to challenge it:

  1. What type of data does an analytical process produce when deriving an intelligent KPI that is used in subsequent operational processes? Is this analytical data because it was derived in an analytical process based on input from a data warehouse? Or is it operational data because its used for operational purposes to directly run the business?

  2. Data lakes and data warehouses (with an attached raw data vault/data lake) explicitly store operational data for subsequent transformations and analytical purposes. Is this analytical data because its non-volatile, integrated and stored in a time-variant way in a data warehouse or lake? And is derived analytical data actually only stored in data warehouses or lakes?

I didn’t explicitly give answers in the mentioned article because I thought it’s pretty obvious without. But I keep being confronted with this distinction and I observe people struggling to properly manage data based on that definition.

So let me try to convince you that this distinction is not helpful and that we should stop using it.

What type of data does an analytical process produce when deriving an intelligent KPI that is used in subsequent operational processes?

Let’s take an example based on a real-world banking scenario. We extract data from a lot of different operational systems and save it in our data warehouse. We derive basic KPIs from it and store them in the data warehouse to be used in an operational online loan application to calculate individual interest rates.

I think we can safely say, that the KPIs are derived in an analytical process and following the definition the result is qualified as analytical data.

But this data is also used for operational purposes as input for interest rate calculation in an online loan application – this online process definitely directly runs our business. Hence, it also qualifies for being defined as operational data.

Is derived analytical data actually only stored in data warehouses or lakes?

The KPIs and especially the interest rate for the loan application would definitely not only be stored in a data warehouse/lake. Most certainly it will also be stored in the operational loan system because it’s a key input for the loan acceptance.

It is even stated that analytical data isn’t critical for the day-to-day operation but rather for the long-term strategic direction and decisions.

But the KPIs used for the interest rate calculation together with the acceptance of the loan application are highly critical for the day-to-day business of a commercial private bank.

And this is not only true for this example. It’s the rule rather the exception that data created by analytical processes is also used in subsequent operational processes.

Business does not distinguish between analytical and operational data

It’s simply not a helpful distinction for real-life scenarios in the business world. Only the business processes and therefore also the IT applications can be distinguished as having an operational or dispositive (planning or analytical) purpose.

But even this distinction is blurred by the fact that analytical results are typically the foundation for decisions to change the way the business operates.

However, an analytical process can tolerate longer down-time. Hence, the service level agreement on analytical processes can be more relaxed compared to operational processes that run the business.

But we need to recognize that all data, regardless of whether it was generated in an operational or analytical business process, is important for the enterprise and always has operational significance.

Data is no process and therefore we cannot say that operational data is OLTP. This just doesn’t make sense.

Helpful distinctions

Let’s therefore stop categorizing data in operational and analytical data. It lacks helpful distinctive criteria and is at best relevant from a pure technical perspective to decide which service level is appropriate for the applications using that data.

Source data and derived data

Instead, we should distinguish source data from derived data – both can be used for operational and analytical purposes.

Why is that distinction more helpful?

Because it matches the company’s business view. The source data is new digitalized information that was not previously available in the organization. It cannot be derived from other data available in the enterprise.

New data must be captured by human input or generated automatically via sensors, optical/acoustic systems or IoT devices. It can also be imported (or acquired) from data providers outside the organization. For this edge case, we can decide whether we want to treat it as source data or as derived data, although the providing application then works outside our organization.

This distinction is very important, as derived data can always be reconstructed by applying the same business logic to the corresponding source data. Source data, on the other hand, can never be reconstructed with logic and must therefore be backed up to prevent data loss.

It’s the data itself that is different, not the processes that use that data. Therefore, we need to manage source data and derived data in different ways from both a technical and a business perspective.

Data on the outside and data on the inside

The excellent article "Data on the Outside vs. Data on the Inside" by Pat Helland, explores the distinction between data managed within a service (inside) and data shared between services (outside) in a Service-Oriented Architecture (SOA) context.

Helland’s conclusion was that SOA requires data representations to play to their respective strengths: SQL for inside data, XML for inter-service communication, and objects for business logic within services. This blend allows each system to balance encapsulation, flexibility, and independence.

With the exception of the restrictive use of SQL, XML, and objects for data representations, the core idea is still very valid from my point of view.

Applications or services should be self-contained, with encapsulated data and logic interacting solely through messages. This approach prevents direct access to the data of another service, strengthening cohesion within the service and decoupling between the services.

Inside data often operates within an ACID (Atomic, Consistent, Isolated, Durable) transaction context, while outside data lacks this immediacy. Outside data should be represented as a stream of events (or data atoms) with eventual consistency. Any data sent outside of a service should be treated as immutable once shared. I have described this in more detail in the second part of my series on "Challenges and Solutions in Data Mesh".

The conclusion is that we should use data representations according to their individual strengths. A data representation is effectively a physical data model and any database type supporting that model has its specific strengths and weaknesses for particular purposes.

Whether you use a relational database, a document store, a graph database or maybe a very basic model like a key/value store for your data inside the service is highly use case dependent.

To facilitate inter-service communication, XML, JSON, protocol buffers, AVRO or Parquet that offer schema independence and flexibility are better suited for outside sharing of immutable state.


From my point of view the approach to use ‘data as products’ implemented as immutable data structures is best suited to enable universal sharing of information. The selection of the physical data model to be used inside your service is a technical optimization decision dependent on your use case.

Data on the Outside vs Data on the Inside - Image by author
Data on the Outside vs Data on the Inside – Image by author

However, the logical view on your overall business information needs to be consistent across all applications or services and should be independent from any physical data representation used inside your application or service.

Inter-service communication can be achieved via APIs or the exchange of data as products. See my articles on universal data supply to understand the whole concept.


If you liked this article then please consider to clap.

How do you think about our challenge to develop a data architecture that is reliable, easy to change, and scaleable?

I’d love to hear about your opinions.

The post Operational and Analytical Data appeared first on Towards Data Science.

]]>
No, You Don’t Need a New Microservices Architecture https://towardsdatascience.com/no-you-dont-need-a-new-microservices-architecture-f0dbda673bae/ Tue, 29 Oct 2024 04:36:32 +0000 https://towardsdatascience.com/no-you-dont-need-a-new-microservices-architecture-f0dbda673bae/ Because you almost certainly already have one without explicitly realizing it

The post No, You Don’t Need a New Microservices Architecture appeared first on Towards Data Science.

]]>
If you feel like the AI generated article-image actually captures your company’s system architecture quite nicely, then this article is for you.

There is no doubt that breaking down complex tasks into smaller, manageable subtasks is helpful for any kind of problem solving. This is also true for IT systems that digitalize our business processes. Architects therefore followed the tried and tested path of „divide and conquer" in IT and divided our systems into smaller applications/services that fulfil specific tasks for different business areas.

With the growing complexity of our enterprise, the system of interconnected applications/services representing the digitalized business processes also grew overly complex. So we’re constantly trying to keep order and structure so that the whole mess doesn’t implode and stop working altogether – that’s actually Enterprise Architecture, if you ever wondered what those guys in the ivory tower are trying to achieve with their architecture component diagrams.

Architecture for the enterprise

Much has been written about the right architecture to tame or prevent chaos when the overall task of digitalizing the company is spread across the entire enterprise.

You can find comprehensive frameworks like The Open Group Architecture Framework (TOGAF), that explain everything that could be relevant to effectively model your enterprise architecture. There are also extensive descriptions available for the right design patterns to be followed when architecting your applications. Enterprise Application Integration (EAI) has defined Enterprise Integration Patterns that help to re-integrate spread information caused by isolated applications. And Domain-Driven Design (DDD) even taught us how to prevent the isolation to occur in the first place.

The specialized discipline of enterprise Data Architecture also emerged, offering an ever-increasing number of styles and approaches to tame the data chaos in the company. I won’t go into detail here, but James Serra has made an incredible effort to decipher data architectures found in the enterprise today.

Despite this multitude of frameworks, development patterns, best practices and lots of advice from both application and data experts, we still struggle to create and maintain a well-organized enterprise architecture. I now have the impression that the sheer volume of good advice on doing the right thing tends to contribute to further confusion.

So we are always on the lookout for the next improved, even more comprehensive and hopefully definitive piece of advice on how we can best organize our IT systems.

Microservices architecture for the enterprise?

The latest (or should I say modern) approach is the microservices architecture, which can also seamlessly extend to the cloud. Allowing a hybrid use of cloud services is definitely also on the long list of challenges that enterprise architects have to solve.

So let’s just tackle this modern style as quickly as possible, shall we?

Well, even for such a modern application architecture, we have already found anti-patterns and pitfalls to avoid – Mark Richards for example, a renowned architecture veteran, has collected good advice on this.

This is not surprising, as all the rich advice we already have for our enterprise architecture is equally valid for a microservices architecture. I would even state that most larger companies already have a kind of microservices architecture in place, although they might not explicitly be aware of it or call it that.

What do I mean with this statement?

All the different applications, services and platforms interconnected by the corporate network can actually be seen as a microservices architecture. However, with all its fragile and inefficient component interplay, scaling problems and maintenance issues it’s a rather weak and somewhat chaotic implementation of it.

Let’s distill the essence of microservices architecture by focusing on its core principles to help us recognize the validity of my statement.

  • Independent deployabilityTo ensure independent deployability, we need applications/services to be as loosely coupled as possible to allow changes in one application/service without affecting others. This is certainly also a main target in our enterprise in general. After all, that was the key driver for the decision to spread the development of applications across the entire enterprise.

  • Everything modeled around a business domainFrameworks like DDD teach us to structure code and components to better align with the needs of business domains. This is general advice also applicable outside microservices architecture. It has been accepted as the best way to allow the decoupling of applications/services needed to enable independent deployments. Traditional layered architectures are no longer the advised way of structuring an enterprise architecture.

  • No shared stateThe realization that a single shared database cannot be scaled to the enterprise is also not new or specific to microservices architectures. It is equally important for applications on enterprise level to decide which data is shared and which is hidden or private. This also allows to reduce unwanted coupling and fosters independent deployability.

  • Small size of services "How small should a microservice be?" is certainly one of the most common questions asked around microservices architecture. However, even evangelists of microservices architecture agree that the actual size of the service is the least significant aspect. Some even argue, that only the size of the interface and not the size of the service itself is relevant. But apart from this question, everyone agrees that size is relative and the ‘right’ size of a service rather needs to be developed over time. It cannot be defined in advance in lines of code or other complexity measures. There may still exist big monolith applications in the company, but from the global perspective of the enterprise even such a monolithic application can have the right size to fulfil the other core concepts described for microservices architecture.

  • Flexibility for change and evolutionThis is the most essential ingredient for maintaining a sucessful enterprise architecture. Evolving software architectures allow to constantly integrate each new innovation into the overall software ecosystem. It’s flexible enough to maintain the right balance in the system with every change implemented. Again, you can find rich advice from renowned software architecture veterans in Building Evolutionary Architectures. It’s valid advice for the enterprise and in no way specific to microservices architecture.

Consequently, the key question is not if a microservices architecture is the right architecture for our organization. There is no right or wrong architecture on the enterprise level, but instead just many trade-offs to be made.

The core concepts of microservices architecture are practically in line with everything we have already learned to maintain a clean architecture on enterprise level. There’s nothing truly new in the microservices approach. Even the emergence of platforms trying to standardize the implementation – Kubernetes is the most prominent one – will not revolutionize our enterprise architecture. It’s just another platform promoted by one of the influntial vendors on the market. One more new product that vendors intend to sell us to solve all our architecture challenges.

The typical approach to ‘integrate’ such allegedly revolutionary platforms into the enterprise architecture are lighthouse projects. These projects are big and expensive but very seldom comprehensive enough to replace substantial parts of the current ecosystem. Instead they just contribute to further complicate matters by creating new islands of technology.

I’ve written about the unfortunate practice for larger companies trying to buy data platforms and assemble them like Lego bricks, hoping for the next big leap forward.

This doesn’t work for either data platforms or application platforms!

The evolution of architectural principles over time should make us realize that there is no simple, one-size-fits-all approach. While today’s view suggests microservices as a promising architectural approach at the enterprise level, it will expose further weaknesses and trigger just another set of pitfalls and anti-patterns to avoid.

The key takeaway is that no single product, platform or new architecture style can ever replace our enterprise ecosystem of interconnected applications that are fit for our purpose. We can change and evolve the enterprise system following sound architecture principles, but we cannot just buy a new enterprise architecture, nor can we completely replace it with a single, new approach. So you don’t need a new microservices architecture but instead evolve your current enterprise architecture to better support the core concepts listed.


See also the articles on Universal Data Supply for my advice on how to seamlessly (re)integrate application architecture with data architecture.

The post No, You Don’t Need a New Microservices Architecture appeared first on Towards Data Science.

]]>
Universal Data Supply: Know Your Business https://towardsdatascience.com/universal-data-supply-know-your-business-9ed8a80a0224/ Tue, 22 Oct 2024 17:50:52 +0000 https://towardsdatascience.com/universal-data-supply-know-your-business-9ed8a80a0224/ An industry example to emphasize the importance of understanding your business case

The post Universal Data Supply: Know Your Business appeared first on Towards Data Science.

]]>
As announced in my lessons learned article, I’m starting the series on implementing universal data supply with a concrete business example from the industry.

You may feel like reading about business matters is lengthy and exhausting. Instead you just want to get straight into the detailed steps to migrate your traditional Data Architecture towards the new approach. But let me try to convince you that thoroughly understanding your business is paramount for any architectural changes to truly create business value.

It’s not like that you need to fall in love with your business straight away. But I can promise you that if you really get engaged with what you want to develop an IT solution for, your solutions will be much better.

Of all the projects I’ve worked on over the years – including exciting but sometimes quite abstract solutions for banks and insurance companies – the following was the closest to my heart.

So let’s take a look at an illustrative, simplified example derived from a real healthcare project I had the pleasure to work on. It will help us to recognize the potential of universal data supply to create an architecture that solves real business challenges.

It is an example where data technology can actually help save lives. What could be more motivating for us as data engineers?

CleanHands Medical Center

Our fictional hospital ‘CleanHands Medical Center’ has a strong commitment to infection prevention and attaches particular importance to thorough hand hygiene.

Image create by DALL-E
Image create by DALL-E

The business case

Hospital acquired infections (HAI) harm patients every day across healthcare systems worldwide. These infections, many of which are caused by multidrug-resistant organisms, place a significant burden on hospitals and ultimately can cost the lives of patients.

Because of the accelerated rise of anitmicrobial resistance (AMR) the treatment of these infections is hard, causing enormous additional costs for hospitals. These costs include extended hospital stays, increased use of medications, specialized treatments in intensive care units (ICU) and not least of all reputational damage and legal costs. Above all it further increases the risk of disease spread, severe illness and consequently even death of patients.

By investing in effective infection prevention and control (IPC), hospitals can significantly improve patient safety while also reducing costs. According to the WHO "Global Strategy on Infection Prevention and Control", the most effective measure against HAIs is good hand hygiene.

For me personally at least, it was surprising that a procedure as simple as washing your hands or rubbing them with disinfectant is still the most effective way to prevent these extremely harmful infections.

According to current studies, even in modern hospitals in industrialized countries hand hygiene compliance rates are still very low. Average compliance is around 40% across general hospital units. In intensive care units, it tends to be slightly better at around 60%. These figures highlight the ongoing challenges in ensuring consistent hand hygiene – it’s a real threat to us all.

Studies have also demonstrated that hand hygiene interventions can achieve a 25–70% reduction in HAI rates. Hand hygiene is not only effective but also significantly reduces costs for hospitals. For every $1 invested it saves approximately $16.5 in healthcare expenses.

Hand hygiene compliance

But why is there still a problem with something as obviously important and simple as maintaining good hand hygiene? After all, there are highly qualified healthcare professionals working in our hospitals who are fully aware of this danger.

Various studies on hand hygiene compliance have shown that it’s very often the lack of immediate feedback as an incentive that discourages health-workers from complying. In the daily hustle and bustle and especially in stressful moments, e.g. when there is a shortage of staff due to many sick days, they simply forget it or just skip good hand hygiene due to carelessness.

Repetitive tasks like handwashing, though simple, can also lead to oversight due to fatigue or desensitization over time. Surprisingly, even doctors, who could be assumed to be the most sensitive to this issue, are among the groups with the worst compliance.

Consequently, to promote and maintain proper hand hygiene, hygienists should regularly monitor quality, provide timely feedback to healthcare workers and conduct regular training and assessments.

However, the monitoring process for hand hygiene practices is typically still based on trained observers who directly observe healthcare staff during routine patient care. Yes indeed, there are hygiene specialists on the ward with a clipboard, watching the staff at work.

Observers try to be as discreet as possible to avoid the Hawthorne effect, where staff improve behavior only because they are being observed. But honestly, this effect just can’t be avoided completely in this setup.

The observation task itself requires a high level of concentration and accuracy. I tried it myself and failed miserably. For each observed treatment, hygienists must count the number of actual hygiene actions and at the same time keep track of the 5 most important opportunities or moments when such an action would have been necessary.

Five moments for hand hygiene - World Health Organization (WHO)
Five moments for hand hygiene – World Health Organization (WHO)

With this information they can calculate a compliance rate by dividing the number of hand hygiene actions observed by the total number of hand hygiene moments recorded.

As this manual process is error-prone and obviously costly, it can only be carried out on a random basis. The hygienists therefore seek to significantly improve and automate this process to enable ongoing discreet monitoring of hand hygiene compliance.

Smart technology helps to save lives

Especially timely feedback to the staff on their compliance and an enabling environment with operable disinfectant dispensers have been identified as important interventions.

The hospital therefore invested in smart disinfectant dispensers coupled with a near-realtime monitoring system. These smart dispensers can transmit usage data on when and how the device was used together with the amount of liquid dispensed.

The dispenser device establishes the connection to the monitoring system via a standard WLAN protocol that is normally already available in the IT infrastructure of hospitals. The monitoring system receives usage data sent by the devices and stores it in a relational database system. The system offers all kinds of maintenance functions to register dispensers at their specific install location on the wards and to manage them with additional information like the assignment to the organizational units in the hospital. Various reports are available that show current and history dispenser usage metrics on different aggregation levels.

Traditional data architecture for initial improvement

Isolated usage information related to general target values determined in the random observation samples already provides great value for manually deriving a hospital-wide key figure for compliance. However, the hygiene specialists quickly identified further potential for improvement. They wanted to automate the determination of compliance rates on a more granular level.

The hospital IT data engineers have therefore developed a batch-oriented ETL process that extracts all usage data from the dispenser system’s relational database into the hospital’s Data Warehouse on a weekly basis. This was implemented to combine the dispenser usage data with patient occupancy data and treatment data from the hospital’s information system (HIS). The HIS data is also loaded into the data warehouse on a weekly basis via a different batch process.

Traditional Data Architecture for the initial solution - Image by author
Traditional Data Architecture for the initial solution – Image by author

With the help of some elaborate business rules, the data engineers were able to implement a transformation logic to derive the "expected frequency of use" for all dispensers during a week at ward level. To this end, the hygienists defined a flat-rate number of hygiene actions that are assumed necessary for a specific treatment of a patient. Combined with weekly occupancy it’s possible to calculate a continuous weekly compliance rate at the ward level.

They compiled weekly reports on hand hygiene compliance for each individual ward. With this information presented by the respective ward manager, hospital staff were further encouraged to improve their compliance.

This intervention has led to a significant reduction in infections in the hospital. Due to this great success, the hygiene specialists now want to achieve further improvement.

Further improvement

To further incentivize the staff to stay compliant at all times, the hygiene specialists want to significantly reduce the time for feedback to health-workers on their compliance.

Instead of weekly paper reports published to the ward manager, they intend to give more timely feedback to the staff via big monitors. They shall visualize the current state of compliance in the ward with easy to understand traffic light indicators.

On the other hand, it was reported that the nearest dispensers were often just empty and the necessary hand hygiene was skipped due to the extra time it took to leave the room and find the next functioning dispenser.

To solve this problem, technical maintenance staff need to receive near real-time information on the fill level of each dispenser installed. This would improve refill management and thus keep the devices operational at all times.

The long-term vision

To further strengthen the prevention control, the hygienists intend to implement near real-time and individual feedback to the healthcare workers.

Photo by Mehmet Keskin on Unsplash
Photo by Mehmet Keskin on Unsplash

Individualization of feedback

Instead of general reports on ward-level the reports and dashboards should be further personalized. Although this example is fictional, it’s important to understand necessary security aspects you need to pay attention to when working with tracking data or potential private personal information (PPI). Any information that could be used to establish a link to an individual’s identity or violate other privacy concerns must be protected from unauthorized access.

  • Compliance reports and dashboards showing their hygiene performance, improvement opportunities and individual compliance per healthcare worker compared to hospital benchmarks.
  • Gamified tracking where healthcare workers earn points for correct hand hygiene. This could include rewards or milestones to motivate ongoing compliance.

Smart and discreet technology to reduce feedback time

The idea is here to have each healthcare worker equipped with smart wearables (like watches) that are able to receive real-time, individual and discreet feedback from the monitoring system.

The intention is to incentivize correct hand hygiene right at the point of care whenever the healthcare worker starts a patient treatment. This would include visual and/or haptical feedback from the smart wearable as soon as the system detects one of the 5 hand hygiene moments.

Photo by AB on Unsplash
Photo by AB on Unsplash

Here are some of the advanced ideas using smart sensors and future technologies for further improving hand hygiene compliance.

  • Advanced sensors in dispensers to detect non-compliant hand movements (e.g., not using enough liquid) and send instant feedback to the smart wearables.
  • Use patient-specific hand hygiene recommendations based on treatment information from the HIS, when entering the patient room.
  • Use ML models based on historical compliance data to predict potential non-compliant events. These models could send pre-emptive reminders to staff at high-risk moments based on real-time data from different monitoring systems.
  • Create sanitization zones around hospital beds where hand hygiene compliance is tracked using spatial sensors and granular treatment information of the individual patient. Smart wearables could interact with these zones to prompt compliance as soon as a healthcare worker enters the area, all based on real-time occupancy and hygiene data.

The next step after this initial business case analysis is an assessment of the current architecture in terms of its ability to support the short and medium-term goals and its potential to realize the long-term vision.

We haven’t discussed every detail yet, but we have a good understanding of the overall business case. With this overview, we can start to create an initial plan to gradually adapt the current architecture. When we further discuss our solution proposal with the hygienists, we can work out the steps together with more details we’ll learn about the business case. Stay tuned for more insights!

The post Universal Data Supply: Know Your Business appeared first on Towards Data Science.

]]>
Data Architecture: Lessons Learned https://towardsdatascience.com/data-architecture-lessons-learned-3589b152a8a6/ Fri, 04 Oct 2024 00:39:27 +0000 https://towardsdatascience.com/data-architecture-lessons-learned-3589b152a8a6/ Three important lessons I have learned on my journey as data engineer and architect

The post Data Architecture: Lessons Learned appeared first on Towards Data Science.

]]>
It’s only been a few months since I once again had to experience what I sometimes refer to as the "self-satisfaction of IT."

That may sound a bit harsh, but unfortunately I experience this time and again. It can be frustrating to see IT departments actually working against their business.

I remember one specific case where a running business solution had to be migrated to another execution platform solely because of ‘technical’ reasons. Sure, business was told that this target platform would be much cheaper in maintenance, but IT didn’t offer tangible evidence on that assertion. Ultimately, the decision to migrate was driven by ‘expert knowledge’ and so-called ‘best practices’, but solely from an IT-centric perspective. ** It cost a fortune to migrate what worked, only to find out that the promised cost reductions didn’t materializ**e and even worse, business functionality deteriorated in some cases.

IT professionals, not only in specific technology-oriented companies, tend to believe that technology, IT tools, and nowadays also data have an end in themselves.

Nothing could be further from the truth.

Although organizational changes are often recommended to improve cooperation between business and IT, the structure itself is not the critical factor. I’ve observed that companies with entirely different organizational setups can still achieve strong collaboration.

So what is the recipe for success of these companies?

What all these organizations had in common is their rigorous focus on the business. Not only in the sales related departments but in every other supporting unit, including and especially in IT. It’s the mindset and attitude of their people – the enterprise culture if you like. The willingness to scrutinize everything against this core requirement: Does it generate a business benefit?

The following practices have proven remarkably effective when it comes to focusing on business value and prevent silo thinking. Follow them to move away from a one-sided belief in technology towards a modern company that optimally interconnects its digitalized business processes through universal data supply.

Beware of silo specialisation

The definition of the ‘data engineering lifecycle’, as helpful and organizing it might be, is actually a direct consequence of silo specialization.

It made us believe that ingestion is the unavoidable first step of working with data, followed by transformation before the final step of data serving concludes the process. It almost seems like everyone accepted this pattern to represent what data engineering is all about.

While this is a helpful general pattern for the current definition of data engineering, it is not at all the target we should aim for.

The fact that we have to extract data from a source application and feed it into a data processing tool, a data or machine learning platform or business intelligence (BI) tools to do something meaningful with it, is actually only a workaround for inappropriate data management. A workaround necessary because of the completely inadequate way of dealing with data in the enterprise today.

We should take a completely different approach that creates data as products to be exchanged in the enterprise. This comprises machine learning (ML) models, business intelligence (BI) processes including any manual processes producing valuable business results, and any operational applications that together create the digitalized business information in your enterprise. I described this approach as universal data supply.

Towards Universal Data Supply

You may think it’s all well and good to come up with a completely new idea. But we have many systems in use that follow precisely these old, well-trodden paths. Most importantly, people have come to accept this as the standard way of doing things. It feels comfortable, and the status quo remains largely unquestioned.

What really needs to happen is a rethinking of how software engineers collaborate with data engineers, and how both groups work together with the business teams. Doing this offers also a practical way to bring software engineering and data engineering closer together again in overlapping areas. It doesn’t mean fully merging these disciplines, but rather acknowledging how much they share in common.

I have previously written about the need to redefine data engineering. This goes hand in hand with the realization that all advances in the software development of applications can and should be transferred in full to the discipline of data engineering.

After we have built all too many brittle data pipelines, it’s time for data engineers to acknowledge that fundamental software engineering principles are just as crucial for data engineering. Since data engineering is essentially a form of software engineering, it makes sense that foundational practices such as CI/CD, agile development practices, clean coding using version control, Test Driven Design (TDD), modularized architectures, and considering security aspects early in the development cycle should also be applied in data engineering.

But the narrow focus within an engineering discipline often leads to a kind of intellectual and organizational isolation, where the greater commonalities and interdisciplinary synergies are no longer recognized. This has led to the formation of the ‘data engineering silo’ in which not only knowledge and resources, but also concepts and ways of thinking were isolated from the software engineering discipline. Collaboration and understanding between these disciplines became more difficult. I think this undesirable situation needs to be corrected as quickly as possible.

Unfortunately, the very same silo thinking seems to start with the hype around artificial intelligence (AI) and its sub-discipline machine learning (ML). ML engineering is about to create the next big silo.

Although there is no doubt that much in the development of ML models is different from traditional software development, we must not overlook the large overlaps that still exist. We should recognize that most of the processes involved in the development and deployment of an ML model are still based on traditional software development practices. The production ready ML model is essentially just another application in the overall IT portfolio that also needs to be integrated in universal data supply as any other software application.

Consequently, we should only specialize in areas that are truly different, collaborate in overlapping areas and take great care to avoid silo thinking.

Model your business

Yes, it’s true what many consultants often emphasize. Without a clear understanding of your business processes, it’s impossible to structure your IT or Data Architecture in a way that effectively aligns with your business needs.

To effectively model your business, you need engineers who have a deep understanding of your business operations – and that goes for software, data and machine learning engineers alike. Achieving this understanding requires close cooperation between your business teams and IT professionals.

Since this is non-negotiable, it’s essential to promote a culture of collaboration and encourage end-to-end thinking across teams.

IT serves the business

We need IT professionals who view technology solely as a tool to support the business case – nothing more, nothing less. While this might sound obvious, I’ve often noticed, especially in larger companies, that IT thinking tends to become disconnected from business objectives.

This situation may have been encouraged by CIOs attempting to reinvent IT departments where IT should no longer merely act as a cost-cutting or supporting unit, but should itself contribute to revenue generation.

While IT can contribute to innovation and new digital products, its primary role remains enabling and enhancing business operations, not acting as an independent entity with separate goals.

Business needs IT to be efficient and smart

On the other hand, it’s essential for business people to recognize that modern IT technology enables them to achieve things for customers that were previously impossible. It’s not just about the growing ability of IT to automate business processes; it’s also about creating products and services that could not even exist without this technology. Moreover, it involves empowering the company through the intelligent use of available data for optimal operations and for continuous improvement of business processes.

This requires end-to-end thinking from both business and IT professionals. Close collaboration and intensive exchange of ideas and perspectives help to stay aligned. The silo specialization mentioned above is just as harmful between business units and IT as it is within IT departments.

Business models shape your IT models

Business models are almost always process models. As data is just seen as the means to exchange information between business processes, data teams need to focus on organizing this exchange in the most efficient and seamless way possible.

Data models derived from the business models are absolutely crucial to organize the exchange. While it’s the main responsibility of Data Engineering to offer governance and guidance to moderate the overall modeling process in the enterprise, it’s essentially a joint exercise of all involved parties.

Unlock data from your applications

In information theory everything starts with data. Even logic is derived from data when it is compiled or interpreted from source code. Hence, we could argue that all data is to be kept in applications that implement the logic.

However, I have argued that data, which does not represent logic, has fundamentally different characteristics compared to applications. As a result, it seems much more efficient to manage data separately from the applications.

This fact together with the status quo that data in RAM is volatile and therefore needs to be persisted in durable storage, is the main reason why data engineering is justified as an own discipline. The sole purpose of data engineering is therefore to manage and organize data independently from applications. Data engineering must provide the infrastructure to unlock the data from applications and enable its seamless sharing between these applications.

This is essentially the same challenge that relational databases faced when they became so popular that they were expected to serve as the enterprise-wide shared data storage for any application. Today, we recognize that one type of database isn’t enough to meet all the diverse requirements. However, the concept of a data infrastructure that allows for data sharing across all applications remains compelling.

To achieve this, we need to reconceptualize the shared database as a flexible Data Mesh that is highly distributed, can support both batch and stream processing, and enables the integration of business data with business context (also known as metadata and schema) into data as products. The mesh facilitates the seamless sharing of these products across all applications. This explicitly includes ML models that are derived from data and finally deployed as intelligent applications to generate valuable predictions at inference time, which can also be treated as new data to be shared across the enterprise.

Moreover, any results derived from business analysts using business intelligence tools given to them as ‘end-user’ utilities should also be acknowledged as valuable new business data. Although these ‘end-user maintained applications’ are often not considered part of the official IT application portfolio, they generate important business information crucial for the organization. As a result, any business data generated by these end-user applications also needs to be embraced by the engineering teams.

Digitalize business context as data

The data infrastructure enables you to combine your data with rich business context, allowing every consumer to accurately interpret the provided business data independently from the source applications. This capability is often referred to as the semantic layer within a data architecture.

However, it is the responsibility of the data producers, i.e. the owners of the applications, to provide the necessary business context. It is not the task of the data team to provide this information detached from the source application. Data engineers cannot reconstruct what the responsible business departments have failed to deliver. This is the main reason why I advocate not implementing business logic in data teams.

Instead, the data engineering team should focus on delivering the technical infrastructure and governance processes to support all business units in making this business context readily available to everyone.

From an organizational perspective, business teams must supply the content and rules that enable software engineers to deliver data as products that can be shared throughout the enterprise, leveraging the data mesh established by data engineers.

Decentralize

If a large enough organization has recognized that it’s business processes are too complex to be implemented in a single application, we should embrace decentralized architectures and principles to be able to manage the loosely coupled applications and their data as a coherent whole.

While IT is frequently discussed when considering centralization, the only areas that truly require central management are foundational services like IT infrastructure, security, governance, and compliance.

I believe that the idea of collecting data in a central data repository or platform (be it a data warehouse or a data lake(house)) from sources organized in a highly decentralized IT application infrastructure is doomed to failure. The decentralized approach of universal data supply seems to be the better approach to truly empower business with the data that is created and transformed in all the different applications.


If we prevent silo thinking in the company by actively promoting collaboration and end-to-end thinking, we will not only have more efficient IT departments, but also a better alignment with business goals.

The rigorous focus on business objectives in IT departments ultimately leads to better applications and application-independent information models. This enables the efficient use of all available data for optimal operation and continuous improvement of business processes and their applications.

Centralizing data in a single repository (like a data warehouse or data lake) is increasingly unsustainable for large, complex organizations. We need to manage the exchange of data in a decentralized data architecture as described in universal data supply.


Implementing universal data supply as a new data architecture for enterprises holds great promise. However, the challenge lies in how to transition from our current systems without necessitating a complete redesign. How do we evolve our architecture toward this innovative concept without discarding what already works?

Well, the good news is that we can make incremental improvements. We won’t need a brand-new data platform, machine learning platform, or a complete overhaul of our existing IT architecture. We can also retain our data warehouse and data lake(house), although redefined in scope and role.

I will be launching a new series of articles that will outline practical strategies for implementing universal data supply, drawing on concrete industry examples. You can expect a step-by-step guide to adopting this decentralized approach. Stay tuned for more insights!

The post Data Architecture: Lessons Learned appeared first on Towards Data Science.

]]>
Data Empowers Business https://towardsdatascience.com/data-empowers-business-3120a6632081/ Sun, 22 Sep 2024 17:01:43 +0000 https://towardsdatascience.com/data-empowers-business-3120a6632081/ Exploiting the full potential of universal data supply

The post Data Empowers Business appeared first on Towards Data Science.

]]>
There is a lot of talk about the value of data being the new gold. Many companies are therefore pouring large sums of money into becoming data-driven. It is sold as a completely new way of doing business, almost as if business has never been driven by data before.

Sure, many companies struggle with being fully data-driven, but I think the bigger issue is, that it’s not at all sufficient.

We all seek to extract something magical from the data. That’s why we collect data in big data lakes, data warehouses and all the other big data collections that have not yet received a catchy marketing label. I believe we all fall into the trap of wanting to unearth a treasure trove of data. We somehow can’t escape the thinking that these massive data collections are the foundation of some analytical superpower that will ultimately propel our businesses to new heights.

Data technology and Data Engineering teams are then often seen as the decisive factor in converting data into business benefits. While technology and engineering are undeniably important, it takes a more holistic approach to truly transform a business into a digital powerhouse.

In this article I will approach the matter from the business side and explain why simply being data-driven isn’t enough. Believe me, you’re not going to take your company to the next level by creating a ‘data team’ to do some magic with your data.

I will explain how business should actually drive IT technology and why current data strategies often fall short in unlocking the full potential of data-empowered companies.

Business View

The functioning of a company can be understood as the interaction of many individual business processes to form a coherent whole. Each business process makes a small contribution to the realization of the company’s value proposition, which is usually the provision of products and services. Business people execute the processes on behalf of the organization based on their knowledge and available process descriptions. The customer finally receives the result as the product or service requested.

The enterprise is a complex adaptive system - Image by author
The enterprise is a complex adaptive system – Image by author

While this looks simple at first sight, the enterprise is rather a complex adaptive system that intensively interacts with the outside world. The internal processes are numerous and never static. They constantly evolve based on decisions and actions from the employees. Employees are supported by applications that partially or completely digitalize the business processes. Data as intermediate output of digitalized business processes is extensively shared in the company and with the outside world via channels.

In fact, data itself is distributed across the entire organization. It is kept private inside the business processes, managed solely by the process itself or the process owner. Data is to be shared via channels and comprises all information needed by other processes to fulfill their business goals.

Hence, we have private business data on the inside of processes and public data that needs to be shared with other applications.

Data is a second thought in business

The business is typically completely process driven with data solely used as a means to transfer information. A data channel has the one and only task to enable the exchange of information.

The business world mainly thinks in terms of processes, not data. You will find tons of information about business process reengineering, but the discipline business data reengineering doesn’t even exist.

Sure, we have analytical processes that require a lot of input data. All of that data needs to modeled and structured, but business thinking is always driven by processes. We have to fulfill a customer order, comply with official regulations, prepare a balance sheet and income statement or train a model to automatically determine the price of a product to be sold. There is a process behind each and every data requirement.

The fact that we have business processes all the way through the enterprise should from my point of view drive IT architecture and, in particular, the discipline of data engineering. Let’s face the truth and recognize that data has no value if it is not required for a specific business process. I think it is the wrong idea to collect data in ever larger data repositories without a clear business process requirement.

And to be clear, the wish ‘give me all data available in the enterprise’ to be able to do ‘something intelligent’ with it, is not that clear business requirement.

What does data-empowered mean?

Data in itself has no value. It’s merely digitalized information.

It can drive the business forward if we are able to provide each business process with exactly the data it needs. And it can even empower the business if we are able to derive insight and ultimately knowledge from it to improve the way we are doing business.

Data therefore seems to serve two purposes: To enable business operations and to form the basis for analytical evaluations that enable decisions for improving and further developing business.

What does it then mean for a company to be data-empowered?

A data-empowered company uses its data to optimally operate the business processes, continually refining them through informed decisions backed by advanced data analytics.

A company that is only concerned with maintaining the status quo would soon bleed to death. This means that analytical processes are at least as important as the operational ones. We are currently witnessing a surge in artificial intelligence deployment to develop smarter applications with ever more potential for accelerating the digital transformation. The importance of analytical processes, including leveraging machine learning to train intelligent models from data, will inevitably continue to grow.

However, these processes can tolerate longer periods of downtime. A temporary inability to make intelligent decisions and enhance business processes won’t immediately jeopardize the company’s survival. Therefore IT technically distinguished between operational processes and the less critical analytical processes.

Due to this technical distinction, our systems and the data have been categorically divided into operational and analytical realms. However, from a business perspective, this rigid technical separation makes no sense as both processes and their results are equally essential for the company to thrive.

In practice, there’s a continuous and dynamic interaction between operational and analytical business processes. For example, a data warehouse may supply data to train a machine learning model, which then generates key input (e.g. the price for a product) for an operational sales process. This process, in turn, can produce valuable customer behavior data for further analytical analysis.

Business people don’t differentiate between operational and analytical data. While we can classify processes as having operational or analytical purposes, data must seamlessly flow between all of them.

Data empowers business when it is accessible and usable across all business processes, whether analytical or operational.

From processes to data

A process model can always be used to distill a data model from it. To better understand this, it helps to consider the analogy between a business process and the application that actually digitalizes this process.

We said that the functioning of a company can be understood as the interaction of all business processes. Consequently, the sum of all applications (or microservices including orchestration) can be understood as the digitalized subset of all business processes in a company.

Business processes interconnect by exchanging information modeled as business data. For business data to be usable, consuming applications require rich information about the context and the data lineage. However, if we digitalize only the basic information without the business context and provenance, it is extremely difficult to interpret the data.

In an analogue world we would then call the process owner and ask for the missing business context. In the digital world the consuming application must be able to directly extract everything from the business data needed to fulfill its goal. The fundamental issue to be solved by data engineering is that business data should be able to exist independently of its generating business processes without any loss of business context information.

The current practice of preserving business context for data is to create separate metadata and enterprise data models. However, metadata is often incomplete and kept separated from the data to be explained. And data models are often static and at best high-level representations buried in technical modeling tools rarely up-to-date to what is really going on in a living enterprise.

The other practice is to keep data encapsulated in applications, so that consumers can query the source application for any missing context information. However, data is fundamentally different from applications and we should avoid to deliver data as an application.

From the business perspective, it’s all about the correct digitalization of the entire business context, which is necessary to seamlessly link all business processes to a functioning whole for the enterprise.

Business drives technology

If we are serious about the business to drive IT technology and data engineering, we need to support the process view as comprehensively as possible. This means that data is only the enabler for the applications that are the first-class citizens in enterprise IT technology.

Data engineering is crucial to enable the lossless exchange of the entire business context between applications in the company and beyond. Modern data technology can of course also be used to implement business logic. But this logic must be owned by the business people and their application developers. Data engineers must implement the data channels with the sole task of enabling lossless information exchange between all these applications.

Universal data supply exactly addresses this important requirement with a Data Mesh. It organizes the company-wide distributed data flow between all applications without the loss of context, with minimal delay and without the need for centralized data repositories separated from the operational world.

This requires a completely different approach compared to traditional data architectures such as data warehouse or data lake(house), which are mainly about extracting and gathering data from applications, but where the business context and data provenance is lost.

I have described an approach that fulfills universal data supply based on the original data mesh principles, but with important adjustments. Don’t implement a data mesh without these adjustments to ensure that data truly empowers business.

Challenges and Solutions in Data Mesh


Universal data supply empowers business with data and is characterized by these principles:

  • Business drives technology. Technology supports and empowers business.
  • Applications implement business logic defined by the business processes. They serve both operational and analytical purposes in a well integrated way – analytical processing is in any case a matter for the application developers and by no means just for the data team.
  • Public data, modeled following modern principles, is the means to exchange information between all applications.
  • All public data must flow seamlessly between all applications and be self-sufficient, i.e. it must encapsulate the entire business context including its provenance at all times.

If you find this information useful, please consider to clap. I would be more than happy to receive your feedback with your opinions and questions.

The post Data Empowers Business appeared first on Towards Data Science.

]]>