Building Scalable Data Platforms

In this article, I aim to delve into the various types of data platform architectures, taking a better look at their evolution, strengths, weaknesses, and practical applications. A key focus will be the Data Mesh architecture, its role in Modern Data Stack (MDS) and today’s data-driven landscape.

It’s a well-known fact that the architecture of a data platform profoundly affects its performance and scalability. The challenge often lies in selecting an architecture that best aligns with your specific business needs.

Given the overwhelming multitude of data tools available in the market today, it’s easy to get lost. The Internet articles I see now and then on this topic are often highly speculative. Questions about which tools are best, who leads the industry, and how to make the right choice can be very frustrating. This story is for data practitioners who would like to learn more about data platform design and which one to choose in each scenario.

Modern data stack

I keep hearing this term on almost every data-related website on the Internet. Every single LinkedIn data group offers a dozen of posts on this topic. However, the majority of those cover just the data tools and don’t emphasize the importance of strategy considerations related to the data platform design process.

So what is a "Modern Data Stack" and how modern is it?

In essence, it’s a set of tools designed to help you work with data. Depending on your business goals, these tools might include managed ETL/ELT data connectors, Business Intelligence (BI) solutions, data modelling tools (Dataform, DBT, etc.) and, at times, the focus isn’t necessarily on how modern these tools are, but on how effectively they meet your needs.

As data flows through the pipeline it is processed, ingested into the data warehouse or a data lake, transformed and visualised. All these steps help the decision makers to get data sights and activate on them promptly – whether it’s a user retention story [1] or a fraud detection algorithm in a banking system.

Modern data tools and data flow. Image by author.

I previously described how to activate on real-time user engagement data in this story:

User Churn Prediction

Scalability

Modern Data Stack (MDS) is all about simplicity, convenience for end users and scalability. Indeed, it doesn’t matter how great your BI tool features are if your major stakeholders don’t like it or don’t know how to use it. Among other great features, it may have Git integration and robust CI/CD [2] capabilities for data pipelines but ultimately it doesn’t matter if can’t render your dashboard or report into a PDF document. Data modelling is important and we create data models using modern tools like DBT and Dataform (GCP). Don’t repeat yourself – use the DRY [2] approach everywhere you can.

Data modelling is useless if it’s done in a wrong way.

These data models must be reusable and unit-tested. According to DBT

DRY is a software development principle that stands for "Don’t Repeat Yourself." Living by this principle means that your aim is to reduce repetitive patterns and duplicate code and logic in favor of modular and referenceable code.

DRY approach not only saves you time and effort but also enhances the maintainability and scalability of your code. Perhaps most importantly, mastering this principle can be the key factor that elevates you from being a good Analytics engineer to a truly exceptional one. This principle is something I consistently aim to achieve. It makes my code reusable and reliable.

Data Lake vs Data Warehouse vs Data Mesh

In my experience, it was never just a single type of data platform architecture. Data lakes are a better fit for users with coding skills and usually is an ideal choice for unstructured and semi-structured data processing. Images, text documents, voice messages, etc. – everything can be stored in the data lake and therefore, can be processed. This makes this architecture extremely useful.

Dealing with Big Data is usually easier within data lakes and makes it the top choice for extra large data volumes.

The downturn is that our data users must be capable of working with one of the popular data lake tools – Dataproc, EMR, Databricks or Galaxy. Usually, companies struggle to find good data developers with great Python, SQL and data modelling skills. This is what makes a Data Engineer role so popular [2].

How to Become a Data Engineer

These distributed computing tools adapt well to growing data workloads and scales to ensure the data is processed by the time it is needed.

For companies with fairly simple data pipeline designs and adequate data volumes a hybrid architecture that blends elements of both data warehouses and data lakes might be the optimal choice.

It combines the advantages of each data platform architecture and in my projects, I use it most often. Data analysts are happy as it offers the ability to perform interactive SQL queries while maintaining a high level of flexibility for customization in case my data pipelines need to scale the processing resources. Modern data warehouse solutions often support interactive queries on data stored in a data lake, such as through external tables. For example, a data pipeline might be designed as follows:

Lake house pipeline example. Image by author.

It helps a lot with data migration as well. Indeed, any data model can be built on top of the data lake instead. Tools like DBT and Dataform still use ANSI-SQL dialect in the end and the migration task would be a simple change in the data adaptor.

Data Mesh platforms are designed to support the ability to share data across different departments seamlessly. This is a typical use-case scenario for companies going through a merger or acquisition process. Indeed, consider a company with a data warehouse where data is ingested from the data lake and the being processed using DBT to visualise the reports in Looker. Now consider it is being acquired by a bigger company whose data stack is mainly a data lake with streaming components. Data is being processed in Airflow pipelines (batch) and in real-time standalone data services (streaming).

Data Mesh helps to integrate different platforms and data domains.

A data mesh architecture represents a decentralized approach to data management that empowers your company to handle data autonomously and conduct cross-team and cross-domain analyses.

In a data mesh setup, each business unit may possess a diverse set of programming skills, such as SQL or Python, and have varying data processing needs, from flexible data handling to interactive SQL queries. As a result, each unit has the freedom to select its preferred data warehouse or data lake solution tailored to its specific requirements. Despite these individual choices, the architecture facilitates seamless data sharing across units without the need for data movement, ensuring that data remains accessible and integrated across the organization.

Easier said than done.

In one of my previous stories, I looked into the evolution of Data Engineering [3].

How Data Engineering Evolved since 2014

We can witness that the data engineering landscape has significantly transformed since 2014 (when Airflow was introduced to a wider community), now addressing more sophisticated use cases and requirements, including support for multiple programming languages, integrations, and enhanced scalability.

This collaborative "Data Mesh" setup is crucial as it allows each team to operate within the same environment while avoiding disruptions to each other’s workflows, thus fostering a more integrated and efficient approach to data handling and analysis.

My research underscores the importance of several key trends in data pipeline orchestration. These include a focus on heterogeneity, support for a range of programming languages, effective utilization of metadata, and the adoption of data mesh architecture. Embracing these trends is crucial for developing data platforms that can adapt to diverse needs and scale efficiently.

Heterogenity is crucial for creating modern, robust, and scalable data platforms.

Implementing a Data Mesh architecture

The easiest way to implement a data mesh is to follow these principles:

Use microservices: Data Mesh is distributed by its definition, it’s not monolithic. Creating a dedicated API service/orchestrator makes perfect sense. It can act as a data hub for all other data processing services. They can invoke the data hub service whenever needed and vice versa.
Split data environments: Creating separate environments for development, production and automated testing is important. It simplifies the new data model and pipeline deployments and makes our code tested and reusable.
Consider the "DRY" approach always: Using other data projects and their repositories as packages makes life easier.
Use infrastructure as code: This is a must and helps to maintain our pipelines [4]. Infrastructure as code is becoming an increasingly popular approach for managing data platform resources. Among software engineers, it is pretty much a standard these days.

Infrastructure as Code for Beginners

Consider the example below and how I implemented Data Mesh using the Mage orchestrator tool. I’ve created a service responsible for the data orchestration of all my data pipelines from other data projects I have.

For instance, it is implemented in one of my data pipelines where it pulls another GitHub repository with a DBT project in it. This project then becomes a package that I use in my orchestrator:

Then if I run my pipeline it will install the erd project as a dependency:

This approach helps me to use other DBT projects in my Data Hub project. Every time I run the data pipeline execution it pulls the code from package repositories to ensure everything is up-to-date.

How do we deploy it?

Okay, we ran it locally but how do we deploy it?

While this task can be complex, employing infrastructure as code can significantly enhance the scalability and maintainability of the code. I previously covered this topic in a tutorial here [5].

Orchestrate Machine Learning Pipelines with AWS Step Functions

This represents a typical data flow or data platform setup for many companies. The real challenge lies in deploying data pipelines and managing the associated resources effectively. For example, deploying a machine learning pipeline across production and staging environments involves careful resource management.

I deployed my Data Hub project using Terraform [6]. A simple command like the one below will do the job:

terraform init
terraform plan -out=PLAN_dev
terraform apply -input=false PLAN_dev

The output will be something like this:

Apply complete! Resources: 16 added, 0 changed, 0destroyed.

Outputs:

load_balancer_dns_name = "http://datahub-dev-alb-123.eu-west-1.elb.amazonaws.com"

As you can see infrastructure as code makes it really easy to deploy the required resources including VPCs, subnets, load balancers and others.

For instance, my Terraform folder structure where I describe the resources I need will look like this:

.
├── PLAN_dev
├── alb.tf
├── backend.tf
├── efs.tf
├── env_vars.json
├── iam.tf
├── main.tf
├── provider.tf
├── variables.tf
└── versions.tf

My main terraform definition file main.tf looks like this. It creates the ECS cluster, the task definition and all other required resources:


resource "aws_ecs_cluster" "aws-ecs-cluster" {
  name = "${var.app_name}-${var.env}-cluster"

  setting {
    name  = "containerInsights"
    value = "enabled"
  }

  tags = {
    Name        = "${var.app_name}-ecs"
    Environment = var.env
  }
}

// To delete an existing log group, run the cli command:
// aws logs delete-log-group --log-group-name app-name-production-logs
resource "aws_cloudwatch_log_group" "log-group" {
  name = "${var.app_name}-${var.env}-logs"

  tags = {
    Application = var.app_name
    Environment = var.env
  }
}

resource "aws_ecs_task_definition" "aws-ecs-task" {
  family = "${var.app_name}-task"

  container_definitions = <<DEFINITION
  [
    {
      "name": "${var.app_name}-${var.env}-container",
      "image": "${var.docker_image}",
      "environment": [
        {"name": "ENV", "value": "dev"},
        {"name": "MAGE_EC2_SUBNET_ID", "value": "${var.mage_ec2_subnet_id}"}
      ],
      "command": [
        "mage", "start", "datahub"
      ],
      "essential": true,
      "mountPoints": [
        {
          "readOnly": false,
          "containerPath": "/home/src",
          "sourceVolume": "${var.app_name}-fs"
        }
      ],
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "${aws_cloudwatch_log_group.log-group.id}",
          "awslogs-region": "${var.aws_region}",
          "awslogs-stream-prefix": "${var.app_name}-${var.env}"
        }
      },
      "portMappings": [
        {
          "containerPort": 6789,
          "hostPort": 6789
        }
      ],
      "cpu": ${var.ecs_task_cpu},
      "memory": ${var.ecs_task_memory},
      "networkMode": "awsvpc",
      "ulimits": [
        {
          "name": "nofile",
          "softLimit": 16384,
          "hardLimit": 32768
        }
      ],
      "healthCheck": {
        "command": ["CMD-SHELL", "curl -f http://localhost:6789/api/status || exit 1"],
        "interval": 30,
        "timeout": 5,
        "retries": 3,
        "startPeriod": 10
      }
    }
  ]
  DEFINITION

  requires_compatibilities = ["FARGATE"]
  network_mode             = "awsvpc"
  memory                   = var.ecs_task_memory
  cpu                      = var.ecs_task_cpu
  execution_role_arn       = aws_iam_role.ecsTaskExecutionRole.arn
  task_role_arn            = aws_iam_role.ecsTaskExecutionRole.arn

  volume {
    name = "${var.app_name}-fs"

    efs_volume_configuration {
      file_system_id     = aws_efs_file_system.file_system.id
      transit_encryption = "ENABLED"
    }
  }

  tags = {
    Name        = "${var.app_name}-ecs-td"
    Environment = var.env
  }

  # depends_on = [aws_lambda_function.terraform_lambda_func]
}

data "aws_ecs_task_definition" "main" {
  task_definition = aws_ecs_task_definition.aws-ecs-task.family
}

resource "aws_ecs_service" "aws-ecs-service" {
  name                 = "${var.app_name}-${var.env}-ecs-service"
  cluster              = aws_ecs_cluster.aws-ecs-cluster.id
  task_definition      = "${aws_ecs_task_definition.aws-ecs-task.family}:${max(aws_ecs_task_definition.aws-ecs-task.revision, data.aws_ecs_task_definition.main.revision)}"
  launch_type          = "FARGATE"
  scheduling_strategy  = "REPLICA"
  desired_count        = 1
  force_new_deployment = true

  network_configuration {
    # subnets          = aws_subnet.public.*.id
    subnets          = var.public_subnets
    assign_public_ip = true
    security_groups = [
      aws_security_group.service_security_group.id,
      aws_security_group.load_balancer_security_group.id
    ]
  }

  load_balancer {
    target_group_arn = aws_lb_target_group.target_group.arn
    container_name   = "${var.app_name}-${var.env}-container"
    container_port   = 6789
  }

  depends_on = [aws_lb_listener.listener]
}

resource "aws_security_group" "service_security_group" {
  # vpc_id = aws_vpc.aws-vpc.id
  vpc_id = var.vpc_id
  ingress {
    from_port       = 6789
    to_port         = 6789
    protocol        = "tcp"
    cidr_blocks     = var.allowed_ips
    security_groups = [aws_security_group.load_balancer_security_group.id]
  }

  egress {
    from_port        = 0
    to_port          = 0
    protocol         = "-1"
    cidr_blocks      = ["0.0.0.0/0"]
    ipv6_cidr_blocks = ["::/0"]
  }

  tags = {
    Name        = "${var.app_name}-service-sg"
    Environment = var.env
  }
}

The next steps

The next steps would be to expand this approach further and start to integrate data connectors and pipelines into one place. Modern data-driven applications require a robust transactional database to manage current application data. When implementing such applications, consider utilizing OLTP and RDS architectures. Data Mesh helps to integrate these resources as well. In our example, we would want to create a Python data connector that would extract data from RDS.

Each component in the data platform ecosystem – data lakes, data warehouses, lake houses, and databases – offers unique advantages and serves distinct purposes but it is rarely just the one.

Generally, the optimal choice will depend on cost efficiency and compatibility with your development stack.

Testing various options can reveal how well a data source or a project integrates into your data platform, whether it’s a data lake or a data warehouse. Numerous data connectors can be managed with easy in a Data Mesh tool to facilitate seamless data extraction, regardless of the underlying architecture.

However, several considerations are crucial:

Alignment with Business Needs: Evaluate how well data tools align with your specific business requirements. For instance, some business intelligence (BI) tools may have a pay-per-user pricing model, which might not be ideal for sharing dashboards with external users.
Functionality Overlap: Assess whether there is an overlap in functionality between tools. For example, determine if you need a BI solution that performs data modelling within its own OLAP cube when this is already handled by your data warehouse.
Cost Efficiency: If cost savings are a priority, it may be advantageous to choose data tools that are provided by the same cloud vendor as your development stack.
Data Modeling Importance: Efficient data modelling is crucial, as it affects the frequency of data processing and subsequently affects processing costs.

The choice between a data lake and a data warehouse often hinges on the skill set of your users. A data warehouse solution typically offers greater interactivity and is suited for SQL-centric products such as Snowflake or BigQuery. Conversely, data lakes are ideal for users with programming expertise, making Python-focused platforms like Databricks, Galaxy, Dataproc, or EMR more suitable.

Conclusion

In the era of Data Mesh, successful collaboration is crucial for thriving in the data domain. Modern Data Stack stems from our data platform architecture which plays the foundational role here. Simply put, a modern data stack is often described as a set of tools that assist with managing and working with data. This is the typical explanation you’ll find in many articles online. However, this statement overlooks the crucial role of strategy and a well-articulated strategy for data platform design is far more critical than the individual features of any given data tool. I always begin a new data warehouse project with comprehensive planning and design sessions.

Proper organization of data environments also supports automated testing and continuous integration (CI) pipelines, ensuring that your data transformation scripts execute correctly according to the logic outlined in your business requirements. There are various methods for deploying data platform resources that will feed data into your data platform, and it’s beneficial to document these methods using metadata. This approach helps maintain clarity and efficiency throughout the project.

My experience suggests that it’s never just one data platform architecture type but a combination of all three that works best for your business goals.