OSI layers for the data ecosystem

Towards a flexible and layered data architecture

Whilst listening to an episode of the data engineering podcast (An Exploration Of Tobias’ Experience In Building A Data Lakehouse From Scratch1.), the question: “Why don`t we have an OSI model for data” resonated with me.

A OSI model is missing for data.

The lack of such a reference data architecture model is something I have observed myself in a multitude of other projects - in research but also industry, in particular around the discussion of (re-) building a data platforms or when creating a data mesh. Furthermore, a lot of platforms and (SaaS) tools have spread in the data ecosystem recently - to the point where keeping them all in sync and neatly managed can become a huge challenge.

Who manages the managed services?

In particular these various managed services are often only semi-compatible with each other and central governance can become a problem. Furthermore, they are outside of the control of the data team and version control tools (merge requests, approvals) and changes of these blackbox tools can pose challenges in itself.

Commit to the pillars and not the term.

Recently, there has been a lot of buzz around terms like: modern data stack, data mesh or data fabric and there are several vendors trying to sell you on one product to fulfill all your data needs. Instead, everyone would be better off if they would focus on the foundational pillars and sharing reference blueprints2. This is happening for devOps, microservices, service discovery, failover, blue green deployments or rolling releases. However, this is not taking place in the data ecosystem: As a result manual oftentimes imperfect solutions are built as this situation makes it hard to build on top of piror art.

Let`s organize some reusable foundational components (mesh-based data platform archetypes). In the following, I want to explore the idea further to find suitable analogies of the OSI network model to the data ecosystem as foundational pillars for a robust and flexible data platform.

OSI model

The OSI (Open Systems Interconnection) model is a framework for understanding how different networking protocols and technologies work together to enable communication between computers. It consists of seven layers, each of which performs a specific function to support the transfer of data.

traditional OSI model
visual from https://sparkbox.com/foundry/what_are_the_7_layers_of_the_OSI_model_network_of_communication

These layers are (nicely explained by ChatGPT):

  1. Physical layer: This layer deals with the physical connection between devices, such as the cables and connectors used to connect them. It also defines the electrical, mechanical, and functional characteristics of the interfaces between devices.

  2. Data link layer: This layer is responsible for establishing, maintaining, and terminating the connection between two devices on a single link. It ensures that the data being transmitted is delivered correctly and without errors.

  3. Network layer: This layer is responsible for routing data between devices on different links. It determines the best path for data to travel and ensures that it is delivered to the correct destination.

  4. Transport layer: This layer is responsible for providing reliable end-to-end communication between devices. It ensures that data is delivered correctly and without errors, even if some of the data packets are lost or corrupted during transmission.

  5. Session layer: This layer is responsible for establishing, maintaining, and terminating communication sessions between devices. It allows devices to exchange data in a coordinated manner.

  6. Presentation layer: This layer is responsible for translating data into a format that can be understood by the application layer. It may also provide data encryption and compression to ensure the security and efficiency of data transmission.

  7. Application layer: This is the top layer of the OSI model, and it is responsible for enabling communication between applications on different devices. It defines the interface between the application and the network, and it is where high-level protocols such as HTTP and FTP are implemented.

How OSI maps to data

A composable data architecture needs to contain capabilities. Some might build on top of each other to create higher-level more useful services to the business.

Composable elements: replace vendor

how do I keep running when the managed service ceases to exist or turns out to be too expensive: OSS be able to self-run/operationalize the software (goes both ways i.e easy load off)

filling the gaps of vendors (open). control aspect: 90% already provided - I can contribute the missing pieces (NOT ductape it) to build an awesome solution

build with inputs and outputs in mind

1. Physical layer (storage and compute)

The Physical Layer (Layer 1) corresponds to the physical infrastructure of a network and might map to the hardware and infrastructure used to store and process data in a data architecture.

reference simple

small

streaming

Storage

Mutable storage
  • RBMS
  • table formats (iceberg, hudi, delta) plus metastore/catalogue on top of blob storage systems
Blob storage

not really mutable storage. Scalable, but inefficient updates

  • s3,gcs, adls
  • HDFS
  • Ozone
Specialized storage
  • graph database
  • search engine
  • nosql key-value store (HBase, Cassandra, …)
  • cache (redis)
Streaming ledger
  • kakfa
  • redpanda
  • pulsar

Compute

  • spark
  • flink
  • materialize, decodable
  • traditional cloud dwh (bigquery, snowflake, redshfit)
  • trino/presto
  • containers managed from orchestration layer
  • virtualization
    • modern
    • traditionally Oracle DB Link
  • starrocks

The Data Link Layer (Layer 2) is responsible for establishing and maintaining connections between devices and might map to the mechanisms used to connect different data sources and systems in a data architecture, such as APIs or ETL processes.

kubernetes & helm plus various operators dedicted towards stateful services

helm for data? keep apis running / blue green rollover

  • all manually right now
  • nobody is sharing their data mesh blueprints2

3. Network layer (metadata & governance)

The Network Layer (Layer 3) is responsible for routing data between devices and might correspond to the networking infrastructure and protocols used to transmit data between different systems in a data architecture.

  • expose end user facing catalogue
  • collect metadata from business processes/transformation
  • collect metadata around data usage (what is joined, what is used)
  • PII tagging
  • openmetadata

structure data products into:

  • raw
  • internal
  • exposed API

4. Transport layer (ingestion & integration)

the Transport Layer (Layer 4) is responsible for ensuring the reliable delivery of data and might map to the mechanisms used to ensure data integrity and consistency in a data architecture, such as transactions or version control.

Transactions: Mechanisms used to ensure that a series of database operations are executed atomically (as a single unit). Version control: Systems used to track and manage changes to data over time.

  • Airbyte
  • custom python code for specifics (neatly coupled and integrated with the orchestrator)
  • branch deployments?

5. Session layer (orchestration)

The Session Layer (Layer 5) is responsible for establishing, maintaining, and terminating connections between devices and might correspond to the protocols and processes used to manage access to data in a data architecture, such as authentication and authorization.

  • orchestration layer: dagster

6. Presentation layer (power transformation)

The Presentation Layer (Layer 6) is responsible for converting data into a format that can be understood by the receiving device and might map to the processes and tools used to transform and manipulate data in a data architecture, such as data lakes or data warehouses.

  • dbt
    • SQL first
    • but no constraints i.e. other things can be interwoven flexibly
    • no lock-in into tools like snowpark / shaky foundation of tools (upstream). breakage of data products
  • data science

7. Application layer (self-service BI)

the Application Layer (Layer 7) is responsible for providing services and applications that use the network to communicate with each other and might map to the applications and services that use data to provide value to users in a data architecture, such as business intelligence tools or data visualization dashboards.

end user availability

  • semantic layer
  • data virtualization
  • dashboards

missing items 🤔

where to fit these points?

open data architecture

  • hierarchie
  • primitives
  • blueprint

standardisierung?

  • file
  • table

ressourcen klon notwendig

normal IT architecture - not specialized

Organe im Körper Integrationsaufwände (oracle DB link)

ephemeral messaging vs. state

data master left shift (to the source) even for historic state

Grammar of data - similar to dplyr

  • anonymization layer
  • IT pseudonymization tokenization ID service/layer

data mesh (domain ownership, data as product)

quantum
image source: https://medium.com/paypal-tech/the-next-generation-of-data-platforms-is-the-data-mesh-b7df4b825522

Summary

(+) do`s

  • define clear end goal
  • is it only a feature of the broader ecosystem or a full standalone solution?
  • move fast? Spent time on operations? vs. control capability

(-) dont`s

  • tooleritis (unneccessary tool sprawl)
  • no escape hatch of managed service

References

The feature image was created with stable diffusion and the prompt: The spirit of data flying over layers.

Chat GPT has assisted in writing some text about the OSI model 😄.

Lastly, I want to thank the co-authors for their contributions to the article!

Further reading:


  1. https://www.dataengineeringpodcast.com/designing-a-lakehouse-from-scratch-episode-354 ↩︎

  2. https://daappod.com/data-mesh-radio/devops-for-data-mesh-chris-riccomini/ ↩︎

Georg Heiler
Georg Heiler
Researcher & data scientist

My research interests include large geo-spatial time and network data analytics.