DATA-CENTRIC ARCHITECTURE OF YOUR DATA
PERFORMANCE, SCALABILITY, BUDGET, AND SECURITY
Data architectures have significantly evolved in the following context:
- Big Data aimed to solve the impact of increasing data volumes, variety, and velocity requirements.
- Hyperscalers introduced the concept of infrastructure as a service, offering scalability synonymous with flexibility and adaptability to needs.
- Vendors have renewed their offerings in SaaS mode.
- Platformization introduced the idea of gathering different needs — analytics, data science, and operational — into a shared solution.
- The empowerment of business stakeholders (Data Mesh) requires renewed data governance: data quality management and access control.
- The financial strategy has shifted reducing infrastructure capital expenditures, enabling scalable, pay-as-you-go models
Optimize storage and compute to improve performance and reduce TCO.
Streamline the IT ecosystem according to business needs: access type, latency, autonomy, monetisation…
Observability, security, personalisation, FinOps: cross-cutting challenges that are reshaping the work of architects.
This results in a Data-centric approach that places data at the heart of an organization’s processes, decisions, and strategies. When data is collected, managed, and analyzed, it provides valuable insights that help improve performance, foster innovation, and create market differentiation.
A Data-centric architecture must facilitate the flow of data between ingestion, storage, transformation, and restitution. While the concept and objectives are clear, the technical methods and strategies to achieve them vary according to requirements like volume, timing, confidentiality, and, most importantly, data types.
This results in a Data-centric approach that places data at the heart of an organization’s processes, decisions, and strategies. When data is collected, managed, and analyzed, it provides valuable insights that help improve performance, foster innovation, and create market differentiation.
A Data-centric architecture must facilitate the flow of data between ingestion, storage, transformation, and restitution. While the concept and objectives are clear, the technical methods and strategies to achieve them vary according to requirements like volume, timing, confidentiality, and, most importantly, data types.
ARCHITECTURE AND DATA TYPES
A Data-centric architecture must integrate various models and technologies to effectively manage the specificities of the different types of data it handles.
A "reference" strategy around MDM?
The architecture considers the consolidation of data within a business object catalog that reflects its activity. The same business objects (the same customer or product) are present in multiple places and with different life cycles. Consequently, reconciling the different versions of the same business object to create the most accurate and consolidated version is a specific process involving quality rules, matching, and merging.
Master Data Management (MDM) offers a technical foundation to address these requirements and is often present in Data-centric architectures. This MDM component builds a 360° view of various business objects, their attributes, and their relationships. It links the different keys of contributing source systems to the unique identifier of the Golden Record, creating a cross-reference base that traces the origin of the data. This link between identifiers is also important for associating the Golden Record with the rest of the transactional data in a broader platform. The MDM component must also ensure the historization of changes in the name of lineage and compliance requirements.
Notably, MDM is evolving to integrate more into the transactional data cycle. Real-time ingestion pipelines followed by event-driven processing and propagation are integral parts of the most performant solutions on the market.
ARCHITECTURE AND DATA TYPES
A Data-centric architecture must integrate various models and technologies to effectively manage the specificities of the different types of data it handles.
A "reference" strategy around MDM?
The architecture considers the consolidation of data within a business object catalog that reflects its activity. The same business objects (the same customer or product) are present in multiple places and with different life cycles. Consequently, reconciling the different versions of the same business object to create the most accurate and consolidated version is a specific process involving quality rules, matching, and merging.
Master Data Management (MDM) offers a technical foundation to address these requirements and is often present in Data-centric architectures. This MDM component builds a 360° view of various business objects, their attributes, and their relationships. It links the different keys of contributing source systems to the unique identifier of the Golden Record, creating a cross-reference base that traces the origin of the data. This link between identifiers is also important for associating the Golden Record with the rest of the transactional data in a broader platform. The MDM component must also ensure the historization of changes in the name of lineage and compliance requirements.
Notably, MDM is evolving to integrate more into the transactional data cycle. Real-time ingestion pipelines followed by event-driven processing and propagation are integral parts of the most performant solutions on the market.
Engineering of transactional and behavioral data
Transactional data is voluminous, more complex, and rich in attributes. Its volume raises challenges during its ingestion by various processes for routing to storage zones. This data is often highly detailed: a sales record, for example, will contain information about the product, customer, date, and amount. It is therefore essential to carefully define the functional scope to be integrated to establish the appropriate pipelines, depending on both analytical and operational ambitions.
Behavioral data, on the other hand, captures interactions and behaviors of users, customers, or systems. These are often “hot” data that are incorporated into “customer journeys.” In addition to being voluminous, they come in different formats, such as clicks, session times, or customer reviews. They are generally stored in Big Data systems, data lakes, or adapted databases, notably chronological or time-series databases.
Data Ingestion
The extraction of source data must respect its lifecycle, with several possible approaches: data export by the sources (in full or incremental mode), the use of exposure APIs, or the capture of changes at the persistence layer level.
Data ingestion is no longer limited to the traditional ETL or ELT process, but increasingly integrates complex data flows, real-time or streaming.
Processes like filtering, cleaning, and standardization can be implemented during the ingestion process, relying on certain market solutions.
After extraction, the data must be converted into a more universal and standardized format and go through steps of quality enhancement, transcoding, or enrichment.
A Data-centric architecture ultimately defines pipelines for collecting, moving, transforming, and directing data towards specific destinations and uses.
DATA PROCESSING
The analysis and interpretation of data represent a large part of the added value of a Data-centric platform. Preparation processes serve the following use cases: utilizing machine learning algorithms and statistical models to guide action plans, personalizing the customer experience, or exploiting data to identify market opportunities or emerging needs. This requires a synergy of tools, technologies, analytical skills, and well-defined processes to maximize their utility.
At the heart of this challenge, computing resources need to be optimized to prioritize scalability and the ability to process data according to different priorities. Many combinations of tools are available to accelerate performance and/or minimize processing times, especially by assigning computations and storage to the different stages within the data pipeline.
DATA STORAGE
Data storage in a Data-centric architecture must take into account the intended purpose:
- Publication for consumers or feedback to source systems.
- Analytical processing for decision-making and dashboard creation.
- Data preparation for business purposes, especially marketing (e.g., audience targeting).
- Machine learning model training for various use cases (clustering).
Data lakes like Amazon S3, Azure Data Lake Storage, and Google Cloud Storage are specially designed to store vast amounts of data in various formats, whether structured, semi-structured, or unstructured.
These data lakes are often the first destination for data from ingestion processes, which themselves feed a layer of data into interoperable table formats such as Iceberg or Delta Lake. These formats are compatible with several distributed computing engines such as those integrated into the Apache Spark, Databricks, and Microsoft Fabric platforms.
Data engineering itself is carried out in platforms like Snowflake, Databricks, Azure Fabric, Amazon Redshift, and Google BigQuery, specifically designed for this purpose, combining data science, analysis, and BI functions.
NoSQL databases and JSON storage benefit from native horizontal scalability, making them attractive when a large number of users need to query the data.
The choice of storage type depends on the requirements in terms of modeling, scalability, and the service level needed to effectively support business processes. A thorough assessment of these elements allows for selecting the solution that best manages the data while optimizing performance and costs.
SERVICE LEVELS AND FINOPS
Maintaining adequate service levels while managing fluctuating data volumes and processing requires that key architecture components are both scalable and elastic.
Elasticity is crucial in storage and computing services, representing a key point in hyperscalers’ offerings. It is therefore logical to turn to the service ecosystems of these providers to build a Data-centric architecture capable of adjusting to workload fluctuations.
However, it is crucial to accurately manage provisioned resources and associated costs to optimize benefits while avoiding budget inflation. The Total Cost of Ownership (TCO) is primarily linked to the dynamic allocation of resources, especially the variable costs of compute (storage usage is generally more linear and less expensive). Governance complexity must also be considered. This governance can be based on existing platform functions or built through a platform aimed at unifying different access management, security, and other settings.
It is essential for the adoption of a Data-centric architecture to be accompanied by a FinOps strategy, implementing cost management tools and practices that provide visibility into spending. Increasingly, platform decomposition into dedicated workspaces or warehouses allows for specific cost distribution and accountability (notably through capping client commitments).