DATA-CENTRIC ARCHITECTURE OF YOUR DATA

hands-engineer-working-blueprint-construction-concept-engineering-tools

PERFORMANCE, SCALABILITY, BUDGET, AND SECURITY

Data architectures have significantly evolved in the following context:

Optimize storage and compute to improve performance and reduce TCO.

Streamline the IT ecosystem according to business needs: access type, latency, autonomy, monetisation…​

Observability, security, personalisation, FinOps: cross-cutting challenges that are reshaping the work of architects.

This results in a Data-centric approach that places data at the heart of an organization’s processes, decisions, and strategies. When data is collected, managed, and analyzed, it provides valuable insights that help improve performance, foster innovation, and create market differentiation.

A Data-centric architecture must facilitate the flow of data between ingestion, storage, transformation, and restitution. While the concept and objectives are clear, the technical methods and strategies to achieve them vary according to requirements like volume, timing, confidentiality, and, most importantly, data types.

This results in a Data-centric approach that places data at the heart of an organization’s processes, decisions, and strategies. When data is collected, managed, and analyzed, it provides valuable insights that help improve performance, foster innovation, and create market differentiation.

A Data-centric architecture must facilitate the flow of data between ingestion, storage, transformation, and restitution. While the concept and objectives are clear, the technical methods and strategies to achieve them vary according to requirements like volume, timing, confidentiality, and, most importantly, data types.

ARCHITECTURE AND DATA TYPES

A Data-centric architecture must integrate various models and technologies to effectively manage the specificities of the different types of data it handles.

A "reference" strategy around MDM?

The architecture considers the consolidation of data within a business object catalog that reflects its activity. The same business objects (the same customer or product) are present in multiple places and with different life cycles. Consequently, reconciling the different versions of the same business object to create the most accurate and consolidated version is a specific process involving quality rules, matching, and merging.

Master Data Management (MDM) offers a technical foundation to address these requirements and is often present in Data-centric architectures. This MDM component builds a 360° view of various business objects, their attributes, and their relationships. It links the different keys of contributing source systems to the unique identifier of the Golden Record, creating a cross-reference base that traces the origin of the data. This link between identifiers is also important for associating the Golden Record with the rest of the transactional data in a broader platform. The MDM component must also ensure the historization of changes in the name of lineage and compliance requirements.

Notably, MDM is evolving to integrate more into the transactional data cycle. Real-time ingestion pipelines followed by event-driven processing and propagation are integral parts of the most performant solutions on the market.

ARCHITECTURE AND DATA TYPES

A Data-centric architecture must integrate various models and technologies to effectively manage the specificities of the different types of data it handles.

A "reference" strategy around MDM?

The architecture considers the consolidation of data within a business object catalog that reflects its activity. The same business objects (the same customer or product) are present in multiple places and with different life cycles. Consequently, reconciling the different versions of the same business object to create the most accurate and consolidated version is a specific process involving quality rules, matching, and merging.

Master Data Management (MDM) offers a technical foundation to address these requirements and is often present in Data-centric architectures. This MDM component builds a 360° view of various business objects, their attributes, and their relationships. It links the different keys of contributing source systems to the unique identifier of the Golden Record, creating a cross-reference base that traces the origin of the data. This link between identifiers is also important for associating the Golden Record with the rest of the transactional data in a broader platform. The MDM component must also ensure the historization of changes in the name of lineage and compliance requirements.

Notably, MDM is evolving to integrate more into the transactional data cycle. Real-time ingestion pipelines followed by event-driven processing and propagation are integral parts of the most performant solutions on the market.

Engineering of transactional and behavioral data

Transactional data is voluminous, more complex, and rich in attributes. Its volume raises challenges during its ingestion by various processes for routing to storage zones. This data is often highly detailed: a sales record, for example, will contain information about the product, customer, date, and amount. It is therefore essential to carefully define the functional scope to be integrated to establish the appropriate pipelines, depending on both analytical and operational ambitions.

Behavioral data, on the other hand, captures interactions and behaviors of users, customers, or systems. These are often “hot” data that are incorporated into “customer journeys.” In addition to being voluminous, they come in different formats, such as clicks, session times, or customer reviews. They are generally stored in Big Data systems, data lakes, or adapted databases, notably chronological or time-series databases.

Data Ingestion

The extraction of source data must respect its lifecycle, with several possible approaches: data export by the sources (in full or incremental mode), the use of exposure APIs, or the capture of changes at the persistence layer level.

Data ingestion is no longer limited to the traditional ETL or ELT process, but increasingly integrates complex data flows, real-time or streaming.
Processes like filtering, cleaning, and standardization can be implemented during the ingestion process, relying on certain market solutions.

After extraction, the data must be converted into a more universal and standardized format and go through steps of quality enhancement, transcoding, or enrichment.

A Data-centric architecture ultimately defines pipelines for collecting, moving, transforming, and directing data towards specific destinations and uses.

Traitement des données en entonnoir

DATA PROCESSING

The analysis and interpretation of data represent a large part of the added value of a Data-centric platform. Preparation processes serve the following use cases: utilizing machine learning algorithms and statistical models to guide action plans, personalizing the customer experience, or exploiting data to identify market opportunities or emerging needs. This requires a synergy of tools, technologies, analytical skills, and well-defined processes to maximize their utility.

At the heart of this challenge, computing resources need to be optimized to prioritize scalability and the ability to process data according to different priorities. Many combinations of tools are available to accelerate performance and/or minimize processing times, especially by assigning computations and storage to the different stages within the data pipeline.

DATA STORAGE

Data storage in a Data-centric architecture must take into account the intended purpose:

Data lakes like Amazon S3, Azure Data Lake Storage, and Google Cloud Storage are specially designed to store vast amounts of data in various formats, whether structured, semi-structured, or unstructured.

These data lakes are often the first destination for data from ingestion processes, which themselves feed a layer of data into interoperable table formats such as Iceberg or Delta Lake. These formats are compatible with several distributed computing engines such as those integrated into the Apache Spark, Databricks, and Microsoft Fabric platforms.

Data engineering itself is carried out in platforms like Snowflake, Databricks, Azure Fabric, Amazon Redshift, and Google BigQuery, specifically designed for this purpose, combining data science, analysis, and BI functions.

NoSQL databases and JSON storage benefit from native horizontal scalability, making them attractive when a large number of users need to query the data.

The choice of storage type depends on the requirements in terms of modeling, scalability, and the service level needed to effectively support business processes. A thorough assessment of these elements allows for selecting the solution that best manages the data while optimizing performance and costs.

SERVICE LEVELS AND FINOPS

Maintaining adequate service levels while managing fluctuating data volumes and processing requires that key architecture components are both scalable and elastic.

Elasticity is crucial in storage and computing services, representing a key point in hyperscalers’ offerings. It is therefore logical to turn to the service ecosystems of these providers to build a Data-centric architecture capable of adjusting to workload fluctuations.

However, it is crucial to accurately manage provisioned resources and associated costs to optimize benefits while avoiding budget inflation. The Total Cost of Ownership (TCO) is primarily linked to the dynamic allocation of resources, especially the variable costs of compute (storage usage is generally more linear and less expensive). Governance complexity must also be considered. This governance can be based on existing platform functions or built through a platform aimed at unifying different access management, security, and other settings.

It is essential for the adoption of a Data-centric architecture to be accompanied by a FinOps strategy, implementing cost management tools and practices that provide visibility into spending. Increasingly, platform decomposition into dedicated workspaces or warehouses allows for specific cost distribution and accountability (notably through capping client commitments).

Our experts will support you in the success of your data projects.

Cloud Computing

Data exploitation has evolved with the implementation of Cloud Computing. This impacts both the technical and software architecture:

All these changes are based on the offerings of hyperscalers, which provide a complete elastic ecosystem, ready to use. The benefits are technological: container orchestration (dockers), resource virtualization, and platforms for provisioning, etc.

This offering is globalized, ensuring better security compliance and response times, and covers all needs: storage, engineering, middleware, data science, across all service levels: IaaS, PaaS, or SaaS.

The three major players dominate the market: Amazon Web Services, Google Cloud, and Microsoft Azure, not to mention IBM Cloud, Oracle Cloud, and even Alibaba Cloud.

The reduction in storage and computing costs also supports this approach, particularly during the launch phase.

However, it is crucial to assess the technological agnosticism of solutions hosted on each of these platforms and to implement cost monitoring (FinOps).

Data Storage

The evolution of storage and computing costs has profoundly changed the landscape of data processing technologies. The drop in storage costs has enabled the widespread adoption of massive and flexible storage solutions on various technologies (HDFS, object storage, etc.), pursuing either analytical or operational objectives. At the same time, the reduction in computing costs has driven the emergence of fast and real-time processing technologies, thus influencing the choice of data management solutions.

These changes have allowed companies to manage vast amounts of data with increased efficiency.

The choice of a data repository depends on the specific requirements of the application and/or organization. Criteria such as the data model, scalability, performance, consistency (ACID vs. BASE), and cost play a crucial role in determining the most suitable repository for data storage and processing needs. However, an architecture is designed based on a comprehensive urbanization policy to pool both material and human resources. It must consider existing systems (technical debt?) and ongoing technological innovations to maximize advantages while controlling expenses.

High Availability

Data platforms concentrate on analytical and operational objectives.

In this context, high availability and data availability have become two essential concepts for maintaining seamless operations, delivering uninterrupted services, and protecting against data loss.

High Availability (HA)

The goal of high availability is to minimize downtime and ensure continuous service availability, even in the event of hardware or software failures.

High availability is measured by the level of service requested by the business and implemented by architects.
It can be ensured by deploying hardware and software redundancies, as well as load balancing and automatic failover mechanisms.

Given the complexity of such a system, cloud computing solutions and SaaS software appear as a solution, as they relieve the company of skills and tasks not related to its core business.

Data Availability

This requirement is less well-known, even though it refers to user access to data under favorable technical conditions: avoiding loss, corruption, or inaccessibility.

Several actions help achieve this goal:

Security and encryption

In extended data architectures, the distribution, dispersal, and replication of data across multiple locations make data protection challenges particularly complex. Several solutions exist:

These security and encryption issues are crucial for meeting compliance and confidentiality requirements for cloud-deployed solutions (which can extend to SecNum certification). The ownership of encryption keys is one of the strongest responses to these requirements, though the multiplicity of keys is not always compatible with a multi-tenant infrastructure.

Finops

FinOps is a collaborative approach to cloud cost management, aimed at optimizing spending by aligning budgets with business priorities and expected service levels.
A FinOps strategy requires the integration of complementary technologies to maximize its effectiveness, including observability and elasticity. Observability enables anticipation and an optimization mechanism for performance, coupled with the measurement of Cloud resource usage. Combined with elasticity, it allows for dynamically adjusting resources according to demand fluctuations.
Implementing a FinOps strategy involves challenges that must be carefully addressed:

Additionally, cost evaluations are expected before implementation and for the selection of technologies. Cost modeling/simulation during the project phase is often difficult and must be given particular attention.