Differentiate: Database, Data Warehouse, Data Mart, Data Lake, Data Lakehouse, Data Fabric, Data Mesh

Wednesday, 07/12/2022

Tram Ho

Hi guys,

Today, continuing the Business Data Analytics Series, I will share with everyone the most common concepts related to data system design below, because when we do data analysis, we also will need to know it. I’m not specialized in the Database Connections below, I’m mainly strong in analysis. We have cooperated with an Expert in Database Management & Architect (Joseph Tan), we are completing the curriculum for the program “Enterprise Data Warehouse”. Hope to “launch” this new program soon.

*Data repositories (including databases, data warehouses, data lakes, data marts, and data lakehouses…): All these concepts are collectively known as data repositories, everyone (source), I don’t know how to translate this word into Vietnamese like how to make sense anymore.

1. Database (database)

A database is a place to store related data that is used to capture a particular situation. An example of a database is a point of sale (POS) database. The POS database will collect and store all relevant data surrounding retail store transactions.

There are many types of databases:

New data entering the database is processed, sorted, managed, updated and then stored in tables. A database is a single-purpose store of raw transactional data. Because the database is tightly tied to transactions, the database performs online transactional processing (OLTP).

2. Data Warehouse (data warehouse)

Look at the image above, next Databases: data will be pushed into ETL Tools to push through the Data warehouse. Data warehouses typically store only modeled/structured data (usually structured data).

Compare Database vs Data Warehouse:

3. Data Mart (simplified version of Data Warehouse)

While a data warehouse is a multi-purpose storage place for different use cases, a data warehouse (mart) is a subsection of a data warehouse, specifically designed and built for a particular business division/function.

Some benefits of using data-mart:

Segregated security: Since the data-mart contains only data specific to that department, you are guaranteed that unwanted data cannot be accessed (financial data, revenue data).
Isolated Performance: Likewise, since each data-mart is only used for a specific department, the performance load is well managed and communicated within the department, so there is no impact on volumes other analytical work. 3 Types of Data Mart:

Dependent Data Marts – A dependent data warehouse is built from an existing data warehouse. It takes a top-down approach that starts with storing all of your business data in a centralized location, and then pulls a defined piece of data as needed for analysis.
Independent Data Marts – An independent data warehouse is an independent system, created without the use of a data warehouse and focused on one business function. Data is released from internal or external data sources, refined, and then loaded into a data mart, where it is saved until needed or for business analysis.
Hybrid Data Marts (Hybrid Data Marts) – A hybrid data warehouse that integrates data from existing data warehouses and complementary operational source systems. It combines the speed and end-user focus of the top-down approach with the support of the enterprise-level integration of the bottom-up approach.

4. Data Lake (data lake)

Choose Data Lake for 2 main reasons:

You need a cheap way to store different types of data in bulk. You don’t have a plan to do with the data, but you intend to use it at some point. Therefore, you collect data first and analyze later. Compare Data Warehouse vs Data Lake

Data Lake is suitable for businesses with advanced analysis needs (using unstructured data), due to the large amount of data, the query time, data analysis is in weeks/months, and the cost is high due to the large amount of data. The amount of data that needs to be stored is very large, and only one group of objects with advanced analysis capabilities using Data Warehouse is a master data warehouse that collects structured data systems in departments, which is very popular in Vietnam. Most businesses, businesses already have data systems in many departments, now gathered in one place, most of the “Business Users” can use this data, this is the total data warehouse of the business. Data Mart is an individual data warehouse designed specifically for each department.

5. Data Lakehouse

A Data Lakehouse combines the advantages of a Data Lake and a Data Warehouse.

6. Data Fabric?

Data Fabric is designed to help organizations solve complex data problems. Use cases by managing their data regardless of application type, platform and where the data is stored. It enables seamless access and sharing of data in a distributed data environment. It is similar to Data Lakehouse, combining Data Warehouse and Data Lake, but goes a step further and also integrates data from applications together. Data Fabrics goes one step further and provides you with support services for control, monitoring, etc. for you and your company.

7. Data Mesh (data mesh)

While Data Mesh aims to solve many of the same problems as Data Fabrics – namely: the difficulty of managing data in heterogeneous data environments – it solves the problem in a fundamentally different way. In short, while the data fabric seeks to build a single virtual management layer on top of distributed data, the data mesh encourages distributed teams to manage data as they see fit, albeit with some general governance rules.

Next Data Platform is Data Mesh?

Data Mesh is a “rising star” when it comes to today’s data storage style. Before learning what Data Mesh is, let’s go over two important concepts: Monolithic Vs. Microservices Architecture

Monolithic architecture

In software engineering, monolithic architecture is considered the traditional model, which is to build applications as a self-contained and independent block of other applications.

This architecture will be very handy for the early stages of any project’s lifecycle for easy code development and deployment. In other words, the monolithic approach allows everything to be released at once, and as with everything in life, this approach has its flows including:

Slower development speed – A large, monolithic application makes development more complex and slower.
New technology adoption – Any change or upgrade in the technology used affects the entire application, making this decision costly and difficult.
Development and deployment Scalability – As the system grows to upgrade a single component will be very difficult and any small change to a monolithic application requires redeploy the entire monolith.

Microservices architecture

On the other hand, the Microservices approach is an architectural approach based on a series of independently deployable services. These services have their own database and business logic with a specific goal. Updates, testing, deployments, and expansions take place within each service. This eliminates the drawback of the monolithic approach, but “as always” creates a new drawback. For example:

Added development complexity – Microservices add more complexity than monolithic architecture because there are more services in many places created by multiple teams. If the development process is not managed properly, it will lead to slower development speed and poor performance.
High infrastructure costs – Each new microservice can have its own costs for test suites, development books, storage infrastructure, monitoring tools, etc.
Debugging challenges – Each microservice has its own set of logs, which makes debugging more complex.
Lack of clear ownership – As more services are introduced, so does the number of teams running those services. Over time, it becomes difficult to know what services are available that the team can take advantage of and who to contact for support.

What is Data Mesh?

Data mesh is the data platform version of microservices (data mesh is the data platform version of microservices.)

According to Zhamak Dehghani – consultant in ThoughtWorks, who first coined this definition: “A data mesh is a type of data foundational architecture that embraces the ubiquity of data in an enterprise. data in the enterprise) by leveraging a self-serve design, which is array-oriented. Borrowing from Eric Evans’ theory of domain-driven design, a flexible, scalable software development model that matches the structure and language of your code to your business domain. its respective business”.

In simpler terms, Unlike traditional monolithic data infrastructure that handles data import, storage, transformation, and export in a central data lake, data grids support consumers distributed, array-specific data and “data-as-a-product” views, with each domain handling its own data pipelines. The tissue that connects these domains and their associated data assets is a common interoperability layer that applies the same syntax and data standards.

Many companies to date have leveraged a single data warehouse connected to multiple business intelligence platforms. Such solutions often incur significant technical debt in maintaining the central pipeline by a small group of data engineers, which creates bottlenecks in the organization’s data platform.

*Technical debt is technical debt. Simply put, technical debt is the amount of work that needs to be handled in an IT project.

For many organizations, a data monolithic architecture has many flows, including:

The central ETL pipeline gives data engineering teams less control over increasing data volumes.
Different data use cases require different types of transformations, placing heavy emphasis on a central platform.

Such centralized data lakes lead to disconnected data producers, impatient data consumers, and worse, a backlog of data engineering struggling to catch up. keep up with the needs of the business.

Instead, domain-oriented data architectures, like data meshes, give teams the best of both worlds: a centralized database (or a distributed data lake) with domains (or business areas) responsible for handling their own pipelines. As Zhamak argues, data architectures can be most easily scaled by being broken down into smaller, domain-oriented components.

Instead, domain-oriented data architectures, like data meshes, offer teams the best of both worlds: centralized databases – a centralized database (or distributed data lake) with domains (or business areas) responsible for handling their own pipelines. As Zhamak argues, data architectures are most easily scalable by being broken down into business-oriented components.

Simply put, the more complex and demanding your company’s data infrastructure requirements are, the more likely your organization is to benefit from the data grid.

Example Data mesh at “high-level”:

Source: Internet

Distinguishing Data Lakehouse vs Data Mesh: “Data Mesh is a paradigm – Lakehouse is a platform”, more detailed distinction, everyone, please watch this video! https://www.linkedin.com/embeds/publishingEmbed.html?articleId=7073142464567856814&li_theme=light Since these are theoretical concepts, I refer to its academic definition, I have left a reference link at each concept concept, here I will summarize for some of you who want to read more for reference:

https://www.zuar.com/blog/data-mart-vs-data-warehouse-vs-database-vs-data-lake/

https://www.holistics.io/blog/data-lake-vs- data-warehouse-vs-data-mart/

https://medium.com/@mhatout/data-mesh-as-i-know-it-d30d9fc1ea69

https://www.javatpoint.com/types-of-databases

https http://www.atlassian.com/microservices/microservices-architecture/microservices-vs-monolith

https://www.montecarlodata.com/blog-what-is-a-data-mesh-and-how-not-to- mesh-it-up/

https://martinfowler.com/articles/data-monolith-to-mesh.html

https://www.databricks.com/session_na20/data-mesh-in-practice-how-europes-leading -online-platform-for-fashion-goes-beyond-the-data-lake —-

Thank you all for taking the time to read the article. See you all in the next posts! Refer to the COURSE “BUSINESS DATA ANALYSIS” – ONLINE/OFFLINE at https://indaacademy.vn/

INDA Training Academy is the leader in Business Data Analytics Skills Training in Vietnam. Business Data Analytics courses at INDA will be opened monthly, each class attracts +100 students – the only Business Data Analytics training center in Vietnam that attracts attracted such a large number of students in each class, opened 34 Public courses in the market and is a training partner in data analysis for large enterprises in Vietnam.

Share the news now

Source : Viblo

Differentiate: Database, Data Warehouse, Data Mart, Data Lake, Data Lakehouse, Data Fabric, Data Mesh

1. Database (database)

2. Data Warehouse (data warehouse)

3. Data Mart (simplified version of Data Warehouse)

4. Data Lake (data lake)

5. Data Lakehouse

6. Data Fabric?

7. Data Mesh (data mesh)

TikTok becomes the second largest social platform in South Africa

The fastest depreciating after 9 months of launch, iPhone 14 Pro Max continues to break the bottom in Vietnam

Beginner's guide to R: Introduction

10 essential SublimeText plugins for JavaScript developers