Data Warehouse and the most basic knowledge

Tram Ho

What is data warehouse?

A data warehouse (DW) is a relational database designed for querying and analysis rather than transaction processing. It includes historical data obtained from transaction data from single and multiple sources.

DW provides enterprise-wide integrated historical data and is focused on providing support to decision makers in data modeling and analysis.

A DW is a specific set of data for the entire organization, not just for a specific set of users.

It is not used for day-to-day operations and transaction processing but is used for decision making.

DW can be viewed as a data system with the following properties:

It is a database designed for investigative tasks, using data from different applications. It supports a relatively small number of customers with relatively long interactions. It includes current and historical data to provide a historical perspective of the information. Its use is in-depth reading. It contains a few large tables. The DW is a subject-driven, integrated, and time-varying repository of information to support management decisions.”

Features of Data warehouse

Subject-Oriented DW goal of data modeling and analysis for decision makers. As a result, it often provides a concise and easy-to-understand view of a particular topic, such as customers, products, or sales, rather than the ongoing operations of the global organization. This is done by excluding data that is not useful related to the topic and including all the data the user needs to understand the topic. image.png

The Integrated Data warehouse integrates a variety of heterogeneous data sources such as RDBMS, flat files, and online transaction records. It requires performing data cleaning and integration before data storage to synchronize naming, attribute types, etc. between aggregate data sources. image.png

Time-Variant Historical information is kept in the Data warehouse. For example, one can retrieve files from 3 months, 6 months, 12 months, or even previous data from a Data warehouse. These variations with a transactional system, where only the latest file is usually stored. image.png

A Non-Volatile Data warehouse is a physically separate store of data, converted from a source operational RDBMS. Active updates of data do not occur in the Data warehouse, i.e. update, insert and delete operations are not performed. It usually requires only two procedures in data access: Initial data loading and data access. As a result, DW does not require transaction processing, rollback, and concurrency, allowing for significant acceleration of data retrieval. Non-Volatile defines that once entered the inventory and the data cannot be changed. image.png

The goal of Data warehouse

  1. To help with reporting as well as analysis
  2. Maintain the organization’s historical information
  3. As a basis for decision making.

Is data warehouse necessary?

A data warehouse is necessary for the following reasons: image.png

1) Business users: Business users require a Data warehouse to view historical summary data. Since these people are not technical, data can be presented to them in basic form.

2) Welcome Store

Phased data: Data warehouse is required to store past time variable data. This input is created to be used for different purposes.

3) Make strategic decisions: Some strategies may depend on the data in the Data warehouse. Therefore, Data warehouse contributes to strategic decisions.

4) For data consistency and quality: Bringing data from disparate sources to a common place, can be done efficiently by users to bring about uniformity and consistency in data .

5) High response times: The data warehouse must be ready for somewhat unexpected loads and query types that require a significant degree of flexibility and quick response times.

Data warehouse components or building blocks

Architecture is the logical arrangement of elements. We build a Data warehouse with software and hardware components. To match the requirements of our organizations, we arrange this building that we may want to enhance another section with additional tools and services. All of this depends on our circumstances. image.png

The figure shows the essentials of a typical warehouse. We see the Source Data component visible on the left. The staging element serves as the next building block. In the middle, we see the Data Storage component that handles the data of the Data warehouse. This element not only stores and manages data; it also tracks data using a metadata store. The Provide Information component shown on the right covers all the different ways to make information available from the Data warehouse to users.

Source data element

The source data entering the Data warehouse can be grouped into four broad categories:

Production Data: This type of data comes from different operating systems of the business. Based on the data requirements in the Data warehouse, we select data segments from different modes of operation.

Internal Data: Within each organization, customers keep their own “private” spreadsheets, reports, customer records, and sometimes even departmental databases. This is internal data, part of which can be useful in a Data warehouse.

Archived Data: Systems that operate primarily for the purpose of running the current business. In every working system we periodically take stale data and store it in gained files.

External Data: Most executives depend on information from outside sources for a large percentage of the information they use. They use statistics relevant to their industry provided by an external department.

Components of data structures

After we have extracted data from various operational systems and external sources, we have to prepare the files for storage in the Data warehouse. Data extracted from a number of different sources needs to be altered, transformed and made available in a suitable format to be saved for querying and analysis. image.png

We will now discuss the three main functions that take place in the staging area.

1) Data extraction: This method has to deal with multiple data sources. We must use the appropriate techniques for each data source.

2) Data transformation: As we all know, data for a Data warehouse comes from many different sources. If extracting data for a Data warehouse is a big challenge, transforming data presents even more significant challenges. We perform a number of individual tasks as part of the data transformation.

First, we clean the data extracted from each source. Cleaning can be about correcting typos or it can deal with providing default values ​​for missing data elements or removing duplicates when we bring the same data from different source systems enter.

Standardization of data elements forms a large part of the data transformation. Data transformations contain many forms of combining pieces of data from different sources. We combine data from a single source record or related pieces of data from multiple source records.

On the other hand, data transformation also contains unhelpful source data and separates the external records into new combinations. Sorting and merging of data takes place on a large scale in the data region. When the data transformation is finished, we have an integrated data set that is cleaned, normalized, and summarized.

3) Loading data: Two distinct types of tasks form the data loading functions. Once we have completed the structure and construction of the Data warehouse and are up and running for the first time, we perform the initial loading of the information into the data store. Initial loads move large volumes of data using a significant amount of time.

Data storage component

Data storage for Data warehouse is a subdivided repository. Data warehouses for operational systems typically include only current data. In addition, these data warehouses include highly normalized structured data for fast and efficient processing.

Ingredient providing information

The informant element is used to enable the process of registering a Data warehouse file and delivering it to one or more destinations according to some customer-specified scheduling algorithm.

Component metadata

Metadata in a Data warehouse is like a data dictionary or data catalog in a database management system. In data dictionary, we hold data about logical data structures, data about records and addresses, information about indexes, etc. image.png

It includes a subset of company-wide data that is valuable to a specific set of users. Range is limited to specific selected sub

jects. The data in the Data warehouse should be fairly current, but not primarily up to the minute, although developments in the Data warehouse industry have made standard and incremental data storage more feasible. Data warehouses are lower than Data warehouses and usually contain organization. The current trend in Data warehouse is to develop a Data warehouse with a number of smaller related Data warehouses for specific types of queries and reports.

Management and control components

Management and control elements coordinate services and functions in a Data warehouse. These components control the data transformation and data transfer to the data warehouse for storage. On the other hand, it censors the provision of data to customers. It works with the database management system and allows data to be stored correctly in the repository. It monitors the movement of information into the staging method and from there into the data warehouses themselves.

Why do we need a separate Data warehouse?

Data Warehouse queries are complex because they involve computing large groups of data at summary levels.

It may require the use of separate ways of organizing, accessing, and deploying data based on multiple perspectives.

Executing OLAP queries in the operational database degrades the performance of functional tasks.

Data Warehouse is used for analysis and decision making, where an extensive database, including historical data, is required, which an operational database would not normally maintain.

Separating the operational database from the Data warehouse is based on the different structures and uses of data in these systems.

Since the two systems provide different functions and require different types of data, it is necessary to maintain separate databases.

Benefits of Data warehouse

  • Understand business trends and make better forecast decisions.
  • Data warehouses are designed to handle huge amounts of data well.
  • The structure of a Data warehouse is more accessible for the end user to navigate, understand, and query.
  • Complex queries in many normalized databases can be easier to build and maintain in a Data warehouse.
  • Data warehouse is an effective method to manage the demand for a lot of information from many users.
  • A data warehouse provides the ability to analyze large amounts of historical data.

Before learning about Data Warehouse, you must have a background knowledge of basic database concepts like schema, ER model, structured query language, etc.

This tutorial will help computer science students understand basic to advanced concepts related to data storage.

INDAACADEMY Academy specializes in training courses on DWH / ETL – Aggregation, Normalization and Building Data Warehouse (Basic).

Share the news now

Source : Viblo