Data Analytics for beginners like me

Tram Ho

Continuing in the first section on Data Analytics basics, in this section, I’ll focus on data types, data levels, and a data visualization overview, each of which is appropriate. with different data types and purposes. We invite you to read and discuss

Some terms and parameters in statistics and Data Analytics

First of all, I will go over a few basic keywords that will be encountered a lot in Data Science in general.

  • Observation : A single row or record of data in the database, related to a single event or data recorded at a time. For example, a table in the database has many rows, each row has many variables, and each row is an observation. It can also be called a case, record, pattern or row
  • Data sampling : A statistical analysis technique by selecting and analyzing a representative subset of data points. The goal is to identify behavioral patterns and trends in a large amount of surveyed data. Sample is randomly selected K observation, K is called sample size
  • Mean : The average value of dl, used to get the central trend of the data in question
  • Variance : The sum of the squares of the difference between the data points and the Mean value. It gives a measure of how the data distributes itself about the mean.
  • Standard deviation : The square root of the variance, is a measure of how much the data varies from the mean, it indicates how closely all the examples are grouped around the value. average value in a set of data.

Data types

There will be 3 main data types that you will often encounter, which are

  • Structured data : Data is processed, stored and reorganized in a certain format. For example: job position, salary, employee information, …
  • Unstructured data : Data that is missing about specific structures. For example Email,
  • Semi-Structured data : Data includes both Structured and Unstructured. For example CSV, Json documents, …

80% of business data is unstructured: documents, videos, emails, images, etc. Unstructured data is often verbose or text-heavy, but can also include data like dates, numbers, etc.

A little more macro, data can also be divided into 2 large groups

  • Quantitative data :
    • Data that is countable, measurable, and expressed in numbers
    • Structured data
    • Example: 1m7 tall person, 2 guava costs 20k
  • Qualitative data :
    • Classification of objects based on attributes, descriptive and conceptual
    • Can be observed but not measured
    • Provide insight into an issue
    • Unstructured data
    • For example: White skin, live in the US and may need medication because too thin

In Business, to put it simply, quantitative data helps you get the numbers to back up the research points in common. While qualitative data gives you the details and depth to understand the full meaning of the data and gain insights into the market and customers.

Hierarchy of data levels to measure performance

Data levels of measurement : A classification that describes the nature of information based on the values ​​assigned to variables.

Why metrics matter :

  • It determines the type of statistical analysis you can perform.
  • As a result, it affects both the nature and depth of insights you can glean from your data.
  • Some statistical tests can only be performed using more precise measures

Therefore, it is essential to plan ahead for how you will collect and measure your data.

There are 4 Levels of data in measurement:

  • Nominal
  • Ordinal
  • Interval
  • Ratio

Below is a table to distinguish these 4 levels, there will be a rather confusing concept of “true zero”, I will explain below ^^

NominalOrdinalIntervalRatio
Data can only be classifiedData can be classified and rankedData can be classified, ranked, and spacedData can be classified, ranked, spaced, and has “true zero”.
You can categorize your data by labeling them, but there is no ordering between the labelsCan sort and rank data in order, but can’t say anything about the gap between ranksIt is possible to classify, rank, and infer equal intervals between neighboring data points, but there is no “true zero”.Can classify, rank and infer equal intervals between neighboring data points and have a “true zero”
For example, if you need to classify human data into 2 genders, you can use the letter F for women and M for menTop 5 Olympic medalists, but this scale doesn’t tell you how close or far they are in terms of winsIncome per person, temperature (in degrees Celsius and Fahrenheit)Weight in grams (continuous); Number of employees at a company (discrete); Speed ​​in miles per hour (continuous)
  • “Zero true” (meaningful zero) : ie data has a meaningful zero. That is, a value of 0 on the ratio means that there is no variable you are measuring (Sounds confusing, continue reading the example below, my friend).
    • The weakness of the Interval scale is that the score of 0 is just an assumption, not an absolute value.
    • For example, 0°C is not the absence of temperature, but the temperature at which water changes from solid to liquid. For example, 50°C, although five times larger than 10°C, does not represent five times the temperature.
    • With a Ratio measure, if you have a population of zero, this means no people! If you have $0 it means you have no purchasing power, but if you have $4 you have twice as much purchasing power as someone with $2.
    • A special thing is that due to the existence of true zero, the Ratio scale has no negative value.

Visualize data for decision making

Data Visualization

DA refers to data representation using charts, graphs, helps users in making decisions for the best results, it also helps to fully answer business questions.

To help moderators make decisions, dl visualizations must meet the following requirements:

  • Simple
  • Clear
  • Intuition
  • Molds
  • Follow the trend

The importance of DA:

  • See and understand data trends or outliers
  • User-friendly charts and graphs make it easy for businesses to make the right decisions by being able to see patterns, trends, and correlations in the analyzed data
  • Allows users to visually organize and present big data

Exploratory data analytics (EDA) is the first step to modeling: Get data set insights -> understand some of the important impacts that affect the data set. data -> detect if there are any outliers in the dataset -> check the underlying assumptions of the dataset

EDA may or may not be used in a statistical model, but it is primarily foreshadowing what the data can tell us beyond the formal model.

Some commonly used visualization charts

Heatmap

  • Charts visualize data through variations
  • Use warm to cool color spectrum to visualize data
  • Measure relationships between multiple variables and the strength of those relationships through color
  • Intuitive look at correlation

image.png

Frequency distribution plot

Also known as histogram

  • Measure the frequency of an event, i.e. the number of times the event occurs in a single observation
  • Displays the number of observations within a certain time period
  • Tabular, line chart, point chart, pie chart
  • Purpose to test or illustrate data collected in a sample

image.png

Swarm plot

  • Can better represent the distribution of observations
  • Only works well with relatively small data sets
  • Is a type of scatter plot used to represent categorical values. It avoids the overlap of points

image.png

Dashboard-Based Visualization

  • Provides real-time data visualization
  • Track, analyze and display data points
  • Customized to suit specific requirements
  • Technically (backend), Dashboard connects to data, but on the interface displays all this data in the form of tables, charts, …
  • Allows businesses to monitor and make informed, quick and timely decisions
  • Steps for dashboard-based visualization:
    • Target audience analysis: who will use the data to make decisions
    • Identify key business metrics (i.e. performance metrics)
    • Understand what will be the ultimate goal or goal of the Dashboard
    • Build Dashboard
    • Enhance and improve Dashboard based on customer experience
Share the news now

Source : Viblo