How to use Pandas with Python

Friday, 16/06/2023

Tram Ho

Data Processing with Pandas in Python

Pandas is a Python library that provides fast, powerful, flexible, and implicit data structures. The library name is derived from panel data. Pandas is designed to be easy and intuitive to work with structured (tabular, multidimensional, potentially heterogeneous) data and time series data.

Pandas’ goal is to be a basic high-level building block for real-world work, real-world data analysis in Python, and more broadly, an open source analysis/manipulation tool. The most powerful and flexible available in any kind of programming language.

Why should you choose Pandas?

Pandas is well-suited to many different types of data:

Tabular data with heterogeneously imported columns, like in an SQL table or an Excel spreadsheet.
Ordered and unordered time series data (not necessarily fixed frequency).
Arbitrary matrix data (uniformly or heterogeneously entered) with row and column labels.
Any other form of observational/statistical datasets. The data actually doesn’t need to be labeled into the Pandas data structure.
Pandas is built on top of NumPy . Pandas’ two main data structures, Series (1-dimensional) and DataFrame (2-dimensional), handle most of the typical cases in finance, statistics, social sciences, and many technical fields.

Advantages of Pandas

Easily handle lossy data, expressed as NaN , in floating point data as well as static commas as desired by the user: skip or go to 0.
Resizable: columns can be inserted and deleted initialized DataFrame and higher dimensional objects.
Automatic and clear data alignment: objects can be explicitly aligned to a set of labels, or the user can simply omit the labels and let Series, DataFrame, etc. automatically align your data within the labels. calculate.
The powerful, flexible Group by function for performing split-merge operations applies on data sets, for both aggregated and transformed data.
Easily convert ragged, differently-indexed data in other Python and NumPy data structures into DataFrame objects.
Smart label-based slicing, fancy indexing, and subsetting of large data sets.
Merging and joining (joining) visual data sets.
Flexibility in reshaping and rotating data sets.
Hierarchical labeling of axes (there can be multiple labels on multiple markers).
Powerful IO tools to load data from flat files like CSV and delimited, Excel files, databases and save/load data from HDF5 format super fast.
Time series specific functionality: date range generation and frequency conversion, moving window statistics, date shift and delay.
Integrates well with other Python libraries like SciPy, Matplotlib, Plotly, etc
Good performance.

Install Pandas . library

Use pip and type the command: pip install pandas
Or with Anaconda , use the command: conda install pandas

To find other ways to install Pandas you can see HERE .

Note: You need to install NumPy library first (if you install using Anaconda then NumPy is already available).

Declaring the Pandas . library

import pandas as pd

You shouldn’t change the word pd to something else because the documentation implicitly follows the same convention.

Working with basic data structures

Pandas has two basic data structures:

Series (1 way)
DataFrame (2 dimensions)

Panel (3 dimensions) used to be a data structure in pandas before being removed from version 0.25.

1. Series

Series([data, index, dtype, name, copy,...])

Series is a one-dimensional array like a NumPy array, or as a column of a table, but it includes an additional table of labels. Series can be initialized via NumPy, type Dict or normal scalar data. Series has many properties like index, array, value, dtype, etc. You can perform Series conversion to a specified dtype, create a copy table, return a bool of an element, convert Series from DatetimeIndex to PeriodIndex, etc.

Some examples of manipulating Series:

Create Series

Example 1: Do not pass index

import pandas as pd
 S = pd.Series([0,1,2,3])
 print(s)

import pandas as pd

S = pd.Series([0,1,2,3])

print(s)

Output:

   0        0
    1        1
    2        2
    3        3
    dtype: int64

0 0

1 1

2 2

3 3

dtype: int64

Pandas will default to passing index from 0 to len(data)-1 .

Example 2: Passing index

import pandas as pd
 S = pd.Series([0,1,2,3], index = ["a","b","c","d"])
 print(s)

import pandas as pd

S = pd.Series([0,1,2,3], index = ["a","b","c","d"])

print(s)

Output:

a        0
 b        1
 c        2
 d        3
 dtype: int64

a 0

b 1

c 2

d 3

dtype: int64

Example 3: Create Series from dict

import pandas as pd
 data = {'a' : -1.3, 'b' : 11.7, 'd' : 2.0, 'f' : 10, 'g' : 5}
 ser = pd.Series(data,index = ['a','c','b','d','e','f'])
 print(ser)

import pandas as pd

data = {'a' : -1.3, 'b' : 11.7, 'd' : 2.0, 'f' : 10, 'g' : 5}

ser = pd.Series(data,index = ['a','c','b','d','e','f'])

print(ser)

Output:

a        -1.3
 c        NaN
 b        11.7
 d        2.0
 e        NaN
 f        10.0
 dtype: float64

a -1.3

c NaN

b 11.7

d 2.0

e NaN

f 10.0

dtype: float64

We create a dict with index a,b,d,f,g . Then create a Series from this dict data, but the indexes c and e are not in the dict, so the data at these indexes is missing (missing data). Pandas displays NaN to indicate these data are empty.

Share the news now

Source : Viblo

How to use Pandas with Python

Data Processing with Pandas in Python

Why should you choose Pandas?

Install Pandas . library

Declaring the Pandas . library

Working with basic data structures

1. Series

TikTok becomes the second largest social platform in South Africa

The fastest depreciating after 9 months of launch, iPhone 14 Pro Max continues to break the bottom in Vietnam

Beginner's guide to R: Introduction

10 essential SublimeText plugins for JavaScript developers