Data Analysis with Pandas (Part 2)

Tram Ho

Come on, we come to part 2 of the Pandas DataFrame series, (Part 1)

Accessing Labels and Data

You already know how to initialize your DataFrame, and you can now retrieve the information from there. With Pandas, you can do the following:

  • Get and modify the row and column of labels as strings
  • Represents data as a NumPy array
  • Check and adjust data types
  • Analyze the size of DataFrame objects

Pandas DataFrame Labels as strings

You can get the DataFrame’s labels row with .index and labels column with .columns

You now have the row and column of labels being special string types. As you can with any other Python string, you can get a single entry:

In addition to extracting a specific item, you can apply other sequence operations, including looping through the rows and columns of labels. However, this is rarely necessary as Pandas provides other ways to loop through DataFrames, which you will see in the next section.

You can also use this method to modify the labels:

In this example, you use numpy.arange () to create a new row labels string containing integers 6 through 10.

Remember that if you try to modify a specific item of .index or .columns, then you get a TypeError.

Data as a NumPy Array

Sometimes you may want to extract data from the Pandas DataFrame without its label. To get a numpy array with unlabeled data, you can use either .to_numpy () or .values :

Both .to_numpy () and .values ​​behave similarly, and both return a NumPy array with data from the Pandas DataFrame.

Pandas documentation recommends you use .to_numpy () because the flexibility is provided by two optional parameters:

  1. dtype : Use this parameter to specify the data type of the result array. It is set to No by default.
  2. copy : Set this parameter to False if you want to use the original data from the DataFrame. Set it to True if you want to make a copy of the data.

However, .values ​​has been around for much longer than .to_numpy (), introduced in Pandas version 0.24.0. That means you will likely see .values ​​more often, especially in older code.

Data Types

Data value types, also known as data types or data types, are important because they determine how much memory your DataFrame uses, as well as its computation speed and accuracy. Pandas relies heavily on the NumPy data type. However, Pandas 1.0 introduced several additional types:

  • BooleanDtype and BooleanArray support the missing Boolean values ​​and the logic three Kleene values .
  • StringDtype and StringArray represent a dedicated string type.

You can get data types for each column of Pandas DataFrame with .dtypes :

As you can see, .dtypes returns a Row object with the column name as the label and the corresponding data type as value.

If you want to modify the data type of one or more colum, then you can use .astype () :

The most important and only required parameter of .astype () is dtype. It expects a data type or dictionary. If you pass dictionary, then the keys are the column name and the value is your desired data type.

As you can see, the data types in the age column and the py-score in the DataFrame df are both int64, representing a 64-bit (or 8-byte) integer. However, df_ also provides a smaller 32-bit (4-byte) integer data type called int32.

Pandas DataFrame Size

The .ndim, .size, and .shape properties return the dimension number, the number of data values ​​per dimension, and the total number of data values, respectively:

DataFrame instances have two dimensions (rows and columns), so .ndim returns 2. On the other hand, A Series object has only one dimension, so in that case, .ndim will return 1.

the .shape property returns a set of values ​​with the number of rows (in this case, 5) and the number of columns (4). Finally, .size returns an integer equal to the number of values ​​in the DataFrame (28).

You can even check how much memory is used by each column with .memory_usage ()

As you can see, .memory_usage () returns a Series with column name as label and memory usage in bytes as data value. If you want to exclude memory usage of the column containing the row labels, pass the optional argument index = False.

In the example above, the last two columns, age and py-score, use 28 bytes of memory each. That’s because these columns have seven values, each of which is an integer taking up 32 bits or 4 bytes. Seven numbers cause 4 bytes, each equivalent to a total of 28 bytes of memory usage.

At this point, you already know how to use and access data by row and column of DataFrame, right? Part 2 is here to end, see you in part 3.

Share the news now

Source : Viblo