Foreword
In the process of learning the application Pandas library to solve problems related to Time series, I realized that there are some basic functions needed and I would like to share some of my conclusions drawn. out from the course on udemy. These are just some basic and subjective functions that I find popular, so I really need more contributions and additions from you and your friends in the Community. Content implementation and function usage in Pandas are summarized on a case-by-case basis, i.e. different question types.
Content
Some basic Pandas functions
Using the Pandas library with what command?
1 2 | import pandas as pd |
Which function in pandas to use to read csv file?
1 2 | pandas.read_csv(<file_csv_path>) |
For example:
1 2 | df = pandas.read_csv('test.csv') |
Similarly, pandas also supports reading files in different extension formats such as excel, html… You can learn more about this by clicking the key-word Pandas IO .
Note: The pwd
can be used to check the current directory location.
How to display data lines in dataframe from pandas?
1 2 | df.head() |
By default, the head() function will return the first 5 lines in the dataframe, but I can adjust the number of returned lines by passing a positive integer in the head() function.
Display some information from dataframe in pandas?
- The function gives the dataframe information:
1 2 | df.info() |
- Description function of dataframe including standard deviation, variance, percentile, mean of each column (corresponding to each field)
1 2 | df.describe() |
Get columns names – field names in dataframe?
1 2 | df.columns |
How to filter out a list of column-by-column values and non-matching values?
1 2 | df['<tên một trường>'].unique() |
For example:
1 2 | df['test'].unique() |
How to get the count of a list of values of a particular column?
1 2 | df['<tên một trường>'].nunique() |
For example:
1 2 | df['test'].nunique() |
How to count occurrences of each value at a field of dataframe?
1 2 3 | df['<tên một trường>'].value_counts().head(5) # hàm head() chỉ hỗ trợ hiển thị kết quả. |
How to sort and filter out top x in dataframe?
1 2 3 | df.sort_values(by = '<tên trường cần sắp xếp>', ascending = False).head(10) # cần lưu ý phải có tham số `ascending = False` để quá trình sắp xếp diễn ra và trường hợp này đang lấy top 10 |
How to filter out the top x in the dataframe and which information should be grouped?
The solution to this problem is: Combination of groupby and sort_values
1 2 | df.groupby(by = '<tên trường cần nhóm>').sum().sort_values(by = '<tên trường cần sắp xếp>', ascending = False).head() |
Compares the condition for the dataFrame and shows the number of rows that meet the condition
Method 1: The >=
operator only serves the example below, can use many other conditional expressions such as < > <= >= == !=
…
1 2 3 4 | len(df[df['<tên trường>'].apply(lambda <tên trường alias>: <tên trường alias> >= <conditional thresh>)]) # Ví dụ: Cho biết số lượng dòng thỏa điều kiện, giá trị của các dòng test nhỏ hơn 2000 len(df[df['test'].apply(lambda field: field < 2000 )]) |
Method 2 and I find it very cool:
1 2 3 4 | sum(df['<tên trường>'] >= <conditional thresh>) # Ví dụ: Cho biết số lượng dòng thỏa điều kiện, giá trị của các dòng test nhỏ hơn 2000 sum(df['test] < 2000) |
Another example would be to filter each city to check for cities that don’t have the text ‘County’:
sum(data['County'].apply(lambda string: 'County' not in string))
Thank you
Thank you everyone for your support.