When it comes to processing table data, most of us will choose Pandas to read and manipulate data, and I am no exception. However, I just read a pretty good article to see if Pandas is the best option? In this article, we will compare the performance of packages: Pandas, Dask, Datatable.
Pandas
Pandas is a Python library that is widely used by everyone, especially when it comes to manipulating tabular data types, which is both fast, simple and easy to use. However, if using a large csv file, it will take a lot of time. In this article, I will not mention Pandas much because Pandas is too popular and useful.
Dask
Dask was born to read large csv files – a problem that pandas has. Dask is an open source library that provides advanced, flexible parallelism for analytical computing. It natively scales these analytics packages into multi-core machines and distributed clusters whenever needed. dask’s data framework uses the Pandas API, making things super easy for those who use and love Pandas.
Datatable
Datatable is another Python library that focuses on improving performance, hoping to process large data (100GB) with maximum speed on 1 machine,. Meanwhile, the interoperability of Datatable and Pandas/Numpy provides easy transition to another data processing framework.
Compare
Ok. Now let’s try to use the code to compare the processing speed of these 3 libraries more intuitively.
Read csv . file
1 2 3 4 5 | <span class="token keyword">import</span> pandas <span class="token keyword">as</span> pd <span class="token keyword">import</span> dask <span class="token punctuation">.</span> dataframe <span class="token keyword">as</span> dd <span class="token keyword">import</span> datatable <span class="token keyword">as</span> dt <span class="token keyword">import</span> matplotlib <span class="token punctuation">.</span> pyplot <span class="token keyword">as</span> plt |
Check reading times of pandas, dask and datatable
Pandas
1 2 3 4 | <span class="token operator">%</span> <span class="token operator">%</span> time df <span class="token operator">=</span> pd <span class="token punctuation">.</span> read_csv <span class="token punctuation">(</span> <span class="token string">"na_filled (1).csv"</span> <span class="token punctuation">)</span> <span class="token builtin">type</span> <span class="token punctuation">(</span> df <span class="token punctuation">)</span> |
dask
1 2 3 4 | <span class="token operator">%</span> <span class="token operator">%</span> time df <span class="token operator">=</span> dd <span class="token punctuation">.</span> read_csv <span class="token punctuation">(</span> <span class="token string">"na_filled (1).csv"</span> <span class="token punctuation">)</span> <span class="token builtin">type</span> <span class="token punctuation">(</span> df <span class="token punctuation">)</span> |
datatable
1 2 3 4 | <span class="token operator">%</span> <span class="token operator">%</span> time df <span class="token operator">=</span> dt <span class="token punctuation">.</span> fread <span class="token punctuation">(</span> <span class="token string">"na_filled (1).csv"</span> <span class="token punctuation">)</span> <span class="token builtin">type</span> <span class="token punctuation">(</span> df <span class="token punctuation">)</span> |
Result
Image: running time
Show the chart to see clearly.
1 2 3 4 5 6 7 8 9 | fig <span class="token operator">=</span> plt <span class="token punctuation">.</span> figure <span class="token punctuation">(</span> <span class="token punctuation">)</span> ax <span class="token operator">=</span> fig <span class="token punctuation">.</span> add_axes <span class="token punctuation">(</span> <span class="token punctuation">[</span> <span class="token number">0</span> <span class="token punctuation">,</span> <span class="token number">0</span> <span class="token punctuation">,</span> <span class="token number">1</span> <span class="token punctuation">,</span> <span class="token number">1</span> <span class="token punctuation">]</span> <span class="token punctuation">)</span> ax <span class="token punctuation">.</span> set_ylabel <span class="token punctuation">(</span> <span class="token string">'ms'</span> <span class="token punctuation">)</span> ax <span class="token punctuation">.</span> set_title <span class="token punctuation">(</span> <span class="token string">'Speed of reading single csv file (4.7MB)'</span> <span class="token punctuation">)</span> lib <span class="token operator">=</span> <span class="token punctuation">[</span> <span class="token string">'pandas'</span> <span class="token punctuation">,</span> <span class="token string">'dask'</span> <span class="token punctuation">,</span> <span class="token string">'datatable'</span> <span class="token punctuation">]</span> perf <span class="token operator">=</span> <span class="token punctuation">[</span> <span class="token number">67.1</span> <span class="token punctuation">,</span> <span class="token number">11.3</span> <span class="token punctuation">,</span> <span class="token number">39.7</span> <span class="token punctuation">]</span> ax <span class="token punctuation">.</span> bar <span class="token punctuation">(</span> lib <span class="token punctuation">,</span> perf <span class="token punctuation">)</span> plt <span class="token punctuation">.</span> show <span class="token punctuation">(</span> <span class="token punctuation">)</span> |
Image: chart
As in the two pictures above, we can easily see that Dask’s file reading time is much faster than Pandas and Datatable. Datatable is faster than Pandas 1 for a little while.
Read multiple files at the same time
Because I’m too lazy to add many files, I always read the above 4 files at the same time =)))
pandas
1 2 3 4 5 6 7 | <span class="token operator">%</span> <span class="token operator">%</span> time files <span class="token operator">=</span> <span class="token punctuation">[</span> <span class="token string">"na_filled (1).csv"</span> <span class="token punctuation">,</span> <span class="token string">"na_filled (1).csv"</span> <span class="token punctuation">,</span> <span class="token string">"na_filled (1).csv"</span> <span class="token punctuation">,</span> <span class="token string">"na_filled (1).csv"</span> <span class="token punctuation">]</span> combined <span class="token operator">=</span> <span class="token punctuation">[</span> <span class="token punctuation">]</span> <span class="token keyword">for</span> f <span class="token keyword">in</span> files <span class="token punctuation">:</span> combined <span class="token punctuation">.</span> append <span class="token punctuation">(</span> pd <span class="token punctuation">.</span> read_csv <span class="token punctuation">(</span> f <span class="token punctuation">)</span> <span class="token punctuation">)</span> combined_df <span class="token operator">=</span> pd <span class="token punctuation">.</span> concat <span class="token punctuation">(</span> combined <span class="token punctuation">,</span> ignore_index <span class="token operator">=</span> <span class="token boolean">True</span> <span class="token punctuation">)</span> |
dask
1 2 3 | <span class="token operator">%</span> <span class="token operator">%</span> time df <span class="token operator">=</span> dd <span class="token punctuation">.</span> read_csv <span class="token punctuation">(</span> files <span class="token punctuation">)</span> |
datatable
1 2 3 4 5 | <span class="token operator">%</span> <span class="token operator">%</span> time df <span class="token operator">=</span> dt <span class="token punctuation">.</span> iread <span class="token punctuation">(</span> files <span class="token punctuation">)</span> df <span class="token operator">=</span> dt <span class="token punctuation">.</span> rbind <span class="token punctuation">(</span> df <span class="token punctuation">)</span> df <span class="token operator">=</span> df <span class="token punctuation">.</span> to_pandas <span class="token punctuation">(</span> <span class="token punctuation">)</span> |
Result
Image: multi-file runtime
Plot up to be more intuitive.
1 2 3 4 5 6 7 8 9 | fig <span class="token operator">=</span> plt <span class="token punctuation">.</span> figure <span class="token punctuation">(</span> <span class="token punctuation">)</span> ax <span class="token operator">=</span> fig <span class="token punctuation">.</span> add_axes <span class="token punctuation">(</span> <span class="token punctuation">[</span> <span class="token number">0</span> <span class="token punctuation">,</span> <span class="token number">0</span> <span class="token punctuation">,</span> <span class="token number">1</span> <span class="token punctuation">,</span> <span class="token number">1</span> <span class="token punctuation">]</span> <span class="token punctuation">)</span> ax <span class="token punctuation">.</span> set_ylabel <span class="token punctuation">(</span> <span class="token string">'s'</span> <span class="token punctuation">)</span> ax <span class="token punctuation">.</span> set_title <span class="token punctuation">(</span> <span class="token string">'Speed of reading multiple csv files'</span> <span class="token punctuation">)</span> lib <span class="token operator">=</span> <span class="token punctuation">[</span> <span class="token string">'pandas'</span> <span class="token punctuation">,</span> <span class="token string">'dask'</span> <span class="token punctuation">,</span> <span class="token string">'datatable'</span> <span class="token punctuation">]</span> perf <span class="token operator">=</span> <span class="token punctuation">[</span> <span class="token number">87</span> <span class="token punctuation">,</span> <span class="token number">21</span> <span class="token punctuation">,</span> <span class="token number">140</span> <span class="token punctuation">]</span> ax <span class="token punctuation">.</span> bar <span class="token punctuation">(</span> lib <span class="token punctuation">,</span> perf <span class="token punctuation">)</span> plt <span class="token punctuation">.</span> show <span class="token punctuation">(</span> <span class="token punctuation">)</span> |
Looking at the two pictures above, Dask is still the fastest (as I said above for large csv data, Dask is very efficient), but when running many files, Pandas reads faster than datatable.
Conclusion
As I mentioned above, Dask library is very convenient with large csv files that Pandas library is difficult to handle, but I do not have such large data to read here. If your csv or excel data is too big and you can’t read it, think about Dask. For Datatable, when you have about 100GB of data, it will be more efficient to use.
If working with normal data or for research or learning, Pandas is still the best, most useful and ideal choice and most importantly, extremely easy to use and has many sources to refer to. Because Pandas has a lot of users.
Finally, I would like to thank you all for reading my post. Wish you have a HAPPY holiday. Upvote for me.
Reference
[2] https://mungingdata.com/pandas/read-multiple-csv-pandas-dataframe/