Pandas, Dask and Datatable – Which Package is more efficient and useful?

Tram Ho

When it comes to processing table data, most of us will choose Pandas to read and manipulate data, and I am no exception. However, I just read a pretty good article to see if Pandas is the best option? In this article, we will compare the performance of packages: Pandas, Dask, Datatable.

Pandas

Pandas is a Python library that is widely used by everyone, especially when it comes to manipulating tabular data types, which is both fast, simple and easy to use. However, if using a large csv file, it will take a lot of time. In this article, I will not mention Pandas much because Pandas is too popular and useful.

Dask

Dask was born to read large csv files – a problem that pandas has. Dask is an open source library that provides advanced, flexible parallelism for analytical computing. It natively scales these analytics packages into multi-core machines and distributed clusters whenever needed. dask’s data framework uses the Pandas API, making things super easy for those who use and love Pandas.

Datatable

Datatable is another Python library that focuses on improving performance, hoping to process large data (100GB) with maximum speed on 1 machine,. Meanwhile, the interoperability of Datatable and Pandas/Numpy provides easy transition to another data processing framework.

Compare

Ok. Now let’s try to use the code to compare the processing speed of these 3 libraries more intuitively.

Read csv . file

Check reading times of pandas, dask and datatable

Pandas

dask

datatable

Result

Image: running time

Show the chart to see clearly.

Image: chart

As in the two pictures above, we can easily see that Dask’s file reading time is much faster than Pandas and Datatable. Datatable is faster than Pandas 1 for a little while.

Read multiple files at the same time

Because I’m too lazy to add many files, I always read the above 4 files at the same time =)))

pandas

dask

datatable

Result

Image: multi-file runtime

Plot up to be more intuitive.

Figure: Chart

Looking at the two pictures above, Dask is still the fastest (as I said above for large csv data, Dask is very efficient), but when running many files, Pandas reads faster than datatable.

Conclusion

As I mentioned above, Dask library is very convenient with large csv files that Pandas library is difficult to handle, but I do not have such large data to read here. If your csv or excel data is too big and you can’t read it, think about Dask. For Datatable, when you have about 100GB of data, it will be more efficient to use.

If working with normal data or for research or learning, Pandas is still the best, most useful and ideal choice and most importantly, extremely easy to use and has many sources to refer to. Because Pandas has a lot of users.

Finally, I would like to thank you all for reading my post. Wish you have a HAPPY holiday. Upvote for me.

Reference

[1] https://towardsdatascience.com/pandas-vs-dask-vs-datatable-a-performance-comparison-for-processing-csv-files-3b0e0e98215e

[2] https://mungingdata.com/pandas/read-multiple-csv-pandas-dataframe/

Share the news now

Source : Viblo