Stop using loops – use Vectorization!

Tram Ho

Using Vectorization – A Super Fast Alternative to Loops in Python

Introduce

Loops come naturally to us, we learn about Loops in most programming languages. So by default we start executing loops whenever there is a repeated operation. But when we work with a large number of iterations (millions/billions of rows), using loops is a crime. You can be stuck for hours, only to realize then that it won’t work. This is where implementing Vectorisation in python becomes super important.

What is vectorization?

Vectorization is a technique of performing array operations (NumPy) on a data set. In the background, it applies operations to all elements of an array or string in one go (unlike a ‘for’ loop that manipulates one row at a time).

In this article we will look at some use cases where we can easily replace python loops with Vectorization. This will save you time and become more ‘pro’ in coding.

Problem 1: Find the total number of

First, we will look at a basic example of finding the sum of numbers using loops and Vectorization in python.

Use loop

Using Vectorization

Vectorization takes ~18x less time to execute than iteration using the range function. This difference will become more significant when working with Pandas DataFrame.

Issue 2: Calculation operations (on DataFrame)

while working with Pandas DataFrame, developers use loops to create new derived columns using mathematical operations. In the following example we can see how easily loops can be replaced with Vectorization for such use cases.

Create DataFrame

DataFrame is tabular data in the form of rows and columns. We will create a pandas DataFrame with 5 million rows and 4 columns filled with random values ​​from 0 to 50.

image.png

We will create a new column ‘scale’ to find the ratio of columns ‘d’ and ‘c’.

Use loop

Using Vectorization

We can see a significant improvement with the DataFrame, the time taken by the vectorization operation is almost 1000 times faster than for loops in Python.

Issue 3: IF-Else statements (on DataFrame)

We perform a lot of operations that require the use of ‘if-else’ logic. We can easily replace these logics with vectorization operations in Python. Let’s look at the following example to understand it better (we’ll use the DataFrame we created in problem 2): Imagine we want to create a new column ‘e’ based on some condition on the old column ‘a’. Use loop

Using Vectorization

Vectorization is 600 times faster than python loops with if-else statements.

Problem 4 (advanced): Used in Machine Learning/Deep Learning Networks

Deep Learning requires us to solve many complex equations and that also gives millions, billions of rows. Running for loops in Python to solve these equations is very slow and Vectorization is the optimal solution.

For example, to calculate the value of y for millions of rows in the following multilinear regression equation: image.png

we can replace loops with Vectorization. The values ​​of m1, m2, m3… are determined by solving the above equation using millions of values ​​corresponding to x1,x2,x3… (for the sake of simplicity, we will just see consider a simple multiplication step)

Create data

image.png

image.png

Use loop

Using Vectorization

image.png

np.dot does Vectorized matrix multiplication in the backend. It is 165 times faster than loops in python.

Conclusion

Vectorization in Python is super fast and should be preferred over loops, whenever we work with very large datasets. Start doing it over time and you’ll become comfortable with thinking along the lines of your code.

References

https://medium.com/codex/say-goodbye-to-loops-in-python-and-welcome-vectorization-e4df66615a52

https://towardsdatascience.com/how-to-speedup-data-processing-with-numpy-vectorization-12acac71cfca

Share the news now

Source : Viblo