Using Vectorization – A Super Fast Alternative to Loops in Python
Introduce
Loops come naturally to us, we learn about Loops in most programming languages. So by default we start executing loops whenever there is a repeated operation. But when we work with a large number of iterations (millions/billions of rows), using loops is a crime. You can be stuck for hours, only to realize then that it won’t work. This is where implementing Vectorisation in python becomes super important.
What is vectorization?
Vectorization is a technique of performing array operations (NumPy) on a data set. In the background, it applies operations to all elements of an array or string in one go (unlike a ‘for’ loop that manipulates one row at a time).
In this article we will look at some use cases where we can easily replace python loops with Vectorization. This will save you time and become more ‘pro’ in coding.
Problem 1: Find the total number of
First, we will look at a basic example of finding the sum of numbers using loops and Vectorization in python.
Use loop
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | <span class="token keyword">import</span> time start <span class="token operator">=</span> time <span class="token punctuation">.</span> time <span class="token punctuation">(</span> <span class="token punctuation">)</span> <span class="token comment"># iterative sum</span> total <span class="token operator">=</span> <span class="token number">0</span> <span class="token comment"># iterating through 1.5 Million numbers</span> <span class="token keyword">for</span> item <span class="token keyword">in</span> <span class="token builtin">range</span> <span class="token punctuation">(</span> <span class="token number">0</span> <span class="token punctuation">,</span> <span class="token number">1500000</span> <span class="token punctuation">)</span> <span class="token punctuation">:</span> total <span class="token operator">=</span> total <span class="token operator">+</span> item <span class="token keyword">print</span> <span class="token punctuation">(</span> <span class="token string">'sum is:'</span> <span class="token operator">+</span> <span class="token builtin">str</span> <span class="token punctuation">(</span> total <span class="token punctuation">)</span> <span class="token punctuation">)</span> end <span class="token operator">=</span> time <span class="token punctuation">.</span> time <span class="token punctuation">(</span> <span class="token punctuation">)</span> <span class="token keyword">print</span> <span class="token punctuation">(</span> end <span class="token operator">-</span> start <span class="token punctuation">)</span> <span class="token comment">#1124999250000</span> <span class="token comment">#0.14 Seconds</span> |
Using Vectorization
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | <span class="token keyword">import</span> numpy <span class="token keyword">as</span> np start <span class="token operator">=</span> time <span class="token punctuation">.</span> time <span class="token punctuation">(</span> <span class="token punctuation">)</span> <span class="token comment"># vectorized sum - using numpy for vectorization</span> <span class="token comment"># np.arange create the sequence of numbers from 0 to 1499999</span> <span class="token keyword">print</span> <span class="token punctuation">(</span> np <span class="token punctuation">.</span> <span class="token builtin">sum</span> <span class="token punctuation">(</span> np <span class="token punctuation">.</span> arange <span class="token punctuation">(</span> <span class="token number">1500000</span> <span class="token punctuation">)</span> <span class="token punctuation">)</span> <span class="token punctuation">)</span> end <span class="token operator">=</span> time <span class="token punctuation">.</span> time <span class="token punctuation">(</span> <span class="token punctuation">)</span> <span class="token keyword">print</span> <span class="token punctuation">(</span> end <span class="token operator">-</span> start <span class="token punctuation">)</span> <span class="token comment">##1124999250000</span> <span class="token comment">##0.008 Seconds</span> |
Vectorization takes ~18x less time to execute than iteration using the range function. This difference will become more significant when working with Pandas DataFrame.
Issue 2: Calculation operations (on DataFrame)
while working with Pandas DataFrame, developers use loops to create new derived columns using mathematical operations. In the following example we can see how easily loops can be replaced with Vectorization for such use cases.
Create DataFrame
DataFrame is tabular data in the form of rows and columns. We will create a pandas DataFrame with 5 million rows and 4 columns filled with random values from 0 to 50.
1 2 3 4 5 6 7 | <span class="token keyword">import</span> numpy <span class="token keyword">as</span> np <span class="token keyword">import</span> pandas <span class="token keyword">as</span> pd df <span class="token operator">=</span> pd <span class="token punctuation">.</span> DataFrame <span class="token punctuation">(</span> np <span class="token punctuation">.</span> random <span class="token punctuation">.</span> randint <span class="token punctuation">(</span> <span class="token number">0</span> <span class="token punctuation">,</span> <span class="token number">50</span> <span class="token punctuation">,</span> size <span class="token operator">=</span> <span class="token punctuation">(</span> <span class="token number">5000000</span> <span class="token punctuation">,</span> <span class="token number">4</span> <span class="token punctuation">)</span> <span class="token punctuation">)</span> <span class="token punctuation">,</span> columns <span class="token operator">=</span> <span class="token punctuation">(</span> <span class="token string">'a'</span> <span class="token punctuation">,</span> <span class="token string">'b'</span> <span class="token punctuation">,</span> <span class="token string">'c'</span> <span class="token punctuation">,</span> <span class="token string">'d'</span> <span class="token punctuation">)</span> <span class="token punctuation">)</span> df <span class="token punctuation">.</span> shape <span class="token comment"># (5000000, 5)</span> df <span class="token punctuation">.</span> head <span class="token punctuation">(</span> <span class="token punctuation">)</span> |
We will create a new column ‘scale’ to find the ratio of columns ‘d’ and ‘c’.
Use loop
1 2 3 4 5 6 7 8 9 10 11 | <span class="token keyword">import</span> time start <span class="token operator">=</span> time <span class="token punctuation">.</span> time <span class="token punctuation">(</span> <span class="token punctuation">)</span> <span class="token comment"># Iterating through DataFrame using iterrows</span> <span class="token keyword">for</span> idx <span class="token punctuation">,</span> row <span class="token keyword">in</span> df <span class="token punctuation">.</span> iterrows <span class="token punctuation">(</span> <span class="token punctuation">)</span> <span class="token punctuation">:</span> <span class="token comment"># creating a new column </span> df <span class="token punctuation">.</span> at <span class="token punctuation">[</span> idx <span class="token punctuation">,</span> <span class="token string">'ratio'</span> <span class="token punctuation">]</span> <span class="token operator">=</span> <span class="token number">100</span> <span class="token operator">*</span> <span class="token punctuation">(</span> row <span class="token punctuation">[</span> <span class="token string">"d"</span> <span class="token punctuation">]</span> <span class="token operator">/</span> row <span class="token punctuation">[</span> <span class="token string">"c"</span> <span class="token punctuation">]</span> <span class="token punctuation">)</span> end <span class="token operator">=</span> time <span class="token punctuation">.</span> time <span class="token punctuation">(</span> <span class="token punctuation">)</span> <span class="token keyword">print</span> <span class="token punctuation">(</span> end <span class="token operator">-</span> start <span class="token punctuation">)</span> <span class="token comment">### 109 Seconds</span> |
Using Vectorization
1 2 3 4 5 6 7 | start <span class="token operator">=</span> time <span class="token punctuation">.</span> time <span class="token punctuation">(</span> <span class="token punctuation">)</span> df <span class="token punctuation">[</span> <span class="token string">"ratio"</span> <span class="token punctuation">]</span> <span class="token operator">=</span> <span class="token number">100</span> <span class="token operator">*</span> <span class="token punctuation">(</span> df <span class="token punctuation">[</span> <span class="token string">"d"</span> <span class="token punctuation">]</span> <span class="token operator">/</span> df <span class="token punctuation">[</span> <span class="token string">"c"</span> <span class="token punctuation">]</span> <span class="token punctuation">)</span> end <span class="token operator">=</span> time <span class="token punctuation">.</span> time <span class="token punctuation">(</span> <span class="token punctuation">)</span> <span class="token keyword">print</span> <span class="token punctuation">(</span> end <span class="token operator">-</span> start <span class="token punctuation">)</span> <span class="token comment">### 0.12 seconds</span> |
We can see a significant improvement with the DataFrame, the time taken by the vectorization operation is almost 1000 times faster than for loops in Python.
Issue 3: IF-Else statements (on DataFrame)
We perform a lot of operations that require the use of ‘if-else’ logic. We can easily replace these logics with vectorization operations in Python. Let’s look at the following example to understand it better (we’ll use the DataFrame we created in problem 2): Imagine we want to create a new column ‘e’ based on some condition on the old column ‘a’. Use loop
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | <span class="token keyword">import</span> time start <span class="token operator">=</span> time <span class="token punctuation">.</span> time <span class="token punctuation">(</span> <span class="token punctuation">)</span> <span class="token comment"># Iterating through DataFrame using iterrows</span> <span class="token keyword">for</span> idx <span class="token punctuation">,</span> row <span class="token keyword">in</span> df <span class="token punctuation">.</span> iterrows <span class="token punctuation">(</span> <span class="token punctuation">)</span> <span class="token punctuation">:</span> <span class="token keyword">if</span> row <span class="token punctuation">.</span> a <span class="token operator">==</span> <span class="token number">0</span> <span class="token punctuation">:</span> df <span class="token punctuation">.</span> at <span class="token punctuation">[</span> idx <span class="token punctuation">,</span> <span class="token string">'e'</span> <span class="token punctuation">]</span> <span class="token operator">=</span> row <span class="token punctuation">.</span> d <span class="token keyword">elif</span> <span class="token punctuation">(</span> row <span class="token punctuation">.</span> a <span class="token operator"><=</span> <span class="token number">25</span> <span class="token punctuation">)</span> <span class="token operator">&</span> <span class="token punctuation">(</span> row <span class="token punctuation">.</span> a <span class="token operator">></span> <span class="token number">0</span> <span class="token punctuation">)</span> <span class="token punctuation">:</span> df <span class="token punctuation">.</span> at <span class="token punctuation">[</span> idx <span class="token punctuation">,</span> <span class="token string">'e'</span> <span class="token punctuation">]</span> <span class="token operator">=</span> <span class="token punctuation">(</span> row <span class="token punctuation">.</span> b <span class="token punctuation">)</span> <span class="token operator">-</span> <span class="token punctuation">(</span> row <span class="token punctuation">.</span> c <span class="token punctuation">)</span> <span class="token keyword">else</span> <span class="token punctuation">:</span> df <span class="token punctuation">.</span> at <span class="token punctuation">[</span> idx <span class="token punctuation">,</span> <span class="token string">'e'</span> <span class="token punctuation">]</span> <span class="token operator">=</span> row <span class="token punctuation">.</span> b <span class="token operator">+</span> row <span class="token punctuation">.</span> c end <span class="token operator">=</span> time <span class="token punctuation">.</span> time <span class="token punctuation">(</span> <span class="token punctuation">)</span> <span class="token keyword">print</span> <span class="token punctuation">(</span> end <span class="token operator">-</span> start <span class="token punctuation">)</span> <span class="token comment">### Time taken: 177 seconds</span> |
Using Vectorization
1 2 3 4 5 6 7 8 | start <span class="token operator">=</span> time <span class="token punctuation">.</span> time <span class="token punctuation">(</span> <span class="token punctuation">)</span> df <span class="token punctuation">[</span> <span class="token string">'e'</span> <span class="token punctuation">]</span> <span class="token operator">=</span> df <span class="token punctuation">[</span> <span class="token string">'b'</span> <span class="token punctuation">]</span> <span class="token operator">+</span> df <span class="token punctuation">[</span> <span class="token string">'c'</span> <span class="token punctuation">]</span> df <span class="token punctuation">.</span> loc <span class="token punctuation">[</span> df <span class="token punctuation">[</span> <span class="token string">'a'</span> <span class="token punctuation">]</span> <span class="token operator"><=</span> <span class="token number">25</span> <span class="token punctuation">,</span> <span class="token string">'e'</span> <span class="token punctuation">]</span> <span class="token operator">=</span> df <span class="token punctuation">[</span> <span class="token string">'b'</span> <span class="token punctuation">]</span> <span class="token operator">-</span> df <span class="token punctuation">[</span> <span class="token string">'c'</span> <span class="token punctuation">]</span> df <span class="token punctuation">.</span> loc <span class="token punctuation">[</span> df <span class="token punctuation">[</span> <span class="token string">'a'</span> <span class="token punctuation">]</span> <span class="token operator">==</span> <span class="token number">0</span> <span class="token punctuation">,</span> <span class="token string">'e'</span> <span class="token punctuation">]</span> <span class="token operator">=</span> df <span class="token punctuation">[</span> <span class="token string">'d'</span> <span class="token punctuation">]</span> end <span class="token operator">=</span> time <span class="token punctuation">.</span> time <span class="token punctuation">(</span> <span class="token punctuation">)</span> <span class="token keyword">print</span> <span class="token punctuation">(</span> end <span class="token operator">-</span> start <span class="token punctuation">)</span> <span class="token comment">## 0.28007707595825195 sec</span> |
Vectorization is 600 times faster than python loops with if-else statements.
Problem 4 (advanced): Used in Machine Learning/Deep Learning Networks
Deep Learning requires us to solve many complex equations and that also gives millions, billions of rows. Running for loops in Python to solve these equations is very slow and Vectorization is the optimal solution.
For example, to calculate the value of y for millions of rows in the following multilinear regression equation:
we can replace loops with Vectorization. The values of m1, m2, m3… are determined by solving the above equation using millions of values corresponding to x1,x2,x3… (for the sake of simplicity, we will just see consider a simple multiplication step)
Create data
1 2 3 4 5 6 7 | <span class="token keyword">import</span> numpy <span class="token keyword">as</span> np <span class="token comment"># setting initial values of m </span> m <span class="token operator">=</span> np <span class="token punctuation">.</span> random <span class="token punctuation">.</span> rand <span class="token punctuation">(</span> <span class="token number">1</span> <span class="token punctuation">,</span> <span class="token number">5</span> <span class="token punctuation">)</span> <span class="token comment"># input values for 5 million rows</span> x <span class="token operator">=</span> np <span class="token punctuation">.</span> random <span class="token punctuation">.</span> rand <span class="token punctuation">(</span> <span class="token number">5000000</span> <span class="token punctuation">,</span> <span class="token number">5</span> <span class="token punctuation">)</span> |
Use loop
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | <span class="token keyword">import</span> numpy <span class="token keyword">as</span> np m <span class="token operator">=</span> np <span class="token punctuation">.</span> random <span class="token punctuation">.</span> rand <span class="token punctuation">(</span> <span class="token number">1</span> <span class="token punctuation">,</span> <span class="token number">5</span> <span class="token punctuation">)</span> x <span class="token operator">=</span> np <span class="token punctuation">.</span> random <span class="token punctuation">.</span> rand <span class="token punctuation">(</span> <span class="token number">5000000</span> <span class="token punctuation">,</span> <span class="token number">5</span> <span class="token punctuation">)</span> total <span class="token operator">=</span> <span class="token number">0</span> tic <span class="token operator">=</span> time <span class="token punctuation">.</span> process_time <span class="token punctuation">(</span> <span class="token punctuation">)</span> <span class="token keyword">for</span> i <span class="token keyword">in</span> <span class="token builtin">range</span> <span class="token punctuation">(</span> <span class="token number">0</span> <span class="token punctuation">,</span> <span class="token number">5000000</span> <span class="token punctuation">)</span> <span class="token punctuation">:</span> total <span class="token operator">=</span> <span class="token number">0</span> <span class="token keyword">for</span> j <span class="token keyword">in</span> <span class="token builtin">range</span> <span class="token punctuation">(</span> <span class="token number">0</span> <span class="token punctuation">,</span> <span class="token number">5</span> <span class="token punctuation">)</span> <span class="token punctuation">:</span> total <span class="token operator">=</span> total <span class="token operator">+</span> x <span class="token punctuation">[</span> i <span class="token punctuation">]</span> <span class="token punctuation">[</span> j <span class="token punctuation">]</span> <span class="token operator">*</span> m <span class="token punctuation">[</span> <span class="token number">0</span> <span class="token punctuation">]</span> <span class="token punctuation">[</span> j <span class="token punctuation">]</span> zer <span class="token punctuation">[</span> i <span class="token punctuation">]</span> <span class="token operator">=</span> total toc <span class="token operator">=</span> time <span class="token punctuation">.</span> process_time <span class="token punctuation">(</span> <span class="token punctuation">)</span> <span class="token keyword">print</span> <span class="token punctuation">(</span> <span class="token string">"Computation time = "</span> <span class="token operator">+</span> <span class="token builtin">str</span> <span class="token punctuation">(</span> <span class="token punctuation">(</span> toc <span class="token operator">-</span> tic <span class="token punctuation">)</span> <span class="token punctuation">)</span> <span class="token operator">+</span> <span class="token string">"seconds"</span> <span class="token punctuation">)</span> <span class="token comment">####Computation time = 28.228 seconds</span> |
Using Vectorization
1 2 3 4 5 6 7 8 9 10 | tic <span class="token operator">=</span> time <span class="token punctuation">.</span> process_time <span class="token punctuation">(</span> <span class="token punctuation">)</span> <span class="token comment">#dot product </span> np <span class="token punctuation">.</span> dot <span class="token punctuation">(</span> x <span class="token punctuation">,</span> m <span class="token punctuation">.</span> T <span class="token punctuation">)</span> toc <span class="token operator">=</span> time <span class="token punctuation">.</span> process_time <span class="token punctuation">(</span> <span class="token punctuation">)</span> <span class="token keyword">print</span> <span class="token punctuation">(</span> <span class="token string">"Computation time = "</span> <span class="token operator">+</span> <span class="token builtin">str</span> <span class="token punctuation">(</span> <span class="token punctuation">(</span> toc <span class="token operator">-</span> tic <span class="token punctuation">)</span> <span class="token punctuation">)</span> <span class="token operator">+</span> <span class="token string">"seconds"</span> <span class="token punctuation">)</span> <span class="token comment">####Computation time = 0.107 seconds</span> |
np.dot does Vectorized matrix multiplication in the backend. It is 165 times faster than loops in python.
Conclusion
Vectorization in Python is super fast and should be preferred over loops, whenever we work with very large datasets. Start doing it over time and you’ll become comfortable with thinking along the lines of your code.
References
https://medium.com/codex/say-goodbye-to-loops-in-python-and-welcome-vectorization-e4df66615a52
https://towardsdatascience.com/how-to-speedup-data-processing-with-numpy-vectorization-12acac71cfca