Threads and Processes in Python

Tram Ho

What is parallel processing?

Basically doing two things at the same time, either run the code concurrently on different CPUs, or run the code on the same CPU and get the speed boost by taking advantage of the “wasted” CPU cycles while the program is running. Your is waiting for external resource –upload file, call API.

For example, here is a “normal” program. It downloads a list of URLs at a time using a single thread.

Here is the same program using 2 threads. It splits urls between threads giving us almost double the speed.

  1. Add your inside function and return its start and stop time

  1. To visualize single thread, run your function multiple times and store start and stop times

  1. Convert the resulting array of times [start, stop] and plot the bar chart

The diagram shows that multiple threads can be created in the same way. The methods in the Python library return an array of results.

Process vs Thread

Process is an instance of a program (e.g. Jupyter notebook, Python interpreter). Process generates threads (sub-processes) to handle sub-tasks such as reading keystrokes, loading HTML pages, saving files. Threads live inside processes and share the same memory space.

Example: Microsoft Word When you open Word, you create a process. When you start typing, the process creates threads: one thread to read keystrokes, another thread to display text, one thread to automatically save your file, and another thread to flag typos. By creating multiple threads, Microsoft takes advantage of idle CPU time (waiting for keystrokes or file loading) and helps you be more productive.

Process

  • Made by OS to run programs
  • Processes can have multiple threads
  • The cost of processes is higher than threads because it takes longer to open and close the program
  • Two processes can execute code concurrently in the same python program
  • Sharing information between processes is slower than sharing between threads because processes do not share memory space. In python they share information by choosing data structures like arrays that require IO time.

Thread

  • Threads are like small processes that live inside a process
  • They share memory space and efficiently read and write to the same variables
  • Two threads cannot execute code concurrently in the same python program (although there are workarounds*)

CPU vs Core

The CPU, or processor, manages the basic computing work of the computer. CPUs have one or more cores, allowing the CPU to execute code concurrently.

With a single core, there is no speed increase for CPU-intensive tasks (e.g. loops, arithmetic). The operating system switches back and forth between tasks performing one task at a time. This is why for small operations (e.g. downloading a few images), multitasking can sometimes affect your performance. There are costs associated with launching and maintaining multiple tasks.

Python’s GIL problem

CPython (standard python implementation) has something called GIL (Global Interpreter Lock), which prevents two threads from executing concurrently in the same program. Some people are annoyed by this, while others fiercely defend it. However, there are still workarounds, and libraries like Numpy bypass this limitation by running external code in C.

When to use threads vs processes?

  • Processes accelerate CPU-intensive Python operations because they benefit from multiple cores and avoid the GIL.
  • Threads are best for IO tasks or external system related tasks because threads can combine their work more efficiently. Processes need to filter their results to combine them, which takes time
  • Threads offer no benefit in python for CPU intensive tasks because of the GIL. For certain operations like Dot Product, Numpy works around Python’s GIL and executes code in parallel.

Parallel processing examples

Library concurrent.futures of Python is fun to work with. Just pass in your function, the list of items to work on and the number of workers

API calls

I’ve found threads to perform better for API calls and have observed a speed increase over serialization and multiprocessing.

2 threads

4 threads

2 processes

4 processes

IO Heavy Task

I went through a bunch of huge text strings to see how the write performance varies. Threads seem to win here, but multiprocessing also improves runtime.

Serial

4 threads

4 processes

CPU Intensive

Multiprocessing won’t date here as expected. Processes avoid GIL and execute code concurrently on multiple cores.

Serial: 4.2 seconds 4 threads: 6.5 seconds 4 processes: 1.9 seconds

Numpy Dot Product As expected, I don’t see any benefit in adding threads or processes to this code. Numpy executes external C code behind and thus avoids the GIL.

Serial: 2.8 seconds 2 threads: 3.4 seconds 2 processes: 3.3 seconds

References

https://medium.com/@bfortuner/python-multithreading-vs-multiprocessing-73072ce5600b

Share the news now