Wednesday, January 12, 2022

Concurrency, Parallelism, and Barrier Synchronization - Multiprocess and Multithreaded Programming

On preemptive, timed-sliced UNIX or Linux operating systems such as Solaris, AIX, Linux, BSD, and OS X, program code from one process executes on the processor for a time slice or quantum. After this time has elapsed, program code from another process executes for a time quantum. Linux divides CPU time into epochs, and each process has a specified time quantum within an epoch. The execution quantum is so small that the interleaved execution of independent, schedulable entities – often performing unrelated tasks – gives the appearance of multiple software applications running in parallel.

When the currently executing process relinquishes the processor, either voluntarily or involuntarily, another process can execute its program code. This event is known as a context switch, which facilitates interleaved execution. Time-sliced, interleaved execution of program code within an address space is known as concurrency.

The Linux kernel is fully preemptive, which means that it can force a context switch for a higher priority process. When a context switch occurs, the state of a process is saved to its process control block, and another process resumes execution on the processor.

A UNIX process is considered heavyweight because it has its own address space, file descriptors, register state, and program counter. In Linux, this information is stored in the task_struct. However, when a process context switch occurs, this information must be saved, which is a computationally expensive operation.

Concurrency applies to both threads and processes. A thread is an independent sequence of execution within a UNIX process, and it is also considered a schedulable entity. Both threads and processes are scheduled for execution on a processor core, but thread context switching is lighter in weight than process context switching.

In UNIX, processes often have multiple threads of execution that share the process's memory space. When multiple threads of execution are running inside a process, they typically perform related tasks. The Linux user-space APIs for process and thread management abstract many details. However, the concurrency level can be adjusted to influence the time quantum so that the system throughput is affected by shorter and longer durations of schedulable entity execution time.

While threads are typically lighter weight than processes, there have been different implementations across UNIX and Linux operating systems over the years. The three models that typically define the implementations across preemptive, time-sliced, multi-user UNIX and Linux operating systems are defined as follows - 1:1, 1:N, and M:N where 1:1 refers to the mapping of one user-space thread to one kernel thread, 1:N refers to the mapping of multiple user-space threads to a single kernel thread. M:N refers to the mapping of N user-space threads to M kernel threads.

In the 1:1 model, one user-space thread is mapped to one kernel thread. This allows for true parallelism, as each thread can run on a separate processor core. However, creating and managing a large number of kernel threads can be expensive.

In the 1:N model, multiple user-space threads are mapped to a single kernel thread. This is more lightweight, as there are fewer kernel threads to create and manage. However, it does not allow for true parallelism, as only one thread can execute on a processor core at a time.

In the M:N model, N user-space threads are mapped to M kernel threads. This provides a balance between the 1:1 and 1:N models, as it allows for both true parallelism and lightweight thread creation and management. However, it can be complex to implement and can lead to issues with load balancing and resource allocation.

Parallelism on a time-sliced, preemptive operating system means the simultaneous execution of multiple schedulable entities over a time quantum. Both processes and threads can execute in parallel across multiple cores or processors. Concurrency and parallelism are at play on a multi-user system with preemptive time-slicing and multiple processor cores. Affinity scheduling refers to scheduling processes and threads across multiple cores so that their concurrent and parallel execution is close to optimal.

It's worth noting that affinity scheduling refers to the practice of assigning processes or threads to specific processors or cores to optimize their execution and minimize unnecessary context switching. This can improve overall system performance by reducing cache misses and increasing cache hits, among other benefits. In contrast, non-affinity scheduling allows processes and threads to be executed on any available processor or core, which can result in more frequent context switching and lower performance.

Software applications are often designed to solve computationally complex problems. If the algorithm to solve a computationally complex problem can be parallelized, then multiple threads or processes can all run at the same time across multiple cores. Each process or thread executes by itself and does not contend for resources with other threads or processes working on the other parts of the problem to be solved. When each thread or process reaches the point where it can no longer contribute any more work to the solution of the problem, it waits at the barrier if a barrier has been implemented in software. When all threads or processes reach the barrier, their work output is synchronized and often aggregated by the primary process. Complex test frameworks often implement the barrier synchronization problem when certain types of tests can be run in parallel. Most individual software applications running on preemptive, time-sliced, multi-user Linux and UNIX operating systems are not designed with heavy, parallel thread or parallel, multiprocess execution in mind.

Minimizing lock granularity increases concurrency, throughput, and execution efficiency when designing multithreaded and multiprocess software programs. Multithreaded and multiprocess programs that do not correctly utilize synchronization primitives often require countless hours of debugging. The use of semaphores, mutex locks, and other synchronization primitives should be minimized to the maximum extent possible in computer programs that share resources between multiple threads or processes. Proper program design allows schedulable entities to run parallel or concurrently with high throughput and minimum resource contention. This is optimal for solving computationally complex problems on preemptive, time-sliced, multi-user operating systems without requiring hard, real-time scheduling.