## Synchronization



- Threads
- Processes (or tasks)
  - Independent
  - Considerable state info
  - Separate address space
- Resources
  - Memory, file handles, sockets, device handles
  - Owned by processes





- Thread of a same process share the same resources
- Processes can share resources only through explicit methods
- The only resources exclusively owned by a thread are the thread stack, register status and thread-local storage (if any)



- A scheduler is responsible to allocate tasks and threads to the available computing resources
- Various kind of scheduling algorithms
  - Best-effort → minimize makespan
  - Real-Time  $\rightarrow$  meet deadlines
  - ... other metrics to minimize
- Preemption and context changes
- Access to shared resources



- Classic multithreading on single processor systems required Time Division Multiplexing (TDM)
  - Time driven
  - Event driven
- Multiprocessors → different threads and processes can run on different CPUs
- Multithreading is easier ("native") on multicore platforms
- But scheduling requires more attention

Multithreading issues

- Race conditions
- Starvation, priority inversion, deadlock, livelock
  - Mamihlapinatapai
- Synchronization (mutex, lock)
- Atomic execution (semaphores)
- Communication:
  - shared-memory (requires locking)
  - message-passing (slower but easier)

Different kinds of Parallelisms

- Instruction Level Parallelism (ILP)
- Data Level Parallelism (DLP)
- Thread Level Parallelism (TLP)

Instruction Level Parallelism (ILP)

- Execute multiple instruction per clock cycle
- Each functional unit on a core is an execution resource within a single CPU:
  - Arithmetic Logic Unit (ALU)
  - Floating Point Unit (FPU)
  - bit shifter, multiplier, etc.
- Need to solve data dependencies



- Consider the sequential code:
  - 1. e = a + b
  - 2. f = c + d
  - 3. g = e \* f
- Operation 3. depends on the results of operations 1. and 2.
- Cannot execute 3. before 1. and 2. are completed

How to "parallelize" software?

- Parallelism can be extracted from ordinary programs
  - At run-time (by complex specific HW)
  - At compile-time (simplifying CPU design and improving run-time performances)
- Degree of ILP is application dependent



- Data dependency check in HW
- Complex mechanisms
  - power, die-space and time consuming
- Problematic when
  - code difficult to predict
  - Intructions have many interdependencies





(2x) Superscalar execution



- Instruction pipelining
- Superscalar execution
- Out-of-order execution
  - deferred memory accesses
  - combined load and store
- Speculative execution
  - branch prediction,
  - speculative load
  - • •



- Intel describes its processors having
  - "in-order front end"
  - "out-of-order execution engine"



ILP: compile-time techniques

- Compiler decides which operations can run in parallel
- Removes the complexity of instruction scheduling from HW to SW
- New instruction sets that explicitly encode multiple independent operations per instruction
  - Very Long Instruction Word (VLIW): one instruction encodes multiple operations (one for each execution unit)
  - Explicitly Parallel Instruction Computing (EPIC): adds features to VLIW (cache prefetching instructions, ...)

## Data Level Parallelism (DLP)

- Higher parallelism than superscalar architecture
- SIMD instructions (Single Instruction, Multiple Data)
  - Intel's MMX, SSE, SSE2, SSE3, SSE3, SSSE3, SSE4, AVX
  - AMD's 3DNow!, SSE5
  - ARM's NEON, IBM's AltiVec and SPE, etc.
  - Graphic cards (GPU)
  - Cell Processor's SPU
- Useful when the same operation has to be applied to a large set of data (i.e., multimedia, graphic operations on pixels, etc.)
- Multiple data are read and/or modified at the same same

Thread Level Parallelism (TLP)

- Higher level parallelism than ILP
- Different kinds of TLP
  - Superthreading
  - Hyperthreading or Symultaneous MultiThreading (SMT)
    - Needs superscalar processor
  - Chip-level MultiProcessing (CMP)
    - Needs multicore architecture
  - Combinations of the above solutions



- Temporal Multithreading (fine- or coarsegrained) → when processor idle, execute instruction of another thread
- Makes better use of the computing resources when a thread is blocked
- Requires adequate hardware support to minimize context change overhead



- Simultaneous MultiThreading (SMT)
- Introduced in late 90s: Intel's Pentium 4
- Execute instructions from multiple threads simultaneously → needs superscalar support
- Energy inefficient
  - Increases cache thrashing by 42%, whereas dual core results in a 37% decrease









Multithreaded processor: able to handle more than one thread at a time

Interleaved execution of different threads (helps masking memory latency)

All instructions in a pipeline stage must come from the **same** thread







- From OS perspective: many "logical" processors
- Average ILP for a thread = 2.5 instr/cycle
  - Pentium 4 issues at most 3 instr/cycle to the execution core
- Hyperthreaded processor can exploit parallelism beyond a single thread ILP



## SMT can be worse than non-SMT approaches

- A thread monopolizing an execution unit for many consecutive pipeline stages can stall the available functional units (that could have been used by other threads)
- Cache thrashing problem
  - Different logical processor can execute two threads accessing completely different memory areas



 A smart SMT-aware OS running on a multicore would schedule two different tasks on different processors → resource contention is minimized



- Out-of-order execution problems on multicore platforms
- Data dependencies
- Race conditions
- Memory barriers
- Locking mechanisms
- Spinlocks, semaphores, mutexes, RCU

Introduction: execution ordering

- A processor can execute instructions in any order (or in parallel), provided it maintains program causality with respect to itself
- The compiler may reorder instructions in many ways, provided causality maintainance
- Some CPUs are more constrained than others
  - E.g., i386, x86\_64, UltraSPARC are more constrained than PowerPC, Alpha, etc.
  - Linux assumes the DEC Alpha execution ordering
     → the most relaxed one



- Loads are more likely to need to be completed immediately
- Stores are often deferred
- Loads may be done speculatively, leading to discarded results
- Order of memory accesses may be rearranged to better use buses and caches
- Multiple loads and stores can be parallelized



- CPU or Compiler optimizations: overlapping instructions may be replaced
  - E.g., two consecutive loads to the same value/register

1) 
$$A = V;$$
  
2)  $A = W;$   $A = W;$ 

 A load operation may be performed entirely inside the CPU

1) 
$$*P = A;$$
  
2)  $B = *P;$   
1)  $*P = A;$   
2)  $B = A;$ 



- While the caches are expected to be coherent, there's no guarantee that that coherency will be ordered
- Whilst changes made on one CPU will eventually become visible on all CPUs, there's no guarantee that they will become apparent in the same order on those other CPUs



- If an instruction in the stream depends on an earlier instruction, then that earlier instruction must be "sufficiently complete" before the later instruction may proceed
- Need to analyse the *dependencies* between operations



- A "dependency" is a situation in which an instruction refers to the data of a preceding instruction
- Data dependencies
  - Read after Write
  - Write after Read
  - Write after Write
- Control dependencies



- True (or flow) dependence:
  - an instruction depends on the result of a previous instruction
- Example:

A is modified by the first instruction and used in the second one

It is not possible to use instruction level parallelism



- Antidependence:
  - an instruction requires a value that is updated by a later instruction
- Example:

A is modified by the second instruction and used by the first one

It is not possible to use instruction level parallelism



- Output dependence:
  - two instructions modify the same resource
- Example:

A is modified by both instructions

It is not possible to use instruction level parallelism



- Read after Read a.k.a Input dependency
  - two instructions read the same resource

A is read by both instructions

ILP is possible!

- Control dependency
  - instruction execution depends on a previous instruction  $\rightarrow$  e.g., conditional statements

"A = 5" executed only if x is true in the previous instruction Considerations on dependencies

- No need to consider dependencies when a strict sequential execution model is used
- Otherwise, dependence analysis is needed for
  - CPU exploiting instruction level parallelism
  - out-of-order execution
  - parallelizing compiler
- Need to enforce execution-order constraints to limit the ILP of a system
- Usually automatically solved by compiler/HW
- Concerns the behavior of a single thread

More than dependencies...

- When more threads are concurrently executed → additional "dependency" problems
- Instructions from different tasks/threads may need to access the same global resources
- Problems similar to dependency problems, but more difficult to detect/solve
- Explicit programmer intervention is needed
- Need to synchronize the accesses to shared resources

Synchronization mechanisms

## Process synchronization

- Barrier
- Lock/semaphore
- Critical sections
- Thread join
- Mutex
- Non-blocking synchronization
- Data synchronization
  - Keep multiple copies of a set of data coherent with one another
  - E.g., cache coherency, cluster file systems, RAID, etc.



- Race conditions
- Deadlock
- Livelock
- Starvation
- Lack of fairness



- More tasks read/modify the same global variable
- "Race" for which task modifies the global value first
- Output depends on the particular sequence of task operations
- Non-deterministic behavior



global int 
$$A = 0$$

```
task T1 ()

A = A + 1

print A

end task
```

task T2 () A = A + 1print A end task

- Two concurrent tasks T1 and T2 are activated
- Both tasks modify the same global variable A
- Expected behavior:
  - T1 reads 0
  - T1 increments A to 1
  - T1 prints 1
  - T2 reads 1
  - T2 increments A to 2
  - T2 prints 2
  - Expected final output = 2



global int 
$$A = 0$$

task T1 () A = A + 1print A end task

task T2 () A = A + 1print A end task

- Problem if operation
   "A=A+1" is not performed atomically
- Possible behavior:
  - T1 reads 0
  - T2 reads 0
  - T1 increments A to 1
  - T2 increments A to 1
  - T1 prints 1
  - T2 prints 1
- Final output = 1 !!!



Two global variables<br/> $\{ A = 1; B = 2 \}$ <br/>CPU1A = 3;<br/>B = 4;with out-of-<br/>order executionx = A;<br/>y = B;

- n! = 24 different combinations for memory accesses
- 4 different outputs:
   (x,y) = (1,2); (1,4); (3,2); (3,4)







Five global variables  

$${A = 1, B = 2, C = 3, P = &A, Q = &C }$$
  
CPU1  
 $B = 4;$   
P = &B  
with out-of-  
order execution  
 $Q = P;$   
 $D = *Q;$ 

 Partial order enforcement by CPU2 → only 12 different combinations for memory accesses and three different outputs:

(Q=P,D) = (&A,1); (&B,2); (&B,4)

D will never receive the value in C



- Some devices present their control interfaces as collections of memory locations
- The order in which control registers are accessed may be crucial
- E.g.: Ethernet card with internal register set accessed through address (A) and data (D) port registers

To read internal register #5:



Read register #5:

- Out-of-order execution may cause second instruction be executed before the first one
   → malfunction!
- Since there is no explicit data dependency, these problems are difficult to detect



 Dependent memory access on a given CPU are always executed in order

 Overlapping load and stores within a CPU maintain functional dependencies



# ...and that cannot be assumed

 Order of execution of independent load and stores

 Order of execution of overlapping load and stores (preserving functional dependencies)

1) 
$$A = *P;$$
  
2)  $B = *(P + 4);$  or   
1)  $*P = A;$   
2)  $B = *P;$   
may be:  $1 \rightarrow 2$   
 $2 \rightarrow 1$   
 $1 \& 2$   
may be:  $1 \rightarrow 2$   
 $1 \& 2 (*P=B=A)$ 



- Class of instructions to enforce an order on memory operations
- Low-level machine code operating on shared memory
- Ensures that all operations preceding a membar are executed before all operations following the barrier



|       | CPU 1:                      |
|-------|-----------------------------|
| load  | eax, 0                      |
| while | (eax == 0)<br>load eax, [a] |
| print | [b]                         |
|       | CPU 2:                      |

- Two CPUs, each one running a task using global variables a and b
- Expected behavior:
  - CPU1 spins until a becomes ≠ 0
  - Then prints the *b* value stored by CPU2
- Expected output: 33



|       | CPU 1:                      |
|-------|-----------------------------|
| load  | eax, 0                      |
| while | (eax == 0)<br>load eax, [a] |
| print | [b]                         |
|       | CPU 2:                      |

- When CPU2's operations may be executed out-of-order:
  - *a* may be updated before *b*
  - CPU1 prints a meaningless value
- Not acceptable!



#### CPU 1: load eax, 0

print [b]

CPU 2: store 33, [b] **membar** store 1, [a]

- Use a memory barrier before CPU2's second instruction
- "store 33, [b]" will be always executed before "store 1, [a]"
- Final output: 33

Why/when using membars?

- Needed when out-of-order execution or concurrent programming is used
- Many tricks to improve performances
  - reordering
  - deferral and combination of memory operations;
  - speculative loads and branch prediction
  - various types of caching
- Memory barriers allow overriding such tricks
- Instruct the compiler and the CPU to restrict the order

Varieties of memory barriers

- 1. Write (or store) memory barriers
- 2. Data dependency barriers
- 3. Read (or load) memory barriers
- 4. General memory barriers
- 5. Implicit varieties:
  - LOCK operations
  - UNLOCK operations



- All STORE operations before the barrier will appear (to other system components) to happen before all STORE operations after the barrier
- No effect on load operations



No data dependency: later load operations referencing \*P may obtain old values ≠ 4

■ To enforce partial order between both operations → insert a Wr\_membar:

Later loads of \*P (by the same processor) will obtain the updated value = 4



Other processors can see the update of A after the update of P, due to caching:
 { A = 1, B = 2, C = 3, P = &B, Q = &C }





 Other processors can see the update of A after the update of P, due to caching:

 $\{ A = 1, B = 2, C = 3, P = &B, Q = &C \}$ 





- Delay the read of \*Q until its value has been updated in memory
- The process running on CPU2 needs to synchronize with the write barrier on CPU1
- Make sure that the target (\*Q) of the second load (D=load \*Q) is updated, before the address (Q) obtained by the first load (Q=load P) is accessed



- Enforce a partial ordering on interdependent load operations
- Does not have any effect on
  - stores
  - independent or overlapping loads
- Weaker form of read memory barrier



 CPU2 cache is forced to commit its coherency queue before processing further requests

 $\{ A = 1, B = 2, C = 3, P = &B, Q = &C \}$ 





- Two or more consecutive loads with later loads depending on the result of previous ones
- Typical situations:
  - the first load retrieves the address to which the second load will be directed
  - the first load retrieves a number which is then used to calculate the index for an array
- A data dependency barrier would be required to make sure that the target of the second load is updated before the address obtained by the first load is accessed



### Definition from Linux kernel manual

• "A data dependency barrier issued by the CPU under consideration guarantees that for any load preceding it, if that load touches one of a sequence of stores from another CPU, then by the time the barrier completes, the effects of all the stores prior to that touched by the load will be perceptible to any loads issued after the data dependency barrier".



 CPUs in the system can be viewed as committing sequences of stores to memory, that other CPU can then perceive













Initially: B = 7; X = 9; Y = 8; C = &Y









Initially: { B = 7; X = 9; Y = 8; C = &Y }





#### Initially: { B = 7; X = 9; Y = 8; C = &Y }





Initially: { M[1] = 2, M[3] = 3, J = 0, K = 3 }





- All LOAD operations before the barrier will appear (to other system components) to happen before all LOAD operations after the barrier
- No effect on store operations
- Stronger than data dependency barrier
  - DD\_bar applies only to dependent loads
  - Rd\_bar applies to **all** load operations
  - Therefore, a Rd\_bar implies a DD\_bar



Initially: 
$$\{ A = 0; B = 9 \}$$





#### Initially: $\{ A = 0; B = 9 \}$









#### Initially: $\{ A = 0; B = 9 \}$





- A read barrier has to be used instead of a data dependency barrier when
  - there is no data dependency between the operations involved. E.g., LOAD B;

or

LOAD A; there is a control dependency between the

operations involved. E.g.,

Here we need a Rd bar! A DD\_bar is not sufficient since there is a control dependency between "Q=&B" and "X=\*Q"



- A write barrier needs a data dependency barrier or a read barrier, to work properly
- The partial ordering enforced by a Wr\_bar on a CPU can be perceived by other CPUs only if they use a paired DD\_bar (or Rd\_bar)
- Typically the stores before the Wr\_bar match the loads after the Rd\_bar (or DD\_bar) and

viceversa



- All LOAD and STORE operations before the barrier will appear (to other system components) to happen before all LOAD and STORE operations after the barrier
- Combine the functionalities of Wr\_bar and Rd\_bar
  - Therefore, a Full\_bar implies both a Wr\_bar and a Rd\_bar



- Compiler barriers
  - barrier();
- CPU memory barriers
  - Mandatory barriers: mb(); wmb(); rmb(); read\_barrier\_depends();
  - SMP conditional barriers: smp\_mb(); smp\_wmb(); smp\_rmb(); smp\_read\_barrier\_depends();
- MMIO write barriers
  - mmiowb();



- LOCK and UNLOCK operations
- They are unidirectional barriers that are permeable to read and write accesses only in one way
- Used to delimit Critical Sections of code to which a process need exclusive access



- A shared resource (data structure or device) that must be exclusively accessed by one thread
- Synchronization mechanism required at Critical Section boundaries
- On uni-processors can be implemented avoiding context switches (e.g., disabling interrupts and preemptions)
- On multi-processors this is no more valid



- All LOAD and STORE operations after the *lock* will appear (to other system components) to happen after the *lock*
- Memory operations occurring before the *lock* may appear to happen after it completes





- All LOAD and STORE operations before the unlock will appear (to other system components) to happen after the unlock
- Memory operations occurring before the *unlock* may appear to happen after it completes



Critical section implementation

- Lock and Unlock operations are almost always paired
- They delimit
   Critical Sections
- When lock/unlock are used, no need for explicit memory barriers





### Is a LOCK followed by an UNLOCK equivalent to a full memory barrier?









### Is an UNLOCK followed by a LOCK equivalent to a full memory barrier?









## CPU1



CPU2

Different spin locks M and Q

Other CPUs might see, for example:

\*E, LOCK M, LOCK Q, \*G,
 \*C, \*F, \*A, \*B, UNLOCK Q,
 \*D, \*H, UNLOCK M

But they will never see:

- \*B, \*C or \*D before LOCK M
- \*A, \*B or \*C after UNLOCK M
- \*F, \*G or \*H before LOCK Q
- \*E, \*F or \*G after UNLOCK Q



| *A = a; *E = e; *E, LOCK M[1], *C, *B, *<br>UNLOCK M[1], LOCK M[2<br>*H, *F, *G, UNLOCK M[2                                                                                                                                                                                                                                                                                                                     |    |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----|
| <ul> <li>[1] LOCK M;</li> <li>*B = b;</li> <li>*C = c;</li> <li>[1] UNLOCK M;</li> <li>*D = d;</li> <li>LOCK M;</li> <li>*F = f;</li> <li>*G = g;</li> <li>UNLOCK M;</li> <li>*H = h;</li> <li>[2] But, <u>assuming CPU1 gets th</u><br/><u>lock first</u>, they will never</li> <li>*B, *C, *D, *F, *G or *H<br/>before LOCK M[1]</li> <li>*A, *B or *C<br/>after UNLOCK M[1]</li> <li>*F, *G or *H</li> </ul> | ne |
| Same spin lock M<br>• *A, *B, *C, *E, *F or *G<br>after UNLOCK M[2]                                                                                                                                                                                                                                                                                                                                             |    |

Advisory vs mandatory locks

- Advisory lock
  - Each thread cooperates by acquiring the lock before accessing the protected resource
- Mandatory lock
  - Attempting unauthorized access to a locked resource forces an exception



- For single processors the exclusive access to a shared resource can be implemented disabling interrupts
- For multiprocessors this is not enough!
- Hardware support required for efficient implementation
- Atomic instruction(s) needed: test if lock is free and acquire the lock in a single atomic operation



- RMW instructions atomically do both the following operations:
  - read a memory location and
  - write a new value into it simultaneously (either a completely new value or some function of the previous value)
- Example: Test-and-set(); Compare-andswap(); Fetch-and-add(); Dec\_and\_test(); etc.



 Consider two processes executing the following piece of code to acquire the same lock:

If both tasks test the "lock" value at the same time, they will both detect that the lock is free → they will both acquire the lock!

The access to the CS is not exclusive









Atomic instructions in Linux

- Read-Modify-Write instructions:
  - xchg(); cmpxchg(); atomic\_cmpxchg();
  - atomic\_[inc|dec|add|sub]\_return();
  - atomic\_[inc|dec|sub]\_and\_test();
  - atomic\_add\_[negative|unless]();
  - test\_and\_[set|clear|change]\_bit();
- Other atomic instructions:
  - atomic\_set(); [set|clear|change]\_bit();
  - atomic\_[inc|dec|add|sub]();

Atomic operations in Linux

- Are executed without being interrupted by other operations
- Atomic operations that modify some state in memory and return information about the state (old or new) imply an SMP-conditional general memory barrier (smp\_mb()) on each side of the actual operation
  - E.g., cmpxchg(); atomic\_dec\_and\_test(); test\_and\_set\_bit(); ...



- Round-robin
  - Strict alternation: each thread can lock the resource at its turn
- Dekker's algorithm
  - Limited to two processes and busy waiting
- Peterson's algorithm
  - Originally formulated for two processes, can be generalized to more than two



```
Uses two flags (f0 and f1) to indicate
"intention to enter", and a turn variable
```

```
Thread1 Initially, f0 = false; f1 = false; turn = 0 (or 1) Thread2
```

```
f0 = true

while f1 {

    if turn \neq 0 {

        f0 = false

        while turn \neq 0 { }

        f0 = true

        }

}

// ... Critical Section ...

turn = 1

f0 = false
```

```
f1 = true
while f0 {
         if turn \neq 0 {
                  f1 = false
                  while turn \neq 1 \{ \}
                   f1 = true
         }
   ... Critical Section ...
turn = 0
f1 = false
```



 Uses two flags (f0 and f1) to indicate "intention to enter", and a turn variable

Initially, f0 = false; f1 = false; turn = 0 (or 1)

Thread1

f0 = true turn = 1 while (f1 && turn==1) {}

// ... Critical Section ...

f0 = false

Thread2

f1 = true turn = 0 while (f0 && turn==0) {}

// ... Critical Section ...

f1 = false



- Dekker's and Peterson's algorithm would need memory barriers when used on processors with instruction reordering
- More efficient solutions are provided using atomic operations

# Key design aspects for locks

## Overhead

- extra resources used just for the implementation of the lock → memory space, time for lock initialization/destruction and acquire/release
- Contention
  - number of threads that can concurrently request the lock → a locked shared by many processes implies a larger blocking probability
- Granularity
  - size of the protected region → course vs fine granularity



- Coarse granularity
  - less overhead, but more lock contention
- Fine granularity
  - Iarger number of "faster" locks
  - larger system overhead
  - higher risk of deadlocks
- Example: locking a whole table, a row, or a single entry



- Blocking time  $\rightarrow$  starvation, lack of fairness
- Locking tasks stalls/blocks/dies/loops
- Error-prone due to crossed dependencies difficult to detect for larger program sizes
  - Deadlocks, livelocks
  - Priority inversion
- Bugs difficult to reproduce

Linux kernel locking constructs

- Spin locks
- R/W spin locks
- Mutexes
- Semaphores
- R/W semaphores
- RCU

# All require atomic instruction support



- A thread waits ("spins") until the lock is available
- Wasteful if locks are held for a long time
  - On a single processor, many time quanta may be wasted due to spinning on a resource locked by a preempted thread
- Efficient for short locks → avoid process rescheduling

Spinlock: example

lock: dd 0

spin\_lock: mov eax, 1

loop:

xchg eax, [lock] test eax, eax jnz loop ret

- The lock variable is 1 when locked, 0 when unlocked
- "xchg eax, [lock]" atomically swap the EAX register with the lock variable
- "test eax, eax" sets the zero flag if EAX = 0
- If zero flag is set, the lock has been acquired
- Otherwise spin



lock: dd 0 spin\_unlock: mov eax, 0 xchg eax, [lock] To release the lock, reset the lock variable to zero



 When an interrupt handler can access the lock as well, the following situation might occurr

| Process on CPU1                  |                              |
|----------------------------------|------------------------------|
| <pre>spin_lock(&amp;lock);</pre> |                              |
| • • •                            | Interrupt comes in           |
|                                  | <br>spin_lock(&lock);        |
| DEADLOCK!!!                      | Waits for the release of the |

lock which will never happen



Disable interrupts

Process on CPU1

spin\_lock(&lock); cli();

sti(); spin\_unlock(&lock)



spin\_unlock(&lock);





 Only local interrupts need to be disabled → not necessary to disable interrupts on other CPUs



#### In Linux

- xxx\_lock\_irqsave(...)
- xxx\_unlock\_irqrestore(...)



- Allow multiple readers to be in the same critical section at once
- Do not allow more than one single writer at a time
- Usually, there is no need to read a shared resource with an exclusive access to it...
- as long as no other process modifies it!

R/W spin locks in Linux: read

- Readers acquire/release the lock with
  - read\_lock(&xxx\_lock, flags);
  - read\_unlock(&xxx\_lock, flags);
- If no process is holding a write lock on the same resource, the read lock can be acquired
- The only result will be an increment/decrement in the counter of the processes currently holding the lock

R/W spin locks in Linux: write

- Writer acquires/releases the lock with
  - write\_lock(&xxx\_lock, flags);
  - write\_unlock(&xxx\_lock, flags);
- It can acquire the lock only when there is no process holding a (read or write) lock → check the lock\_counter
- When a process is holding a write lock, no other process can acquire the lock



- R/W spin locks are faster than normal spin locks, allowing more readers to contemporarily access the resource
- If we know that interrupts will only need read locks, possible to use
  - read\_lock(&lock) for read accesses
     and
  - write\_lock\_irqsave(&lock, flags) for write accesses



- Busy-wait is an anti-pattern associated to "spinning" before entering a critical section
- Waste of CPU cycles
- May delay subsequent requests, reducing the spinning frequency
- Better to **block** the process on events like lock acquisitions, timers, I/O availability, or signals → the blocked thread is put in sleeping state and other threads are executed



- Instead of "spinning", a more efficient method is using semaphores with blocking
- Thread must acquire a semaphore before entering a CS
- Block the execution of the thread requesting the lock until it is allowed to acquire the semaphore
- Other threads can execute other code/critical sections



- A protected variable to restrict the access to shared resources
- Implemented by a counter for a set of available resource → counting semaphore
- Binary semaphores are called mutexes
- Prevents race conditions but not deadlocks

Non-blocking synchronization

Overcomes the disadvantages of using locks

#### Lock-free

- a thread cannot lock up: every step it takes brings progress to the system
- Wait-free
  - a thread can complete any operation in a finite number of steps, regardless of other threads
- All wait-free algorithms are lock-free (but not viceversa)
- No semaphores nor mutexes



- Still need atomic instructions like Test-andset, Compare-and-swap, etc.
- Example:

```
CAS(addr, old, new)
{
 atomic
    if (*addr == old)
    then {*addr = new ;
        return true}
    else return false
endatomic
}
```



 Each thread represents a teller trying to make a deposit onto the same account



 Need to synchronize simultaneous deposits to the account



## Thread 1



}

A = read\_account(); A = A + money; write\_account(A);

## Thread 2



If both threads simultaneously read the account → a transaction is lost!



 Each teller has to acquire a lock before doing a deposit:



Implies locking overhead



Use Compare\_And\_Swap(address,old,new):



 Lock-free but not wait-free: other tellers may keep writing new values → failing teller tries again indefinitely



- To reduce traffic on the bus due to repeated reads, failing threads may wait some time before trying another update
  - constant delay
  - incremental delay
  - random delay



- Arises when a thread reads a location a second time to detect if anything has changed from the first read
- But another thread could have modified and restored the value at that location, e.g.
  - Thread 1 reads A from a shared memory location
  - Thread 2 modifies the same location from A to B and back to A
  - Thread 1 reads again the value A and assumes nothing changes → ERROR!!!



## Push and Pop function of a stack list

```
Obj* Pop()
{
    while(1) {
        Obj* ret = top;
        if (!ret) return NULL;
        Obj* next = ret->next;
        if (CAS(&top, ret, next))
            return ret;
        }
    }
}
```

```
void Push(Obj* obj)
{
    while(1) {
        Obj* next = top;
        obj->next = next;
        if (CAS(&top, next, obj))
            return;
        }
}
```



Suppose initially the stack contains

$$top \rightarrow A \rightarrow B \rightarrow C$$

 Suppose Thread1 is preempted during the Pop() by Thread2





## Thread1's Pop operation:





- Lock-free approach that is able to deal with the ABA problem
- Allows a group of load and store instructions to execute in an atomic way
  - Load-Link/Store-Conditional
  - Software Transactional Memory (STM)

Load-Link/Store-Conditional

- LL works like a normal load from a memory location
- A subsequent SC to the same memory location will store a new value only if no updates have occurred to that location since the load-link, otherwise it fails
- SC fails even if the value read by LL has since been updated and then restored (ABA problem) → LL/SC is stronger than read/CAS



- Optimistic behavior: threads complete modifications to shared memory regardless of other threads
- They record every read and write in a log
- Each reader verifies that other threads have not concurrently made changes to memory that it accessed in the past
  - *commit* permanent changes if validation is successful
  - otherwise *abort*, undoing all its prior changes



- The conflict control is placed on the reader instead of the writer
- Increased concurrency: no need to wait for access to a resource
- Increased overhead in case of failing
- Good performance in practice → conflicts arise rarely in practice



- Split updates into "*removal*" and "*reclamation*" phases
- The removal phase
  - removes references to data items within a data structure (possibly by replacing them with references to new versions of these data items)
  - can run concurrently with readers which will see either the old or the new version of the data structure rather than a partially updated reference



#### The reclamation phase

- frees the data items removed from the data structure during the removal phase
- must not start until readers no longer hold references to those data items
- reclaiming can be done either by blocking or by registering a callback
- No need to consider readers starting after the removal phase → they are unable to gain a reference to the removed data items



- 1. Remove pointers to a data structure  $\rightarrow$  later readers cannot gain a reference to it
- 2. Wait for all previous readers to complete their RCU read-side critical sections
- 3. Reclaim the data structure (e.g., using *kfree* in the Linux kernel)





- Wait-free reads
- RCU readers use much lighter-weight synchronization → low overhead
- Reclamation phase may by done by entirely different thread → e.g., Linux directory entry cache

Where does the name come from?

- Read-Copy Update
  - A thread wishing to update a linked structure in place does the following
    - creates a new structure, copying the data from the old structure into the new one
    - modifies the new, copied, structure
    - **updates** the pointer to refer to the new structure
    - sleeps until there are no readers left
- Therefore, an RCU protected structure is *Read* concurrently with a thread *Copying* in order to do an *Update*

Linux RCU implementation

## RCU API

- rcu\_read\_lock()
- rcu\_read\_unlock()
- synchronize\_rcu() / call\_rcu()
- rcu\_assign\_pointer()
- rcu\_dereference()



- Sharing data on multicore platforms requires attention
- Possible to choose among different constructs (spinlocks, semaphores, R/W locks, RCU, ...)
- Using proper mechanism it is possible to exploit the power of multicore
- Start thinking parallel!