This paper studies the performance implications of architectural
synchronization support for automatically parallelized numerical
programs. As the basis for this work, we analyze the need for
synchronization in automatically parallelized numerical programs.
The needs are due to tasks scheduling, iteration scheduling,
barriers, and data dependence handling. We present synchronization
algorithms for efficient execution of programs with nested parallel
loops. Next, we identify how various hardware synchronization support
can be sued to satisfy these software synchronization needs. The
synchronization primitives studied are test & set, fetch & add and
exchange-byte operations. In addition to these, synchronization bus
implementation of lock/unlock and fetch & add operations are also
considered. Lastly, we ran experiments to quantify the impact of
various architectural support on the performance of a bus-based
shared memory multiprocessor running automatically parallelized
numerical programs. We found that supporting an atomic fetch &
add primitive in shared memory is as effective as supporting lock/
unlock operations with a synchronization bus. Both achieve
substantial performance improvement over the cases where atomic
test & set and exchange-byte operations are supported in shared
memory.