This paper studies the performance implications of
architectural synchronization support for automatically
parallelized numerical programs. As the basis for this
work, we analyze the needs for synchronization in
automatically parallelized numerical programs. The needs
are due to task scheduling, iteration scheduling, barriers,
and data dependence handling. We present synchronization
algorithms for efficient execution of programs with nested
parallel loops. Next, we identify how various hardware
synchronization support can be used to satisfy these
software synchronization needs. The synchronization
primitives studied are test&set, fetch & add and exchange
-byte operations. In addition to these, synchronization
bus implementation of lock/unlock and fetch &add operations
are also considered. Lastly, we ran experiments to quantify
the impact of various architectural support on the
performance of a bus-based shared memory multiprocessor
running automatically parallelized numerical programs. We
found that supporting an atomic fetch&add primitive in
shared memory is as effective as supporting lock/unlock
operations with a synchronization bus. Both achieve
substantial performance improvement over the cases where
atomic test&set and exchange-byte operations are supported
in shared memory.