Cache-coherent, bus-based shared-memory multiprocessors
are a cost-effective platform for parallel processing. In
scientific parallel applications, most of computation
involves processing of large multidimensional data structures
which results in a high degree of data parallelism. This
parallelism can be exploited in the form of nested parallel
loops. Most existing shared memory multiprocessors exploit
this multi-level parallelism at only one level. In this paper.
we explore efficient algorithms and models for executing
nested parallel loops and present a simulation based performance
comparison of different technique using real application traces.
We show that it is possible to exploit the parallelism in the
nested parallel loops with the use of good scheduling and
synchronization algorithms.