This paper describes and evaluates three architectural methods for
accomplishing data parallel computation in a programmable embedded
system. Comparisons are made between the well-studied Very Long Instruction Word (VLIW) and Single Instruction Multiple Packed Data (SIMpD) paradigms; the less-common Single Instruction Multiple Disjoint Data (SIMdD)
architecture is described and evaluated. A taxonomy is defined for
data-level parallel architectures, and patterns of data access for
parallel computation are studied, with measurements presented for over
40 essential telecommunication and media kernels. While some algorithms
exhibit data-level parallelism suited to packed vector computation, it
is shown that other kernels are most efficiently scheduled with more
flexible vector models. This motivates exploration of non-traditional
processor architectures for the embedded domain.