46 International IC – Taipei • Conference Proceedings Configuring a VLIW-DSP core for application specific requirements Oz Levia Chief Techology Officer Improv Systems Inc. Abstract In this paper we describe an architectural approach to configurable and scalable Very Long Instruction Word (VLIW) DSP core for embedded systems. We focus our presentation on an actual experience with a specific configurable VLIW DSP core - the Jazz DSP processor. Following an introduction to the processor and the VLIW structure, we discuss the methodology and tools required for application specific configuration of the processor. Introduction Configurable VLIW Processor In recent years, very long instruction word (VLIW) approaches have become increasingly prominent in the high-end DSP space. The reasons for this are straightforward; VLIW provides parallel execution of operations to significantly increase performance. Unlike, superscalar approaches, the overhead of determining this parallelism is paid in the compiler rather than each time the application is run. The ‘price’ to be paid for performance comes in the width of the instruction and the resultant potential increase in the memory image for a given application. The Jazz DSP core Unlike many VLIW processors, the Jazz processor does not aggregate computation units (ALU, multiplier, shifter) into a single data path but provides a flat collection of computation units. This allows the compiler to use the computation units to their best advantage in each instruction. Also, the Jazz processor is part of Improv’s general programmable system architecture (PSA), which provides a unique approach to combining multiple processors into a single structure. One aspect of this approach is the ability to attach multiple memory ports into each Jazz processor. The Jazz DSP core is a Flexible VLIW that delivers high performance with low power consumption. As could be seen in Figure 1. Jazz has an array of computational Units (CU) and an array of Memory Interface Units. A task Control unit is used to control task queuing and execution. Instructions are fetched from a program address space. Jazz is designed to deliver high performance at moderate clock speeds through the use of parallelism. As a result, the HW design is relatively simple and power consumption is very low. Figure 1: The Jazz DSP Core A flexible DSP Core Like configurable RISC processors, designers working with configurable VLIW processors can achieve significant gains by adding custom logic into the data path. However, the VLIW approach offers significant opportunities for designers above and beyond those afforded by configurable RISC processors. The possible opportunities for configuring a VLIW processor include: * Defining the collection of Computational Units (CU) in the processor (ALUs, MACs, etc) that can operate in parallel each cycle. * Ading custom CU into the processor for acceleration of common or critical program part. * Configuring the VLIW instruction to tradeoff parallelism for instruction word width * Changing the number of Memory interface Units for variable memory accesses in and out of the processor datapath each cycle. * Modify other aspects of the processor to trade off power and performance with area. For example: number and location of registers, task queue depth, processor data connectivity and more. Mix and Match Computation Units To increase performance with configurable processors, the general belief is that the designer must add custom logic and instructions. However, with Improv’s Jazz processor, the International IC – Taipei • Conference Proceedings 47 designer can increase performance without any hardware design. This is achieved by creating different combinations of computation units in the processor to create a mix that is specifically tuned to an application domain. The Jazz processor can contain multiple computation units including ALUs, MACs, and shifters. Improv provides a robust collection of these computation units in its base offering. Designers can define the collection of computation units in the processor to change the number and type of operations that can be executed each instruction. For instance, a designer might want to create a processor with 3 ALUs, 1 shifters and 1 MAC for ALU intensive application domains or create a processor with 2 ALUs, 2 shifters and 2 MACs for more MAC-intensive and balanced application domains. Designer-Defined Computation Units For most applications, combinations of general-purpose computation units can provide enough performance. However, for very high-performance applications like network processing, multi-channel speech processing and image/video processing it can be important to find every opportunity to increase performance while maintaining programmability. Designers can analyze applications and identify critical, high impact operations that can be implemented in custom logic and added into the processor. In the Jazz processor, designers can define and insert their own custom computation units called designer-defined computation units (DDCUs). DDCUs are defined as a set of operations and resources to the compiler (controlled -template based Verilog code is also supported for Hw implementations). The compiler binds specific operations to available resources allowing the designer to continue to use high level programming with out any machine specific code. For example, consider an application that can be accelerated by adding an operation to perform 5-bit addition. The designer could create a custom unit to perform this operation and add it into the processor. However, it is much easier to add the same operation and additional logic to the pre-defined ALU computation unit. The ALU unit has a number of operations that it supports already and the designer simply maps those operations plus the new 5-bit addition operation to the new unit. Now the user can include the new unit in the processor but this unit can also be used to support standard ALU operations as well. Using this feature user of Jazz can create CU to accelerate critical parts of an application with out giving away the ability to use the compiler and other analysis tools. Select Bandwidth to Memory VLIW offers performance through parallelism, but multiple operations per cycle require bandwidth to and from memory to match the computational bandwidth. Designers can add or subtract MIUs and can select from a set of MIU that have different capabilities. For example, R/W access, Byte access, Wait state support, and others. Designers can also create this own MIU. Instruction Word Configuration VLIW offers significant performance opportunities. However, for some applications the tradeoff between the size of the instruction word and potential performance needs to be considered. Improv’s Jazz Composer allows the designer to define the number of slots available in the instruction for computation units and then assign one or more computation units into each slot. This allows the designer to populate the processor with a generous mix of computation units without paying a high price in instruction width. It also means that the designer can configure a RISC-like processor by overlaying multiple computation units into a single slot in the instruction. Application Specific Configuration: Design methodology The unique strength of the Jazz Processor is in the close cooperation between the configurable core and the programming tools that support the processor. The design methodology is iterative. It is described in Figure 2. Figure 2: Design Methodology As could be seen, application code is compiled using Solo - the Jazz Compiler - and the results are analyzed using a mix of static ad dynamic measurements. Feedback is that used to modify the application (optimize) or the processor (configuration). The tools adjust to processor configuration by reading and processing a platform configuration file. Jazz Composer The Jazz processor is configured using a graphical tool (see Figure 3.) called the Jazz Composer that provides an intuitive drag-and-drop facility. The designer can configure specific characteristics of the base processor structure including data width of the processor, number of constant registers and depth of the hardware task queue. Similar features are available in most configurable processors. Jazz Composer takes configurable processing to a new level by allowing the designer to address all of the opportunities discussed earlier. 48 International IC – Taipei • Conference Proceedings Figure 3: The Jazz Composer GUI The Solo Compiler The most critical tool in the methodology is the compiler. To maintain time-to-market advantage, designers must be able to stay with high-level language programmability. For VLIW, the compiler is even more critical because of the complexity of managing parallel data path elements, multiple memory accesses and distributed register systems. Improv’s compiler maps the operations used in an application onto a target processor by matching each operation to a computation unit that supports that operation. Improv’s advanced VLIW code generation manages data movement through the concurrent data path, parallelization of operations and resource management. Author’s contact details Oz Levia Improv Systems Inc. 1485 Saratoga Ave., #100 San Jose, CA 95129 Phone: 1-408 517 4790 Fax: 1-408 517 4799 Email: ozl@improvsys.com International IC – Taipei • Conference Proceedings 49 Presentation Materials 50 International IC – Taipei • Conference Proceedings International IC – Taipei • Conference Proceedings 51 52 International IC – Taipei • Conference Proceedings International IC – Taipei • Conference Proceedings 53 54 International IC – Taipei • Conference Proceedings International IC – Taipei • Conference Proceedings 55
|