## Digital Signal Processing

Applications with the TMS320 Family

## Theory, Algorithms, and Implementations

# Digital Signal Processing Applications with the TMS320 Family 

## Volume 3

Edited by Panos Papamichalis, Ph.D. Digital Signal Processing<br>Semiconductor Group<br>Texas Instruments

## IMPORTANT NOTICE

Texas Instruments (TI) reserves the right to make changes to or to discontinue any semiconductor product or service identified in this publication without notice. TI advises its customers to obtain the latest version of the relevant information to verify, before placing orders, that the information being relied upon is current.

TI warrants performance of its semiconductor products to current specifications in accordance with TI's standard warranty. Testing and other quality control techniques are utilized to the extent TI deems necessary to support this warranty. Unless mandated by government requirements, specific testing of all parameters of each device is not necessarily performed.

Tl assumes no liability for TI applications assistance, customer product design, software performance, or infringement of patents or services described herein. Nor does TI warrant or represent that license, either express or implied, is granted under any patent right, copyright, mask work right, or other intellectual property right of TI covering or relating to any combination, machine, or process in which such semiconductor products or services might be or are used.

## TRADEMARKS

$A D I$ and AutoCAD are trademarks of Autodesk, Inc.
Apollo and Domain are trademarks of Apollo Computer, Inc.
ATVista is a trademark of Truevision, Inc.
CodeView, MS-Windows, MS, and MS-DOS are trademarks of Microsoft Corp. DEC, Digital DX, VAX, VMS, and Ultrix are trademarks of Digital Equipment Corp. DGIS is a trademark of Graphic Software Systems, Inc.
EPIC, XDS, TIGA, and TIGA-340 are trademarks of Texas Instruments, Inc.
GEM is a trademark of Digital Research, Inc.
GSS*CGI is a trademark of Graphic Software Systems, Inc.
HPGL is a registered trademark of Hewlett-Packard Co.
Macintosh and MPW are trademarks of Apple Computer Corp.
NEC is a trademark of NEC Corp.
PC-DOS, PGA, and Micro Channel are trademarks of IBM Corp.
PEPPER is a registered trademark of Number Nine Computer Corp.
PM is a trademark of Microsoft Corp.
PostScript is a trademark of Adobe Systems, Inc.
RTF is a trademark of Microsoft Corp.
Sony is a trademark of Sony Corp.
Sun 3, Sun Workstation, SunView, SunWindows, and SPARC are trademarks of Sun Microsystems, Inc.
UNIX is a registered trademark of AT\&T Bell Laboratories.

Copyright © 1990, Texas Instruments Incorporated

## CONTENTS

FOREWORD ..... v
PREFACE ..... vii
PART I. INTRODUCTION

1. The TMS320 Family and Book Overview ..... 3
2. The TMS320 Family of Digital Signal Processors
(Kun-Shan Lin, Gene A. Frantz, and Ray Simar, Jr., reprinted from PROCEEDINGS OF THE IEEE, Vol. 75, No. 9, September 1987) ..... 11
3. The TMS320C30 Floating-Point Digital Signal Processor
(Panos Papamichalis and Ray Simar, Jr., reprinted from IEEE Micro Magazine, Vol. 8, No. 6, December 1988) ..... 31
PART II. DIGITAL SIGNAL PROCESSING ROUTINES
4. An Implementation of FFT, DCT, and Other Transforms on the TMS320C30 (Panos Papamichalis) ..... 53
5. Doublelength Floating-Point Arithmetic on the TMS320C30 (Al Lovrich) ..... 137
6. An $8 \times 8$ Discrete Cosine Transform Implementation on the TMS320C25 or the TMS320C30 (William Hohl) ..... 169
7. Implementation of Adaptive Filters with the TMS320C25 or the TMS320C30 (Sen Kuo, Chein Chen) ..... 191
8. A Collection of Functions for the TMS320C30 (Gary Sitton) ..... 273
PART III. DIGITAL SIGNAL PROCESSING INTERFACE TECHNIQUES
9. TMS320C30 Hardware Applications (Jon Bradley) ..... 333
10. TMS320C30-IEEE Floating-Point Format Converter
(Randy Restle and Adam Cron) ..... 365
PART IV. TELECOMMUNICATIONS
11. Implementation of a CELP Speech Coder for the TMS320C30 Using SPOX
(Mark D. Grosen) ..... 403

## PART V. COMPUTERS

12. A DSP-Based Three-Dimensional Graphics System
(Nat Seshan) .................................................................................................. 423

## PART VI. TOOLS

13. The TMS320C30 Applications Board Functional Description (Tony Coomes and Nat Seshan) ..... 467
TMS320 BIBLIOGRAPHY ..... 533

## Foreword

Much has happened in the TMS320 Family since Volume 1 of Digital Signal Processing Applications with the TMS320 Family was published, and Volumes 2 and 3 are a timely update to the family history.

The DSP microcomputers keep changing the perspective of the systems designers by offering more computational power and better interfacing capabilities. The steps of change are coming more quickly, and the potential impact is greater and greater. Because things change so rapidly in this area, there is a pressing need for ways to quickly learn how to utilize the new technology. These new volumes respond to that need.

As with Volume 1, the purpose of these books is to teach us about the issues and techniques that are important in implementing digital signal processing systems using microprocessors in the TMS320 Family. Volume 2 highlights the TMS320C25; and Volume 3, the TMS320C30 chip. A large part of the books is devoted to such matters as characteristics of the TMS320C25 and TMS320C30 chips, useful program code for implementing special DSP functions, and details on interfacing the new chips to external devices. The remainder of the books illustrates how these chips can be used in communications, control, and computer graphics applications.

What these two volumes make clear is how remarkably fast the field of DSP microcomputing is evolving. IC technologists and designers are simply packing more and more of the right kind of computing power into affordable microprocessor chips. The high-speed floating-point computing power and huge address spaces of chips like the TMS320C30 open the door to a whole new class of applications that were difficult or impractical with earlier generations of fixed-point DSP chips. The signal processing theorists and system designers are clearly being challenged to match the creativity of the chip designers.

The present books differ from Volume 1 in the inclusion of a small section on tools. This is a hopeful sign, because it is progress in this area that is likely to have the greatest impact on speeding the widespread application of DSP microprocessors. While useful design tools are beginning to emerge, much more can be done to help system designers manage the complexity of sophisticated DSP systems, which often involve a unique combination of theory, numerical and symbolic processing algorithms, real-time programming, and multiprocessing. No doubt future volumes of Dig ital Signal Processing Applications with the TMS320 Family will have more to say about this important topic. Until then, Volumes 2 and 3 have much useful information to help system designers keep up with the TMS320 Family.

Ronald W. Schafer
Atlanta, Georgia
November 14, 1989

## Preface

The newer, floating-point DSP devices, such as the TMS320C30, have brought an added dimension to DSP applications. With the TMS20C30, programming is much easier because the designer does not have to worry about dynamic range and accuracy issues. An algorithm implemented in floating-point in a high-level language can be easily ported to such a device. The new architecture contains other features, besides the floating point capability, that simplify programming. Some of these features (such as the software stack, the large register file, etc.) were added to facilitate the development of high-level language compilers. Currently, C and Ada compilers have been introduced. In addition, Spectron Microsystems introduced an operating system for DSPs (called SPOX) that further facilitates the development of algorithms on the DSP devices.

Volume 3 of Digital Signal ProcessingApplications with the TMS320 Family contains application reports primarily on the third generation of the TMS320 Family (floating-point devices). This book is a continuation of Volumes 1 and 2 in the sense that it addresses the same needs of the designer. The designer still has the task of selecting the DSP device with the appropriate cost, performance, and support, developing the DSP algorithm that will solve the problem, and implementing the algorithm on the processor. This volume tries to help by bringing the designer up to date on the applications of newer processors or in different applications of earlier processors.

The objectives remain the same as in earlier volumes. First, the application reports supply examples of device use and serve as tutorials in programming the devices. Of course, the same purpose is served on a more elementary basis by the software and hardware applications sections of the corresponding user's guides. Second, since the source code of each application is provided with the report, the designer can take it intact (or extract a portion of it) and place it in the application.

It is assumed that the reader has exposure to the TMS320 devices or, at least, has the necessary manuals (such as the appropriate TMS320 user's guides) that will help the reader understand the explanations in the reports. The reports themselves include as references the necessary background material. Additionally, the Introduction gives a brief overview of the available devices at the time of the writing and points to the source of more information.

The reports are grouped by application area. The term report is used here in a broad sense, since some articles from technical publications are also included. The authors of the reports are either the digital signal processing engineering staff of the Texas Instruments Semiconductor Group (including both field and factory personnel, and summer students) or third parties.

The source code associated with the reports is also available in electronic form, and the reader can download it from the TI DSP Electronic Bulletin Board (telephone (713) 274-2323). If more information is needed, the DSP Hotline can be called at (713) 274-2320.

The editor thanks all the authors and the reviewers for their contribution to this volume of application reports.

Panos E. Papamichalis, Ph.D. Senior Member of Technical Staff

## Part I. Introduction

1. The TMS320C20 Family and Book Overview
2. The TMS320C20 Family of Digital Signal Processors
(Kun-Shan Lin, Gene A. Frantz, and Ray Simar, Jr., reprinted from PROCEEDINGS OF THE IEEE, Vol. 75, No. 9, September 1987)
3. The TMS320C30 Floating-Point Digital Signal Processor (Panos Papamichalis and Ray Simar, Jr., reprinted from IEEE Micro Magazine, Vol. 8, No. 6, December 1988)

## TMS320 Family and Book Overview

Digital signal processors have found applications in areas where they were not even considered a few years ago. The two major reasons for such proliferation are an increase in processor performance and a reduction in cost. Volume 3 of Digital Signal Processing Applications with the TMS320 Family presents a set of application reports primarily on the TMS320C30, the third-generation TMS320 device.

## Organization of the Book

The material in this book is grouped by subject area:

- Introduction
- Digital Signal Processing Routines
- DSP Interface Techniques
- Telecommunications
- Computers
- Tools
- Bibliography

The Introduction contains this overview and two review articles. The first article gives a general description of the TMS320 family and is reprinted from a special issue of the IEEE Proceedings, while the second article discusses the TMS320C30 device and is reprinted from the IEEE MicroMagazine. The overview points out how the TMS320 family has grown since the two articles were published and also introduces newer devices.

The five articles in the Digital Signal Processing Routines section present useful algorithms, such as the FFT, the Discrete Cosine Transform, etc., that are implemented on the TMS320C30. Two of the reports also consider implementations on the TMS320C25.

The section on DSP Interface Techniques contains an article on interfacing the TMS320C30 with external hardware, such as memories and A/D and D/A converters, and an article on a hardware implementation of a floating-point converter between the IEEE and the TMS320C30 formats.

The following three sections contain one article each. In the Telecommunications section, an implementation of the government-standard CELP speech-coding algorithm is presented. The Computers section contains an article on 3-D graphics systems, which shows examples of using the TMS320C30 device for graphics problems. In the Tools section, the article gives a functional description of the TMS320C30 Application Board that is part of the hardware emulator for that device.

The Bibliography section contains a list of articles mentioning DSP implementations using TMS320 devices. The different titles are listed chronologically and are grouped by subject. The list is not exhaustive, but it gives pointers for pursuing practical implementations in representative application areas.

## The TMS320 Family of Processors

The TMS320 Family of digital signal processors started with the TMS32010 in 1982, but it has been expanded to encompass five generations (at the time of this writing) with devices in each generation. Figure 1 shows this progression through the generations. The TMS320 devices can be grouped in two broad categories: fixed-point and floating-point devices. As implied by Figure 1, the first, second, and fifth generations are the fixed-point devices, while the third and the fourth generations (the latest one under development) support floating-point arithmetic.

Figure 1. TMS320 Family Roadmap


The following article, "The TMS320 Family of Digital Signal Processors," by Lin, et. al., is reprinted from the Proceedings of the IEEE and gives an overview of the TMS320 family. Since additional devices have been developed from the time the article was written, this section highlights these newer devices. Table 1 shows a comprehensive list of the currently available TMS320 devices and their salient characteristics.

Table 1. TMS320 Family Overview

\begin{tabular}{|c|c|c|c|c|c|c|c|c|c|c|c|c|}
\hline \multirow[b]{2}{*}{Gen} \& \multirow[b]{2}{*}{Device} \& \multirow[b]{2}{*}{\begin{tabular}{l}
Data \\
Type
\end{tabular}} \& \multirow[b]{2}{*}{Cycle Time (ns)} \& \multicolumn{4}{|c|}{Memory} \& \multicolumn{3}{|c|}{I/O} \& \multirow[b]{2}{*}{OnChip Timers} \& \multirow[b]{2}{*}{Package} \\
\hline \& \& \& \& RAM \& OnChip ROM \& EPROM \& \begin{tabular}{l}
Off- \\
Chip
\end{tabular} \& Parallel \& Serial \& DMA \& \& \\
\hline 1st \& \begin{tabular}{l}
TMS320C10 1 \\
TMS320C10-25 \\
TMS320C10-14 \\
TMS320E14 \\
TMS320C15 ๆ \\
TMS320C15-25 \(\pi\) \\
TMS320E15 \\
TMS320E15-25 \\
TMS320C17 \\
TMS320E17
\end{tabular} \& \begin{tabular}{l}
Integer \\
Integer \\
Integer \\
Integer \\
Integer \\
Integer \\
Integer \\
Integer \\
Integer \\
Integer
\end{tabular} \& \[
\begin{aligned}
\& 200 \\
\& 160 \\
\& 280 \\
\& 160 \\
\& 200 \\
\& 160 \\
\& 200 \\
\& 160 \\
\& 200 \\
\& 200
\end{aligned}
\] \& \[
\begin{aligned}
\& 144 \\
\& 144 \\
\& 144 \\
\& 256 \\
\& 256 \\
\& 256 \\
\& 256 \\
\& 256 \\
\& 256 \\
\& 256
\end{aligned}
\] \& \begin{tabular}{l}
1.5 K \\
1.5 K \\
1.5 K \\
4K \\
4K \\
4K
\end{tabular} \& \begin{tabular}{l}
4K \\
4K \\
4K \\
4K
\end{tabular} \& \[
\begin{aligned}
\& 4 K \\
\& 4 K \\
\& 4 K \\
\& 4 K \\
\& 4 K \\
\& 4 K \\
\& 4 K \\
\& 4 K \\
\& 4 K \\
\& 4 K
\end{aligned}
\] \& \[
\begin{aligned}
\& 8 \times 16 \\
\& 8 \times 16 \\
\& 8 \times 16 \\
\& 7 \times 16 \\
\& 8 \times 16 \\
\& 8 \times 16 \\
\& 8 \times 16 \\
\& 8 \times 16 \\
\& 6 \times 16 \\
\& 6 \times 16
\end{aligned}
\] \& \begin{tabular}{l}
1 \\
2
\end{tabular} \& \& 4

1

1 \& | DIP/PLCC |
| :--- |
| DIP/PLCC |
| DIP/PLCC |
| CERQUAD |
| DIP/PLCC |
| DIP/PLCC |
| DIP/CERQUAD |
| DIP/CERQUAD. |
| DIP/PLCC |
| DIP/CERQUAD | <br>

\hline 2nd \& | TMS32020 ๆ |
| :--- |
| TMS320C25 |
| TMS320C25-50 $\pi$ |
| TMS320E25 ๆ |
| TMS320C26 | \& | Integer |
| :--- |
| Integer |
| Integer |
| Integer |
| Integer | \& \[

$$
\begin{gathered}
200 \\
100 \\
80 \\
100 \\
100
\end{gathered}
$$

\] \& \[

$$
\begin{aligned}
& 544 \\
& 544 \\
& 544 \\
& 544 \\
& 1.5 \mathrm{~K}
\end{aligned}
$$

\] \& \[

$$
\begin{aligned}
& 4 \mathrm{~K} \\
& 4 \mathrm{~K} \\
& \\
& 256
\end{aligned}
$$

\] \& 4K \& \[

$$
\begin{aligned}
& 128 \mathrm{~K} \\
& 128 \mathrm{~K} \\
& 128 \mathrm{~K} \\
& 128 \mathrm{~K} \\
& 128 \mathrm{~K}
\end{aligned}
$$

\] \& \[

$$
\begin{aligned}
& 16 \times 16 \\
& 16 \times 16 \\
& 16 \times 16 \\
& 16 \times 16
\end{aligned}
$$

\] \& \[

$$
\begin{aligned}
& 1 \\
& 1 \\
& 1 \\
& 1 \\
& 1
\end{aligned}
$$

\] \& \[

$$
\begin{aligned}
& \dagger \\
& \dagger \\
& \dagger \\
& \dagger \\
& \dagger
\end{aligned}
$$

\] \& \[

$$
\begin{aligned}
& 1 \\
& 1 \\
& 1 \\
& 1 \\
& 1
\end{aligned}
$$

\] \& | PGA |
| :--- |
| PGA/PLCC |
| PGA/PLCC |
| CERQUAD |
| PLCC | <br>

\hline 3rd \& TMS320C30 1 \& Float Pt \& 60 \& 2K \& 4K \& \& 16M \& $16 \mathrm{M} \times 32$ \& 2 \& $\ddagger$ \& 2 \& PGA <br>
\hline 5th \& TMS320C50 $\frac{1}{}$ \& Integer \& 50 \& 8.5 K \& 2K \& \& 128 K \& $16 \times 16$ \& 1 \& $\dagger$ \& 1 \& CLCC <br>

\hline \multicolumn{8}{|l|}{| $\dagger$ External DMA |
| :--- |
| $\ddagger$ External/Internal DMA |
| ๆ For information on military versions of these devices, contact your local TI sales office. |} \& \& \& \& \& <br>

\hline
\end{tabular}

The additions to the first generation are the TMS320C14 and the TMS320E14; the latter is identical with the former, except that the latter's on-chip program memory is EPROM. The TMS320C14/E14 devices have features that make them suitable for control applications. Figure 2 shows the components of these devices. The memory and the CPU are identical to TMS320C15/E15, while the peripherals reflect the orientation of the devices toward control.

Figure 2. TMS320C14/E14 Key Features


Some of the key features of the TMS320C14/E14 are:

- 160 -ns instruction cycle time
- Object-code-compatible with the TMS320C15
- Four 16-bit timers
- Two general-purpose timers
- One watchdog timer
- One baud-rate generator
- 16 individual bit-selectable I/O pins
- Serial port/USART with codec-compatible mode
- Event manager with 6-channel PWM D/A
- CMOS technology, 68-pin CERQUAD

The additions to the second generation are the TMS320E25, the TMS320C25-50, and the TMS320C26. The TMS320E25 is identical to the TMS320C25, except that the 4 K -word on-chip program memory is EPROM. Since increased speed is very important for the real-time implemen-
tation of certain applications, the TMS320C25-50 was designed as a faster version of the TMS320C25 and has a clock frequency of 50 MHz instead of 40 MHz .

The TMS320C26 is a modification of the TMS320C25 in which the program ROM has been exchanged for RAM. The memory space of the TMS320C26 has 1.5 K words of on-chip RAM and 256 words of on-chip ROM, making it ideal for applications requiring larger R'AM but minimal external memory.

A new generation of higher-performance fixed-point processors has been introduced in the TMS320 Family: the TMS320C5x devices. This generation shares many features with the first and the second generations, but it also encompasses significant new features. Figure 3 shows the basic components of the first device in that generation, the TMS320C50.

Figure 3. TMS320C50 Key Features


Some of the important features of the TMS320C50 are listed below:

- Source code is upward compatible with the TMS320C1x/C2x devices
- $50 / 35$-ns instruction cycle time
- 8 K words of on-chip program/data RAM
- 2 K words boot ROM
- 544 words of data/program RAM
- 128 K words addressable total memory
- Enhanced general-purpose and DSP-specific instructions
- Static CMOS, 84-pin CERQUAD
- JTAG serial scan path

The software and hardware development tools for the TMS320 family make the development of applications easy. Such tools include assemblers, linkers, simulators, and C compilers for the software. They include evaluation modules, software development boards, and extended development systems for hardware. These tools are mentioned in the following paper by Lin, et. al. The interested reader can find much more information in the additional literature that is published by Texas Instruments and mentioned in the next section. In particular, the TMS320 Family Development Support Reference Guide is an excellent source.

One important addition to the list of tools is the SPOX operating system, developed by Spectron Microsystems. SPOX permits you to write an application in a high-level language (C) and run it on actual DSP hardware. The operating system of SPOX hides the details of the interface from you and lets you concentrate on your algorithm while running it at supercomputer speeds on the TMS320C30.

## References

Texas Instruments publishes an extensive bibliography to help designers use the TMS320 devices effectively. Besides the user's guides for corresponding generations, there are manuals for the software and the hardware tools. The TMS320 Family Development Support Reference Guide is particularly useful because it provides information, not only on development tools offered by TI, but also on those produced by third parties. Here is a partial list of the literature available (the literature number is in parentheses)

- TMS320 Family Development Support Reference Guide (SPRU011A)
- TMS320C1x User's Guide (SPRU013A)
- TMS320C2x User's Guide (SPRU014)
- TMS320C3x User's Guide (SPRU031)
- TMS320C1x/TMS320C2x Assembly Language Tools User's Guide (SPRU018)
- TMS320C30 Assembly Language Tools User's Guide (SPRU035)
- TMS320C25 C Compiler Reference Guide (SPRU024)
- TMS320C30 C Compiler Reference Guide (SPRU034)
- Digital Signal Processing Applications with the TMS320 Family, Volume 1 (SPRA012)
- Digital Signal Processing Applications with the TMS320 Family, Volume 2 (SPRA016)

You can request this literature by calling the Customer Response Center at 1-800-232-3200, or the DSP Hotline at 1-713-274-2320.

## Contents of Other Volumes of the Application Book

## Volume 1

Part I. Digital Signal Processing and the TMS320 Family

- Introduction
- The TMS320 Family

Part II. Fundamental Digital Signal Processing Operations

- Digital Signal Processing Routines
- Implementation of FIR/IIR Filters with the TMS32010/TMS32020
- Implementation of Fast Fourier Transform Algorithms with the TMS32020
- Companding Routines for the TMS32010/TMS32020
- Floating-Point Arithmetic with the TMS32010
- Floating-Point Arithmetic with the TMS32020
- Precision Digital Sine-Wave Generation with the TMS32010
- Matrix Multiplication with the TMS32010 and TMS32020
- DSP Interface Techniques
- Interfacing to Asynchronous Inputs with the TMS32010
- Interfacing External Memory to the TMS32010
- Hardware Interfacing to the TMS32020
- TMS32020 and MC68000 Interface

Part III. Digital Signal Processing Applications

- Telecommunications
- Telecommunications Interfacing to the TMS32010
- Digital Voice Echo Canceller with a TMS32020
- Implementation of the Data Encryption Standard Using the TMS32010
- 32K-bit/s ADPCM with the TMS32010
- A Real-Time Speech Subband Coder Using the TMS32010
- Add DTMF Generation and Decoding to DSP- $\mu$ P Designs
- Computers and Peripherals
- Speech Coding/Recognition
- A Single-Processor LPC Vocoder
- The Design of an Adaptive Predictive Coder Using a Single-Chip
- Digital Signal Processor
- Firmware-Programmable C Aids Speech Recognition
- Image/Graphics
- A Graphics Implementation Using the TMS32020 and TMS34061
- Digital Control
- Control System Compensation and Implementation with the TMS32010


## Volume 2

Part I. Introduction

- Book Overview
- The TMS320 Family of DSP
- The Texas Instruments TMS320C25 Digital Signal Microcomputer

Part II. Digital Signal Interface Techniques

- Hardware Interfacing to the TMS320C2x
- Interfacing the TMS320 Family to the TLC32040 Family
- ICC Requirements of the TMS320C25
- An Implementation of a Software UART Using the TMS320C25
- TMS320C17/E17 and TMS370 Serial Interface


## Part III. Data Communications

- Theory and Implmentation of a Split-Band Modem Using the TMS320C17
- Implementation of an FSK Modem Using the TMS320C17
- An All-Digital Automatic Gain Control

Part IV. Telecommunications

- General Purpose Tone Decoding and DTMF Detection

Part V. Control

- Digital Control

Part VI. Tools

- TMS320 Algorithm Debugging Techniques


# The TMS320 Family of <br> Digital Signal Processors 

Kun-Shan Lin<br>Gene A. Frantz<br>Ray Simar, Jr.<br>Digital Signal Processor Products-Semiconductor Group<br>Texas Instruments

# The TMS320 Family of Digital Signal Processors 

KUN-SHAN LIN, member, ieee, GENE A. FRANTZ, senior member, ieee, and RAY SIMAR, Jr.

This paper begins with a discussion of the characteristics of dig. ital signal processing, which are the driving force behind the design of digital signal processors. The remainder of the paper describes the three generations of the TMS320 family of digital signal processors available from Texas Instruments. The evolution in architectural design of these processors and key features of each generation of processors are discussed. More detailed information is provided for the TMS320C25 and TMS320C30, the newest members in the family. The benefits and cost-performance tradeoffs of these processors become obvious when applied to digital signal processing applications, such as telecommunications, data communications, graphics/image processing, etc.

## Digital Signal Processing Characteristics

Digital signal processing (DSP) encompasses a broad spectrum of applications. Some application examples include digital filtering, speech vocoding, image processing, fast Fourier transforms, and digital audio [1]-[10]. These applications and those considered digital signal processing have several characteristics in common:

- mathematically intensive algorithms,
- real-time operation,
- sampled data implementation,
- system flexibility.

To illustrate these characteristics in this section, we will use the digital filter as an example. Specifically, we will use the Finite Impulse Response (FIR) filter which in the time domain takes the general form of

$$
\begin{equation*}
y(n)=\sum_{i=1}^{N} a(i) * x(n-i) \tag{1}
\end{equation*}
$$

where $y(n)$ is the output sample at time $n, a(i)$ is the $i$ th coefficient or weighting factor, and $x(n-i)$ is the $(n-i)$ th input sample.

With this example in mind, we can discuss the various characteristics of digital signal processing: mathematically intensive algorithms, real-time processing, sampled data implementation, and system flexibility. First, let us look at the concept of mathematically intensive algorithms.

Manuscript received October 6, 1986; revised March 27, 1987.
The authors are with the Semiconductor Group, Texas Instruments Inc., Houston, TX 77521-1445, USA.
IEEE Log Number 8716214.

## Mathematically Intensive Algorithms

From (1), we can see that to generate every $y(n)$, we have to compute $N$ multiplications and additions or sums of products. This computation makes it mathematically intensive, especially when $N$ is large.

At this point it is worthwhile to give the FIR filter some physical significance. An FIR filter is a common technique used to eliminate the erratic nature of stock market prices. When the day-to-day closing prices are plotted, it is sometimes difficult to obtain the desired information, such as the trend of the stock, because of the large variations. A simple way of smoothing the data is to calculate the average closing values of the previous five days. For the new average value each day, the oldest value is dropped and the newest value added. Each daily average value (average ( $n$ )) would be the sum of the weighted value of the latest five days, where the weighting factors ( $a(i)^{\prime} s$ ) are $1 / 5$. In equation form, the average is determined by

$$
\begin{align*}
\text { average }(n)= & \frac{1}{5} * d(n-1)+\frac{1}{5} * d(n-2) \\
& +\frac{1}{5} * d(n-3)+\frac{1}{5} * d(n-4) \\
& +\frac{1}{5} * d(n-5) \tag{2}
\end{align*}
$$

where $d(n-i)$ is the daily stock closing price for the ( $n-$ $i$ )th day. Equation (2) assumes the same form as (1). This is also the general form of the convolution of two sequences of numbers, $a(i)$ and $x(i)$ [5], [6]. Both FIR filtering and convolution are fundamental to digital signal processing.

## Real-Time Processing

In addition to being mathematically intensive, DSP algorithms must be performed in real time. Real time can be defined as a process that is accomplished by the DSP without creating a delay noticeable to the user. In the stock market example, as long as the new average value can be computed prior to the next day when it is needed, it is considered to be completed in real time. In digital signal processing applications, processes happen faster than on a daily basis. In the FIR filter example in (1), the sum of products must
be computed usually within hundreds of microseconds before the next sample comes into the system. A second example is in a speech recognition system where a noticeable delay between a word being spoken and being recognized would be unacceptable and not considered realtime. Another example is in image processing, where it is considered real-time if the processor finishes the processing within the frame update period. If the pixel information cannot be updated within the frame update period, problems such as flicker, smearing, or missing information will occur.

## Sampled Data Implementation

The application must be capable of being handled as a sampled data system in order to be processed by digital processors, such as digital signal processors. The stock market is an example of a sar pled data system. That is, a specific value (closing value) is assigned to each sample period or day. Other periods may be chosen such as hourly prices or weekly prices. In an FIR filter as shown in (1), the output $y(n)$ is calculated to be the weighted sum of the previous $N$ inputs. In other words, the input signal is sampled at periodic intervals ( 1 over the sample rate), multiplied by weighting factor $a(i)$, and then added together to give the output result of $y(n)$. Examples of sample rates for some typical sampled data applications [2], [4] are shown in Table 1.

Table 1 Sample Rates versus Applications

| Application | Nominal <br> Sample Rate |
| :--- | :---: |
| Control | 1 kHz |
| Telecommunications | 8 kHz |
| Speech processing | $8-10 \mathrm{kHz}$ |
| Audio processing | $40-48 \mathrm{kHz}$ |
| Video frame rate | 30 Hz |
| Video pixel rate | 14 MHz |

In a typical DSP application, the processor must be able to effectively handle sampled data in large quantity and also perform arithmetic computations in real time.

## System Flexibility

The design of the digital signal processing system must be flexible enough to allow improvements in the state of the art. We may find out after several weeks of using the average stock price as a means of measuring a particular stock's value that a different method of obtaining the daily information is more suited to our needs, e.g., using different daily weightings, a different number of periods over which to average, or a different procedure for calculating the result. Enough flexibility in the system must be available to allow for these variations. In many of the DSP applications, techniques are still in the developmental phase, and therefore the algorithms tend to change over time. As an example, speech recognition is presently an inexact technique requiring continual algorithmic modification. From this example we can see the need for system flexibility so that the DSP algorithm can be updated. A programmable DSP system can provide this flexibility to the user.

## Historical DSP Solutions

Over the past several decades, digital signal processing machines have taken on several evolutions in order to incorporate these characteristics. Large mainframe computers were initially used to process signals in the digital domain. Typically, because of state-of-the-art limitations, this was done in nonreal time. As the state of the art advanced, array processors were added to the processing task. Because of their flexibility and speed, array proressors have become the accepted solution for the research laboratory, and have been extended to end-applications in many instances. However, integrated circuit technology has matured, thus allowing for the design of faster microprocessors and microcomputers. As a result, many digital signal processing applications have migrated from the array processor to microprocessor subsystems (i.e., bit-slice machines) to single-chip integrated circuit solutions. This migration has brought the cost of the DSP solution down to a point that allows pervasive use of the technology. The increased performance of these highly integrated circuits has also expanded DSP applications from traditional telecommunications to graphics/image processing, then to consumer audio processing.
A recent development in DSP technology is the singlechip digital signal processor, such as the TMS320 family of processors. These processors give the designer a DSP solution with its performance attainable only by the array processors a few years ago. Fig. 1 shows the TMS320 family in graphical form with the $y$-axis indicating the hypothetical performance and the $x$-axis being the evolution of the semiconductor processing technology. The first member of the family, the TMS32010, was disclosed to the market in 1982 [11], [12]. It gave the system designer the first microcomputer capable of performing five million DSP operations per second (5 MIPS), including the add and multiply functions [13] required in (1). Today there are a dozen spinoffs from the TMS32010 in the first generation of the TMS320 family. Some of these devices are the TMS320C10, TMS320C15, and TMS320C17 [14]. The second generation of devices include the TMS32020 [15] and TMS320C25 [16]. The TMS320C25 can perform 10 MIPS [16]. In addition, expanded memory space, combined single-cycle multiply/ accumulate operation, multiprocessing capabilities, and expanded I/O functions have given the TMS320C25 a 2 to 4 times performance improvement over its predecessors. The third generation of the TMS320 family of processors, the TMS320C30 [26], [27], has a computational rate of 33 million DSP floating-point operations per second (33 MFLOPS). Its performance (speed, throughput, and precision) has far exceeded the digital signal processors available today and has reached the level of a supercomputer.
It we look closely at the TMS320 family as shown in Fig. 1 , we can see that devices in the same generation, such as the TMS320C10, TMS320C15, and TMS320C17, are assembly object-code compatible. Devices across generations, such as the TMS320C10 and TMS320C25, are assembly sourcecode compatible. Software investment on DSP algorithms therefore can be maintained during the system upgrade. Another point is that since the introduction of the TMS32010, semiconductor processing technology has emerged from $3-\mu \mathrm{m}$ NMOS to $2-\mu \mathrm{m}$ CMOS to $1-\mu \mathrm{m}$ CMOS.


Fig. 1. The TMS320 family of digital signal processors.

The TMS320 generations of processors have also taken the same evolution in processing technology. Low power consumption, high performance, and high-density circuit integration are some of the direct benefits of this semiconductor processing evolution.

From Fig. 1, it can be observed that various DSP building blocks, such as the CPU, RAM, ROM, I/O configurations, and processor speeds, have been designed as individual modules and can be rearranged or combined with other standard cells to meet the needs of specific applications. Each of the three generations (and future generations) will evolve in the same manner. As applications become more sophisticated, semicustom solutions based on the core CPU will become the solution of choice. An example of this approach is the TMS320C17/E17, which consists of the TMS320C10 core CPU, expanded 4K-word program ROM (TMS320C17) or EPROM (TMS320E17), enlarged data RAM of 256 words, dual serial ports, companding hardware, and a coprocessor interface. Furthermore, as integrated circuit layout rules move into smaller geometry (now at $2 \mu \mathrm{~m}$, rapidly going to $1 \mu \mathrm{~m}$ ), not only will the TMS320 devices become smaller in size, but also multiple CPUs will be incorporated on the same device along with application-specific I/O to achieve low-cost integrated system solutions.

## Basic TMS320 Architecture

As noted previously, the underlying assumption regarding a digital signal processor is fast arithmetic operations and high throughput to handle mathematically intensive algorithms in real time. In the TMS320 family [11]-[17], [26], [27], this is accomplished by using the following basic concepts:

- Harvard architecture,
- extensive pipelining,
- dedicated hardware multiplier,
- special DSP instructions,
- fast instruction cycle.

These concepts were designed into the TMS320 digital signal processors to handle the vast amount of data characteristic of DSP operations, and to allow most DSP operations to be executed in a single-cycle instruction. Furthermore, the TMS320 processors are programmable devices, providing the flexibility and ease of use of generalpurpose microprocessors. The following paragraphs discuss how each of the above concepts is used in the TMS320 family of devices to make them useful in digital signal processing applications.

## Harvard Architecture

The TMS320 utilizes a modified Harvard architecture for speed and flexibility. In a strict Harvard architecture [18], [19], the program and data memories lie in two separate spaces, permitting a full overlap of instruction fetch and execution. The TMS320 family's modification of the Harvard architecture further allows transfer between program and data spaces, thereby increasing the flexibility of the device. This architectural modification eliminates the need for a separate coefficient ROM and also maximizes the processing power by maintaining two separate bus structures (program and data) for full-speed execution.

## Extensive Pipelining

In conjunction with the Harvard architecture, pipelining is used extensively to reduce the instruction cycle time to its absolute minimum, and to increase the throughput of the processor. The pipeline can be anywhere from two to four levels deep, depending on which processor in the family is used. The TMS320 family architecture uses a two-level pipeline for its first generation, a three-level pipeline for its second generation, and a four-level pipeline for its third generation of processors. This means that the device is processing from two to four instructions in parallel, and each instruction is at different stage in its execution. Fig. 2 shows an example of a three-level pipeline operation.


Fig. 2. Three-level pipeline operation.

In pipeline operation, the prefetch, decode, and execute operations can be handled independently, thus allowing the execution of instructions to overlap. During any instruction cycle, three different instructions are active, each at a different stage of completion. For example, as the $N$ th instruction is being prefetched, the previous $(N-1)$ th instruction is being decoded, and the previous $(N-2)$ th instruction is being executed. In general, the pipeline is transparent to the user.

## Dedicated Hardware Multiplier

As we saw in the general form of an FIR filter, multiplication is an important part of digital signal processing. For each filter tap (denoted by $i$ ), a multiplication and an addition must take place. The faster a multiplication can be performed, the higher the performance of the digital signal processor. In general-purpose microprocessors, the multiplication instruction is constructed by a series of additions, therefore taking many instruction cycles. In comparison, the characteristic of every DSP device is a dedicated multiplier. In the TMS320 family, multiplication is a singlecycle instruction as a result of the dedicated hardware multiplier. If we look at the arithmetic for each tap of the FIR filter to be performed by the TMS32010, we see that each tap of the filter requires a multiplication (MPY) instruction.

## LT ;LOAD MULTIPLICAND INTO T REGISTER DMOV ;MOVE DATA IN MEMORY TO DO DELAY MPY ;MULTIPLY <br> APAC ;ADD MULTIPLICATION RESULT TO ACC

The other three instructions are used to load the multiplier circuit with the multiplicand (LT), move the data through the filter $\operatorname{tap}$ (DMOV), and add the result of the multiplication (stored in the product register) to the accumulator (APAC). Specifically, the multiply instruction (MPY) loads the multiplier into the dedicated multiplier and performs the multiplication, placing the result in a product register. Therefore, if a 256 -tap FIR filter is used, these four instructions are repeated 256 times. At each sample period, 256 multiplications must be performed. In a typical generalpurpose microprocessor, this requires each tap to be 30 to 40 instruction cycles long, whereas in the TMS320C10, it is only four instruction cycles. We will see in the next section how special DSP instructions reduce the time required for each FIR tap even further.

## Special DSP Instructions

Another characteristic of DSP devices is the use of special instructions. We were introduced to one of them in the previous example, the DMOV (data move) instruction. In digital signal processing, the delay operator $\left(z^{-1}\right)$ is very important. Recalling the stock market example, during each new sample period (i.e., each new day), the oldest piece of data
(the closing price five days ago) was dropped and a new one (today's closing price) was added. Or, each piece of the old data is delayed or moved one sample period to make room for the incoming most current sample. This delay is the function of the DMOV instruction. Another special instruction in the TMS32010 is the LTD instruction. It executes the LT, DMOV, and APAC instructions in a single cycle. The LTD and MPY instruction then reduce the number of instruction cycles per FIR filter tap from four to two. In the second-generation TMS320, such as the TMS320C25, two more special instructions have been included (the RPT and MACD instructions) to reduce the number of cycles per tap to one, as shown in the following:

```
RPTK 255 ;REPEAT THE NEXT INSTRUCTION 256 TIMES \((N+1)\)
MACD ;LT, DMOV, MPY, AND APAC
```


## Fast Instruction Cycle

The real-time processing capability is further enhanced by the raw speed of the processor in executing instructions. The characteristics which we have discussed, combined with optimization of the integrated circuit design for speed, give the DSP devices instruction cycle times less than 200 ns. The specific instruction cycle times for the TMS320 family are given in Table 2. These fast cycle times have made

Table 2 TMS320 Cycle Times

| Device | Cycle Time <br> $(\mathrm{ns})$ |
| :--- | :---: |
| TMS320C10* | $160-200$ |
| TMS32020 | $160-200$ |
| TMS320C25 | $100-125$ |
| TMS320C30 | $60-75$ |
| *The same cycle time applies to all of the first-generation processors. |  |

the TMS320 family of processors highly suited for many realtime DSP applications. Table 1 showed the sample rates for some typical DSP applications. This table can be combined with the cycle times indicated in Table 2 to show how many instruction cycles per sample can be achieved by the various generations of the TMS320 for real-time applications (see Fig. 3).

As we can see from Fig. 3, many instruction cycles are available to process the signal or to generate commands for real-time control applications. Therefore, for simple control applications, the general-purpose microprocessors or controllers would be adequate. However, for more mathematically intensive control applications, such as robotics and adaptive control, digital signal processors are much better suited [24]. The number of available instruction cycles is reduced as we increase the sample rate from 8 kHz for typical telecommunication applications to $40-48 \mathrm{kHz}$ for audio processing. Since most of these real-time applications require only a few hundreds of instructions per sample (such as ADPCM [4], and echo cancelation [4]), this is within the reach of the TMS320. For higher sample rate applications, such as video/image processing, digital signal processors available today are not capable of handling the processing of the real-time video data. Therefore, for these


Fig. 3. Number of instruction cycles/sample versus sample rate for the TMS320 family.
types of applications, multiple digital signal processors and frame buffers are usually required. From Fig. 3, it can also be seen that for slower speed applications, such as control, the first-generation TMS320 provides better cost-performance tradeoffs than the other processors. For high sample rate applications, such as video/image processing, the second and third generations of the TMS320 with their multiprocessing capabilities and high throughput are better suited.

Now that we have discussed the basic characteristics of digital signal processors, we can concentrate on specific details of each of the three generations of the TMS320 family devices.

## The first Generation of the TMS320 family

The first generation of the TMS320 family includes the TMS32010 [13], and TMS32011 [17], which are processed in $2.4 \mu \mathrm{~m}$ NMOS technology, and the TMS320C10 [13], TMS320C15/E15 [14], and TMS320C17/E17 [14], processed in $1.8-\mu \mathrm{m}$ CMOS technology. Some of the key features of these devices are [14] as follows:

- Instruction cycle timing:
$-160 \mathrm{~ns}$
-200 ns
-280 ns .
- On-chip data RAM:
- 144 words
-256 words (TMS320C15/E15, TMS320C17/E17).
- On-chip program ROM:
-1.5K words -4 K words (TMS320C15, TMS320C17).
- 4 K words of on-chip program EPROM (TMS320E15, TMS320E17).
- External memory expansion up to 4 K words at full speed.
- $16 \times 16$-bit parallel multiplier with 32 -bit result.
- Barrel shifter for shifting data memory words into the ALU.
- Parallel shifter.
- $4 \times 12$-bit stack that allows context switching.
- Two auxiliary registers for indirect addressing.
- Dual-channel serial port (TMS32011, TMS320C17, TMS320E17).
- On-chip companding hardware (TMS32011, TMS320C17, TMS320E17).
- Coprocessor interface (TMS320C17, TMS320E17).
- Device packaging
-40-pin DIP
-44-pin PLCC.


## TMS320C10

The first generation of the TMS320 processors is based on the architecture of the TMS32010 and its CMOS replica, the TMS320C10. The TMS32010 was introduced in 1982 and was the first microcomputer capable of performing 5 MIPS. Since the TMS32010 has been covered extensively in the literature [4], [11]-[14], we will only provide a cursory review here. A functional block diagram of the TMS320C10 is shown in Fig. 4.
As shown in Fig. 4, the TMS320C10 utilizes the modified Harvard architecture in which program memory and data memory lie in two separate spaces. Program memory can reside both on-chip ( 1.5 K words) or off-chip ( 4 K words). Data memory is the $144 \times 16$-bit on-chip data RAM. There are four basic arithmetic elements: the ALU, the accumulator, the multiplier, and the shifters. All arithmetic operations are performed using two's-complement arithmetic.

ALU: The ALU is a general-purpose arithmetic logic unit that operates with a 32 -bit data word. The unit can add, subtract, and perform logical operations.

Accumulator: The accumulator stores the output from the ALU and is also often an input to the ALU. It operates with a 32-bit word length. The accumulator is divided into a highorder word (bits 31 through 16) and a low-order word (bits 15 through 0 ). Instructions are provided for storing the highand low-order accumulator words in data memory (SACH for store accumulator high and SACL for store accumulator low).

Multiplier: The $16 \times 16$-bit parallel multiplier consists of three units: the $T$ register, the $P$ register, and the multipler array. The T register is a 16 -bit register that stores the multiplicand, while the $P$ register is a 32 -bit register that stores the product. In order to use the multiplier, the multiplicand


Fig. 4. TMS320C10 functional block diagram.
must first be loaded into the T register from the data RAM by using one of the following instructions: LT, LTA, or LTD. Then the MPY (multiply) or the MPYK (multiply immediate) instruction is executed. The multiply and accumulate operations can be accomplished in two instruction cycles with the LTA/LTD and MPY/MPYK instructions.

Shifters: Two shifters are available for manipulating data: a barrel shifter and a parallel shifter. The barrel shifter performs a left-shift of 0 to 16 bits on all data memory words that are to be loaded into, subtracted from, or added to the accumulator. The parallel shifter, activated by the SACH instruction, can execute a shift of 0,1 , or 4 bits to take care of the sign bits in two's-complement arithmetic calculations.

Based on the architecture of the TMS32010/C10, several spinoffs have been generated offering different processor speeds, expanded memory, and various I/O integration. Currently, the newest members in this generation are the TMS320C15/E15 and the TMS320C17/E17 [14].

## TMS320C15/E15

The TMS320C15 and TMS320E15 are fully object-code and pin-for-pin compatible with the TMS32010 and offer expanded on-chip RAM of 256 words and on-chip program ROM (TMS320C15) or EPROM (TMS320E15) of 4 K words. The TMS320C15 is available in either a 200 -ns version or a $160-$ ns version (TMS320C15-25).

## TMS320C17/E17

The TMS320C17/E17 is a dedicated microcomputer with 4 K words of on-chip program ROM (TMS320C17) or EPROM (TMS320E17), a dual-channel serial port for full-duplex serial communication, on-chip companding hardware (u-law/ A-law), a serial port timer for stand-alone serial communication, and a coprocessor interface for zero glue interface between the processor and any 4/8/16-bit microprocessor. The TMS320C17/E17 is also object-code compatible with the TMS32010 and can use the same development tools. The

Table 3 TMS320 First-Generation Processors

| TMS320 Devices | Instruction Cycle Time ( ns ) | Process | On-Chip Prog ROM (words) | On-Chip Prog EPROM (words) | On-Chip Data RAM (words) | Off-Chip Prog (words) | Ref |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| TMS32010 | 200 | NMOS | 1.5K |  | 144 | 4 K | [13] |
| TMS32010-25 | 160 | NMOS | 1.5K |  | 144 | 4K | [13] |
| TMS32010-14 | 280 | NMOS | 1.5K |  | 144 | 4 K | [13] |
| TMS32011 | 200 | NMOS | 1.5K |  | 144 |  | [17] |
| TMS320C10 | 200 | CMOS | 1.5K |  | 144 | 4 K | [13] |
| TMS320C10-25 | 160 | CMOS | 1.5K |  | 144 | 4K | [13] |
| TMS320C15 | 200 | CMOS | 4.0K |  | 256 | 4K | [13] |
| TMS320C15-25 | 160 | CMOS | 4.0K |  | 256 | 4 K | [14] |
| TMS320E15 | 200 | CMOS |  | 4.0K | 256 | 4K | [14] |
| TMS320C17 | 200 | CMOS | 4.0K |  | 256 |  | [14] |
| TMS320C17-25 | 160 | CMOS | 4.0K |  | 256 |  | [14] |
| TMS320E17 | 200 | CMOS |  | 4.0K | 256 |  | [14] |

device is based on the TMS320C10 core CPU with added peripheral memory and I/O modules added on-chip. The TMS320C17/E17 can be regarded as a semicustom DSP solution suited for high-volume telecommunication and consumer applications.

Table 3 provides a feature comparison of all members of the first-generation TMS320 processors. References to more detailed information on these processors are also provided.

## The Second Generation of the TMS320 Family

The second-generation TMS320 digital signal processors includes two members, the TMS32020 [15] and the TMS320C25 [16]. The architecture of these devices has been evolved from the TMS32010, the first member of the TMS320 family. Key features of the second-generation TMS320 are as follows:

- Instruction cycle timing: -100 ns (TMS320C25) -200 ns (TMS32020).
- 4 K words of on-chip masked ROM (TMS320C25).
- 544 words of on-chip data RAM.
- 128 K words of total program data memory space.
- Eight auxiliary registers with a dedicated arithmetic unit.
- Eight-level hardware stack.
- Fully static double-buffered serial port.
- Wait states for communication to slower off-chip memories.
- Serial port for multiprocessing or interfacing to codecs.
- Concurrent DMA using an extended hold operation (TMS320C25).
- Bit-reversed addressing modes for fast Fourier transforms (TMS320C25).
- Extended-precision arithmetic and adaptive filtering support (TMS320C25).
- Full-speed operation of MAC/MACD instructions from external memory (TMS320C25).
- Accumulator carry bit and related instructions (TMS320C25).
- $1.8-\mu \mathrm{m}$ CMOS technology (TMS320C25): -68-pin grid array (PGA) package. -68-pin lead chip carrier (PLCC) package.
- $2.4-\mu \mathrm{m}$ NMOS technology (TMS32020): -68-pin PGA package.


## TMS320C25 Architecture

The TMS320C25 is the latest member in the second generation of TMS320 digital signal processors. It is a pin-compatible CMOS version of the TMS32020 microprocessor, but with an instruction cycle time twice as fast and the inclusion of additional hardware and software features. The instruction set is a superset of both the TMS32010 and TMS32020, maintaining source-code compatibility. In addition, it is completely object-code compatible with the TMS32020 so that TMS32020 programs run unmodified on the TMS320C25.

The $100-\mathrm{ns}$ instruction cycle time provides a significant throughput advantage for many existing applications. Since most instructions are capable of executing in a single cycle, the processor is capable of executing ten million instructions per second ( 10 MIPS). Increased throughput on the TMS320C25 for many DSP applications is attained by means of single-cycle multiply/accumulate instructions with a data move option (MAC/MACD), eight auxiliary registers with a dedicated arithmetic unit, instruction set support for adaptive filtering and extended-precision arithmetic, bit-reversal addressing, and faster I/O necessary for data-intensive signal processing.

Instructions are included to provide data transfers between the two memory spaces. Externally, the program and data memory spaces are multiplexed over the same bus so as to maximize the address range for both spaces while minimizing the pin count of the device. Internally, the TMS320C25 architecture maximizes processing power by maintaining two separate bus structures, program and data, for full-speed execution.

Program execution in the device takes the form of a threelevel instruction fetch-decode-execute pipeline (see Fig. 2). The pipeline is essentially invisible to the user, except in some cases where it must be broken (such as for branch instructions). In this case, the instruction timing takes into account the fact that the pipeline must be emptied and refilled. Two large on-chip data RAM blocks (a total of 544 words), one of which is configurable either as program or data memory, provide increased flexibility in system design. An off-chip 64 K -word directly addressable data memory address space is included to facilitate implementations of DSP algorithms. The large on-chip 4K-word masked ROM can be used for cost-reduced systems, thus providing for a true single-chip DSP solution. The remainder of the 64 K word program memory space is located extimally. Large
programs can execute at full speed from this memory space. Programs may also be downloaded from slow external memory to on-chip RAM for full-speed operation. The VLSI implementation of the TMS320C25 incorporates all of these
features as well as many others such as a hardware timer, serial port, and block data transfer capabilities.
A functional block diagram of the TMS320C25, shown in Fig. 5, outlines the principal blocks and data paths within


Fig. 5. TMS320C25 functional block diagram.
the processor. The diagram also shows all of the TMS320C25 interface pins.
In the following architectural discussions on the memory, central arithmetic logic unit, hardware multiplier, control operations, serial port, and I/O interface, please refer to the block diagram shown in Fig. 5.
Memory Allocation: The TMS320C25 provides a total of 4 K 16 -bit words of on-chip program ROM and 54416 -bit words of on-chip data RAM. The RAM is divided into three separate Blocks (B0, B1, and B2). Of the 544 words, 256 words (block $B 0$ ) are configurable as either data or program memory by CNFD (configure data memory) or CNFP (configure program memory) instructions provided for that purpose; 288 words (blocks B1 and B2) are always data memory. A data memory size of 544 words allows the TMS 320 C 25 to handle a data array of 512 words while still leaving 32 locations for intermediate storage. The TMS320C25 provides 64 K words of off-chip directly addressable data memory space as well as a 64 K -word off-chip program memory space.
A register file containing eight Auxiliary Registers (AR0AR7), which are used for indirect addressing of data memory and for temporary storage, increase the flexibility and efficiency of the device. These registers may be either directly addressed by an instruction or indirectly addressed by a 3-bit Auxiliary Register Pointer (ARP). The auxiliary registers and the ARP may be loaded from either data memory or by an immediate operand defined in the instruction. The contents of these registers may also be stored into data memory. The auxiliary register file is connected to the Auxiliary Register Arithmetic Unit (ARAU). Using the ARAU accessing tables of information does not require the CALU for address manipulation, thus freeing it for other operations.
Central Arithmetic Logic Unit (CALU): The CALU contains a 16 -bit scaling shifter, a $16 \times 16$-bit parallel multiplier, a 32 bit Arithmetic Logic Unit (ALU), and a 32 -bit accumulator. The scaling shifter has a 16 -bit input connected to the data bus and a 32 -bit output connected to the ALU. This shifter produces a left-shift of 0 to 16 bits on the input data, as programmed in the instruction. Additional shifters at the outputs of both the accumulator and the multiplier are suitable for numerical scaling, bit extraction, extended-precision arithmetic, and overflow prevention.
The following steps occur in the implementation of a typical ALU instruction:

1) Data are fetched from the RAM on the data bus.
2) Data are passed through the scaling shifter and the ALU where the arithmetic is performed.
3) The result is moved into the accumulator.

The 32 -bit accumulator is split into two 16 -bit segments for storage in data memory: ACCH (accumulator high) and ACCL (accumulator low). The accumulator has a carry bit to facilitate multiple-precision arithmetic for both addition and subtract instructions.
Hardware Multiplier: The TMS320C25 utilizes a $16 \times 16$ bit hardware multiplier, which is capable of computing a 32-bit product during every machine cycle. Two registers are associated with the multiplier:

- a 16-bit Temporary Register (TR) that holds one of the operands for the multiplier, and
- a 32-bit Product Register (PR) that holds the product.

The output of the product register can be left-shifted 1 or 4 bits. This is useful for implementing fractional arithmetic or justifying fractional products. The output of the PR can also be right-shifted 6 bits to enable the execution of up to 128 consecutive multiple/accumulates without overflow. An unsigned multiply (MPYU) instruction facilitates extended-precision multiplication.
I/O Interface: The TMS320C25 I/O space consists of 16 input and 16 output ports. These ports provide the full 16 bit parallel I/O interface via the data bus on the device. A single input (IN) or output (OUT) operation typically takes two cycles; however, when used with the repeat counter, the operation becomes single-cycle. I/O devices are mapped into the I/O address space using the processor's external address and data buses in the same manner as memorymapped devices. Interfacing to memory and I/O devices of varying speeds is accomplished by using the READY line.
A Direct Memory Access (DMA) to external program/data memory is also supported. Another processor can take complete control of the TMS320C25's external memory by asserting HOLD low, causing the TMS320C25 to place its address, data, and control lines in the high-impedance state. Signaling between the external processor and the TMS320C 25 can be performed using interrupts. Two modes of DMA are available on the device. In the first, execution is suspended during assertion of HOLD. In the second "concurrent DMA" mode, the TMS320C25 continues to execute its program while operating from internal RAM or ROM, thus greatly increasing throughput in data-intensive applications.

## TMS320C25 Software

The majority of the TMS320C25 instructions (97 out of 133) are executed in a single instruction cycle. Of the 36 instructions that require additional cycles of execution, 21 involve branches, calls, and returns that result in a reload of the program counter and a break in the execution pipeline. Another seven of the instructions are two-word, longimmediate instructions. The remaining eight instructions support I/O, transfers of data between memory spaces, or provide for additional parallel operation in the processor. Furthermore, these eight instructions (IN, OUT, BLKD, BLKP, TBLR, TBLW, MAC, and MACD) become single-cycle when used in conjunction with the repeat counter. The functional performance of the instructions exploits the parallelism of the processor, allowing complex and/or numerically intensive computations to be implemented in relatively few instructions.
Addressing Modes: Since most of the instructions are coded in a single 16-bit word, most instructions can be executed in a single cycle. Three memory addressing modes are available with the instruction set: direct, indirect, and immediate addressing. Both direct and indirect addressing are used to access data memory. Immediate addressing uses the contents of the memory addressed by the program counter.
When using direct addressing, 7 bits of the instruction word are concatenated with the 9 bits of the data memory page pointer (DP) to form the 16 -bit data memory address. With a 128 -word page length, the DP register points to one of 512 possible data memory pages to obtain a 64 K total data memory space. Indirect addressing is provided by the aux-
iliary registers (AR0-AR7). The seven types of indirect addressing are shown in Table 4. Bit-reversed indexed addressing modes allow efficient I/O to be performed for the resequencing of data points in a radix-2 FFT program.

Table 4 Addressing Modes of the TMS320C25

| Addressing Mode | Operation |
| :---: | :---: |
| OP A | direct addressing |
| OP * (,NARP) | indirect; no change to AR. |
| OP * + (, NARP) | indirect; current AR is incremented. |
| OP * - (, NARP) | indirect; current AR is decremented. |
| OP * $0+($, NARP) | indirect; AR0 is added to current AR. |
| OP *0-(,NARP) | indirect; AR0 is subtracted from current AR. |
| OP *BR0 + (,NARP) | indirect; ARO is added to current AR (with reverse carry propagation). |
| OP *BR0-(,NARP) | indirect; AR0 is subtracted from current AR (with reverse carry propagation). |

Note: The optional NARP field specifies a new value of the ARP.

## TMS320C25 System Configurations

The flexibility of the TMS320C25 allows systems configurations to satisfy a wide range of application requirements [16]. The TMS320C25 can be used in the following config. urations:

- a stand-alone system (a single processor using 4 K words of on-chip ROM and 544 words of on-chip RAM),
- parallel multipfocessing systems with shared global data memory, or
- host/peripheral coprocessing using interface control signals.
A minimal processing system is shown in Fig. 6 using external data RAM and PROM/EPROM. Parallel multiprocessing and host/peripheral coprocessing systems can be designed by taking advantage of the TMS320C25's direct memory access and global memory configuration capabilities.
In some digital processing tasks, the algorithm being implemented can be divided into sections with a distinct processor dedicated to each section. In this case, the first and second processors may share global data memory, as well as the second and third, the third and fourth, etc. Arbitration logic may be required to determine which section of the algorithm is executing and which processor has access to the global memory. With multiple processors ded-
icated to distinct sections of the algorithm, throughput can be increased via pipelined execution. The TMS320C25 is capable of allocating up to 32 K words of data memory as global memory for multiprocessing applications.


## The Third Generation of the TMS320 Family

The TMS320C30 [26]-[27] is Texas Instruments third-generation member of the TMS320 family of compatible digital signal processors. With a computational rate of 33 MFLOPS (million floating-point operations per second), the TMS320C30 far exceeds the performance of any programmable DSP available today. Total system performance has been maximized through internal parallelism, more than twenty-four thousand bytes of on-chip memory, single-cycle floating-point operations, and concurrent I/O. The total system cost is minimized with on-chip memory and on-chip peripherals such as timers and serial ports. Finally, the user's system design time is dramatically reduced with the availability of the floating-point operations, general-purpose instructions and features, and quality development tools.

The TMS320C30 provides the user with a level of performance that, at one time, was the exclusive domain of supercomputers. The strong architectural emphasis of providing a low-cost system solution to demanding arithmetic algorithms has resulted in the architecture shown in Fig. 7.

The key features of the TMS320C30 [26], [27] are as follows:

- 60 -ns single-cycle execution time, $1-\mu \mathrm{m}$ CMOS.
- Two $1 \mathrm{~K} \times 32$-bit single-cycle dual-access RAM blocks.
- One $4 \mathrm{~K} \times 32$-bit single-cycle dual-access ROM block.
- $64 \times 32$-bit instruction cache.
- 32-bit instruction and data words, 24-bit addresses.
- 32/40-bit floating-point and integer multiplier.
- 32/40-bit floating-point, integer, and logical ALU.
- 32-bit barrel shifter.
- Eight extended-precision registers.
- Two address-generators with eight auxiliary registers.
- On-chip Direct Memory Access (DMA) controller for concurrent I/O and CPU operation.
- Peripheral bus and modules for easy customization.
- High-level language support.
- Interlocked instructions for multiprocessing support.
- Zero overhead loops and single-cycle branches.

The architecture of the TMS320C30 is targeted at 60 -ns and faster cycle times. To achieve such high-performance


Fig. 6. Minimal processing system with external data RAM and PROM/EPROM.


Fig. 7. TMS320C30 functional block diagram.
goals while still providing low-cost system solutions, the TMS320C30 is designed using Texas Instruments state-of-the-art $1-\mu \mathrm{m}$ CMOS process. The TMS320C30's high system performance is achieved through a high degree of parallelism, the accuracy and precision of its floating-point units, its on-chip DMA controller that supports concurrent I/O, and its general-purpose features. At the heart of the architecture is the Central Processing Unit (CPU).

## The CPU

The CPU consists of the following elements: floatingpoint/integer multiplier; ALU for performing floating-point, integer, and logical operations; auxiliary register arithmetic units; supporting register file, and associated buses. The multiplier of the CPU performs floating-point and integer multiplication. When performing floating-point multiplication, the inputs are 32-bit floating-point numbers, and the result is a 40 -bit floating-point number. When performing integer multiplication, the input data is 24 bits and yields a 32-bit result. The ALU performs 32-bit integer, 32-bit logical, and 40-bit floating-point operations. Results of the multiplier and the ALU are always maintained in 32-bit integer or 40-bit floating-point formats. The TMS320C30 has the ability to perform, in a single cycle; parallel multiplies and adds (subtracts) on integer or floating-point data. It is this ability to perform floating-point multiplies and adds (subtracts) in a single cycle which give the TMS320C30 its peak computational rate of 33 MFLOPS.

Floating-point operations provide the user with a convenient and virtually trouble-free means of performing computations while maintaining accuracy and precision. The TMS320C30 implementation of floating-point arith-
metic allows for floating-point operations at integer speeds. The floating-point capability allows the user to ignore, to a large extent, problems with overflow, operand alignment, and other burdensome tasks common to integer operations.

The register file contains 28 registers, which may be operated upon by the multiplier and ALU. The first eight of these registers (R0-R7) are the extended-precision registers, which support operations on 40-bit floating-point numbers and 32-bit integers.

The next eight registers (AR0-AR7) are the auxiliary registers, whose primary function is related to the generation of addresses. However, they also may be used as generalpurpose 32-bit registers. Two auxiliary register arithmetic units (ARAU0 and ARAU1) can generate two addresses in a single cycle. The ARAUs operate in parallel with the multiplier and ALU. They support addressing with displacements, index registers (IR0 and IR1), and circular and bitreversed addressing.

The remaining registers support a variety of system functions: addressing, stack management, processor status, block repeat, and interrupts.

## Data Organization

Two integer formats are supported on the TMS320C30: a 16-bit format used for immediate integer operands and a 32-bit single-precision integer format.

Two unsigned-integer formats are available: a 16-bit format for immediate unsigned-integer operands and a 32-bit single-precision unsigned-integer format.

The three floating-point formats are assumed to be normalized, thus providing an extra bit of precision. The first
is a 16 -bit short floating-point format for immediate float-ing-point operands, which consists of a 4-bit exponent, 1 sign bit, and an 11-bit fraction. The second is a single-precision format consisting of an 8-bit exponent, 1 sign bit, and a 23-bit fraction. The third is an extended-precision format consisting of an 8 -bit exponent, 1 sign bit, and a 31-bit fraction.

The total memory space of the TMS320C30 is 16 M (million) $\times 32$ bits. A machine word is 32 bits, and all addressing is performed by word. Program, data, and I/O space are contained within the 16 M -word address space.

RAM blocks 0 and 1 are each $1 \mathrm{~K} \times 32$ bits. The ROM block is $4 \mathrm{~K} \times 32$ bits. Each RAM block and ROM block is capable of supporting two data accesses in a single cycle. For example, the user may, in a single cycle, access a program word and a data word from the ROM block.

The separate program data, and DMA buses allow for parallel program fetches, data reads and writes, and DMA operations. Management of memory resources and busing is handled by the memory controller. For example, a typical mode of operation could involve a program fetch from the on-chip program cache, two data fetches from RAM block 0 , and the DMA moving data from off-chip memory to RAM block 1. All of this can be done in parallel with no impact on the performance of the CPU.

A $64 \times 32$-bit instruction cache allows for maximum system performance with minimal system cost. The instruction cache stores often repeated sections of code. The code may then be fetched from the cache, thus greatly reducing the number of off-chip accesses necessary. This allows for code to be stored off-chip in slower, lower cost memories. Also, the external buses are freed, thus allowing for their use by the DMA or other devices in the system.

## DMA

The TMS320C30 processes an on-chip Direct Memory Access (DMA) controller. The DMA controller is able to perform reads from and writes to any location in the memory map without interfering with the operation of the CPU. As a consequence, it is possible to interface the TMS320C30 to slow external memories and peripherals (A/Ds, serial ports, etc.) without affecting the computational throughput of the CPU. The result is improved system performance and decreased system cost.

The DMA controller contains its own address generators, source and destination registers, and transfer counter. Dedicated DMA address and data buses allow for operation with no conflicts between the CPU and DMA controller.

The DMA controller responds to interrupts in a similar way to the CPU. This ability allows the DMA to transfer data based upon the interrupts received. Thus I/O transfers that would normally be performed by the CPU may instead be performed by the DMA. Again, the CPU may continue processing data while the DMA receives or transmits data.

## Peripherals

All peripheral modules are manipulated through mem-ory-mapped registers located on a dedicated peripheral bus. This peripheral bus allows for the straightforward addition, removal, and creation of peripheral modules. The initial TMS320C30 peripheral library will include timers and serial ports. The peripheral library concept allows Texas Instru-
ments to create new modules to serve a wide variety of applications. For example, the configuration of the TMS320C30 in Fig. 7 includes two timers and two serial ports.

Timers: The two timer modules are general-purpose timer/event counters, with two signaling modes and internal or external clocking.

Available to each timer is an I/O pin that can be used as an input clock to the timer or as an output signal driven by the timer. The pin may also be configured as a general-purpose I/O pin.

Serial Ports: The two serial ports are modular and totally independent. Each serial port can be configured to transfer $8,16,24$, or 32 bits of data per frame. The clock for each serial port can originate either internally or externally. An internally generated divide-down clock is provided. The pins of the serial ports are configurable as general-purpose 1/O pins. A special handshake mode allows TMS320C30s to communicate over their serial ports with guaranteed synchronization. The serial ports may also be configured to operate as timers.

## External Interfaces

The TMS320C30 provides two external interfaces: the parallel interface and the I/O interface. The parallel interface consists of a 32 -bit data bus, a 24-bit address bus, and a set of control signals. The I/O interface consists of a 32-bit data bus, a 13-bit address bus, and a set of control signals. Both ports support an external ready signal for wait-state generation and the use of software-controlled wait states.

The TMS320C30 supports four external interrupts, a number of internal interrupts, and a nonmaskable external reset signal. Two dedicated, general-purpose, external I/O flags, XF0 and XF1, may be configured as input or output pins under software control. These pins are also used by the interlocked instructions to support multiprocessor communication.

## Pipelining In the TMS320C30

The operation of the TMS320C30 is controlled by five major functional units. The five major units and their function are as follows:

- Fetch Unit (F) which controls the program counter updates and fetches of the instruction words from memory.
- Decode Unit (D) which decodes the instruction word and controls address generation.
- Read Unit (R) which controls the operand reads from memory.
- Execute Unit (E) which reads operands from the register file, performs the necessary operation, and writes results back to the register file and memory.
- DMA Channel (DMA) which reads and writes memory concurrently with CPU operation.

Each instruction is operated upon by four of these stages; namely, fetch, decode, read, and execute. To provide for maximum processor throughput these units can perform in parallel with each unit operating on a different instruction. The overlapping of the fetch, decode, read, and execute operations of different instructions is called pipelining. The DMA controller runs concurrently with these units. The pipelining of these operations is key to the high per-
formance of the TMS320C30. The ability of the DMA to move data within the processor's memory space results in an even greater utilization of the CPU with fewer interruptions of the pipeline which inevitably yields greater performance.

The pipeline control of the TMS320C30 allows for extremely high-speed execution rate by allowing an effective rate of one execution per cyclé. It also manages pipeline conflicts in a way that makes them transparent to the user.

While the pipelining of the different phases of an instruction is key to the performance of the TMS320C30, the designers felt it essential to avoid pipelining the operation of the multiplier or ALU. By ruling out this additional level of pipelining it was possible to greatly improve the processor's useability.

## Instructions

The TMS320C30 instruction set is exceptionally well suited to digital signal processing and other numerically intensive applications. The TMS320C30 also possesses a full complement of general-purpose instructions. The instruction set is organized into the following groups:

- load and store instructions;
- two-operand arithmetic instructions;
- two-operand logical instructions;
- three-operand arithmetic instructions;
- three-operand logic instructions;
- parallel operation instructions;
- arithmetic/logical instruction with store instructions;
- program control instructions;
- interlocked operations instructions.

The load and store instructions perform the movement of a single word to and from the registers and memory. Included is the ability to load a register conditionally. This operation is particularly useful for locating the maximum and minimum of a set of data.

The two-operand arithmetic and logical instructions consist of a complete set of arithmetic instructions. They have two operands; src and dst for source and destination, respectively. The src operand may come from memory, a register, or be part of the instruction word. The dst operand is always a register. This portion of the instruction set includes floating-point integer and logical operations, support of multiprecision arithmetic, and 32-bit arithmetic and logical shifts.
The three-operand arithmetic and logical instructions are a subset of the two-operand arithmetic and logical instructions. They have three operands: two src operands and a dst operand. The src operands may come from memory or a register. The dst operand is always a register. These instructions allow for the reading of two operands from memory and/or the CPU register file in a single cycle.
The parallel operation instructions allow for a high degree of parallelism. They support very flexible, parallel floatingpoint and integer multiplies and adds. They also include the ability to load two registers in parallel.

The arithmetic/logical and store instructions support a high degree of parallelism, thus complementing the parallel operation instructions. They allow for the performance of an arithmetic or logical instruction between a register and an operand read from memory, in parallel with the stor-
ing of a register to memory. They also provide for extremely rapid operations on blocks of memory.

The program control instructions consist of all those operations that affect the program flow. This section of the instruction set includes a set of flexible and powerful constructs that allow for software control of the program flow. These fall into two main types: repeat modes and branching.

For many algorithms, there is an inner kernel of code where most of the execution time is spent. The repeat modes of the TMS320C30 allow for the implementation of zero overhead looping. Using the repeat modes allows these time-critical sections of code to be executed in the shortest possible time. The instructions supporting the repeat modes are RPTB (repeat a block of code) and RPTS (repeat a single instruction). Through the use of the dedicated stackpointer, block repeats (RPTBs) may be nested.
The branching capabilities of the TMS320C30 include two main subsets: standard and delayed branches. Standard branches, as in any pipelined machine that comprehends them, empty the pipeline to guarantee correct management of the program counter. This results in a branch requiring, in the case of the TMS320C30, four cycles to execute. Included in this subset are calls and returns. A standard branch (BR) is illustrated below.

| BR | THREE |
| :---: | :---: |
| MPYF | standard branch. |
| ADDF | ; not executed. |
| SUBF | ; not executed. |
| AND | ; not executed. |
| $\vdots$ |  |
| THREE MPYF | ; fetched 3 cycles after BR |
| is fetched. |  |

Delayed branches do not empty the pipe, but rather, guarantee that the next three instructions will be fetched before the program counter is modified by the branch. The result is a branch that only requires a single cycle. Every delayed branch has a standard branch counterpart. A delayed branch (BRD) is illustrated below.

| BRD | THREE | ; delayed branch. |
| :---: | :---: | :--- |
| MPYF | ; executed. |  |
| ADDF | ; executed. |  |
| SUBF | ; executed. |  |
| AND | ; not executed. |  |
| $\vdots$ |  |  |
| THREE MPYF | ; fetched after SUBF fetched. |  |

The combination of the repeat modes, standard branches, and delayed branches provides the user with a set of programming constructs which are well suited to a wide range of performance requirements.

The program control instructions also include conditional calls and returns. The decrement and branch conditionally instruction allows for efficient loop control by combining the comparison of a loop counter to zero with
the check of condition flags, i.e., floating-point overflow. The condition codes available include unsigned and signed comparisons, comparisons to zero, and comparisons based upon the status of individual condition flags. These conditions may be used with any of the conditional instructions.
The interlocked operations instructions support multiprocessor communication. Through the use of external signals, these instructions allow for powerful synchronization mechanisms, such as semaphores, to be implemented. The interlocked operations use the two external flag pins, XF0 and XF1. XF0 signals an interlocked-operation request and XF1 acts as an acknowledge signal for the requested interlocked operation. The interlocked operations include interlocked loads and stores. When an interlocked operation is performed the external request and acknowledge signals can be used to arbitrate between multiple processors sharing memory, semaphores, or counters.

## Development and Support Tools

Digital signal processors are essentially application-specific microprocessors (or microcomputers). Like any other microprocessor, no matter how impressive the performance of the processor or the ease of interfacing, without good development tools and technical support, it is very difficult to design it into the system. In developing an application, problems are encountered and questions are asked. Oftentimes the tools and vendor support provided to the designer are the difference between the success and failure of the project.
The TMS320 family has a wide range of development tools available [25]. These tools range from very inexpensive evaluation modules for application evaluation and benchmarking purposes, assembler/linkers, and software simulators, to full-capability hardware emulators. A brief summary of these support tools is provided in the succeeding subsections.

## Software Tools

Assembler/linkers and software simulators are available on PC and VAX for users to develop and debug TMS320 DSP algorithms. Their features are described as follows:
Assembler/Linker: The Macro Assembler translates assembly language source code into executable object code. The Linker permits a program to be designed and implemented in separate modules that will later be linked together to form the complete program.
Simulator: The Simulator simulates operations of the device in software to allow program verification and debug. The simulator uses the object code produced by the Macro Assembler/Linker.
C Complier: The C Compiler is a full implementation of the standard Kernighan and Ritchie C as defined in The C Programming Language [28]. The compiler supports the insertion of assembly language code into the $C$ source code. The user may also write functions in assembly language, and then call these functions from the C source. Similarly, C functions may be called from assembly language. Variables defined in the C source may be accessed in assembly language modules and vice versa. The result is a complier that allows the user to tailor the amount of highlevel programming versus the amount of assembly lan-
guage according to his application. The C compiler is supported on the TMS320C25 and the TMS320C30.

## Hardware Tools

Evaluation modules and emulation tools are available for in-circuit emulation and hardware program debugging for developing and testing DSP algorithms in a real product environment.
Evaluation Module (EVM): The EVM is a stand-alone sin-gle-board module that contains all of the tools necessary to evaluate the device as well as provide basic in-circuit emulation. The EVM contains a debug monitor, editor, assembler, reverse assembler, and software communications to a host computer or a line printer.
SoftWare Development System (SWDS): The SoftWare Development System is a PC plug-in card with similar functionality of the EVM.
Emulator (XDS): The eXtended Development System provides full-speed in-circuit emulation with real-time hardware breakpoint/trace and program execution capability from target memory. By setting breakpoints based on internal conditions or external events, execution of the program can be suspended and the XDS placed into the debug mode. In the debug mode, all registers and memory locations can be inspected and modified. Full-trace capabilities at full speed and a reverse assembler that translates machine code back into assembly instructions are included. The XDS system is designed to interface with either a terminal or a host computer. In addition to the above design tools, other development support is available [25]:

## Applications

The TMS320 is designed for real-time DSP and other com-putation-intensive applications [4]. In these applications, the TMS320 provides an excellent means for executing signal processing algorithms such as fast Fourier transforms (FFTs), digital filters, frequency synthesis, correlation, and convolution. The TMS320 also provides for more generalpurpose functions via bit-manipulation instructions, block data move capabilities, large program and data memory address spaces, and flexible memory mapping.

To introduce applications performed by the TMS320, digital filters will be used as examples. The remaining portion of this section will briefly cover applications, and conclude by showing some benchmarks.

## Digital Filtering

As discussed several times in this paper, the FIR filter is simply the sum of products in a sampled data system. This was shown in (1). A simple implementation of the FIR filter uses the MACD instruction (multiply/accumulate and data move) for each filter tap, with the RPT/RPTK instruction repeating the MACD for each filter tap. As we saw earlier, a 256 -tap FIR filter can be implemented by using the following two instructions:

```
RPTK 255
MACD **,COEFFP
```

In this example, the coefficients may be stored anywhere in program memory (reconfigurable on-chip RAM, on-chip ROM, or external memories). When the coefficients are
stored in on-chip ROM or externally, the entire on-chip data RAM may be used to store the sample sequence. This allows filters of up to 512 taps to be implemented. Execution of the filter will be at full speed or 100 ns per tap as long as the memory supports full-speed execution (either on-chip RAM or high-speed external RAM).
Up to this point, it has been assumed that the filter coefficients are fixed from sample to sample. If the coefficients are adapted or updated with time, such as in adaptive filters for echo cancelation [4], [20], then the DSP algorithm requires a greater computational capacity from the processor. The requirement to adapt each of the coefficients, usually with each sample, is accomplished by three instructions (MPYA or MPYS, ZALR, and SACH) on the TMS320C25 [16]. A means of adapting the coefficients is the least-meansquare (LMS) algorithm given by the following equation:

$$
b_{k}(i+1)=b_{k}(i)+2 B[e(i) * x(i-k)]
$$

where $b_{k}(i+1)$ is the weighting coefficient for the next sample period, $b_{k}(i)$ is the weighting coefficient for the present sample period, $B$ is the gain factor or adaptation step size, $\mathrm{e}(i)$ is the error function, and $x(i-k)$ is the input of the filter.

In an adaptive filter, it is important to update the coefficients $b_{k}(i)$ in order to minimize the error function $\mathrm{e}(i)$, which is the difference between the output of the filter and a reference signal. Quantization errors are critical to the performance of the filter when updating the coefficients and can be minimized if the result is obtained by rounding rather than truncating. For each coefficient in the filter at a given point in time, the factor $2^{*} B^{*} e(i)$ is a constant. This factor car then be computed once and stored in the T register for each of the updates. Thus the computational requirement has become one multiply/accumulate plus rounding. Without the new instructions, the adaptation of each coefficient is five instructions corresponding to five clock cycles. This is shown in the following instruction sequence:

| LRLK | AR2,COEFFD | ; LOAD ADDRESS OF COEFFICIENTS. |
| :---: | :---: | :---: |
| LRLK | AR3,LASTAP | ; LOAD ADDRESS OF DATA SAMPLES. |
| LARP | AR2 |  |
| LT | ERRF | ; errf $=2^{*} \mathrm{~B}^{*}$ e(i) |
| ZALH | *,AR3 | ; $\mathrm{ACC}=\mathrm{bk}(\mathrm{i})^{* 2} \mathrm{2}^{* * 16}$ |
| ADD | ONE, 15 | ; $\mathrm{ACC}=\mathrm{bk}(\mathrm{i})^{*} 2^{* *} 16+2^{* *} 15$ |
| MPY | *-,AR2 |  |
| APAC |  | $\begin{aligned} ; A C C= & b k(i)^{*} 2^{* *} 16 \\ & +\operatorname{errf}^{*} x(i-k)+2^{* *} 15 \end{aligned}$ |
| SACH | *+ | ; SAVE bk(i+1). |

When the MPYA and ZALR instructions are used, the adaptation reduces to three instructions corresponding to three clock cycles, as shown in the following instruction sequence. Note that the processing order has been slightly changed to incorporate the use of the MPYA instruction. This is due to the fact that the accumulation performed by the MPYA is the accumulation of the previous product.


The adaptive filter coefficient update can further be simplified using the TMS320C30 [27] as shown below. The first instruction defines the number of times to repeat the kernel. The second instruction is the repeat-block instruction (RPTB). The RPTB instruction allows the iterations of the kernel to be performed with zero overhead looping. The kernel assumes that the error term is stored in register R0. It is important to note that all of the calculations are performed in floating-point arithmetic. The MPYF3 is a three-operand floating-point multiply of the input sample $x(i-k)$, which is stored in memory by the error term errf. The next step is a three-operand floating-point add (ADDF3) of the change in the filter tap to the filter tap in parallel with the store (STF) of the previously updated filter tap. That is, the store (STF) is to be performed in parallel with ADDF3. Thus the number of cyles for a floating-point adaptation is only two.

|  | LDI | N,RC | ; load length N into block repeat counter |
| :---: | :---: | :---: | :---: |
|  | RPTB | adapt | ; repeat the adaptation loop N+1 times |
| adapt: | MPYF3 | * + + AR0(1),R0,R1 | ; errf * $\mathrm{x}(\mathrm{i}-\mathrm{k}) \rightarrow \mathrm{R} 1$ |
|  |  |  |  |
|  | ADDF3 | * + AR1(1),R1,R2 | ; $b(k, i)+\operatorname{errf}^{*} x(i-k)$ |
|  |  |  | $\rightarrow$ R2 |
| 11 | STF | R2,*AR1 + + (1) | ; $\mathrm{R} 2 \rightarrow \mathrm{~b}(\mathrm{k}-1, \mathrm{i})$ |

Since we have discussed the application of digital filtering, we can now describe several applications in the areas of telecommunications, graphics/image processing, highspeed control, instrumentation, and numeric processing, and then conclude this section with several benchmarks. If more detail is needed on any of these applications, the reader is referred to [4].

## Telecommunications Applications

Many aspects of the telecommunications network can take advantage of the TMS320. As telecommunications evolves more toward an all-digital network, DSP will become even more utilized [23]. Several typical uses of the TMS320 are discussed.
Echo Canceler: In echo cancellation [4], [20], an adaptive FIR filter performs the modeling routine and signal modifications to adaptively cancel the echo caused by the impedance mismatches in the telephone transmission lines.

For this application, a large on-chip RAM of 544 words and on-chip ROM of 4 K words on the TMS320C25 provides for a 256 -tap adaptive filter ( $32-\mathrm{ms}$ echo cancellation) to be executed in a single chip without external data or program memory.
High-Speed Modems: The TMS320 can perform numerous functions such a modulation/demodulation, adaptive equalization, and echo cancellation [21], [22]. For lower speed modems, such as Bell 212A and V. 22 bis modems, the TMS320C17 provides the most cost-effective single-chip solution to these applications. For higher speed modems, such as the V.32, requiring more processing power and multiprocessing capabilities, the TMS320C25 and TMS320C30 are the designer's choice.

Voice Coding: Voice-coding techniques [3], [4], such as full-duplex 32 -kbit/s ADPCM (CCITT G.721), CVSD, $16-\mathrm{kbit} / \mathrm{s}$ subband coders, and LPC, are frequently used in voice transmission and storage. Arithmetic speed, normalization, and the bit-manipulation capability of the TMS320 provide for implementation of these functions, usually in a single chip. For example, the TMS320C17 can be used as a single-chip ADPCM [4], subband [4], or LPC [4] coder. An application of voice coding is an ADPCM transcoder implemented in half-duplex on a single TMS320C17 or full-duplex on a TMS320C25 for telecommunication multiplexing applications. Another example is a secure-voice communication system, requiring voice coding, as well as data encryption and transmission over a public-switched network via a modem; the TMS320C25 offers an ideal solution.

## Graphics/Image Processing Applications

In graphics and image processing applications [4], the ability to interface with a host processor is important. Both the TMS320C30 and the TMS320C25 multiprocessor interface enable them to be used in a variety of host/coprocessor configurations [4]. Graphics and image processing applications can use the large directly addressable external data space and global memory capability to allow graphical images in memory to be shared with a host processor, thus minimizing unnecessary data transfers. The indexed indirect addressing modes allow matrices to be processed row-by-row when performing matrix multiplication for threedimensional image rotations, translations, and scaling.
The TMS320C30 has a number of features that support graphics and image processing extremely well. The float-ing-point capabilities allow for extremely precise computation of perspective transformations. They also support more sophisticated algorithms such as shading and hidden line removal, operations which are computationally intensive.
The large address space allows for straightforward addressing of large images or displays. The flexible addressing registers, coupled with the integer multiply, support powerful addressing of multiple-dimensional arrays. Vec-tor-oriented instructions allow the user to efficiently manipulate large blocks of memory. Finally, the on-chip DMA controller allows the user to easily overlap the processing of data with its I/O.

## High-Speed Control

High-speed control applications [4], [24] use the TMS320C17 and TMS320C25 general-purpose features for bit-test and logical operations, timing synchronization, and
high data-transfer rate (ten million 16-bit words per second). Both devices can be used in closed-loop systems for control signal conditioning, filtering, high-speed computing, and multichannel multiplexing capabilities. The following demonstrates two typical control applications:
Disk Control: Digital filtering in a closed-loop actuation mechanism positions the read/write heads over the disk surface. Supplemented with many general-purpose features, the TMS320 can replace costly bit-slice/custom/analog solutions to perform such tasks as compensation, filtering, fine/coarse tuning, and other signal conditioning algorithms.
Robotics: Digital signal processing and bit-manipulation power, coupled with host interface, allow the TMS320C25 to be useful in robotics control [24]. The TMS320C25 can replace both the digital controllers and analog signal processing hardware for communication to a central host processor and for the performance of numerically intensive control functions.

## Instrumentation

Instrumentation, such as spectrum analyzers and various high-speed/high-precision instruments, often requires a large data memory space and the high performance of a digital signal processor. The TMS320C25 and TMS320C30 are capable of performing very long-length FFTs and generating precision functions with minimal external hardware.

## Numeric Processing

Numeric and array processing applications benefit from TMS320 performance. High throughput resulting from features, such as a fast cycle time and an on-chip hardware multiplier, combined with multiprocessing capabilities and data memory expansion, provide for a low-cost, easy-to-use replacement for a typical bit-slice solution. The TMS320C30's floating-point precision, high throughput, and interface flexibility are excellent for this application.

## TMS320 Benchmarks

To complete the discussion on the applications that the TMS320 can perform, we will provide some benchmarks. The TMS320 has demonstrated impressive benchmarks in performing some of the common DSP routines and system applications. Table 5 shows typical TMS320 benchmarks [4].

Table 5 TMS320 Family Benchmarks

|  | First <br> Generation | Second <br> Generation | Third <br> Generation |
| :--- | :--- | :--- | :---: |
| DSP Routines/Applications | 400 ns | 100 ns | 60 ns |
| FIR filter tap | 9.25 kHz | 37 kHz | $>60 \mathrm{kHz}$ |
| 256-tap FIR sample rate | 9.200 ns | 180 ns |  |
| LMS adaptive FIR filter tap | 700 ns | 400 |  |
| 256-tap adaptive FIR filter <br> sample rate | 5.4 kHz | 9.5 kHz | $>20 \mathrm{kHz}$ |
| Bi-quad filter element (five <br> multiplies) | $2 \mu \mathrm{~s}$ | $1 \mu \mathrm{~s}$ | 360 ns |
| Echo canceler (single <br> chip) | 8 ms | 32 ms | $>64 \mathrm{~ms}$ |

## Summary

This paper has discussed characteristics of digital signal processing and how these characteristics have influenced the architectural design of the Texas Instruments TMS320 family of digital signal processors. Three generations of the

TMS320 family were covered, and their support tools necessary to develop end-applications were briefly reviewed. The paper concluded with an overview of digital signal processing applications using these devices.

## References

[1] L. R. Rabiner and B. Gold, Theory and Application of Digital Signal Processing. Englewood Cliffs, N\}: Prentice-Hall, 1975.
[2] A. V. Oppenheim, Ed., Applications of Digital Signal Processing. Englewood Cliffs, NJ: Prentice-Hall, 1978.
[3] L. R. Rabiner and R. W. Schafer, Digital Processing of Speech Signals. Englewood Cliffs, NJ: Prentice-Hall, 1978.
[4] K. Lin, Ed., Digital Signal Processing Applications with the TMS320 Family. Englewood Cliffs, NJ: Prentice-Hall, 1987
[5] A.V. Oppenhiem and R. W. Schafer, Digital Signal Processing. Englewood Cliffs, NJ: Prentice-Hall, 1975.
[6] C. Burrus and T. Parks, DFT/FFT and Convolution Algorithms. New York, NY: Wiley, 1985.
[7] T. Parks and C. Burrus, Digital Filter Design. New York, NY: Wiley, 1987.
[8] J. Treichler, C. Johnson, and M. Larimore, A Practical Guide to Adaptive Filter Design. New York, NY: Wiley, 1987.
[9] P. Papamichalis, Practical Approaches to Speech Coding. Englewood Cliffs, NJ: Prentice-Hall, 1987.
[10] R. Morris, Digital Signal Processing Software. Ottawa, Ont., Canada: DSPS Inc., 1983.
[11] K. McDonough, E. Caudel, S. Magar, and A. Leigh, "Microcomputer with 32 -bit arithmetic does high-precision number crunching," Electronics, pp. 105-110, Feb. 24, 1982.
[12] S. Magar, E. Caudel, and A. Leigh, "A Microcomputer with digital signal processing capability," in 1982 Int. Solid State Conf. Dig. Tech. Pap., pp. 32-33, 284, 285.
[13] First Generation TMS320 User's Guide. Houston, TX: Texas Instruments Inc., 1987.
[14] TMS320 First-Generation Digital Signal Processors Data Sheet. Houston, TX: Texas Instruments Inc., 1987.
[15] TMS32020 User's Guide. Houston, TX: Texas Instruments Inc., 1985.
[16] TMS320C25 User's Guide. Houston, TX: Texas Instruments Inc., 1986.
[17] TMS32011 User's Cuide. Houston, TX: Texas Instruments Inc., 1985.
[18] H. Cragon, "The elements of single-chip microcomputer architecture," Comput. Mag., vol. 13, no. 10, pp. 27-41, Oct. 1980.
[19] S. Rosen, "Electronic computers: A historical survey," Comput. Surv., vol. 1, no. 1, Mar. 1969.
[20] M. Honig and D. Messerschmitt, Adaptive Filters. Dordrecht, The Netherlands: Kluwer, 1984.
[21] R. Lucky et al., Principles of Data Communication. New York, NY: McGraw-Hill, 1965.
[22] P. Van Gerwen et al., "Microprocessor implementation of high speed data modems," IEEE Trans. Commun., vol. COM25, pp. 238-249, 1977.
[23] M. Bellanger, "New applications of digital signal processing in communications," IEEE ASSP Mag., pp. 6-11, July 1986.
[24] Y. Wang, M. Andrews, S. Butner, and G. Beni, "Robot-controller system," in Proc. Symp. on Incremental Motion Control Systems and Devices, pp. 17-26, June 1986.
[25] TMS320 Family Development Support Reference Guide. Houston, TX: Texas Instruments Inc., 1986.
[26] R. Simar, T. Leigh, P. Koeppen, J. Leach, J. Potts, and D. Blalock, "A 40 MFLOPS digital signal processor: The first supercomputer on a chip," in Proc. IEEE Int'I Conf. on Acoustics, Speech, and Signal Processing, Apr. 1987.
[27] TMS320C30 User's Guide. Houston, TX: Texas Instruments Inc., 1987.
[28] B. Kernighan and D. Ritchie, The C Programming Language. Englewood Cliffs, NJ: Prentice-Hali, 1978.

# The TMS320C30 Floating-Point Digital Signal Processor 

Panos Papamichalis<br>Ray Simar, Jr.<br>Digital Signal Processor Products-Semiconductor Group Texas Instruments

# The TMS320C30 Floating-Point Digital Signal Processor 

Digital signal processors have significantly impacted the way we bring real-time implementations of sophisticated DSP algorithms to life. What was once only a laboratory curiosity that required large computers or specialized, bulky, and expensive hardware is now incorporated into lowcost consumer products. The rapid advancement of programmable DSPs since their commercial introduction in the early 1980s lets us satisfy the needs of very demanding applications. Implementation of basic DSP functions, such as digital filters and fast Fourier transforms, has been integrated into advanced system solutions involving speech algorithms, image processing, and control applications. The variety of the applications increases every day as researchers, developers, and entrepreneurs discover new areas in which DSP devices can be used. At the same time, the design of new devices incorporates features that make such implementations easier.

The Texas Instruments family of TMS320 DSPs ${ }^{1}$ evolved with the expanding needs of the DSP applications and currently encompasses over 17 devices. The TMS320 family consists of three generations of devices. The first two generations are 16-bit, fixed-point-arithmetic devices while the third one, represented by the TMS320C30 and explained in detail here, is a 32-bit, floating-point device. Architecturally, the TMS320 family, like most DSP devices, relies on multiple Harvard buses. In the first two generations, we expanded the basic Harvard architecture to permit communication between the program and data spaces. In the third generation, we unified the two spaces to form an organization that encompasses the advantages of both the Harvard and the von Neumann architectures.

## Overview of the TMS320C30

The 320 C 30 is a fast processor ( 16.7 million instructions per second for an instruction cycle time of 60 nanoseconds) with a large memory space ( 16 million 32-bit words) and floating-point-arithmetic capabilities. This last feature is a major trend in new DSP devices, which was developed to answer the need for quicker, more accurate solutions to numerical problems. DSP algorithms, being very intensive numerically, cause a designer to worry about overflows and the accuracy of results. The introduction of floating-point capabilities eliminates these difficulties.

Panos Papamichalis Ray Simar, Jr.

Texas Instruments

In the 320 C 30 , a chip design with $1-\mu \mathrm{m}$ geometries produces instruction cycle times lower than those achieved with the fixed-point devices of the first two generations. In addition, the design produces a controlled increase in die size that results more from the extended on-chip memory spaces than from the floating-point capabilities.

The pipelined architecture of the 320 C 30 permits the higher throughput achieved by the device, as we explain later. Yet, programmers do not have to worry about the pipeline when writing the code. We can describe the design philosophy of the 320C30 (as well as all the other devices in the TMS320 family) as an "interlocked" or "hiddenpipeline" approach. When writing the program, programmers can assume that the result of any instruction will be available for the next instruction. Most of the instructions execute in one machine cycle. If a conflict arises between executing an instruction in one cycle and having the data available for the next instruction, the device automatically inserts the necessary delay to eliminate the conflict. Since this delay could result in loss of performance, we provide development tools that identify where such conflicts occur. With this data, programmers can rearrange and optimize code.

Many applications, such as graphics and image processing, are difficult to implement on the earlier DSP devices because they require a large memory space. To satisfy this need, the 320 C 30 provides a total memory space of 16 million 32-bit words, memory several orders of magnitude larger than the fixed-point devices. Furthermore, it contains significantly increased on-chip memory: six thousand 32-bit words of RAM and ROM. The desire to have a device capable of offering system-level solutions to the implemented algorithms guided the design decision to increase on-chip memory. In other words, the 320C30 attempts to offer the capability of implementing an algorithm with as little peripheral circuitry as possible.

Along the same lines, the 320 C 30 contains a peripheral bus on which on-chip peripherals can be attached using a memory-mapped approach. Currently available peripherals include two serial ports, two timers, and a DMA controller. The modularity of the design permits easy change, addition, or deletion of peripherals to accommodate different needs. For instance, if a $\mu$-law-to-linear format converter or a gate array is more important than one of the timers for certain applications, a user can make the change without impacting the core of the device.

As the power of the DSP devices increases, so does the sophistication of the algorithms that are implemented. The implication is that constructing and debugging an algorithm at the assembly-language level becomes a more and more tedious task. To address that problem, we provide the 320C30 development tools, which include a high-levellanguage compiler and a DSP operating system. The extended memory space, the software stack, and the large onchip register file also facilitate such a development. We've already introduced a C compiler and announced an Ada compiler. We expect compiler availability to change sig-
nificantly the way DSP algorithms are ported to DSP devices. With these tools, programmers can develop the algorithms on large computers, requiring at the most only selective optimization when they incorporate the algorithm on the 320 C 30 .

Here, we describe the 320 C 30 architecture in detail, discussing both the internal organization of the device and the external interfaces. We also explain the pipeline structure, addressing software-related issues and constructs, and examine the development tools and support. Finally, we present examples of applications.

## Architecture of the 320C30

Studying the architecture of the device helps in understanding how the different components contribute toward a high-throughput system. The interaction and the efficient use of the parts can contribute to very effective programming. Another very important aspect to consider is the system cost of the application. We designed the device to incorporate on-chip features that minimize the amount and the cost of external logic, thus leading to very compact and cost-effective solutions. These advantages become explicit when looking at the architecture in detail. The internal structure of the 320C30, as shown in Figure 1, consists of the

- on-chip memory and cache,
- CPU with register file,
- peripheral bus and peripherals, and
- interconnecting buses.

See Figure 2 for the die photograph. To interface with the external world, the 320C30 provides pins corresponding to
-two buses (primary and expansion),

- two serial ports and two timers,
- four external interrupt signals,
- two external flags, and
- hold and hold-acknowledge signals.

In addition, other pins exist for address and data strobs, power, and so on.

The overall architecture of the device is a Harvard type in the sense that internally and externally it has multiple buses to access program instructions, data, or perform DMA transfers. However, it also has a von Neumann flavor since the memory space is unified, and there is no separation of program and data spaces. As a result, the user can choose to locate programs and data at any desired location.
Some of the major features of the 320 C 30 are:

- a 60 -ns cycle time that results in execution of over 16 million instructions per second (MIPS) and over 33 million floating-point operations per second (Mflops);
- 32 -bit data buses and 24 -bit address buses for a 16 M word overall memory space;
- dual-access, $4 \mathrm{~K} \times 32$-bit on-chip ROM and $2 \mathrm{~K} \times 32$ bit on-chip RAM;


Figure 1. Block diagram of the TMS320C30 architecture.

- a $64 \times 32$-bit program cache;
- a 32-bit integer/40-bit floating-point multiplier and ALU;
- eight extended-precision registers, eight auxiliary registers, and 12 control and status registers;
- generally single-cycle instructions;
- integer, floating-point, and logical operations;
- two- and three-operand instructions;
- an on-chip DMA controller; and
- fabrication in $1-\mu \mathrm{m}$ CMOS technology and packaging in a 180-pin package.

Memory organization: The 320 C 30 provides 4 K 32 bit words of on-chip ROM, and 2 K 32 -bit words of on-chip RAM. The on-chip ROM is mapped into the first 4 K of the overall memory map; it is accessed when the processor operates in the microcomputer mode. Location 0 of the memory map holds the reset vector, and adjacent locations hold other interrupt vectors. In microprocessor mode, the reset vector resides in external memory, and on-chip ROM is not accessed. The 2 K on-chip RAM consists physically of two segments of 1 K words each. These two segments of RAM are mapped into adjacent sections of the memory. Figure 3 on the next page shows the arrangement of the onchip memory, as well as the cache, buses, and two external interfaces/buses, which we examine later.


Figure 2. Die photograph of the $320 c 30$.


Figure 3. On-chip memory, cache, and buses.

The internal memory (both ROM and RAM) supports two accesses for reads and/or writes in one cycle. This key feature permits high throughput and ease of programming, since it makes possible three-operand instructions with two operands residing in the memory. Notice that, to support this feature, we include two buses dedicated to data addresses (DADDR1, DADDR2) and one bus to carry the data (DDATA). There are also separate program buses, PDATA and PADDR.

The address buses are 24 bits wide, indicating that the overall memory space is 16 million (32-bit) words. We believe this large space will facilitate implementation of algorithms in image processing applications that often require large amounts of memory. The unified memory space offers flexibility in placing program and data. But it also permits optimal use of the memory space as a trade-off between program and data.

An important addition to the architecture is the 64 -word instruction cache. To reduce the overall system cost of applications, system designers often use slower (and cheaper) external memories, a tactic that could slow down the processor and degrade the performance. The instruction cache addresses this problem by storing on-chip instructions that have been fetched previously. Its main advantage becomes obvious when loops must be executed. In this case, the first time the instructions are fetched, they are also stored in the cache. Any subsequent execution of the loop does not access external memory but fetches instructions from the cache, resulting in higher speed and
making the external buses available for data transfers.
The cache is segmented into two sections of 32 words each that are transparent to users. A user can, however, control the operation of the cache by manipulating three control bits that are contained in the status register of the CPU. Each control bit is dedicated to a specific operation: cache enable/disable, cache freeze, and cache clear. When a cache miss occurs, that is, when the next instruction is not included in the cache, the instruction is brought in and also stored in the cache. The two cache sections are updated on a least recently used basis.

CPU organization. The CPU consists of the ALU (arithmetic logic unit), the hardware multiplier, and the register file. These units are shown in Figure 4.

The register file consists of

- eight 40-bit-wide, extended-precision registers R0 through R7,
- eight 32-bit auxiliary registers AR0 through AR7, and
- twelve 32-bit control registers.

The extended-precision registers function as accumulators and can handle both floating-point and integer numbers. When they are used for floating-point numbers, the top eight bits represent the exponent and the bottom 32 bits the mantissa of the number. In their integer format, registers R0 through R7 use only their bottom 32 bits, keeping the top 8 bits unchanged in any integer or logical operation.

The eight auxiliary registers AR0 through AR7 can function as memory pointers in indirect addressing, as loop counters, or as general-purpose registers in integer arithmetic or logical operations. Associated with these registers are two auxiliary register arithmetic units (ARAU) that generate two memory addresses in parallel for the instructions that need them. The flexibility of indirect addressing increases even further when two index registers are used in conjunction with the auxiliary registers, as we discuss later.

The register file contains 12 control registers designated for specific functions. If the control registers are not used for these functions, they can be treated as general-purpose registers in integer arithmetic and logical operations. Examples of such control registers are the

```
- status register,
- index registers,
- stack pointer,
- interrupt mask and interrupt flag registers, and
- repeat-block registers.
```

In particular, the stack-pointer register points to the software stack. The user has the flexibility of designating where the stack resides, and even of changing its location during the program execution. This feature also makes the stack of essentially unlimited depth and permits its usage not only for storing the program counter during subroutine calls but also for passing arguments to subroutines. Such an arrangement is particularly convenient in the development of compilers, and we have used it extensively in the 320C30's optimizing C compiler.

The ALU performs floating-point, integer, and logical operations. The ALU always stores the result in the register file, but the input can come either from the register file or from memory, or it can be an immediate value.

In the case of floating-point arithmetic, the input to the ALU can originate from either a 40-bit extended-precision register or a 32-bit memory datum. Registers R0 through R7 store the 40 -bit-word result. On the other hand, in integer arithmetic, both input and output are 32 -bit numbers, and the output can move to either the lower 32 bits of the R0 through R7 registers or to any other register in the register file.

The single-cycle hardware multiplier has been an integral part of DSPs because any real-time application relies on the fast execution of multiplies. Following the same distinction as in the previous paragraph on the ALU, the multiplier performs both floating-point and integer multiplications. The 32 -bit inputs to a floating-point multiplication yield a 40 -bit-wide result for storage in one of the extended-precision registers.

In both the ALU and the multiplier the results of the operations are automatically normalized, thus handling any overflows of the mantissa. If there is an exponent overflow, the result is saturated in the direction of overflow and the overflow flag is set. Underflows are handled by setting the result to zero and setting an underflow flag.


Figure 4. The 320C30 central processing unit.

Buses and peripherals. Figure 3 shows that multiple on-chip buses handle program, data, and DMA operations in parallel. The device contains separate address and data buses for these three operations, with the data having two address buses to accommodate the access of multiple operands from the memory in one cycle. Also, separate buses lead to the register file. The rule to remember is that, in one cycle, up to two data memory accesses are permitted for any on-chip memory block. This multiplicity of buses eliminates bottlenecks. The user can maximize the throughput of the device by a judicious combination of the on-chip memory with the two external buses (the primary bus and the expansion bus).

The primary bus contains a 24 -bit address bus and a 32 bit data bus. Its true space, though, is 16 M words minus the on-chip memory and the expansion bus. The primary bus can be placed in high impedance when the device is put on hold. To facilitate its interfacing with slow memories, the 320 C 30 offers programmable wait states (up to seven) as well as an external ready signal.

The expansion bus contains a 13-bit address bus and a 32-bit data bus. It has two strobes, one for memory and one for I/O accesses. In other words, the memory space of the


Figure 5. Peripheral bus and peripherals.
expansion bus is two segments of 8 K words each, one segment mapped as regular memory and the other one mapped as I/O. Like the primary bus, the expansion bus has up to seven software-programmable wait states.

A major innovation in the 320 C 30 -to support systemlevel solutions and to help in adapting the device to changing needs-is the peripheral bus shown in Figures 1 and 5 . The peripheral bus supplies a way of expanding or varying the interface with the outside world without changing the core of the device. All of the peripherals attached to this bus are mapped to memory, and they can be replaced by others with a minimal effort if certain applications have different demands.

Currently, we have implemented a DMA controller, two serial ports, and two timers as peripherals. The DMA controller performs reads from and writes to any location in the 320C30 memory map without interfering with the operation of the CPU. The DMA controller contains its own address generators, source and destination address registers, and transfer counter. The two modular and totally independent serial ports are identical with a complementary set of control registers. Each serial port can be configured to transfer $8,16,24$, or 32 bits of data per word, with each port clock originating either internally or externally. The pins of the serial ports are configurable as generalpurpose I/O pins, while the serial ports can also be configured and used as timers.

The two 320C30 timer modules function as generalpurpose timer/event counters; each have two signaling modes and internal or external clocking. Available to each timer is an I/O pin for use as an input clock to the timer, as an output signal driven by the timer, or as a generalpurpose pin.

## Software

The software features of a programmable DSP are probably the most important features because they determine the effectiveness of the implementation. Typically, the user first develops an application on a large computer using a high-level language and, once it is working satisfactorily, ports it to a DSP device. The software features of the 320 C 30 that we discuss include the integer and floating-point number representations, addressing modes, pipeline effects, and different types of instructions and constructs.

Integer and floating-point formats. A 32-bit, twoscomplement notation represents the integers. In addition to this single-precision format, we have a short format, consisting of 16 -bit, twos-complement numbers used only for immediate operands. Every instruction of the 320C30 consists of one 32-bit word.

We use three formats for floating-point numbers: short, single precision, and extended precision. The single-precision, 32 -bit-wide format assigns 24 bits to the mantissa and 8 bits to the exponent. The exponent occupies the 8 most significant bits, and it is represented in twos-complement notation, taking values between -128 and 127 . The exponent value -128 is the result reserved to represent zero.

The mantissa, placed at the 24 least significant bits of a 32 -bit number, is normalized to a number with an absolute value between 1.0 and 2.0. Since the mantissa is represented in a normalized, twos-complement notation, the leftmost bit, which corresponds to the sign, and its adjacent bit will always be the complement of each other. As a result, only the sign bit is represented, with the most significant bit suppressed. In other words, the mantissa contains 24 significant bits plus the sign bit, with the most significant bit implied.

Addressing modes. The 320C30 supports several addressing modes that allow the user to access data from memory, registers, and the instruction word. The basic addressing modes are

- register,
- direct,
- indirect,
- short immediate,
- long immediate, and
- PC relative.

In register mode the operand is placed into a CPU register that is explicitly specified in an instruction. In direct mode the data memory address is formed by preceding the 16 least significant bits of the instruction word with the 8 least significant bits of the data page pointer. To keep all instructions one word long, we store only the 16 least significant bits from the address in the instruction word; the rest become the data page pointer. This restriction implies that in direct addressing the memory space is segmented into 256 pages of 64 K words each.

Table 1.
Addressing modes of the $320 \mathrm{C3} 0$.

| Mode | Example | Operation | Description |
| :--- | :--- | :--- | :--- |
| Register | ADDF R0,R1 |  | Operand in R0 <br> Direct |
| Short <br> immediate | ADDF @MEM, R1 | Addr $=$ MEM | Operand in MEM |

di is an integer between 0 and 255 or one of the index registers IR0 and IR1.

Indirect addressing, the most versatile of all the módes, specifies the address of an operand in memory through the contents of an auxiliary register. As an option, the contents of the register can be modified by constant displacements or by the contents of the index registers. Table 1 lists all of the addressing modes, with particular emphasis on indirect addressing modes.

An instruction explicitly specifies the auxiliary register used for indirect addressing. The user can modify it by a constant displacement taking values 0 to 255 or by the contents of one of the two index registers IR0 or IR1. The modification can take place before or after accessing the memory. In the case of premodification, the user has the option to change the contents of the auxiliary register either permanently or temporarily. The notation used for such modifications is reminiscent of the C-language syntax.

Two special forms of indirect addressing that are particularly useful are bit-reversed and circular addressing. Bit-reversed addressing is used with the fast Fourier transform to compensate for the fact that normally ordered data
at the input of the transform are scrambled at output (bitreversed order). To avoid moving the data around to place them in the proper order, bit-reversed addressing accesses the data in scrambled order for any subsequent operation.

Circular addressing implements circular buffers. Such buffers are very convenient for use in digital-filtering operations. In circular addressing, BK, one of the control registers, specifies the size of the block. Then, when the user modifies the contents of an auxiliary register (pointing within that block) in a circular fashion, the final value is tested to determine if it is still within the block. If it is not, it is wrapped around using modulo arithmetic.

The short-immediate mode encodes immediate, 16 -bitlong operands of arithmetic operations. The long-immediate mode encodes program control instructions (branch instructions) for which it is useful to have a 24-bit absolute address contained in the instruction word. Finally, the PCrelative addressing also applies to program control instructions and uses the difference from the present location of the PC counter rather than an absolute address. The last two
modes are transparent to the user. The user specifies the branching label wanted, and the assembler assigns the appropriate addressing mode.

Pipeline. To achieve the high throughput of the device, the 320C30 uses a four-phase pipeline with five major functional units operating in parallel. These five units are

- instruction fetching,
- instruction decoding and address generation,
- operand reads,
- instruction execution, and
- DMA transfer.

Figure 6 shows diagrammatically how the pipeline operates on successive instructions. When the pipeline is full, an instruction completes the execution phase every 60 -ns machine cycle.

Occasionally conflicts may arise, as in the case of a loaded auxiliary register that needs to be used for indirect addressing in the next instruction. To handle such cases, we established a priority between the different units, giving DMA the lowest priority. Among the others, an Execute instruction has the highest and a Fetch instruction the lowest priority.

In programming the device, the user does not have to worry about the pipeline conflicts, which do not occur that often anyway. When a conflict does occur, the device automatically inserts the necessary extra cycle(s) to make the instructions behave as expected. In most cases, this arrangement will be sufficient for successful operation. For time-critical operations, though, it may be necessary to remove the extra cycles caused by pipeline conflicts. The user can make this correction by rearranging the instructions of the program. To do so, the user must determine how to identify the locations where insertions occur. For that purpose, the development tools (simulator, emulators) contain a tracing feature that 'can display the pipeline. In this trace, any conflicts are immediately identified, and then the user can take steps to correct the problem.

Instruction set features. The instruction set of the 320C30 supports both two- and three-operand instructions. In all arithmetic instructions (except Store), the
destination is a register in the register file. The source operands can come from memory or from a register or, in the case of two-operand instructions, can be part of the instruction word.

A unique feature of the 320 C 30 is the set of instructions in which operations execute in parallel. This construct permits a high degree of concurrency and execution of any arithmetic or logical instruction in parallel with a Store instruction. It also supports parallel multiplies and adds, as well as parallel loading and storing of two registers. Parallel multiply and adds lead to the peak performance of 33 Mflops. Executing the Store instruction at the same time with another arithmetic operation essentially permits this kind of data movement without a penalty. As an example, the following instruction adds the contents of memory pointed to by AR1 (indicated by *AR1) to register R0 (treating them as floating-point numbers) and places the result in register R1. In parallel with that process, the original contents of R1 are stored in the memory location indicated by AR3.

$$
\begin{array}{lll} 
& \text { ADDF } & \text { *AR1,R0,R1 } \\
\| & \text { STF } & \text { R1,*AR3 }
\end{array}
$$

When executing a branch instruction, the pipeline must be flushed since the path followed after the branch is data dependent. As a result, a regular branch instruction is more costly than other instructions, taking four cycles to complete. This overhead may be unacceptable in some timecritical applications. To alleviate this problem and to offer more flexibility to the programmer, the 320C30 contains a set of delayed branches that complement the set of standard branches. In a delayed branch, the three instructions following the branch instruction execute whether the branch is taken or not taken. As a result, the delayed branch ends up taking only one cycle to execute. The same approach can be used even when there are less than three such instructions, by adding NOPs (no operations). The branch will still take less than four cycles.

The greatest cost of branching occurs during the execution of loops. In looping, a counter is decremented and compared to zero at the end of the loop. If it is not zero, a branch is taken to the beginning of the loop. The 320C30 offers a special arrangement that implements loops with no


Figure 6. Pipeline of 320 C 30 instructions.

## User-friendly development tools offer extra support: an optimizing $C$ compiler and a DSP operating system.

overhead. The two instructions RPTB (repeat block) and RPTS (repeat single) realize this arrangement. The format of the RPTB instruction is:

RPTB LABEL
(put instructions here)
LABEL (last instruction)
Associated with the repeat-block construct are three of the 12 control registers in the register file. One register indicates the beginning of the block, the second indicates the end of the block, and the third acts as the repeat counter. The assembler automatically assigns values to the first two registers. They contain the address of the instruction immediately below RPTB, and the address of LABEL respectively. Users should initialize the repeat counter before entering the loop. In terms of execution time, this arrangement behaves as if the loop were implemented with straight-line code.

The instruction RPTS has the format

## RPTS count

and it repeats the following instruction "count" times. It differs from RPTB in that it

- applies to only one instruction;
- does not refetch the instruction for every execution, but keeps it in the instruction register thus freeing the buses for data transfers, and
- is not interruptible.

Table 2 on the next page is a sample of the instructions available on the 320 C 30 . Although we included a rich set of instructions for both DSP and general-purpose processing, the perceived size of the instruction set is much smaller. The reason is that a symmetry exists between integer and floating-point instructions, between instructions with two or three operands, and between single and parallel instructions. For instance, addition is represented by ADDI, ADDF, or ADDC in the case of adding integers, floating-point numbers, or adding with a carry. The threeoperand instructions have the same form, with a 3 appended at the end (ADDF3). All of the multiplier and ALU operations can be performed in parallel with a Store instruction, and such instructions take the form of the following example:

$$
\begin{array}{ll}
\text { ADDF3 } & * \text { AR0,R1,R2 } \\
\text { STF } & \text { R0,*AR1 }
\end{array}
$$

Furthermore, two loads or two stores can execute in parallel, as is also the case with a multiply and an add or a multiply and a subtract. The design of the instruction set has been guided by a desire to ease programming efforts. The execution results of an instruction are always available for use in the instruction that follows.

Besides the regular arithmetic and logical instructions, the 320 C 30 includes instructions to handle the software stack, internal and external interrupts, and branches and subroutine calls. Conditional loads and calls make the programming more compact and efficient, while special instructions (called interlocked instructions) can be used in multiprocessor environments.

## Development tools and support

The newer DSP devices offer increased processing power that permits the implementation of more complicated and demanding algorithms. However, as the complexity of the algorithm increases, the task of debugging the implementation becomes more difficult. The 320C30 addresses this problem by providing user-friendly development tools and offering extra support in the form of an optimizing C compiler and a DSP operating system.

The assembler translates assembly-language source files into machine-language object files. Source files can contain instructions, assembler directives, and macro directives. Assembler directives control various aspects of the assembly process such as the source-listing format, symbol definition, and method of placing the source code into sections. Macro directives permit a concise representation of groups of instructions that occur frequently.

The linker combines object files into one executable object module. As it creates the executable module, the linker performs relocation operations and resolves external references. The linker accepts relocatable COFF (Common Object File Format) object files, created by the assembler, as input. It can also accept archive library members and output modules created by a previous linker run. Linker directives allow the user to combine object-file sections, bind sections or symbols to specific addresses or within specific portions of 320C30 memory, and define or redefine global symbols. An associated archiver can create macro or object-file libraries.

The software simulator is a very important tool for debugging 320C30 programs. Its interface consists of a screen broken into windows that display the internal registers, the reverse-assembled program, and a versatile window where memory, breakpoints, and a wealth of other information can be displayed. The same interface (modified to accommodate some special features) is also used with the hardware emulator. The major features of the simulator include:

- Simulation of the entire 320 C 30 instruction set and the

Table 2.
Instructions for the 320C30.

| Instruction | Description | Instruction | Description |
| :---: | :---: | :---: | :---: |
| Load and store instructions |  |  |  |
| LDE | Load floating-point exponent | POP | Pop integer from stack |
| LDF | Load floating-point value | POPF | Pop floating-point value from stack |
| LDFcond | Load floating-point value conditionally | PUSH | Push integer on stack |
| LDI | Load integer | PUSHF | Push floating-point value on stack |
| LDIcond | Load integer conditionally | STF | Store floating-point value |
| LDM | Load floating-point mantissa | STI | Store integer |
| Two-operand instructions |  |  |  |
| ABSF | Absolute value of a floating-point number | NORM | Normalize floating-point value |
| ABSI | Absolute value of an integer | NOT | Bitwise logical-complement |
| ADDC $\dagger$ | Add integers with carry | OR | Bitwise logical-OR |
| ADDF | Add floating-point values | RND | Round floating-point value |
| ADDI | Add integers | ROL | Rotate left |
| AND | Bitwise logical-AND | ROLC | Rotate left through carry |
| ANDN | Bitwise logical-AND with complement | ROR | Rotate right |
| ASH | Arithmetic shift | RORC | Rotate right through carry |
| CMPF | Compare floating-point values | SUBB $\dagger$ | Subtract integers with borrow |
| CMPI $\dagger$ | Compare integers | SUBC | Subtract integers conditionally |
| FIX | Convert floating-point value to integer | SUBF | Subtract floating-point values |
| FLOAT | Convert integer to floating-point value | SUBI | Subtract integer |
| LSH $\dagger$ | Logical shift | SUBRB | Subtract reverse integer with borrow |
| MPYF | Multiply floating-point values | SUBRF | Subtract reverse floating-point value |
| MPYI $\dagger$ | Multiply integers | SUBRI | Subtract reverse integer |
| NEGB | Negate integer with borrow | TSTB $\dagger$ | Test bit fields |
| NEGF | Negate floating-point value | XOR $\dagger$ | Bitwise exclusive-OR |
| NEGI | Negate integer |  |  |
| Program control instructions |  |  |  |
| Bcond | Branch conditionally (standard) | IDLE | Idle until interrupt |
| BcondD | Branch conditionally (delayed) | NOP | No operation |
| BR | Branch unconditionally (standard) | RETIcond | Return from interrupt conditionally |
| BRD | Branch unconditionally (delayed) | RETScond | Return from subroutine conditionally |
| CALL | Call subroutine | RPTB | Repeat block of instructions |
| CALLcond | Call subroutine conditionally | RPTS | Repeat single instruction |
| DBcond | Decrement and branch conditionally (standard) | SWI | Software interrupt |
| DBcondD | Decrement and branch conditionally (delayed) | TRAPcond | Trap conditionally |
| $\dagger$ Two- and | -operand versions |  |  |

key peripheral features;

- Command entry from either menu-driven keystrokes (menu mode) or from line commands (line mode);
- Help menus for all screen modes;
- Quick storage and retrieval of simulation parameters from files to facilitate preparation for individual sessions;
- Reverse assembly allowing editing and reassembly of source statements;
- Multiple execution modes;
- Trace expressions that are easy to define;
- Trace execution that can display designated expression values, cache memory, and the instruction pipeline; and
- Breakpoints that can occur on address read, write, or both, on address execute, and on expression valid.

Perhaps the most important trend with the newer DSPs is the availability of high-level-language compilers. The presence of C and Ada compilers in the 320C30 is not an accident since the 320C30 was designed with a compiler in mind. We expect this path to a high-level language to make the porting of application programs from large computers much easier. The algorithm can be developed almost entirely on a large computer and then converted to the 320 C 30 assembly language by compilation.

The C compiler for the 320 C 30 has exceptional efficiency, ${ }^{2}$ which makes a good C program almost as effective as the assembly-language program. The C compiler will be sufficient for most applications. The exception is time-critical applications. In such cases one can use the fact that most DSP algorithms spend the vast majority of the execution time on a small section of the code. (Researchers often mention the $90 / 10$ rule: 90 percent of the time is spent on 10 percent of the code.) Under these circumstances, the user can optimize execution by creating very fast assem-bly-language routines that implement the time-critical sections, and call them from C as regular C functions. To achieve this, we define the $C$ function interface very precisely so that users can create their own routines. The Ccompiler package comes with a library of general-purpose mathematical, interface, and I/O functions.

Besides this method of optimizing the performance of the C language, two more methods can be used. The first one is based on the fact that the output of the compiler is an assembly-language program. The user can edit this program and optimize it by rearranging the instructions. The second method is to use the "asm" directive supported by the C compiler. The arguments of this directive are passed to the output of the compilation without any alteration so that the user can insert assembly-language instructions into the middle of the C program.

A key part of the 320C30 development environment is Spox, the first real-time operating-system for a single-chip DSP. Spox, developed by Spectron Microsystems, extends the core C language with a library of standard I/O routines and, most importantly, a DSP math package. One of Spox's unique features is that it provides users with software objects that are especially suited for DSP. Some of these objects are vectors, matrices, filters, and streams. The math

# Perhaps the most important trend with the newer DSPs is the availability of high-levellanguage compilers. 

package and these software objects are carefully designed to take full advantage of the capabilities of the 320 C 30 . Spox also supports multitasking, thus allowing the user to easily implement the more complex control structures that are becoming essential for DSP systems.

By providing a complete software development environment that includes compilers and operating systems along with the more-traditional tools such as assemblers and linkers, we allow the user to move from system conception to system implementation in the shortest possible time.

The next level of development tools includes the hardware emulators for debugging target hardware or determining the performance of an algorithm on the 320C30 device itself. The XDS 1000 is a real-time, in-circuit emulator/software development tool based on the 320C30. Besides these tools from Texas Instruments, other companies offer related support, such as the PC-based development board by Atlanta Signal Processors and the development platform of Spectron Microsystems for PCs and Sun workstations.

## Applications

Certain features of the 320 C 30 such as its high speed, versatile architecture, and rich instruction set, make it easy to implement very demanding algorithms. The large memory space makes the device suitable for application areas such as image processing in which memory addressing is one of the prime considerations. And the C compiler makes it easy to construct algorithms with complicated logic.

General DSP algorithms. Almost every DSP application needs to perform some kind of filtering, the first application considered for a DSP device. Digital filters are categorized as FIR (finite-length impulse response) and IIR (infinite impulse response) filters, ${ }^{3.4}$ or, equivalently, as filters that have only zeros or both poles and zeros. Each of these categories can have either fixed or adaptive coefficients.

The 320C30 implements FIR filters very efficiently. For instance, let an FIR filter have an impulse response $h[0]$, $h[1], \ldots, h[N \times 1]$, and let $x[n]$ represent the input of the filter at time $n$. Then, the following equation gives the output $y[n]$ with the equation:

$$
\begin{aligned}
y[n]= & h[0] \times x[n]+h[1] \times x[n-1]+\ldots+ \\
& h[N-1] \times x[n-N+1]
\end{aligned}
$$

```
Typical Calling Sequence:
    load AFO
    load AR1
    load FCC
    load BK
    CALL FIR
Data Memory Organization:
```



```
The physical address for the start of the input samples must be on
a boundary with the LSBS set to zero according to the length of the
buffer. The pointer to the input sequence (x) is incremented and
assumed to be moving from an older input to a newer input. At the
end of the subroutine AR:1 will be pointing to the position for the
ne:t input sample.
Argument Assignments:
    Argument : Function
    AKO : Address of h(N-1)
    ARI ; Address of }x(N-1
    FC ; Length of filiter - 2(N-2)
    EK. : Length of filter (N)
Fegisters used as input: ARO, AFi, RC, EK
Registers modified: FO, FI, ARO, AR1, RC
Register containing result: FO
Frograni size: 6 words
Execution cycles: 11 + (N-1)
;
```



```
; .global FIF
; ; initialize RO:
FIF: MPYFS *ARO++(1),*AR1++(1)%,RO;h(N-1)**(n-(N-1)) -> RO
    LDF O.O,R2 ; initializer2.
; filter (1<x i<N)
RFTS RC ; setup the repeat single.
    MFYF3 *ARO++(1),*AR1++(1)%,RO;h(N-1-i)**(n-(N-1-i)) - (NO
:: ADDFS ROO,F2,R2; multiply and add operation
;
    ADDF RO,R2,RO ; add last product
return sequence
    RETS ; return
end
    . end
```

Figure 7. FIR filter implementation on the 320C30.

(Continued on page 26)

Figure 8. Implementation of $N$ biquads on the $320 C 30$.

Two features of the 320 C 30 facilitate the implementation of the FIR filters: parallel multiply/add operations and circular addressing. The first feature permits a multiplication and an addition to execute in one machine cycle, while the second makes a finite buffer of length $N$ sufficient for the data $x[n]$. Figure 7 shows the arrangement of the data and the assembly code for an FIR filter. Note that the filter takes one cycle of execution per tap.

The transfer function of the IIR filters contains both poles and zeros, and its output depends on both the input and the past output. As a rule, these filters need less computation than a FIR filter of similar frequency response, but they have the drawback of being sensitive to coefficient quantization. Most often, the IIR filters are implemented as a cascade of second-order sections, called biquads. To implement an IIR filter consisting of $N$ biquads, let $a 1[i], a 2[i]$ be the numerator coefficients of the $i$ th biquad and $b 0[i], b 1[i], b 2[i]$ the denominator coefficients of
the same biquad. Also, let $x[n]$ be the input and $y[n]$ be the output of the IIR filter. In canonic form, the following C code implements the $N$ biquads:

```
\(\mathrm{y}[0, \mathrm{n}]=\mathrm{x}[\mathrm{n}]\);
for \((\mathrm{i}=0 ; \mathrm{i}<\mathrm{N} ; \mathrm{i}++)\) \{
\(\mathrm{d}[\mathrm{i}, \mathrm{n}]=\mathrm{a} 2[\mathrm{i}]^{*} \mathrm{~d}[\mathrm{i}, \mathrm{n}-2]+\mathrm{al}[\mathrm{i}]^{*} \mathrm{~d}[\mathrm{i}, \mathrm{n}-1]+\mathrm{y}[\mathrm{i}-1, \mathrm{n}]\);
\(\mathrm{y}[\mathrm{i}, \mathrm{n}]=\mathrm{b} 2[\mathrm{i}]^{*} \mathrm{~d}[\mathrm{i}, \mathrm{n}-2]+\mathrm{b} 1[\mathrm{i}]^{*} \mathrm{~d}[\mathrm{i}, \mathrm{n}-1]+\)
            b0[i]*d[i,n];
)
\(\mathrm{y}[\mathrm{n}]=\mathrm{y}[\mathrm{N}-1, \mathrm{n}]\);
```

Figure 8 shows the memory arrangement and the code for this implementation on the 320C30.

In addition to the fixed-coefficient filters, the 320 C 30 can also implement very effectively adaptive filters (with three cycles per updated tap).

Fourier transforms are another important tool often used in DSP systems. The purpose of the transform is to convert information from the time domain to the frequency do-

```
value 3. The result }y(n)\mathrm{ is placed in RO. At the end of the program,
AR1 points to the new d(0,n-2) so that it is set when the new samole
comes in.
Fargument fasciqmments:
    Argument : Furic:tion
    ---------+---------------------
    AFiO : Address of filter coefficients (a2(0))
    AR1 : Address of delay node values (d(0,n-2))
    BK. EK = 3
    IFO : IFOO=4
    IF1 : IF1 = 4*N-4
    FiC : Number of biquads (N) - 2
Registers used as input: R2, ARO, AR1, IRO, IR1, BK, RC
Fiegisters modified: FO, R1, F2, ARO, AR1, RC
Fiegister containing result: RO
Frogram size: }17\mathrm{ words
Execution cycles: 23 + 6N
.global IIFi
IIR2 MFYFZ *AFO, *AR1,NFO ; a2(O) * d(O,n-2) -> FO
        MFYF3 *++ARO(1),*AR1--(1)%,R1 ; b2(0)*d(0,n-2) - * R'
        ; al(O) * d(0,n-1) -> FiO
        ADDF3 FO, R2, R2 ; first sum term of d (0,n).
        MFYFS *++ARO(1), *AFi1--(1)%, ROO ; bl(O) * d(O,n-1) -> RO
        ADDFS FO, R2, F2% second sum term of d(O,n).
        MFYFS *++AFO(1),F2, R2 ; bO(O) *d(O,n) -> R2
        STF Fi2, *AR1--(1)% ; stored(0,n); point to
        d(0,n-2).
    RF'TB LOOF ; loop for 1<= i<N
    MFYFJ *++ARO(1), *++AR1(IRO), RO; a2(i) * d(i,n-2) -> RO
        ADDFS RO,R2,R2; first sum term of y(i-1,n)
        MFYF3 *++AFOO(1), *AF1--(1)%, Fi1 ; b2(i) * d(i,n-2) -> Fil
        ADDFS Fi,FIZ,Fi2 ; second sum term of y(i-1,n)
        MFYFS *++ARO(1), *AR1, FO ; al(i) * d(i,n-1) -> RO
        ADDF3 FO, R2, Fi2 first sum term of d(i,n).
        MFYFS *++AFO(1),*AF1--(1)%, FOO ; b1(i) * d(i,n-1) - FiO
        ADDFS RO, F2, R2 ; second sum term of d(i,n).
        STF R2, *AR1--(1)% ; store d(i,n); point to
        d(i,n-2)
LOOF MFYFS *++AFIO(1),F2,R2 ; bO(i) * d(i,n) -> F2
final summation
        ADDF FOO,Fi2 ; first sum term of Y(N-1,n)
        ADDF: Fi,F2,FOO : second sum term of }\textrm{Y}(\textrm{N}-1,n
        NOF *AR1--(IF1) ; return to first biquad
        NOF ,*AF1--(1)% ; point to d(0,n-1)
return sequence
        RETS ; return
end
    .end
```

main. Computationally efficient implementation of Fourier transforms are known as the fast Fourier transform (FFT). ${ }^{3.5}$ Table 3 shows the timing for different FFTs on the 320C30. The code for these FFTs, as well as the routines listed in Table 4, appear in the TMS320C30 User's Guide. ${ }^{6}$

The 320C30 has many features that make it well suited for FFTs, such as the high speed of the device, the floatingpoint capability, the block-repeat construct, and the bitreversed addressing mode. For instance, the FFT shown in Figure 9 on the next page can be implemented in code that can be entirely contained in the 64 -word cache of the 320C30. ${ }^{7}$

Telecommunications and speech. Telecommunications and speech applications have many requirements in common with other DSP applications, but they also have some special needs. For instance, telecommunications applications interfacing to T 1 carriers sometimes need to convert between a linear signal and one compressed by $\mu$ law or A-law formats. Such a conversion can be realized with hardware by adding a peripheral to the DSP peripheral bus. This is the approach taken in some members of the TMS320 first generation of devices. An alternative way is to do the same function with software.
In speech applications, digital filters are often implemented in lattice form. Depending on the application, both FIR and IIR filters are realized this way, although sometimes the terminology lattice filter and inverse lattice filter is used respectively.

Graphics and image processing. In graphics and image processing applications DSPs perform operations on two-dimensional signals, and matrix arithmetic takes on particular significance. In the 320 C 30 matrix arithmetic can be decomposed into a series of dot products, which can be very effectively implemented using constructs similar to the FIR filter implementation discussed earlier. Additionally, the large memory space of the 320 C 30 allows processing of large segments of data at a time.

Benchmarks. We have implemented several generalpurpose and applications-oriented routines for the 320C30 and include these in the User's Guide. ${ }^{6}$ Table 4 lists some of these routines with the necessary cycles and the memory requirements for the program.

The last five years have seen a tremendous growth in the utility of digital signal processors. This growth has been fueled, at least in part, by the ever-increasing level of performance and ease of use of general-purpose DSPs. The TMS320C30 represents the newest generation of DSPs. But, the end of this trend is not yet in sight. Rather, we expect the trend of higher levels of performance and greater ease of use to continue. For DSPs, the next five years look bright indeed.

Table 3.
Timing of an FFT on the 320C30.

| Number of <br> points | Radix-2 <br> (complex) | Radix-4 <br> (complex) | Radix-2 <br> (real) |
| :---: | :---: | :---: | :---: |
| FFT timing (ms) |  |  |  |
| 64 | 0.167 | 0.123 | 0.075 |
| 128 | 0.367 | - | 0.162 |
| 256 | 0.801 | 0.624 | 0.354 |
| 512 | 1.740 | - | 0.771 |
| 1,024 | 3.750 | 3.040 | 1.670 |
|  |  |  |  |
| Code size <br> (Words) | 55 | 176 | 86 |
|  |  |  |  |

The code size does not include the sine/ cosine tables. The timing does not include bit reversal or data I/O.

| Table 4. <br> Program memory and timing requirements for $320 C 30$ routines. |  |  |
| :---: | :---: | :---: |
| Application | Words | Cycles (best case/ worst case) |
| Inverse of a floating-point number | 31 | 31 |
| Integer division | 27 | 27/58 |
| Double-precision integer multiplication | 24 | 20/24 |
| Square root | 32 | 35 |
| Dot product of two vectors | 10 | $8+(N-1)$ |
| Matrix times vector operation |  | $2+R(C+9)$ |
| FIR filter | 5 | $7+(N-1)$ |
| IIR filter (one biquad) | 7 | 7 |
| IIR filter ( $N>1$ biquads) | 16 | $19+6 N$ |
| LMS adaptive filter | 9 | $8+3(N-1)$ |
| LPC lattice filter | 11 | $9+5(P-1)$ |
| Inverse LPC lattice filter | 9 | $9+3(P-1)$ |
| $\mu$-law compression | 16 | 16 |
| $\mu$-law expansion | 13 | 11/16 |
| A-law compression | 18 | 18 |
| A-law expansion | 15 | 14/21 |
| $N=$ length of appropriate vector <br> $P=$ length of lattice filter <br> $R=$ number of rows of a matrix <br> $C=$ number of columns of a matrix |  |  |

```
GENERIC PROGRAM TO DO A LOOFED-CODE RADIX-2 FFT COMFUTATION IN \(320 C 30\).
```

THE FROGRAM IS ADAPTED FROM THE FORTRAN FRGGRAM IN PAGE 111 OF REFERENCE [5]

AUTHOR: PANOS E. FAPAMICHALIS
TEXAS INSTRUMENTS
JULY 16, 1987
.GLOBL
.GLOBL
.GLOBL
-BSS
.TEXT
cIALIZE


; GENERAL BUTTERFLY

| RFTB | $E L K 2$ |  |
| :--- | :--- | :--- |
| SUBF | $* A R 2, * A R O, R 2$ | $R 2=x(I)-X(L)$ |
| SUBF | $*+A R 2, *+A R O, R 1$ | $R 1=Y(I)-Y(L)$ |
| MPYF | $R 2, R 6, R O$ | $R O=R 2 * S I N$ AND... |
| ADDF | $*+A R 2, *+A R O, R 3$ | $R J=Y(I)+Y(L)$ |
| MPYF | $R 1, *+A R 4(I R 1), R 3$ | $R 3=R 1 * C O S$ AND. . |

Figure 9. Example of a radix-2, decimation-in-frequency FFT.


Figure 9 (cont'd.)

## References

1. K.-S. Lin, G.A. Frantz, and R. Simar, "The TMS320 Family of Digital Signal Processors," Proc. IEEE, Vol. 75, No. 9, Sept. 1987, pp. 1143-1159.
2. R. Simar and A. Davis, "The Application of High-Level Languages to Single-Chip Digital Signal Prosessors," Proc. 1988 Int'l. Conf. Acoustics, Speech, and Signal Processing, Apr. 1988, pp. 1678-1681.
3. A. Oppenheim and R. Schafer, Digital Signal Processing, Prentice Hall, Englewood Cliffs, N.J., 1975, 585 pp.
4. L. Rabiner and B. Gold, Theory and Application of Digital Signal Processing, Prentice Hall, 1975, 762 pp.
5. C.S. Burrus and T.W. Parks, DFT/FFT and Convolution Algorithms, John Wiley \& Sons, New York, 1985, 232 pp.
6. TMS320C30 User's Guide, Texas Instruments, Dallas, Tex., 1988.
7. P.Papamichalis, "FFT Implementation on the TMS320C30," Proc. 1988 Int'l. Conf. on Acoustics, Speech, and Signal Processing, Apr. 1988, pp. 1399-1402.


Panos Papamichalis is a senior member of the technical staff and a section manager in the Texas Instruments DSP Applications Group. He is also an adjunct professor for the Electrical and Computer Engineering Department at Rice University in Houston. Author of Practical Approaches to Speech Coding, his interests include digital signal processing with applications to speech
processing and telecommunications.
Papamichalis received his engineering degree from the School of Mechanical and Electrical Engineering, National Technical University of Athens. His MS and PhD degrees in electrical engineering come from the Georgia Institute of Technology in Atlanta. He is a member of the Institute of Electrical and Electronics Engineers and Sigma Xi.


Ray Simar, Jr. is a group member of the TI Semiconductor technical staff and the principal architect and program manager of the TMS320C30. He has supported the TMS320 family of digital signal processors.

Simar holds a BS degree in bioengineering from Texas A\&M University, College Station, and an MSEE from Rice University. He is a member of Tau Beta Pi, Phi Eta Sigma, and Phi Kappa Phi.

Questions concerning this article can be directed to Panos Papamichalis, Texas Instruments, Inc., PO Box 1443, M/S 701, Houston, TX 77251-1443.

## Part II. Digital Signal Processing Routines

4. An Implementation of FFT, DCT, and Other Transforms on the TMS320C30 (Panos Papamichalis)
5. Doublelength Floating-Point Arithmetic on the TMS320C30 (Al Lovrich)
6. An $8 \times 8$ Discrete Cosine Transform Implementation on the TMS320C25 or the TMS320C30 (William Hohl)
7. An Implementation of Adaptive Filters with the TMS320C25 or the TMS320C30
(Sen Kuo and Chein Chen)
8. A Collection of Functions for the TMS320C30 (Gary Sitton)

## An Implementation of FFT, DCT, and Other Transforms on the TMS320C30

Panos Papamichalis
Digital Signal Processor Products-Semiconductor Group
Texas Instruments

This report describes the implementation of several Fast Fourier Transforms (FFTs) and related algorithms on the TMS320C30. The TMS320C30 is the first device in the third generation of 32-bit floating-point Digital Signal Processors (DSPs) in the Texas Instruments TMS320 family. The algorithms considered here are the complex radix-2 FFT, the complex radix-4 FFT, the real-valued radix-2 FFT (both forward and inverse transforms), the Discrete Hartley Transform (DHT), and the Discrete Cosine Transform (DCT). These transforms have many applications, such as in image processing, sonar, and radar.

The introduction briefly describes transforms and their implementation on the TMS320 family of processors. Next, the different kinds of FFTs (including the real FFT), the closely-related Hartley transform, and the Cosine transform are described and compared. This is followed by a description of the TMS320C30 features that permit efficient implementations of these algorithms. Then, specific implementations, transforms, and TMS320C30 C Compiler facts are outlined. Finally, the report discusses some implementation issues, and the appendices list actual TMS320C30 code for performing transforms.

The powerful architecture and instruction set of the TMS320C30 permit flexible and compact coding of the algorithms in assembly language while preserving close correspondence to a high-level language implementation. The efficiency of the architecture and the speed of the device make faster realization of real and complex transforms possible. With the availability of a C compiler, these routines can be put in C -callable form and used as faster versions of FFT C functions.

## Introduction

The Fast Fourier Transform (FFT) is an important tool used in Digital Signal Processing (DSP) applications. Its development by Cooley and Tuckey gave impetus to the establishment of DSP as an independent discipline. The well-structured form of the FFT has also made it one of the benchmarks in assessing the performance of number-crunching devices and systems.

In recent years, because of the popularity of this signal-processing tool, there have been efforts to improve its performance by advances both at the algorithmic level and in hardware implementation. Researchers have been developing efficient algorithms to increase the execution speed of FFTs while keeping requirements for memory size low. On the other hand, developers of VLSI systems are including features in their designs that improve system performance for applications requiring FFTs. In particular, singlechip programmable DSP devices, currently available or under development, can realize FFTs with speeds that allow the implementation of very complex systems in realtime.

The Texas Instruments TMS320 family consists of five generations of programmable digital signal processors. The TMS32010 introduced the first generation, which today encompasses more than twelve devices with various speeds, interfacing capabilities, and price/performance combinations. FFT implementations on the TMS32010 can be found in the appendix of the book by Burrus and Parks [1].

The second-generation TMS320 devices (the TMS32020, the TMS320C25, and their spinoffs) enhanced the architecture and speed capabilities of the first generation. Examples of FFT programs implemented on the TMS32020 can be found in an application report in the book Digital Signal Processing Applications with the TMS320 Family [2]. Such programs are easily extended to the TMS320C25 because of the code compatibility between devices.

The architectural and speed improvements on the processors from one generation to the next have made the FFT computation faster and the programming easier. These advantages have reached a new high level in the third generation. The TMS320C30 is the first device in the third generation, and this report examines implementation of the FFT algorithms on it. The fourth generation (TMS320C4x) is a new set of floating-point devices, while the fifth generation (TMS320C5x) is a continuation of the fixed-point devices. Since software compatibility is maintained within the fixed-point and the floating-point devices, the existing FFT implementations will also be applicable to these new generations.

The Fourier Transform of an analog signal $x(t)$, given as

$$
\begin{equation*}
X(\omega)=\int_{-\infty}^{\infty} x(t) e^{-j \omega t} d t \tag{1}
\end{equation*}
$$

determines the frequency content of the signal $x(t)$. In other words, for every frequency, the Fourier transform $X(\omega)$ determines the contribution of a sinusoid of that frequency in the composition of the signal $x(t)$. For computations on a digital computer, the signal $x(t)$ is sampled at discrete-time instants. If the input signal is digitized, a sequence of numbers $x(n)$ is available instead of the continuous-time signal $x(t)$. Then, the Fourier transform takes the form

$$
\begin{equation*}
X\left(e e^{j \omega}\right)=\sum_{\mathrm{n}=-\infty}^{\infty} x(n) e^{-j \omega n} \tag{2}
\end{equation*}
$$

The resulting transform $X\left(e^{j \omega}\right)$ is a periodic function of $\omega$, and it needs to be computed for only one period. The actual computation of the Fourier transform of a stream of data presents difficulties because $X\left(e^{j \omega}\right)$ is a continuous function in $\omega$. Since the transform must be computed at discrete points, the properties of the Fourier transform led to the definition of the Discrete Fourier Transform (DFT), given by

$$
\begin{equation*}
X(k)=\sum_{n=0}^{N-1} x(n) e^{-\frac{j 2 \pi k n}{N}} \tag{3}
\end{equation*}
$$

When $x(n)$ consists of $N$ points $x(0), x(1), \ldots, \mathrm{x}(N-1)$, the frequency-domain representation is given by the set of $N$ points $X(k), k=0,1, \ldots, N-1$. Equation (3) is often written in the form

$$
\begin{equation*}
X(k)=\sum_{n=0}^{N-1} \quad x(n) W_{N}^{n k} \tag{4}
\end{equation*}
$$

where $\mathrm{W}_{N}^{n k}=e-j 2 \pi n k / N$. The factor $W_{N}$ is sometimes referred to as the twiddle factor. A detailed description of the DFT can be found in references [1,3,4]. The computational requirements of the DFT increase rapidly with increasing block size $N$, having an impact on the real-time system performance. This problem was alleviated with the development of special fast algorithms, collectively known as Fast Fourier Transform (FFT). With an FFT, the computational burden increases much less rapidly with $N$, and for any given $N$, the FFT computational load, measured in terms of required multiplications and additions, is smaller than a brute-force computation of the DFT.

The definition of the FFT is identical to the DFT: only the method of computation differs. To achieve the efficiency of an FFT, it is important that $N$ be a highly composite number. Typically, the length $N$ of the FFT is a power of $2: N=2^{M}$, and the whole algorithm breaks down into a repeated application of an elementary transform known as a butterfly. If $N$ is not a power of 2 , the sequence $x(n)$ is appended with enough zeroes to make the total length a power of 2 . Again, references [1,3,4] contain a detailed development of the FFT. Reference [2] also discusses the same topic.

## Different Forms of the FFT

Over the years, researchers have developed different forms of FFT for more efficient computation. Special cases, such as those in which the input is a sequence of real numbers, have been investigated, and even more sophisticated algorithms have been developed. The general form of the FFT butterfly is given in Figure 1.


Figure 1. Radix-2 Butterfly for Decimation in Time
If the inputs to the butterfly are the two complex numbers $P$ and $Q$, the outputs will be the complex numbers $P^{\prime}$ and $Q^{\prime}$, such that

$$
\begin{equation*}
P^{\prime}=P+Q W_{N}^{k} \tag{5}
\end{equation*}
$$

and

$$
\begin{equation*}
Q^{\prime}=P-Q W_{N}^{k} \tag{6}
\end{equation*}
$$

The quantities $P, Q$, and $P^{\prime}, Q^{\prime}$ represent different points in the array being transformed, and they may or may not occupy adjacent locations in that array. In an in-place computation, the result $P^{\prime}$ will overwrite $P$, and $Q^{\prime}$ will overwrite $Q$. $W_{N}^{k}$ represents again the twiddle factor, and its exponent is determined by the location of the corresponding butterfly in the FFT algorithm.

Figure 2 shows an alternate form of the same FFT butterfly.


Figure 2. Alternate Form of Radix-2 Butterfly for Decimation in Time.

Although the notation is now less descriptive, it creates a clearer picture when several butterflies are put together to form an FFT. Using the first notation, Figure 3 is the flowgraph of an 8 -point FFT example.


Figure 3. Example of 8-Point FFT with Decimation in Time.
Note that the input sequence $x(n)$ is in the correct order, while the output $X(k)$ is scrambled. Actually, this scrambling occurs in a very systematic way, called bit-reversed order: If you express the indices of a scrambled sequence in binary and you reverse this number, the result is the order that this particular point occupies. For instance, $X(3)$ occupies the sixth position in the output (when counting from the zero position). In binary form, $3_{10}=011_{2}$, and if bit-reversed, you get $110_{2}=6_{10}$, which is the position that $X(3)$ occupies. It turns out that the third position is occupied by $X(6)$, and to restore the correct order at the output, you need only to swap these two numbers.

The same procedure can be repeated with all the scrambled numbers not occupying the position that their index suggests. If the input sequence $x(n)$ is rearranged to appear in bit-reversed form, the output $X(k)$ appears in the correct order, as shown in Figure 4.


Figure 4. Alternate Form of 8-Point FFT with Decimation in Time. The Input Is in Bit-Reversed Order and the Output Is in the Correct Order.

Since the only difference between Figures 3 and 4 is a rearrangement of the butterflies, the computational load and the final results are identical. In terms of implementation, this rearrangement means that the nesting of the two innermost loops in the FFT routine is interchanged.

The butterflies and the FFT configurations presented thus far implement the FFT with a decimation in time. This terminology essentially describes a way of grouping the terms of the DFT definition; see Equation (3). An alternative way of grouping the DFT terms together is called decimation in frequency. Figures 5 and 6 show the same example of an 8-point FFT: Figure 5 with the input in correct order and the output in bit-reversed order, and Figure 6 vice-versa, and using the decimation in frequency (DIF).


Figure 5. Example of an 8-Point FFT with Decimation in Frequency.


Figure 6. Alternate Form of 8-Point FFT with Decimation in Frequency. The Input Is in Bit-Reversed Order and the Output Is in the Correct Order

Pictorially, the difference between decimation in time and decimation in frequency is that the twiddle factor appears at the input of the butterfly in the first, and at the output in the second. Otherwise, the two methods are identical in terms of results. However, depending on what is the most convenient order of getting the twiddle factors and where the longest-span butterfly appears, you may prefer one method over the other.

The butterfly shown in Figure 1 (or Figure 2) is the smallest element in a radix-2 FFT. The radix of the FFT represents the number of inputs that are combined in a butterfly. The Fast Fourier Transform is usually explained around the radix-2 algorithm for conceptual simplicity. If, however, higher-order radices are used, more computational savings can be achieved. These savings increase with the radix, but there is very little improvement above radix 4 . That's why the radix- 2 and radix- 4 FFTs are the most commonly used algorithms.

In radix- 4 FFT, each butterfly has 4 inputs and 4 outputs, essentially combining two stages of a radix-2 algorithm in one. Figure 7 shows this combination graphically.


Figure 7. Butterfly for Radix-4, Decimation-in-Time FFT.

Although four radix-2 butterflies are combined into one radix-4 butterfly, the computational load of the latter is less than four times the load of a radix-2 butterfly. Examples of radix-4, 16-point FFTs are shown in Figures 8 and 9 for decimation in time and decimation in frequency, respectively.


Figure 8. Example of a 16-Point, Radix-4, Decimation-in-Time FFT.


Figure 9. Example of a 16-Point, Radix-4, Decimation-in-Frequency FFT.
These configurations take the incoming sequence in order and produce the frequencydomain result in digit-reversed form. It is a simple matter to rearrange the FFT and have the input in digit-reversed form and the output in order.

Digit reversal is similar to bit reversal, except that the number whose digits are reversed is written in base 4 (equal to the radix) rather than base 2 . For example, the output value $X(14)$ in a 16 -point, radix- 4 FFT occupies position eleven (again starting from zero) because $14_{10}=32_{4}$ and, reversing the digits of the number, $23_{4}=11_{10}$. To restore the output to the correct order, the contents of locations with digit-reversed indices should be swapped. However, since the TMS320C30 has a special bit-reversed addressing mode, it is desirable to have the output of the radix-4 computation in bit-reversed rather than digit-reversed form. This is accomplished quite simply if, in each radix-4 butterfly, the two middle output legs are interchanged. That is, whenever the output of the butterfly is the four numbers $A^{\prime}, B^{\prime}, C^{\prime}$, and $D^{\prime}$, instead of storing them in that order, store them in the order $A^{\prime}, C^{\prime}, B^{\prime}$, and $D^{\prime}$, as shown in Figure 10.

(a)

(b)

Figure 10. Radix-4 Butterflies. (a) Regularly-Ordered Output, (b) Bit-Reversed Output.

References $[5,6]$ explain why this simple rearrangement puts the result in bit-reversed order.

## Features of the TMS320C30

The TMS320C30 is the first device introduced in the third generation of the TMS320 Digital Signal Processors [7,8]. It has many architectural features that permit very efficient implementation of algorithms. Some of those features pertinent to the FFT implementation are discussed in this section.

The two most salient characteristics of the TMS320C30 device are its high speed ( 60 -ns cycle time) and floating-point arithmetic. The higher speed makes the implementation of real-time application easier than in earlier processors, even when the other architectural advantages are not considered. Each instruction executes in a single cycle under mild pipeline restrictions. The device automatically takes care of any potential conflicts. The pipeline should be observed closely (e.g., using the trace capability of the simulator) only if code optimization for speed is required.

The floating-point capability permits the handling of numbers of high dynamic range without concern for overflows. In FFT programs, in particular, the computed values tend to increase from one stage to the next, as discussed in reference [2]. Then, the fixed-point arithmetic will cause overflows if the incoming numbers are large enough and no provisions are made for scaling. All these considerations are eliminated with the floating-point capability of the TMS320C30. The TMS320C30 performs floating-point arithmetic with the same speed as any fixed point operation; no performance is sacrificed for this feature.

There are eight extended-precision registers, $\mathrm{R} 0-\mathrm{R} 7$, that can be used as accumulators or general-purpose registers, and eight auxiliary registers, AR0-AR7, for addressing and integer arithmetic. For many applications, these registers are sufficient for temporary storage of values, and there is no need to use memory locations. This is the case with the radix-2 FFT algorithm, where no locations are required other than those for the transformation of incoming data to be transformed. Also, arithmetic using these registers greatly increases the programming efficiency. The two index registers, IR0 and IR1, are used for indexing the contents of the auxiliary registers AR0-AR7, thus making the access of the butterfly legs and the twiddle factors easy.

A powerful structure in the TMS320C30 is the block-repeat capability that has the form

|  | RPTB LABEL |
| :--- | :--- |
| put instructions here |  |

Whatever occurs after the RPTB instruction and up to the LABEL is repeated one time more than the number included in the repeat counter register, RC. The RC register must be initialized before entering the block-repeat construct. The net effect is that the repeated code behaves as if it were straight-line coded (no penalty for looping), with program size equal to the one in looped code. In this way, the FFT butterfly, being the core of the program, can be implemented in a block-repeat form, thereby saving execution time while preserving the clarity of the program and conserving program space.

A bit-reversed addressing mode is available to eliminate the need for swapping memory locations at the beginning or the end of the FFT (depending on the FFT type). When you use this addressing mode, you access a sequence of data points in bit-reversed order rather than sequentially, and you can recover the points in the correct order during retrieval of the data instead of spending extra cycles to accomplish it in software.

## Implementation of Radix-2 and Radix-4 Complex FFTs

Because of the powerful architecture and the instruction set of the TMS320C30, the assembly language program follows closely the flow of a high-level language program; this makes it easy to read and debug. It also keeps the size of the program small and reduces the requirements for program memory. Appendix A presents an example of code for a Radix-2 complex FFT, while Appendix B is a radix-4 complex FFT. The program memory requirements for these programs (as well as others to be discussed later) are given in Table 1.

## Table 1. Program Memory Requirements for the Core of the FFT and Hartley Transforms

| Routine Type | Program Size |
| :--- | :---: |
| Radix-2, complex FFT | 50 words |
| Radix-4, complex FFT | 170 words |
| Radix-2, real FFT | 68 words |
| Radix-2, real inverse FFT | 76 words |
| Hartley transform | 71 words |

The numbers in the table correspond only to the core program and do not include the sine/cosine tables for the twiddle factors, any input/output, or any bit-reversing operations. Note also that they are independent of the FFT data size.

The data memory requirements are, of course, dependent on the FFT size. The maximum length of a complex, radix-2 FFT that can be implemented entirely on the internal memory of the TMS320C30 is 1024 points. In the present implementation, the 1024-point radix-4 FFT requires a few more locations (about 7) than are available on-chip.

The code (provided in the appendices) has been written to be independent of the FFT length. The length $N$, together with the sine/cosine tables for the twiddle factors, should be provided separately to maintain the generic nature of the core FFT program. An example of a file with the sine/cosine tables for a 64 -point FFT is given in the Appendix F. Note that the FFT size and the number of stages are declared .global in both files (i.e., the main routine and the file with the table) so that the core program gets the actual values during linking.

To reduce the storage requirements of a sine/cosine table, a full sine and a cosine cycle are overlapped. The table stores $5 / 4$ of a full sine wave, with the cosine table starting with a phase delay of $1 / 4$ cycle from the sine table. This table size is larger than actually needed, and it is selected merely for testing convenience of the algorithms. The minimum table size for a radix- 2 complex FFT includes $1 / 2$ of a full sine wave, and $1 / 2$ of a full cosine wave. If these two half waves are combined using the above quarter-cycle phase delay, the minimum table size for this kind of FFT is $3 / 4$ of a full sine wave. For instance, for a 1024 -point FFT, the table can be the first 768 points of a sine wave, where a full cycle would be 1024 points. In the case of a radix- 4 complex FFT, the minimum table size should include $3 / 4$ of a sine and $3 / 4$ of a cosine wave. Overlapping these requirements, we get the minimum table size of a radix -4 algorithm to be one full sine wave.

An example of a linking file is also included in Appendix F to show how the different segments are assigned. For a complete description of the assembler and linker, consult the corresponding manual [6].

The timing of the FFT routines was done using the cycle-counting capability of the TMS320C30 simulator. For the conversion of the number of cycles into seconds, a cycle time of 60 ns was used. The timing refers only to the core FFT computation, ignoring read-in and write-out requirements, since such requirements are application-dependent. Also, no bit reversal is counted (although it may be included in the program), since it is performed as part of the read-in or read-out. Table 2 gives the timing for the different FFT routines and for the Hartley transform.

Table 2. FFT Timing in Milliseconds

| Transform <br> Size | Radix-2 <br> Comp!ex <br> FFT | Radix-4 <br> Complex <br> FFT | Radix-2 <br> Real <br> FFT | Radix-2 <br> Real <br> Inverse FFT | Hartley <br> Transform |
| :---: | :---: | :---: | :---: | :---: | :---: |
| 64 | 0.165 | 0.123 | 0.077 | 0.085 | 0.081 |
| 128 | 0.370 | - | 0.174 | 0.193 | 0.181 |
| 256 | 0.816 | 0.624 | 0.387 | 0.434 | 0.403 |
| 512 | 1.784 | - | 0.857 | 0.964 | 1.132 |
| 1024 | 3.873 | 3.040 | 1.879 | 2.124 | 2.430 |
| 1024 | 2.366 |  |  |  |  |

For the complex FFTs, the radix-4 algorithm reduces the execution time by $20-25 \%$ compared to radix-2, depending on the FFT size. The last entry in this table represents the timing of the radix-2, DIT routine generated at the University of Erlangen [18] and given in Appendix A. These numbers are typically used for benchmarking.

## Implementation of Real FFT

The development of FFT algorithms is centered mostly around the assumption that the input sequence consists of complex numbers (as does the output). This assumption guarantees the generality of the algorithm. However, in a large number of actual applications, the input is a sequence of real numbers. If this condition is taken into consideration, additional computational savings can be achieved because the FFT of a real sequence demonstrates the following symmetries: Assuming that the FFT output $X(k)$ is complex,

$$
\begin{equation*}
X(k)=R(k)+j I(k) \tag{7}
\end{equation*}
$$

and that the sequence has length $\mathrm{N}, \mathrm{R}(\mathrm{k})$ and $\mathrm{I}(\mathrm{k})$ should satisfy the following relations:

$$
\begin{align*}
& R(k)=R(N-k), k=1, \ldots, N / 2-1  \tag{8}\\
& I(k)=-I(N-k), k=1, \ldots, N / 2-1  \tag{9}\\
& I(0)=I(N / 2)=0 . \tag{10}
\end{align*}
$$

In other words, the real part of the transform is symmetric around zero frequency, while the imaginary part is antisymmetric. Similar conditions hold if the transform is expressed in terms of magnitude and phase.

The savings are due to the fact that not all points need to be computed. Since the not-computed points do not need to be saved either, there are also storage savings. An efficient algorithm for real-valued FFTs is described in [10]. This algorithm was implemented in the present study in such a way that, given the sequence of $N$ real numbers $x(0), x(1), \ldots, x(\mathrm{~N}-1)$, the resulting FFT, consisting of complex numbers, is stored as $R(0), R(1), \ldots, R(N / 2), I(N / 2-1), I(N / 2-2), \ldots, I(1) . R(k)$ and $I(k)$ represent the real and imaginary parts of the complex number $X(k)$. Figure 11 shows the memory arrangement for the FFT. Note that the input to the real FFT should be bit-reversed, but the bit reversal can be done as the data is brought in. With this arrangement, an $N$-point FFT uses exactly $N$ memory locations. If the full array $X(k)$ is needed, the following relations should be used:

$$
\begin{align*}
& X(0)=R(0)  \tag{11}\\
& X(k)=R(k)+j I(k), K=1, \ldots, N / 2-1  \tag{12}\\
& X(N / 2)=R(N / 2)  \tag{13}\\
& X(k)=R(N-k)-j I(N-k), k=N / 2+1, \ldots, N-1 \tag{14}
\end{align*}
$$



Figure 11. Memory Arrangement of a Real FFT.
It is expected that, in most signal processing applications, there will be no need to reconstruct the full $X(k)$ array and that the output shown in Figure 11 will be sufficient for any further processing.

Appendix C contains TMS320C30 routines implementing a radix-2 real FFT and its inverse. The implementation of the forward transformation is based on the FORTRAN programs contained in [10]. The inverse transformation assumes that the input data are given in the order presented at the output of the forward transformation and produces a time signal in the proper order (i.e., bit-reversing takes place at the end of the program). Viewed another way, the inverse real FFT operates as shown in Figure 11 but with the arrows reversed (and inverse FFT taking the place of the FFT).

The timing for the real-valued FFT (both forward and inverse) is included in Table 2, and the corresponding program sizes are shown in Table 1 . As you can see, the realvalued FFT is considerably faster than the corresponding complex FFT because not all the computations need be performed. Furthermore, there are data storage savings because only half the values must be stored. As a result, the maximum length of real-valued FFT that can be implemented on the TMS320C30 without using any external memory is 2048 points. Of course, if all the values are needed, they can be recovered using the symmetry conditions mentioned earlier. To achieve the efficiencies of real FFT and not use any extra memory locations during the computation, the decimation-in-time method is applied [10]. Decimation in time requires the bit-reversal operation in the forward transform to be performed at the beginning of the program rather than at the end. The reverse is true for bit-reversing in the inverse transform.

## The Discrete Hartley Transform

Another transform that has attracted attention recently is the Discrete Hartley Transform (DHT)[11, 12]. The DHT is applicable to real-valued signals and is closely related to the real-valued FFT. Comparison of references [10] and [12] describing the implementation of the two algorithms on FORTRAN programs shows that their implementation on the TMS320C30 should be similar. And indeed, this is the case.

The DHT pair is defined for a real-valued sequence $x(n), n=0, \ldots, N-1$, by the following equations:

$$
\begin{align*}
& H(k)=\sum_{\mathrm{n}=0}^{N-1} \quad x(n) \operatorname{cas}(2 \pi k n / N), k=0, \ldots, N-1  \tag{15}\\
& x(n)=\frac{1}{N} \sum_{k=0}^{N-1} H(k) \operatorname{cas}(2 \pi k n / N), k=0, \ldots, N-1
\end{align*}
$$

where $\operatorname{cas}(x)=\cos (x)+\sin (x)$. The DHT demonstrates a symmetry that is convenient for implementations: The same program can be used for both the forward and the inverse transforms, and the result is correct within a scale factor. Also, the real FFT and the DHT can be derived from each other [12].

A radix-2 Hartley transform was implemented on the TMS320C30, and the corresponding code is included in Appendix D. This code follows the structure of the real FFT in Appendix C. Tables 1 and 2 show the program memory requirements and the timing for the execution of Hartley transforms of different sizes. The sine/cosine table sizes are the same as in the case of a real FFT.

## The Discrete Cosine Transform

The Discrete Cosine Transform (DCT), since its introduction in 1974 [13], has gained popularity in speech and image processing applications because of its near-optimal behavior. This discussion is based on the paper by Lee [14]. The DCT code was developed and implemented by Paul Wilhelm of the University of Washington.

If $x(n), n=0, \ldots, N-1$ is a time-domain signal and $X(k)$ is the corresponding DCT, $x(n)$ and $X(k)$ are related by the following equations:

$$
\begin{align*}
& x(k)=\frac{2}{N} \sum_{n=0}^{N-1} e(k) x(n) \cos \frac{(2 k+1) \pi n}{2 N} \\
& x(n)=\sum_{k=0}^{N-1} e(k) X(k) \cos \frac{(2 k+1) \pi n}{2 N}  \tag{18}\\
& e(0)=1 / \sqrt{ } 2  \tag{19}\\
& e(k)=1, \quad \text { for } k \neq 0 \tag{20}
\end{align*}
$$

Appendix E shows an implementation of the DCT based on the paper by Lee [14]. The appendix contains the algorithms for both the forward and the inverse transformations and an example of a table for a 16-point DCT. Note that, because of the structure of the algorithm, the cosine table needed contains actually the inverses of the cosines (within a scale factor), and it is not stored in the natural order. Instead, it is generated by the following C pseudocode:

```
for [k=2,i=0;k=N/2; k*=2]
    for [j=k/2; j<N/2; j+=k]{
            cos__table[i+ +] = 1/[2*}\operatorname{cos[[*pi/[2*N]]];
            cos__table[i+ +] = 1/[2*}\operatorname{cos}[[N-j]*pi/[2*N]]]
    }
cos
    table[N-2] = cos[pi/4];
cos__table[N-1] = 2/N;
```

The last entry to the table is not part of the cosine itself; it is a constant that is used by the algorithm, and it is placed at the end of the cosine table for convenience.

Table 3 shows the timing of the forward and inverse transforms for different transform lengths. The difference in the timing between the forward and the inverse transforms is due to the fact that more time was expended to optimize the performance of the inverse transform. Since four of the smallest butterflies were done simultaneously in the center program loop, the minimum permissible array size to be transformed is 8 .

Table 3. DCT Timing in Milliseconds

| Transform <br> Size | Forward <br> Transform | Inverse <br> Transform |
| :---: | :---: | :---: |
| 16 | 0.023 | 0.020 |
| 64 | 0.105 | 0.088 |
| 128 | 0.230 | 0.193 |
| 256 | 0.502 | 0.416 |
| 512 | 1.094 | 0.905 |
| 1024 | 2.378 | 1.982 |

## Other Related Transforms

In addition to the FFT types mentioned earlier (complex, real, decimation-in-time, decimation-in-frequency, etc.), newer forms of the FFT have been developed to reduce the computational load. One of the latest in the literature is the Split-Radix FFT. The SplitRadix FFT [16] has the lowest number of multiplies and adds of any known algorithm. It achieves this efficiency by combining certain radix- 2 and radix- 4 butterflies, but, as a result, the classical concept of FFT stages is lost. The new structure uses a rather complicated indexing scheme, which is the price paid for the reduced multiplies/adds. Since, on the TMS320C30, multiplies/adds are not more expensive computationally than any other operation, the indexing scheme wipes out the gains of the reduced arithmetic. Actually, an implementation of the split-radix FFT showed it to be slower than the radix-2 FFT, one of the main reasons being that the block-repeat structure could no longer be used effectively.

Very often, there is a question on what the different benchmark numbers mean. A useful comparison of execution times for different algorithms on different machines has been made [17]. Table 4 presents a small segment of the resulting information that is relevant to the present discussion: the timing in seconds for the radix-8, mix-radix, and split-radix algorithms that were implemented on various machines. Different operating systems and compilers have been used, as shown. The execution times of Table 4 should be compared with the 0.001879 s that it takes to implement a 1024 -point, radix-2, real FFT on a TMS320C30. As can be seen, the TMS320C30 compares favorably to all the other machines investigated.

Table 4. Execution Times in Seconds for a 1024-Point Real FFT. The Numbers Should Be Compared with 0.001879 s of a 1024-Point Real FFT on the TMS320C30

| Machine | Radix-8 | Mix-radix | Split-radix |
| :--- | :--- | :--- | :--- |
| VAX 750 UNIX BSD4.2 f77 | 0.3634 | 0.3902 | 0.3021 |
| VAX 750 UNIX BSD4.2 f77 -0 | 0.2376 | 0.2948 | 0.2089 |
| VAX 750 UNIX BSD4.3 f77 | 0.2545 | 0.2600 | 0.2371 |
| VAX 750 UNIX BSD4.3 f77 -0 | 0.1825 | 0.2127 | 0.1672 |
| VAX 785 ULTRIX f77 | 0.1046 | 0.1107 | 0.1101 |
| VAX 785 ULTRIX f77 -0 | 0.0796 | 0.0943 | 0.0811 |
| VAX 785 VMS FOR/NOOPTM | 0.0767 | 0.0871 | 0.0975 |
| VAX 785 VMS FOR/OPTM | 0.0539 | 0.0641 | 0.0633 |
| VAX 8600 VMS FOR/OPTM | 0.0217 | 0.0243 | 0.0235 |
| MICROVAX VMS FOR/NOOPTM | 0.1671 | 0.1846 | 0.1864 |
| MICROVAX VMS FOR/OPTM | 0.1299 | 0.1527 | 0.1419 |
| DEC-10 TOPS-10 FOR/NOOPTM | 0.0940 | 0.1184 | 0.0991 |
| DEC-10 TOPS-10 FOR/OPTM | 0.0885 | 0.1110 | 0.0845 |
| CDC 855 FTN5,OPT=0 | 0.0277 | 0.0319 | 0.0338 |
| CDC 855 FTN5,OPT=1 | 0.0277 | 0.0316 | 0.0337 |
| CDC 855 FTN5,OPT=2 | 0.0182 | 0.0171 | 0.0151 |
| CDC 855 FTN5,OPT=3 | 0.0180 | 0.0173 | 0.0150 |
| SUN 3/50 UNIX BSD4.2 f77 -0 -f68881 | 0.2518 | 0.3365 | 0.2103 |
| SUN 3/50 UNIX BSD4.2 f77 -f68881 | 0.2806 | 0.3897 | 0.2802 |
| SUN 3/50 UNIX BSD4.2 f77 -0 | 0.7586 | 1.047 | 0.6955 |
| SUN 3/50 UNIX BSD4.2 f77 | 0.7476 | 1.029 | 0.7033 |
| SUN 3/160 UNIX BSD4.2 f77 | 0.6037 | 0.6895 | 0.5660 |
| SUN 3/160 UNIX BSD4.2 f77 -pfa | 0.0983 | 0.1060 | 0.0946 |
| SUN 3/260 UNIX BSD4.3 f77 | 0.3689 | 0.4126 | 0.3390 |
| SUN 3/260 UNIX BSD4.3 f77 -0 | 0.3530 | 0.4142 | 0.3297 |
| Pyramid 90X UNIX BSD4.2 f77 -0 | 0.2053 | 0.2244 | 0.1416 |
| Pyramid 90X UNIX BSD4.2 f77 | 0.2206 | 0.2457 | 0.1326 |
| HP-1000 21MX-E FTN7X | 0.9400 | 1.248 | 0.9478 |
| Apple MAC Microsoft FOR | 2.6670 | 3.1600 | 2.8260 |
| AST PC Microsoft FOR | 1.5040 | 2.0800 | 1.4630 |

## The TMS320C30 C Compiler

The C compiler for the TMS320C30 permits easy porting of high-level language programs to the DSP device. If the CPU loading of a particular application is not very high, the C compiler can create programs that run on the TMS320C30 in real time. If, however, the result is non-realtime, it may be necessary to use assembly language for more efficient coding.

In most cases, only a portion of the code needs to be written in assembly language. Typically, there are a few code segments where the device spends most of the time and which, when optimized in assembly language, yield the necessary performance improvement. By following the conventions outlined in the run-time environment of the C compiler [15], you can write these time-critical routines in assembly language and call them in a C program. This is also true for the FFT routines. In appendices A, B, and C, the radix-2, radix-4, and real FFT routines mentioned earlier are also put in a C-callable form by adding the necessary interface at the beginning and the end of the code. The tables with the sines and cosines are again assumed to be supplied during link time.

## Issues in FFT Implementation

There are many ways of actually implementing the FFT code (and the other transformations), taking into consideration the different possibilities of program locations, the data locations, the ways of input and output, etc. Since it is impractical to cover every possible case, this report has concentrated on a configuration in which the use of external memory is minimized. With the source code and additional explanations provided, you should be able to customize the FFT implementation for a particular application.

## Use of External Memory

In these implementations, only on-chip memory was used, and that's why the maximum transform size considered was 1024 points long ( 2048 for a real transform). Often, though, applications call for use of external memory for program or data or both. When external memory is used, the structure of the code does not change at all; it is only the timing that may be affected.

Fast external memory can be selected so that no wait states are necessary. But even when there are no wait states, accessing external memory may impose some limitations. For instance, you can make only one external memory access in a full cycle, but you can make two accesses of internal memory in each cycle. Also, because of mutliplexing of the busses, pipeline conflicts may arise if both program and data are placed on the same external port. Resolution of such conflicts causes extra cycles for the execution. The section on pipelining in the TMS320C30 User's Guide explains in detail what kind of potential conflicts may occur.

To minimize or avoid such conflicts, there are some simple steps that the designer can take. The TMS320C30 has three separate memory areas (one on-chip, one accessed by the primary bus, and one accessed by the expansion bus) that can be combined. For instance, the program can be placed on the expansion port and the data on the primary port. Or the data can first be brought into internal memory and then operated upon. Alternatively, the program may be relocated to internal memory. A related approach is to use the cache. All the transforms are implemented as loops that are executed many times. If you activate the on-chip cache after the first access of the code, the instructions execute from the cache instead of the external memory.

If there are additional conflicts, they can typically be resolved by some rearrangement of the code. For instance, consecutively writing to external memory takes two cycles per write. If, however, a write is followed by some internal operation, then the second cycle of the write is transparent, and the actual cost is one cycle.

## Bit Reversal

The TMS320C30 has a special form of the indirect addressing mode for the bitreversing operation that is required at the beginning or the end of an FFT. Through this addressing mode, the scrambled data are accessed in their proper order. This addressing mode works as follows:

Let $\operatorname{ARn}(\mathrm{n}=0 . .7)$ be the auxiliary register pointing to the array with scrambled data. The index register IR0 contains a number equal to one-half the size of the FFT. Then, after every access of the data, ARn is incremented by IR0 using the construct
*ARn + + [IRO]B

This causes the contents of ARn to be incremented by the contents of IRO, but if there is a carry in this incrementing, the carry propagates to the right instead of to the left. The result is the generation of the addresses in a bit-reversed order. The bit-reversed addressing mode works correctly if the array with the data is aligned in memory so that the first memory address is a multiple of the FFT size. This can be achieved if the first memory address has zeros for the last $M$ bits, where $M=\log _{2} N$, with $N$ being the FFT size. For example, in the case of a 1024-point FFT, the last 10 bits of the memory address of the first datum should be zeros.

In the implementation of the complex FFT, the output is complex even when the input is real. So, there is a need to consider both the real and the imaginary parts of the data array. The above description of the bit-reversed addressing mode assumed that the real and the imaginary parts are stored as separate arrays in the memory. In this case, each of the arrays (real or imaginary parts) can be accessed as described. However, in most cases (including this report), the real and imaginary points alternate in the same array.

In this arrangement, the following simple modification achieves the same goal: set IR0 equal to $N$ instead of $N / 2$, and access the $N$ points of the transform. At every access, the auxiliary register is pointing to the real part of the FFT. The imaginary part is located in the next higher location, and it can be easily accessed.

With the bit-reversed addressing mode, the unscrambling of the data can take place when the FFT result is accessed for further processing or for I/O. It is possible, though, that certain applications demand the reordering of the data in the same array. Such a rearrangement can be done very simply for a complex FFT with the following code.
; DO THE BIT-REVERSING EXPLICITLY

| LDI | @FFTSIZ,RC | ; RC = FFT SIZE |
| :--- | :--- | :--- |
| SUBI | 1, RC | ; RC SHOULD BE ONE LESS THAN DESIRED \# |
| LDI | @FFTSIZ,IRO | ; IRO $=$ FFT SIZE |
| LDI | @INPUT,ARO |  |
| LDI @INPUT,AR1 |  |  |

* 

|  | RPTB | BITRV |  |
| :---: | :---: | :---: | :---: |
|  | CMPI | AR1,ARO | EXCHANGE LOCATIONS ONLY |
|  | BGE | CONT | IF AROAR1 |
|  | LDF | *ARO,RO |  |
| II | LDF | *AR1,R1 | EXCHANGE REAL PARTS |
|  | STF | RO, *AR1 |  |
| II | STF | R1, *ARO |  |
|  | LDF | * $+\mathrm{ARO}, \mathrm{RO}$ |  |
| II | LDF | * + AR1, R1 | EXCHANGE IMAGINARY PARTS |
|  | STF | RO,* + AR1 |  |
| 11 | STF | R1,* + ARO |  |
| CONT | NOP | *ARO + + [2] |  |
| BITRV | NOP | *AR1 + + [IRO]B |  |

Note that AR1 is pointing to the bit-reversed version of the address contained in AR0. For real-valued FFT, or for FFTs that store the real and the imaginary parts in separate arrays, the real-FFT routine in Appendix C contains a modified example of the above code.

## Use of DMA

If the signal to be transformed arrives as a continuous stream of data, the DMA could be used to collect the new data while the data already collected are processed. In this case, the data source address of the DMA points to the memory location corresponding to a serial port, or to another port associated with an external device. The destination is a memory space designated for storage.

There are two ways to use such buffers. One possibility is to designate one buffer as the temporary storage and the other buffer as the working area. When the storage buffer receives the necessary amount of data, the data is transferred to the working area, and the DMA starts refilling the storage buffer. Alternatively, the two buffers are considered equivalent: when the processor finishes processing and outputting the data from one and the DMA has filled the other, the two buffers switch functions; i.e., the DMA starts filling the first buffer while the CPU is processing the data in the buffer just filled.

## Test Vector

For testing purposes, a vector with 64 (quasi-random) data points and the corresponding FFT values is given in Appendix F. In this way, if any of the routines is implemented, the test vectors can be used to verify the correct functionality of the routines. Together with the test vectors, Appendix $C$ gives a sine/cosine table for a 64-point transform, and the linking file for such a transform.

## Summary

This report examined implementations of fast transforms on the Texas Instruments TMS320C3x floating-point devices. The transforms considered were several forms of the FFT, the Discrete Hartley Transform, and the Discrete Cosine Transform. Because of the powerful architecture of the device, the implementation was done easily and efficiently. It was shown that a TMS320C30 executes the FFTs several times faster than large computers such as VAX and SUN workstations. With the availability of the C compiler, these routines can be put in C-callable form and be used to compute the corresponding transforms efficiently.

## Appendices

Appendices A to F contain the TMS320C30 assembly language programs for the different algorithms considered. The contents of the appendices are as follows:

Appendix A: Radix-2 Complex FFT. composed of

A1: Generic Program to Do a Looped-Code Radix-2 FFT Computation on the TMS320C30.
A2: $\mathrm{fft} \_2$ - Radix-2 Complex FFT to Be Called as a C Function.
A3: Complex, Radix-2 DIT FFT - R2DIT.ASM.
A4: Complex, Radix-2 DIT FFT - R2DITB.ASM.
A5: TWID1KBR.ASM - Table with Twiddle Factors for a FFT up to a Length of 1024 Complex Points.

Appendix B: Radix-4 Complex FFT.
composed of
B1: Generic Program to Do a Looped-Code Radix-4 FFT on the TMS320C30.
B2: fft__4-Radix-4 Complex FFT to Be Called as a C Function.

Appendix C: Radix-2 Real FFT.
composed of
C1: Generic Program to Do a Radix-2 Real FFT Computation on the TMS320C30.
C2: fft__rl - Radix-2 Real FFT to Be Called as a C Function.
C3: Generic Program to Do a Radix-2 Real Inverse FFT Computation on the TMS320C30.

Appendix D: Discrete Hartley Transform.
composed of
D1: Generic Program to Do a Radix-2 Hartley Transform on the TMS320C30.

Appendix E: Discrete Cosine Transform.
composed of
E1: A Fast Cosine Transform.
E2: A Fast Cosine Transform (Inverse Transform).
E3: FCT Cosine Tables File.
E4: Data File.

Appendix F: Test Vectors, 64-Point Sine Table, Link Command File. composed of

F1: Example of a 64-Point Vector to Test the FFT Routines.
F2: File to Be Linked with the Source Code for a 64-Point, Radix-4 FFT.
F3: Link Command File.
The first three appendices contain the code for the radix-2, complex radix-4, and real radix- 2 FFT transformations. These routines are given in both the regular form and in a C-callable form. Furthermore, the contents of a file with the twiddle factors are given, as well as an example of a link command file for a 64 -point FFT. Note that the source code of these routines can be downloaded from the TI DSP bulletin board (BBS) by calling (713) 274-2323. For questions regarding the BBS, call the TI DSP hotline at (713) 274-2320.

## Acknowledgements

Mr. Raimund Meyer and Mr. Karl Schwarz (Lehrstuhl fur Nachrichtentechnik, University of Erlangen) provided the fast routines of Appendix A to do 1024-point, radix-2, DIT FFT. Mr. Paul Wilhelm of the University of Washington provided the routines for the Fast Cosine Transform (FCT) together with the related explanations and the test vector in Appendix E. Their contributions are gratefully acknowledged.

## References

[1] Burrus, C. S., and Parks, T.W. DFT/FFT and Convolution Algorithms, John Wiley and Sons, New York, 1985.
[2] Lin, K. -S., Ed. Digital Signal Processing Applications with the TMS320 Family, Prentice-Hall, Englewood Cliffs, New Jersey, 1987.
[3] Oppenheim, A. V. and Schafer R.W. Digital Signal Processing, Prentice-Hall, Inc., Englewood Cliffs, New Jersey, 1975.
[4] Rabiner, L.W., and Gold, B. Theory and Application of Digital Signal Processing, Prentice-Hall, Inc., Englewood Cliffs, New Jersey, 1975.
[5] Burrus, C.S. "Unscrambling for Fast DSP Algorithms," IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. ASSP-36, No. 7, pp. 1086-1087, July 1988.
[6] Papamichalis, Panos E., and Burrus, C.S. "Conversion of Digit-Reversed to BitReversed Order in FFT Algorithms," Proceedings of 1989 IEEE International Conference on Acoustics, Speech, and Signal Processing, May 1989.
[7] Third-Generation TMS320 User's Guide, Texas Instruments, Inc., Dallas, Texas, August 1988.
[8] Papamichalis, Panos E., and Simar, Ray Jr. "The TMS320C30 Floating-Point Digital Signal Processor,' IEEE Micro, Vol. 8, No. 6, pp. 13-29, December1988.
[9] TMS320C30 Assembly Language Tools User's Guide, Texas Instruments Inc., Dallas, Texas, July 1987.
[10] Sorensen, H.V., Jones, D.L., Heideman, M.T., and Burrus, C.S. ' Real-Valued Fast Fourier Transform Algorithms", IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. ASSP-35, No. 6, pp. 849-863, June 1987.
[11] Bracewell, R.N. 'The Fast Hartley Transform,'" Proceedings of IEEE, Vol. 72, No. 8, pp. 1010-1018, August 1984.
[12] Sorensen, H.V., Jones, D.L., Burrus, C.S., and Heideman, M.T. 'On Computing the Discrete Hartley Transform," IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. ASSP-33, No. 4, pp. 1231-1238, October 1985.
[13] Ahmed, N., Natarajan, T., and Rao, K.R. 'Discrete Cosine Transform,' IEEE Transactions on Computers, Vol. C-23, pp. 90-93, January 1974.
[14] B. G. Lee, 'FFCT - A Fast Cosine Transform,'" Proceedings of 1984 IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 28A.3.1-28A.3.4, March 1984.
[15] TMS320C30 C Compiler Reference Guide, Texas Instruments Inc., Dallas, Texas, December 1988.
[16] Sorensen, H.V., Heideman, M.T., and Burrus, C.S. "On Computing the Split-Radix FFT," IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. ASSP-34, No. 1, pp. 152-156, February 1986.
[17] Sorensen, H.V.and Burrus, C.S. "Computer Dependency of FFT Algorithms", Proceedings of ASILOMAR, 1987.
[18] Schuessler, H.W., Meyer, R., and Schwarz, K. "FFT Implementation on DSP Chips-Theory and Practice," Proposal for the 1990 IEEE International Conference on Acoustics, Speech, and Signal Processing.

## Appendix A. Radix-2 Complex FFT





```
* NAME:
* 
* SYNopsIS:
* INT fft_2(N, M, DATA
* INT N FFT SIZE: N=2**M
INT M NumbER OF STACES = LOG2(N)
float *DATA ARRAY HITH INPUT AND OUTPUT DATA
* DESCRIPTION:
* GENERIC FUNCTION tO DO A RADIX-2 FFT COHPUTATION ON THE 320C30.
    THE dATA ARRAY IS 2*N-LONG, WITH REAL AND ImAGINARY vAlUES ALTERNATING.
    THE PROGRAM IS BASED ON THE FORTRAN PROCRAM IN THE BURRUS AND PARKS
    B00K, P. 111.
    THE COMPUTATION IS DONE IN PLACE, AND THE ORIGINAL DATA IS DESTROYED.
    BIT REVERSAL IS IIFLEMENTED AT THE END OF THE FUNCTION. If THIS IS NO
    NECESSARY, THIS PART CAN BE COHENTED OUT.
    THE SINE/COSINE TABLE FOR THE THIDDLE FACTOFS IS EXPECTED TO BE SLPPLIED
    dURING LINK TIME, AND IT SHOULD HAVE THE FOLOHING FOPHAT:
. COBAL -
. DATA
sine .FLOAT VALUE1 = sin(0*2*pi/N)
FLOAT VALUE = sin}(1*2*pi/N
FLOAT VALUE(5N/4) = sin( (5*N/4-1)*2*pi/N
    the vallues value1, valuez, etc., are the same mave values. for an
    N-POINT FFT, THERE ARE N+N/4 VALUES FOR A FULL AND A QUARTER PERIOD OF
    THE SINE HAVE. IN THIS WAY, A FULL SINE AND COSINE PERIOD APE AVAILABLE
    (SUPERIMPOSED)
    STACK STRUCTURE UPON THE CALL:
\begin{tabular}{|c|c|}
\hline -fP(4) & - DATA \\
\hline -P(3) & - M \\
\hline -FP(2) & - N \\
\hline -fP(1) & : RETUPN ADDR \\
\hline -fP(0) & - OL FP \\
\hline
\end{tabular}
```

* initialize fft routine

|  | LDI | efftsil, IR1 |  |
| :---: | :---: | :---: | :---: |
|  | LSH | -2, IR1 | ; IRI $=$ N/4, POINTER FOR SIN/COS TABLE |
|  | LDI | 0 , ARB | ; ARG HOLDS THE CUPRENT STAGE MUHBER |
|  | LDI | efFTSIL,IRO |  |
|  | LSH | 1, IRO | ; IRO=2*N1 (BECAUSE Of REAL/IMAG) |
|  | LDI | EFFTSIL,R7 | ; $\mathrm{R7}=\mathrm{N} 2$ |
| * | LDI | 1,AR7 | ; INITIALIIE REPEAT COWNTER OF FIRST ; LOOP |
|  | LDI | 1,AR5 | ; INITIALIZE IE INDEX (ARS=IE) |
| * | OUTER LOMP |  |  |
| 100p: |  |  |  |
| LOOP: | NOP | *++AR6 (1) | ; CURPENT FFT STAGE |
|  | LDI | EINPUT, AR0 | ; ARO POINTS TO X (I) |
|  | ADDI | R7, ARO, AR2 | ; AR2 POINTS TO X (L) |
|  | LDI | AR7, RC |  |
|  | SUBI | 1,RC | ; RC SHOUD BE OHE LESS THAN DESIRED \# |



GENERIC PROGRAM FOR A FAST LOOPED-CODE RADIX-2 DIT FFT COHPUTATIO ON THE THS320C30

## RRITTEN BY: RAIMUND HEYER, KARL SCHMARZ LEIRSTUHL FUER NACHRICHTENTECHNIK UNIVERSITAET ERLANGEN-NUERNBERG

 CANERSTRASSE 7, D-8520 ERLANGEN, FRGTHE (COAPLEX) DATA RESIDE IN INTERNAL HEMORY. THE COMPUTATION IS DONE in-flace, but the result is moved to another memory section to demonstrate the bit-reversed addressing.

FOR THIS PROGRAM THE MINIMM FFTLENGTH IS 32 POINTS BECAUSE OF THE SEPARATE STAGES.

FIRST TWO PASSES ARE REALIZED AS A FOUR BUTTEFFLY LOOP SINCE THE MLITIPLIES ARE TRIVIAL. THE MLTIPLIER IS GNLY USED FOR A LOAD IN Parallel WITH an adof or sube.

## *****************

EXAYPLE FOR A 1024 -POINT FFT (EXCLUDING BIT REVERSAL):
MEMORY SIZE:

$$
\begin{array}{ll}
\text { PROCRAM } & =229 \text { WRRDS } \\
\text { DATA (THIDDLE FACTORS) } & =512 \text { HORDS }
\end{array}
$$

CYCLES PER BUTTERFLY:

| STAGES 1 AND 2 | $=$ | 4 |
| :--- | :--- | :--- |
| STAGES 3 TO 8 | $=$ | 8 |
| STAGE 9 |  | $=8.25$ |
| STACE 10 |  | 85 |

TAGE $10=8.5$
AVERAGE CYCLES/BUTTERFLY $=7.275$
TOTAL BUTTERFLYCYCLES $=37248$
INITIALILATION OVERHEAD $=2181=5.55 \%$ OF TOTAL TIIE
TOTAL NUMBER OF INSTRUCTION CYCLES $=39429$
TOTAL TIME FOR A 1024 POINT FFT $=2.36$ as (EXCLUDING BIT
REVERSAL)

## THIS PPOGRAM INCUDES FOLIOUING FILES

THE FILE 'TWIDIKBR.ASH' CONSISTS OF TWIDDLE FACTORS
THE THIDDLE FACTORS ARE STORED IN BITREVERSED ORDER AND WITH A TABLE LENGTH OF N/2 ( $\mathrm{N}=$ FFTLENGTH).
EXAPPLE: SHOHN FOR $N=32, \underline{N}(n)=\operatorname{Cos}(2 * P I * n / N)-j * S I N(2 * P I * n / N)$
ADDRESS COEFFICIENT
$\begin{array}{rl}0 & \mathrm{R}(\operatorname{LN}(0)\} \\ 0 & -\cos (2 * \mathrm{PI} 1 * 0 / 32)=1 \\ 1 & -\operatorname{ILN}(0)\}\end{array}$
$-I\{\operatorname{LIN}(0)\}=\operatorname{Sin}(2 * P I * 0 / 32)=0$
$\begin{aligned} R(\operatorname{LNN}(4))=\operatorname{Cos}(2 \pm P I \pm 4 / 32) & =0.707 \\ -I(1)(4)\}=\operatorname{SIN}(2 \pm I \pm 4 / 32) & =0.707\end{aligned}$
$\operatorname{R}(\mathrm{WN}(3)\}=\cos (2+P 1 * 3 / 32)=0.831$
$-I(W \mathbb{N}(3)\}=\operatorname{SIN}(2+P I * 3 / 32)=0.556$
$\operatorname{R}(4 N(7))=\operatorname{Cos}(2 \pm P 1 * 7 / 32)=0.195$
$-1(\operatorname{LN}(7))=\operatorname{SIN}(2 * P I * 7 / 32)=0.98$
hHeN generated for a fft lengit of 1024, the table is for all available fat of less or equal length.

THE MISSING THIDDLE FACTORS (IN(), LWN(),...) ARE GENERATED BY USING THE SYITETRY $\mathbb{W}^{W}(N / 4+n)=-j \neq 1 \mathbb{N}(n)$. THIS CAN BE EASILY REALIIED BY CHANGING REAL- AND IMAGINARY PART OF THE THIDDLE FACTORS AND BY NEGATING THE NEL REAL PART.

TO CHANGE THE FFT LENGTH, ONLY THE PARAFETERS IN THE HEADER OF THIDIKBr.ASM AND THE INPUT AND OUTPUT VECTOR LENGTHS NEED TO BE - al tered.
***************************************************************************

| $A R+j A I$ |  |  |
| :---: | :---: | :---: |
| 1 |  |  |
| $1 /$ |  |  |
| 1 |  |  |
| 11 |  |  |
|  | 1 | $1+$ |
|  |  |  |
|  |  |  |
| TR $=$ BR * COS + BI * SIN |  |  |
| $\mathrm{TI}=\mathrm{BR} * \mathrm{SIN}-\mathrm{BI} * \operatorname{COS}$ |  |  |
| $A R^{\prime}=A R+T R$ |  |  |
| $A I^{\prime}=A I-T I$ |  |  |
| $B R^{\prime}=A R-T R$ |  |  |
| $\mathrm{BI}^{\prime}=\mathrm{AI}+\mathrm{TI}$ |  |  |



```
\begin{tabular}{ll}
.global & FFT \\
.global & N \\
.global & NHALB \\
.global & NIERT \\
. global & NATCHEL \\
. .global & N \\
. global & SINE
\end{tabular}
\begin{tabular}{|c|c|c|c|}
\hline * & .BSS & INP, 2048 & \[
\begin{aligned}
& \text {; INPUT VECTOR LENGTH }=2 \text { (DEPENDS } \\
& \text {; ON N) }
\end{aligned}
\] \\
\hline * & .BSS & OUTP,2048 & \[
\begin{aligned}
& \text {; OUTPUT VECTOR LENGTH }=2 \mathrm{~N} \text { (DEPENS } \\
& \text {; ON N) }
\end{aligned}
\] \\
\hline \multicolumn{4}{|l|}{*} \\
\hline & .text & & \\
\hline \multicolumn{4}{|l|}{*} \\
\hline FFTSIZ & .word & \(N\) & \\
\hline F6412 & .word & NIERT-2 & \\
\hline FG4M3 & .word & NIERT-3 & \\
\hline F6872 & .word & NATCHEL-2 & \\
\hline FG2 & .word & NHALB & \\
\hline FG2M3 & .mord & NHALB-3 & \\
\hline LOGFFT & .word & H & \\
\hline SINTAB & .word & SINE & \\
\hline SINTM1 & .vord & SINE-1 & \\
\hline SINTP2 & .word & SINE+2 & \\
\hline Infut & .word & INP & \\
\hline INPUTP2 & .word & INP+2 & \\
\hline OUTPUT & .word & OUTP & \\
\hline
\end{tabular}
```

| * | FILL PIPELINE |  |  |
| :---: | :---: | :---: | :---: |
|  | ADDF | *AR2, *AR0, R4 | ; $R 4=A R+C R$ |
|  | SUBF | *AR2, *ARO+ + , R | ; $R 5=A R-C R$ |
|  | ADDF | *AR1, *AR3, RS | ; $\mathrm{R} G=\mathrm{DR}+\mathrm{BR}$ |
|  | SUBF | *AR1++, *AR3++, R7 | ; $\mathrm{R7}=\mathrm{DR}-\mathrm{BR}$ |
|  | ADDF | R6, R4, R0 | ; $\mathrm{RR}^{\prime}=\mathrm{RO}=\mathrm{R}^{4}+\mathrm{R} 6$ |
|  | MPYF | *AR3++, *AR7,R1 | ; $\mathrm{Rl}=\mathrm{DI}, \mathrm{BR}^{\prime}=\mathrm{R} 3=\mathrm{R} 4-\mathrm{Rb}$ |
| $1:$ | SUBF | R6, R4, R3 |  |
|  | ADDF | R1, *AR1, R0 | ; $\mathrm{RO}=\mathrm{BI}+\mathrm{DI}, \mathrm{AR}^{\prime}=\mathrm{RO}$ |
| 14 | STF | R0, *AR4+4 |  |
|  | Slub | R1, ${ }_{\text {ARP }}++$, R1 | ; $\mathrm{RI}=\mathrm{BI}-\mathrm{DI}, \mathrm{BR}^{\prime}=\mathrm{R} 3$ |
| 14 | STF | R3, AAR5++ |  |
|  | ADDF | R1,R5, R2 | ; $\mathrm{CR}^{\prime}=\mathrm{R}_{2}=\mathrm{R}_{5}+\mathrm{R}_{1}$ |
|  | MPYF | *+AR2, +AR7, R1 | ; $\mathrm{RI}=\mathrm{Cl}, \mathrm{DR}^{\prime}=\mathrm{R} 3=\mathrm{RS}-\mathrm{RI}$ |
| $1:$ | Subf | R1,R5, R3 |  |
|  | ADDF | R1, 4 APO, R2 | ; $\mathrm{R} 2=\mathrm{Al}+\mathrm{Cl}, \mathrm{CR}^{\prime}=\mathrm{R} 2$ |
| 11 | STF | R2, *AR2+ (IR1) |  |
|  | Subf |  | ; $\mathrm{Rb}=\mathrm{Al}-\mathrm{Cl}, \mathrm{DR}^{\prime}=\mathrm{R} 3$ |
| 11 | STF | R3, \#AR6++ |  |
|  | ADDF | R0, R2, R4 | ; $\mathrm{Al}^{\prime}=\mathrm{R4}=\mathrm{R2}+\mathrm{RO}$ |
| RADIX-4 BUTTERFLY LOOP | RADIX-4 BUTTERFLY LOOP |  |  |
|  | RPTB | BXKI |  |
|  | IPYF | *AR2-, *AR7, RO | ; $\left.\mathrm{RO}=\mathrm{CR},(\mathrm{BI})^{\prime}=\mathrm{R} 2=\mathrm{R} 2-\mathrm{R}\right)$ |
| 13 | SIBF | R0, R2, R2 |  |
|  | MPYF | *AR1++, *AR7, R1 | ; $\mathrm{RL}=\mathrm{BR},\left(\mathrm{Cl}{ }^{\prime}=\mathrm{R} 3=\mathrm{Rb}+\mathrm{R} 7\right)$ |
| 14 | ADDF | R7,R6, R3 |  |
|  | ADDF | $\mathrm{RO}, \pm A \mathrm{P}, \mathrm{R4}$ | ; $\mathrm{R} 4=\mathrm{AR}+\mathrm{CR},\left(\mathrm{Al}^{\prime}=\mathrm{R} 4\right)$ |
| : | STF | R4, tar4++ |  |
|  | SUEF | R0, 4 ARO++, RS | ; $\mathrm{RS}=\mathrm{AR}-\mathrm{CR},\left(\mathrm{BI}^{\prime}=\mathrm{R} 2\right)$ |
| : | STF | R2, *AR5++ |  |
|  | SURF | R7, R6, R7 | ; $\left.{ }^{(D I}{ }^{\prime}=\mathrm{R} 7=\mathrm{Rb}-\mathrm{R} 7\right)$ |
|  | ADDF | R1, AAR3, R6 | ; $\mathrm{R} 6=\mathrm{DR}+\mathrm{BR},\left(\mathrm{DI}{ }^{\prime}=\mathrm{R} 7\right)$ |
| : | STF | R7, ARR6++ |  |
|  | SUBF | R1, *AR3+t,R7 | ; $\mathrm{R7} 7=\mathrm{DR}-\mathrm{BR},\left(C I^{\prime}=\mathrm{R} 3\right)$ |
| $1:$ | STF | R3, +AR2++ |  |
|  | ADDF | R6, R4, R0 | ; $\mathrm{AR}^{\prime}=\mathrm{RO}=\mathrm{R}^{4}+\mathrm{Rb}$ |
|  | MPYF | *AR3++, *AR7, R1 | ; $\mathrm{RI}=\mathrm{DI}, \mathrm{BR}^{\prime}=\mathrm{RS}=\mathrm{RL} 4-\mathrm{Rb}$ |
| 14 | SUBF | R6, R4, R3 |  |
|  | ADDF | R1, *AR1, R0 | ; $\mathrm{RO}=\mathrm{BI}+\mathrm{DI}, \mathrm{AR}^{\prime}=\mathrm{RO}$ |
| 14 | STF | RO, *AR4++ |  |
|  | SUBF | R1, *AR1+4, R1 | ; $\mathrm{RI}=\mathrm{BI}-\mathrm{DI}, \mathrm{BR}^{\prime}=\mathrm{RB}$ |
| 11 | STF | R3, ${ }_{\text {APR5 }}++$ |  |
|  | ADDF | R1,R5,R2 | ; $\mathrm{CR}^{\prime}=\mathrm{R} 2=\mathrm{R5}+\mathrm{Rl}^{\prime}$ |
|  | mPF | *+AR2, 4 AR7, R1 | ; $\mathrm{Rl}=\mathrm{CI}, \mathrm{DR}^{\prime}=\mathrm{R} 3=\mathrm{RS}-\mathrm{Rl}$ |
| : | SUBF | R1,R5, R3 |  |
|  | ADDF | R1, $\triangle$ ARO, R2 | ; $\mathrm{R} 2=\mathrm{Al}+\mathrm{Cl}, \mathrm{CR}^{\prime}=\mathrm{R} 2$ |
| : | STF | R2, *AR2+( (IR1) |  |
|  | SUBF | R1, +ARO+4, R6 | ; $\mathrm{Rb}=\mathrm{AI}-\mathrm{CI}, \mathrm{DR}^{\prime}=\mathrm{R} 3$ |
| : | STF | R3, *AR6++ |  |


| 8 | ${ }_{*}^{\text {BLK }}$ K | ADDF | RO,R2,R4 | ; $\mathrm{AI}^{\prime}=\mathrm{R} 4=\mathrm{R} 2+\mathrm{RO}$ |
| :---: | :---: | :---: | :---: | :---: |
|  | - Clear pipeline |  |  |  |
|  |  | SUBF | R0,R2, R2 | ; $\mathrm{BI}^{\prime}=\mathrm{R} 2=\mathrm{R} 2-\mathrm{RO}$ |
|  |  | ADDF | R7,R6, R3 | ; $\mathrm{Cl}^{\prime}=\mathrm{R} 3=\mathrm{Rb}+\mathrm{R7}$ |
|  |  | STF | R4, *AR4 | ; $\mathrm{AI}^{\prime}=\mathrm{R} 4, \mathrm{BI}^{\prime}=\mathrm{R} 2$ |
|  | 13 | STF | R2, *AR5 |  |
|  |  | SUBF | R7,R6,R7 | ; $\mathrm{DI}^{\prime}=\mathrm{R} 7=\mathrm{Rb}-\mathrm{R7}$ |
|  |  | STF | R7, *AR6 | ; $\mathrm{DI}^{\prime}=\mathrm{R} 7, \mathrm{Cl}^{\prime}=\mathrm{R} 3$ |
|  | : | STF | R3, *-AR2 |  |
| 3 | * third to last of stage 2 |  |  |  |
| ふ |  |  |  |  |
| $\stackrel{\square}{0}$ |  | LOI | ef62, IR1 |  |
| $\stackrel{1}{2}$ |  | LDI | IRO,ARS |  |
| 0 |  | SUBI | 1,AR5 |  |
|  |  | LDI | 1,AR6 |  |
|  | STUFE | LDI | CSINTAB,AR7 | ; POINTER TO THIDDLE FACTOR |
| ¢ |  | LDI | 0,AR4 | ; GROUP COUNTER |
| 1 |  | LDI | EINPUT, ARO | ; UPPER REAL BUTTERFLY INPUT |
|  |  | LDI | AR0, AR2 | ; UPPER REAL BUTTERFLY OUTPUT |
|  |  | ADDI | IRO, ARO, AR3 | ; LOMER REAL BUTTERFLY OUTPUT |
|  |  | LDI | AR3, AR1 | ; LOMER REAL BUTTERFLY INPUT |
|  |  | LSH | 1, ARb | - DOMBLE GROUP COUNT |
|  |  | LSH | -2,ARS | ; HALF BUTTERFLY COUNT |
|  |  | LSH | 1,AR5 | ; CLEAR LSB |
| $\equiv$ | * | LSH | -1, IR0 | ; HALF STEP FROM UPPER TO LOUER REAL <br> ; PART |
|  |  | LSH | -1, IR1 |  |
| $\frac{1}{2}$ |  | ADDI | 1, IR1 | ; STEP FROH OL IMAGIMARY TO NEW REAL <br> ; Value |
|  | * | Lof | *AR1++,R6 | ; DUAFY LOAD, ONLY FOR ADDRESS LPDATE |
|  | i: WF *ART,R7 ; R $=$ COS |  |  |  |
| $\frac{1}{3}$ | GRUPPE |  |  |  |
| \% | * FILL PIPEIIN |  |  |  |
| 3 | * FILL PIPEIINE |  |  | ; ARO = LPPER REAL BUTTERFLY INPUT <br> ; ARI = LOUER REAL BUTTERFLY INPUT |
| is |  |  |  | ; AR2 = UPPER REAL BUTTERFLY OUTPUT |
| 9 | * |  |  | ; AR3 = LOUER REAL BUTTERFLY OUTPUT <br> ; THE IMAGIMARY PART HAS TO FOLON |
| 5 |  | LDF | *++AR7,R6 | ; Rb = SIN |
| $\bigcirc$ |  | HPYF | *AR1-, R6, R1 | ; R1 = BI * SIN |
| N | : | ADDF | *++AR4, R0, R3 | ; DUYYY ADDF FOR COUNTER UPDATE |
|  |  | MPYF | *AR1, R7, RO | ; $\mathrm{RO}=\mathrm{BR}+\operatorname{COS}$ |
|  |  | MPYF | *AR1++, *AR7--, RO | ; $\mathrm{R} 3=\mathrm{TR}=\mathrm{R} 0+\mathrm{R} 1, \mathrm{R} 0=\mathrm{BR} * \mathrm{SIN}$ |
| N | 14 | ADDF | R0, R1, R3 |  |
|  |  | MPYF | *AR1++,R7,R1 | ; $\mathrm{Rl}=\mathrm{BI} * \operatorname{Cos}, \mathrm{R} 2=A \mathrm{R}-\mathrm{TR}$ |
|  | : 1 | SUBF | K3, *AR $0, R 2$ |  |
| $\bigcirc$ |  | ADDF | *AROt+, R3, R5 | ; $\mathrm{R} 5=A R+T R, B R^{\prime}=R 2$ |

```
i: STF R2,*AR3++
LDI ARS,RC
```



```
* FIRST BUTTERFLY-TYPE:
\(T R=B R * C O S+B I * S I N\)
\(T I=B R * S I N-B I * C O S\)
\(A R^{\prime}=A R+T R\)
\(A I^{\prime}=A I-T I\)
\(B R^{\prime}=A R-T R\)
\(B I^{\prime}=A I+T I\)
```




## Ht*****t\&

```
SECONO BUTTERFLY-TYPE:
```

$T R=B I * \operatorname{COS}-B R * S I N$
$I I=B I * S I N+B R * C O S$
$A R^{\prime}=A R+T R$
$A I^{\prime}=A I-T I$
$B R^{\prime}=A R-T R$
$B I^{\prime}=A I+T I$
$B I^{\prime}=A I+T$

RPTB BFLY
*

| *+AR1, R7, RS | ; $\mathrm{RS}=\mathrm{BI} * \operatorname{COS},\left(\mathrm{AR}^{\prime}=\mathrm{RS}\right)$ |
| :---: | :---: |
| R5, *AR2++ |  |
| R1, R0, R2 | ; $(\mathrm{R} 2=\mathrm{TI}=\mathrm{R} 0+\mathrm{R} 1)$ |
| *AR1, R6, R0 | ; $\mathrm{R} 0=\mathrm{BR} * \mathrm{SIN},(\mathrm{R} 3=\mathrm{AI}+\mathrm{II})$ |
| R2, *AR0, R3 |  |
| R2, AR $\mathrm{P} 0++$, R4 | ; $\left(\mathrm{R4}=\mathrm{AI}-\mathrm{TI}, \mathrm{BI}^{\prime}=\mathrm{R} 3\right)$ |
| R3, *AR3++ |  |
| R0, R5, R3 | ; $\mathrm{TR}=\mathrm{R} 3=\mathrm{RS}-\mathrm{R} 0$ |
| *AR1++,R7,R0 | ; $\mathrm{R} 0=\mathrm{BR} * \operatorname{COS}, \mathrm{R} 2=A R-T R$ |
| R3, *AR0, R2 |  |
| *AR1++,R6,R1 | ; $\mathrm{Rl}=\mathrm{BI} * \mathrm{SIN},\left(\mathrm{Al}^{\prime}=\right.$ R4 $)$ |
| R4, *AR2++ |  |
| *AR0++,R3,R5 | ; $\mathrm{R} 5=A \mathrm{R}+\mathrm{TR}, \mathrm{BR} \mathrm{R}^{\prime}=\mathrm{R} 2$ |

EFLY2 ADLF
,
R2, *AR3++
clear pipeline

| ADIF | R1, R0,R2 | ; $\mathrm{R} 2=\mathrm{TI}=\mathrm{R} 0+\mathrm{R1}$ |
| :---: | :---: | :---: |
| ADDF | R2, *AR0, R3 | ; $\mathrm{R} 3=\mathrm{AI}+\mathrm{TI}$ |
| STF | R5, *AR2++ | ; $\mathrm{AR}^{\prime}=\mathrm{R} 5$ |
| CMPI | AR6, AR4 |  |
| BNED | GRUPPE | ; DO FOLLOWING 3 INSTRUCTIONS |
| SUBF |  | ; $\mathrm{R4} 4=\mathrm{AI}-\mathrm{II}, \mathrm{BI}^{\prime}=\mathrm{R} 3$ |
| STF | R3, *AR3++(IR1) |  |
| LDF | *++AR7,R7 | ; R7 $=\cos$ |
| STF | R4, *AR2++(IR1) | ; $\mathrm{Al}^{\prime}=\mathrm{R4}$ |
| NOP | *AR1++(IR1) | ; BRANCH HERE |

## ENI OF THIS BUTTERFLY GROUP

CMPI 4,IRO

BNZ STUFE
; UMP OUT AFTER LD(N)-3 STAGE

SECOND TO LAST STAGE

| LDI | CINPUT,ARO | ; UPPER INPUT |
| :--- | :--- | :--- |
| LDI | ARO, AR2 | ; UPPER OUTPUT |
| ADDI | IRO,ARO,AR1 | ; LOWER INPUT |

* 

| ADDF | $* A R 1, * A R 1, R 2$ | $; A R^{\prime}=R 2=A R+B R$ |
| :--- | :--- | :--- |
| SUBF | $* A R 1++* A R O++, R 3$ | $; B R^{\prime}=R 3=A R-B R$ |
| ADDF | $* A R 0, * A R 1, R O$ | $; A I^{\prime}=R O=A I+B I$ |
| SUBF | $* A R 1++, * A R O++, R 1$ | $; B I^{\prime}=R 1=A I-B I$ |

2. BUTTERFLY: $w^{\sim} 0$

| AJIF | *ARO, *AR1, R6 | ; $A R^{\prime}=R 6=A R+B R$ |
| :---: | :---: | :---: |
| SUBF | *AR1++, *AR0++, R7 | ; $B R^{\prime}=R 7=A R-E R$ |
| ADDF | *AR0, *AR1, R4 | ; $\mathrm{AI}^{\prime}=\mathrm{R4}=\mathrm{AI}+\mathrm{BI}$ |
| SUBF | *AR1++(IRO) , *ARO+ | (IRO),R5 ; $\mathrm{BI}^{\prime}=\mathrm{RS}=\mathrm{AI}-\mathrm{BI}$ |
| STF | R2, *AR2++ | ; $\left(A R^{\prime}=\right.$ R2 $)$ |
| STF | R3, *AF3++ | ; $\left(B R^{\prime}=\right.$ R3) |
| STF | R0, $\pm$ AR2++ | ; $\left(\mathrm{AI}^{\prime}=\mathrm{RO}\right)$ |
| STF | R1, *AR3++ | ; ( $\mathrm{BI}^{\prime}=$ R1) |
| STF | R6, $\pm$ AR2+ + | ; $A R^{\prime}=\mathrm{R} 6$ |
| STF | R7, *AR3++ | ; $\mathrm{BR}^{\prime}=\mathrm{R} 7$ |
| STF | R4, * AR2+ + ( IR0) | ; $\mathrm{AI}^{\prime}=\mathrm{R} 4$ |
| STF | RS, *AR3+(IR0) | ; $\mathrm{BI}^{\prime}=\mathrm{R} 5$ |

## 3. EUTTERFLY: $W M / 4$

| ADDF | $* A R O++*+A R 1, R 5$ | $; A R^{\prime}=R 5=A R+B I$ |
| :--- | :--- | :--- |
| SUBF | $* A R 1, * A R 0, R 4$ | $; A I^{\prime}=R 4=A I-B R$ |
| ADDF | $* A R 1++* A R 0--R 6$ | $; B I^{\prime}=R 6=A I+B R$ |
| SUBF | $* A R 1++* * A R O++R 7$ | $; B R^{\prime}=R 7=A R-B I$ |

4. BUTTERFLY: WMM/4

| ADDF | $*+A R 1, *++A R 0, R 3$ | $; A R^{\prime}=R 3=A R+B 1$ |
| :--- | :--- | :--- |
| LDF | $*-A R 7, R 1$ | $; R 1=0(F O R$ INHER LOOP) |
| LDF | $* A R 1++, R 0$ | $; R O=B R(F O R$ INNER LOOP) |
| SUBF | $* A R 1++(I R O), * A R O++, R 2\left(B R^{\prime}=R 2=A R-B I\right.$ |  |
| STF | $R 5, * A R 2++$ | $;\left(A R^{\prime}=R S\right)$ |
| STF | $R 7, * A R 3++$ | $;\left(B R^{\prime}=R 7\right)$ |
| STF | $R 6, * A R 3++$ | $;\left(B I^{\prime}=R 6\right)$ |

5. TO M. BUTTERFLY
RPTB BF2END

| LDF | *AR7++, R7 | ; R7 $=\cos ,\left(\left(\mathrm{Al}^{\prime}=\mathrm{R4}\right)\right)$ |
| :---: | :---: | :---: |
| STF | R4, *AR2++ |  |
| LDF | *AR7++, R6 | ; $\mathrm{Rb}=\mathrm{SIN},\left(\mathrm{BR}^{\prime}=\mathrm{R} 2\right)$ |


|  | IPYF | *+AR1, R6, RS | ; $\mathrm{R} 5=\mathrm{BI} * \operatorname{SIN},\left(\mathrm{AR}^{\prime}=\mathrm{R} 3\right)$ |
| :---: | :---: | :---: | :---: |
| : | STF | R3, *AR2++ |  |
|  | ADDF | R1, R0, R2 | ; $(\mathrm{R} 2=\mathrm{TI}=\mathrm{R} 0+\mathrm{R} 1)$ |
|  | MPYF | -AR1, R7, R0 | ; $\mathrm{RO}=\mathrm{BR} * \operatorname{COS},(\mathrm{R} 3=\mathrm{AI}+\mathrm{TI})$ |
| i | ADDF | $R 2, * A R 0, R 3$ |  |
|  | SUBF | R2, $*$ AR0++(1R0), R4 | ; $\left(\mathrm{R4}=\mathrm{AI}-\mathrm{TI}, \mathrm{BI}^{\prime}=\mathrm{R3}\right)$ |
| : | STF | R3, *AR3++(IRO) |  |
|  | ADDF | R0, R5, R3 | ; $\mathrm{R} 3=\mathrm{TR}=\mathrm{RO}+\mathrm{RS}$ |
|  | MPYF | *AR1++,R6, R0 | ; $\mathrm{RO}=\mathrm{BR} * \mathrm{SIN}, \mathrm{R} 2=A R-T R$ |
| 14 | SUBF | R3, $\pm$ AR0, R 2 |  |
|  | MPYF | *AR1++,R7,R1 | ; $\mathrm{RI}=\mathrm{BI} * \operatorname{COS},\left(\mathrm{AI}^{\prime}=\mathrm{R} 4\right)$ |
| : 1 | STF | R4, *AR2+ (IRO) |  |
|  | ADDF | *AR0++, R3, R5 | ; $\mathrm{R} 5=A \mathrm{R}+\mathrm{TR}, \mathrm{BR}^{\prime}=\mathrm{R} 2$ |
| : | STF | R2, *AR3++ |  |
| * ${ }^{\text {a }}$ |  |  |  |
|  | MPYF | *+AR1, R6, R5 | ; $\mathrm{RS}=\mathrm{BI} * \mathrm{SIN},\left(\mathrm{AR}^{\prime}=\mathrm{RS}\right)$ |
| $1:$ | STF | R5, $*$ AR2++ |  |
|  | SUBF | R1, R0, R2 | ; $\mathrm{R} 2=\mathrm{TI}=\mathrm{RO}-\mathrm{R} 1)$ |
|  | MPYF | *AR1, R7,R0 | ; $\mathrm{RO}=\mathrm{BR} * \operatorname{COS},(\mathrm{R} 3=\mathrm{AI}+\mathrm{TI})$ |
| i | ADDF | R2, *ARO, R3 |  |
|  | SUBF | R2, *AR0++, R4 | ; $\left(\mathrm{R} 4=\mathrm{AI}-\mathrm{II}, \mathrm{BI}^{\prime}=\mathrm{R} 3\right)$ |
| : | STF | R3, *AR3++ |  |
|  | ADDF | R0, R5, R3 | ; $\mathrm{R} 3=\mathrm{TR}=\mathrm{RO}+\mathrm{R} 5$ |
|  | MPYF | *AR1++,R6,R0 | ; $\mathrm{RO}=\mathrm{BR} * \mathrm{SIN}, \mathrm{R} 2=A R-T R$ |
| : 1 | SUBF | $R 3, \pm A R 0, R 2$ |  |
|  | MPYF | *AR1++(1F0) , R7, R1 | ; $\mathrm{RI}=\mathrm{BI} * \cos ,\left(\mathrm{AI}^{\prime}=\mathrm{R4}\right)$ |
| 11 | STF | R4, *AR2++ |  |
|  | ADDF | *ARO++, R3, R3 | ; $R 3=A R+T R, B R^{\prime}=R 2$ |
| : | STF | R2, *AR3++ |  |
| * . |  |  |  |
|  | MPYF | *+AR1,R7,R5 | ; R5 = BI $* \cos ,\left(\mathrm{AR}^{\prime}=\mathrm{R} 3\right)$ |
| : 1 | STF | R3, *AR2++ |  |
|  | SUBF | R1, R0, R2 | ; $(\mathrm{R} 2=\mathrm{TI}=\mathrm{RO}-\mathrm{R} 1)$ |
|  | MPYF | *AR1, R6, R0 | ; $\mathrm{RO}=\mathrm{BR} * \mathrm{SIN},(\mathrm{R} 3=\mathrm{AI}+\mathrm{TI})$ |
| 14 | ADDF | R2, *ARO, R3 |  |
|  | SURF | R2, $*$ ARO++(IRO), R 4 | ; $\left(R 4=A I-T I, B I^{\prime}=R 3\right)$ |
| : | STF | R3, *AR3+(IRO) |  |
|  | SUBF | R0, RS, R3 | ; $\mathrm{R} 3=\mathrm{TR}=\mathrm{RS}-\mathrm{RO}$ |
|  | MPYF | *AR1++,R7,R0 | ; $\mathrm{R} 0=\mathrm{BR} * \operatorname{COS}, \mathrm{R} 2=A R-T R$ |
| : | SUBF | R3, 4 ARO, R 2 |  |
|  | MPYF | *AR1++,R6, R1 | ; $\mathrm{RI}=\mathrm{BI} * \mathrm{SIN},\left(\mathrm{AI}^{\prime}=\mathrm{R4}\right)$ |
|  | STF | R4, *AR2+ ( IR0) |  |
|  | ADDF | *ARO++, R3, RS | ; $R 5=A R+T R, B R^{\prime}=R 2$ |
| : 1 | STF | R2, *AR3++ |  |
| * |  |  |  |
|  | MPYF | *+AR1, R7, R5 | ; RS $=$ BI * COS , (AR' $=$ R5 $)$ |
| : | STF | R5, *AR2++ |  |
|  | ADDF | R1, R0, R2 | ; $\mathrm{R} 2=\mathrm{TI}=\mathrm{RO}+\mathrm{R} 1)$ |
|  | IPYF | *AR1,R6, R0 | ; $\mathrm{RO}=\mathrm{BR} * \mathrm{SIN},(\mathrm{RS}=\mathrm{AI}+\mathrm{TI})$ |
|  | ADDF | R2, *AR0, R3 |  |
|  | SUBF | R2, $*$ ARO++, R4 | ; $\left(\mathrm{R4}=\mathrm{AI}-\mathrm{TI}, \mathrm{y}(\mathrm{L})=\mathrm{BI}^{\prime}=\mathrm{R} 3\right)$ |
|  | STF | R3, *AR3++ |  |
|  | SUBF | R0, R5, R3 | ; $\mathrm{R} 3=\mathrm{TR}=\mathrm{RS}-\mathrm{RO}$ |
|  | MPYF | *AR1++, R7, R0 | ; $\mathrm{R} 0=\mathrm{BR} * \operatorname{COS}, \mathrm{R} 2=A R-T R$ |


| $i$ | SUBF | $R 3, * A R O, R 2$ |
| :--- | :--- | :--- | :--- |
| BFZEND | MPYF | $* A R R+(I R O), R 6, R 1$ |$; R 1=B 1 * S I N, R 3=A R+T R$

alear pipeline

| LDI | EINPUT,ARO | ; UPPER INPUT |
| :--- | :--- | :--- |
| LDI | ARO,AR2 | ; UPPER OUTPUT |
| LDI | EINPUTP2,AR1 | ; LONER INPUT |
| LDI | AR1,AR3 | ; LOURR OUTPUT |
| LDI | ESINTP2,AR7 | ; POINTER TO THIDDLE FACTORS |
| LDI | 3, IRO | ; GROUP OFFSET |
| LDI | EFGAR2,RC |  |

FILL PIPELINE

1. Butterfly: wo

| DDF | *AR0, *AR1, R6 | ; $A R^{\prime}=R 6=A R+B R$ |
| :---: | :---: | :---: |
| SUBF | *AR1++, *ARO++, R7 | ; $\mathrm{AR}^{\prime}=\mathrm{R} 7=A R-B R$ |
| ADDF | *AR0, *AR1, R4 | ; $\mathrm{AI}^{\prime}=\mathrm{R} 4=\mathrm{AI}+\mathrm{BI}$ |
| SUBF | *AR1++(IRO) , ARRO | (R0) ,R5 ; $\mathrm{BI}^{\prime}=\mathrm{R5}$ |

2. BUTTERFLY: WM/4

| ADDF | *+AR1,*ARO, R3 ; | $A R^{\prime}=R 3=A R+B I$ |
| :---: | :---: | :---: |
| LDF | *-AR7,R1 ; | ; R1 $=0$ (FOR INER LOOP) |
| LDF | *AR1++,R0 ; | RO $=$ BR (FOR INNER LOOP) |
| SUBF | *AR1++( IRO) , $\pm$ ARO ++ , R2 | ; $B^{\prime}=\mathrm{R} 2=A R-B I$ |
| STF | R6, 4 AR2 ++ ; | ( $\left.A R^{\prime}=R 6\right)$ |
| STF | R7, *AR3++ ; | ( $\left(\mathrm{R}^{\prime}=\mathrm{R} 7\right)$ |
| STF | RS, *AR3++(IRO) ; | ( $\mathrm{BI}^{\prime}=\mathrm{RS}$ ) |

3. TO.M. BUTTERFLY:

| : | LDF | *AR7++,R7 | ; R7 $=\operatorname{COS},\left(\mathrm{Al}^{\prime}=\mathrm{R4}\right)$ |
| :---: | :---: | :---: | :---: |
|  | STF | R4, *AR2+ (IR0) |  |
|  | LFF | *AR7+ ${ }^{\text {, R } 6}$ | ; RG = $\operatorname{SIN},\left(\mathrm{BR}^{\prime}=\mathrm{R} 2\right)$ |
| : | STF | R2, 4 AR3++ |  |
|  | MPYF | *AR1,R6, R5 | ; $\mathrm{R} 5=\mathrm{BI} *$ SIN , ( $\left.\mathrm{AR}^{\prime}=\mathrm{R} 3\right)$ |
| : | STF | R3, *AR2++ |  |
|  | ADDF | R1, R0, R2 | ; $(\mathrm{R} 2=\mathrm{TI}=\mathrm{R} 0+\mathrm{R} 1)$ |
|  | MPYF | *AR1,R7,R0 | ; $\mathrm{RO}=\mathrm{R} \boldsymbol{R} * \operatorname{COS},(\mathrm{R} 3=\mathrm{A}$ |



|  | STF | R1, *AR1 |
| :---: | :---: | :---: |
| , |  |  |
| END: | NOP |  |
|  | NOP |  |
|  | NOP |  |
|  | NOP |  |
| * |  |  |
| SELF | BR | SELF |
|  | .end |  |

```
APPENDIX A4
```

COHPLEX, RADIX-2 DIT FFT : R2DITB.ASM

GENERIC PROGRAM FOR A FAST LOOPED-CODE RADIX-2 DIT FFT COAPUTATION ON THE THS320C30
CAUERSTRASSE 7, D-8520 ERLAMGEN, FRG

THE (COHPLEX) data reside in internal renory. THE COHPUTAIION IS dONE IN-PLACE, BUT THE RESULT IS MOVED TO ANOTHER MEMORY SECTION TO DEIONSTRATE THE BIT-REVERSED ADDRESSING.

FOR THIS PROGRAM THE MINIMM FFT LENGTH IS 32 POINTS BECAUSE OF THE SEPARATE STAGES.

FIRST TWO PASSES ARE REALIZED AS A FOUR BUTTERFLY LOOP SINCE THE MULTIPLIES ARE TRIVIAL. THE MLTIFLIER IS ONLY USED FOR A LOAD IN PARALLEL WITH AN ADDF OR SUBF.

EXAMPLE FOR A 1024-POINT FFT (WITH BIT REVERSAL) :
memory size :

| PROG | $=231$ HORDS |
| :--- | :--- |

CYCLES PER BUTTERFLY :

| STAGES 1 AND 2 | $=4$ |
| :--- | :--- |
| STAGES 3 TO 8 | $=8$ |
| STAGE 9 | $=8.25$ |

STAGE $9=8.25$
STAGE $10=10.5$ (DUE TO EXT. MEMORY MAITS)
AVERAGE CYCLES/BUTTERFLY $=7.475$
TOTA BUTTERFLYCYCLES $=38272$
INITIALILATION OVERIEAD $=2185=5.4 \%$ OF TOTAL TIME
TOTAL NUMBER OF INSTRUCTION CYCLES $=40457$
TOTAL TIME FOR A 1024 POINT FFT $=2.42$ ms (INCLUDING BIT
REVERSAL)

-     * 

THIS PROCRAK INCUDES FOLLOWING FILES:
TKE FILE 'TWIDIKBR.ASH' CONSISTS OF THIDDLE FACTORS
THE THIDDE FACTORS ARE STORED IN BIT REVERSED ORDER AND WITH A TABLE
LENGTH OF N/2 ( $N=$ FFTLENGTH).
EXAPPLE: SHOWN FOR $N=32, \operatorname{LN}(n)=\operatorname{COS}(2 * P I * n / N)-j * S I N(2 * P I * n / N)$
ADDRESS COEFFICIENT
$0 \quad \operatorname{R}(\operatorname{LnN}(0))=\cos (2 * P I * 0 / 32)=1$
$1 \quad-\mathrm{I}\{\mathbb{N}(0)\}=\operatorname{SIN}(2 * \mathrm{PI} * 0 / 32)=0$
$2 \quad \operatorname{R}(\operatorname{LIN}(4))=\cos (2+\mathrm{P}[+4 / 32)=0.707$
$-I\{$ LIN $(4)\}=\operatorname{SIN}(2 * P I * 4 / 32)=0.707$
$12 \quad \operatorname{R}($ tu $(3))=\cos (2 * \mathrm{P} 1 * 3 / 32)=0.831$
$13 \quad-[\{\operatorname{LN}(3)\}=\operatorname{Sin}(2+1 * 3 / 32)=0.536$

HHEN GENERATED FOR A fFT LENGTH OF 1024, THE TABLE IS FOR ALL
AVAILABLE FFT OF LESS OR EQUAL LENGTH.
THE MISSING TWIDOLE FACTORS (IN(),WN(), ....) ARE GENERATED BY USING THE SYMETRY HN( $N / 4+n)=-j \neq W N(n)$. THIS CAN BE EASILY REALLIZED, BY CHANGING REAL- AND IMAGINARY PART OF THE TIIDDLE FACTORS AND BY negating the new real part.
TO CHANGE THE FFT LENGTH ONLY THE PARAETERS IN THE HEADER OF thidikbr.asM and the inPut and output vector lengits need to be al TERED.

## *********

${ }^{*}$



* $\quad \mathrm{TR}=\mathrm{BR} * \mathrm{COS}+\mathrm{BI} * \operatorname{SIN}$
* $\mathrm{II}=\mathrm{BR} * \mathrm{SIN}-\mathrm{BI} * \operatorname{COS}$
* $\quad A R^{\prime}=A R+T R$
- $A I^{\prime}=A I-$ II
* $A I^{\prime}=A I-T I$
* $B R^{\prime}=A R-T R$
* $\mathrm{BI}^{\prime}=\mathrm{AI}+\mathrm{II}$
H1*************


FIRST 2 Stages as radix-4 butterfly
FILL PIPELINE

| ADDF | *AR2, *ARO, R4 | ; $\mathrm{R} 4=A \mathrm{R}+\mathrm{CR}$ |
| :---: | :---: | :---: |
| Subs |  | ; $\mathrm{RS}=\mathrm{AR}-\mathrm{CR}$ |
| ADDF | *AR1, *AR3, R6 | ; Rb = $=\mathrm{DR}+\mathrm{BR}$ |
| SUBF | *AR1++, *AR3+t, R7 | ; $\mathrm{R} 7=\mathrm{DR}-\mathrm{BR}$ |
| ADDF | R6, R4, R0 | ; $A \mathrm{R}^{\prime}=\mathrm{RO}=\mathrm{R}^{4}+\mathrm{Rb}$ |
| MPYF | *AR3++, *AR7,R1 | ; $\mathrm{R} 1=\mathrm{DI}, \mathrm{BR}^{\prime}=\mathrm{R} 3=\mathrm{R} 4-\mathrm{Rb}$ |
| SUBF | R6, R4, R3 |  |
| ADDF | R1, *AR1, R0 | ; $\mathrm{RO}=\mathrm{BI}+\mathrm{DI}, \mathrm{AR}^{\prime}=\mathrm{RO}$ |
| STF | RO, *AR4+4 |  |
| SUBF | R1, *AR1++, R1 | ; $\mathrm{Rl}=\mathrm{BI}-\mathrm{DI}, \mathrm{BR}^{\prime}=\mathrm{R} 3$ |
| STF | R3, *AR5++ |  |
| ADDF | R1, R5, R2 | ; $\mathrm{CR}^{\prime}=\mathrm{RQ}=\mathrm{RS}+\mathrm{Rt}$ |
| MPYF | *+AR2, *AR7,R1 | ; $\mathrm{Rl}=\mathrm{Cl}, \mathrm{DR}^{\prime}=\mathrm{R} 3=\mathrm{RS}-\mathrm{Rl}$ |
| SUBF | R1, R5, R3 |  |
| ADDF | R1, *AR0, R2 | ; $\mathrm{R} 2=\mathrm{AI}+\mathrm{CI}, \mathrm{CR}^{\prime}=\mathrm{R} 2$ |
| STF | R2, *AR2++(IR1) |  |
| SUBF | R1, $\pm$ AR $0++$, R6 | ; $\mathrm{Rb}=\mathrm{AI}-\mathrm{CI}, \mathrm{DR}^{\prime}=\mathrm{R3}$ |
| STF | R3, *ARb++ |  |
| ADDF | F0, R2, R4 | ; $\mathrm{AI}^{\prime}=\mathrm{R} 4=\mathrm{R} 2+\mathrm{RO}$ |

RADIX-4 Butterfly loo

| RPTB | BLK1 |  |
| :---: | :---: | :---: |
| MPYF | *AR2--, AR7,R0 | ; $\mathrm{RO}=\mathrm{CR},\left(\mathrm{Bl}^{\prime}=\mathrm{R} 2=\mathrm{R} 2-\mathrm{RO}\right)$ |
| SUBF | R0,R2, R2 |  |
| MPYF | *AR1++,*AR7, R1 | ; $\mathrm{RI}=\mathrm{BR},\left(\mathrm{Cl}{ }^{\prime}=\mathrm{R} 3=\mathrm{R} 6+\mathrm{R} 7\right)$ |
| ADDF | R7,R6, R3 |  |
| ADDF | R0, AARO, R4 | ; $\mathrm{R4}=\mathrm{AR}+\mathrm{CR},\left(\mathrm{AI}^{\prime}=\mathrm{R} 4\right)$ |
| STF | R4, *AR4++ |  |
| SUBF | R0, $*$ AROt+, RS | ; $\mathrm{RS}=\mathrm{AR}-\mathrm{CR},\left(\mathrm{BI}^{\prime}=\mathrm{R} 2\right)$ |
| STF | R2, *AR5++ |  |
| SUBF | R7,R6, R7 | ; $\left.{ }^{\text {DI }}{ }^{\prime}=\mathrm{R} 7=\mathrm{R6}-\mathrm{R} 7\right)$ |
| ADDF | R1, *AF3, R6 | ; Kb = $\mathrm{DR}+\mathrm{BR},\left(\mathrm{DI}{ }^{\prime}=\mathrm{R} 7\right)$ |
| STF | R7, *AR6++ |  |
| SUBF | R1, $*$ AR3 + +, R7 | ; $\mathrm{R} 7=\mathrm{LR}-\mathrm{BR},\left(\mathrm{Cl}{ }^{\prime}=\mathrm{R} 3\right)$ |
| STF | R3, *AR2++ |  |
| ADDF | R6, R4, R0 | ; $\mathrm{AR}^{\prime}=\mathrm{RO}=\mathrm{R}^{4}+\mathrm{Rb}$ |
| HPYF | *AR3++, *AR7,RI | ; $\mathrm{Rl}=\mathrm{DI}, \mathrm{BR}{ }^{\prime}=\mathrm{RS}=\mathrm{R} 4-\mathrm{Rb}$ |
| SUEF | R6, R4, R3 |  |
| ADDF | R1, *AR1,R0 | ; $\mathrm{RO}=\mathrm{BI}+\mathrm{DI}, \mathrm{AR}^{\prime}=\mathrm{RO}$ |
| STF | R0, *AR4++ |  |
| SUBF | R1, $\ddagger$ AR1++, R1 | ; $\mathrm{Rl}=\mathrm{BI}-\mathrm{DI}, \mathrm{BR}^{\prime}=\mathrm{R} 3$ |
| STF | R3, *AR5++ |  |
| ADDF | R1, R5, R2 | ; $\mathrm{CR}^{\prime}=\mathrm{R}_{2}=\mathrm{R}^{\prime}+\mathrm{R} 1$ |
| HPYF | *+AR2,*AR7, R1 | ; $\mathrm{Rl}=\mathrm{CI}, \mathrm{DR}^{\prime}=\mathrm{R} 3=\mathrm{RS}-\mathrm{Rl}$ |
| SUEF | R1, $\mathrm{R5}$, R3 |  |
| ADDF | R1, *ARO, R2 | ; $\mathrm{R2}=\mathrm{AI}+\mathrm{Cl}, \mathrm{CR}^{\prime}=\mathrm{R} 2$ |
| STF | R2, *AR2++(IR1) |  |
| SUBF | R1, $\pm$ AR $0++$, R6 | ; $\mathrm{RB}=\mathrm{AI}-\mathrm{CI}, \mathrm{DR}^{\prime}=\mathrm{R} 3$ |




|  | Subf | tAR1++, 4 ARO+t, R1 ; | ; $\mathrm{BI}^{\prime}=\mathrm{RI}=\mathrm{AI}-\mathrm{BI}$ |
| :---: | :---: | :---: | :---: |
|  | 2. BUTTEPFLY: W |  |  |
|  | ADDF | \#ARO, *AR1, R6 ; | $A R^{\prime}=R 6=A R+B R$ |
|  | SUBF | *AR1++, 4 ARO++, R7 ; | ; $\mathrm{R}^{\prime}=\mathrm{R} 7=A R-8 R$ |
|  | ADDF | *ARO, *AR1, R4 ; | AI ${ }^{\prime}=R 4=A I+B I$ |
|  | SUBF | *AR1++(IRO), $\pm A R 0++($ IRO $)$ | ROI, $\mathrm{RS} ; \mathrm{BI}^{\prime}=\mathrm{RS}=\mathrm{AI}-\mathrm{BI}$ |
|  | STF | $R 2, \pm A R 2++$ | $\left(A R^{\prime}=R Q\right)$ |
| $\because$ | STF | R3,*AR3 ${ }^{\text {a }}$ - ; | ; $\left(\mathrm{BR}^{\prime}=\mathrm{R} 3\right.$ ) |
|  | STF | $\mathrm{RO}, \mathrm{AAR2}^{2++}$; | ( $\left.\mathrm{Al}^{\prime}=\mathrm{RO}\right)$ |
| : | STF | R1, *AR3 ${ }^{\text {a }}$ - ; | ( Bl $^{\prime}=$ R1) |
|  | STF | R6, $\mathrm{ARP2} 2++$; | ; $A R^{\prime}=$ R6 |
| 14 | STF | R7, 4 AR3++ ; | ; $\mathrm{PR}^{\prime}=\mathrm{RT}$ |
|  | STF | R4, +AR2+ (1RO) ; | ; $\mathrm{Al}^{\prime}=\mathrm{R} 4$ |
| 14 | STF | R5, *AR3 + ( IRO) ; | ; $\mathrm{BI}^{\prime}=\mathrm{RS}$ |
| * | 3. BUTTEPFLY: UM/4 $^{\text {a }}$ |  |  |
|  | ADDF | HAROt+, + +AR1, RS ; | ; $A R^{\prime}=R 5=A R+B I$ |
|  | SUBF | *ARI, *ARO, R4 ; | ; $\mathrm{AI}^{\prime}=\mathrm{RL}=\mathrm{AI}-\mathrm{BR}$ |
|  | ADDF | *AR1+t, *ARO--,R6 ; | ; $\mathrm{BI}^{\prime}=\mathrm{RG}=\mathrm{AI}+\mathrm{BR}$ |
|  | SUBF | *AR1++,*AROt+, R7 ; | ; $\mathrm{BR}^{\prime}=\mathrm{RT}=\mathrm{AR}-\mathrm{BI}$ |
| 4. BUTTEPFLY: $W^{M} 1 / 4$ | 4. BUTTEPFLY: $\mathrm{W}^{\prime M / 4}$ |  |  |
|  | ADDF | *+AR1,*++ARO, R3 ; | ; $\mathrm{AR}^{\prime}=\mathrm{R} 3=A R+B I$ |
|  | LDF | *-AR7,R1 ; | RIL $=0$ (FOR INNER LOOP) |
| i: | LDF | +AR1++,RO ${ }^{\text {P }}$ | ; $\mathrm{FO}=\mathrm{BR}$ (FOR INEER LOOP) |
|  | SUBF | *AR1++(IRO) , *AFO++, R2 | ; $B R^{\prime}=R 2=A R-B I$ |
|  | STF | R5, *AR2 + + ; | ( $A R^{\prime}=$ RS $)$ |
| : | STF | R7, *AR3 ${ }^{\text {+ }}$ - ; | ( $\mathrm{BR}^{\prime}=\mathrm{R} 7$ ) |
|  | STF | R6, 4 AR3 ${ }^{\text {+ }}$ - ; | ( $\mathrm{BI}^{\prime}=\mathrm{R} 6$ ) |
|  | 5. TO M. Butterfly: |  |  |
|  | RPTB | bF2END |  |
|  | LPF | *AR7++,R7 ; | ; $\mathrm{R7}=\operatorname{COS},\left(\left(\mathrm{Al}^{\prime}=\mathrm{R4} 4\right)\right.$ |
| : 1 | STF | R4, ARR2++ |  |
|  | LDF | tAR7+4,R6 ; | ; $\mathrm{Rb}=\mathrm{SIN},\left(\mathrm{BR}^{\prime}=\mathrm{R} 2\right)$ |
| : | STF | R2, mer3++ |  |
|  | MPYF | + + ARI, R6, R5 ; | R5 = $\mathrm{BI} * \mathrm{SIN},\left(A \mathrm{R}^{\prime}=\mathrm{R} 3\right)$ |
| ! | STF | R3, tAR2++ |  |
|  | ADDF | R1,R0,R2 ; | ; $\mathrm{R} 2=\mathrm{TI}=\mathrm{RO}+\mathrm{RI})$ |
|  | NPYF | *AR1,R7,R0 $\quad$; | ; $\mathrm{RO}=\mathrm{BR}+\operatorname{COS},(\mathrm{RO}=\mathrm{AI}+\mathrm{II})$ |
| 11 | ADOF | R2, ${ }^{\text {A }}$ ARO, R3 |  |
|  | SUEF | $R 2, * A R 0++(I R O), R 4$; | ; $\left(\mathrm{R4}=\mathrm{AI}-\mathrm{TI}, \mathrm{BI}^{\prime}=\mathrm{R} 3\right)$ |
| : | STF | R3, *AR3+(IRO) |  |
|  | ADDF | R0,R5,R3 ; | ; $\mathrm{R} 3=\mathrm{TR}=\mathrm{RO}+\mathrm{RS}$ |
|  | HPYF | *AR1++,R6,RO ; | ; $R 2=B R * S I N, R 2=A R-T R$ |
| i | SUEF | R3, 4 ARP, R2 |  |
|  | IPYF | *AR1++,R7,R1 ; | ; $\mathrm{RI}=\mathrm{BI} * \operatorname{COS},\left(\mathrm{AI}^{\prime}=\mathrm{R4}\right)$ |
| $1:$ | STF | R4, +AR2++( IRO) |  |


|  | AbDF | *ARO++,R3,R5 | ; $\mathrm{RS}=\mathrm{AR}+\mathrm{TR}, \mathrm{BR}^{\prime}=\mathrm{R} 2$ |
| :---: | :---: | :---: | :---: |
| : 1 STF R2, AAR3++ |  |  |  |
| * |  |  |  |
|  | MPYF | * + AR1, R6, R5 | ; $\mathrm{RS}=\mathrm{BI} * \mathrm{SIN},\left(\mathrm{AR}^{\prime}=\mathrm{RS}\right)$ |
| : | STF | R5, *AR2++ |  |
|  | SUBF | R1, R0, R2 | ; ( $\mathrm{R} 2=\mathrm{TI}=\mathrm{R} 0-\mathrm{R} 1)$ |
|  | MPYF | *AR1,R7,R0 | ; $\mathrm{RO}=\mathrm{BR} * \operatorname{COS},(\mathrm{R} 3=\mathrm{AI}+\mathrm{II})$ |
| 14 | ADDF | R2, *ARO, R3 |  |
|  | SUBF | R2, 4 ARO+ + , 4 | ; $\left(\mathrm{R4}=\mathrm{AI}-\mathrm{TI}, \mathrm{BI}^{\prime}=\mathrm{R3}\right)$ |
| : | STF | R3, *AR3+4 |  |
|  | ADDF | R0, R5, R3 | ; $\mathrm{R} 3=\mathrm{TR}=\mathrm{RO}+\mathrm{RS}$ |
|  | MPYF | *AR1++,R6,R0 | ; $\mathrm{RO}=\mathrm{BR} * \mathrm{SIN}, \mathrm{R} 2=A R-T R$ |
| 14 | SUBF | R3, $\pm$ ARO, R 2 |  |
|  | IPYF | -AFR1++(1R0), R7, R1 | ; $\mathrm{Rl}=\mathrm{BI} \pm \cos ,\left(\mathrm{AI}^{\prime}=\mathrm{R4}\right)$ |
| : | SIF | R4, *AR2++ |  |
|  | ADDF | *AF0+t, R3, R3 | ; $\mathrm{R} 3=A R+T R, B R^{\prime}=\mathrm{R} 2$ |
| : | STF | R2, 4 AR3++ |  |
| , |  |  |  |
|  | IPYF | *+AR1, R7, RS | ; $\mathrm{RS}=\mathrm{BI} * \operatorname{Cos},\left(\mathrm{AR}^{\prime}=\mathrm{R} 3\right)$ |
| : | STF | R3, 4 AR2+ |  |
|  | SUBF | R1, RO, R2 | ; ( $\mathrm{R} 2=\mathrm{TI}=\mathrm{R} 0-\mathrm{R} 1)$ |
|  | IPYF | *AR1, R6, RO | ; $\mathrm{RO} 0=\mathrm{BR} * \mathrm{SIN},(\mathrm{R} 3=\mathrm{AI}+\mathrm{TI})$ |
| : | ADDF | R2, *ARO, R3 |  |
|  | SUBF | R2, 4 ARO++( IRO) , R4 | ; $\left(\mathrm{R} 4=\mathrm{AI}-\mathrm{II}, \mathrm{BI}^{\prime}=\mathrm{R} 3\right)$ |
| : | STF | R3, *AR3++(IRO) |  |
|  | SUBF | R0, RS, R3 | ; $\mathrm{R} 3=\mathrm{TR}=\mathrm{RS}-\mathrm{RO}$ |
|  | HPYF | *AR1++,R7,R0 | ; $\mathrm{RO}=\mathrm{BR} * \operatorname{COS}, \mathrm{R} 2=A R-T R$ |
| : | SUBF | R3, 4 ARO, R2 |  |
|  | IPYF | *AR1++, R6, R1 | ; RI $=$ BI $\# \operatorname{SIN},\left(\mathrm{AI}^{\prime}=\mathrm{R} 4\right)$ |
| : | STF | R4, *AR2+ (IRO) |  |
|  | ADDF | WARO+4,R3, R5 | ; $\mathrm{RS}=\mathrm{AR}+\mathrm{TR}, \mathrm{BR}^{\prime}=\mathrm{R} 2$ |
| : | STF | R2, *AP3+ |  |
|  |  |  |  |
|  | MPYF | *+AR1, R7, R5 | ; $\mathrm{RS}=\mathrm{BI} * \operatorname{COS},\left(\mathrm{AR}^{\prime}=\mathrm{RS}\right)$ |
| : 1 | STF | R5, *AR2+ |  |
|  | ADSF | R1,R0,R2 | ; $(\mathrm{R} 2=\mathrm{TI}=\mathrm{RO}+\mathrm{R} 1)$ |
|  | mpy | *AR1,R6, RO | ; $\mathrm{R} 0=\mathrm{BR} * \mathrm{SIN},(\mathrm{R} 3=\mathrm{AI}+\mathrm{TI})$ |
| $1:$ | ADDF | $R 2, \ldots A R 0, R 3$ |  |
|  | SUBF | R2, $\pm$ AR $0++$, R4 | ; $\left(\mathrm{R4}=\mathrm{AI}-\mathrm{TI}, \mathrm{y}(\mathrm{L})=\mathrm{BI}^{\prime}=\mathrm{R} 3\right)$ |
| : | STF | R3, *AR3++ |  |
|  | SUBF | R0, R5, R3 | ; $\mathrm{R} 3=\mathrm{TR}=\mathrm{RS}-\mathrm{RO}$ |
|  | MPYF | *AR1++,R7,RO | ; $\mathrm{RO}=\mathrm{BR} \times \operatorname{COS}, \mathrm{R} 2=A R-T R$ |
| 4 | SUBF | R3, 4 ARO, R 2 |  |
| BF2END | MPYF | *AR1++(IRO), R6, R1 | ; $\mathrm{Rl}=\mathrm{BI} * \mathrm{SIN}, \mathrm{R} 3=A R+\mathrm{TR}$ |
| : | ADDF | -ARO++, R3, R3 |  |
| - clear pipeline |  |  |  |
|  | STF | R2, 4 AR3+ | ; $\mathrm{BR}^{\prime}=\mathrm{R} 2, \mathrm{AI}^{\prime}=\mathrm{R}^{4}$ |
| : | STF | R4, \#AR2++ |  |
|  | ADDF | R1, R0, R2 | ; $\mathrm{R} 2=\mathrm{TI}=\mathrm{RO}+\mathrm{RL}$ |
|  | ADDF | R2, \#ARO, R3 | ; $\mathrm{R} 3=\mathrm{AI}+\mathrm{TI}, \mathrm{AR}^{\prime}=\mathrm{R} 3$ |
| : | STF | R3, *AR2++ |  |
|  | SUBF | R2, *AR0, R4 | ; $\mathrm{R4}=\mathrm{AI}-\mathrm{TI}, \mathrm{BI}^{\prime}=\mathrm{R} 3$ |

LAST STAGE WITH INTEGRATED BIT REVERSAL

| LDI | EINPUT, ARO | ; UPPER INPUT |
| :---: | :---: | :---: |
| LI | COUTPUT,AR2 | ; REAL OUTPUT !!! |
| LDI | EINPUTP2,AR1 | ; LOUER INPUT |
| LDI | COUTP1,AR3 | ; IMAGINARY OUTPUT ! ! |
| LI | ESINTP2;AR7 | ; Pointer to tuidale factors |
| LDI | GFFTSIL, IFO | ; BIT reversal |
| LII | 3, IR1 | ; GROUP OFFSET |
| LDI | efgank, RC |  |

fill pipeline

1. BUTTERFLY: WO

| ADDF | $* A R 0, * A R I, R G$ | $; A R^{\prime}=R 6=A R+B R$ |
| :--- | :--- | :--- |
| SUBF | $* A R I++, * A R O++, R 7$ | $; B R^{\prime}=R 7=A R-E R$ |
| SUBF | $* A R 1, * A R O, R 4$ | $; B I=R 4=A I-B I$ |
| ADDF | $* A R I++(I R 1), * A R O++(I R I), R S \quad ; A I^{\prime}=R S=A I+B I$ |  |

2. EUTTERFLY: $\boldsymbol{U}^{\sim} M / 4$

| UBF | *+AR1, +ARO, R3 | $B R^{\prime}=R 3=A R-B I$ |
| :---: | :---: | :---: |
| LDF | *-AR7, R1 | ; RI $=0$ (FOR INER LOOP) |
| LDF | *ARI++,RO | ; $\mathrm{RO}=\mathrm{BR}$ (FOR INER LOOP) |
| ADDF | *ARI++(IR1), \#ARO++, R2 | ; $A R^{\prime}=R 2=A R+B I$ |
| STF | R6, *AR2++(IRO) $b$ | $\left(A R^{\prime}=R 6\right)$ |
| STF | RS, *AR3++( IRO) ${ }^{\text {b }}$ | ( $\mathrm{Al}^{\prime}=$ RS $)$ |
| STF | R7, $\ddagger$ AR2++(IRO) 6 | ( $\mathrm{BR}^{\prime}=\mathrm{R} 7$ ) |

3. TO M. BUTTERFLY:
PTB BFLEND

17 CYCLES IF fFT SIIE <1024 DUE TO THE USE OF INTERNAL HEMORY FOR BIT REVERSAL, 21 CYCLES IF FFT SIZE $=1024$ DIE TO THE USE OF EXTERNAL MEMORY FOR BIT REVERSAL

| LIF | *AR7++, R7 | ; $\mathrm{R} 7=\operatorname{COS},\left(\right.$ (BI $\left.{ }^{\prime}=\mathrm{R4}\right)$ ) |
| :---: | :---: | :---: |
| STF | R4, $\pm$ AR $3++$ (IRO) B |  |
| LIFF | *AR7++,R6 | ; $\mathrm{R} G=\operatorname{SIN},\left(\mathrm{AR}^{\prime}=\mathrm{R} 2\right)$ |
| STF | R2, *AR2++(IRO) ${ }^{\text {B }}$ |  |
| MPYF | + + AR1, R6, R5 | ; RS = BI * SIN, ( $\left.\mathrm{BR}^{\prime}=\mathrm{R} 3\right)$ |
| STF | R3, * RR2++( IR0) $^{\text {B }}$ |  |
| ADDF | R1, R0, R2 | ; $(\mathrm{R} 2=\mathrm{TI}=\mathrm{R} 0+\mathrm{R} 1)$ |
| MPYF | *AR1,R7,R0 | ; $\mathrm{RO}=\mathrm{BR} \pm \operatorname{COS},\left(\mathrm{AI}^{\prime}=\mathrm{R} 3=\mathrm{AI}-\mathrm{TI}\right)$ |
| SUBF | R2, *AR0, R3 |  |
| ADDF | R2, *AP0 + ( IR1), R4 | ; $\left(\mathrm{BI}^{\prime}=\mathrm{R4}=\mathrm{AI}+\mathrm{TI}, \mathrm{AI}^{\prime}=\mathrm{R} 3\right)$ |
| STF | R3, *AR3++(IR0) B |  |
| ADDF | R0, R5, R3 | ; $\mathrm{R} 3=\mathrm{TR}=\mathrm{RO}+\mathrm{RS}$ |
| MPYF | *AR1++,R6,RO | ; $\mathrm{RO}=\mathrm{BR} * \mathrm{SIN}, A \mathrm{R}^{\prime}=\mathrm{R} 2=A \mathrm{R}+\mathrm{TR}$ |

```
A ADDF R3, &ARO,R2
    MPFF *AR1++(IR1),R7,R
        R4,**RRO++,R3
R3,*ARO++,R3
        *+AR1,R7,RS ; R5 = BI * COS, (BR' = R3)
MPYF *+AR1,R7,RS
MPYF 
            R1,RO,R2 ; (R2 = T1 = R0 - R1)
            AARI,R6,RO ; ; RO=BR*SIN, (AI' = R3 =AI-TI)
    IPYF TARL,R6,RO
        ADDF lla, R2,ARO++(IR1),R4; ; (BI' = R4 =AI + II, AI' = R3)
        SADF llolm,
        SU|F
            ; RZ =TR = RO-RS
            *ARI++,R7,RO
            R3,*ARO,R2
```



```
- Clear pIpeline
STF R2,*AR2++(IRO)B
            STF 
                                    ; AR' = R2, (BI' = R4)
            R1,RO, R2
            ADDF 
            ADDF 
            R2,#AR2,R3
            R2,*ARO,R4
            R3,*AR3++( IRO)B
            R3,*AR3+
                                    ; R1 = BI * COS, (BI' = R4)
                                    ; BR' = R3 = AR - TR, AR' = R2
                                    4:
SUE
SUBFF R3,*ARO++,R3
    SUBF R1,RO,R2
:
:
:: S
:4
SUBF
R3,**RO++,R3
            ADDF R1,RO,R2
                                    ; R2 = TI = RO + R1 
ADD
            ADDF
                                    ; AI' = R3 = AI - TI, BR' = R3
* ENd OF FFT
* END:
                    NOF
*
SELF BR 
SELF
```

i:

float $7.11432195745216 e-001$
fleat $7.02754744457225 e-00$
float $6.13588464915452 \mathrm{e}-00$
float $9.99981175282601 \mathrm{e}-001$

## Appendix B. Radix-4 Complex FFT

| N |  |  |  |  |
| :---: | :---: | :---: | :---: | :---: |
|  | APPENDIX B1 |  |  |  |
|  | * |  |  |  |
|  | * generic program to do a looped-CODE radix-4 fft computation on the <br> * tM5320C30. |  |  |  |
|  |  |  |  |  |
|  | * The frogram is taken from the burrus and pafks book, p. 117. TIE COMPLEX <br> * data reside in internal memory, and the comfutation is done in-place. |  |  |  |
|  | * * |  |  |  |
| $\lambda$ |  |  |  |  |
| 2 | * data IS Included in a separate file to preserve the generic natufe of the <br> * program. fof the same purpose, the sile of the fft n and loci(n) are |  |  |  |
|  | * |  |  |  |
| $\stackrel{1}{0}$ | * in orier tó have the final result in bit-reversed order, the two middle <br> * branches of the radix-4 Eutterfly are interchanged during storace. note <br> * this difference haten colparing with the program in p. 117 of the burrus <br> * and parks book. |  |  |  |
| 3 |  |  |  |  |
| 3 |  |  |  |  |
|  | * |  |  |  |
|  | * author: panos e. papamichalis <br> * TEXAS Instruments <br> August 23, 1987 |  |  |  |
|  |  |  |  |  |
|  |  |  |  |  |
|  |  |  |  |  |
|  | * |  | ********************************************************************* |  |
|  |  | . GLORA | FFT | ; ENTRY POINT FOR EXECUTION |
|  |  | . GLOBL | $N$ | ; FFT SIIE |
|  |  | . GLOBL | M | ; LIG64(N) |
|  |  | . GLOR | SINE | ; ADDRESS OF SINE TABLE |
|  | * |  |  |  |
|  | INP | . USECT | " 1 N ", 1024 | ; MEMORY HITH INPUT DATA |
| 2 | . TEXT |  |  |  |
|  | * |  |  |  |
|  | * initialize |  |  |  |
| 9 | * |  |  |  |
|  | * | . WORD | FFT | ; STARTING LOCAIION OF THE Program |
|  |  | . SPACE | 100 | ; RESERVE 100 WORDS FOR VECTORS, ETC. |
|  | TEMP | .HORD | \$+2 |  |
| $\xi$ | STORE | . HORD | fFTSIL | ; BEGINNING of tep storage area |
|  |  | . HORD | $N$ |  |
|  |  | . WORD | M |  |
|  |  | .HORD | SINE |  |
|  |  | . WORD | INP |  |
|  | * |  |  |  |
|  |  | . BSS | FFTSI2,1 | ; FFT SIZE |
|  |  | . BSS | LOGFFT, 1 | ; LOO4(FFTSIL) |
|  |  | .BSS | SINTAB, 1 | ; SINE/COSINE TABLE BASE |
| 山 |  | . BSS | INPUT, 1 | ; AREA WITH INPUT DATA to process |
|  |  | . BSS | STAGE, 1 | ; FFT STAGE \# |
|  |  | . BSS | RPTCNT, 1 | ; REPEAT COUNTER |
| $\bigcirc$ |  | . BSS | IEINDX, 1 | ; IE INDEX FOR SINE/COSINE |


|  | . BSS | LPCNT, 1 | ; SECOND-LOCP COUNT |
| :---: | :---: | :---: | :---: |
|  | . BS S | JT, 1 | ; JT COWNTER IN PROGRAM, P. 117 |
|  | . BSS | IA1,1 | ; ial index in procram, P. 117 |
| FFT: |  |  |  |
|  |  |  |  |
| * |  |  | ; INItIALIzE DATA LOCATIONS |
|  | LIP | TEMP | ; COMHAND TO LOAD DATA PACE POINTER |
|  | LUI | ETEMP, ARO |  |
|  | LDI | ESTORE, AR1 |  |
|  | LiI | *AFOt+, R0 | ; XfER DATA FROH One lemory to the |
| * |  |  | ; OTHER |
|  | STI | R0, 4 AR1++ |  |
|  | LDI | $\pm A R 0++$ R 0 |  |
|  | SII | R0, *AR1++ |  |
|  | LDI | *AFO++, R0 |  |
|  | StI | F0, *AR1++ |  |
|  | LDI | *AFO, RO |  |
|  | SII | F0, $\pm$ AR1 |  |
| * |  |  |  |
|  | LiF | FFTSI2 | ; COHAAND TO LOAD data pace pointer |
|  | LDI | EfFTSIL,RO |  |
|  | LII | efftsil, IRO |  |
|  | LDI | efFTSIL, IR1 |  |
|  | 101 | 0, AR7 |  |
| * | SII | AR7, ESTAGE | ; ESTAGE HOLIS THE CURRENT STAGE ; Number |
|  | LSH | 1, IRO | ; IRO$=2 \times \mathrm{NI}$ ( $\operatorname{BECAUSE}$ Of REAL/IMAG) |
|  | LSH | -2, IR1 | ; IRI=N/4, POINTER FOR SIN/COS TABLE |
|  | LUI | 1,AR7 |  |
|  | STI | AR7, EPPTCNT | ; Initialile repeat counter of first |
| * |  |  | ; LOOP |
|  | LSH | -2,60 |  |
|  | SII | AR7, EIEINDX | ; initialize ie index |
|  | ADDI | 2,R0 |  |
|  | SII | R0, evt | ; $\mathrm{JT}=\mathrm{RO} / 2+2$ |
|  | SUBI | 2,R0 |  |
|  | LSH | 1,R0 | ; $\mathrm{RO}=\mathrm{N} 2$ |
| * |  |  |  |
| * outer loof |  |  |  |
| LOOP: |  |  |  |
|  | LDI | EINPUT,ARO | ; ARO POINTS TO X X (1) |
|  | ADDI | RO,ARO, AR1 | ; ARI POINTS TO X(II) |
|  | ADDI | R0, AR1, AR2 | ; AR2 POINTS TO X(12) |
|  | ADDI | R0, AR2, AR3 | ; AR3 POINTS TO X(I3) |
|  | WI | ERPTCNT,RC |  |
|  | SUBI | 1,RC | ; RC SHOW PE ONE LESS Than desired * |
| * |  |  |  |
| * FIST LOOP |  |  |  |
|  | RPTB | RLK1 |  |
|  | ADDF | * + ARO, * + AR2, R1 | ; $\mathrm{RI}=\mathrm{Y}(\mathrm{I})+\mathrm{Y}(12)$ |
|  | ADDF | *+AR3,*+AR1, R3 | ; $R 3=Y(11)+Y(13)$ |
|  | ADLF | R3,R1,R6 | ; R $6=\mathrm{R} 1+\mathrm{R} 3$ |


|  | SUBF | * + AR2, * + ARO, R4 | ; $\mathrm{R}^{4}=\mathrm{Y}(\mathrm{I})-\mathrm{Y}(\mathrm{L} 2)$ |  | LDI | EIAL,AR7 EIAI,AR4 |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  | STF | R6, ${ }^{\text {+ }+ \text { AR0 }}$ | ; $Y(1)=R 1+R 3$ |  | ADDI | ESINTAB, AR4 | ; CREATE COSINE INDEX AR4 |
|  | SUBF | R3,R1 | ; R1=R1-R3 |  | ADDI | AR4, AR7, AR5 | ; crente coske maex ma |
|  | LDF | *AR2,R5 | ; $\mathrm{RS}=\mathrm{X}(12)$ |  | SUBI | 1,AR5 | ; $\mathrm{IA} 2=1 \mathrm{I} 1+1 \mathrm{I} \mid-1$ |
| 14 | LDF | **ARI,R7 | ; $R 7=Y(11)$ |  | ADDI | AR7, AR5, AR6 |  |
|  | ADDF | *AR3, *AR1, R3 | ; $\mathrm{R} 3=\mathrm{x}(111)+\mathrm{x}(13)$ |  | SUBI | 1,AR6 | ; $\mathrm{IA} 3=1 \mathrm{~A} 2+1 \mathrm{I} 1-1$ |
|  | ADDF | R5, *ARO, R1 | ; $\mathrm{RI}=\mathrm{X}(1)+\mathrm{X}(12)$ | * |  |  | ; , $_{\text {a }}$ |
| : 1 | STF | R1, *+AR1 | ; $\mathrm{Y}(\mathrm{IL})=\mathrm{R1} 1-\mathrm{R} 3$ |  | SECOND LOOP |  |  |
|  | ADDF | R3,R1,R6 | ; R $6=\mathrm{R} 1+\mathrm{R} 3$ | * | Sean loor |  |  |
|  | SUBF | R5, *ARO, R2 | ; $\mathrm{R} 2=\mathrm{x}(1)-\mathrm{x}(12)$ |  | RPTB | RLK2 |  |
| : 1 | STF | R6, $\times$ ARO++(IRO) | ; $x(1)=R 1+R 3$ |  | ADDF | *+AR2, *+ARD, R3 | ; $\mathrm{R} 3=\mathrm{Y}(\mathrm{I})+\mathrm{Y}(12)$ |
|  | SUBE | R3,R1 | ; R1=R1-R3 |  | ADDF | **AR3, **AR1,R5 | ; $\mathrm{R} 5=\mathrm{Y}(\mathrm{II})+\mathrm{Y}(13)$ |
|  | SUBF | \#AR3, *AR1,R6 | ; R6=x(11)-x(13) |  | ADDF | R5, R3,R6 | ; $\mathrm{R} 6=\mathrm{R} 3+\mathrm{RS}$ |
|  | SUPF | R7, * + AR 3 , R3 | ; -R3=Y(11)-Y(13) !!! |  | SUBF | *+AR2, $*+$ ARO, R4 | ; $\mathrm{R} 4=\mathrm{Y}(\mathrm{I})-\mathrm{Y}(\mathrm{I} 2)$ |
| : ${ }^{\text {a }}$ | STF | R1, *AR1++(IRO) | ; $X(11)=R 1-\mathrm{R} 3$ |  | SUEF | RS,R3 | ; R3-R3-R5 |
|  | SUBF | R6, R4, RS | ; $\mathrm{RS}=\mathrm{R4} 4 \mathrm{R6}$ |  | ADDF | *AR2, *ARO, R1 | ; RI= ${ }^{\text {( }}$ (1) $+\mathrm{X}(\mathrm{I} 2)$ |
|  | ADDF | R6, R4 | ; $\mathrm{R} 4=\mathrm{R4} 4+\mathrm{Rb}$ |  | ADFF | *AR3, *AR1,R5 | ; $\mathrm{R} 5=\mathrm{x}(\mathrm{I} 1)+\mathrm{X}(\mathrm{I} 3)$ |
|  | STF | R5, * AR2 | ; $\mathrm{Y}(12)=R 4-\mathrm{Rb}$ |  | MPYF | R3, *+ARS (IR1), R6 | ; R6=R3*C02 |
| 4 | STF | R4, *+AR3 | ; $\mathrm{Y}(13)=\mathrm{R} 4+\mathrm{R} 6$ | : | STF | R6, $*+$ ARO | ; $\mathrm{Y}(\mathrm{I})=\mathrm{R} 3+\mathrm{R} 5$ |
|  | SUBF | R3,R2,R5 | ; RS $=$ R2-R3 ! ! |  | ADDF | RS, R1, R7 | ; R7 $=$ R1+R5 |
|  | ADDF | R3, R2 | ; R2=R2+R3 ! ! ! |  | SUBF | *AR2, *ARO, R2 | ; $\mathrm{R} 2=\mathrm{x}(1)-\mathrm{x}(12)$ |
| BLK1 | STF | R5, *AR2++(IR0) | ; $\mathrm{X}(12)=R 2-\mathrm{R} 3$ !!! |  | SUBF | RS, R1 | ; R1=R1-RS |
| : | STF | R2, 4 AR3++(IRO) | ; $\mathrm{X}(13)=R 2+\mathrm{R} 3$ !!! |  | MPYF | R1, *AR5, R7 | ; R7 $=$ R $1 * S 12$ |
| * |  |  |  | : | STF | R7, $*$ ARO + ( IRO) | ; $\mathrm{X}(1)=R \mathrm{R}+\mathrm{RS}$ |
|  | HIS IS | LAST STAGE, YOU |  |  | SUBF | R7,R6 | ; R = $=\mathrm{P} 3 * \mathrm{CO} 2-\mathrm{R} 1 * \mathrm{SI} 2$ |
| * |  |  |  |  | SUBF | *+AR3, *+AR1, RS | ; $\mathrm{RS}=\mathrm{Y}(11)-\mathrm{Y}(13)$ |
|  | LDI | EStace, AR7 |  |  | MPYF | R1, + + AR5(IR1), R7 | ; $\mathrm{R} 7=\mathrm{R} 1 \times \mathrm{CO} 2$ |
|  | ADDI | 1,AR7 |  | : | STF | R6, *+AR1 | ; $Y(I L)=R 3 * C 02-\mathrm{R} 1 * \mathrm{SI} 2$ |
|  | CMFI | COGFFT,AR7 |  |  | MPYF | R3, ARRS, R6 | ; R $6=\mathrm{R} 3 \times \mathrm{SI} 2$ |
|  | B7D | END |  |  | ADDF | R7,R6 | ; R6=R1*CO2+R3*SI2 |
|  | STI | AR7, ESTAGE | ; CURPENT FFT STAGE |  | ADLF | R5, R2, R1 | ; $\mathrm{Rl}=\mathrm{R} 2+\mathrm{RS}$ |
| * |  |  |  |  | SUBF | RS, R2 | ; $\mathrm{R} 2=\mathrm{R} 2$ - R 5 |
| * | INNER |  |  |  | SUEF | *AR3, *AR1, R5 | ; $\mathrm{RS}=\mathrm{x}(11)-\mathrm{x}(13)$ |
| * |  |  |  |  | SUBF | R5, R4, R3 | ; $83=R 4-\mathrm{RS}$ |
|  | LOI | 1,AR7 |  |  | ADDF | RS, R4 | ; $\mathrm{R} 4=\mathrm{R4}+\mathrm{RS}$ |
|  | SII | AR7, eIAI | ; INIT IAI INTEX |  | MPYF | R3, *+AR4 (IR1), R6 | ; $\mathrm{R} 6=\mathrm{R} 3 * \mathrm{CO}$ |
|  | LDI | 2,AR7 |  | : ${ }^{\text {a }}$ | STF | R6, $*$ AR1 $1+$ ( IRO) | ; $\mathrm{X}(\mathrm{II})=\mathrm{RL} 1 * \mathrm{CO} 2+\mathrm{R} 3 * \mathrm{SI2}^{2}$ |
|  | STI | AR7, $2 . P C N T$ | ; INIT LOOP COUNTER FOR INEER LOOP |  | MPYF | R1, ARR4, R7 | ; R7=R1*SI1 |
| INLOP: |  |  |  |  | SUBF | R7,R6 | ; R $6=\mathrm{R} 3+\mathrm{COL}-\mathrm{R} 1 * S I 1$ |
|  | LDI | 2,AR6 | ; increment iner loop counter |  | MPYF | R1, ${ }^{++ \text {AR4 ( }}$ (R1), R6 | ; $\mathrm{R} 6=\mathrm{R} 1 * \mathrm{COO}_{1}$ |
|  | ADDI | ELPCNT, ARS |  | : | STF | R6, + +AR2 | ; $\mathrm{Y}(\mathrm{I} 2)=\mathrm{R} 3 * C 01-\mathrm{RL} * \mathrm{SII}$ |
|  | 101 | CLPCNT,ARO |  |  | MPYF | R3, ${ }^{\text {AR } 4, ~ R 7 ~}$ | ; R7=R3*SI1 |
|  | LDI | EIA1,AR7 |  |  | ADIF | R7,R6 |  |
|  | ADDI | CIEINDX, AR7 | ; $\mathrm{IAI}=1 \mathrm{IA}+\mathrm{IE}$ |  | MPYF | R4, *+ARG(IR1),R6 | ; R6=R4*C03 |
|  | ADDI | EINPUT,ARO | ; $\mathrm{X}(\mathrm{I}), \mathrm{Y}(\mathrm{I})$ ) POINTER | i | STF | R6, $\ddagger$ AR2++( 1 R0) | ; $\mathrm{X}(12)=\mathrm{RL} *$ C01 $+\mathrm{R} 3 * \mathrm{SII}$ |
|  | SII | AR7, eIA1 |  |  | MPYF | R2, *AR6, R7 | ; R7=R2*SI3 |
|  | ADDI | RO, ARO,ARI | ; (X(II), Y(II)) POINTER |  | SUBF | R7,R6 | ; $\mathrm{R} 6=\mathrm{R4} 4 \mathrm{CO}^{2}-\mathrm{R} 24 \mathrm{SI} 3$ |
|  | STI | ARb, OLPCNT |  |  | MPYF | R2, ${ }^{\text {+ }}$ ARG( 1 (R1), R6 | ; R6=R2*C03 |
|  | ADDI | R0, AR1, AR2 | ; (X(12), Y(12) ) POINTER | i | STF | R6, ${ }^{+ \text {+ }}$ R 3 | ; $\mathrm{Y}(13)=\mathrm{R} 4 * \mathrm{COS}_{-\mathrm{R} 2 * S 13}$ |
|  | ADDI | R0,AR2,AR3 | ; (XII3),Y(I3)) POINTER |  | MPYF | R4, *AR6, R7 | ; R7=R4*SI3 |
|  | LDI | CRPTCNT, RC |  |  | ADIF | R7, R6 | ; R6 $=$ R2 4 C03 $+\mathrm{R4} 4 \mathrm{SI} 3$ |
|  | SUBI | 1,RC | ; RC Shoud be one less than desired * | BLK2 | STF | R6, AR $3++($ IRO $)$ | ; $\mathrm{X}(13)=\mathrm{R} 2+\operatorname{CO3+R4} 4 \mathrm{SI} 3$ |
|  | CIPI | EJT,ARb | ; IF LPCNT=J, 60 TO | * |  |  |  |
|  | B2D | SPCL | ; SPECIAL BUTTERFLY |  |  |  |  |



```
APPENDIX B2
NHE: fft_4 -- RADIX-4 COMPLEX FFT TO BE CALLED AS A C FUNCIION.
SWOPSIS:
    int fft_4(N, M, DATA)
    int N\quadFFT SIZE: N-4+EM
    int M MUPER OF STAGES = LOGt(N)
    float *data ARRAY WITH INPUT AND OUIPUT DATA
```

DESCRIPTION:
GENERIC FLUCTION TO DO A RADIX-4 FFT COHPUTATION ON THE THS320C30.
THE DATA APRAY IS 2HN-IONG, WITH REAL AND IMAGIMARY VALUES ALTER-
Nating. TIE PROCRAH IS BASED ON THE FORTRAN PROCPAM IN THE BURRUS
AND PAPKS BOOK, P. 117.

IN ORDER TO HAVE THE FINAL RESULT IN BIT-REVERSED ORDER, THE THO MIDDLE BRANCHES OF TIE RADIX-4 BUTTERFLY ARE INTERCHANGED DURIMG STOPAGE. NOTE THIS DIFFERENCE WEN COHPARIMG WITH THE PROGRAM ON P. 117. THE COHPUTATION IS DONE IN-PLACE, AND THE ORIGINEL DATA IS DESTROYED. BIT REVERSAL IS IHPLEEENTED AT THE ED OF TIE FUNCTION If this is mot mecessary, this part can be coooented out. the SINE/COSINE TARLE FOR THE THIDDE FACTORS IS EXPECTED TO BE SIPPIIED OURING LINK TIFE, AND IT SHOLD HAVE THE FOLOUIMG FORMAT:

HLUES value1, value2, EIC., hat The SHE WAVE VALUES. FOR AN H-POINT FFT, HERE ARE N+N/4 VALUES FOR A FUL AND A CUARTER PERIOD of the sine mave. IN THIS way, a full sine and Cosine periol are AVAILARLE (SUPERIMPOSED).

## STACK STRUCTURE UPON TIE CALL:

| +fP(4) | DATA |
| :---: | :---: |
| $-\mathrm{FP}(3)$ | / |
| -fP(2) | - N |
| +P(1) | : RETUPN AUDR |
| $-\mathbf{P P}(0)$ | : OD FP |

$$
\begin{aligned}
& \text { REGISTERS USED: RO, R1, R2, RS, R4, R5, R6, R7, ARO, AR1, AR2, AR3, AR4, } \\
& \text { AR5, AR6, ART, IRO, IR1, RS, RE, RC }
\end{aligned}
$$

## NUTHOR: PANOS E. PAPAMICHALLI

 TEXAS IMSTRMHENTS


- MOVE ARGMENTS to locations matching ; THE NHES IN THE PROCRAM


MAIN INER LOOP
LOI 1,AR7
LDI
ADDI
LDI

$$
\begin{array}{ll}
\text { EIAI,ART } \\
\text { EIEINX, AR7 } & \text {; } \operatorname{IAI}=I A 1+1 E \\
\text { OINAIT ORO } & \text { (Y(I) Y(I) }
\end{array}
$$

$$
\begin{array}{ll}
\text { IEIEINX, ART } & \text {; IAIIIAI } 1+1 E \\
\text { EINPUT, ARO } & \text { (X(I), Y(I) POINTE }
\end{array}
$$

ADDI
SII- ADDILOI
SUBI
CPPI
CAPI
AR7 ela2,AR7
eIA1, AR7ADDIART, ELPCNT ; INIT LOOP COUNTER FOR INER LOOP
AR7, eIAI

$$
\begin{aligned}
& \text { AR6, eLPCNI } \\
& \text { RO,AR1, AR2 }
\end{aligned}
$$

$$
\begin{array}{ll}
\text { RO, AR1, AR2 } & ;(X(12), Y(12)) \text { POINTER } \\
\text { RO,AR2,AR3 } & ;(X(13), Y(13)) \text { POINTER }
\end{array}
$$

EPPTCNT, RC

$$
1, R C
$$LDI

LDI

$$
\begin{aligned}
& \text { eIAI, AR7 } \\
& \hline
\end{aligned}
$$

ADDIADDI
SUBI
ADDI; INIT IAI INDEX
QPCNT,ARS
QPCNT,ARO CLPCNT,ARO

$$
\text { RO,ARO,ARI } \quad ;(X(I 1), Y(I 1)) \text { POINTER }
$$

$$
\begin{array}{ll}
1, R C & \text {; RC SHOUN BE CUE LE: } \\
\text { EJT,AR6 } & \text {; IF LPCNT }=J T, \text { OO TO }
\end{array}
$$

$$
\begin{array}{ll}
\text { EJT,AR6 } & \text {; IF LPCNT=JT, CO TO } \\
\text { SPCL } & \text {; SPECIAL BUTTERFLY }
\end{array}
$$

ESINTAB,AR4 ; CREATE COSINE INOEX AR4
AR4, ART, AR5

$$
1, \text { ARS }
$$

$$
; I A Z=I A|+|A|-1
$$

ADDI
SUBI

$$
\begin{aligned}
& \text { AR7,AR5, A }
\end{aligned}
$$

$$
: I A Z=I A 2+I A I-1
$$

SECOND LOOP

| RPTB | BLK2 |  |
| :---: | :---: | :---: |
| ADDF | *+AR2,*+ARO,R3 | ; $\mathrm{R} 3=\mathrm{Y}(1)+Y(12)$ |
| ADDF | *+AR3, *+AR1, R5 | ; RS $=\mathrm{Y}(11)+\mathrm{Y}(13)$ |
| ADLF | R5, R3, R6 | ; $\mathrm{R} 6=\mathrm{R} 3+\mathrm{R} 5$ |
| SUBF | *+AR2, *+ARO, R4 | ; R4=Y(1)-Y(12) |
| SUBF | R5, R3 | ; $\mathrm{R} 3=\mathrm{R} 3-\mathrm{R5}$ |
| ADDF | *AR2, *ARO, R1 | ; RI=X(I) $+\mathrm{X}(\mathrm{I} 2)$ |
| ADCF | *AR3, *ARI, RS | ; $\mathrm{R} 5=\mathrm{x}(11)+\mathrm{x}(13)$ |
| MPYF | R3,*+AR5(IR1),R6 | ; $\mathrm{R}=$ =R3 3 C02 |
| STF | R6, *+ARO | ; $Y(1)=R 3+R 5$ |
| ADDF | RS,R1, R7 | ; $\mathrm{R7} 7=\mathrm{Rl} 1+\mathrm{P} 5$ |
| SUEF | *AR2,*AR0, R2 | ; $82=x(1)-\mathrm{x}(12)$ |
| SUBF | R5,R1 | ; R1=R1-R5 |
| MPYF | R1, *AR5,R7 | ; R7=R1+SI2 |
| STF | R7, *ARO+ (IRO) | ; $X(1)=R 1+R 5$ |
| SUPF | R7,R6 | ; $\mathrm{R} 6=\mathrm{R} 3+\mathrm{CO2}-\mathrm{R1} 15 \mathrm{SI} 2$ |
| SUBF | *+AR3,*+AR1, RS | ; RS=Y(II)-Y(13) |
| MPYF | R1,**ARS(IR1),R7 | ; $\mathrm{R} 7=\mathrm{Rl}+\mathrm{CO} 2$ |
| STF | R6, *+AR1 |  |
| MYYF | R3, *AR5, R6 | ; R6=R3+S12 |
| ADDF | R7,R6 | ; R $6=\mathrm{RL}+\mathrm{CO} 2+\mathrm{R} 3 * S 12$ |


|  | ADDF | RS, R2, R1 | ; $\mathrm{Rl}=$ R2+R5 |
| :---: | :---: | :---: | :---: |
|  | SUBF | R5, R2 | ; R2=R2-R5 |
|  | SUBF | *AR3,*AR1, RS | ; $\mathrm{R} 5=\mathrm{x}(\mathrm{L1})-\mathrm{x}(13)$ |
|  | SUBF | R5, R4, R3 | ; R3=R4-R5 |
|  | ADDF | RS, R4 | ; $\mathrm{R} 4=\mathrm{R} 4+\mathrm{RS}$ |
|  | MpyF | R3, ++AR4(IR1), R6 | ; $\mathrm{Rb}=\mathrm{R} 3 * \mathrm{CO} 1$ |
| ii | STF | R6, *AR1++(IR0) | ; $\mathrm{X}(\mathrm{IL})=\mathrm{RI}+\mathrm{CO2}+\mathrm{R} 3+\mathrm{SI} 2$ |
|  | MPYF | R1, \#AR4, R7 | ; $\mathrm{R} 7=\mathrm{R1} * \mathrm{SIL}$ |
|  | SUBF | R7,R6 | ; R6=R3*CO1-R1+SI1 |
|  | MPYF | R1, *+AR4( IR1), R6 | ; $\mathrm{Rb}=\mathrm{RL} * \mathrm{CO} 1$ |
| $1:$ | STF | R6, * + AR2 | ; $\mathrm{Y}(\mathrm{I} 2)=\mathrm{R} 3 * \mathrm{CO1}-\mathrm{R} 1+\mathrm{SII}$ |
|  | MPYF | R3, *AR4, R7 | ; R7=R3*SII |
|  | ADDF | R7,R6 | ; R6=R1*CO1+R3*SI1 |
|  | MPYF | R4, *+AR6 (IR1), $\mathrm{R6}$ | ; R6=R4*C03 |
| : | STF | R6, $\pm$ AR2 $2+($ IRO $)$ | ; $\mathrm{X}(\mathrm{I} 2)=\mathrm{R1}+\mathrm{CO1}+\mathrm{R} 3 * \mathrm{SII}$ |
|  | MPYF | R2, *AR6, R7 | ; R7=R2*SI3 |
|  | SUBF | R7,R6 | ; R6 $=\mathrm{R4} 4 \times \mathrm{CO3}-\mathrm{R} 2 * \mathrm{SI} 3$ |
|  | MPYF | R2, *+AR6(IR1),R6 | ; $\mathrm{R} G=\mathrm{R} 2 \pm \mathrm{CO} 03$ |
| it | STF | R6, *+AR3 | ; $\mathrm{Y}(\mathrm{I} 3)=\mathrm{R} 4 * \mathrm{CO} 3-\mathrm{R} 2 * \mathrm{SI} 3$ |
|  | MPYF | R4, *AR6, R7 | ; R7=R4*SI3 |
|  | ADDF | R7, R6 | ; R6 $=$ R2 2 C03 $+\mathrm{R} 4+\mathrm{SI} 3$ |
| BLK2 | STF | R6, *AR3++(IRO) | ; $X(13)=$ R2*CO3+R4*SI3 |
| * | . |  |  |
|  | CrFI | \& 4 PCNT, R0 |  |
|  | BP | INLOP | ; LOOP BACK TO THE INNER LOOP |
|  | BR | CONT |  |
| * |  |  |  |
| * | IAL BU | LY FOR W=J |  |
| * |  |  |  |
| SPCL | LDI | IR1, AR4 |  |
|  | LSH | -1,AR4 | ; POINT TO SIN(45) |
|  | ADDI | ESINTAB,AR4 | ; CREATE COSINE INDEX AR4 $4=C 021$ |
| * |  |  |  |
|  | RPTB | BLK3 |  |
|  | ADDF | *AR2, *AR0, R1 | ; $\mathrm{Rl}=\mathrm{x}(\mathrm{I})+\mathrm{x}(\mathrm{I} 2)$ |
|  | SUBF | *AR2, *AR0, R2 | ; R2=x(1)-x(12) |
|  | ADDF | *+AR2, *+ARO, R3 | ; $\mathrm{R} 3=Y(1)+Y(12)$ |
|  | SUBF | *+AF2, *+AR0,R4 | ; $\mathrm{R} 4=Y(1)-Y(12)$ |
|  | ADDF | *AR3, *AR1, R5 | ; $\mathrm{R} 5=\mathrm{X}(11)+\mathrm{X}(13)$ |
|  | SUBF | R1,R5,R6 | ; R6=R5-R1 |
|  | ADDF | R5, R1 | ; $\mathrm{R1}=\mathrm{R1} 1+\mathrm{R} 5$ |
|  | ADDF | *+AR3, *+AR1, R5 | ; $\mathrm{R} 5=\mathrm{Y}(11)+\mathrm{Y}(13)$ |
|  | SUEF | RS, R3, R7 | ; R7=R3-R5 |
|  | ADDF | RS, R3 | ; R3 $=$ R3+R5 |
|  | STF | R3, *+AR0 | ; $\mathrm{Y}(1)=R 3+\mathrm{R5}$ |
| ; | STF | R1, *AR0++(IRO) | ; $X(1)=R 1+R 5$ |
|  | SUBF | *AR3, *AR1, R1 | ; $\mathrm{R} 1=\mathrm{X}(\mathrm{L1})-\mathrm{X}(13)$ |
|  | SUBF | *+AR3, *+AR1, R3 | ; $\mathrm{R} 3=\mathrm{Y}(111)-\mathrm{Y}(13)$ |
|  | STF | R6, * + AR1 | ; $\mathrm{Y}(\mathrm{II})=\mathrm{RS}-\mathrm{RI}$ |
| \# | STF | R7, *AR1++(IRO) | ; $\mathrm{x}(11)=\mathrm{R} 3-\mathrm{RS}$ |
|  | ADDF | R3, R2, R5 | ; $\mathrm{R} 5=\mathrm{R} 2+\mathrm{R} 3$ |
|  | SUBF | R2,R3,R2 | ; R2=-R2+R3 ! ! |
|  | SUEF | R1, R4, R3 | ; $\mathrm{R} 3=\mathrm{R4} 4 \mathrm{R1}$ |
|  | ADDF | R1,R4 | ; R4=R4+R1 |


|  | SUBF | R5, R3, R1 | ; RI=R3-R5 |
| :---: | :---: | :---: | :---: |
|  | MPYF | *AR4, R1 | ; R1=R1*C021 |
|  | ADLF | R5, R3 | ; R3=R3+R5 |
|  | MPYF | *AR4,R3 | ; R3=R3*C021 |
| : 1 | STF | R1, *+AR2 | ; $\mathrm{Y}(12)=(\mathrm{R} 3-\mathrm{FS})+\mathrm{CO} 21$ |
|  | SUBF | R4, R2,R1 | ; R1=R2-R4 ! ! ! |
|  | MPYF | *AR4, R1 | ; R1=R1*CO21 |
| ; | STF | R3, *AR2 ${ }^{++(\text {IR0 }}$ ) | ; $\mathrm{X}(12)=(\mathrm{R} 3+\mathrm{PS}) *$ C021 |
|  | ADDF | R4, R2 | ; R2=R2+R4 !!! |
|  | MPYF | *AR4, R2 | ; R2=R2*C021 ! ! |
| ELK3 | STF | R1,*+AR3 | ; $\mathrm{Y}(\mathrm{I} 3)=-(\mathrm{R} 4-\mathrm{R} 2) * \mathrm{CO21}$ !!! |
| : | STF | R2, *AR3++(IR0) | ; $\mathrm{X}(\mathrm{I} 3)=(\mathrm{R} 4+\mathrm{R} 2)+$ C021 ! ! ! |
| , | CMPI | elpCNT,RO |  |
|  | EPD | INCOP | ; LOOP BACK TO THE INER LOOP |
| * al |  |  |  |
| CONT | LOI | ERPTCNT, A\&7 |  |
|  | LDI | BIEINDX, AF6 |  |
|  | L.SH | 2,AR7 | ; increment repeat counter for next ; TIE |
| * | SII | AR7, EPPTCNT |  |
|  | LSH | 2,AR6 | ; $\mathrm{IE}=4 * \mathrm{IE}$ |
|  | SII | ARG, EIEINDX |  |
|  | LDI | R0, IRO | ; $\mathrm{N} 1=\mathrm{N} 2$ |
|  | LSH | $-3, \mathrm{RO}$ |  |
|  | ADDI | 2,R0 |  |
|  | SII | fo, evt | ; $\mathrm{JT}=\mathrm{N} 2 / 2+2$ |
|  | SUBI | 2,R0 |  |
|  | LSH | 1,RO | ; $\mathrm{N} 2=\mathrm{N} 2 / 4$ |
|  | BR | LOCP | ; NEXT FFT STACE |
| * |  |  |  |
| * DO THE Bit-reversing of the output |  |  |  |
| * |  |  |  |
| END: | LIII | CffTSIL,RC | ; $\mathrm{RC}=\mathrm{N}$ |
|  | SUBI | 1,RC | ; RC SHOULD BE ONE LESS THAN DESIRED |
|  | LaI | EfFTSI2,IRO | ; IRO=SILE OF FFT $=\mathrm{N}$ |
|  | LDI | CINPUT, ARO |  |
|  | LII | EINPUT,AR1 |  |
| * ${ }^{\text {a }}$ |  |  |  |
|  | CMPI | Afo, AR1 |  |
|  | BGE | CONT |  |
|  | LDF | *ARO, K0. |  |
| i | LDF | *AR1, R1 |  |
|  | STF | R0, *AF1 |  |
| i: | STF | R1, AARO |  |
|  | LIF | *+ARO(1), R0 |  |
| : 1 | LIF | *+AR1(1), R1 |  |
|  | STF | R0, *+AR1(1) |  |
| $1!$ | STF | R1,*+ARO(1) |  |
| CONT | NOP | *++ARO(2) |  |
| BITRV | NOF | *AR1++(IRO) B |  |
| * |  |  |  |
| resture the register valles and return |  |  |  |




## Appendix C.Radix-2 Real FFT

## APPENDIX C1

GENERIC PROGRAM TO DO A RADIX-2 FEAL FFT COMPUTATION ON THE THS320C30

## the program is taken fram the paper by sorensen et al., dile 1987 ISSUE

 OF THE TRANSACTIONS ON ASSP.the (RERL) DATA reside in internal remory. The computation is done IN-PLACE. THE BIT REVERSAL IS DONE AT THE beginning of THE PROGRAM.

THE THIDDLE FACTORS ARE SUFPLIED IN A TABLE PUT IN A .DATA SECTION. THIS PROGRAM. FÜR THE SAME FURPOSE, THE SIZE OF THE FFT N AND LOG2(N) ARE REFINED IN A GOOR DIRECTIVE AND SPECIFIED OURING LINKING TIE LENGTH THE TAELE IS $\mathrm{N} / 4+\mathrm{N} / 4=\mathrm{N} / 2$.
author: panos e. papamichal is

SEPTEMBER 8, 1987

|  | . CLABE | FFT | ; ENTRY POINT FOR EXECUTION |
| :---: | :---: | :---: | :---: |
|  | . GLOES | N | ; fft Sile |
|  | . CLOBL | M | ; L002(N) |
|  | . GLOEL | SINE | ; ADDRESS OF SINE TAELE |
| * ; Hehor wih inout mia |  |  |  |
| INP | .USECT | "IN", 1024 | ; MEMORY WITH INPUT DATA |
|  | . BSS | OUTP, 1024 | ; Memory with cutput data |
| . TEXt |  |  |  |
| Initialize |  |  |  |
|  | . WCRD | FFT | ; Starting location of the progeam |
|  | .SPACE | 100 | ; RESERVE 100 WORDS FOR VECTORS, ETC. |
| * : |  |  |  |
| fFTSIL | 2 .WORD | N |  |
| LOGFFT | 1 .WCRD | M |  |
| SINTAB | B .WORD | SINE |  |
| INPUT | . HORD | INP |  |
| OUTPUT | T .WORD | OUTP |  |
| * |  |  |  |
| FFT: | LUP | FFTSIL | ; COMmand to loal data page fointer |
| do the bit-Reversing at the begining |  |  |  |
|  | LOI | efftsil, RC | ; $\mathrm{kc}=\mathrm{N}$ |
|  | SUBI | 1,RC | ; RC Should be one less then desired |
|  | LDI | CFFTSIL,IR0 |  |
|  | LSH | -1, IRO | ; IROFHALF THE SIIE OF FFT $=\mathrm{N} / 2$ |
|  | LDI | EINPUT, ARO |  |
|  | LOI | EINPUT,ARI |  |
|  | RPTB | BITRV |  |


|  | CHPI | AR1, ARO | ; XCHAMCE LOCATIONS ONLY |
| :---: | :---: | :---: | :---: |
|  | BCE | CONT | ; IF ARO<ARI |
|  | LIf | *AR0, RO |  |
| it | LDF | *ARI, RI |  |
|  | STF | R0, * AR1 |  |
| : | STF | R1, *ARO |  |
| CONT | NOP | *ARO++ |  |
| BITKV | NOP | *ARI++( IRO) ${ }^{\text {a }}$ |  |
| * |  |  |  |
| * length-two buttepflies |  |  |  |
| * LOI |  |  |  |
|  | 101 | EINPUT, ARO | ; ARO POINTS TO X (I) |
|  | LDI | IR0, RC | ; REPEAT N/2 TIMES |
|  | SUBI | 1,RC | ; RC SHOUL BE CNE LESS THAN DESIRED * |
| * |  |  |  |
|  | RPTB | BLK1 |  |
|  | ADBF | *+ARO, *ARO++, R 0 | ; $\mathrm{R} 0=\mathrm{x}(\mathrm{I})+\mathrm{x}(\mathrm{I}+1)$ |
|  | SUBF | *APO, *-ARO, R1 | ; $\mathrm{R} 1=\mathrm{x}(\mathrm{I})-\mathrm{x}(1+1)$ |
| BLK1 | STF | R0, *-AFO | ; $\mathrm{x}(\mathrm{I})=\mathrm{x}(\mathrm{I})+\mathrm{x}(\mathrm{I}+1)$ |
| 14 | STF | R1, $\mathrm{AR} 0++$ | ; $\mathrm{x}(\mathrm{I}+1)=\mathrm{x}(\mathrm{I})-\mathrm{x}(\mathrm{I}+1)$ |
| FIRST PASS OF THE D0-20 LOOP (STAGE K=2 IN DO-10 LOOP) |  |  |  |
|  |  |  |  |
| * |  |  |  |
|  | LOI | EINPUT, ARO | ; ARO POINTS TO $\times$ (1) |
|  | LDI | 2, IRO | ; IRO=2=N2 |
|  | LOI | CFFTSIL,RC |  |
|  | LSH | $-2, R C$ | ; REPEAT N/4 TIMES |
|  | SUBI | 1,RC | ; RC SHOUS be ONE LESS THAN DESIRED \# |
| * |  |  |  |
|  | RPTB | BLK2 |  |
|  | ADDF | *+ARO(IR0) , ARO++ | $\mathrm{RO}), \mathrm{RO} \quad ; \mathrm{R} 0=\mathrm{x}(1)+\mathrm{x}(1+2)$ |
|  | SUBF | *ARO, *-ARO(IRO), R | ; R1 $=\mathrm{x}(\mathrm{I})-\mathrm{x}(1+2)$ |
|  | NEGF | *+ARO, RO | ; R $0=-x(1+3)$ |
| : | STF | RO, *-ARO(IRO) | ; $\mathrm{X}(\mathrm{I})=\mathrm{x}(\mathrm{I})+\mathrm{X}(\mathrm{I}+2)$ |
| ELK. 2 | STF | R1, $*$ ARO + ( IRO) | ; $x(1+2)=x(1)-x(1+2)$ |
| 11 | STF | R0, *+ARO | ; $x(1+3)=-x(1+3)$ |
| * |  |  |  |
| MAIN LOOP (FFT STAGES) |  |  |  |
|  |  |  |  |
|  | LDI | CfFTSIL, IRO |  |
|  | LSH | -2, IRO | ; IRO=INDEX FOR E |
|  | LDI | 3,R5 | ; RS HOLDS THE CURRENT STAGE MWHBER |
|  | LDI | 1,R4 | ; $\mathrm{R} 4=\mathrm{Na}$ |
|  | LI | 2,R3 | ; $\mathrm{R} 3=12$ |
| LOOP | LSH | -1, IRO | ; $\mathrm{E}=\mathrm{E} / 2$ |
|  | LSH | 1,R4 | ; $\mathrm{N} 4=2 \times \mathrm{NH}$ |
|  | LSH | 1,R3 | ; $\mathrm{N} 2=2 * N 2$ |
| INNER LOOP (D0-20 LOOP IN THE PROLRAM) |  |  |  |
| * |  |  |  |
|  | LOI | IINPUT, AR5 | ; ARS POINTS TO X X () |
| Incoip | LDI | IRO,ARO |  |
|  | ADDI | ESINTAB, ARO | ; ARO POINTS TO SIN/COS TARLE |
|  | LDI | R4, IR1 | ; IRI $=14$ |


LDI
ADDI
LDI
ADDI
LDI
SUBI
ADDI
LDF
ADDF
SUBF
STF
NEGF
NEGF
STF

| ARS,AR1 |  |
| :--- | :--- |
| 1, AR1 | ; AR1 POINTS TO $\times(11)=x(I+J)$ |
| AR1,AR3 | R3,AR3 |

; AR3 POINTS TO $x(13)=x(I+J+N 2)$
; AR2 POINTS TO $X(12)=x(I-J+N 2)$ ; AR4 POINTS TO $\times(I 4)=X(I-J+N 1)$
; $\mathrm{R} 0=\mathrm{x}(\mathrm{I})$
; $\mathrm{R} 1=\mathrm{X}(1)+\times(1+\mathrm{N} 2)$
; $\mathrm{R} 0=-\mathrm{x}(\mathrm{I})+\mathrm{x}(\mathrm{I}+\mathrm{N} 2)$
; $x(1)=x(1)+x(I+N 2)$
; $\mathrm{R} 0=\mathrm{x}(\mathrm{I})-\mathrm{x}(\mathrm{I}+\mathrm{N} 2)$
; $\mathrm{R} 1=-\mathrm{x}(1+\mathrm{N} 4+\mathrm{N} 2)$
; $\mathrm{x}(\mathrm{I}+\mathrm{N} 2)=\mathrm{x}(\mathrm{I})-\mathrm{x}(1+\mathrm{N} 2)$
; $\mathrm{X}(1+\mathrm{N} 4+\mathrm{N} 2)=-\mathrm{X}(1+\mathrm{N} 4+\mathrm{N} 2)$
InNERMOST LOOP

## 1,R5

## LOGFFT,RS

loop

```
APPENUIX C2
NAME:
fft_rl --- RADIX-2 REAL FFT TO BE CALLED AS A C FUNCTION
STNOPSIS:
    int \(\mathrm{fftarl}(\mathrm{N}, \mathrm{M}\), data)
    int \(N \quad\) FFT SIIE: \(N=2 *+H\)
    int \(M\) NUMPER OF STAGES \(=10 G 20 \mathrm{~N}\)
    float *datz ARRAY HITH INPUT AND OUTPUT DATA
```

IESCRIPTION:
generic function to do a radix-2 fft corputation on the theszoczo.
THE DATA ARRAY IS N-LONG, WITH ONLY REAL DATA. THE OUTPUT IS STORE
IN THE SAME LOCATIONS WITH REAL AND IMAGINARY POINTS R AND I AS
FOLLOWS: $R(0), R(1), \ldots, R(N / 2), I(N / 2-1), \ldots, I(1)$
tife prockram is based on the fortran procram in the paper by sorense
ET AL., UNE 1987 ISSUE OF TRANS. ON ASSP. THE COHPUTATION IS DONE
IN-FLACE, AND THE ORIGINAL DATA IS DESTROYED. BIT REVERSAL IS
implenented at the beginning of the function. If this is not
NECESSAFY, THIS PART CON BE COHTENTED OUT
The sine/cosine takle for the twidole factors is expected to be
SUPFLIED DUFING LIN TIME, AND IT SHOLD HAVE THE FOLONING FORTAT:
-global -sin
. data
.float valuel $=\sin (0 * 2 * p i / N$
float value $2=\sin (1 * 2+\rho i / N)$
flout value $(\mathbb{N} / 2)=\cos ((N / 4)+2 * p i / N)$

THE values valuel to value(N/4) ARE THE FIRST QUARTER OF THE SINE PERIOD ANDd value (N/4+1) To value(N/2) APE THE FIRST QUARTER OF THE COSINE PERIOD.

STACK STRUCTURE LPON THE CALL:

| -fP(4) | DATA |
| :---: | :---: |
| -FP(3) | ; M |
| -fP(2) | N |
| -fP(1) | : RETURN ADSER |
| -fP(0) | - OLD FP |

REGISTERS USED: R0, R1, R2, R3, R4, RS, ARO, AR1, AR2, AR4, ARS; IRO,
IRI, RS, RE, KC
author: panos e. papamichal is TEXAS INSTRUMENT

OCTOBER 13, 1987


| FP .SET AR3 |  |  |  |
| :---: | :---: | :---: | :---: |
|  |  |  |  |
|  | . CCOBE | -FFT_RL | ; ENTRY POINT FOR EXECUTION |
|  | .CLORE | -SIE | ; ADDPESS OF SINE TAELE |
| * |  |  |  |
|  | .BSS | FFTSIL, 1 |  |
|  | .BSS | LOCFFT, 1 |  |
|  | .BSS | INPUT, 1 |  |
| * | . TEXT |  |  |
| * |  |  |  |
| SINTAB | .word | _SINE |  |
| * |  |  |  |
| - initialize c function |  |  |  |
| * |  |  |  |
| _FFT_R: | PUSH | FP | ; SAVE dedicated registers |
|  | LDI | SP,FP |  |
|  | PUSH | R4 |  |
|  | PUSH | RS |  |
|  | PUSH | AR4 |  |
|  | PUSH | ARS |  |
| * |  |  |  |
|  | LDI | *-fP(2), RO | ; MOVE ARGMENTS TO LOCATIONS MATCHING |
|  | STI | R0, efFTSIL | ; THE MHES IN THE PROCRMM |
|  | LI | --PP(3),RO |  |
|  | STI | R0, LlogFft |  |
|  | LII | *-FP(4) , RO |  |
|  | SII | R0, EINPUT |  |
| * do the bit reversing at the begining |  |  |  |
|  |  |  |  |
|  | LDI | EfFTSIL,RC | ; $\mathrm{RC}=\mathrm{N}$ |
|  | SUBI | 1,RC | ; RC SHOUL BE OKE LESS THAN DESIRED |
|  | LDI | EFFTSIL, IRO |  |
|  | LSH | -1, IRO | ; IRO=HALF THE SIIE OF FFT=N/2 |
|  | LDI | EIMPUT,ARO |  |
|  | LOI | EINPUT,ARI |  |
| , |  |  |  |
|  | PPTB | BITR |  |
|  | CMPI | AR1, APO | ; XCHAMCE LOCATIONS ONLY |
|  | BGE | CONT | ; IF AROCAR1 |
|  | LIF | *ARO,RO |  |
| : | LDF | *AR1, RI |  |
|  | STF | $\mathrm{RO}, \pm$ PR1 |  |
| : | STF | R1, 4 ARO |  |
| CONT | NOP | HARO++ |  |
| BITRU | NOP | *AR1++(IRO) ${ }^{\text {a }}$ |  |
| * |  |  |  |
| - length-two butterflies |  |  |  |
| * |  |  |  |
|  | LDI | EINPUT,ARO | ; ARO POINTS TO X I ) |
|  | LDI | IRO,RC | ; REPEAT N/2. TIMES |
|  | SUBI | 1,RC | ; RC Shold be one less than desired |


|  |  | RPTE | BLK1 |  |
| :---: | :---: | :---: | :---: | :---: |
| 3 |  | ADDF | *+ARO, *ARO++, RO | ; $\mathrm{R} 0=\mathrm{x}(1)+\mathrm{x}(1+1)$ |
| 0 |  | SUBF | *ARO, *-ARO, R1 | ; R1 $=\mathrm{x}(1)-\mathrm{x}(1+1)$ |
| 3 | 8LK1 | STF | R0, *-ARO | ; $\mathrm{x}(1)=x(1)+x(1)+1)$ |
| $\stackrel{3}{0}$ | i | STF | R1, *ARO++ | ; $x(1+1)=x(1)-x(1+1)$ |
| き | * |  |  |  |
| 2 | * FIRST PASS Of THE D0-20 LOOP (STAGE K=2 In do-10-LOOP) |  |  |  |
| 9 |  | L.DI | EINPUT,ARO | ; ARO POINTS TO X X (1) |
|  |  | LDI | 2, IR0 | ; IR0 $=2=\mathrm{N} 2$ |
| 4 |  | LDI | EfFTSIL,RC |  |
| 1 |  | LSH | -2,RC | ; REPEAT N/4 TIMES |
| T |  | SUBI | 1,RC | ; RC SHOUD be OEE LESS ThAN DESIRED |
|  | * |  |  |  |
|  |  | RPTB | BLK2 |  |
|  |  | ADDF | *+ARO(IRO) , *ARO++( | R0) , RO ; R $0=\mathrm{X}(\mathrm{I})+\mathrm{x}(\mathrm{I}+2)$ |
|  |  | SUBF | *AR0, *-ARO(IRO), R1 | ; R1=x(1)-x(1+2) |
|  |  | NEGF | *+ARO, R0 | ; $\mathrm{R} 0=-\mathrm{x}(\mathrm{I}+3)$ |
| $?$ | $1:$ | STF | RO, *-ARO(1RO) | ; $\mathrm{x}(1)=x(1)+x(1+2)$ |
| ล | Buk2 | STF | R1, * ARO++(IRO) | ; $x(1+2)=x(1)-x(1+2)$ |
|  | ! | STF | R0, *+ARO | ; $x(1+3)=-x(1+3)$ |
| ? |  |  |  |  |
| $\frac{1}{3}$ | * main loop (fft stages) |  |  |  |
| $\cdots$ |  | LII | CFFTSIL, IR0 |  |
|  |  | LSH | -2, IRO | ; IROOINDEX FOR E |
|  |  | LOI | 3,R5 | ; RS HOLDS THE CURRENT STAGE NUMBER. |
|  |  | LDI | 1,R4 | ; R4=N4 |
|  |  | LII | 2,R3 | ; $\mathrm{R} 3=\mathrm{N} 2$ |
|  | LOOP | LSH | -1, IRO | ; $\mathrm{E}=\mathrm{E} / 2$ |
|  |  | LSH | 1,R4 | ; $\mathrm{N}_{4}=2+\mathrm{Na}$ |
| 9 |  | LSH | 1,53 | ; $\mathrm{N} 2=2 * N 12$ |
|  | - inner loop (dol-20 loop in the program) |  |  |  |
| $\underset{\sim}{\mathrm{O}}$ |  |  |  |  |
| - | INLOF | L01 | EINPUT, ARS | ; ARS POINTS TO X ${ }^{\text {(I) }}$ |
| 5 |  | LDI | IRO, ARO |  |
| 5 |  | ADDI | ESINTAB, ARO | ; ARO POINTS TO SIN/COS TABLE |
| N |  | LDI | R4, IRI | ; $\mathrm{IR1}=\mathrm{N4}$ |
| $\bigcirc$ |  | LDI | AR5, AR1 |  |
| - |  | ADDI | 1,AR1 | ; AR1 POINTS TO $\mathrm{X}(11)=\mathrm{X}(\mathrm{I}+\mathrm{J})$ |
|  |  | LDI | AR1, AR3 |  |
|  |  | ADDI | R3, AR3 | ; AR3 POINTS TO $\mathrm{X}(13)=\mathrm{X}(1+\mathrm{J}+\mathrm{N} 2)$ |
|  |  | LDI | AR3,AR2 |  |
|  |  | SUBI | 2,AR2 | ; AR2 POINTS TO $x(12)=x(1-J+N 2)$ |
|  |  | ADDI | R3,AR2,AR4 | ; AR4 POINTS TO $\mathrm{x}(14)=\mathrm{x}(1-\mathrm{J}+\mathrm{N} 1)$ |
|  |  | LDF | *AR5++(IR1),R0 | ; $\mathrm{RO}=\mathrm{x}(\mathrm{I})$ |
|  |  | ADDF | *+ARS ( IR1), R0, R1 | ; R1 $=\mathrm{X}(1)+\mathrm{x}(\mathrm{I}+\mathrm{N} 2)$ |
|  |  | SUBF | R0, *+ ARS (IR1), R0 | ; $\mathrm{R}==-\mathrm{x}(1)+\mathrm{x}(1+\mathrm{N} 2)$ |
| - | : | STF | R1, *-AR5(IR1) | ; $x(1)=x(1)+x(1+N 2)$ |
| w |  | NEGF | R0 | ; $\mathrm{R} 0=\mathrm{x}(1)-\mathrm{x}(\mathrm{I}+\mathrm{N} 2)$ |


|  | NEGF | *++ARS(IRI),R1 | ; $\mathrm{Rl}=-\mathrm{X}(1+\mathrm{N} 4+\mathrm{N} 2)$ |
| :---: | :---: | :---: | :---: |
| 11 | STF | R0, APR5 | ; $x(1+N 2)=x(1)-x(1+N 2)$ |
|  | STF | R1, *AR5 | ; $\mathrm{x}(\mathrm{I}+\mathrm{N} 4+\mathrm{N} 2)=-\mathrm{x}(\mathrm{I}+\mathrm{N} 4+\mathrm{N} 2)$ |
| * InNERHOSTT LOOP | INWERTOST LOOP |  |  |
|  | LDI | CFFTSIL, IRI |  |
|  | LSH | -2, IR1 | ; IRI=SEPARATION BETLEEN SIN/COS TBLS |
|  | LDI | R4, RC |  |
|  | SUBI | 2,RC | ; REPEAT NH-1 TIMES |
|  | RPTB | BLK3 |  |
|  | IPYF | *AR3,*+ARO(IR1),R0 | ; $\mathrm{R} 0=\mathrm{x}(13) * \cos$ |
|  | MPYF | *AR4, *ARO;R1 | ; $\mathrm{R} 1=\mathrm{x}(14) * S \mathrm{IN}$ |
|  | MPYF | *AR4,*+ARO( IR1),R1 | ; $\mathrm{R} 1=\mathrm{x}(14) * \cos$ |
| : | ADDF | R0,R1,R2 | ; $\mathrm{R} 2=\mathrm{x}(13) * \cos +x(14) \pm \operatorname{SIN}$ |
|  | MYYF | $\pm A R 3, * A R O++($ IRO $)$, RO | ; $\mathrm{R} 0=\mathrm{X}(\mathrm{I} 3)+5 \mathrm{SIN}$ |
|  | SUBF | RO, R1, RO | ; $\mathrm{R} 0=-\mathrm{X}(13) * \operatorname{SIN}+\mathrm{X}(14) * \cos !!!$ |
|  | SUEF | $\pm A R 2, \mathrm{RO}, \mathrm{R1}$ | ; R1=-X(12)+R0 ! ! ! |
|  | ADDF | *AR2, R0, R1 | ; R1=x(12)+R0 ! ! ! |
| if | STF | R1, + AR3 ++ | ; $x(13)=-x(12)+\mathrm{RO} 0!!!$ |
|  | ADDF | *AR1,R2,R1 | ; $\mathrm{Rl}=\mathrm{x}(11)+\mathrm{R} 2$ |
| 11 | STF | R1, 4 AR4- | ; $\mathrm{X}(14)=x(12)+R 0$ !!! |
|  | SUBF | R2, *AR1, R1 | ; $\mathrm{Rl}=\mathrm{X}(\mathrm{II})$-R2 |
| : ${ }^{\text {a }}$ | SIF | R1, *AR1++ | ; $\mathrm{X}(\mathrm{I} 1)=\mathrm{X}(11)+\mathrm{R} 2$ |
| * | STF | R1, \#AR2-- | ; $\mathrm{x}(12)=\mathrm{x}(11)-\mathrm{F} 2$ |
|  | SUBI | EINPUT, ARS |  |
|  | ADDI | R3, ARS | ; ARS $=1+\mathrm{N} 1$ |
|  | CMPI | EFFTSI2,ARS | - |
|  | BLED | IMCOP | ; LOOP BACK TO THE INNER LOOP |
|  | ADDI | EINPUT, ARS |  |
|  | NOP |  |  |
|  | NOP |  |  |
|  |  |  |  |
|  | ADDI | 1,R5 |  |
|  | CNPI | CLOCFFT, RS |  |
|  | BLE | LOOP |  |
| * ${ }^{\text {a }}$ |  |  |  |
| * | restore the register values and return |  |  |
| $\begin{array}{ll}\text { POP } & \text { ARS } \\ \text { POP } & \text { AR4 } \\ \text { POP } & \text { RS } \\ \text { POP } & \text { R4 } \\ \text { POP } & \text { FP } \\ \text { REIS }\end{array}$ |  |  |  |
|  |  |  |  |
|  |  |  |  |
|  |  |  |  |
|  |  |  |  |
|  |  |  |  |

APPENDX C3
GENERIC PROCRAM TO DO A RADIX-2 REAL INERSE FFT COHPUTATION ON THE thss30c30.

THE (REAL) DATA RESIDE IN INTERHAL MEHORY. THE COHPUTATION IS DONE IN-PLACE. THE BIT REVERSAL IS DDNE AT THE BEGINING OF THE PROCRAA. TIE INPUT DATA ARE STORED IN THE FOLOUING ORDER:

RE(0), $\operatorname{RE}(1), \ldots, \operatorname{RE}(\mathrm{N} / 2), \mathrm{Im}(\mathrm{N} / 2-1), \ldots, \operatorname{In}(1)$
THE THIDOLE FACTORS ARE SUPPLIED IN A TAREE PUT IN A .DATA SECTION. THIS data is included in a separate file to preserve the ceneric nature of tie PROGRAM. FOR THE SAME PURPOSE, THE SIZE OF THE FFT N AND LOO2(N) ARE
defined in a giox directive and specified dring lining. Tie lengit of THE TABLE IS N/4 $+\mathrm{N} / 4=\mathrm{N} / 2$.

AUTHOR: PANOS PAPARICHALIS
DCCEIPER 21, 1988
texas instruments

*

| LDI | 1, IR0 | ; IRO=INDEX FOR E |
| :---: | :---: | :---: |
| LDI | 3,85 | ; RS HOLDS THE CURRENT STAGE MURER |
| LDI | CFFTSI2,R3 |  |
| LSH | -1,R3 | ; $\mathrm{R} 3=\mathrm{N} 1 / 2=\mathrm{N} 2$ |
| LDI | EfFTSIL, R4 |  |
| LSH | -2,R4 | ; $\mathrm{R} 4=\mathrm{N} 1 / 4=\mathrm{NA}$ |

* inner loop

| LOOP | LDI | EINPIT, AR5 | ; ARS POINTS TO X (I) |
| :---: | :---: | :---: | :---: |
|  | LI | IRO,ARO |  |
|  | ADDI | ESINTAB, ARO | ; ARO POINTS TO SIN/COS Tarle |
| INLOP | LDI | R4, IRI | ; IR1 $=$ N |
|  |  |  |  |
|  | LDI | ARS,AR1 |  |
|  | ADDI | 1,AR1 | ; AR1 POINTS TO $\mathrm{X}(\mathrm{II})=\mathrm{X}(\mathrm{I}+\mathrm{J})$ |
|  | LDI | AR1,AR3 |  |
|  | ADDI | R3, AR3 | ; AR3 POINTS TO $X(13)=X(I+J+12)$ |
|  | LDI | AR3,AR2 |  |
|  | SUBI | 2,AR2 | ; AR2 POINTS T0 $\mathrm{X}(\mathrm{I} 2)=\mathrm{X}(\mathrm{I}-\mathrm{J}+\mathrm{N} 2)$ |
|  | ADDI | R3, AR2,AR4 | ; AR4 POINTS TO $\mathrm{X}(14)=\mathrm{X}(1-$ J+N1) |
|  |  |  |  |
|  | NOP | *++ARS(IR1) | ; POINT TO $\times(1+\mathrm{M})$ |
|  | ADDF | *-APS (IR1), *+ARSI |  |
|  | SUBF | ++AR5(IRI), *-ARS (I |  |
|  | STF | R0, *-ARS(IR1) | ; $\mathrm{x}(\mathrm{I})=\mathrm{x}(\mathrm{I})+\mathrm{x}(\mathrm{I}+\mathrm{N} 2)$ |
|  | STF | R1, ${ }^{\text {+ }+ \text { ARS ( IR1) }}$ | ; $\mathrm{X}(1+\mathrm{N} 2)=\mathrm{X}(1)-\mathrm{X}(1+\mathrm{N} 2)$ |
| ii | LOF | *AR5, RO |  |
|  | IPYF | 2.0,RO |  |
|  | STF | R0, +-AR5(IRI) | ; $x(1+N 4)=2 * x(1+N M)$ |
| : | LDF | *++AP5 (IR1), RI |  |
|  | MPYF | -2.0,R1 |  |
|  | STF | R1, + PR5 + ( IR1) | ; $\mathrm{x}(1+\mathrm{N} 4+\mathrm{N} 2)=-\mathrm{x}(1+\mathrm{N} 4+\mathrm{N} 2) * 2$ |
| - innertost loop |  |  |  |
|  | LDI | efFTSIL, IR 1 |  |
|  | LSH | -2, IR1 | ; IRI=SEPARATION BETIEEN SIN/COS TRLS |
|  | LII | R4, RC |  |
|  | SUBI | 2,RC | ; REPEAT M-1 TIIES |
| * | RPTB | BLK3 |  |
|  | SUBF | *AR2, *AR1, R1 | ; $\mathrm{RI}=\mathrm{T} 1=\mathrm{x}(11)-\mathrm{X}(12)$ |
|  | ADDF | *AR2, *AR1, RO |  |
|  | IPYF | R1, +FARO(1R1), RO | ; $\mathrm{R} 0=\mathrm{T} 1+\cos$ |
| : | STF | R0, AR1++ | ; $x(11)=x(11)+x(12)$ |
|  | ADDF | *AR3, *AR4, R2 | ; $\mathrm{R} 2=\mathrm{T} 2=\mathrm{X}(13)+\mathrm{X}(14)$ |
|  | SUBF | *AR3, ${ }^{\text {ARR4, }}$ R 6 |  |
|  | HPYF | R2, *ARO, RS | ; R6=T2+SIN |
| : | STF | R6, *AR2-- | ; $\mathrm{X}(\mathrm{I} 2)=\mathrm{x}(14)-\mathrm{X}(13)$ |
|  | SUBF | R6, RO |  |
|  | MPYF | R2, * 4 ARO(IR1), R6 | ; $\mathrm{Rb}=\mathrm{T} 2+\cos$ |
| :1 | STF | R0, *AR3++ | ; $\mathrm{X}(13)=\mathrm{T} 1 * \operatorname{COS}-\mathrm{T} 2 *$ SIN |
|  | IPYF | R1, $\quad$ ARO++( IRO) , RO | ; $\mathrm{RO}=\mathrm{T} 1+\mathrm{SIN}$ |
|  | ADDF | R6, R0 |  |
| $\begin{aligned} & \text { BLK3 } \\ & * \end{aligned}$ | STF | RO, 4 AR4- | ; $\mathrm{X}(\mathrm{I} 4)=\mathrm{T} 1 * \mathrm{SIN}+\mathrm{T} 2+\operatorname{Cos}$ |
|  | SUBI | EINPUT, ARS |  |
|  | Capl | CFFTSIL,ARS |  |
|  | BLTD | INOP | ; LOOP BACK TO THE JNER LOOP |
|  | ADDI | EINPUT, AFS |  |
|  | LDI | IRO,ARO |  |
|  | ADDI | ESINTAB, ARO | ; aro points to sin/cos table |



## Appendix D. Discrete Hartley Transform

## APPENDIX DI

GENERIC PROGRAM TO DO A RADIX-2 HARTLEY TRANSFORM ON THE TIS $320 C 30$.
THE PROGRAM IS TAKEN FROM THE PAPER BY SORENSEN ET AL., OCT 1985 ISSUE OF THE TRANSACTIONS ON ASSP.

The (real) data reside in internal hemory. The cohputation is done in-flace. the bit-reversal is done at the begining of the procrat.

THE TUIDDLE FACTORS ARE SUPPLIED IN A TABLE PUT IN A .DATA SECTION. THIS DATA IS INCLUDED IN A SEPARATE FILE TO PRESERVE THE GEHERIC NATURE OF THE PROGRAM. FOR THE SAME PURPOSE, THE SIZE OF THE FHT N AND LOG2(N) ARE DEFINED IN A GLDEL DIRECTIVE AND SPECIFIED DURING LINKING. THE LENGTH OF THE TABLE IS $N / 4+N / 4=N / 2$

AUTHOR: PANOS PAPAMICHALIS
DECEMBER 14, 198 TEXAS INSTRUWENTS
; ENTRY POINT FOR EXECUTION
; FHT SIIE
; LOG2(N)
; ADDRESS OF SINE TABLE
; MEMORY WITH INPUT DATA
; STARTING LOCATION OF THE PROGRAM
; RESERVE 100 HORDS FOR VECTORS, ETC.
; COMPAND TO LOAD DATA PAGE POINTER

* DO the bit reversing at the beginnino

| LDI | CFHTSIL,RC | ; $\mathrm{RC}=\mathrm{N}$ |
| :---: | :---: | :---: |
| SUBI | 1,RC | ; RC SHOULD BE ONE LESS THAN DESIRED |
| LDI | efhtsiz, IRO |  |
| LSH | -1, IRO | ; IRO=HALF THE SIIE Of FHT=N/2 |
| LDI | EINPUT, ARO |  |
| LII | EINPUT, ARI |  |
| RPTB | BITRY |  |
| CMPI | AR1, ARO | ; XCHANGE LOCATIONS ONLY |
| BGE | CONT | ; IF AROCARI |
| LF | *ARO, RO |  |


| $: 1$ | LIF | $* A R 1$, R1 |
| :--- | :--- | :--- |
|  | STF | RO, *AR1 |
| i: | STF | R1,*ARO |
| CONT | NOP | $* A R O++$ |
| BITRV | NOP | $* A R 1++$ (IROIB |
| $*$ |  |  |
| $*$ | LENGTH-TWO BUTTERFLIES |  |




## Appendix E. Discrete Cosine Transform

| - appendix el |  |  |  |
| :---: | :---: | :---: | :---: |
| * |  |  |  |
| * a FAST COSIE TRANSFORM | A FAST COSINE TRANSFORM |  |  |
| * ${ }^{\text {- }}$ |  |  |  |
| * BASED ON THE AlGORITHH OUTLINED BY BYEONG GI LEE IN HIS ARTICE, FCT - A |  |  |  |
| - National conference on acoustics, SPEECH, AND SIGMA PROCESSIMG, SAN |  |  |  |
|  |  |  |  |
| * DIECO, CA, 19-21 MAPCH 1984, P 28A.3/1-4 Val 2, (CH1954-5/84/0000-0299). |  |  |  |
| * lee's alcoritim has been modified to allow matural order tile domain |  |  |  |
| * COEFFICIENTS RATHER THeN THE LESS ORDERED invut sucgested in his artice. | LEE'S ALGORITH HAS BEEN HODIFIED TO ALLOW MATURAL ORDER TIIE DOMAIN COEFFICIENTS RATHER THAN THE LESS ORDERED INPUT SUGGESTED IN HIS ARTICLE. |  |  |
| * THE | THE FREQUENCY domain coefficients are in bit reverse order. This is an in |  |  |
| - place calculation. |  |  |  |
| AUTHOR: PAUL WILHELM |  |  |  |
| * |  |  |  |
| * |  |  |  |
|  | -global | FCT | ; FAST COSINE TRANSFORM ENTRY POINT. |
|  | - global |  | ; LENGTH OF DATA ENTRY. |
|  | .global | COS_tab | ; TARLE OF COSIE COEFFICIENTS. |
|  | - global | COEFF | ; TABLE OF INPUT DATA. |
| .text |  |  |  |
| * |  |  |  |
| FCTSIIE | IZE .vord | 1 |  |
| _cos | . word | COS_tab |  |
| IATA | A .word | COEFF |  |
| * |  |  |  |
| FCT: |  |  |  |
|  | LDI | EFCTSIIE,AFO | ; LOAD DATA LENGTH. |
|  | LDI | efCTSIZE, BK | ; SET BLOCX SIIE FOR CIRCUAR |
| * |  |  | ; ADDRESSIMG. |
|  | LDI | E_DATA, ARG | ; LOAD DATA POINTER. |
|  | LDI | eCOS,AR7 | ; LOAD COSINE TABLE POINTER. |
|  | LDI | ARO, IRI | ; INITIALILE INDEX REGISTERS FOR FIRST |
|  | LDI | -1, IRO | ; BUTTERFLY SERIES. |
|  | LDI | AR6, ARI | ; initialize data pointers. |
|  | ADDI3 | AR6, ARO, AR2 |  |
|  | SUBI | 1,AR2 |  |
|  | LSH3 | IR0, AR0, AR3 |  |
|  | LDI | 1,AR5 | ; INITIALILE 2'S POuER COUNTER. |
|  | ADDI | AR6, AR3 | ; FINISH DATA POINTER INITIALIZATION. |
|  | ADDI3 | IR0, AR3, AR4 |  |
|  | ADDI3 | IRO, ARS, RC | ; RC SHOUD BE OUE LESS THAN COUNT |
| * |  |  | ; DESIRED. |
| * first loop series |  |  |  |
| * this loop series does all the butterfly staces except the final ore. |  |  |  |
|  |  |  |  |
|  | RPTB | END_CENTER_LOA |  |


|  |  |  | ; TTO BUTTEPFLIES APE CALCULATED AT THE SAE TIE. |
| :---: | :---: | :---: | :---: |
|  |  |  |  |
| MIDDE_LOOP: |  |  |  |
|  | Lof | *AR2, R2 | GET LONER HALF OF EACH BUTTERFLY. (THIS ALLOUS FOR MORE PARALLE COHANDS LATER) |
| : | LDF | *AR3, R3 |  |
|  |  |  |  |
|  | SUBF3 | *AR3, 4 AR4, R1 | SUBTRACT SECNND BUTTEPFLY DATA. SUBTRACT FIRST BUTTERFLY DATA. MLTIPLY ZND SUBTRACTION RESULT BY |
|  | SUBF3 | *AR2,*AR1, RO |  |
|  | HPYF3 | R1, ${ }^{\text {+ }+ \text { AR7, R1 }}$ |  |
| 11 | ADDF3 | R3, *AR4, R3 | ; COSILE COEFFICIENT. ADD SECOND |
| * |  |  | BUTTERFLY DATA. |
|  | TPYF3 | RO, --AR7, RO | ; mLtiply ist subtraction result by |
| : | ADDF3 | R2, $*$ AR1, R2 | COSIN COEFFICIENT. ADD FIRST |
|  |  |  | BUTTERFLY DATA. <br> SAVE 2ND MULIPLY RESUT IN LONER |
|  | STF | R1, *AR2++( IR1)\% |  |
| 1 | STF | R3, $\ddagger$ AR4++(IRI)\% | half if Butterfly. SAve 2ND |
| * |  |  | ADDIIION IN UPPER 2ND BUTIEPFLY. |
| ED_CENTER_LOCP: |  |  |  |
|  | STF | R0, *AR3++( IR1)\% | ; SAve ist mutiply in lomer half of |
| it | STF | R2, *ARI++(IR1) | 2ND BUTTERFLY. SAVE IST ADOITION |
| * |  |  | IN LPPER IST BUTTERFLY. |
| * |  |  |  |
| END OF CENTER LOOP OF FIRST LOOP SERIES. |  |  |  |
| * | ADDI3 | IRO, ARS, RC | l PDDATE REPEAT CONTER FOR NEXT RLOCK REPEAT. |
|  | ADDF3 | *AR3 + + + AR2 - , RO | ; UPDATE DATA POINTERS. |
|  | Cupl | AR3, AR2 | ; HAVE BUTTERFLIES BEEN COMPLETED? |
|  | BGTD | MIDDELLOOP | ; DELAYED BRANCH, IF NOT. |
|  | ADDF3 | *AR1++,*AR4--,R0 | UPDATE FINGL TWO POINTERS FOR NEXT REPEAT |
| * | ADDI | 2,AR7 | ; UPDATE COSINE COEFFICIENT POINTER. |
|  | 0 R | O100\%,ST | ; SET REPEAT MODE. IFASTER THAN USIMG |
| * |  |  | ; RPTB LHEN START AND END ADDRESS |
| * |  |  | ; APE STILL G000) |
| * |  |  |  |
| * delay braich frow here to middielloop. |  |  |  |
|  | LSH | -1, IR1 | ; UPDATE Index register. (divide by 2) |
|  | LDI | AR6, ARI | ; PEinitialile data pointers. |
|  | ADDI | IRO, AR6, AR2 |  |
|  | ADDI | IR1, AR2 |  |
|  | CapI | 2, IR1 | ; IS FIRST BUTTERFLY SERIES COMPLETE? |
|  | BGTD | OUTSIDE 1009 | ; DELAY BRANCH, IF NOT. |
|  | LSH | 1, ARS | ; MULTIPLY 2'S POUER CONITER BY 2. |
|  | SUBI3 | IRO, AR4, AR3 | ; CONTINE REINITIALIZING DATA |
| * |  |  | ; POINTERS. |
|  | ADDI3 | IRO,APS , RC | ; SET Repeat Counter for repeat block. |

END OF FIRST LOOP SERIES.

* final butterfly stace loop.
* includes last butterflies and first stage of bit reverse additions.

|  | LDI | 4, IR1 | initialize index register. SET IP DATA POINTERS. |
| :---: | :---: | :---: | :---: |
|  | ADDI | 1, AR3 |  |
|  | LSH | -1, ARS |  |
|  | ADDI | 3, AR4 |  |
|  | ADDI3 | IRO, AP5, RC | INITIALIIE REPEAT COUNTER. <br> CALCLLATE (2/4)*COS(PI/4). <br> (I.E.-) (SQRT(2))/M THIS VALLE IS CALLED, S, BELOW.) |
|  | MPYF3 | *AR7,*+AR7,R4 |  |
| * |  |  |  |
| * |  |  |  |
|  | RPTB | END.2NO_LOOP | ; TWO BUTTERFLIES ARE CALCULATED PER |
| * |  |  | ; LOOP. |
|  | SUBF3 | *AR2, *AR1, R0 | SUBTRACT 1ST BUTTERFLY DATA. SUBTRACT ZND BUTTERFLY DATA. MULTIPLY IST SUBTRACTION RESULT |
|  | SUBF 3 | *AR4, *AR3, R1 |  |
|  | HPYF3 | R0, R4, RO |  |
| * | ADDF3 | *AR3++(IR1), *AR4++(IR1), R3 ; BY S. ADD 2ND BUTTERFL |  |
|  |  |  | DATA. |
|  | MPYF3 | R1, R4, R1 | Multiply 2 No SUBTRACTION RESULT |
| * | ADDF3 | *AR1++(IR1) * *AR2++(IR1) | ), R2 ; BY S. ADD IST BUTTERFLY |
|  |  |  | ; DATA. |
|  | MPYF3 | R3,*+AR7, R3 | Multiply 2ni addition result by |
| if | STF | R0, *-AR2 (IR1) | 7071. SAVE IST. SUBTRACTION IN LOWER 1/2 OF IST BUUTERFLY. MULTIPLY IST ADDITION RESULT BY |
|  | IPYF3 | R2,*+AR7, R2 |  |
| * | STF | R1, *-AR4(IR1) | . 7071 SAVE 2ND SUBTRACTION IN |
|  |  |  | ; LOHER 1/2 OF 2ND BUTTEPFLY. |
|  | ADDF3 | R3, R1, R3 | ADD 2ND SUBTRACTION MLITIPLY TO 2ND ADDITION MULIIPLY. |
| * |  |  |  |
|  | STF | R2, *-AR1 (IR1 | ; SAVE IST ADDITION MLTIIPLY IN UPPER <br> ; $1 / 2$ OF BUTTERFLY. |
| * |  |  |  |
|  |  |  |  |
| END_2ND_LOOP: |  |  |  |
| * |  |  |  |
|  | STF | R3, *-AR3(IR1) | ; SAVE 2ND ADDITION MLUTPLY IN UPPER <br> ; $1 / 2$ OF UPPER EUTTERFLY. |
|  |  |  |  |
| * |  |  |  |
|  | Of FINAI | UTTERFLY STAGE |  |



- END OF LAST LOOP SERIES.
    * maltiply coefficient zero by .5, if not zero.


# LDF $\quad$ ARR6,RO $\quad$ SET ZERO FLAG IF *ARS $=0$. 

$\begin{array}{ll}\text { LAF } & \text {; SER }, \text { RO } \\ \text { BEQD } & \text { DONT_STORE }\end{array} \quad$; IF COEFFICIENT IS $2 E R O$, DON'T DO DONT_STORE ; IF COEF
24,AR5 ; USE INTEGER MATH FOR FLOAT DIVIDE
SUBI3 AR5,*AR6,AR1 NOP ; BY 2.
delayed branch from here if value is not to be stored.

* STI AR1, *ARG ; STORE, IF EXPONENT HASN'T -128.
DONT_STORE:
RETS


## APPENDIX E2

A FAST COSINE TRANSFORM (INUERSE TRANSFORH)
based on the algoritm outlined by byeong gi lee in his article, fct - a FAST COSINE TRANSFORM, PURLISHED IN THE PROCEEDINGS OF THE IEEE InterNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNL PROCESSING, SAN DIEGO, CA, 19-21 MARCH 1984, P 28A.3/1-4 VOL 2., (CH1954-5/84/0000-0299).

LEE'S ALGORITHM HAS BEEN MODIFIED TO ALLOW NATURAL ORDER TIME DOHAIN COEFFICIENTS.

THE FREQUENCY LOMAIN COEFFICIENTS ARE IN BIT REVERSE ORDER. THIS IS AN IN place calculation.

AUTHOR: PAUL WILHELM

| .global | IFCT |
| :--- | :--- |
| .global | $M$ |
| .global | COEFF |
| .glebal | COS_TAB |

## ; INUERSE FAST COSINE TRANSFORM ENTRY

 POINT.; Lengit of array to be transfortied.
; TABLE OF COSINE COEFFICIENTS.
; TABLE OF ARRAY DATA TO BE
TRANSFORTIED.

.text

| FCTSILE | .word | M |
| :--- | :--- | :--- |
| DATA | .word | COEFF |
| COS | word | COS_TAB |

* 

COS_TAB
IFCT:

| LDI | EFCTSILE, ARO | ; LOAD ARRAY SIIE. |
| :---: | :---: | :---: |
| LDI | EFCTSILE, BK | ; LOAD BLOCK SIZE FOR CIRCULAR <br> ; ADDRESSIMG |
| LDI | e_data, AR6 | ; LOAD POINTER TO DATA table. |
| LDI | ecos, AR7 | ; LOAD POINTER TO COSINE TABLE. |
| ADDI | ARO, AR7 | ; POint to last cosine value in table. |
| SUBI | 2,AR7 |  |
| LII | ARO, IRO | ; INITIALIZE INDEX REGISTERS FOR BIT |
| LSH | -2, IR0 | REVERSED ADDITION SEQUENEE. |
| LII | ARO, IR1 |  |
| LDI | AR6, AR1 | ; initialize data pointers. |
| ADDI | IR0, AR1 |  |

start of bit reversed addition loop series.

| OUTSIDE: |  |  | ; TOP OF OUTSIDE LOOP FOR BIT REVERSED ; ADDITIONS. |
| :---: | :---: | :---: | :---: |
| * | ADDI | IRO, AR1 | ; lidate data pointers and repeat ; CONTER. |
|  | LDI | AR1, AR2 |  |
|  | LDI | IRO, RC |  |
|  | SUBI | 2,RC |  |



THIS LOOP INCLUDES THE LAST BIT REVERSED ADDITION STAGE, THE FIRST BUTTERFLY, AND THE COSINE MLTIPLICATIONS FOR THE SECOND BUTTERFLY * SERIES.
; FOUR BUTTERFLIES ARE DONE EACH CYCLE THROUCH THIS LOOP.

$$
\text { ; BIT REVERSED ADDITION FOR } 2 N D
$$

BUTTERFLY.
; COSINE PI/4 TIMES LOWER HALF OF 1 ST BUTTERFLY.
; COSINE PI/4 TIMES LOWER HALF OF 2ND BUTTERFLY.
; BIT REVERSED ADDITION FOR 4TH ; BIT REVERSED
; ADD UPPER HALF OF IST BUTTERFLY.
; COSINE PI/4 TIHES LOHER HALF OF 4TH BUTTERFLY.
; ADD UPPER HALF OF 2ND BUTTERFLY.
; SUBTRACT LOUER HALF OF $15 T$ butterfly.
; MKRTIPLY UPPER HALF OF 2ND BUTTERFLY
; BY COSINE COEFFIEIENT.
; SUBTRACT LOHER HALF OF 2ND BUTTERFLY.
; STORE UPPER HALF OF IST BUTTERFLY.
; STORE LOWER HALF OF IST BUTTERFLY.
; STORE LONER HALF OF $2 N D$ BUTTERFLY.
; COSINE PI/4 TIMES LOWER HALF OF 3RD BUTTERFLY.
; MULIIPLY LOHER HALF OF 2ND BUTTERFLY

- BY COSINE COEFFICIENT
; SUBTRACT LOWER HALF OF 4TH
; BUTTERFLY.
; ADD UPPER HALF OF 3RD BUTTERFLY.
; MLIIPLY ; OHER HALF OF 4TH BUTTERFLY BY COSIN COEFFICIENT
; ADD UPPER HALF OF 4TH BUTTERFLY.
; ADD UPPER HALF OF 4TH BUTTE
; SUBTRACT LOUER HALF OF 3RD BUTTERFLY.
; MLTIPLY UPPER HALF OF 4TH BUTTERFLY BY COSINE COEFFICIENT.
; STORE UPPER HALF OF 4TH BUTTERFLY.
; STORE LPPER HALF OF 2ND BUTTERFLY.
${ }^{*}$ END
* 

STF
R1, * \&R4 + +(IR1) $\%$ R4, *AR3+(IR1)\%
STOPE LOURR HALF OF 4TH BUTTERFLY.

END of Center Bititerly loop.
start next to Last loop series.
THIS SERIES OF LOOPS DOES ALL BUT THE LAST BUTTERFLY STAGE. ALL THE COSINE COEFFICIENT MUTIPLICATIONS ARE DONE, INCLUDING THE MUTIPLICAIIONS FOR THE LAST BUTTERFLY STAGE. (THIS PROGRAY FLOW ALLOUS FOR FAST EXECUTION.)

ER_LOOP:
STF


END of CENTER LOOP OF NEXT TO LAST SERIES.

| LDI | ARS, RC |
| :---: | :---: |
| LDF | *AR7--,RS |
| LDF | *AR7--, R4 |
| Cupl | AR1, PR6 |
| bxed | NLL_LOOP |
| ADDF3 | *AR4++,*AR3--,RO |

; RELOAD REPEAT COUNTER
; GET NEW COSINE COEFFICIENTS. IFYI-
THE LAST TIME, THIS HILL FETC
from hewory belon the cosine ; TABLE.)
; HAS MIDDLE LOOP BEEN COHPLETED?
; IF NOT, BRANCH DELAYED.
DUMYY ADDS TO UPDATE DATA POINTERS.


## Appendix E3. FCT Cosine Tables File

* 
* APFENIIX ES
* 
* fCT cosine tables file
* to be Linken with fct source coie fur 32 point fct.
* COEFFICIENTS ARE $1 /(2 * \operatorname{COS}(N * P I / 2 M))$, WHERE $N$ IS A NLMMER FROM 1 to * M-1. M IS THE ORDER OF THE TRANSFORN.
* FOR A 32 PUINT FCT, $N$ IS IN THE FILLOWING URDER:
* $1,15,3,13,5,11,7,9$,
* $2,14,6,10$,
* 4,12 ,
* 
* 
* the last value in the table is 2/M.
$M$ set 16
. data

COS_TAB

$$
\begin{array}{ll}
\text {. float } & 0.5024193 \\
. \text { float } & 5.1011487 \\
\text {.float } & 0.5224986 \\
\text {.float } & 1.7224471 \\
\text {. float } & 0.5669440 \\
\text {.float } & 1.0606777 \\
\text {.float } & 0.6468218 \\
\text {.float } & 0.7881546 \\
\text {.float } & 0.5097956 \\
\text {.float } & 2.5629154 \\
\text {.float } & 0.6013449 \\
\text {.float } & 0.8999762 \\
\text {.float } & 0.5411961 \\
\text {.float } & 1.3065630 \\
\text {.float } & 0.7071068 \\
\text {.float } & 0.1250000 \\
\text {. erid } &
\end{array}
$$

## Appendix E4. Data File

```
*
* AFPENDIX E4
*
* UATA FILE
*
    -global COEFF
    . data
#
COEFF
    .float 137.0
    .float 249.0
    .float 105.0
    .float 217.0
    .float 73.0
        .fluat 185.0
        .float 41.0
        .float 153.0
        .flont 9.0
        .float 121.0
        .float 233.0
        .float 89.0
        .float 201.0
        .float 57.0
        .float 169.0
        .float 25.0
        .end
```


## Appendix F. Test Vectors, 64-Point Sine Table, Link Command File

APPENDIX F1
EXAMPLE OF A 64 -POINT vECTOR TO TEST THE FFT ROUTINES


0.9166
0.1402
0.1402
0.7054
0.7054
0.0178
0.0178
0.2611
0.1358
0.0503
0.5782
0.2432
0.9448
0.5876
0.7256
0.2849
0.6767
0.8642
0.1943

* b4-pOINT FFT CORRESPONDING TO VECTOR $X$
- $Y=$
30.3774
$1.7780-2.5584 \mathrm{i}$
$1.7780-2.5584 i$
$1.0376-2.3999 i$
$0.6594+2.3639 \mathrm{i}$
$0.659+2.3639 i$
$-1.5228-0.75271$
$-3.8171-0.2050 i$
$-3.8171-0.2050 \mathrm{i}$
$-2.7096+1.2841 \mathrm{i}$
$2.1622-1.6863 \mathrm{i}$ $0.2879+1.8671 i$ $-1.5479+1.6298 \mathrm{i}$
$-0.6366-0.1176 i$
$2.2902+1.5549 \mathrm{i}$
$-2.4837-0.5842 \mathrm{i}$
$-1.7338+0.0738 \mathrm{i}$
$-0.2180-0.4726 i$
$-0.2104+0.4897 \mathrm{i}$
$-1.7473-1.0213 i$
0.1233-2.3915i
$-0.6415-1.1144 \mathrm{i}$
$-2.7719-0.4802 i$
$-2.7719-0.4802 \mathrm{i}$
$-0.0063-0.3885 \mathrm{i}$
$0.7163+1.5682 i$
$0.3218-1.3316 \mathrm{i}$
$-0.7823+1.0607 \mathrm{i}$
$-0.2533+2.82701$
$-1.0813-2.7861 i$
$-1.0813-2.7861 i$
$3.4869+1.9485 i$
$3.4869+1.9485 \mathrm{i}$
$3.0352+1.3855 \mathrm{i}$
$3.2099+2.3564 \mathrm{i}$ $3.2099+2.3564 i$
$1.9511-0.7714 i$
$1.8755+0.2867 i$

|  |  |
| :---: | :---: |
|  | - |
|  |  |
|  |  |
|  |  |
|  |  <br>  <br>  |


Radix-4 FFT.
Appendix F2. File to Be Linked with the Source Code for a 64-Point,

## Appendix F3. Link Command File

```
*
* AFPENIIX FS
*
*
* LINK COMMAND FILE
*
* [0 NOT TYPE IN THESE FIRST SEVEN LINES
-0 12opt64.out
12Fupt.obj
sinb4.0t.j
SECTIONS
{
    .text : {}
    .data : {}
    IN 808800h : { 12fopt.obj(IN) }
    .bss 809C00h: {
}
```


# Doublelength Floating-Point Arithmetic on the TMS320C30 

Al Lovrich<br>Digital Signal Processor Products-Semiconductor Group<br>Texas Instruments

In the past, extended-precision arithmetic has been implemented only on fixed-point processors. The introduction of the TMS320C30 Digital Signal Processor (DSP), a floatingpoint 33-MFLOP device, enables us to represent multilength floating-point math in terms of singlelength floating-point math. Extended-precision arithmetic allows designers to have more accuracy in their applications. Some of these applications include digital filtering, FFTs, image processing, control, etc.

This application report describes how to extend the available precision of floatingpoint arithmetic on the TMS320C30. Our emphasis is on implementing an efficient extension of the available precision while minimizing both the execution time and the memory usage.

The structure of this report is as follows: The first section describes the TMS320C30 DSP floating-point number representation. The second section discusses doublelength arithmetic and some basic definitions. The third section discusses the algorithms used along with the TMS320C30 implementation. An analysis of the error introduced by the algorithm is presented in the fourth section. The last section provides an insight into generating C callable functions from assembly language routines. Finally, the appendix provides the source listings for the extended-precision arithmetic.

## Floating Point Format

The TMS320C30 supports three floating-point formats [1].

- Short floating-point format, used to represent immediate operands, consisting of a 4 -bit exponent and a 12-bit mantissa.
- Single-precision format, used for regular floating-point value representation, consisting of an 8-bit exponent and a 24-bit mantissa.
- The extended-precision format, used with the extended-precision registers, consisting of an 8 -bit exponent and a 32 -bit mantissa.

For the extended-precision algorithms to work properly on the DSP, it is important to start from the highest-precision floating-point format available in the system that is used for basic floating-point operations. The single-precision format is of particular interest in developing the TMS320C30 code for extended-precision floating-point operations. Therefore, a working knowledge of the properties of this format is essential for the concepts presented in this application report.

In the single-precision format, the floating-point number is represented by an 8 -bit exponent field ( $e$ ) in two's complement notation, and a two's complement 24 -bit mantissa field $(f)$ with an implied most-significant nonsign bit. Bit 23 of the mantissa indicates the sign ( $s$ ), as shown in Figure 1.


## Figure 1. Single-Precision Floating-Point Format of the TMS320C30

Operations are performed with an implied binary point between bits 23 and 22. When the implied most-significant nonsign bit is made explicit, it is located to the immediate left of the binary point after the sign bit. We show the implied bit explicitly throughout this application report for clarity. The floating-point number x is expressed as follows:

```
\(\mathrm{x}=\quad 01 . f \times 2^{e} \quad\) if \(\quad s=0 ;\)
    \(10 . f \times 2^{e} \quad\) if \(\quad s=1\);
    \(0 \quad\) if \(\quad e=-128, s=0\), and \(f=0\)
```

The range and precision available with the TMS320C30 single-precision floatingpoint format are illustrated by the following values:

Most Positive: $\quad \mathrm{x}=+3.4028234 \times 10+38$
Least Positive: $\quad \mathrm{x}=+5.8774717 \times 10^{-39}$
Least Negative: $\quad \mathrm{x}=-5.8774724 \times 10^{-39}$
Most Negative: $\quad \mathrm{x}=-3.4028236 \times 10^{+38}$

## Doublelength Floating-Point - The Basics

The techniques used to develop doublelength results in this application report require a singlelength floating-point system and arithmetic that satisfy certain conditions. The TMS320C30 implementation takes the singlelength system as the highest floatingpoint precision system available. The algorithms'presented do not require a doublelength accumulator with respect to the singlelength system used. The extended-precision formats available are used to control the truncation or rounding of the single-precision results.

The doublelength arithmetic presented here increases precision of a given floatingpoint operation without the need for a doublelength accumulator. Using this method, the result of the floating-point operations on two single-precision numbers can be determined exactly. If x and y are two such numbers and the desired operation is addition, the result can be represented as a pair of floating-point numbers z and zz . The z value represents
the most significant portion of the floating-point operation, while zz represents the least significant portion of the floating-point operation.

As an example, consider the result of the exact addition of two floating-point numbers $x$ and $y$ that are expressed in the single-precision format of the TMS320C30:

$$
\begin{array}{ll}
x=217 \text { FFFFFh } & \left(\text { decimal: } 1.71798682 \times 10^{10}\right) \\
y=0 \text { C7FFFFFh } & \left(\text { decimal: } 8.19199951 \times 10^{3}\right)
\end{array}
$$

The values are represented in the TMS320C30 binary equivalent as follows:

$$
\begin{aligned}
& x=233 \times 01.11111111111111111111111 b \\
& y=2^{12} \times 01.11111111111111111111111 b
\end{aligned}
$$

Addition of two floating-point numbers requires aligning the two variables x and y [1]:

$$
\begin{aligned}
& x=233 \times 01.11111111111111111111111 b \\
& y=2^{33} \times 00.00000000000000000000111111111111111111111111000 b
\end{aligned}
$$

As can be seen in this example, most of the precision available for y will not be available to carry out the addition. Maintaining full precision for floating-point addition requires extra mantissa bits beyond the 24 bits available on the DSP. Since the need for such precision is rare, software methods are used to represent the result of the operation as a floating-point number pair ( $\mathbf{z}, \mathrm{zz}$ ). In our example, the exact result is represented as follows:

$$
\begin{aligned}
& \mathrm{z}=2^{34} \times 01.00000000000000000000011 \mathrm{~b} \\
& z z=209 \times 01.11111111111111111111000 \mathrm{~b}
\end{aligned}
$$

The corresponding hexadecimal representation of $(\mathbf{z}, \mathrm{zz})$ is shown below:

$$
\begin{array}{ll}
z=22000003 \mathrm{~h} & (\text { decimal: } 1.71798753 \times 1010) \\
\mathrm{zz}=097 F F F F 8 \mathrm{~h} & \left(\text { decimal: } 1.0239995 \times 10^{3}\right)
\end{array}
$$

Some definitions are basic to the development of concepts in this report. First is the definition of the floating-point operations over a system $R$. The system contains all the possible floating-point numbers that the single-precision format of the TMS320C30 can represent. All the floating-point arithmetic is carried out in base 2 . Therefore, $R$ can be represented as follows on the TMS320C30:

$$
R=\left\{\mathrm{x}\left|\mathrm{x}=\mathrm{m}(\mathrm{x}) 2^{\mathrm{e}(\mathrm{x}),}, \mathrm{m}(\mathrm{x})\right|<224,-128<\mathrm{e}(\mathrm{x})<127\right\}
$$

A floating-point operation is faithful if the result of the operation $\mathrm{fl}(\mathrm{x} * \mathrm{y})$ equals either:
The largest element of $R$ that is smaller than or equal to ( $\mathrm{x} * \mathrm{y}$ ) or
The smallest element of $R$ that is larger than or equal to ( $\mathrm{x} * \mathrm{y}$ )
where $*$ represents one of the following floating-point operations:,,$+- \times, \div$ In other words, faithful refers to truncating the floating-point operation result. The floating-point
multiplier on the TMS320C30 saves the upper 40 bits of the mantissa in one of the extendedprecision registers [1] and drops the least significant byte of the result. By this definition, the floating-point multiplication on the TMS320C30 is faithful. Since the algorithms require the floating-point result to be in single-precision format, the floating-point multiplication on the DSP must therefore be followed by a second truncation step. Saving the contents of the extended-precision register to a memory location or masking off the low 8 bits results in truncation.

A floating-point operation is optimal if for all $x$ and $y$, the result of $f(x * y)$ is an element of $R$ nearest to ( $x * y$ ). In other words, the round-off error should not exceed one-half of the last remaining bit position. This is commonly referred to as rounding.

The results of floating-point operations on the TMS320C30 are stored in the extendedprecision registers [1]. The extended-precision register adds 8 bits of precision to the floating-point arithmetic result. Execution of the RND (round) instruction forces the result of the floating-point arithmetic to be optimal. When you round the result of the addition or subtraction operations on the TMS320C30, these floating-point operations become optimal.

## Implementing Doublelength Floating-Point Arithmetic

This section presents the algorithms used in implementing doublelength arithmetic in pseudo-code for a number of fundamental floating-point operations. The basic idea of doublelength arithmetic can be extended to multiplelength precision, given that the start of the implementation is based on the highest precision available on the system. Therefore, to achieve quadruplelength results, the same algorithm can be applied to doublelength values, and so on. The implementation is based on the theoretical results presented in Reference [2].

## Exact Singlelength Addition

In this discussion of the algorithm used to carry out exact addition and its implementation on the TMS320C30 DSP, the term exact refers to performing an operation on two floating-point numbers, $x$ and $y$, and obtaining a doublelength floating-point number pair $(\mathrm{z}, \mathrm{zz})$ to represent the result. In this implementation, we have not accounted for floatingpoint exponent overflow or underflow. For this algorithm to produce a correct result, the floating-point addition and subtraction must be optimal.

The purpose of exact addition is to find a term, zz, that satisfies Equation (2).
$\mathrm{z}+\mathrm{zz}=\mathrm{x}+\mathrm{y}$
Equation (2) can be rewritten as
$\mathrm{zz}=\mathrm{y}-(\mathrm{z}-\mathrm{x})$

Equation (3) can be expanded into Equation (4).

$$
\begin{align*}
& \mathrm{w}=\mathrm{z}-\mathrm{x}  \tag{4}\\
& \mathrm{zz}=\mathrm{y}-\mathrm{w}
\end{align*}
$$

In particular, $|\mathrm{x}|>|\mathrm{y}|$ must be valid for Equation (4) to be valid. Implementation of Equation (4) on the TMS320C30 always generates the exact correction term zz if the result of floating-point addition operation is made optimal. This requirement guarantees that the result of single-precision floating-point add and subtract belongs to system R. By swapping the x and y values when $|\mathrm{x}|<|\mathrm{y}|$, the condition for obtaining an exact result is met.

The algorithm requires that x and y be normalized. Normalization guarantees that the floating-point number has only one sign bit, and that sign bit is followed by nonsign bits [1]. Floating-point addition on the TMS320C30 assumes that the operands are normalized.

The TMS320C30 assembly code for obtaining the doublelength sum of two singlelength floating-point numbers x and y is shown in Appendix A. First, the values for x and y are interchanged when $|\mathrm{x}|<|\mathrm{y}|$. When you add x and y values, the number with the smaller exponent, $y$, is shifted repeatedly until the exponents of $x$ and $y$ are equal and their mantissas are aligned. We have now calculated the singlelength number, $z$, that satisfies Equation (2). Since the floating-point addition on the TMS320C30 is made optimal by rounding, the extra precision is, in effect, dropped. The extra precision value, zz , is obtained by implementing Equation (4). Figure 2 is a graphical representation of the implemented algorithm. The figure also shows the relationship between doublelength number pair ( $\mathrm{z}, \mathrm{zz}$ ) and singlelength floating-point numbers and their representation on the TMS320C30.


Figure 2. Exact Singlelength Addition

The same algorithm can be used to implement exact floating-point subtraction on the DSP. This is accomplished by negating the second operand and performing an exact addition.

## Doublelength Addition

A natural extension of exact singlelength addition and subtraction is its application to doublelength arithmetic. Figure 3 shows an algorithm for implementing doublelength addition on the DSP. Using this algorithm, you can add two doublelength numbers ( $\mathbf{x}, \mathbf{x x}$ ) and $(\mathbf{y}, \mathrm{yy})$ and represent the result as a doublelength number ( $\mathbf{z}, \mathbf{z z}$ ).

The algorithm requires forming a doublelength number ( $\mathrm{r}, \mathrm{rr}$ ) that represents an exact addition of $x$ and $y$. Generating a second number, $s=((r r+y y)+x x)$, results in a number pair ( $\mathrm{r}, \mathrm{s}$ ) that approximates the addition of ( $\mathrm{x}, \mathrm{xx}$ ) and ( $\mathrm{y}, \mathrm{y} y$ ). Finally, an exact addition of $r$ and $s$ generates a doublelength number $(\mathbf{z}, \mathbf{z z})$ that has the same value as ( $\mathbf{x}, \mathbf{x x}$ ) $+(y, y y)$.

To obtain exact results for addition and subtraction, subtraction and addition must be optimal; this is guaranteed by following each subtraction or addition instruction on the DSP with a round instruction.
; Calculate the doublelength sum of ( $\mathrm{x}, \mathrm{xx}$ ) and ( $\mathrm{y}, \mathrm{yy}$ ),
; the result being ( $\mathrm{z}, \mathrm{zz}$ )

$$
\begin{aligned}
& r=x+y \\
& \text { if }(a b s(x)>a b s(y)) \\
& \quad \quad s=x-r+y+y y+x x \\
& \text { else } \quad s=y-r+x+x x+y y \\
& \quad \mathrm{z}=\mathrm{r}+\mathrm{s} \\
& \mathrm{zz}=\mathrm{r}-\mathrm{z}+\mathrm{s}
\end{aligned}
$$

Figure 3. Doublelength Addition

## Exact Singlelength Multiplication

The exact singlelength multiplication is shown in Figure 4. The algorithm requires breaking the x and y mantissas into half-length numbers, referred to as head ( $\mathrm{hx}, \mathrm{hy}$ ) and tail (tx,ty) sections [2]. This algorithm requires addition and subtraction to be optimal and multiplication faithful. The TMS320C30 DSP multiplication result is faithful if the contents of the extended-precision register are truncated.

To split $x$ and $y$ into two half-length numbers, a constant value is needed that is dependent on the number of available digits. The TMS320C30 device has $t=24$ bits of mantissa in the single-precision format. Equation (5) shows that head section hx is chosen to be as near to the value of $x$ as possible.

$$
\begin{equation*}
\mathrm{hx}=\operatorname{round}\left(\mathrm{m}(\mathrm{x}) 2^{-\mathrm{t} 1}\right) 2 \mathrm{e}(\mathrm{x})+\mathrm{tl} \tag{5}
\end{equation*}
$$

Also, $\mathbf{t} 1$ is chosen to be approximately one-half of the available precision, or 12 , on the processor. This effectively breaks the mantissa into half-length values. Equation (5) shows that $h x$ is obtained by rounding and is defined to be an element of $R\{t 1\}$. The tail section $t x$ is easily obtained by subtracting $h x$ from $x$. Since floating-point subtraction can be made optimal on the TMS320C30, it follows that $t x$ is an element of $R\{t 1-1\}$. Setting the constant equal to $2^{12}$ does not always satisfy Equation (5) when $t$ is even. When the constant is set to $212+1$, the definition of Equation (5) is satisfied. The proof for the above is given in Reference [2].
; Calculate the exact product of $x$ and $y$, the result being
; a doublelength number ( $\mathrm{z}, \mathrm{zz}$ ). This algorithm uses the
; following syntax when called from a user program as shown
; mult12 (x,y,z,zz);

$$
\begin{aligned}
& \mathrm{p}=\mathrm{x} \times \text { constant } ; \\
& \mathrm{hx}=\mathrm{x}-\mathrm{p}+\mathrm{p} ; \\
& \mathrm{tx}=\mathrm{x}-\mathrm{hx} ; \\
& \mathrm{p}=\mathrm{y} \times \text { constant } ; \\
& \mathrm{hy}=\mathrm{y}-\mathrm{p}+\mathrm{p} ; \\
& \mathrm{ty}=\mathrm{y}-\mathrm{hy} ; \\
& \mathrm{p}=\mathrm{hx} \times \mathrm{hy} ; \\
& \mathrm{q}=\mathrm{hx} \times \mathrm{ty}+\mathrm{tx} \times \mathrm{hy} ; \\
& \mathrm{z}=\mathrm{p}+\mathrm{q} ; \\
& \mathrm{zz}=\mathrm{p}-\mathrm{z}+\mathrm{q}+\mathrm{tx} \times \mathrm{ty} ;
\end{aligned}
$$

Figure 4. Exact Singlelength Product

## Doublelength Multiplication

The doublelength multiplication algorithm, shown in Figure 5, relies on the singlelength algorithm discussed earlier. The algorithm generates a nearly doublelength approximation of the output result (c,cc). Note that the exact singlelength multiplication routine is used for this approximation. Exact addition is used to generate a doublelength floating-point number that is the closest approximation to the actual result.

The doublelength product program implementation uses the TMS320C30 stack capabilities to save some intermediate variables. These programs are written to be used as callable functions or macros in your program. In either case, the stack pointer must be set to a valid memory segment for proper code execution.
; Calculate the doublelength product of ( $\mathrm{x}, \mathrm{xx}$ ) and ( $\mathrm{y}, \mathrm{yy}$ )
; the result being a nearly doublelength number ( $\mathbf{z}, \mathbf{z z}$ ).
; Program uses exact singlelength multiplication, mult12 (.).
;

$$
\begin{aligned}
& \text { mult12 }(x, y, c, c c) \\
& \mathrm{cc}=\mathrm{x} \times \mathrm{yy}+\mathrm{xx} \times \mathrm{y}+\mathrm{cc} \\
& \mathrm{z}=\mathrm{c}+\mathrm{cc} \\
& \mathrm{zz}=\mathrm{c}-\mathrm{z}+\mathrm{cc}
\end{aligned}
$$

Figure 5. Exact Doublelength Product

## Doublelength Quotient and Square Root

Figures 6 and 7 show the algorithm used in calculating the doublelength quotient and doublelength square root routines. Singlelength multiplication is used to generate a doublelength approximation of the quotient or square root values. As with doublelength multiplication, exact addition is used to generate a doublelength floating-point result.
; Calculates the doublelength quotient of ( $\mathrm{x}, \mathrm{xx}$ ) and ( $\mathrm{y}, \mathrm{yy}$ )
; the result being ( $\mathrm{z}, \mathrm{zz}$ )

$$
\begin{aligned}
& \mathrm{c}=\mathrm{x} / \mathrm{y} \\
& \text { mult12(c, } \mathrm{y}, \mathrm{u}, \mathrm{uu}) \\
& \mathrm{cc}=(\mathrm{x}-\mathrm{u}-\mathrm{uu}+\mathrm{xx}-\mathrm{c} \times \mathrm{yy}) / \mathrm{y} \\
& \mathrm{z}=\mathrm{c}+\mathrm{cc} \\
& \mathrm{zz}=\mathrm{c}-\mathrm{z}+\mathrm{cc}
\end{aligned}
$$

## Figure 6. Doublelength Quotient

; Calculate the doublelength square root of ( $\mathrm{x}, \mathrm{xx}$ ), the ; result being (z,zz)

```
if \((x>0)\{\)
    \(\mathrm{c}=\operatorname{sqrt}(\mathrm{x}) ;\)
    mult12 (c, c, u, uu);
    \(\mathrm{cc}=(\mathrm{x}-\mathrm{u}-\mathrm{uu}+\mathrm{xx}) \times 0.5 / \mathrm{c}\);
    \(\mathrm{z}=\mathrm{c}+\mathrm{cc} ;\)
    \(\mathrm{zz}=\mathrm{c}-\mathrm{z}+\mathrm{cc} ;\}\)
else \{
    \(\mathrm{z}=\mathrm{zz}=0\).
```

Figure 7. Doublelength Square Root

## Error Analysis

This section discusses and determines an upper bound for the error generated in forming a doublelength result. The value of the doublelength number ( $\mathrm{z}, \mathrm{zz}$ ) is equal to z + zz. Singlelength addition, subtraction, and multiplication results are always exact. In doublelength addition, any error introduced in the end result is generated by calculating the zz term. An upper bound error magnitude has been calculated in Reference [2] and is shown in Equation (6) as follows:

$$
\begin{equation*}
\left|E+\left|\leq\{|x+x x|+|y+y y|\} \times 2^{2-2 t}=|Z| \times 2^{2-2 t}\right.\right. \tag{6}
\end{equation*}
$$

where $t=24$ for this system. This gives an upper bound of $|Z| \times 2-46$, or approximately $|\mathrm{Z}| \times 1.42 \times 10^{-14}$. This translates to a theorical accuracy greater than 13 decimal places. Table 1 shows an example of doublelength addition using the exact addition algorithm previously described. The numbers in the left column represent TMS320C30 hexadecimal notation for the floating-point results, and ( $\mathbf{z}, \mathrm{zz}$ ) is the decimal equivalent of the doublelength output result. Appendix B shows a listing of C programs (exact) that convert from TMS320C30 hexadecimal notation to decimal notation.

Table 1. Exact Singlelength Arithmetic Examples

| Singlelength Addition |  |
| :---: | :---: |
| $x=217$ FFFFFh |  |
| $y=0 C 7 F F F F F h$ |  |
| $z=22000003 \mathrm{~h}$ | $(z, z z)=17179876351.9995117$ (Exact) |
| zz $=097 \mathrm{FFFF} 8 \mathrm{~h}$ | 17179876351.9995117 (DSP) |
| $x=F C 7 C 8923 \mathrm{~h}$ |  |
| $y=0 A 29 A 7 E 5 h$ |  |
| $z=0 A 29 A B D 8 h$ | $(z, z z)=1357.37010409682989$ (Exact) |
| zz = EFA46000h | 1357.37010409682989 (DSP) |
| Singlelength Multiplication |  |
| $\mathrm{x}=$ OF7FFFFFh |  |
| $y=21$ FFFFFFh |  |
| z $=30800000 \mathrm{~h}$ | $(z, z z)=-562949986975740$ (Exact) |
| $z z=18800002 \mathrm{~h}$ | -562949986975740 (DSP) |
| $\mathrm{x}=\mathrm{FC7CB923h}$ |  |
| $y=0 A 29 A 7 E 5 h$ |  |
| $z=07277 B F 7 h$ | $(z, z z)=167.484236862815123$ (Exact). |
| zz = EBA714FOh | 167.484236862815123 (DSP) |

The doublelength product, quotient, and square-root algorithms all have a small relative error. The upperbound error magnitude for each is given in Equations (7) through (9).

$$
\begin{align*}
& \left|E^{\times}\right|=(|x+x x| \times|y+y y|) \times 11 \times 2^{-48}  \tag{7}\\
& \left|E^{+}\right|=(|x+x x| \div|y \times y y|) \times 21.1 \times 2^{-48}  \tag{8}\\
& \left|E^{\sqrt{ }}\right|=\operatorname{sqrt}(|x+x x|) \times 12.7 \times 2^{-48} \tag{9}
\end{align*}
$$

Equation (7) establishes an upperbound of $|\mathrm{Z}| \times 3.9 \times 10^{-14}$, or approximately 13 decimal digits of accuracy for doublelength multiplication. Similarly, an upperbound of $|\mathrm{Z}| \times 7.5 \times 10^{-14}$, or greater than 13 decimal digits for the doublelength squareroot algorithm, is established. Table 2 shows examples for each algorithm discussed, along with the algorithm output and expected theorical output.

Table 2. Exact Doublelength Arithmetic Examples

| Doublelength Multiplication |  |
| :---: | :---: |
| $\begin{array}{ll} x & =22000000 h \\ x x & =097 F F F F E h \\ y & =21000001 h \\ y y & =097 F F F F E h \\ z & =43000002 h \\ z z & =2 A 7 F F F F C h \\ x & =22000003 h \\ x x & =097 F F F F 8 h \\ y & =0 A 29 A B D 8 h \\ y y & =\text { EFA46000h } \\ z & =2 C 29 A B D D h \\ z z & =13907 D C 2 h \end{array}$ | $\begin{aligned} (z, z z)= & 1.47573996570139475 \times 10^{20}(\text { Exact }) \\ & 1.47573996570139427 \times 10^{20}(\text { DSP }) \end{aligned}$ $\begin{aligned} (z, z z)= & 23319450552284.2434 \text { (Exact) } \\ & 23319450552284.1250 \text { (DSP) } \end{aligned}$ |
| Doublelength Quotient |  |
| $\begin{array}{ll} x & =43000002 h \\ x x & =2 A 7 F F F F C h \\ y & =2 C 29 A B D D h \\ y y & =13907 D C 2 h \\ z & =1641205 A h \\ z z & =\text { FC24BE20h } \\ x & =22000000 h \\ x x & =\text { 097FFFFEh } \\ y & =21000001 h \\ y y & =097 F F F F E h \\ z & =007 F F F F D h \\ z z & =\text { D34000000h } \end{array}$ | $\begin{aligned} (z, z z)= & 6328365.08044074177 \text { (Exact) } \\ & 6328365.08044075966 \text { (DSP) } \end{aligned}$ $\begin{aligned} (z, z z)= & 1.99999964237223082 \text { (Exact) } \\ & 1.99999964237217398 \text { (DSP) } \end{aligned}$ |
| Doublelength Square Root |  |
| $\begin{array}{ll} x & =2 C 2 B D D 00 h \\ x x & =3907 D C 2 h \\ z & =61451 A 4 h \\ z z & =\text { FB39EF11h } \\ x & =21000001 h \\ x x & =097 F F F F E h \\ z & =103504 F 5 h \\ z z & =F 7 B C 0784 h \end{array}$ | $\begin{aligned} (z, z z)= & 4860114.04539400958 \text { (Exact) } \\ & 4860114.04539400712 \text { (DSP) } \end{aligned}$ $\begin{aligned} (z, z z)= & 92681.9110722252960 \text { (Exact) } \\ & 92681.9110722253099 \text { (DSP) } \end{aligned}$ |

Note that the results were obtained using the programs shown in Appendix B. The C programs were created and compiled on a 80386 -based microcomputer running under MS-DOS 3.3.

## How to Generate C-Callable Functions

The source listings for the extended-precision arithmetic presented in Appendix A are optimized for execution speed and code size. These routines are designed to be used as macros in a user program environment or, with a few adjustments, as a $\mathbf{C}$ function.

This section provides an overview of TMS320C30 C compiler calling conventions necessary to create functions that can be added to the $\mathbf{C}$ compiler library. You need a working knowledge of $C$ language to understand the terminology in this section $[4,5,6]$.

The C compiler uses the processor stack to pass arguments to functions, store local variables, and save temporary values. The C compiler uses two registers of the TMS320C30 to manage the stack pointer (SP) and the frame pointer (AR3).

When a C program calls a function, it must

1. Push the arguments onto the stack,
2. Call the function, and
3. Pop the arguments off the stack,
in that order.
On the other hand, the called $\mathbf{C}$ function must perform the following tasks:
4. Set up a local frame by saving the old frame pointer on the stack.
5. Assign the new frame pointer to the current value of stack pointer.
6. Allocate the frame.
7. Save any dedicated registers that the function modifies.
8. Execute function code.
9. Store a scalar value in R0.
10. Deallocate the frame.
11. Lastly, restore the old frame pointer [4].

The following code segment shows the singlelength addition routine modified to be in C-callable form. Note that registers R4 through R7 and AR4 through AR7 are dedicated registers used by the compiler. These registers must be saved as floating-point values.

| single | .set | OFFh |
| :--- | :--- | :--- |
| fp | .set | ar3 |
| $x$ | .set | r0 |
| $y$ | .set | r1 |
| z | .set | r? |
| zz | .set | r3 |


| w | .set | r4 |  |
| :---: | :---: | :---: | :---: |
| x1 | .set | r2 |  |
| y1 | .set | r3 |  |
|  | .global | _add12: |  |
|  | .width | 96 |  |
|  | .text |  |  |
|  |  |  |  |
|  | push | $f p$ | Save old fp |
|  | pushf | r4 |  |
|  | push | r4 |  |
|  | Idi | sp,fp | ; Point to top of stack |
|  | Idi | *-fp[2],r0 | ; Load x into rO |
|  | Idi | * - fp [3],r1 | ; Load y into r1 |
|  | absf | $\mathrm{x}, \mathrm{x} 1$ |  |
|  | absf | y,y1 |  |
|  | cmpf | y1,x1 | ; $\|x\|>\|y\|$ |
|  | ldflt | $\mathrm{x}, \mathrm{x} 1$ |  |
|  | ldflt | $y, x$ |  |
|  | dflt | $\mathrm{x} 1, \mathrm{y}$ |  |
|  | addf3 | $x, y, z$ | ; $\mathrm{z}=\mathrm{x}+\mathrm{y}$ |
|  | rnd | $z$ |  |
|  | subf3 | x,z,w | ; Form w = z - x |
|  | rnd | w |  |
|  | subf3 | w,y,zz | ; $z z=y-[y-w]$ |
|  | rnd | zz |  |
|  | pop | r4 |  |
|  | popf | r4 |  |
|  | pop | $f p$ | ; Restore fp |
|  | retsu |  |  |
|  | .end |  |  |

## Conclusion

This report presented an implementation of extended-precision arithmetic routines for the TMS320C30 DSP. The programs presented include singlelength floating-point addition, subtraction, and multiplication, which produce exact doublelength results. Doublelength floating-point addition, subtraction, multiplication, division, and square root were also presented. The doublelength floating-point routines all had a small relative error that appeared in the correction term zz. However, it has been shown that the accuracy of the doublelength floating-point result is at least 13 decimal digits. Table 3 is a summary of information about the routines contained in Appendices A and B. Execution times shown
in the table are given only for the routines in Appendix A. These times do not include the call and return if the routine is implemented as a called function. They also do not include any context saves and restores that may be required.

Table 3. Summary Information

| Routine | Mnemonic | Appendix | Code Size <br> (Words) | Execution <br> (Cycles) |
| :--- | :--- | :--- | :---: | :---: |
| Singlelength Add | _-add12 | A1 | 12 | 12 |
| Doublelength Add |  |  |  |  |
| Singlelength Multiply |  |  |  |  |
| Doublelength Multiply |  |  |  |  |
| Doublelength Divide |  |  |  |  |
| Doublelength Square Root |  |  |  |  |
| Change Two Single-Precision |  |  |  |  |
| TMS320C30 Numbers to One <br> Double-Precision Result <br> Change Two Double-Precision <br> TMS320C30 Numbers to a <br> Double-Precision Result | —dbladd | A2 | 25 | 25 |
| Mult12 | A3 | 35 | 35 |  |

## References

[1.] Third-Generation TMS320 User's Guide (literature number SPRU031), Texas Instruments, Inc., 1988.
[2.] Dekker, T.J.,"A Floating-Point Technique for Extending the Available Precision", Numer. Math. 18, 1971, pp 224-242.
[3.] Linnainmaa, S.,"Software for Doubled-Precision Floating-Point Computations'", ACM Transactions on Mathematical Software, Vol. 7, No. 3, Sept. 1981, pp 272-283.
[4.] TMS320C30 C Compiler (literature number SPRU034), Texas Instruments, Inc., 1988.
[5.] Kernigan, B.W. and Ritchie, D.M., The C Programming Language, 2nd Revision, Prentice-Hall, Englewood Cliffs, New Jersey, 1978.
[6.] Kochan, S.G., Programming in C, Second Edition, Howard K. Sams, Indianapolis, Indiana, 1988.

## Appendix A



## Appendix A2. Double Length Add



|  | *H**H*******H***H************************** |  |  |  |  | subf3 <br> rnd | $\begin{aligned} & \text { hy, } y, \text { ty } \\ & \text { ty } \end{aligned}$ | ; ty $=\mathrm{y}-\mathrm{hy}$ |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| 9 | * FUNCTION DEF : _ulti2 |  |  |  |  |  |  |  |
| $\sim$ |  |  |  |  | * |  |  |  |
| 2 | * AUTHOR: AI Lovrich 2/21/89 <br> * Texas Instruments, Inc. |  |  |  |  | mpyf3 | hx, hy, p | ; $p=h x *$ hy |
| 2 | * Texa lintres, |  |  |  |  | andn | single, p | ; $\mathrm{fl}(\mathrm{t})$ is faithful |
| 0 | * Entry Conditions: |  |  |  | * |  |  |  |
| N | * | Upon entry ( $\mathrm{rO}, \mathrm{r}$ ) ) contains ( $(x, y$ ) |  |  |  | mpyf3 | hx, ty, temp | ; teap $=\mathrm{hx} *$ ty |
| 1 | * Exit Conditions: |  |  |  |  | andn | single, temp | ; fl(*) is faithful |
| 0 | * Upon exit ( $\mathrm{r} 0, \mathrm{rr}$ ) contains (z,zz). |  |  |  |  | mpyf3 | tx, hy, $q$ | ; $q=$ tx * hy |
|  | * Registers Affected: |  |  |  |  | andn | single, q | ; $f(*)$ is faithful |
| $\underset{J}{ }$ | * ro, r1, r2, r3, r4, r5, rb, r7 |  |  |  |  | addf3 | $q$, temp, q | ; $q=h x * t y+t x *$ hy |
| 0 | * |  |  |  |  | rnd | 9 |  |
| \% | * Revisjon: Original |  |  |  | * |  |  |  |
| $\bigcirc$ | * Execution Tine: 35 Cycles |  |  |  |  | addf3 | $p, q, z$ | ; $z=p+q$ |
| E. | **************H**H+HHHH*+H***************** |  |  |  |  | rad | 2 |  |
|  |  | .global | _mult12 |  | * |  |  |  |
|  | single | . 5 et | Offh |  |  | subf3 | z,p,zz | ; $\mathrm{zz}=\mathrm{p}-\mathrm{z}$ |
|  | x | . 5 et | ro |  |  | rnd | 22 |  |
| $\stackrel{3}{3}$ | $y$ | . 5 et | $r 1$ |  |  | addf | 9, 22 | ; zz $=p-z+q$ |
| $\bigcirc$ | p | . set | r2 |  |  | rad | 22 |  |
| 7. | x | . 5 et | r3 |  |  | mpyf3 | tx, ty, teap | ; temp $=$ tx * ty |
|  | tx | . 5 et | 14 |  |  | andn | single, temp | ; $\mathrm{fl}\left({ }^{(*)}\right.$ is faithful |
| 9 | 9 | . 5 et | r5 |  |  | addf3 | 22, teap, zz | ; $\mathbf{z z}=p-z+q+t x * t y$ |
|  | by | . Set | r5 |  |  | rnd | 22 |  |
| 5 | ty | . set | r6 |  | * |  |  |  |
| $\bigcirc$ | 2 | . 5 et | ro |  |  | retsu |  |  |
|  | 22 | . 5 et | $r 1$ |  |  | . data |  |  |
|  | .temp | . set | r7 |  |  |  |  |  |
|  |  | .text |  |  | constant: |  |  |  |
| N | -mult12: |  | econstant, teap |  |  | .float | 4097 | ; constant $=2^{\wedge}(24-24 / 2)+1$ |
|  |  | 1 df |  |  |  |  |  |  |
|  |  | mpyf3 | teap, $\mathrm{x}, \mathrm{p}$ | ; $\mathrm{p}=\mathrm{x} *$ constant |  |  |  |  |
|  |  | andn | single, $p$ | ; $f 1(*)$ is faithful |  |  |  | . |
|  | * |  |  |  |  |  |  |  |
|  |  | subf3 | $\mathrm{p}, \mathrm{x}, \mathrm{hx}$ | ; $\mathrm{hx}=\mathrm{x}-\mathrm{p}$ |  |  |  |  |
|  |  | rnd | hx |  |  |  |  |  |
|  |  | addf3 | $h x, p, h x$ | ; $h x=x-p+p$ |  |  |  |  |
|  |  | rnd |  |  |  |  | . |  |
|  | * |  |  |  |  | . |  |  |
|  |  | subf3 | hx, x, tx | ; tx $=x-h x$ |  |  |  |  |
|  |  | rnd |  |  |  |  |  |  |
|  | * |  |  |  |  |  |  |  |
|  |  | mpyf3 | teap, $\mathrm{y}, \mathrm{p}$ | ; $p=y *$ constant |  |  |  |  |
|  | * | andn | single, p | ; $f 1(*)$ is faithful |  |  |  |  |
|  |  |  |  |  |  |  |  |  |
|  |  | subf3 | p, y, hy | ; hy = y - p |  |  |  |  |
|  |  | rnd |  |  |  |  |  |  |
|  |  | addf3 | hy, p, hy | ; hy = $\mathrm{y}-\mathrm{p}+\mathrm{p}$ |  |  |  |  |
|  |  | rnd |  |  |  |  |  |  |
|  | * |  |  |  |  |  |  |  |
| $\checkmark$ |  |  |  |  |  |  |  |  |



## 

* FUNCTION DEF : _div2
* AUTHOR: AI Lovrich 2/21/89

Texas Instruments, Inc.
Entry Conditions:

* Upon entry (ro,rl) contains ( $x, y$ ),
* and (r2,r3) contains ( $x x, y y$ ).
* Exit Conditions:
* Upon exit (ro,r1) contains ( $2,2 z$ ).
* Registers Affected:
* $\mathrm{r} 0, \mathrm{r} 1, \mathrm{r} 2, r 3, ~ r 4, ~ r 5, ~ r 6, ~ r 7$
- Algoritha used:
* $c=x / y$;
* mult12(c, $y, u, u u)$;
$\mathrm{cc}=(x-u-u u+x x-c * y y) / y ;$
$z=c+c c ;$
$2 \mathrm{zi}=\mathrm{c}-\mathrm{z}+\mathrm{cc}$
* Revision: Original

Execution Tine: 115 Cycies

-global -div2

| mpyf <br> andn <br> subr <br> rind |
| :---: |
|  |  |
|  |  |
|  |  |

ro,r1,r2.
$0, r 1, r 2$
ingle, r2 ingle, r2
$2.0, r 2$
12
r2,rO ; R1=x[3] = x[2] * (2.0-v*x[2])
andn single,ro
; R2 = v**[2]
; R2 = v * x[3]
luyff
mnd ll
r2,ro
; R1 =x[4] =x[3] * (2.0-v*x[3])
andn single,rO ; This minimizes error in the LSBs.
*
For the last iteration we use the formulation:

* x[5] = (x[4] * (1.0-(v*x[4])))+x[4]

```
```

| mpyf | r0, r1, r2 | ; $R 2=v * x[4]=1.0 .01 . . \Rightarrow 1$ |
| :---: | :---: | :---: |
| andn | single, r2 |  |
| subrif | 1.0, 2 | ; R2 $=1.0-v \times x[4]=0.0 . .01 . . . \Rightarrow 0$ |
| rnd | $r 2$ |  |
| mpy | ro, r2 | ; $R 2=x[4] *(1.0-v * x[4])$ |
| andn | single, P 2 |  |
| addf | r2, ro | ; $\mathrm{R} 2=\mathrm{x}[5]=(x[4]+(1.0-(v * x[4]))+\mathrm{x}[4]$ |
| rnd | ro,rl | ; Round since this is follow by a MPYF. |

* Now the case of v<O is handled.

| negf | $r 1, r 2$ |  |
| :--- | :--- | :--- |
| ldf | $r 3, r 3$ | ; This sets condition flags. |
| ldfn | $r 2, r 1$ | ; If $v<0$, then $R 1=-R 1$ |
|  |  |  |
| ldf | $r 1, r 4$ | ; save $1 / y$ |

* restore variables
popf y m ; restore y

```

```

        mpyf yl,x
            ; save x
            andn : single,x
        * save variables
        pushf x ; save c
        pushf yl ; save 1/y
    * 
* multi2(c, y, u, uu)

```


```

    * FUNCTION DEF : _sqrt2
    ANTHOR: Al LOvrich 2/21/89
            Texas Instruments, Inc.
        Entry Conditions:
        Upon entry (r0,rl) contains ( }x,xx\mathrm{ ).
        Exit Conditions:
        Upon exit (ro,rl) contains (z,zz).
    Registers Affected:
        r0, r1, r2, r3, r4, r5, rb, r7
    Algorithe used:
        c=sqrt(x);
        mult12(c, c,u, uu);
        cc=(x-v-vu+xax)*0.5/c;
        z = C + CC;
        zz=c-z+cc;
    Revision: Original
    Execution Time: 163 Cycles
    ```

```

    -global -sqrt2
        single .set of
        l
        hx .s
        Met r3
    き『
    \x lemp .se
c
ccc
-sgrt2:

* c= sqrt(x)
* c= sqrt(x)
* Extract the exponent of v.

| ldf <br> retsle | ro,r3 | ; save $v$ |
| :--- | :--- | :--- |
| pushf |  |  |$\quad x x \quad$| return if number non-positive |
| :--- |
| ; save $x x$ |

```
n
\begin{tabular}{lll} 
pushf & X & ; save x \\
mpyf & \(2.0, \mathrm{rO}\) & ; add a rounding bit in the exponent \\
andn & single,ro & \\
pushf & ro \\
pop & \(r 1\) & \\
ash & \(-25, r 1\) & ; The 8 LSBs of R1 contain \(1 / 2\) the expon
\end{tabular}
* x[0] formation given the exponent of \(v\).
*

                                    ; Now \(\mu 1=x[0]=1.0 * 2 *(-e / 2)\).
* Generate v/2.
- apyf
\(\begin{array}{ll}\text { mpyf } \\ \text { andn }\end{array} \quad 0.25, r 0 \quad ; \quad \mathrm{v} / 2\) and take rounding bit out.
* Now the iterations begin.


\begin{tabular}{|c|c|c|}
\hline mpyf & r1,r2,ro & ; \(R 1=v * x[4]=1.0 . .01 . .3>1\) \\
\hline andn & single, ro & \\
\hline subrf & 1.0,ro & ; R1 \(=1.0-v * x[4]=0.0 . .01 . . . \Rightarrow 0\) \\
\hline rnd & ro & \\
\hline mpyf & r1,ro & ; R1 \(=x[4] *(1.0-v * x[4])\) \\
\hline andn & single, ro & \\
\hline addf & \(\mathrm{rO}, \mathrm{rl}\) & ; \(\mathrm{RO}=\mathrm{x}[5]=(x[4] *(1.0-(v * x[4]))\}+x[4]\) \\
\hline rnd & r1, 2 & ; Round since this is followed by a MPYF \\
\hline
\end{tabular}
* Now the case of \(v<0\) is handled.
\begin{tabular}{lll} 
negf & \(r 2, r 0\) & \\
ldf & \(r 3, r 3\) & ; This sets condition flags. \\
1dfn & \(r 0, r 2\) & ; If \(v<0\), then \(R 2=-R 2\)
\end{tabular}
* restore variables


\section*{Appendix B}

3/* C30ng. - Progras to operate on two single-precision number in C30 format and produce a double-precision result */
tinclude (bath. h)
tinclude (stdio.h)
min)
long double \(x, y, z_{i}\)
long int \(\mathrm{xI}, \mathrm{yl}\);
int \(i\), operation;
long int c30toellong int);
\(i=1 ;\)
of
printf("Type two c30 hex nuabers: 1 n ");
printf("x = ")
\(\operatorname{scanf}\left({ }^{\prime 2 x} x^{\prime \prime}, 2 \times 1\right)\)
printf("y = ")
scanf("zx", \(2 y 1\) )
\(x 1=\mathrm{c} 30\) toe (xl);
\(x=\) (long double)(*(float *)(\&xl));
\(y 1=\) c30toe (y1);
\(y=\) (long double)(*) float *)(kyl));
dol
printf("Add(1), Sub(2), Mpy(3), Div(4), Sart(5): "); scanf( \({ }^{\prime}\) Id", koperation);
) while (operation(1 it operation)5)
if (operation \(=1\) ) \(z=x+y\);
if (operation \(=\) 2) \(z=x-y\);
if (operation \(=3\) ) \(z=x * y\)
if (operation \(=4\) ) \(z=x / y ;\)
if (operation \(=5\) ) \(z=5 q r t(x)\);
printf("\nz = \%.18Lg", z);
printf("\n\nType in COO bex result: \(1 n^{\prime \prime}\) );
printf("z = ");

printf("zz = ");

\(\mathrm{x} 1=\mathrm{c} 30\) toe \((\mathrm{x} 1)\)
\(x=\) (long double)(*(float *)(\&x1));
\(y 1=\) c30toe (y1);
\(y=\) (long double)(z(float *)(tyl));
\(z=x+y ;\)
printf(") \(\mathrm{nz}=7.18 \mathrm{q} \mathrm{g}^{\prime \prime}\) 2);
printf( \("\) nn \(\ln\) Type 0 to exit, else continue: : \()\)
canf ( 20 , kil);
) while (i ! \(=0\) );
,
c3oroe - routine to convert from a c30 floating point number to a number in iese format. Both input and output in hex. */
long int c30toellong int \(x\) )
1
long int mantissa, sign;
long int exp;
sign \(=x \& 0 \times 00800000 ;\)
\(\exp =x \gg 24 ;\)
* exp \(=-128\) corresponds to 0 . exp \(=-127\) is denoraslized in ieee: represent it as \(0 . \# /\)
if \((\exp <=-127)\) return \((0)\);
/* add iaplied bit and sign-extend mantissa */
mantissa \(=x\) \& \(0 \times 007 \mathrm{fffff}\);
if (sign)
mantissa : \(=0 x f f 000000\);
else
mantissa \(:=0 \times 00800000 ;\)
/* convert mantissa to sign-magnitude */
if (sign) mantissa \(=-\) mantissa;
/* adjust anntissa if it wes \(\mathbf{- 2 . 0}\) */
if \((\) anntissa \(=0 \times 01000000)\{\)
expt+;
entissa \(=0 \times 00800000\)
\({ }_{3}\)
if ( \(\exp >127\) ) return \((0)\); /* too large number; return error */
/* make exponent 127 -excess and return ieee nuaber */
\(\exp +=127\);
mantissa \(=(\) mantissa \& \(0 x 007 f f f f f) ;(\operatorname{sign} \ll 8) ;(\exp (<23) ;\)
return(antissa);
\}

```

printf("\nz = %.18\&", z);
printf("\n\nType 0 to exit, else continue : ");
scanf ('Zd", \&i);
} while (i != 0);
}
/4 C3OTOE - routine to convert from a c30 floating point nuaber to a
nuaber in ieee formst. Both input and output in hex. */
long int c30toe(long int x
{
long int mantissa, sign;
long int exp;
sign =x\&0x00800000;
exp = x >> 24;
/t exp=-128 corresponds to 0 . exp=-127 is denoralized in ieee: represent it as 0 . */
if $(\exp (=-127)$ return $(0)$;
/* add implied bit and sign-extend mantissa */
mantissa $=x \& 0 \times 007 \mathrm{ffff}$; $; ~$
if (sign)
if (sign)
mantissa : $=0 x f f 000000 ;$
else
mentissa $:=0 \times 00800000$
/* convert mantissa to sign-mgnitude */
if (sign) mantissa $=-$-mantissa;
/* adjust mantissa if it was -2.0 */
if (antissa $=0 \times 01000000$ )
exp+t;
mantis5a $=0 \times 00600000$
${ }^{\text {man }}$
if (exp ) 127) return(0); /* too large number; return error */
/* make exponent 127 -excess and return ieee number */
$\exp t=127$;
mantissa $=($ mantissa \& $0 \times 007 f f f f f) ;(\operatorname{sign} \ll 8) ;(\exp \ll 23) ;$ return(eantissa);

# $8 \times 8$ Discrete Cosine Transform Implementation on the TMS320C25 or the TMS320C30 

William Hohl

Digital Signal Processor Products-Semiconductor Group
Texas Instruments

## Introduction

In the general class of orthogonal transforms, there exists one in particular, the discrete cosine transform (DCT), that has recently gained wide popularity in signal processing. The DCT has found applications in such areas as data compression, pattern recognition, and Weiner filtering, primarily because of its close comparison to the Karhunen-Loeve Transform (KLT) with respect to rate distortion criteria [1]. Although the KLT is considered to be optimal, there is no fast algorithm to compute it. Since there is no fast KLT algorithm, the DCT is an attractive alternative.

For image coding, the DCT works well because of the high correlation among adjacent data samples (pixel values). Because of this correlation, the DCT provides near optimal reduction while retaining high image quality. In a comparative study [2], the DCT was shown to outperform the Fourier, Hartley, and cas-cas transforms for image compression, providing even more motivation for finding fast implementations.

A number of algorithms have been developed, most notably those of Hou [3] and Lee [4], which generate higher-order DCTs from lower-order ones. This paper presents two $8 \times 8$ DCT routines, one for the TMS320C25 and another for the TMS320C30, based upon the routine in [3].

## The DCT Algorithm

For a given real data sequence $x_{0}, x_{1}, \ldots, x_{N-1}$, the discrete cosine transform is given in [1] as

$$
\begin{equation*}
z_{k}=\sqrt{\frac{2}{N}} \alpha(k) \sum_{n=0}^{N-1} x_{n} \cos \left(\frac{\pi(2 n+1) k}{2 N}\right) k=0,1, \ldots, N-1 \tag{1a}
\end{equation*}
$$

and its inverse is
where $\alpha(k)=\frac{1}{\sqrt{2}}$ for $k=0$; otherwise, the transform is unitary. If $z_{0}$ is scaled up by 2 , the DCT can also be written in matrix form as

$$
\begin{equation*}
\mathbf{z}=\sqrt{\frac{2}{N}} T(N) \mathbf{x} \tag{2}
\end{equation*}
$$

where $\mathbf{x}$ and $\mathbf{z}$ are column vectors denoting the input and output data sequences, and $T(N)$ is the DCT matrix of order $N$. Actually, expanding the matrix (neglecting the factor of $\sqrt{\frac{2}{N}}$ for the moment), a 4-point DCT appears as

$$
\left[\begin{array}{l}
z_{0}  \tag{3}\\
z_{2} \\
z_{1} \\
z_{3}
\end{array}\right]=\left[\begin{array}{rrrr}
1 & 1 & 1 & 1 \\
\alpha & -\alpha & \alpha & -\alpha \\
\beta & -\delta & -\beta & \delta \\
\delta & \beta & -\delta & -\beta
\end{array}\right]\left[\begin{array}{l}
x_{0} \\
x_{2} \\
x_{3} \\
x_{1}
\end{array}\right]
$$

where $\alpha=\frac{1}{\sqrt{2}}, \beta=\cos \left(\frac{\pi}{8}\right)$, and $\delta=\sin \left(\frac{\pi}{8}\right)$. Similarly, the 8 -pt DCT can be expressed as
$\left[\begin{array}{l}z_{0} \\ z_{4} \\ z_{2} \\ z_{6} \\ z_{1} \\ z_{5} \\ z_{3} \\ z_{7}\end{array}\right]=\left[\begin{array}{rrrrrrrr}1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 \\ \alpha & -\alpha & \alpha & -\alpha & \alpha & -\alpha & \alpha & -\alpha \\ \beta & -\delta & -\beta & \delta & \beta & -\delta & -\beta & \delta \\ \delta & \beta & -\delta & -\beta & \delta & \beta & -\delta & -\beta \\ \lambda & \mu & -\nu & -\gamma & -\lambda & -\mu & \nu & \gamma \\ \mu & \nu & -\gamma & \lambda & -\mu & -\nu & \gamma & -\lambda \\ \gamma & -\lambda & \mu & \nu & -\gamma & \lambda & -\mu & -\nu \\ \nu & \gamma & \lambda & \mu & -\nu & -\gamma & -\lambda & -\mu\end{array}\right]\left[\begin{array}{l}x_{0} \\ x_{2} \\ x_{4} \\ x_{6} \\ x_{7} \\ x_{5} \\ x_{3} \\ x_{1}\end{array}\right]$,
where $\lambda=\cos \left(\frac{\pi}{16}\right), \gamma=\cos \left(\frac{3 \pi}{16}\right), \mu=\sin \left(\frac{3 \pi}{16}\right)$, and $\nu=\sin \left(\frac{\pi}{16}\right)$. Note that the input is no longer in natural order but has been rearranged according to the permutation matrix P and the relation

$$
\begin{equation*}
\tilde{x}=P x \tag{5}
\end{equation*}
$$

where

$$
P=\left[\begin{array}{llllllll}
1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\
0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 \\
0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 \\
0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 \\
0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 \\
0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 \\
0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 \\
0 & 1 & 0 & 0 & 0 & 0 & 0 & 0
\end{array}\right]
$$

Upon examination, the matrix $\hat{T}(N)$ in (4), which is the matrix $T(N)$ with the rows and columns rearranged, can be described more compactly as

$$
\hat{T}\left(N ;=\left[\begin{array}{rr}
\hat{T}\left(\frac{N}{2}\right) & \hat{T}\left(\frac{N}{2}\right)  \tag{6}\\
\hat{D}\left(\frac{N}{2}\right) & -\hat{D}\left(\frac{N}{2}\right)
\end{array}\right]\right.
$$

since the upper half of the 8 -point DCT is exactly the 4 -point DCT matrix previously generated. Using the results obtained in [3], the relationship between $\hat{D}\left(\frac{N}{2}\right)$ and $T\left(\frac{N}{2}\right)$ is a given as

$$
\begin{equation*}
\hat{D}\left(\frac{N}{2}\right)=K \hat{T}\left(\frac{N}{2}\right) Q \tag{7}
\end{equation*}
$$

where

$$
K=R L R t
$$

$R$ being the matrix that performs a bit reversal on the input data; $L$ is the lower triangular matrix

$$
L=\left[\begin{array}{rrrrrrrr}
1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\
-1 & 2 & 0 & 0 & 0 & 0 & 0 & 0 \\
1 & -2 & 2 & 0 & 1 & 0 & 0 & 0 \\
-1 & 2 & -2 & 2 & 0 & 0 & 0 & 0 \\
1 & -2 & 2 & -2 & 2 & 0 & 0 & 0 \\
-1 & 2 & -2 & 2 & -2 & 2 & 0 & 0 \\
1 & -2 & 2 & -2 & 2 & -2 & 2 & 0 \\
-1 & 2 & -2 & 2 & -2 & 2 & -2 & 2
\end{array}\right]
$$

and $Q=\operatorname{diag}\left[\cos \left(n+\frac{1}{4}\right)\left(\frac{2 \pi}{N}\right)\right]$, for $n=0,1, \ldots$. 7. The output vector $\mathbf{z}$ is now in bit-reversed order. Signal flow graphs for 2-point, 4-point, and 8-point DCTs are shown in Figure 1, with the multipliers defined as in (4).

(a) 2-Point

(b) 4-Point

(c) 8-Point

Figure 1. Signal Flow Graphs for 2-Point, 4-Point, and 8-Point DCTs
The structure of the algorithm looks very much like that of a Fast Fourier Transform (FFT), since the most fundamental computation is a 2 -point butterfly. This routine is actually a generalized case of the Cooley-Tukey FFT algorithm with the addition of the recursion at the end. If the equations for the signal flow graph are written explicitly, the recursive nature of the DCT becomes clear; for a 4 -point DCT, we have

$$
\begin{aligned}
\hat{z}_{0} & =z_{0}, \\
\hat{z}_{2} & =z_{2}, \\
\hat{z}_{1} & =z_{1}, \\
\hat{z}_{3} & =2 z_{3}-\hat{z}_{1},
\end{aligned}
$$

and for the 8-point DCT,

$$
\begin{aligned}
& \hat{z}_{0}=z_{0}, \\
& \hat{z}_{4}=z_{4}, \\
& \hat{z}_{2}=z_{2}, \\
& \hat{z}_{6}=z_{6}, \\
& \hat{z}_{1}=z_{1}, \\
& \hat{z}_{3}=2 z_{3}-\hat{z}_{1}, \\
& \hat{z}_{5}=2 z_{5}-\hat{z}_{3}, \\
& \hat{z}_{7}=2 z_{7}-\hat{z}_{5} .
\end{aligned}
$$

To create a unitary transform, each element in the vector should be multiplied by the scaling factor $\sqrt{\frac{2}{N}}$ for both the forward and inverse transforms. The inverse transform is obtained by completely reversing the direction of the signal flow graph; i.e., performing the bit-reversal first, then the recursions and the butterflies, and finally, the data permutation.

For the two-dimensional case of interest, the DCT can be described in the form
$z(k, l)=\frac{2}{N} \alpha(k) \alpha(l) \underset{m=0}{N-1} \sum_{n=0}^{N-1} x(m, n) \cos \left(\frac{\pi(2 m+1) k}{2 N}\right) \cos \left(\frac{\pi(2 n+1) l}{2 N}\right)$
$x(m, n)=\frac{2}{N} \sum_{k=0}^{N-1} \sum_{l=0}^{N-1} \alpha(k) \alpha(l) z(k, l) \cos \left(\frac{\pi(2 m+1) k}{2 N}\right) \cos \left(\frac{\pi(2 n+1) l}{2 N}\right)$
where $\alpha(k)=\frac{1}{\sqrt{2}}$ for $k=0$, unity otherwise. Like the FFT, the DCT kernel is separable, allowing the transform to be performed in two steps, first along the rows and then the columns.

## Implementation on the TMS320C25

The DCT algorithm may be carried out in one of two ways, either using

1. A matrix formulation, where the DCT coefficients are simply multiplied by the data, or
2. The signal flow graph.

This routine uses a matrix formulation, which requires the sixty-four cosine coefficients to be stored in an array in memory. The matrix formulation is based on the following equation:
$\left[\begin{array}{l}z_{0} \\ z_{1} \\ z_{2} \\ z_{3} \\ z_{4} \\ z_{5} \\ z_{6} \\ z_{7}\end{array}\right]=\left[\begin{array}{rrrrrrrr}1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 \\ \lambda & \gamma & \mu & \nu & -\nu & -\mu & -\gamma & -\lambda \\ \beta & \delta & -\delta & -\beta & -\beta & -\delta & \delta & \beta \\ \gamma & -\nu & -\lambda & -\mu & \mu & \lambda & \nu & -\gamma \\ \alpha & -\alpha & -\alpha & \alpha & \alpha & -\alpha & -\alpha & \alpha \\ \mu & -\lambda & \nu & \gamma & -\gamma & -\nu & \lambda & -\mu \\ \delta & -\beta & \beta & -\delta & -\delta & \beta & -\beta & \delta \\ \nu & -\mu & \gamma & -\lambda & \lambda & -\gamma & \mu & -\nu\end{array}\right]\left[\begin{array}{l}x_{0} \\ x_{1} \\ x_{2} \\ x_{3} \\ x_{4} \\ x_{5} \\ x_{6} \\ x_{7}\end{array}\right]$,
where $\lambda=\cos \left(\frac{\pi}{16}\right), \gamma=\cos \left(\frac{3 \pi}{16}\right), \mu=\sin \left(\frac{3 \pi}{16}\right)$, and $\nu=\sin \left(\frac{\pi}{16}\right)$.
The algorithm described above has been shown to be numerically stable for fixedpoint processors; however, to prevent serious data errors, truncation and roundoff must be accounted for. A roundoff technique similar to the one in [6], is used to prescale the matrix coefficients by $(215-1)$. This product is then loaded into the accumulator with a one-bit left shift, effectively dividing it by $2^{15}$. After a multiplication is performed, the 32 -bit value in the accumulator must be rounded to sixteen bits, where bits 13,14 , and 15 are used to determine the value of the sixteenth bit. The TMS320C25 performs this operation in a single instruction by adding 3000 h to the accumulator product with a onebit left shift, as outlined in the code shown in Figure 2.

```
* INITIALIZE MATRIX COEFFICIENTS AND ROUNDOFF VALUES INTOINTERNAL BLOCK 0
DCTINI LDPK RNDOFF
                RSXM ; SIGN-EXTENSION MODE
                SPM 1
                LRLK AR1,COEFF ; COEFFICIENTS
                RPTK EDATA-IDATA
                BLKP IDATA,* +
                LRLK AR1,RNDOFF ; VARIABLES
                RPTK 10
                    BLKP EDATA,* +
* SECOND SET OF COEFFICIENTS
\begin{tabular}{|c|c|c|c|}
\hline & LAR & AR1,DST & AR1 IS NOW DESTINATION POINTER \\
\hline & MAR & * +,AR2 & WORK ON SECOND COLUMN \\
\hline & LAR & AR2,SRC & \\
\hline & LARK & AR3,7 & \\
\hline & LT & * +,AR2 & \\
\hline & MPY & C10 & \\
\hline T2 & ZAC & & \\
\hline & RPTK & 6 & \\
\hline & MAC & C11,* + & \\
\hline & LTA & * +, AR1 & \\
\hline & MPY & C10 & \\
\hline & ADD & RNDOFF & \\
\hline & SACH & * 0 +, AR3 & \\
\hline & BANZ & t2,*-,AR2 & \\
\hline
\end{tabular}
```

Figure 2. TMS320C25 Code for Roundoff Routine

After the multiplications are computed, the results are stored in another array area in transposed order; thus, a separate routine for transposing the matrix is not needed. Once the rows are transformed, the pointers for the input and output matrices are exchanged. When the procedure is repeated, the output is stored as rows, completing the transform. Appendix A contains a complete program listing for the forward transform on the TMS320C25. To perform an inverse DCT, the table of cosine coefficients should be replaced with those used for an inverse transform.

## Implementation on the TMS320C30

The TMS320C30's increased speed and flexible addressing modes can reduce execution time substantially. In using the FFT-like structure, extraneous multiplications are removed, and because of the TMS320C30's ability to perform parallel multiplication/additions, two butterflies can be computed at once. After an initial subtraction is done, the coefficient multiplication can be executed in parallel with the addition of the data. The TMS320C30's floating-point capability eliminates not only the problems of roundoff error associated with fixed point processors but also the need for any truncation routines.

Because the DCT size is fixed to eight points, there are only four locations that need exchanging; this allows for a fast bit-reversal of the data. When using the TMS320C30's extended-precision registers for temporary storage, the transfers can be done in-place. These data transfers are also done in parallel, since two load or store operations can be performed simultaneously. The code for performing the bit reversal is shown in Figure 3 below.

* CORRECT ORDER FROM BIT REVERSED TO NATURAL

| BITREV | LDF | *ARO,RO |
| :--- | :--- | :--- |
| $\\|$ | LDF | ${ }^{*}-A R 2, R 1$ |
| $\\|$ | STF | R1,*ARO |
| $\\|$ | STF | RO,*-AR2 |
| $\\|$ | LDF | ${ }^{*}$ AR1,RO |
| $\\|$ | LDF | *-AR3,R1 |
| $\\|$ | STF | R1,*AR1 |
| $\\|$ | STF | RO,*-AR3 |

Figure 3. TMS320C30 Code for Bit Reversal

Because of the amount of data shuffling that occurs, an eight-word scratch-pad vector has been created with four permanent pointers set up at every other memory location. This allows access to each element in the vector (by predecrement or preincrement addressing) without requiring constant alteration of one or two pointer locations. Although there is no overhead for looping on the TMS320C30, straight-line coding is used as much as possible to increase performance.

You can transpose the DCT matrix in the same way as in the TMS320C25 implementation: namely, store the transformed row vector as a column vector in another matrix and interchange the input and output pointers.

The complete routines for the forward and inverse transforms are given in Appendix B.

## Results

The execution times and memory requirements for the two routines are given in Table 1. For the TMS320C30 implementation, the forward transform contains the scale factor of $\frac{2}{N}$, so the transform is not unitary. When the signal flow is reversed, instructions accumulate and the time required to perform the inverse transform actually increases (see Table 1). This increase occurs because certain multiplications cannot be performed in parallel with another instruction. The two times are identical on a TMS320C25 because it uses a matrix routine to compute the transform.

Table 1. Execution Times and Memory Requirements

| Device | Memory Required |  | Time Required |
| :---: | :--- | :---: | :---: |
|  |  |  |  |

[^0]
## Summary

Two routines for a two-dimensional Discrete Cosine Transform are presented: one for the TMS320C25 and one for the TMS320C30, with a development of the algorithm given for clarification. This report also discussed the similarities of the DCT to the CooleyTukey FFT algorithm and arithmetic shortcuts which can reduce the DCT's execution time. Although these implementations use the most recent formulation, there is still room for investigation into more efficient methods. Another approach that might prove fruitful is to deal with the entire $8 \times 8$ array all at once, as suggested by Haque [7], rather than transforming the array by rows and columns. However, both routines given in the appendices provide fast, numerically stable solutions for applications requiring the DCT.

## Acknowledgements

The author thanks Steve Ford for supplying the original code for the TMS320C25 implementation. Francois Charlot helped in modifying the code for the TMS320C25, as well as in preparing this manuscript. Daniel Chen improved the performance of the code for both the TMS320C25 and the TMS320C30.

## References

[1] Ahmed, N., Natarajan, T., and Rao, K.R. "Discrete Cosine Transform," IEEE Transactions on Computing, vol. C-23, pp. 90-93, January 1974.
[2] Perkins, M. "A Comparison of the Hartley, Cas-Cas, Fourier, and Discrete Cosine Transforms for Image Coding,' IEEE Transactions on Computing, vol. 36, pp. 758-760, June 1988.
[3] Hou, H.S. "A Fast Recursive Algorithm for Computing the Discrete Cosine Transform,'’ IEEE Transactions on ASSP, vol. ASSP-35, No. 10, pp. 1455-1461, October 1987.
[4] Lee, B.G. 'FCT - A Fast Cosine Transform,'" Proceedings of 1984 Conference on $A S S P$, pp. 28.A.3.1-28.A.3.4, March 1984.
[5] Jayant, N.S., and Noll, P. Digital Coding of Waveforms, New York, PrenticeHall, 1984.
[6] Srinivasan, S., Jain, A.K., and Chin, T.M. "Cosine Transform Block Codec for Images Using the TMS32010,'’ Proceedings of IEEE ISCAS '86, Cat. No. 86CH2255-8, vol. 1, pp. 299-302.
[7] Haque, M.A. "A Two-Dimensional Fast Cosine Tranform,' IEEE Transactions on ASSP, vol. ASSP-33, pp. 1532-1539, December 1985.



EIGHTH SET OF COEFFICIENTS

| RPTK | 6 |
| :--- | :--- |
| MAC | C61,*+ |
| LTA | $4+$, AR2 |
| IPY | C_60 |
| ADD | RNDOFF |
| SACH | $\$ 0+$, AR3 |
| BANZ | $77, *-$, ARI |


| RPTK | 6 |
| :--- | :--- |
| MAC | C71,*+ |
| LTA | $*+$ AR1 |
| MPY | C_70 |
| ADD | RNDOFF |
| SACH | $* 0+$, AR3 |
| BANZ | T8, $*-$, AR2 |


|  | LAC | LIST | ; CHANGE SOURCE AND DESTINATION POINTERS, |
| :--- | :--- | :--- | :--- |
|  | DHON | SRC | ; SO RESLLT OF FIRST PASS BECOHES OPERAND |

- datas - tables and declarations

|  | .asect | "RCOEF", OFFOOh | ; THIS IS TO SET UP THE LABELS FOR A CNFP |
| :---: | :---: | :---: | :---: |
|  | .label | IDATA | ; DCT COEFFICIENTS |
| COO | .word | 5792 | ; FIRST ROW OF COEFFICIENTS |
| CO1 | . .ord | 5792 | ; $5792=(1 / 4) * 2 * *(-1 / 2)$ IN Q15 FORMAT |
| C02 | .word | 5792 |  |
| C03 | .word | 5792 |  |
| CO4 | .word | 5792 |  |
| C05 | .word | 5792 |  |
| CO6 | .word | 5792 |  |
| C07 | . mord | 5792 |  |
| C10 | .vord | 8034 | ; SECOND ROW OF COEFFICIENTS |
| $\mathrm{Cl1}$ | .word | 6811 |  |
| $\mathrm{Cl2}$ | .sord | 4551 |  |


|  | C13 | .word | 1598 |  |  | .mord | 12288 | ; ROUNDOFF FACTOR |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| $\%$ | C14 | .word | -1598 | ; $1598=(1 / 4) *$ SIN(PI/16) IN Q15 FORNAT |  | - word | PICT | ; ADORESS OF PICTURE |
|  | C15 | . word | -4551 | ; 4551 $=(1 / 4) *$ SIN(3PI/16) IN Q15 FORMAT |  | .word | RESULT | ; ADDRESS OF RESULT |
|  | C16 | . word | -6811 | ; $6811=(1 / 4) * \cos (3 P \mathrm{I} / 16)$ IN Q15 FORMAT |  | .word | 5792 | ; COO COEFFICIENT |
|  | $\mathrm{Cl}_{17}$ | .word | -8034 | ; $8034=(1 / 4) * \cos (P I / 16)$ IN Q15 FORTAT |  | -word | 8034 | - C10 COEFFICIENT |
|  | C20 | .word | - 7568 | ; third rou of coefficients |  | - word | 7568 | ; C20 COEFFICIENT |
|  | C21 | .word | 3134 | ; $3134=(1 / 4) * \operatorname{SIN}(\operatorname{PI} / 8)$ IN Q15 FCRMAT |  | -word | 6811 | ; C3O COEFFICIENT |
|  | C22 | .word | -3134 | ; $7568=(1 / 4) * \cos (\mathrm{PI} / 8)$ IN Q15 FORMAT |  | .uord | 5792 | ; C40 COEFFICIENT |
|  | C23 | .word | -7568 |  |  | . word | 4551 | ; C50 COEFFICIENT |
|  | C24 | .word | -7568 |  |  | .word | 3134 | ; C6O COEFFICIENT |
|  | C25 | .word | -3134 |  |  | .word | 1598 | ; C70 COEFFICIENT |
|  | C26 | .word | 3134 |  |  |  |  |  |
|  | C27 | .word | 7568 |  |  | DEFINIT |  |  |
|  | C30 | . word | 6811 | ; FOURTH ROW OF COEFFICIENTS |  |  |  |  |
|  | C31 | .word | -1598 |  | COEFF | .usect | "COEFFS", 64 | ; DCT COEFFICIENTS (GOES INTO BO) |
|  | C32 | .word | -8034 |  |  | . BSS | PICT, 64 | ; PICTURE |
|  | C33 | . Word | -4551 |  |  | . BSS | RESULT, 64 | ; RESULT, AFTER DCT |
|  | C34 | .word | 4551 |  |  | .BSS | RNDOFF, 1 | ; ROUNDOFF FACTOR |
|  | C35 | .word | 8034 |  |  | . BSS | SRC, 1 | ; SOURCE ADDRESS FOR CURRENT DCT LOOP |
|  | C36 | .word | 1598 |  |  | . BSS | DST, 1 | ; DESTINATION ADDRESS |
|  | C37 | . word | -6811 |  |  | . BSS | C-00, 1 | ; COO COEFFICIENT |
|  | C40 | .tuord | 5792 | ; FIFTH ROW OF COEFFICIENTS |  | . BSS | C-10,1 | ; CIO COEFFICIENT |
|  | C41 | .word | -5792 |  |  | . BSS | C.20,1 | ; C20 COEFFICIENT |
|  | C42 | .word | -5792 |  |  | . BSS | C_30,1 | ; C30 COEFFICIENT |
| $\lambda$ | 643 | .word | 5792 |  |  | . BSS | C_40,1 | ; C40 COEFFICIENT |
| $\pm$ | C44 | .word | 5792 |  |  | . BSS | C.50,1 | ; C50 Coefficient |
| $\infty$ | C45 | .werd | -5792 |  |  | . BSS $^{\text {S }}$ | C. 60,1 | - C6O COEFFICIENT |
| $\times$ | C46 | .word | -5792 |  |  | . BSS | C.70, 1 | ; C70 COEFFICIENT |
| $\infty$ | C47 | .word | 5792 |  | * |  |  |  |
|  | C50 | .word | 4551 | ; SIXTH ROW OF COEFFICIENTS |  | .end |  |  |
| $\stackrel{.}{4}$ | C51 | .word | -8034 |  |  |  |  |  |
|  | C52 | .word | 1598 |  |  |  |  |  |
|  | C53 | . word | 6811 |  |  |  |  |  |
| 8 | C54 | - word | -6811 |  |  |  |  |  |
|  | C55 | .word | -1598 |  |  |  |  |  |
| $\bigcirc$ | C56 | .word | 8034 |  |  |  |  |  |
|  | C57 | .word | -4551 |  |  |  |  |  |
| $\stackrel{\text { ® }}{ }$ | C60 | .word | 3134 | ; SEVENTH ROW OF COEFFICIENTS |  |  |  |  |
|  | C61 | .word | -7568 |  |  |  |  |  |
| N | Cb2 | .word | 7568 |  |  |  |  |  |
|  | C63 | . word | -3134 |  |  |  |  |  |
|  | C64 | . word | -3134 |  |  |  |  |  |
|  | C65 | .word | 7568 |  |  |  |  | . |
| J | C66 | .word | -7568 |  |  |  |  |  |
|  | C67 | .word | 3134 |  |  |  |  |  |
|  | C70 | . word | 1598 | ; EIGHTH ROW OF COEFFICIENTS |  |  |  |  |
|  | C71 | .word | -4551 |  |  |  |  |  |
|  | C72 | . word | 6811 |  |  |  |  |  |
| \% | C73 | .word | -8034 |  |  |  |  |  |
| 8 | C74 | .word | 8034 |  |  |  |  |  |
| $\pm$ | C75 | .word | -6811 |  |  |  |  |  |
|  | C76 | .word | 4551 |  |  |  |  |  |
|  | C77 | .word | -1598 |  |  |  |  | , |
|  |  | . label | EDATA | ; END OF COEFFICIENTS TARLE |  |  |  |  |


| $\begin{array}{ll} 9 & 2 \\ 3 & \infty \\ 0 & \times \\ \cdots & \infty \end{array}$ |  | ( $2-\mathrm{D}$ D | CRETE COSINE | ANSFORM, (8×8) VERSION 1.0 |
| :---: | :---: | :---: | :---: | :---: |
|  | * AUTHOR: WILLIAM HOHL |  |  |  |
|  | * |  |  |  |
|  | ITRANSACTIONS ON ASSP, VOL. ASSP-35, NO. 10, OCTOBER 1987, PP. $1455-$ 1461). |  |  |  |
|  | * input matrix is stored in ram, and the results are stored in the same <br> * LOCATION. |  |  |  |
|  | * |  |  |  |
|  | ************************************************************************** |  |  |  |
|  |  |  |  |  |
|  |  | . BSS | OUT, 64 |  |
|  |  | . BSS | INP, 64 |  |
|  |  | .BSS | SCR, 8 | ; SCRATCHPAD MEMORY |
|  |  | . global | COSTAB |  |
|  |  | .global | START |  |
|  |  | . data |  |  |
| S | _cos | .word | $\operatorname{costab}$ |  |
| $\bigcirc$ | INPUT | . word | INP |  |
| 8 | OUTPUT | .word | OUT |  |
| $\bigcirc$ | SCRATCH | . bord | SCR |  |
| $\pm$ | SCRLAST | . word | SCR+7 |  |
| 9 | RTN1 | . word | TRANS1 |  |
|  | RTN2 | .word | TRANS2 |  |
| ミ |  | .text |  |  |
|  | * |  |  |  |
|  | START | LOI | 7,RC |  |
|  |  | LDI | 2, IRO |  |
|  |  | LDI | 8, IRI |  |
|  |  | LOI | 8, BK | ; SET BUFFER LENGTH=8 |
|  |  | LDP | ESCRATCH |  |
|  |  | 101 | ESCRATCH,AR4 |  |
|  |  | LDI | eOUTPUT,AR6 | ; VARIABLE LOCATIONS |
|  |  | LII | EINPUT,ARS | ; HOLDS INPUT MATRIX |
|  |  | LDF | $0.25, \mathrm{R6}$ | ; CONSTANT 0.25 |
|  |  | LDF | 2.0,R7 | - CONSTANT 2.0 |
|  | * |  |  |  |
|  |  | LDI | ERTN1,R4 | ; RETURN ADDRESS OF SUBROUTINE |
|  |  | RPTB | BLKI |  |
|  |  | BRD | DCT |  |
|  |  | LDI | AR5, ARO | ; POINTS TO INPUT |
|  |  | LDI | ARS, ARI |  |
|  |  | ADDI | 1,AR1 |  |


| TRANS1: | LDF | *AR4++(1)K, R1 | ; TRAWSPOSE THE ROHS |
| :---: | :---: | :---: | :---: |
|  | STF | R1,*AR6++(IR1) | ; INTO COLUNS |
| 11 | LDF | *AR4++(1)\%,R1 |  |
|  | STF | R1, $*$ ARb + ( IR1) |  |
| if | LDF | *AR4++(1) $\mathrm{L}, \mathrm{R1}$ |  |
|  | STF | R1, *ARb + ( IR1) |  |
| : | LDF | *AR4++(1)\%, R1 |  |
|  | STF | R1, *AR6 + ( IR1) |  |
| if | LDF | *AR4++(1)\%,R1 |  |
|  | STF | R1, $*$ AR $6++($ IR1) |  |
| 14 | LDF | *AR4++(1)\%,R1 |  |
|  | STF | R1, *ARb++(IR1) |  |
| 4 | LDF | * $\mathrm{R}^{4++(1) \%, R 1}$ |  |
|  | STF | R1, *ARb++(IR1) |  |
| 11 | LPF | *AR4++(1)\%,R1 |  |
|  | STF | R1, *AR6++(IR1) |  |
| : | LDF | *AR5++(IR1),R5 |  |
| * - ${ }^{\text {a }}$ |  |  |  |
| BLK1 ${ }^{-}$ | SUBI | 63,AR6 |  |
| * | LDI | ESCRATCH, AR 4 |  |
|  | LDI | QOUTPUT, ARS | ; DO DCT ON COLUMN |
|  | LDI | EINPUT,AR6 | ; VECTORS |
|  | LDI | 7,RC |  |
| * |  |  |  |
|  | LDI | CRTN2,R4 | ; RETURN ADDRESS OF SUBROUTINE |
|  | RPTB | BLK3 |  |
|  | BRD | DCT |  |
|  | LDI | AR5, ARO | ; POINTS TO INPUT |
|  | LDI | AR5,AR1 |  |
|  | ADDI | 1,AR1 |  |
| * |  |  |  |
| TRANS2: | LDF | *AR4++(1)\%,R1 |  |
|  | STF | R1,*ARb++(IR1) |  |
| ! | LDF | *AR4++(1)\%,R1 |  |
|  | STF | R1, *ARb + ( IR1) |  |
| 4 | LDF | *AR4++(1)K, R1 |  |
|  | STF |  |  |
| 11 | LDF | *AR4++(1)\%, R1 |  |
|  | STF | R1,*AR6++(IR1) |  |
| 11 | LDF | *AR4++(1) $\mathrm{K}, \mathrm{RL}$. |  |
|  | STF | R1, 4 ARb + ( $1 R 1$ ) |  |
| 11 | LDF | *AR4++(1)X, R1 |  |
|  | STF | R1, *AR6++(IR1) |  |
| 1 | LDF | *AR4++(1)\%, R1 |  |
|  | STF | R1, *AR6++(IR1) |  |
| 11 | LDF |  |  |
|  | STF | R1, *AR6++(IR1) |  |
| 14 | LDF | *AR5++(IR1),R5 |  |
| * |  |  |  |
| * |  | 63,AR6 | ; Increment pointers |
| EN | BR | ENO | ; END |
| * |  |  |  |



| $9 \text { s }$ | 11 | STF | R1, *AR2 |  |
| :---: | :---: | :---: | :---: | :---: |
|  |  | STF | R2, *-AR2 |  |
|  |  | STF | R3, *-AR3 |  |
| To | 11 | STF | R0, *AR3 |  |
| $\checkmark{ }^{( }$ | * CORRECT ORDER FROM BIt-REVERSED TO Natural |  |  |  |
| J | * |  |  |  |
| $\omega{ }^{\text {c }}$ | BITREV | LDF | *ARO, RO | ; ONLY TWO LOCATIONS ARE ACTUALLY SWITCHED |
| N? |  | LDF | *-AR2, R1 |  |
| $\bigcirc 8$ |  | STF | R1, *ARO |  |
| 5 | 11 | STF | R0,*-AR2 |  |
| $1 \bigcirc$ |  | LDF | *AR1, R0 |  |
| $\bigcirc 0$ | 4 | LDF | *-AR3, R1 |  |
|  |  | STF | R1, *AR1 |  |
| \$ | 11 | STF | R0, *-AR3 |  |
|  | * CONTINUE WITH Recursive algorithm |  |  |  |
| 3 |  |  |  |  |  |
| \% | * |  |  |  |
|  | RECURSE | MPYF3 | R7, *-AR3,R2 |  |
| N ${ }^{\circ}$ |  | MPYF3 | R7, *AR3, R1 |  |
| $\bigcirc$ | 4 | SUBF3 | *-AR1, R2, R2 | $\begin{array}{r} ; 2 x(7)-x(3) \\ - \\ ; \end{array} 2 x(8)-x(4)$ |
|  |  | SUBF3 | *AR1, R1, R1 |  |
| $\bigcirc$ |  | STF | R1, *AR3 |  |
|  | 11 | STF | R2, *-AR3 |  |
| 8 | * |  |  |  |
| $\stackrel{\square}{8}$ | LASTLCOP | MPYF3 | R7, *AR1,R0 | ; $X(4)=2 * \times(4)$ |
| 3 |  | MPYF3 | R7, \#AR2, R1 | ; $x(6)=2 * x(6)$ |
| $\square$ | : | SUBF3 | *AR0,R0, R2 | ; $\mathrm{R} 2=2 \mathrm{X}(4)-\mathrm{X}(2)$ |
|  |  | MPYF3 | R7, ※AR3, R3 | ; $\mathrm{R} 3=2 \times \times(8)$ |
|  | if | STF | R2, *AR1 |  |
|  |  | SUBF3 | *AR1,R1, R1 | ; $\mathrm{R} 1=2 \mathrm{X}(6)-\mathrm{X}(4)$ |
|  |  | SUBF3 | R1, R3, R3 | ; R3 $=2 \times(8)-\mathrm{X}(6)$ |
|  |  | STF | R1, *AR2 |  |
|  | * ${ }^{11}$ | STF | R3, *AR3 |  |
|  |  |  |  |  |
|  | * SCA | CALE FACTOR OF ( $2 / \mathrm{N}$ ) $=0.25$ |  |  |
|  |  |  |  |  |  |  |
|  |  | STF | R, AR ${ }^{\text {a }}$ |  |
|  |  | STF | R0, *AR3--(1) |  |
|  | : | MPYF3 | R6, *-AR3, R1 |  |
|  |  | STF | R1, *AR3--(1) |  |
|  | 4 | MPYF3 | R6, *-AR3, R0 |  |
|  |  | STF | R0, *AR3--(1) |  |
|  | $1:$ | MPYF3 | R6, *-AR3, R1 |  |
|  |  | STF | R1, *AR3-(1)$R 6, *-A R 3, R 0$ |  |
|  | 11 | MPYF3 |  |  |  |
|  |  | STF | R0, *AR3--(1) | ; OK TO MOVE AR3 |
|  | : 1 | MPYF3 | R6, *-AR3,R1 |  |
| - |  | STF | R1, *AR3--(1) |  |
|  | i | MPYF3 | R6, *-AR3, RO |  |
|  |  | STF | R0, *AR3--(1)$R 6, *-A R 3, R 1$ |  |
|  | i | MPYF3 |  |  |  |
|  |  | * N, ${ }^{\text {a }}$ |  |  |  |
| $\infty$ |  |  |  |  |  |  |  |  |


| EXIT | bud | $R 4$ | ; RETURN |
| :---: | :---: | :---: | :---: |
|  | LDF | *-ARO, RO |  |
|  | MPYF3 | *AR7, R0, RO | ; MILT BY 1/SART(2) |
|  | STF | R0, *-ARO | ; STORE THE RESLIT |
|  | .end |  |  |
| * |  |  |  |
| COSTAB | .float | 0.980785280403 ; | ; LAMBDA |
|  | .float | 0.555570233019 ; |  |
|  | .float | -0.195090322016 ; |  |
|  | .float | -0.831469612303 ; | ; -GAMMA |
|  | .float | 0.923379532511 ; | ; BETA |
|  | .float | -0.382683432365; | ; -DELTA |
|  | .float | 0.707106781188 ; | ; ALPHA |



|  | IDCT | LDI | ARO,AR1 |  |  | MPYF3 | *AR7++(1), R3, R3 | ; SKIP TO NEXT COEFF |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| $9 \geqslant$ |  | ADDI | 2,AR1 |  |  | STF | R1, *AR1 |  |
|  |  | LDI | AR1,AR2 |  | 4 | STF | R0, *ARO |  |
| $\cdots{ }^{3}$ |  | ADDI | 2,AR2 |  |  | STF | R2, * ${ }^{\text {R } 2}$ |  |
|  |  | LDI | AR2,AR3 |  | 11 | STF | R3, *AR3 |  |
| $\cdots \infty$ |  | ADDI | 2,AR3 |  |  | LDF | *AR0, R2 | ; THESE SECTIONS PERFORM |
|  | * |  |  |  | 11 | LDF | *AR1, R3 | ; TWO BUTTERFLIES AT ONCE |
| ¢ |  | LDF | *-ARO, RO |  |  | SUBF3 | *ARO, *-ARO,RO |  |
| N? |  | MPYF3 | *AR7, R0,RO | ; MULT BY 1/SORT (2) |  | SUBF3 | *AR1, *-AR1, R1 |  |
| $\bigcirc$ |  | STF | R0, *-ARO | ; STORE THE RESULT |  | STF | RO, *ARO |  |
| NO | * - |  |  |  |  | MPYF3 | R1,*+AR7,R1 | ; -DELTA |
| ज? | * beg | N WITH | URSION |  |  | ADDF3 | R3, *-AR1,R0 |  |
| 00 | * |  |  |  |  | MPYF3 | R0, *AR7, R0 | ; BETA |
|  |  | SUBF3 | *AR3, *AR2,R2 | ; $\mathrm{X}(6)-\mathrm{x}(8)$ | 11 | AJIF3 | R2, *-AR0, R2 |  |
| N శ |  | SUBF3 | R2, *AR1, R3 | ; $x(4)-x(6)$ |  | STF | R2, *-ARO |  |
| $\bigcirc$ |  | MPYF3 | *AR3,R7,R0 | ; $2 \mathrm{X}(8)-\mathrm{PRO}$ | 14 | STF | R0, *-AR1 |  |
| N | 4 | STF | R2, *AR2 |  |  | STF | R1, *AR1 |  |
| ? |  | SUEF3 | R3, *AR0, R2 ${ }^{\text {- }}$ | ; $\mathrm{X}(2)-\mathrm{X}(4)$ | * |  |  |  |
| $\underset{\sim}{\omega}$ |  | MPYF3 | *AR2,R7,R1 | ; $2 * x(6)->\mathrm{R} 1$ |  | LDF | *AR2, R2 |  |
| No | : 1 | STF | R3, *AR1 |  | 11 | LDF | *AR3, R3 |  |
|  |  | STF | RO, *AR3 |  |  | SUBF3 | *AR2, *-AR2,R0 |  |
|  | 14 | STF | R1, *AR2 |  |  | SUBF3 | *AR3, *-AR3, R1 |  |
| ¢ |  | MPYF3 | *AR1,R7,R0 |  |  | STF | R0, *AR2 |  |
|  |  | STF | R0, *AR1 |  |  | MPYF3 | $\mathrm{R} 1, *+\mathrm{AR} 7, \mathrm{R} 1$ | ; -dELTA ON NEXT GROUP |
| $5$ |  | STF | R2, *ARO |  |  | ADDF3 | R3, *-AR3,R0 |  |
| $\stackrel{1}{\square}$ | SECLOOP | SUBF3 | *-AR3, *-AR1, ${ }^{\text {a }}$ | ; $\mathrm{X}(3)-\mathrm{x}(7)$ |  | MPYF3 | R0, *AR $7++(1 R 0)$, ${ }^{\text {a }}$ | RO ; BETA ON NEXT GROUP |
| 2 |  | SUBF3 | *AR3, *AR1,R3 | ; $x(4)-x(8)$ | i | ADDF3 | R2,*-AR2, R2 |  |
| $\frac{1}{2}$ |  | MPYF3 | R7, *-AR3, RO | ; $2 * \mathrm{X}(7)$ |  | STF | R2,*-AR2 |  |
| $\frac{7}{2}$ | 1 | ${ }_{\text {STF }}$ | R2, *-AR1 |  | ii | STF | R0, *-AR3 |  |
|  |  | MPYF3 | R7,*AR3, R1 | ; $2 * \times(8)$ |  | STF | R1, *AR3 |  |
|  | 14 | STF | R3, *AR1 |  | * |  |  |  |
|  |  | STF | R0, *-AR3 |  | * | No group | f Butterflies |  |
|  | 14 | STF | R1, *AR3 |  | * |  |  |  |
|  | C |  |  |  |  | LDF | *-AR1,R2 | ; THIS IS THE SAME AS ABOVE, EXCEPT THE |
|  | * COR | ECT ORDE | FROM NATURAL | BIT-REVERSED | 11 | LDF | *AR1, R3 | POINTERS CHANGE |
|  | * |  |  |  |  | SUBF3 | *-AR1,*-ARO,R1 |  |
|  | BITREV | LDF | *AR0, R0 | ; ONLY TWO LOCATIONS ARE ACTUALLY SWITCHED |  | SUBF3 | *AR1, *ARO, K0 |  |
|  | 11 | LDF | *-AR2, R1 |  |  | ADDF3 | R3, *AR0, R3 |  |
|  |  | STF | R1, *ARO |  |  | ADDF3 | R2, *-ARO,R2 |  |
|  | 11 | STF | R0, *-AR2 |  |  | STF | R1,*-AR1 |  |
|  |  | LDF | *AR1, RO |  | il | STF | R2, *-ARO |  |
|  | 11 | LDF | *-AR3, R1 |  |  | STF | RO, *AR1 |  |
|  |  | STF | R1, $\times$ AR1 |  | : | STF | R3, *ARO |  |
|  | :1 | STF | R0, *-AR3 |  | * |  |  |  |
|  | ${ }^{*}$ |  |  |  |  | LDF | *-AR3, R2 |  |
|  | * FIR | T SET OF | dutterflies |  | 14 | LDF | *AR3, R3 |  |
|  | * |  |  |  |  | SUBF3 | *-AR3,*-AR2,R1 |  |
|  |  | LDF | *ARO,RO |  |  | SUBF3 | *AR3, *AR2, R0 |  |
|  | i | LDF | *AR1, R1 |  |  | MPYF3 | *AR7++(1), R1, R1 | ; -NJ |
|  |  | LDF | *AR2,R2 |  | i | ADDF3 | R3, * AR2, R3 |  |
|  | 11 | LDF | *AR3, R3 |  |  | MPYF3 | $* A R 7++(1), R 0, R O$ | ; -GAMMA |
|  |  | MPYF3 | *AR7, R1, R1 | ; PERFORM THE ALPHA MLLT'S | 11 | ADDF3 | R2, *-AR2,R2 | ; |
|  |  | MPYF3 | *AR7,RO,RO |  |  | MPYF3 | R2, *AR7++(1), R2 | ; LAMBDA |
| - |  | MPYF3 | *AR7,R2,R2 |  |  | MPYF3 | R3,*AR7, R3 ; | ; MU |



# An Implementation of Adaptive Filters with the TMS320C25 or the TMS320C30 

Sen Kuo<br>Northern Illinois University<br>Chein Chen<br>Digital Signal Processor Products-Semiconductor Group<br>Texas Instruments

## Introduction

A filter selects or controls the characteristics of the signal it produces by conditioning the incoming signal. The coefficients of the filter determine its characteristics and output a priori in many cases. Often, a specific output is desired, but the coefficients of the filter cannot be determined at the outset. An example is an echo canceller; the desired output cancels the echo signal (an output result of zero when there is no other input signal). In this case, the coefficients cannot be determined initially since they depend on changing line or transmission conditions. For applications such as this, it is necessary to rely on adaptive filtering techniques.

An adaptive filter is a filter containing coefficients that are updated by an adaptive algorithm to optimize the filter's response to a desired performance criterion. In general, adaptive filters consist of two distinct parts: a filter, whose structure is designed to perform a desired processing function; and an adaptive algorithm, for adjusting the coefficients of that filter to improve its performance, as illustrated in Figure 1. The incoming signal, $x(n)$, is weighted in a digital filter to produce an output, $y(n)$. The adaptive algorithm adjusts the weights in the filter to minimize the error, $e(n)$, between the filter output, $y(n)$, and the desired response of the filter, $\mathrm{d}(\mathrm{n})$. Because of their robust performance in the unknown and time-variant environment, adaptive filters have been widely used from telecommunications to control.


Figure 1. General Form of an Adaptive Filter

Adaptive filters can be used in various applications with different input and output configurations. In many applications requiring real-time operation, such as adaptive prediction, channel equalization, echo cancellation, and noise cancellation, an adaptive filter implementation based on a programmable digital signal processor (DSP) has many advantages over other approaches such as a hard-wired adaptive filter. Not only are power, space, and manufacturing requirements greatly reduced, but also programmability provides flexibility for system upgrade and software improvement.

The early research on adaptive filters was concerned with adaptive antennas [1] and adaptive equalization of digital transmission systems [2]. Much of the reported research on the adaptive filter has been based on Widrow's well-known Least Mean Square (LMS) algorithm, because the LMS algorithm is relatively simple to design and implement, and it is well-understood and well-suited for many applications. All the filter structures and update algorithms discussed in this application report are Finite Impulse Response (FIR) filter structures and LMS-type algorithms. However, for a particular application, adaptive filters can be implemented in a variety of structures and adaptation algorithms [1, 3 through 9]. These structures and algorithms generally trade increased complexity for improved performance. An interactive software package to evaluate the performance of adaptive filters has also been developed [10].

The complexity of an adaptive filter implementation is usually measured in terms of its multiplication rate and storage requirement. However, the data flow and data manipulation capabilities of a DSP are also major factors in implementing adaptive filter systems. Parallel hardware multiplier, pipeline architecture, and fast on-chip memory size are major features of most DSPs [11, 12] and can make filter implementation more efficient.

Two such devices, the TMS320C25 and TMS320C30 from Texas Instruments [13, 14], have been chosen as the processors for fixed-point and floating-point arithmetic. They combine the power, high speed, flexibility, and an architecture optimized for adaptive signal processing. The instruction execution time is 80 ns for the TMS320C25 and only 60 ns for the TMS320C30. Most instructions execute in a single cycle, and the architectures of both processors make it possible to execute more than one operation per instruction. For example, in one instruction, the TMS320C25 processor can generate an instruction address and fetch that instruction, decode the instruction, perform one or two data moves (if the second data is from program memory), update one address pointer, and perform one or two computations (multiplication and accumulation). These processors are designed for real-time tasks in telecommunications, speech processing, image processing, and high-speed control, etc.

To direct the present research toward realistic real-time applications, three adaptive structures were implemented:

1. Transversal
2. Symmetric transversal
3. Lattice

Each structure utilizes five different update algorithms:

1. LMS
2. Normalized LMS
3. Leaky LMS
4. Sign-error LMS
5. Sign-sign LMS

Each structure with its adaptation algorithms is implemented using the TMS320C25 with fixed-point arithmetic and the TMS320C30 with floating-point arithmetic. The processor assembly code is included in the Appendix for each implementation. The assembly code for each structure and adaptation strategy can be readily modified by the reader to fit his/her applications and could be incorporated into a C function library as callable routines.

In this application report, the applications of adaptive filters, such as adaptive prediction, adaptive equalization, adaptive echo cancellation, and adaptive noise cancellation are presented first. Next, the implementation of the three filter structures and five adaptive algorithms with the TMS320C25 and TMS320C30 is described. This is followed by the practical considerations on the implementation of these adaptive filters. The remainder of the application report covers coding options, such as the routine libraries that support both assembly and C languages.

## Applications of Adaptive Filters

The most important feature of an adaptive filter is the ability to operate effectively in an unknown environment and track time-varying characteristics of the input signal. The adaptive filter has been successfully applied to communications, radar, sonar, control, and image processing. Figure 1 illustrates a general form of an adaptive filter with input signals, $x(n)$ and $d(n)$, output signal, $y(n)$, and error signal, $e(n)$, which is the difference between the desired signal, $\mathrm{d}(\mathrm{n})$, and output signal, $\mathrm{y}(\mathrm{n})$. The adaptive filter can be used in different applications with different input/output configurations. In this section we briefly discuss several potential applications for the adaptive filters [15].

## Adaptive Prediction

Adaptive prediction [16 through 18] is illustrated in Figure 2. In the general application of adaptive prediction, the signals are $\mathrm{x}(\mathrm{n})$ - delayed version of original signal, $\mathrm{d}(\mathrm{n})$ - original input signal, $\mathrm{y}(\mathrm{n})$ - predicted signal, and $\mathrm{e}(\mathrm{n})$ - prediction error or residual.


Figure 2. Block Diagram of an Adaptive Predictor
A major application of the adaptive prediction is the waveform coding of a speech signal. The adaptive filter is designed to exploit the correlation between adjacent samples of the speech signal so that the prediction error is much smaller than the input signal on the average. This prediction error signal is quantized and sent to the receiver in order to reduce the number of bits required for the transmission. This type of waveform coding is called Adaptive Differential Pulse-Code Modulation (ADPCM) [17] and provides data rate compression of the speech at $32 \mathrm{~kb} / \mathrm{s}$ with toll quality. More recently, in certain online applications, time recursive modeling algorithms have been proposed to facilitate speech modeling and analysis.

The coefficients of the adaptive predictor can be used as the autoregressive (AR) parameters of the nonstationary model. The equation of the AR process is

$$
u(n)=a_{1} * u(n-1)+a_{2} * u(n-2)+\ldots \ldots+a_{m} * u(n-m)+v(n)
$$

where $a_{1}, a_{2}, \ldots, a_{m}$ are the AR parameters. Thus, the present value of the process $u(n)$ equals a finite linear combination of past values of the process plus an error term $\mathrm{v}(\mathrm{n})$. This adaptive AR model provides a practical means to measure the instantaneous frequency of input signal. The adaptive predictor can also be used to detect and enhance a narrow band signal embedded in broad band noise. This Adaptive Line Enhancer (ALE) provides at its output $\mathrm{y}(\mathrm{n})$ a sinusoid with an enhanced signal-to-noise ratio, while the sinusoidal components are reduced at the error output e(n).

## Adaptive Equalization

Figure 3 shows another model known as adaptive equalization [2, 9, 15]. The signals in the adaptive equalization model are defined as $\mathrm{x}(\mathrm{n})$ - received signal (filtered version of transmitted signal) plus channel noise, $\mathrm{d}(\mathrm{n})$ - detected data signal (data mode) or pseudo random number (training mode), $\mathrm{y}(\mathrm{n})$ - equalized signal used to detect received data, and $\mathrm{e}(\mathrm{n})$ - residual intersymbol interference plus noise.


Figure 3. Block Diagram of an Adaptive Equalizer
The use of adaptive equalization to eliminate the amplitude and phase distortion introduced by the communication channel was one of the first applications of adaptive filtering in telecommunications [19]. The effect of each symbol transmitted over a time-dispersive channel extends beyond the time interval used to represent that symbol, resulting in an overlay of received symbols. Since most channels are time-varying and unknown in advance, the adaptive channel equalizer is designed to deal with this intersymbol interference and is widely used for bandwidth-efficient transmission over telephone and radio channels.

## Adaptive Echo Cancellation

Another application, known as adaptive echo cancellation [20, 21] is shown in Figure 4. In this application, the signals are identified as $x(n)$ - far-end signal, $d(n)$ - echo of far-end signal plus near-end signal, $y(n)$ - estimated echo of far-end signal, and $e(n)$ - near-end signal plus residual echo.


Figure 4. Block Diagram of an Echo Canceller

The adaptive echo cancellers are used in practical applications of cancelling echoes for long-distance telephone voice communication, full-duplex voiceband data modems, and high-performance audio-conferencing systems. To overcome the echo problem, echo cancellers are installed at both ends of the network. The cancellation is achieved by estimating the echo and subtracting it from the return signal.

## Adaptive Noise Cancellation

One of the simplest and most effective adaptive signal processing techniques is adaptive noise cancelling [1, 22]. As shown in Figure 5, the primary input d(n) contains both signal and noise, where $x(n)$ is the noise reference input. An adaptive filter is used to estimate the noise in $d(n)$ and the noise estimate $y(n)$ is then subtracted from the primary channel. The noise cancellation output is then the error signal $e(n)$.

The applications of noise cancellation include the cancellation of various forms of interference in electrocardiography, noise in speech signals, noise in fighter cockpit environments, antennas sidelobe interference, and the elimination of $60-\mathrm{Hz}$ hum. In the majority of these noise cancellation applications, the LMS algorithm has been utilized.


Figure 5. General Form of a Noise Canceller

## Application Summary

The above list of applications is not exhaustive and is limited primarily to applications within the field of telecommunications. Adaptive filtering has been used extensively in the context of many other fields including, but not limited to, instantaneous frequency tracking, intrusion detection, acoustic Doppler extraction, on-line system identification, geophysical signal processing, biomedical signal processing, the elimination of radar clutter, beamforming, sonar processing, active sound cancellation, and adaptive control.

## Implementation of Adaptive Structures and Algorithms

Several types of filter structures can be implemented in the design of the adaptive filters such as Infinite Impulse Response (IIR) or Finite Impulse Response (FIR). An adaptive IIR filter [1,5], with poles as well as zeros, makes it possible to offer the same filter characteristics as the FIR filter with lower filter complexity. However, the major problem with adaptive IIR filter is the possible instability of the filter if the poles move outside the unit circle during the adaptive process. In this application report, only FIR structure is implemented to guarantee filter stability.

An adaptive FIR filter can be realized using transversal, symmetric transversal, and lattice structures. In this section, the adaptive transversal filter with the LMS algorithm is introduced and implemented first to provide a working knowledge of adaptive filters.

## Transversal Structure with LMS Algorithm

## Transversal Structure Filter

The most common implementation of the adaptive filter is the transversal structure (tapped delay line) illustrated in Figure 6. The filter output signal $y(n)$ is
where $\underline{x}(n)=[x(n) x(n-1) \ldots x(n-N+1)]^{T}$ is the input vector, $\underline{w}(n)=\left[w_{0}(n) w_{1}(n) \ldots\right.$ $\mathrm{w}_{\mathrm{N}-1}(\mathrm{n}) \mathrm{T}^{\mathrm{T}}$ is the weight vector, T denotes transpose, n is the time index, and N is the order of filter. This example is in the form of a finite impulse response filter as well as the convolution (inner product) of two vectors $\underline{x}(n)$ and $\underline{w}(n)$. The implementation of Equation (1) is illustrated using the following C program:

$$
\begin{aligned}
& \mathrm{y}[\mathrm{n}]=0 . ; \\
& \text { for }(\mathrm{i}=0 ; \mathrm{i}<\mathrm{N} ; \mathrm{i}++) \\
& \quad \mathrm{y}[\mathrm{n}]+=\mathrm{wn}[\mathrm{i}] * \mathrm{xn}[\mathrm{i}] ;
\end{aligned}
$$

where $w n[i]$ denotes $w i(n)$ and $x n[i]$ represents $x(n-i)$.


Figure 6. Transversal Filter Structure

## TMS320C25 Implementation

The architecture of TMS320C25 [13] is optimized to implement the FIR filter. After execution of the CNFP (Configure Block B0 as Program Memory) instruction, the filter coefficients $w_{i}(n)$ from RAM block B0 (via program bus) and data $x(n-i)$ from RAM block B1 (via data bus) are available simultaneously for the parallel multiplier (see Figure 7).


Figure 7. TMS320C25 Arithmetic Unit (after execute CNFP instruction)

The MACD instruction enables complete multiply/accumulate, data move, and pointer update operations to be completed in a single instruction cycle ( 80 ns ) if filter coefficients are stored in on-chip RAM or ROM or in off-chip program memory with zero wait states. Since the adaptive weights $w_{i}(n)$ need to be updated in every iteration, the filter coefficients must be stored in RAM. The implementation of the inner product in Equation (1) can be made even more efficient with a repeat instruction, RPTK. An N-weight transversal filter can be implemented as follows [23]:

LARP ARn
LRLK ARn,LASTAP
RPTK $\quad \mathrm{N}-1$
MACD COEFFP,*-
Where ARn is an auxiliary address register that points to $\mathrm{x}(\mathrm{n}-\mathrm{N}+1)$, and the Prefetch Counter (PFC) points to the last weight $\mathrm{w}_{\mathrm{N}-1}(\mathrm{n})$ indicated by COEFFP. When the MACD instruction is repeated, the coefficient address is transferred to the PFC and is incremented by one during its operation. Therefore, the components of weight vector $\underline{w}(\mathrm{n})$ are stored in B 0 as

Low Address


High Address
The MACD in repeat mode will also copy data pointed to by ARn, to the next higher on-chip RAM location. The buffer memories of transversal filter are therefore stored as

Low Address


High Address

In general, roundoff noise occurs after each multiplication. However, the TMS320C25 has a $16 \times 16$-bit multiplier and a 32 -bit accumulator, so there is no roundoff during the summing of a set of product terms in Program (A). All multiplication products are represented in full precision, and rounding is performed after they are summed. Thus $y(n)$ is obtained from the accumulator with only one roundoff, which minimizes the roundoff noise in the output $\mathrm{y}(\mathrm{n})$. Since both the tapped delay line and the adaptive weights are stored in data RAM to achieve the fastest throughput, the highest transversal filter order for efficient implementation on the TMS320C25 is 256 . However, if necessary, higher order filters can be implemented by using external data RAM.

## TMS320C30 Implementation

The architecture of TMS320C30 [14] is quite different from TI's second generation processors. Instead of using program/data memory, it provides two data address buses to do the data memory manipulations. This feature allows two data memory addresses to be generated at the same time. Hence, parallel data store, load, or one data store with one data load can be done simultaneously. Such capabilities make the programming much easier and more flexible. Since the hardware multiplier and arithmetic logic unit (ALU) of TMS320C30 are separated, with proper operand arrangement, the processor can do one multiplication and one addition or subtraction at the same time. With these two combined features, the TMS320C30 can execute several other parallel instructions. These parallel instructions can be found in Section 11 of the Third-Generation TMS320 User's Guide [14]. Associating with single repeat instruction RPTS, an inner product in Equation (1) can be implemented as follows:

| MPYF3 | *AR0++(1)\%,*AR1 $++(1) \%, \mathrm{R} 1$ | $;$ w[0].x[0] |
| ---: | :--- | :--- |
| RPTS | $\mathrm{N}-2$ | $;$ Repeat $\mathrm{N}-1$ times |
| MPYF3 | *AR0++(1)\%,*AR1 $++(1) \%, \mathrm{R} 1$ | $; \mathrm{y}[]=\mathrm{w}[] \cdot \mathrm{x}[]$ |
| \|| ADDF3 | $\mathrm{R} 1, \mathrm{R} 2, \mathrm{R} 2$ | $;$ Include last product |
| ADDF3 | $\mathrm{R} 1, \mathrm{R} 2, \mathrm{R} 2$ |  |

where auxiliary registers AR0 and AR1 point to x and w arrays. The addition in the parallel instruction sums the previous values of R1 and R2. Therefore, R1 is initialized with the first product prior to the repeat instruction RPTS.

Note that the implementation above does not move the data in the x array like MACD does in TMS320C25. For filter delay taps, the TMS320C30 uses a circular buffer method to implement the delay line. This method reserves a certain size of memory for the buffer and uses a pointer to indicate the beginning of the buffer. Instead of moving data to next memory location, the pointer is updated to point to the previous memory location. Therefore, from the new beginning of the buffer, it has the effect of the tapped delay line. When the value of the pointer exceeds the end of the buffer, it will be circled around to the other end of the buffer. It works just like joining two ends of the buffer together as a necklace. Thus, new data is within the circular queue, pointed to by ARO, replacing
the oldest value. However, from an adaptive filter point of view, data doesn't have to be moved at this point yet.

TMS320C30 has a 32-bit floating point multiplier and the result from the multiplier is put and accumulated into a 40 -bit extended precision register. If the input from A/D converter is equal to or less than 16 bits, there is no roundoff noise after multiplication. Theoretically, the TMS320C30 can implement a very high order of adaptive filter. However, for the most efficient implementation, the limitation of filter order is 2 K because the TMS320C30 external data write requires at least two cycles. If the filter coefficients are put in somewhere other than internal data RAM, the instruction cycles will be increased.

## LMS Adaptation Algorithm

The adaptation algorithm uses the error signal

$$
\begin{equation*}
\mathrm{e}(\mathrm{n})=\mathrm{d}(\mathrm{n})-\mathrm{y}(\mathrm{n}), \tag{2}
\end{equation*}
$$

where $d(n)$ is the desired signal and $y(n)$ is the filter output. The input vector $\underline{x}(n)$ and $\mathrm{e}(\mathrm{n})$ are used to update the adaptive filter coefficients according to a criterion that is to be minimized. The criterion employed in this section is the mean-square error (MSE) $\epsilon$ :

$$
\begin{equation*}
\epsilon=\mathrm{E}\left[\mathrm{e}^{2}(\mathrm{n})\right] \tag{3}
\end{equation*}
$$

where E [.] denotes the expectation operator. If $\mathrm{y}(\mathrm{n})$ from Equation (1) is substituted into Equation (2), then Equation (3) can be expressed as

$$
\begin{equation*}
\epsilon=\mathrm{E}\left[\mathrm{~d}^{2}(\mathrm{n})\right]+\underline{\mathrm{w}}^{\mathrm{T}}(\mathrm{n}) \operatorname{Rw}(\mathrm{n})-2 \underline{\mathrm{w}}^{\mathrm{T}}(\mathrm{n}) \underline{p} \tag{4}
\end{equation*}
$$

where $R=E\left[x(n) x^{T}(n)\right]$ is the $\mathrm{N} x \mathrm{~N}$ autocorrelation matrix, which indicates the sample-to-sample correlation within a signal, and $\mathrm{p}=\mathrm{E}[\mathrm{d}(\mathrm{n}) \underline{\mathrm{x}}(\mathrm{n})]$ is the $\mathrm{N} \times 1$ cross-correlation vector, which indicates the correlation between the desired signal $\mathrm{d}(\mathrm{n})$ and the input signal vector $\mathrm{x}(\mathrm{n})$.

The optimum solution $\mathrm{w}^{*}=\left[\mathrm{w}_{0}{ }^{*} \mathrm{w}_{1} * \ldots \mathrm{w}_{\mathrm{N}-1}{ }^{*}\right]^{\mathrm{T}}$, which minimizes MSE, is derived by solving the equation

$$
\begin{equation*}
\frac{\delta \epsilon}{\delta \underline{\mathrm{w}}(\mathrm{n})}=0 \tag{5}
\end{equation*}
$$

This leads to the normal equation

$$
\begin{equation*}
\mathrm{R} \underline{\mathrm{w}}^{*}=\underline{\mathrm{p}} \tag{6}
\end{equation*}
$$

If the R matrix has full rank (i.e., $\mathrm{R}^{-1}$ exists), the optimum weights are obtained by

$$
\begin{equation*}
\underline{\mathrm{w}}^{*}=\mathrm{R}^{-1} \underline{\mathrm{p}} \tag{7}
\end{equation*}
$$

In Linear Predictive Coding (LPC) of a speech signal, the input speech is divided into short segments, the quantities of $R$ and $p$ are estimated, and the optimal weights corresponding to each segment are computed. This procedure is called a block-by-block dataadaptive algorithm [24].

A widely used LMS algorithm is an alternative algorithm that adapts the weights on a sample-by-sample basis. Since this method can avoid the complicated computation of $\mathrm{R}^{-1}$ and p , this algorithm is a practical method for finding close approximate solutions to Equation (7) in real time. The LMS algorithm is the steepest descent method in which the next weight vector $\mathrm{w}(\mathrm{n}+1)$ is increased by a change proportional to the negative gradient of mean-square-error performance surface in Equation (7)

$$
\begin{equation*}
\underline{w}(n+1)=\underline{w}(n)-u \underline{\nabla}(n) \tag{8}
\end{equation*}
$$

where u is the adaptation step size that controls the stability and the convergence rate. For the LMS algorithm, the gradient at the nth iteration, $\underline{\nabla}$ (n), is estimated by assuming squared error $\mathrm{e}^{2}(\mathrm{n})$ as an estimate of the MSE in Equation (3). Thus, the expression for the gradient estimate can be simplified to

$$
\begin{equation*}
\underline{\nabla}(\mathrm{n})=\frac{\delta\left[\mathrm{e}^{2}(\mathrm{n})\right]}{\delta \underline{w}(\mathrm{n})}=-2 \mathrm{e}(\mathrm{n}) \underline{\mathrm{x}}(\mathrm{n}) \tag{9}
\end{equation*}
$$

Substitution of this instantaneous gradient estimate into Equation (8) yields the Widrow-Hoff LMS algorithm

$$
\begin{equation*}
\underline{\mathrm{w}}(\mathrm{n}+1)=\underline{\mathrm{w}}(\mathrm{n})+2 \mathrm{ue} \mathrm{e}(\mathrm{n}) \underline{\mathrm{x}}(\mathrm{n}) \tag{10}
\end{equation*}
$$

where 2 u in Equation (10) is usually replaced by u in practical implementation.
Starting with an arbitrary initial weight vector $\underline{\mathrm{w}}(0)$, the weight vector $\underline{\mathrm{w}}(\mathrm{n})$ will converge to its optimal solution $\underline{w}^{*}$, provided $u$ is selected such that [1]

$$
\begin{equation*}
0<u<\frac{1}{\lambda_{\max }} \tag{11}
\end{equation*}
$$

where $\lambda_{\max }$ is the largest eigenvalue of the matrix $R$. $\lambda_{\text {max }}$ can be bounded by

$$
\lambda_{\max }<\operatorname{Tr}[\mathrm{R}]=\sum_{\mathrm{i}=0}^{\mathrm{N}-1} \mathrm{r}(0)=\mathrm{Nr}(0)
$$

where $\operatorname{Tr}[$.$] denotes the trace of a matrix and r(0)=E\left[x^{2}(n)\right]$ is average input power.
For adaptive signal processing applications, the most important practical consideration is the speed of convergence, which determines the ability of the filter to track nonstationary signals. Generally speaking, weight vector convergence is attained only when the slowest weight has converged. The time constant of the slowest mode is [1]

$$
\begin{equation*}
\mathrm{t}=\frac{1}{\mathbf{u} \lambda_{\min }} \tag{13}
\end{equation*}
$$

This indicates that the time constant for weight convergence is inversely proportional to $u$ and also depends on the eigenvalues of the autocorrelation matrix of the input. With the disparate eigenvalues, i.e., $\lambda_{\max } \gg \lambda_{\min }$, the setting time is limited by the slowest mode, $\lambda_{\min }$. Figure 8 shows the relaxation of the mean square error from its initial value $\epsilon_{0}$ toward the optimal value $\epsilon_{\text {min }}$.

Adaptation based on a gradient estimate results in noise in the weight vector, therefore a loss in performance. This noise in the adaptive process causes the steady state weight vector to vary randomly about the optimum weight vector. The accuracy of weight vector in steady state is measured by excess mean square error (excess $\operatorname{MSE}=\mathrm{E}\left[\epsilon-\epsilon_{\min }\right]$ ). The excess MSE in the LMS algorithm [1] is

$$
\begin{equation*}
\text { excess MSE }=u \operatorname{Tr}[R] \epsilon_{\min } \tag{14}
\end{equation*}
$$

where $\epsilon_{\min }$ is minimum MSE in the steady state.
Equations (13) and (14) yield the basic trade-off of the LMS algorithm: to obtain high accuracy (low excess MSE) in the steady state, a small value of $u$ is required, but this will slow down the convergence rate. Further discussions of the characteristics and properties of the LMS algorithm are presented in [1, 3 through 9]. The implementations of LMS algorithm with the TMS320C25 and TMS320C30 are presented next.


Figure 8. Learning Curve of an Adaptive Transversal Filter and an LMS Algorithm with Different Step Sizes

Since $u^{* e}(\mathrm{n})$ is constant for N weights update, the error signal $\mathrm{e}(\mathrm{n})$ is first multiplied by $u$ to get ue(n). This constant can be computed first and then multiplied by $x(n)$ to update $\mathrm{w}(\mathrm{n})$. An implementation method of the LMS algorithm in Equation (10) is illustrated as

```
ue(n) = u*e[n];
for (i=0; i<N; i++) {
    wn[i] += uen * xn[i];
}
```


## TMS320C25 Implementation

The TMS320C25 provides two powerful instructions (ZALR and MPYA) to perform the update example in Equation (10).

- ZALR loads a data memory value into the high-order half of the accumulator while rounding the value by setting bit 15 of the accumulator to one and setting bits $0-14$ of the accumulator to zero. The rounding is necessary because it can reduce the roundoff noise from multiplication.
- MPYA accumulates the previous product in the P register and multiplies the operand with the data in T register.

Assuming that ue(n) is stored in T and the address pointer is pointing to AR3, the adaptation of each weight is shown in the following instruction sequence:

LRLK AR1,N-1 ; Initialize loop counter
LRLK AR2, COEFFD ; Point to $\mathrm{w}_{\mathrm{N}-1}(\mathrm{n})$
LRLK AR3,LASTAP+1 ; Point to $\mathrm{x}(\mathrm{n}-\mathrm{N}+1$ ), since MACD in (A)
; Already moved elements of current
; $x(n)$ to the next higher location
MPY *-,AR2 $\quad ; \mathrm{P}=\mathrm{ue}(\mathrm{n}) * x(\mathrm{n}-\mathrm{N}+1)$
ADAP ZALR *,AR3 ; Load $\mathrm{w}_{\mathrm{i}}(\mathrm{n})$ and round
MPYA *-,AR2 $\quad ; \mathrm{ACC}=\mathrm{P}+\mathrm{w}_{\mathrm{i}}(\mathrm{n})$ and $\mathrm{P}=\mathrm{ue}(\mathrm{n}) * \mathrm{x}(\mathrm{n}-\mathrm{i})$
SACH $*+, 0$, AR1 1 Store $\mathrm{w}_{\mathrm{i}}(\mathrm{n}+1)$
BANZ ADAP,*-,AR2 ; Test loop counter, if counter not
; Equal to 0, decrement counter,
; Branch to ADAP and select AR2 as
; Next pointer.
For each iteration, N instruction cycles are needed to perform Equation (1), 6 N instruction cycles are needed to perform weight updates in Equation (10), and the total number of instruction cycles needed is $7 \mathrm{~N}+28$. An example of a TMS320C25 program implementing a LMS transversal filter is presented in Appendix A1. Note that BANZ needs three instruction cycles to execute. This can be avoided by using straight line code, which requires $4 \mathrm{~N}+33$ instruction cycles [25].

## TMS320C30 Implementation

Although the TMS320C30 doesn't provide any specific instruction for adaptive filter coefficients update, it still can achieve the weight updating in two instructions because of its powerful architecture.The TMS320C30 has a repeat block instruction RPTB, which allows a block of instructions to be repeated a number of times without any penalty for looping. A single repeat mode, RM, in the status register, ST, and three registers - repeat start address (RS), repeat end address (RE), and repeat counter (RC) - control the block repeat. When RM is set, the PC repeats the instructions between RS and RE a number of times, which is determined by the value of RC. The repeat modes repeat a block of code at least once in a typical operation. The repeat counter should be loaded with one less than the desired number of repetitions. Assuming the error signal e(n) in Equation (10) is stored in R7, the adaptation of filter coefficients is shown as follows:

```
    MPYF3 *AR0++(1)%,R7,R1 ; R1 = u*e(n)*x(n)
    LDI order-3,RC ; Initialize repeat counter
    RPTB LMS ; Do i = 0, N-3
    MPYF3 *AR0++(1)%,R7,R1 ; Compute u*e(n)*x(n-i-1)
    | |ADDF3 *AR1,R1,R2 ; Compute wi(n) + u*e(n)*x(n-i)
LMS STF R2,*AR1++(1)% ; Store wi(n+1)
    MPYF3 *AR0,R7,R1 ; For i = N-2
| |ADDF3 *AR1,R1,R2
STF R2,*AR1++(1)% ; Store wN-2(n+1)
ADDF3 *AR1,R1,R2 ; Include last w
STF R2,*AR1++(1)% ; Store wN-1(n+1)
```

where auxiliary register AR0 and AR1 point to x and w arrays. R1 is updated before loop since the accumulation in the parallel instruction uses the previous value in R1. In order to update x array pointer to the new beginning of the data buffer for next iteration (i.e., perform the data move), one of the loop instruction set has been taken out of loop and modified by eliminating the incrementation of ARO.

To perform an N -weight adaptive LMS transversal filter on TMS320C30 requires $3 \mathrm{~N}+15$ instruction cycles. There are N and 2 N instruction cycles to perform Equations (1) and (10), respectively. The TMS320C30 example program is given in Appendix A2.

The LMS algorithm considerably reduces the computational requirements by using a simplified mean square error estimator (an estimate of the gradient). This algorithm has proved useful and effective in many applications. However, it has several limitations in performance such as the slow initial convergence, the undesirable dependence of its convergence rate on input signal statistics, and an excess mean square error still in existence after convergence.

## Symmetric Transversal Structure [5]

A transversal filter with symmetric impulse response (weight values) about the center weight has a linear phase response. In applications such as speech processing, linear phase filters are preferred since they avoid phase distortion by causing all the components in the filter input to be delayed by the same amount. The adaptive symmetric transversal structure is shown in Figure 9.


Figure 9. Symmetric Transversal Structure (even order)
This filter is actually an FIR filter with an impulse response that is symmetric about the center tap. The output of the filter is obtained as

$$
\begin{equation*}
y(n)=\sum_{i=0}^{N / 2-1} w_{i}(n)[x(n-i)+x(n-N+i+1)] \tag{15a}
\end{equation*}
$$

where N is an even number. Note that, for fixed-point processors, the addition in the brackets may introduce overflow because the input signals $x(n-i)$ and $x(n-N+i+1)$ are in the range of -1 and $1-2^{-15}$. This problem can be solved by shifting $x(n)$ to the right one bit. The update of the weight vector is

$$
\begin{equation*}
w_{i}(n+1)=w_{i}(n)+u e(n)[x(n-1)+x(n-N+i+1)] \tag{15b}
\end{equation*}
$$

for $\mathrm{i}=0,1, \ldots,\left(\mathrm{~N} / 2^{-1}\right)$, which requires $\mathrm{N} / 2$ multiplications and N additions. Theoretically, this symmetric structure can also reduce computational complexity since such filters require only half the multiplications of the general transversal filter. However, it is true only for the TMS320C30 processor. When a filter is implemented on the TMS320C25, the transversal structure is more efficient than the symmetric transversal structure due to the pipeline multiplication and accumulation instruction MACD, which is optimized to implement convolution in Equation (1).

## TMS320C25 Implementation

For TMS320C25, in order to implement the instructions MAC, ZALR, and MPYA, we can trade memory requirements for computation saving by defining

$$
\begin{equation*}
\mathrm{z}(\mathrm{n}-\mathrm{i})=\mathrm{x}(\mathrm{n}-\mathrm{i})+\mathrm{x}(\mathrm{n}-\mathrm{N}+\mathrm{i}+1), \mathrm{i}=0,1, \ldots, \mathrm{~N} / 2^{-1} \tag{16a}
\end{equation*}
$$

Now, Equation (15) can be expressed as

$$
\begin{align*}
& y(n)=\sum_{i=0}^{N / 2-1} w_{i}(n) z(n-i)  \tag{16b}\\
& w_{i}(n+1)=w_{i}(n)+u e(n) z(n-i), i=0,1, \ldots, N / 2^{-1} \tag{16c}
\end{align*}
$$

Equation (16a) can be implemented using the TMS320C25 as

$$
\begin{aligned}
& \text { LARK } \quad \text { AR1, } \mathrm{N} / 2-1 \quad ; \text { Counter }=\mathrm{N} / 2-1 \\
& \text { LRLK AR2,LAST_X ; Point to } \mathrm{x}(\mathrm{n}-\mathrm{N}+1) \\
& \text { LRLK AR3,FIRST_X ; Point to } x(n) \\
& \text { LRLK AR4,FIRST_Z ; Point to } \mathrm{Z}(\mathrm{n}) \\
& \text { LARP AR3 } \\
& \text { SYM LAC } \quad *+, 0, \text { AR2 } \\
& \text { ADD *-,0,AR4 } \\
& \text { SACL *+,0,AR1 } \\
& \text { BANZ SYM,*-,AR3 }
\end{aligned}
$$

The instruction sequence to implement the LMS algorithm in Equations (1) and (10) can be used to implement Equations (16b) and (16c), except using MAC instead of MACD in Program (A). Therefore, $N$ instruction cycles are needed to shift data in $x(n), 3 N$ instruction cycles are needed to implement Equation (16a), N/2 for Equation (16b), and 3 N for Equation (16c). The total number of instruction cycles required to implement the symmetric transversal filter with the LMS algorithm is $7.5 \mathrm{~N}+38$. Where 7.5 N is an integer because N is chosen as an even number. The 0.5 N instruction cycles come from Equation (15a) since symmetric transversal structure folds the filter taps into half of the order N (see Figure 9). The maximum filter length for most efficient code, 256 , is the
same as for the FIR filter. The use of the additional data memory can be obtained from the reduced data memory requirement for weights of the symmetric transversal filter. The complete TMS320C25 program is given in Appendix B1.

Note that instead of storing buffer locations $\mathrm{x}(\mathrm{n})$ contiguously, then using DMOV to shift data in the buffer memory (requiring N cycles) at the end of each iteration, we can use a circular buffer with pointers pointing to $x(n)$ and $x(n-N+1)$. Since pointer updating requires several instruction cycles, compared with N cycles using DMOV to update the buffer memory contents, the circular buffer technique is more efficient if N is large.

## TMS320C30 Implementation

As mentioned above, the TMS320C30 uses a circular buffer instead of data move technique. Therefore, it does not have to implement tapped delay line separately as TMS320C25. Equations (1) and (16a) can be combined and implemented in the same loop. The advantage of this is that a parallel instruction reduces the number of the instruction cycles. The implementation is shown as follows:

|  | LDF | 0.0,R2 | ; Clear R2 |
| :---: | :---: | :---: | :---: |
|  | LDI | order/2-2,RC | ; Set up loop counter |
|  | RPTB | INNER | ; Do i $=0, \mathrm{~N} / 2-2$ |
|  | ADDF3 | *AR4++(1)\%,*AR5 | z(i) $=\mathrm{x}(\mathrm{n}-\mathrm{i})+\mathrm{x}(\mathrm{n}+\mathrm{N}-\mathrm{i})$ |
|  | MPYF3 | R1,*AR1++(1),R3 | ; $\mathrm{R} 3=\mathrm{w}[] * \mathrm{z}[]$ |
|  | STF | R1,*AR2 + + (1) | ; Store $\mathrm{z}(\mathrm{i})$ |
| INNER | ADDF3 | R3,R2,R2 | ; Accumulate the result for y |
|  | ADDF3 | *AR4++(1)\%, *AR5 | For $\mathrm{i}=\mathrm{N} / 2-1$ |
|  | MPYF3 | R1,*AR1--(IR0),R3 |  |
|  | STF | R1,*AR2 - - (IR0) |  |
|  | ADDF3 | R3,R2,R2 | ; Include last product |

where AR4 and AR5 point to $\mathrm{x}[0]$ and $\mathrm{x}[\mathrm{N}-1]$. AR1 and AR2 point to w and z array, respectively. IR0 contains value of $\mathrm{N} / \mathbf{2 - 1}^{-1}$. The same instruction codes of weight update of transversal filter can be used in symmetric transversal structure by changing the x array pointer to the z array pointer. Appendix B2 presents an example program. The total number of instructions needed is $2.5 \mathrm{~N}+15$, which is less than that of the transversal structure.

## Lattice Structure [6]

An alternative FIR filter realization is the lattice structure [26]. A discussion of the transversal filter with the LMS algorithm shows that the convergence rate of the transversal structure is restricted by the correlation of signal components; i.e., the eigenvalue spread, $\lambda_{\max } / \lambda_{\min }$. The lattice structure is a decorrelating transform based on a family of prediction error filters as illustrated in Figure 10. The recursive equations that describe the lattice predictor are

$$
\begin{align*}
& \mathrm{f}_{0}(\mathrm{n})=\mathrm{b}_{0}(\mathrm{n})=\mathrm{x}(\mathrm{n})  \tag{17a}\\
& \mathrm{f}_{\mathrm{m}}(\mathrm{n})=\mathrm{f}_{\mathrm{m}-1}(\mathrm{n})-\mathrm{k}_{\mathrm{m}}(\mathrm{n}) \mathrm{b}_{\mathrm{m}-1}(\mathrm{n}-1), 0<\mathrm{m}<=\mathrm{M}  \tag{17b}\\
& \mathrm{~b}_{\mathrm{m}}(\mathrm{n})=\mathrm{b}_{\mathrm{m}-1}(\mathrm{n}-1)-\mathrm{k}_{\mathrm{m}}(\mathrm{n}) \mathrm{f}_{\mathrm{m}-1}(\mathrm{n}), 0<\mathrm{m}<=\mathrm{M} \tag{17c}
\end{align*}
$$

where $f_{m}(n)$ represents the forward prediction error, $b_{m}(n)$ represents the backward prediction error, $k_{m}(n)$ is the reflection coefficients, $m$ is the stage index, and $M$ is the number of cascaded stages. The lattice structure has the advantage of being order-recursive. This property allows adding or deleting of stages from the lattice without affecting the existing stages.


Stage m


Figure 10. Lattice Structure
To implement the lattice filter for processing actual data, the reflection coefficients $\mathrm{k}_{\mathrm{m}}(\mathrm{n})$ are required. These coefficients can be computed according to estimates of the autocorrelation coefficients using Durbin's algorithm. However, it would be more efficient if these reflection coefficients could be estimated directly from the data and updated on a sample-by-sample basis, such as LMS algorithm [6]. The reflection coefficient $\mathrm{k}_{\mathrm{m}}(\mathrm{n}+1)$ can be recursively computed [7]:

$$
\mathrm{k}_{\mathrm{m}}(\mathrm{n}+1)=\mathrm{k}_{\mathrm{m}}(\mathrm{n})+\mathrm{u}\left[\mathrm{f}_{\mathrm{m}}(\mathrm{n}) \mathrm{b}_{\mathrm{m}-1}(\mathrm{n}-1)+\mathrm{b}_{\mathrm{m}}(\mathrm{n}) \mathrm{f}_{\mathrm{m}-1}(\mathrm{n})\right], 0<\mathrm{m}<=\mathrm{M}(18)
$$

For applications such as noise cancellation, channel equalization, line enhancement, etc., the joint-process estimation [3] illustrated in Figure 11 is required. This device performs two optimum estimations: the lattice predictor and the multiple regression filter. The following equations define the implementation of the regression filter

$$
\begin{align*}
& e_{0}(n)=d(n)-b_{0}(n) g_{0}(n)  \tag{19a}\\
& e_{m}(n)=e_{m-1}(n)-b_{m-1}(n) g_{m-1}(n), 0<m<=M  \tag{19b}\\
& g_{m}(n+1)=g_{m}(n)+u_{e m}(n) b_{m}(n), \quad 0<=m<=M \tag{20}
\end{align*}
$$

where the LMS algorithm is used to update the coefficients of the regression filter. For noise cancellation application, $e_{m}(n)$ corresponds to the output $e(n)$ in Figure 5. For applications such as adaptive line enhancer and channel equalizer, filter output $\mathrm{y}(\mathrm{n})$ is obtained as

$$
\begin{equation*}
y(n)=\sum_{m=0}^{M} g_{m}(n) b_{m}(n) \tag{21}
\end{equation*}
$$



Figure 11. Lattice Structure with Joint Process Estimation

## TMS320C25/TMS320C30 Implementation

There are five memory locations- $f_{m}(n), b_{m}(n), b_{m}(n-1), k_{m}(n)$, and $g_{m}(n)$ required for each stage. The limitation of on-chip data RAM is 544 words for the TMS320C25 and 2K words for the TMS320C30. A maximum of 102 stages can therefore be implemented on a single TMS320C25 for the highest throughput. Here, another advantage of TMS320C30 architecture design is shown. Since the operands of the mathematic operations can be either memory or register on the TMS320C30, and there is no need to preserve the values of $f_{m}$ array for the next iteration (refer to Equations (17) and (18)), the $f_{m}$ array can be replaced by an extended precision register. Thus, for the most efficient codes, the stage limitation of lattice structure for TMS320C30 is 512 , or one-fourth of the 2 K on-chip RAM.

Lattice structures have superior convergence properties relative to transversal structures and good stability properties; e.g., low sensitivity to coefficient quantization, low roundoff noise, and the ability to check stability by inspection. The disadvantages of lattice filter algorithms are that they are numerically complex and require mathematical sophistication to thoroughly understand their derivations. Furthermore, as shown in Appendixes C1 and C2, lattice structures cannot take advantage of the TMS320C25 and TMS320C30's pipeline architecture to achieve high throughput. The total number of instruction cycles needed is $33 \mathrm{M}+32$ for TMS320C25 and $14 \mathrm{M}+4$ for TMS320C30.

## Modified LMS Algorithms [5]

The LMS algorithm described in previous sections is the most widely used algorithm in practical applications today. In this section, a set of LMS-type algorithms (all direct variants of the LMS algorithm) are presented and implemented. The motivation for each is some practical consideration, such as faster convergence, simplicity in implementation, or robustness in operation. The description of these algorithms is based on the transversal structure. However, these algorithms can be applied to the symmetric transversal structure and the lattice structure as well.

## Normalized LMS Algorithm

The stability, convergence time, and fluctuation of the adaptation process is governed by the step size $u$ and the input power to the adaptive filter. In some practical applications, you may need an automatic gain control (AGC) on the input to the adaptive filter. The normalized LMS algorithm is one important technique used to improve the speed of convergence. This is accomplished while maintaining the steady-state performance independent of the input signal power. This algorithm uses a variable convergence factor $u(n)$, which represents a $u$ that is a function of the time index,

$$
\begin{equation*}
\mathrm{u}(\mathrm{n})=a / \operatorname{var}(\mathrm{n}) \tag{22}
\end{equation*}
$$

and

$$
\begin{equation*}
\underline{\mathrm{w}}(\mathrm{n}+1)=\underline{\mathrm{w}}(\mathrm{n})+\mathrm{u}(\mathrm{n}) \mathrm{e}(\mathrm{n}) \underline{\mathrm{x}}(\mathrm{n}) \tag{23}
\end{equation*}
$$

where $a$ is a convergence parameter, and $\operatorname{var}(\mathrm{n})$ is an estimate of the input average power at time n using the recursive equation

$$
\begin{equation*}
\operatorname{var}(\mathrm{n})=(1-b) \operatorname{var}(\mathrm{n}-1)+b \mathrm{x}^{2}(\mathrm{n}) \tag{24}
\end{equation*}
$$

where $0<b \ll 1$ is a smoothing parameter. In practice, $a$ is chosen equal to $b$.
For fixed-point processors, there is a way to reduce the computation of power estimation. Since $b$ in Equation (24) doesn't have to be an exact number, it is computationally convenient to make $b$ a power of 2 . If $b=2^{-m}$, the multiplication of $b$ can be implemented by shifting right m bits. Therefore, the $\operatorname{var}(\mathrm{n})$ in Equation (24) is computed by

$$
\begin{aligned}
\operatorname{var}(\mathrm{n}) & =\operatorname{var}(\mathrm{n}-1)-b \operatorname{var}(\mathrm{n}-1)+b \mathrm{x}^{2}(\mathrm{n}) \\
& =\operatorname{var}(\mathrm{n}-1)-\operatorname{var}(\mathrm{n}-1) * 2^{-\mathrm{m}}+\mathrm{x}^{2}(\mathrm{n}) * 2^{-\mathrm{m}}
\end{aligned}
$$

Then, assuming the variance $\operatorname{var}(\mathrm{n})$ of input signal is stored in the data memory VAR and its initial value is $0.99997\left(=1-2^{-15}\right)$, The implementation of this equation using TMS320C25 assembly code is

LARP AR3
LRLK AR3,FRSTAP ; Point to input signal $x$
SQRA * ; Square input signal
SPH ERRF
ZALH $\quad$ VAR $\quad$; ACC $=\operatorname{var}(\mathrm{n}-1)$
SUB VAR,SHIFT ; ACC $=(1-b) \operatorname{var}(\mathrm{n}-1)$
ADD ERRF,SHIFT ; ACC $=(1-b) \operatorname{var}(\mathrm{n}-1)+b \mathrm{x}^{2}(\mathrm{n})$
SACH VAR ; Store var(n)
The normalized LMS algorithm can be implemented as

$$
\begin{aligned}
& \text { var }=b_{1} * \operatorname{var}+b * \operatorname{xn}[0] * \operatorname{xn}[0] \\
& \text { unen }=\mathrm{e}[\mathrm{n}] * a / \operatorname{var} \\
& \text { for }(\mathrm{i}=0 ; \mathrm{i}<\mathrm{N} ; \mathrm{i}++) \\
& \text { wn[i] }+=\text { unen } * \mathrm{xn}[\mathrm{i}]
\end{aligned}
$$

where $b_{1}=(1-b), \mathrm{xn}[0]=\mathrm{x}(\mathrm{n})$, and unen $=\mathrm{u}(\mathrm{n}) * \mathrm{e}(\mathrm{n})$. This normalized technique reduces the dependency of convergence speed on input signal power at the cost of increased computational complexity, especially the division in Equation (22). The algorithms of implementing the fixed-point and floating-point division on the TMS320C25 and

TMS320C30 can be found in the user's guide for each device [13, 14]. Since the power of input signal is always positive, those codes can be simplified to save computation time.
Since the power estimation in Equation (24) and step size normalization in Equation (22) are performed once for each sample $\mathrm{x}(\mathrm{n})$, the computation increase can be ignored when N is large. As shown in Appendixes D1 and D2, the total number of instruction cycles needed for the normalized LMS algorithm ( $7 \mathrm{~N}+57$ for the TMS320C25 and $3 \mathrm{~N}+47$ for the TMS320C30) is slightly higher than for the LMS algorithm ( $7 \mathrm{~N}+34$ and $3 \mathrm{~N}+15$ ) when N is large.

## Sign LMS Algorithms

The LMS algorithm requires 2 N multiplications and additions for each iteration; this amount is much lower than the requirements for many other complicated adaptive algorithms, such as Kalman and Recursive Least Square (RLS) [3]. However, there are three simplified versions of the LMS algorithm (sign-error LMS, sign-data LMS, and signsign LMS) that save the number of multiplications required and extend the real-time bandwidth for some applications [5, 27].

First, the sign-error LMS algorithm can be expressed as

$$
\begin{equation*}
\underline{w}(n+1)=\underline{w}(n)+u \operatorname{sign}[e(n)] \underline{x}(n) \tag{25}
\end{equation*}
$$

where

$$
\begin{aligned}
& \operatorname{sign}[\mathrm{e}(\mathrm{n})]= \\
&-1, \text { if } \mathrm{e}(\mathrm{n}) \geq 0 \\
&-1, \text { if } \mathrm{n})<0
\end{aligned}
$$

The C program implementation of sign-error LMS algorithm is

```
tu = u;
if (e[n] < 0.) {
    tu = -u;}
for (i=0; i<N; i++){
    wn[i] += tu * xn[i];
}
```

As shown in Appendixes E1 and E2, the instruction sequence to implement weight update with the sign-error LMS algorithm is identical to that with the LMS algorithm. The difference is that the sign-error LMS algorithm uses the sign $[\mathrm{e}(\mathrm{n})]^{*}$ u instead of e(n)*u before the update loop. Note that, for fixed-point processors, if $u$ is chosen to be a power of two, the $u \mathrm{x}(\mathrm{n})$ can be accomplished by shifting right the elements in $\mathrm{x}(\mathrm{n})$. This algorithm keeps the same convergence direction as the LMS algorithm. Thus, the sign-error LMS algorithm should remain efficient, provided the variable gain $u(n)$ is matched to this change. However, the use of constant step size $u$ to reduce computation comes at the expense of a slow convergence rate since smaller $u$ is normally used for stability reasons.

The programs in Appendixes E1 and E2 implement a transversal filter with signerror LMS algorithm in looped code. The total number of instruction cycles needed for this algorithm using the TMS320C25 is $7 \mathrm{~N}+26$, which is slightly less than for the LMS algorithm's $7 \mathrm{~N}+28$. Computing $\mathrm{u}^{*} \mathrm{e}(\mathrm{n})$ takes 5 instruction cycles. The sign-error LMS algorithm determines the sign of the $u$ by checking the sign of $e(n)$, which takes only 3 instruction cycles. The total number of instruction cycles needed for the sign-error LMS algorithm using the TMS320C30 is $3 N+16$, which is slightly higher than for the LMS algorithm. This occurs because the TMS320C30 takes only one instruction cycle to compute $u^{*} e(n)$ and two instruction cycles to determine the sign of the $u$.

Secondly, the sign-data LMS algorithm is

$$
\begin{equation*}
\underline{\mathrm{w}}(\mathrm{n}+1)=\underline{\mathrm{w}}(\mathrm{n})+\mathrm{u} \mathrm{e}(\mathrm{n}) \operatorname{sign}[\underline{\mathrm{x}}(\mathrm{n})] \tag{26}
\end{equation*}
$$

This equation can be implemented as

$$
\begin{aligned}
w_{i}(n+1) & =w_{i}(n)+u e(n), \text { if } x(n-i)>=0 \\
& =w_{i}(n)-u e(n), \text { if } x(n-i)<0
\end{aligned}
$$

for $i=0,1, \ldots, N-1$. Since the sign determination is required inside the adaptation loop to determine the sign of $x(n-i)$, slower throughput is expected. The total number of instruction cycles needed is $11 \mathrm{~N}+26$ for the TMS320C25 and $5 \mathrm{~N}+16$ for the TMS320C30.

Finally, the sign-sign LMS algorithm is

$$
\begin{equation*}
\underline{w}(n+1)=\underline{w}(n)+u \operatorname{sign}[e(n)] \operatorname{sign}[\underline{x}(n)] \tag{27}
\end{equation*}
$$

which requires no multiplications at all and is used in the CCITT standard for ADPCM transmission. As we can see from the above equations, the number of multiplications is reduced. This simplified LMS algorithm looks promising and is designed for VLSI or discrete IC implementation to save multiplications.

The sign-sign LMS algorithm can be implemented as

$$
\begin{aligned}
& \text { for }(\mathrm{i}=0 ; \mathrm{i}<\mathrm{N} ; \mathrm{i}++)\{ \\
& \text { if }(\mathrm{e}[\mathrm{n}]>=0 .)\{ \\
& \text { if }(\operatorname{xn}[\mathrm{i}]>=0 .) \\
& \quad \mathrm{wn}[\mathrm{i}]+=\mathrm{u} \\
& \text { else } \quad \\
& \operatorname{wn}[\mathrm{i}]-=\mathrm{u} ;\} \\
& \text { else }\{ \\
& \text { if }(\operatorname{xn}[\mathrm{i}]>=0 .) \\
& \operatorname{wn}[\mathrm{i}]-=\mathrm{u}
\end{aligned}
$$

else

$$
\mathrm{wn}[\mathrm{i}]+=\mathrm{u} ;\}\}
$$

When this algorithm is implemented on TMS320C25 and TMS320C30 with pipeline architecture and a parallel multiplier, the performance of sign-sign LMS algorithm is poor compared to standard LMS algorithm due to the determination of sign of data, which can break the instruction pipeline and can severely reduce the execution speed of the processors.

In order to avoid double branches inside the loop, the XOR instruction is utilized to check the sign bit of $e(n)$ and $x(n-i)$. The sign-sign LMS algorithm can be implemented as

$$
\begin{aligned}
w_{i}(n+1) & =w_{i}(n)+u, \text { if } \operatorname{sign}[e(n)]=\operatorname{sign}[x(n-i)] \\
& =w_{i}(n)-u, \text { otherwise }
\end{aligned}
$$

The following TMS320C25 instruction sequence implements this algorithm without branching (assuming that the current address register used is AR3):

|  | LRLK | AR1,N-1 | Set up counter |
| :---: | :---: | :---: | :---: |
|  | LRLK | AR2, COEFFD | ; Point to $\mathrm{w}_{\mathrm{i}}(\mathrm{n})$ |
|  | LRLK | AR3,LASTAP + 1 | ; Point to $\mathrm{x}(\mathrm{n}-\mathrm{i})$ |
| ADAP | LAC | *-,0,AR2 | ; Load x(n-i) |
|  | XOR | ERR | ; XOR with e(n) |
|  | SACL | ERRF | ; Save sign bit, sign $=0$ if same signs <br> ; Sign $=1$ if different signs |
|  | LAC | ERRF | $\begin{aligned} & \text {; Sign extension to ACCH, } \\ & ; \mathrm{ACCH}=0 \text { If ERRF }>=0 \\ & ; \mathrm{ACCH}=0 \mathrm{FFFFh} \text { if ERRF }<0 \end{aligned}$ |
|  | XORK | MU,15 | Take one's complement of $m$ ; If sign = 1 |
|  | ADD | *,15 | ; Weight update |
|  | SACH | *+,1,AR1 | Save new weight |
|  | BANZ | ADAP,*-,AR3 |  |

The one's complement of $u$ is used instead of $-u$, because they are only slightly different and the step size does not require the exact number. The weight update with this technique requires 10 N instruction cycles and FIR filtering requires N instruction cycles so that the total number of instruction cycles needed is $11 \mathrm{~N}+21$. The complete TMS320C25 assembly program is given in Appendix F1.

To determine whether a positive or negative $u$ should be used without branching is trickier in the TMS320C30. Fortunately, the extended precision registers of TMS320C30 interpret the 32 most-significant bits of the 40-bit data as the floating-point number and the 32 least-significant bits of the 40 -bit data as an integer. When a floating-point number
changes its sign, its exponent remains the same. Therefore, the sign of step size u can be determined by using XOR logic on its mantissa. The following code shows how the sign-sign LMS algorithm is implemented on the TMS320C30.

|  | ASH | -31,R7 | ; R7 $=\operatorname{Sign}[\mathrm{e}(\mathrm{n})$ ] |
| :---: | :---: | :---: | :---: |
|  | XOR3 | R0,R7,R5 | ; $\mathrm{R} 5=\operatorname{Sign}[\mathrm{e}(\mathrm{n})] * \mathrm{u}$ |
|  | LDF | *AR0+ +(1)\%,R6 | ; R6 = $\mathrm{x}(\mathrm{n}$ ) |
|  | ASH | -31,R6 | ; R6 = $\operatorname{Sign}[\mathrm{x}(\mathrm{n}-\mathrm{i})$ ] |
|  | XOR3 | R5,R6,R4 | ; $\mathrm{R} 4=\operatorname{Sign}[\mathrm{x}(\mathrm{n}-\mathrm{i})] * \operatorname{Sign}[\mathrm{e}(\mathrm{n})] * \mathrm{u}$ |
|  | ADDF3 | *AR1,R4,R3 | ; $\mathrm{R} 3=\mathrm{w}_{\mathrm{i}}(\mathrm{n})+\mathrm{R} 4$ |
|  | LDI | order $-3, \mathrm{RC}$ | ; Initialize repeat counter |
|  | RPTB | SSLMS | ; Do i $=0, \mathrm{~N}-3$ |
|  | LDF | *AR0+ +(1)\%,R6 | ; Get next data |
| \|| | STF | R3,*AR1++(1)\% | ; Update $\mathrm{w}_{\mathrm{i}}(\mathrm{n}+1)$ |
|  | ASH | -31,R6 | ; Get the sign of data |
|  | XOR3 | R5,R6,R4 | ; Decide the sign of u |
| SSLMS | ADDF3 | *AR1,R4,R3 | ; $\mathrm{R} 3=\mathrm{w}_{\mathrm{i}}(\mathrm{n})+\mathrm{R} 4$ |
|  | LDF | *AR0,R6 | ; Get last data |
| \| $\mid$ | STF | R3,*AR1++(1)\% | ; Update $\mathrm{w}_{\mathrm{N}-2}(\mathrm{n}+1)$ |
|  | ASH | -31,R6 | ; Get the sign of data |
|  | XOR3 | R5,R6,R4 | ; Decide the sign of $u$ |
|  | ADDF3 | *AR1,R4,R3 | ; Compute $\mathrm{w}_{\mathrm{N}-1}(\mathrm{n}+1)$ |
|  | STF | R3,*AR1++(1)\% | ; Store last w( $\mathrm{n}+1$ ) |

Here, R0, R4, and R5 contain the value of $u$ before updating. AR0 and AR1 point to $x$ array and $w$ array, respectively. R7 contains the value of error signal e(n). The complete program is given in Appendix F2. The total number of instruction cycles is $5 \mathrm{~N}+16$, which is much higher than LMS algorithm.

The sign-sign LMS algorithm is developed to reduce the multiplication requirement of the LMS algorithm. Since DSPs provide the hardware multiplier as a standard feature, this modification does not provide any advantage when implementing this algorithm on the DSPs. On the contrary, it causes some disadvantages since decision instructions will destroy the instruction pipeline. If you use the XOR logic operation in order to avoid using the decision instructions, the complexity of the program will be increased and the total number of instruction cycles will be greater than the regular LMS algorithm.

## Leaky LMS Algorithm

When adaptive filters are implemented on signal processors with fixed word lengths, roundoff noise is fed back to adaptive weights and accumulates in time without bound. This leads to an overflow that is unacceptable for real-time applications. One solution is
based upon adding a small forcing function, which tends to bias each filter weight toward zero. The leaky LMS algorithm has the form

$$
\begin{equation*}
\underline{\mathrm{w}}(\mathrm{n}+1)=\mathrm{r} \underline{\mathrm{w}}(\mathrm{n})+\mathrm{u} \mathrm{e}(\mathrm{n}) \underline{\mathrm{x}}(\mathrm{n}) \tag{28a}
\end{equation*}
$$

where $r$ is slightly less than 1 .
Since r can be expressed as $1-\mathrm{c}$ and $\mathrm{c} \ll 1$, the TMS320C25 can take advantage of the built-in shifters to implement this algorithm. Therefore, Equation (28a) can be changed to

$$
\begin{equation*}
\underline{\mathrm{w}}(\mathrm{n}+1)=\underline{\mathrm{w}}(\mathrm{n})-\mathrm{c} \underline{\mathrm{w}}(\mathrm{n})+\mathrm{u} \mathrm{e}(\mathrm{n}) \underline{\mathrm{x}}(\mathrm{n}) \tag{28b}
\end{equation*}
$$

In order to achieve the highest throughput by using ZALR and MPYA, $\mathrm{cw}(\mathrm{n})$ can be implemented by shifting $\mathrm{w}_{\mathrm{i}}(\mathrm{n})$ right by m bits where $2^{-\mathrm{m}}$ is close to c . Since the length of the accumulator is 32 bits and the high word (bits 16 to 31 ) is used for updating $w(n)$, shifting right $m$ bits of $w_{i}(n)$ can be implemented by loading $w_{i}(n)$ and shifting left 16 - m bits. The sequence of TMS320C25 instructions to implement Equation (28b) is shown as

|  | LRLK | AR1,N-1 | Set up counter |
| :---: | :---: | :---: | :---: |
|  | LRLK | AR2,COEFFD | ; Point to $\mathrm{w}_{\mathrm{i}}(\mathrm{n})$ |
|  | LRLK | AR3,LASTAP+1 | ; Point to $\mathrm{x}(\mathrm{n}-\mathrm{i})$ |
|  | LT | ERRF | ; $\mathrm{T}=\mathrm{ERRF}=\mathrm{u}^{*} \mathrm{e}(\mathrm{n})$ |
|  | MPY | *-,AR2 |  |
| ADAPT | ZALR | *,AR3 |  |
|  | MPYA | *-,AR2 |  |
|  | SUB | *,LEAKY | ; LEAKY = 16-m |
|  | SACH | *+,0,AR1 |  |
|  | BANZ | ADAPT,*-,AR2 |  |

For each iteration, 7 N instruction cycles are needed to perform the adaptation process ( 6 N for the LMS algorithm). The total number of instruction cycles needed is $8 \mathrm{~N}+28$ (see Appendix G1 for the complete program). The leaky factor $r$ has the same effect as adding a white noise to the input. This technique not only can solve adaptive weights overflow problem, but also can be beneficial in an insufficient spectral excitation and stalling situation [5].

The method used above is especially for the TMS320C25, which has a free shift feature. Since TMS320C30 is a floating-point processor, r can simply multiply to filter coefficient. However, in order to reduce the instruction cycles, this multiplication can combine with another instruction to be a parallel instruction inside the loop. The following code shows how to rearrange the instructions from the LMS algorithm to include this multiplication without an extra instruction cycle.


Auxiliary registers AR0 and AR1 point to $x$ and $w$ arrays. AR2 points to the memory location that contains value r. R7 contains the value of error signal e(n). R1 and R2 are updated before the loop because the parallel instructions inside the loop use the previous values in R1 and R2. Note that R1 is updated twice before the loop because the updating of R 2 requires the previous value of R 1 . In order to update x array pointer to the new beginning of the data buffer for next iteration, two of the loop instruction sets have been taken out of loop and modified by eliminating the incrementation of AR0. The TMS320C30 assembly program of an adaptive transversal filter with the leakage LMS algorithm is listed in Appendix G2 as an example. The total number of instruction cycles for this algorithm is $3 N+15$, which is the same as the LMS algorithm. This example shows the power and flexibility of the TMS320C30.

## Implementation Considerations

The adaptive filter structures and algorithms discussed previously were derived on the basis of infinite precision arithmetic. When implementing these structures and algorithms on a fixed integer machine, there is a limitation on the accuracy of these filters due to the fact that the DSP operates with a finite number of bits. Thus, designers must pay attention to the effects of finite word length. In general, these effects are input quantization, roundoff in the arithmetic operation, dynamic range constraints, and quantization of filter coefficients. These effects can either cause deviations from the original design criteria or create an effective noise at the filter output. These problems have been investigated extensively, and techniques to solve these problems have been developed [28, 29].

The effects of finite precision in adaptive filters is an active research area, and some significant results have been reported [30 through 32]. There are three categories of finite word length effects in adaptive filters:

- Dynamic Range Constraint (scaling to avoid overflow). Since this is not applicable for a floating-point processor, the TMS320C30 is not mentioned in this portion.
- Finite Precision Errors (errors introduced by roundoff in the arithmetic).
- Design Issues (design of the optimum step size $u$ that minimizes system noise).


## Dynamic Range Constraint

As shown in Figure 1, the most widely used LMS transversal filter is specified by the difference equations

$$
y(n)=\sum_{i=0}^{N-1} w_{i}(n) x(n-i)
$$

and

$$
\begin{equation*}
\mathrm{w}_{\mathrm{i}}(\mathrm{n}+1)=\mathrm{w}_{\mathrm{i}}(\mathrm{n})+\mathrm{u}^{*} \mathrm{e}(\mathrm{n}) * \mathrm{x}(\mathrm{n}-\mathrm{i}), \text { for } \mathrm{i}=0,1, \ldots, \mathrm{~N}-1 \tag{30}
\end{equation*}
$$

where $x(n-i)$ is the input sequence and $w_{i}(n)$ are the filter coefficients.
If the input sequence and filter coefficients are properly normalized so that their values lie between -1 and 1 using Q15 format, no error is introduced into the addition. However, the sum of two numbers may become larger than one. This is known as overflow. The TMS320C25 provides four features that can be applied to handle overflow management [13]:
A. Branch on overflow conditions.
B. Overflow mode (saturation arithmetic).
C. Product register right shift.
D. Accumulator right shift.

One technique to inhibit the probability of overflow is scaling, i.e., constraining each node within an adaptive filter to maintain a magnitude less than unity. In Equation (29), the condition for $|\mathrm{y}(\mathrm{n})|<1$ is

$$
\begin{equation*}
\mathrm{x}_{\max }<1 / \sum_{\mathrm{i}=0}^{\mathrm{N}-1}\left|\mathrm{w}_{\mathrm{i}}(\mathrm{n})\right| \tag{31}
\end{equation*}
$$

where $\mathrm{x}_{\text {max }}$ denotes the maximum of the absolute value of the input. The right shifter of the TMS320C25, which operates with no cycle overhead, can be applied to implement scaling to prevent overflow of multiply-accumulate operations in Equation (29). By setting the PM bits of status register ST1 to 11 using the SPM or LST1 instructions, the P register output is right-shifted 6 places. This allows up to 128 accumulations without the possibility of an overflow. SFR instruction can also be used to right shift one bit of the accumulator when it is near overflow.

Another effective technique to prevent overflow in the computation of Equation (29) is using saturation arithmetic. As illustrated in Figure 12, if the result of an addition overflows, the output is clamped at the maximum value. If saturation arithmetic is used, it is common practice [28] to permit the amplitude of $\mathrm{x}(\mathrm{n}-\mathrm{i})$ to be larger than the upper bound given in Equation (31). Saturation of the filter represents a distortion, and the choice of scaling on the input depends on how often such distortion is permissible. The saturation arithmetic on the TMS320C25 is controlled by the OVM bit of status register ST0 and can be changed by the SOVM (set overflow mode), ROVM (reset overflow mode), or LST (load status register).


Figure 12. Saturation Arithmetic

Filter coefficients are updated using Equation (30). As illustrated in Figure 13, a new technique presented in reference 31 uses the scaling factor a to prevent filter's coefficients overflow during the weight updating operation. Suppose you use $\mathrm{a}=2-\mathrm{m}$. A right shift by m bits implements multiplication by a, while a left shift by m bits implements the scaling factor $1 / a$. Usually, the required value of a is not expected to be very small and depends on the application. Since a scales the desired signal, it does not affect the rate of convergence.


Figure 13. Fixed-Point Arithmetic Model of the Adaptive Filter

## Finite Precision Errors

The TMS320C25 is a $16 / 32$-bit fixed point processor. Each data sample is represented by a fractional number that uses 15 magnitude bits and one sign bit. The quantization interval

$$
\begin{equation*}
\delta=2^{-\mathrm{b}}, \tag{32}
\end{equation*}
$$

$(b=15)$, is called the width of quantization since the numbers are quantized in steps of $\delta$.
The products of the multiplications of data by coefficients within the filter must be rounded or truncated to store in memory or a CPU register. As shown in Figure 14, the roundoff error can be modeled as the white noise injected into the filter by each rounding operation. This white noise has a uniform distribution over a quantization interval and for rounding

$$
\begin{equation*}
-1 / 2 \delta<\mathrm{e} \leq 1 / 2 \delta \tag{33a}
\end{equation*}
$$

and

$$
\begin{equation*}
\delta_{\mathrm{e}}^{2}=(1 / 12) \delta^{2} \tag{33b}
\end{equation*}
$$

where $\delta_{\mathrm{e}}{ }^{2}$ is the variance of the white noise.

In general, roundoff noise occurs after each multiplication. However, the TMS320C25 has a full precision accumulator, i.e., a $16 \times 16$-bit multiplier with a 32 -bit accumulator, so there is no roundoff when you implement a set of summations and multiplications as in Equation (29). Rounding is performed when the result is stored back to memory location $y(n)$, so that only one noise source is presented in a given summation node.


$$
y=\text { Rounding }[x \bullet a]=x \bullet a+e
$$

Figure 14. Fixed-Point Roundoff Noise Model
For floating-point arithmetic, the variance of the roundoff noise [31] is slightly different from Equation (33b),

$$
\begin{equation*}
\sigma_{e}^{2}=0.18 \delta^{2} \tag{33c}
\end{equation*}
$$

Since TMS320C30 has a 40/32-bit floating-point multiplier and ALU, the result from arithmetic operation has the mantissa of [31] bits plus one sign bit. Therefore, the $\delta$ in Equation (33c) is equal to $2^{-31}$. Another roundoff noise is introduced when you restore the result back to memory. This noise has the power of 2-23 because the mantissa of TMS320C30 floating-point data is 23 bits plus one sign bit. Therefore, unless the filter order is high, the roundoff noise from arithmetic operation is relatively small.

The steady-state output error of the LMS algorithm due to the finite precision arithmetic of a digital processor was analyzed in reference [31]. It was found that the power of arithmetic errors is inversely proportional to the adaptation step size $u$. The significance of this result in the adaptive filter design is discussed next. Furthermore, roundoff noise is found to accumulate in time without bound, leading to an eventual overflow [32]. The leaky LMS algorithm presented in the previous section can be used to prevent the algorithm overflow.

## Design Issues

The performance of digital adaptive algorithms differs from infinite precision adaptive algorithms. The finite precision LMS algorithm is given as

$$
\begin{equation*}
\underline{\mathrm{w}}(\mathrm{n}+1)=\underline{\mathrm{w}}(\mathrm{n})+\mathrm{Q}\left[\mathrm{u}^{*} \mathrm{e}(\mathrm{n}) * \underline{\mathrm{x}}(\mathrm{n})\right] \tag{34}
\end{equation*}
$$

where Q [.] denotes the operation of fixed point quantization. Whenever any correction term $\mathrm{u}^{*} \mathrm{e}(\mathrm{n}) * \mathrm{x}(\mathrm{n}-\mathrm{i})$ in the update of the weight vector in Equation (34) is too small, the quantized value of that term is zero, and the corresponding weight $w_{i}(n)$ remains unchanged. The condition for the ith component of the vector $\mathrm{w}(\mathrm{n})$ not to be updated when the algorithm is implemented with the TMS320C25 is

$$
\begin{equation*}
|\mathrm{ue}(\mathrm{n}) \mathrm{x}(\mathrm{n}-\mathrm{i})|<\delta / 2 \tag{35a}
\end{equation*}
$$

where $\delta=2^{-15}$. The condition for TMS320C30 is

$$
\begin{equation*}
|\mathrm{ue}(\mathrm{n}) \mathrm{x}(\mathrm{n}-\mathrm{i})|<2 \exp * \delta / 2 \tag{35b}
\end{equation*}
$$

where $\exp$ is the exponent of $\mathrm{w}_{\mathrm{i}}(\mathrm{n})$ and $\delta=2-23$.
Since the adaptive algorithms are designed to minimize the mean squared value of the error signal, $e(n)$ decreases with time. If $u$ is small enough, most of the time the weights are not updated. This early termination of the adaptation may not allow the weight values to converge to the optimum set, resulting in a mean square error larger than its minimum value. The conditions for the adaptation to converge completely [30] is $u>u_{\min }$ where

$$
\begin{equation*}
u^{2}{ }_{\min }=\frac{\delta^{2}}{4 \sigma_{x}^{2} \epsilon_{\min }} \tag{36a}
\end{equation*}
$$

for the TMS320C25 and the TMS320C30

$$
\begin{equation*}
u^{2} \min =\frac{\delta^{2 *} 2 \exp }{4{\sigma_{x}}^{2} \epsilon_{\min }} \tag{36b}
\end{equation*}
$$

where $\sigma_{x}^{2}$ is the power of input signal $x(n)$ and $\epsilon_{\min }$ is the minimum mean squared error at steady state.

In the Leaky LMS Algorithm section, it was mentioned that the excess MSE given in Equation (14) is minimized by using small $u$. However, this may result in a large quantization error since the most significant term in the total output quantization error is [31]

$$
\begin{equation*}
\frac{\mathrm{No}_{\mathrm{e}}{ }^{2}}{2 \mathrm{a}^{2} \mathrm{u}} \tag{37}
\end{equation*}
$$

The optimum step size $u_{0}$ reflects a compromise between these conflicting goals. The value of $u_{0}$ is shown to be too small to allow the adaptive algorithm to converge completely and also to give a slow convergence. In practice, $u>u_{0}$ is used for faster convergence. Hence, the excess MSE becomes larger, and the roundoff noise can typically be neglected when compared with the excess mean square error.

Finally, recall Equations (11) and (12). The step size u has an upper limit to guarantee the stability and convergence. Therefore, the adaptive algorithm requires

$$
\begin{equation*}
0<u<\frac{1}{N \sigma_{x}^{2}} \tag{38}
\end{equation*}
$$

On the other hand, the step size $u$ also has a lower limit. The optimum $u_{0}$, which minimizes the sum of the excess MSE and roundoff noise, is smaller than $u_{\text {min }}$, i.e., too small to allow the adaptive weight to converge. For an algorithm implemented on the TMS320C25, the word-length of 16 bits is fixed, and the minimum step-size that can be used is given in Equation (36). The most important design issue is to find the best $u$ to satisfy

$$
\begin{equation*}
\mathrm{u}_{\min }<\mathrm{u}<\frac{1}{\mathrm{~N} \sigma_{\mathrm{x}}^{2}} \tag{39}
\end{equation*}
$$

Therefore, in order to make the condition in Equation (39) valid, the initial values of filter coefficients are better close to zero for the floating-point processor if the situation in unknown.

## Software Development

The TMS320C25 and TMS320C30 combine the high performance and the special features needed in adaptive signal processing applications. The processors are supported by a full set of software and hardware development tools. The software development tools include an assembler, a linker, a simulator, and a C compiler. The most universal software development tool available is a macro assembler. However, the assembly language programming for DSP can be tedious and costly. For adaptive filter applications, an assembly language programmer must have knowledge of adaptive signal processing. The challenge lies in compressing a great deal of complex code into the fairly small space and most efficient code dictated by the real-time applications typical of adaptive signal processing.

Recently, C compilers for the processors were developed to make DSP programming easier, quicker, and less costly compared with the work associated with programming in assembly language. Due to the general characteristics of a compiler, the code it generates is not the most efficient. Since the program efficiency consideration is important for adaptive filter implementation, the code generated from the C compiler has to be modified before implementing. Thus, two alternative ways, besides writing an assembly program, to implement adaptive signal processing on DSP are presented. First is the automatic adaptive filter code generator [12], which can be found on Texas Instruments TMS320 Bulletin Board Service (BBS), and second are the adaptive filter function libraries that support assembly and C programming languages.

In this report, two adaptive filter libraries have been developed: one can be called from an assembly main program; the other can be called from the C main program. Note that, for the TMS320C25 only, certain data memory locations have been reserved for storing the necessary filter coefficients, previous delayed signal, etc. In other words, these data memories are used as global variables.

## Assembly Function Libraries

The basic concept of creating an assembly subroutine for an adaptive filter is to modify in module the assembly programs discussed above. Then, the user can implement the adaptive filter by writing his own assembly main program that calls the subroutine.

## TMS320C25 Assembly Subroutine

The TMS320C25 has an eight-level deep hardware stack. The CALL and CALA subroutine calls store the current contents of the program counter (PC) on the top of the stack. The RET (return from subroutine) instruction pops the top of the stack back to the PC. For computational convenience, the processor needs to be set as follows before calling the assembly callable subroutine.

1. PM status bits equal to 01 .
2. SXM status bit set to 1 .
3. The current DP (data memory page pointer) is 0 .

The following example is the TMS320C25 assembly main routine, which performs an adaptive line enhancement by calling the LMS algorithm subroutine. The filter order is 64 , delay is equal to one, and the convergence factor $u$ is 0.01 .

\author{

* DEFINE AND REFER SYMBOLS <br> .global ORDER,U,ONE,D,Y,ERR,XN,WN,LMS
}

DEFINE SAMPLING RATE, ORDER, AND MU
*
ORDER: .equ 20
MU: .equ 327
PAGE0: .equ 0
*
DEFINE ADDRESSES OF BUFFER AND COEFFICIENTS
*

| X0: | .usect | '"buffer'’,ORDER-1 |
| :--- | :--- | :--- |
| XN: | .usect | 'buffer',,1 |
| WN: | .usect | 'ccoeffs',,ORDER |

* RESERVE ADDRESSES FOR PARAMETERS

ONE: .usect '"parameters',,1
U: .usect '"parameters', 1
ERR: .usect '"parameters',,1
Y: .usect
''parameters', 1
D: .usect '"parameters', 1
ERRF: .usect ''parameters',,1
*

* INITIALIZATION
* 

| START | LDPK | PAGE0 | ; Set DP $=0$ |
| :--- | :--- | :--- | :--- |
|  | SPM 1 | ; Set PM equal to 1 |  |
|  | SSXM |  | ; Set sign extension mode |
|  | LRLK AR7,X0 | ; AR7 point to $>300$ |  |
|  | LACK 1 | ; Initialize ONE $=1$ |  |
|  | SACL | ONE | ; Initialize $\mathrm{U}=\mathrm{MU}=0.01$ |

SACL U

* PERFORM THE PREDICTOR
$* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *$

| INPUT: | IN | D,PA2 | ; Get the input |
| :--- | :--- | :--- | :--- |
| $*$ | CALL | LMS | ; Call subroutine |
| OUTPUT: | OUT | Y,PA2 | ; Output the signal |
| $*$ |  |  |  |
|  | LAC | D | ; Insert the newest sample |
| LARP | AR7 |  |  |
| SACL | $*$ |  |  |
| B | INPUT |  |  |
|  | .end |  |  |

The symbols, such as ORDER, U, ONE, D, LMS, Y, and ERR, are defined and referred to for the purpose of modular programming. The uninitialized sections specified by the directive .usect can be placed in any location of memory according to the linker command file. Note that MACD instruction requires the sources of the operands on program memory and data memory separately, and CNFP instruction configures RAM block 0 as program memory. Therefore, the coeffs section has to be in data RAM block 0 , and the buffer has to be in RAM block 1. Appendix H1 contains the adaptive transversal filter with LMS algorithm subroutine using the TMS320C25, and Appendix H2 contains an example of a linker command file.

## TMS320C30 Assembly Subroutine

Instead of a hardware stack, TMS320C30 uses a software stack, which is more flexible and convenient for a high-level language compiler. The stack memory location is pointed to by the stack pointer SP. In order to maintain the proper program sequence, the programmer must make certain that no data is lost and that the stack pointer always points to proper location. The PUSH, PUSHF, POP, POPF, CALL, CALLcond, RETIcond, and RETScond instructions will change the value of the stack pointer; in addition, writing data into it and using the interrupt will also change that value. It is the programmer's responsibility to initialize the stack pointer in the beginning of the program. The same adaptive line enhancer example above using TMS320C30 is listed below. The adapfltr.int program that initializes the stack pointer and the data RAM is given in Appendix H3.

```
*
* DEFINE GLOBAL VARIABLES AND CONSTANTS
*
\begin{tabular}{|c|c|c|}
\hline & .copy & "adapfltr.int" \\
\hline & .global & LMS30,order,u,d,y,e \\
\hline N & .set & 20 \\
\hline mu & .set & 0.01 \\
\hline
\end{tabular}
\begin{tabular}{|c|c|c|c|}
\hline \multirow{3}{*}{begin} & \multicolumn{3}{|l|}{.text} \\
\hline & .set & \$ & \\
\hline & LDI & N,BK & ; Set up circular buffer \\
\hline & LDP & @xn_addr & ; Set data page \\
\hline & LDI & @xn__addr,AR0 & ; Set pointer for x[] \\
\hline & LDI & @wn_addr,AR1 & ; Set pointer for w[] \\
\hline & LDF & 0.0,R0 & ; \(\mathrm{R} 0=0.0\) \\
\hline & RPTS & \(\mathrm{N}-1\) & \\
\hline & STF & R0,*AR0 + + (1)\% & ; \(\mathrm{x}[\mathrm{]}=0\). \\
\hline
\end{tabular}
```

|  | \|STF | R0,*AR1 + + (1) \% | ; w[] $=0$. |
| :---: | :---: | :---: | :---: |
|  | LDI | @in__addr,AR6 | ; Set pointer for input ports |
|  | LDI | @out__addr,AR7 | ; Set pointer for output ports |
| * |  |  |  |
| * PER | FORM | DAPTIVE LINE | HANCER |
| * |  |  |  |
| nput: |  |  |  |
|  | LDF | *AR6,R7 | ; Input d(n) |
|  | \|LDF | * + AR6(1),R6 | ; Input $x(n)$ |
|  | STF | R7,@d | ; Insert d(n) |
|  | STF | R6,*AR0 | ; Insert $x(n)$ to buffer |
| * |  |  |  |
| * CA | L ASSE | MBLY SUBROUTI |  |
| * |  |  |  |
| * | CALL | LMS30 |  |
| * OU | PUT y | ) AND e(n) SIGNA |  |
| * |  |  |  |
|  | LDF | @y,R6 | ; Get y(n) |
|  | BD | input | ; Delay branch |
|  | LDF | @e,R7 | ; Get e(n) |
|  | STF | R6,*AR7 | ; Send out y(n) |
|  | STF | R7,* + AR7(1) | ; Send out e(n) |
| * |  |  |  |
| * DE | INE CO | NSTANTS |  |
| * |  |  |  |
| n | .usect | ''buffer'’,N |  |
| wn | .usect | '"coeffs', N |  |
| in__addr | .usect | 'vars', 1 |  |
| out__addr | .usect | "vars', 1 |  |
| xn__addr | .usect | '"vars', 1 |  |
| wn__addr | .usect | "vars', 1 |  |
| u | .usect | '"vars', 1 |  |
| order | .usect | "vars', 1 |  |
| d | .usect | "vars', 1 |  |
| y | .usect | "vars', 1 |  |
| e | .usect | "vars', 1 |  |
| cinit | .sect | ".cinit" |  |
|  | .word | 6,in__addr |  |
|  | .word | 0804000h |  |
|  | .word | 0804002h |  |
|  | .word | xn |  |
|  | .word | wn |  |


| .float | mu |
| :--- | :--- |
| .word | $\mathrm{N}-2$ |
| .end |  |

In the above example, data memory order is initialized to $\mathrm{N}-2$ for computation convenience. The linker command files and the subroutine that implements the LMS transversal filter can be found in Appendixes H4 and H5.

## C Function Libraries

The TMS320C25 and TMS320C30 C language compilers provide high-level language support for these processors. The compilers allow application developers without an extensive knowledge of the device's architecture and instruction set to generate assembly code for the device. Also, since C programs are not device-specific, it is a relatively straightforward task to port existing C programs from other systems.

To allow fast development of efficient programs for adaptive signal processing applications, C function libraries have been developed. These libraries include functions for adaptive transversal, symmetric transversal, and lattice structures.

## TMS320C25 C-Callable Subroutines

In a C program, the memory assignments are chosen by the compiler. There are two ways to use the most efficient instruction MACD:
A. Use inline assembly code to assign memory locations for filter coefficients and buffers.
B. Reserve the desired memory locations for them and do the assignment in the linker command file.

The latter method is used in this report.
For a C main program, the parameters passed to and returned from the subroutines are all within the parentheses following the subroutine name, as shown below:

$$
\begin{array}{ll}
\operatorname{lms}(\mathrm{n}, \mathrm{mu}, \mathrm{~d}, \mathrm{x}, \& \mathrm{y}, \& \mathrm{e}) & \mathrm{n}-\text { Filter order } \\
& \mathrm{mu}-\text { Convergence factor } \\
& \mathrm{d}-\text { Desired signal } \\
& \mathrm{x}-\text { Input signal } \\
& \mathrm{y} \text { - Address of output signal } \\
& \mathrm{e}-\text { Address of error signal }
\end{array}
$$

Since the TMS320C25 C compiler pushes the parameters from right to left into software stack pointed by AR1, the subroutine gets the parameters in reverse order, as shown below:

| MAR | $*_{-}$ | ; Set pointer for getting parameters |
| :--- | :--- | :--- |
| LAC | $*_{-}$ | $;$ACC $=\mathrm{N}$ |

SUBK 1
SACL ORDER ; ORDER $=\mathrm{N}-1$
LAC *- ; Getting and storing the mu
SACL U
LAC *- ; Getting and storing the D
SACL D
LAC *-,0,A-R3 ; Insert the newest sample
LRLK AR3,FRSTAP
SACL *
The assembly subroutine returns the parameters y and e as follows:
LARP AR1
LAR AR2,*-,AR2 ; Get the address of $y$ in main
LAC Y
SACL *,0,AR1 ; Store y
LAR AR2,*,AR2 ; Get the address of e in main
LAC ERR
SACL *,0,AR1 ; Store e
Therefore, the parameters should be entered in the order given above. If there are other parameters, they should be inserted right after the convergence factor mu. The leaky LMS algorithm subroutine is given as an example.

$$
\operatorname{llms}(\mathrm{n}, \mathrm{mu}, \mathrm{r}, \mathrm{~d}, \mathrm{x}, \& \mathrm{y}, \& \mathrm{e})
$$

the $r$ is defined in Equation (28a). Note that the values of the AR registers, which will be used in subroutine, and the status registers must be saved at the beginning of the subroutine and restored right before returning to calling routine. An example of a C-callable program is given in Appendix I1. Memory locations 0200 h to $0200 \mathrm{~h}+\mathrm{N}-1$ and 0300 h to $0300 \mathrm{~h}+\mathrm{N}-1$ are reserved for filter coefficients and buffers, respectively. N denotes the filter order.

## TMS320C30 C Subroutine

As previously mentioned, the TMS320C30 architecture has features designed for a high-level language compiler. Note that the callable word is dropped in this section title because the TMS320C30 is so flexible that the restrictions for the TMS320C25 no longer exist. Since the memory locations of filter buffers and coefficients are determined by the parameters that pass from the calling routine, the same subroutine can be used in different places. However, the only restriction is that the memory locations of filter buffers must align to the circular addressing boundary [14]. The features of TMS320C30 architecture that make a major contribution toward these improvements are dual data address buses, software stack, and flexible addressing mode. The parameters passed to subroutine are pushed into the stack. Therefore, after returning from the subroutine, the stack pointer, SP, must be updated to point to the location where SP pointed before pushing the parameters
into the stack. However, this will be done by the C compiler. The usage example of the C function subroutine is given as follows:

$$
\begin{array}{ll}
\operatorname{tlms}(\mathrm{n}, \mathrm{u}, \mathrm{~d}, \& \mathrm{w}, \& \mathrm{x}, \& \mathrm{y}, \& \mathrm{E}) \text { where } & \mathrm{n} \text { - Filter order } \\
& \mathrm{u}-\text { Step size } \\
& \mathrm{d} \text { - Desired signal } \\
& \text { \&w - Filter coefficients } \\
& \text { \&x - Input signal buffers } \\
& \text { \&y - Addr of output signal } \\
& \text { \&e - Addr of error signal }
\end{array}
$$

The example below shows how the C subroutine receives and manipulates the parameters passed from the caller program and how the result is returned to the caller routine.

```
*
* SET FRAME POINTER FP
*
FP .set AR3
    PUSH FP
    LDI SP,FP
*
* GET FILTER PARAMETERS
*
    LDI *-FP(2),R4 ; Get filter order
    LDI . *-FP(6),AR0 ; Get pointer for x[]
    LDI *_-FP(5),AR1 ; Get pointer for w[]
*
* COMPUTE ERROR SIGNAL e(n) AND STORE y(n) AND e(n)
*
    LDI *-FP(2),AR2 ; Get y(n) address
    SUBF3 R2,*+FP(1),R7 ; e(n) = d(n) - y(n)
    | |STF R2,*AR2 ; Send out y(n)
    LDI *-FP(3),AR2 ; Get e(n) address
    STF R7,*AR2 ; Send out e(n)
    MPYF *+FP(2),R7 ; R7 = e(n) * u
    POP FP
```

Note that AR3 is used as the frame pointer in TMS320C30 C compiler. Appendix I2 contains the complete LMS transversal filter example subroutine program.

## Development Process and Environment

Following a four stage procedure [33] to minimize the amount of finite word length effect analysis and real-time debugging, adaptive structures and algorithms are implemented
on the TMS320C25. Figure 15 illustrates the flowchart of this procedure. Since the implementation on TMS320C30 is done only by the simulator, the last stage, real-time testing, is not implemented.


Figure 15. Adaptive Filter Implementation Procedure
In the first stage, algorithm design and study is performed on a personal computer. Once the algorithm is understood, the filter is implemented using a high-level C program with double precision coefficients and arithmetic. This filter is considered an ideal filter.

In the second stage, the C program is rewritten in a way that emulates the same sequence of operations with the same parameters and state variables that will be implemented in the processors. This program then serves as a detailed outline for the DSP assembly language program or can be compiled using TMS320C25 or TMS320C30 C compiler. The effects of numerical errors can be measured directly by means of the technique shown in Figure 16, where $\mathrm{H}(\mathrm{z})$ is the ideal filter implemented in the first stage and $\mathrm{H}^{\prime}(\mathrm{z})$ is a real filter. Optimization is performed to minimize the quantization error and produce stable implementation.


Figure 16. A Commutational Technique for Evaluating Quantization Effects
In the third stage, the TMS320C25 and TMS320C30 assembly programs are developed; then they are tested using the simulators with test data from a disk file. Note that the simulation of TMS320C25 can also be implemented on the SWDS with the data logging option. This test data is a short version of the data used in stage 2 that can be internally generated from a program or data digitized from a real application environment. Output from the simulation is compared against the equivalent output of the C program in the second stage. Since the simulation requires data files to be in Q15 format, certain precision is lost during data conversion. When a one-to-one agreement within tolerable range is obtained between these two outputs, the processor software is assured to be essentially correct.

The final stage is applied only to the TMS320C25. First, you download this assembled program into the target TMS320C25 system (SWDS) to initiate real-time operation. Thus, the real-time debugging process is constrained primarily to debugging the I/O timing structure of the algorithm and testing the long-term stability of the algorithm. Figure 17 shows an experimental setup for verification, in which the adaptive filter is configured for a onestep adaptive predictor illustrated in Figure 18. The data used for real-time testing is a sinusoid generated by a Tektronix FG504 Function Generator embedded in white noise generated by an HP Precision Noise Generator. The DSP gets a quantized signal from the Analog Interface Board (AIB), performs adaptive prediction routines, and outputs an enhanced sinusoid to the analog interface board. The corrupted input and predicted (enhanced) output waveforms are compared on the oscilloscope or on the HP 4361 Dynamic Signal Analyzer. The corresponding spectra of input and output can be compared on the signal analyzer. The signal-to-noise ratio (SNR) improvement can be measured from the analyzer, which is connected to an HP plotter.


Figure 17. Real-Time Experiment Setup


Figure 18. Block Diagram of a One-Step Adaptive Predictor
To illustrate the operation in a nonstationary environment, the adaptive predictor is implemented using a TMS320C25, and the following experiment is performed. The input signal is swept from 1287 Hz to 4025 Hz , then jumps back to 1287 Hz . The time for each sweep is one second. The input spectra at every second are shown in Figure 19a; the corresponding output spectra are shown in Figure 19b. From the observations on the
oscilloscope and signal analyzer, the significant SNR improvement, convergence speed, ability to track nonstationary signals, and long-term stability of the adaptive predictor are observed.


Figure 19(a). Spectrum of Input Signal


Figure 19(b). Spectrum of Enhanced Output Signal

## Summary

Three adaptive structures and six update algorithms are implemented with the TMS320C25 and TMS320C30. Applications of adaptive filters and implementation considerations have been discussed. Two subroutine libraries that support both C language and assembly language for two processors were developed. These routines can be readily incorporated into TMS320C25 or TMS320C30 users' application programs.

The advancements in the TMS320C25 and TMS320C30 devices have made the implementation of sophisticated adaptive algorithms oriented toward performing real-time processing tasks feasible. Many adaptive signal processing algorithms are readily available and capable of solving real-time problems when implemented on the DSP. These programs provide an efficient way to implement the widely used structures and algorithms on the TMS320C25 and TMS320C30, based on assembly-language programming. They are also extremely useful for choosing an algorithm for a given application. The performances of adaptive structures and algorithms that have been implemented using the TMS320C25 and TMS320C30 have been summarized in Tables 1 and 2.

Table 1. The Performance of Adaptive Structures and Algorithms of TMS320C25

| TMS320C25 |  |  |  |
| :---: | :---: | :---: | :---: |
| Transversal Structure | LMS | Instruction Cycles | 7N+28 |
|  |  | Program Memory (Word) | 33 |
|  | Leaky <br> LMS | Instruction Cycles | $8 \mathrm{~N}+28$ |
|  |  | Program Memory (Word) | 34 |
|  | Sign-Data LMS | Instruction Cycles | $11 \mathrm{~N}+26$ |
|  |  | Program Memory (Word) | 41 |
|  | Sign-Error LMS | Instruction Cycles | $7 \mathrm{~N}+26$ |
|  |  | Program Memory (Word) | 30 |
|  | $\begin{aligned} & \text { Sign-Sign } \\ & \text { LMS } \end{aligned}$ | Instruction Cycles | $11 \mathrm{~N}+21$ |
|  |  | Program Memory (Word) | 30 |
|  | Normalized LMS | Instruction Cycles | 7N+57 |
|  |  | Program Memory (Word) | 47 |
| Symmetric <br> Transversal <br> Structure | LMS | Instruction Cycles | $7.5 \mathrm{~N}+38$ |
|  |  | Program Memory (Word) | 50 |
|  | Leaky <br> LMS | Instruction Cycles | $8 \mathrm{~N}+38$ |
|  |  | Program Memory (Word) | 51 |
|  | Sign-Data <br> LMS | Instruction Cycles | $9.5 \mathrm{~N}+36$ |
|  |  | Program Memory (Word) | 58 |
|  | Sign-Error <br> LMS | Instruction Cycles | $7.5 \mathrm{~N}+36$ |
|  |  | Program Memory (Word) | 47 |
|  | Sign-Sign <br> LMS | Instruction Cycles | $9.5 \mathrm{~N}+31$ |
|  |  | Program Memory (Word) | 47 |
|  | Normalized LMS | Instruction Cycles | $7.5 \mathrm{~N}+69$ |
|  |  | Program Memory (Word) | 66 |
| Lattice <br> Structure | LMS | Instruction Cycles | $33 \mathrm{~N}+32$ |
|  |  | Program Memory (Word) | 63 |
|  | Leaky <br> LMS | Instruction Cycles | $35 N+32$ |
|  |  | Program Memory (Word) | 65 |
|  | Sign-Error <br> LMS | Instruction Cycles | $36 N+32$ |
|  |  | Program Memory (Word) | 65 |
|  | Normalized LMS | Instruction Cycles | 90N+34 |
|  |  | Program Memory (Word) | 92 |

Note: N represents filter order.

Table 2. The Performance of Adaptive Structures and Algorithms of TMS320C30

| TMS320C30 |  |  |  |
| :---: | :---: | :---: | :---: |
| Transversal Structure | LMS | Instruction Cycles. | $3 \mathrm{~N}+15$ |
|  |  | Program Memory (Word) | 17 |
|  | Leaky <br> LMS | Instruction Cycles | $3 \mathrm{~N}+15$ |
|  |  | Program Memory (Word) | 19 |
|  | Sign-Data LMS | Instruction Cycles | $5 \mathrm{~N}+16$ |
|  |  | Program Memory (Word) | 24 |
|  | Sign-Error LMS | Instruction Cycles | $3 \mathrm{~N}+16$ |
|  |  | Program Memory (Word) | 18 |
|  | $\begin{gathered} \text { Sign-Sign } \\ \text { LMS } \end{gathered}$ | Instruction Cycles | $5 \mathrm{~N}+16$ |
|  |  | Program Memory (Word) | 24 |
|  | Normalized LMS | Instruction Cycles | $3 \mathrm{~N}+47$ |
|  |  | Program Memory (Word) | 49 |
| Symmetric <br> Transversal <br> Structure | LMS | Instruction Cycles | $2.5 \mathrm{~N}+15$ |
|  |  | Program Memory (Word) | 23 |
|  | Leaky <br> LMS | Instruction Cycles | $2.5 \mathrm{~N}+19$ |
|  |  | Program Memory (Word) | 26 |
|  | Sign-Data LMS | Instruction Cycles | $3.5 \mathrm{~N}+18$ |
|  |  | Program Memory (Word) | 30 |
|  | Sign-Error LMS | Instruction Cycles | $2.5 \mathrm{~N}+18$ |
|  |  | Program Memory (Word) | 24 |
|  | $\begin{gathered} \hline \text { Sign-Sign } \\ \text { LMS } \\ \hline \end{gathered}$ | Instruction Cycles | $3.5 \mathrm{~N}+17$ |
|  |  | Program Memory (Word) | 30 |
|  | Normalized LMS | Instruction Cycles | $2.5 \mathrm{~N}+50$ |
|  |  | Program Memory (Word) | 56 |
| Lattice <br> Structure | LMS | Instruction Cycles | $14 \mathrm{~N}+9$ |
|  |  | Program Memory (Word) | 20 |
|  | Leaky <br> LMS | Instruction Cycles | $16 \mathrm{~N}+9$ |
|  |  | Program Memory (Word) | 22 |
|  | Sign-Error LMS | Instruction Cycles | $16 \mathrm{~N}+9$ |
|  |  | Program Memory (Word) | 22 |
|  | Normalized LMS | Instruction Cycles | $67 \mathrm{~N}+9$ |
|  |  | Program Memory (Word) | 73 |

Note: N represents filter order.

## References

[1] B. Widrow and S. Stearns, Adaptive Signal Processing, Prentice-Hall, 1985.
[2] R. Lucky, J. Salz, and E. Weldon, Principles of Data Communications, McGrawHill, 1968.
[3] S. Haykin, Adaptive Filter Theory, Prentice-Hall, 1986.
[4] M. Honig and D. Messerschmit, Adaptive Filters: Structures, Algorithms, and Applications, Kluwer Academic, 1984.
[5] J.R. Treichler, C.R. Johnson, and M.G. Larimore, Theory and Design of Adaptive Filters, Wiley, 1987.
[6] T. Alexander, Adaptive Signal Processing, Springer-Verlag, 1986.
[7] G. Goodwin and K. Sin, Adaptive Filtering Prediction and Control, Prentice-Hall, 1984.
[8] M. Bellanger, Adaptive Digital Filters and Signal Analysis, Marcel Dekker, 1987.
[9] J. Proakis, Digital Communications, McGraw-Hill, 1983.
[10] C. Chen and S. Kuo, "An Interactive Software Package for Adaptive Signal Processing on an IBM Person Computer," 19th Pittsburgh Conference on Modeling and Simulation, May 1988.
[11] S. Kuo, G. Ranganathan, P. Gupta, and C. Chen, '"Design and Implementation of Adaptive Filters,'" IEEE 1988 International Conference on Circuits and Systems, June 1988.
[12] S. Kuo, G. Ma, and C. Chen, "An Advanced DSP Code Generator for Adaptive Filters," 1988 ASSP DSP workshop, Sept. 1988.
[13] Texas Instruments, Second-Generation TMS320 User's Guide, 1987.
[14] Texas Instruments, Third-Generation TMS320 User's Guide, 1988.
[15] S. Qureshi, '"Adaptive Equalization," Invited Paper, Proceedings of the IEEE, Sept. 1985.
[16] L. Rabiner and R. Schafer, Digital Processing of Speech Signals, Prentice-Hall, 1978.
[17] N. Jayant and P. Noll, Digital Coding of Waveforms: Principles and Applications to Speech and Video, Prentice-Hall, 1984.
[18] J. Makhoul, "Linear Prediction: A Tutorial Review," Proceedings of the IEEE, April 1975.
[19] C. Cowan and P. Grant, Adaptive Filters, Prentice-Hall, 1985.
[20] C. Gritton and D. Lin, "Echo Cancellation Algorithms," IEEE ASSP Magazine, April 1984.
[21] D. Messerschmitt, et al, "Digital Voice Echo Canceller with a TMS32020," in Digital Signal Processing Applications with the TMS320 Family, Prentice-Hall, 1986.
[22] B. Widrow, et al, "Adaptive Noise Cancelling: Principles and Applications," Proceedings of the IEEE, December 1975.
[23] A. Lovrich and R. Simar, "Implementation of FIR/IIR Filter with the TMS32010/TMS32020," in Digital Signal Processing Applications with the TMS320 Family, Texas Instruments, 1986.
[24] S. Orfanidis, Optimum Signal Processing, MacMillan, 1985.
[25] G. Frantz, K. Lin, J. Reimer, and J. Bradley, '"The Texas Instruments TMS320C25 Digital Signal Microcomputer," IEEE Micro, December 1986.
[26] B. Friedlander, "Lattice Filters for Adaptive Processing," Proceedings of the IEEE, August 1982.
[27] A. Gersho, "Adaptive Filtering with Binary Reinforcement," IEEE Transactions on Information Theory, March 1984.
[28] A. Oppenheim and R. Schafer, Digital Signal Processing, Chap. 9, Prentice-Hall, 1975.
[29] L. Rabiner and B. Gold, Theory and Application of Digital Signal Processing, Chap. 5, Prentice-Hall, 1975.
[30] J. R. Gitlin et al, "On the Design of Gradient Algorithms for Digitally Implemented Adaptive Filters," IEEE Transactions on Circuit Theory, March 1973.
[31] C. Caraiscos and B. Liu, "A Roundoff Error Analysis of the LMS Adaptive Algorithm," IEEE Transactions on ASSP, February, 1984.
[32] J. Cioffi, "Limited-Precision Effects in Adaptive Filtering," IEEE Transactions on Circuits and Systems, July 1987.
[33] R. Crochier, R. Cox, and J. Johnson, 'Real-Time Speech Coding,' IEEE Transactions on Communications, April 1982.

## List of Appendices for Implementation of Adaptive Filters with the TMS320C25 and TMS320C30

| Appendix | Title |
| :---: | :---: |
| A1 | Transversal Structure with LMS Algorithm Using the TMS320C25 |
| A2 | Transversal Structure with LMS Algorithm Using the TMS320C30 |
| B1 | Symmetric Transversal Structure with LMS Algorithm Using the TMS320C25 |
| B2 | Symmetric Transversal Structure with LMS Algorithm Using the TMS320C30 |
| C1 | Lattice Structure with LMS Algorithm Using the TMS320C25 |
| C2 | Lattice Structure with LMS Algorithm Using the TMS320C30 |
| D1 | Transversal Structure with Normalized LMS Algorithm Using the TMS320C25 |
| D2 | Transversal Structure with Normalized LMS Algorithm Using the TMS320C30 |
| E1 | Transversal Structure with Sign-Error LMS Algorithm Using the TMS320C25 |
| E2 | Transversal Structure with Sign-Error LMS Algorithm Using the TMS320C30 |
| F1 | Transversal Structure with Sign-Sign LMS Algorithm Using the TMS320C25 |
| F2 | Transversal Structure with Sign-Sign LMS Algorithm Using the TMS320C30 |
| G1 | Transversal Structure with Leaky LMS Algorithm Using the TMS320C25 |
| G2 | Transversal Structure with Leaky LMS Algorithm Using the TMS320C30 |
| H1 | Assembly Subroutine of Transversal Structure with LMS Algorithm Using the TMS320C25 |
| H2 | Linker Command File for Assembly Main Program Calling a TMS320C25 Adaptive LMS Transversal Filter Subroutine |
| H3 | TMS320C30 Adaptive Filter Initialization Program |
| H4 | Assembly Subroutine of Transversal Structure with LMS Algorithm Using the TMS320C30 |
| H5 | Linker Command/file for Assembly Main Program Calling the TMS320C30 Adaptive LMS Transversal Filter Subroutine |
| I1 | C Subroutine of Transversal Structure with LMS Algorithm Using the TMS320C25 |
| I2. | C Subroutine of Transversal Structure with LMS Algorithm Using the TMS320C30 |



Algoritha:

$$
63
$$

$y(n)=$ SUW $w(k) * x(n-k) \quad k=0,1,2, \ldots, 63$
$k=0$
$e(n)=d(n)-y(n)$
$w(k)=w(k)+u * e(n) * x(n-k) k=0,1,2, \ldots 63$
Where we use filter order $=64$ and mu $=0.01$.
Note: This source progran is the generic version; I/0 configuration has not been set up. User has to aodify the main routine for specific application.

Initial condition:

1) PM status bit should be equal to 01 .
2) SXM status bit should be set to 1 .
3) The current DP (data memory page pointer) should be page 0 .
4) Data memory OWE should be 1 .
5) Data nemory $U$ should be 327 .

Chen, Chein-Chung February, 1989

## define parameters

| QRDER: |  |  |
| :--- | :--- | :--- |
| PAGEO: | .equ | 64 |
| 0 |  |  |

DEFINE ADDRESSES OF BUFFER AND COEFFICIENTS

$$
\begin{array}{ll}
. \text { usect } & \text { "buffer", ORDER-: } \\
\text {.usect } & \text { "buffer",1 } \\
\text {.usect } & \text { "coeffs",ORDER }
\end{array}
$$

## * reSERVE ADDRESSES FOR PARAIETERS

| D: | .usect | "parameters",1 |
| :---: | :---: | :---: |
| Y: | .usect | "parameters", 1 |
| ERR: | .usect | "parameters",1 |
| ONE: | .usect | "parameters", 1 |
| U: | . usect | "parameters",1 |
| ERRF: | .usect | "parameters", 1 |
| *********************************** |  |  |
| * Perform the adaptive filter |  |  |
| ********************************* |  |  |
| .text |  |  |

estimate the signal $y$

| LARP | AR3 |  |
| :---: | :---: | :---: |
| CNFP |  | ; Configure BO as progran memory |
| MPYK |  | ; Clear the P register |
| LAC | Ofe, 15 | ; Using rounding |
| LRLK | AR 3 , XN | ; point to the oldest sample |
| RPTK | ORDER-1 | ; Repeat N times |
| MACD | WN+0fd00h,*- | ; Estinate Y(n) |
| CNFD |  | ; Configure B0 as data memory |
| APAC |  |  |
| SACH | Y | ; Store the filter output |

* COMPUTE THE ERROR

$$
\begin{array}{lll}
\text { NEG } & & ; A C C=-Y(n) \\
A D D H & D & ; E R R(n)=D(n)-Y(n) \\
S A C H & E R R &
\end{array}
$$

## LPDATE TIE WEIGHTS

* T30-Adaptive transversal filter with LIS algoritha using the THS320C30

I/0 configuration:


Algoritha:
$y(n)=\operatorname{sum}_{k=0}^{63} u(k) * x(n-k) \quad k=0,1,2, \ldots, 63$
$e(n)=d(n)-y(n)$
$u(k)=w(k)+u * e(n) * x(n-k) \quad k=0,1,2, \ldots 63$

Where we use filter order $=64$ and $\mathbf{m u}=0.01$.
Chen, Chein-Chung March, 1989
********************t********* copy "adapfltr.int"
***************************t***************************
PERFORM ADAPTIVE FILTER
******************************t***********************
au $\quad$. .set 64
initialize pointers and arrays

| begin | .text |  |  |
| :---: | :---: | :---: | :---: |
|  | . set | \$ |  |
|  | LDI | order, BK | ; Set up circular buffer |
|  | LDP | exn_addr | ; Set data page |
|  | LII | exn_addr, ARO | ; Set pointer for x[] |
|  | LDI | Evn_addr, AR1 | ; Set pointer for w[] |
|  | LDF | 0.0,R0 | ; $\mathrm{RO}=0.0$ |
|  | RPTS | order-1 |  |
|  | STF | R0, *ARO++(1)\% | ; $\mathrm{x}[\mathrm{]}=0$ |
| i | STF | R 0 , ARI++(1)\% | ; $w[]=0$ |
|  | LI | Cin_addr, AR'́ | ; Set pointer for input ports |
|  | LDI | lout_addr,AR7 | ; Set pointer for output ports |


| LDF *AR6,R7 $\quad$ [ Input $\mathrm{d}(\mathrm{n})$ |  |  |  |
| :---: | :---: | :---: | :---: |
| i 1 | LDF | *+AR6(1),R6 | ; Input $\times(\mathrm{n})$ |
|  | STF | Rb, *ARO | ; Insert $x(n)$ to buffer |
| - conpute filter Output y $(\mathrm{n})$ |  |  |  |
|  | Lof | 0.0,R2 | ; $\mathrm{R2}=0.0$ |
|  | MPYF3 | *ARO++(1)\%, *AR1++ | (1)\%,R1 |
|  | RPTS | order-2 |  |
|  | HPYF3 | *ARO++(1)\%, *AR1++ | (1)Y,R1 |
| 11 | ADDF3 | R1, R2, R2 | ; $\mathrm{y}(\mathrm{n})=\mathrm{u}[\mathrm{]} \times \mathrm{x}[]$ |
|  | ADDF | R1,R2 | ; Include last result |
| COMPUTE ERROR SIGNL $e(n)$ AND OUTPUT $y(n)$ AND e(n) SIGNLS |  |  |  |
|  | SUBF | R2,R7 | ; $e(n)=d(n)-y(n)$ |
|  | STF | R2, *AR7 | ; Send out $y(n)$ |
| 11 | STF | R7, $2+$ AR7 (1) | ; Send out e(n) |
| * |  |  |  |
| * lpdate leights u(n) |  |  |  |
|  | IPYF | Bu, R7 | ; R7 = e( $n$ ) * u |
|  | TPYF3 | *ARO++(1)\%,R7,R1 | ; Rl $=e(n) * u * x(n)$ |
|  | LI | order-3,RC | ; Initialize repeat counter |
|  | RPTB | LTS | ; $\mathrm{DO}_{0} \mathrm{i}=0, \mathrm{~N}-3$ |
|  | RPYF3 |  | ; R1 $=e(n) * u * x(n-i-1)$ |
| il | ADDF3 | *AR1, R1, R2 | ; $\mathrm{R} 2=\mathrm{ui}(\mathrm{n})+\mathrm{e}(\mathrm{n}) * u * x(n-\mathrm{i})$ |
| LHS | STF | R2, *AR1++(1)\% | ; wi $(\mathrm{n}+1)=v i(n)+e(n) * u * x(n-i)$ |
|  | HPYF3 | *AR0, R7, R1 | ; For $\mathrm{i}=\mathrm{N}-2$ |
|  | ADDF3 | *AR1, R1, R2 |  |
|  | BD | input | ; Delay branch |
|  | STF | R2, *ARI++(1)\% | ; wi $(\mathrm{n}+1)=w i(n)+e(n) * u * x(n-i)$ |
|  | ADDF3 | *AR1, R1, R2 |  |
|  | STF | R2, *AR1++(1)\% | ; Update last y |
| deFine constants |  |  |  |
|  |  |  |  |
| xn | . usect | "buffer", order |  |
| * | . usect | "coeffs", order |  |
| in_addr | .usect | "vars", 1 |  |
| out_addr | . usect | "vars", 1 | . |
| xn_addr | . usect | "vars", 1 |  |
| wn_addr | . usect | "vars", 1 |  |
| $\checkmark$ | . usect | "vars', 1 |  |
| cinit | . sect | ".cinit" |  |
|  | . word | 5,in_addr |  |
|  | .mord | 0804000h |  |
|  | .word | 0804002h |  |
|  | .word | $\times \mathrm{n}$ |  |
|  | .word | *n |  |
|  | .float | au |  |
|  | .end |  |  |

*****************************************************************
Y25 : Adaptive Filter Using Symetry Transversal Structure and LIS Algorithm, Looped Code


Algoritha:

$$
21(n-k)=x(n-k)+x(n-63+k) k=0,1, \ldots, 31
$$

$$
y(n)=\sin _{k=0}^{31} \omega(k) * x(n-k) \quad k=0,1,2, \ldots, 31
$$

$$
k=0
$$

$$
e(n)=d(n)-y(n)
$$

$v(k)=u(k)+u{ }_{v e}(n) * 21(n-k) k=0,1,2, \ldots 31$
Where we use filter order $=64$ and $\mathrm{mu}=0.01$.
Note: This source progran is the generic version; I/O configuration has not been set up. User has to modify the main routine for specific application.

Initial condition:

1) PW status bit should be equal to 01 .
2) SXH status bit should be set to logic 1 .
3) The current DP (data meaory page pointer should be page 0 .
4) Data menory OUE should be 1
5) Data memory $U$ should be 327 .

Chen, Chein-Chung February, 1989

*
DEFINE PARAMETERS
ORDER:
ORDER2: $\quad$.equ
define addresses of buffer and coerficients
PRSEIF
RSBBLF: .usect "buffer", ORDER2-1
ASBUF: .usect "buffer", 1
N: .usect "coeffs",ORDER
FRSDAT: .usect "coeffs",ORDER-1
LASDAT: .usect "coeffs",1
reserve adoresses for paraiteters
D: .usect "parameters", 1
.usect "parameters",
.usect "parameters",1
.usect "parameters", 1
-usect "parameters",1
ERRF: .usect "parameters",1
*****************************
PERFORM THE ADAPTIVE FILTER
*H******************************
.text
SYMUETRIC BUFFER ADDITION

| LARP | AR3 |
| :---: | :---: |
| LARK | AR1, ORDER2-1 |
| LRIK | AR2,LASDAT |
| LRLK | AR3, FRSDAT |
| LRIK | AR4,FRSBUF |
| LAC | ${ }^{*+}, 0$, AR2 |
| ADD | - - , 0, AR4 |
| SACL | *+, 0,AR1 |
| BANZ | SW1, --,AR3 |

; Set up the counter
; Point to oldest data
; Point to newest data
; Point to first buffer
; Buffer $(k)=\operatorname{DAT}(n+k)+\operatorname{DAT}(n-N+k)$

ESTIMATE THE SIGNL Y
Configure BO as progran menory
Clear the $P$ register
; Clear the P re
Using rounding
Point to the oldest buffer
; Reint to the old

- Repeat N/2
; Estimate $\mathrm{Y}(\mathrm{n})$
; Store the filter output
Algorithm Using the TMS320C25
Appendix B1. Symmetric Transversal Structure with LMS

COHPUTE THE ERROR



Appendix B2. Symmetric Transversal Structure with LMS
Algorithm Using the TMS320C30

Algoritha:

$$
\begin{aligned}
& f_{i}(n)=f_{i}-1(n)-K i(n) * b_{i}-1(n-1) \quad i=1,2, \ldots, 64 \\
& b_{i}(n)=\operatorname{bi}^{-1}(n-1)-K i(n) * f i-1(n) \quad i=1,2, \ldots, 64 \\
& e i(n)=d(n)-\operatorname{SLM}_{k=0}^{i-1} y k(n)=e i-1-b i-1(n) * 6 i-1(n) \quad i=1,2, \ldots, 64 \\
& k=0 \\
& 64 \quad 64 \\
& y(n)=\text { SUM yi }(n)=\text { SUA } b i(n)+G i(n) \\
& i=0 \quad i=0 \\
& K i(n+1)=K i(n)+n *[f(n) \neq b i-1(n-1)+b i(n) * f i-1(n)] \\
& G i(n+1)=G i(n)+m u * e i(n) * b i(n) \quad i=1,2, \ldots 64 \\
& f i(n)=f i-1(n)-K i(n) * b i-1(n-1) \quad i=1,2, \ldots, 64
\end{aligned}
$$

there filter order $=64$ and $\mathbf{m}=0.01$
Note: This source progran is the generic version; I/O configuration has not been set up. User has to modify the main routine for specific application.

Initial condition:

1) PM status bit should be equal to 01 .
2) SXM status bit should be set to logic 1.
3) The current DP (data memory page pointer) should be page 0 .
4) Data memory $U$ should be 327
5) The B1 \& BD1 pointer (AR3 \& AR4) should be exchanged every teration. For example
For odd iteration: ARB $\rightarrow$ B1
For $A R 4 \rightarrow B D 1$
For even iteration: AR3 $\rightarrow$ BDI
AR4 $\rightarrow$ B1
L25: Adaptive Filter Using Lattice Structure
L25: Adaptive Filter Using Lattice Structure
and LIS Algorithm, Looped Code


| . usect | "parameters",1 |
| :---: | :---: |
| . usect | "paraseters",1 |
| .usect | "parameters",1 |
| .usect | "parameters",1 |
| .usect | "parameters",1 |



* perform the adaptive filter

.text
* initialize the pointers

| LARP | AR3 |
| :--- | :--- |
| LARK | AR1,ORDER-1 |
| LRLK | AR2,F1 |
| LRLK | ARS,B1 |
| LRLK | AR4,BD1 |
| LRLK | ARS,G1 |
| LRLK | AR6,K1 |

* 

INITIALIZE THE BI ANO FI

| LAC | $x$ |
| :--- | :--- |
| SACL | $\pm, 0$, ARR |
| SACL | $\pm, 0, A R 3$ |

* 

INItIALIZATION

| LT | *, AR5 | ; $\mathrm{T}=\mathrm{Bl}$ |
| :---: | :---: | :---: |
| MPY | \#, AR2 | ; $\mathrm{P}=\mathrm{Bl} * \mathrm{Cl}^{1}$ |
| PAC |  | ; $\mathrm{ACC}=\mathrm{B1} * \mathrm{Gl}^{1}$ |
| SACH | Y | ; Initialize $\mathrm{Y}(0)=\mathrm{Bl} * \mathrm{Gl}$ |
| NEG |  | ; $A C C=-(B 1 * G 1)$ |
| ADDH | D | ; $A C C=D(n)-B 1 * O 1$ |
| SACH | E | ; Initialize $\mathrm{E}(0)=\mathrm{D}(\mathrm{n})-\mathrm{Bl} \pm \mathrm{G1}$ |



##  using the THS320C30

## Algorithn:

$$
K i(n+1)=K i(n)+m u *[f i(n) * b i-1(n-1)+b i(n) * f i-1(n)]
$$

$$
G i(n+1)=G i(n)+m u \in e i(n) * b i(n) \quad i=1,2, \ldots 64
$$

Where filter order $=64$ and $=0.04$.
Chen, Chein-Chung March, 1989

.copy adapfltr.int"
\#\#********************************************

* PERFORM ADAPTIVE FILTER

| order | . set | 64 | ; Filter order |
| :---: | :---: | :---: | :---: |
| mu | . set | 0.04 | ; Step size |

## INITIALIZE POINTERS AND ARRAYS



$$
\begin{aligned}
& f_{i}(n)=f_{i}-1(n)-K i(n) * b_{i-1}(n-1) \quad i=1,2, \ldots, 64 \\
& b i(n)=\operatorname{bi}-1(n-1)-K i(n) * f i-1(n) \quad i=1,2, \ldots, 64 \\
& \text { i-1 } \\
& i(n)=d(n)-\sin y k(n)=e i-1-b i-1(n)+G i-1(n) \quad i=1,2, \ldots, 64 \\
& 64 \\
& y(n)=\operatorname{SUM} y i(n)=\operatorname{GXM} \operatorname{bi}(n)+G i(n)
\end{aligned}
$$

|  | MPYF3 | RS, *AR2, R6 | ; B1 * 61 |
| :---: | :---: | :---: | :---: |
| i | STF | R5, *AR1 | ; Insert B1 |
|  | SUBF | R6, R7 | ; $\mathrm{E}=\mathrm{D}-\mathrm{B1} * \mathrm{OL}^{1}$ |
| * ${ }^{\text {a }}$ |  |  |  |
|  | LDI | order-1, RC |  |
|  | RPTB | lattice |  |
|  | MPYF3 | *AR0, R5, R3 | ; $\mathrm{R} 3=\mathrm{kFi}-1$ |
|  | HPYF3 | R7, + AR1 $1+(1) \%$, R0 | ; $\mathrm{R} 0=E \mathrm{E}-1 * \mathrm{Bi}-1$ |
| : 1 | SUBF3 | R3, *AR4, R3 | ; $\mathrm{R} 3=\mathrm{Bi}=\mathrm{BDi}-1-\mathrm{kFi}-1$ |
|  | MPYF | Ru, RO | ; $\mathrm{RO}=u * \mathrm{Ei}-1 * \mathrm{Bi}-1$ |
|  | ADDF3 | R0, *AR2, R0 | ; $\mathrm{RO}=6 \mathrm{Ci}-1+\mathrm{u} * \mathrm{Ei}-1 * \mathrm{Bi}-1$ |
| 11 | STF | R3, \#AR1 | ; Store Bi |
|  | MPYF3 | R5, *AR1, R1 | ; $\mathrm{Rl}=\mathrm{Fi}-1 * \mathrm{Bi}$ |
| $1 i$ | STF | R0, $*$ AR2 $2+(1)$ | ; Store Gi |
|  | MPYF3 | *AR $0, ~+A R 4, ~$ O | ; $\mathrm{R} 0=\mathrm{kBDi}-1$ |
|  | SUBF | R0, R5 | ; $\mathrm{RS}=\mathrm{Fi}$ |
|  | HPYF3 | RS, *AR4++(1)\%, R0 | ; $\mathrm{Rl}=\mathrm{Fi} * \mathrm{BDi}^{-1}$ |
|  | ADDF | R1,R0 | ; $\mathrm{RO}=\mathrm{Fi} * \mathrm{BDi}^{-1}+\mathrm{Fi}-1 * \mathrm{Bi}$ |
|  | MPYF | eu, RO | ; $\mathrm{RO}=\mathrm{u} *\left(\mathrm{Fi} * \mathrm{BDi}^{-1}+\mathrm{Fi}-1 * B \mathrm{i}\right)$ |
|  | ADDF3 | $\mathrm{RO}, \ldots \mathrm{APO}, \mathrm{RO}$ | ; $k i=k i-1+R 0$ |
|  | PPYF3 | R3, *AR2, R4 | ; $\mathrm{R4}=\mathrm{Y}_{\mathrm{i}}$ |
| 14 | STF | RO, + AR $0++(1)$ | ; Store ki |
|  | ADDF | R4, R6 | ; Compute $\mathrm{y}(\mathrm{n})$ |
| lattice | SUBF | R4, R7 | ; Conpute e(n) |
| * OUTPUT y $(n)$ AND e $(n)$ SIGNLS |  |  |  |
|  |  |  |  |
| * |  |  |  |
|  | BD | input | ; Delay branch |
|  | SUBF | R4, R6 | ; Take out last tera |
|  | STF | R6, *AR7 | ; Send out $\mathrm{y}(\mathrm{n})$ |
| 11 | STF | R7, $\boldsymbol{+}$ AR7 ${ }^{\text {(1) }}$ | ; Send out e( $n$ ) |
|  | LDI | *ARO--(IRO), R5 | ; Update k[] pointer |
| 11 | LDI | *AR2--(IRO), R7 | ; Update g[] pointer. |
| dEFINE CONSTANTS |  |  |  |
|  |  |  |  |
| kn | .usect | "coeffs", order |  |
| gn | . usect | "coeffs", order |  |
| bn | -usect | "buffer", 2*order |  |
| in_addr | . usect | "vars",1 |  |
| out_addr | .usect | "vars", 1 |  |
| kn_addr | .usect | "vars",1 |  |
| bn_addr | .usect | "vars", 1 |  |
| gn_addr | .usect | "vars",1 |  |
| $u$ | . usect | "vars", 1 |  |
| cinit | . sect | ".cinit" |  |
|  | .word | 6, in_addr |  |
|  | .word | 0804000h |  |
|  | .word | 0804002h |  |
|  | .word | kn |  |
|  | .word | bn |  |
|  | .word | gn |  |
|  | .float | nu |  |
|  | .end |  |  |





```
*************************************************+H*********
```

TNBO - Adaptive transversal filter with Normalized LAS algoritha
using the Ths320c30
Algoritha:

$$
y(n)=\underset{k=0}{63} \quad \sin _{k=0} v(k) * x(n-k) \quad k=0,1,2, \ldots, 63
$$

$$
\operatorname{var}(n)=r * \operatorname{var}(n-1)+(1-r)+x(n)+x(n)
$$

$$
e(n)=d(n)-y(n)
$$

$$
v(k)=v(k)+u * e(n)+x(n-k) / v a r(n) \quad k=0,1,2, \ldots 63
$$

$$
\text { Where we use filter order }=64 \text { and } m=0.01 \text {. }
$$

Chen, Chein-Chung March, 1989

.copy "adapfltr.int"

## .copy "adapfltr.int" <br> 

* PERFORM ADAPTIVE FILIER

| order | . set | 64. | ; Filter order |
| :---: | :---: | :---: | :---: |
| ${ }^{0}$ | . 5 et | 0.01 | ; Step size |
| power | . set | 1.0 | ; Input signal power |
| alpha | . Set | 0.996 |  |
| alphal | . set | 0.004 | ; 1.0-alpha |

* initialize pointers and arrays
begin
.text
Lset
LDI
LDP
LDI
LDI
LDF
RPTS
STF
II
STF
LDI
LDI

| order, BK |
| :---: |
| exn_addr |
| exn_addr, ARO |
| emmaddr, AR1 |
| 0.0,Ro |
| order-1 |
| RO, *ARO+ (1) |
| R0, + $\mathrm{AR} 1++(1) \%$ |
| lin_addr, AR6 |
| loutaddr, AR |
|  |
| HAR6, R7 |
| **ARb(1),R6 |
| R6, *ARO |

; Set up circular buffer
; Set data page
; Set pointer for $x[]$
; Set pointer for w[]
; RO $=0.0$
; $x[]=0$
; w []$=0$
; Set pointer for input ports
; Set pointer for output ports
; Input $d(n)$
; Input $x(n)$
; Insert $x(n)$ to buffer

- estimate the poier of the input sigwl


Appendix D2. Transversal Structure with Normalized LMS

```
****************************************************************
    TSE25: Adaptive Filter Using Transversal Structure
            and Sign-Error LIS Algorithm,Looped Code
    Algoritha:
        = 63% v(k)*x(n-k) k=0,1,2,\ldots,63
            = SUM
        e(n)=d(n)-y(n) -
        For k=0,1,2,\ldots,63
            u}(k)=u(k)+u#x(n-k)\mathrm{ if e(n) >=0
            (k)=u(k)-u*x(n-k) if e(n)< <
        Where we use filter order = 64 and mu = 0.01.
    Note: This source program is the generic version; I/0 configuration has
        not been set up. User has to modify the main routine for specific
        application.
    Initial condition:
        1) PH status bit should be equal to 01.
        2) SXM status bit should be set to 1.
        ) The current DP (data memory page pointer) should be page 0.
        4) Data memory ONE should be 1.
        5) Data menory U should be 327.
        b) Data meaory NEGNU should be -327.
            Chen, Chein-Chung February, }198
*********************************************
define Parameters
RDER: .equ 64
    DEFINE ADORESSES OF BUFFER AND COEFFICIENTS
        .usect "buffer" ORDER-
        usect "buffer",
        .usect "coeffs",ORDER
    RESERVE ADDRESSES FOR PARARETERS
        .usect "parameters",1
        .usect "parameters",1
```

| ERR: | . usect | "parameters",1 |  |  |
| :---: | :---: | :---: | :---: | :---: |
| ONE: | .usect | "parameters", 1 |  |  |
| U: | .usect | "parameters",1 |  |  |
| ERRF: | .usect | "parameters", 1 |  |  |
| negru: | .usect | "parameters", 1 |  |  |
| ********************************* |  |  |  |  |
| * perform tie adaptive filter <br> ********************************** |  |  |  |  |
|  |  |  |  |  |
| . text |  |  |  |  |
|  |  |  |  |  |
| estimate the signal y |  |  |  |  |
| LAPP AR3 |  |  |  |  |
|  | LARP | AR3 |  |  |
|  | CNEP |  | ; Configure BO as progran menory |  |
|  | MPYK | 0 | ; Clear the P register | - |
|  | LAC | OUE, 15 | ; Using rounding | 0 |
|  | LRLK | AR3, XN | ; Point to the oldest sample |  |
| FIR | RPTK | CRDER-1 | ; Repeat N times | E. |
|  | MACD | lin $+0 f$ fooh,*- | ; Estimate $\mathrm{Y}(\mathrm{n})$ |  |
|  | CNFD |  | ; Configure B0 as data memory |  |
|  | APAC |  |  | E |
|  | SACH | Y | ; Store the filter output |  |
| * CHECK THE SIGN OF ERROR |  |  |  |  |
| CHECK TIE SIG Of exor |  |  |  |  |
|  | neg |  | ; $A C C=-Y(n)$ | 0 |
|  | ADDH | D | ; $A C C=D(n)-Y(n)$ | 판 |
|  | BGEL | NEXT |  | - |
|  | LT | NEGW | ; T register $=-U$ | (0) |
| * UPDATE THE WEIGHTS |  |  |  |  |
| * |  |  |  |  |
| next | LAFK | AR1, ORDER-1 | ; Set up counter | $\square$ |
|  | LRIK | AR2, in | ; Point to the coefficients |  |
|  | LRLK | AR3, $\mathrm{XN+1}$ | ; Point to the data sample |  |
|  | MPY | -, AR2 | ; $P=U * X(n-k)$ |  |
| ADAPT | 2ALR | \#,AR3 | ; Load ACCH with $\mathrm{H}(\mathrm{k}, \mathrm{n})$ \& round |  |
|  | MPYA | *-, AR2 | ; $W(k, n+1)=W(k, n)+P$ |  |
| * |  |  | ; $P=U * X(n-k)$ |  |
|  | SACH | ${ }^{+}$, 0 , ARI | ; Store W(k, $\mathrm{n}+1)$ |  |
|  | BANZ | ADAPT,*-, AR2 |  |  |
| * |  |  |  |  |
| FINISH | .end |  |  |  |

## 

TSE30 - Adaptive transversal filter with Sign-Error LUS
algorithe using the Ths320C30

## Algoritha:

            \(k=0\)
        \(e(n)=d(n)-y(n)\)
        for \(k=0,1,2, \ldots 63\)
        \(u(k)=u(k)+u * x(n-k)\) if \(e(n)>=0.0\)
        \(u(k)=u(k)-u \pm x(n-k)\) if \(e(n)<0.0\)
        Where we use filter order \(=64\) and \(\mathrm{mu}=0.01\).
            Chen, Chein-Chung March, 1989
        HF***************F*************H*************
            .copy "adapfltr.int"
            ******************************\&*************
    ***************************
    * PERFOPM ADAPTIVE FILTER
* PERGOK ADAPIIVE FILTER
$\begin{array}{lll}\boldsymbol{* w * * * * * * * * * * * * * * * * * * * * ~} \\ \text { order } & \text {.5et } & 64 \\ \text { mu } & \text {.5et } & 0.01\end{array}$
initialile pointers and arrays


```
        y(n)=\operatorname{Sum}v(k)*x(n-k)k=0,1,2,\ldots,63
```

```
        y(n)=\operatorname{Sum}v(k)*x(n-k)k=0,1,2,\ldots,63
```

* 


; Input $x(n)$
Insert $x(n)$ to buffer

- COPPUTE FILTER OUTPUT y(n)

|  | LDF | 0.0,R2 II | ; $\mathrm{R} 2=0.0$ |
| :---: | :---: | :---: | :---: |
|  | IPYF3 | *ARO++(1)K, *AR1++(1)Z,R1 order-2 |  |
|  | RPTS |  |  |
|  | IPYF3 |  |  |
| 11 | ADDF3 | R1, R2, R2IIII | ; $y(n)=u[] . x[]$ |
|  | ADDF | R1,R2 | ; Include last result |
| * COMPUTE ERROR SIGNL e(n) |  |  |  |
|  | SUBF | R2,R7 | ; $e(n)=d(n)-y(n)$ |
| * OUTPUT y $(n)$ AND e(n) SIGNKL |  |  |  |
|  | STF | R2, \#AR7 | ; Send out $\mathrm{y}(\mathrm{n})$ |
| $1:$ | STF | R7, $4+$ AR7 (1) | ; Send out $\mathrm{e}(\mathrm{n})$ |
| * UPDATE Weights u(n) |  |  |  |
|  | ASH | -31,R7 | ; Get Sign[e(n)] |
|  | XOR3 | R4, R7, R5 | ; R5 = Ste (n) ] * u |
|  | IPYF3 | *ARO+ $+(1) \mathrm{K}, \mathrm{RE}$, R1 |  |
|  | LDI | order-3, RC | ; Initialize repeat counter |
|  | PPTB | SELMS | ; $\mathrm{Do} \mathrm{i}_{\mathrm{i}}=0, \mathrm{~N}-3$ |
|  | MPYF3 | *ARD++(1)\%, R5, R1 | ; $\mathrm{RI}=\mathrm{SLe}(\mathrm{n}) \mathrm{]} * \mathrm{u} * \times(\mathrm{n}-\mathrm{i}-1)$ |
| i 1 | ADDF3 | *AR1, R1, R2 | ; R2 = wi $(\mathrm{n})+\mathrm{S}[\mathrm{e}(\mathrm{n}) \mathrm{l}$ \% $u * x(n-\mathrm{i})$ |
| SELIS | STF | R2, *AR1++(1)\% | ; vi(n+1) $=v i(n)+S[e(n)] * u * x(n-i)$ |
|  | IPYF3 | *AR0,R5, R1 | ; For $\mathrm{i}=\mathrm{N}-2$ |
|  | ADDF3 | *ARI, R1, R2 |  |
|  | BD | input | ; Delay branch |
|  | STF | R2, *AR1++(1)\% | ; vi $(n+1)=v i(n)+S[e(n)] * u+x(n-i)$ |
|  | ADDF3 | $\pm$ ARI, R1, R2 |  |
|  | STF | R2, *AR1++(1)\% | ; Update last y |
| * define constants |  |  |  |
| xn | .usect | "buffer", order |  |
| un | .usect | "coeffs", order |  |
| in_addr | . usect | "vars", 1 |  |
| out_addr | . usect | "vars", 1 |  |
| xn_addr | .usect | "vars', 1 |  |
| un_addr | . usect | "vars",1 |  |
| $u$ | .usect | "vars', 1 |  |
| cinit | . sect | ".cinit" |  |
|  | .word | 5, in_addr |  |
|  | . word | 0804000h |  |
|  | .mord | 0804002h |  |
|  | . word | xn |  |
|  | . word | un |  |
|  | .float | nu |  |
|  | .end |  |  |

.title 'TSS25'
260 Implementation of Adaptive Filters with the TMS320C25 or the TMS320C30

> TSS : Adaptive Filter Using Transversal Structure
> and Sign-Sign LISS Algoritha ,Looped Code

## Algoritha:

$$
\begin{aligned}
& y(n)=\sin _{k=0}^{\infty} v(k) * x(n-k) \quad k=0,1,2, \ldots, 63 \\
& e(n)=\delta(n)-y(n)-
\end{aligned}
$$

For $k=0,1,2, \ldots, 63$
$u(k)=u(k)+v$ if $e(n)+x(n-k)\rangle=0$
$u(k)=u(k)-u$ if $e(n) * x(n-k)<0$
Where we use filter order $=64$ and $m=0.01$
Note: This source prograle is the generic version; $1 / 0$ configuration has not been set up. User has to modify the main routine for specific application.

Initial condition:

1) PM status bit should be equal to 01
2) SXM status bit should be set to 1 .
3) The current DP (data memory page pointer) should be page 0 .
4) Data memory ONE should be 1 .
5) Data meary U should be 327

Chen, Chein-Chung February, 1989

## 

## define paraitters

$\begin{array}{lll}\text { ORDER: } \\ \text { PAGEO: } & \text {.equ } & 64 \\ \end{array}$
equ
ssect buffer", ORDER-1
usect "coeffer", orDER

## RESERUE ADDRESSES FOR PARAIETERS

| .usect | "parameters", 1 |
| :--- | :--- |
| . usect | "parameters", 1 |
| .usect | "parameters", 1 |


| ON: | .usect | "parameters", 1 |
| :--- | :--- | :--- |
| U: | .usect | "parameters", 1 |
| ERRF: | .usect | "parameters", 1 |

HH***********H*H+**********

* PERFORH THE ADAPTIVE FILTER
.text
* estimate the signal $y$

| LARP | AR3 |  |
| :--- | :--- | :--- |
| CNFP |  | ; Configure B0 as progran memory |
| IPYK | 0 | ; Clear the $P$ register |
| LAC | ONE, 15 | ; Using rounding |
| LRLK | ARS, XN | ; Point to the oldest sample |
| RPTK | ORDER-1 | ; Repeat $N$ times |
| MACD | IN+OfdOOh,*- | ; Estimate Y(n) |
| ONFD |  | ; Configure BO as data memory |
| APAC |  | ; Store the filter output |

; Set up counter
; Point to the coefficients
; Point to the data sample
CHECK THE SIGN OF ERPOR
NEG
ADDH
D

$$
\text { ; } A C C=D(n)-Y(n)
$$

## upDATE THE LEIGRTS

; $A C C=X(n-k)$
; Get the sign of ERR $(n) * X(n-k)$
; Get the sign of
; Store the sign
; Get the sign with its sign extension
; Get the sign with its sign extension
; Update W(k)
*

| LAC | *-, 0, AR2 | ; $A C C=X(n-k)$ |
| :---: | :---: | :---: |
| XOR | ERR | ; Get the sign of ERR( $n) * x(n-k)$ |
| SACL | ERRF | ; Store the sign |
| LAC | ERPF | ; Get the sign vith its sign extension |
| XORK | M, 15 | ; Get the convergent factor MN or -rN |
| ADD | *,15 | ; Update W(k) |
| SACH | *+, 1, AR1 |  |
| BANZ | ADAPT, --, AR3 |  |







Where we use filter order $=\mathrm{N}$
Note: This subroutine performs Adaptive Filter using the LiS Algorithe. There are some initial conditions to neet before calling it.

Initial conditions:

1) Data menory OE should be equal to 1.
2) Data memory $U$ should be equal to MU ( 015 format).
3) PM status bit should be equal to 01 .
4) SXM status bit should be set to logic 1 .
5) OM status bit should be set to 1 .
6) The current DP (data memory page pointer) should be page 0 .
p.s. 1) The return current auxiliary register will be AR2 2) AR1 AR3 have been used in this subroutine.

Chen, Chein-Chung February, 1989

*
DEFINE AND REFER SYYBOLS
.global LHS, ORDER, U, D,ONE, Y,ERR, XN,
reserve address for parameter
.usect "parameters",1
SAVE2: .usect "parameters",1
SAVE3: . .usect "parameters",1
ERPF: .usect "parameters",1
****H***********************
PERFOPM THE ADAPTIVE FILTER
$* * *$
$*$
ESTIMATE THE SIGWL $Y$

| .text |  |  |
| :---: | :---: | :---: |
| LARP | AR3 | ; Set current register |
| SAR | AR1, SAVEI | ; Save register AR1 |
| SAR | AR2, SAVE2 | ; Save register AR2 |
| SAR | AR3, SAVE3 | ; Save register AR3 |
| CWFP |  | ; Configure BO as progran menory |
| MPYK | 0 | ; Clear the P register |
| LAC | OHE, 15 | ; Using rounding |
| LRLK | AR3, XN | ; Point to the oldest sample |
| RPTK | ORDER-1 | ; Repeat N times |
| MACD | in+0fd00h, - | ; Estimate Y( n ) |
| CaFD |  | ; Configure B0 as data memory |
| APAC |  |  |
| SACH | $Y$ | ; Store the filter output |

* COMPUTE THE ERROR

LHS

FIR

> ; Set current register ; Save register AR1 ; Save register AR2 ; Save register AR3 ; Configure BO as progran menory ; Clear the P register ; Using rounding ; Point to the oldest sample ; Repeat $N$ times ; Estimate Y(n) ; Configure BO as data menory ; Store the filter output

| NEG |  | $; A C C=-Y(n)$ |
| :--- | :--- | :--- |
| ADDH | $D$ | $; \operatorname{ERR}(n)=D(n)-Y(n)$ |
| SACH | ERR |  |

* update the leights

|  | LT | ERR | - $T=\cdot \operatorname{ERR}(\mathrm{n})$ |
| :---: | :---: | :---: | :---: |
|  | MPY | U | ; $P=U * E R R(n)$ |
|  | PAC |  |  |
|  | ADD | OVE, 15 | ; round the result |
|  | SACH | ERPF | ; $\operatorname{ERPF}=U * \operatorname{ER}(\mathrm{n})$ |
| * |  |  |  |
|  | LAPX | AR1, ORDER-1 | ; Set up counter |
|  | LRLK | AR2, IN | ; Point to the coefficients |
|  | LRLK | AR3, $\mathrm{XN+1}$ | ; Point to the data sample |
|  | LT | ERPF | ; T register $=U * \operatorname{ERR}(\mathrm{n})$ |
|  | MPY | *-, ARP | ; $P=U * E R R(n) * X(n-k)$ |
| ADAPT | 2ALR | -,AR3 | ; Load ACCH with $A(k, n)$ \& round |
|  | IPYA | *-, AR2 | ; $W(k, n+1)=W(k, n)+P$ |
| * |  |  | ; $P=U * E R R(n) * X(n-k)$ |
|  | SACH | *+,0,AR1 | ; Store W(k, $\mathrm{n}+1)$ |
|  | BANZ | ADAPT, +-,AR2 |  |
| * ${ }^{\text {a }}$ |  |  |  |
|  | LAR | AR1, SAVE1 | ; Restore register AR1 |
|  | LAR | AR2, SAVE2 | ; Restore register AR2 |
|  | LAR | AR3, SAVE3 | ; Restore register AR3 |
|  |  |  |  |
| FINISH RET |  |  |  |
| * |  |  |  |

## Appendix H2. Linker Command File for Assembly Main Program Calling a TMS320C25 Adaptive LMS Transversal Filter Subroutine




|  | .vidth 132 |  |  |  |
| :---: | :---: | :---: | :---: | :---: |
| a | *************************HH*********************H**H** |  |  |  |
|  | * |  |  |  |
|  | * This is the initial boot routine for ThS320C30 adaptive <br> * filter Progras. |  |  |  |
|  | * This eodule perforas the following actions: |  |  |  |
|  |  |  |  |  |
|  | * 1) Allocates and initializes the system stack. |  |  |  |
|  | * 2) Perforas auto-initialization, which copies section |  |  |  |
|  | * ".const" data from ROH to DATA RAA. |  |  |  |
|  | 3) Prepare to start the user's asseably progran. |  |  |  |
| 8 | * ${ }^{\text {* }}$ |  |  |  |
|  | ***************************************************** |  |  |  |
| 2 | STACK_SIZE | . et | 40h | ; Size of system stack |
|  | FP | . 5 et | AR3 | ; Frame pointer |
|  |  |  |  |  |
|  |  | . sect | "vectors" |  |
| 0 | RESET | .word | adap_init |  |
|  |  |  |  |  |
|  | * ALlocate Space for the Sisten stack. initialize tie first hords in |  |  |  |
|  | *.text to point to the sthck and initialization tares. |  |  |  |
|  | stack | .usect | ". stack", STACK_SILE |  |
|  |  | .text |  |  |
|  | * |  |  |  |
|  | stack_addr | .word | stack | ; Address of stack |
|  | init_addr | .word | cinit | ; Address of init tables |
|  | H*******H*****************************H************** |  |  |  |
| $\stackrel{7}{2}$ | * adaptive filter initialization entry point function |  |  |  |
|  |  |  |  |  |
|  | adap_init: |  |  |  |
|  | * |  |  |  |
|  |  |  |  |  |
| 3 | * |  |  |  |
| 0 |  | LDP | stack_addr | ; Get page of stored address |
|  |  | LDI | estack_addr, SP | ; Load the address into SP |
|  |  | LI | SP,FP | ; And into FP too |
|  | * ${ }^{\text {a }}$ a |  |  |  |
|  | * do altoinitialization |  |  |  |
|  |  |  |  |  |
|  |  | LP | init_addr | - Get page of stored address |
|  |  | LDI | Cinit_addr, ARO | ; Get address of init tables |
|  |  | CPPI | -1,ARO | ; If RAM model, skip init |
| 9 |  | BEQ | done |  |
| 3 |  | LII | *ARO++, R1 | ; Get first count |
|  |  | BID | done | ; If 0 , nothing to do |
|  |  | LII | *ARO++,AR1 | ; Get dest address |
|  |  | LDI | *ARO++, RO | ; Get first word |
|  |  | SUBI | 1,81 | ; Count-1 |
|  | * do_init: |  |  |  |
|  |  |  |  |  |
|  |  | RPTS | R1 | ; Block copy |
|  |  |  |  |  |



| BT30 - TMS320C30 adaptive transversal filter with LIHS algorithe assembly subroutine. |  |
| :---: | :---: |
|  |  |
| * |  |
| * | Algoritha: |
| * | $\mathrm{N}-1$ |
| * | $y(n)=$ SUM w $w(k) * x(n-k) k=0,1,2, \ldots, N-1$ |
| * | $\mathrm{k}=0$ |
| * | $e(n)=d(n)-y(n)$ |
| * |  |
| * | $w(k)=u(k)+u * e(n) * x(n-k) k=0,1,2, \ldots, N-1$ |

Where we use filter order $=\mathrm{N}$ and $\mathrm{mu}=0.01$.
Initial condition:

1) ARO and ARI should point to $\times[0]$ and $u[0]$
2) Data memory u should contain step size.
3) Data memory order should contain N -2, where N is filter order,
4) Data memories $d$, $y$, and eshould be defined in caller routine.
Chen, Chein-Chung March, 1989

global Lis $30, u, d, y, e$, order
***********************************************

* PERFORM ADAPTIVE FILTER
**********************************************

| LHS30 | .text |
| :---: | :---: |
|  | . set |
|  | PUSH |
|  | PUSHF |
|  | PUSFF |
|  | PUSH |
|  | PUSFF |

* COMPUTE FILTER OUTPUT y $(n)$
LDF $\quad 0.0, R 3 \quad ; R 3=0.0$

| IPYF3 | *ARO++(1)\%, *AR1++(1)\%, R1 |  |
| :---: | :---: | :---: |
| RPTS | Corder |  |
| PPYF3 | *AR0++(1)\%, *AR1++(1) \%, R1 |  |
| ADDF3 | R1, R3, R3 | ; $\mathrm{y}(\mathrm{n})$ |
| ADDF | R1, R3 | ; Inclu |

* uPDATE MEIGHTS u[] AND SHIFT x[]
* 

| IPYF | Eu, R3 | ; R3 $=e(n) * u$ |
| :---: | :---: | :---: |
| IPYF3 | *ARO+ (1) $\mathrm{Z}, \mathrm{R} 3, \mathrm{R1}$ | ; R1 $=e(n) * u * x(n)$ |
| LII | Corder, RC | ; Initialize repeat counter |
| SUBI | 1,RC |  |
| RPTB | LHS | ; Do i $=0, \mathrm{~N}-3$ |
| MPYF3 | *AR0++(1)\%, R3, R1 | ; $\mathrm{RI}=\mathrm{e}(\mathrm{n}) * u * x(n-\mathrm{i}-1)$ |
| ADDF3 | *AR1, R1, R2 | ; R2 $=u i(n)+e(n) * u * x(n-i)$ |
| STF | R2, + AR1++(1)\% | ; vi $(n+1)=v i(n)+e(n) * u * x(n-i)$ |
| MPYF3 | *AR0; R3, R1 | ; for $\mathrm{i}=\mathrm{N-2}$ |
| ADDF3 | *AR1,R1, R2 |  |
| STF | R2, *AR1++(1)\% | ; $\omega i(n+1)=v i(n)+e(n) * u * x(n-i)$ |
| ADDF3 | *AR1,R1,R2 |  |
| STF | R2, *AR1++(1)\% | ; Update last v |

; Store $y(n)=(n)=d(n)-y(n)$
, Store e(n)

| STF | R3, ey | ; Store $y(n)$ |
| :--- | :--- | :--- |
| SUBPF | ed,R3 | ; e(n) $=d(n)-y(n)$ |
| STF | R3, Ee | ; Store e $(n)$ |

## 



Update last

Appendix H4. Assembly Subroutine of Transversal Structure with
LMS Algorithm Using the TMS320C30

## Appendix H5. Linker Command/file for Assembly Main Program Calling the TMS320C30 Adaptive LMS Transversal Filter Subroutine



Algorithm Using the TMS320C25


##  <br> CT30 - TMS320C30 C subroutine adaptive transversal filter with LIS algoritha.

## Algoritha:


$y(n)=$ SUH $\because(k) * x(n-k) \quad k=0,1,2, \ldots, N-1$
$k=0$
$e(n)=d(n)-y(n)$
$u(k)=u(k)+u * e(n) * x(n-k) \quad k=0,1,2, \ldots, N-1$
Where we use filter order $=\mathrm{N}$ and $\mathrm{m}=0.01$
Usage: thas $n, m u, d, t u, \& x, d y$, de $)$
$n$ - order of filter
av - convergence factor

-     - desired signal
but - filter coefficients
\&x - input signal buffer
ky - addr of output signal
de - addr of error signal
Chen, Chein-Chung March, 1989


$$
\text { FP } \quad \begin{array}{lll}
\text {. } & \text {.setobal } & \text {-tlins } \\
\text { AR3 }
\end{array}
$$



* PERFORM ADAPTIVE FILTER


| -tlas | .text |  |
| :---: | :---: | :---: |
|  | . 5 et | \$ |
|  | PUSH | FP |
|  | LDI | SP,FP |
|  | PUSH | ARO |
|  | PUSH | AR1 |
|  | PUSH | AR2 |
|  | PUSH | R1 |
|  | PUSHF | R1 |
|  | PUSH | R2 |
|  | PUSHF | R2 |
|  | PUSH | R4 |
|  | PUSHF | R6 |
|  | PUSFF | R7 |

* 

GET FILTER PARAIETERS

| LDI | --PP(2),R4 | ; Get filter order |
| :--- | :--- | :--- |
| LDI | -FP(6),ARO | ; Get pointer for $x[]$ |
| LDI | --FP(5),ARI | ; Get pointer for u[] |
| SUBI | $2, R 4$ | ; Set loop counter |

COHPUTE FILTER OUTPUT y(n)

|  | LDF | 0.0,R2 | ; R2 = 0.0 |
| :---: | :---: | :---: | :---: |
|  | MPYF3 | *ARO+ ${ }^{(1)}$, ARR1++(1), R1 |  |
|  | RPTS | R4 |  |
|  | MPYF3 | *ARO++(1) , 4 AR1 $++(1), \mathrm{RI}$ |  |
| 11 | ADDF3 | R1,R2,R2 | ; $y(n)=$ |
|  | ADDF | R1, R2 | ; Include |

- COMPUTE ERROR SIGWL e(ñ) aND STORE y(n) aND e(n) *

* Lpdate heichts u[] ano shift x[]
* 

| IPYF | *+ + P( 2 ) , R7 | ; R7 $=e(n) * u$ |
| :---: | :---: | :---: |
| IPYF3 | --ARO(1), R7, R1 | ; R1 $=e(n) * u * x(n-N+1)$ |
| LDI | R4,RC | ; Initialize repeat counter |
| RPTB | LIS | ; $\mathrm{DO}_{0} \mathrm{i}=1, \mathrm{~N}-1$ |
| IPYF3 | *-ARO(1), R7, RI | ; R1 $=e(n) * u * x(n-i+1)$ |
| ADDF3 | *--AR1(1),R1, R2 | ; $R 2=u i(n)+e(n) * u * x(n-i)$ |
| LDF | *ARO, R6 | ; Get $\times($ ( $n+1-\mathrm{N}+1)$ |
| STF | R2, 4AR1 | ; $\quad$ i $i(n+1)=v i(n)+e(n) * u * x(n-i)$ |
| STF | R6, *+ARO(1) | ; Shift x [] |
| ADDF3 | - -AR1(1), R1, R2 | ; R2 $=v i(n)+e(n) * u * x(n)$ |
| STF | R2, \# $A$ R1 | ; Update last y |

# A Collection of Functions for the TMS320C30 

Gary Sitton

Gaslight Software

## Introduction

This report presents a collection of efficient machine language programs for advanced applications with the TMS320C30. These programs provide basic math and transcendental functions. Other routines include vector functions, FFTs and linear algebra.

## Library Overview

The set of programs fall into six categories:
I. Normal precision floating point math functions,
II. Extended precision floating point math functions,
III. Integer arithmetic routines,
IV. Vector utility routines,
V. Radix 2 FFT routines, and
VI. Linear algebra routines.

Categories I and II are programs which implement a minimal set of elementary mathematical functions for advanced applications. In these categories, the functions FPINV and SQRT are improved versions of the programs in the TMS320C3x User's Guide [1]. In category III, IMULT and IDIV are improved versions of the programs EXTMPY and DIVI in [1]. In category IV, *FMIEEE and *TOIEE are array versions of the TOIEEE and FMIEEE scalar programs from the User's Guide.

The names and short descriptions of these routines use some special notation:
Categories I and II: $\quad \mathbf{x d}$ - indicates that the relative accuracy of the implemented function is x decimal digits.
Categories IV and VI: $\quad *$ - program name prefix stands for M or R .
$\mathbf{M}$ - selects the memory based parameter entry point. $\mathbf{R}$ - selects the register based parameter entry point.
Categories II and VI: $\quad \mathbf{X}$ - indicates the extended precision program version.

Consult the program source listings for more details.
The following are brief descriptions of the programs by category:
I. Normal floating-point (32-bit) math functions (\$MATH.ASM):
A. SIN $\quad$-computes a $7 \mathrm{~d} \operatorname{sine}(\mathrm{x})$ for all x in radians.
B. $\quad \operatorname{COS} \quad$-computes a 7d $\operatorname{cosine}(\mathrm{x})$ for all x in radians.
C. EXP $\quad$-computes a $7 \mathrm{~d} \exp (\mathrm{x})$ for all $|\mathrm{x}| \leq 88$.
D. LN $\quad$-computes a $7 \mathrm{~d} \ln (\mathrm{x})$ for all $\mathrm{x}>0$.
E. ATAN -computes a $7 \mathrm{~d} \operatorname{atan}(\mathrm{x})$ in radians for all x .
F. SQRT
G. FPINV
H. FDIV
-computes an $8 \mathrm{~d} \operatorname{sqrt}(\mathrm{x})$ for all $\mathrm{x} \geq 0$.
-computes an $8 \mathrm{~d} 1 / \mathrm{x}$ for all $\mathrm{x} \neq 0$.
-computes an $8 \mathrm{~d} x / \mathrm{y}$ for all x and all $\mathrm{y} \neq 0$.
II. Extended-precision, floating-point (40-bit) math functions (\$MATHX.ASM):
A. SINX
B. COSX
C. EXPX
D. LNX
E. ATANX
F. SQRTX
G. FPINVX
H. FDIVX
I. FMULTX
-computes a $9 \mathrm{~d} \operatorname{sine}(\mathrm{x})$ for all x in radians.
-computes a 9 d cosine $(\mathrm{x})$ for all x in radians.
-computes a $9 \mathrm{~d} \exp (\mathrm{x})$ for all $|\mathrm{x}| \leq 88$.
-computes an $8 \mathrm{~d} \ln (\mathrm{x})$ for all $\mathrm{x}>0$.
-computes an $8 \mathrm{~d} \operatorname{atan}(\mathrm{x})$ in radians for all x .
-computes a $10 \mathrm{~d} \operatorname{sqrt}(\mathrm{x})$ for all $\mathrm{x} \geq 0$.
-computes a $10 \mathrm{~d} 1 / \mathrm{x}$ for all $\mathrm{x} \neq 0$.
-computes a $10 \mathrm{~d} \mathrm{x} / \mathrm{y}$ for all x and all $\mathrm{y} \neq 0$.
-computes a $10 \mathrm{~d} x * y$ for all x and y .
III. Integer (32-bit) math routines (\$MATHI.ASM):
A. ILOG2
B. IMULT
C. IDIV
-computes $\mathrm{m}=\log 2(\mathrm{n}), \mathrm{n} \leq 2^{\mathrm{m}}$ for use with radix 2 FFT programs.
-computes 64 -bit product of two 32 -bit numbers.
-computes quotient and remainder of two 32-bit numbers.
IV. Vector utilities (\$VECTOR.ASM):
A. *CORMULT -in-place computation of the complex vector product of two complex arrays using the complex conjugate of the second array.
B. *CONMULT -in-place computation of the complex vector product of two complex arrays.
C. *CBITREV -in-place bit reverse permutation on a complex array with separate real and imaginary arrays.
D. *FMIEEE -in-place fast conversion of an IEEE array to a TMS320C30 array.
E. *TOIEEE -in-place fast conversion of a TMS320C30 array to an IEEE array.
F. *VECMULT -in-place multiplies a constant times an array.
G. *CONMOV -moves (fills) a constant into an array.
H. *VECMOV -moves (copies) an array into another array.
V. Radix 2 FFT routines (\$FFT2.ASM):
A. CFFFT2 -Complex DIF forward radix 2 FFT using separate real and imaginary arrays and $3 / 4$ cycle sine table.
B. CIFFT2 -Complex DIT inverse radix 2 FFT using separate real and imaginary arrays and $3 / 4$ cycle sine table (does not include the $1 / \mathrm{N}$ scale factor).
VI. Linear algebra routines (\$LINALG.ASM):
A. $\quad$ SOLUTN $\quad$ Solves a well conditioned system of linear equations with any number of dependent variable sets. Uses no (diagonal) pivoting with normal-precision floating-point math.
B. *SOLUTNX -Solves a well conditioned system of linear equations with any number of dependent variable sets. Uses no (diagonal) pivoting with extendedprecision floating-point math.

## Extended vs. Normal Precision

Categories I, II, and VI represent a dual collection of programs implemented with 32-bit single- or normal-precision TMS320C30 floating-point arithmetic, and with 40-bit extended-precision TMS320C30 floating-point arithmetic. Some of the normal-precision programs (category I, for example) have been written using the TMS320C30 RND instruction for rounding to obtain the optimal precision from the standard floating point TMS320C30 instruction set. This has been done with a slight loss of speed. Such rounding can be carefully eliminated by the user if the additional speed is necessary at the expense of some accuracy.

Extended-precision was implemented on the TMS320C30 by the simple implementation of the 40 -by- 40 floating-point multiply routine, FMULTX. This was necessary since the TMS320C30 has 40 -bit addition and subtraction instructions, but the multiply operates only on 32 -bit inputs. By using the native add and subtract FMULTX and the extendedprecision registers R0 to R7, 40-bit floating-point math was effected. All 40-bit constants are stored in two consecutive words in memory. The first word is the normal truncated 32-bit floating-point number. The least significant byte of the second word contains the remaining bottom 8 bits of the extended mantissa. The programs are coded to properly load extended-precision registers with these double-word constants.

The extended-precision versions of the programs in this report may be slower than their normal precision counterparts. When using extended-precision results in R0 from category II programs, note that the results may be stored in memory with or without rounding. A more accurate normal-precision result will generally be obtained by rounding. You should never round before using an extended-precision result as input to another extendedprecision program unless special circumstances exist. Note that truncation, not rounding, will occur if an extended-precision register is moved to any 32 -bit register or any memory location. This will generally cause loss of accuracy in the amount of the value of the least significant bit of the mantissa.

## Program Utilization

Since all programs in this collection are intended to be invoked by a CALL instruction, you must have the stack pointer (SP register) appropriately set to an available memory area, preferably in internal RAM. Programs in categories I and II save and restore the data page register DP by using the stack area pointed to by SP. Programs in category III do not alter or use the DP register at all. The programs in categories IV through VI alter but do not restore the DP register.

All of the programs in categories I through III, except for ILOG2, are implemented as straight line code. You may wish to disable the instruction cache while these programs are being executing. This will cause no loss of execution speed and will avoid flushing out potentially reusable instructions in the cache. It is beneficial to have the cache enabled when using most of the remaining programs (categories IV through VI) as they generally contain multi-instruction loops.

Programs in categories IV through VI allow input through externally defined variables addresses. The .global references indicate these addresses, where the input variable values and/or addresses are located. The starting address of these memory locations is given by the external variable \$PARAMS. All of the addresses are assumed to be in the same TMS320C30 memory page as \$PARAMS. If this is not the case, the addresses or the programs should be changed assure that the DP register gets set properly.

Programs in categories IV and VI also allow the use of registers to hold input parameters. The exact registers to be used are found in the program source listings. When using the register input entry point, refer to the program using the $R$ prefix on the program name, e.g. RSOLUTN. The memory based parameter input entry uses the $M$ prefix, e.g. MSOLUTN. The .global references to the $R$ prefix entry points may be deleted if they are not needed.

## Function Approximation Techniques

Categories I and II are made up of a collection of elementary mathematical functions numerically approximated using two basic methods. The functions SIN, COS, EXP, LN, and ATAN are approximated by using polynomials fitted to the various functions over a limited range of the independent variable. The functions SQRT and FPINV are approximated by iteratively solving a particular non-linear equation. The extended precision versions of these programs (category II) use the same approach with extended-precision arithmetic and resort to more accurate polynomials or more iterations to achieve the desired precision.

## Polynomial Approximations

The polynomial approximation method is fundamentally very simple. A limited part of a function is approximated by a polynomial of some order sufficient to obtain the desired accuracy. The polynomial is generally a series of the form:

$$
\begin{equation*}
P(n, x)=\sum_{i=0}^{n}\{a[i] x i\} \tag{1}
\end{equation*}
$$

where x is the independent variable, n the polynomial order (a fixed integer), and $\mathrm{a}[\mathrm{i}]$ is a set of $n+1$ fixed coefficients.

The desired function, say $f(x)$, is then approximated by a particular $P(n, x)$ such that:

$$
\begin{equation*}
f(x)=P(n, x)+e(x), x 1<x<x u \tag{2}
\end{equation*}
$$

where $x 1$ and $x u$ are the limits of the domain of $x$, and $e(x)$ or $e(x) / f(x)$ is the error function which has been usually minimized in the min-max (equi-ripple) sense. This is done by selecting an appropriate means of calculating the coefficients $\mathrm{a}[\mathrm{i}]$.

Various techniques and schemes are used in the selection of:

- the approximation interval,
- transformations on the function,
- selection of the polynomial form,
- error minimization criteria, and
- calculation of the coefficients.

See Hastings [2] for an excellent tutorial on this numerical methodology. All of the polynomial approximations used in here were obtained from the National Bureau of Standards reference edited by Abramowitz and Stegun [3].

## Non-Linear Equation Approximation

The second method of approximation, using the solution of non-linear equations, is easier to understand. This method requires that a solution for the equation $\mathrm{g}(\mathrm{x})=0$ be found. One means for solving this equation is by Newton-Raphson iteration. This can be understood by considering the Taylor series expansion for $g(x)$ :

$$
\begin{equation*}
g(x+h)=g(x)+h g^{\prime}(x)+r(x, h) \tag{3}
\end{equation*}
$$

where $r(x, h)$ is the remainder of the series (which can be assumed to be small), and $g^{\prime}(x)$ is the derivative of the function $g(x)$. Leaving off the remainder in (3) we get, in terms of incremental values of $x$, the approximation:

$$
\begin{equation*}
\mathrm{g}(\mathrm{x}[\mathrm{i}+1])=\mathrm{g}(\mathrm{x}[\mathrm{i}])+\{\mathrm{x}[\mathrm{i}+1]-\mathrm{x}[\mathrm{i}]\} \mathrm{g}^{\prime}(\mathrm{x}[\mathrm{i}]) \tag{4}
\end{equation*}
$$

Solving for $\mathrm{x}[\mathrm{i}+1]$ in (4) with $\mathrm{g}(\mathrm{x}[\mathrm{i}+1])=0$ yields the approximation:

$$
\begin{equation*}
x[i+1]=x[i]-g(x[i]) / g^{\prime}(x[i]) \tag{5}
\end{equation*}
$$

Thus, $\mathrm{x}[\mathrm{i}+1]$ will converge to a solution of $\mathrm{g}(\mathrm{x})=0$. Convergence can be shown to be quadratic, i.e. the error in the approximation at each iteration is proportional to the square of the error in the previous iteration. Minimally, this requires a sufficiently close starting value for $\mathrm{x}[0]$ and the condition that $\left|\mathrm{g}^{\prime}(\mathrm{x})\right|>0$ for all iterated values of x .

## Math Functions Details

The approximation techniques can be applied to each of the classes of functions. The following sections describe the approximations as they are applied to each function.

## Inverse and Square Root Functions

For the problem of computing good approximations to sqrt(c) (SQRT and SQRTX routines) and 1/c (FPINV and FPINVX routines), both $g(x)$ and $g^{\prime}(x)$ must be derived and then use the iteration of equation (5). This is complicated by the restriction that division should be avoided since the TMS320C30 has no divide instructions. For the iteration to find the inverse of c , you can write:

$$
\begin{equation*}
\mathrm{g}(\mathrm{x}[\mathrm{i}])=1 / \mathrm{x}[\mathrm{i}]-\mathrm{c}=0, \tag{6}
\end{equation*}
$$

which is solved when $1 / \mathrm{x}=\mathrm{c}$ or $\mathrm{x}=1 / \mathrm{c}$. Taking the derivative of (6) and substituting into (5) and simplifying gives us:

$$
\begin{equation*}
\mathrm{x}[\mathrm{i}+1]=\mathrm{x}[\mathrm{i}]\{2-\mathrm{cx}[\mathrm{i}]\} \tag{7}
\end{equation*}
$$

which needs no division.
Thus, (7) will converge to $1 / \mathrm{c}$ with the accuracy (in digits) for each iteration equal to twice that of the preceding one. Thus, if $\mathrm{x}[0]$ approximates $1 / \mathrm{c}$ to 3 bits of precision, only three iterations of (7) will yield about $24=3\left(2^{3}\right)$ bits of accuracy.

A similar iteration from $f(x)=x^{2}$ for $\operatorname{sqrt}(c)$ can be derived from the formulation:

$$
\begin{equation*}
g(x[i])=x[i]^{2}-c=0 \tag{8}
\end{equation*}
$$

which is solved when $x^{2}=c$ or $x=\operatorname{sqrt}(c)$. The solution for (8) leads to the classic square root formula:

$$
\begin{equation*}
x[i+1]=0.5\{c / x[i]+x[i]\} \tag{9}
\end{equation*}
$$

but this equation uses division. However, the iteration from $f(x)=1 / x^{2}$ for $1 /$ sqrt(c) can be shown to be:

$$
\begin{equation*}
x[i+1]=x[i]\left\{1.5-c^{\prime} x[i] 2\right\} \tag{10}
\end{equation*}
$$

where $c^{\prime}=c / 2=0.5 c$. Though (10) needs no division, the final desired result must be transformed by an extra multiplication by the input c because:

$$
\begin{equation*}
\operatorname{sqrt}(\mathrm{c})=\mathrm{c}\{1 / \text { sqrt }(\mathrm{c})\} \tag{11}
\end{equation*}
$$

Formula (10) will also converge, in the precision doubling fashion of the NewtonRaphson iteration, given a suitable close starting value for $\mathrm{x}[0]$ and the use of sufficiently accurate arithmetic. Note that the extended-precision version routines FPINVX and SQRTX both use an extra iteration (for a total of 4) to achieve the needed 32-bit accuracy for the 40-bit format.

The initial guess $\mathrm{x}[0]$, for the iterations of $1 / \mathrm{sqrt}(\mathrm{c})$ and $1 / \mathrm{c}$, may be obtained using an interesting approximation. A TMS320C30 floating-point number $\mathrm{c}=(1+\mathrm{m}) 2^{\mathrm{e}}$, where $0 \leq \mathrm{m}<1$ and $-127 \leq \mathrm{e} \leq 127$. The extra 1 , added to the fractional mantissa m , is the implied bit. Then we can write the inverse of c as:

$$
\begin{equation*}
1 / c=1 /(1+\mathrm{m}) 2^{-\mathrm{e}} \tag{12}
\end{equation*}
$$

An excellent approximation for the inverse of the mantissa is:

$$
\begin{equation*}
1 /(1+m)=1-m / 2 \tag{13}
\end{equation*}
$$

which is exact at the end points: $\mathrm{m}=0$ and $\mathrm{m}=1$. Then the approximation for the reciprocal would be:

$$
\begin{equation*}
1 / \mathrm{c}=(1-\mathrm{m} / 2) 2^{-e} . \tag{14}
\end{equation*}
$$

It turns out that this approximation can be achieved in a single logical operation. If you compute the unlikely value of $c^{\prime}=c$ XOR OFF7FFFFFFFFh, you would complement all bits in cexcept the sign bit. Including the implied bit and taking the effect of one's complement arithmetic into account results in a final value of:

$$
\begin{equation*}
c^{\prime}=\{1+(1-m)\} 2-(e+1) \tag{15}
\end{equation*}
$$

or the desired approximation:

$$
\begin{equation*}
c^{\prime}=(1-m /) 2^{-e}=1 / c \tag{16}
\end{equation*}
$$

c' gives about 3 bits of precision, which is an excellent seed $\mathrm{x}[0]$ for the $1 / \mathrm{c}$ iteration. Using e/2, you have a start for the $1 / \mathrm{sqrt}(\mathrm{c})$ iteration as well.

## Sine and Cosine Functions

The SIN, COS, SINX, and COSX (sine and cosine) routines all use the same basic approximation (section 4.3.98, p. 76 in [3]). The series is for $\sin (\mathrm{x}) / \mathrm{x}$ but is obviously transformed by multiplying by x . The polynomial of even terms then is of the form:

$$
\begin{equation*}
\sin (x)=x \sum_{i=0}^{5}\left\{a[2 i] x^{2 i}\right\}+x e(x) \tag{16}
\end{equation*}
$$

where $|\mathrm{x}| \leq \mathrm{Pi} / 2$ and $|\mathrm{xe}(\mathrm{x})| \leq 2\left(10^{-9}\right)$. Instead of using another power series for $\cos (\mathrm{x})$, you can use the fact that:

$$
\begin{equation*}
\cos (\mathrm{x})=\sin (\mathrm{x}+\mathrm{Pi} / 2) \tag{17}
\end{equation*}
$$

The series given by (16) is only accurate in the 1st and 4th quadrants, i.e. $|\mathrm{x}| \leq$ $\mathrm{Pi} / 2$. $\operatorname{Sin}(\mathrm{x})$ in the other two quadrants is found from:

$$
\begin{equation*}
\sin (x)=\sin (P i-x) \tag{18}
\end{equation*}
$$

The case for $\mathrm{x}<0$ is expediently handled by using $|\mathrm{x}|$ for all calculations except for the final multiply by x in (16).

## Exponential Functions

The EXP and EXPX (exponential) routines use an approximation (see Section 4.2.45, p. 71, in [3]). The expansion is of the form

$$
\begin{equation*}
\exp (x)=\sum_{i=0}^{7}\{a[i] x\}+e(x) \tag{19}
\end{equation*}
$$

where $0 \leq \mathrm{x} \leq \ln (2)$ and $|\mathrm{e}(\mathrm{x})| \leq 2\left(10^{-10}\right)$. The series for 2 y is found by substituting $\mathrm{y}=\mathrm{x} / \ln (2)$ since:

$$
\begin{equation*}
\exp (x)=\exp (\ln (2) y)=2 y \tag{20}
\end{equation*}
$$

The new expansion then becomes:

$$
\begin{equation*}
2 \mathrm{y}=\sum_{\mathrm{i}=0}^{7}\{\mathrm{~b}[\mathrm{i}] \mathrm{y}\}+\mathrm{e}(\mathrm{x}), \tag{21}
\end{equation*}
$$

where $\mathrm{b}[\mathrm{i}]=\mathrm{a}[\mathrm{i}](\ln (2) \mathrm{i})$. See the coefficients in the EXP routine .
Values of $\exp (\mathrm{x})$ for x outside the convergent range are found by two means. First for $\mathrm{x}<0$, note the relationship:

$$
\begin{equation*}
\exp (-x)=1 / \exp (x) \tag{22}
\end{equation*}
$$

which does require an inverse (see the FPINV and FPINVX routines). For y $>1$, let $\mathrm{y}=\mathrm{n}+\mathrm{f}$ where $\mathrm{n}=1,2, \ldots$ and $0 \leq \mathrm{f}<1$. By substituting y in (20), you get

$$
\begin{equation*}
\exp (x)=2^{n+f}=\left(2^{\mathrm{f}}\right)\left(2^{\mathrm{n}}\right) \tag{23}
\end{equation*}
$$

## Natural Log Functions

The LN and LNX (natural or base e logarithm) routines use the approximation from [3] (section 4.1.44, p. 69). The expansion comes in the form:

$$
\begin{equation*}
\ln (1+x)=\sum_{i=1}^{8}\{a[i] x i\}+e(x) \tag{24}
\end{equation*}
$$

where $0 \leq x \leq 1$ and $|e(x)| \leq 3\left(10^{-8}\right)$. The expansion for $\ln (y)$ can be used if the transformation $\mathrm{y}=\mathrm{x}-1$ is applied.

Values of $\ln (x)$ for $x$ outside the convergent range are found in the following way. First, make the substitution $\mathrm{x}=\mathrm{f}\left(2^{\mathrm{n}}\right)$ for $1 \leq \mathrm{f}<2$ and $\mathrm{n}=0,1, \ldots$, and then write:

$$
\begin{equation*}
\log 2(x)=\log 2\left(f 2^{n}\right)=n+\log 2(f), \tag{25}
\end{equation*}
$$

where $\log 2(x)$ is the $\log$ base 2 of $x$. Using the relationship that $\log 2(x)=\ln (x) / \ln (2)$, you get the equation

$$
\begin{equation*}
\ln (x)=\ln (f)+n \ln (2) . \tag{26}
\end{equation*}
$$

## Arctangent Functions

The ATAN and ATANX (arc or inverse tangent) routines use the approximation from section 4.4 .49 , p. 81 in [3]. The series with only even terms for $\operatorname{atan}(\mathrm{x}) / \mathrm{x}$ is transformed to

$$
\begin{equation*}
\operatorname{atan}(x)=x \sum_{i=0}^{8}\left\{a[2 i] x^{2 i}\right\}+x e(x) \tag{27}
\end{equation*}
$$

where $-1 \leq \mathrm{x} \leq 1$ and $|\mathrm{xe}(\mathrm{x})| \leq 2\left(10^{-8}\right)$. Values for $\operatorname{atan}(\mathrm{x})$ for x outside the convergent range are obtained by noting the following identity:

$$
\begin{equation*}
\operatorname{atan}(x)=\operatorname{atan}((x-1) /(x+1))+\mathrm{Pi} / 4 \tag{28}
\end{equation*}
$$

Using the bilinear transformation $\mathrm{y}=(\mathrm{x}-1) /(\mathrm{x}+1)$ assures, at the expense of a divide operation, that $\mathrm{y} \leq 1$ for $\mathrm{x} \geq 1$. The case for $\mathrm{x}<0$ is expediently handled by using $|\mathrm{x}|$ for all calculations except for the final multiply by x in (27).

## Divide and Multiply Fünctions

The last group of routines in category I and II are those for the additional arithmetic functions FDIV and FDIVX (floating-point divides), and FMULTX (extended-precision floating-point multiply). The divide operation for the TMS320C30, $\mathrm{a}=\mathrm{b} / \mathrm{c}$ is done by calculating the reciprocal or inverse of the divisor c . Then you compute

$$
\begin{equation*}
\mathrm{a}=\mathrm{b}(1 / \mathrm{c}) . \tag{29}
\end{equation*}
$$

For a normal-precision divide, FDIV finds $1 / \mathrm{c}$ by a call to FPINV. A subsequent normal TMS320C30 floating-point multiply of the rounded inverse provides a suitable quotient. For an extended-precision divide, FDIVX finds 1/c by a call to FPINVX. The inverse is then extended-precision multiplied by the dividend using FMULTX.

The extended-precision floating-point multiply simulated by FMULTX is the key to the implementation of virtually all of the extended-precision functions. The extended multiply is achieved using the normal floating-point multiply of the TMS320C30. For two extended-precision numbers $\mathbf{x a}$ and $\mathbf{x b}$, you can represent each as the sum of two floatingpoint numbers: $\mathrm{xa}=\mathrm{a}+\mathrm{ea}\left(2^{-24}\right)$ and $\mathrm{xb}=\mathrm{b}+\mathrm{eb}\left(2^{-24}\right)$. The quantities ea and eb are the one-byte extensions of $\mathbf{x a}$ and $\mathbf{x b}$ respectively.

Thus the complete product $\mathrm{xc}=(\mathrm{xa})(\mathrm{xb})$ can be expanded and written as

$$
\begin{equation*}
\mathrm{xc}=(\mathrm{a})(\mathrm{b})+[(\mathrm{a})(\mathrm{eb})+(\mathrm{b})(\mathrm{ea})]^{-24}+(\mathrm{ea})(\mathrm{eb}) 2^{-48} \tag{30}
\end{equation*}
$$

The last term in (30) is always less than the 32-bit precision in the mantissa of the final result. Therefore, you need only to compute the first two terms in the product xc. Also, note that all the indicated products in (30) may be computed using a normal-precision native TMS320C30 multiply as long as the terms are collected in extended-precision registers. The additions are also done using the native TMS320C30 add as it is implemented in extended-precision.

## Integer Arithmetic Program Details

Integer routines differ from the floating-point versions because they produce only integer results. If the computation can produce fractional values, then the fraction must be truncated to leave only the integer result.

## Integer Result Log Base 2

The routine ILOG2 is a useful utility for computing integer value $m$ of the log base 2 of the integer $\mathbf{n}$. The result is computed by successive multiplies by 2 (implemented as shifts by 1 ). The resulting relationship is $n \leq 2 m$, such that if $\log 2(n)$ is not an exact integer, $m$ is rounded up to the next largest integer. This is useful as it allows the determination of $m$ from any value $n>0$ (e.g. not a power of two) which might require the padding of additional values (zeros) for a radix 2 FFT . This program is very fast because of a delayed branch loop and internally requires only $4(m+1)$ cycles (cached) to do the calculation.

## Extended Precision Integer Multiply

The IMULT routine is a modified version of the program EXTMPY in the TMS320C3x User's Guide [1]. It has been modified and slightly speeded up. The negation of the final 64-bit product is done in two instructions by direct two's complement negation rather than by using one's complement to simulate the same result. The product is computed by breaking the multiplier and multiplicand up into two 16 bit integers each. Thus the full product $c$ of the numbers $a=a u\left(2^{16}\right)+a l$, and $b=a u\left(2^{16}\right)+b l$ is

$$
\begin{equation*}
\mathrm{c}=(\mathrm{au})(\mathrm{bu}) 2^{32}+[(\mathrm{au})(\mathrm{bl})+(\mathrm{bu})(\mathrm{al})] 2^{16}+(\mathrm{al})(\mathrm{bl}) \tag{31}
\end{equation*}
$$

where the powers of two indicated are accomplished by shifts. Note that each product in (31) must be represented as a 32 -bit integer. The adds in the sum must be done with care to facilitate the carry between the two final 32-bit components of the product.

## Integer Divide

The IDIV routine is a modified version of the program DIVI in the TMS320C3x User's Guide [1]. It has been modified to return the absolute value of the remainder of the integer division. The remainder was originally computed, but was discarded during the extraction process for the quotient. A few more instructions allow the extraction of both the quotient and remainder from the result of the SUBC process. The program IDIV may be used for the computation of the modulo function. The output of IDIV is the pair $\{\mathrm{q},|\mathrm{r}|\}=\mathrm{a} / \mathrm{b}$, with the property:

$$
\begin{equation*}
0 \leq \mathrm{r}=(\mathrm{a} \text { modulo } \mathrm{b})<\mathrm{a} \tag{32}
\end{equation*}
$$

for $\mathrm{a}>0$ and $\mathrm{b}>0$. The complete relationship is, by definition, $\mathrm{a}=\mathrm{bq}+\mathrm{r}$, for positive a and b .

## Vector Utility Routines

Vector utilities are functions which operate on arrays of numbers. Some utilities, like dot products and convolutions, are simple. Other utilities, like those presented here, are more involved.

## Complex and Complex Conjugate Array Multiplies

The array routine *CORMULT computes the point-by-point complex conjugate multiply of two complex arrays. If the arrays are c1 and c2, and are of length $n$, then:

$$
\begin{equation*}
\mathrm{c} 1[\mathrm{k}] \leftarrow \mathrm{c} 1[\mathrm{k}] \operatorname{conj}(\mathrm{c} 2[\mathrm{k}]), \mathrm{k}=1, \ldots, \mathrm{n} \tag{33}
\end{equation*}
$$

where $\leftarrow$ means replaces. Each complex array is assumed to be stored as two separate arrays, i.e. $\{c 1\}=\{x 1, y 1\}$ and $\{c 2\}=\{x 2, y 2\}$. In cartesian complex representation, (33) becomes

$$
\begin{equation*}
(x 1+i y 1) \leftarrow(x 1+i y 1)(x 2-i y 2) \tag{34}
\end{equation*}
$$

where i represents the imaginary constant sqrt( -1 . Separating the real and imaginary parts, we have:

$$
\begin{equation*}
\mathrm{x} 1 \leftarrow \mathrm{x} 1 \mathrm{x} 2+\mathrm{y} 1 \mathrm{y} 2, \mathrm{y} 1 \leftarrow \mathrm{y} 1 \mathrm{x} 2-\mathrm{y} 2 \mathrm{x} 1 \tag{35}
\end{equation*}
$$

This operation can be used for the frequency domain correlation of two FFTs to implement time domain correlation.

On the other hand, the array routine $*$ CONMULT computes the point-by-point complex multiply of two complex arrays. If the arrays are c1 and c2, and are each of length n , then

$$
\begin{equation*}
\mathrm{c} 1[\mathrm{k}] \leftarrow \mathrm{c} 1[\mathrm{k}](\mathrm{c} 2[\mathrm{k}]), \mathrm{k}=1, \ldots, \mathrm{n} \tag{36}
\end{equation*}
$$

In cartesian complex representation, (36) becomes

$$
\begin{equation*}
(\mathrm{x} 1+\mathrm{iy} 1) \leftarrow(\mathrm{x} 1+\mathrm{iy} 1)(\mathrm{x} 2+\mathrm{iy} 2) \tag{37}
\end{equation*}
$$

Separating the real and imaginary parts results in

$$
\begin{equation*}
\mathrm{x} 1 \leftarrow \mathrm{x} 1 \mathrm{x} 2-\mathrm{y} 1 \mathrm{y} 2, \mathrm{y} 1 \leftarrow \mathrm{y} 1 \mathrm{x} 2+\mathrm{y} 2 \mathrm{x} 1 \tag{38}
\end{equation*}
$$

This operation can be used for the frequency domain convolution of two FFTs to implement digital filtering.

## Complex Array Bit Reversal

The array routine *CBITREV executes an in-place bit reverse permutation on two arrays simultaneously. This operation is generally used for index scrambling before a DIT FFT (decimation in time, see CIFFT2), or after a DIF FFT (decimation in frequency, see CFFFT2) for index unscrambling. Therefore, *CBITREV is useful in permuting complex arrays stored as two separate arrays which are associated with radix 2 FFTs. The program uses the bit reverse indexing feature of the TMS320C30 to achieve this function. The loop in *CBITREV is nearly as efficient in permuting two arrays together as permuting one array alone. This is due to the use of parallel load and store instructions and a delayed (single cycle) conditional branch.

## Floating Point Conversions

The array routines *FMIEEE and *TOIEEE are vectorized versions of their original scalar counterparts FMIEEE and TOIEEE. Both routines do fast conversions from or to IEEE format by avoiding dealing with special rare cases. Also, both programs convert the numbers in the arrays in-place which destroys the original data. These array versions of the format conversion routines are much faster than calling the scalar version routines in a special loop. These routines also have their own internal, shared constant table for conversions.

## Vector Primitives

The array routines $*$ VECMULT, $*$ CONMOV, and $*$ VECMOV are a useful suite of efficient programs for simple array operations. The first routine, *VECMULT, performs the simple operation $\mathrm{x}[\mathrm{k}] \leftarrow \mathrm{x}[\mathrm{k}] \mathrm{c}$ which is a scalar-vector multiply useful in uniformly scaling an array by a constant c . You can use this for scaling arrays after an inverse FFT by choosing $c=1 / n$. The next routine, ${ }^{*}$ CONMOV, performs the operation $\mathrm{x}[\mathrm{k}] \leftarrow \mathrm{c}$ which is useful in filling or initializing any portion of an array to a single constant $c$. The last routine, $* V E C M O V$ performs the simple operation $x[k] \leftarrow y[k]$, an array move, and is, therefore, generally useful.

## FFT Routines

This category contains the two complementary radix 2 complex FFT programs CFFFT2 and CIFFT2. These programs differ from previously available TMS320C30 FFT programs in that they operate on complex arrays which are stored as two separate and independent real arrays. Both routines do the FFTs in-place and do no index permutations or constant scaling (multiplication). Also these programs require only a $3 / 4$ cycle external, pre-computed sine table. As with previous FFT programs, these, too, have a special multiply-less butterfly loop for the occurrence of unity twiddle or complex rotation factors.

The routine CFFFT2 is a DIF radix 2 complex forward FFT program and thus assumes a normally indexed pair of input arrays. The output array is bit-reverse permuted and normally must be unscrambled to be of any use (see *CBITREV). The routine CIFFT2 is a DIT radix 2 inverse FFT program and thus assumes a bit-reverse indexed pair of input arrays. A normally indexed complex frequency spectrum must be bit-reverse scrambled before using CIFFT2 (again, see *CBITREV). On the other hand, the output from this inverse FFT is in normal indexed order, but lacks the traditional scaling by the factor of $1 / \mathrm{n}$. Therefore, back-to-back calls of CFFFT2 and CIFFT2 will return the original complex array (in proper order) but multiplied by a factor of n . Consult the handbook by Burrus and Parks [4] for additional FFT algorithm details.

## Linear Algebra Routines

The routines *SOLUTN and *SOLUTNX are the normal- and extended-precision implementations of the algorithm for solving simultaneous linear equations. This algorithm is the modified Gauss-Jordan elimination without (off diagonal) pivoting. This is a simple algorithm which is intended for use with well-conditioned systems of dense linear equations of moderate size. Well conditioned means that the system of linear equations is linearly independent or non-singular. This subject and further algorithm details are to be found in chapter 2 of [5] by Press et al, or any other book on the numerical techniques of linear algebra. This algorithm is suitable for a wide range of problems requiring the solution of a system of linear equations, e.g. exact or least squares polynomial fitting.

A simple system of linear equations has the form:

$$
\begin{align*}
& \mathrm{A}[1,1] \mathrm{x}[1]+\mathrm{A}[1,2] \mathrm{x}[2]+\ldots+\mathrm{A}[1, \mathrm{n}] \mathrm{x}[\mathrm{n}]=\mathrm{y}[1],  \tag{39}\\
& \mathrm{A}[2,1] \mathrm{x}[1]+\mathrm{A}[2,2] \mathrm{x}[2]+\ldots+\mathrm{A}[2, \mathrm{n}] \mathrm{x}[\mathrm{n}]=\mathrm{y}[2],
\end{align*}
$$

$$
\mathrm{A}[\mathrm{n}, 1] \mathrm{x}[1]+\mathrm{A}[\mathrm{n}, 2] \mathrm{x}[2]+\ldots+\mathrm{A}[\mathrm{n}, \mathrm{n}] \mathrm{x}[\mathrm{n}]=\mathrm{y}[\mathrm{n}] .
$$

Symbolically, you may write $\mathrm{A}=\mathrm{A}[\mathrm{i}, \mathrm{j}]$ as the n x n matrix of coefficients, and $x=x[i]$ as the unknown independent variable (column) vector, and $y=y[j]$ as the dependent variable (row) vector. Thus (39) can be written in short hand form as $\mathrm{Ax}=\mathrm{y}$ or Ax $-\mathrm{y}=0$, where the multiplication indicated is a matrix-vector multiply. The fundamental problem in linear algebra, then, is to find the solution vector $x$. In fact, you may desire to find the m different solutions to m sets of linear equations which share the same coefficient matrix $A$, i.e. $A x[k]=y[k]$, for $k=1, \ldots, m$.

You can solve the general problem just stated by using *SOLUTN, or with more accuracy with *SOLUTNX. This is done by constructing a tableau B (table of coefficients) which is simply the coefficient matrix A (in row major storage format) with the negative of the y vector(s) appended (:) as m extra columns to A . Thus you would have $\mathrm{B}=\mathrm{A}$ : -y , as your problem, where $B$ is a $n$ by $n+m$ matrix and typically $m=1$. Thus, for the common case of $m=1$, the input array B can be written as:

$$
\begin{align*}
& \mathrm{A}[1,1], \mathrm{A}[1,2], \ldots, \mathrm{A}[1, \mathrm{n}],-\mathrm{y}[1],  \tag{40}\\
& \mathrm{A}[2,1], \mathrm{A}[2,2], \ldots, \mathrm{A}[2, \mathrm{n}],-\mathrm{y}[2],
\end{align*}
$$

$$
\mathrm{A}[\mathrm{n}, 1], \mathrm{A}[\mathrm{n}, 2], \ldots, \mathrm{A}[\mathrm{n}, \mathrm{n}],-\mathrm{y}[\mathrm{n}] .
$$

After the $*$ SOLUTN routine is executed, the matrix $\mathrm{C}=\mathrm{A}^{\prime}: \mathrm{x}$ appears, where the column(s) beyond the original coefficients $A$ (the $y[k]$ vectors) have been replaced by the solution vector(s) $x[k]$. The new matrix $A^{\prime}$ is a partially computed version of the inverse of the matrix A . The complete inverse of A , which is normally computed by the standard Gauss-Jordan scheme, is rarely needed. Therefore, a faster modified algorithm has been used which does about half the work.

This simple method used for solving systems of linear equations has two restrictions.

1. As the pivoting operation (exchange of $x$ and $y$ variables) always starts with $\mathrm{A}[1,1]$ and proceeds down the diagonal, $\mathrm{A}[1,1]$ must be non-zero. This is because, in the exchange process, you must divide by the pivot element. A zero coefficient at $\mathrm{A}[1,1]$ may be moved by reordering the variable indices by appropriately swapping rows and columns in A and in y .
2. The maximum absolute value of the elements in A must be approximately unity. This is necessary to assure that no pivot element is encountered which is smaller in magnitude than $10^{-8}$ for $*$ SOLUTN, and $10^{-10}$ for *SOLUTNX. This restriction monitors the system condition and assures an adequately accurate solution, but the final solution should always be verified by substitution. This is done by inspecting the elements of the error vector $\mathrm{e}=\mathrm{Ax}-$ y computed by using the solution x , and the original A and y .

## Summary

This report presented a set of routines that can be used in digital signal processing applications. The appendix contains the source code of these routines. This source code can also be obtained from the Texas Instruments Electronic Bulletin Board (713) 274-2323. If there are comments or corrections, please contact the author of this report:

Mr. Gary Sitton
Gas Light Software
5211 Yarwell
Houston, TX 77096
Tel (713) 729-1257

## References

(1) TMS320C3x User's Guide (literature number SPRU031), Texas Instruments, Dallas, TX, August 1988.
(2) Hastings, C. Jr., "Approximations for Digital Computers", Princeton University Press, Princeton N.J., 1955.
(3) Abramowitz, M. and Stegun, I.A. (Editors), Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables, National Bureau of Standards (Applied Mathematics Series 55), Washington D.C., 1964.
(4) Burrus, C.S. and Parks, T.W., "DFT/FFT and Convolution Algorithms", John Wiley and Sons, New York N.Y., 1985.
(5) Press, W.H., Flannery, B.P., Teukolsy, S.A., and Vettering, W.T., Numerical Recipes in C- The Art of Scientific Programming, Cambridge University Press, Cambridge England, 1988.

```
###***************************************************************************
- PrOGRAM: आMATH.ASM
    NORMAL FLOATING-POINT (32-BIT) MATH FUNCTIONS
    sMATH.ASM CONSISTS OF THE FOLLOHING ROUTINES:
    SIN - COMPUTES A 7D SINE(X) FOR ALLL X IN radiaNS.
    COS - COMPUTES A TO COSINE(X) FOR ALL X IN RADIANS.
    EXP - COMPUTES A 7D EXP(X) FOR ALL ;X: =人 88.
    LN - COMPUTES A TD LN(X) FOR ALL X>0.
    ATAN - COrPUTES A 7D ATAN(X) FOR ALL X IN RADINNS.
    SQRT - COMPUTES AN 8D SRRT(X) FOR ALL X >=0.
    FPINU - COMPUTES AN 8D 1/X FOR ALL X }=0\mathrm{ .
    FDIV - COMPUTES AN 8D X/Y FOR ALL X AND ALL Y }/=0\mathrm{ .
******************************************************************************
```

RND RO
RO, R4
; ROND X
;
COSINE ENTRY POINT
ECOS:
SCALE AND MAP VARIABLE

| ABSF | R0 | ; $\mathrm{RO}=6=\mathrm{XX}_{1}$ |
| :---: | :---: | :---: |
| LDF | R0,R1 | ; R1 $<=$ RND : $\times$; |
| MPYF | ONORM, R1 | ; R1 $¢=x \pm 2 /$ PI |
| FIX | R1, IR0 | ; IRO < INTEGER QUALDRANT Q |
| FLOAT | IRO,R2 | ; R2 < $=$ FLOATING QUADRANT Q |
| SUBF | R2,R1, R0 | ; R0 < $=$ x, -1<x<1 |
| NEGF | R0, R3 | ; $\mathrm{F} 3<=-\mathrm{x}$ |
| ADGI | 1,1R0 | ; $\operatorname{IRO} \ll=Q+1$ |
| AND | 3, IR0 | ; IR0 ¢ = TABLE INDEX |
| TSTB | 2, IR0 | ; LOOK AT 2ND LSB |
| LDFNZ | R3,R0 | ; IF 1 THEN RO $<=-\chi$ |
| LDI | EACON, ARO | ; AR1 $->$ CONST. TABLE |
| ADDF | *+ARO(IRO), RO | ; FINAL MAPPING, R0 < $=\mathrm{X}+\mathrm{C}$ |
| NEGF | R0, R3 | ; R3 $<=-\mathrm{X}$ |
| LDI | EACOF, ARO | ; ARO $\rightarrow$ COEFF. TABLE |
| POP | [p | ; UNSAVE DP |

Evaluate truncated (ODD) series

| MPYF | R0, R0, R2 | ; R2 $<=x * * 2$ |
| :---: | :---: | :---: |
| RNI | R2 | ; ROUND $\mathrm{X} * \pm 2$ |
| MPYF | *ARO-, R2, R1 | ; R1 $<=\chi * * 2 * C 11$ |
| ADIF | *ARO--, R1 | ; R1 $<=C 9+\mathrm{Rl}$ |
| MPYF | R2,R1 | ; R1 $\left.<=X_{* * 2 *(C 9}+\mathrm{R1}\right)$ |
| ADDF | *ARO--,RI | ; R1 $¢=\mathrm{C7}+\mathrm{R} 1$ |
| MPYF | R2,R1 | ; R1 $\leqslant=X_{* * 2 *(C 7+R 1) ~}^{\text {c }}$ |
| ADDF | *AR0--,R1 | ; $\mathrm{R} 1 \ll \mathrm{C5}+\mathrm{R} 1$ |
| RND | R1 | ; ROLND BEFORE * |
| MPYF | R2,R1 | ; R1 $<=x * * 2 *(C 5+R 1)$ |
| ADDF | *ARO--,R1 | ; R1 $<=C 3+\mathrm{R} 1$ |
| RND | R1 | ; ROUND BEFORE * |
| MPYF | R2,R1 | ; R1 $<=X * * 2 *(C 3+R 1)$ |
| ADDF | *ARO,R1 | ; $\mathrm{R1}$ < $=\mathrm{Cl}+\mathrm{Rl}$ |

FINISH UP SERIES AND RETURN

| LDF | R4, R4 | ; TEST ORIGINAL $X$ |
| :---: | :---: | :---: |
| LDFN | R3, R0 | ; IF $\mathrm{X}<0$ THEN R $0<=-\chi$ |
| POP | R2 | ; R2 < $=$ RETURN ADDRESS |
| EUD | R2 | ; RETURN (DELAYED) |
| RND | R0 | ; ROUND BEFORE * |
| RND | R1 | ; ROUND BEFORE * |


\section*{ <br> * PROGRAM: COS <br> * uritten by: gary a. sitton <br> GAS LIGHT SOFTWARE <br> HOUSTON, TEXAS <br> MARCH 1989. <br> COSINE FLNCTION: RO $<=\operatorname{COS}($ RO $)$. <br> * APPROXIMATE ACCURACY: 7 DECIMAL DIGITS, <br> * INPUT RESTRICTIONS: NONE. <br> * REGISTERS FOR INPUT: RO (ARGLMENT IN RADIANS) <br> * REGISTERS USED AND RESTORED: DP AND SP. <br> * REGISTERS ALTERED: ARO, IRO, AND RO-4. <br> * registers for output: Ro <br> - ROUTINES memed: gCOS ISIN. <br> * EXECUTION CYCLES (MIN, MAX): 46, 46. <br> * <br> * NOTE: USES SHFT CONSTANT FROM SIN PROGRAM! <br> *************************************************** <br> ; EXTERNAL PROGRAM NAMES <br> . GLOBL COS <br> .GLOBL ECOS <br> . TEXT <br> : START OF COS PROGRAM <br> $\cos :$ <br> | PUSH | DR | ; SAVE DP |
| :--- | :--- | :--- |
| LDP | BACOF | ; LOAD DATA PAGE POINTER |
| BRD | ECOS | ; RO $<=\cos (x)=\operatorname{SIN}\left(x^{\prime}\right)$, (DELAYED) |
| RND | RO | ; ROUND $x$ |
| ADDF | CSHFT,RO | ; RO $<=x^{\prime}=x+\operatorname{PI} / 2$. |
| LDF | RO,R4 | ; R $\left\langle=x^{\prime}\right.$ |}

RETURN OCCURS FROM SIN

```
*****************************************************
* frogran: Exp
* wRItten by: gary a. SItton
    GAS LIGHT SOFTUARE
    HOUSTON, TEXAS
    MARCH }1989
EXPONENTIAL FUNCTION: RO <= EXP(RO).
* APPROXIMATE ACCURACY: }7\mathrm{ DECIMAL DIGITS.
INPUT RESTRICTIONS: IRO: <= 88.0.
REGISTERS FOR INPUT: RO.
* registers used and restored: DP AND SP.
* REGISTERS ALTERED: R0-4.
* registers for output: ro.
* ROUIINES NEEDED: FPINY
* EXECUTION CYCLES (MIN, MAX): 44 (RO (= 0), 70.
ExECUTION CYCLES (MIN, KAX): 44 (RO (= 0), 70.
********************************************************
```


ENRM .FLOAT $1.442695041 \quad ; 1 / L N(2)$
; POLYNOMIAL COEFFS. FOR $2 * *-x, 0<=x<1$.

|  | . FLaAT | 1.0000000000 | ; CO |
| :---: | :---: | :---: | :---: |
|  | . FLOAT | -0.693147180 | ; Cl |
|  | . FLOAT | 0.240226469 | ; C 2 |
|  | . FLOAT | -0.055503654 | ; C3 |
|  | . FLOAT | 0.009615978 | ; C4 |
|  | . FLOAT | -0.001328240 | ; C 5 |
|  | . FLOAT | 0.000147491 | ; C6 |
| C7 | . FLOAT | -0.000010863 | ; ${ }^{\text {c }}$ |

AC7 .WORD C7
.TEXT
; START OF EXP PROGRAM
EXF:

- SCALE VARIARLE X



```
RND RO ; ROND BEFORE *
MPYF R1,RO ; ; RO< < X X*2*(C7 + RO)
ADDF *ARO--,RO ;RO}=C=C5+R
RND RO ; ROND BEFORE :
; RO<= X**2*(C5 + RO)
; RO<= C3 + RO
RNDD RO moll
RND RO mO ; ROND BEFORE * 
ADDF *AFO--,RO,R1 ; R1 <= C1 + RO
```

FINISH UP, POST SCALE BY C AND RETURN

| POP | R2 | ; R2 $=$ RETURN AUDRESS |
| :--- | :--- | :--- |
| RUD | R2 | ; RETURN (DELAYED $)$ |
| RND | R1 | ; ROUND BEFORE |


| LDI | R2,R4 | ; R4 $=$ R1 |
| :---: | :---: | :---: |
| LSH | 8,R1 | ; R1 $<=$ R1 EXP. PEMOVED |
| ASH | -1,R2 | ; R2 $<=$ R2 WITH -E/2 EXP. |
| PUSH | R2 | ; SAVE R2 AS INTEGER |
| POPF | R2 | ; R2 $¢=$ FLT. PT. |
| LDE | R2,R1 | ; R1 $<=(1-\mathrm{H} / 2) * 2 \pm \pm-\mathrm{E} / 2$ |
| LDF | ECNST3,R2 | ; R2 $<=1.1 \ldots$ FOR ODD E |
| L.SH | 7,R4 | ; TEST LSB OF E (AS SIGN) |
| LDFNN | eCNST4, R2 | ; IF E EVEN R2 $=0.78 .$. |
| MPYF | R2,R1 | ; R1 $¢=$ CORRECTED ESTIMATE |
| POP | DP | ; UNSAVE DP |

; GENERATE V/2 (USES MPYF).

| MPYF | CNST1,RO | ; RO $C=$ V/2 TRUNC. |
| :--- | :--- | :--- |
| RND | RO | ; RO $C=$ RND V/2 |

; NEWTON ITERATION FOR $Y(X)=X-V * *-2=0 \ldots$

| MPYF | R1, R1, R2 | ; R2 $<=x[0] * * 2$ |
| :---: | :---: | :---: |
| MPYF | R0, R2 | ; R2 $\langle=(\mathrm{V} / 2) * \times[0] * * 2$ |
| SUBRF | CNST2,R2 | ; R2 $<=1.5-(\mathrm{V} / 2) * \mathrm{X}[0] * * 2$ |
| MPYF | R2,R1 | ; RL $¢=X[1]=X[0] *(1.5-(V / 2) * X[0] * * 2)$ |
| MPYF | R1,R1, R2 | ; R2 $<=x[1] * * 2$ |
| MPYF | R0, R2 | ; R2 $<=(\mathrm{V} / 2) * \mathrm{X}[1] *+2$ |
| SUBRF | CNST2,R2 | ; R2 $<=1.5-(\mathrm{V} / 2) * \mathrm{X}[1] * * 2$ |
| MPYF | R2,R1 | ; R1 $\langle=X[2]=X[1] *(1.5-(V / 2) * X[1] * * 2)$ |
| MPYF | R1, R1, R2 | ; R2 < $=\mathrm{x}[2]+* 2$ |
| npyF | R0, R2 | ; R2 $\langle=(\mathrm{V} / 2) * \mathrm{X}[2] * * 2$ |
| SUBRF | CNST2,R2 | ; R2 $<=1.5-(\mathrm{V} / 2) * \mathrm{X}[2] * * 2$ |
| MPYF | R2,R1 | ; RL $\leqslant=\mathrm{x}[3]=\mathrm{x}[2] *(1.5-(\mathrm{V} / 2) * \times[2] * 2)$ |
| FND | R1 | ; ROUND BEFORE * |
| MPYF | R1,R1,R2 | ; R2 $<=\mathrm{X}[3] * * 2$ |
| RND | R2 | ; ROUND BEFORE * |
| MPYF | R0,R2 | ; R2 $=\left(\begin{array}{l}\text { (V/2 }\end{array}\right) * x[3] * * 2$ |
| SUBRF | CNST2,R2 | ; R2 ¢ $=1.5-(\mathrm{V} / 2) * \mathrm{X}[3] * * 2$ |
| RND | R2 | ; ROWND BEFORE * |
| MPYF | R2,R1 | ; R1 $\langle=\mathrm{X}[4]=\mathrm{X}[3] *(1.5-(\mathrm{V} / 2) * \mathrm{X}[3] * * 2)$ |

; INVERT FINAL RESULT AND RETURN

| POP | R2 | ; R2 $<=$ RETURN ADDRESS |
| :--- | :--- | :--- |
| BUD | R2 | ; RETURN (DELAYED) |
| RND | R3 | ; ROUND BEFORE * |
| RND | R1 | ROUND BEFORE * |
| MPYF | R1,R3,RO | ; RO $=\operatorname{SORT}(V)=V * S Q R T(1 / V)$ |



```
    PROGRAM: FPIN
    URITTEN BY: GARY A. SITTON
        GAS LIGHT SOFTUARE
        GAS LIGHT SOFTUARE
        MOUSTON, TEX
        MARCH 1989
    FLOATING POINT INUERSE: RO <= 1/RO
    APPROXIMATE ACCURACY: }8\mathrm{ DECIMAL DIGITS.
    | INPUT RESTRICTIONS: RO != 0.0.
REGISTERS FOR INPUT: RO
REGISTERS USED AND RESTORED: DP AND SP
- REGISTERS AlTERED: RO-2 AND R4.
* REGISTERS FOR OUTPUT: RO.
R ROUTINES NEEDED: NONE.
* EXECUTION CYCLES (MIN, MAX): 33, 33.
EXECUIION CYCLES (MIN, MAX): 33,33. *
```

; EXTERNAL PROGRAM NAIES
.GLOBL FPINV
; INTERNAL CONSTANTS
. DATA
$\begin{array}{lll}\text { ONE } & \text {. SET } & 1.0 \\ \text { THO } & \text {.SET } & 2.0\end{array}$

MSK .HORD OFF7FFFFFH .TEXT

- START OF FPIN PROGRAM

FPINV:

| LDF | RO,RO | ; TEST $F$ |
| :--- | :--- | :--- |
| RETSZ | ; RETURN NOH IF $F=0$ |  |

; GET APPROXIMATION TO $1 / F$. FOR $F=(1++1) * 2 * E E$
; AND $0<=M<1$, USE: $X[0]=(1-M / 2) * 2 * *-E$

| PUSH | DP | ; SAve data page pointer |
| :---: | :---: | :---: |
| LDP | ersk | ; LOAD data pace pointer |
| PUSHF | R0 | ; SAVE AS FLT. PT. $\mathrm{F}=(1+1) * 2 * * E$ |
| POP | R1 | ; FETCH BACK AS INTEGER |
| XOR | ErSK, R1 | ; COMPLEMENT E \& M BUT NOT SIGN BIT |
| PUSH | R1 | ; SAVE AS INTEGER, AND BY MAGIC... |
| POPF | R1 | ; R1 $<=\mathrm{X}[0]=(1-\mathrm{M} / 2) * 2 * \pm-\mathrm{E}$. |
| POP | DP | ; UNSAVE DP |

NEWTON ITERATION FOR: $Y(X)=x-1 / F=0 \ldots$

| MPYF | R1,R0,R4 | ; $\mathrm{R} 4<=\mathrm{F} * \mathrm{X}[0]$ |
| :---: | :---: | :---: |
| SUBRF | TW0, 84 | ; $\mathrm{R} 4 \ll=2-\mathrm{F} * \mathrm{X}[0]$ |
| MPYF | R4,R1 | ; $\mathrm{R} 1 \ll \mathrm{X}[1]=\mathrm{X}[0] *(2-\mathrm{F} * \mathrm{X}[0])$ |
| MPYF | R1,R0,R4 | ; $\mathrm{R} 4<=\mathrm{F} * \mathrm{X}[1]$ |
| SUBRF | Two, R4 | ; $\mathrm{R} 4<=2-\mathrm{F} * \mathrm{X}[1]$ |
| MPYF | R4,R1 | ; $\mathrm{R} 1 \ll \mathrm{X}[2]=\mathrm{X}[1] *(2-\mathrm{F} * \mathrm{X}[1])$ |
| MPYF | R1, RO, R4 | ; R4 $<=F * \times[2]$ |
| SUBRF | TW0, 84 | ; R4 < $=2-\mathrm{F} * \times[2]$ |
| MPYF | R4, R1 | ; $\mathrm{R} 1<=\mathrm{X}[3]=\mathrm{X}[2] *(2-\mathrm{F} * \mathrm{X}[2])$ |


| RND | RO,R4 | ; ROUND F BEFORE LAST MULTIPLY |
| :--- | :--- | :--- |
| RND | R1,R0 | ; ROND X[ 3$]$ BEFCRE MULTIPLIES |
| MPYF | $R 0, R 4$ | ; R4 $¢=F * X[3]=1+$ EPS |

FINISH ITERATION AND RETURN

| POP | R2 | ; R2 $<=$ RETURN ADDRESS |
| :--- | :--- | :--- |
| BUD | R2 | ; RETURN (DELAYED $)$ |
| SUBRF | ONE, R4 | ; R4 $<=1-F * X[3]=$ EPS |
| MPYF | R0,R4 | ; R4 $<=X[3] * E P S$ |
| ADDF | $R 4, R 1, R 0$ | ; R $0<=X[4]=(X[3] *(1-(F * X[31)))+X[3]$ |

```
* Program: fdiv
HRITTEN BY: GARY A. SITTON
GAS LIGHT SOFTUARE
HOUSTON, TEXAS
APRIL 1989.
FLOATING POINT DIUIDE FLNCTION: RO <= RO/R1.
APPROXIMATE ACCURACY: 8 DECIMAL DIGITS.
INPUT RESTRICTIONS: R1 != 0.0.
* REGISTERS FOR INPUT: RO (DIVIDEND) AND RI (DIVISOR),
* REGISTERS USED AND RESTORED: DP AND SP.
* REGISTERS ALTERED: RO-4
* REGISTERS FOR OUTPUT: RO (quotiENt).
* ROUTINES NEEDED: FPIN.
* EXECUTION CYCLES (MIN, MAX): 43, 43
*****************ES (MIN, MAX): 43, 43. *
```

EXTERNAL 'PROGRAY NAMES
.GLOBL FDIV
.GLOBL FPIN
. TEXT
; START OF FDIV PROGRAM
FDIV:

| RND | R0, R3 | ; R3 $<=$ RND X |
| :---: | :---: | :---: |
| LDF | R1, R0 | ; R1 $<=Y$ |
| CALL | FPINU | ; $R 0<=1 / Y$ |
| RND | R0 | ; ROUND BEFORE * |
| MPYF | R3,R0 | ; ROS $=\mathrm{X}$ |
| RETS |  | ; RETURN |

## ***************************************************************************

**
PROCBAM: sMATHX.ASM
EXTENDED-PRECISION, FLOATING-POINT (40-BIT) MATH FUNCTIONS
sMATHX.ASM CONSISTS OF THE FOLLOWING ROUTINES:
SINX - COMPUTES A $9 D \operatorname{SIN}(X)$ FOR ALL X IN RADIANS.
$\cos X$ - COMPUTES A 9D $\operatorname{COSINE(X)\text {FORALLXINRADIANS.}}$
EXPX - COMPUTES A 9D EXP(X) FOR ALL $\{X 1=<88$.
LNX - COMPUTES AN $80 \operatorname{LN}(X)$ FOR ALL $X>0$.
ATANX - COMPUTES AN 8D ATAN(X) FOR ALL X IN RADIANS.
SQRTX - COMPUTES A $10 D \operatorname{SART}(x)$ FUR ALL $x>=0$.
FPINXX - COMPUTES A $10 \mathrm{D} 1 / \mathrm{K}$ FOR ALL $\times 1=0$.
FDIVX - COMPUTES A $10 D \times / Y$ FOR ALL $X$ AND ALL $Y ~ /=0$.
FMULTX - COMPUTES A 10D X*Y FOR ALL X AND ALL Y.
**********************************************************************

## 相 <br> - PROGRAM: SINX <br> * URItten by: gary a. sitton <br> GAS LIGHT SOFTUARE <br> HOUSTON, TEXAS <br> MARCH 1989. <br> EXTENDED PRECISION SINE FUNCTION: RO $<=\operatorname{SIN}($ RO $)$. <br> APPROXIMATE ACCURACY: 9 DECIMAL DIGITS. <br> INPUT RESTRICTIONS: NONE. <br> REGISTERS FOR INPUT: RO (ARGUMENT IN RADIANS). <br> - REGISTERS USED AND RESTORED: DP AND SP. <br> * REGISTERS ALTERED: ARO, IRO, AND RO-7. <br> * REGISTERS FOR OUTPUT: RO. <br> - routines needed: fmutx. <br> - EXECUTION CYCLES (MIN, MAX): 160, 160. <br> ```*```

; EXTERNAL PROGRAM NAMES
. GLOBL SINX
GLOEL ECOSX
GLOBL FMULTX
: INTERNAL CONSTANTS
. DATA
; SCALING COEFFS. FOR SIN(X)

| NRM2 | .WORD | $00000006 F H$ | ; BOTTOM OF $2 / \mathrm{PI}$ |
| :--- | :--- | :--- | :--- |
| NRM1 | .WORD | OFF22F983H | ; TOP OF $2 / \mathrm{PI}$ |

POLYNOMIAL COEFFS. FOR $\operatorname{SIN}(X)$ . WORD COO490FDAH ; TOP OF C1 (PI/2)
; BOTTOM OF C3
HORD OFOOAAOOEZH ; TOP OF C3 CS
HORD OFC2335EOH ; TOP OF CS
. WORD OFEE69754H ; TOP OF C7
WORD OF 2280828 H ; TOP OF CT
.HORD OED999784H ; TOP OF Cl1
ACOF . WORD COF ; ADDRESS OF COEFFS.
CON .FLOAT -1.0, $0.0,1.0,0.0$; MAPPING CONSTS.
ACON . MORD CON ; ADDRESS OF CONSTS.
.TEXT

SINX:


```
CALL FMMLTX 
OR *ARO,R2 ; OR IN BOTTOM OF C1
ADDF R2,R0,R1; ; R1<= C1 + R0
```

; TEST FOR $X<0$ AND RETURN

| NEGF | R3, RO | ; RO $<=X$ |
| :--- | :--- | :--- |
| BRD | FPULTX | ; RO $=X * R 1=\operatorname{SIN}(X)$, (DELAYED) |
| POPF | R5 | ; TEST ORIGINAL $X$ |
| LDFN | R3,RO | ; IF $X<0$ THEN RO $<=-X$ |
| POP | DP | ; UNSAVE DP |

; RETURN OCCURS FROM FMULTX

| LDF | R0, R1 | ; R1 $<=x$ |
| :---: | :---: | :---: |
| CALL | Frulty | ; R0 < $=$ X**2 |
| LDF | R0,R1 | ; R1 $<=X_{* * 2}$ |
| MPYF | *AR0--, R1, R0 | ; RO $<=X * * 2 * C 11$ |
| ADDF | *ARO--,R0 | ; R0 $<=C 9+\mathrm{R} 0$ |
| MPYF | R1,R0 | ; $\mathrm{FO} 0<=\mathrm{X}_{* * 2 *}$ (C9 + R0) |
| ADDF | *ARO--,R0 | ; R0 $<=$ C7 + R0 |
| MPYF | R1, R0 | ; RO< $=\mathrm{X} * * 2 *(C 7+\mathrm{RO})$ |
| LDF | *ARO--, R2 | ; R2 $<=$ TOP OF C5 |
| OR | *AR0--, R2 | ; OR IN BOTTOM OF C5 |
| ADDF | R2,R0 | ; RO C $=\mathrm{CS}+\mathrm{RO}$ |
| CALL | FMULTX | ; $\mathrm{RO}<=\mathrm{X} * * 2 *(C 5+\mathrm{RO})$ |
| LDF | *ARO--, R2 | ; R2 < $=$ TOP OF C3 |
| 0 R | *AR0--,R2 | ; OR IN BOTTOM OF C3 |
| ADDF | R2, R0 | ; RO $<=C 3+\mathrm{RO}$ |


HOTE: USES SHF1 AND SHF2 FROM SINX PROGRA

## H14

## ; EXTERNL PROCRAH NAYES

- acobe cosx
-CLOBL ECOSX
.TEXT
;
START OF COSX PROGRAM
$\cos x:$

| PUSH | DP | ; SAVE DP |
| :---: | :---: | :---: |
| LDP | ENPM1 | ; LOAD DATA PAGE POINTER |
| BRD | ECOSX | ; RO $<=\operatorname{COS}(x)=\operatorname{SIN}\left(x^{\prime}\right)$, (DELAYED) |
| LF | ESFF1,R1 | - RI $<=$ TOP OF PI $/ 2$ |
| 0 R | eSHF2,R1 | ; OR IN BOTTOM OF PI/2 |
| ADDF | R1,R0 | ; $\mathrm{RO}<=X^{\prime}=X+\mathrm{PI} / 2$ |

```
***********
* PROGRAM: EXPX
WRITTEN BY: GARY A. SITtON
    GAS LIGHT SOFTWARE
    HOUSTON, TEXAS
    MARCH 1989.
    EXTENDED PREC. EXPONENTIAL: RO <= EXP(RO).
* APPROXIMATE ACCURACY: 9 DECIMAL DIGITS.
    INPUT RESTRICTIONS: :RO: <= $8.0.
* REGISTERS FOR INPUT: RO.
* REGISTERS USED AND RESTORED: DP AND SP.
* REGISTERS ALTERED: ARO AND RO-7
* REGISTERS FOR OUTPuT: RO.
* ROUTINES NEEDED: FPMLTX AND FPINUX
* EXECUTION CYCLES (MIN, MAX): 115 (RO <=0), 160.
```

******************************************************
; EXTERNAL PROGRAM NAMES
.GLOBL EXPX
. GLOBL FMULTX
.GLOBL FPINXX
: INTERNAL CONSTANTS
.DATA

- SCALING COEFFS. FOR $2-* * X$
ENRM2 . WORD 000000029 H ; BOTTOM OF $1 /$ LN(2)
ENRM1 .WORD OOO38AA3BH ; TOP OF $1 / L N(2)$
; POLYNOMIAL COEFFS. FOR $2 * *-X, 0<x<1$.

| . WORD | 000000000H | ; CO (1.0) |
| :---: | :---: | :---: |
| . HORD | 00000000AH | ; BOTtOM OF Cl |
| . WORD | OFFCE8DE8H | ; TOP OF Cl |
| . HORD | 00000006EH | ; BOTTOM OF C2 |
| . WORD | OFD75FDEDH | ; TOP OF C2 |
| . HORD | 000000046H | ; Botta OF C3 |
| . HORD | ОFB9CA833H | ; TOP OF C3 |
| .HORD | OF91D8C56H | ; TOP OF C4 |
| . WORD | OF6DIE7A9H | ; TOP OF C5 |
| .HORD | OF31AA7D7H | ; TOP OF Cb |
| . WORD | OEFC9BD9CH | ; TOP OF C7 |

AC7 .HORD C7
. TEXT
; START OF EXPX PROGRAM



| 0 R | *ARO-, R2 | ; OR IN Bottor of C3 |
| :---: | :---: | :---: |
| ADDF | R2,R0 | ; $\mathrm{RO}=\mathrm{CB}^{\text {+ }} \mathrm{RO}$ |
| CALL | Frult | ; $\mathrm{RO}=2=\mathrm{XH}(\mathrm{C} 3+\mathrm{R} 0)$ |
| DF | *ARO-, R2 | ; R2 $<=$ TOP OF C2 |
| OR | -ARO-, R2 | ; OR IN Botton of C2 |
| ADDF | R2,R0 | ; RO S $=~ C 2+\mathrm{R} 0$ |
| CALL | Fralty | ; $\mathrm{RO}=\mathrm{C}=\mathrm{X}(\mathrm{C} 2+\mathrm{R})$ |
| LDF | *ARO--, R2 | ; R2 $<=$ TOP OFCl |
| OR | *ARO-, R2 | ; OR IN BOTto Of Cl |
| ADDF | R2,RO | ; $\mathrm{RO} S=\mathrm{Cl}+\mathrm{RO}$ |
| CALL | Frult | ; $\mathrm{RO}<=\mathrm{X} *(\mathrm{C} 1+\mathrm{R} 0)$ |
| ADD IN SCALED EXPOMENT. |  |  |
| ADDF | R3,R0 | ; $\mathrm{R} 0 \ll \mathrm{LN}(\mathrm{X})+\mathrm{E}$ * $\mathrm{L}(2)$ |
| RETS |  | ; RETURN |

```
#********************************************************
    * PROCRAM: ATANX
    URITTEN BY: GARY A. SITTON
        GAS LIGHT SOFTUARE
        HOUSTON, TEXAS
        MARCH 1989.
EXTENDED PRECISION ARC TANGENT: RO <= ATAN(RO).
* APPROXIMATE ACCURACY: }8\mathrm{ DECIMAL DIGITS.
* INPUT RESTRICTIONS: NONE.
* REGISTERS FOR INPUT: RO.
* REGISTERS USED AND RESTORED: DP AND SP
* REGISTERS AlTERED: ARO, IRO, AND.RO-7.
* REGISTERS FOR OUTPUT: RO (IN RADIANS)
R ROUTINES NEEDED: FMULTX, AND FDIVX.
* EXECUTION CYCLES (MIN, MAX): 210 (!ATANX: <=1), 332. *
*******************************************************
```

; EXTERNAL PROGRAM NAMES
-GLOBL ATANX
GLOBL FMULTX
GLOBL FDIVX

- INTERNAL CONSTANTS
. DATA
- SCALING COEFFS. FOR ATAN(X)

| . WORD | $00000005 D \mathrm{H}$ | ; BOTTOM OF -PI/4 |
| :--- | :--- | :--- |
| .WORD | OFFB6FO25H | ; TOP OF -PI/4 |
| . HORD | 0000000 A 2 H | ; BOTTOM OF PI/4 |
| .HORD | OFF490FDAH | ; TOP OF PI/4 |
| . WORD | 000000000 H | ; BOTTOM OF ZERO |
| .HORD | 080000000 H | ; TOP OF ZERO |

; POLYNOMIAL COEFFS. FOR ATAN(X), $-1 \leqslant=x \leqslant=1$.

| . WORD | 000000000H | ; TOP OF Cl (1.0) |
| :---: | :---: | :---: |
| . WORD | 00000006EH | ; BOTTOM OF C3 |
| .HORD | OFED55594H | ; TOP OFFC3 |
| .WVRD | 000000009H | ; BOTTOM OF C5 |
| . WORD | OFDACBBE 4 H | ; TOP OFF C5 |
| . WORD | 0000000FFH | ; BOtton of $\mathrm{C7}$ |
| . WORD | OFDEE8033 | ; TOP OF C7 |
| . HORD | 000000056 H | ; Botton of C9 |
| .WORD | OFC5A3D83H | ; TOP OF C9 |
| . WRRD | 0000000 3 3 | ; BOTTOM OF $\mathrm{Cl1}$ |
| . WORD | OFCESCE8B | ; TOP OF Cl1 |
| . WORIJ | 0000000 BFH | BOTTOM Of Cl 3 |
| . WORD | OFB2FCIFDH | - TOP OF C13 |



| ADDF | R2,RO | ; $\mathrm{R} 0 \ll=\mathrm{Cl3}+\mathrm{R} 0$ |
| :---: | :---: | :---: |
| MPYF | R1,R0 | ; R0 $<=X * * 2 *(C 13+\mathrm{R})$ |
| LDF | *ARO--, R2 | ; R2 $<=T O P$ OF C11 |
| OR | *ARO--,R2 | ; OR IN BOTTOM OF CII |
| AbDF | R2,R0 | ; $\mathrm{RO}<=\mathrm{Cl1}+\mathrm{RO}$ |
| CALL | Fructe | ; $\left.\mathrm{RO} \ll=x_{* * 2 *(C 11}+\mathrm{RO}\right)$ |
| LDF | *ARO--,R2 | ; R2 <= TOP OF C9 |
| OR | *ARO--,R2 | ; OR IN BPTTOM OF C9 |
| ADDF | R2,R0 | ; $\mathrm{RO} 0<=\mathrm{C} 9+\mathrm{RO}$ |
| CALL | FMLTX | ; $\mathrm{RO} 0<=\mathrm{X} * * 2 *(C 9+\mathrm{RO})$ |
| LDF | *ARO-- R2 | ; R2 $<=$ TOP OF C7 |
| OR | *ARO-, R2 | ; OR IN BOTTOH OF CT |
| ADDF | R2,R0 | ; $\mathrm{RO} \ll=C 7+\mathrm{RO}$ |
| CALL | Fructi | ; RO< $=X_{* * 2 *(C 7+R O) ~}^{\text {c }}$ |
| LDF | *ARO--,R2 | ; R2 $6=$ TOP OF CS |
| OR | *ARO--,R2 | ; OR IN BOTTOM OF CS |
| ADDF | R2,R0 | ; $\mathrm{RO}<=\mathrm{C5}+\mathrm{RO}$ |
| CALL | FAuLTX | ; RO $<=X * * 2 *(C 5+\mathrm{RO})$ |
| LDF | *AFO--,R2 | ; R2 $<=$ TOP OF C3 |
| 0 O | *ARO- ${ }^{\text {, R2 }}$ | ; OR IN BOTTOM OF C3 |
| ADDF | R2,R0 | ; $\mathrm{RO} \mathrm{C}=\mathrm{C3}+\mathrm{RO}$ |
| CALL | Frilit | ; $\mathrm{RO}<=\mathrm{X} * * 2 *(C 3+\mathrm{RO})$ |

; FINISH UP

| ADDF | *ARO--, RO, R1 | ; $\mathrm{Rl} \ll=\mathrm{Cl}+\mathrm{R} 0$ |
| :---: | :---: | :---: |
| LDF | R3,R0 | ; ROS $=X$ (SIGEED) |
| CALL | Finlit | ; R0 $<=\operatorname{ATANX}(X)=X *(1+R 0)$ |
| NOP | *ARO++(IRO) | ; ARO $\rightarrow$ C ( $0.0, \mathrm{PI} / 4 \mathrm{OR}+\mathrm{PI} / 4)$ |

; ADD IN POST SCALE VALUE C AND RETURN

| POP | R4 | ; R4 $=$ R RETURN ADDRESS |
| :--- | :--- | :--- |
| BUD | R4 | ; RETURN (DELAYED) |
| LDF | *ARO--R1 | ; R1 $C=$ TOP OF C |
| OR | *ARO,R1 | ; OR IN BOTTOH OF C |
| ADDF | R1,RO | ; RO $C=\operatorname{ATAN(X)}+C$ |


| URITYEN BY: GARY A. SITTON |  |
| :---: | :---: |
|  | GAS LIGHT SOFTUARE |
| HOUSTON, ' TEXAS |  |
| MARCH 1989. |  |
| APPROXIMATE ACCURACY: 10 DECIMAL DIGITS. |  |
| IMPUT RESTRICTIOWS: RO $=0.0$. |  |
| REEISTERS FOR INPUT: RO. |  |
| REGISTERS USED AND RESTORED: DP AND SP. |  |
| REGISTERS ALTERED: RO-7. |  |
| REGISTERS FOR OUIPUT: RO. |  |
| ROUTIES NEEDED: FIMLTX. |  |
| ExECOTION CYC | YCLES (MIN, MAX): $138,138$. |

;
GLOBL SARTX
. GLOBL FIMLTX
;
. DATA

| LSH | 8,R1 | ; R1 < $=$ R1 EXP. REMOVED |
| :---: | :---: | :---: |
| ASH | $-1, R 4$ | ; R4 <= R4 WITH -E/2 EXP. |
| PUSH | R4 | ; SAVE R4 AS INTEGER |
| POPF | R4 | ; R4 < = FLT. PT. |
| LDE | R4,R1 | ; $\mathrm{R} 1<=(1-\mathrm{M} / 2) * 2 * *-\mathrm{E} / 2$ |
| LDF | eCNST3, R2 | ; R2 < $=1.1 .$. FOR ODD E |
| LSH | 7,R5 | ; TȨST LSB OF E (AS SIGN) |
| LDFNN | eCNST4,R2 | ; IF E EVEN R2 $<=0.78 .$. |
| MPYF | R2,R1 | < CORRECTED ESTI |

GENERATE V/2 (USES IPYF),

| MPYF | CNST1,RO | ; RO $S=V / 2$ TRUNC. |
| :--- | :--- | :--- |
| LDM | R3,RO | ; RO $C=V / 2$ FULL PREC. |


| MPYF | R1,R1,R2 | ; R2 $<=X[0] * * 2$ |
| :--- | :--- | :--- |
| MPYF | RO,R2 | ; R2 $<=(V / 2) * X[0] * * 2$ |
| SUBRF | CNST2,R2 | ; R2 $<=1.5-(V / 2) * X[0] * * 2$ |
| MPYF | R2,R1 | ; R1 $<=X[1]=X[0] *(1.5-$ |


| MPYF | R1,R1,R2 | ; R2 $<=X[1] * * 2$ |
| :--- | :--- | :--- |
| MPYF | R0,R2 | ; R2 $(=(V / 2) * X[1] * 2$ |
| SUBRF | CNST2,R2 | ; R2 $(=1.5-(V / 2) * X[1] * * 2$ |
| MPYF | R2,R1 | ; R1 $<=X[2]=X[1] *(1.5-(V / 2) * X[1] * * 2)$ |


| MPYF | R1,R1,R2 | ; R2 $\langle=X[2] * * 2$ |
| :--- | :--- | :--- |
| MPYF | RO,R2 | R2 $\langle=(V / 2) * X[2] * * 2$ |
| SUBRF | CNST2,R2 | ; R2 $\langle=1.5-(V / 2) * X[2$ |

SUBRF CNST $2, \mathrm{R} 2 ;$; R2 $<=1.5-(\mathrm{V} / 2) * \times[2] * * 2$
MPYF R2,R1 $\quad ; \mathrm{R} 1<=\mathrm{X}[3]=\mathrm{X}[2] *(1.5-(\mathrm{V} / 2) * \times[2] * * 2)$

| LDF | R0, R2 | ; R2 $¢=0 / 2$ |
| :---: | :---: | :---: |
| LDF | R1,R0 | ; $\mathrm{RO} 0<\mathrm{X}[3]$ |
| CALL | fmult | ; RO $=x[3] * * 2$ |
| LDF | R1,R4 | ; R4 $<=x[3]$ |
| LDF | R2,R1 | ; R1 $¢=0 / 2$ |
| LDF | R4, R2 | ; R2 $6=\mathrm{x}[3]$ |
| CALL | frultx | ; RO $<=(\mathrm{V} / 2) * \mathrm{X} 3 \mathrm{3} \times * 2$ |
| SUBRF | CNST2, RO | ; R $0<=1.5-(\mathrm{V} / 2) * \mathrm{X}[3] * 2$ |
| LDF | R2,R1 | ; R1 $<=X[3]$ |
| CALL | FMult | ; RO $<=\mathrm{X}[4]=\mathrm{X}[3] *(1.5-(\mathrm{V} / 2) * \mathrm{x}[3] * * 2)$ |

; INVERT FINAL RESULT AND RETURN

| BRD | FMULTX | ; RO $=$ SQRT $(V)=V * S Q R T(1 / V) \quad$ (DELAYED) |
| :--- | :--- | :--- |
| LDF | R3,R1 | ; R1 $<=V$ |
| POP | DP | (USAVE DP |
| NOP |  | DEAD CYCLE |

RETURN OCCURS FROM FMMLTX
********************************************************

* PROGRAM: FPINXX
- hritten by: gary a. sitton
GAS LIGHT SOFTUARE
HOUSTON, TEXAS
MARCH 1989.
EXTENDED PREC. FLT. PT. INEESE: RO $<=1 / \mathrm{RO}$
* APPROXIMATE ACCURACY: 10 DECIMAL DIGITS.
INPUT RESTRICTIONS: RO != 0.0.
- REGISTERS FOR INPUT: RO.
* REGISTERS USED AND RESTORED: DP AND SP.
* REGISTERS ALTERED: RO-1 AND R4-7.
- REGISTERS FOR OUTPUT: RO.
* ROUTINES NEEDED: FHITX
* EXECUTION CYCLES (MIN, MAX): 76, 76
* EXECUTION CYCLES (MIN, MAX): 76, 76 .
- EXTERHAL PROCRAM NATES
. GLOBL FPINX
gCoR FiUltX
; INTERNAL CONSTANTS
. DATA
$\begin{array}{lll}\text { OEE } & \text { SET } & 1.0 \\ \text { THO } & \text { SET } & 2.0\end{array}$
HSK .WORD OFFIFFFFFH
. TEXT
- START OF fPINXX Procrah
FPINX:
LDF
RETSZ
; TEST F
RETSZ
RETURN NOW IF F $=0$
3 GET APPROXIMATION TO 1/F. FOR $F=(1+H) * 2 * * E$ AND $0 \ll M(1$, USE: $x[0]=(1-M / 2) * 2 * *-E$

| PUSH | DP | ; SAVE DP |
| :---: | :---: | :---: |
| $10^{1}$ | MSS | ; LOAD DATA PAGE POINTER |
| PUSF | R0 | ; SAVE AS FLT. PT. $\mathrm{F}=(1+\mathrm{H}) * 2 * * E$ |
| PGP | R1 | ; FETCH BACX AS INTEGER |
| XOR | MSX, R1 | ; COHPLEMENT E \& M BUT NOT SIGN BIT |
| PUSH | R1 | ; SAVE AS INTECER, AND BY MAGIC... |
| POPF | R1 | ; R1 $\langle=X[0]=(1-M / 2) * 2 * * E$. |
| POP | DP | UNSAVE DP |

; NEUTON ITERATION FOR: $Y(X)=X-1 / F=0 \ldots$

| MPYF | R1,R0,R4 | ; $\mathrm{R} 4<=\mathrm{F} * \mathrm{X}[0]$ |
| :---: | :---: | :---: |
| SUBRF | TWO, R4 | ; R4 < $=2-\mathrm{F} * \mathrm{X}[0]$ |
| MPYF | R4, R1 | ; RI $<=\mathrm{X}[1]=\mathrm{X}[0] *(2-\mathrm{F} * \mathrm{X}[0])$ |
| MPYF | R1,R0, R4 | ; RA C $=\mathrm{F} * \mathrm{X}[1]$ |
| SUBPF | TWO,R4 | ; R4 $<=2-F * X[1]$ |
| MPYF | R4,R1 | ; R1 $¢=X[2]=X[1] *(2-F * X[1])$ |
| MPYF | R1,R0,R4 | ; $\mathrm{R} 4<=\mathrm{F} * \times[2]$ |
| SUBRF | TW0, R4 | ; $\mathrm{R} 4<=2-\mathrm{F} * \mathrm{X}[2]$ |
| MPYF | R4, R1 | ; R1 $<=\mathrm{X}[3]=\mathrm{X}[2] *(2-\mathrm{F} * \mathrm{X}[2])$ |

; FOR THE LAST ITERATION: $X[4]=(X[3] *(1-(F * X[3])))+X[3]$

| CALL | Fmultix | ; $\mathrm{R} 0<=\mathrm{F} * \mathrm{X}[3]=1+E P S$ |
| :---: | :---: | :---: |
| SUBRF | ONE,RO | ; RO $<=1-\mathrm{F} * \times[3]=E P S$ |
| CALL | FMult ${ }^{\text {a }}$ | ; R0 < $=x[3] *$ EPS |
| ADDF | R1,R0 | ; R0 $<=\mathrm{X}[4]=(X[3] *(1-(F * x[3]))$ ) $\mathrm{X}[3]$ |
| RETS |  | ; RETURN |

.END

```
*)
    Pgocran: FDIUX
    MRITFEN BY: GARY A. SITTON
                    GAS LIGHT SOFTMARE
                GAS LIGHT SOFTMA
                HOUSTON, TEXAS
EXTENOED PRECISION DIVIDE: RO <= RO/R1.
APPROXIMATE ACCURACY: }10\mathrm{ DECIMAL DIGITS.
INPUT RESTRICTIONS: R1!= 0.0.
REGISTERS FOR INPUT: RO (DIVIDEND) AND R1 (DIVISOR).*
REGISTERS USED AND RESTORED: DP AND SP.
 registers altered: RO-7.
* REGISTERS FOR OUTPUT: RO (QUOTIENT).
ROUTINES NEEDED: FMLTX AND FPINXX.
EXECUTION CYCLES (MIN, MAX): 107, }107
```

; EXTERNAL PROGRAM NATES
.GOBR FDIVX
-GLOBL FPINXX
- G.ORR FALLTX
.TEXT
START OF FDIUX PROGRAM
FDIVX:

| LDF | RO,R3 | ; R3 $C=X$ |
| :--- | :--- | :--- |
| LDF | R1,RO | ; R1 $C=Y$ |
| CALL | FPINUX | ; RO $C=1 / Y$ |
| LDF | R3,R1 | ; R1 $C=X$ |
| BR | FMULTX | ; RO $C=X / Y$ |

; RETURN OCCURS FROM FPNLTX:

## 

* PROGRAM: FMULTX
* URITTEN BY: GARY A. SITTON GAS LIGHT SOFTWARE houston, texas MOUSCH 1989 MARCH 1989.

EXTENDED PRECISION MULTIPLY: RO $<=$ RO*R1.

* approximate accuracy: 10 decimal digits.
* INPUT RESTRICTIONS: NONE.
* REGISTERS FOR INPUT: RO.
* REGISTERS USED AND RESTORED: DP AND SP.
* REGISTERS ALTERED: RO AND R4-7.
* REGISTERS FOR OUTPUT: RO.
* ROUTINES NEEDED: NONE.

EXECUTION CYCLES (MIN, MAX): 20, 20.
******************************************************

- EXTERNAL PROGRAM NAMES
.GLOEL FMULTX
.TEXT
; START OF FMULTX PROGRAM
FMULTX:


TEST FOR XA*XB < O AND RETURN

| POP | R4 | ; R4 $\langle=$ RETURN ADDRESS |
| :--- | :--- | :--- |
| BUD | R4 | ; RETURN (DELAYED) |
| LDF | RO,RO | ; TEST ORIGINAL $(X A \wedge \times B)$ |
| LDFN | R6,R5 | ; IF $X A * X B<0$ THEN RS $\langle=-i \times A * \times B ;$ |
| LDF | R5,RO | ; RO $\langle=X A * X B$ |

310 A Collection of Functions for the TMS320C30
相

```
* ProgRAM: IlOG2
URITTEN BY: gARY A. SITTON
        gAS LIGHT SOFTUARE
        HOUSTON, TEXAS
        MARCH 1989.
    INTEGER LOG BASE 2: RO <= (INTEGER) LOG2(RO).
INPUT RESTRICTIONS: RO > O.
REGISTERS FOR INPUT: RO.
* REGISTERS USED AND RESTORED: SP,
* REGISTERS Al TERED: IRO-1 AND RO
* REGISTERS FOR OUTPUT: RO.
* ROITINES NEFDED: NONE
* ROUTINES NEEDED: NONE.
```

; EXTERNAL PROGRAM NAMES
.GLOBL HOG2
.TEXT
; START OF ILOG2. PROGRAM
ILOG2:
CHPI IRO,RO ; COAPARE I TO N
LOOP: BGTD LOOP ; LOOP IF N > I (DELAYED
$\begin{array}{ll}\text { LSH } 1, \text { IRO } & \text {; LOOP IF N } \\ \text { I } & \text { I }<=2 * I\end{array}$
ADDI $\begin{array}{ll}1, \text { IR1 } & ; M=2 * I \\ & ; M=M+1\end{array}$
CMPI IRO,RO $\quad$; COMPARE I TO $N$
LDI IR1,RO ; RO $<=\operatorname{LOG2}(\mathrm{N})$
; RETURN
IMLT:
; SEPARATE MLLTIPLIER AND MLLTIPLICAND IN TWO PARTS

```
- procomin imlt
```

- procomin imlt
- MRITJEN BY: GARY A. SItTO
- MRITJEN BY: GARY A. SItTO
GAS LIGHT SOFTWARE
GAS LIGHT SOFTWARE
HOUSTON, TEXAS
HOUSTON, TEXAS
MARCH 1989.
MARCH 1989.
IMTEGER 32 X 32 MULTIPLY: R1, RO <= RO*R1.
IMTEGER 32 X 32 MULTIPLY: R1, RO <= RO*R1.
RESULT IS THE 64 BIT PRODUCT OF THO 32 BIT INPUTS.
RESULT IS THE 64 BIT PRODUCT OF THO 32 BIT INPUTS.
    - INPUT RESTRICTIONS: NONE.
    - INPUT RESTRICTIONS: NONE.
    * registers FGR INPUT: RO aND R1.
    * registers FGR INPUT: RO aND R1.
    - registers used aND restored: SP
    - registers used aND restored: SP
REGISTERS ALTERED: ARO-1 AND RO-4.
REGISTERS ALTERED: ARO-1 AND RO-4.

4. REDISTERS FOR OUTPUT: R1 (UPPER) AND RO (LOWER).
5. REDISTERS FOR OUTPUT: R1 (UPPER) AND RO (LOWER).
    * ROUTINES NEEDED: NOME.
    * ROUTINES NEEDED: NOME.
; EXTERNAL PROGRAM NAMES
; EXTERNAL PROGRAM NAMES
.g.OBL IMLT
.g.OBL IMLT
.TEXT
.TEXT
; START OF IMULT PROGRAM

```
    ; START OF IMULT PROGRAM
```

| XOR | R0, R1, ARO | ; ARO $=$ SIGNUM (RO*R1) |
| :---: | :---: | :---: |
| ABSI | R0 | ; RO< $=$ ¢ X ; |
| ABSI | R1 | ; R1 $¢=1 \mathrm{Y} \mathbf{1}$ |


| LDI | -16, AR1 | ; AR1 $<=-16$ (FOR SHIFTS) |
| :---: | :---: | :---: |
| LSH | AR1,R0,R2 | ; R2 $¢=\mathrm{X}_{1}=$ UPPER 16 BITS OF XX |
| AND | OFFFFH,RO | ; RO $¢=\times$ O $=$ LOWER 16 BITS OF $\mid$ X\| |
| LSH | AR1, R1, R3 | ; R3 $=$ YI $=$ UPPER 16 BITS OF YY: |
| AND | OFFFFH,R1 | R1 $\leqslant=Y 0=$ LOWER 16 BITS Of Y ( |

- CARRY OUT THE MLTIPLICATION

| IPYI | R0,R1,R4 | ; $\mathrm{R4}$ < $=\mathrm{XO} 0 \times \mathrm{Y}=\mathrm{P} 1$ |
| :---: | :---: | :---: |
| MPYI | R3,R0 | ; $\mathrm{RO} 0=\mathrm{XO} 0 \times \mathrm{Y}_{1}=\mathrm{P} 2$ |
| MPYI | R2,R1 | ; $\mathrm{R} 16=\mathrm{X} 1 * \mathrm{Y} 0=\mathrm{P} 3$ |
| ADDI | R0,R1 | ; R1 $<=$ P2+P3 |
| IPYI | R2,R3 | ; $\mathrm{R} 3<=\mathrm{XI}_{1} \times \mathrm{Y}_{1}=\mathrm{P} 4$ |

; PUT THE PRODUCTS TOGETHER

BGED DONE ; IF $)=0$ THEN DONE (DELAYED)
LSH AR1,R1 ; R1 < $=$ LPPER 16 BITS of P2+P3 ADDI R4,R2,RO ; RO $<=$ HO = LOHER WORD OF THE PRODUCT $\begin{array}{ll}\text { ADDC } & \text { R3,R1 }\end{array}$; R1 $<=W 1=$ UPPER WORD OF THE PRODUCT
; NEGATE THE PRODUCT IF NHHBERS WERE OF OPPOSITE SIGN

| SUBRI | $0, R O$ | $;$ RO $<=-W 0$ |
| :--- | :--- | :--- |
| SURRE | $0, R 1$ | $;$ R1 $<=-W I$ (WITH BORROW) |
| RETS |  | RETURN |

; RETURN

```
PROGRM: IDIV
    urITTEN bY: gary A. SITtON
                    GAS LIGHT SOFTUARE
                    HOUSTON, TEXAS
MARCH 1989.
integer 32 / 32 divide: RO, RI \(<=\) RO/R1. RESUT IS A 32 BIT QUOTIENT AND IREMAINDER:
INPUT RESTRICTIONS: R1 \(!=0\).
REGISTERS FOR INPUT: RO (DIVIDEND) AND R1 (DIVISOR).* REGISTERS USED AND RESTORED: SP.
REGISTERS ALTERED: IRO-1 AND RO-3.
- REGISTERS FOR OUTPUT: RO (QUDTIENT) AND R1 (iREMAINDERI).
ROUTINES NEEDED: NONE
H**HHHH***************************************
```

EXTERNAL PROGRAH NAYES
GLOSL IDIV
; START OF IDIV PROGRA

## TEXT

IDIV:
; determine sign of resul. get absolute value of operands.

| XOR | R0,R1,R2 | ; R2 $\langle=$ SIGNUM (RO/R1) |
| :--- | :--- | :--- |
| ARSI | RO | ; RO $\left\langle=i X_{i}\right.$ |
| ABSI | R1 | ; R1 $\left\langle=i Y_{i}\right.$ |

; test input values
$\begin{array}{ll}\text { OPI } & \text { RO,R1 } \\ \text { RHID }\end{array}$
; CONPARE DIVISOR TO DIVIDEND IEID IERO ; IF RI $>$ RO THEN RETURN 0 (DELAYED)

MOPMLIIE OPERANOS. USE DIFFERENCE IN EXPONENTS AS SHIFT CONT FOR DIVISOR, AND AS REPEAT COUNT FOR SUBC.

| float | R0, R3 | ; R3 = NORMALILED DIVIDEND |
| :---: | :---: | :---: |
| Pusf | R3 | ; PUSH AS FLOAT |
| PCP | IR1 | ; IRI <= INTEGER |
| LSH | -24, IRI | ; IRI ¢ $=$ DIVIDEND EXPONENT |
| FLOAT | R1, R3 | ; R3 < MORMALILED DIVISOR |
| PUSF | R3 | ; PUSH AS FLOAT |
| POP | IRO | ; IRO <= INTEGER |
| LSH | -24, IRO | IRO ¢= DIVISOR EXPONE |

SUBI IRO, IR1 ; IR1 <= DIFFERENCE IN EXPONENTS
SH IR1,R1 ; RL <= ALIGNED DIVISOR WITH DIVIDEND

- DO IRI +1 SUBTRACT \& SHIFTS

| RPTS | IR1 | ; REPEAT IR1+1 TIMES |
| :--- | :--- | :--- |
| SUBC | R1,RO | ; RO $\langle=2 *(R 0-$ R1 $)$ |

, MASK OFF THE LOWER IRI +1 BITS OF RO

| LDI | R0, R1 | ; R1 < $=$ RREMAINDER, QUOTIENT: |
| :---: | :---: | :---: |
| SUBRI | 31, IRI | ; IRI $<=32-($ RI $1+1)$ |
| LSH | IR1, R0 | ; RO < = RO SHIFT LEFT IR1 |
| NEGI | IRI | ; IR1 < - IR1 |
| LSH | IR1,R0 | ; ROS $=1 \times 1 /$ Y ${ }^{\text {d }}$ |
| SUBRI | -32, IR1 | ; IRI $<=-(I R 1+1)$ |
| LSH | IR1, R1 | ; R1 $<=$ : REMAINDER |


| NEGI | RO, R3 |  |
| :---: | :---: | :---: |
| ASH | -31,R2 | ; TEST SIGN BIT |
| LINZ | R3,R0 | ; IF SET RO $<=-\mathrm{RO}$ |
| CaPI | 0,RO | ; SET STATUS FROM RESULT |
| RETS |  | ; RETURN |

- RETURN ZERO QUOTIENT

IERO: | LDI | RO,R1 | ; R1 $<=$ RREMAINDER |
| :--- | :--- | :--- |
|  | LDI | $0, R O$ |

RETS
; RO $<=0$ QUOTIENT
; RETURN


## PROGRAM: SUECTOR.ASM

VECTOR UTILITIES
SVECTOR.ASH CONSISTS OF THE FOLLOWING ROUTINES:
*CORTULT - IN-PLACE CONPUTATION OF THE COMPLEX VECTOR PRODUCT OF TWO COMPLEX APRAYS USING THE COMPIEX CONUUGATE OF THE SECONDAPRAY.
*CONALT - In-PLACE COMPutation of the corplex vector product of tho COTPLEX ARRAYS.
*CBITREV - IN-PLACE BIT REVERSE PERIUTATION ON A COMPLEX ARRAY WITH SEPARATE REAL AND IMAGINARY ARRAYS.

4fmieee - in-place fast conversion of an ieee array to a tms320c30 ARRAY.
*TOIEEE - IN-PLACE FAST CONVERSION OF A TMS320C30 ARRAY TO AN IEEE ARRAY.
*VECFULT - IN-PLACE MLLTIPLIES A CONSTANT TIMES AN ARRAY.
*CONHOY - MOVES (FILLS) A CONSTANT INTO AN ARRAY.
*VECMOV - MOVES (COPIES) AN ARRAY INTO ANOTHER ARRAY.

## \#\#***************************************************

- PROGRAM: +CORMUL

URITTEN BY: GARY A. SITTON
GAS LIGHT SOFTUARE
HOUSTON, TEXAS
FEBRLARY 1989.
COMPLEX IN-PLACE FREQUENCY DOMAIN CORPELATION:
Cl < $=\mathrm{Cl} * \operatorname{CONJ}(C 2)$, C1 AND C2 ARE BOTH OF LENGTH

MCORMLLT ENTRY PROTOCOL:
VARIABLES FOR INPUT:
SIAD1 $\rightarrow$ XI[0], SIAD2 $\rightarrow$ YI[0],
$\$$ SAD1 $\rightarrow \times 2[0], \$$ SAD2 $\rightarrow$ Y2[0],
SN = $N$ (LENGTH), SPARTS = DATA PAGE.
INPUT RESTRICTIONS: $\$ \mathrm{~N}>0$.
REGISTERS ALTERED: RC, DP, ARO-3 AND RO-3.
RCORHMLT ENTRY PROTOCOL:
REGISTERS FOR INPUT:
ARO $\rightarrow$ X1[0], AR1 $\rightarrow$ Y1[0], AR2 $\rightarrow X 2[0]$, AR3 $\rightarrow$ Y2[0], RC $=N$ (LENGTH).
INPUT RESTRICTIONS: RC $>0$.
REGISTERS ALTERED: RC, ARO-3 AND RD-3.
REGISTERS USED AND RESTORED: SP.

- REGISTERS FOR OUTPUT: NONE
* ROUTINES NEEDED: NONE.

EXTERNAL MEMORY ADDRESSES
.GLOBL. SPARMS ; PARAYETER PAGE ADDRESS
; EXTERNAL VARIABLE ADDRESSES

| .GLOBL | \$ | ; ARRAY LENGTH N |
| :---: | :---: | :---: |
| . GLOBL | \$IAD1 | ; ADDRESS OF INPUT X1 |
| . GLOBL | \$1AD2 | ; ADARESS OF INPUT Y1 |
| .GLOBL | \$SAD1 | ; ADDRESS OF INPUT X2 |
| . GLOBL | \$SAD2 | ; ADDRESS OF INPUT Y2 |

.GLOBL MCORMULT ; MEMORY ENTRY FOR COMPLEX (CORR.) MLLTIPLY .GLOBL RCOFTULT ; REGISTER ENTRY FOR COMPLEX (CORR.) MMLTIPLY

START OF PROGRAM AREA

## MCORNLT:

| LDP | esparts | ; LOAD DATA PAGE POINTER |
| :---: | :---: | :---: |
| LDI | Can, RC | ; RC ¢ $=\mathrm{N}$ |
| LII | esIADI, ARO | ; ARO $\rightarrow$ X1[0] |
| LDI | esIAD2,AR1 | ; AR1 $\rightarrow$ Y1[0] |
| LDI | essadi,AR2 | ; AR2 $\rightarrow$ X2[0] |
| LDI | essad2,AR3 | ; AR3 $\rightarrow$ Y2[0] |

; register based parameter entry

## RCORTMLT:

- COWPLEX MLTIPLY (CORRELATION) LOOP

| SUBI | 1,RC | ; RC $<=N-1$ |
| :---: | :---: | :---: |
| RPTB | LOOP1 | ; REPEAT BLOCK N TIMES |
| IPYF | *AR0, *AR2, R1 | ; R1 $<=\mathrm{XI}_{1}[1] * \times 2[1]$ |
| IPYF | *AR1, *AR3, R3 | ; R3 $<=\mathrm{Y} 1[1] * \mathrm{Y} 2[1]$ |
| IPYF | *AR2++, A AR1, R0 | ; RO $<=Y_{1}[1] \pm \times 2[1]$, INCR. AR2 AND... |
| ADDF | R1, R3, R2 | ; R2 $<=X_{1}[I] * \chi_{2}[1]+Y 1[I] * Y 2[I]$ |
| MPYF | AARO, A AR3 ${ }^{\text {+ }}$, RI | ; R1 $¢=X_{1[1]}+Y 2[I]$, INCR. AR3 |
| SUBF | R1,R0,R3 | ; R3 $<=Y 1[1] \times{ }_{2}[1]-\mathrm{XI}[1]+\mathrm{Y}_{2}[1]$ |
| STF | R2, *ARO++ | ; XI[I] $=$ R2, INCR. ARO AND.. |
| STF | R3, $\pm$ AR1++ | ; YIII] <= R3, INCR. ARI |
| RETS |  | ; RETUPN |

```
*)
* PROGRAM: +CONMLIT
WIITTEN BY: GARY A. SITTON
    GAS LIGHT SOFTHARE
HOUSTON, TEXAS
APRIL 1989.
COMPLEX IN-PLACE FREEUENCY DOKAIN CONVOUUTION:
* C1 <= C1 * C2, C1 AND C2 ARE BOTH OF LENGTH
*N, ANDC1 = (X1 +I*Y1) ANDC2 = (X2 +I*Y2)
* MCONNLTT ENTRY PROTOCOL:
        VARIABLES FOR INPUT:
            $IAD1 }->\mathrm{ XI[0], $IAD2 }->\mathrm{ Y1[0],
            SAD1 -> X2[0] SSAD2 -> Y2[0],
            SN = N (LENGTH), SPARHS = DATA PAGE.
        INPUT PESTRICTIONS: SN>0
        REGISTERS ALTERED: RC, DP, ARO-3 AND RO-3.
    RCONHLLT ENTRY PROTOCOL
        REGISTERS FOR INPUT
            ARO }->\mathrm{ X1[0], AR1 }->\mathrm{ Y1[0], AR2 }->\mathrm{ X2[0],
            AR3 }->\mathrm{ Y2[0], RC = N (LENGTH)
        INPUT RESTRICTIONS: RC >0.
        REGISTERS ALTERED: RC, ARO-3 AND RO-3.
    REGISTERS USED AND RESTORED: SP.
    * REGISTERS FOR OUTPUT: NONE
    * ROUTINES NEEDED: NONE
```


; EXTERNAL MEMORY ADDRESSES
.GLORL SPARTS ; PARAMETER PAGE ADDRESS
; EXTERNAL VARIABLE ADDRESSES

| . GLOBL | \$ | ; ARRAY LENGTH $N$ |
| :---: | :---: | :---: |
| . GLOBL | \$1AD1 | ; ADDRESS OF INPUT |
| .GLOBL | \$1AD2 | ; ADDRESS OF INPUT |
| . GLOBE | \$SAD1 | ; ADDPESS OF INPU |
| .GLOBL | \$SAD2 | ; ADDRESS OF INP |

; EXTERNAL PROGRAM NAFS
. GLOBL MCONALT ; MERORY ENTRY FOR COHPLEX (CONN.) MLTIPLY .GLOBL RCONHLT ; REGISTER ENTRY FOR COMPLEX (CONV.) MLTIPLY

- START OF PROGRAM APEA
.TEXT

| Mcountit |  |  |  |
| :---: | :---: | :---: | :---: |
|  | LDP | ESPARAS | ; LOAD DATA PAGE POINTER |
|  | LDI | EN, RC | ; RC $<=N$ |
|  | LDI | esIADI, AR0 | ; ARO $\rightarrow$ X1[0] |
|  | LDI | esIAD2,AR1 | ; ARI $\rightarrow$ Y1[0] |
|  | LII | essabl,AR2 | ; AR2 $\rightarrow$ X2[0] |
|  | LDI | EsSAD2,AR3 | ; AR3 $\rightarrow$ Y2[0] |
| ; | register based parameter entry |  |  |
| RCOTOULT: |  |  |  |
| ; | COMPLEX MLTIPLY (CONVOLUTION) LOOP |  |  |
|  | SUBI | 1,RC | ; $\mathrm{RC} \ll=\mathrm{N}-1$ |
|  | RPTB | LOOP2 | ; REPEAT Block $n$ times |
|  | MPYF | *ARO, *AR2, R1 | ; R1 $<=\mathrm{X}_{1}[1] \times \times 2[1]$ |
|  | RPYF | *AR1, *AR3, R3 | ; R3 $<=~ Y 1[1]+Y 2[1]$ |
|  | MPYF | *AR2++, *AR1, RO | ; RO $<=Y 1[I] * \times 2[I], ~ I N C R . ~ A R 2 ~ A N D . . . ~$ |
| 14 | SUBF | R3,R1, R2 | ; R2 $s=\times 12[1] * \times 2[1]-Y 1[1] * Y 2[1]$ |
|  | MPYF | *AR0, A AR3++, R1 | ; R1 $<=X 1[1] * Y 2[1]$, INCR. AR3 |
|  | ADDF | R1, R0, R3 | ; R3 $<=Y 1[I] * \times 2[I]+X 1[I] * Y 2[I]$ |
| $\begin{aligned} & \text { LOOP2: } \\ & \text { ii } \end{aligned}$ | STF | R2, *ARO++ | ; XI[I] <= R2, INCR. ARO AND... |
|  | STF | R3, * $\mathrm{PR1}++$ | ; YI[I] $<=$ R3, INCR. AR1 |
|  | RETS | - | ; RETUPN |



```
* PROGRAM: *CBITREV
    URITTEN BY: GARY A. SITTON
        gaS LIGHT SOFTHARE
        HOUSTON, TEXAS
        MARCH 1989.
    BIT REVERSE INDEX MAP TWO REAL ARRAYS AS A SINGLE
    * COMPLEX ARRAY HITH THE SUAPPING DONE IN-PLACE.
    X[I], Y[I] <-) X[J], Y[J], WHERE J = BR(I)
    LENGTH OF ARRAYS N }>=4\mathrm{ IS ABSOLUTELY REQUIRED.
    MCBITREV ENTRY PROTOCOL:
        VARIABLES FOR INPUT:
            SIAD1 -> X[0] $IAD2 -> Y[0]
            N = N (LENGTH), $PARHS = DATA PAGE.
    INPUT RESTRICTIONS: SN >=4
    REGISTERS ALTERED: RC, DP, IRO, ARO-3 AND RO-3.
    RCBITREV ENTRY PROTOCOL
        REGISTERS FOR INPUT:
            ARO }->\mathrm{ X[0], AR1 }->\mathrm{ Y[0], RC = N (LENGTH).
        INPUT RESTRICTIONS: RC >=4,
        REGISTERS ALTERED: RC, IRO, ARO-3 AND RO-3.
    REgISTERS USED AND RESTORED: SP.
    * REGISTERS FOR OUTPUT: NONE.
* ROUTINES NEEDED: NONE
```

*******************************************************
; EXTERNAL MEMORY ADDRESSES
.GLOBL \$PARHS ; PARAMETER PAGE ADDRESS
EXTERNAL VARIABLE ADDRESSES

| . GLOBL | SN | ; ARRAY LENGTH $N$ |
| :--- | :--- | :--- |
| .GLOBL | SIAD1 | ; ADDRESS OF INPUT $x$ |
| .GLOBL | \$IAD2 | ; ADDRESS OF INPUT $Y$ |

; EXTERNAL PROGRAM NAMES
$\begin{array}{ll}\text {.GLOBL MCBITREV } & \text {; MEMORY ENTRY FOR COMPLEX BIT REVERSE } \\ \text {.GLOBL RCBITREV } & \text {; REGISTER ENTRY FOR COMPLEX BIT REVERSE }\end{array}$
; START OF PROGRAM AREA
.TEXT
;
mcbithev:

| LDP | C\$PARMS | ; LOAD dATA PAGE POINTER |
| :---: | :---: | :---: |
| LDI | C 5 N, RC | ; $\mathrm{FC} ¢=\mathrm{N}$ |
| LDI | CsiADI, ARO | ; ARO $\rightarrow$ ARRAY $X$ |
| LDI | esIAD2, AR1 | ; AR1 $->$ ARRAY $Y$ |

; REGISTER BASED PARAMETER ENTRY
RCBITREV:

| LDI | RC, IRO | ; IRO< $=\mathrm{N}$ |
| :---: | :---: | :---: |
| SUBI | 3,RC | RC $<=\mathrm{N}-3$ |
| LSH | -1, IRO | IRO < N/2 FOR BIT REVERSE |
| LDI | ARO,AR2 | ; AR2 $\rightarrow$ ARRAY X (BIT REV.) |
| NOP | *AR2++(IRO) B | ; INCR. BR(AR2) (OUTSIDE LOOP) |
| NOP | *ARO++ | ; INCR. ARO (OUTSIDE LOOP) |
| LDI | AR1, AR3 | AR3 -> ARRAY Y (BIT REV.) |

; DO BIT REVERSE SUAP ON BOTH ARRAYS
; SKIPPING THE OTH AND N-1ST ELEMENTS
RPTB LOOP3 ; REPEAT LOOP N-2 TIME
$\begin{array}{lll}\text { CMPI } & \text { AR2, ARO } & \text {; COMPARE AR2 TO ARO } \\ \text { BGED } & \text { LOOP3 } & \text {; IF ARO }>=\text { AR2, LOOP (DELAYED) } \\ \text { NOP } & \text { HAR1++ } & \text {; INCR. AR1 }\end{array}$



|  | LDF | *AR2,R2 | ; R2 $¢=x[J]$ |
| :---: | :---: | :---: | :---: |
|  | LDF | *AR1,R1 | ; R1 $<=\mathrm{Y}[1]$ |
| 13 | LDF | *AR3, R3 | ; R3 $<=Y[J]$ |
|  | STF | R0, *AR2 | ; X [J] $<=\mathrm{RO}$ |
| 13 | STF | R2, *-ARO | ; $\mathrm{X}[1]$ < R2 |
|  | STF | R1, ARR3 | ; $\mathrm{Y}[\mathrm{J}]$ <= R1 |
| 11 | STF | R3, *AR1 | ; Y[I] $<=$ R3 |
| L00p3: | NOP | *AR2++(IRO) ${ }^{\text {B }}$ | ; INCR. BR(AR2) |


| TABA | ．HORD | CTAB |  |
| :---: | :---: | :---: | :---: |
| ； | START OF PROSRAM APEA |  |  |
|  | ．text |  |  |
| ； | HEMORY BASED PARATEIER ENTRY |  |  |
| MFMIEEE： |  |  |  |
|  | LDP | ESPARTS | ；load data page pointer |
|  | LDI | 25N，RC | ；RC $<=N$ |
|  | LDI | esIADI，ARO | ；ARO $\rightarrow$ IEEE ARRAY |
| ； | REGISTER BASED PARAMETER ENTRY |  |  |
| RFMIEEE： |  |  |  |
|  | SUBI | 1，RC | ；RC $<=N-1$ |
|  | LDP | CCTAB | ；LOAD DATA PAGE POINTER |
|  | LII | ETABA，ARI | ；AR1 $\rightarrow$ CONSTANT TABLE |
| ； | IEEE $\rightarrow$＇C30 CONVERSION LOOP |  |  |
|  | RPTB | L00P4 | ；REPEAT LOOP N TIMES |
|  | AND | ＊AR0，＊AR1，RO | ；REPLACE FRACTION WITH 0 |
|  | ADDI | ＊ARO，RO | ；SHIFT SIGN AND EXPONENT INSERTING 0 |
|  | LDI2 | ＊＋AR1（1），RO | ；IF ALL ZERO，LOAD＇C30 0.0 |
|  | LDI | ＊ARO，R1 | ；TEST ORIGINAL NUMBER |
|  | BGED | LOOP4 | ；IF $3=0$ ，STOPE NUPBER（DELAYED） |
|  | SUBI | ＊＋AR1（2），R0 | ；REMOVE EXPONENT BIAS（127） |
|  | PUSH | RO | ；SAVE AS AN INTEGER |
|  | POPF | R0 | ；UNSAVE AS A FLT．PT．NUMBER |
|  | NEGF | RO | ；NEGATE＇C3O MUMBER |
| LOOP4： | STF | R0，＊ARO＋＋ | ；STORE＇C3O NUMBER，INCR．ARO |
|  | RETS |  | ；RETURN |

＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊
＊PROGRAM：＊TOIEEE
WRITTEN BY：GARY A．SITTON
GAS LIGHT SOFTHARE
hOUSTON，TEXAS
APRIL 1989.
convert an array of this3zoczo floating－point NUMBERS TO IEEE FLOATING－POINT FORMAT．ZERO
IS THE ONLY SPECIAL CASE
MTOIEEE ENTRY PROTOCOL：
VARIABLES FOR INPUT：
\＄IADI $->\times$ KO］, $\mathbf{N}=\mathrm{N}$（LENGTH），
SPARMS $=$ DATA PAGE．
INPUT RESTRICTIONS：$\$$ N $>0$ ．
REGISTERS ALTERED：RC，DP，ARO－1 AND RO－1
RTOIEEE ENTRY PROTOCOL：
REGISTERS FOR INPUT：
ARO $\rightarrow X[0], R C=N$（LENGTH）．
INPUT RESTRICTIONS：RC $>0$ ．
REGISTERS ALTERED：RC，ARO－1 AND RO－1．
REGISTERS USED AND RESTORED：SP
REGISTERS FOR OUTPUT：NONE
－ROUTINES NEEDED：NONE．
＊Note：＊toieee shares the ctab table from＊fmieee ＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊
；EXTERNAL MEMORY ADDRESSES
．GLOBL \＄PARMS ；PARAMETER PAGE ADDRESS
：EXTERNAL VARIABLE ADDRESSES
．GLOBL $\$ \mathrm{~N}$ ；ARRAY LENGTH $N$
．GLOBL \＄IAD1 ；ADDRESS OF INPUT $x$
EXTERNAL PROGRAM NAMES
．GLOBL MTOIEEE ；MEMORY ENTRY FOR＇C3O－$\rightarrow$ IEEE CONVERSION
．GLOBL RTOIEEE ；REGISTER ENTRY FOR＇C30 $->$ IEEE CONVERSION
Start of program area
．TEX
；MEMORY EASED PARAMETER ENTRY
MTOIEEE：

| LDP | ESPARMS | ; LOAD DATA PAGE POINTER |
| :--- | :--- | :--- |
| LDI | RSN,RC $K=N$ |  |
| LDI | RIADI,ARO | ; ARO $\rightarrow$ C3O ARRAY |

; REGISTER BASED PARAMETER ENTRY
RTOIEEE:

| SUBI | $1, R C$ | ; RC $<=N-1$ |
| :--- | :--- | :--- |
| LDP | ECTAB | ; LOAD DATA PAGE POINTER |
| LDI | ETABA,ARI | ; AR1 $\rightarrow$ CONSTANT TABLE |

; 'C3O $\rightarrow$ IEEE CONVERSION LOOP

| RPTB | L00P5 | ; REPEAT LOOP N TIMES |
| :---: | :---: | :---: |
| ABSF | *ARO, RO | ; TEST INUMBER: |
| LDF2 | *+AR1 (4), RO | ; IF $=0$, LOAD FAKE 0.0 |
| LSH | 1,R0 | ; SHIFT OFF SIGN BIT |
| PUSHF | R0 | ; SAVE AS A FLT. PT. |
| LDF | *ARO, R1 | ; TEST ORIGINAL NUMBER |
| BGED | LOOP5 | ; IF $>=0$, STORE NMHER (DELAYED) |
| POP | RO | ; UNSAVE AS AN INTEGER |
| ADDI | *+AR1 (2), RO | ; ADD EXPONENT BIAS (127) |
| LSH | $-1, \mathrm{RO}$ | ; ADUUST FOR SIGN BIT |
| OR | *+AR1 (3) , RO | ; Negate ieee muriber |

## 

PROGRAM: *VECMLT
WRITTEN BY: GARY A. SITTON
GAS LIGHT SOFTUARE
HOUSTON, TEXAS
FEERUARY 1989.
SCALAR - VECTOR MULTIPLY: X[I] $<=X[I] * C, C$ IS A
CONSTANT AND THE ARRAY X IS OF LENGTH $N>=1$.
MECHULT ENTRY PROTOCOL:
GARIABLES FOR INPUT:
SIADI $\rightarrow$ X[OI, SN $=\mathrm{N}$ (LENGTH)
\$CNST $=\mathrm{C}$, SPARNS $=$ DATA PAGE.
INPUT RESTRICTIONS: $\$ \mathrm{~N}>0$.
REGISTERS ALTERED: RC, DP, ARO AND RO-1
FVECMNLT ENTRY PROTOCOL:
REGISTERS FOR INPUT
$A R O \rightarrow X[0], R O=C, R C=N$ (LENGTH).
INPUT RESTRICTIONS: RC $>0$
REGISTERS ALTERED: RC, ARO AND RI.
REGISTERS USED AND RESTORED: SP.
REGISTERS FOR OUTPUT: NONE.
ROUTINES NEEDED: NONE.


EXTERNAL MEMORY ADDRESSES
. GLOBL \$PARTS ; PARAYETER PAGE ADDRESS
; EXTERNAL VARIABLE ADDRESSES

| . GLOBL | \$ | ; ARRAY LENGTH N |
| :---: | :---: | :---: |
| .GLOBL | \$CNST | ; ADDRESS OF CONSTANT C |
| . GLOBL. | \$IADI | ; ADDRESS OF INPUT X |

: EXTERNAL PROGRAM NAYES
GLOBL WECMUT ; MEMORY ENTRY FOR SCALAR - VECTOR MLITIPLY
GLOBL RVECAULT ; REGISTER ENTRY FOR SCALAR - VECTOR MLITIPLY
START OF PROGRAM AREA
. TEXT
; MEMORY BASED PARAIETER ENTRY
mecralt

| LDP | ESPARTS | ; LOAD DATA PAGE POINTER |
| :--- | :--- | :--- |
| LDI | ESN,RC | ; $R C=N$ |

$\begin{array}{lll}\text { LDI } & \text { ESIAD1,ARO } & \text {; ARO } \rightarrow \text { X[O] } \\ \text { LDF } & \text { ESCNST,RO } & \text {; RO } \leqslant=C\end{array}$

```
; register based parameter entry
RVECNULT:
SUBI 
BLT SKIP1 ; IF RC < O THEN SKIP LOOP
; SCALAR - VECTOR MLTIPLY LOOP
RPTS RC ; REPEAT INST. N-1 TIMES
MPYF RO,*++ARO,R1 ; R1 <= C*X[I+1]
i, STF R1,*ARO ; X[I] <= C*X[I]
SKIP1: STF R1,*ARO ; X[N-1] <= C*X[N-1]
RETS ; RETURN
```

| PROGRAM: *CONMOU | * |
| :---: | :---: |
| IRITTEN BY: GARY A. SITtIon | * |
| GAS LIGHT SOFTUARE | * |
| HOUSTON, TEXAS |  |
| FEBRUARY 1989. | * |
| SCALAR $\rightarrow$ VECTOR MOVE: X[I] <= C, C IS A |  |
| CONSTANT AND THE ARRAY $X$ IS OF LENGTH N. | * |
| MCONHOV ENTRY PROTOCOL: | * |
| VARIABLES FOR INPUT: |  |
| \$IAD1 $\rightarrow \times \times 0], \$ N=N$ (LENGTH), |  |
| \$CNST $=\mathrm{C}$, SPARMS $=$ DATA PAGE. |  |
| INPUT RESTRICTIONS: $\$$ N $>0$. |  |
| REGISTERS ALTERED: RC, DP, ARO, AND RO. |  |
| RCONMOV ENTRY PROTOCCL: | * |
| REGISTERS FOR INPUT: |  |
| ARO $\rightarrow$ X[0], R0 $=C, \mathrm{RC}=\mathrm{N}$ (LENGTH). |  |
| INPUT RESTRICTIONS: RC $>0$. |  |
| REGISTERS ALTERED: RC, ARO. |  |
|  |  |
| REGISTERS FOR OUTPUT: NONE. | * |
| ROUTINES NEEDED: NONE. | * |


EXTERNAL MEMORY ADDRESSES
. GLOBL \$PARHS ; PARAMETER PAGE ADDRESS
; EXTERNAL VARIABLE ADDRESSES
GLOL ; ARRAY LENGTH GLOBL SIAD1 ; ADDRESS DF INPUT: $\times$
; EXTERNAL PROGRAM NAMES
GLOBL MCONHOU ; MEMORY ENTRY FOR CONSTANT TO VECTOR MOUESTART OF PROGRAM AREA
.TEXT
; MEMORY EASED PARAMETER ENTRY
CONMOV:

| LDP | EsPARMS | ; LOAD data page pointer |
| :---: | :---: | :---: |
| LDI | CSN, RC | ; RC $<=N$ |


| LDI | ESIADI, ARO | ARO $\rightarrow$ X[0] |
| :--- | :--- | :--- |
| LDF | ESCNST,RO | ; RO $=C$ |

; REGISTER BASED PARAMETER ENTRY
RCOWOV:
SUBI $\quad 1, R C \quad ; \operatorname{RC} \zeta=\mathrm{N}-1$
; SCALAR TO VECTOR MOVE LOOP

| RPTS | RC | ; REPEAT INST. $N$ TIMES |
| :--- | :--- | :--- |
| STF | RO, AARO+ | ; X[I] $=C$ |
| RETS |  | ; RETURN |

RETS ; RETURN


## 

; EXTERNAL MEMORY ADDRESSES
.GLOBL SPARNS ; PARAIETER PAGE ADDRESS
; EXTERNAL VARIABLE ADDRESSES
.GLOBL ${ }^{\text {N }}$; ARRAY LENGTH $N$
.GLOBL SIAD1 ; ADDRESS OF INPUT $X$
.GLOBL SIAD2 ; ABDRESS OF INPUT Y

- EXTERNAL PROGRAM NAMES
.GLORE MECHOU ; MEMORY ENTRY FOR VECTOR TO VECTOR MOVE
.GLOBL RUECHOU ; REGISTER ENTRY FOR VECTOR TO VECTOR HOVE
; START OF PROGRAM AREA
.text
; MEHORY BASED PARAMEIER ENTRY
MNECMOV:

| LDP | ESPARHS | ; LOAD DATA PAGE POINTER |
| :--- | :--- | :--- |
| LDI | ESN,RC | ; RC $<=N$ |
| DII | E $\$ 1 A D 1$, ARO | ; ARO $\rightarrow X[0]$ |


|  | LI | esIAD2,AR1 | ; AR1 $\rightarrow$ Y[0] | ******************************************************************* |
| :---: | :---: | :---: | :---: | :---: |
|  |  |  |  | * Procray |
| ; | register based parameter entry. |  |  | PROGRAM: SFFT2.ASM |
| RVECHOV: |  |  |  | * radix 2 fft routines |
|  |  |  |  |  |
|  | SUBI | 2,RC | ; RC < $=\mathrm{N}-2$ | * \$FFT2.ASM CONSISTS OF THE FOLLOWINg ROUTINES: |
|  | Lof | *ARO++,RO | ; RO $<=\mathrm{X}[0]$ |  |
|  | Cxpl | $0, \mathrm{RC}$ | ; COMPARE RC TO 0 | CFFFT2 - CONPLEX DIF FORHARD RADIX 2 FFT USING SEPARATE REAL AND |
|  | BLT | SKIP2 | ; IF RC く 0 THEN SKIP LOOP | ImAGINARY ARRAYS AND 3/4 CYCLE SINE TABLE. |
| ; | VECTOR MOVE LOOP |  |  | CIFFT2 - COMPLEX DIT INUERSE RADIX 2 FFT USING SEPARATE REAL AND |
|  | RPTS | RC | ; REPEAT INST. N-1 TIMES | THE I/N SCALE FACTOR. |
|  | LDF | *ARO+ + RO | ; RO ¢ $=\mathrm{x}[1+1]$ | * |
| $1:$ | STF | R0, *AR1++ | ; MOVE X[1] TO Y[1] | ******************************************************************** |
| SKIP2: | STF | RO, *AR1 | ; MOVE X[N-1] TO Y[N-1] |  |
|  | RETS |  | ; RETURN |  |
|  | .END |  |  |  |

```
***************************************
*
    PROGRAM: CFFFT2
URITTEN BY: GARY A. SITTON
    GAS LIGHT SNFTUARE
    HOUSTON, TEXAS
    MARCH 1989.
* SPECIAL version uses 3/4 SINE table lookup WITH
* THE PARANETERS PASSED IN PREDEFINED MEMQRY LOCATIONS.
* COHPLEX RADIX-2 DIF FORHARD FFT FOR THE TMS32OC3O.
* THIS PROGRAM ASSMNES NORINL ORDERED DATA AS INPUT,
* but leaves the OUTPut INDEXED IN bit reversed order.
* tuo POINTERS AFE USED FOR SEPARATE REAL AND ImAGINARY
ARRAYS.
VARIARLES FOR INPUT:
    IAD1 }->\mathrm{ REAL[0], SIAD2 -> IMAG[0]
    N =N (LENGTH), SH = M (LOC2(N))
    $SINE -> SINE TABLE, $PARHS = DATA PAGE.
    INPUT RESTRICIIONS: $N >
/ REGISTERS ALTERED: RC, DP, IR0-1, ARO-7, aND RO-7
PEGISTERS USED AND RESTORED: SP
REGISTERS FOR OUTPUT: NONE.
- routineS NEEDED: NONE.
***)*
```

; EXTERNAL PROGRAM NATES
.GLOBL CFFFT2 ; ENTRY POINT FOR EXECUTION
; EXTERNSL MEHORY ADDRESSES
GOBL SSINE ; SINE TABLE ADDRESS
GLOBL SPARHS ; PARAMETER PAGE ADDRESS

- EXTERNA VARIABLE ADDRESSES

| . GLOPL | \$ | ; FFT Lengit, $\mathrm{N}=2 \pm \pm \mathrm{M}$ |
| :---: | :---: | :---: |
| .G.OBL | \$ ${ }^{\text {H }}$ | ; $\mathrm{H}=\mathrm{LOO} 2(\mathrm{~N}) \mathrm{l}=2$ |
| . GLOBL | \$IAD1 | ; REAL INPUT ARRAY ADDPESS |
| . GLOBL | \$IAD2 | ; IMAGINARY INPUT ARRAY ADI |

    . TEXT
    ; START OF DIF FFFT PROCRAM
CFFFT2:
; INITIALIIE LOOP VARIABLES
LDP ESPARHS ; LOAD DATA PAGE POINTE

- outer loop

```
FLOOP: ADDI 1,AR6 ; K <=K + 1
LDI ESIADI,ARO ; ; AFO }->\times\times(0
ADDI RT,ARO,AR1 ; AR1 }->\mathrm{ X(L)
LDI ESIAD2,AR2 ; AR2 }->\mathrm{ Y(O)
ADD1 R7,AR2,AF3 ; AR3 }->\mathrm{ Y(L)
LDI R5,RC ; SETUP IST INER LOOP REPEAT COUNTER
SUBI 1,RC ; RC (ONE LESS THAN THE DESIRED #)
```

; FIRST INNER LOOP (UNITY THIDDLE FACTOR)

|  | RPTB | FBLK1 | ; REPEAT BLOCK IE TIMES |
| :---: | :---: | :---: | :---: |
|  | ADDF | *AR0, *AR1, R0 | ; RO < $=X(\mathrm{I})+\mathrm{X}(\mathrm{L})$ |
|  | SUBF | *AR1, *ARO, R1 | ; R1 $<=X(I)-X(L)$ |
|  | ADDF | *AR2, *AR3, R2 | ; R2 $<=Y(I)+Y(L)$ AND... |
|  | SUBF | *AR3, *AR2, R3 | ; R3 $<=Y(I)-Y(L)$ |
|  | STF | RO, *ARO+ (IRO) | ; $X(1)<=R 0$, INCR. ARO AND... |
| 18 | STF | R1, *AR1++(1RO) | ; $X(L)<=R 1$, INCR. AR1 |
| FELK1: | STF | R2, *AR2++(IRO) | ; $Y(I)<=R 2$, INCR. AR2 AND... |
| it | STF | R3, 4 AR3++(IR0) | $Y(L)<=R 3, ~ I N C R . ~ A R 3 ~$ |

; Program exit test

| CMPI ASM, ARG | ; COMPARE $M$ TOK |
| :--- | :--- |
| RETSGE | ; IF K $>=M$ THEN RETUPN |

; MAIN INNER LOOP

|  | LDI | 2,AR7 | ; $\rfloor<=2$, (PRE-INCREHENTED) |
| :---: | :---: | :---: | :---: |
|  | LDI | 1,ARO | ; ARO $=1$ ( INIT. 1) |
|  | LDI | 1,AR2 | ; AR2 < $=1$ (INIT. 1) |
|  | LDI | CSSINE,AR5 | ; ARS $<=$ SINTABLIA] (INIT. IA $=0$ ) |
| FINLOP: | ADDI | R5, AR5 | ; AP5 $\rightarrow$ SINTAB[IA $<=I A+$ IE] |
|  | LDF | *AR5, R6 | ; R6 $<=\operatorname{SIN}(x),(x=(2+P I / N) * I A)$ |
|  | ADDI | AR5, IR1,AR4 | ; AR $4 \rightarrow \cos (x)$ |
|  | ADDI | CSIADI, ARO | ; ARO $\rightarrow$ X $(\mathrm{I})$ |
|  | ADDI | CsIAD2,AR2 | ; AR2 $\rightarrow$ Y(I) |
|  | ADDI | R7,ARO,AR1 | ; AR1 $\rightarrow X(L)$ |
|  | ADDI | R7,AR2,AR3 | ; AR3 $\rightarrow$ Y Y(L) |
|  | LDI | R5, RC | ; SETUP 2ND INAER LOOP REPEAT CONTTER. |
|  | SUBI | 1,RC | ; RC (ONE LESS THAN THE DESIRED \#) |

; SECOND INNER LOOP (DOES TWIDDLE ROTATION)
RPTB FBLK2 ; REPEAT BLOCK IE TIIES


********************************************************)
; EXTERNAL PROGRAM NAMES
.GLOBL CIFFT2 ; ENTRY POINT FOR EXECLIIION
; EXTERNAL MEMORY ADDRESSES

| .GLOBL | SSINE |
| :--- | :--- |
| . GLOBL | SFARMS |

; EXTERNAL Variable adoresses

| . GLOBL | \$ | ; FFT LENGTH, $\mathrm{N}=2 * * 4$ |
| :---: | :---: | :---: |
| . GLOBL | \$ | ; $M=$ LOG2(N) $)=2$ |
| .GLOBL | \$1AD1 | ; REAL INPUT ARRAY ADDRESS |
| .6LOBL | \$1AD2 | IMAGINARY INPUT ARRAY ADDRESS |

: START OF DIT IFFT PROGRAM
.TEXT
CIFFT2
: initialize loop variables
LDP ETFPARMS ; LOAD LATA PAGE POINTER
LII EN, IRO - ikO $<=N$

| LDI | IR0, IRI | ; IR1 $<=N$ |
| :---: | :---: | :---: |
| LSH | -2, IRI | ; IR1 $<=$ N/4, OFFSET FOR COSINE |
| LDI | CM, AR6 | ; AR6 $¢=K$ ( INIT. M) |
| LDI | 1,R7 | ; R7 $<=$ N2 (INIT. 1) |
| LDI | IR0,R5 | ; $\mathrm{R} 5<\mathrm{N}$ |
| LSH | -1,85 | ; R5 <= IE (INIT. N/2) |
| LDI | 2, IR0 | ; IRO $<=\mathrm{N}_{1}$ (INIT. 2) |

; OUTER LOOP
ILOOP: LDI $\quad$ ISADI,ARO $\quad$ ARO $\rightarrow \times(0)$
ADDI R7, ARO, AR1 ; ARO $\rightarrow X(0)$
LDI $\operatorname{CSIAD2AR2;} ; A R 2 \rightarrow Y(D)$
ADDI R7 AR2, AR3 ; AR3 $\rightarrow$ Y(L)

| LDI RS,RC | ; AR3 |
| :--- | :--- |
| ; SETUP $15 T$ IN |  |

SUBI 1, RC ; SE CONE IESS THAN RER REAT COUNTER.
FIRST INNER LOOP (UNITY TWIDDLE FACTOR)

|  | RPTB | IBLK1 | ; REPEAT BlOCK IE TIMES |
| :---: | :---: | :---: | :---: |
|  | ADDF | *ARO, *AR1, R0 | ; $\mathrm{RO} 0<\mathrm{X}(\mathrm{I})+\mathrm{X}(\mathrm{L})$ |
|  | SUBF | *AR1, *ARO, R1 | ; RI $<=X(1)-X(L)$ |
|  | ADDF | *AR2,*AR3,R2 | ; R2 $<=Y(I)+Y(L)$ AND... |
|  | SUBF | *AR3, *AR2, R3 | ; $\mathrm{R} 3<=\mathrm{Y}(\mathrm{I})-\mathrm{Y}(\mathrm{L})$ |
|  | STF | R0, *ARO+ (IRO) | ; $X(I)<=R O$, INCR. ARO AND... |
| 11 | STF | R1, *AR1++(IR0) | ; $X(L)<=R 1$, INCR. AR1 |
| IBLK1: | STF | R2, *AR2++(IRO) | ; $Y(I)<=R 2, ~ I N C R$. AR2 AND... |
| i: | STF | R3, *AR3++(IRO) | ; $\mathrm{Y}(\mathrm{L})<=R 3$, INCR. AR3 |
|  | CMPI | Esh, AR6 | ; COAPARE M TO K |
|  | BEQD | SKIP | ; IF K $=$ M THEN SKIP TWIDDLED LOOP |
| ; | MAIN | NER LOOP |  |
|  | LII | 2,AR7 | ; $\rfloor \leqslant=2$ (PRE-INCREMENTED) |
|  | LDI | 1,ARO | ; ARO <=I (INIT. 1) |
|  | LDI | 1,AR2 | ; AR2 < $=1$ (INIT. 1$)$ |
|  | LDI | essine, AR5 | ; ARS $<=$ IA (INIT. 0 ) |
| IINLOP: | ADDI | RS, ARS | ; AR5 $\rightarrow$ SINTABCIA $<=I A+$ IE] |
|  | LDF | *AR5, R6 | ; R6 $<=\operatorname{SIN}(\mathrm{X}),(\mathrm{X}=(2 * \mathrm{PI} / \mathrm{N}) * \mathrm{IA})$ |
|  | ADDI | ARS, IRI, AR4 | ; AR4 $\rightarrow \cos (\mathrm{X})$ |
|  | ADDI | esiADI, ARO | ; ARO $\rightarrow \mathrm{X}(\mathrm{I})$ |
|  | ADDI | esIAD2,AR2 | ; AR2 $\rightarrow$ Y $(1)$ |
|  | ADDI | R7,ARO,AR1 | ; ARI $\rightarrow \mathrm{X}(\mathrm{L})$ |
|  | ADDI | R7,AR2,AR3 | ; AR3 $\rightarrow$ Y (L) |
|  | LDI | R5, RC | ; SETUP 2ND INNER LOOP REPEAT COUNTER |
|  | SUBI | 1,RC | ; RC (ONE LESS THAN THE DESIRED \#) |

; SECOND INNER LOOP (DOES THIDDLE ROTATION)

| RPTB | IBLK2 | ; REPEAT BLOCK IE TIMES |
| :---: | :---: | :---: |
| PYF | *AR4, *AR1,R4 | ; $\mathrm{R4}$ ¢ $=\cos * \times(L)$ |
| HPYF | R6, *AR3, R3 | P3 (= SINzY(L) |



## PROCRAM: SLINALG.ASM

LINEAR ALGEBRA ROUTINES

## SLINALG.ASM CONSISTS OF THE FOLLOHING ROUTINES:

sSOLUTN - SOLVES A WELL CONDITIONED SYSTEM OF LINEAR EQLATIONS WITH ANY MUMBER OF DEPENDENT VARIABLE SETS. USES NO (DIAGONAL PIVOTING WITH MORMAL PRECISION FLOATING-POINT MATH.
*SOLUTNX - SOLVES A WELL CONDITIONED SYSTEM OF LINEAR EQUATIONS WITh ANY NUMBER OF DEPENDENT VARIABLE SETS. USES NO (DIAGONAL PIVOTING WITH EXTENDED-PRECISION FLOATING-POINT MATH.



| .GLOBL MSOLUTN | ; MEMORY BASED ENTRY |
| :--- | :--- |
| .GLOBL RSOLUTN | ; REGISTER BASED ENTRY |
| -GLOBL FPIN | ; RECIPROCAL ROUTINE |

; EXTERNAL PARAMETER NMMES

| . GLORL | \$PARMS | ; PARAMETER SPACE ADDRESS |
| :---: | :---: | :---: |
| . GLOBL | \$IAD1 | ; POINTER TO MATRIX B, AdDress of b[0, 0$]$ |
| . GLOBL | \$NROW | ; NuMber OF RONS IN-b, VALUE OF M |
| . GLORL | SNCOL | ; number of columas in b, value of n |


; START SOLUTN PROGRAM
. TEXT
; MEMORY BASED PARAKETER ENTR

## nsoutn:

| Lop | espARHS | ; LOAD data page pointer |
| :---: | :---: | :---: |
| LDI | esIADI, ARO | ; ARO $\rightarrow$ B[O, 0$]$ |
| LDI | eswrow, AR1 | ; AR1 $<=$ M |
| LDI | esaccol, AR2 | ; AR2 $<=N$ |

; REGISTER BASED PARAHETER ENTRY

## RSOUUTN:

- SETUP LOOP REGISTERS

| LDP | EEPSN | ; LOAD DATA PAGE POINTER |
| :---: | :---: | :---: |
| LDI | 0, IRO | ; IRO $<=\mathrm{K}$ (INIT. 0 ) |
| LDI | ARO, AR3 | ; AR3 $\rightarrow$ B[0, 0] |
| SUBI | 1,AR1 | ; AR1 $<=\mathrm{M}-1$ |
| LDI | AR2,AR6 | ; AR6 $<=N$ |
| SUBI | 2,AR6 | ; $\operatorname{ARC}<=\mathrm{N}-2$ |

; MAIN LOOP (K INDEX)
KLOOP: LDF *+AR3(IRO),R3 ; R3 $\langle=B[K, K]$, NEXT PIVOT
ABSF R R , BD
CIPF EEPSN,RO $\quad$; COMPARE :BCK, KJ: TO EPS
BLT SING $\quad$; IF $\operatorname{BRE}, \mathrm{KI}$; < $\angle E P S$ THEN STOP
; COHFUTE RECIPROCAL OF PIVOT ELEMENT
NEGF R3,RO
CALL FPINV
RND RO
, RO $\langle=-1 / B[K, K]$

- divide right part of pivot row by -pivot element

|  | $\begin{aligned} & \text { ADDI } \\ & \text { LDI } \end{aligned}$ | $\begin{aligned} & \text { AR3, IR0,AR7 } \\ & \text { AR6, RC } \end{aligned}$ | $\begin{aligned} & \text {; ART } \rightarrow \text { B[K, K] } \\ & ; \text { RC } \leqslant=N-K-2 \end{aligned}$ |
| :---: | :---: | :---: | :---: |
|  | RPTB | ULOOP | ; REPEAT DIUIDE LOOP $\mathrm{N}-\mathrm{K}-1$ TIMES |
|  | MPYF | R0,*++AR7,R2 | ; R2 < $=~ B C K, ~ J] \pm(-1 / B[K, ~ K J) ~$ |
| * | RND | R2 | ; REMOVE "*" TO ROLND * |
| Quoup: | STF | R2, *AR7 | ; B[K, J] $<=$ R2 |
| ; | START INNER LOOP (I INIEX) |  |  |
|  | LDI | 0, 1R1 | ; IR1 $<=1$ (INIT. 0 ) |
|  | LDI | AR0, AR4 | ; AR4 $\rightarrow$ B[0, 0] |
|  | CMPI | IRO, IR1 | ; COMPARE I TO K |
| ILOOP: | BEQ | SKIP | ; IF I = K THEN SKIP PIVOT ROW |

; COMPLETE PIVOTING OPERATION

| ADDI | AR4, IR0,AR5 | ; AR5 $\rightarrow$ B[I, K] |
| :---: | :---: | :---: |
| LDF | * ARS, R0 | ; $\mathrm{RO} \ll=\mathrm{BLI}, \mathrm{K}]$ |
| LDI | AR6, RC | ; $\mathrm{RC}<=\mathrm{N}-\mathrm{K}-2$ |
| CMPI | 1,RC | ; COMPARE RC 701 |
| BLTD | Jump | ; IF RC < 1 THEN NO RPTB (DELAYED) |
| SUBI | 1,RC | ; RC $<=N-K-3$ |
| ADDI | AR3, IR0,AR7 | ; AR7 $\rightarrow$ B ${ }^{\text {ck, }}$ J] |
| MPYF | R0, *+ + AR7, R1 | ; R1 $¢=B[K, K+1] * B[1, K]$ |

; START INNER-INNER LOOP (J INDEX)

|  | RPTB | H.OOP | ; REpeat pivot loop n-K-2 times |
| :---: | :---: | :---: | :---: |
|  | MPYF | R0, *++AR7, R1 | ; R1 $<=B[K, ~ J]+B[I, ~ K] ~$ |
| 1 | ADDF | R1, *++AR5, R2 | ; R2 $<=B[I, ~ J]+\mathrm{R} 1$ |
| * | RND | R2 | ; REMOVE "*" TO ROUND + |
| HLOOP: | STF | R2, *ARS | ; B[I, J] $<=$ R2 |

- END OF INNER-INNER LOOP (J INDEX)

| UMMP: <br> * | ADDF | R1, ${ }^{+++A R 5, ~ R 2 ~}$ | ; $\mathrm{R} 2 \ll \mathrm{~B}[1, \mathrm{~N}-1]+\mathrm{R} 1$ |
| :---: | :---: | :---: | :---: |
|  | PND | R2 | ; REMOVE "*" TO ROUND + |
|  | STF | R2, *AR5 | ; B[I, $\mathrm{N}-1]<=\mathrm{R} 2$ |
| SKIP: | CapI | AR1, IR1 | ; COAPARE I TTO M-1 |
|  | B. TD | ILOOP | ; IF I ( M-1 THEN LOOP (DELAYED) |
|  | ADDI | AR2,AR4 | ; AR4 $\rightarrow$ B $[1+1,0]$ |
|  | ADDI | 1, IR1 | ; $1<=1+1$ |
|  | CMPI | IRO, IR1 | ; COMPAPE I TOK |

## RETS



| . GLOBL | MSOLutnx | ; MEMORY BASED ENTRY |
| :---: | :---: | :---: |
| .GLOBL | RSOLUTNX | ; REGISTER BASED ENTRY |
| . GLOBL | FPINX | ; RECIPROCAL ROUTINE |
| . GLOBL | Faukt | maltiply routine |

EXTERNAL PARAMETER NAMES

|  | . GLORL | sparis | ; PARMMETER SPACE ADDRESS |
| :---: | :---: | :---: | :---: |
|  | .GLOBL | \$IAD1 | ; POINTER TO MATRIX B, ADDRESS OF B[0, 0 ] |
|  | . GLOBL | SNRON | ; Nurber of rows in b, value of M |
|  | . GLOBL | SNCOL | ; Nhaber of colunns in b, value of N |
| ; | INTERNLL CONSTANTS |  |  |
|  | . DATA |  |  |
| EPSNX | . FLOAT | 1.0E-10 | ; SINGLLARITY CRITERION |
| ZEROX | .SET | 0.0 | ; SINGLLARITY FLAG |
| ; | Start soulnx procram |  |  |
|  | . TEXT |  |  |
| ; | HEWRY BASED PARAYETER ENTRY |  |  |
| msolutnx: |  |  |  |
|  | LDP | esparts | ; LOAD DATA PAGE POINTER |
|  | LII | ESIADI,ARO | ; ARO - ${ }^{\text {BtO, } 0]}$ |
|  | LIT | ESNRON, ARI | ; AR1 $<=M$ |
|  | LDI | esncol,ar2 | ; AR2 $<=\mathrm{N}$ |
| ; | REGISTER BASED PARAMETER ENTRY |  |  |
| RSOUUTNX: |  |  |  |
| ; | SETUP LOOP REGISTERS |  |  |
|  | LDP | EEPSNX | ; LOAD DATA PAGE POINTER |
|  | LDI | 0, IRO | ; IRO $<=\mathrm{K}$ (INIT. 0 ) |
|  | LDI | ARO,AR3 | ; AR3 $\rightarrow$ B $[0,0]$ |
|  | SLBI | 1,AR1 | ; ARI $<=\mathrm{M}-1$ |
|  | LTI | AR2,AR6 | ; ARb $<=N$ |
|  | SUBI | 2,AR6 | ; ARS $<=\mathrm{N}-2$ |
| ; | MAIN LOOP (K INDEX) |  |  |
| KLOOPX: | LFF | - + AR3(IRO) | R3 ; R3 < $=$ BEK, K1, NEXT PIVOT |
|  | ABSF | R3,RO | ; $\mathrm{RO}=1 \mathrm{R} 31$ |
|  | CPPF | CEPSNX,RO | ; COAPARE :BEK, K]: TO EPS |
|  | BLT | SIMEX | ; IF IBCK, KI: < EPS THEN STOP |
| ; | coupute reciproch of -PIVOT ELEMENT |  |  |


| NEGF | R3,RO | ; RO $<=-B[K, K]$ |
| :--- | :--- | :--- |
| CALL | FPINXX | ; RO $<=-1 / B[K, K]$ |
| LDF | RO,R1 | ; R1 $<=-1 / B[K, K]$ |

; divide right part of pivor row by -pivot eleient

$$
\begin{aligned}
& \begin{array}{lll}
\text { ADDI } & \text { AR3,IR0,AR7 } & ; \text { AR7 } \rightarrow \mathrm{B}[K, \mathrm{KJ} \\
\text { LDI } & \text { ARS, RC } & ; R C<=\mathrm{K}-\mathrm{K}-2
\end{array} \\
& \text { RPTB DLOOPX ; REPEAT DIVIDE LOOP N-K-1 TIMES } \\
& \text { LDF } \quad \text { *++ART,RO } \quad \text {; RO } C=B[K, ~ J 1 ~ \\
& \text { CALL FMULTX } \quad ; R 0<=B[K, J] *(-1 / B[K, K]) \\
& \text { RND RO ; ROUND * } \\
& \text { DLOOPX: STF RO,*AR7 ; BEK, JJ }<=\text { R } 0
\end{aligned}
$$

; START INNER LOOP (I INDEX)

| LDI | 0, IR1 | ; IR1 <= I (INIT. 0) |
| :---: | :---: | :---: |
| LDI | ARO, AR4 | ; AR4 $\rightarrow$ B $[0,0]$ |
| CMPI | IRO, IR1 | ; COMPARE I TOK |
| BEQ | SKIPX | ; IF I $=\mathrm{K}$ THEN SKIP PIVOT ROW |

; COMPLETE PIVOTING OPERATIO

| ADDI | AR4, IRO, ARS | ; ARS $\rightarrow$ B[I, K] |
| :---: | :---: | :---: |
| LDF | *ARS, RO |  |
| LDI | AR6, RC | ; RC $<=N-\mathrm{K}-2$ |
| CMPI | 1,RC | ; COMPAPE RC TO 1 |
| E\&TD | IATPX | ; IF RC < 1 THEN ND RPTB (DELAYED) |
| SUBI | 1, KC | ; $\mathrm{KC}<=\mathrm{N}-\mathrm{K}-3$ |
| ADDI | AR3, IR0,AR7 | ; AR7 $\rightarrow$ B[K, J] |
| MPYF | R0,*++AR7, R1 | ; R1 ¢= B[K, K+1]*B[I, K] |

- START INNER-INER LOOP (J INDEX)

|  | RPTB | J.OOPX | ; PEPEAT PIWOT LOOP N-K-2 TINES |
| :---: | :---: | :---: | :---: |
|  | MPYF | R0,*++AR7, R1 | ; R1 $<=B[K, ~ J] * B[1, ~ K] ~$ |
| 14 | ADDF | R1,*++AR5,R2 | ; R2 $<=\mathrm{BLI}, \mathrm{J}]+\mathrm{RL}$ |
|  | RND | R2 | ; ROWD + |
| HOOPX: | STF | R2, *AP5 | ; B[I, J] $¢=$ R2 |

; END OF INNER-INER LOOP (J INDEX)

| JUMPX: | $\begin{aligned} & \text { ADDF } \\ & \text { RND } \\ & \text { STF } \end{aligned}$ | $\begin{aligned} & \text { R1,*++AR5,R2. } \\ & \text { R2 } \\ & \text { R2,*AR5 } \end{aligned}$ | $\begin{aligned} & \text {; R2 }<=\mathrm{B}[\mathrm{I}, \mathrm{~N}-1]+\mathrm{R} 1 \\ & \text {; ROND + } \\ & \text {; } \mathrm{B}[1, N-1]<=\mathrm{R} 2 \end{aligned}$ |
| :---: | :---: | :---: | :---: |
| SKIPX: | CMPI | AR1, IR1 | ; COMPARE I TO M-1 |
|  | BLTD | ILOOPX | ; IF I ( M-1 THEN LOOP (DELAYED) |
|  | ADDI | AR2,AR4 | ; AR4 $\rightarrow$ B $[1+1,0]$ |
|  | ADDI | 1, IR1 | I< $=1+1$ |

CUPI IRO, IR1 ; COMPARE ITTK
; END OF INNER LOOP (I INDEX

| apl | AR1, IRO | ; COMPARE K TO M-1 |
| :---: | :---: | :---: |
| BLTD | KLOCPX | ; IF K < H-1 THEN |
| ADDI | AR2,AR3 | ; AR3 $\rightarrow$ B[K+1, 0] |
| ADDI | 1, IRO | ; $\mathrm{K}<=\mathrm{K}+1$ |

- END OF OUTER LOOP (K INDEX)

RETS ; RETURN
;
SINGULAR SYSTEM EXIT
SING: LDF ZEROX,R3 ; SET "SINGLLAR" FLAO
RETS ; RETURN

# Part IIII. Digital Signal Processing Interface Techmiques 

9. TMS320C30 Hardware Applications
(Jon Bradley)
10. TMS320C30-IEEE Floating-Point Format Converter (Randy Restle and Adam Cron)

# TMS320C30 Hardware Applications 

Jon Bradley<br>Digital Signal Processor Products-Semiconductor Group<br>Texas Instruments

## Introduction

The TMS320C30 is a high-speed, floating-point, digital signal processor. The TMS320C30s advanced interface design allows it to be used to implement a wide variety of system configurations. Its two external buses and DMA capability provide a parallel 32-bit interface to byte- or word-wide devices, while the interrupt interface, dual serial ports, and general purpose digital I/O provide communication with a multitude of peripherals.

This application report describes how to use the TMS320C30s interfaces to connect to various external devices. Specific discussions include implementation of parallel interface to devices with and without wait states, use of general purpose I/O, and system control functions. All interfaces shown in this report have been built and tested to verify proper operation.

Major topics discussed in this report are as follows:

- System Configuration Options Overview
- Primary Bus Interface
- Zero Wait Interface to RAMs
- Ready Generation
- Bank Switching Techniques
- Expansion Bus Interface
- A/D Converter Interface
- D/A Converter Interface
- System Control Functions
- Clock Oscillator Circuitry
- Reset Signal Generator
- Serial Port Interface
- XDS1000 Target Design Considerations


## System Configuration Options Overview

The various TMS320C30 interfaces allow connections to a wide variety of different device types. Each of these interfaces is tailored to a particular family of devices.

## Categories of Interfaces on the TMS320C30

The interface types on the TMS320C30 fall into several different categories depending on the devices to which they were intended to be connected. Each interface comprises one or more signal lines that transfer information and control its operation. Shown in Figure 1 are the signal line groupings for each of these various interfaces.

Figure 1. External Interfaces on the TMS320C30


All of the interfaces are independent of one another and different operations may be performed simultaneously on each interface.

The Primary and Expansion buses implement the memory mapped interface to the device. The external DMA interface allows external devices to cause the processor to relinquish the Primary bus and allow direct memory access.

## Typical System Block Diagram

The devices that can be interfaced to the TMS320C30 include memory, DMA devices, and numerous parallel and serial peripherals and I/O devices. Figure 2 illustrates a typical configuration of a TMS320C30 system showing different types of external devices and the interfaces to which they are connected.

Figure 2. Possible System Configurations


This block diagram constitutes essentially a fully expanded system. In an actual design, any subset of the illustrated configuration may be used.

## Primary Bus Interface

The primary bus is used by the TMS320C30 to access the majority of its memory mapped locations. Therefore, typically when a large amount of external memory is required in a system, it is interfaced to the primary bus. The expansion bus (discussed in the next section) actually comprises two mutually exclusive interfaces, controlled by the MSTRB and IOSTRB signals respectively. Cycles on the expansion bus controlled by the MSTRB signal are essentially equivalent to cycles on the primary bus, with the exception that bank switching is not implemented on the expansion bus. Accordingly, the discussion of primary bus cycles in this section applies equally to MSTRB cycles on the expansion bus.

Although both the primary bus and the expansion bus may be used to interface to a wide variety of devices, the devices most commonly interfaced to these buses are memories. Therefore, detailed examples of memory interface will be presented in this section.

## Zero Wait State Interface To Static RAMs

For full speed, zero-wait state interface to any device, the TMS320C30 requires a read access time of 30 ns from address stable to data valid. Because, for most memories, access time from chip select is the same as access time from address, it is theoretically possible to use 30 ns memories at full speed with the TMS320C30. This, however, dictates that there be no delays present between the processor and the memories. This is usually not the case in practice, due to interconnection de-
lays and the fact that typically some gating is required for chip select generation. Therefore, slightly faster memories are generally required in most systems. If one level of reasonably high-speed (below 10 ns in propagation delay) gating is used to generate chip select for the memories, 20 ns devices may be used.

Among currently available RAMs, there are two distinct categories of devices with different interface characteristics. These two categories are RAMs without output enable control lines ( $\overline{\mathrm{OE}})$, which include the 1-bit wide organized RAMs and most of the 4-bit wide RAMs, and those with $\overline{\mathrm{OE}}$ controls, which include the byte wide and a few of the 4-bit wide RAMs. Many of the fastest RAMs do not provide $\overline{\mathrm{OE}}$ control, and use chip select $(\overline{\mathrm{CS}})$ controlled write cycles to insure that data outputs do not turn on for write operations. In $\overline{\mathrm{CS}}$ controlled write cycles, the write control line ( $\overline{\mathrm{WE}}$ ) goes low prior to $\overline{\mathrm{CS}}$ going low, and internal logic holds the outputs disabled until the cycle is completed. Using $\overline{\mathrm{CS}}$ controlled write cycles is an efficient way to interface fast RAMs without $\overline{\mathrm{OE}}$ controls to the TMS320C30 at full speed.

In the case of RAMs with $\overline{\mathrm{OE}}$ controls, the use of this signal can provide added flexibility in many systems. Additionally, many of these devices can be interfaced using $\overline{\mathrm{CS}}$ controlled write cycles with $\overline{\mathrm{OE}}$ tied low, in the same manner as with RAMs without $\overline{\mathrm{OE}}$ controls. There are, however, two requirements for interfacing to $\overline{\mathrm{OE}}$ RAMs in this fashion. First, the RAMs $\overline{\mathrm{OE}}$ input must be gated with chip select and $\overline{\mathrm{WE}}$ internally so that the device's outputs do not turn on unless a read is being performed. Second, the RAM must allow its address inputs to change while $\overline{\mathrm{WE}}$ is low, which some RAMs specifically prohibit.

The circuit shown in Figure 3 shows an interface to Cypress Semiconductor's CY7C186 $25 \mathrm{~ns} 8 \mathrm{~K} \times 8$-bit CMOS static RAMs with the $\overline{\mathrm{OE}}$ control input tied low and using a $\overline{\mathrm{CS}}$ controlled write cycle.

Figure 3. TMS320C30 Interface to Cypress Semiconductor CY7C186 CMOS SRAM


In this circuit, the two chip selects on the RAM are driven by $\overline{\text { STRB }}$ and $\overline{\text { A23 }}$, which are ANDed together internally. The use of $\overline{\mathrm{A} 23}$ locates the RAM at addresses 00000 h through 03 FFFh in external memory and $\overline{\mathrm{STRB}}$ establishes the $\overline{\mathrm{CS}}$ controlled write cycle. The $\overline{\mathrm{WE}}$ control input is then driven by the TMS320C30R/W signal, and the $\overline{\mathrm{OE}}$ input is not used, and is therefore connected to ground.

The timing of read operations, shown in Figure 4, is very straightforward since the two chip select inputs are driven directly. The read access time of the circuit is therefore the inverter propagation delay added to the RAMs chip select access time or $\mathrm{t}_{1}+\mathrm{t}_{2}=5+25=30 \mathrm{~ns}$. This access time therefore meets the TMS320C30s specified 30 ns requirement.

Figure 4. Read Operations Timing


During write operations, as shown in Figure 5, the RAMs outputs do not turn on at all, due to the use of the chip select controlled write cycles. The chip select controlled write cycles are generated by the fact that $R / \bar{W}$ goes active (low) before the $\overline{\text { STRB }}$ term of the chip select input. Because the RAMs output drivers are disabled whenever the $\overline{\mathrm{WE}}$ input is low (regardless of the state of the $\overline{\mathrm{OE}}$ input) bus conflicts with the TMS320C30 are automatically avoided with this interface.The circuit's data setup and hold times ( $\mathrm{t}_{1}$ and $\mathrm{t}_{2}$ in the timing diagram) of approximately 50 and 20 ns , respectively, also easily meet the RAMs timing requirements of 10 and 0 ns .

Figure 5. Write Operations Timing


If more complex chip select decode is required than can be accomplished in time to meet zero-wait state timing, wait states or bank switching techniques (discussed in a later section) should be used.

It should be noted that the CY7C186's $\overline{\mathrm{OE}}$ control is gated internally with $\overline{\mathrm{CS}}$, therefore the RAMs outputs are not enabled unless the device is selected. This is critical if there are any other devices connected to the same bus; if there are no other devices connected to the bus, then $\overline{\mathrm{OE}}$ need not be gated internally with chip select.

RAMs without $\overline{\mathrm{OE}}$ controls can also be easily interfaced to the TMS320C30 using a similar approach to that used with RAMs with $\overline{\mathrm{OE}}$ controls. If there is only one bank of memory implemented, and no other devices are present on the bus, the memories' $\overline{\mathrm{CS}}$ input may often be connected to $\overline{S T R B}$ directly. If several devices must be selected, however, a gate is generally required to AND the device select and $\overline{\mathrm{STRB}}$ to drive the $\overline{\mathrm{CS}}$ input to generate the chip select controlled write cycles. In either case, the $\overline{\mathrm{WE}}$ input is driven by the TMS320C30 R/W signal. Provided sufficiently fast gating is used, 25 ns RAMs may still be used.

As with the case of RAMs with $\overline{\mathrm{OE}}$ control lines, this approach works well if only a few banks of memory are implemented where the chip select decode can be accomplished with only one level of gating. If many banks are required to implement very large memory spaces, bank switching can be used to provide for multiple bank select generation while still maintaining full speed accesses within each bank. Bank switching is discussed in detail in a later section.

## Ready Generation

The use of wait states can greatly increase system flexibility and reduce hardware requirements over systems without wait state capability. The TMS320C30 has the capability of generating wait states on either the primary bus or the expansion bus and both buses have independent sets of ready control logic. Ready generation is discussed in this subsection from the perspective of the primary bus interface, however, wait state operation on the expansion bus is similar to that of the primary bus, therefore these discussions pertain equally well to expansion bus operation. Thus, ready generation will not be included in the specific discussions of the expansion bus interface.

Wait states are generated on the basis of the internal wait state generator, the external ready input ( $\overline{\mathrm{RDY}})$, or the logical AND or OR of the two. When enabled, internally generated wait states effect all external cycles, regardless of the address accessed. If different numbers of wait states are required for various external devices, the external $\overline{\text { RDY }}$ input may be used to tailor wait state generation to specific system requirements.

If the logical OR (or electrical AND since the signals are true low) of the external and wait count ready signals is selected, the earlier of either of the two signals will generate a ready condition and allow the cycle to be completed. It is not required that both signals be present.

The OR of the two ready signals can be used to implement wait states for devices that require a greater number of wait states than are implemented with external logic (up to seven). This feature is useful, for example, if a system contains some fast and some slow devices. In this case, fast devices can generate a ready signal externally with a minimum of logic, and slow devices can use the internal wait counter for larger numbers of wait states. Thus, when fast devices are accessed, the external hardware responds promptly with a ready signal that terminates the cycle. When slow devices are accessed, the external hardware does not respond, and the cycle is appropriately terminated after the internal wait count.

The OR of the two ready signals may also be used if conditions occur that require termination of bus cycles prior to the number of wait states implemented with external logic. In this case, a
shorter wait count is specified internally than the number of wait states implemented with the external ready logic, and the bus cycle is terminated after the wait count. This feature may also be used as a safeguard against inadvertent accesses to nonexistent memory that would never respond with ready and therefore lock up the TMS320C30.

If the OR of the two ready signals is used, however, and the internal wait state count is less than the number of wait states implemented externally, the external ready generation logic must have the ability to reset its sequencing to allow a new cycle to begin immediately following the end of the internal wait count. This requires that, under these conditions, consecutive cycles must be from independently decoded areas of memory and that the external ready generation logic be capable of restarting its sequence as soon as a new cycle begins. Otherwise, the external ready generation logic may lose synchronization with bus cycles and therefore generate improperly timed wait states.

If the logical AND (electrical OR) of the wait count and external ready signals is selected, the later of the two signals will control the internal ready signal, and both signals must occur. Accordingly, external ready control must be implemented for each wait state device in addition to the wait count ready signal being enabled.

This feature is useful if there are devices in a system that are equipped to provide a ready signal but cannot respond quickly enough to meet the TMS320C30s timing requirements. In particular, if these devices normally indicate a ready condition and, when accessed, respond with a wait until they become ready, the logical AND of the two ready signals can be used to save hardware in the system. In this case, the internal wait counter can be used to provide wait states initially, and become ready after the external device has had time to send a not ready indication. The internal wait counter then remains ready until the external device also becomes ready, which terminates the cycle.

Additionally, the AND of the two ready signals may be used for extending the number of wait states for devices that already have external ready logic implemented but require additional wait states under certain unique circumstances.

In the implementation of external ready generation hardware, the particular technique employed depends heavily on the specific characteristics of the system. The optimum approach to ready generation varies depending on the relative number of wait state and non-wait state devices in the system and the maximum number of wait states required for any one device. The approaches discussed here are intended to be general enough for most applications, and are easily modifiable to comprehend many different system configurations.

In general, ready generation involves the following three functions:

1) Segmentation of the address space in some fashion to distinguish fast and slow devices.
2) Generating properly timed ready indications.
3) Logically ORing all of the separate ready timing signals together to connect to the physical ready input.

Segmentation of the address space is required so that a unique indication of each of the particular areas within the address space that require wait states can be obtained. This segmentation is commonly implemented in a system in the form of chip select generation. Chip select signals may be used to initiate wait states in many cases, however, occasionally chip select decoding considerations may provide signals that will not allow ready input timing requirements to be met. In this case, coarse address space segmentation may be made on the basis of a small number of address lines, where simpler gating allows signals to be generated more quickly. In either case, the signal indicating that a particular area of memory is being addressed is normally used to initiate a ready or wait state indication.

Once the region of address space being accessed has been established, a timing circuit of some sort is normally used to provide a ready indication to the processor at the appropriate point in the cycle to satisfy each device's unique requirements.

Finally, since indications of ready status from multiple devices are typically present, the signals are logically ORed using a single gate to drive the $\overline{\mathrm{RDY}}$ input.

One of two basic approaches may be taken in the implementation of ready control logic depending upon the state in which the ready input is to be between accesses. If $\overline{R D Y}$ is low between accesses, the processor is always ready unless a wait state is required; if $\overline{\mathrm{RDY}}$ is high between accesses, the processor will always enter a wait state unless a ready indication is generated.

If $\overline{\text { RDY }}$ is low between accesses, control of full speed devices is straightforward; no action is necessary since ready is always active unless otherwise required. Devices requiring wait states, however, must drive ready high fast enough to meet the input timing requirements. Then, after an appropriate delay, a ready indication must be generated. This can be quite difficult in many circumstances since wait state devices are inherently slow and often require complex select decoding.

If $\overline{\mathrm{RDY}}$ is high between accesses, zero wait state devices, which tend to be inherently fast, can usually respond immediately with a ready indication. Wait state devices may simply delay their select signals appropriately to generate a ready. Typically, this approach results in the most efficient implementation of ready control logic. Figure 6 shows a circuit of this type which can be used to generate 0,1 , or 2 wait states for multiple devices in a system.

Figure 6. Circuit For Generation of 0, 1, or 2 Wait States for Multiple Devices


In this circuit, full speed devices drive ready directly through the '74AS21, and the two flipflops delay wait state devices' select signals one or two H1 cycles to provide 1 or 2 wait states.

Considering the TMS320C30's ready delay time of 8 ns following address, zero wait state devices must use ungated address lines directly to drive the input of the '74AS21, since this gate contributes a maximum propagation delay of 6 ns to the $\overline{\mathrm{RDY}}$ signal. Thus, zero wait state devices should be grouped together within a coarse segmentation of address space if other devices in the system require wait states.

With this circuit, devices requiring wait states may take up to 36 ns from a valid address on the TMS320C30 to provide inputs to the '74AS20s inputs. Typically, this allows sufficient time for any decoding required in generating select signals for slower devices in the system. For exam-
ple, the 74ALS138 driven by address and $\overline{S T R B}$, can generate select decodes in 22 ns , which easily meets the TMS320C30s timing requirements.

With this circuit, unused inputs to either the 74AS20s or the 74AS21 should be tied to a logic high level to prevent noise from generating spurious wait states.

If more than 2 wait states are required by devices within a system, other approaches may be employed for ready generation. If between three and seven wait states are required, additional flipflops may be included, in the same manner as shown in Figure 6, or internally generated wait states may be used in conjunction with external hardware. If greater than seven wait states are required, an external circuit using a counter may be used to supplement the internal wait-state generator's capabilities.

## Bank Switching Techniques

The TMS320C30's programmable bank switching feature can greatly ease system design when large amounts of memory are required. This feature is used to provide a period of time during which all device selects are disabled that would not normally be present otherwise. During this interval, slow devices are allowed time to turn off before other devices have the opportunity to drive the data bus, thus avoiding bus contention.

When bank switching is enabled, any time a portion of the high order address lines change, as defined by the contents of the BNKCMPR register, $\overline{\text { STRB }}$ goes high for one full H 1 cycle. Provided $\overline{\text { STRB }}$ is included in chip select decodes, this causes all devices to be disabled during this period. The next bank of devices is not enabled until $\overline{\text { STRB }}$ goes low again.

Bank switching is not required during writes since these cycles always exhibit an inherent one-half H 1 cycle setup of address information before $\overline{\mathrm{STRB}}$ goes low. Thus, when using bank switching for read/write devices, a minimum of half of one H1 cycle of address setup is provided for all accesses. Therefore, large amounts of memory can be implemented without wait states or extra hardware required for isolation between banks. Also, note that access time for cycles during bank switching is the same as that of cycles without bank switching, and accordingly, full speed accesses may still be accomplished within each bank.

When using bank switching to implement large multiple-bank memory systems, an important consideration is address line fanout. Besides parametric specifications for which account must be made, AC characteristics are also crucial in memory system design. With large memory arrays which commonly require large numbers of address line inputs to be driven in parallel, capacitive loading of address outputs is often quite large. Because all TMS320C30 timing specifications are guaranteed up to a capacitive load of 80 pF , driving greater loads will invalidate guaranteed AC characteristics. Therefore it is often necessary to provide buffering for address lines when driving large memory arrays. AC timings for buffer performance may then be derated according to manufacturer specifications to accomodate a wide variety of memory array sizes.

The circuit shown in Figure 7 illustrates the use of bank switching with Cypress Semiconductor's 'CY7C185 $25 \mathrm{~ns} 8 \mathrm{~K} \times 8$ CMOS static RAM. This circuit implements 32 K 32 -bit words of memory with one wait-state accesses within each bank.


A wait state is required with this implementation of bank memory because of the added propagation delay presented by the address bus buffers used in the circuit. The wait state is not a function of the fact that the memory is organized as multiple banks or the use of bank switching. When bank switching is used, memory access speeds are the same as without bank switching once bank boundaries are crossed. Therefore, no speed penalty is paid when using bank switching except for the occasional extra cycle inserted when bank boundaries are crossed. It should be noted, however, that if the extra cycle inserted when crossing bank boundaries does impact software performance significantly, code can often be restructured to minimize bank boundary crossings, thereby reducing the effect of these boundary crossings on software performance.

The wait state for this bank memory is generated using the wait state generator circuit presented in the previous section. Because A23 is the signal which enables the entire bank memory system, the inverted version of this signal is ANDed with $\overline{\text { STRB }}$ to derive a one wait state device select. This signal is then connected in the circuit along with the other one wait state device selects. Thus, any time a bank memory access is made, one wait state is generated.

Each of the four banks in this circuit is selected using a decode of A15-A13 generated by the 74AS138 (see Figure 8). With the BNKCMPR register set to 0 Bh , the banks will be selected on even 8 K -word boundaries starting at location 080A000h in external memory space.

Figure 8. Bank Memory Control Logic


The 74ALS2541 buffers used on the address lines are necessary in this design since the total capacitive load presented to each address line is a maximum of $20 \times 5 \mathrm{pF}$ or 100 pF (bank memory plus zero wait-state static RAM), which exceeds the TMS320C30 rated capacitive loading of 80 pF . Using the manufacturers derating curves for these devices at a load of 80 pF (the load presented by the bank memory) predicts propagation delays at the output of the buffers of a maximum of 16 ns. The access time of a read cycle within a bank of the memory is therefore the sum of the memory access time and the maximum buffer propagation delay or $25+16=41 \mathrm{~ns}$, which, since it falls between 30 and 90 ns , requires one wait state on the TMS320C30.

The 74ALS2541 buffers offer one additional system performance enhancement in that they include 25 -ohm resistors in series with each individual buffer output. These resistors greatly improve the transient response characteristics of the buffers especially when driving CMOS loads such as the memories used here. The effect of these resistors is to reduce overshoot and ringing which is common when driving predominantly capacitive loads such as CMOS. The result of this is reduced noise and increased immunity to latchup in the circuit, which in turn results in a more reliable memory system. Having these resistors included in the buffers eliminates the need to put discrete resistors in the system which is often required in high speed memory systems.

This circuit could not have been implemented without bank switching, since data output's turn-on and turn-off delays would have caused bus conflicts. Here, the propagation delay of the 74AS138 is only involved during bank switches, where there is sufficient time between cycles to allow new chip selects to be decoded.

The timing of this circuit for read operations using bank switching is shown in Figure 9. With the BNKCMPR register set to 0 Bh , when a bank switch occurs, the bank address on address lines A23-A13, is updated during the extra H 1 cycle while $\overline{\mathrm{STRB}}$ is high. Then, after chip select decodes have stabilized, and the previously selected bank has disabled its outputs, $\overline{\operatorname{STRB}}$ goes low for the next read cycle. Further accesses occur at normal bus timings with one wait state as long as another bank switch is not necessary. Write cycles do not require bank switching due to the inherent address setup provided in their timings.

Figure 9. Timing For Read Operations Using Bank Switching


The timing for this interface is summarized in the Table 1.
Table 1. Bank Switching Interface Timing

| Time Interval | Event | Time Period |
| :---: | :--- | :---: |
| $\mathrm{t}_{1}$ | H1 falling to address/STRB valid | 14 ns |
| $\mathrm{t}_{2}$ | Add to select delay | 10 ns |
| $\mathrm{t}_{3}$ | Memory disable from STRB | 10 ns |
| $\mathrm{t}_{4}$ | H1 falling to STRB | 10 ns |
| $\mathrm{t}_{6}$ | Memory output enable delay | 3 ns |

## Expansion Bus Interface

The TMS320C30s expansion bus interface provides a second complete parallel bus which can be used to implement data transfers concurrently with and independent of operations on the primary bus. The expansion bus comprises two mutually exclusive interfaces controlled by the MSTRB and IOSTRB signals, respectively. This section discusses interface to the expansion bus using $\overline{\text { IOSTRB }}$ cycles; $\overline{\text { MSTRB }}$ cycles are essentially equivalent in timing to primary bus cycles, and are discussed in the previous section.

Unlike the primary bus, both read and write cycles on the I/O portion of the expansion bus are two H 1 cycles in duration and exhibit the same timing. The $\mathrm{XR} / \overline{\mathrm{W}}$ signal is high for reads and low for writes. Since I/O accesses take two cycles, many peripherals that require wait states if interfaced either to the primary bus or using $\overline{\text { MSTRB }}$ may be used in a system without the need for wait states. Specifically, in cases where there is only one device on the expansion bus, devices with access times greater than the 30 ns required by the primary bus, but not more than 59 ns can be interfaced to the I/O bus without wait states.

## A/D Converter Interface

$\mathrm{A} / \mathrm{D}$ and $\mathrm{D} / \mathrm{A}$ converters are components that are commonly required in DSP systems and interface efficiently to the I/O expansion bus. These devices are available in many speed ranges and with a variety of features, and while some may be used at full speed on the I/O bus, others may require one or more wait states.

Figure 10 shows an interface to an Analog Devices AD1678 analog to digital converter. The AD1678 is a 12-bit, $5 \mu$ s converter allowing sample rates up to 200 kHz and with an input voltage range of 10 volts bipolar or unipolar. The converter is connected according to manufacturers specifications to provide 0 to +10 volt operation. This interface illustrates a common approach to connecting devices such as this to the TMS320C30. Note that the interface requires only a minimum amount of control logic.

Figure 10. Interface to AD1678 A/D Converter


The AD1678 is a very flexible converter and is configurable in a number of different operating modes. These operating modes include byte or word data format, continuous or non-continuous conversions, enabled or disabled chip select function, and programmable end of conversion indication. This interface utilizes 12 -bit word data format, rather than byte format to be compatible with the TMS320C30. Non-continuous conversions are selected, so that variable sample rates may be used, since continuous conversions occur only at a rate of 200 kHz . With non-continuous conversions, the host processor determines the conversion rate by initiating conversions through write operations to the converter.

The chip select function is enabled, so the chip select input is required to be active when accessing the device. Enabling the chip select function is necessary to allow a mechanism for the AD1678 to be isolated from other peripheral devices connected to the expansion bus. To establish the desired operating modes, the SYNC and $12 / \overline{8}$ inputs to the converter are pulled high and $\overline{\mathrm{EO}}-$ $\overline{\mathrm{CEN}}$ is grounded, as specified in the AD1678 data sheet.

In this application, the converter's chip select is driven by XA12, which maps this device at 804000 h in I/O address space. Conversions are initiated by writing any data value to the device, and the conversion results are obtained by reading from the device after the conversion is completed. To generate the devices Start Conversion ( $\overline{\mathrm{SC}}$ ) and Output Enable ( $\overline{\mathrm{OE}})$ inputs, $\overline{\text { IOSTRB }}$ is ANDed with XR/ $\bar{W}$. Therefore, the converter is selected whenever XA12 is low, and $\overline{\mathrm{OE}}$ is driven when reads are performed, while SC is driven when writes are performed.

As with many A/D converters, at the end of a read cycle the AD1678 data output lines enter a high impedance state. This occurs after the Output Enable $(\overline{\mathrm{OE}})$ or read control line goes inactive. Also common with these types of devices, is that the data output buffers often require a substantial amount of time to actually attain a full high-impedance state. When used with the TMS320C30, devices must have their outputs fully disabled no later than 65 ns following the rising edge of $\overline{\text { IOSTRB }}$, since the TMS320C30 will begin driving the data bus at this point if the next cycle is a write. If this timing is not met, bus conflicts between the TMS320C30 and the AD1678 may occur, potentially causing degraded system performance and even failure due to damaged data bus drivers. The actual disable time for the AD1678 can be as long as 80 ns , therefore buffers are required to isolate the converter outputs from the TMS320C30. The buffers used here are 74LS244s that are enabled when the AD1678 is read, and turned off 30.8 ns following IOSTRB going high.Therefore, the TMS320C30 requirement of 65 ns is met.

When data is read following a conversion, the AD1678 takes 100 ns after its $\overline{\mathrm{OE}}$ control line is asserted to provide valid data at its outputs. Thus, including the propagation delay of the 74LS244 buffers, the total access time for reading the converter is 118 ns . This requires two wait states on the TMS320C30 expansion I/O bus.

The two wait states required in this case are implemented using software wait states, however, depending on the overall system configuration it may be necessary to implement a separate wait state generator for the expansion bus (refer to section on ready generation). This would be the case if there were multiple devices that required different numbers of wait states connected to the expansion bus.

Figure 11 shows the timing for read operations between the TMS320C30 and the AD1678. At the beginning of the cycle, the address and $\mathrm{XR} / \overline{\mathrm{W}}$ lines become valid $\mathrm{t}_{1}=10 \mathrm{~ns}$ following the falling edge of $H_{1}$. Then, after $t_{2}=10 \mathrm{~ns}$ from the next rising edge of $\mathrm{H}_{1}, \overline{\text { IOSTRB }}$ goes low, beginning the active portion of the read cycle. After $t_{3}=5.8 \mathrm{~ns}$, the control logic propagation delay, the $\overline{\mathrm{IOR}}$ signal goes low, asserting the $\overline{\mathrm{OE}}$ input to the AD1678. The ' 74 LS 244 buffers take $\mathrm{t}_{4}=30 \mathrm{~ns}$ to enable their outputs, and then, following the converters access delay and the buffer propagation delay ( $\mathrm{t}_{5}=100+18=118 \mathrm{~ns}$ ) data is provided to the TMS320C30. This provides approximately 46 ns of data setup before the rising edge of IOSTRB. Therefore, this design easily satisfies the TMS320C30s requirement of 15 ns of data setup time for reads.

Figure 11. Read Operations Timing Between the TMS320C30 and AD1678


Unlike the primary bus, read and write cycles on the I/O expansion bus are timed the same with the exception that $\mathrm{XR} / \overline{\mathrm{W}}$ is high for reads and low for writes and that the data bus is driven by the TMS320C30 during writes. When writing to the AD1678, the '74LS244 buffers do not turn on and no data is transferred. The purpose of writing to the converter is only to generate a pulse on the converter's $\overline{\mathrm{SC}}$ input, which initiates a conversion cycle. When a conversion cycle is completed, the AD1678's EOC output is used to generate an interrupt on the TMS320C30 to indicate that the converted data may be read.

It should be noted that for different applications, use of TLC1225 or TLC1550 A/D converters from Texas Instruments may be beneficial. The TLC1225 is a self-calibrating 12-bit-plus-sign bipolar or unipolar converter which features $10 \mu$ s conversion times. The TLC1550 is a 10 -bit, $6 \mu$ s converter with a high speed DSP interface. Both converters are parallel-interface devices.

## D/A Converter Interface

In many DSP systems, the requirement for generating an analog outputsignal is a natural consequence of sampling an analog waveform with an $\mathrm{A} / \mathrm{D}$ converter and then processing the signal digitally internally. Interfacing D/A converters to the the TMS320C30 on the expansion I/O bus is also quite straightforward.

As with $\mathrm{A} / \mathrm{D}$ converters, $\mathrm{D} / \mathrm{A}$ converters are also available in a number of varieties. One of the major distinctions between various types of $\mathrm{D} / \mathrm{A}$ converters is whether or not the converter includes latches to store the digital value to be converted to an analog quantity, and the interface to control those latches. With latches and control logic included with the converter, interface design is often simplified, however, internal latches are often included only in slower D/A converters.

Because slower converters limit signal bandwidths, the converter chosen for this design was selected to allow a reasonably wide range of signal frequencies to be processed, in addition to illustrating the technique of interfacing to a converter using external data latches.

Figure 12 shows an interface to an Analog Devices AD565A digital to analog converter. This device is a 12 -bit, 250 ns current output DAC with an on-board 10 volt reference. Using an offboard current-to-voltage conversion circuit connected according to manufacturers specifications,
the converter exhibits output signal ranges 0 to +10 volts, which is compatible with the conversion range of the $\mathrm{A} / \mathrm{D}$ converter discussed in the previous section.

Figure 12. Interface Between the TMS320C30 and the AD565A


Because this DAC essentially performs continuous conversions based on the digital value provided at its inputs, periodic sampling is maintained by periodically updating the value stored in the external latches. Therefore, between sample updates, the digital value is stored and maintained at the latch outputs that provide the input to the DAC. This results in the analog output remaining stable until the next sample update is performed.

The external data latches used in this interface are '74LS377 devices that have both clock and enable inputs. These latches serve as a convenient interface with the TMS320C30; the enable inputs provide a device select function, and the clock inputs latch the data. Therefore, with the enable input driven by inverted XA12 and the clock input driven by IOW, which is the AND of $\overline{\text { IOSTRB }}$ and $\mathrm{XR} / \overline{\mathrm{W}}$, data will be stored in the latches when a write is performed to I/O address 805000h. Reading this address has no effect on the circuit.

Figure 13 shows a timing diagram of a write operation to the $\mathrm{D} / \mathrm{A}$ converter latches.

Figure 13. Write Operation to the D/A Converter Timing Diagram


Because the write is actually being performed to the latches, the key timings for this operation are the timing requirements for these devices. For proper operation, these latches require simply a minimal setup and hold time of data and control signals with respect to the rising edge of the clock input. Specifically, the latches require a data setup time of 20 ns , enable setup of 25 ns , disable setup of 10 ns and data and enable hold times of 5 ns . This design provides approximately 60 ns of enable setup, 30 ns of data setup, and 7.2 ns of data hold time. Therefore, the setup and hold times provided by this design are well in excess of those required by the latches. The key timing parameters for this interface are summarized in Table 2.

Table 2. Key Timing Parameter for D/A Converter Write Operation

| Time Interval | Event | Time Period |
| :---: | :--- | :---: |
| $\mathrm{t}_{1}$ | H1 falling to address valid | 10 ns |
| $\mathrm{t}_{2}$ | XA12 to XA12 delay | 5 ns |
| $\mathrm{t}_{3}$ | H1 rising to IOSTRB falling | 10 ns |
| $\mathrm{t}_{4}$ | IOSTRB to IOW delay | 5.8 ns |
| $\mathrm{t}_{5}$ | Data setup to IOW | 30 ns |
| $\mathrm{t}_{6}$ | Data hold from IOW | 7.2 ns |

## System Control Functions

There are several aspects of TMS320C30 system hardware design that are critical to overall system operation. These include such functions as clock and reset signal generation and interrupt control.

## Clock Oscillator Circuitry

An input clock may be provided to the TMS320C30 either from an external clock input or by using the on-board oscillator. Unless special clock requirements exist, using the on-board oscillator is generally a convenient method of clock generation. This method requires few external components and can provide stable, reliable clock generation for the device.

Figure 14 shows a clock generator circuit using the internal oscillator. This circuit is designed to operate at 33.33 MHz and since crystals with fundamental oscillation frequencies of 30 MHz and above are not readily available, a parallel-resonant third-overtone circuit is used.

Figure 14. Crystal Oscillator Circuit


In a third-overtone oscillator, the crystal fundamental frequency must be attenuated so that oscillation is at the third harmonic. This is achieved with an LC circuit that filters out the fundamental, thus allowing oscillation at the third harmonic. The impedance of the LC circuit must be inductive at the crystal fundamental and capacitive at the third harmonic. The impedance of the LC circuit is given by:

$$
\begin{equation*}
z(\omega)=\frac{L / C}{j\left[\omega_{L}-1 / \omega C\right]} \tag{1}
\end{equation*}
$$

Therefore, the LC circuit has a pole at:

$$
\begin{equation*}
\omega_{\mathrm{p}}=\frac{1}{\sqrt{\mathrm{LC}}} \tag{2}
\end{equation*}
$$

At frequencies significantly lower than $\omega_{\mathrm{p}}$, the $1 /(\omega \mathrm{C})$ term in (1) becomes the dominating term, while $\omega_{\mathrm{L}}$ can be neglected. This gives:

$$
\begin{equation*}
\mathrm{z}(\omega)=\mathrm{j} \omega \mathrm{~L} \text { for } \omega<\omega_{\mathrm{p}} \tag{3}
\end{equation*}
$$

In (3), the LC circuit appears inductive at frequencies lower than $\omega_{\mathrm{p}}$. On the other hand, at frequencies much higher than $\omega_{\mathrm{p}}$, the $\omega \mathrm{L}$ term is the dominant term in (1), and $1 /(\omega \mathrm{C})$ can be neglected. This gives:

$$
\begin{equation*}
z(\omega)=\frac{1}{j \omega C} \text { for } \omega>\omega_{p} \tag{4}
\end{equation*}
$$

The LC circuit in (4) appears increasingly capacitive as frequency increases above $\omega_{\mathrm{p}}$. This is shown in Figure 15, which is a plot of the magnitude of the impedance of the LC circuit of Figure 14 versus frequency.

Figure 15. Magnitude of the Impedance of the Oscillator LC Network


Based on the discussion above, the design of the LC circuit proceeds as follows:

1) Choose the pole frequency $\omega_{p}$ approximately halfway between the crystal fundamental and the third harmonic.
2) The circuit now appears inductive at the fundamental frequency and capacitive at the third harmonic.

In the oscillator of Figure 13 , choose $\omega_{\mathrm{p}}=22.2 \mathrm{MHz}$, which is approximately halfway between the fundamental and the third harmonic. Choose $\mathrm{C}=20 \mathrm{pF}$. Then, using (2), $\mathrm{L}=2.6 \mu \mathrm{H}$.

## Reset Signal Generation

The reset input controls initialization of internal TMS320C30 logic and also causes execution of the system initialization software. For proper system initialization, the reset signal must be applied at least ten H 1 cycles, i.e., 600 ns for a TMS320C30 operating at 33.33 MHz . Upon powerup, however, it can take 20 ms or more before the system oscillator reaches a stable operating state. Therefore, the powerup reset circuit should generate a low pulse on the reset line for 100 to 200 ms . Once a proper reset pulse has been applied, the processor fetches the reset vector from location zero which contains the address of the system initialization routine. Figure 16 shows a circuit that will generate an appropiate powerup reset circuit.

Figure 16. Reset Circuit


The voltage on the reset $\operatorname{pin}(\overline{\operatorname{RESET}})$ is controlled by the $\mathrm{R}_{1} \mathrm{C}_{1}$ network. After a reset, this voltage rises exponentially according to the time constant $\mathrm{R}_{1} \mathrm{C}_{1}$, as shown in Figure 17.

Figure 17. Voltage on the TMS320C30 Reset Pin.


The duration of the low pulse on the reset pin is approximately $\mathrm{t}_{1}$, which is the time it takes for the capacitor $\mathrm{C}_{1}$ to be charged to 1.5 V . This is approximately the voltage at which the reset input switches from a logic 0 to a logic 1 . The capacitor voltage is given by:

$$
\begin{equation*}
\mathrm{V}=\mathrm{V}_{\mathrm{CC}}\left[1-\mathrm{e}-\frac{t}{t}\right] \tag{5}
\end{equation*}
$$

where $\tau=\mathrm{R}_{1} \mathrm{C}_{1}$ is the reset circuit time constant. Solving (5) for t gives:

$$
\begin{equation*}
\mathrm{t}=-\mathrm{R}_{1} \mathrm{C}_{1} \ln \left[1-\frac{\mathrm{V}}{\mathrm{~V}_{c c}}\right] \tag{6}
\end{equation*}
$$

Setting the following:

$$
\begin{aligned}
& \mathrm{R}_{1}=100 \mathrm{k} \Omega \\
& \mathrm{C}_{1}=4.7 \mu \mathrm{~F} \\
& \mathrm{~V}_{\mathrm{CC}}=5 \mathrm{~V} \\
& \mathrm{~V}=\mathrm{V}_{1}=1.5 \mathrm{~V}
\end{aligned}
$$

gives $\mathrm{t}=167 \mathrm{~ms}$. Therefore, the reset circuit of Figure 16 provides a low pulse of long enough duration to ensure the stabilization of the system oscillator.

Note that if synchronization of multiple TMS320C30s is required, all processors should be provided with the same input clcock and the same reset signal. After powerup, when the clock has stabilized, all processors may then be synchronized by generating a falling edge on the common reset signal. Because it is in the falling edge of reset that establishes synchronization, reset must be high for a period of time (at least ten H 1 cycles) initially. Following the falling edge, reset should remain low for at least ten H 1 cycles and then be driven high. This sequencing of reset may be accomplished using additional circuitry, based on either RC time delays or counters.

## Serial Port Interface to AIC

For applications such as modems, speech, control, instrumentation, and analog interface for DSPs, a complete analog-to-digital (A/D) and digital-to-analog (D/A) input/output system on a single chip may be desired. The TLC32044 analog interface circuit (AIC) integrates on a single monlithic/CMOS chip a bandpass, switched-capacitor, antialiasing-input filter, 14-bit resolution A/D and D/A converters, and a lowpass, switched-capacitor, output-reconstruction filter. The TLC32044 offers numerous combinations of master clock input frequencies and conversion/sampling rates, which can be changed via digital processor control.

Four serial port modes on the TLC32044 allow direct interface to TMS320C30 processors. When the transmit and receive sections of the AIC are operating synchronously, it can interface to two SN54299 or SN74299 serial-to-parallel shift registers. These shift registers can then interface in parallel to the TMS320C30, other TMS320 digital processors, or to external FIFO circuitry. Output data pulses are emitted to inform the processor that data transmission is complete or to allow the DSP to differentiate between two transmitted bytes. A flexible control scheme is provided so that the functions of the AIC can be selected and adjusted coincidentally with signal processing via software control. Refer to the TLC32044 data sheet for detailed information.

When interfacing the AIC to the TMS320C30 via one of the serial ports, no additional logic is required. This interface is shown in Figure 18. The serial data, control and clock signals connect directly between the two devices and the AIC's master clock input is driven from TCLK0, one of the TMS320C30s internal timer outputs. The AIC's WORD/BYTE input is pulled high selecting 16 -bit serial port transfers to optimize serial port data transfer rate. The TMS320C30s XF0, configured as an output, is connected to the AIC's reset ( $\overline{\mathrm{RST}})$ input to allow the AIC to be reset by the TMS320C30 under program control. This allows the TMS320C30 timer and serial port to be initialized before beginning conversions on the AIC.

Figure 18. AIC to TMS320C30 Interface


To provide the master clock input for the AIC, the TCLK0 timer is configured to generate a clock signal with a $50 \%$ duty cycle at a frequency of $\mathrm{H} 1 / 4$ or 4.167 MHz . To accomplish this, the timer 0 global control register is set to the value 3 C 1 h , which establishes the desired operating modes. The timer 0 period register is set to 1 which sets the required division ratio for the H 1 clock.

To properly communicate with the AIC the TMS320C30 serial port must be configured appropriately. To configure the serial port, several TMS320C30 registers and memory locations must be initialized. First the serial port should be reset by setting the serial port global control register to 2170300 h . (The AIC should also be reset at this time. See description below of resetting the AIC using XFO). This resets the serial port logic and configures the serial port operating modes including data transfer lengths and enables the serial port interrupts. This also configures another important aspect of serial port operation: polarity of serial port signals. Because active polarity of all serial port signals is programmable, it is critical that the bits in the serial port global control register that control this be set appropriately. In this application all polarities are set to positive except FSX and FSR which are driven by the AIC and are true low.

The serial port transmit and receive control registers must also be initialized for proper serial port operation. In this application, both of these registers are set to 111 h , which configures all of the serial port pins in the serial port mode, rather than the general purpose digital I/O mode.

With the operations described above completed, interrupts are enabled, and provided the serial port interrupt vector(s) are properly loaded, serial port transfers may begin after the serial port is taken out of reset. This is accomplished by loading E170300h into the global control register.

To begin conversion operations on the AIC and subsequent transfers of data on the serial port, the AIC is first reset by setting XF0 to zero at the beginning of the TMS320C30 initialization rou-
tine. Setting XF0 to zero is accomplished by setting the TMS320C30 IOF register to 2 . This sets the AIC to a default configuration and halts serial port transfers and conversion operations until reset is set high. Once the TMS320C30 serial port and timer have been initialized as described above, XF0 is set high by setting the IOF register to 6 . This allows the AIC to begin operating in its default configuration, which in this application is the desired mode. In this mode all internal filtering is enabled, sample rate is set at approximately 6.4 kHz , and the transmit and receive sections of the device are configured to operate synchronously. Conveniently, this mode of operation is appropriate for a variety of applications, and if a 5.184 MHz master clock input is used, the default configuration results in an 8 kHz sample rate which makes this device ideal for speech and telecommunications applications.

In addition to the benefit of a convenient default operating configuration, the AIC can also be programmed for a wide variety of other operating configurations. Sample rates and filter characteristics may be varied, in addition to which, numerous connections in the device may be configured to establish different internal architectures, by enabling or disabling various functional blocks.

To configure the AIC in a fashion different from the default state, the device must first be sent a serial data word with the two LSBs set to one. The two LSBs of a transmitted data word are not part of the transferred data information and are not set to one during normal operation. This condition indicates that the next serial transmission will contain secondary control information, not data. This information is then used to load various internal registers and specify internal configuration options. There are four different types of secondary control words distinguished by the state of the two LSBs of the control information transferred. Note that each secondary control word transferred must be preceded by a data word with the two LSBs set to one.

The TMS320C30 can communicate with the AIC either synchronously or asynchronously depending on the information in the control register. The operating sequence for synchronous communication with the TMS320C30 shown in Figure 19, is as follows:

1) The $\overline{\mathrm{FSX}}$ or $\overline{\mathrm{FSR}}$ pin is brought low.
2) One 16-bit word is transmitted or one 16-bit word is received.
3) The $\overline{\mathrm{FSX}}$ or $\overline{\mathrm{FSR}}$ pin is brought high.
4) The EODX or OEDR pin emits a low-going pulse.

Figure 19. Synchronous Timing of TLC32044 to TMS320C30


For asynchronous communication, the operating sequence is similar, but $\overline{\mathrm{FSX}}$ and $\overline{\mathrm{FSR}}$ do not occur at the same time (see Figure 20). After each receive and transmit operation, the TMS320C30 asserts an internal receive (RINT) and transmit (XINT) interrupt, which may be used to control program execution.

Figure 20. Asynchronous Timing of TLC32044 to TMS320C30


## XDS1000 Target Design Considerations

The TMS320C30 Emulator is an eXtended Development System (XDS1000) which has all the features necessary for full-speed emulation. The TMS320C30 uses a revolutionary technology to allow complete emulation via a serial scan path. If users provide a 12 -pin header on their target system, realtime emulation can be performed using the TMS320C30 in their target system.

To use the emulation connector of the XDS1000, the signals shown in Figure 21. should be provided to a 12 pin header (two rows of six pins) with pin 8 cut out to provide keying. Table 3 describes the pins and signals present on the header.

Figure 21. 12 Pin Header Signals

| Header Dimensions: |
| :--- |
| Pin-to-pin spacing: $\quad 0.100$ inches $(X, Y)$ |
| Pin width: 0.025 inches square post |
| Pin length: 0.235 inches nominal |


| EMU1 ${ }^{\dagger}$ | 1 | 2 | GND |
| :---: | :---: | :---: | :---: |
| EMUo ${ }^{\dagger}$ | 3 | 4 | GND |
| EMU2 ${ }^{\dagger}$ | 5 | 6 | GND |
| PD ( +5 V ) | 7 |  | NO PIN (KEY) |
| EMU3 | 9 | 10 | GND |
| H3 | 11 | 12 | GND |

Table 3. Signal Description

| Signal Name | Description |
| :--- | :--- |
| EMU0 | Emulation pin 0. |
| EMU1 | Emulation pin 1. |
| EMU2 | Emulation pin 2. |
| EMU3 | Emulation pin 3. |
| H3 | TMS320C30 H3. |
| GND | Ground. |
| PD | Presence detect . It indicates that the cable is connected and target system is powered up. It |
|  | should be tied to +5 volts in the target system. |

In addition to the signals required at the emulation connector, the EMU4 through EMU6 signals on the TMS320C30 must also be appropiately connected to ensure proper emulation operation. The EMU4 signal must be tied to +5 volts and EMU5 and EMU6 must be left unconnected. Also, the RSV0 through RSV10 signals must be tied to +5 volts as described in the Third-Generation TMS320 User's Guide (literature number SPRU031).

## Summary

The TMS320C30 is a high-performance 32-bit floating-point digital signal processor. Its dual parallel-interface busses and serial ports, along with a wide variety of additional support interfaces make the device an extremely flexible system-level DSP microprocessor. Using the techniques described in this report, the TMS320C30 can be used to implement sophisticated signal processing applications with the high precision and dynamic range provided by 32 -bit floating-point arithmetic.

This application report has described the use of external interfaces on the TMS320C30 to connect it to memories, $\mathrm{A} / \mathrm{D}$ and $\mathrm{D} / \mathrm{A}$ converters, and numerous other peripheral devices, as well as the generation of wait states and other system functions.

The interfaces described in this report have all been built and tested to verify proper operation, and the techniques described can be extended to encompass design of more complex systems.

##  Format Converpter

Randy Restle, Regional Technology Center, Waltham, MA Adam Cron, Digital Signal Processor Products-Semiconductor Group Texas Instruments

## Introduction

Certain applications require the exceptionally high arithmetic throughput inherent in the TMS320C30 Digital Signal Processor but must use the IEEE floating-point number format, which differs from the TMS320C30's number format. The TMS320C30 uses a 2's complement format for the mantissa and exponent. Besides making the device more compatible with analog to digital converters, it is computationally more efficient in both speed and die size than the IEEE format. Applications requiring the IEEE format can benefit from the use of a custom chip for this conversion. For this reason, a chip has been designed, built, and tested. This report describes that chip.

The TMS320C30-IEEE Floating-Point Number Format Converter is a peripheral that performs floating-point number conversions between the native format of the TMS320C30 and the Single-Precision IEEE Standard 754-1985. This conversion is performed in hardware and can convert an incoming (IEEE-formatted) or outgoing (TMS320C30-formatted) floating-point number in less than one TMS320C30 instruction cycle. Normally, the part is placed between memory and the TMS320C30.

This peripheral has two operating modes.

- Mode 1 does not pipeline any data through the chip. Instead, one wait state is automatically generated to compensate for the converter's propagation delays. This mode is equivalent in performance to equipping the TMS320C30 with a single-cycle convert instruction. In those applications where speed is of utmost importance, the pipeline mode is provided.
- Mode 2 enables the converter's built-in pipeline.

Because propagation delays through the chip reduce the access time required for TMS320C30 external memory, the pipeline mode allows conversions to take place on one data value while a previously converted value is being read, or written, by the TMS320C30. Depending on the TMS320C30 instruction cycle time and the access time of memories being used, the pipeline mode can eliminate degradation in TMS320C30 throughput entirely. However, it should be noted that values fed through the pipeline appear at the output in the next cycle. Therefore, an extra read or write (i.e., the same operation that was being performed) must be performed to flush the pipeline. Consequently, when pipeline mode is used, data values and their addresses are skewed from one another. This mode is intended for high-speed block transfer/conversion, and the address skew should be acceptable.

All control signals to and from the converter are compatible with TMS320C30 signals so that no extra circuitry is required to use this chip. In fact, it has been designed to appear as much as possible like a simple bus transceiver (e.g., SN74LS245). Consequently, it has two data buses. Data bus A (pins DA31 through DA0) should be connected directly to one of the TMS320C30's data buses
 its output enable pin $(\overline{\mathrm{OE}})$ can be tied to either $\overline{\text { STRB }}$ or $\overline{\mathrm{MSTRB}}$ of the TMS320C30, depending on where in the TMS320C30 memory map IEEE numbers are stored.

## Key Features

This device is designed to fit into systems equipped with TMS320C30 external memory into which IEEE formatted numbers are stored. Below is a list of some specific features of the TMS320C30-IEEE Floating-Point Converter:

- Automatic wait-state generation during conversions
- Automatic interrupt generation when IEEE NaNs are encountered
- Automatic pipeline mode for single-cycle conversions
- Built-in SCOPE (i.e., JTAG) testability logic


## Report Overview

- External Interfaces - Describes the external interfaces of this chip, the pinout, and pins.
- Architectural Overview - Describes the functions of the converter. Gives an overview of the TMS320C30 and IEEE Standard 754-1985 number formats and the scope of numbers that can be converted.
- Converter Operating Modes - Describes the converter's operating modes.
- Interrupts - Describes the Not a Number interrupt generated by the converter.
- Software Application Examples - Contains software application examples.
- Hardware Application Examples - Contains hardware application examples.
- JTAG/IEEE-1149.1 Scan Interface-Contains the JTAG/IEEE scan interface description.


## Typographical Conventions

In this report, buses are signified with the bus name in capital letters, followed by the range of signals (bits) enclosed in parentheses and separated by a colon. For example, $\mathrm{TI}(31: 0)$ is bus "TI", bits 31 through 0 ( 31 is the most significant bit, 0 , the least). Table 1 shows the symbols and their corresponding meaning that are used in sections of the report concerning control logic, algorithm overview, and bit-specific conversion algorithms.

Table 1. Symbols and Meanings

| Symbol | Name | Meaning |
| :---: | :--- | :--- |
| + | plus | arithmetic summation |
| $!$ | pipe | logical OR |
| $\&$ | ampersand | logical AND |
| $!$ | exclamation point | one's complement |
| - | minus | two's complement |
|  | caret | EXCLUSIVE OR |

## External Interfaces

## Packaging

The TMS320C30 device is housed in an 84-pin package. This pinout was chosen for efficient flow through connection to the buses. The TMS320C30-IEEE Converter's pin assignments are shown in Table 2, and the pin locations are shown in Figure 1.

Table 2. Pin Assignments

| Pin | Name | Pin | Name | Pin | Name |
| :--- | :--- | :--- | :--- | :--- | :--- |
| 1 | GND | 29 | DA3 | 57 | DA29 |
| 2 | DB15 | 30 | DA4 | 58 | DA30 |
| 3 | DB14 | 31 | DA5 | 59 | DA31 |
| 4 | DB13 | 32 | DA6 | 60 | TDI |
| 5 | DB12 | 33 | DA7 | 61 | TMS |
| 6 | DB11 | 34 | DA8 | 62 | TCK |
| 7 | DB10 | 35 | DA9 | 63 | VCC |
| 8 | DB9 | 36 | DA10 | 64 | GND |
| 9 | DB8 | 37 | DA11 | 65 | TDO |
| 10 | DB7 | 38 | DA12 | 66 | TIP |
| 11 | DB6 | 39 | DA13 | 67 | RST |
| 12 | DB5 | 40 | DA14 | 68 | DB31 |
| 13 | DB4 | 41 | DA15 | 69 | DB30 |
| 14 | DB3 | 42 | VCC | 70 | DB29 |
| 15 | DB2 | 43 | GND | 71 | DB28 |
| 16 | DB1 | 44 | DA16 | 72 | DB27 |
| 17 | DB0 | 45 | DA17 | 73 | DB26 |
| 18 | WAIT | 46 | DA18 | 74 | DB25 |
| 19 | PIPE | 47 | DA19 | 75 | DB24 |
| 20 | CLK | 48 | DA20 | 76 | DB23 |
| 21 | VCC | 49 | DA21 | 77 | DB22 |
| 22 | GND | 50 | DA22 | 78 | DB21 |
| 23 | NAN | 51 | DA23 | 79 | DB20 |
| 24 | DIR | 52 | DA24 | 80 | DB19 |
| 25 | OE | DA0 | 53 | DA25 | 81 |
| 26 | DA1 | 54 | DA26 | DB18 |  |
| 27 | DA2 | 55 | DA27 | 82 | DB17 |
| 28 | DA | 56 | DA28 | 83 | DB16 |

Figure 1. Pin Locations


## Pinout Description

Table 3 describes the pin functions.
Table 3. Converter Signals

| Signal | Pins | Type | Description |
| :--- | :---: | :--- | :--- |
| DIR | 1 | Input | Direction - This pin determines what type of conversion <br> should take place. When it is high, data on bus B is converted <br> from IEEE to TMS320C30 format and output on bus A. When <br> it is low, data on bus A is converted from TMS320C30 to IEEE <br> format and output on bus B. This pin is normally tied directly <br> to the TMS320C30 read/write pin. |
| $\overline{\mathrm{OE}}$ | 1 | Input | Output Enable (active low) - In combination with the DIR <br> pin, this pin disables the currently driven bus (i.e., bus A or B). |

Table 3. Converter Signals (Concluded)

| Signal | Pins | Type | Description |
| :--- | :---: | :--- | :--- |
| WAIT | 1 | Output | This pin is driven high in nonpipelined operations to signal the <br> TMS320C30 to extend its external memory access to allow <br> the conversion to complete. It can be tied directly to the <br> TMS320C30 ready line. It is appropriately driven for both <br> read and write operations, but is always low in pipelined mode <br> of operation. |
| PIPE | 1 | Input | Pipeline Enable - When this is high, the converter is confi- <br> gured in pipeline mode. It must be tied low for nonpipeline <br> mode. |
| CLK | 1 | Input | Clock - This clock is the wait-state generator and the pipeline <br> clock. It should be connected directly to the TMS320C30 H1 <br> clock pin. |
| NAN | 1 | Output | Not-a-Number Interrupt - This pin is driven low for 1.5 CLK <br> cycles and signals an attempted conversion of the IEEE for- <br> mat: Not-a-Number. This pin can be tied directly to one of the <br> TMS320C30 interrupt pins and can signal command or mes- <br> sage passing in multi-processor, shared-memory-type de- <br> signs. |
| DA(31:0) | 32 | Input/Output | Data Bus A - This 32-bit bus should be tied to either one of <br> the two TMS320C30 data buses (i.e., the primary or expan- <br> sion buses). |
| DB(31:0) | 32 | Input/Output | Data Bus B - This 32-bit bus is normally connected to a <br> memory array containing IEEE-formatted data. |
| TCK | 1 | Input | Test Clock. |
| TMS | 1 | Input | Test Mode Select. |
| TDIP | 1 | 1 | Input |
| 1 | 1 | Output | Output |
| Reset (active low) - This pin resets all logic on the device. |  |  |  |
| Test Data Out. |  |  |  |

## Architectural Overview

Figure 2 shows the block diagram of the converter.
Figure 2. Converter Block Diagram


## Introduction

The TMS320C30 attains a peak performance of 33 MFLOPS, largely due to the float-ing-point format that it uses. In this format, both exponent and mantissa are represented in 2's-complement form.

In the IEEE format, the mantissa is represented in signed-magnitude form, and the exponent includes a bias (i.e., an offset). Additionally, values of numbers are not determined by the same formula. Instead, the exponent is used to flag numbers that are encoded differently. For example, if the exponent is 255 , the value is considered not a number ( NaN ). Another exception is signaled when the exponent is zero. In this case, the mantissa is defined to be denormalized.
The TMS320C30's floating-point format is considerably simpler; most numbers can be converted to it without any loss of precision. However, some denormalized IEEE numbers are smaller than can be represented in TMS320C30 format. When these numbers are converted, they are translated to the closest TMS320C30 values. The error is less than $\pm 2^{-127}$.

## IEEE Floating-Point Format Overview

IEEE Standard 754-1985 defines formats for single-, single-extended-, double- and double-extended-precision floating-point numbers. The single-precision format fits entirely with-
in 32 bits, which is the bus width of the TMS320C30, and is the only format supported by the converter.

The format of the single-precision IEEE Standard 754-1985 is shown below:
Figure 3. Single-Precision IEEE Standard 754-1985 Format


In this format,
$\mathbf{S}$ is the sign bit of the mantissa ( $0=$ positive, $1=$ negative $)$.
EXPONENT is an unsigned 8-bit field that determines the location of the binary point of the number being encoded.

FRACTION is a 23 -bit field containing the fractional part of the mantissa.
LSB is the least significant bit of a field
MSB is the most significant bit of a field
The decimal value (v) of some number X is defined by one of five separate cases shown below:

Case 1: If EXPONENT $=255$ and FRACTION $\neq 0$, then $v$ is NaN .
Case 2: If EXPONENT $=255$ and FRACTION $=0$, then $\mathrm{v}= \pm$ infinity.
Case 3: If $0<$ EXPONENT $<255$, then $v=(-1)^{s} 2^{\exp -127}$ (1.FRAC)
where:
$\mathbf{S}$ is either 0 or 1
FRAC is the decimal equivalent of FRACTION
EXP is the decimal equivalent of EXPONENT
Note that an implied 1 exists to the left of the binary point as shown above. This means the mantissa of an IEEE-encoded value has 24 bits of precision.

Case 4: If EXPONENT $=0$ and FRACTION $\neq 0$, then v is a denormalized number and
$\mathrm{v}=(-1)^{\mathrm{s}} 2^{-126}$ (0.FRAC)
where
$\mathbf{s}$ is either 0 or 1
FRAC is the decimal equivalent of FRACTION
Note that an implied 0 exists to the left of the binary point as shown above. This means the mantissa of an IEEE-encoded value has 24 bits of precision.

Case 5: If EXPONENT $=0$ and $\operatorname{FRACTION}=0$, then $\mathrm{v}= \pm$ zero.

## TMS320C30 Floating-Point Format Overview

TMS320C30 single-precision floating-point format uses a 2's-complement exponent and mantissa and is shown in Figure 4.

Figure 4. TMS320C30 Single-Precision Floating-Point Format


The decimal value (v) of some number X is determined as follows:
$\mathrm{v}=\left\{(-2)^{\mathrm{s}}+(\right.$. FRAC $\left.)\right\} 2^{\exp }$
where $\mathbf{S}$ is either 0 or 1
FRAC is the decimal equivalent of FRACTION
EXP is the decimal equivalent of EXPONENT
An alternate way of describing the TMS320C30 mantissa is as follows:

## ss.fraction

Note that the bit to the left of the binary point is implied and is the complement of the sign bit. This gives the TMS320C30's mantissa 24 bits of precision and not 23 bits as might be expected. For example:

The most positive TMS320C30 mantissa is
$01.11111111111111111111111=2-2^{-23}$
The least positive TMS320C30 mantissa is
$01.00000000000000000000000=1$
The most negative TMS320C30 mantissa is
$10.00000000000000000000000=-2$
The least negative TMS320C30 mantissa is
$10.11111111111111111111111=-1-2^{-23}$
Note that zero is uniquely identified when the TMS320C30 exponent is $\mathbf{- 1 2 8}$.

## IEEE Number Conversion

This section describes the classifications of IEEE numbers, how they are decoded, and the algorithms necessary to translate them to TMS320C30 format.

Table 4 shows the dynamic range of IEEE numbers. This chart can be used to quickly determine the case classification of an IEEE number.

Table 4. IEEE Range of Numbers


Table 4. IEEE Range of Numbers (Concluded)

| Sign | Exponent | Mantissa | Value | Type | Case |
| :---: | :---: | :---: | :---: | :---: | :---: |
| 1 | 00 | 0.011... 111 | $-\left(1-2^{-22}\right) \times 2^{-127}$ | - Denormalized Number | 4D |
| 1 | 00 | 0.100... 000 | - $\left(2^{-127}\right)$ | - Denormalized Number | 4D |
| 1 | 00 | 0.100... 001 | $-\left(1+2^{-22}\right) \times 2^{-127}$ | - Denormalized Number | 4C |
| 1 | 00 | 0.100... 010 | - $(1+2-21) \times 2-127$ | - Denormalized Number | 4C |
| 1 | 00 | 0.100... 011 | $-\left(1+2^{-21}+2^{-22}\right) \times 2^{-127}$ | - Denormalized Number | 4C |
|  | . |  |  |  |  |
| 1 | 00 | 0.111... 111 | $-\left(1-2^{-23}\right) \times 2^{-126}$ | - Denormalized Number | 4C |
| 1 | 01 | 1.000... 000 | $-\left(2^{-126}\right)$ | - Normalized Number | 3C |
| 1 | 01 | 1.000... 001 | $-\left(1+2-^{23}\right) \times 2^{-126}$ | - Normalized Number | 3B |
| 1 | 01 | 1.000... 010 | $-\left(1+2^{-22}\right) \times 2^{-126}$ | - Normalized Number | 3B |
| 1 | 01 | 1.000... 011 | $-\left(1+2^{-22}+2^{-23}\right) \times 2^{-126}$ | - Normalized Number | 3B |
|  | . |  |  |  |  |
| 1 | 01 | 1.111... 111 | $-\left(2-2^{-23}\right) \times 2^{-126}$ | - Normalized Number | 3B |
| 1 | 02 | 1.000... 000 | -( $\left.2^{-125}\right) \times 2$ | - Normalized Number | 3 C |
| 1 | 02 | 1.000... 001 | $-\left(2+2^{-23}\right) \times 2^{-125}$ | - Normalized Number | 3B |
| 1 | 02 | 1.000... 010 | - $\left(2+2^{-22}\right) \times 2^{-125}$ | - Normalized Number | 3B |
| 1 | 02 | 1.000... 011 | $-\left(1+2^{-22}+2^{-23}\right) \times 2^{-125}$ | - Normalized Number | 3B |
|  | - |  |  |  |  |
| 1 | FE | 1.111... 100 | $-\left(2-2^{-21}\right) \times 2^{127}$ | - Normalized Number | 3B |
| 1 | FE | 1.111... 101 | $-\left(2-2^{-21}+2^{-23}\right) \times 2^{127}$ | - Normalized Number | 3 B |
| 1 | FE | 1.111... 110 | $-\left(2-2^{-22}\right) \times 2^{127}$ | - Normalized Number | 3B |
| 1 | FE | 1.111... 111 | $-\left(2-2^{-23}\right) \times 2^{127}$ | - Normalized Number | 3B |
| 1 | FF | $=0$ | - infinity | - Infinity | 2B |

## IEEE-to-TMS320C30 Control Logic

The control logic that classifies incoming IEEE data in order to perform correct translation to TMS320C30 format is shown below. The form of the expressions was chosen to minimize propagation delay through the device.

The logic is simplified if the following three factors are used (refer to typographical definitions for symbols used):

| $\operatorname{EXPFF}=$ | $\operatorname{IEEE}(30)$ | $\& \operatorname{IEEE}(29)$ | $\& \operatorname{IEEE}(28)$ | $\& \operatorname{IEEE}(27)$ | $\&$ |
| :--- | :--- | :--- | :--- | :--- | :--- |
| $\operatorname{IEEE}(26)$ | $\& \operatorname{IEEE}(25)$ | $\& \operatorname{IEEE}(24)$ | $\& \operatorname{IEEE}(23)$ |  |  |
| $\operatorname{EXP} 00=$ | $!(\operatorname{IEEE}(30)$ | $\mid \operatorname{IEEE}(29)$ | $\mid \operatorname{IEEE}(28)$ | $\mid \operatorname{IEEE}(27)$ | $\mid$ |
|  | $\operatorname{IEEE}(26)$ | $\mid \operatorname{IEEE}(25)$ | $\mid \operatorname{IEEE}(24)$ | $\mid \operatorname{IEEE}(23)$ | $)$ |
| MANT0 $=!(\operatorname{IEEE}(21)$ | $\mid \operatorname{IEEE}(20)$ | $\mid \operatorname{IEEE}(19)$ | $\mid \operatorname{IEEE}(18)$ | $\mid$ |  |
|  | $\operatorname{IEEE}(17)$ | $\mid \operatorname{IEEE}(16)$ | $\mid \operatorname{IEEE}(15)$ | $\mid \operatorname{IEEE}(14)$ | $\mid$ |


| $\operatorname{IEEE}(13)$ | $\mid \operatorname{IEEE}(12)$ | $\mid \operatorname{IEEE}(11)$ | $\mid \operatorname{IEEE}(10)$ |
| :--- | :--- | :--- | :--- |
| $\operatorname{IEEE}(9)$ | $\operatorname{IEEE}(8)$ | $\mid \operatorname{IEEE}(7)$ | $\mid \operatorname{IEEE}(6)$ |
| $\operatorname{IEEE}(5)$ | $\operatorname{IEEE}(4)$ | $\operatorname{IEEE}(3)$ | $\mid \operatorname{IEEE}(2)$ |
| $\operatorname{IEEE}(1)$ | $\operatorname{IEEE}(0))$ |  |  |

Then
Case 1: NaN
$=$ EXPFF \& ( IEEE(22)| !MANT0 )
Case 2A: positive infinity
$=!\operatorname{IEEE}(31) \& \operatorname{EXPFF} \&!(\operatorname{IEEE}(22) \mid!$ MANT0 $)$
Case 2B: negative infinity
$=\operatorname{IEEE}(31) \& \operatorname{EXPFF} \&!(\operatorname{IEEE}(22) \mid$ !MANT0 )
Case 3A: positive normalized numbers
$=$ !IEEE(31) \& ! EXP00 \& !EXPFF
Case 3B: negative normalized numbers with fraction $\neq 0$
$=\operatorname{IEEE}(31) \&!\operatorname{EXP} 00 \&!\operatorname{EXPFF} \&(!\operatorname{MANT} 0 \mid \operatorname{IEEE}(22)$ )
Case 3C: negative normalized numbers with fraction $=0$
$=\operatorname{IEEE}(31) \&!E X P 00 \&!E X P F F ~ \& ~!($ MANT0 $\mid$ IEEE (22) )
Case 4A: positive denormalized numbers $\geq 2^{-127}$
$=$ ! IEEE(31) \& EXP00 \& IEEE(22)
Case 4B: positive denormalized numbers $<2^{-127}$
$=$ ! IEEE(31) \& EXP00 \& !IEEE(22) \& !MANT0
Case 4C: negative denormalized numbers $\leq\left(-1-2^{-23}\right) \times 2^{-127}$
$=\operatorname{IEEE}(31) \& \operatorname{EXP} 00$ \& $\operatorname{IEEE}(22) \&!M A N T 0$
Case 4D: negative denormalized numbers $>\left(-1-2^{-23}\right) \times 2^{-127}$
$=\operatorname{IEEE}(31) \& \operatorname{EXP} 00$ \& ( IEEE(22) ^ !MANT0 )
Case 5: positive and negative zero
$=$ EXP00 \& !IEEE(22) \& MANT0

## IEEE-to-TMS320C30 Conversion Algorithm Overview

Table 5 shows the conversion algorithms used on the sign, exponent, and mantissa fields of IEEE numbers to produce the corresponding TMS320C30 fields. These fields are broken down into bit-specific algorithms in the following section.

Table 5. Conversion Algorithms from IEEE to TMS320C30 Format

| TMS320C30 |  |  |  |
| :---: | :--- | :--- | :--- |
| Case | Exponent | Sign | Fraction |
| 1. | $\mathrm{e}_{\text {IEEE }}$ | $\mathrm{s}_{\text {IEEE }}$ | $\mathrm{f}_{\text {IEEE }}$ |
| 2A. | 7 Fh | $\mathrm{s}_{\text {IEEE }}$ | 7F FFFFh |
| 2B. | 7 Fh | $\mathrm{s}_{\text {IEEE }}$ | 000000 h |
| 3A. | $\mathrm{e}_{\text {IEEE }}+81 \mathrm{~h}$ | $\mathrm{~s}_{\text {IEEE }}$ | fiEEE |
| 3B. | $\mathrm{e}_{\text {IEEE }}+81 \mathrm{~h}$ | $\mathrm{~s}_{\text {IEEE }}$ | $-\mathrm{f}_{\text {IEEE }}$ |
| 3C. | $\mathrm{e}_{\text {IEEE }} \wedge 80 \mathrm{~h}$ | $-\mathrm{s}_{\text {IEEE }}$ | $-\mathrm{f}_{\text {IEEE }}$ |
| 4A. | 81 h | $\mathrm{~s}_{\text {IEEE }}$ | $2 \times \mathrm{f}_{\text {IEEE }}$ |
| 4B. | 80 h | $\mathrm{~s}_{\text {IEEE }}$ | 000000 h |
| 4C. | 81 h | $\mathrm{~s}_{\text {IEEE }}$ | $2 \mathrm{x}-\mathrm{f}_{\text {IEEE }}$ |
| 4D. | 80 h | 0 | 000000 h |
| 5. | 80 h | 0 | 000000 h |

Note: Fraction, above, has only 23-bits

## IEEE-to-TMS320C30 Bit-Specific Conversion Algorithms

These circuits were designed by examining Table 5 and finding all possible choices for each bit. The different choices were fed into data selectors, whose addresses were derived from the case-identifying logic described in the preceding section on control logic.

For maximum performance, all data selectors were designed from NAND gates. This also permitted minimization by eliminating all NAND gates that had an input of 0 and by reducing the number of NAND inputs where a bit was always 1 . However, for clarity, no minimization is shown here. Instead, that detail can be seen in the following figures.

The following bit algorithms are shown in bit descending order, starting with IEEE bit 31.
Figure 5. IEEE Bit 31 to TMS320C30 Bit 23


Figure 6. IEEE Bit 30 to TMS320C30 Bit 31

$\mathrm{b}=\mathrm{CASE} 1 \mid$ CASE2A $\mid$ CASE2B $\mid$ CASE3C
$B=!b$
$\mathrm{A}=\mathrm{CASE} 2 \mathrm{~A} \mid$ CASE2B $\mid$ CASE3A $\mid$ CASE3B $\mathrm{a}=!\mathrm{A}$

Figure 7. IEEE Bit $n$ to TMS320C30 Bit $\mathbf{n + 1}$, Where $29 \geq n \geq 24$


$$
\begin{aligned}
& \mathrm{b}=\mathrm{CASE} 2 \mathrm{~A} \mid \text { CASE2B } \mid \text { CASE3A } \mid \text { CASE3B } \\
& \mathrm{B}=!\mathrm{b} \\
& \mathrm{a}=\mathrm{CASE} 2 \mathrm{~A} \mid \text { CASE2B } \mid \text { CASE1 } \mid \text { CASE3C } \\
& \mathrm{A}=!\mathrm{a}
\end{aligned}
$$

Figure 8. IEEE Bit 23 to TMS320C30 Bit 24


$$
\begin{aligned}
& \mathrm{b}=\mathrm{CASE} \mid \text { CASE3C } \mid \text { CASE4B } \mid \text { CASE4D } \mid \text { CASE } 5 \\
& \mathrm{~B}=!\mathrm{b} \\
& \mathrm{~A}=\mathrm{CASE4B} \mid \text { CASE4D } \mid \text { CASE } \mid \text { CASE3A } \mid \text { CASE3B } \\
& \mathrm{a}=!\mathrm{A}
\end{aligned}
$$

Figure 9. IEEE Bit n to TMS320C30 Bit n , Where $22 \geq \mathrm{n} \geq 1$


```
\(\mathrm{C}=\mathrm{CASE} 2 \mathrm{~A}|\mathrm{CASE} 3 \mathrm{~B}| \mathrm{CASE} 3 \mathrm{C} \mid \mathrm{CASE} 4 \mathrm{C}\)
\(\mathrm{c}=\) ! C
\(\mathrm{b}=\mathrm{CASE} 1 \mid\) CASE2A \(\mid\) CASE3A \(\mid\) CASE4A \(\mid\) CASE4C
\(B=!b\)
\(\mathrm{A}=\mathrm{CASE4A} \mid \mathrm{CASE} 4 \mathrm{C}\)
\(\mathrm{a}=!\mathrm{A}\)
```

Figure 10. IEEE Bit 0 to TMS320C30 Bit 0


$$
\begin{aligned}
& \mathrm{B}=\mathrm{CASE} 2 \mathrm{~A} \\
& \mathrm{~b}=!\mathrm{B} \\
& \mathrm{~A}=\mathrm{CASE} 1|\mathrm{CASE} 2 \mathrm{~A}| \text { CASE } 3 \mathrm{~A} \mid \text { CASE3B } \mid \text { CASE3C } \\
& \mathrm{a}=!\mathrm{A}
\end{aligned}
$$

## TMS320C30 Number Conversion

This section describes the classifications of TMS320C30 numbers, how they are decoded, and the algorithms necessary to translate them to IEEE format.

## TMS320C30 Dynamic Range

Shown in Table 6 is the dynamic range of TMS320C30 numbers. As with Table 4, this table can be used to quickly determine case classification of a TMS320C30 number.

Table 6. TMS320C30 Range of Numbers

| Exponent | Sign | Mantissa | Value | Type | Case |
| :---: | :---: | :---: | :---: | :---: | :---: |
| 7 F | 0 | 1.111... 111 | $\left(2-2^{-23}\right) \times 2^{127}$ | Positive Number | 6 |
| 7F | 0 | 1.111... 110 | $\left(2-2^{-22}\right) \times 2^{127}$ | Positive Number | 6 |
| 7F | 0 | 1.111... 101 | $\left(2-2^{-21}+2^{-23}\right) \times 2^{127}$ | Positive Number | 6 |
| 7F | 0 | 1.111... 100 | $\left(2-2^{-21}\right) \times 2^{127}$ | Positive Number | 6 |
| . |  |  |  |  |  |
|  |  |  |  |  |  |
| 7F | 0 | 1.000... 000 | $2^{127}$ | Positive Number | 6 |
| 7 E | 0 | 1.111... 111 | $\left(2-2^{-23}\right) \times 2^{126}$ | Positive Number | 6 |
| 7E | 0 | 1.111... 110 | $\left(2-2^{-22}\right) \times 2^{126}$ | Positive Number | 6 |
| 7E | 0 | 1.111... 101 | $\left(2-2^{-21}+2^{-23}\right) \times 2^{126}$ | Positive Number | 6 |
| . |  |  |  |  |  |
| . |  | . |  |  |  |
| 00 | 0 | 1.000... 000 |  | Positive Number | 6 |
| FF | 0 | 1.111... 111 | 1-2-24 | Positive Number | 6 |
| FF | 0 | 1.111... 110 | $1-2^{-23}$ | Positive Number | 6 |
| FF | 0 | 1.111... 101 | $1-2^{-22}+2^{-24}$ | Positive Number | 6 |
| - |  |  |  |  |  |
|  |  |  |  |  |  |
| FF | 0 | 1.000... 000 | $2^{-1}$ | Positive Number | 6 |
| FE | 0 | 1.111... 111 | $\left(2-2^{-23}\right) \times 2^{-2}$ | Positive Number | 6 |
| FE | 0 | 1.111... 110 | $\left(2-2^{-22}\right) \times 2^{-2}$ | Positive Number | 6 |
| FE | 0 | 1.111... 101 | $\left(2-2^{-21}+2^{-23}\right) \times 2^{-2}$ | Positive Number | 6 |
| - |  |  |  |  |  |
| - |  |  |  |  |  |
| 82 | 0 | 1.000... 000 | $2^{-126}$ | Positive Number | 6 |
| 81 | 0 | 1.111... 111 | $\left(2-2^{-23}\right) \times 2^{-127}$ | Positive Number | 7 (note 1) |
| 81 | 0 | 1.111... 110 | $\left(2-2^{-22}\right) \times 2^{-127}$ | Positivr Number | 7 (note 1) |
| 81 | 0 | 1.111... 101 | $\left(2-2^{-21}+2^{-23}\right) \times 2^{-127}$ | Positive Number | 7 (note 1) |
| 81 | 0 | 1.111... 100 | $\left(2-2^{-21}\right) \times 2^{-127}$ | Positive Number | 7 (note 1) |
| . |  |  |  |  |  |
| - |  |  |  |  |  |
| 81 | 0 | 1.000... 010 | $\left(1+2^{-22}\right) \times 2^{-127}$ | Positive Number | 7 (note 1) |
| 81 | 0 | 1.000... 001 | $\left(1+2^{-23}\right) \times 2^{-127}$ | Positive Number | 7 (note 1) |
| 81 | 0 | 1.000... 000 | $2^{-127}$ | Positive Number | 7 (note 1) |
| 80 | 0 | 0.111... 111 | (note 2) | Implied Zero | 8 |
| 80 | 0 | 0.111... 110 | (note 2) | Implied Zero | 8 |
| 80 | 0 | 0.111... 101 | (note 2) | Implied Zero | 8 |
| . |  |  |  |  |  |
| 80 | 0 | 0.000... 001 | (note 2) | Implied Zero | 8 |

Table 6. TMS320C30 Range of Numbers (Concluded)

| Exponent | Sign | Mantissa | Value | Type | Case |
| :---: | :---: | :---: | :---: | :---: | :---: |
| 80 | 0 | 0.000... 000 | 0.0 | Zero | 8 |
| 80 | 1 | 10.111... 111 | (note 2) | Implied Zero | (note 3) |
| 80 | 1 | 10.111... 110 | (note 2) | Implied Zero | (note 3) |
| 80 | 1 | 10.111... 101 | (note 2) | Implied Zero | (note 3) |
| . |  |  |  |  |  |
| 80 | 1 | 10.000... 011 | (note 2) | Implied Zero | (note 3) |
| 80 | 1 | 10.000... 010 | (note 2) | Implied Zero | (note 3) |
| 80 | 1 | 10.000... 001 | (note 2) | Implied Zero | (note 3) |
| 80 | 1 | 10.000... 000 | (note 2) | Implied Zero | 8 |
| 81 | 1 | 10.111... 111 | $\left(-1-2^{-23}\right) \times 2^{-127}$ | Negative Number | 9 (note 1) |
| 81 | 1 | 10.111... 110 | $\left(-1-2^{-22}\right) \times 2^{-127}$ | Negative Number | 9 (note 1) |
| 81 | 1 | 10.111... 101 | $\left(-1-2^{-21}+2^{-23}\right) \times 2^{-127}$ | Negative Number | 9 (note 1) |
| - |  |  |  |  |  |
| 81 | 1 | 10.000... 010 | $\left(-2+2^{-22}\right) \times 2^{-127}$ | Negative Number | 9 (note 1) |
| 81 | 1 | 10.000... 001 | $\left(-2+2^{-23}\right) \times 2^{-127}$ | Negative Number | 9 (note 1) |
| 81 | 1 | 10.000... 000 | ${ }^{-\left(2^{-126}\right)}{ }^{-126}$ | Negative Number | 10 |
| 82 | 1 | 10.111... 111 | $\left(-1-2^{-23}\right) \times 2^{-126}$ | Negative Number | 11 |
| 82 | 1 | 10.111... 110 | $\left(-1-2^{-22}\right) \times 2^{-126}$ | Negative Number | 11 |
| 82 | 1 | 10.111... 101 | $\left(-1-2^{-21}+2^{-23}\right) \times 2^{-126}$ | Negative Number | 11 |
| . |  |  |  |  |  |
| FF | 1 | 10.000... 001 | $-1+2^{-24}$ | Negative Number | 11 |
| FF | 1 | 10.000... 000 | -1 | Negative Number | 10 |
| 00 | 1 | 10.111... 111 | $\left(-1-2^{-23}\right) \times 2^{-1}$ | Negative Number | 11 |
| 00 | 1 | 10.111... 110 | $\left(-1-2^{-22}\right) \times 2^{-1}$ | Negative Number | 11 |
| 00 | 1 | 10.111... 101 | $\left(-1-2^{-21}+2^{-23}\right) \times 2^{-1}$ | Negative Number | 11 |
| . |  |  |  |  |  |
| 00 | 1 | 10.000... 001 | $-2+2^{-23}$ | Negative Number | 11 |
| 00 | 1 | 10.000... 000 | -2 | Negative Number | 10 |
| 01 | 1 | 10.111... 111 | -2-2-22 | Negative Number | 11 |
| 01 | 1 | 10.111... 110 | -2-2-21 | Negative Number | 11 |
| 01 | 1 | 10.111... 101 | $-2-2^{-20}+2^{-22}$ | Negative Number | 11 |
| . |  |  |  |  |  |
| 7F | 1 | 10.000... 001 | $\left(-2+2^{-23}\right) \times 2^{127}$ | Negative Number | 11 |
| 7F | 1 | 10.000... 000 | $-\left(2^{128}\right)$ | Negative Number | 12 |

Notes: 1) Numbers converted to IEEE denormalized values lose one least significant bit of accuracy.
2) The TMS320C30 does not produce these numbers under normal arithmetic operations. Because the exponent of these numbers is $\mathbf{- 1 2 8}$, the TMS320C30 considers them zero. TMS320C30 Boolean operations are capable of producing numbers of these forms. Because of this, proper conversion to IEEE format is unclear and should be avoided. See note 3 .
3) Case 8 \& Case 9 are activated simultaneously. This is the only instance where the cases are not mutually exclusive. The TMS320C30 does not produce these numbers under normal arithmetic operations. Because the exponent of these numbers is -128 , the TMS320C30 considers them zero. TMS320C30 Boolean operations are capable of producing numbers of these forms. Because of this, proper conversion to IEEE format is unclear. This dilemma can be resolved with minor modification to the case qualifier logic. See note 2.

## TMS320C30-to-IEEE Control Logic

Conversion from TMS320C30 format to IEEE format is qualified with a different set of Boolean equations. To eliminate confusion between IEEE and TMS320C30 cases, different case numbers are used.

The logic is simplified if the following three factors are used:

| EXP80_81 = | !C30(31) | \| C30(30) | \| C30(29) | C30(28) |
| :---: | :---: | :---: | :---: | :---: |
|  | C30(27) | C30(26) | C30(25) |  |
| EXP7F $=$ | !C30(31) | \& C30(30) | \& C30(29) |  |
|  | C30(27) | \& C30(26) | \& C30(25) | \& C30(24) |
| MANT0 $=$ | C30(22) | C30(21) | \| C30(20) | \| C30(19) |
|  | C30(18) | \| C30(17) | \| C30(16) | \| C30(15) |
|  | C30(14) | C30(13) | \| C30(12) | \| C30(11) |
|  | C30(10) | C30(9) | C30(8) | \| C30(7) |
|  | C30(6) | C30(5) | C30(4) | \| C30(3) |
|  | C30(2) | C30(1) | C30(0) |  |

Then,
Case 6: positive numbers $\geq 2^{-126}$

$$
=!\text { EXP80_81 \& !C30(23) }
$$

Case 7: positive numbers N such that

$$
\begin{aligned}
& \left(2-2^{-23}\right) \times 2^{-127} \geq \mathrm{N} \geq 2-127 \\
& =\text { EXP80_81 \& C30(24) \& ! C30(23) }
\end{aligned}
$$

Case 8: zero
= EXP80_81 \& C30(24)

Case 9: negative numbers N such that

$$
\begin{aligned}
& \left(-1-2^{-23}\right) \times 2^{-127} \geq \mathrm{N} \geq\left(-2+2^{-23}\right) \times 2^{-127} \\
& =\text { EXP80_81 \& C30(23) \& !MANT0 }
\end{aligned}
$$

Case 10: negative numbers N such that
$-\left(2^{-126}\right) \geq N \geq-\left(2^{127}\right)$ and whose fraction is 0
$=!($ EXP80_81 \& ! C30(24) ) \& ! EXP7F \& C30(23) \& MANT0
Case 11: negative numbers N such that
$-\left(2^{-126}\right)>\mathrm{N}>-\left(2^{128}\right)$ and whose fraction $\neq 0$
$=!$ EXP80_81 \& C30(23) \& !MANT0
Case 12: negative $2^{128}$
$=$ EXP7F \& C30(23) \& MANT0

## TMS320C30-to-IEEE Conversion Algorithm Overview

Table 7 shows the conversion algorithms used on the sign, exponent, and mantissa fields of TMS320C30 numbers to produce the corresponding IEEE fields. These fields are broken down into bit-specific algorithms in the next section.

Table 7. Conversion Algorithms from TMS320C30 to IEEE Format

| IEEE |  |  |  |
| :---: | :---: | :---: | :---: |
| Case | Sign | Exponent | Fraction |
| 6 | ${ }^{\text {s C }} 30$ | $\mathrm{e}_{\mathrm{C} 30}+7 \mathrm{Fh}$ | $\mathrm{f}_{\mathrm{C} 30}$ |
| 7 | $\mathrm{s}_{\mathrm{C} 30}$ | 00 | $\left(\mathrm{f}_{\mathrm{C} 30} / 2\right)+400000 \mathrm{~h}$ |
| 8 | 0 | 00 | 000000 h |
| 9 | ${ }^{\text {s C30 }}$ | 00 | $\left(\bar{f}_{C 30}+1\right) / 2+400000 \mathrm{~h}$ |
| 10 | ${ }^{\text {s C30 }}$ | $\mathrm{e}_{\mathrm{C} 30}+80 \mathrm{~h}$ | 000000 h |
| 11 | ${ }^{\text {s C30 }}$ | $\mathrm{e}_{\mathrm{C} 30}+7 \mathrm{Fh}$ | $\mathrm{f}_{\mathrm{C} 30}+1$ |
| 12 | ${ }^{\text {s }} \mathrm{C} 30$ | FFh | 000000 h |

## TMS320C30-to-IEEE Bit-Specific Conversion Algorithms

These circuits were designed by examining Table 7 and finding all possible choices for each bit. The different choices were fed into data selectors whose addresses were derived from the case-identifying logic described in the preceding section on TMS320C30 to IEEE control logic.

Just as in the IEEE case-identifying logic, all data selectors were designed from NAND gates for maximum performance. This also permitted minimization by eliminating all NAND gates having an input of 0 and by reducing the number of NAND inputs where a bit was always 1 . However, for clarity, no minimization is shown here. Instead, that detail can be seen in the following figures.

The following bit algorithms are shown in bit-descending order, starting with TMS320C30 bit 31 .

Figure 11. TMS320C30 Bit 31 to IEEE Bit 30


Figure 12. TMS320C30 Bit $\mathbf{n}$ to IEEE Bit $\mathbf{n} \mathbf{- 1}$, Where $\mathbf{3 1} \geq \mathbf{n} \geq 24$


$$
\begin{aligned}
& \mathrm{B}=\mathrm{CASE} 10 \mid \text { CASE } 12 \\
& \mathrm{~b}=!\mathrm{B} \\
& \mathrm{a}=\mathrm{CASE} \mid \text { CASE11 } \mid \text { CASE12 } \\
& \mathrm{A}=!\mathrm{a}
\end{aligned}
$$

Figure 13. TMS320C30 Bit 23 to IEEE Bit 31


Figure 14. TMS320C30 Bit 22 to IEEE Bit 22


Figure 15. TMS320C30 Bit $n$ to IEEE Bit $n$, Where $21 \geq n \geq 1$

$\mathrm{C}=\mathrm{CASE} 6 \mid$ CASE 9
$\mathrm{c}=$ ! C
$\mathrm{b}=$ CASE6 $\mid$ CASE7 $\mid$ CASE11
$\mathrm{B}=!\mathrm{b}$
$\mathrm{A}=\mathrm{CASE} 11$
$\mathrm{a}=!\mathrm{A}$

Figure 16. TMS320C30 Bit 0 to IEEE Bit 0:

$\mathrm{B}=\mathrm{CASE} 7 \mid$ CASE9
$\mathrm{b}=!\mathrm{B}$
$\mathrm{a}=\mathrm{CASE} 6|\mathrm{CASE} 7| \mathrm{CASE} 11$
$\mathrm{A}=\mathrm{a} \mathrm{a}$

## Scope of Conversion

This section describes the actions taken by the converter when it converts to and from the IEEE format. When there is not a match between formats, the converter forces the translated number to the closest approximation.

## IEEE-to-TMS320C30 Exceptions

The match is not exact in translating from four sets of IEEE numbers to TMS320C30 numbers. They are: $\mathrm{NaN}, \pm$ infinity, $\pm$ zero and denormalized numbers too small to represent.

## NaN (Not a Number)

The NaN format is especially useful in passing commands to another process. So that commands can be passed through the converter, NaNs are not converted. However, the bit positions of the sign and exponent bits are altered. That is, the sign bit of the IEEE number is transferred to the sign bit of the TMS320C30 format. Likewise, the exponent field is transferred. In this way, the sign of the NaN is preserved which may aid in quick detection of the code. In other words, the TMS320C30 Branch on Positive instruction (BP) or Branch on Negative instruction (BN) are effective. So that the command can be acted on quickly, a NaN interrupt is generated.
$\pm$ Infinity
When positive or negative infinity is passed through the converter, the most positive or negative TMS320C30 number is produced.

Denormalized numbers whose magnitude $<2^{-126}$
Half of the denormalized IEEE numbers are out of range of TMS320C30 numbers. These denormalized numbers have very small magnitudes and are therefore forced to zero when converted.
$\pm$ Zero
The IEEE format includes representations for positive and negative zero, but the TMS320C30 format does not. The converter forces each of these numbers to the singular TMS320C30 zero format.

## TMS320C30-to-IEEE Exceptions

There are two sets of TMS320C30 numbers that do not perfectly match IEEE numbers. One set consists of a single value ( $-2^{127}$ ). The other consists of numbers converted to IEEE denormalized numbers.
$-2^{127}$
The single value, $-2^{127}$, is a very large negative number. When this number is translated, negative infinity is produced.

## Numbers Translated to Denormalized Values

When the exponent is -127 , denormalized IEEE numbers are produced, and one least significant bit of accuracy is lost. This occurs because the TMS320C30 mantissa must be right-shifted one bit in order that the exponent be increased to -126 , which is the most negative exponent the IEEE format can use.

## Converter Operating Modes

The converter is controlled by the TMS320C30. Conversions occur when the converter's output enable pin $(\overline{\mathrm{OE}})$ is active (i.e., low) and the TMS320C30 performs a read or write over its primary ( $\overline{\mathrm{STRB}}$ active) or expansion ( $\overline{\mathrm{MSTRB}}$ active) buses. This requires the converter to be placed directly between the TMS320C30 and external memory. That memory is where IEEE data will be stored. If direct (i.e., no conversion wanted) access to that memory is desired, transceivers like the SN74LS245 should be added in parallel with the converter. However, doing so requires that only one data path be enabled at a time. If unused, one of the XF pins of the TMS320C30 can be dedicated to perform this selection.

During a read, data is converted from IEEE format to TMS320C30 format. During a write, data is converted from TMS320C30 format to IEEE format. This will happen if the TMS320C30 $\mathrm{R} / \overline{\mathrm{W}}$ or XR/ $\overline{\mathrm{W}}$ pin is tied to the converter's direction (DIR) pin. Table 8 shows how to put the converter into its two operating modes and briefly describes each mode.

Table 8. Converter Operating Modes

| Mode | Pin | Description |
| :---: | :---: | :--- |
| Memory | PIPE=0 | Flow-Through Conversion Enabled - In this mode, the converter essentially <br> behaves like a simple bus transceiver, such as an SN74LS245, except with an <br> integrated floating-point format converter. When this mode is used, conver- <br> sions take two cycles. Because of this, the converter automatically generates a <br> wait state, which will halt the TMS320C30 for one cycle until the conversion <br> is complete. |
| Pipeline | PIPE=1 | Converter's Pipeline Registers Enabled Internally - This mode permits <br> single-cycle conversion. As one data value is being converted, a previously <br> converted value is output. |

## Memory Mode Operation

In this mode, one wait cycle is automatically generated during conversions from

- IEEE format to TMS320C30 format (reads)
- TMS320C30 format to IEEE format (writes)

The converter will not generate wait cycles of any other length and requires that the TMS320C30 H1 clock pin be tied to the converter's CLK pin. Figure 17 shows the timing diagram for this mode of operation.

Figure 17. Memory Mode Timing Diagram


## Pipelined Operation

Pipeline mode permits consecutive conversions every instruction cycle without wait cycles. However, because the pipeline has two internal stages, it takes two consecutive occurrences of the same operation (i.e., two reads or two writes) before it is filled. Therefore, the first read after a transition from a write will not provide properly converted data, and vice versa.

There is an address skew of one address when consecutive data values are converted. This should not be a major problem when blocks of memory are converted. The only added task will be to perform one extra transfer (read or write) to convert the last value remaining in the pipeline. With this exception, operation is identical to the Memory mode. Figure 18 shows a timing diagram for this mode of operation.

Figure 18. Pipeline Mode Timing Diagram


Interrupts
The converter automatically generates an interrupt whenever the conversion of an IEEE number classified as Not a Number ( NaN ) is attempted. The interrupt pulse is 1.5 H 1 cycles wide. This is compatible with the TMS320C30 edge-triggered interrupt types. Table 9 shows this interrupt and its trigger. Note that the converter does not change the value of the NaN, but it does alter its bit positions. This assures that the sign bit of the IEEE number remains a sign bit in the TMS320C30 format. The same is true of the exponent field. The fractional field is left unchanged. If NaN is used to pass a code or command to the TMS320C30, interpretation of the code requires only the alteration of the comparison mask in software. For more information, refer to the previous subsection NaN (Not a Number).

Table 9. NaN Interrupt

| Name | Function | Sources |
| :---: | :---: | :---: |
| NAN | Not a Number | IEEE CASE1: NaN |

## Simple Nonpipelined Conversion

If an external device (i.e., RAM, ROM, dual bus RAM, latch, etc.) contains a single-precision IEEE floating-point number and the corresponding TMS320C30 number is needed, the following TMS320C30 code will perform the required conversion:

| EXTD | . word | 0800000h | ; put address of external device here |
| :---: | :---: | :---: | :---: |
|  | LDI | QEXTD, AR0 | ; load ARO w/address of external device |
|  | LDF | *AR0,R0 | ; R0=C30 formatted number |

The following example performs TMS320C30-to-IEEE format conversion:

```
\begin{tabular}{llll} 
EXTD & . word & 0800000 h & ; put address of external device here \\
& & \\
& LDI & eEXTD,ARO & ; load ARO w/address of external device \\
& STF & RO,*ARO & ; location pointed to by ARO=IEEE formatted
\end{tabular}
```


## Simple Pipelined Conversion

This example illustrates the overhead when the converter's pipeline mode is used. Since a single value will be converted, it is necessary to read the converter one extra time to flush the pipeline. Once again, assume that an external device (i.e., RAM, ROM, dual bus RAM, latch, etc.) contains a single-precision IEEE floating-point number, and the corresponding TMS320C30 number is needed.


The following example performs TMS320C30 to IEEE format conversion:

| EXTD | . Word | 0800000 h | ; put address of external device here |
| :--- | :--- | :--- | :--- |
| $*$ |  |  |  |
|  | LDI | @EXTD,ARO | ; load ARO w/address of external device |
|  | STF | RO,*ARO | ; value stored not correct until 2nd store |
|  | STF | RO,*ARO | iocation pointed to by ARO=IEEE formatted |

## Pipelined Block Conversions

In the previous subsection, the pipeline was used, but not efficiently. This example shows a more typical application of pipeline mode. Again, external memory contains IEEE formatted data.

| N | .set | $03 F F h$ | ; $N=$ \# of values to convert -1 |
| :--- | :--- | :--- | :--- |
| EXTD | .word | 0800000 h | ; put external address here |
| DADR | .word | 0809800 h | ; put destination address here |
|  |  |  |  |


|  | LDI | QEXTD, ARO | ; load ARO w/address of external device |
| :---: | :---: | :---: | :---: |
|  | LDI | CDADR, AR1 | ; load AR1 w/destination address |
|  | LDF | *AR0++, R0 | ; prime (preload) the converter's pipeline |
|  | LDI | N, RC | ; block will be repeated N ( 0400 h ) times |
|  | RPTB | RCR | ; specify end address of block repeat |
|  | LDF | *AR0++, R0 | ; read converted values into R0 |
| $\begin{aligned} & \text { RCR : } \\ & \text { * } \end{aligned}$ | STF | R0, *AR1++ | store converted values into on-chip memory |

This is more efficient:

| N | . set | 03 FEh | ; $\mathrm{N}=$ \# of values to convert - 2 |
| :---: | :---: | :---: | :---: |
| EXTD | . word | 0800000h | ; put external address here |
| DADR | . word | 0809800h | ; put destination address here |
|  | LDI | QEXTD, ARO | ; load ARO w/address of external device |
|  | LDI | Q DADR, AR1 | ; load AR1 w/destination address |
|  | LDF | *AR0++, R0 | ; prime (preload) the converter's pipeline |
|  | LDF | *AR0++, R 0 | ; read 1st converted value for 1st STF |
| * | RPTS | N | repeat next instruction $N-1$ ( 03 FFh ) times, extra loop is to store last |
| * |  |  | ; value converted |
|  | LDF | *AR0++, R0 | ; read converted values into R0 |
| 11 | STF | R0, *AR1++ | ; store converted values into on-chip |
|  |  |  | ; memory, lst store will save junk |

The following example performs TMS320C30 to IEEE format conversion:

| N | . set | 0400h | ; $N$ equals number of values to convert |
| :---: | :---: | :---: | :---: |
| EXTD | . word | 0800000h | ; put external address here |
| SADR | . word | 0809800h | ; put source data address here |
|  | LDI | @EXTD, ARO | ; load ARO w/address of external device |
|  | LDI | @SADR,AR1 | ; load AR1 w/source data address |
|  | LDI | N, RC | ; block will be repeated $\mathrm{N}+1$ (0401h) times, |
| * |  |  | extra loop is to store last value converted |
|  | RPTB | AC | ; specify end address of block repeat |
|  | LDF | *AR1++, R0 | ; read TMS320C30 format numbers into R0 |
| $\underset{*}{\text { AC }}$ | STF | R0, *ARO++ | store converted values into external device |

This is more efficient:

| N | . set | 03 FFh | ; N equals number of values to convert - 1 |
| :---: | :---: | :---: | :---: |
| EXTD | . word | 0800000h | ; put external address here |
| SADR | . word | 0809800h | ; put source data address here |
|  | LDI | @EXTD,AR0 | ; load ARO w/address of external device |
|  | LDI | Q SADR,AR1 | ; load AR1 w/source data address |
|  | LDF | *AR0++, R0 | ; read 1st converted value for 1st STF |
| * | RPTS | N | $\begin{aligned} & \text {; repeat next instruction } N(0400 \mathrm{~h}) \text { times, } \\ & \text {; extra loop is to store last value } \\ & \text {; converted } \end{aligned}$ |
|  | LDF | *AR1++, R0 | ; read converted values into R0 |
| 11 | STF | R0, *AR0++ | ; store converted values into external |
|  | STF | R0, *ARO++ | ; store last value |

## Using TMS320C30 External Flag 0 (XF0)

As mentioned in the section on converter operating modes, one of the TMS320C30's XF pins can be tied to the converter's output enable (OE) pin to enable the data path through the converter
or to bypass it, as the case may be. The following TMS320C30 code uses the TMS320C30 XF0 pin to do this (see Hardware Applications Examples section later in this report for the hardware configuration). Nonpipelined mode is assumed.

| N | . set | 03 FFh | $N$ equals number of values to convert - 1 |
| :---: | :---: | :---: | :---: |
| EXTD | . word | 0800000h | ; put external address here |
| SADR | . word | 0809800h | ; put source data address here |
|  | LDI | QEXTD, AR0 | ; load ARO w/address of external device |
|  | LDI | @SADR, AR1 | ; load AR1 w/source data address |
|  | LDI | 2,IOF | ; XF0=output=0, select the converter |
|  | LDF | *AR0++, R0 | ; read 1st converted value for 1st STF |
|  | RPTS | N | ; repeat next instruction $\mathrm{N}+1$ (0400h) times |
|  | LDF | *AR1++, R0 | ; read converted values into R0 |
| * | STF | R0, *AR1++ | ; store converted values into on-chip <br> ; memory, lst store will save junk |
|  | LDI | 6, IOF | ; XF0=output=1, deselect the converter |

## Using the TMS320C30 DMA Capability

The built-in TMS320C30 DMA controller can be used to read converted IEEE values. The TMS320C30 assembly code to set up the DMA is shown below. Non-pipelined mode is assumed.

| DMA | . word | 0808000 h | ; base address of DMA registers |
| :---: | :---: | :---: | :---: |
| GLBL | . word | 0 C 53 h | ; DMA global register init value |
| N | . set | 0400h | ; N equals number of values to convert |
| EXTD | - word | 0800000h | ; put external address here |
| $\begin{aligned} & \text { DADR } \\ & \star \end{aligned}$ | . word | 0809800h | ; put destination data address here |
| * | DMA controller | setup |  |
|  | LDI | @DMA, ARO | ; ARO $\rightarrow$ DMA control registers |
|  | LDI | @EXTD,R0 | ; RO = address of IEEE data |
|  | LDI | @DADR,R1 | ; R1 = converted data destination address |
|  | LDI | N, R2 | ; R2 = DMA transfer count |
|  | LDI | @GLBL, R3 | ; R3 = DMA Global register initial value |
|  | STI | R0, *+AR0 ( 4 ) | ; DMA will transfer from external device |
|  | STI | R1, *+AR0 (6) | ; DMA will transfer to RAM block 0 |
|  | STI | R2,*+AR0 ( 8 ) | ; DMA will transfer N values |
|  | STI | R3,*AR0 | ; start the DMA |

## Hardware Application Examples

## IEEE Data Stored in TMS320C30 External MSTRB Memory

Below is shown an example of interfacing the converter to TMS320C30 external memory containing only IEEE formatted data. In this configuration, it is likely that the memory would be dual bus RAM to enable a second processor to share data with the TMS320C30 through this memory. Figure 19 shows an interface to a static RAM (SRAM) bank.

Figure 19. Interface to Static RAM


## Bypassing the Converter

A previous subsection (Using TMS320C30 External Flag 0) showed TMS320C30 assembly code that used the TMS320C30 XF0 pin either to steer data through the converter or to bypass the converter for direct, or unconverted, access to that memory. Figure 20 shows a circuit that can be used with that code.

Figure 20. Steered Access to the Memory


## JTAG/IEEE-1149.1 Scan Interface

Integrated circuit and board-level testing is increasingly important. JTAG or IEEE-1149.1 is a standard test methodology. It is based on a 4 -wire connection to a device and provides access to all I/O buffers (boundary scan) of a device. This permits stimulation and observation of internal logic. By allowing stimulation of output pins and observation of input pins, external circuitry can also be tested. If implemented completely, this can eliminate "bed of nails" test rigs.

The TMS320C30-IEEE Floating-Point Format Converter is equipped with a JTAG/ IEEE-1149.1 compatible scan interface. The internal architecture is based on Texas Instruments' SCOPEtm design specifications. This provides for boundary-scanning of the device and inclusion of an eight-bit instruction register.

Figure 21 shows the internal scan architecture and gives the naming conventions used to describe the device blocks:

Figure 21. Scan Architecture


## I/O Pin Description

## TCK

The TCK input clock signal is the scan clock. It typically will be generated off-board by a test controller. All tests of the device are controlled by an external controller and proceed at the scan clock (TCK) speed.

## TMS

The TMS input signal is clocked in by TCK. TMS controls the test mode of the device. Using TMS and TCK, a test controller can scan registers through the device, perform tests, or place the device in a normal functional mode.

## TDI

The TDI input signal is used to input serial data through the registers in the device. All data is clocked in by TCK and shifts according to the state of the test logic set up by an external test controller using TMS and TCK.

## TDO

The TDO output signal is used to scan serial test data out of the device under the control of the test host. While shifting data, TDO is active-shifting data out on the falling edge of TCK. When through shifting data, TDO is tri-stated.

## $T I P$

TIP is an output indicating good or bad parity in the instruction register. The indication defaults to good if the external controller does not check for parity. To check parity, the test controller places the device in the instruction register pause state. While in this state, the device will output the actual (i.e., hardware-determined) parity of the device's instruction register. A high logic level indicates good parity, while a low logic level indicates bad parity.

## Architectural Elements

## TITAP

The Texas Instruments' Test Access Port (TITAP) is a 16-state state-machine designed according to the JTAG and IEEE-1149.1 specifications. The TITAP controls the test logic and is controlled by the TMS and TCK inputs to the device from an external test host controller.

## Instruction Register

The Instruction Register is eight bits in length. Table 10 lists the instructions available for this device.

Table 10. Test Instructions

| msb $\rightarrow$ Isb | Instruction |
| :--- | :--- |
| 00000000 | Boundary Scan |
| 10000001 | ID Register Scan |
| 10000010 | Sample Boundary Scan |
| 00000011 | Boundary Scan |
| 00000110 | Control Boundary HI-Z |
| 10000111 | Control Boundary 1/0 |
| 00001010 | Read Boundary-Normal |
| 10001011 | Read Boundary-Test |
| 00001100 | Boundary Selftest |
| 1111111 | Bypass Scan |
| All Others | Bypass Scan |

The Instruction Register is preloaded with 00000001 (msb-lsb) in the instruction register capture state of the TITAP. This is not per the JTAG/IEEE-1148.1 standards.

## Boundary Scan Instruction

This instruction places the device in test mode: all function inputs and outputs are controlled by the test logic. Function inputs and outputs are sampled in the data register capture state of the TITAP, and the boundary data register is selected in the data register scan path during data register scans.

## ID Register Scan Instruction

This instruction places the device in normal mode: all function inputs and outputs operate in their normal modes. The bypass data register is selected in the data register scan path during data register scans.

## Sample Boundary Scan Instruction

This instruction places the device in normal mode: all function inputs and outputs operate in their normal modes. Function inputs and outputs are sampled in the data register capture state of the TITAP, and the boundary data register is selected in the data register scan path during data register scans.

## Control Boundary HI-Z Instruction

This instruction places the device in test mode: all function outputs are tri-stated (if possible), while all function inputs operate in their normal mode. The bypass data register is selected in the data register scan path during data register scans.

## Control Boundary 1/0 Instruction

This instruction places the device in test mode: all function inputs and outputs are controlled by the test logic. The bypass data register is selected in the data register scan path during data register scans.

## Read Boundary - Normal Instruction

This instruction places the device in normal mode: all function inputs and outputs operate in their normal modes. The boundary data register retains its current state in the data register capture state of the TITAP, and the boundary data register is selected in the data register scan path during data register scans.

## Read Boundary - Test Instruction

This instruction places the device in test mode: all function inputs and outputs are controlled by the test logic. The boundary data register retains its current state in the data register capture state of the TITAP, and the boundary data register is selected in the data register scan path during data register scans.

## Boundary Self-Test Instruction

This instruction places the device in normal mode: all function inputs and outputs operate in their normal modes. The boundary data register contents are toggled, and the data register captures the state of the TITAP. Also, the boundary data register is selected in the data register scan path during data register scans.

## Bypass Scan Instruction

This instruction places the device in normal mode: all function inputs and outputs operate in their normal modes. The bypass data register is selected in the data register scan path during data register scans.

## Boundary Data Register

The boundary data register contains 70 bits and is ordered according to Figure 22.

## Figure 22. Scan Path Bit Order

TDI $\rightarrow$ DIR $\longrightarrow$ PIPE $\longrightarrow$ CLK $\longrightarrow$ OEZ $\longrightarrow$ NAN $\longrightarrow$ WAIT $\longrightarrow$ DA31 $\rightarrow$ DA30 $\longrightarrow \ldots$ DA1 $\longrightarrow$ DA0 $\longrightarrow$ DB31 $\longrightarrow \mathrm{DB30} \longrightarrow \ldots \mathrm{DB} 1 \longrightarrow \mathrm{DB} 0 \longrightarrow$ TDO

## Bypass Data Register

The Bypass Data Register is one bit in length and is operated in accordance with the JTAG/ IEEE-1149.1 specifications.

## Scan References

Refer to the following documents for further descriptions of the test logic of this device:

1) A Test Access Port and Boundary Scan Architecture; Technical Sub-Committee of the Joint Test Action Group (JTAG).
2) IEEE Standard 1149.1 - IEEE Standard Test Access Port and Boundary-Scan Architecture.

## Part IV. Telecommunications

11. Implementation of a CELP Speech Coder for the TMS320C30 Using SPOX (Mark D. Grosen)

# Implementation of a CELP <br> Speech Coder for the TMS320C30 Using SPOX 

Mark D. Grosen

Spectron Microsystems, Inc.

## Introduction

Speech coders are critical to many speech transmission and store-and-forward systems. With the emergence of universal standards, it is possible to develop systems that are interoperable. Quality and bit rate for speech coders vary from toll quality at 32 kilobits/second (kbps) (CCITT ADPCM) to intelligible quality at 2.4 kbps (DOD LPC-10). Recently, a new standard for 4.8 kbps with near toll-quality has been proposed and is based on code-excited linear prediction (CELP) techniques [1,2]. Unfortunately, products based on new coding algorithms are often slow to appear because of the considerable time and effort required to develop real-time implementations.

The purpose of this article is to demonstrate how a CELP coder based on this new standard can be quickly developed using SPOX. Utilizing the power of the TMS320C30 DSP plus the ease of use provided by C and the SPOX DSP library, an efficient and portable coder can be written in a much shorter period of time than that required by conventional assembly language methods. Because of the portability of SPOX and C, the coder can also be compiled and executed on a variety of hardware platforms.

## A 4.8-kbps CELP Coder

CELP coders were first introduced by Atal and Schroeder in 1984 [3]. These coders offer high quality at low bit rates, but at a high computational cost. Implementing the original systems directly required several hundred million instructions per second (MIPS). Much of the research on CELP techniques has concentrated on reducing this computational load to facilitate real-time implementations.

The proposed U. S. Federal Standard 4.8-kbps CELP coder (USFS CELP), Version 2.3, uses several techniques to reduce the complexity to a level where a one- or two-processor implementation is possible. These are the main characteristics of the coder:

- 240 -sample frame size at $8-\mathrm{kHz}$ sampling rate
- Tenth-order short-term predictor
- Calculated once per frame, open loop
- Autocorrelation with Hamming window
- LSP quantization
- Four subframes (60 samples)
- One tap pitch predictor

1) Closed loop analysis
2) Even/odd subframe delta search method

- 1024-element codebook

1) Overlapped by 2 (see Pitch and Codebook Search)
2) $75 \%$ of elements are zero

Block diagrams of the decoder and encoder are shown in Figure 1.

Figure 1. USFS CELP Decoder and Encoder Structures


DECODER


ENCODER

Bit allocations are given in Table $1[2,4]$.
Table 1. $4.8-\mathrm{kbps}$ CELP Parameters

|  | Spectrum | Pitch | Codebook |
| :--- | :--- | :--- | :--- |
| Update | $30 \mathrm{~ms}(240$ samples $)$ | $7.5 \mathrm{~ms}(60)$ | $7.5 \mathrm{~ms}(60)$ |
| Parameters | 10 LSP | 1 delay, 1 gain | 1 of 1024 index, 1 gain |
| Bps | 1133.3 | 1466.7 | 2000 |
| Remaining 200 bps reserved for expansion, error protection, and synchronization |  |  |  |

The standard also specifies an error protection scheme utilizing forward error-correcting Hamming code and parameter smoothing.

The major computational parts of the algorithm are the pitch search and the codebook search, both of which are performed four times per frame. An important technique to reduce the computations is the end-correction convolution technique (see Pitch and Codebook Search). This is a recursive convolution method that reduces the number of multiply-adds by an order of magnitude.

In addition, the codebook is designed to have approximately $75 \%$ of the samples equal to zero. This allows many of the convolution updates in the codebook search to be reduced to a simple shift of a vector of samples. On DSP processors with circular addressing, this shift can be replaced by using circular buffers.

To further reduce complexity, the pitch search is limited in range for every other subframe. During even-numbered subframes, the optimal pitch value is performed over the range 20 to 147 ( 128 values). On the odd subframes, the search is only over the range 16 from the previous pitch value. This also decreases the bit rate with a negligible effect on speech quality.

If adequate processing power is not available, you can implement an interoperable coder by using a subset of the full codebook. For example, if only the first 128 vectors from the codebook could be used, the sub-optimal coder would work with an optimal coder if the same frame structure and bit rate were used.

These techniques produce complexity estimates for the USFS CELP coder ranging from 5.3 MIPS to 16.0 MIPS for a 128 -vector and 1024 -vector codebook, respectively[4].

## Using SPOX in Development

The computational complexity of CELP coders, even with use of the various techniques to reduce it, has made real-time implementations impractical on first- and second-generation DSPs. The recent introduction of the third-generation TMS320C30[5], however, makes it feasible to implement the USFS CELP coder with one or two processors. Furthermore, because of the generalpurpose capabilities of the TMS320C30 and the availability of a C compiler and SPOX, development of a real-time coder can be significantly expedited.

In particular, SPOX provides the following functions to facilitate software development.

- C standard I/O functions
- printf(), $\operatorname{scanf}()$
- fopen(), fread(), fwrite()
- Stream I/O to move data efficiently
- Standard set of DSP math functions
- Filters
- Vector operations
- Windows
- Levinson-Durbin algorithm
- Processor independence

Both FORTRAN and C versions of the Version 2.3 USFS CELP coder were available as starting points for the real-time implementation. The initial development was done on a Sun worksta-
tion equipped with SPOX/SUN [6] and the usual UNIX programming tools, such as the symbolic debugger dbx. SPOX/SUN is a library of SPOX DSP math functions that can be used for developing SPOX applications on Sun workstations. The new version of the coder utilizing SPOX was checked against the existing implementation for correctness. After the new version was debugged on the workstation, the source code was recompiled employing the Texas Instruments TMS320C30 C compiler and linked with the SPOX/XDS library for the XDS1000 development system.

The same facilities for testing the code on the workstation were available on the XDS1000. A SPOX stream function (see Input/Output section) read digitized speech from a disk file. Status information was printed to the console screen. Command line arguments were used to vary the encoder's parameters such as the codebook size.

The software development process for the USFS CELP coder followed three evolutionary steps:

- C program using standard I/O
- C program using SPOX functions for faster math and I/O
- C program using SPOX and assembly language optimizations

The first step was taken because an existing Cimplementation was available. The C standard I/O provided by SPOX made it possible to run the application code written in C directly on the XDS1000. For example, functions (fscanf()) that read control information from a disk file on the Sun also worked on the XDS1000 using the PC's hard disk.

In general, it would have been easier to start with the SPOX library functions to implement some of the common operations contained in the coder. Many of the functions needed (filtering, correlation, dot-product) are in the SPOX DSP library. In this case, the C implementations of these standard vector and filter functions in the existing program were replaced with the corresponding SPOX functions. The SPOX functions, written in optimized assembly language, execute several times faster than the corresponding C functions.

The last step was needed to meet real-time constraints. XDS1000 timing capabilities allowed the identification of two time-critical sections of the code which were then rewritten in TMS320C30 assembly code. Since the interface to the SPOX math functions is open, new math functions can be written that work with SPOX data structures such as vectors and filters.

## Implementation

Several major parts of the USFS CELP encoder are implemented with a mixture of C, SPOX, and TMS320C30 assembly language functions. The decoder can be easily constructed from the material presented here. An adaptive postfilter for the decoder is not described here.

The framework of the resulting encoder is shown in Figure 2. A description of the major functions performed can be found in the following sections. Appendix A provides a short summary of the SPOX functions employed in the next four sections (Input/Output, Spectrum Analysis, Filters, and Pitch and Codebook Search).

```
encoder(instream, outstream)
    SS_Stream instream;
    SS_Stream outstream;
{
    while ( SS_get(instream, SV_array(speech)) ) {
    /* Apply a high pass filter to the input speech */
        SF_apply(hpfilter, speech, speech);
    /* Find the coefficients of the short-term prediction filter */
        calculateLP(speech, invcoeffs);
    /*
    * Convert the direct form coefficients to line spectrum pairs.
    * Then quantize the LSP's and convert back to direct form.
    */
        SV_a2lsp(invcoeffs, lsps);
        quantizeLSP(lsps, qntzlsps);
        SV_lsp2a(qntzlsps, invcoeffs);
    /*
    * For each of the 4 subframes, determine the pitch prediction
    * parameters and codebook (excitation) parameters
    */
        for (i = 0; i < 4; i++) {
            genShortResidual(s[i], res[i]);/* generate short term residual */
                pitchSearch(s[i], res[i]); /* find optimum pitch predictor */
                genFullResidual(s[i], res[i]); /.* generate residual */
                codeSearch(res[i], reshat); /* find best codebook vector */
                updateFilters(reshat); /* update filter states */
            }
            packParams(); /* pack parameters into output array */
            SS_put(outstream, params);
    }
}
```


## Input/Output

Input speech samples are obtained by employing a function (SS_get()), which reads data from a named stream (instream). The creation of instream during program initialization determines the source of the data. During development, the easiest source is a disk file with digitized speech. When real-time testing is needed, a codec connected to a TMS320C30 serial port could be utilized. For example, instream could be created to read from standard input with the following code segment.

```
#define FRAMESIZE 240 * sizeof(Float)
instream = SS_create(DF_FILE, DF_STDIN, FRAMESIZE, NULL);
```

The output stream (outstream) consists of the packed frame parameters. It could also go to a disk file or a serial port by using SS_put().

## Spectrum Analysis

After preconditioning the signal with a highpass filter (see the Filters section), the coefficients of the short term prediction filter can be found by using the function calculateLP() shown below.

```
SV_Vector window, rc, error, cor, gammavec;
calculateLP(s, coeffs)
    SV_Vector s, coeffs;
{
    SV_window(s, window, s); /* window the speech in-place */
    SV_corr(s, s, cor); /* autocorrelation */
    SV_autorc(cor, coeffs, rc, error); /* Levinson-Durbin */
    SV_mul2(gammavec, coeffs); /* bandwidth expansion */
```

\}

The vector window is initialized to contain the desired window; in this case, a Hamming window is used. The autocorrelation terms are stored in the vector cor that has the same length as the order of the short term filter. SV_autorc() uses a Levinson-Durbin type algorithm to compute the inverse filter coefficients. As a side effect, the reflection coefficients are also stored in rc. Finally, a $15-\mathrm{Hz}$ bandwidth expansion is produced by the multiplication of the inverse filter coefficient vector by a vector (gammavec) consisting of the terms

$$
g[i]=0.994^{i} \quad \text { for } i=0,1, . . ., m-1
$$

Efficient quantization is obtained by:

- Transforming the prediction coefficients into line spectrum pairs (LSPs)
- Then quantizing the LSPs

The conversions between prediction coefficients and LSPs are not currently in the SPOX library. The existing Cimplementation evaluates cosine values directly, which is too expensive computationally. A more efficient routine (SV_a2lsp( )), that employs table-lookup of cosine values, has been written utilizing the algorithm outlined in [7]. The quantized LSPs are transformed back to direct-form coefficients for use in the short-term predictor.

## Filters

Three filters in the encoder can be realized by use of SPOX filter objects. The inverse filter $A(z)$ and the short term predictor $1 / A(z)$ share the same filter coefficients. The former is an FIR filter and the latter an all-pole filter. The final filter is the all-pole weighting filter $W(z)$ with coefficients given by $1 / A(\lambda z)$, with $\lambda=0.8$.

During the initialization of the encoder, the filters are created with the code fragment shown below.

```
#define FILTERSIZE 11 * sizeof(Float)
SF_Filter invfilter, predfilter, wgtfilter;
SV_Vector invcoeffs, wgtcoeffs;
SA_Array array;
array = SA_create(SG_CHIP, FILTERSIZE, NULL);
invfilter \equiv SF_create(array, NULL, NULL);
SF_bind(invfilter, invcoeffs, NULL);
array = SA_create(SG_CHIP, FILTERSIZE, NULL);
predfilter = SF_create(NULL, array, NULL);
SF_bind(predfil左er, NUL工, invcoeffs);
```

```
array = SA_create(SG_CHIP, FILTERSIZE, NULL);
wgtfilter .= SF_create(NULL, array, NULL);
SF_bind(invfilter, NULL, wgtcoeffs);
```

Note that the inverse and prediction filters are both bound to the same coefficient vector. For each new frame of speech, this vector is updated when it is passed to calculateLP().

An important consideration is that the filters are used more than once during a frame. A different signal is filtered each time, but the state (history) of the filter must be the same. This is accomplished before each filter operation by using the

- SF_getstate() function to recover a vector with the state of the filter at the end of the previous frame
- SF_setstate() function to restore the filter's state

The following code segment shows how the short term prediction residual is generated for the pitch search.

```
SF_setstate(predfilter, NULL, predstate);
SV_fill(residual, 0.0);
SF_apply(predfilter, residual, residual); /* zero input of filter */
SV_sub3(residual, speech, residual); /* speech - history */
SF setstate(invfilter, invstate, NULL);
SF_apply(invfilter, residual, residual); /* filter with inverse */
SF_setstate(wgtfilter, NULL, wgtstate);
SF_apply(wgtfilter, residual, residual); /* filter with weighting */
```


## Pitch and Codebook Search

After the program finds the short-term predictor and generates the corresponding residual, the pitch predictor and code book parameters are found for each of the four subframes. The pitch and codebook search functions are similar: both search over a set of values to minimize an error term. In this section, only the codebook search is illustrated (see Figure 3). Many of the functions, however, can be applied to the pitch predictor calculations.

Figure 3. Codebook Search Block Diagram


The search in Figure 3 minimizes the distance between the input vector and one of many generated vectors. The quantity being minimized is the Euclidean norm:

$$
\begin{align*}
e & =\|r-\hat{r}\|^{2}  \tag{1}\\
& =r^{t} r-2 r^{t} \hat{r}+\hat{r}^{t} \hat{r}
\end{align*}
$$

where
$r=$ the original residual
$\hat{r}=$ the synthesized residual
It can be seen from the vector definition that only two terms need to be computed - the correlation of $r$ and $\hat{r}$ and the energy of $\hat{r}$; this is because the energy of the original residual is invariant over all the generated residuals. It appears that there would be N convolutions and 2 N dot products to perform for each sub-frame. Implemented directly, the codebook search would thus require 66 MIPS if $\mathrm{N}=256$ and a sub-frame length of 60 are specified.

Instead, the USFS CELP coder uses a specially structured codebook that greatly reduces the computational load. The biggest savings comes from the elimination of all but one of the convolutions for each subframe. The codebook is overlapped, as shown in Figure 4.

Figure 4. Structure of Overlapped Codebook



This structure permits a recursive convolution computation. The first codebook vector is convolved normally with the weighting filter. Subsequent convolutions, however, make use of the following relationships.

$$
\begin{align*}
V_{i+1}(z) & =z^{-1} \hat{R}_{i}(z)+x_{i+1}[1] H(z)  \tag{2}\\
\hat{R} i+1(z) & =z^{-1} V_{i+1}(z)+x_{i+1}[0] H(z)
\end{align*}
$$

where $\hat{R}_{i}(z)$ is the $Z$-transform of the generated residual. Given the convolution of the previous codebook vector with the weighting filter, the convolution employing the next vector can be found with only $120(2 \times 60)$ multiplies and adds.

This number can be further reduced by another property of the codebook. The vectors are generated by center-clipping a gaussian noise source, which causes approximately $75 \%$ of the elements to be zero. Thus, $75 \%$ of the updates to the convolutions require no multiplications or additions; however, the convolution elements must still be shifted. The following function update( ) implements the recursive update operation. Note that it must be called twice per codebook vector, once for each new term.

```
update(x, res, wgtimpulse)
    Float x;
    SV_Vector res, wgtimpulse;
{
    Float *rptr, *rptrm1, *wptr;
    Int len;
    len = SV_getlength(res);
    rptr = (\overline{Float *) SV_loc(res, len - 1);}
    rptrm1 = rptr - 1;
    if ( }x==0.0 ) { /* no input, so just shift */
        for (; len > 1; len--) {
            *rptr-- = *rptrm1--;
        }
        *rptr = 0.0;
    }
    else { /* update using new input */
        wptr = (Float *) SV_loc(wgtimpulse, len - 1);
        for (; len > 1; len--) {
            *rptr-- = *rptrm1-- + x * *wptr--;
        }
        *rptr = x * *wptr;
    }
}
```

Once the convolution has been determined, the corresponding error and gain can be found.
The following function calculates the error and gain terms.

```
Float error(res, reshat, gain)
    SV_Vector res, reshat;
    Flōat *gain;
{
    Float cor, energy;
    SV_dotp(reshat, reshat, &energy);
    SV_dotp(reshat, res, &cor);
    *gain = cor / energy;
    return( *gain * cor );
}
```

The codebook search function with update( ) and error( ) functions is shown below. The first convolution must be calculated directly, so it is done outside of the main for loop. The error for each entry is compared against the current maximum; if it is greater than the maximum, this entry becomes the new best vector. The process is repeated for each of the $N$ vectors.

```
SV_Vector codebook, wgtimpulse;
codeSearch(res, reshat)
    SV_Vector res, reshat;
{
    Float errmax, gain, err;
    Float *cbptr;
    Int i, best;
    findImpulse(wgtimpulse);
    SV_setbase(codebook, FIRSTVEC);
    convolve(codebook, wgtimpulse, reshat);
    errmax = error(res, reshat, &gain);
```

```
best = 0;
cbptr = (Float *) SV_loc(codebook, 0) - 1;
for (i = 1; i < N; i++) {
    update(*cbptr--, reshat, wgtimpulse);
    update(*cbptr--, reshat, wgtimpulse);
    if ( (err = error(res, reshat, &gain)) > errmax ) {
            errmax = err;
            best = i;
    }
}
}
```

After the search is completed, the gain of the best vector is recomputed and quantized. The corresponding gain index and index of the codebook element can then be readied for transmission.

## Assembly Language Enhancements

The codebook and pitch searches require the largest share of the computation cycles in the encoder. One way to increase performance is to recode critical parts of these functions in assembly language. One such function is the update( ) function described above for the recursive convolution computation.

An assembly language version of update() was written to take advantage of the parallel instructions and repeat block capabilities of the TMS320C30. The assembly language function utilizes the same calling structure as the C version. The function was written using the assembly language macros provided with SPOX to work with the vector, matrix, and filter objects in the DSP library[8]. The new version of update( ) is listed in Figure 5.

Figure 5. Update Function Written in TMS320C30 Assembly Language

```
*
* Synopsis:
*
* Void update(x, res, wgtimpulse)
*
*
*
#include <sv30.h>
FP .set ar3
    .global _update
    .text
_update:
    push FP
    ldi sp, FP
* Set the following registers by using vector object macros
* ar0 - SV_loc(wgtimpulse, 0)
* ar1 - SV_loc(res, 0)
* rc - the length of the vectors
* rera - x
*
    Idi *-FP(2), ar2
    SV_get1 ar2, SV_LOC0, ar0
    SV_get2 ar2, SV_LEN|SV_LOC0, rc, ar1
    ldf *-FP(4), r1 ; ; x mift is 0 so just shift
    subi 1, rc
    addi rc, ar1 ; ar1 -> res[l - 1]
    ldi ar1, ar2 ; ar2 ->> res[i - 1]
* General case when x != 0.0
    addi rc, ar0 ; ar0 m wgt[l - 1]
    subi 2,rc ; set loop count
    mpyf r1, *ar0--, r2 ; x * wgt[i]
    addf r2, *--ar2, r0
    rptb lp20
    mpyf r1, *ar0--, r2 ; x * wgt[i]
lp20: addf r2, *--ar2, r0
    stf r0, *ar1--
        bud end
        stf r0, *arl--
        mpyf rl, *ar0, r0 ; res[0] = x*wgt[0]
        stf r0, *ar1
*
* Case for x == 0.0
*
shift: subi 2, rc
    ldf *--ar2, r0
    ; loop l - 1 times
                                    ; prime the pipe
slp: }\begin{array}{lll}{\mathrm{ rptb }}&{\mathrm{ slp }}\\{\mathrm{ stf }}&{*--ar2, r0}
    stf r0, *ar1-- ; final store
    ldf 0.0,r0 ; first term = 0.0
    stf r0, *ar1
*
end: pop FP
    rets
```


## Performance

A complete CELP encoder was implemented as described above. Two versions were tested:

- One encompassing C and standard SPOX functions
- One having C, SPOX, and two custom TMS320C30 assembly language functions

Table 2 shows the execution times for different combinations of codebook size, processor, and implementation. To achieve near real-time performance for a codebook with 128 vectors, the codebook and pitch search functions were completely rewritten in assembly language. Each function required approximately 130 lines of assembly code.

Table 2. Timing of Various Implementations of the CELP Encoder for One Frame of Speech

| Codebook Size | Sun (C/SPOX) | C30 (C/SPOX) | C30 (C/SPOX/ASM) |
| :---: | :---: | :---: | :---: |
| 128 | $16,000 \mathrm{~ms}$ | 88.2 ms | 39.0 ms |
| 256 | $24,000 \mathrm{~ms}$ | 114.6 ms | 54.3 ms |

Memory requirements for the program on the TMS320C30 were approximately 14,000 words for instructions and approximately 6,000 words for data. The application code required approximately 4500 words of instructions. The SPOX operating system and DSP math functions consumed the remaining 9500 words of memory. This figure reflects many functions that are essential for easing development but unnecessary for a real-time implementation.

Once a real-time implementation has been achieved, the SPOX memory requirements can be greatly reduced by porting (or customizing) SPOX to a custom hardware implementation. In this case, the SPOX memory requirements can be reduced to approximately 4000 words, making a 12 K -word implementation feasible (both data and instruction memory requirements).

These timings show that a real-time CELP coder can be implemented on a single TMS320C30. They also illustrate the power of the TMS320C30 compared to a standard microprocessor. Note that a TMS320C30 implementation has approximately 500,000 instruction cycles available in a $30-\mathrm{ms}$ frame.

Version 3.0 of the USFS CELP coder has significant improvements in computational complexity, including:

- Ternary codebook to eliminate multiplications
- Shorter codebook
- Faster LSP conversion and quantization

Work to bring the SPOX implementation up to Version 3.0 is continuing. An investigation of a two-processor implementation is also being performed.

## Summary

A 4.8-kbps CELP coder based on a Department of Defense-proposed standard has been implemented on a TMS320C30. Several of the functions used in the encoder were illustrated. A suboptimal implementation of the encoder using a 128 -vector codebook is possible on only one TMS320C30. Work is continuing on both the algorithm and the software implementation to improve the coder's real-time performance.

With SPOX, the encoder was developed in less than one month. The resulting source (with the exception of two TMS320C30 assembly language functions) can be compiled and run on a Sun workstation, a PC, or a TMS320C30 system such as the Texas Instruments XDS1000. This represents a considerable improvement in development time and effort over previous implementation methods.

## References

1) Kemp, D.P., Sueda, R. A., and Tremain, T. E., "An Evaluation of 4800 bps Voice Coders," Proceedings of ICASSP '89, IEEE, May 1989.
2) Campbell, J. P., Welch, V. C., and Tremain, T. E., "An Expandable Error-Protected 4800 bps CELP Coder," Proceedings of ICASSP '89, IEEE, May 1989.
3) Atal, B. S., and Schroeder, M. R., "Stochastic Coding of Speech at Very Low Bit Rates," Proceedings of ICC '84, pages 1610-1613, 1984.
4) Tremain, T. E., Campbell, J. P., and Welch, V. C., "A 4.8 kbps Code Excited Linear Predictive Coder," Proceedings of Mobile Satellite Conference, pages 491-496, May 1988.
5) Texas Instruments, Inc., Third-Generation TMS320 User's Guide, 1988.
6) Spectron MicroSystems, Inc., SPOX/SUN User's Guide, April 1989.
7) Soong, F. K., and Juang, B. H., "Line Spectrum Pair (LSP) and Speech Data Compression," Proceedings of ICASSP '84, pages 1.10.1-1.10.4, IEEE, 1984.
8) Spectron MicroSystems, Inc., Adding Math Functions to SPOX, March 1989.

## Appendix A

The SPOX functions used in the code examples are briefly described below. Complete descriptions can be found in Getting Started With SPOX and the SPOX Programming Reference Manual. These manuals are supplied with the XDS1000. They are also available from Spectron MicroSystems, Inc.

Stream Functions

```
SS_get - get data from a stream into an array
    Int SS_get(stream, array)
    SS_Stream stream;
    SA_Array array;
SS_put - put data from an array to a stream
Int SS_put(stream, array)
    SS_Stream stream;
    SA_Array array;
```

Vector Functions

SV_autorc - perform inverse filter calculations
Void SV_autorc(cor, inv, rc, alpha)
SV_Vector cor;
SV_Vector inv;
SV-Vector rc;
SV_Vector alpha;
SV_corr - calculate correlation of two vectors
SV_Vector SV_corr(src1, src2, dst)
SV_Vector srci;
SV_Vector src2;
SV_Vector dst;
SV_dotp - calculate the dot product of two vectors
SV_Vector SV_corr(src1, src2, result)
SV_Vector src1;
SV_Vector src2;
Flōat *result;
SV_fill - fill a vector with a value
SV_Vector SV_fill(vector, value)
SV_Vector vector;
Float value;
SV_getlength - return the length of a vector
Int SV_getlength(vector)
SV_Vector vector;
SV_loc - return the address of a vector element

```
    Ptr SV_loc(vector, num)
    SV_Vector vector;
    In\overline{t}}\mathrm{ num;
SV_mul2 - multiply elements of two vectors
    SV_Vector SV_mul2(src, dst)
    SV_Vector src;
    SV_Vector dst;
SV_setbase - set the base of a vector
    Void SV_setbase(vector, base)
    SV_V̄ector vector;
    In\overline{t}}\mathrm{ base;
SV_sub3 - subtract elements of two vectors and store results in a third
    vector
    SV_Vector SV_sub3(src1, src2, dst)
        SV_Vector srcl;
        SV-Vector src2;
        SV_Vector dst;
SV_window - apply a symmetric window to a vector
    SV_Vector SV_window(src, wnd, dst)
        SV_Vector src;
        SV_Vector wnd;
    SV_Vector dst;
```

Filter Functions

```
SF_apply - apply a filter to a vector
    SV_Vector SF_apply(filter, input, output)
        SF_Filter filter;
        SV_Vector input;
        SV_Vector output;
SF_bind - bind coefficient vectors to a filter
    Void SF_bind(filter, num, den)
        SF_\overline{Filter filter;}
        SV_Vector num;
        SV_Vector den;
SF_getstate - copy filter state arrays into vectors
    Void SF_getstate(filter, hisinv, hisoutv)
    SF Filter filter;
        SV-Vector hisinv;
        SV_Vector hisoutv;
SF_setstate - copy vectors into filter state arrays
    Void SF_setstate(filter, hisinv, hisoutv)
    SF_Filter filter;
    SV_Vector hisinv;
    SV_vector hisoutv;
```


## Part V. Computers

12. A DSP-Based Three-Dimensional Graphics System
(Nat Seshan)

# A DSP-Based Three-Dimensional Graphics System 

Nat Seshan

Digital Signal Processor Products-Semiconductor Group
Texas Instruments

This application report is based on the author's bachelor's thesis at the Massachusetts Institute of Technology.

The placement of a high-performance computational engine, such as an advanced digital signal processor, between the host processor and the video controller in a graphics system can improve performance tremendously. Several factors make the Texas Instruments TMS320C30 Digital Signal Processor well-suited to this task:

- 32-bit floating point arithmetic provides both high-resolution and large dynamic range in calculation.
- Single-cycle, 60 -ns instruction execution and parallel bus access greatly improve system throughput.
- A hardware single-cycle multiplier facilitates the matrix arithmetic, which is frequently required in 3D graphics.
- The ease of programmability allows the design of flexible and expandable systems.
- Software tools, such as simulators[1], assembler/linkers[2], and high- level language debuggers/compilers[3], decrease product development time.
- In-circuit scan-path emulators[4], decrease hardware prototyping and debugging time.
- The use of a standard device lowers the overall system cost.

With the use of the TMS320C30, the host processor can request higher-level commands of the rest of the system. Instead of issuing requests for line-draws or screen clears, it can, for example, request that a 3D object be rotated 90 degrees and then be redrawn. In addition, a rendering element (usually a video controller or graphics system processor) can devote its resources solely to screen management rather than doing some portion of the computationally intensive processing. The following pages provide a description of how a 3D graphics system used the TMS320C30 to compute object transformations.

The digital signal processor resides on the TMS320C30 Application Board (C30AB) designed for the IBM PC/AT or compatible. The PC's $80 \times 86$ acts as the host processor and communicates to the C30AB through an 8 -bit bus slot. Also resident on the bus is a Texas Instruments TMS34010 Software Development Board (SDB)[5,6]. The SDB contains a TMS34010 Graphics System Processor (GSP) [7], which manages the screen memory and drives the video display. Overall, this system is meant to serve as an instructional model of how a graphics system can be designed using an advanced digital signal processor.

## The Potential for Graphics Pipelines

A mechanical engineer for an automobile manufacturer wants to design a robot arm for plant automation. Before building a prototype machine, he wishes to compare the ways in which various designs can pick up and assemble components. To do this, the engineer needs a CAD system capable of creating, storing, and adjusting representations of 3D objects and then rendering the images on a video display. The CAD system has four basic aspects:

1) A user interface for command entry.
2) A data management system to store objects and their screen representations.
3) One or more computational engines to perform high-speed calculations for applications such as transformations, clipping, lighting/shading, and fractal graphics.
4) A rendering engine to control the video memory and to drive the video display.

These four tasks are common to many graphics systems, whether they be intended for CAD/ CAM, fractal graphics, heads-up displays in fighter aircraft, or Postscript printer control. If one or more processors are assigned to each function, the resulting pipeline will achieve greatly improved system throughput.

In a single-processor system, the CPU is directly responsible for all computations. It must write to video memory, perform all necessary computations, interface to the user, and manage all data storage and recovery. Although additions to the system, such as a video-memory controller or a floating-point coprocessor, may speed up the system, the CPU remains overly burdened as the only intelligent component of the system.

## Independent Screen Management

A two-processor system can use a GSP to drive the CRT and to control the video memory. To control the display, the GSP either must interface to an analog monitor through a color palette or must directly drive a digital monitor. If the video memory is volatile, the processor needs a refresh controller that runs in parallel with other processor actions. Special hardware can be developed for screen clears and polygon fills. For flexibility of data representation, the processor should to be able to access pixels of varying bit-widths. At the instruction level, specialized operations could be created to speed pixel processing. Libraries of subroutines for windowing, drawing, and text management enable the rendering engine to execute higher-level commands. Overall, these features allow the CPU to send more powerful directives to the GSP.

## A Multiprocessor Pipeline

Adding more links in the graphics pipeline can further relieve the CPU of burdensome tasks. Performance improvements result from each stage being optimized for a particular function. In addition, throughput increases with the number of stages. The pipeline may also contain multiple processors running in parallel at a particular stage to further improve the latency of that stage. Figure 1 shows a full-scale implementation of a graphics pipeline for 3D graphics.

Figure 1. A Full Scale Graphics Pipeline


In a large-scale graphics pipeline, the host processor runs the applications program. The user may be trying to use a CAD program, model the formation of galaxies, animate 3 D objects, etc. The host runs these programs at the top level, provides the user interface, and communicates to all I/O devices, including mass storage systems. For numerically intensive applications it may be appropriate to have a digital signal processor as this host. For example, modeling the formation of galaxies requires numerical solutions to systems of differential equations. But even in such a case, it would be reasonable to have a more general-purpose CPU act as a user front end to the digital signal processor.

The purpose of the object manager is to communicate with the host by receiving data and transferring it to other processors in the system. It manages the global representation of all screen parameters and objects. A Reduced Instruction Set Computer (RISC) processor would be well-suited as either the host or the object manager because of its high-performance general-purpose architecture.

Because a DSP has a highly parallel architecture, a fast execution cycle time, an instruction set optimized for numerical processing, and several development tools, it would perform well as any of the computational stages in a graphics pipeline. For example, a DSP could act as a transform manager that calculates the new universal coordinates of globally stored objects according to rota-
tion, translation, and scaling commands from the object manager. Also, the DSP could act as a lighting manager that accepts parameters of environmental lighting settings from the object manager and applies them to the transformed objects. For example, the user may set ambient intensities as well as other sources of varying geometries, intensities, and colors. The lighting manager then applies these light sources to the surfaces of the objects, which may have varying degrees of specular or diffuse reflection, to compute the necessary shading.

Although the perspective and clipping stage of the system is represented in Figure 1 by a single processing unit, the task may be further partitioned to several DSPs working in series. The perspective calculation takes viewing parameters from the object manager, such as direction of view, location of viewer, and zoom, and produces a two-dimensional projection for the screen. Objects that are too high, too low, or too far right or left can be clipped automatically because the resulting two-dimensional coordinates are off screen. However, clipping objects fully or partially obscured by other objects may require additional stages. Also, objects behind the viewer and those too far away for the user to recognize should be clipped appropriately.

Although digital signal processors are well-suited to be the computational stages of a graphics pipeline, a processor optimized to be a rendering engine might serve better to drive the video display and manage the video memory. Such a processor could also help with the clipping tasks described above. A z-buffer could hold the transformed z-coordinate of each pixel that is projected on to the $x-y$ plane of the screen to facilitate hidden surface removal. A device such the Texas Instruments TMS34010 or the recently introduced TMS34020 could serve as the rendering engine in a full scale system. Both these processors have 32-bit general-purpose architectures with instruction sets and external memory interfaces optimized for graphics.

## An Overview of This Implementation

The system shown in Figure 2 is not intended to be a marketable product. Rather, it is targeted toward those who have the intention of designing products in the graphics market. Firms having experience in graphics will be able to resolve the tougher issues of graphics system design without presentation of the described system. The system shown in this report illustrates an attractive option for designing a fast, reliable, portable graphics system with quick turn-around time.

Figure 2. A Simple Three-Processor Graphics Pipeline


One strength of this system is its complete use of standard, commercially available parts. In general, use of standard parts allows for faster design and manufacturing, as well as a more reliable, easier-to-support product. Even the three hardware subsystems can be found on the market:

1) The IBM PC compatible host
2) The TMS320C30 Application Board object manager and transform engine subsystem
3) The TMS34010 Software Development Board rendering subsystem

Another strength of this system is the complete use of portable software. Use of portable software often speeds design times because system software can be mostly debugged before the actual target hardware is available. All software for this system was written in Kernigan and Ritchie C. The command and rendering routine was first debugged on the PC and GSP with the intermediary stage removed. Once debugged, the computationally intensive portion of the software was ported to the DSP, which then assumed control of the GSP. The software on the TMS34010 SDB used many of the graphics routines in the TMS34010 Graphics/Math Library. These routines have been used in many other graphics systems using the TMS34010.

## System Hardware

The IBM PC was chosen as the host because of its extensive support by TI development tools. In addition, a large amount of documentation is available concerning interfacing to the PCbus. The system described in this report is designed to run best on an 80386-based IBM PC compatible with an AT power supply and an 80387 floating-point coprocessor. However, either Intel 8086 or 80286 general-purpose microprocessors can also act as the host to the computational engine. The host computer sends commands to

- Load and delete objects
- Target an object for adjustment
- Adjust a particular object
- Recalculate the perspective or
- Redraw the screen.

The 80X87 floating-point coprocessor is not absolutely necessary but greatly improves the time to generate floating-point parameters for the next stage.

This graphics demonstration was the first application developed using the TMS320C30 Application Board (C30AB). Since that time, the C30AB has been included as a part of the XDS1000 emulation system for the TMS320C30 Digital Signal Processor. The TMS320C30's features include

- 60 -ns single-cycle execution time (more than 33 MFLOPS)
- $2 \mathrm{~K} \times 32$-bit dual-access RAM
- $4 \mathrm{~K} \times 32$-bit dual-access ROM
- $64 \times 32$-bit instruction cache
- Two 32-bit external memory expansion buses
- Single-cycle floating-point multiply/accumulate
- Two external 32-bit memory ports
- On-chip DMA controller
- Zero-overhead loops and single-cycle branches
- Two on-chip timers and two serial ports
- Floating-point/integer and logical 32/40-bit ALU
- 16 M -word memory space
- Register-based CPU
- Development tools, including a simulator, assembler/linker, optimizing C compiler, Csource debugger, and an in-circuit emulator/debugger
- On-chip scan-path emulation logic
- Low-power CMOS technology

The TMS320C30 executes commands from the 80X86 to transform objects, load objects into or delete objects from the system, and compute the projection of 3D objects on the 2D screen. When given a directive to draw the screen, it sends a command to the rendering engine to clear the current screen. Then, the TMS320C30 transfers lists of lines, points, and polygons for the next stage to render.

The TMS34010 Software Development Board (SDB) has been used in TMS34010 development support since 1987. It is configurable for a variety of monitors. The board supports the TMS34010 Graphics/Math Function Library [8] (a library of high-level routines callable from any C program). This board was slightly modified to receive commands from the C30AB as well as from the PC host. Program loaders, C compilers [9], assemblers, and C language standard I/O library support have been developed for this board, as well as for the C30AB. Both cards interface to an IBM PC through an 8 -bit slot on the AT bus. The TMS34010 GSP on the SDB is an advanced high-performance CMOS 32-bit microprocessor optimized for graphics display systems. Its key features include:

- 160 -ns instruction cycle time
- Fully programmable 32-bit general-purpose processor with a 128 M -byte address range
- Pixel processing, X-Y addressing, and window clip/pick built into the instruction set
- Programmable pixel size with 16 boolean and 6 arithmetic pixel processing options (Ras-ter-Ops)
- 31 general purpose 32 -bit registers
- 256 -byte LRU on-chip instruction cache
- Direct interfacing to both conventional DRAM and multiport video RAM
- Dedicated $8 / 16$-bit host processor interface and HOLD/HLD interface
- Programmable CRT control (HSYNC, VSYNC, BLANK)
- Full line of hardware and software development tools, including a C compiler

The TMS34010 GSP receives commands from the TMS320C30, along with arrays of points, lines, and filled polygons to be drawn. It then uses library routines to render these images on the video display.

## System Limitations

The system described here is an instructional system built in a limited development time. Aspects of the system could be optimized for speed and for memory usage. A high-speed 3D graphics system has many features that were not implemented.

This design is non-optimal in several ways. The C routines could be hand-coded to execute faster. A 32-bit host bus interface would allow word-at-a-time data transfers to the TMS320C30. The GSP could be interfaced to faster video memory. At the time of this writing, the TMS34020 second-generation graphics system processor is available. The entire TMS320C30 program could be configured to run from internal memory. Many of these optimizations were not realized because of the limited time available for developing the system.

Many operations that an advanced digital signal processor could easily perform were not designed into this system. These tasks include curved and textured surface generation, lighting, shading, and front and back clipping. For demonstrative purposes, only the endpoint transformation and perspective calculations were implemented.

Similarly, the capabilities of the GSP are clearly underutilized in this pipeline. The GSP is adept at managing multiple windows for display. It can also display text in various fonts. The presented system simply requires that the GSP manage a single graphics-only (no text) window.

## Representation of Graphics Elements

Any graphics system must have a method of representing the image to be portrayed on the screen. This method requires a system that is able to store and display primitive elements. These elements could range in complexity from three coordinates describing a point to a set of parametric equations representing an irregular three-dimensional surface. However, simply defining a set of primitive drawing structures does not result in an adequate graphics data representation. The engineer designing the robot does not think of the system as several sheet-metal polygons welded together. He more likely conceives of the arm as a clamp attached to a hand, which, in turn, is attached to an arm, etc. A powerful graphics system must not only describe the primitives to be rendered on the CRT, but also how the primitives are organized or related.

Frames of reference play the central role in the organization of graphics primitives. Any set of graphics primitives rigid with respect to each other can be said to exist in the same, constant frame. When the primitives move, they move as a single unit and remain in the same orientation with respect to each other. In this system, any such set of primitives is called an object. The transformational state of any object is determined by three sets of three parameters each. These sets of the object correspond to the

- Translation
- Scale
- Rotation

Translation of an object within its frame simply amounts to moving all locations in that frame a specified distance along the $x-, y-$, and $z-a x e s$. Thus, each object must hold a set of translation factors, denoted in this system's software by $\mathbf{d x}$, $\mathbf{d y}$, and $\mathbf{d z}$ (See Listing 1 in the Appendix). Simi-
larly, $\mathbf{s x}, \mathbf{s y}$, and $\mathbf{s z}$ determine the scale of an object. These factors determine how many units of the untransformed object's coordinates are represented by one unit of the transformed object's coordinates. The three parameters shown in Appendix Listing 1 that represent all possible orientations of an object (theta, phi, and omega) are described in Table 1.

Table 1. Angles of Rotation

| Angle | Axis Rotation is <br> Around | Direction of <br> Positive Rotation | Zero Value |
| :---: | :---: | :---: | :---: |
| $\theta$ | z | x to y | Positive x -axis |
| $\omega$ | x | y to z | Positive y -axis |
| $\phi$ | y | z to x | Positive z -axis |

## The Object Data Structure

Every object contains one or more sets of locations, which are referenced by the drawing primitives within the object. The locnum field of the object structure (see Listing 1) represents the number of locations available to be referenced by primitives within the object. This and other array sizes are kept for end points in For/Next-type loops and to allocate the appropriate space for the array contained within an object. Every location (see Appendix Listing 2) contains three float-ing-point numbers representing a coordinate in 3D space: $\mathbf{x}, \mathbf{y}$, and $\mathbf{z}$. Their integer $\mathrm{x}-\mathrm{y}$ locations on screen are also saved: $\mathbf{a}, \mathbf{b}$. To reference a location, a primitive needs only to know the index in the locs array. This allows many primitives to reference the same location.

Three different primitives were implemented to be rendered on the screen:

- Points
- Line segments
- Filled polygons

Points are rendered as single pixels on the screen. The point structure shown in Listing 3 of the Appendix contains the color to draw the point and the index to the location (locn) that is referenced by that point. The line structure in Listing 4 of the Appendix contains a color and two indices (startlocn and endlocn) to two end-points of the segment. Finally, the filled polygon shown in Listing 5 of the Appendix contains, in addition to the color, the number of vertices (vertnum) for the polygon, and a pointer (*vertlocn) to an array of vertex location indices listed in the order in which they are connected). The last location in the vertex array is connected back to the first, closing the polygon.

## Hierarchy

The final array contained within an object (the parent object) is a list of pointers to child objects defined with respect to the transformed frame of the parent. The number of potential internal objects, MAXOB, sets the static size of the array of pointers to child objects. (In this implementation, $\mathbf{M A X O B}=10$.) In addition, the parameter obnum keeps track of how many of these potential child objects are utilized. The final bookkeeping parameter is subnum. If subnum equals $n$, then the object was the $n$th object pointed to in its parent object's child-object array.

Figure 3. Hierarchical Representation of the Solar System


The solar system (Figure 3) represents a classical example of a hierarchical structure. The sun slowly revolves around the galaxy. Wherever the sun travels, the planets follow in the same frame. In turn, each planet may have satellites that revolve around them. The planet is defined with a certain offset (radius of orbit) from the sun, and the satellite is defined similarly with an offset from the planet. To describe the movement of the earth over a period of time, you need only to adjust for its revolution around the sun and the revolution of the moon around the earth. You do not need to describe the rotation of the moon around the sun because when a planet is moved, its satellites automatically move with it.

Transformation parameters are referenced to the frame of the object's parent. Thus, to fully describe a planet orbiting the sun, one must define an empty frame revolving about the sun at some offset, and then define a planet within that frame rotating about some axis. The levels of abstraction within this hierarchy give this data representation its power.

The flexibility of the object structure permits the system to model the viewer. The viewer is considered to be at the absolute origin of the system. At system initialization, the first object loaded is the universal object *universe. An appropriate choice for such an object would be a set of axes. The view is then adjusted by modifications to the parameters of the *universe:

| $\mathbf{d x}, \mathrm{dy}, \mathrm{dz}$ | - Object translation (viewing position) |
| :--- | :--- |
| $\mathbf{s x}, \mathrm{sy}, \mathrm{sz}$ | - Object scale (zoom) |
| theta, phi, omega | - Object orientation (pan) |

These three sets of parameters respectively represent the position of the origin of the universe with respect to the viewer (viewing position), how much the view is magnified to the user (zoom), and where the origin is with respect to the user (pan).

## Transformations

Transformations of locations in 3D space can be reduced to four-dimensional matrix arithmetic[10]. A location in space can be represented by a four-dimensional row vector ( $x y z 1$ ). When this vector left-multiplies any 4-by-4 transformation matrix, the resulting row vector represents the transformed point. Tables 2, 3, and 4 illustrate the 4-by-4 transformation matrices for rotation around each axis.

Table 2. Z-Axis Rotation Matrix
$\left[\begin{array}{cccc}\cos & \text { sine } & 0 & 0 \\ -\sin & \cos & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1\end{array}\right]$

Table 4. X-Axis Rotation Matrix
$\left[\begin{array}{lccl}1 & 0 & 0 & 0 \\ 0 & \cos & \sin & 0 \\ 0 & -\sin & \cos & 0 \\ 0 & 0 & 0 & 1\end{array}\right]$

It can be shown that these matrices can be used to account for a rotation about any arbitrary axis passing through the origin. The transformation matrix shown in Table 5 corresponds to scaling a location by ( $\mathbf{s x}, \mathbf{s y}$, and $\mathbf{s z}$ ) and then moving it by ( $\mathbf{d x}, \mathbf{d y}$, and $\mathbf{d z}$ ).

Table 5. Translation and Scaling Matrix
$\left[\begin{array}{cccc}\text { sx } & 0 & 0 & 0 \\ 0 & \text { sy } & 0 & 0 \\ 0 & 0 & s z & 0 \\ d x & d y & d z & 1\end{array}\right]$

The arbitrary transformation of a frame can be defined by a matrix resulting from a multiplication of a subset of the above transformation matrices. However, this multiplication is in general, not commutative. That is, rotating around the x -axis and then translating is not the same as translating and then rotating about the x -axis. By sending values for the nine parameters, the host can request the adjustment of an object. However, this system defines these operation as always taking place in the order below:

1) Scale object by ( $\mathbf{s x}, \mathbf{s y}$, and $\mathbf{s z}$ )
2) Translate object by (dx, dy and dz)
3) Rotate object around $z$-axis by theta.
4) Rotate object around $x$-axis by omega.
5) Rotate object around $y$-axis by phi.

When the matrices shown in Tables 2 through 5 are multiplied, the resulting matrix always contains (0001) T as its final column. Thus, to denote an arbitrary transformation, you need only remember the first three columns of the composite matrix. If you were to apply the transformations in the order stated previously, the resulting equations in Table 6 would determine the element of the transformation matrix $R$.

Table 6. Transformation Equations

| $\mathrm{r}_{12}=\mathrm{s}_{\mathrm{y}} \sin \theta$ | $(2.2)$ |
| :--- | :--- |
| $\mathrm{r}_{13}=\mathrm{s}_{\mathrm{z}} \sin \Omega$ | $(2.3)$ |
| $\mathrm{r}_{14}=\cos \Omega\left(\mathrm{d}_{\mathrm{x}} \cos \theta-\mathrm{d}_{\mathrm{y}} \sin \theta\right)+\mathrm{d}_{\mathrm{z}} \sin \Omega$ | $(2.4)$ |
| $\mathrm{r}_{21}=\mathrm{s}_{\mathrm{x}}(\sin \theta \cos \phi+\cos \theta \sin \Omega \sin \phi)$ | $(2.5)$ |
| $\mathrm{r}_{22}=\mathrm{s}_{\mathrm{y}}(\cos \theta \cos \phi-\sin \theta \sin \Omega \sin \phi)$ | $(2.6)$ |
| $\mathrm{r}_{23}=-\mathrm{s}_{\mathrm{z}} \cos \Omega \sin \phi$ | $(2.7)$ |
| $\mathrm{r}_{24}=\sin \phi\left(\sin \Omega\left(\mathrm{d}_{\mathrm{x}} \cos \theta-\mathrm{d}_{\mathrm{y}} \sin \theta\right)-\mathrm{d}_{\mathrm{z}} \cos \Omega\right)+\cos \phi\left(\mathrm{d}_{\mathrm{x}} \sin \theta+\mathrm{dy} \cos \theta\right)$ | $(2.8)$ |
| $\mathrm{r}_{31}=\mathrm{s}_{\mathrm{x}}(\sin \theta \sin \phi-\cos \theta \sin \Omega \cos \phi)$ | $(2.9)$ |
| $\mathrm{r}_{32}=\mathrm{s}_{\mathrm{y}}(\cos \theta \cos \phi+\sin \theta \sin \Omega \cos \phi)$ | $(2.10)$ |
| $\mathrm{r}_{33}=\mathrm{s}_{\mathrm{z}} \cos \Omega \cos \phi$ | $(2.11)$ |
| $\mathrm{r}_{34}=\cos \phi\left(\sin \Omega\left(-\mathrm{d}_{\mathrm{x}} \cos \theta+\mathrm{d}_{\mathrm{y}} \sin \theta+\mathrm{d}_{\mathrm{z}} \cos \Omega\right)+\sin \phi\left(\mathrm{d}_{\mathrm{x}} \sin \theta+\mathrm{dyc} \cos \theta\right)\right.$ | $(2.12)$ |

Note that there also exists a matrix $\mathbf{p}$ [3][4] (see Listing 1 in the Appendix) that represents the product of all the ancestral transform matrices of an object and that object's R matrix. This matrix represents the object's transformation from the absolute origin of the system.

## The Host Processor's Access to Objects

The 80X86 host can exert its control over objects in the following ways:

1) Target Objects - The host can set the target object for adjustment, deletion, or insertion of a child object by either targeting the parent object or a particular child object of the currently targeted object.
2) Load and Delete Objects - The host has the ability to add objects to the system with initial transform parameters. In addition, it can remove objects from the system (including all objects within the deleted objects). When the targeted object is deleted, the new target object defaults to being the object's parent.
3) Adjust Objects - By specifying the nine transform parameters, the host can adjust an object in its parent's frame.
4) Change Perspective - To change the viewing perspective, the host must request that the *universe be adjusted.
5) Update Screen Representation - The host can request that the targeted object and its child objects have their location array's screen representations updated.
6) Redraw View - Once all adjustments and updates of screen coordinates are re-specified, the host can request that the view be updated.

Overall, the object structure serves well as a data representation for 3D graphics. A single set of locations is available to be referenced by the points, line segments, and filled polygons to be rendered on the screen. Each object contains parameters and matrices that specify the transformed state of the object. Thus, at any time these matrices could be applied to the original co-ordinates
loaded into the system to calculate the transformed location of the point. Therefore, as the transformation and the projection on to two-dimensional co-ordinates are done in one step, the original 3D coordinates can be retained and only the final modified two-dimensional screen representation need be updated. The point of view can simply be modified by adjusting the *universe as one would adjust any other object. Overall, the hierarchical object structure provides a powerful and flexible way to manage graphical data.

## DSP Command Execution

The digital signal processor assumes the role of the object manager and keeps track of the representations. Before examining the precise manner in which the TMS320C30 processes the commands from the host, one needs to understand the underlying hardware of this subsystem. A description of the TMS320C30 Application Board can be found in the application report TMS320C30Application Board Functional Description, located in this book. The report describes the avenues of communication between the C 30 AB and the PC over the PC's bus. An examination of how the TMS320C30 receives and processes data and commands from the 80X86/7 follows.

## Initialization

As its first initialization task, the PC maps the dual-port SRAM of the C30AB into its address space by writing the 8 MSBs of address to the mapping register. It then brings the C30AB out of reset by writing a 1 to the SWRESET in the C30AB's control register. The PC then loads the TMS320C30 application program into the dual-port SRAM. Loader support software on the C30AB EEPROM moves the code to the proper location in the TMS320C30's address space. Finally, the PC switches the TMS320C30's memory map into run mode to start program execution. The first part of the main routine initializes the system (see Listing 8 in the Appendix).

For the system software to run properly, the DSP software must initialize several different items.

1) It enables the on-chip instruction cache.
2) It sets the external flag bit on the C 30 AB target connector to transfer control of the rendering system from the PC to the C 30 AB (This assumes that the PC loaded the rendering software before it started up the C30AB).
3) It configures both the primary and the expansion bus with zero software wait-states. Thus, all wait states are generated by the address-decoding PALs on the C30AB.

In addition, the linker configures

1) Primary bus SRAM as program storage
2) Expansion bus SRAM as heap memory allocation
3) Zeroth page of internal RAM as space for system constants
4) First page of internal RAM as the system stack. This configuration maximizes the potential for parallel data and instruction accesses

The initialization procedure then appropriates several local variables for system use, including

1) Two registered looping variables, $\mathbf{i}$ and $\mathbf{j}$
2) The constant 2 PI
3) Registered pointers to the communication registers of the rendering subsystem, *hstdata and *hstentl

The TMS320C30 initially sets the contents of these GSP registers to indicate that the computational stage does not have any requests of the rendering stage.

The TMS320C30 system software contains the global variables shown in Listing 7 of the Appendix. The dual-port SRAM pointer dual_port is initialized to point to the lowest location on the I/O expansion bus. This pointer points to an integer array that contains all data and command from the PC. Another pointer to the currently targeted object (*to) is set to reference the universe. The *universe is set as its own parent with an obnum of 0 , indicating no internal objects are loaded.

During the final part of initialization, the C 30 AB software waits for the PC to load the static *universe object. To understand how the PC loads objects into the system, you must comprehend the general communications protocol between the TMS320C30 and the 80X86.

## Host to DSP Communication

A two-way polling scheme arbitrates access of the dual-port SRAM. The software allocates the first two words of the SRAM as COMMAND and ACKNOWLEDGE signals, respectively (see Listing 6 in the Appendix). Remember that the TMS320C30 must mask off the 24 MSBs of dual-port data to receive the proper 8 -bit value. The processors poll and write to these two words in order to send requests and acknowledgments. During initialization, the TMS320C30 clears both the COMMAND and ACKNOWLEDGE locations of the dual-port SRAM. The PC graphics application software must run after this point to ensure that this phase of the initialization does not clear a command from the PC. Once the system software starts executing on both the PC and the TMS320C30, the following sequence enables the PC to send a command to the C30AB:

1) The PC waits for the dual-port SRAM to become free by polling the ACKNOWLEDGE word for a zero.
2) The PC loads all command parameters into the dual-port SRAM.
3) The PC then loads the appropriate command byte into COMMAND.
4) Once the TMS320C30 returns to its command detection loop, it acknowledges a received command by writing the same byte into the ACKNOWLEDGE word.
5) The PC sees that the TMS320C30 has acknowledged the command and writes 00 h into COMMAND to withdraw its command. The PC thereby relinquishes control of the dual-port SRAM.
6) The TMS320C30 reads all necessary parameters into its main memory.
7) The TMS320C30, by writing a zero to the ACKNOWLEDGE word, indicates that the PC can request another command. This returns the sequence to step (1).

The TMS320C30 treats all of its data types as 32-bit values, but it can read only one byte of valid data from the dual-port SRAM. Thus, the TMS320C30 must mask and concatenate the bytes that the PC maps into contiguous locations to form multibyte words. In addition, since Intel and
the TMS320C30 have different standards, floating-point values from the PC must be converted before the TMS320C30 can use them.

The TMS320C30 can receive either unsigned 8-bit chars or unsigned 16-bit short integers from the PC. The macros shown in Listing 6 of the Appendix are used to access these data types from the dual-port SRAM. The DPLONG macro takes a certain location in the dual-port, finds the short integer located there, and concatenates it into a 32-bit value for the TMS320C30. The word LONG in the macro indicates all integers whether chars, shorts, or longs are represented as 32-bit values by the TMS320C30.

Table 7. Comparison of Intel and TMS320C30 32-Bit Floating-Point Formats

| Standard | Exponent <br> Field Bits | Exponent <br> Format | Sign <br> Bit | Mantissa <br> Field | Mantissa <br> Format |
| :--- | :---: | :---: | :---: | :---: | :---: |
| TMS320C30 | $31-24$ | Two's Complement | 23 | $22-0$ | Two's Complement |
| Intel | $30-23$ | Offset Binary | 31 | $22-0$ | Magnitude |

Table 7 illustrates the differences between the TMS320C30 and the Intel single-precision floating-point formats. For every floating-point value that the TMS320C30 receives, it must extract the appropriate fields, convert the fields to the appropriate numerical representation, and then reassemble the fields in TMS320C30 floating-point format. The dpfloat routine shown in Listing 9 of the Appendix uses the union structure fllong shown in Listing 6 of the Appendix to allow manipulations normally available only for integers on the floating-point value. The program first concatenates the four-byte value in the dual-port SRAM into a single 32-bit integer and then converts this word to TMS320C30 format.

## Computational Subsystem Software

Using the communication techniques described in the last section, the TMS320C30 processes the graphics command from the PC. After performing C30AB initialization, the program main enters a command detection/execution loop. For each valid value of the COMMAND byte, a C case statement executes the appropriate code. Since these routines are, in general, too long to be discussed in exhaustive detail, the rest of this section merely summarizes how they work.

When the PC wants to load an object, it first loads the initial nine floating-point transformation parameters into the dual-port SRAM. It then loads the number of

1) Locations
2) Drawn points
3) Lines
4) Filled polygons

These values are limited to 16 bits, thereby allowing for only 65,535 primitives of each type. The size of the dual-port SRAM further limits the array sizes in this implementation. Then the PC loads three floating-point parameters, $(x, y$, and $z)$, for each location. The size of the dual port limits the number of locations to 377 . Once these parameters are loaded into the memory, the host places the command byte for an object load into COMMAND. Upon reception of these parameters, the TMS320C30 allocates space for the object as a child of the current target object and also allocates
space for the location, point, and line arrays. Because the size of each polygon varies, space is allocated as each polygon is read.

After allocating global space for the new object and loading the locations, the TMS320C30 requests more data from the PC. It first requests the points, then the lines, then each polygon. The dual-port SRAM limits the primitive arrays to 2047 points and 1364 lines. In addition, each polygon is limited to 4092 vertices. The TMS320C30 makes a data request by replacing the current COMMAND byte that it wrote in ACKNOWLEDGE with 127, the flag for the PC to load more data. Although the roles of ACKNOWLEDGE and COMMAND are reversed in this case, the TMS320C30 requests data in much the same way the PC requests commands. Once the TMS320C30 completes loading the object, it selects the object as the new target object. Finally, using the equations in Table 6, the TMS320C30 calculates the initial value of the object's transformation matrix.

The target object is the object in the hierarchy selected for adjustment, deletion, or calculation of screen coordinates. The PC can either target an object's parent or one of the object's child objects. The command to target a child requires the PC to specify either the child object's sibling number or subnum. Thus, when selecting objects for adjustment, the PC must remember where it loaded objects into the hierarchy.

To adjust the transformation parameters of a given object, the PC simply loads the new parameters into the dual-port SRAM. The TMS320C30 adds the values of the new angles of rotation and translation factors to the previous ones. In addition, the TMS320C30 multiplies the old scaling factors by the new ones. Then, the TMS320C30 calculates the transformation matrix of the object by using the equations in Table 6. It does not recalculate screen locations, however, until this is specifically requested by the PC. The TMS320C30 can thus avoid calculating screen coordinates until all adjustments have been made.

Once the PC requests all the changes for a frame on the display, it requests recalculation of screen coordinates at each node it changed. The PC can request recalculation for a particular object and thus update its internal objects as well. This allows the TMS320C30 to avoid recalculating screen coordinates of unchanged locations. For maximum efficiency, the PC must request recalculation in the highest node that it adjusted along any particular path. Thus, in the planetary example given earlier, if, in a period of time, only Pluto and its moon Charon were moved (the other bodies miraculously standing still), only Pluto would need to be targeted for recalculation.

To calculate transformations, the TMS320C30 multiplies the object's transformation matrix by its parent's parent transformation matrix to obtain its own parent transformation matrix, p[3][4]. The TMS320C30 right-multiplies all locations within that object by this matrix to achieve the transformation from the absolute origin of the system. The computational engine calculates perspective by dividing the transformed x - and y -coordinate by the transformed z -coordinate so that locations farther away appear closer together. The plane $z=0$ is defined to be the plane of the screen. This also has the feature that objects behind the viewer appear upside-down in front of the viewer because the objects' $z$-coordinates are negative. Thus, the program running on the PC must maintain all objects in front of the viewer. Then, the TMS320C30 recursively executes this procedure for each object within the targeted object.

Unlike the recalculation of screen coordinates, the redrawing of objects is done for all objects within the system. Thus, the draw_object routine is called with the *universe as the argument. The
precise manner in which the TMS320C30 uses this program to redraw the screen is described in the TMS320C30 Drawing Routine Section found later in this report.

## Summary of DSP Command Execution

The dual-port SRAM on the C30AB provides all means of communication between the PC and the TMS320C30. A two-way polling scheme arbitrates the TMS320C30's and the PC's access to this SRAM. Using this protocol, the PC can request object loading, deletion, or adjustment, but can request only modification of the object currently targeted for these changes. Also, at the host's request, the computational engine may recalculate the screen representation of all locations within the targeted object. Once all updates for a particular view are made, the PC may request a redrawing of the display. The description of the rendering subsystem, presented next, facilitates a better understanding of how the TMS320C30 requests rendering commands of the GSP.

## The Rendering Subsystem

A modified version of the TMS34010 Software Development Board serves as the rendering stage of this graphics pipeline. A complete overview of this PC-based card can be found in the TMS34010 Software Development Board User's Guide [2]. Because only minor modifications were made to the commercially available SDB, the hardware aspects of the rendering subsystem are discussed in less detail than the computational stage. The same holds true for many software routines taken from the TMS34010 Math/Graphics Function Library.[8] After presenting overviews of the TMS34010 and the SDB, this section focuses on the C30AB/SDB interface and the communications protocol used for command and data transfer between the TMS320C30 and the GSP.

## The TMS34010 Graphics System Processor

The TMS34010 combines the best features of general-purpose processors and graphics controllers in one powerful and flexible Graphics System Processor. Key features of the TMS34010 are its speed, high degree of programmability, and efficient manipulation of hardware-supported data types, such as pixels and two-dimensional pixel arrays.

The TMS34010's unique memory interface reduces the time needed to perform tasks such as bit alignment and masking. The 32-bit architecture supplies the large blocks of continuously-addressable memory that are necessary in graphics applications. TMS34010 system designs can take advantage of video RAM technology to facilitate applications such as high-bandwidth frame buffers; this circumvents the bottleneck often encountered when using conventional DRAMs are used in graphics systems.

The TMS34010's instruction set includes a full complement of general-purpose instructions, as well as graphics functions from which you can construct efficient high-level functions. The instructions support arithmetic and Boolean operations, data moves, conditional jumps, plus subroutine calls and returns.

The TMS34010 architecture supports a variety of pixel sizes, frame buffer sizes, and screen sizes. On-chip functions have been carefully selected so that no functions tie the TMS34010 to a particular display resolution. This enhances the portability of graphics software and allows the TMS34010 to adapt to graphics standards such as MIT's X, CGI/CGM, GKS, NAPLPS, PHIGS, and other evolving industry and display management standards.

## TMS34010 Software Development Board

Figure 4 shows the block diagram of the modified TMS34010 SDB. The graphics SDB is a single card designed around the IBM PC/XT Expansion Bus and serves as a software development tool for programmers writing application software for the TMS34010 Graphics System Processor. The development of a high-performance bit-mapped graphics display in this application report demonstrates the simplicity of hardware design using the TMS34010 SDB.

Figure 4. Modified TMS34010 Software Development Board Block Diagram


This board comes with interactive debug software. Its features include software breakpoints, software single-step and run with count. At the same time, current machine status is displayed on the top half of the host monitor.

The SDB contains 512 K bytes of program RAM for the TMS34010 to execute drawing functions, application programs, and displays. Both the program RAM and the frame buffer are accessible to the host through the TMS34010's memory-mapped host port.

The frame buffer consists of eight SIP memory modules organized into four color planes. This allows 16 colors per frame from the digital monitor. The TMS34070 color palette incorporates a 12 -bit color lookup table to give you a choice of 16 colors in a frame from a 4096 -color palette. Furthermore, the palette incorporates a variety of unique line load features to allow the color lookup table to be reloaded on every line; this means that 16 of 4096 colors can be displayed per line.

## The TMS34010 Host Interface

The GSP has two 16-bit buses: one interfaces with the video and program memory, and a second interfaces to a host processor. The host can access the GSP by writing and reading four internal memory-mapped GSP 16-bit registers:

- HSTADRL and HSTADRH together form a 32-bit pointer to a location in the GSP's address space.
- HSTCNTL contains several programmable fields that control host interface functions.
- HSTDATA buffers data that is transferred through the host interface between the GSP's local memory and the host processor.

Several signals are available for communications between the host and the GSP.

- HD15 through HD0 are the actual data lines.
- HCS is the interface select signal strobe from the host.
- HSF1 and HSF0 select which host register is being addressed.
- HREAD and HWRITE are, respectively, the read and write strobes from the host.

Table 8 shows how the above signals address the four host registers.

- HLDS and HUDS signals, respectively, select the low byte or the high byte of the host interface registers.
- HRDY informs the host when the GSP is ready to complete a transaction.
- HINT is the interrupt signal from the host to the GSP.

Table 8. TMS34010 Signals Controlling Host Port Interface

| Host Interface Control Signals |  |  |  |  |  |
| :---: | :---: | :---: | :---: | :--- | :---: |
| HCS |  <br> HSF0 | HREAD | HWRITE | Operation |  |
| 1 | XX | X | X | No Operation |  |
| 0 | 00 | 0 | 1 | HSTADRL read |  |
| 0 | 00 | 1 | 0 | HSTADRL write |  |
| 0 | 01 | 0 | 1 | HSTADRH read |  |
| 0 | 01 | 1 | 0 | HSTADRH write |  |
| 0 | 10 | 0 | 1 | HSTDATA read |  |
| 0 | 10 | 1 | 0 | HSTDATA write |  |
| 0 | 11 | 0 | 1 | HSTCNTL read |  |
| 0 | 11 | 1 | 0 | HSTCNTL write |  |

The fields in HSTCNTL control host interrupt processing, auto-incrementing of the host address register, and protocol in byte-at-a-time accesses to the 16-bit host port (whether the lower or the higher byte comes first). HSTCNTL also contains the status of interrupts from the host to the GSP and from the GSP to the host and a three-bit message word in either direction. These control bits are shown in Table 9.

Table 9. TMS34010 Host Control Register Fields

| Field | Name | Purpose | Write Access |
| :--- | :--- | :--- | :--- |
| $0-2$ | MSGIN | Input Message Buffer | Host Only |
| 3 | INTIN | Input Interrupt Bit | Host Only |
| $4-6$ | MSGOUT | Output Message Buffer | GSP Only |
| 8 | INTOUT | Output Interrupt Bit | GSP Only |
| 8 | NMI | Nonmaskable Interrupt | Host Only |
| 9 | NMIN | Nonmaskable Interrupt | GSP and Host |
| 10 | Unsed | Unused | Neither |
| 11 | INCW | Increment Pointer Address on Write | GSP and Host |
| 12 | INCR | Increment Pointer address on Read | GSP and Host |
| 13 | LBL | Lower Byte Last | GSP and Host |
| 14 | CF | Cache Flush | GSP and Host |
| 15 | HLT | Halt TMS34010 Processing | GSP and Host |

## TMS320C30 Application Board Interface

In its unmodified form, the SDB communicates to the PC host through a single transceiver. A PAL decodes the PC address into the appropriate register selection signals. The registers are mapped redundantly into blocks of PC memory address space, as shown in Table 10. The board was modified by the addition of a connector to a cable from the C30AB's target connector. The TMS320C30 sends to the modified SDB the following:

- The TMS320C30s expansion bus address
- The TMS320C30s data signals
- I/O address space access strobe
- Expansion bus read and write strobes

These signals map the GSP's host interface registers in the TMS320C30's address space (also shown in Table 10). The TMS320C30 mapping is actually replicated in four-word blocks until location 8057FFh.

Table 10. Mapping of TMS34010 Host Control Registers

| Register | PC Mapping | TMS320C30 Mapping |
| :---: | :---: | :---: |
| HSTDATA0 | C7000h - C7CFFh | 805002 h |
| HSTCNTL | C7D00h - C7DFFh | 805003 h |
| HSTADRL | C7E00h - C7EFFh | 805000 h |
| HSTADRH | C7F00h - C7FFFh | 805001 h |

The modified SDB board must be able to select either the PC or the C30AB as its host. The C 30 AB target connector makes the two external flag bits XF0 and XF1 available to the SDB. The TMS320C30 can configure these flags as either input or output pins. Upon leaving reset, these pins default to inputs and remain in the high-impedance state. XF0 is pulled low on the SDB to appear off when the TMS320C30 is in reset. After the PC loads the rendering software into the GSP, it activates the C30AB and loads the TMS320C30's software. As discussed earlier, the TMS320C30, during initialization, configures XF0 as an output and loads it with a one. The address-decoding PALs on the SDB use this signal to select the C30AB as the SDB's host. When the TMS320C30 controls the SDB, it communicates through a full 16-bit interface to the GSP. Thus, before the integer screen coordinates are sent in two's-complement form to the GSP, they must be clipped to a range of $-32,768$ to 32,767 . Fortunately, this range is still two orders of magnitude greater than the resolution of most monitors.

In general, the above interface is fairly straightforward. The only complication is that the designers of the GSP expected a relatively slow microcoded general-purpose processor as a host. This allows the GSP to actually assert its HRDY line 80 ns before it is actually ready to process a transaction. When interfacing to the TMS320C30, PALs become necessary as state machines to create the appropriate number of wait-states on host reads and writes and thus ensure proper interprocessor communication.

## DSP to GSP Communication

The TMS320C30 loads all commands and data into a command buffer contained within a space not usually mapped by the SDB's C compiler configuration. This portion of GSP address space, the Shadow RAM, is normally reserved for optional PROMs. However, by writing a 1 to an RS latch in the GSP's memory space, this area becomes occupied by the topmost portion of program/data DRAM. Before the TMS320C30 starts writing to HSTDATA to access this memory, it configures the host address to autoincrement. Once the GSP finishes processing data in the shadow RAM, it resets the value of the address registers to point to the beginning of the shadow RAM in order to allow the TMS320C30 to properly load its next command and data.

The communication protocol between the TMS320C30 and the GSP closely resembles the protocol between the PC and the TMS320C30. The MSGIN and MSGOUT fields, respectively, replace the COMMAND and ACKNOWLEDGE words. However, rather than these fields con-
taining a particular value for a command, the value of 3 (binary 011) in either of these fields indicates that a command or an acknowledge exists. Upon reception of a command request, the GSP refers to the first location of the shadow RAM for a command word from the TMS320C30. Thus, the overall command scheme proceeds as follows:

1) The TMS320C30 waits until it sees that the MSGOUT field contains a 0.
2) The TMS320C30 stores all command and data into the shadow RAM.
3) The TMS320C30 writes a 3 to the MSGIN field and waits for acknowledgment.
4) The GSP acknowledges the reception of a command by writing a 3 to the MSGOUT field.
5) The TMS320C30 withdraws its request by writing a 0 to MSGIN.
6) The GSP reads the first word of the shadow RAM for the command and jumps to the appropriate case to process it.
7) Once the GSP is finished with all data in the shadow RAM, it resets the values of the host address registers and then writes a 0 to the MSGOUT bit, indicating that the TMS320C30 is free to request another command.

## The TMS320C30 Drawing Routine

When the TMS320C30 receives a redraw-screen request from the PC, it sends a command to the GSP to clear the screen after the monitor has drawn the bottom line; this ensures that the last view was drawn in its entirety. The TMS320C30 then calls its draw_object routine with *universe as an argument. For each array of primitives within the object, the TMS320C30 sends the size of the array and the array of screen representations of the primitives themselves to the TMS34010. Thus, the TMS320C30 can request the GSP to draw arrays of points, lines, or filled polygons. Once all arrays are drawn, draw_object recursively executes for all child objects within the universe. In this manner, all objects defined within the system are drawn.

## GSP System Initialization

Several initialization routines are provided in the TMS34010 Math/Graphics Function Library User's Guide [8]. The GSP executes these programs to properly configure the system before it begins its command detection loop:

- The call to init_video configures the graphics buffer for an NEC Multisync Monitor displaying $640 \times 480$ resolution.
- The init_graphics function initializes the graphics environment by setting up the data structures for the graphics functions and assigning default values to system parameters.
- The init_screen command initializes the screen. The entire frame buffer is cleared, and a color lookup table is loaded with the default color palette.
- The init_vuport function initializes the viewport data structures and opens viewport 0 , the system, or root window.
- The set_origin command sets the origin of the system to the center of the screen.


## Drawing Routines

Several drawing routines are also provided in the TMS34010 Math/Graphics Function Library User's Guide [8]:

- For each primitive in an array sent from the TMS320C30, the GSP sets the proper drawing color with the set_color command.
- The TMS320C30 commands the GSP to execute to the clear_screen before it starts to request drawing of primitives for the next view.
- The TMS320C30 requests a wait_scan execution from the GSP to ensure that the GSP has fully displayed the last view before drawing the current view.
- The GSP uses the draw_point $(\mathbf{x}, \mathbf{y})$ function to render a point on the display.
- Similarly, it uses the draw_line ( $\mathbf{x} \mathbf{1 , \mathbf { y } 1 , \mathbf { x } 2 , \mathbf { y } 2 ) \text { command to draw a line. The arguments are }}$ the screen coordinates of the two end-points of the segment.
- The fill_polygon(n, linelist, ptlist) function takes as arguments of the number of vertices, an array of the line segments forming the sides of the polygon, and a list of screen coordinates referenced by the linelist.


## Summary

The TMS34010 Software Development board provides a good rendering module for this graphics system. The support hardware has been debugged and used in industry since 1987 and thus makes a reliable rendering subsystem. The target connector to the C30AB provides access to the TMS320C30 as an alternate host. Three PALs and two transceivers allow the TMS320C30 to assume control of the GSP, once both have started running their software. The draw_object program on the TMS320C30 can command the GSP to draw graphics primitives. Functions in the TMS34010 Math/Graphics Function Library User's Guide [8] allow the GSP to initialize the monitor interface, clear the screen, ensure that an entire screen has been drawn, and draw the graphics primitives. Overall, the TMS34010 development tools provide an easy means to develop a rendering subsystem for this graphics pipeline.

## Possible Improvements

Several changes may be incorporated into the system to improve performance. Some simple enhancements involve modifications of the computational subsystem's software to allow faster and more transparent command execution. Restructuring the method in which the data and command pass through the pipeline, a more complex modification, can greatly increase throughput. Additional features such as more complex primitives, lighting, windowing, and text display would require major software modifications to the system. However, any such modifications would not need to change the communication protocols or the command detection loops significantly. Finally, although the TMS320C30 represents the state-of-the-art in digital signal processing, the host processor and the rendering engine may be improved.

## Computational Subsystem Software

The drawing routine currently sends the primitive arrays of an object one at a time to the GSP. Instead, it should send all primitive arrays for all objects to be redrawn in a single pass. The GSP should then process the contents of this stack of commands and data.

Currently, as soon as the PC finishes requesting objects adjustments, it must request recalculations of the screen coordinates of location arrays. The screen_object routine must operate on all
objects that have been adjusted directly or indirectly by having their ancestors adjusted. Instead, this routine should be called once with the *universe as the argument. The object structure should contain a flag that is set when an object is adjusted and reset when it is drawn. Thus, the new screen_object procedure would recursively search down the hierarchy of objects until it encounters an object that has been adjusted and then should recalculate all the screen coordinates for it and those of its internal objects. Upon completion, it should search the rest of the hierarchy for adjusted objects. Thus, the host would have to request only adjustment, targeting, and draw commands. Screen representations would be automatically recalculated whenever a draw command is executed.

## Rendering Subsystem Software

Rendering subsystem drawing routines could be improved by designing functions coded to handle the primitive arrays rather than individual programming elements. These functions may be able to fit in the GSP's instruction cache and improve execution time.

## Improved Data Flow

One problem consistent at all stages of the system is the method of buffering. A single buffer usually contains all data and commands to be transferred from one stage to the next. Thus, during command execution one processor may wait for the other to relinquish control of the command buffer.

The first of two methods to improve the dual-port SRAM connecting the PC and the DSP is to divide the SRAM into two buffers. The PC writes the current command to one buffer, while the TMS320C30 processes commands and data stored in the other. This prevents contention for the dual-port SRAM. The particular buffer which each processor controls is swapped on each command request. Second, adding three more $4 \mathrm{~K} \times 8$ dual-port SRAMS in parallel would allow the PC to communicate to the TMS320C30 with full 32-bit wide words. Thus, the masking and concatenation necessary to receive larger data types would become unnecessary. On the original design the potential addition of these RAMs consumed a prohibitive amount of board space. Full word size is possible only if space constraints are eased.

The splitting of the command buffer between the TMS320C30 and the GSP allows the GSP to draw the current screen while the TMS320C30 sends the primitive arrays for the next. Similarly, two display buffers allow one buffer to be displayed on the monitor while the GSP draws the next view to the other.

## Computational Features

The DSP is suited to perform many other types of computational features. Because these functions are more complex, they were not implemented in the limited design time available. This system truncates objects that are too high, too low, too far right, or too far left by using the GSP's drawing routines that automatically clip coordinates outside the screen boundaries. However, the system cannot determine whether one object is in front of another and draw the objects appropriately. Functions to do this hidden-surface removal require complex algorithms to determine whether
one 3D surface obscures another. Simpler routines could be made to clip objects that are too far away to see or objects that are behind the viewer.

A lighting feature would allow appropriate factors of light intensity and reflection to determine the shading of surfaces. Lighting may be ambient (equal everywhere) or come from several possible source geometries. Reflections could either be diffuse and scatter light equally in all directions, or be specular like those off any shiny surface. With these parameters, the TMS320C30 can compute the appropriate shading of a given pixel. In this scenario, the GSP is reduced to drawing single points with a given color. Thus, any lighting function would slow rendering time.

More complex primitives can be produced by using the TMS320C30 to generate arrays of pixels representing solutions to equations. The PC could dispatch a command to draw a primitive based on a particular type of equation (such as the parametric equations representing a sphere) and then load the appropriate parameters for that equation. The DSP would generate the appropriate set of pixels for that object and send it to the GSP as arrays of points.

## Rendering Features

The TMS34010 Math/Graphics Function Library [8] permits the user to create and select various windows for display. Once a window is selected the DSP can run the existing system software within that window. Thus, the host would also need to be able to direct the DSP to tell the GSP how to manipulate its windows. The Library also enables the GSP to print text on the screen. This feature also would not be very difficult to implement.

## A More Advanced Host

A more advanced host could be a high-speed RISC processor such as SPARC. This unit could communicate with the DSP at faster rates, so command transfers would consume less time. In addition, SPARC is a 32-bit machine, which could allow word transfers between host and DSP in a single instruction.

## A More Advanced Rendering Engine

The TMS34010's performance as a rendering engine could be improved. If the GSP could be ready to complete a transaction when the HRDY line is asserted and not some period of time later, the C30AB to SDB interface would be more straightforward and not require as many wait states. This problem is corrected in the second-generation GSP TMS34020, which was not available at the time of the design of this system. In addition, the TMS34020 also allows the host to transparently access the GSP's bus while the GSP continues processor functions.

## Conclusion

Despite its shortcomings, this system still demonstrates the dataflow in a graphics pipeline using a digital signal processor as a computational element. One main benefit of the digital signal
processor is the availability of development tools such as C compilers, assembler/linkers, software development boards, and in-circuit emulators that accelerate design time. The TMS320C30 also provides speeds comparable to many bit-slice processors that require programmers to develop extensive microcode routines. The hardware multiplier, floating-point capability, RISC architecture, and parallel bus access facilitate fast, precise graphics calculations. Overall, a digital signal processor provides an attractive option to the graphics system designer interested in making high-performance systems with quick turnaround time.

## References

1) TMS320C30 Simulator User's Guide (literature number SPRU017), Texas Instruments, 1989.
2) Third-Generation TMS320 User's Guide (literature number SPRU031), Texas Instruments, 1988.
3) TMS320C30 C Compiler User's Guide (literature number SPRU034), Texas Instruments, 1988.
4) TMS320C30 Assembly Language Tools User's Guide (literature number SPRU035), Texas Instruments, 1989.
5) TMS34010 Software Development Board Schematics (literature number SPVU003), Texas Instruments, 1986.
6) TMS34010 Software Development Board User's Guide (literature number SPVU002A), Texas Instruments, 1987.
7) TMS34010 User's Guide (literature number SPVU001A), Texas Instruments, 1988.
8) TMS34010 Math/Graphics Function Library User's Guide (literature number SPVU006), Texas Instruments, 1987.
9) TMS34010 C Compiler Reference Guide (literature number SPVU005A), Texas Instruments, 1986.
10) Foley, J.D. and Van Dam, A., Fundamentals of Interactive Computer Graphics, Addison Wesley, 1984.

## Appendix A

## Graphics Programs

Listing Name

TMS320C30 C Structure Representing an Object
TMS320C30 C Structure Representing a Location
TMS320C30 C Structure Representing a Point
TMS320C30 C Structure Representing a Line
TMS320C30 C Structure Representing a Filled Polygon
TMS320C30 Communications Macros
TMS320C30 Global Variables
TMS320C30 Main Command Execution Loop
TMS320C30 Floating-Point Conversion Routine
TMS320C30 Object Loading Routine
TMS320C30 Screen Coordinate Calculation Routine
TMS320C30 Transformation Matrix Evaluation Routine
TMS320C30 Object Deletion Routine
TMS320C30 Request for Additional Data in Object Load
TMS320C30 Object Drawing Routine
TMS34010 Point Structure
TMS34010 Line Structure
TMS34010 Color Array
TMS34010 Color Palette
TMS34010 Main Command Execution Routine
PC Object Loading Data Structure
PC Communications Macros
PC Global Variables
PC Targeted Object Adjustment Routine
PC Routine to Set Parameters for an Object Load
PC Routine to Target Parent of Current Target Object
PC Routine to Target a Child of Current Target Object
PC Routine to Redraw Screen
PC Routine to Load the Primitives of a Wireframe Cube
PC Main Routine to Draw a "Planetary System of" Cubes

## ************************************************************************

--ไListing 1: TMS320C30 C Structure Representing an Object
struct object
i
struct object *parent;/* object within who's frane the object is defined */
lorig subnum; $/ *$ sibling number of object $\#$ /
long locnum;
long ptnum;
long ptnum;
long 1nnum;
long pgnum;
long obnum: /* number of polygons */
float $5 x$; float sy. float 52 , I* number of daughter objects :/
float dx: float dy; flo
lloat theta. Fion
float phi; $\quad$ i* angle of rotation around 2 -axis ( $x$ to $y$ ) $\quad$ )
float omega; . $\quad$ * angle of rotation around $x$-axis ( $y$ to $z$ ) $\quad$ */
$\begin{array}{ll}\text { float omega; } & 1 * \text { angle of rotation around } y \text {-axis }(z \text { to } x) \\ \text { float } r[3][4] ; & / * \text { matrix formed ty scale, the offset, then rotate } * /\end{array}$
float $p[3][4] ; \quad / *$ ascending product of all ancestral r matrices $\quad \$ /$
loc *locs; /* pointer to location array */
point *points; 1* pointer to point array
** painter to line array
polygon *polygons: $\quad$ * pointer to polygon array
struct object *objects[MAXOB]; /* pointer to array of ay \#/
;;
**pointers to child objects $\quad$ /
***

-->Listing 2: THS320C30 C Structure Representing a Location
typedef struct
$\uparrow$
float $x_{;}$float $y_{;}$float $z_{;} \quad$ / world coordinates */
loc:
**************************************************************************)
*****************************************************************************
--ไListing 3: TMS32OC30 C Structure Representing a Point
typedef struct
§

$$
\begin{aligned}
& \text { long color } \\
& \text { long loch; }
\end{aligned}
$$

1* number of location in location array */


-->Listing 4: TMS320C30 C Structure Representing a Line
typedef struct
$\uparrow$

| long color; | * start loc number */ |
| :--- | :--- |
| long startlocn; | $/ *$ end loc number $* /$ |
| long endlocn; |  |

ng endlocn;
) line;

**********************************************************************
-->Listing 5: TMS320C30 C Structure Representing a Filled Polygon
${ }_{6}^{\text {ty }}$
long color:
long vertnum;
) polygon;
*****************************************************************************

->Listing 6: TMS320c30 Comaunications Macros


## define DPO(a)

define DP1(a)
define DP2(a)
Udefine DP3(a)
dual-portal
define DP3(a) dual_port[a +2
dual_port[a + 3 ]
\#*******************************************************************

## *******************************************************************)

--histing 7: TMS320c30 Global Variables

| long k, 1 ; <br> struct object *universe, *to, *no; |  | /* temporary and looping variables <br> /* universe, target object, next object <br> /* dual port SRAM |
| :---: | :---: | :---: |
|  |  |  |
|  |  |  |
| union |  | /* variable to construct a c 30 format |
| ¢ |  | /* float from intel format allowing |
| float | f; | /* bit manipulation on a float |
| unsigned long | i; |  |
| ) fllang; |  |  |

## (t4**t

```
->Listing 8: TMS320C30 Main Command Execution Loop
```


## §

register float twopi $=6.283185308$;
register long i, j;
register long *hstdata $=($ long *) 0x805002; /* 340 host data register */ register long *hstentl $=($ long *) $0 \times 805003 ; / * 340$ host control register $* /$ dual_port $=($ unsigned long *) $0 \times 804000$

$\begin{array}{lll}\text { asn(" } & \text { OR } & 0800 \mathrm{~h}, \mathrm{ST"} \text { " }) ; * \text { enable cache }\end{array}$

* set for zero internal, 10 wit ; $1 *$ set XFO and assume control of 340SDB */
es on both buses
( $($ unsigned long *) $0 \times 808060)=0$;
t( (unsigned long *) $0 \times 808064$ ) $=0 \times 1000$;
*hstent 1 CTLFREE;/* turn off any request to TMS34010 */
*dual_port $=0 ; \quad / *$ turn off any request from the $\mathrm{PC} \quad$ */
ACKNOWLEDCE $=0 ; \quad 1 *$ turn off any acknowlegement to the PC
* allocate space for the internal object
*/
to $\quad$ (struct object *) malloc (sizeof(struct object));
$\begin{array}{ll}\text { to }->\text { subnuan }=0 ; & / * \text { target universe } \\ \text {; } & / * \text { set universe sibling number to } 0\end{array}$
to - ->arent $=$ to; $\quad$ /* universal object is its oun parent
while(COMMAND $!=1) ; \quad / *$ first command must be a load object
Uhile ICOARAND $!=11$;
while(COMMAND $!=0$ );
load_object();
ACKNOMLEDE $=0$;
ACXNOMEDGE
matrix();
for (; ;)
\{

$\mathrm{j}=\operatorname{dPL} \operatorname{ONG}(2)$; $\quad$; get daughter object number to target */

ACKNouLEDCE $=0$; $/ *$ shos that dual port is free if (j $>$ to-Sobnua) break; /* can only target existing object */ to $=$ to->objects $[j] ; 7 *$ target daughter object $\quad \# /$ break;

| case 3: | /* TARGET PARENT OBJECT |
| :---: | :---: |
| ACKNOMLEDCE $=0$; | /* show that dual port is free |
| to $=$ to-->parent; | /* set targeted object to parent |
| break; . |  |

$\begin{array}{ll}\text { case 4: } & \text { /* DELETE TARGETED OBUECT } \\ \text { ACKNOMEDCE }=0 ; & \text { /* Shou that request dual port is free }\end{array}$ if (to $==$ universe) break; $/ *$ don't allow deletion of universe*/
$j=$ to->subnum +1 ; /* get number of next sibling $\quad * /$
no $=$ to-sparent; $\quad / *$ set next object to parent $\quad$ /I
delete_object(to); /* delete current object $\$ 1$
to $=$ no; $\quad$ I* target parent object
I* find total number of siblings
/* decrement sibling nuaber on all younger siblings */
for(i $=\mathrm{j} ; \mathrm{i}\langle=1 ;++\mathrm{i})$-to->objects[i]->subnum;
--to->obnua; / $*$ decrement total number of daughter objects */ break;
case $5:$ /* ADUST TARGETED OBUECT ..... */
to->sx $\quad \|=$ dpfloat(2); /* adjust scales ..... */
$\quad=\operatorname{dpfloat}(6)$
 ..... *
to $\rightarrow$ dx $=$ dpfloat(14); /4 adjust offsets*/
$t 0->d z \quad t=\operatorname{dpfloat}(22)$;
to 0 >theta $+=$ dpfloat (26); / / adjust angles
ACKNOLLEDCE $=0$;/7 show that dual port is free*/

* keep angles in the $(0,2 \mathrm{pi})$ range ..... */
to->theta $=$ faod (to->theta, twopi)
to->phi $=$ faodto-sphi , twopi?

$$
\begin{aligned}
& 0->\text { onega }=\text { faod(to->omega, twopi) } \\
& \text { atrix(tol): recalc }
\end{aligned}
$$

matrix(to);
/t recalculate transfora matrix
break;
case 6:
/* DRAW LWIVERSE/* DRAW LNIVERSE*/
ACKNOULEDCE $=0$; 1* show that dual port is freeWhile(HOSTCNTL $!=$ CTLFREE), $* *$ wait for 340 to be free1
*hstdata $=4$; / $\ddagger$ enter comand for a screen clear $\# /$*hstent $=$ CTLREQ; $\quad l *$ request service from 340*/
while (HOSTCNTL $!=$ CT ; / wait for acknowledgement*/
*/
*hstentl $=$ CTLWITH; /* withdraw request*/
While(HOSTCNIL != CTLFREE); 1* bait for 340 to be free ..... */ *hstdata $=6 ; \quad$ /* enter comand for a scanline thstent1 = CTLRER; $\quad$ / request service from 340 while (HOSTCNTL ! = CTLACK); /* wait from acknouledgement *hstent $=$ CTLHITH; $\quad / *$ vithdras request */ break;
case 7:
1* CALCULATE SCREEN COORDINATES /* ++HMARNING+++ the PC user must execute a screen command to $\# 1$
*/ /* screen all objects that have been adusted since the last / $\%$ draw before the next drav. However, if an object is

* screened all daughter objects are as well.
ACXNOHLEDEE $=0$; $/ *$ show that dual port is free screen_object(to); /\& calcuclate screen coordinates break;


## default:

ACKNOHLEDGE $=0$;
/* show that dual port is free */
)
3
3

## 


--)Listing 9: TMS320C30 Floating-Point Conversion Routine

| float dpfloat (a) register unsigned long a; ( | /* offset from start of dual port SRAM */ |  |
| :---: | :---: | :---: |
| register unsigned long sign; unsigned long mant, ex; |  |  |
| $\begin{aligned} a= & (\text { DP3 }(a) \ll 24 \\ & ;(\text { DP2 }(a) \& 0 \times 00 F F) \ll 16 \end{aligned}$ | 1* concatenate 4-byte value | */ |
| ) (DP1 (a) \& Ox00FF) << 8 |  |  |
| - (DPO(a) \& $0 \times 00 \mathrm{FF}$ ) ; |  |  |
| sign = (a \& 0x80005000) >> 8; | /* extract and reposition sign bit | / |
| ex = ( 1 \& $40 \times 77800000$ ) | /* extract exponent | */ |
| $\text { if (sign })^{-0 \times 3 f 800000) \ll 1 ;}$ | /* converts to 2 's complement | */ |
|  |  |  |
| mant $=(-$ a) \& $0 \times 007 \mathrm{FFFFF}$; | 1* takes 2's complement of mantissa | */ |
| if (mant $=0$ ) ex $=0 \times 0100$ | 0000; 1* checks for input mantissa of -2 |  |
| $)$ ) |  |  |
| else mant $=a$ \& $0 \times 007 \mathrm{FFFFF}$; | /* otherwise leave mantissa alone | 1 |
| $a=\operatorname{sign}+$ mant + ex; | /* reconstruct floating-point fields | / |
| fllong. $\mathrm{i}=\mathrm{a}$; <br> return fllong.f: | /* return reconstructed float | \#/ |

register unsigned ion
register unsigned long sign:
/* concatenate 4-byte value
(DP1(a) \& Ox00FF) 《 8
( $\left.{ }^{(D P O}(\mathrm{a}) \& 0 \times 00 \mathrm{FF}\right)$ );
$\begin{aligned} 5 i g n & =(\text { a \& } 0 \times 80005000) \gg 8 ; \quad \text { i* extract and reposition sign bit }\end{aligned}$ -0x35800000) 《
extract exponent
*/
if (sign)
mant $=(-$ a) \& 0x007FFFFF; $1 *$ takes 2 's complement of mantissa
)
$a=\operatorname{sign}+$ mant + ex; $\quad / *$ reconstruct floating-point fields $\quad$ *

)


## H2

## -->Listing 10: TH5320C30 Object Loading Routine

## void load_object()

(
register long $i, j$;
register struct object t $_{0}$;
register loc temploc;
register loc *teaploc;
register line *templn
polygon *temppg
temppt;
long Ic = DPLOMG(2)
long pt $=\operatorname{DPL}$ LONG (4)
long pt $=$ DPLONG(4)
long in $=$ DPLONG(6)
long pg $=\operatorname{DPL}$ ONG(8);

| /* pointer to target object |  |
| :---: | :---: |
| /* teaporary location pointer |  |
| /* temporary line pointer |  |
| /* teaporary polygon pointer |  |
| /* temporary point pointer |  |
| /* nuaber of coordinate locations |  |
| /* number of points |  |
| /* number of lines |  |
| ** nunber of polygons |  |

$0=$ to;
/* set target object as object for loading /
/* initialize primitive numbers and transforn parameters
$0-$ )locnum $=1 c$;
$0-3$ ptnuin $=p t$
$0-3$ lnnum $=1 \mathrm{in}$;
$0->$ pgnuie $=p g ;$
$0 \rightarrow{ }_{0}=\operatorname{dpfloat}(10) ; \quad 0->_{5 y}=\operatorname{dpfloat}(14) ; \quad 0->5 z \quad=\operatorname{dpfloat}(18) ;$


/* ALLOCATE SPACE FOR OBUECT PRIMITIUES
$0-$ locs $=(l o c *)$ malloc (sizeof (loc) * 1 c );
$0-$-points $=($ point *) malloc (sizeof (point ) *pt)
$0-$-lines $=(l i n e *)$ malloc (sizeof (line $) *(\mathrm{n})$;
$0-$ ppolygons $=($ polygon *) malloc (sizeof (polygon) * pg);
/* LOAD UPTO 377 LOCATIONS PER OBJECT
for $(\mathrm{i}=0, \mathrm{j}=46 ; \mathrm{i}\langle 1 \mathrm{I} ;++\mathrm{i}, \mathrm{j}+=12$ )
f
\}

## / $~$ LOAD UPTO 1364 LINES

if ( 1 n )
!
nore_data();
for $(i=0, j=2 ; i(1 n ;+i, j t=6)$
<

| tempin | $=\&(0-3)$ ines $[\mathrm{i})$ ) | /* set teuporary line |
| :---: | :---: | :---: |
| templn->color | $=\operatorname{DPLONO}(\mathrm{j})$; | /* get color |
| templn->startlocn | $=\operatorname{DPLONG}(\mathrm{j}+2)$; | /* get starting location |
| templn->endlocn | = DPLONG( j + 4): | /* get ending location |

\}
)

```
* LOAD ONe polygon at a time

\section*{* LOAD ONE POLYGON AT A TIIE}
if (pg)
for \((i=0 ; i<p g ;+i)\)
f

/* allocate space for vertex location lis
teappg-3vertlocn \(=(\) long *) malloc (sizeof (long) * 1);
```

for (k=0,j = 6; k< 1; ++k, j t= 2) /\& load verteces */
temppg-3vertlocn[k]=\operatorname{BPLONG}(j); /* set vertex location */

```
\}

\section*{1}
\(3^{3}\)

teaploc \(=\&(0->10 c s[i])\);
/* save temporary location */
/* load world coordinates */ temploc- \(-3 x=\) dpfloat \((\mathrm{j})\); temploc-3y \(=\) dpfloat( \(j+4\) ); teaploc->z \(=\) dpfloat \((\mathrm{j}+8)\);
3
/ LOAD UPT 2047 POINTS PER OBUEC
if (pt)
if
more_datall;
for (i \(=0, j=2 ; i(p t ;+i, j t=4)\)
1

teappt->locn \(=\) DPLOWG(j +2\()\); /i get point location
3

\section*{M1}
-->Listing 11: THS320c30 Screen Coordinate Calculation Routine
```

void screen_object(0)
register struct object *o

```
1
\(\begin{array}{lll}\text { register long i,j; } & \text { * teaporary and looping variables } \\ \text { register loc *temploc; } & \text { */ } \\ \text { register struct object *tempob; } & \text { /* temporary location pointer } & \text { */ } \\ \text { register float } x, y ; & \text { i* co-ordinate floating point values } & \text { */ } \\ \text { float } z, d ; & \text { /* and perspective constant }\end{array}\)
float \(2, d ;\)
* and perspective constant
teapob \(=0\)->parent;
/* set temporary object to parent object */

\section*{1* COMPUTE PARENT MATRIX}
* if object is universe set parent matrix to transform matrix \(r\)

\section*{if ( \(0==\) universe)}
f
for \((\mathrm{i}=0 ; i<3 ;++i)\) for \((j=0 ; j<4 ;++j) \quad 0-\rangle p[i][j]=0->r[i][j] ;\) )
/* otherwise \(p\) matrix is product of \(r\) matrix and parent's \(p\) matrix */ else for \((\mathrm{i}=0 ; \mathrm{i}\langle 3 ;+\mathrm{i})\)
fis
\(0->_{p}[i][0]=0->r[0][0] *\) teapob- \(>p[i][0]+0->r[1][0] *\) tenpob->p[i][1] \(+0->r[2][0] *\) teapob->p[i][2];
\(0->p[i][1]=0->r[0][1] *\) tenpob- \(>p[i][0]+0->r[1][1] *\) tenpob- \(>p[i][1]\) \(+0-3 r[2][1]\) * tempob- \(\rightarrow p[i][2] ;\)
\(0->p[i][2]=0->r[0][2] *\) tempob \(->p[i][0]+a->r[1][2] *\) tempob \(->p[i][1]\) \(+0->r[2][2] *\) tempob- \(>_{p}[i][2] ;\)
\(0->_{p}[i][3]=0->r[0][3] *\) tenpob->p[i][0] \(+0->r[1][3] *\) teapob- \(-3 p[i][1]\)

/* COMPUTE SCREEN COORDINATES
\(j=0->\) locnum;
* get number of locations */
for ( \(i=0 ; i<j ;++i\) )
(
teaploc \(=8(0->\) locs \([i])\);
共
/* set teaporary location */
\(\begin{array}{ll}\text { /* save global coordinates } \\ x=\text { temploc }-3 x ;\end{array} \quad y=\) temploc \(->y ; \quad z=\) tenploc \(->z\)
/* calculate \(z\) value, add offset of 5, and invert for perspective \(\left.\left.\left.\left.d=1 /(x * 0-)_{p}[2][0]+y * 0-\right)_{p}[2][1]+z * 0-\right)_{p}[2][2]+0-\right)_{p}[2][3]+10\right)\);
* calculate transformed \(x\) and \(y\), add perspective, and scale to screen*
\(k=(1 \mathrm{Ong})((x * 0->p[0][0]+y * 0->p[0][1]\)
\(\left.\left.\left.+z * 0->_{p}[0][2]+\quad 0-\right)_{p}[0][3]\right) \pm d * 200\right) ;\)

\(\left.+z * 0->p[1][2]+0_{0 \rightarrow>p[1][3])}+d * 200\right)\);
/* clip to a 16 bit integer
if ( \(k\) ) 32000) \(k=32000\); else if \((k<-32000) k=-32000\);
* set screen coordinates
*/
temploc->a \(=k\);
temploc-3b \(=1\);
)
/* screen all internal objects
\(\mathrm{j}=0\) - -obnum;
for ( \(i=0 ; i<=j ;++i)\) screen_object(0->objects[i]);
*************************************************************************

\section*{}
-- Listing 12: TMS320c30 Transformation Matrix Evaluation Routine
\({ }_{i}^{m a t r i x()}\)
\(i\)

register struct object *o; sino, cosp, sinp; \(/ *\) variables
= to;
cost \(=\cos (0->\) theta);
sint \(=\sin (0->\) theta \() ;\)
\(\cos 0=\cos (0-\) - onega);
\(\left.\operatorname{sino}=\sin (0-)_{0 \text { onega }}\right) ;\)
\(\operatorname{cosp}=\cos (0->p h i)\)
\(\operatorname{sinp}=\sin (0->p h i) ;\)
\(0-\gg[0][0]=0->5 x *\) cost \(*\) coso;
\(0 \rightarrow r[0][1]=-0->\) sy \(*\) sint \(*\) coso
\(0-)_{r}[0][2]=0-752 *\) sino;
\(0->r[0][3]=(0->d x * \cos t-0->d y * \sin t) * \cos 0+0->d z * \sin 0 ;\)
\(0-3 \mathrm{r}[1][0]=0->5 x *(\) sint \(*\) cosp + cost \(* \sin 0 * \operatorname{sinp})\);
\(->r[1][1]=0-\rangle\) sy \(*(\cos t * \operatorname{cosp}-\operatorname{sint} * \operatorname{sino} * \operatorname{sinp})\);
\(0 \rightarrow>r[1][2]=-0-\rangle_{5 z} * \cos 0 *\) sinp;
\(0->r[11[3]=(10->d x * \cos t-0->d y * \sin t) * \sin 0-0->d z * \cos 0) * \operatorname{sinp}\) \(+(0-\rangle d x * \operatorname{sint}+0->d y * \cos t) * \operatorname{cosp} ;\)
\(0-3 r[2](0)=0->5 x *(\) sint \(*\) sinp - cost \(*\) sino * cosp \()\)
\(0->r[2][1]=0-3\) sy \(*(\) cost \(\#\) sinp \(+\operatorname{sint} *\) sino \(\#\) cosp \()\);
\(0-\rangle_{r}[2][2]=0-35 z * \cos 0 * \operatorname{cosp} ;\)
\(0-3 r[2][3]=((1-0-\rangle d x *\) cost \(+0-\rangle d y *\) sint \() *\) sino \(+0->d z * \cos 0)\) * cosp \(\quad+(0->d x * \operatorname{sint}+0->d y * \operatorname{cost}) * \sin p\);

3


\section*{**********************************************************************}
->Listing 13: TM 320 C 30 Object Deletion Routine

> void delete-object ( 0 )
> register struct object *0;
```

register long $i, j$

```
/* tenporary, looping variables */
free ( \(0-\) - locs);
free ( \(0->p o i n t s\) );
free ( 0 - - lines);
** delete location array
* delete point array
/* delete line array
/* get number of polygons
\(\mathrm{j}=0->\) pgnum; \(\quad\) /* get number of polygons
for \((\mathrm{i}=0, \mathrm{i}\) < \(\mathrm{j} ;++\mathrm{i})\) free ( 0 ->polygons[i].vertlocn); /* delete */ for \((\mathrm{i}=0 ; \mathrm{i}<=\mathrm{j} ;++\mathrm{i})\) free ( 0 ->polygons); ; \(\quad / *\) polygons
\(j=0\)->obnum; \(/ *\) get number of daughter objects for \((i=0 ; i<=j ;++i)\) delete_object( \(0->\) objects \([i]\) ); /* delete objects \(\# /\) free ( 0 );
/* delete object */
****************************************************************************

\section*{H************************************************************************)}
-->Listing 14: TMS320c30 Request for Additional Data in Object Load
void aore_datal)
\{
\begin{tabular}{ll} 
ACKNOWLEDGE \(=127 ;\) & /* request more data \\
Uhile (CONMAND \(!=127) ;\) & I* wait for nore data \\
ACKNOLEDGE \(=1 ;\) & /* restore old acknowledge \\
Uhile (COMMAND \(!=0) ;\) & /* wait for PC to resume old command *
\end{tabular}
\(\boldsymbol{* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * ~}\)

\section*{*********************************************************************)}
-->Listing 15: TMS320c30 Object Drawing Routine
```

void draw_object ( 0 )
register struct object *o
i
$\begin{array}{lll}\text { register long } & \text { i; } & \text { /* teaporary, looping variable } \\ \text { register loc } & \text { *temploc; } & \text { /* temporary location pointer }\end{array}$
point *temppt;
register line *templn;
polygon *temppg;
/ t temporary point pointer .
register line *tempin; $\quad$ * temporary line pointer $\quad$ /
register long *hstdata $=$ (long *) $0 \times 805002$; temporary point pointer */

```

```

register long *histentl $=(1$ ong *) $0 \times 805003 ; / * 340$ host control register */
register $j=0$->lnnum: $\quad$ (Tong *) $0 x 805003 ; / * 340$ host control register $* /$
** DRAW ANY LINES
if ( j )
if

| While (HOSTCNTL $=$ CTLFREE); | /* wait till 340 is free |
| :---: | :---: |
| *hstdata $=123$; | /* send command to draw object |
| *hstentl $=$ CTLREQ; | /* request service from 340 |
| *hstdata $=\mathrm{j}$; | /* send number of lines |
| for ( $\mathrm{i}=0 ; \mathrm{i}$ 人 $\mathrm{j} ;+\mathrm{t})$ | /* send lines |

        \{
            templn \(=\&(0-3\) lines[i]); \(\quad\) / save line pointer *)
            *hstdata \(=\) tempin-3color;
            thstdata \(=0->\) locs[templn->startlocn].a; \(\quad\) * send color
            *hstdata \(=0\)->locs[tempin->startlocn].b; \(/ \pm\) coordinates \(\quad\);
            *hstdata \(=0-\lambda l o c s[t e m p l n->\) endlocn] \(a ;\)
            *hstdata \(=0\)->locs tempin--endlocn].a;
            *hstdata \(=0->\) locs [templn->endlocn].b;
        3
        while(HOSTCNTL \(:=\) CTLACK); /* wait for 340 to acknolvedge request */
        thatontl \(=\) CTLuITH; \(\quad 1 *\) wait for sto to acknolvedge request
    /* withdraw request

* DRAW ANY POINTS $/ *$ get nunber of points
$j=0->$ ptnum; $\quad / *$ get nunber of points
if ( $j$ )
i

```

While (HOSTCNTL \(=\) CTLFREE); /* wait till 340 is free */ stdata \(=1\); *hstent1 = CTLREQ;
*hstdata \(=j\);
for \((\mathrm{i}=0\); i \(<j ;++\mathrm{i})\)
\{
temppt \(=\ell(0->\) points \([i])\);
 *hstdata \(=0\)->locs[temppt->locn].b;
J
hile(HOSTCNTL \(=\) CTLACK); /* wait for 340 to acknolwedge request * hstent \(=\) CTLUITH;
/* vithdras request
)

1* save point pointer
/* send color
/* send comand to draw object : /* request service from 340 */ /* send number of points */ I* send points of points
/* save point pointer */
```

/* DRAW ANY POLYGONS */
l=0->pgnum;
if (1)
for(i = 0; i < 1; ++i)1* draw polygons*)

```
temppg \(=\&(0-\)->polygons \([i])\);
```*/
            lemppg =&(0-3polygons[i]); }\begin{array}{lll}{\mathrm{ /* wait till 340 is free }}&{#/}\\{j=\mathrm{ temppg-3vertnum; }}&{\mathrm{ /* send command to drav object }}&{$/}
            j = temppg->vertnum; ; I* send command to draw object\(* /\)
\(* / 1\)
```



```
            *hstdata = 5; , * send number of points
            *hstent1 = CTLREQ; _/* send points
            /* send points
            *hstdata = temppg->color; /* send color
            hstdata = j; /* send number of verteces
                */
            */
            /* send point connect list (0,1, 1,2, 2,3 ....j-2,j-1, j-1,0 */
            *hstdata = 0;
            for(k =.1; k < j; ++k
            }
                *hstdata = k; *hstdata = k;
            }
            *hstdata = 0;
            %* send vertex location list
            i
                temploc = &(0->locs[temppg->vertlocn[k]]); /* save point
                            */
            *hstdata = temploc->a; *hstdata = temploc->b;
        3
        while(HOSTCNTL != CTLACK); /* wait for 340 to acknolwedge request*/
        *hstentl = CTLWITH; I* withdraw request
O
/* DRAW ANY DAUGHTER ORJECTS
```

j = 0-3obnum;

```
j = 0-3obnum;
                    /* get daughter objects */
                    /* get daughter objects */
for (i = 0; i <= j; ++i) draw_object(0->objects[i]);
******************************************************************************
******************************************************************************
--\Listing 16: TMS34010 Point Structure
lymeref struct n
} Doint;
***************************************************#**************************
```


## **************************************************************************

--久Listing 17: TMS34010 Line Structure

```
typedef struct
{
    short color; /* line color
        */
    ** co-ordinate of starting point
    short y1; /*y co-ordinate of starting point */
    short x2; I* x co-ordinate of end point
    short x2; 1* x co-ordinate of end point
    */
    */
) line;
```

*****************************************************************************

## $\boldsymbol{* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * ~}$

## --ไListing 18: TMS34010 Color Array

long color $[16]=($
 CO13, CC14, © C15;
****************************************************************************

## **************************************************************************

*-->Listing 19: Ths34010 Color Palette
short mypalet[16] $=$
$0 \times 0000,0 \times F 000,0 \times 00 F 0,0 \times F O F 0,0 x 0 F 00,0 \times F F 00,0 \times 0 F F 0,0 x F F F 0$,
$0 \times 0 A F 0,0 \times 0900,0 \times F A 70,0 \times 5440,0 \times 17 B 0,0 \times 6660,0 \times 9990,0 \times B B B 03$;


## 

```
--%Listing 20: TMS34010 Main Conmand Execution Routin
```


## main() <br> 亿

register line *templn;
register point *temppt;
register short tempint;
register short i;
line *lines;
$\begin{array}{lll}\text { line *lines; } & / * \text { pointer to line array } & * / \\ \text { Doint *peints; } & / * \text { pointer to point array } & * /\end{array}$
short *istadrh, *hstadri, *histctll, *number, *pgnum, *pointer, adrl, adrh;
*( (short *) $0 \times 04000000$ ) $=0 \times 0001$; /* turn on shadou ram $\$ /$

lines $=($ line *) $(0 \times F F F 00020)$;
points $=($ point *) $(0 \times F F F 00020)$;
adrl $=$ (short) (( (long) pointer) \& $0 \times 0000 \mathrm{FFFF}$ );
adrh $=$ (short) ((()long) pointer) ) 16) \& 0x0000FFFF)
init_grafix();
init-grafix();
init_vuport ();
init_vuport();
set_origin ( 320,240
*hstadrh = adrh;
*hstadrl $=\operatorname{adrl}$;
*hstctll $=0$;
for (; ; )
i
while (*hstctl1 $:=0 \times 0003$ ); *hstctll $=0 \times 0030$;
while (*histct11 ! $=0 \times 0030$ ); switch (*pointer) i
/* temporary line pointer
/* temporary point pointe
*)
*/
/* temporary integer
** looping variable
/* pointer to line array
$\square$*/
*/
+1
instctll $=$ (short *) $0 \times C 00000 F 0 ; 1 *$ host control register low byte
histadrh $=($ short $*) 0 \times C 00000 E 0$; $/ *$ hast address register high word. */
histadrl $=($ short *) $0 \times C 0000000 ; \quad / *$ host address register low word $\quad \$ /$
pointer $=($ shart $*) 0 \times F F F 00000 ; \quad 1 *$ pointer to beginning of shadow ram $* / /$
** starting point of line array */
Dgnum $=($ short *) (0xFFF00020); / *location of number of polygon verteces $* /$
number $=$ (short *) (0xFFF00010); /* number of primitives to draw */
init-video(1); /* conifigure for a NEC MLLTISYNC, non-interlaced, 60 Hz */

. /* initialize graphics environment */ * initialize graphics environment */ * initialize screen

* initialize viewing window
/* place origin at center of screen */
* reset start data address
/* turn off any command to the 340 */
/* wait for request from the C30 */
/* acknouledge request $\quad$ // /* decode comand foad data withdraw*/ /* decode command

| case 123: | /* LRAW LINES */ |
| :---: | :---: |
| tempint $=$ *number; | /* get number */ |
| for ( $\mathrm{i}=0$; i ( tempint; + +i ) | /* of lines */ |
| \{ |  |
| templn $=$ \&(lines[i]); | /*set line point*/ |
| set_color1(color[templn-3color]); | /* set color */ |
| draw-line( templn-3x1, | /* draw line */ |
| templn->yl, |  |
| templn-3x2, |  |
| templn->y2); |  |

## )

*hstadrh = adrh; *hstadrl = adrl; *histctll $=0$;
break;

§
tempt $=\alpha(p o i n t s[i]) ;$
set-colorl(colorttewppt->color]);
temppt $->x$,
temppt $-3 y$ )
)
*hstadrl = adrl; $\quad$ : turn off any comand to the 340 *
se 3:
1) reset start dat
hstadrl = adrl;
histctll $=0$;
break;
*hstadrh = adrh;
*hstadrl = adrl;
hstctll = 0 ;
** turn off any command to the 340 */
case 5:
set_color (color[tnumber])
tempint $=$ *pgnun;
* get number af yerteces */
(tempint,
/* get number of verteces */
(short *) (pointer + 3),
hstadrh $=$ adrhi $\quad / *$ reset start data address $\quad * /$
*hstadrl $=$ adrl;
thistctll $=0 ; \quad / *$ turn off any command to the $340 \quad * /$
break
thstadrh = adrh;
thstadrl = adrl;
*hstetll $=0$;
wait_scan(0);
wait_scan(479);
break
*hstadrh $=$ adrh;
hstadrl =adr
break;
3
,
*****************************************************************************
--ไtisting 21: PC Object Loading Data Structure

trans;
******************

滕

## ->Listing 22: PC Communications Macros

> define DATASHORT(a) *( (unsigned short *) (dual_port + a)
> \#define DATAFLOAT(a) ( (float *) (dual_port + a))
> \#define COMMAND *dual_port
> \#define ACKNOHLEDCE *( (unsigned char *) OXE0008001)
*******************************k*********************************************
*************************************************************************
--Listing 23: PC Global Variables
char *dual_port;
I* dual port sram connecting to c30 SHDS*/
trans *ata;
***************************************************
***************************************************************************
-->Listing 24: PC Targeted Object Adjustment Routine
void adjust_object(5x, sy, 5z, dx, dy, dz, theta, phi, omega)
double $5 x, 5 y, 5 z, d x, d y, d z$, theta, phi, onega;
\{
While (ACKNOWLEDCE != 0);
DATAFLOAT(2) $=5 \mathrm{x}$; $\quad$ DATAFLCAT( 6$)=5 y$; $\quad$ IATAFLOAT $(10)=52$ $\operatorname{DATAFLOAT}(14)=d x ; \quad$ DATAFLOAT $(18)=d y ; \quad$ DATAFLOAT(22) $=d z ;$ DATAFLOAT 26 ) $=$ theta; $\quad$ DATAFLOAT $(30)=$ phi; $\operatorname{DATAFLOAT(34)~}=$ omega COMMAND $=5$; Uhile (ACKNOWLEDGE != 5); COMMAND $=0$;
)


## 

--3Listing 25: PC Routine to Set Parameters for an Object Load
void set_parameters(sx, sy, sz, dx, dy, dz, theta, phi, onega)
souble $5 x, 5 y, 52, d x, d y, d z$, theta, phi, onega;
$\{$

| while (ACKNOWLEDGE | $!=0) ;$ | /* wait | for C30 to be free |
| :---: | :---: | :---: | :---: |
| data->5x $=5 x$; | data->sy | = 5y; | data $>_{5 z}=52 ;$ |
| data->dx $=d x$; | data->dy | $=d y$; | data $->\mathrm{dz}$ = dz; |
| data->theta $=$ theta; | data->phi | = phi; | data->omega $=$ omega; |

)
****************************************************************************

-->Listing 26: PC Routine to Target Parent of Current Target Object
void target_parent()
\}

While(ACKNOWLEDGE $!=0$ ); COMMAND $=3$;
Uhile(ACKNOHLEDGE != 3);
COMMAND $=0$;
\}

1* wait for c30 to be free */ /* conmand to target parent object ** * wait for C30 to acknowlege request*/ /* uithdrae request

## 

-->Listing 27: PC Routine to Target a Child of Current Target Object
void target_child(x)
int $x$;


COMMAND $=2$;
/* wait for C30 to acknowlege request*/
)
******************************************************************

-->Listing 28: PC Routine to Redraw Screen

```
void drav_object(
i
```

| While ACKNOMLEDGE | $!=0) ;$ | /* wait for C30 to be free */ |
| :---: | :---: | :---: |
| COWHAN = 7; |  | /* command to compute screen co-ords */ |
| While (ACKNOMLEDSE | $!=71$ | /* wait for C30 to acknowlege request*/ |
| COMHAND $=0$; |  | /* withdraw request */ |
| While (ACKNOHLEDGE | $!=0) ;$ | /* bait for C30 to be free */ |
| COHTAND $=6$; |  | /* command to draw screen \#/ |
| While ACKNOHLEDGE | $!=61 ;$ | /* wait for C30 to acknowlege request*/ |
| COMHAND $=0$; |  | /* withdraw request |

3

* withdraw request



## H2

-->Listing 29: PC Routine to Load the Primitives of a Wireframe Cube

)

## 

## 人Listing 30: FC Main Routine to Draw a "Planetary System of" Cubes

## eaial

register int $x_{i}$
dual_pert $=$ (char a) $0 \times 50008000 ;$ /" location of dual port sras
data $=$ (trans t) 0xE0008002; /4 lecation of object data
crand $=0 ;$; $1.0001, .0001, .0001,0 ., 0 ., 0 ., 0 ., 0 ., 0.1$;
et_paraas
sube (3);
,
ube (2);
et_paraanters (.2, 2, 2, 0. ,5. ,0., 0. , 0. ,0.)
ube (6);
arget_parent (1;
set_paraseters $(.2,2,2,2,0 .,-5 ., 0 ., 0 ., 0 ., 0.1$; cube (4);
arget_parent ();
target_parent ( )
set_parameters $(.3, .3, .3,0 ., 0 ., 6 ., 0 ., 0 ., 0.1$
cube (5);
arget_parent (1;
set_parameters 1.3, .3, $3,0 ., 6 ., 0 ., 0 ., 0 ., 0.1$;
cube (1);
terget_parent I):
set_paraneters $(3,3,3,3,0 ., 0 .,-6 ., 0 ., 0 ., 0.1$;
cube (5);
target_parent()
set_parametersi.3, 3,.3,0.,-6. ,0., 0., 0., 0.);
ube (1);
target_parent()
for $(x=0 ; x<1000 ;++x)$
${ }_{i}$ forl
adjust_object(1.00s2b, 1.0092b, 1.0092b, 0., 0. , 0., 0. , 0. , . 2 )
target_child(1);
adjust_ob ject(1., 1., 1., 0., 0. ,0., 0. , 2, 0. );
target_parent ();
target_edild(2);
adjust_object(1.,1., 1., 0., 0. ,0., 0., .2,0.);
target_parent ();
target_child(3);
adjust_ob ject(1., 1., 1., 0., 0., 0. ,0., . 2, 0.);
target_parent ();
target_caild (4);
adjust_object(1. ,1., 1., 0., 0. , 0, , 0. , 2, 0. 1 ;
target-parent (1)
target_child(0)
adjust_object(1., 1., 1. , 0. , 0. , 0. , 0. , 0. ,-, 4);
target_child $(0)$;
adjust_object(1.,1., 1., 0. , 0. , 0. , . 4, 0. , 0. );
target_parent(1);
target_child(1);

target_parent ();
target-partat();
screen_object();
dransereenl);
$\}$
for $(x=0 ; x<1100 ;+4 x)$
adjust_soject(1.,1.,1., 0. , 0., 0. ,0.,..005, .21; target_child(1),
adjust_object(1.,1.,1.,0.,0.,0., 0.,.25,0.); target_parent ();
target_child(2);
adjust_object (1., 1. ,1., 0. , 0., 0. ,0., 25,0.);
target_parent ();
target_child(3);
adjust_os ject(1., 1. ,1., 0. , 0. , 0. ,0.,.25,0.1;
target_parent ();
target_child(4);
adjust_ob ject (1., 1., 1. , 0. , 0. , 0. , 0., . 25,0.1;
terget_parent();
target_child(0);
target_child(0);
target_childi(0);
adjust_object(1., 1., 1., 0., 0., 0.,., 3, 0., 0. $)_{\text {; }}$
adjust_object (1.,
target_parent();
target_parent ();
target_child(1);
adjust_object(1., 1., 1., 0., 0. ,0.,.3,0., 0. );
target_parent ();
target_parent();
screen_objectl);
draw_sereeall;

## Part VI. Tools

13. The TMS320C30 Applications Board Functional Description (Tony Coomes and Nat Seshan)

# The TMS320C30 Applications Board Functional Description 

Tony Coomes-Software Development Systems Nat Seshan-Digital Signal Processor Products<br>Semiconductor Group<br>Texas Instruments

## Introduction

This report describes the architecture of the TMS320C30 Applications Board (APPB), which is part of the TMS320C30 XDS1000 Development System. The XDS1000 is an in-circuit emulation tool for TMS320C30 hardware/software system development. The APPB was designed with two goals: to provide a basic platform for software development and to provide a variety of interfaces to the TMS32C30. There are four key interfaces used on the APPB:

1) SRAM
2) EPROM
3) Dual-port SRAM
4) DRAM

The SRAM and EPROM interfaces on the APPB are quite simple; thus, this report focuses on the dual-port SRAM and the DRAM interfaces. Figure 1 shows a basic block diagram of the APPB.

Figure 1. TMS320C30 Applications Board (APPB) Block Diagram


The APPB features include the following:

- TMS320C30/host communications via a designated, relocatable 4 K -byte dual-bus SRAM memory block.
- 16 K -words ( 64 K -bytes) zero wait-state SRAM on the TMS320C30 primary bus (STRB).
- 2 K -words of one wait-state EPROM for interrupt and reset vectors on the TMS320C30 primary bus.
- 16 K -words ( 64 K -bytes) zero wait-state SRAM on the TMS320C30 expansion bus (MSTRB). The SRAM can be selected in either one of two 8 K -word banks.
- I/O expansion bus.
- 512 K -words of DRAM on the TMS320C30 primary bus.
- Emulation port.
- IBM PC, PC/XT, PC/AT support.

The remainder of this document describes each interface in more detail.

## Host/TMS320C30 Interface

The host/TMS320C30 interface is composed of two basic blocks, the dual-port SRAM and the control logic. The control logic consists of address decoding, a read/write control register, and a write-only mapping register. The control registers are mapped into the host I/O space as shown in Table 1. Figure 2 is a block diagram of the host interface.

Table 1. Host I/O Memory Locations for Control Registers

| Host I/O Memory Locations | Contents |
| :---: | :--- |
| $0330-0337$ | Semaphores (LSB is the only valid bit) |
| 0338 | Dual-port SRAM mapping register Q |
| 0339 | Control register R |

Figure 2. Host Interface Block Diagram


One of the major problems in developing an application for a PC is finding a block of memory that does not conflict with other memory-mapped cards. To ease this problem, the dual port SRAM interface has been designed to be relocatable on 4 K -byte boundries throughout the lower 1 M -bytes of host memory space. A software example of how to map the dual-port SRAM into this space is given later in this report.

Writing a value to a hardware mapping register on the APPB relocates the dual-port SRAM. When a host memory access is generated, the value in the mapping register is compared to host address bits A12-A19. If they match, a dual-port SRAM access is allowed. To ensure PC and PC/XT compatibility, the dual-port SRAM can be located only in the lower 1M-bytes of host memory.

The APPB contains one general-purpose control register. This register is broken into two four-bit nibbles. The lower nibble can be read from and written to by the host and read by the TMS320C30. The upper nibble can be read from and written to by the TMS320C30 and read by the host. The lower nibble of the control register is cleared by any reset to or from the host PC. The upper nibble of the control register is cleared by any reset to the TMS320C30. The names of the APPB control register bits and host/TMS320C30 access capabilities are given in Table 2. Table 3 gives the control register bit definitions.

Table 2. APPB General-Purpose Control Register Bits

| Bit | Name | Host Access | C30 Access |
| :---: | :--- | :--- | :--- |
| 0 | CINT | Write/Read | Read only |
| 1 | XINTCLR | Write/Read | Read only |
| 2 | DPSEL | Write/Read | Read only |
| 3 | SWRESET | Write/Read | Read only |
| 4 | XINT | Read only | Write/Read |
| 5 | CINTCLR | Read only | Write/Read |
| 6 | MBANK | Read only | Write/Read |
| 7 | MSWAP | Read only | Write/Read |

Table 3. APPB General-Purpose Control Register Bit Definitions

| Bit | Name | Function |
| :---: | :---: | :---: |
| 0 | CINT | Clears and disables interrupts from the TMS320C30 to the host (XINT). XINTCLR must be set to 1 before the TMS320C30 can generate an interrupt to the host. The host clears and reenables XINT by writing 0 , then 1 to XINTCLR. On reset, XINTCLR is read as a 0 . |
| 1 | XINTCLR | Interrupt (INT0) to the TMS320C30. The host may interrupt the TMS320C30 by setting this bit to 1 . The TMS320C30 clears and re-enables the CINT by writing 0 , then 1 to CINTCLR. The host cannot generate an interrupt to the TMS320C30 while CINTCLR $=0$. On reset, CINT is read as a 0 . |
| 2 | DPSEL | Dual-port SRAM select. When this bit is set to 1, the dual-port SRAM is memory-mapped in the 4 K -byte space of the host PC specified by the 8 -bit value in register Q . When DPSEL $=0$, the dual-port SRAM will not be mapped in the host PC's address space. On reset, DPSEL is read as a 0 . |
| 3 | SWRESET | TMS320C30 SWDS soft reset. SWRESET $=0$ resets the TMS320C30 SWDS. SWRESET must be set to 1 to take the SWDS out of the reset state. On reset (power on), SWRESET is read as a 0 . |
| 4 | XINT | Interrupt to the host PC. The TMS320C30 may interrupt the host by setting this bit to 1 . The host clears and re-enables XINT by writing 0 , then 1 to XINTCLR. The TMS320C30 cannot generate an interrupt to the host while XINTCLR $=0$. On reset, XINT is read as a 0 . |
| 5 | CINTCLR | Clears and disables interrupts from the the host to the TMS320C30 (CINT). CINTCLR must be set to 1 before the host can generate an interrupt to the TMS320C30. The TMS320C30 clears and re-enables CINT by writing 0 , then 1 to CINTCLR. On reset, CINTCLR is read as a 0 . |
| 6 | MBANK | Memory bank select. The 16 K -word bank of memory on the TMS320C30 parallel I/O Bus (SRAM space 1) is mapped as two overlapping banks of 8 K -words each. MBANK $=0$ selects the lower 8 K words, MBANK $=1$ selects the upper 8 K -words. On reset, MBANK is read as a 0 . |
| 7 | MSWAP | Memory Swap. The MSWAP bit is used to swap the address map for EPROM and SRAM space 0. MSWAP $=0$ maps the EPROM at $000000 \mathrm{~h}-003 \mathrm{FFFh}$ and SRAM space 0 at F00000h-F03FFFh. MSWAP = 1 maps the EPROM at F00000h-F03FFFh and SRAM space 0 at $00000 \mathrm{~h}-003 \mathrm{FFFh}$. On reset, MSWAP is read as a 0 . |

The last portion of the control section contains the dual-port SRAM semaphore registers. Semaphore registers are used to coordinate communications between the host and the TMS320C30. Note that these semaphores do not provide hardware protection of the memory array. Instead, they provide a basic means (via software control) to ensure that data can be accessed from both sides of the dual-port SRAM without being corrupted. A software example that uses the semaphores is presented later in this report.

## SRAM and EPROM Interfaces

There are two SRAM interfaces on the APPB: one on the primary bus and one on the expansion bus. Both are implemented with eight 16 K -bit $\times 4,25$-ns SRAMs that provide zero wait-state TMS320C30 operation at 32 MHz . The interfaces are quite simple and consist of a set of address buffers, termination resisters, and a PAL for address decode on the primary bus. Note that the TMS320C30 address lines are routed to various components scattered around the board and then to the primary bus expansion. To prevent line reflections on the SRAM addresses, buffers have been used to isolate the SRAM.

There are two special features on the APPB that apply to the SRAM:

1) You can swap the memory address ranges of the EPROM and the SRAM on the primary bus by setting or clearing the MSWAP bit previously described in Table 3.
2) There are two 8 K -word pages of memory on the expansion bus.

By swapping the EPROM and SRAM, you can load in your own interrupt and reset vectors. Otherwise, you would have to remove the EPROMs and reprogram them with your own defined interrupt/reset vectors. The following code segment sets/clears the MSWAP bit.

```
#define EPROM 0 /* select EPROM */
#define SRAM 1 /* select SRAM */
sel_mswap(mem_type)
int mem_type;
{
    char *cntlreg = (char *)0x00805FF7; /* pointer to control reg */
    if (mem_type) *cntlreg |=0x80; /* set MSWAP to 1 select SRAM */
        else- *cntlreg &= OX7F; /* set MSWAP to 0 select EPROM */
}
```

There are 16 K -words of SRAM on the expansion bus; however, the TMS320C30 can directly access only 8 K -words. Instead of wasting the unaddressable 8 K -words, you can use a bank addressing bit (MBANK) in the APPB control register to select between the lower and upper 8 K -word segments.

The following code segment selects the current bank of memory.

```
#define BANKO 0 /* select lower 8k */
#define BANK1 1 /* select upper8k */
sel_mbank(bank)
int bank;
{
    char *entlreg = (char *)0x00805FF7; /* pointer to control reg */
        if (bank) *cntlreg |= 0x40; /* select bank 1 */
        else *entlreg &= 0xBF; /* select bank 0 */
}
```

The APPB supports 2 K -words of one wait-state EPROM on the primary bus for a boot loader and operating system support. As stated earlier, this EPROM is remappable.

## DRAM Interface

The APPB provides a DRAM expansion module that is connected to the TMS320C30 primary bus. Historically, DRAM interfaces to DSP devices have not been popular because of interface
difficulty and limited processor address space. The TMS320C30 supplies solutions to both of those issues with its memory interface and 16M-words address space. Two areas of the TMS320C30 memory interface are most useful for DRAM design:

- Use of bank mode
- The ability to do continous reads while in a bank without deasserting the $\overline{\text { STRB }}$ signal

When you use these two features, it is quite simple to design a medium-speed interface to page-mode DRAMs.

The TMS320C30 DRAM module consists of four banks of memory, each bank $256 \mathrm{~K} \times 32$ bits, that provide 1 M -word ( 4 M -bytes) of medium speed storage for the TMS320C30 (see Figure 3). The bank-switch function on the TMS320C30 provides fast page-mode access on back-to-back read cycles within a DRAM page. All address and control lines to the memory array are buffered and series-terminated for good signal quality. The memory array uses CAS-before-RAS refresh to reduce component count. There is no onboard refresh timer; instead, SDACK0 from the host PC provides a refresh request every $12-16 \mu \mathrm{~s}$. The DRAM access/cycle times are summarized in Table 4.

Figure 3. TMS320C30 Bank Addressing


In Table 4, these definitions are assumed:
Access Time - Number of clocks from $\overline{\text { STRB }}$ active to data clocked into the TMS320C30.
Cycle time - Number of clocks between two back-to-back cycles (includes DRAM $\overline{\text { RAS }}$ precharge on non-page-mode cycles).

Table 4. TMS320C30 DRAM Access and Cycle Times

| Mode | Access Time (clks) | Cycle Time (clks) |
| :--- | :---: | :---: |
| Read | 3 | 5 |
| Read (page mode) | $3 / 2^{\dagger}$ | 2 |
| Write | 3 | 4 |

$\dagger$ First page-mode access takes 3 clocks; the following accesses take 2 clocks each.
The four banks of DRAM are mapped into the TMS320C30 memory space at the address locations shown in Table 5.

Table 5. DRAM Bank Memory Locations in the TMS320C30 Memory Space

| DRAM Memory Bank No. | TMS320C30 Memory Location |
| :---: | :---: |
| 0 ( $\overline{\text { RAS }} 0, \overline{\text { CAS }} 0$ ) | $400000 \mathrm{H}-43 \mathrm{FFFFH}$ |
| 1 (RAS1, CAS 1 ) | $440000 \mathrm{H}-47 \mathrm{FFFFH}$ |
| 2 (RAS $2, \overline{\text { CAS }} 2$ ) | $480000 \mathrm{H}-4 \mathrm{BFFFFH}$ |
| 3 (RAS3, CAS 3 ) | 4C0000H-4FFFFFH |

Memory decode for the DRAM module is performed in two steps:

1) The APPB main card provides a memory select to decode the board range of $400000 \mathrm{H}-4 \mathrm{FFFFF}$.
2) Bank decode is then provided on the DRAM module through TMS320C30 address bits A18 and A19.

The DRAM controller consists of a pair of registered PALs, several SSI gates, and a delay line (used to time DRAM row/column address multiplexing). DRAM timing is generated from PAL UE5 (see schematics in Appendix C), while address decoding and special refresh control are provided by PAL UD5. Both PALs are clocked off of a delayed H1 clock. The DRAM controller looks for every opportunity to generate page-mode cycles to the DRAM. The TMS320C30 leaves $\overline{\text { STRB }}$ low for back-to-back reads; the DRAM controller looks for this condition and cycles CAS while holding RAS low (i.e., DRAM page-mode access). When $\overline{\text { STRB }}$ goes high, the DRAM controller will take both RAS and CAS high to prepare for a new access. For proper operation, the TMS320C30 primary bus control register (refer to the Primary Bus Control Register subsection in the Third-Generation TMS320 User's Guide ) must be set to operate off of the external ready signal and use a maximum bank size of 512 words (refer to the the Programmable Bank Switching subsection of the Third-Generation TMS320 User's Guide ).

Figures 4 through 6 show the timing for the various DRAM cycles.
Figure 4. Page-Mode Read-Cycle Timing Diagram


Figure 5. Single Write-Cycle Timing Diagram


Figure 6. Single Read-Cycle Timing Diagram


## Expansion Interface

The APPB's two expansion connectors contain the signals from the TMS320C30 expansion port, serial ports, flag pins, etc. Each 50-pin connector ( P 3 and P 4 of Figure 7) is composed of a dual row of 25 pins located on 0.1 -inch centers. These expansion connectors provide easy connection to other hardware via standard 50-wire flat ribbon cable. Figure 6 shows the orientation of the connectors. See schematic sheet 7 of Appendix C for pinout details.

Figure 7. TMS320C30 Applications Board


## Dual-Port SRAM Interface

All communications between the TMS320C30 and the host occur through the dual-port SRAM, which is 4 K -bytes deep, with 8 dedicated semaphore registers. On the host side, the dual-port memory array is memory-mapped, while the semaphores are I/O-mapped. On the TMS320C30 side, the dual-port SRAM is located on the expansion bus with the memory array mapped from $0 \times 00804000-0 \times 00804 \mathrm{FFF}$ and the semaphores mapped from 0x00805FF8-0x00805FFF. The host can directly access the dual-port SRAM without having to compensate for byte-wide access limitations. However, as the TMS320C30 can do only 32-bit accesses, the upper 24 bits of a data word are undefined. The TMS320C30 must therefore format data written to and read from the dual-port SRAM. A software example is given later in this report.

While dual-port SRAMs provide an excellent means for multiprocessor communications, a certain amount of software overhead is required to coordinate data flow. As might be expected, there are numerous methods for coordinating data flow. This application report presents a set of primitives that have been developed to form a basic communications protocol. The primitives are written entirely in C and have been tested on the XDS 1000 with the simple test routine provided. Remember that there are numerous ways to do a communications protocol. The method shown in this report is not the best for all applications; it is simply a method that makes good use of the capability of the dual-port SRAM.

The following are basic ideas of the communications protocol developed for this applications report.

1) The dual-port memory is broken into eight equal segments. The first segment is used only for control structures and command passing. The remaining seven segments are used entirely for data passing. Segment size is set to 512 bytes. The number and size of segments can be changed at compile time if desired.
2) Each of the seven data segments is totally independent from any other data segment. However, only one processor can own a particular segment at any given time. The TMS320C30 and host can simultanously access the dual-port SRAM as long as both are not trying to access the same segment.
3) The host is the master; the TMS320C30 is the slave. The TMS320C20 polls the dual-port control segment to determine if the host has deposited a command. If a command is present, the TMS320C30 executes the command and then returns to polling.
4) Only the first semaphore register is used in the dual-port. Each processor uses this semaphore to gain access to the control segment. Access to the seven data memory segments are coordinated via the control structures, not the semaphores.
5) There are seven control structures in the control segment, one for each data segment. Each control structure consists of 22 bytes and are defined as follows:

| Byte | Name |  |
| :--- | :--- | :--- |
| 0 | pflag | Definition |
| 1 | command | Coffer present (i.e., being used) |
| 2 | buf_stat | Status of the execute data buffer |
| 3 | nc | Reserved |
| $4-7$ | count | Number of 32-bit words to transfer |
| $8-11$ | addr | TMS320C30 to read/write data |
| $12-21$ | message | Ten bytes reserved for message passing |

Appendix A contains routines for the communication primitives used by the host and the TMS320C30. Appendix A1 contains routines for the PC side, Appendix A2 routines for the TMS320C30 side. Note that the routines on both sides have the same names and perform essentially the same function. Appendix A3 contains a memory map and description (TMS320C30 view). After the code has been compiled, use the following sequence to execute the test program:

1) Reset the $\mathrm{XDS} / 1000$ :

$$
\begin{array}{cc}
\text { xreset } & \text { [RETURN] } \\
\text { c30reset } & \text { [RETURN] }
\end{array}
$$

2) Get into the emulator and load the TMS320C30 dual-port code.

| emu30 | [RETURN] | ; load emulator |
| :---: | :---: | :---: |
| xr |  | ; reset the c30 |
| 10 | 'file name' | ; load the object file |
| xd |  | ; execute disconnect |
| [esc] |  | ; escape to main menu |
| q 'yes' |  | ; quit emulator |

At this point, your dual bus code should be executing and waiting for a host input.
3) Execute host dual-port code.

```
'file name'
```

The host code will then print the numbers 0 through 25 to the screen.

## Conclusion

This report has provided basic functional details of the TMS320C30 APPB. Because of their complexity, the DRAM and dual-port SRAM interfaces have been discussed. The features of the TMS320C30 allow it to encompass a wide range of interfaces. The TMS320C30 bank-switch mode and continuous strobe signal on back-to-back read cycles overcome traditional DSP/DRAM problems of interface difficulty and limited processor address space. A set of communications primitives routines to use with dual-port SRAM have been provided in Appendix A. These routines are written in C for ease of understanding and modification to meet individual needs.

## Appendix A

TMS320C30 Application Board Routines, Memory Map and Description

TMS320C30 Application Board Routines - PC Side TMS320C30 Application Board Routines - TMS320C30 Side Memory Map and Description (TMS320C30 View)

| /* |  |  | */ |
| :---: | :---: | :---: | :---: |
| /* | APPENDIX A1 |  | / |
| /* |  |  | / |
| /* | TMS320C30 APPLICATION BOARD ROUTINES - PC SIde |  | 1 |
| /* |  |  | / |
| /* | Texas Instruments Inc. |  | 1 |
| /* | 10/25/89 |  | * |
| /* |  |  | / |
| /* | Functions: |  | 1 |
| /* |  |  | 1 |
| /* | int APPB_reset() | Reset APPB | */ |
| /* | int APPB_dpinit() | Intialize APFB. | */ |
| /* | int APPB-getsem() | Get access to semaphore bit $N$ | */ |
| /* | int APPB_relsem() | Release access to semaphore bit N | / |
| /* | int APPB_getctlbik() | Get a control block in DPRAM | */ |
| /* | int APPB_relctlbik() | Release control block in DPRAM | */ |
| /* | int APPB_getmemblk | Get a block of memory from DPRAM | */ |
| /* | int APPB_putmembik() | Put a block of menory to DFRAM | */ |
| /* |  |  | */ |
| i* All code was compiled with Microsoft C compiler version 5.1 using the |  |  | / |
| $1 *$ | large model. If small model is used, then pointers used to access the |  | / |
| /* | dual port SKAM would have to be declared and used as 'far' pointers |  | */ |
| /* | (1.e. 32-bit pointer). Under the large model, all pointers are |  | +/ |
| /* | defaulted to 32 bits. |  | */ |
| /* |  |  | */ |
|  |  |  |  |


| \#define | DPRAM_SILE | $0 \times 1000$ |
| :---: | :---: | :---: |
| *define | DPRAM_ELKS | 7 |
| *define | DPRAM_BLK_SILE | E 512 |
| \#defitie | NEMLSEMS | 8 |
| \#define | max_SEM_TIME | 10000 |
| \#define | BUF_EMPTY | 0 |
| \#define | BUF_FULL | 1 |
| \#define | NOP | $0 \times 00$ |
| \#define | HOST_MEM_UR | $0 \times 80$ |
| \#define | HOST_MEM_RD | $0 \times 81$ |
| typedef typedef typedef | unsigned char unsigned short unsigned long | UCHAR; <br> UINT; <br> ULONG; |
| typedef | struct <br> \{ |  |
|  | UCHAR pf | pflag; |
|  | UCHAR CO | command; |
|  | UCHAR bU | buf_stat; |
|  | ICHAR $n C$ | nc; |
|  | ULONG Co | count; |
|  | ULONG ad | addr; |
|  | UCHAR me | message[10]; |
|  | 3DPCNTL; |  |

[^1]
manal)
:
GINT semanuflIPRAM_BLKS];
int i;
ULONG memarray[25], men2array[25];
APPB_dpint();
if (APP8_putmerblk (25ul, memarray, $0 \times 00809900$ )) printf("falled memory writelin");
if(APPB_getmembik(25LL, 0×00809900, wem2array)) brintf("failed memory readin");
for ( $\left.\mathrm{i}=0 ; \mathrm{i}<25 ; \mathrm{i}^{++}\right)$printf("value read \% $\%$ \n", wea2array[i]);
exit(0):

| /* |  | */ |
| :---: | :---: | :---: |
|  | APPB_reset (), PC side | *1 |
| /* |  | */ |
|  | Reset APPB. | */ |
| /* |  | */ |
| /* | Sequence: | */ |
| /* |  | */ |
|  | 1) Clear control register. | */ |
|  | 2) Set SUFESET- to 1. | */ |
| /* |  | */ |
|  | **************************** |  |
|  | APPB_reset() |  |
|  | ( |  |
|  | outport(CTL_REG, 0) ; |  |
|  | outport (CTL_REG, SURESET_); |  |
|  | return(0); |  |
|  | $)$ ) |  |

/***********************************************************************/

## Sequence:

* 1) Set DPRAM semaphores to 1 (free)
* 2) Set DPRAM mapping register.

3) Set DPRAM global enable bit to 1
int APPB_dpint()
(
int 1 ;
UINT sewaddr = SEM_BASE;
UCHAF *doran $=($ UCHAR $*$ DPRAM_CTL;
for ( $\mathrm{i}=0 ; \mathrm{i}<8 ; \mathrm{i}++$ ) outport(5emaddr$\left.{ }^{++}, 1\right)$; out port (MAP_REG, DPRAMLSEG); outport(CTL_REG,DPSEL: SWPESET_); return(0);
```
/**************************************************************************/
/* */
/* APPB_getsem(), PC side
*/
* */$
* Attempts to gain access of semaphore 'semnum'.
Returfi a 0 if successful, a -1 if failed.
/* Sequence
/* 1) Write 0 to semaphore.
1* 1) Write 0 to semaphore. 2) #/
7* 2) Decrement trmoute, check for timeout =0, or semaphore =0. */
/* 3) Return pass/fail.
**
******/
int APPB_getsem(semium
    UINT semnula;
    \ell
        UINT semaddr = SEM_BASE + semnuur;
        UINT timeout = MAX_SEM_TIME;
        outport(semaddr,0);
        whilei --timeout && (inport(semaddr) & 1l);
        if(timrout) return(0);
        else retura(-1);
}
```

/*
1* APPB_relsea(), PC side
/* Release semaphore at 'semun'.
/* Return a 0 if successful, a - 1 if failed. */
/* Sequence
( $\ddagger$ 1) Write 1 to semaphore.
/ $\ddagger$ 1) Write 1 to semaphore. $\quad \begin{aligned} & * \\ & \ddagger+1\end{aligned}$
/* 2) Decrement timeout, check for thmeout $=0$, or semaphore $=1$. $\quad$.
/* 3) Return pass/fail. $\quad$ */
/*
/***********************************************************************/
int APPB_relsen(seminua)
UINT semnum;
<
UINT semaddr $=$ SEMLBASE + semnum:
UINT timeout = MAX_SEM_TIME;
outportisemaddr, 1);
while( --tineout \&\& !(inport(senaddr) \& 1)!;
if(tineout) retura(0);
eise return(-1);
)
 /* /* APPB_getctlblk(), PC side
/* Find unused block of memory in the dual port. */
Return a 0 if successful, a -1 if failed. *)
/* Sequence
1* 1) Search control structures for free block of menory.
7* 2) If block free, set semnum to block index, return 0 . \$/
/* 3) Else, return -1 (failed to find block). $\quad \$ /$
/* */
int APPB_getctiblk(semanu)
UINT *seminum;
\{
int i;
DPCNTL *UOCt1 $=($ OPCNTL $*)$ DPRAM_CTL;
if(APPB_getsen(0)) return( -1 );

if(!dpctli[i].pflag)
dipctiin.pflag $=1$;
dpcti[i].command $=$ NOP;
opctilil.buf_stat = BUF_EMPTY;
tsemnum $=1$;
if(APPB_relsem(0)) return(-1);
if (APPB_relsem(
else return 0 ).
$)^{21}$
APPB_relsefti(0); return (-1);
;
 /*

- ${ }_{* / 1}^{* /}$
- 

1* Release block of memory in the dual port.
/a Return a 0 if successful, a -1 if failed.
/* Sequence
1* 1) Null out the control structure. $\$ /$
/* 2) Retura.
/*
/**************t*t***************************
int AFPB_relctlbik(semnum)
UINT semrum;
$\uparrow$
int i;
DPCNTL *doct $=($ DPCNTL $*$ DPRAM_CTL;
if(APPB_getsem(0)) return(-1)
doctl[semnur].pflag $=0$;
dpet1[semnuli].command $=$ NOP;
dpcti[seminuaj.buf_stat $=$ BUF_EMPTY;
if(AFPB_relsem(0)) returni-1);
else return(0);
3


[^2]
Appendix A2. TMS320C30 Applications Board


| /* |  | */ |
| :---: | :---: | :---: |
| /* | APPB_getsen(), THS320C30 side | */ |
| /4 |  | */ |
| /* | Attempts to gain access of semaphore 'semnun' | */ |
| /* |  | */ |
| /* | Sequence | */ |
| /* |  | */ |
|  | 1) Write 0 to semaphore. | */ |
| /* | 2) Wait till read a 0 . | */ |
| /********************************************************************/ |  |  |
| int APPB_getsen(semnum) |  |  |
|  | UINT seanum; |  |
|  |  |  |
| UCHAR *semaddr = (UCHAR *) (SEM_BASE + semnua); |  |  |
| *semaddr $=0$; uhile(*semaddr \& (UL); return(0); |  |  |
|  | ) |  |

## H4れtw

```
/****************************************************************************/
/*
/* APPB_getctlblk(), TMS320C30 side. : */
    Find unused black of memory in the dual port.
/* Return a 0 if successful, a - }1\mathrm{ if failed.
/* Find unused block of mery in the dual port. #/
/* Sequence
/# 1) Search control structures for free block of memory. $/
/* 2) If block free, set semnum to block index, return 0. */
3) El block free, set seminum to block index, return 0.
l* 3) Else, returfl-1 (falled to find block). 
int APPB_getctiblk(semnum)
    UINT *semnum;
    &
        int i;
        BPCNTL *dpCt = (DPCNTL *)DPRAM_CTL;
        AFPB_getsem(0);
        For (i=0; iODPRAM_BLKS; i++)
        If(!(dpctl[i].pflag & IUL)
        }
        dpctlil.pflag = ;
        dyct1[il.command = NOP;
        dpct1lil.buf_stat = BUF_EMPTY;
        *semnum}=1
        APFB_relsem(0); return(0);
        }
        APPB_relsem(0); retura(-1);
    }
```

    /*******************************************************************//*
    (* APPB_relctiblk(), TMS320c30 side$* /$
$* 1$
1
(*)$\pm 1$
/* Release block of memory in the dual port. ..... */
/* Return a 0 if successful, a -1 if failed. ..... */
/* Sequence ..... $\# i$
*
*/
(* 1) Null out the control structure. ..... */
/* ..... */
/**int APPB_relctlbik(seanua)
UINT semnum;
\&
int i;
DPCNTL $* d p c t 1=($ DPCNTL $*)$ DPRAM_CTL .
APPB-getsem (0);
dpctilseanumb.pflag $=0$;
$\begin{aligned} \text { dpctitsemnunl. command } & =N O P ;\end{aligned}$
dpcti[semnual.buf_stat $=$ BUF_EMPTY;
APPB_relsem(0); return(0);
)



```
/**************************************************************************/
/* APPB_getlong(), TMS320C30 side.
/*
/* Get a long word of data from the dual port.
*/
/****************************************************************************/
int APPB_getlong(sre,dst)
        ULONO *srC;
        ULONG *dst;
    <
        int j;
        *dst = oul;
        for(j=0;j(32;jt=8) *dst i= ((*src++) & 0x000000ff)<< j;
return(0);
;
```




## APPENDIX A3. Memory Map and Description (TMS320C30 View)

Listed below is a summary of the APPB memory map.

| 000000 - | 003FFF | EPROM (Boot EPROM/remappable) |
| :---: | :---: | :---: |
| 004000 - | 3FFFFF | Unused |
| 400000 - | 4FFFFF | DRAM space |
| 400000 - | 43 FFFF | 256 K -word DRAM minimum configuration |
| 440000 - | 47FFFF | 256 K -word DRAM minimum configuration |
| 480000 - | 4BFFFF | 256K-word DRAM option bank 2 |
| 4C0000 - | 4FFFFF | 256K-word DRAM option bank 3 |
| 500000 - | 7FFFFF | Unused |
| 800000 - | 801FFF | SRAM space 1 (16K-byte zero wait-state SRAM) |
| 802000 - | 805FFF | Reserved by TI |
| 804000 - | 805FFF | I/O Devices |
| 804000 - | 804FFF | 4K-byte dual-port SRAM |
| 805000 - | 805FF6 | I/O Expansion Bus |
| 805FF7 |  | Control Register R |
| 805FF8 - | 805FFF | dual-port RAM Semaphores (D0 only) |
| 806000 - | 807FFF | Reserved by TI |
| 808000 - | 8097FF | Memory mapped Peripherals |
| 809800 - | 809BFF | RAM Block 0 |
| 809C00 - | 809FFF | RAM Block 1 |
| $80 \mathrm{~A} 000-$ | EFFFFF | Unused |
| F00000 - | F03FFF | SRAM space 0 ( 16 K -byte zero wait-state SRAM, remappable) |
| F00800 - | FFFFFF | Unused |

## Appendix B

## Modules

## Appendix Name

B1 Module U5 - TMS320C30 Software Development Board
B2
B3
B4
B5
B6

> Module U6 - TMS320C30 Software Development Board Module RAMDEC - TMS320C30 Software Development Board Module RDYEN - TMS320C30 Software Development Board Module RAMCONTROL - TMS320C30 SWDS DRAM Module Module RAMDEC - TMS320C30 SWDS DRAM Module

## Appendix B1. TMS320C30 Software Development Board

| Module U5 <br> title' |  |  |
| :--- | :--- | :--- |
| DWG NAME |  | TMS320C30 SOFTWARE DEVELOPMENT BOARD |
| DWG \# | 2554377 |  |

equations

| !NQG | $=$ ! XAEN \& (SA == ^h338); |
| :---: | :---: |
| !NRG | $=$ ! XAEN \& (SA == ^ h 339$)$; |
| !NDPSEML | $=$ !XAEN \& SA9 \& SA8 \& !SA7 \& !SA6 \& SA5 \& SA4 \& !SA3 \&!NSIOW |
|  | \# ! XAEN \& SA9 \& SA8 \& !SA7 \& !SA6 \& SA5 \& SA4 \& !SA3 |
|  | \& ! NSIOR ; |

!NDPCEL = !XAEN \& !NPQ;
SGAB $=$ !NSIOW \& !XAEN \# !NSMEMW \& !XAEN ;
$!$ NSGBA $=!$ XAEN \& ! SSIOR \& $(S A==\wedge h 339)$ \# !XAEN \& !NSIOR \& SA9 \& SA8 \& !SA7 \& !SA6 \& SA5 \& SA4 \& !SA3 \# !XAEN \& !NSMEMR \& !NPQ;
end U5

## Appendix B2. Module U6

Module U6
title'
DWG NAME TMS320C30 SOFTWARE DEVELOPMENT BOARD
DWG \# 2554377
COMPANY TEXAS INSTRUMENTS INCORPORATED
ENGR NAT SESHAN
DATE 10/01/88'
XSUF10 Device 'P20L8';
CIOA0 Pin 1;
CIOA1 Pin 2;
CIOA2 Pin 3;
CIOA3 Pin 4;
CIOA4 Pin 5;
CIOA5 Pin 6;
CIOA6 Pin 7;
CIOA7 Pin 8;
CIOA8 Pin 9;
CIOA9 Pin 10;
CIOA10 Pin 11;
GND Pin 12;
CIOA11 Pin 13;
CIOA12 Pin 14;
TIOW Pin 15;
NSRANGE Pin 16;
CIORNW Pin 17;
NFR Pin 18;
NFG Pin 19;
NDPMEMGR Pin 20;
NDPSEMGR Pin 21;
TIOR Pin 22;
NCIOSTRB Pin 23;
VCC Pin 24;
$\mathrm{X}=$. X .;
$\mathrm{C}=. \mathrm{C}$.;
CIOA $=[$ CIOA12, CIOA11,CIOA10,CIOA9,CIOA8, CIOA7,CIOA6,CIOA5,CIOA4,CIOA3,CIOA2,CIOA1,CIOA0];
equations
!NSRANGE = !NCIOSTRB \& ! CIOA12 \# !NCIOSTRB \& (CIOA >= ^h1FF7);
!NDPMEMGR = ! NCIOSTRB \& !CIOA12;
!NDPSEMGR $=$ !NCIOSTRB \& (CIOA $\left.>={ }^{\wedge} h 1 \mathrm{FF} 8\right)$;

| !NFG | $=!$ NCIOSTRB \& ! ${ }^{\text {a }}$ (IORNW $\&\left(\mathrm{CIOA}=={ }^{\wedge} \mathrm{h} 1 \mathrm{FF} 7\right.$ ); |
| :---: | :---: |
| !NFR | $=$ ! NCIOSTRB \& CIORNW \& (CIOA $=={ }^{\wedge} \mathrm{h} 1 \mathrm{FF} 7$ ); |
| !TIOR | $=$ NCIOSTRB |
|  | \# (CIOA > ${ }^{\text {^ }}$ h1FF7) |
|  | \# !CIOA12 |
|  | \# !CIORNW; |
| !TIOW | $=$ NCIOSTRB |
|  | \# (CIOA > = ^h1FF7) |
|  | \# !CIOA12 |
|  | \# CIORNW; |

test_vectors
([CIOA, NCIOSTRB, CIORNW] $\rightarrow$
[TIOR, TIOW, NSRANGE, NFG, NFR, NDPMEMGR, NDPSEMGR]);
READ OR WRITE TO A SEMAPHORE
$\left[{ }^{\wedge} h 1 F F 8,0, \mathrm{X}\right] \rightarrow[0,0,0,1,1,1,0] ;$
$\left[{ }^{\wedge} h 1 F F 9,0, X\right] \rightarrow[0,0,0,1,1,1,0] ;$
$[\wedge h 1 F F A, 0, X] \rightarrow[0,0,0,1,1,1,0] ;$
$[\wedge h 1 \mathrm{FFB}, 0, \mathrm{X}] \rightarrow[0,0,0,1,1,1,0]$;
$[\wedge h 1 F F C, 0, X] \rightarrow[0,0,0,1,1,1,0] ;$
$\left[{ }^{\wedge} h 1 F F D, 0, X\right] \rightarrow[0,0,0,1,1,1,0] ;$
$\left[{ }^{\wedge} h 1 F F E, 0, X\right] \rightarrow[0,0,0,1,1,1,0] ;$
$\left[{ }^{\wedge} h 1 F F F, 0, X\right] \rightarrow[0,0,0,1,1,1,0]$;
WRITE TO F REGISTER
$\left[{ }^{\wedge} h 1 F F 7,0,0\right] \rightarrow[0,0,0,0,1,1,1] ;$
READ FROM F REGISTER
$\left[{ }^{\wedge} h 1 F F 7,0,1\right] \rightarrow[0,0,0,1,0,1,1] ;$
NCIOSTRB DISABLED
$[\mathrm{X}, 1, \mathrm{X}] \rightarrow[0,0,1,1,1,1,1] ;$
EXTERNAL READS

[^b1000000001010, 0, 1] $\rightarrow$ [1, 0, 1, 1, 1, 1, 1]; $\left[{ }^{\wedge} b 1000000001011,0,1\right] \rightarrow[1,0,1,1,1,1,1]$; $\left[{ }^{\wedge} b 1000000001100,0,1\right] \rightarrow[1,0,1,1,1,1,1]$; [^b1000000001101, 0, 1] $\rightarrow$ [1, 0, 1, 1, 1, 1, 1]; $\left[{ }^{\wedge} b 1000000001110,0,1\right] \rightarrow[1,0,1,1,1,1,1]$; [^b1000000001111, 0, 1] $\rightarrow$ [1, 0, 1, 1, 1, 1, 1];
[^h1FF0, 0, 1] $\rightarrow[1,0,1,1,1,1,1]$;
[^h1FF1, 0, 1] $\rightarrow[1,0,1,1,1,1,1] ;$
$\left[{ }^{\wedge} h 1 \mathrm{FF} 2,0,1\right] \rightarrow[1,0,1,1,1,1,1]$;
[ $\left.{ }^{\wedge} \mathrm{h} 1 \mathrm{FF} 3,0,1\right] \rightarrow[1,0,1,1,1,1,1]$;
[ $h$ h1FF4, 0, 1] $\rightarrow[1,0,1,1,1,1,1] ;$
[^h1FF5, 0, 1] $\rightarrow[1,0,1,1,1,1,1] ;$
[^h1FF6, 0, 1] -> [1, 0, 1, 1, 1, 1, 1];

## EXTERNAL IO WRITES

[^b1000000000000, 0, 0] $\rightarrow$ [0, 1, 1, 1, 1, 1, 1];
$\left[{ }^{\wedge} b 1000000000001,0,0\right] \rightarrow[0,1,1,1,1,1,1]$;
$\left[{ }^{\wedge} b 1000000000010,0,0\right] \rightarrow[0,1,1,1,1,1,1]$;
[^b1000000000011, 0, 0] $\rightarrow[0,1,1,1,1,1,1]$;
$\left[{ }^{\wedge} b 1000000000100,0,0\right] \rightarrow[0,1,1,1,1,1,1] ;$
$\left[{ }^{\wedge} b 1000000000101,0,0\right] \rightarrow[0,1,1,1,1,1,1] ;$
$\left[{ }^{\wedge} b 1000000000110,0,0\right] \rightarrow[0,1,1,1,1,1,1] ;$
$\left[{ }^{\wedge} b 1000000000111,0,0\right] \rightarrow[0,1,1,1,1,1,1]$;
$\left[{ }^{\wedge} b 1000000001000,0,0\right] \rightarrow[0,1,1,1,1,1,1]$;
$\left[{ }^{\wedge} b 1000000001001,0,0\right] \rightarrow[0,1,1,1,1,1,1] ;$
$\left[{ }^{\wedge} b 1000000001010,0,0\right] \rightarrow[0,1,1,1,1,1,1]$;
[^b1000000001011, 0, 0] $\rightarrow$ [0, 1, 1, 1, 1, 1, 1];
$\left[{ }^{\wedge} b 1000000001100,0,0\right] \rightarrow[0,1,1,1,1,1,1] ;$
$\left[{ }^{\wedge} b 1000000001101,0,0\right] \rightarrow[0,1,1,1,1,1,1] ;$
$\left[{ }^{\wedge} b 1000000001110,0,0\right] \rightarrow[0,1,1,1,1,1,1] ;$
[^b1000000001111, 0, 0] $\rightarrow$ [0, 1, 1, 1, 1, 1, 1];
[^h1FF0, 0, 0] $\rightarrow[0,1,1,1,1,1,1]$;
$\left[{ }^{\wedge} h 1 \mathrm{FF} 1,0,0\right] \rightarrow[0,1,1,1,1,1,1]$;
[^h1FF2, 0, 0] $\rightarrow[0,1,1,1,1,1,1] ;$
[^h1FF3, 0, 0] $\rightarrow[0,1,1,1,1,1,1] ;$
$\left[{ }^{\wedge} h 1 F F 4,0,0\right] \rightarrow[0,1,1,1,1,1,1] ;$
[^h1FF5, 0, 0] $\rightarrow[0,1,1,1,1,1,1] ;$
[^h1FF6, 0, 0] $\rightarrow[0,1,1,1,1,1,1] ;$

## test_vectors

([CIOA12, NCIOSTRB, CIORNW] ->
[TIOR, TIOW, NSRANGE, NFG, NFR, NDPSEMGR, NDPMEMGR]);
DUAL-PORT SRAM READ OR WRITE
$[0,0, \mathrm{X}] \rightarrow[0,0,0,1,1,1,0]$;
end U6

## Appendix B3. Module RAMDEC

module RAMDEC
title'
DWG NAME
TMS320C30 SOFTWARE DEVELOPMENT BOARD
DWG \# 2554377
COMPANY. TEXAS INSTRUMENTS INCORPORATED
ENGR TONY COOMES
DATE 10/01/88,
XSUB4 device 'P16L8';
a12 Pin 1; "c30 address inputs
a13 Pin 2;
a14 Pin 3;
a15 Pin 4;
a16 $\quad$ Pin 5;
a17 Pin 6;
a18 Pin 7;
a19 Pin 8;
a20 Pin 9;
a21 Pin 11;
a22 Pin 13;
a23 Pin 14;
m_swap Pin 15; "sram/eprom swap bit
vss $\quad \operatorname{Pin} 10 ;$
memen Pin 18; "dram expansion select
sram Pin 17; "sram select
eprom Pin 16; "eprom select
busen Pin 12; "eprom/dram data buffer select
vcc $\quad$ Pin 20;
madd $=[\mathrm{a} 23, \mathrm{a} 22, \mathrm{a} 21, \mathrm{a} 20, \mathrm{a} 19, \mathrm{a} 18, \mathrm{a} 17, \mathrm{a} 16, \mathrm{a} 15, \mathrm{a} 14, \mathrm{a} 13, \mathrm{a} 12] ;$
equations

```
"On reset the eprom and sram maps are swapped
" m_swap \(=0 \quad\) m_swap \(=1\)
"sram F00000-F03FFF 000000-003FFF
"eprom 000000-003FFF F00000-F03FFF
sram \(=!\left(\left(\left(\right.\right.\right.\) madd \(\left.>={ }^{\wedge} h 000\right) \&\left(\right.\) madd \(\left.<={ }^{\wedge} h 003\right) \& m_{-}\)swap \()\)
    \# ((madd \(\left.>={ }^{\wedge} h F 00\right) \&\left(\right.\) madd \(\left.\left.\left.<={ }^{\wedge} h F 03\right) \&!m \_s w a p\right)\right) ;\)
eprom \(=!\left(\left(\left(\right.\right.\right.\) madd \(\left.>={ }^{\wedge} h 000\right) \&\left(\right.\) madd \(\left.<={ }^{\wedge} h 003\right) \&!m_{-}\)swap \()\)
    \# ((madd >= ^hF00) \& (madd <= ^hF03) \& m_swap));
memen \(=!\left(\left(\right.\right.\) madd \(\left.>={ }^{\wedge} \mathrm{h} 400\right) \&\left(\right.\) madd \(\left.\left.<={ }^{\wedge} \mathrm{h} 4 \mathrm{FF}\right)\right)\);
busen \(=\) !(!eprom \# !memen);
```

test_vectors
([madd, m_swap ] $\rightarrow$ [sram, eprom, memen, busen])
$\left[{ }^{\wedge} \mathrm{h} 000,1\right] \rightarrow[0,1,1,1] ;$
$\left.\left[{ }^{\wedge} \mathrm{h} 000,0\right]\right] \rightarrow\left[\begin{array}{lll}1, & 0, & 1,\end{array} 0\right] ;$
$\left[{ }^{\wedge} h 004,1\right] \rightarrow[1,1,1,1] ;$
$\left[{ }^{\wedge} h F 00,1\right]$ $1 \rightarrow\left[\begin{array}{llll}1, & 0, & 1, & 0\end{array}\right] ;$
$\left[{ }^{\wedge} \mathrm{hF} 00,0\right.$ ] $0 \rightarrow[0,1,1,1] ;$
$\left[{ }^{\wedge} h F F 0,1\right] \rightarrow[1,1,1,1] ;$
$\left[\begin{array}{cc}\wedge \\ \end{array} \mathrm{F} 00,1\right] \rightarrow\left[\begin{array}{lll}1, & 0, & 1, \\ 0\end{array}\right] ;$
$\left[{ }^{\wedge} h 400,00\right] \rightarrow\left[\begin{array}{llll}1, & 1, & 0, & 0\end{array}\right]$;
$\left[{ }^{\wedge} h 4 C F, 1\right] \rightarrow[1,1,0,0] ;$
$\left[{ }^{\wedge} \mathrm{h} 800,1\right]$ ] $\left.11,1,1,1\right]$;
end RAMDEC

## Appendix B4. Module RDYEN


test_vectors
([clk, strb, busen, rd_wr, eprom, oe, bhiz ] $\rightarrow$ prdy)
$[\mathrm{c}, 1,1,1,1,0,1] \rightarrow 1$;
$[c, c, 0,1, \quad 0, \quad 0,0] \rightarrow 1 ;$
$[c, ~ 0, ~ 0, ~ 1, ~ 0, ~ 0, ~ 1] ~ 0 ;$
$[c, 0, \quad 0,1, \quad 0,0,1] \rightarrow 1$;
$\left[\begin{array}{ccccccc}c, & 0, & 0, & 1, & 0, & 0, & 1\end{array}\right] \rightarrow 0 ;$
$[c, 1, \quad 0,1, \quad 0,0,1] \rightarrow 1 ;$
$[\mathrm{c}, 1,10,1, \quad 0,0,1] \rightarrow 1$;
test_vectors
([strb, busen, rd_wr, eprom, bhiz ] $\rightarrow$ [dat_rd, dat_wr, epromcs])
$[1,1,1,1,1] \rightarrow[1,10,1]$;
$[0, ~ 0, ~ 1, ~ 1, ~ 1] ~] ~[0, ~ 0, ~ 1] ;$
$[0, \quad 0, \quad 0,1, \quad 1] \rightarrow\left[\begin{array}{lll}1, & 1, & 1\end{array}\right]$;
$[0,1,1,1,1] \rightarrow[1,10,1]$;
$[1, \quad 0,1,1,1] \rightarrow[1, \quad 0,1]$;
check eprom
$[1,0,1,0,1] \rightarrow[1,0,1]$; $\left[\begin{array}{lllll}0, & 0, & 1, & 0, & 1\end{array}\right] \rightarrow\left[\begin{array}{lll}0, & 0, & 0\end{array}\right] ;$ $[0, \quad 0, \quad 1, \quad 0, \quad 0] \rightarrow[1, \quad 0,11] ;$ $[0, ~ 0, ~ 0, ~ 0, ~ 1] ~] \rightarrow\left[\begin{array}{lll}1, & 1, & 1\end{array}\right] ;$ $[0, ~ 1, ~ 1, ~ 0, ~ 1] \rightarrow[1, \quad 0,1]$; $[1,10,1,1,1] \rightarrow[1,10,1]$;
end RDYEN

## Appendix B5. Module RAMCONTROL

## Module RAMCONTROL

 title'DWG NAME 320C30 SWDS DRAM MODULE
DWG \# 2554397
COMPANY TEXAS INSTRUMENTS INCORPORATED
ENGR TONY COOMES
DATE 10/01/88'
XDUE5 device 'P16R8';
clk $\quad$ Pin 1 ;
refreq_ Pin 2; "refresh request
strb_ Pin 3; "c30 strobe
rd $\quad$ Pin 4; "c30 read/write
memen_ Pin 5; "memory board chip select
oe_ Pin 11; "pal output enable
vss Pin 10;
s0 Pin 19; "state variable
refclr Pin 18; "refresh clear
casen Pin 17; "column address strobe
ren Pin 16; "write strobe
rasen Pin 15; "row address strobe
mrdy Pin 14; "dram ready strobe
busact Pin 13; "dram bus active
s1 Pin 12; "state variable
vcc Pin 20;
"define machine states
"[refclr,rasen,casen,mrdy,busact,s0,s1];
idle $={ }^{\wedge}$ b1111111;
ras0 $={ }^{\wedge} \mathrm{b} 1011111$;
cas0 = ^b1000111;
cas1 $={ }^{\wedge} b 1011101$;
whld $={ }^{\wedge} \mathrm{b} 1111110$;
$\operatorname{trp}={ }^{\wedge} \mathrm{b} 1111001$;
ref1 $={ }^{\wedge}$ b0101111;
ref2 = ${ }^{\text {b0001111; }}$
ref3 = ^b0011111;
ref4 $={ }^{\wedge} \mathrm{b} 1111101$;
refreq $=$ !refreq; "convert to positive logic
strb = !strb_;
memen $=$ !memen_;
oe $=$ !oe_;
$\mathrm{c}=. \mathrm{C}$.;

```
c = .C.;
output = [refclr,rasen,casen,mrdy,busact,s0,s1];
equations
ren := !(!rd & !strb_); high on read, low on writes
state_diagram output
state idle:
    case (refreq & strb & memen) :ref1; "ref has 1st priority
        (refreq & strb & !memen) :ref1;
        (refreq & !strb & memen) :ref1;
        (refreq & !strb & !memen) :ref1;
        (!refreq & strb & memen) :ras0;
        (!refreq & strb & !memen) :idle;
        (!refreq & !strb & memen) :idle;
        (!refreq & !strb & !memen) :idle;
    endcase;
    state ras0:
    goto cas0;
    state cas0
    case rd
        !rd
    endcase;
    state cas1:
    case strb & !refreq
        strb & refreq
        !strb & !refreq
        !strb & refreq
    endcase;
state whld:
    case strb & !refreq
        strb & refreq
        !strb & !refreq
        !strb & refreq
        endcase;
state
    trp:
    case refreq
        !refreq
"cas,ras high
    :ref1;
        :idle;
    endcase;
state ref1
"cas,refclr low
    goto ref2;
state
ref2:
"ras low
goto ref3;
```

state
ref3:
goto ref4;
state
ref4:
goto idle;
test_vectors "page mode read, ref, page mode read ([clk,refreq ,strb, rd,memen, oe ] $\rightarrow$ [output,ren])
$\left[\begin{array}{ccccc}\mathrm{c}, & 0, & 0, & 1, & 0, \\ 1\end{array}\right] \rightarrow[$ idle, 1$]$;
$\left[\begin{array}{cccccc}\mathrm{c}, & 0, & 1, & 1, & 1 & 1\end{array}\right] \rightarrow[$ ras0, 1$] ;$
$\left[\begin{array}{cccccc}\mathrm{c}, & 0, & 1, & 1, & 1, & 1\end{array}\right]->[\operatorname{cas} 0,1]$;
$\left[\begin{array}{cccccc}\mathrm{c}, & 0, & 1, & 1, & 1 & 1\end{array}\right] \rightarrow[$ cas1, 1$]$;
$\left[\begin{array}{cccccc}c, & 0, & 1, & 1, & 1 & 1\end{array}\right] \rightarrow[\operatorname{cas} 0,1] ;$
$\left[\begin{array}{cccccc}\mathrm{c}, & 1, & 1, & 1, & 1, & 1] \rightarrow[\operatorname{cas} 1,1] ;\end{array}\right.$
$\left[\begin{array}{cccccc}c, & 1, & 1, & 1, & 1, & 1\end{array}\right] \rightarrow[\operatorname{trp}, 1] ;$
$[\mathrm{c}, 1,1,1,1,1]->[r e f 1,1] ;$
$\left[\begin{array}{cccccc}\mathrm{c}, & 1, & 1, & 1, & 1, & 1\end{array}\right] \rightarrow[\mathrm{ref} 2,1] ;$
$\left[\begin{array}{cccccc}c, & 1, & 1, & 1, & 1 & 1\end{array}\right] \rightarrow[\operatorname{ref} 3,1] ;$
$[\mathrm{c}, 0,1,1,1,1]->[r e f 4,1] ;$
$\left[\begin{array}{llllll}\mathrm{c}, & 0, & 1, & 1, & 1, & 1\end{array}\right] \rightarrow[$ idle, 1$]$;
$\left[\begin{array}{cccccc}\mathrm{c}, & 0, & 1, & 1, & 1, & 1\end{array}\right] \rightarrow[\operatorname{ras} 0,1]$;
$\left[\begin{array}{cccccc}c, & 0, & 1, & 1, & 1\end{array}\right] \rightarrow[$ cas0 , 1] $;$
$[\mathrm{c}, ~ 0, ~ 1, ~ 1, ~ 1, ~ 1] ~ \rightarrow[$ cas1, 1$] ;$
$\left[\begin{array}{cccccc}c, & 0, & 1, & 1, & 1, & 1\end{array}\right] \rightarrow[\operatorname{cas} 0,1] ;$
$[\mathrm{c}, ~ 0, ~ 1, ~ 1, ~ 1, ~ 1] \rightarrow[$ cas1, 1$] ;$
$\left[\begin{array}{llllll}c, & 0, & 0, & 1, & 1, & 1\end{array}\right] \rightarrow\left[\begin{array}{lll}\operatorname{trp}, & 1\end{array}\right] ;$
$\left[\begin{array}{cccccc}\mathrm{c}, & 0, & 0, & 1, & 0, & 1\end{array}\right] \rightarrow[$ idle, 1$]$;
test_vectors "write cycle
([clk,refreq ,strb, rd, memen, oe ] $\rightarrow$ [output,ren])
$\left[\begin{array}{ccccc}c, & 0, & 0, & 0, & 0, \\ 1\end{array}\right] \rightarrow[$ idle , 1$] ;$
$[\mathrm{c}, 0,1, \quad 0,1,1] \rightarrow[$ ras0, 0$]$;
$[\mathrm{c}, ~ 0, ~ 1, ~ 0, ~ 1, ~ 1] ~ \rightarrow-[c a s 0,0] ;$
$\left[\begin{array}{cccccc}c, & 0, & 1, & 0, & 1 & 1] \rightarrow[\text { whld, } 0\end{array}\right] ;$
$\left[\begin{array}{cccccc}c, & 0, & 1, & 0, & 1 & 1] \rightarrow[\text { whld, } 0\end{array}\right] ;$
$\left[\begin{array}{cccccc}\mathrm{c}, & 0, & 1, & 0, & 1 & 1\end{array}\right] \rightarrow[$ whld, 0 ];
$\left[\begin{array}{ccccc}\mathrm{c}, & 0, & 0, & 0, & 1, \\ 1\end{array}\right] \rightarrow[$ idle, 1$] ;$
$\left[\begin{array}{ccccc}\mathrm{c}, & 0, & 0, & 1, & 0, \\ 1\end{array}\right] \rightarrow[$ idle , 1$]$;
"write cycle /ref
$\left[\begin{array}{ccccc}c, & 0, & 0, & 0, & 1\end{array}\right] \rightarrow[$ idle , 1$] ;$
$[\mathrm{c}, ~ 0, ~ 1, ~ 0, ~ 1, ~ 1]->[r a s 0,0] ;$
$\left[\begin{array}{cccccc}\mathrm{c}, & 1, & 1, & 0, & 1 & 1] \rightarrow[\operatorname{cas} 0,0] ;\end{array}\right.$
$\left[\begin{array}{cccccc}c, & 1, & 1, & 0, & 1, & 1\end{array}\right] \rightarrow[$ whld, 0 ];
$\left[\begin{array}{cccccc}\mathrm{c}, & 1, & 1, & 0, & 1 & 1] \rightarrow[\mathrm{ref} 1,0] ;\end{array}\right.$
$[\mathrm{c}, 1,1,10,1,1]->[\mathrm{ref} 2,0] ;$
$\left[\begin{array}{cccccc}c, & 1, & 0, & 0, & 0, & 1\end{array}\right] \rightarrow\left[\begin{array}{lr}\text { ref3 }, 1\end{array}\right] ;$
$[\mathrm{c}, 0, \quad 0,1, \quad 0,1] \rightarrow[$ ref4, 1$]$;
$[\mathrm{c}, 0,10,1, \quad 0,1]$->[idle , 1 ];
end RAMCONTROL

## Appendix B6. Module RAMDEC

module RAMDEC
title'
DWG NAME 320C30 SWDS DRAM MODULE
DWG \# 2554397
COMPANY TEXAS INSTRUMENTS INCORPORATED
ENGR TONY COOMES
DATE 10/01/88'
XDUD5 device 'P16R4';
clk Pin 1 ;
refclr Pin 2; "clear refresh stat
a18 Pin 3; "c30 address 18
a19
strb
mux
oe
vss
Pin 4;
"c30 address 19
memen Pin 5; "dram board memory enable
mux
Pin 6;
"c30 strobe
"address mux
"pal output enable
ras0 Pin 17; "ras select 0
ras1 Pin 16; "ras select 1
ras2 Pin 15; "ras select 2
ras3 Pin 14; "ras select 3
rowsel Pin 13; "row address select
vcc Pin 20;
$\mathrm{c}=. \mathrm{C}$.;
equations
ras0 := !(!refclr \# (!a19 \& !a18 \& !memen \& !strb));
ras1 := !(!refclr \# (!a19 \& a18 \& !memen \& !strb));
ras2 := !(!refclr \# ( a19 \& !a18 \& !memen \& !strb));
ras3 := !(!refclr \# ( a19 \& a18 \& !memen \& !strb));
rowsel $=$ mux;
test_vectors "page mode read, ref, page mode read ([clk,refclr, memen, strb, a19, a18, oe] $\rightarrow$ [ras0, ras1, ras2, ras3])
$[c, 1,1,1,0,0,0] \rightarrow[1,1,1,1]$;
$\left[\begin{array}{ccccccc}\mathrm{c}, & 1, & 0, & 0, & 0, & 0, & 0\end{array}\right] \rightarrow[0,1,1,1] ;$
$[\mathrm{c}, 1,10,0,0,1,0] \rightarrow[1,0,1,1] ;$
$[\mathrm{c}, 1, \quad 0,0,1,0,0] \rightarrow[1,1,0,1] ;$
$[\mathrm{c}, 1,10,0,1,1,0] \rightarrow[1,1,1,0] ;$
$[\mathrm{c}, 1,1,1,1,1,0] \rightarrow[1,1,1,1] ;$
$[\mathrm{c}, 1, \quad 0,1,1,1,0] \rightarrow[1,1,1,1] ;$
$[\mathrm{c}, ~ 0, ~ 0, ~ 1, ~ 1, ~ 1, ~ 0] \rightarrow[0, ~ 0, ~ 0, ~ 0] ;$
$[\mathrm{c}, 1,10,1,1,1,0] \rightarrow[1,1,1,1] ;$
$[c, 0, \quad 0,0,1,1,0] \rightarrow[0,0, \quad 0,0]$;
$[\mathrm{c}, 1,10,0,1,1,0] \rightarrow[1,1,1,0]$;
test_vectors "rowsel
(mux $\rightarrow$ rowsel)
$1 \rightarrow 1$;
$0 \rightarrow 0 ;$
end RAMDEC

## Appendix C

## TMS320C30 Application Board Schematics

Appendix Title
C1 TMS320C30 Software Development Schematics
C2 TMS320C30 SWDS DRAM Module Schematics

## Appendix C1. TMS320C30 Software Development Schematics












COMPUTER GENERATED DRAWING: DO NOT REUISE MANUALLY
. FESISTAHCE values are in ohms
5. RESISTORS ARE $1 / 4$ WATT, $5 \%$
chapacitamce values afe in migrofarads







EXPANSION HEADERS

$\qquad$
 77] CA(B:23)
$M A<0: 8$ )
$[3,4,5,6]$

ADDRESS MUX


## TMS320 Bibliography

Since the TMS32010 was disclosed in 1982, the TMS320 family has received an ever-increasing amount of recognition. The number of outside parties contributing to the extensive development support offered by Texas Instruments is rapidly growing. Many technical articles are being written about TMS320 applications in the field of digital signal processing.

The following articles and papers have been published since 1982 regarding the Texas Instruments TMS320 Digital Signal Processors. Readers who are interested in gaining further information about these processors and their applications may obtain copies of these articles/papers from their local or university library.

The articles are broken down into 12 different application categories. Articles in each category are in reverse chronological order (most recent first). Articles having the same publication date are shown in alphabetical order by authors name.

The application categories are:

1) General Purpose DSP
2) Graphics/Imaging
3) Instrumentation
4) Voice/Speech
5) Control
6) Military
7) Telecommunications
8) Automotive
9) Consumer
10) Industrial
11) Medical
12) Development Support

## General Purpose DSP

1) R.Chassaing, "A Senior Project Course in Digital Signal Processing with the TMS320," IEEE Transactions on Education, USA, Volume 32, Number 2, pages 139-145, May 1989.
2) P.E. Papamichalis, C.S. Burrus, "Conversion of Digit-Reversed to Bit-Reversed Order in FFT Algorithms," Proceedings of ICASSP 89, USA, pages 984-987, May 1989.
3) P.E. Papamichalis, "Application, Progress and Trends in Digital Signal Processing," Proceedings of Mikroelktronik Conference, Baden-Baden, March 1989.
4) R. Chassaing, "Adaptive Filtering with the TMS320C25 Digital Signal Processor," Proceedings of 1989 ASEE Conference, USA, pages 215-217, 1989.
5) P.E. Papamichalis, R. Simar, Jr., "The TMS320C30 Floating-Point Digital Signal Processor," IEEE Micro Magazine, USA, pages 13-29, December 1988.
6) K. Rogers, "The Real-Time Thing (Digital Signal Controller)," Electronic Engineering Times, USA, Number 506, page 85, October 1988.
7) P.E. Papamichalis, "Impact of DSP Devices on Fast Algorithms," Proceedings of the 1988 IEEE DSP Workshop, USA, September 1989.
8) G.Umamaheswari, C. Eswaran, A. Jhunjhunwala, "Signal Processing with a Dual-Bank Memory," Microprocessor Microsystems, Great Britain, Volume 12, Number 4, pages 206-210, May 1988.
9) G. Castellini, P. Luigi, E. Liani, L. Pierucci, F. Pirri, S. Rocchi, "A Multiprocessor Structure Based on Commercial DSP," Proceedings of ICASSP 88, USA, Volume V, page 2096, April 1988.
10) M.R. Civanlar, R.A. Nobakht, "Optimal Pulse Shape Design Using Projections onto Convex Sets," Proceedings of ICASSP 88, USA, Volume D, p. 1874, April 1988.
11) L.J. Eriksson, M.C. Allie, C.D. Bremigan, R.A. Greiner, "Active Noise Control Using Adaptive Digital Signal Processing," Proceedings of ICASSP 88, USA, Volume A, page 2594, April 1988.
12) G. Mirchandani, D.D. Ogden, "Experiments in Partitioning and Scheduling Signal Processing Algorithms for Parallel Processing," Proceedings of ICASSP 88, USA, Volume D, page 1690, April 1988.
13) P. Papamichalis, "FFT Implementation on the TMS320C30," Proceedings of ICASSP 88, USA, Volume D, page 1399, April 1988.
14) A.C. Rotger-Mora, "An N-Dimensional SIMD Ring Architecture for Implementing Very Large Order Adaptive Digital Filters," Proceedings of ICASSP 88, USA, Volume V, page 2140, April 1988.
15) J. Santos, J. Parera, M. Veiga, "A Hypercube Multiprocessor for Digital Signal Processing Algorithm Research," Proceedings ofICASSP88, USA, Volume D, page 1698, April 1988.
16) R. Simar, A. Davis, "The Application of High-Level Languages to Single-Chip Digital Signal Processors," Proceedings of ICASSP 88, USA, Volume D, page 1678, April 1988.
17) K. Bala, "Running on Embedded Power. (Dedicated 32-Bit Microprocessors Used in New Microcontrollers)(Technology Trends: Microprocessors and Peripherals),"Electronic Engineering Times, USA, Number 478, page 34, March 1988.
18) J. Cooper, "DSP Chip Speeds VME Transfer," ESD: Electronic Systems Design, USA, Volume 18, Number 3, pages 47,48,50,51, March 1988.
19) L. Vieira de Sa, F. Perdigao, "A Microprocessing System for the TMS32020," Microprocessing Microprogramming, Netherlands, Volume 23, Number 1-5, pages 221-225, March 1988.
20) G. Wade, "Offset FFT and Its Implementation on the TMS320C25 Processor," Microprocessing Microsystems, Great Britain, Volume 12, Number 2, pages 76-82, March 1988.
21) R. Chassaing, "Digital Broadband Noise Synthesis by Multirate Filtering Using the TMS320C25," Proceedings of 1988 ASEE Conference, USA, pages 394-397, 1988.
22) R. Chassaing, "A Senior Project Course on Applications in Digital Signal Processing with the TMS320," Proceedings of 1988ASEE Conference, USA, pages 354-359, 1988.
23) L.N. Bohs, R.C. Barr, "Real-Time Adaptive Sampling with the Fan Method," Proceedings of the Ninth Annual Conference of the IEEE Engineering in Medicine and Biology Society, USA, Volume 4, pages 1850-1851, November 1987.
24) T. Kimura, Y. Inabe, T. Hayashi, K. Uchimura, K. Hamazato, "Dual-Chip SLIC Using VLSI Technology," Conference Record of GLOBECOM Tokyo '87, Volume 3, pages 1766-1770, November 1987.
25) W.S. Gass, R.T. Tarrant, T. Richard, B.I. Pawate, M. Gammel, P.K. Rajasekaran, R.H. Wiggins, C.D. Covington, "Multiple Digital Signal Processor Environment for Intelligent Signal Processing," Proceedings of the IEEE, USA, Volume 75, Number 9, pages 1246-1259, September 1987.
26) L. Johnson, R. Simar, Jr., "A High Speed Floating Point DSP," Conference Record of MIDCON/87, USA, pages 396-399, September 1987.
27) K.S. Lin, G.A. Frantz, R.Simar, Jr., "The TMS320 Family of Digital Signal Processors," Proceedings of the IEEE, USA, Volume 75, Number 9, pages 1143-1159, September 1987.
28) S.L. Martin, "Wave of Advances Carry DSPs To New Horizons. (Digital Signal Processing)," Computer Design, USA, Volume 26, Number 17, pages 69-82, September 1987.
29) C. Murphy, A. Coats, J. Conway, P. Colditz, P. Rolfe, "Doppler Ultrasound Signal Analysis Based on the TMS320 Signal Processor," 27 th Annual Scientific Meeting of the Biological Engineering Society, Great Britain, Volume 10, Number 2, pages 127-129, September 1987.
30) G.S. Kang, L.J. Fransen, "Experimentation With An Adaptive Noise-Cancellation Filter," IEEE Transactions on Circuits and Systems, USA, Volume CAS-34, Number 7, pages 753-758, July 1987.
31) R. Chassaing, "Applications in Digital Signal Processing with the TMS320 Digital Signal Processor in an Undergraduate Laboratory," Proceedings of the 1987 ASEE Annual Conference, USA, Volume 3, pages 1320-1324, June 1987.
32) D.W. Horning, "An Undergraduate Digital Signal Processing Laboratory," Proceedings of the 1987 ASEE Annual Conference, USA, Volume 3, pages 1015-1020, June 1987.
33) D. Locke, "Digitising In The Gigahertz Range," IEE Colloguium on Advanced $A / D$ Conversion Techniques, Great Britain, Digest Number 48, 10/1-4, April 1987.
34) S. Orui, M. Ara, Y. Orino, E. Sazuki, H. Makino, "Realization of IIR Filter using the TMS320," Resident Reports of Kogakuin University, Japan, Number 62, pages 195-204, April 1987.
35) R. Simar, T. Leigh, P. Koeppen, J. Leach, J. Potts, D. Blalock, "A 40 MFLOPS Digital Signal Processor: The First Supercomputer on a Chip," Proceedings of ICASSP 87, USA, Catalog Number 87CH2396-0, Volume 1, pages 535-538, April 1987.
36) R. Simar, "TMS320: Texas Instruments Family of Digital Signal Processors," Proceedings of SPEECH TECH 87, USA, pages 42-47, April 1987.
37) G.Y. Tang, B.K. Lien, "A Multiple Microprocessor System For General DSP Operation," Proceedings of ICASSP 87, USA, Catalog Number 87CH2396-0, Volume 2, pages 1047-1050, April 1987.
38) L. Vieira de Sa, "Second MicroProcessor Enhances TMS32020 System," EDN: Electronic Design News, USA, Volume 32, Number 9, pages 230-232, April 1987.
39) T.J. Moir, T.G. Vishwanath, D.R. Campbell, "Real-Time Self-Tuning Deconvolution Filter and Smoother," InternationalJournal of Control, Great Britain, Volume 45, Number 3, pages 969-985, March 1987
40) R. Simar, M. Hames, "CMOS DSP Packs Punch of a Supercomputer," EDN: Electronic Design News, USA, Volume 35, Number 7, pages 103-106, March 1987.
41) S. Sridharan, "On Improving the Performance of Digital Filters Designed Using the TMS32010 Signal Processor,"Journal of Electrical and Electronic Engineers of Australia, Australia, Volume 7, Number 1, pages 80-82, March 1987.
42) R. McCammon, "Software Routine Probes TMS32010 Code," EDN: Electronic Design News, USA, Volume 32, Number 4, pages 200,202, February 1987.
43) J. Prado, R. Alcantara, "A Fast Square-Rooting Algorithm Using A Digital Signal Processor," Proceedings of IEEE, USA, Volume 75, Number 2, pages 262-264, February 1987.
44) T.G. Vishwanath, D.R. Camppbell, T.J. Moir, "Real-Time Implementation Using a TMS32010 Microprocessor," IEEE Transactions on Industrial Electronics, USA, Volume 1E-34, Number 1, pages 115-118, February 1987.
45) R Chassaing, "Applications in Digital Signal Processing with the TMS320 Digital Signal Processor in an Undergraduate Laboratory," Proceedings of 1987 ASEE Conference, USA, pages 1320-1324, 1987.
46) R.M. Sovacool, "EPROM Enhances TMS32020 Mu C's Memory," EDN: Electronic Design News, USA, Volume 32, Number 1, page 231, 1987.
47) F. Kocsis, F. Marx, "Fast DFT Modules For The TMS32010 Digital Signal Processor," Meres and Automation, Hungary, Volume 35, Number 1, pages 6-11, 1987.
48) Y.V.V.S. Murty, W.J. Smolinski, "Digital Filters for Power System Relaying," International Journal of Energy Systems, USA, Volume 7, Number 3, pages 125-129, 1987.
49) S. Wang, "The TMS32010 High Speed Processor and Its Applications," Mini-Micro Systems, China, Volume 8, Number 3, pages 24-32, 1987.
50) G.A. Frantz, K.S. Lin, J.B. Reimer, J. Bradley, "The Texas Instruments TMS320C25 Digital Signal Microcomputer," IEEE Microelectronics, USA, Volume 6, Number 6, pages 10-28, December 1986.
51) P. Renard, "A/D Converters: The Advantage of a Mixture of Techniques," Mesures, France, Volume 51, Number 16, pages 80-81, December 1986.
52) M. Ara, E. Suzuki, "Design of Real Time Filter Using DSP," Resident Reports of Kogakuin University, Japan, Number 61, pages 115-127 October 1986.
53) J. Reidy, "Connection of a 12-Bit A/D Converter to Fast DSPs," Electronik, Germany, Volume 35, Number 22, pages 132-134, October 1986.
54) G.R. Steber, "Implementation of Adaptive Filters on the TMS32010 DSP Microcomputer," Proceedings of IECON 86, Catalog Number 86CH2334-1, Volume 2, pages 653-656, September/October 1986.
55) D. Collins, M.A. Rahman, "Digital Filter Design Using The TMS320 Digital Signal Processor," Proceedings of EUSIPCO-86, Volume 1pages 163-166, September 1986.
56) R. Simar, Jr., J.B. Reimer, "The TMS320C25: A 100 ns CMOS VLSI Digital Signal Processor," 1986 Workshop on Applications of Signal Processing to Audio and Acoustics, September 1986.
57) J. Dudas, A. Stipkovits, E. Simonyi, "On The recursive Momentary Discrete Fourier Transform," Proceedings of EUSIPCO-86, Volume 1, pages 303-306, September 1986.
58) E. Feder, "Digital Signal Processor - General Purpose or Dedicated?," Electronics Industry, France, Number 111, pages 74-82, September 1986.
59) K. Herberger, "The Use of Signal Processors For Simulating Data Circuits," Proceedings of EUSIPCO-86, Volume 2, pages 1109-1112, September 1986.
60) K. Kassapoglou, P.Hulliger, "Implementation of Recursive Least Squares Identification Algorithm on The TMS320," Proceedings of EUSIPCO-86, Volume 2, pages 1263-1266, September 1986.
61) G. Lucioni, "General Processor Application; CAD Tool For Filter Design," Proceedings of EUSIPCO-86, Volume 2, pages 1335-1338, September 1986.
62) R. Schapery, "A 10-MIP Digital Signal Processor From Texas Instruments," Conference Record of Midcon 86, USA, 1/2/1-11, September 1986.
63) "DSP Microprocessors," Inf. Elettronica, Italy, Volume 14, Number 7-8, pages 21-28,
64) R.L. Barnes, S.H. Ardalan, "Multiprocessor Architecture For Implementing Adaptive Digital Filters," Conference Record of ICC-86, Catalog Number 86CH2314-3, Volume 1, pages 180-185, June 1986.
65) A.D.E. Brown, "EPROMS Simplify TMS32010 Memory System," EDN: Electronic Design News, USA, Volume 31, Number 13, page 230, June 1986.
66) T. Kolehamainen, T. Saramaki, M. Renfors, Y. Neuvo, "Signal Processor Implementation of Computationally Efficient FIR Filter Structures-Theory and Practice," $2 n d N o r$ dic Symposium on VLSI in Computers and Communications, 10 pages, June 1986.
67) T.G. Marshall Jr.,"Transform Methods For Developing Parallel Algorithms For Cy-clic-Block Signal Processing," Conference Record of ICC-86, Catalog Number 86CH2314-3, Volume 1, pages 288-294, June 1986.
68) S. Abiko, M. Hashizume, Y. Matsushita, K. Shinozaki, T. Takamizawa, C. Erskine, S. Magar, "Architecture and Applications of a 100-ns CMOS VLSI Digital Signal Processor," Proceedings of ICASSP 86, USA, Catalog Number 86CH2243-4, Volume 1, pages 393-396., April 1986.
69) T.P. Barnwell, "Algorithm Development and Multiprocessing Issues for DSP Chips," Proceedings of Speech Technology 86, April 1986.
70) W. Gass, "TMS32020 - The Quick and Easy Solution to DSP Problems," Proceedings of Speech Technology 86, April 1986.
71) M. Hashizume, S. Abiko, Y. Matsushita, K. Shinozaki,T. Takamizawa, S. Magar, J. Reimer, "A 100-ns CMOS VLSI Digital Signal Processor Using Double Level Metal Structure," Semiconductor Group 1986 Technical Meeting, April 1986.
72) R.E. Morley, A.M. Engebretson, and J.G. Trotta, "A Multiprocessor Digital Signal Processing System for Real-Time Audio Applications," IEEE Transactions on Acoustics, Speech and Signal Processing, USA, Volume ASSP-34, Number 2, April 1986.
73) S.G. Smith, A. Fitzgerald, P.B. Denyer, D. Renshaw, N.P. Wooten, R. Creasey, "A Comparison of Micro-DSP And Silicon Compiler Implementations of a Polyphase-Network Filter Bank," Proceedings of ICASSP 86, USA, Catalog Number 86CH2243-4, Volume 3, pages 2207-2210, April 1986.
74) J. Reimer, M. Hames, "Next Generation CMOS Chip Stakes High-Performance Claim on 10-MIPS DSP Operations," Electronic Design, USA, Volume 34, Number 8, pages 141-146, April 1986.
75) W.W. Smith, "Playing to Win: Product Development with the TMS320 Chip," Speech Technology Magazine, March/April 1986.
76) D. Essig, C. Erskine, E. Caudel, and S. Magar, "A Second-Generation Digital Signal Processor," IEEE Journal of Solid-State Circuits, USA, Volume SC-21, Number 1, pages 86-91, February 1986.
77) W.K. Anakwa, T.L. Stewart, "TMS320 Microprocessor-Based System For Signal Processing," Proceedings of the ISMM International Symposium, pages 64-65, February 1986.
78) M. Omenzetter, "Universal Signal Processors Offers High Data Throughput," Electronik, Germany, Volume 35, Number 4, pages 71-77, February 1986.
79) P.F. Regamey, "Matched Filtering Using a Signal Microprocessor TMS320," Mitt. AGEN, Switzerland, Number 42, pages 31-35, February 1986.
80) "TI Set To Show 2nd-Generation DSP," Electronics, USA, pages 23-24, February 3, 1986.
81) "TI Preps CMOS Versions of Signal-Processor Chips," Electronics Engineering Times, USA, page 6, February 3, 1986.
82) D. Wilson, "Digital Signal Processing Moves on Chip," Digital Design, USA, Volume 16, Number 2, pages 33-34, February 1986.
83) "TI Chip Heads for Fast Lane of Digital Signal Processing," Electronics, USA, page 9, January 27, 1986.
84) R.D. Campbell and S.R. McGeoch, "The TMS32010 Digital Signal Processor-An Educational Viewpoint,"InternationalJournalfor ElectricalEngineering Education, Great Britain, Volume 23, Number 1, pages 21-31, January 1986.
85) P. Eckelman, "The Cascadable Signal Processor For Digital Signal Processing," Electronics Industry, Germany, Volume 17, Number 10, pages 26-27, 1986.
86) R. Cook, "Digital Signal Processors," High Technology, USA, Volume 5, Number 10, pages 25-30, October 1985.
87) C.F. Howard, "A High-Level Approach to Digital Processing Design," Proceedings of MILCOMP/85, USA, October 1985.
88) H.E. Lee, "Versatile Data-Acquisition System Based on the Commodore C-64/C-128 Microcomputer," Proceedings of the Symposium of Northeastern Accelerator Personnel, USA, Volume 57, Number 5, pages 983-985, October 1985.
89) N.K. Riedel, D.A. McAninch, C. Fisher, and N.B. Goldstein, "A Signal Processing Implementation for an IBM PC-Based Workstation," IEEE Micro, USA, Volume 5, Number 5, pages 52-67, October 1985.
90) K.E. Marrin, "VLSI and Software Move DSP Into Mainstream," Computer Design, USA, Volume 24, Number 9, pages 69-72, September 1985.
91) "Signal Processor ICs: Highly Integrated ICs Making DSP More Attractive," Electronics Engineering Times, USA, pages 37-38, September 2, 1985.
92) K.E. Marrin, "VLSI and Software Move DSP Techniques into Mainstream," Computer Design, USA, September 1985.
93) "High-Speed Four-Channel Input Board," Electronics Weekly, USA, Number 1277, p. 31, July 24, 1985.
94) "4-Channel Analog-Input Board Puts Signal-Processing on VMF Bus," EDN: Electronic Design News, USA, Volume 30, Number 17, page 74, July 1985.
95) R.H. Cushman, "Third-Generation DSPs Put Advanced Functions On-Chip," EDN: Electronic Design News, USA, July 1985.
96) W.W. Smith, Jr., "Agile Development System, Running on PCs, Builds TMS320-Based FIR Filter," Electronic Design, USA, Volume 33, Number 13, pages 129-138, June 6, 1985.
S. Magar, S.J. Robertson, and W. Gass, "Interface Arrangement Suits Digital Processor to Multiprocessing," Electronic Design, USA, Volume 33, Number 5, pages 189-198, March 7, 1985.
97) G. Kropp, "Signal Processor Offers Multiprocessor Capability," Elektronik, Germany, Volume 34, Number 6, pages 53-58, March 1985.
98) S. Magar, D. Essig, E. Caudel, S. Marshall and R. Peters, "An NMOS Digital Signal Processor with Multiprocessing Capability," Digest of IEEE International Solid-State Circuits Conference, USA, February 1985.
99) "TI 'Shiva' Chip Outlined," Electronics Engineering Times, USA, page 15, February 18, 1985.
100) S. Magar, E. Caudel, D. Essig, and C. Erskine, "Digital Signal Processor Borrows from P to Step up Performance, Electronic Design, USA, Volume 33, Number 4, pages 175-184, February 21, 1985.
101) C. Erskine, S. Magar, E. Caudel, D. Essig, and A. Levinspuhl, "A Second-Generation Digital Signal Processor TMS32020: Architecture and Applications," Traitement de Signal, France, Volume 2, Number 1, pages 79-83, January-March 1985.
102) S. Baker, "TI 'Shiva' Chip Outlined," Electronic Engineering Times, USA, Number 317, page 15, February 1985.
103) S. Baker, "Silicon Bits," Electronic Engineering Times, USA, Number 316, page 42, February 1985.
104) H. Bryce, "Board Arrives For Digital Signal Processing on the VMEbus," Electronic Design, USA, Volume 33, Number 2, page 266, 1985.
105) K. Marrin, "VME-Compatible DSP System Incorporates TMS320 Chip,"EDN: Electronic Design News, USA, Volume 30, Number 2, page 122, January 1985.
106) C. Erskine and S. Magar, "Architecture and Applications of A Second-Generation Digital Signal Processor," Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, USA, 1985.
107) D.P. Morgan and H.F. Silverman, "An Investigation into the Efficiency of a Parallel TMS320 Architecture: DFT and Speech Filterbank Applications," Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, USA, Volume 4, pages 1601-1604, 1985.
108) P. Harold, "VME Bus Meeting Sparks Change in Standard, New Products," EDN: Electronic Design News, USA, Volume 29, Number 26, page 18, December 1984.
109) W. Loges, "A Code Generator Sets up the Automatic Controller Program for the TMS320," Elektronik, Germany, Volume 33, Number 22, pages 154-158, November 1984.
110) H. Volkers, "Fast Fourier Transforms with the TMS320 as Coprocessor," Elektronik, Germany, Volume 33, Number 23, pages 109-112, November 1984.
111) Keun-Ho Ryoo, "On the Recent Digital Signal Processors," Journal of South Korean Institute of Electrical Engineering, South Korea, Volume 33, Number 9, pages 540-549, September 1984.
112) D. Wilson, "Editor's Comment," Digital Design, USA, Volume 14, Number 9, page 14, September 1984.
113) "Signal Processors Will Squeeze Into One Chip, Says TI's French," Electronics, USA, Volume 57, Number 9, pages 14,20, May 1984.
114) S. Mehrgardt,"32-Bit Processor Produces Analog Signals,"Elektronik, Germany, Volume 33, Number 7, pages 77-82, April 1984.
115) S. Magar, "Signal Processing Chips Invite Design Comparisons," Computer Design, USA, Volume 23, Number 4, pages 179-186, April 1984.
116) S. Mehrgardt, "General-Purpose Processor System for Digital Signal Processing," Elektronik, Germany, Volume 33, Number 3, pages 49-53, February 1984.
117) T. Durham, "Chips: Familiarity Breeds Approval," Computing, Great Britain, page 26, January 1984.
118) J. Bradley and P. Ehlig, "Applications of the TMS32010 Digital Signal Porcessor and Their Tradeoffs," Midcon/84 Electronic Show and Convention, USA, 1984.
119) J. Bradley and P. Ehlig, "Tradeoffs in the Use of the TMS32010 as a Digital Signal Processing Element," Wescon/84 Conference Record, USA, 1984.
120) E. Fernandez, "Comparison and Evaluation of 32-Bit Microprocessors," Mini/Micro Southeast Computer Conference and Exhibition, USA, 1984.
121) D. Garcia, "Multiprocessing with the TMS32010," Wescon/84 Conference Record, USA, 1984.
122) S. Magar, "Architecture and Applications of a Programmable Monolithic Digital Signal Processor - A Tutorial Review," Proceedings of IEEE International Symposium on Circuits and Systems, USA, 1984.
123) D. Quarmby (Editor), "Signal Processor Chips," Granada, England 1984.
124) R. Steves, "A Signal Processor with Distributed Control and Multidimensional Scalability," Proceedings of IEEE National Aerospace and Electronics Conference, USA, 1984.
125) V. Vagarshakyan and L. Gustin, "On A Single Class of Continuous Systems - A Solution to the Problem on the Diagnosis of Output Signal Characteristics Recognition Procedures," IZV. AKAD. NAUK ARM. SSR, SER. TEKH. NAUK, USSR, Volume 37, Number 3, pages 22-27, 1984.
126) J. So, "TMS320-A Step Forward in Digital Signal Processing," Microprocessors and Microsystems, Great Britain, Volume 7, Number 10, pages 451-460, December 1983.
127) J. Elder and S. Magar, "Single-Chip Approach to Digital Signal Processing," Wescon/83 Electronic Show and Convention, USA, November 1983.
128) M. Malcangi, "VLSI Technology for Signal Processing. III," Elettronica Oggi, Italy, Number 11, pages 129-138, November 1983.
129) P. Strzelcki, "Digital Filtering," Systems International, Great Britain, Volume 11, Number 11, pages 116-117, November 1983.
130) W. Loges, "Digital Controls Using Signal Processors," Elektronik, Germany, Volume 32, Number 19, pages 51-54, September 1983.
131) "TI's Voice Chip Makes Debut," Computerworld, USA, Volume 17, Number 15, page 91, April 1983.
132) L. Adams, "TMS320 Family 16/32-Bit Digital Signal Processor, An Architecture for Breaking Performance Barriers," Mini/Micro West 1983 Computer Conference and Exhibition, USA, 1983.
133) R. Blasco, "Floating-Point Digital Signal Processing Using a Fixed-Point Processor," Southcon/83 Electronics Show and Convention, USA, 1983.
134) R. Dratch, "A Practical Approach to Digital Signal Processing Using an Innovative Digital Microcomputer in Advanced Applications," Electro '83 Electronics Show and Convention, USA, 1983.
135) C. Erskine, "New VLSI Co-Processors Increase System Throughput," Mini/Micro Midwest Conference Record, USA, 1983.
136) L. Kaplan, "Flexible Single Chip Solution Paves Way for Low Cost DSP," Northcon/83 Electronics Show and Convention, USA, 1983.
137) L. Kaplan, "The TMS32010: A New Approach to Digital Signal Processing," Electro '83 Electronics Show and Convention, USA, 1983.
138) S. Mehrgardt, "Signal Processing with a Fast Microcomputer System," Proceedings of EUSIPCO-83 Second European Signal Processing Conference, Netherlands, 1983.
139) L. Morris, "A Tale of Two Architectures: TI TMS 320 SPC VS. DEC Micro/J-11," Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, USA, 1983.
140) L. Pagnucco and D. Garcia, "A 16/32 Bit Architecture for Signal Processing," Mini/ Micro West 1983 Computer Conference and Exhibition, USA, 1983.
141) J. Potts, "A Versatile High Performance Digital Signal Processor," Ohmcon/83 Conference Record, USA, 1983.
142) J. Potts, "New 16/32-Bit Microcomputer Offers 200-ns Performance," Northcon/83 Electronics Show and Convention, USA, 1983.
143) R. Simar, "Performance of Harvard Architecture in TMS320," Mini/Micro West 1983 Computer Conference and Exhibition, USA, 1983.
144) K. McDonough, E. Caudel, S. Magar, and A. Leigh, "Microcomputer with 32-Bit Arithmetic Does High-Precision Number Crunching," Electronics, USA, Volume 55, Number 4, pages 105-110, February 1982.
145) K. McDonough and S. Magar, "A Single Chip Microcomputer Architecture Optimized for Signal Processing," Electro/82 Conference Record, USA, 1982.
146) L. Kaplan, "Signal Processing with the TMS320 Family," Midcon/82 Conference Record, USA, 1982.
147) S. Magar, "Trends in Digital Signal Processing Architectures," Wescon/82 Conference Record, USA, 1982.

## Graphics/Imaging

1) J.A. Lindberg, "Color Cell Compression Shrinks NTSC Images," ESD: Electronic Systems Design Magazine, USA, Volume 17, Number 10, pages 91-96, October 1987
2) S. Ganesan, "A Digitial Signal Processing Microprocessor Based Workstation For Myoelectric Signals," Fifth International Conference on System Engineering, USA, Catalog Number 87CH2480-2, pages 427-438, September 1987.
3) JU. Pokovny, O. Skoloud, "Digitisation of a Video Signal From a Television For a Microcomputer,"Sdelovaci Tech., Czechoslovakia, Volume 35, Number 6, pages 207-211, June 1987.
4) M.E. Bụkaty, "A Vehicle Identification System For Surveillance Applications," Topical Meeting on Machine Vision. Technical Digest Series, USA, Volume 12, pages 106-109, March 1987.
5) K.N. Ngan, A.A. Kassim, H.S. Singh, "Parallel Image-Processing System Based on THe TMS32010 Digital Signal Processor," IEE Proceedings in Electronics, Great Britain, Volume 134, Number 2, pages 119-124, (March 1987.
6) K.N. Ngan, A.A. Kassim, H. Singh, "A TMS32010-Based Fast Parallel Vison Processor," Proceedings of the International Workshop on Industrial Applications of Machine Vision and Machine Intelligence, Catalog Number 87TH0166-9, pages 156-161, February 1987.
7) P. Bellamah, "Hardware-Software Increases Video Storage Capacity," PC Week, USA, Volume 4, Number 4, page 15, January 271987.
8) J.M. Younse, "Motion Detection Using the Statistical Properties of a Video Image," Proceedings of SPIE International Society of Optical Engineering, USA, Volume 697, pages 233-243, August 1986.
9) T. Gehrels, B.G. Marsden, R.S. McMillan, J.V. Scotti, "Astrometry With a Scanning CCD," Astronomy Journal, USA, Volume 91, Number 5, pages 1242-1248, May 1986.
10) S. Srinivasan, A.K. Jain, T.M. Chin, "Cosine Transform Block Codec For Images Using TMS32010," IEEE International Symposium on Circuits and Systems, USA, Catalog Number 86CH2255-8, Volume 1, pages 299-302, May 1986.
11) D.M. Holburn and I.D. Sommerville, "A High-Speed Image Processing System Using the TMS32010," Software and Microsystems, Great Britain, Volume 4, Number 5-6, pages 102-108, October-December 1985.
12) C. D. Crowell and R. Simar, "Digital Signal Processor Boosts Speed of Graphics Display Systems," Electronic Design, USA, Volume 33, Number 7, pages 205-209, March 1985.
13) J. Reimer and A. Lovrich, "Graphics with the TMS32020," WESCON/85 Conference Record, USA, 1985.
14) H. Megal and A. Heiman, "Image Coding System - A Single Processor Implementation," MILCOM/85 IEEE Military Communications Conference Record, USA, 1985.
15) G. Gaillat, "The CAPITAN Parallel Processor: 600 MIPS for Use in Real Time Imagery," Traitement de Signal, France, Volume 1, Number 1, pages 19-30, October-December 1984.

## Instrumentation

1) G.R. Halsall, D.R. Burton, M.J. Lalor, C.A. Hobson, "A Novel Real-Time Opto-Electronic Profilometer Using FFT Processing," Proceedings of ICASSP 89, USA, pages 1634-1637, May 1989.
2) A.J. Pratt, R.E. Gander, B.R. Brandell, "Real-Time Median Frequency Estimator," Proceedings of the Ninth Annual Conference of the IEEE Engineering in Medicine and Biology Society, USA, Volume 4, pages 1840-1841, November 1987.
3) D.Y. Cheng, A. Gersho, "A Fast Codebook Search Algorithm For Nearest-Neighbor Pattern Matching," Proceedings of ICASSP 86, USA, Catalog Number 86CH2243-4, Vol 1, pages 265-268, April 1986.
4) Y. Chikada, M. Ishiguro, H. Hirabayashi, M. Morimoto, K. Morita, T. Kanazawa, H. Iwashita, K. Nakazima, S. Ishikawa, T. Takahashi, K. Handa, T.Kazuga, S. Okumura, T. Miyazawa, K. Miura, S. Nagasawa, "A Very Fast FFT Spectrum Analyzer For Radio

Astronomy," Proceedings of ICASSP 86, USA, Catalog Number 86CH2243-4, Volume 4, pages 2907-2910, April 1986.
5) R.C. Wittenberg, "Four Microprocessors Power Multifunction Analyzer," Electronic Engineering Times, USA, Number 306, page 30, November 1984.
6) D. Lee, T. Moran, and R. Crane, "Practical Considerations for Estimating Flaw Sizes from Ultrasonic Data," Materials Evaluation, Volume 42, Number 9, pages 1150-1158, August 1984.
7) S. Magar, R. Hester, and R. Simpson, "Signal-Processing c Builds FFT-Based Spectrum Analyzer," Electronic Design, USA, Volume 30, Number 17, pages 149-154, August 1982.

## Voice/Speech

1) A. Aktas, H. Hoge, "Multi-DSP and VQ-ASIC Based Acoustic Front-End for Real-Time Speech Processing Tasks," Proceedings of EUROSPEECH 89, pages 586-589, September 1989.
2) D. Bergmann, D. Boillon, F. Bonifacio, R. Breitschadel, "Experimental Speech Input/ Output System," Proceedings of ICASSP 89, USA, pages 1138-1141, May 1989.
3) J. DellaMorte, P.E. Papamichalis, "Full-Duplex Real-Time Implementation of the FED-STD-1015 LPC-10e Standard V. 52 on the TMS320C25," Proceedings of SPEECH TECH 89, pages 218-221, May 1989.
4) B.I. Pawate, G.R. Doddington, "Implementation of a Hidden Markov Model-Based Layered Grammar Recognizer," Proceedings of ICASSP 89, USA, pages 801-804, May 1989.
5) P.E. Papamichalis, "High Quality Speech Coding: Some Recent Algorithms,"Proceedings of SPEECH TECH 89, pages 329-333, May 1989.
6) J.C. Ventura, "Digital Audio Gain Control for Hearing Aids," Proceedings of ICASSP 89, USA, pages 2049-2052, May 1989.
7) N. Matsui, H. Ohasi, "DSP-Based Adaptive Control of a Brushless Motor," IEEE Industry Application Society Annual Meeting, USA, October 1988.
8) A. Albarello, R. Breitschaedel, A. Ciaramella, E. Lenormand, "Implementation of an Acoustic Front-End For Speech Recognition," CSELT Technical Report, Italy, Volume 16, Number 5, pages 455-459, August 1988.
9) D. Curl, "Voice Over Data Means More For Your Money," Communications, Great Britain, Volume 5, Number 8, pages 27-29, August 1988.
10) H. Hanselman, H. Henrichfreise, H. Hostmann, A. Schwarte, "Hardware/Software Environment for DSP-Based Multivariable Control," 12th. IMACS World Congress, July 1988.
11) J.B. Attili, M. Savic, J.P. Campbell, Jr., "A TMS32020-Based Real Time Text-Independent, Automatic Speaker Verification System," Proceedings of ICASSP 88, USA, Volume S, page 599, April 1988.
12) D. Chase, A. Gersho, "Real-Time VQ Codebook Generation Hardware For Speech Processing," Proceedings of ICASSP 88, USA, April 1988.
13) T. Kohonen, K. Torkkola, M. Shozaki, J. Kangas, O. Venta, "Phonetic Typewriter for Finnish and Japanese," Proceedings of ICASSP 88, USA, Volume S, page 607, April 1988.
14) I. Lecomte, M. Lever, L. Lelievre, M. Delprat, A. Tassy, "Medium Band Radio Communications," Proceedings of ICASSP 88, USA, April 1988.
15) J.B. Reimer, K.S. Lin, "TMS320 Digital Signal Processors in Speech Applications," Proceedings of SPEECH TECH '88, April 1988.
16) M. Smmendorfer, D. Kopp, H. Hackbarth, "A High-Performance Multiprocessor System for Speech Processing Applications," Proceedings of ICASSP 88, USA, Volume V, page 2108, April 1988.
17) P. Vary, K. Hellwig, R. Hoffmann, R.J. Sluyter, C. Garland, M. Russo, "Speech Codec for the European Mobile Radio System," Proceedings of ICASSP 88, USA, Volume S, page 227, April 1988.
18) A. Hunt; "A Speaker-Independent Telephone Speech Recognition System: The VCS TeleRec," Speech Technology, USA, Volume 4, Number 2, pages 80-82, March-April 1988.
19) R.A. Sukkar, J.L. LoCicero, J.W. Picone, "Design and Implementation of a Robust Pitch Detector Based on a Parallel Processing Technique," IEEE Journal of Selected Areas of Communications, USA, Volume 6, Number 2, pages 441-451, February 1988.
20) A.Z. Baraniecki, "Digital Coding of Speech Algorithms and Architecture," Proceedings of IECON '87, November 1987.
21) G.R. Steber, "Audio Frequency DSP Laboratory on a Chip-TMS32010," Proceedings of IECON '87, Volume 2, pages 1047-1051, November 1987.
22) S.H. Kim, K.R. Hong, H.B. Han, W.H. Hong, "Implementation of Real Time Adaptive Lattice Predictor on Digital Signal Processor," Proceedings of TENCON 87, South Korea, Volume 3, pages 1131-1135, August 1987.
23) J.B. Reimer, M.L. McMahan, W.W. Anderson, "Speech Recognition For a Low Cost System Using a DSP," Digest of Technical Papers for 1987 International Conference on Consumer Electronics, June 1987.
24) A. Ciaramella, G. Venuti, "Vector Quantization Firmware For an Acoustical Front-End Using the TMS32020," Proceedings of ICASSP 87, USA, Catalog Number 87CH2396-0, Volume 4, pages 1895-1898, April 1987.
25) G.A. Frantz, K.S. Lin, "A Low Cost Speech System Using the TMS320C17," Proceedings of SPEECH TECH '87, pages 25-29, April 1987.
26) Z. Gorzynski, "Realtime Multitasking Speech Application on the TMS320," Microprocessors and Microsystems, Great Britain, Volume 11, Number 3, pages 149-156, April 1987.
27) P. Papamichalis, D. Lively, "Implementation of the DOD Standard LPC-10/52E on the TMS320C25," Proceedings of SPEECH TECH '87, pages 201-204, April 1987.
28) B.I. Pawate, M.L. McMahan, R.H. Wiggins, G.R. Doddington, P.K. Rajasekaran, "Connected Word Processor on a Multiprocessor System," Proceedings of ICASSP 87, USA, Catalog Number 87CH2396-0, Volume 2, pages 1151-1154, April 1987.
29) S. Roucos, A. Wilgus, W. Russell, "A Segment Vocoder Algorithm For Real-Time Implementation," Proceedings of ICASSP 87, USA, Catalog Number 87CH2396-0, Volume 4, pages 1949-1952, April 1987.
30) H. Yeh, "Adaptive Noise Cancellation For Speech With a TMS32020," Proceedings of ICASSP 87, USA, Catalog Number 87CH2396-0, Volume 2, pages 1171-1174, April 1987.
31) R. Conover, D. Gustafson, "VLSI Architecture For Cepstrum Calculations," 1987 IEEE Region 5 Conference, USA, Catalog Number 87CH2383-8, pages 63-64, March 1987.
32) K. Field, A. Derr, L. Cosell, C. Henry, M. Kasner, J. Tiao, "A Single Board Multrate APC Speech Coding Terminal," Proceedings of ICASSP 87, USA, Catalog Number 87CH2396-0, Volume 2, pages 960-963, April 1987.
33) H. Brehm, W. Stammler, "Description and Generation of Sperically Invariant Speech-Model Signals," Signal Processing, Netherlands, Volume 12, Number 2, pages 119-141, March 1987.
34) A.Z. Baraniecki, "Digital Coding of Speech Algorithms and Architectures," Proceedings of IECON '87, Volume 2, pages 977-984, 1987.
35) B. Flocon, P. Lockwood, J. Sap, L. Sauter, "MARIPA: Speaker Independent Recognition of Speech on IBM-PC," Eighth International Conference on Pattern Recognition, Catalog Number 86CH2342-4, pages 893-895, October 1986.
36) M.T. Reilly, "A Hybridized Linear Prediction Code Speech Synthesizer," Conference Records for MILCOM 86, USA, Catalog Number 86CH2323-4, Volume 2, 32.5/1-5, October 1986.
37) K. Torkkola, H. Riittinen, T. Kohonen, "Microprocessor-Based Word Recognizer For a Large Vocabulary," Eighth International Conference on Speech Recognition Proceedings, Catalog Number 86CH2342-4, pages 814-816, October 1986.
38) C.H. Lee, D.Y. Cheng, D.A. Russo et al, "An Integrated Voice-Controlled Voice Messaging System," Proceedings of Speech Technology 86, April 1986.
39) Kun-Shan Lin and G.A. Frantz, "A Survey of Available Speech Hardware for Computer Systems," Proceedings of Speech Technology 86, April 1986.
40) L.R. Morris, "Software Engineering for an IBM PC/TI-SPEECH Realtime Digital Speech Spectrogram Production System," Proceedings of Speech Technology 86, April 1986.
41) K. Torkkola, H. Riittinen, "A Microprocessor-Based Recognition System For Large Vocabularies," Proceedings of ICASSP 86, USA, Catalog Number 86CH2243-4, Volume 1, pages 333-337, April 1986.1)
42) Z. Gorzynski, "Real Time Software Engineering on the TMS320: Application in a Pitch Detector Implementation," International Conference on Speech Input/Output; Techniques and Applications, Conference Publication Number 258, pages 270-275, March 1986.
43) S. Ganesan, M.O. Ahmad, "A Real Time Speech Signal Processor," Proceedings of the ISMM Internal Symposium, pages 46-49, February 1986.
44) L. Gutcho, "DECtalk-a Year Later," Speech Technology, Volume 3, Number 1, pages 98-102, August-September 1985.
45) B. Bryden, H.R. Hassanein, "Implementation of a Hybrid Pitch-Excited/Multipulse Vocoder for Cost-Effective Mobile Communications," Proceedings ofSpeech Technology 85, April 1985.
46) M. McMahan, "A Complete Speech Application Development Environment,"Proceedings of SPEECH TECH 85, pages 293-295, April 1985.
47) H. Hassanein and B. Bryden, "Implementation of the Gold-Rabiner Pitch Detector in a Real Time Environment Using an Improved Voicing Detector," Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, USA, 1985.
48) K. Lin and G. Frantz, "Speech Applications with a General Purpose Digital Signal Processor," IEEE Region 5 Conference Record, USA, March 1985.
49) K. Lin and G. Frantz, "Speech Applications Created by a Microcomputer," IEEE Potentials, USA, December 1985.
50) M. Malcangi, "Programmable VLSI's for Vocal Signals," Electronica Oggi, Italy, Number 10, pages 103-113, October 1984.
51) V. Kroneck, "Conversing with the Computer," Elektrotechnik, Germany, Volume 66, Number 20, pages 16-18, October 1984.
52) P.K. Rajasekaran and G.R. Doddington, "Real-Time Factoring of the Linear Prediction Polynomial of Speech Signals," Digital Signal Processing - 1984: Proceedings of the International Conference, pages 405-410, September 1984.
53) M. Hutchins and L. Dusek, "Advanced ICs Spawn Practical Speech Recognition," Computer Design, USA, Volume 23, Number 5, pages 133-139, May 1984.
54) E. Catier, "Listening Cards or Speech Recognition," Electronique Industrielle, France, Number 67, pages 72-76, March 1984.
55) O. Ericsson, "Special Processor Did Not Meet Requirements - Built Own Synthesizer," Elteknik Aktuell Elektronik, Sweden, Number 3, pages 32-36, February 1984.
56) H. Strube, "Synthesis Part of a 'Log Area Ratio' Vocoder Implemented on a Signal-Processing Microcomputer," IEEE Transactions on Acoustics, Speech and Signal Processing, USA, Volume ASSP-32, Number 1, pages 183-185, February 1984.
57) B. Bryden and H. Hassanein, "Implementation of Full Duplex 2.4 Kbps LPC Vocoder on a Single TMS320 Microprocessor Chip," Proceedings ofIEEE International Conference on Acoustics, Speech and Signal Processing, USA, 1984.
58) M. Dankberg, R. Iltis, D. Saxton, and P. Wilson, "Implementation of the RELP Vocoder Using the TMS320," Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, USA, 1984.
59) A. Holck and W. Anderson, "A Single-Processor LPC Vocoder," Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, USA, 1984.
60) N. Morgan, "Talking Chips," McGraw-Hill, 1984.
61) A. Kumarkanchan, "Microprocessors Provide Speech to Instruments," Journal of Institute ofElectronic and TelecommunicationEngineers, India, Volume 29, Number 12, December 1983.
62) L. Dusek, T. Schalk, and M. McMahan, "Voice Recognition Joins Speech on Programmable Board," Electronics, USA, Volume 56, Number 8, pages 128-132, April 1983.
63) J.R. Lineback, "Voice Recognition Listens For Its Cue," Electronics, USA, Volume 56, Number 1, page 110, January 1983.
64) D. Daly and L. Bergeron, "A Programmable Voice Digitizer Using the TI TMS320 Microcomputer," Proceedings of IEEE International Conference onAcoustics, Speech and Signal Processing, USA, 1983.
65) W. Gass, "The TMS32010 Provides Speech I/O for the Personal Computer," Mini/Micro Northeast Electronics Show and Convention, USA, 1983.
66) A. Holck, "Low-Cost Speech Processing with TMS32010," Midcon/83 Conference Record, USA, 1983.
67) H. Strube, R. Wilhelms, and P. Meyer, "Towards Quasiarticulatory Speech Synthesis in Real Time," Proceedings of EUSIPCO-83 SecondEuropean Signal Processing Conference, Netherlands, 1983.
68) T. Schalk and M. McMahan, "Firmware-Programmable $\mu \mathrm{c}$ Aid Speech Recognition," Electronic Design, Volume 30, Number 15, pages 143-147, July 1982.

## Control

1) I. Ahmed, "16-Bit DSP Microcontroller Fits Motion Control System Application," PCIM, October 1988.
2) D. Bursky, "Merged Resources Solve Control Headaches," Electronic Design, USA, pp 157-159, October 1988.
3) I. Ahmed, "Implementation of Self Tuning Regulators with TMS320 Family of Digital Signal Processors," MOTORCON '88, pages 248-262, September 1988.
4) D. Dunnion, M. Stropoli, "Design a Hard-Disk Controller with DSP Techniques," Electronic Design, USA, pages 117-121, September 1988.
5) R. van der Kruk, J. Scannell, "Motion Controller Employs DSP Technology," PCIM, September 1988.
6) S.W. Yates, R.D. Williams, "A Fault Tolerant Multiprocessor Controller For Magnetic Bearings," IEEE Micro, USA, Volume 8, Number 4, page 6, August 1988.
7) I. Garate, R.A. Carrasco, A.L. Bowden, "An Integrated Digital Controller For Brushless AC Motors Using a DSP Microprocessor," Third International Conference on Power Electronics and Variable-Speed Drive, Conference Publication Number 291, Conference Publication Number 291, pages 249-252, July 1988.
8) J.M. Corliss, R. Neubert, "DSP Keeps Keep Disk Drive on Track," Computer Design, USA, pages 60-65, June 1988.
9) Y.V.V.S. Murty, W.J. Smolinski, S. Sivakumar, "Design of a Digital Protection Scheme For Power Transformers Using Optimal State Observers," IEE Proc. C, Generation Transmission, Distribution, Great Britain, Volume 135, Number 3, pages 224-230, May 1988.
10) R.D. Jackson, D.S. Wijesundera, "Direct Digital Control of Induction Motor Currents," IEE Colloquim on 'Microcomputer Instrumentation and Control Systems in Power Electronics, Great Britain, Digest Number 61, 1/1-3, April 1988.
11) A. Lovrich, G. Troullinos, R. Chirayil, "An All Digital Automatic Gain Control," Proceedings of ICASSP 88, USA, Volume D, page 1734, April 1988.
12) K. Bala, "Running on Imbedded Power," Electronics Engineering Times, USA, March 1988.
13) I. Ahmed, S. Meshkat, "Using DSPs in Control," Control Engineering, February 1988.
14). M. Babb (Editor), "Solving Control Problems With Specialized Processors," Control Engineering, February 1988.
14) S. Meshkat, "High-Level Motion Control Programming Using DSPs," Control Engineering, February 1988.
15) S. Meshkat, I. Ahmed, "Using DSPs in ACInduction Motor Drives," ControlEngineering, February 1988.
16) J. Tan, N. Kyriakopoulos, "Implemmention of a Tracking Kalman Filter on a Digital Signal Processor," IEEE Transactions of IndustrialElectronics, USA, Volume 35, Number 1, pages 126-134, February 1988.
17) H. Hanselman, "LQG-Control of a Highly Resonant Disc Drive Head Positioning Actuator," IEEE Transactions on Industrial Electronics, USA, Volume 35, Number 1, pages 100-104, February 1988.
18) I. Ahmed, "DSP Architectures for Digital Control Systems," SATECH 1988, 1988.
19) S. Meskat, "Advanced Motion Control Systems," Intertec Communications - Ventura, CA., 1988.
20) I. Ahmed, S. Lundquist, "DSPs Tame Adaptive Control," Machine Design, USA, Volume 59, Number 28, pages 125-129, November 1987.
21) B.K. Bose, P.M. Szczesny, "A Microcomputer-Based Control and Simulation of an Advanced IPM Synchronous Machine Drive System For Electric Vehicle Propulsion," Proceedings of IECON '87, Volume 1, pages 454-463, November 1987.
22) Y. Dote, M. Shinojima, R.G. Hoft, "Digital Signal Processor (DSP)-Based Novel Variable Structure Control For Robotic Manipulator," Proceedings of IECON ,'87, Volume 1, pages 175-179, November 1987.
23) J.P. Pratt, S. Gruber, "A Real-Time Digital Simulation of Synchronous Machines: Stability Consiferations and Implementation," IEEE Transactions on Industrial Electronics, USA, Volume 1E-34, Number 4, pages 483-493, November 1987.
24) I. Ahmed, "Deadbeat Controllers and Observers with the TMS320," MOTORCON '87, pages 22-33, September 1987.
25) I. Ahmed, S. Lindquist, "Digital Signal Processors: Simplifying High-Performance Control," Machine Design, September 1987.
26) R.D. Ciskowski, C.H. Liu, H.H. Ottesen, S.U. Rahman, "System Identification: An Experimental Verification," IBM Journal of Research Developments, Volume 31, Number 5, pages 571-584, September 1987.
27) J.A. Taufiq, R.J. Chance, C.J. Goodman, "On-Line Implementation of Optimised PWM Schemes For Traction Inverter Drives," International Conference of 'Electric Railway Systems For a New Century, Conference Publication Number 279, September 1987.
28) Y. Dote, M. Shinojima, H. Yoshimura, "Microprocessor-Based Novel Variable Structure Control For Robot Manipulator," Proceedings of the 10th. IFAC World Congress, July 1987.
29) H. Hanselmann, A. Schwarte, "Generation of Fast Target Processor Code From High Level Controller Descriptions," Presented at 10th. IFAC World Congress, July 1987.
30) E. Debourse, "Emergence of DSPs in Machine-Tool Axes Control Systems: Application of Distributed Interpolation Concepts," Proceedings of the International Workshop on Industrial Automation, February 1987.
31) C. Chen, "The Mathematical Model and Computer Simulation of an LCI Drive," Electrical Machinery Power Systems, USA, Volume 13, Number 3, pages 195-206, 1987.
32) R.D. Ciskowski, C.H. Liu, H.H. Ottesen, S.U. Rahman, "System Identification: An Experimental Verification," IBM Journal Research Development, USA, September 1987.
33) H. Hanselmann, "Implementation of Digital Controllers - A Survey," Automatica, Volume 23, Number 1, pages 7-32, 1987.
34) H. Henrichfreise, W. Moritz, H. Siemensmeyer, "Control of a Light, Elastic Manipulation Device," Conference on Applied Motion Control, 1987.
35) M.C. Stich, "Digital Servo Algorithm For Disk Actuator Control," Conference on Applied Motion Control, pages 35-41, 1987.
36) T. Takeshita, K. Kameda, H. Ohashi, N. Matsui, "Digital Signal Processor Based High Speed Current Control of Brushless Motor," Electronic Engineering, Japan, USA, Volume 106, Number 6, pages 42-49, November-December 1986.
37) R. Lessmeier, W. Schumacher, W. Leonard, "Microprocessor-Controlled AC-Servo Drives With Synchronous or Induction Motors: Which is Preferable?," IEEE Transactions On Industry Applications, USA, September/October 1986.
38) R. Alcantara, J. Prado, C Guegen, "Fixed-Point Implementation of the Fast Kalman Algorithm: Using the TMS32010 Microprocessor," Proceedings of EUSIPCO-86, Volume 2, pages 1335-1338, September 1986.
39) B. Nowrouzian, M.H. Hamza, "DC Motor Control Using a Switched-Capacitor Circuit," Proceedings of the IASTED International Symposium on High Technology in the Power Industry, pages 352-356, August 1986.
40) N. Matsui, T. Takeshita, "Digital Signal Processor-Based Controllers For Motors," SICE, July 1986.
41) H. Hanselmann, "Using Digital Signal Processors For Control," Proceedings of EICON, 1986.
42) H. Hanselman, W. Moritz, "High Banwidth Control of the Head Positioning Mechanism in a Winchester Disc Drive," Proceedings of IECON, pages 864-869, 1986.
43) R. Cushman, "Easy-to-Use DSP Converter ICs Simplify Industrial-Control Tasks," Electronic Design, USA, Volume 29, Number 17, pages 218-228, August 1984.
44) W. Loges, "Signal Processor as High-Speed Digital Controller," Elektronik Industrie, Germany, Volume 15, Number 5, pages 30-32, 1984.
45) W.Loges, "Higher-Order Control Systems with Signal Processor TMS320," Elektronik, Germany, Volume 32, Number 25, pages 53-55, December 1983.

## Military

1) V. Lazzari, Quacchia, M. Sereno, E. Turco, "Implementation of a $16 \mathrm{Kbit} / \mathrm{s}$ Split Band-Adaptive Predictive Codec For Digital Mobile Radio Systems," CSELT Technical Reports, Italy, Volume 16, Number 5, pages 443-447, August 1988.
2) P. Papamichalis, J. Reimer, "Implementation of the Data Encryption Standard Using the TMS32010," Digital Signal Processing Applications, 1986.

## Telecommunications

1) S. Casale, R. Russo, G.C. Bellina, "Optimal Architectural Solution Using DSP Processors for the Implementation of an ADPCM Transcoder," Proceedings of GLOBECOM '89, pages 1267-1273, November 1989.
2) A. Lovrich and J.B. Reimer, "A Multi-Rate Transcoder," Transactions on Consumer Electronics, USA, November 1989.
3) J.L. Dixon, V.K. Varma, N.R. Sollenberger, D.W. Lin, "Single DSP Implementation of a 16 Kbps Sub-Band Speech Coder for Portable Communications," Proceedings of ICASSP 89, USA, pages 184-187, May 1989.
4) J.L. So, "Implementation of an NIC (Nearly Instantaneous Companding) 32 Kbps Transcoder Using the TMS320C25 Digital Signal Processor," Proceedings of GLOBECOM 88, Section 43.4, November 28 - December 1, 1988.
5) V. Lazzari, Quacchia, M. Sereno, E. Turco, "Implementation of a $16 \mathrm{Kbit} / \mathrm{s}$ Split Band-Adaptive Predictive Codec For Digital Mobile Radio Systems," CSELTTechnical Reports, Italy, August 1988.
6) V. Del Bello, "Signal Processor For Telephone Functions," Elettronica Oggi, Italy, Number 63, pages 155-157, June 1988.
7) N. Tamaki, "Studies on Subscriber Line Equalizer Using Decision Feedback Equalizing Circuit," Transactions of the Institue of Information Communication English B., Japan, Volume J71B, Number 5, pages 616-625, May 1988.
8) A. Charbonnier, J-P. Petit, "Sub-Band ADPCM Coding for High Quality Audio Signals," Proceedings of ICASSP 88, USA, Volume A, page 2540, April 1988.
9) D. Chase, A. Gersho, "Real-Time VQ Codebook Generation Hardware for Speech Processing," Proceedings of ICASSP 88, USA, Volume 3, pages 1730-1733, April 1988.
10) V.K. Jain, S.S. Skrzypkowiak, R.B. Heathcock, "TMS320C25 Based Enhanced ADPCM Transcoder," Proceedings of ICASSP 88, USA, Volume S, page 635, April 1988.
11) T.C. Jedrey, N.E. Lay, W. Rafferty, "An All Digital 8-DPSK TCM Modem for Land Mobile Satellite Communications," Proceedings of ICASSP 88, USA, Volume D, page 1722, April 1988.
12) A, Lovrich, G. Troullinos, R. Chirayil, "An All Digital Automatic Gain Control," Proceedings of ICASSP 88, USA, Volume D, page 1734, April 1988.
13) P. Voros, "High-Quality Sound Coding Within $2 \times 64 \mathrm{Kbit} / \mathrm{s}$ Using Instantaneous Dynamic Bit-Allocation," Proceedings of ICASSP 88, USA, Volume A, page 2536, April 1988.
14) H.P. Widmer, R. Keung, "HF Data Communication For Extremely Low SNR and High Interference Level," Fourth International Conference on HF Radio Systems and Techniques, Conference Publication Number 284, pages 33-37, April 1988.
15) W.B. Michael, P.D. Hill, "Performance Evaluation of a Real-Time TMS32010-Based Adaptive Noise Canceller (ANC)," IEEE Transactions on Acousticical Speech Signal Processing, USA, Catalog Number 86CH2255-8, Volume 3, pages 892-895, March 1988.
16) H. Ando, M. Nakaya, H. Hona, I. Iizuka, Y. Horiba, "A DSP Line Equalizer VLSI for TCM Digital Subscriber-Line Transmission," IEEE Journal of Solid-State Electronics, USA, Volume 23, Number 1, pages 118-123, February 1988.
17) N. Tamaki, "Studies on an Adaptive Line Equalizer For Subscriber Loops," Transactions of the Institute of Electronic Information Communication English B., Japan, Volume J71B, Number 2, pages 172-180, February 1988.
18) A. Ayerbe Garcia, J.M. Guell Rabasso, A.L. Villen, J.A. Martinez Ayuso, "ADPCM-32 Kbit/S Coder/Decoder For Telephone Channels," MundoElectron., Spain, Number 178, pages 103-109, November 1987.
19) M. Ishikawa, Y. Tanaka, T Kimura, "An Adaptive Line Equalizer VLSI Using Digital Signal Processing," IEEE Journal Solid-State Circuits, USA, Volume 23, Number 3, pages 830-835, November 1987.
20) O. Matsubara, K. Yabuta, E. Sato, H. Takatori, "A Switched Capacitor Line Equalizer For Digital Subscriber Loop Transmission," Conference Record of GLOBECOM Tokyo '87, Japan, Volume 3, pages 1746-1751, November 1987.
21) Mills, J.D., V.P. Telang, C.E. Rohrs, "A Data and Voice System For The General Service Telephone Network," Proceedings of IECON '87, Volume 2, pages 1143-1148, November 1987.
22) C. Nuthalapati, "A FET Processor Based Phase Noise Measurement System For Radar ATE," Proceedings of AUTOTESTCON '87, Catalog Number 87CH2510-6, pages 47-51, November 1987.
23) N. Tamaki, S. Sugimoto, F. Mano, "A Line Terminating Circuit Using the DSP Technique," Conference Records of GLOBECOM Tokyo '87, Japan, Volume 3, pages 1731-1735, November 1987.
24) G.J. Saulnier, W.A. Haskins, P. Das, "Tone Jammer Suppression in a Direct Sequence Spread Sprectrum Receiver Using Adaptive Lattice and Transversal Filters," Conference Records of MILCOM 87, USA, Volume 1, pages 123-127, October 1987.
25) R.N. Bera, K.S. Rattan, "Real-Time Simulation of the Unmanned Research Vehicle Using Multi-Rate Sampling," Fifth International Conference on System Engineering, USA, Catalog Number 87CH2480-2, pages 573-578, September 1987.
26) D. Shear, "Design and Build a Transponder Using DSP Tools. (A Related Article on the Functional Capability of the Acoustic Transponder)," EDN: Electronic Design News, USA, Volume 32, Number 18, pages 137-148, September 1987.
27) S.H. Kim, K.R. Hong, H.B. Han, W.H. Hong, "Implementation of Real Time Adaptive Lattice Predictor on Digital Signal Processor," Proceedings of TENCON 87, South Korea, Volume 3, pages 1131-1135, August 1987.
28) W. Mattern, "Multifrequency Communications Channel Decoder With The Type TMS32010 Signal Processor Circuit," Electronics Industry, France, Number 127, Supplement Number 13, pages 29-33, June 1987.
29) A. Ciaramella, G. Venuti, "Vector Quantization Firmware For an Acoustical Front-End Using the TMS32020," Proceedings of ICASSP 87, USA, Catalog Number 87CH2396-0, Volume 4, pages 1895-1898, April 1987.
30) M.J. Pettitt, D. Remedios, A.W. Davis, A. Hadjifotiou, S. Wright, "A Coherent Transmission System Using DFB Lasers and Phase Diversity Reception," IEE Colloqium on 'High Capacity Fibre Optic Systems', Great Britain, Digest Number 23, 9/1-5, February 1987.
31) G.J. Saulnier, K. Yum, P. Das, "The Suppression of Tone-Jammers Using Adaptive Lattice Filtering," IEEE International Conference on Communications '87, USA, Volume 2, pages 869-873, June 1987.
32) H.H. Lu, D. Hedberg, B. Fraenkel, "Implementation of High-Speed Voiceband Data Modems Using The TMS320C25," Proceedings of ICASSP 87, USA, Catalog Number 87CH2396-0, Volume 4, pages 1915-1918, April 1987.
33) S. Ono, N. Kondoh, M. Kobayashi, M. Hata, "A New Automatic Equalizer For Digital Subscriber Loops," Electronics and Communications in Japan, Part 1, Japan, Volume 70, Number 4, pages 93-102, April 1987.
34) J.M. Perl, M. Aharoni, "TMS32010 Implementation of an Improved Kineplex Type HF Modem," Proceedings of MELECON, Catalog Number 87CH2425-7, pages 135-140, March 1987.
35) R. Schwarze, W. Tobergte, "Digital Signal Processors in Data Transmission," Electronik, Germany, Volume 36, Number 3, pages 73-78, February 1987.
36) J.I. Statman, E.R. Rodenmich, "Parameter Estimation Based on Doppler Frequency Shifts," IEEE Transactions on Aerospace and Electronic Systems, USA, Volume AES-23, Number 1, pages 31-39, January 1987.
37) R. Komiya, K. Yoshida, N. Tamaki, "The Loop Coverage Between TCM and Echo Canceller Under Various Noise Considerations," IEEE Transactions on Communications, USA, Volume COM-34, Number 11, pages 1058-1067, November 1986.
38) I. Ahmed, A. Lovrich, "Adaptive Line Enhancer Using the TMS320C25," Conference Records of Northcon/86, USA, 14/3/1-10, September/October 1986.
39) H. Brehm, W. Stammler, M. Warner, "Design of a Highly Flexible Digital Simulator For Narrowband Fading Channels," Proceedings of EUSIPCO-86, Volume 2, pages 1113-1116, September 1986.
40) J.M. Perl, A. Bar, J. Cohen, "TMS-320 Implementation of a x 2400 BPS V. 26 Modem," Proceedings of EUSIPCO-86, Volume 2, pages 1121-1124, September 1986.
41) C.R. Spitzer, "All-Digital Jets Are Taking Off," IEEE Spectrum, USA, Volume 23, Number 9, pages 51-56, September 1986.
42) D. Boudreau, " 2400 BPS TMS32010 Modem Implementation For Mobile Satellite Applications," Proceedings of the Thirteenth BiennialSymposium on Communications, Canada, Volume B-3, pages 1-4, June 1986.
43) R. Chirayil, A. Lovrich, G. Troullinos, " 2400 BPS Modem Implementation Using a General Purpose DSP," Digest of Technical Papers for 1986 International Conference on Consumer Electronics, pages 110-111, June 1986.
44) D. Hanke, K. Wilhelm, H. Meyer, "Development and Application of In-Flight Simulator For Flying Qualities Research at DFVLR," Proceedings of NAECON 1986, USA, Catalog Number 86CH2307-7, Volume 2, pages 490-498, May 1986.
45) P.D. Hill, W.B. Mikhael, "Performance Evaluation of a Real-Time TMS32010-Based Adaptive Noise Filter," Proceedings of 1986 IEEE International Symposium on Circuits and Systems, USA, Volume 36, Number 3, pages 411-412, May 1986.
46) S.M. Kuo, M.A. Rodriquez, "Implementation of an Adaptive Frequency Sampling Line Enhancer," Proceedings of 1986 IEEE International Symposium on Circuits and Systems, USA, Catalog Number 86CH2255-8, Volume 3, pages 896-899; May 1986.
47) G. Troullinos, J. Bradley, "Split-Band Modem Implementation Using The TMS32010 Digital Signal Processor,"Conference Records ofElectro/86 and Mini/Micro Northeast, USA, 14/1/1-21, May 1986.
48) R. Vemula, E. Lee, "A Microprocessor-Based Noise Cancellor For The Cockpit," Proceedings of NAECON 1986, USA, Catalog Number 86CH2307-7, Volume 4, pages 1323-1327, May 1986.
49) C.R. Cole, A. Haoui, P.L. Winship, "A High-Performance Digital Voice Echo Canceller on a SINGLE TMS32020," Proceedings of ICASSP 86, USA, Catalog Number 86CH2243-4, Volume 1, pages 429-432, April 1986.
50) F.L. Kitson, K.A. Zeger, "A Real-Time ADPCM Encoder Using Variable Order Prediction (Speech)," Proceedings of ICASSP 86, USA, Catalog Number 86CH2243-4, Volume 2, pages 825-828, April 1986.
51) G. Mirchandani, R.C. Gaus, Jr., L.K. Bechtel, "Performance Characteristics of a Hardware Implementation of The Cross-Talk Resistant Adaptive Noise Canceller," Proceedings of ICASSP 86, USA, Catalog Number 86CH2243-4, Volume 1, pages 93-96, April 1986.
52) G.S. Muller, C.K. Pauw, "Acoustic Noise Cancellation," Proceedings of ICASSP 86, USA, Catalog Number 86CH2243-4, Volume 2, pages 913-916, April 1986.
53) J. Rothweiler, "Performance of a Real Time Low rate Voice Codec," Proceedings of ICASSP 86, USA, Catalog Number 86CH2243-4, Volume 4, pages 3039-3042, April 1986.
54) P.J. Wilson, J.M. Puetz, A.V. McCree, D.T. Wang, "An Integrated Voice Codec and Echo Canceller Implemented in a Single DSP Processor," Proceedings of ICASSP 86, USA, Catalog Number 86CH2243-4, Volume 2, pages 1333-1336, April 1986.
55) G. Albertengo, S. Benedetto, E. Biglieri, "A DSP Application: An Adaptive Echo Canceller," Proceedings of the IASTED International Symposium: Modelling, Identification, and Control, MIC '86, pages 69-72, February 1986.
56) A.W. Davis, S. Wright, M.J. Pettitt, J.P. King, K. Richards, "Coherent Optical Receiver For $680 \mathrm{Mbit} / \mathrm{S}$ Using Phase Diversity," Electron. Lett., Great Britain, Volume 22, Number 1, pages 9-11, January 1986.
57) C.R. Cole, A. Haoui, P.L. Winship, "A High-Performance Digital Voice Echo Canceller on a Single TMS32020," Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, USA, 1986.
58) H. Hanselman, W. Moritz, "High Bandwidth Control of the Head Positioning Mechanism in a Winchester Disc Drive," Proceedings of IECON, 1986.
59) Sleeper Product - "A Combo Voice/Data I/O Card - Awakens Interest," Electronic Engineering Times, USA, pages 80-81, November 11, 1985.
60) R. Chjirayil, P. Ehlig, J. Bradley, G. Troullinos, "Modem Implementation Using The TMS32010," Proceedings of the National Communications Forum, 1985, Volume 39, pages 711-715, September 1985.
61) P. Ehlig, "DSP Chip Adds Multitasking Telecomm Capability to Engineering Workstation," Electronic Design, USA, Volume 33, Number 10, pages 173-184, May 2, 1985.
62) W.J. Christmas, "A Microprocessor-Based Digital Audio Coder and Decoder," International Conference on Digital Processing of Signals in Communications, Number 62, pages 22-26, April 1985.
63) J. Reimer, M. McMahan and M. Arjmand, "ADPCM on a TMS320 DSP Chip," Proceedings of SPEECH TECH 85, pages 246-249, April 1985.
64) P. Mock, "Add DTMF Generation and Decoding to DSP- P Designs," Electronic Design, USA, Volume 30, Number 6, pages 205-213, March 1985.
65) V. Milutinovic, "4800 Bit/s Microprocessor-Based CCITT Compatible Data Modem," Microprocessing and Microprogramming, Volume 15, Number 2, pages 57-74, February 1985.1)
66) G. Corsini and P. Terreni, "A Radar Echo Simulator Based on P TMS320," Proceedings of MELECON/85 IEEE Mediterranean Electrotechnical Conference (Sponsors: Mayor
of Madrid, Ministers of Industry and Energy, Spanish National Telephone Co.), USA, Volume 2, pages 327-330, 1985.
67) T. Fjallbrant, "A TMS320 Implementation of a Short Primary Block ATC-System with Pitch Analysis," International Conference on Digital Processing of Signals in Communications, Number 62, pages 93-96, 1985.
68) D.P. Kelly and J.L. Melsa, "Syllabic Companding and $32 \mathrm{~Kb} / \mathrm{s}$ ADPCM Performance," IEEE International Conference on Communications, USA, Volume 1, pages 414-417, 1985.
69) A. Vaghar, and V. Milutinovic, "An Analysis of Algorithms for Microprocessor Implementation of High-Speed Data Modems," Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, USA, Volume 4, pages 1656-1659, 1985.
70) J. Perl, "Channel Coding in a Self-Optimizing HF Modem," International Zurich Seminar on Digital Communications; Applications of Source Coding, Channel Coding and Secrecy Coding: Proceedings, pages 101-106, 1984.
71) R. Chirayil, P. Ehlig, "Integrating Low to Medium and High Speed Modems,"Mini-Micro Southwest-84 Computer Conference and Exhibition, September 1984,

## Automotive

1) Kun-Shan Lin, "Trends of Digital Signal Processing in Automotive," International Congress on Transportation Electronic (CONVERGENCE '88), October 1988.
2) K.E. Beck, M.M. Hahn, "A Real-Time Combustion Analysis Instrument," SAE Technical Paper Series, USA, February-March 1988.
3) M. Payne, "Do Not Disturb: Lotus in Action. (Lotus Racing Cars Use of an Active Suspension System)," Electronics Weekly, USA, i20 1394, page 12, January 1988.
4) D.A. Williams, S. Oxley, "Application of the Digital Signal Processor to an Automotive Control System," 6th. International Conference on Automotive Electronics, October 1987.
5) C.M. Anastasia, G.W. Pestana, "A Cylinder Pressure Sensor for Closed Loop Engine Control," SAE Technical Paper Series, February 1987.

## Consumer

1) A. Lovrich and J.B. Reimer, "A Multi-Rate Transcoder," Digest of Technical Papers for 1989 International Conference on Consumer Electronics, June 7-9 1989.
2) G.A. Frantz, J.B. Reimer, and R.A. Wotiz, "Julie, The Application of DSP to a Product," Speech Tech Magazine, USA, September 1988.
3) J.B. Reimer and G.A. Frantz, "Customization of a DSP Integrated Circuit for a Customer Product," Transactions on Consumer Electronics, USA, August 1988.
4) J.B. Reimer, P.E. Nixon, E.B. Boles, and G.A. Frantz, "Audio Customization of a DSP IC," Digest of Technical Papers for 1988 International Conference on Consumer Electronics, June 8-10 1988.
5) H. Mitschke, "Video Recorder: Picture From A Store," Funschau, West Germany, Number 9, pages 56-58, April 1988.
6) J.B. Reimer, P.E. Nixon, E.B. Boles, G.A. Frantz, "Audio Customization of a DSP IC," Digest of Technical Papers for 1987 International Conference on Consumer Electronics, June 1987.
7) G.R. Steber, "Audio Frequency DSP Laboratory on a Chip-TMS32010," Proceedings of IECON '87, Volume 2, pages 1047-1051, November 1987.

## Industrial

1) R.C. Chance, T.A. Taufiq, "A TMS32010 Based Near Optimized Pulse Width Modulated Waveform Generator," Third International Conference on Power Electronics \& Variable Speed Drives, Conference Publication Number 291, July 1988.
2) G. Anwar, R. Horowitz, M. Tomizuka, "Implementation of a MRAC for a Two Axis Direct Drive Robot Manipulator Using a Digital Signal Processor," American Control Conference, pages 658-660, June 1988.
3) D.E. Luttrell, T.A. Dow, "Control of Precise Positioning System with Cascaded Colinear Actuators," American Control Conference, pages 121-126, June 1988.
4) Y.V.V.S. Murty, W.J. Smolinski, S. Sivakumar, "Design of a Digital Protection Scheme For Power Transformers Using Optimal State Observers," IEE Proc. C, Generation Transmission, Distribution, Great Britain, Volume 135, Number 3, pages 224-230, May 1988.
5) M. Ruscio, M. Santoro, M. Adorni, A. Chiaravalloti, "Digital Control System For The Coordinated Boiler/Turbine Control in the ENEL Piombino Power Station," Elettrotencia, Italy, Volume 75, Number 3, pages 253-258, March 1988.
6) J.A. Taufiq, R.J. Chance, C.J. Goodman, "On-Line Implementation of Optimised PWM Schemes For Traction Inverter Drives," International Conference of 'Electric Railway Systems For a New Century', Great Britain, Conference Publication Number 279, pages 63-67, September 1987.
7) Y. Dote, M. Shinojima, H. Hoshimura, "Microprocessor-Based Novel Variable Structure Control for Robot Manipulator," Proceedings of the 10th. IFAC World Congress, July 1987.
8) H. Henrichfriese, W. Moritz, H. Siemensmeyer, "Control of a Light, Elastic Manipulation Device," Conference on Applied Motion Control, pages 57-66, 1987.
9) Y. Wang, M. Andrews, S. Butner, G. Beni, "Robot-Controller System," 15th Annual Symposium on IncrementalMotion ControlSystems \& Devices, pages 17-26, June 1986.
10) R. Cushman, "Easy-to-Use DSP Converter ICs Simplify Industrial-Control Tasks," Electronic Design, USA, Volume 29, Number 17, pages 218-228, August 1984.
11) P. Rojek and W. Wetzel, "Multiprocessor Concept for Industrial Robots: Multivariable Control with Signal Processors," Elektronik, Germany, Volume 33, Number 16, pages 109-113, August 1984.
12) G. Farber, "Microelectronics-Developmental Trends and Effects on Automation Techniques," Regelungstechnik Praxis, Germany, Volume 24, Number 10, pages 326-336, October 1982.

## Medical

1) F.S. Schlindwein, D.H. Evans, "A Real-Time Autoregressive Spectrum Analyzer for Doppler Ultrasound Signals," Ultrasound in Medicine and Biology, Volume 15, Number 3, pages 263-272, 1989
2) N. Dillier, "Programmable Master Hearing Aid With Adaptive Noise Reduction Using A TMS32020," Proceedings of ICASSP 88, USA, Volume A, page 2508, April 1988.
3) P.B. Knapp, H.S. Lusted, "A Real-Time Digital Signal Processing System for Bioelectric Control of Music," Proceedings of ICASSP 88, USA, Volume A, page 2556, April 1988.
4) R.B. Knapp, B. Townshend, "A Real-Time Digital Signal Processing System for an Auditory Prosthesis," Proceedings of ICASSP 88, USA, Volume A, page 2493, April 1988.
5) L.R. Morris, P.B. Barszczewski, "Design and Evolution of a Pocket-Sized DSP Speech Processing System for a Cochlear Implant and Other Hearing Prosthesis Applications," Proceedings of ICASSP 88, USA, Volume A, page 2516, April 1988.
6) T.J. Sullivan, S.M. Natarajan, "VLSI Based Design of a Battery-Operated Digital Hearing Aid," Proceedings of ICASSP 88, USA, Volume A, page 2512, April 1988.
7) A.J. Pratt, R.E. Gander, B.R. Brandell, "Real-Time Median Frequency Estimator," Proceedings of the Ninth Annual Conference of the IEEE Engineering in Medicine and Biology Society, USA, November 1987.
8) H. Xu, Y.H. Liang, L.G. Zhou, "The Real-Time Realization of Fetal ECG Heart Rate Monitor by Adaptive System," Proceedings of the NinthAnnual Conference of the IEEE Engineering in Medicine and Biology Society, USA, (Catalog Number 87CH2513-0), Volume 3, pages 1662-1663, November 1987.
9) L.R. Morris, page Braszczewski, J. Bosloy, "Algorithm Selection and Software Time/ Space Optimisation for a DSP Micro-Based Speech Processor For a Multi-Electrode Cochlear Implant," Proceedings of ICASSP 87, USA, Catalog Number 87CH2396-0, Volume 2, pages 972-975, April 1987.
10) K.C. McGill, K.L. McMillan, "A Smart Trigger For Real-Time Spike Classification," Proceedings of the Eighth Annual Conference of the IEEE Engineering in Medicine and Biology Society, USA, Catalog Number 86CH2368-9, Volume 1, pages 275-278, November 1986.
11) C. Murphy, page Rolfe, "Application of the TMS320 Signal Processor For The Real-Time Processing of The Doppler Ultasound Spectra," Proceedings of the Eighth Annual Conference of the IEEE/Engineering in Medicine and Biology, USA, Catalog Number 86CH2368-9, Volume 2, pages 1175-1178, November 1986.
12) "Innovations: Digital Hearing Aid," IEEE Spectrum, USA, 22 December 1985.
13) A. Casini, G. Castellini, P.L. Emiliani, and S. Rocchi, "An Auxiliary Processor for Biomedical Signals Based on a Signal Processing Chip," Digital Signal Processing-1984: Proceedings of the International Conference, pages 228-232, September 1984.
14) T.R. Myers, "A Portable Digital Speech Processor for an Auditory Prosthesis," Wescon/84 Conference Record, USA, 1984.

## Development Support

1) M. Karjalainen, "A LISP-Based High-Level Programming Environment for the TMS320C30," Proceedings of ICASSP 89, USA, pages 1150-1153, May 1989.
2) B.C. Mather, "Digital Filter Design Package (DFP2), Version 2.12," IEEE Spectrum, USA, Volume 25, Number 7, page 16, July 1988.
3) R. Weiss, "PC Package Ends DSP Drugery. (Monarch Software Package,Digital Signal Processing)," Electronic Engineering Times, USA, Number 494, p. 57, July 1988.
4) A. Kohl, "PCDevelopment Environment For Signal Processors," Elektron. Prax., West Germany, Volume 23, Number 4, April 1988.
5) R. Simar, Jr., A. Davis, "The Application of High-Level Languages to Single-Chip digital Signal Processors," Proceedings of ICASSP 88, USA, Volume 3, pages 1678-1681, April 1988.
6) A. Bindra, "New chips, Tools For Signal Processing on Tap: DSP Seminar Attracts Third-Party Developers," Electronic Engineering Times, USA, Number 470, page 6, January 1988.
7) R.J. Chance, "Simulation of Multiple Digital Signal Processor Systems,"Journal of Microcomputer Applications, Great Britain, Volume 11, Number 1, pages 1-19, January 1988.
8) A. Kohl, "Pascal and C-Compilers for the Type TMS320C25 Signal Processor," Elektron. Ind., West Germany, Volume 19, Number 4, pages 58,60,62, 1988.
9) R.J. Chance, B.S. Jones, "A Combined Software/Hardware Development Tool For The TMS32020 Digital Signal Processor," Journal of Microcomputer Applications, Great Britain, Volume 10, Number 3, pages 179-197, July 1987.
10) M.A. Zissman, G.C. O’Leary, D.H. Johnson, "A Block Diagram For a Digital Signal Processing MIMD Computer," Proceedings of ICASSP 87, USA, Catalog Number 87CH2396-0, Volume 4, pages 1867-1870, April 1987.
11) M.J. Tracy, "Forth as a Language For Digital Signal Processing," 1987 Rochester Forth Conference on Comparative Computer Architectures, USA, Volume 5, Number 1, pages 221-224, 1987.
12) A. Gharahgozlou, M. Banaouas, E. Babani, "Software Development For a Microprocessor on a 'Host' Computer," Electronic Industry, France, Number 115, pages 61-64, November 1986.
13) D.R. Campbell, C. Canning, K. Miller, "Crossassembler For The TMS32010 Digital Signal Processor," Microprocessors and Microsystems, Great Britain, Volume 10, Number 8, pages 434-441, October 1986.
14) J. Chance, "Simulation Experiences in the Development of Software For Digital Signal Processors," Microprocessors and Microsystems, Great Britain, Volume 10, Number 8, pages 419-426, October 1986.
15) A.C.P. van Meer, "TMS32010 Evaluation Module Controller," Einhoven University of Technology, Report Number EUT-86,E-162, 42 pages, October 1986.
16) S.E. Reyer, "A Demonstration Unit For Digital Signal Processing Development and Experimentation," Proceedings of IECON '86, Catalog Number 86CH2334-1, Volume 2, pages 641-646, September/October 1986
17) J.R. Parker, "A Subset FORTRAN Compiler For a Modified Harvard Architecture," SIGPLAN Not, USA, Volume 21, Number 9, pages 57-62, September 1986.
18) H. Harrison, "A High-Level Language Programming Environment for Speech and Signal Processing," Proceedings of Speech Technology 86, April 1986.
19) S. Suehiro, K. Sugimoto, "Forth Machine With Hardware Interpreter Designed to Increase Execution Speed," Nekkei Electron., Japan, Number 396, pages 213-245, 1986.
20) G. Frantz and K. Lin, "The TMS320 Family Design Tools," Proceedings of SPEECH TECH 85, pages 238-240, April 1985.
21) R. Schafer, R. Merseraeau, and T. Barnwell, "Software Package Brings Filter Design to PCs," Computer Design, USA, Volume 23, Number 13, pages 119-125, November 1984.
22) G. Pawle and T. Faherty, "DSP/Development Board Offers Host Independence," Computer Design, USA, Volume 23, Number 12, pages 109-116, October 15, 1984.
23) R. Mersereau, R. Schafer, T. Barnwell, and D. Smith, "A Digital Filter Design Package for PCs and TMS320," MIDCON/84 Electronic Show and Convention, USA, 1984.
24) R. Cushman, "Sophisticated Development Tool Simplifies DSP-Chip Programming," Electronic Design, USA, Volume 28, Number 20, pages 165-178, September 1983.
25) W. Gass and M. McMahan, "Software Development Techniques for the TMS320," SOUTHCON/83 Electronics Show and Convention, USA, 1983.
26) R. Wyckoff, "A Forth Simulator for the TMS320 IC," Rochester Forth Applications Conference, USA, pages 141-150, June 1983.

## Index

## A

a/d converter 351
adaptive filter implementation 191
adaptive predictor 196
addressing modes 39
applications 26, 43
benchmarks 29,47
biquad implementation 45
DCT transforms 53
digital filtering 26
FFT transforms 53
graphics 423
graphics/image processing 29, 47
hardware 333
instrumentation 29
numeric processing 29
telecommunications 27,47,401
applications board ('C30) 467
architecture 15
buses 37
CPU 37
dedicated hardware multiplier 16
external interfaces 24
Harvard architecture 15
instruction cycle 16
peripherals 37
pipelining 15,24
TMS320C30 34
arctangent functions 283
auxiliary registers 37

## B

bank switching ..... 345
bibliography ..... 533
bit reversal ..... 287
buses 37
expansion ..... 350
primary ..... 337
C
C callable functions ..... 151
C compiler, libraries ..... 232
C compiler ..... 74
CELP speech coder ..... 403
clock oscillator ..... 356
complex array bit reversal ..... 287
complex conjugate array multiples ..... 286
contents (of this applications book) 8CPU
auxiliary registers ..... 37
organization ..... 36
D
d/a converter ..... 353
DCT transforms ..... 169
DCT transforms ..... 53
development systems ..... 8
divide functions ..... 284
DMA ..... 24
documents ..... 8
doublelength math $144,146,147$
DSP
architecture 15,34
characteristics 13
echo canceller 197
error analysis 148
expansion bus 350
exponential functions 282
external interfaces ('C30) 24

## F

family of processors (320) 5,11
features
first generation TMS320 17
second generation TMS320 19
third generation TMS320 22, 33
TMS320C14/E14 6
TMS320C2x 6
TMS320C50 7
FFT transforms 53, 287
finite impulse response filter (see FIR)
FIR filter 13, 26, 44
floating point
conversions 287
doublelength arithmetic 137
format converter (IEEE) 365
formats 38, 139
floating point coverter (IEEE) 365
function approximation 279
G
Gas Light Software 290
graphic application 423

## H

hardware applications 333
hardware development systems 8,26
hardware multiplier 16
hartley transforms 67,70
Harvard architecture 15

I
integer arithmetic 285
integer formats 38
interface categories ('C30) 335
inverse functions 280
linear algebra routines 288
LMS algorithm 199

M
memory, organization 35
multiply functions 284
N
natural $\log$ functions 283
noise canceller 198
non-linear equation approximation 280

overview of book 3

## P

peripherals 37
peripherals ('C30) 24
pipelining $15,24,40$
polynomial approximation 279
primary bus 337

## R

read cycle timing 476,478
ready generation 341
real-time processing 13
references 8
reset signal 357
roundoff noise model 225
serial port 359
sine/cosine functions 282
singlelength math 142,145
software
floating-point formats 38
integer formats 38
TMS320C25 21
software development systems $8,26,41$
speech coder 403
SPOX 403, 407
square root functions 280
SRAM, dual port 470
stock market example 14
T
telecommunications 401
three-D graphics system 423
TMS320C30 applications board 467
TMS34010 graphics processor 441
vector primitive 287
vector utilities 286
W
wait states 337
write cycle timing 476

X
XDS 1000 system 362


[^0]:    * TMS320C25 wordlengths are 16 bits
    ** TMS320C30 wordlengths are 32 bits

[^1]:    Appendix A1.

[^2]:    /4*************************************************************************/
    /* APPB_getmeatikl), PC side $\quad$ */
    /* Read black of memory to the dual port.
    1* Returna 0 if successful, a - 1 if failed.
    / $/ 4$
    1* Sequence
    1*
    1* 11 Find free block of dual port for memory
    /* 2) Write memory paraneters to control block.
    1* 3) Hait for mis320C30 to put requested memory into the dual port.
    (* 4) Read data from the dual port.
    /* 5) Kelease block of dual port menory.
    /*
    int APPB_getmenblk(ent, src, dst)
    UONG int;
    ULONG Sre;
    ULONG *dst;
    \{
    DPCNTL *doct $=($ DPCNTL $*)$ DPRAMLCTL;
    ULONG *dpran;
    INT dptik;
    int i;
    UINT timeout $=$ MAX_SEM_TIME;
    if(APPB_getctlblk(\&dpblk)) return(-1);
    doram $=($ ULONG* $)($ DPRAFLMEMBASE $+($ dph $1 k *$ DPRAM_BLK_SIZE $) ;$
    if(APPB_getsen(0)) return(-1);
    dpctildpblkl.comand $=$ HOST_MEM_RD;
    dpct1[dpblk].buf_stat = BUF_ENPTY;
    dpcti[dpblk].buf_stat $=$ EXF_E
    dpctl[dpblk].count
    $\begin{array}{ll}\text { dpcti[dpblk],count } & =\mathrm{cnt} ; \\ \text { dpcti[dpblk],addr } & =\mathrm{src} ;\end{array}$
    whilel --timeout
    i
    if(!APPB_getsen(0) \& (dpctI[dpblk].buf_stat $==$ BUF_FULL $)$ break;
    if(APPB_relsem(0)) return(-1);
    ,
    if(APPB_relsea(0) : : !timeout) return( -1 );
    for ( $\mathrm{i}=0 ; \mathrm{i}<\mathrm{cnt} ; \mathrm{i}^{1++}$ )
    *dst++ = *dprant+;
    if(APPB_relctiblk(dpblk)) return(-1);

