
334 MIMO System Technology for Wireless Communications
natural number. In the example shown in Figure 11.9, each signal branch is
distributed to four multipliers for which dedicated hardware building blocks
in the FPGA have been used. All multiplications are performed simulta-
neously (i.e., in one 10 ns cycle), and the results are added pairwise in a
subsequent pipeline until a stream is reconstructed. Due to the pipelining,
the final result is obtained after a short delay depending on the number of
antennas. The entire data reconstruction with 3 Tx and 5 Rx requires 60
simultaneous multiplications (in one cycle) for 10 received and 6 transmitted
I and Q signals and four cycles for pairwise executed additions, resulting in
only 50 ns delay after the corresponding subcarrier signals leave the pipe-
lined FFT units, which is negligible compared to the OFDM symbol duration
of 0.8 µs.* So the data reconstruction unit performs the required 100 million
matrix-vector multiplications per second continuously in real time.
The major challenge is that the weight matrices differ from carrier to carrier
and need to be exchanged rapidly. This is shown exemplary for the weight
W
n
21
in Figure 11.9. Approximately once per frame, the frequency response
of each weight is written as a vector by the DSP into a dual-port RAM
assembled from the dedicated hardware memory blocks in the FPGA. The
second port of the dual-port RAM is connected to the dedicated multiplier
used. Now the address of the weight in the vector is counted through at 100
MHz clock synchronous with the increasing subcarrier index n of the incom-
ing signals (leaving the FFT unit subcarrier by subcarrier). In this way, the
matrix vector multiplication pipeline is reused for all subcarriers. The num-
ber of multipliers needed is related only to the numbers of antennas used
while the memory effort scales in addition with the number of subcarriers.
The rapid exchange of weights is a key idea enabling the implementation
of MIMO-OFDM with current FPGAs at 100 MHz bandwidth as shown in
the experimental system. Even higher bandwidths may be possible using
the same technique in an ASIC clocked at higher speed, accordingly. The
spatially multiplexed data streams appear separated from each other after
the matrix-vector multiplication unit. The subsequent signal processing can
be organized in parallel using conventional pipelined OFDM receiver pro-
cessing chains. System integration is straightforward (see Figure 11.11).
Therefore, we have organized the weight matrices for all carriers in register
pages assembled of 60 dual-port RAM blocks, where each RAM block con-
tains the weights for all subcarriers for one input-output pair and is located
next to the corresponding multiplier. The weights are written once per 2 ms
frame into the RAM blocks by the DSP using the outer port. Via the inner
port, the matrix-vector pipeline reads out the corresponding weight for the
current subcarrier and pair of input and output once per sample clock cycle.
In the next cycle, it switches to the next weight for the next subcarrier by
incrementing the weight address. This switching is performed simulta-
neously for all 60 weights. Once per OFDM symbol, hence, all weight matrices
* The FFT and IFFT units are implemented in parallel for all antennas using an FPGA core mod-
ule from XILINX. The pipelined units need about 1.5 OFDM symbol durations (1.2 µs) to provide
the output.
4190_book.fm Page 334 Tuesday, February 21, 2006 9:14 AM