Figure 2.1. State-of-the-art commercial system-on-chip baseband architecture
2.2. Role of microelectronics
The tremendous improvement in mobile communication has to be considered alongside the progress in the microelectronic industry, which started with the invention of the transistor in the late 1940s (Shockley 1949), coincidentally at the same time as, when Shannon published his famous article (Shannon 1948). In the following decades, the semiconductor industry achieved an exponential increase in the number of transistors on a single chip, known as Moore’s law (Moore 1965), which is a further key driver of our information society. In today’s semiconductor technologies, two-digit million transistors can be integrated on 1 mm 2of silicon. For many decades, improvement in the silicon process technology provided better performance, lower cost per gate, higher integration density and lower power consumption. However, we have reached a point where Moore’s law is slowing down. The reasons for this slowdown are, in particular, the immense cost of new technologies and the design cost in these technologies, decreasing performance gain and increasing delay in interconnect and power/power density challenges, to name just a few.
The question is, what contribution have microelectronics made to improve throughput and implementation efficiency in channel decoding in the past. As a case study, we consider two Turbo code decoders. Both decoders were designed with the same design methodology and have a very similar state-of-the-art architecture that exploits spatial parallelism and processes several sub-blocks on corresponding Maximum a Posteriori (MAP) decoders in parallel:
– the first decoder is a fully UMTS-compliant Turbo decoder implemented in a 180 nm technology. Under worst-case Process, Voltage and Temperature (PVT) conditions, a maximum frequency of 166 MHz is achieved, which results in a throughput of 71 Mbit/s at 6 decoding iterations. The total area is 30 mm2 (Thul et al. 2005);
– the second decoder is a fully LTE-compliant Turbo decoder implemented in a 65 nm technology, achieving a maximum frequency of 450 MHz under worst-case PVT conditions. It yields a throughput of 2. 15 Gbit/s at 6 decoding iterations and consumes 7.7 mm2 area (Ilnseher et al. 2012).
Three semiconductor technology nodes are between 180 nm and 65 nm technology. We observe a throughput increase by 30× although the improvement of frequency, which is limited by the critical path inside the MAP decoder, is only 3×. The improvement in area efficiency (throughput/area) is 118×. Hence, progress in microelectronics contributed to a huge improvement in area efficiency but much less to a frequency increase, and, thus, throughput increase. The throughput increase mainly originates from code design, i.e. conflict-free Turbo code interleavers that enable efficient implementation with a high degree of parallelism, advanced algorithmic and architectural features, such as next-iteration initialization, optimized radix-4 kernel, re-computation and advanced normalization to reduce internal bit widths. We see that microelectronics could not keep up with the increased requirements coming from communication systems. Thus, the design of communication systems is no longer just a matter of spectral efficiency or BER/FER. When it comes to implementation, channel coding requires a cross-layer approach covering information theory, algorithms, parallel hardware architectures and semiconductor technology to achieve excellent communications performance, high throughput, low latency, low power consumption and high energy and area efficiency (Scholl et al . 2016; Kestel et al . 2018a).
2.3. Towards 1 Tbit/s throughput decoders
A large parallelism is a must for high-throughput decoders towards 1 Tbit/s. The achievable parallelism strongly depends on the properties of the decoding algorithms, for example, sub-functions of a decoding algorithm that have no mutual data dependencies can easily be parallelized by spatial parallelism. This is the case for the Belief Propagation (BP) algorithm to decode LDPC codes. In the BP algorithm, all check nodes can be processed independently from each other. The same applies for the variable nodes. The situation is different for the MAP algorithm used in Turbo code decoding, where the calculation of a specific trellis step depends on the result of the previous trellis step. This results in a sequential behavior, and the different trellis steps cannot be calculated in parallel. Such data dependencies exist also in iterative decoding algorithms between the various iterations. Hence, the bottleneck for high throughput is the part of an algorithm that cannot be parallelized. This part strongly limits the overall throughput that is also known as Amdahl’s law (Amdahl 2013). Other important features for efficient high-throughput implementations are locality and regularity to minimize interconnect. Interconnect can largely contribute to area, delay and energy consumption. Let us assume an FEC IP block of size 10 mm 2is a feasible size. A signal has to travel at least 7 mm if it has to be transmitted from one corner to the other in this IP block. It will take three clock cycles in a 14 nm technology, assuming a frequency of 1 GHz. This interconnect delay largely decreases the throughput and increases power. Thus, data transfers can be as important as calculations. Although channel decoding algorithms for advanced codes such as Turbo- and LDPC codes are mainly data-flow dominated, they imply irregularity and restricted locality, since efficient channel coding for these coding schemes is grounded on some randomness: that is the interleaver for Turbo codes and the Tanner graph for LDPC codes, respectively. Any regularity and locality in these structures improves the implementation efficiency but has negative impact on the communications performance. Hence, there is a fundamental discrepancy between information theory and efficient implementations for high throughput that demands regularity and locality. In Table 2.1, important implementation properties for the different code classes are summarized.
Table 2.1. Implementation properties of various coding schemes
Code |
Dec. Algorithms |
Parallel vs. Serial |
Locality |
Compute Kernels |
Transfers vs. Compute |
Turbo |
MAP |
Serial/iterative |
Low (interleaver) |
Add-Compare-Select |
Compute dominated |
LDPC |
Belief propagation |
Parallel/iterative |
Low (Tanner graph) |
Min-Sum/add |
Transfer dominated |
Polar |
Successive cancelation/list |
Serial |
High |
Min-Sum/add/sorting |
Balanced |
Functional parallelism/pipelining is an efficient technique to speed up algorithms with data dependencies. We can “unroll” the iterations and insert buffers between the different pipeline stages in iterative decoding algorithms. In this way, the various iterations can be calculated in parallel, but on different data sets. Spatial and functional parallelization are implementation techniques only, i.e. they are not changing the algorithmic behavior. We can also modify the decoding algorithm itself to enable parallelism. Let us again consider the MAP algorithm. The data dependencies in the forward/backward recursion can be broken up by exploiting the fact that the trellis has a finite memory due to the underlying constraint length of the code. This property enables the splitting of the trellis into sub-trellises that can be processed independently of each other. However, some acquisition steps are mandatory at the border of the sub-trellises to get the correct probabilities. The length of the sub-trellises and the corresponding acquisition length impact the communications performance. In this case, an increased parallelism can have a negative impact on the communications performance.
Читать дальше