LibCat » Книги » Приключения » unrecognised » Liliana Andrade - Multi-Processor System-on-Chip 1

Liliana Andrade - Multi-Processor System-on-Chip 1

Здесь есть возможность читать онлайн «Liliana Andrade - Multi-Processor System-on-Chip 1» — ознакомительный отрывок электронной книги совершенно бесплатно, а после прочтения отрывка купить полную версию. В некоторых случаях можно слушать аудио, скачать через торрент в формате fb2 и присутствует краткое содержание. Жанр: unrecognised, на английском языке. Описание произведения, (предисловие) а так же отзывы посетителей доступны на портале библиотеки ЛибКат.

Читать книгу

Название:
Multi-Processor System-on-Chip 1
Автор:
Liliana Andrade
Жанр:
unrecognised / на английском языке
Год:
неизвестен
ISBN:
нет данных
Рейтинг книги:
4 / 5. Голосов: 1
Избранное:

Добавить в избранное
Отзывы:
Написать комментарий
Ваша оценка:
- 80
- 1
- 2
- 3
- 4
- 5

Multi-Processor System-on-Chip 1: краткое содержание, описание и аннотация

Предлагаем к чтению аннотацию, описание, краткое содержание или предисловие (зависит от того, что написал сам автор книги «Multi-Processor System-on-Chip 1»). Если вы не нашли необходимую информацию о книге — напишите в комментариях, мы постараемся отыскать её.

A Multi-Processor System-on-Chip (MPSoC) is the key component for complex applications. These applications put huge pressure on memory, communication devices and computing units. This book, presented in two volumes – Architectures and Applications – therefore celebrates the 20th anniversary of MPSoC, an interdisciplinary forum that focuses on multi-core and multi-processor hardware and software systems. It is this interdisciplinarity which has led to MPSoC bringing together experts in these fields from around the world, over the last two decades. <p><i>Multi-Processor System-on-Chip 1</b> covers the key components of MPSoC: processors, memory, interconnect and interfaces. It describes advance features of these components and technologies to build efficient MPSoC architectures. All the main components are detailed: use of memory and their technology, communication support and consistency, and specific processor architectures for general purposes or for dedicated applications.

Multi-Processor System-on-Chip 1 — читать онлайн ознакомительный отрывок

Ниже представлен текст книги, разбитый по страницам. Система сохранения места последней прочитанной страницы, позволяет с удобством читать онлайн бесплатно книгу «Multi-Processor System-on-Chip 1», без необходимости каждый раз заново искать на чём Вы остановились. Поставьте закладку, и сможете в любой момент перейти на страницу, на которой закончили чтение.

Тёмная тема

Шрифт:

↓

↑

Сбросить

Интервал:

↓

↑

Закладка:

Сделать

Activation functions are used in neural networks to transform data by performing some nonlinear mapping. Examples are rectified linear units (ReLU), sigmoid and hyperbolic tangent (TanH). The activation functions operate on a single data value and produce a single result. Hence, for an activation layer, the size of the output map is equal to the size of the input map.

Figure 1.4. Example pooling operations: max pooling and average pooling

Neural networks may also have pooling layers that transform an input map into a smaller output map by calculating single output values for (small) regions of the input data. Figure 1.4 shows two examples: max pooling and average pooling. Effectively, the pooling layers downsample the data in the width and height dimensions. The depth of the output map is the same as the depth of the input map.

1.3.1.2. Implementation requirements

To obtain sufficiently accurate results when implementing machine learning inferences, appropriate data types must be used. During the training phase, data and weights are typically represented as 32-bit floating-point data. The common practice for deploying models for inference in embedded systems is to work with integer or fixed-point representations of quantized data and weights (Jacob et al . 2017). The potential impact of quantization errors can be taken into account in the training phase to avoid a notable effect on the model performance. Elements of input maps, output maps and weight kernels can typically be represented using 16-bit or smaller data types. For example, in a voice application, data samples are typically represented by 16-bit data types. In image or video applications, a 12-bit or smaller data type is often sufficient for representing the input samples. Precision requirements can differ per layer of the neural network. Special attention should be paid to the data types of the accumulators that are used to sum up the partial results when performing dot-product operations. These accumulators should be wide enough to avoid overflow of the (intermediate) results of the dot-product operations performed on the (quantized) weights and input samples.

Memory requirements for low/mid-end machine learning inference are typically modest, thanks to limited input data rates and the use of neural networks with limited complexity. Input and output maps are often of limited size, i.e. a few tens of kBs or less, and the number and size of the weight kernels are also relatively small. The use of the smallest possible data types for input maps, output maps and weight kernels helps us to reduce memory requirements.

In summary, low/mid-end machine learning inference applications require the following types of processing:

– various types of pre-processing and feature extraction, often with DSP-intensive computations;

– neural network processing, with the dot-product operation as a dominant computation and regular access patterns on multidimensional data. Additional requirements come from the use of scalar activation functions and pooling operations working on 2D data;

– decision-making, which is performed after the neural network processing, is more control-oriented.

The different types of processing may be implemented using a heterogeneous multi-processor architecture, with different types of processors to satisfy the different processing requirements. However, for low/mid-end machine learning inference, the total compute requirements are often limited and can be handled by a single processor running at a reasonable frequency, provided it has the right capabilities. As we discussed above, the use of a single processor eliminates the area and communication overhead associated with multi-processor architectures. It also simplifies software development, as a single tool chain can be used for the complete application. However, it requires that the processor performs DSP, neural network processing and control processing with excellent cycle efficiency. In the next section, we will discuss the capabilities of a programmable processor that enables such cycle efficiency in more detail.

For many IoT edge devices, low cost is a key requirement. Therefore, making IoT edge devices smarter by adding machine learning inference must be cost-effective. The main contributor to cost is silicon area, in particular, for high-volume products, so it is important that the processor implementing the machine learning inference minimizes the logic area and uses small memories. In addition, small code size is key to limiting the area of the instruction memory.

Many IoT edge devices are battery-operated and have a tight power budget. This demands a power-efficient processor, measured in uW/MHz, as well as an excellent cycle efficiency so that the processor can be run at a low frequency. Low power consumption is particularly important for IoT edge devices that perform always-on functions such as:

– smart speakers, smartphones, etc. with always-on voice command functions that are “always listening”;

– camera-based devices, performing, for example, face detection or gesture recognition that are “always watching”;

– health and fitness monitoring devices that are “always sensing”.

Such devices typically apply smart techniques to reduce power consumption. For example, an “always listening” device may sample the microphone signal and use simple voice detection techniques to check whether anyone is speaking at all. It then applies the more compute-intensive machine learning inference for recognizing voice commands only when voice activity is detected. A processor must limit power consumption in each of these different states, i.e. voice detection and voice command recognition. For this purpose, it must offer various power management features, including effective sleep modes and power-down modes.

1.3.2. Processor capabilities for low-power machine learning inference

Selecting the right processor is key to achieving high efficiency for the implementation of low/mid-end machine learning inference. In this section, we will describe a number of key capabilities of the DSP-enhanced ARC EM9D processor and illustrate how they can be used to implement neural network processing efficiently.

As described earlier, the dot-product operation on input samples and weights is a dominant computation. The key primitive for implementing the dot product is the multiply-accumulate (MAC) operation, which can be used to incrementally sum up the products of input samples and weights. Vectorization of the MAC operations is an important way to increase the efficiency of neural network processing. Figure 1.5 illustrates two types of vector MAC instructions of the ARC EM9D processor.

Figure 1.5. Two types of vector MAC instructions of the ARC EM9D processor

Both of these vector MAC instructions operate on 2x16-bit vector operands. The DMAC instruction on the left is a dual-MAC that can be used to implement a dot product, with A1 and A2 being two neighboring samples from the input map and B1 and B2 being two neighboring weights from the weight kernel. The ARC EM9D processor supports 32-bit accumulators for which an additional eight guard bits can be enabled to avoid overflow. The DMAC operation can effectively be used for weight kernels with an even width, reducing the number of MAC instructions by a factor of two compared to a scalar implementation. However, for weight kernels with an odd width, this instruction is less effective. In such cases, the VMAC instruction, shown on the right in Figure 1.5, can be used to perform two dot-product operations in parallel, accumulating intermediate results into two accumulators. In case the weight kernel “moves” over the input map with a stride of one, A1 and A2 are two neighboring samples from the input map and the value of B1 and B2 is the same weight that is applied to both A1 and A2.