1 ...8 9 10 12 13 14 ...18 The performance data for the RISC-V processor published in (Croome 2018) reports a total of 1.5 Mcycles for executing the CIFAR-10 graph on a highly parallel 8-core RISC-V architecture. For calculating the total number of cycles on a single RISC-V core, we consider that the performance is highly dominated by the cycles spent on 5x5 convolutions, which constitute more than 98% of the compute operations in this graph. For these 5x5 convolutions, (Croome 2018) reports a speed-up from a 1-core system to an 8-core system of 18.5/2.2 = 8.2. Hence, a reasonable estimate for the total number of cycles on a single RISC-V core is 1.5x8.2 = 12.3 Mcycles.
Table 1.4. Performance data for the CIFAR-10 CNN graph
# |
Layer type |
ARC EM9D [ Mcycles ] |
Processor A [ Mcycles ] |
Processor B (RISC-V ISA) [ Mcycles ] |
0 |
Permute |
0.01 |
– |
– |
1 |
Convolution |
1.63 |
6.78 |
– |
2 |
Max Pooling |
0.14 |
0.34 |
– |
3 |
Convolution |
3.46 |
9.25 |
– |
4 |
Avg Pooling |
0.09 |
0.09 |
– |
5 |
Convolution |
1.76 |
4.88 |
– |
6 |
Avg Pooling |
0.07 |
0.04 |
– |
7 |
Fully-connected |
0.03 |
0.02 |
– |
8 |
Fully-connected |
0.001 |
|
– |
Total |
|
7.2 |
21.4 |
12.3 |
From Table 1.4, we conclude that the ARC EM9D processor spends 3x fewer cycles than processor A and 1.7x fewer cycles than the RISC-V core (processor B) for the same machine learning inference task, without using any specific accelerators. Thanks to the good cycle efficiency, the ARC EM9D processor can be clocked at a low frequency, which helps to save power in a smart IoT edge device.
Smart IoT edge devices that interact intelligently with their users are appearing in many application areas. These devices have diverse compute requirements, including a mixture of control processing, DSP and machine learning. Versatile processors are required to efficiently execute these different types of workloads. Furthermore, these processors must allow for easy customization to improve their efficiency for a specific application. Configurability and extensibility are two key mechanisms that provide such customization. Increasingly, IoT edge devices apply machine learning technology for processing captured sensor data, so that smart actions can be taken based on recognized patterns. We presented key processor features and a software library for the efficient implementation of low/mid-end machine learning inference. More specifically, we highlighted several processor capabilities, such as vector MAC instructions and XY memory with advanced AGUs, that are key to the efficient implementation of machine learning inference. The ARC EM9D processor is a universal processor for low-power IoT applications which is both configurable and extensible. The complete and highly optimized embARC MLI library makes effective use of the ARC EM9D processor to efficiently support a wide range of low/mid-end machine learning applications. We demonstrated this efficiency with excellent results for the CIFAR-10 benchmark.
Amodei, D., Ananthanarayanan, S., Anubhai, R., Bai, J., Battenberg, E., Case, C., Casper, J., Catanzaro, B., Cheng, Q., Chen, G., Chen, J., Chen, J., Chen, Z., Chrzanowski, M., Coates, A., Diamos, G., Ding, K., Du, N., Elsen, E., Engel, J., Fang, W., Fan, L., Fougner, C., Gao, L., Gong, C., Hannun, A., Han, T., Johannes, L.V., Jiang, B., Ju, C., Jun, B., LeGresley, P., Lin, L., Liu, J., Liu, Y., Li, W., Li, X., Ma, D., Narang, S., Ng, A., Ozair, S., Peng, Y., Prenger, R., Qian, S., Quan, Z., Raiman, J., Rao, V., Satheesh, S., Seetapun, D., Sengupta, S., Srinet, K., Sriram, A., Tang, H., Tang, L., Wang, C., Wang, J., Wang, K., Wang, Y., Wang, Z., Wang, Z., Wu, S., Wei, L., Xiao, B., Xie, W., Xie, Y., Yogatama, D., Yuan, B., Zhan, J., Zhu, Z. (2016). Deep speech 2: End-to-end speech recognition in English and Mandarin. Proceedings of the 33rd International Conference on Machine Learning – Volume 48, ICML-16 , 173–182.
Croome, M. (2018). Using RISC-V in high computing, ultra-low power, programmable circuits for inference on battery operated edge devices [Online]. Available at: https://content.riscv.org/wp-content/uploads/2018/07/Shanghai-1325_GreenWaves_Shanghai-2018-MC-V2.pdf.
Dutt, N. and Choi, K. (2003). Configurable processors for embedded computing. IEEE Computer , 36(1), 120–123.
embARC Open Software Platform (2019). Available at: https://embarc.org/.
Jacob, B., Kligys, S., Chen, B., Zhu, M., Tang, M., Howard, A.G., Adam, H., and Kalenichenko, D. (2017). Quantization and training of neural networks for efficient integer-arithmetic-only inference. Computing Research Repository . Available at: http://arxiv.org/abs/1712.05877.
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R.B., Guadarrama, S., and Darrell, T. (2014). Caffe: Convolutional architecture for fast feature embedding. Computing Research Repository . Available at: https://arxiv.org/abs/1408.5093.
Krizhevsky, A. (2009). Learning multiple layers of features from tiny images. Technical Report, University of Toronto, 2009.
Lai, L., Suda, N., and Chandra, V. (2018). CMSIS-NN: Efficient neural network kernels for arm cortex-M CPUs. Computing Research Repository . Available at: http://arxiv.org/abs/1801.06601.
Petrov-Savchenko, A. and van der Wolf, P. (2018). Get smart with NB-IoT: Efficient low-cost implementation of NB-IoT for smart applications. Technical paper, Synopsys [Online]. Available at: https://www.synopsys.com/dw/doc.php/wp/NB_IoT_for_Smart_Applications.pdf.
Q Number Format (2019). Available at: https://en.wikipedia.org/wiki/Q_(number_format).
Samuel, A.L. (1959). Some studies in machine learning using the game of checkers. IBM Journal of Research and Development , 3(3), 210–229.
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. (2015). Going deeper with convolutions. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 1–9.
For a color version of all figures in this book, see www.iste.co.uk/andrade/multi1.zip.
2
A Qualitative Approach to Many-core Architecture
Benoît DUPONT DE DINECHIN
Kalray S.A., Grenoble, France
We present the design of the Kalray third-generation MPPA many-core processor, whose objectives are to combine the performance scalability of GPGPUs, the energy efficiency of DSP cores and the I/O capabilities of FPGA devices. These objectives are motivated by the consolidation of high-performance and high-integrity functions on a single computing platform for autonomous vehicles. High-performance computing functions, represented by deep learning inference and computer vision, need to execute under soft real-time constraints. High-integrity functions are developed under model-based design, and must meet hard real-time constraints. Finally, the third-generation MPPA processor integrates a hardware root of trust and implements a security architecture, in order to support trusted execution environments.
Читать дальше