null - 程序员宅基地

2016：DianNao Family Energy-Efficient Hardware Accelerators for Machine Learning-程序员宅基地

技术标签： pr

文章目录

Abstract
- 第二段
1 INTRODUCTION
2 DIANNAO: A NN ACCELERATOR

这个发表在了
https://cacm.acm.org/
communication of the ACM
上

参考文献链接是

Chen Y , Chen T , Xu Z , et al. 
DianNao Family: Energy-Efficient Hardware Accelerators for Machine Learning[J]. 
Communications of the Acm, 2016, 59(11):105-112

https://cacm.acm.org/magazines/2016/11/209123-diannao-family/fulltext
果然找到了他

特码的，我下载了，6不6

The original version of this paper is entitled “DianNao: A
Small-Footprint, High-Throughput Accelerator for Ubiq-
uitous Machine Learning” and was published in Proceed-
ings of the International Conference on Architectural Support
for Programming Languages and Operating Systems (ASPLOS)
49, 4 (March 2014), ACM, New York, NY, 269–284.

Abstract

ML pervasive
- broad range of applications
  - broad range of systems(embedded to data centers)

computer
- toward heterogeneous multi-cores
- a mix of cores and hardware accelerators,
designing hardware accelerators for ML
- achieve high efficiency and broad application scope

第二段

efficient computational primitives
- important for a hardware accelerator,
inefficient memory transfers can
- potentially void the throughput, energy, or cost advantages of accelerators,
an Amdahl’s law effect
become a first-order concern,

like in processors,
- rather than an element factored in accelerator design on a second step

a series of hardware accelerators
- designed for ML(nn),
- the impact of memory on accelerator design, performance, and energy.

representative neural network layers
450.65x over GPU
energy by 150.31x on average
- for 64-chip DaDianNao (a member of the DianNao family)

1 INTRODUCTION

designing hardware accelerators which realize the best possible tradeoff between flexibility and efficiency is becoming a prominent
issue.

The first question is for which category of applications one should primarily design accelerators?
Together with the architecture trend towards accelerators, a second simultaneous and significant trend in high-performance and embedded applications is developing: many of the emerging high-performance and embedded applications, from image/video/audio recognition to automatic translation, business analytics, and robotics rely on machine learning
techniques.
This trend in application comes together with a third trend in machine learning (ML) where a small number
of techniques, based on neural networks (especially deep learning techniques 16, 26 ), have been proved in the past few
years to be state-of-the-art across a broad range of applications.
As a result, there is a unique opportunity to design accelerators having significant application scope as well as
high performance and efficiency. 4

第二段

Currently, ML workloads
mostly executed on
- multicores using SIMD[44]
- on GPUs[7]
- or on FPGAs[2]

the aforementioned trends
- have already been identified
- by researchers who have proposed accelerators implementing,
CNNs[2]
Multi-Layer Perceptrons [43] ;

accelerators focusing on other domains,
- image processing,
- propose efficient implementations of some of the computational primitives used
- by machine-learning techniques, such as convolutions[37]

There are also ASIC implementations of ML
- such as Support Vector Machine and CNNs.

these works focused on
- efficiently implementing the computational primitives
  - ignore memory transfers for the sake of simplicity[37,43]
  - plug their computational accelerator to memory via a more or less sophisticated DMA. [2,12,19]

第三段

While efficient implementation of computational primitives is a first and important step with promising results,
inefficient memory transfers can potentially void the throughput, energy, or cost advantages of accelerators, that is, an
Amdahl’s law effect, and thus, they should become a first-
order concern, just like in processors, rather than an element
factored in accelerator design on a second step.
Unlike in processors though, one can factor in the specific nature of
memory transfers in target algorithms, just like it is done for accelerating computations.
This is especially important in the domain of ML where there is a clear trend towards scaling up the size of learning models in order to achieve better accuracy and more functionality. 16, 24

第四段

In this article, we introduce a series of hardware accelerators designed for ML (especially neural networks), including
DianNao, DaDianNao, ShiDianNao, and PuDianNao as listed in Table 1.
We focus our study on memory usage, and we investigate the accelerator architecture to minimize memory
transfers and to perform them as efficiently as possible.

2 DIANNAO: A NN ACCELERATOR

DianNao
- first of DianNao accelerator family,
accommodates sota nn techniques (dl ),
inherits the broad application scope of nn.

2.1 Architecture

DianNao
- input buffer for input (NBin)
- output buffer for output (NBout)
- buffer for synaptic(突触) weights (SB)
- connected to a computational block (performing both synapses and neurons computations)
- NFU, and CP, see Figure 1

NBin是存放输入神经元
SB是存放突触的权重的
这个NBout是存放输出神经元

我觉得图示的可以这样理解：2个输入神经元，2个突触，将这2个对应乘起来，输出是1个神经元啊。但是我的这个NFU牛逼啊，他可以一次性求两个输出神经元。

NFU

a functional block of $T_i$ inputs/synapses(突触)
- $T_n$ output neurons,
time-shared by different algorithmic blocks of neurons.

这个NFU对 $T_i$ 个输入和突触运算，得到 $T_n$ 个输出神经元，突触不是应该是 $T_i\times T_n$ 个吗？？，

Depending on the layer type,
- computations at the NFU can be decomposed in either two or three stages

For classifier and convolutional:
- multiplication of synapses $\times$ inputs:NFU-1
- , additions of all multiplications, :NFU-2
- sigmoid. :NFU-3

如果是分类层或者卷积的话的话，那就是简单的突触 $\times$ 输入，然后加起来，求sigmoid。这个我能理解哦，这种情况不就是卷积吗。

如果是分类层，那么输入就是

last stage (sigmoid or another nonlinear function) can vary.

For pooling, no multiplication(no synapse),
- pooling can be average or max.

adders(加法器) have multiple inputs,
- they are in fact adder trees,

the second stage also contains
- shifters and max operators for pooling.

要啥移位啊？？

the sigmoid function (for classifier and convolutional layers)can be efficiently implemented using ( $a_i x \times + b_i , x \in [x_i , x_{i+1} ]$ ) (16 segments are sufficient)

On-chip Storage

on-chip storage structures of DianNao
- can be construed as modified buffers of scratchpads.

While a cache is an excellent storage structure for a general-purpose processor, it is a sub-optimal way to exploit reuse because of the cache access overhead (tag check, associativity, line size, speculative read, etc.) and cache conflicts.
The efficient alternative, scratchpad, is used in VLIW processors but it is known to be very difficult to compile for.
However a scratchpad in a dedicated accelerator realizes the best of both worlds: efficient
storage, and both efficient and easy exploitation of locality because only a few algorithms have to be manually adapted.

第二段

on-chip storage into three (NBin, NBout,and SB), because there are three type of data (input neurons,output neurons and synapses) with different characteristics (read width and reuse distance).

The first benefit of splitting structures is to tailor the SRAMs to the appropriate
read/write width,
and the second benefit of splitting storage structures is to avoid conflicts, as would occur in a cache.
Moreover, we implement three DMAs to exploit spatial locality of data, one for each buffer (two load DMAs for inputs, one store DMA for outputs).

2.2 Loop tiling

DianNao 用 loop tiling去减少memory access
- so可容纳大的神经网络
举例
- 一个classifier 层
  - 有 $N_n$ 输出神经元
  - 全连接到 $N_i$ 的输入
  - 如下图

$N_n$ 个输出， $N_i$ 个输入，sypase应该是 $N_n\times N_i$ 大小，用这个矩阵 $×Ni \times N_i$ 即可得到结果啊

先取出来一块
- 有点疑惑啊
- 万一右边第一个元素和左边全部元素都有关
- 你咋算啊（）
- 其实啊，我他妈算右边第一个时候
- 只需要用到和synapse的一行呀！
- 那你那个大大的synapse矩阵咋办啊
下面是原始代码和和
- tiled代码
- 他把分类层映射到DianNao

在这里插入图片描述

for(int n=0;n<Nn;n++)
	sum[n]=0;
for(int n=0;n<Nn;n++) //输出神经元
	for(int i=0;i<Ni;i++) //输入神经元
		sum[n]+=synapse[n][i]*neuron[i];
for(int n=0;n<Nn;n++)
	neuron[n]=Sigmoid(sum[n]);

俺的想法：
- 一次来Tnn个输出
- 和Tii个输入
- 然后这个东西对于硬件还是太大了
- 再拆
- 来Tn个和Ti个吧
- 就酱

for(int nnn=0;nnn<Nn;nnn+=Tnn){
    //tiling for output 神经元
//第一个for循环准备扔出去Tnn个输出
    for(int iii=0;iii<Ni;iii+=Tii){
    //tiling for input 神经元
//第二个for循环准备扔进来Tii个输入
//下面就这两个东西动手

        for(int nn=nnn;nn<nnn+Tnn;nn+=Tn){
    
//第三个for循环觉得觉得Tnn还是太大了，继续拆
//大小是Tn
//那么我们对每一个Tnn块！（开始位置是nn哦！！）
//我们如下求解

///
            for(int n=nn;n<nn+Tn;n++)
//第一步把中间结果全部搞成零！

            sum[n]=0;
//为求sum[n]，sum[n]=synapse的第n行乘neuron的全部啊！
        for(int ii=iii;ii<iii+Tii;ii+=Ti)

//上面的for是对Ti进行拆

            for(int n=nn;n<nn+Tn;n++)
                for(int i=ii;i<ii+Ti;i++)
                    sum[n]+=synapse[n][i]*neuron[i];

    for(int nn=nnn;nn<nnn+Tnn;nn+=Tn)
        neuron[n]=sigmoid(sum[n]);
///

 }   }  }

在tiled代码中， $i i$ 和 $n n$
- 表示NFU有 $T_i$ 个输入和突触
  - 和 $T_n$ 个输出神经元
输入神经元被每个输出神经元需要重用
- 但这个输入向量也太他妈大了
- 塞不到Nbin块里啊
- 所以也要对循环 $i i$ 分块，因子 $T_{ii}$

上面的代码肯定有问题，正确的如下：

	for (int nnn = 0; nnn < Nn; nnn += Tnn) {
    
		for (int nn = nnn; nn < nnn + Tnn; nn += Tn) {
    
			for (int n = nn; n < nn + Tn; n++)
				sum[n] = 0;
			for (int iii = 0; iii < Ni; iii += Tii) {
    
				for (int ii = iii; ii < iii + Tii; ii += Ti)
					for (int n = nn; n < nn + Tn; n++)
						for (int i = ii; i < ii + Ti; i++)
							sum[n] += synapse[n][i] * neuron[i];
			}
			for (int n = nn; n < nn + Tn; n++)
				printf("s%ds ", sum[n]);
		}
	}
	for (int index = 0; index < Nn; index++)
		printf("%d ", sum[index]);

本文链接：https://blog.csdn.net/zhoutianzi12/article/details/110244427

原作者删帖不实内容删帖广告或垃圾文章投诉

智能推荐

Oracle感慨(转)-程序员宅基地

文章浏览阅读60次。　一转眼接触ORACLE已经一年了，在这一年中收获多多，感慨多多，我记得是2004年11月底开始学习ORACLE的，当时选择方向也是几经波折，还好现在的处境不是非常的艰难，前途似乎还有想象中的光明。毕业已经两年半，开始半年主要是接触Sybase，当时公司后台使用的Sybase SQL 11,由于人手比较少，管理也不很严格，所以我经常有机会光顾他，在那里我学会了怎样备份，恢复数..._李小龙语录一拳就是一拳一脚就是一脚

android关于自定义Dialog中布局match_parent 属性失效的问题_android dialog 布局 match_parent 无效-程序员宅基地

文章浏览阅读1.9k次。dialog如果没有设置style，那么系统会主动设置一个style，这个style中的decorview会存在padding，所以导致match_parent无效1.方法一dialog.show();//在show之后修改，必须这样否则无效，没看源码具体原因不知道Window window =dialog.getWindow();if (window == null) {..._android dialog 布局 match_parent 无效

vue3+vite+ts的项目打包后使用静态资源dist用5+app打包成APP_vite vue ts 转 app-程序员宅基地

文章浏览阅读256次，点赞10次，收藏2次。使用app在真机上安装完成后，进入首页后登录时遇到了连不上后端api的问题，在处理完跨域的问题后，发现因为发布app时 vue开发模式下配置的跨域是无效的，打包后会找不到接口。使用HBuildeX创建5+app项目，然后删除。并将vue3打包出来的dist文件夹中的。全部内容复制到5+app项目中；_vite vue ts 转 app

毕设开源基于Python的南京二手房数据采集及可视化分析-程序员宅基地

文章浏览阅读734次，点赞6次，收藏10次。首先通过爬虫采集链家网上所有南京二手房的房源数据，并对采集到的数据进行清洗；然后，对清洗后的数据进行可视化分析，探索隐藏在大量数据背后的规律；最后，采用一个聚类算法对所有二手房数据进行聚类分析，并根据聚类分析的结果，将这些房源大致分类，以对所有数据的概括总结。通过上述分析，我们可以了解到目前市面上二手房各项基本特征及房源分布情况，帮助我们进行购房决策。Python网络爬虫技术RequestsPython数据分析技术NumpyMatplotlibPandask-means聚类算法。

Android开发，Kotlin的了解与学习（三）-----流程控制语句_break' or 'continue' jumps across a function or a -程序员宅基地

文章浏览阅读1.4k次。这一章主要讲一讲Kotlin中 for if when等的使用方法_break' or 'continue' jumps across a function or a class boundary

详细讲解局部变量、全局变量、静态变量三种类型_静态局部变量 pub-程序员宅基地

文章浏览阅读391次。你需要明白的java基础知识之一(变量）_静态局部变量 pub

随便推点

学习cassandra(1)入门，使用场景(写多读少)和搭建启动使用,整合Spring boot_cassandra入门实战黑马程序员-程序员宅基地

文章浏览阅读2.3k次。官网http://cassandra.apache.org/ 下载后wget http://mirror.bit.edu.cn/apache/cassandra/3.11.1/apache-cassandra-3.11.1-bin.tar.gz解压tar -xvf apache-cassandra-3.11.1-bin.tar.gzcd apache-cassandra-3.11.1修改配置文件_cassandra入门实战黑马程序员

ES源码之路(一)：源码本地编译启动_gradle编译es源码-程序员宅基地

文章浏览阅读1.7k次。ES源码之路(一)：源码本地编译启动先来一段客套话，介绍一下ES：ElasticSearch是一个基于Lucene的搜索服务器。它提供了一个分布式多用户能力的全文搜索引擎，基于RESTful web接口。Elasticsearch是用Java语言开发的，并作为Apache许可条款下的开放源码发布，是一种流行的企业级搜索引擎。ElasticSearch用于云计算中，能够达到实时搜索，稳定，可..._gradle编译es源码

React脚手架介绍和Demo_脚手架demo是什么-程序员宅基地

文章浏览阅读256次。React脚手架生成代码和Demo_脚手架demo是什么

插画人物怎么画？人体动态结构怎么画？_人体体块怎么画-程序员宅基地

文章浏览阅读3.2k次。动漫人物怎么画？人体动作怎么画？有多少绘画新手都是喜欢收藏大量人体图文或是人体视频教程，但是从来都从来都没有认真的练习过的，收藏夹可不会让你画技提升，究其原因还是因为干货缺少了理论上的指导，导致众多小可爱知其然不知其所以然。为此今天就分享了一套人体动态练习图，搭配讲解食用你还怕学不会吗？顺便推荐大家可以搜一下：灵猫课堂，或者打开手机微信，添加好友框内搜索：灵猫课堂，一键关注，学习无忧！上面..._人体体块怎么画

C语言字符串操作总结大全(超详细)_复制指定长度字符串-程序员宅基地

文章浏览阅读513次。1）字符串操作 strcpy(p, p1) 复制字符串 strncpy(p, p1, n) 复制指定长度字符串 strcat(p, p1) 附加字符串 strncat(p, p1, n) 附加指定长度字符串 strlen(p) 取字符串长度 strcmp(p, p1) 比较字符串 strcasecmp忽略大小写比较字符串strncmp(p, p1, n) 比较指定长_复制指定长度字符串

宝塔的防火墙是什么？有什么作用呢？_宝塔系统防火墙是什么防护-程序员宅基地

文章浏览阅读1.4w次。宝塔想必大家一定很熟悉了，用过服务器的都知道。那今天给大家介绍下宝塔的防火墙。宝塔面板网站防火墙是基于nginx/apache模块开发的一套应用层防火墙,能有效阻止大部分渗透攻击，且提供高度自由的规则自定义功能，为站点加一道铜墙铁壁。主要目的是从源头阻止站点被挂马的事情发生。目前宝塔官网和官方论坛一直都在使用宝塔网站防火墙,效果良好。宝塔面板防火墙是一个防火墙程序，用于在宝塔面板中防御服务器外来攻击使用的。根据环境服务软件的不同，分为nginx防火墙和apache防火墙。宝塔面板防火墙其实管理的是操_宝塔系统防火墙是什么防护