|
|
|
|
Lecture | Topic | Reading | Spatial Assignment |
---|
1 | Introduction, role of hardware accelerators in post Dennard and Moore era (硬件加速器在后登納-摩爾時代作用介紹) | Is Dark silicon useful? Hennessy Patterson Chapter 7.1-7.2 |
|
2 | Classical ML algorithms: Regression, SVMs (What is the
building block?) (經典ML算法:回歸,SVMs) | TABLA |
|
3 | Linear algebra fundamentals and accelerating linear algebra BLAS operations
20th century techniques: Systolic arrays and MIMDs, CGRAs (線性代數(shù)基礎和BLAS加速運算) | Why Systolic Architectures? Anatomy of high performance GEMM | Linear Algebra Accelerators |
4 | Evaluating Performance, Energy efficiency, Parallelism, Locality,
Memory hierarchy, Roofline model (評價性能、能效、并行度、局部性、內存層次結構,Roofline 模型) | Dark Memory |
|
5 | Real-World Architectures: Putting it into practice Accelerating GEMM: Custom, GPU, TPU1 architectures and their GEMM performance (現(xiàn)實世界的架構:將其付諸實踐加速GEMM:自定義、GPU、TPU1架構和它們的GEMM性能。) | Google TPU Codesign Tradeoffs NVIDIA Tesla V100 |
|
6 | Neural networks: MLPs and CNNs Inference (神經網絡:MLP和CNN推斷) | Viviense IEEE proceeding Brooks’s book (Selected Chapters) | CNN Inference Accelerators |
7 | Accelerating Inference for CNNs:
Blocking and Parallelism in practice DianNao, Eyeriss, TPU1 (加速對CNNs的推理:在實踐中阻塞和并行。 DianNao、Eyeriss TPU1) | Systematic Approach to Blocking Eyeriss Google TPU (see lecture 5) |
|
8 | Modeling neural networks with Spatial, Analyzing
performance and energy with Spatial (以空間為基礎的神經網絡建模,分析性能和空間能量) | Spatial One related work |
|
9 | Training: SGD, back propagation, statistical efficiency, batch size (訓練:SGD, )反向傳播, | NIPS workshop last year Graphcore | Training Accelerators |
10 | Resilience of DNNs: Sparsity and Low Precision Networks (DNNs的彈性能力:稀疏性和低精度網絡)
| Some theory paper EIE Flexpoint of Nervana Boris Ginsburg: paper, presentation LSTM Block Compression by Baidu? |
|
11 | Low precision training (低精度訓練)
| HALP Ternary or binary networks See Boris Ginsburg's work (lecture 10) |
|
12 | Training in Distributed and Parallel systems:
Hogwild!, asynchrony and hardware efficiency (分布式并行系統(tǒng)訓練) | Deep Gradient compression Hogwild! Large Scale Distributed Deep Networks Obstinate cache? |
|
13 | FPGAs and CGRAs: Catapult, Brainwave, Plasticine (FPGA) | Catapult Brainwave Plasticine |
|
14 | ML benchmarks: DAWNbench, MLPerf (機器學習基準) | DawnBench Some other benchmark paper |
|
15 | Project presentations |
|
|