Hot Topics on Data Center (HotDC) 2018
Posted tinoryj
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Hot Topics on Data Center (HotDC) 2018相关的知识,希望对你有一定的参考价值。
Keynote Session
Accelerate Machine Intelligence: An Edge to Cloud Continuum
Hadi Esmaeilzadeh - UCSD
Background
open source: http://act-lab.org/artifacts
Data grows at an unprecedented rate
new landscape of computing: personalize and targeted experience for users
growing gap between data and compute
power/energy efficiency is a primary concern
approximate computing
machines learn to extract insights from data - two disjoin solutions for ml
distrubute computer + FPGA / ASIC chips
don‘t use vhdl / verlog language in the full stack for normal user
CoSMIC stack
how to distribute
- understanding machine learning - solving optimize problem
- abstraction between algorithm and acceleration system - parallelized stochastic gradient descent solver(to fpga gpu asic cgra xeon phi)
- leverage linearity of differentiation for distributed learning
- programming and compilation
- build a new language for math
- dataflow graph generation
how to design customizable accelerator
- multi-threading acceleration
- connectivity and bussing
- PE architecture - make hardware simple
how to reduce overhead of distributed coordination
specialized system software in CoSIMC
benchmarks
- 16-node CoSIMC with UltraScale+FPGA offer 18.8x speedup over 16-node spark with E3 skylake cpu
- using FPGA (66%) and software (34%) for speedup
RoboX Accelerator Architecture
DNNs tolerate low-bitwidth operations - bit-level
Making Cloud Systems Reliable and Dependable: Challenges and Opportunities
Lidong Zhou- MSRA
Background
system reliability:
- Fault Tolerance
- Redundancies
- State Machine Replication
- Paxos
- Erasure Coding
Real-World Gray Failures in Cloud
- redundancies in data center networking
- active device and link failure localization in data center
- NetBouncer: large-Scale path probing and diagnosis
- NetBouncer: leverage the power of scale
- root cause of the gray failure - stuck due to network issue - heart beat still normal (request stuck)
- Insight: should detect what the requesters errors
- critical gray failure are ovserviable
- from error handling to error reporting
Solution - Panorama
- Analysis - automatically covert a software component into an in-situ observer
- Runtime - observer send to local observation store(LOS)
- locate ob-boundary
- observations not always direct
- observations split to ob-origin & ob-sink
- match ob-origin & ob-sink
- Detect what "requesters" see
- failure that matter are observable to requesters
- turn error handlers into error reporters
- enables construction of in-situ observers
- https://github.com/ryanphuang/panorma
Reliability of Large-Scale Distributed Systems
- foundation reliability
- rethink cloud reliability: new theory & new method
- understand gray failure
- systematic and comprehensive observations
paper: Gray Failure: The Achilles‘ Heel of Cloud-Scale Systems
Deconstructing RDMA-enabled Distributed Transactions: Hybrid is Better!
Haibo Chen - SJTU
Background
- (Distributed) Transactions were slow
- High cost for distributed TX - Usually 10s~100s of thousands of TPS - (SIGMOD‘12)
- only 4% of wall-clock time spent in useful data processing
new features:
- RDMA: remote direct memory access
- ultra low latency(5us)
- ultra high throughput
- NVM: Non-volatile memory
An Active Line of Research of RDMA-enabled TX
- DrTM - DrTM(SOSP 2015) DrTM-R(EuroSys 2016) DrTM-B(USENIX ATC 2017)
- FaRM - FaRM-KV(NSDI 2014) FaRM-TX(SOSP 2015)
- FaSST(OSDI 2016)
- LITE(SOSP 2017)
Transaction(TX)s
- protocols - OCC,2PL,SI...
- impl on hardware devices - CX3,CX4,CX5,ROCE, one-side, two-side....
- OLTP workloads - TPC-C, TPC-E, TATP, Smallbank
Main: Use RDMA in TXs
outlet:
- RDMA primitive-level analysis
- Phase-by-phase analysis for TX
- DrTM+H: Putting it all together
content:
- phase: Exe/Val/Log/Commit
- offloading with one-side improves the performance
- one-sided primitive has good scalability on modern RNIC
- Execution framework & DrTM+H:https://github/com/SJTU-IPADS/drtmh
RDMA in Data Centers: from Cloud Computing to Machine Learning
Chuanxiong Guo - ByteDance
Background
- Data Center Network (DCN) offer lot services
- single ownership
- large scale
- bisection bandwidth
- TCP/IP not working well
- latency
- bandwidth
- processing overhead(40G) - 12% CPU at receiver & 6% CPU at sender
RDMA over Commodity Ethernet (RoCEv2)
- no CPU overhead
- single QP, 88Gb/s 1.7% CPU usage (TCP 8 connection 30-50Gb/s, client 2.6% & server 4.3% CPU)
- RoCEv2 needs a lossless ethernet network
- PFC(priority-based flow control) hop-by-hop flow control
- DCQCN - sender-switch-receiver (RP-CP-NP)
- the slow-receiver symptom - ToR tot NIC is 40Gb/s & NIC to server is 64Gb/s. NIC may generate large number of PFC pause frames
RDMA for DNN Training Acceleration
- understanding using DNN
- DNN Training: BP
- Distributed ML training, GPUs, with mini-batch
- RDMA acceleration : ResNet RNNs DNN (rdma performance better than tcp)
Highlighted Research Session
Congestion Control Mechanisms in Data Center Networks
Wei Bai - MSRA
DCN中实现低时延
- 排队时延 -PIAS(NSDI 2015)
- 丢包重发时延 - TLT
PIAS
- Flow completion Time (FCT)是关键问题
- 流信息不能假设为已知、可以在现有设备上快速部署
- PIAS performs Multi-level feedback queue (MLFQ) to emulate shortest job first (SJF)
- three function in pias:
- package tagging
- switch
- rate control
TLT
- 同时达到Lossy & Loss-Less两种网络的好处
- using PFC to eliminate congestion packet losses
- packet loss :
- middle - fast retransmissions
- tail - Timeout retransmissions
- 识别重要包, 当交换机队列超过阈值时丢掉非重要包
Understanding the challenges of Scaling Distributed DNN Training
Cheng Li - USTC
- Deep Learning growth fast
- DNN - Deep Neural Networks
- benefit: more data / bigger models / more computation
- Jeff Dean - Google
Distributed DNN
- Model or data parallelism
- data parallelism is a primary choice
- BSP / ASP - BSP is choice (ASP可能不收敛)
- Bulk Synchronous Parallel - 确定时间同步
- Asynchronous Parallel
- net server other bottlenecks for parallelism
- 通过测试确定影响计算能力的制约条件
- 数据压缩传输带来的压缩开销
- 系统设计
- 弹性系统设计
- 短板效应 - 最终计算速度的制约
- 如何快速调整系统的规模等 - message bus流处理 - 用生产者消费者模型
Octopus: an RDMA-enable Distributed Persistent Memory File System
Youyou Lu - Tsinghua
- 分布式文件系统设计
- 非易失性内存 - 内存存储
- DRAM Limitations
- Cell Density
- Refresh - 性能/功耗
- NVDIMM内存 - 断电后存储数据
- Intel 3D Xpoint - 接近内存的延迟, 高容量, 断电非易失
- RDMA - 高性能环境下使用
- DiskGluster - latency来自于HDD | MemGluster - latency来自于软件
- RDMA-enable Distributed File System
- shard data mamangment
- New data flow strategies
- Efficient RPC design
- Concurrent control
Design
- I/O处理
- 将所有NVMM组织为同一空间
- 降低DFS中的数据拷贝(7次降到4次)
- server扫描数据存储地址,client获取地址之后自己获取(将任务转嫁给client)
- Metadata RPC
- Collect-Dispatch Distributed Transaction
- 性能测试
- 局域网服务期间测试 - 带宽可以达到网络带宽的88%
- 在Hadoop平台下进行测试
Short Talk
Computer Organization and Design Course with FPGA Cloud, Ke Zhang (ICT, CAS)
新的技术AI IOT
提高新的软硬协同设计能力 - CPUGPUFPGAGPUASIC
ZyForce平台 - 虚拟FPGA实验
ActionFlow:A Framework for Fast Multi-Robots Application Development, Jimin Han (UCAS)
国科大大四 - 2018.8开始
机器人应用快速开发
Labeled Network Stack, Yifan Shen (ICT, CAS)
Caching or Not: Rethinking Virtual File System for Non-Volatile Main Memory, Ying Wang (ICT, CAS)
Data Motif-based Proxy Benchmarks for Big Data and AI Workloads, Chen Zheng (ICT, CAS)
以上是关于Hot Topics on Data Center (HotDC) 2018的主要内容,如果未能解决你的问题,请参考以下文章
Three Style Shoes on Nike LeBron 15 at 2018 hot sale
CS224W摘要16.Advanced Topics on GNNs