# Silicon Heterogeneity in the Cloud

Babak Falsafi ecocloud.ch





# Data Economics





#### Modern Datacenters are Warehouse-Scale Computers



- Millions of interconnected home-brewed servers
- Centralization helps exploit economies of scale
- Network fabric provides micro-second connectivity
- At physical limits
- Need sources for
  - Electricity
  - Network
  - Cooling



#### 20MW, 20x Football Field \$3 billion

Warning! Datacenters are not Supercomputers



- Run heterogeneous data services at massive scale
- Driven for commercial use
- Fundamentally different design, operation, reliability, TCO
  - Density 10-25KW/rack as compared to 25-90KW/rack
  - Tier 3 (~2 hrs/downtime) vs.Tier I (upto I day/downtime)
  - .....and lots more

Datacenters are the IT utility plants of the future



7



Supercomputing

Cloud Computing

# Cloud Taking Over Enterprise





# But, Silicon out of steam!



Silicon efficiency is dead (long live efficient silicon)

#### Moore's Law is Dead too! [Mark Bohr's Keynote, ISSCC'15]



## But, Silicon out of steam!



Silicon efficiency is dead (long live efficient silicon)

Moore's Law is Dead too! [Mark Bohr's Keynote, ISSCC'15]

# Global Foundries cancelled their 7nm on August 28, 2018!



# **Optimization Opportunities:** The ISA Triangle Method at a start of the start



Special Adation Holistic Optimization

#### Approximation

(Match quality to analog input/output)

# Scale-Out Datacenters



Vast data sharded across servers

#### Memory-resident workloads

- Necessary for performance
- Major TCO burden
- Put memory at the center
  - Design system around memory
  - Optimize for data services



| Core | Core | Core | Core |
|------|------|------|------|
| Core | Core | Core | Core |
| ¢    |      |      |      |

RAM

Servers driven by the DRAM market!

# In-Memory Scale-Out Services ecoloue



Many independent requests/tasks
Huge dataset split into shards
Use aggregate memory over network

# Server Benchmarking with CloudSuite 3.0 (cloudsuite.ch)





#### Building block for Google PerfKit, EEMBC Big Data!

# Scaling CPU's: Manycores



- Parallelism has emerged as the only silver bullet
- Use simpler cores
   Prius instead of Audi R8
- Restructure software



■ Each core → fewer joules/op Aodern Manycore CPU (e.g., Tilera)



# But, Services Stuck in Memory (x86 servers) [ASPLOS'12]





- On-chip memory overprovisioned
- Instruction supply is bottlenecked

## Scale-Out Processors (SOP)





x86 server CPU
XLogic 60% of silicon
X6x bigger cores



3-way ARM manycoreLogic 85% of silicon

 $\checkmark$ 7x more parallelism

Innovation in SOP's [ISCA, MICRO, ASPLOS, MemSys, IEEE Micro' 12-18]



Instruction supply:

Core front-end (BP/BTB, Boomerang [Grot])

On-chip networks:Core-to-\$ rather than core-to-core

Off-chip connectivity:HBM, DRAM hierarchy, network

# Example SOP



#### **CAVIUM**

**Case for Workload Optimized Processors For Next Generation Data Center & Cloud** 

Gopal Hegde VP/GM, Data Center Processing Group

#### **Cavium Thunder X**

- Based on SOP @ EPFL
- Designed to serve data
- Optimized code supply
- Trade off SRAM for cores
- Runs stock software
- I 0x faster than Xeon for CloudSuite

# Massively parallel cores



- Data parallelismHigher memory b/w
- Super simple cores
- Shared front end
- IOx slower clocks

Great for dense parallel computation





- Can populate chips But, can not operate all Today's chips are already ''dark'' (memory)
- All future platforms will be heterogeneous
- Selectively activate parts



[source: Hardavellas et. al., "Toward Dark Silicon in Servers", IEEE Micro, 2011]

#### Custom Computing [FPGA's vs. GPU's in Data centers, IEEE Micro'17]



Reconfigurable

- Best for spatial computing
- Not caching/reuse
- Parallel, spatial computing
- IOx slower clocks
- Better for sparse arithmetic

Microsoft, Amazon & Intel



# Microsoft's Catapult





FPGA economies of scale:

Local/remote compute accelerator
 Network/storage accelerator
 Configurable cloud

# Google's TPU



Custom array of arithmetic units:

- Linear algebra for ML/NN
- Currently memory bound
- I0x over GPU
- ML as a service



# Oracle's RAPID



- Accelerator for analytics in SQL
- Data movement engine in hardware
- Custom message passing cores
- Up to 15x better perf/Watt over Xeon



# Parallel lookups require traversing chainsDecouple chains in co-designed hw/sw

Pointer-based data structures (e.g., hash table, B-tree)

#### 

#### 15x better perf/Watt over Xeon

#### Walkers: CPU-Side Database Accelerators



TZZZ

# Walkers in Software [VLDB'16]



Use insights to help Xeon

- Decouple hash & walk in software
- Schedule off-chip pointer access with co-routines

## 2.3x speedup on Xeon

- Unclogs dependences in microarchitecture
- Maximizes memory level parallelism
- DSL w/ co-routines
- To be integrated in SAP HANA [VLDB'18]

# Moving Forward: The Specialization Funnel



Specialized

- GPU/ThunderX
- DBToaster
- IX Kernel
- Tensorflow

ASIC

- Crypto/Bitcoin
- Network logic

General Purpose

- Intel CPU
- Oracle Database
- Linux
- Java/C

Specialize as algorithms mature Domain-specific languages to platforms



### Modern apps/services are statistical Analog input, analog output

#### Key:

Much redundancy in data/arithmeticOutput quality not accuracy or error

Exploit inProcessing, communication, storage

# Arithmetic in Deep Learning (Microsoft Brainwave)



FPGA Performance vs. Data Type



# HBFP (Block FP) vs. FP32



Resnet-50 on ImageNet



#### FP32 performance with 8-bit logic [NeurIPS'18]































# Near-Memory Processing

A stack of DRAM w/ nearby logic

- Minimize data movement
- Massive internal bandwidth
- Limitations:
- A few layers of DRAM
- Logic power/thermals (3D)
- Thermals ok for HBM (2.5D)

#### Opportunities for algorithm/hardware co-design

# [source: AMD] DRAM



#### Memory & Storage Hierarchy



@2016 Western Digital Corporation or its affiliates. All rights reserved.

# Storage-Class Memory



#### Persistence

- IOO's of nanosecond vs. microsecond
- Implications for logging & networks

#### Disparity between reads/writes

- Can read at memory speed
- Writes must be batched/are slow
- Writes consume more power
- DRAM cache can help [MemSys'18]





SSD is treated as storage

- Online data in DRAM
- But, DRAM costs dominate, slow scalability

Online services:

- Roundtrip tail latency 100's ms
- SSD access is in 50 us (1000x faster)
- SSD is 50x cheaper

Technology to bring SSD online!



Network stacks/interfaces are a bottleneck:

- Logic growing at 17%/year, network at 20%/year!
- µServices emerging
- RPC stacks, scheduling/dispatch, data transformation, ....

Key challenges:

- Abstractions for control/data planes
- Co-design of network stacks

# Near-Network Processing: Nvidia BlueField



Network-interface integrated manycore

- Up to 32 cores w/ 0.5 TB of DRAM
- Can host an in-memory object store
- RPC over in-memory data



#### Scale-Out NUMA [ASPLOS'14,ISCA'15, MICRO'16]





#### soNUMA:





- Socket-integrated network interface
- Protected global memory read/write + synch
- Fine-grain (~64B) & bulk objects (~IMB)
- Remote memory ~ 2x local memory latency
- Extensions for messaging & RPC [Daglis' thesis]



Server design centered around dataIn-memory services offered over the network

Witnessing end of Moore's Law • Emerging heterogeneous logic + memory

Future servers nodes:

- Logic & memory with multiple network access points
- Tool chains to go from DSL's → accelerators

# Integrate + Specialize + Approximate





# For more information please visit us at ecocloud.ch

