

#### Preparing for Extreme Heterogeneity in High Performance Computing

Jeffrey S. Vetter With many contributions from FTG Group and Colleagues

DATE 2019 Special Session on Embedded meets Hyperscale and HPC 27 Mar 2019

ORNL is managed by UT-Battelle, LLC for the US Department of Energy



### Highlights

- Recent trends in extreme-scale HPC paint an uncertain future
  - Contemporary systems provide evidence that power constraints are driving architectures to change rapidly
  - Multiple architectural dimensions are being (dramatically) redesigned: Processors, node design, memory systems, I/O
  - Complexity is our main challenge
- Applications and software systems are all reaching a state of crisis
  - Applications will not be functionally or performance portable across architectures
  - Programming and operating systems need major redesign to address these architectural changes
  - Procurements, acceptance testing, and operations of today's new platforms depend on performance prediction and benchmarking.
- We need portable programming models and performance prediction now more than ever!
- Programming systems must provide performance portability (beyond functional portability)!!
  - Emerging memory hierarchies
    - DRAGON transparent NVM access from GPUs
    - NVL-C user management of nonvolatile memory in C
    - Papyrus parallel aggregate persistent storage
  - Heterogeneous processor (not covered today)
    - OpenACC->FGPAs
    - Clacc OpenACC support in LLVM
- Performance prediction is critical for design and optimization (not covered today)



# The three technical areas in ECP have the necessary components to meet national goals



25 applications ranging from national security, to energy, earth systems, economic security, materials, and data 80+ unique software products spanning programming models and run times, math libraries, data and visualization 6 vendors supported by PathForward focused on memory, node, connectivity advancements; deployment to facilities



#### ORNL 75<sup>th</sup> Lab Day and Summit Unveiling – 8 June 2018 #1 on Top 500

| Application Performance | 200 PF                                                  |  |  |  |  |
|-------------------------|---------------------------------------------------------|--|--|--|--|
| Number of Nodes         | 4,608                                                   |  |  |  |  |
| Node performance        | 42 TF                                                   |  |  |  |  |
| Memory per Node         | 512 GB DDR4 + 96 GB HBM2                                |  |  |  |  |
| NV memory per Node      | 1600 GB                                                 |  |  |  |  |
| Total System Memory     | >10 PB DDR4 + HBM2 + Non-volatile                       |  |  |  |  |
| Processors              | 2 IBM POWER9™ 9,216 CPUs<br>6 NVIDIA Volta™ 27,648 GPUs |  |  |  |  |
| File System             | 250 PB, 2.5 TB/s, GPFS™                                 |  |  |  |  |
| Power Consumption       | 13 MW                                                   |  |  |  |  |
| Interconnect            | Mellanox EDR 100G InfiniBand                            |  |  |  |  |
| Operating System        | Red Hat Enterprise Linux (RHEL) version 7.4             |  |  |  |  |



#### ECP applications target national problems in 6 strategic areas

| National security                                                      | Energy security                                                                  | Economic security                                          | Scientific discovery                                                            | Earth system                                                         | Health care                                    |
|------------------------------------------------------------------------|----------------------------------------------------------------------------------|------------------------------------------------------------|---------------------------------------------------------------------------------|----------------------------------------------------------------------|------------------------------------------------|
| Stockpile<br>stewardship<br>Next-generation<br>electromagnetics        | Turbine wind plant<br>efficiency<br>High-efficiency,<br>low-emission             | Additive<br>manufacturing<br>of qualifiable<br>metal parts | Find, predict,<br>and control materials<br>and properties<br>Cosmological probe | Accurate regional<br>impact assessments<br>in Earth system<br>models | Accelerate<br>and translate<br>cancer research |
| simulation of hostile<br>environment and<br>virtual flight testing for | combustion engine<br>and gas turbine<br>design                                   | Reliable and<br>efficient planning<br>of the power grid    | of the standard model<br>of particle physics                                    | Stress-resistant crop<br>analysis and catalytic<br>conversion        |                                                |
| hypersonic re-entry<br>vehicles                                        | Materials design for                                                             | Seismic hazard                                             | laws of nature                                                                  | of biomass-derived                                                   |                                                |
|                                                                        | extreme<br>environments of                                                       | risk assessment<br>Urban planning                          | Demystify origin of<br>chemical elements                                        | Metagenomics                                                         | · · · · · · · · · · · · · · · · · · ·          |
|                                                                        | and fusion reactors                                                              | All and                                                    | Light source-enabled                                                            | biogeochemical                                                       |                                                |
|                                                                        | Design and<br>commercialization<br>of Small Modular                              |                                                            | and molecular<br>structure and design                                           | change,<br>environmental                                             |                                                |
|                                                                        | Reactors                                                                         |                                                            | Whole-device model                                                              | remediation                                                          | 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1          |
|                                                                        | Subsurface use<br>for carbon capture,<br>petroleum extraction,<br>waste disposal |                                                            | confined fusion<br>plasmas                                                      | R                                                                    |                                                |
|                                                                        | Scale-up of clean fossil fuel combustion                                         |                                                            |                                                                                 |                                                                      |                                                |
|                                                                        | Biofuel catalyst<br>design                                                       |                                                            |                                                                                 |                                                                      |                                                |

# **Major Trends in Computing**



#### **Contemporary devices are approaching fundamental limits**



Dennard scaling has already ended. Dennard observed that voltage and current should be proportional to the linear dimensions of a transistor: 2x transistor count implies 40% faster and 50% more efficient.

R.H. Dennard, F.H. Gaensslen, V.L. Rideout, E. Bassous, and A.R. LeBlanc, "Design of ion-implanted MOSFET's with very small physical dimensions," *IEEE Journal of Solid-State Circuits*, 9(5):256-68, 1974,

10



Figure 1 | As a metal oxide-semiconductor field effect transistor (MOSFET) shrinks, the gate dielectric (yellow) thickness approaches several atoms (0.5 nm at the 22-nm technology node). Atomic spacing limits the



Figure 2 | As a MOSFET transistor shrinks, the shape of its electric field departs from basic rectilinear models, and the level curves become disconnected. Atomic-level manufacturing variations, especially for dopant

I.L. Markov, "Limits on fundamental limits to computation," Nature, 512(7513):147-54, 2014, doi:10.1038/nature13570.



#### Business climate reflects this uncertainty, cost, complexity, consolidation



#### Sixth Wave of Computing



http://www.kurzweilai.net/exponential-growth-of-computing



### **Transition Period Predictions**

Optimize Software and Expose New Hierarchical Parallelism

- Redesign software to boost performance on upcoming architectures
- Exploit new levels of parallelism and efficient data movement

Architectural Specialization and Integration

- Use CMOS more efficiently for our workloads
- Integrate components to boost performance and eliminate inefficiencies

#### Emerging Technologies

- Investigate new computational paradigms
  - Quantum
  - Neuromorphic
  - Advanced Digital
  - Emerging Memory Devices



#### **Architectural specialization is quickening**

- Vendors, lacking Moore's Law, will need to continue to differentiate products (to stay in business)
- Grant that advantage of better CMOS process stalls
- Use the same transistors differently to enhance
   performance
- Architectural design will become extremely important, critical
  - Dark Silicon
  - Address new parameters for benefits/curse of Moore's Law

#### Intel's Nervana AI platform takes aim at Nvidia's GPU techology

Firm claims Xeon-based chips will deliver a '100-fold increase' in deep learning performance



CHIPMAKER Intel has set out its plans for artificial intelligence (AI) and claimed that it will reduce the time to train a deep learning model by up to 100 times within the next three years.

At the forefront of the firm's AI ambitions is the Intel Nervana platform, which was announced on Thursday **following Intel's acquisition of deep learning startup Nervana Systems earlier this year.** 

http://www.theinquirer.net/inquirer/news/2477796/intels-nervana-ai-platform-takes-aim-at-nvidias-gpu-techology

NTON



CADE METZ BUSINESS 05.18.16 3:57 PM



🙆 GOOGLE

**GOOGLE HAS DESIGNED** its own computer chip for driving deep neural networks, an <u>AI</u> technology that is reinventing the way Internet services operate.

This morning at Google I/O, the centerpiece of the company's year, CEO Sundar Pichai said that Google has designed an ASIC, or application-specific integrated circuit, that's specific to deep neural nets. These are networks of

http://www.wired.com/2016/05/google-tpu-custom-chips/









Xilinx ACAP

D.E. Shaw, M.M. Deneroff, R.O. Dror et al., "Anton, a special-purpose machine for molecular dynamics simulation," Communications of the ACM, 51(7):91-7, 2008.

#### Turing Award Lecture on June 4: A New Golden Age for Computer Architecture



- Domain-specific HW/SW Co-Design
- Enhanced Security

15

- Open Instruction Sets
- Agile Chip Development

#### A New Golden Age for Computer Architecture: Domain-Specific Hardware/Software Co-Design, Enhanced Security, Open Instruction Sets, and Agile Chip Development

#### John L. Hennessy and David A. Patterson

In the 1980s, Mead and Conway<sup>1</sup> democratized chip design and high-level language programming surpassed assembly language programming, which made instruction set advances viable. Innovations like RISC, superscalar, multilevel caches, and speculation plus compiler advances (especially in register allocation) ushered in a Golden Age of computer architecture, when performance increased annually by 60%. In the later 1990s and 2000s, architectural innovation decreased, so performance came primarily from higher clock rates and larger caches. The ending of Dennard Scaling and Moore's Law also slowed this path; single core performance improved only 3% last year! In addition to poor performance gains of modern microprocessors, Spectre recently demonstrated timing attacks that leak information at high rates<sup>2</sup>.

We're on the cusp of another Golden Age that will significantly improve cost, performance, energy, and security. These architecture challenges are even harder given that we've lost the exponentially increasing resources provided by Dennard scaling and Moore's law. We've identified areas that are critical to this new age:

#### 1. Hardware/Software Co-Design for High-Level and Domain-Specific Languages

Advanced programming languages like Python and domain-specific languages like TensorFlow have dramatically improved programmer productivity by increasing software reuse and by raising the level of abstraction. Whereas compiler-architecture co-design delivered gains of about three in the 1980s for C compilers and RISC architectures, new advances could create compilers and domain-specific architectures<sup>3</sup> (DSAs) that deliver tenfold or more jumps<sup>4</sup> in this new Golden Age.

#### 2. Enhancing Security

We've made tremendous gains in information technology (IT) in the past 40 years, but if security is a war, we're losing it. Thus far, architects have been asked for little beyond pagelevel protection and supporting virtual machines. The very definition of computer architecture ignores timing, yet Spectre shows that attacks that can determine timing of operations can leak supposedly protected data. It's time for architects to redefine computer architecture and treat security as a first class citizen to protect data from timing attacks, or at worst reduce information leaks to a trickle.

#### 3. Free and Open Architectures and Open-Source Implementations

Progress on these issues likely will require changes to the instruction set architecture (ISA), which is problematic for proprietary ISAs. For tall challenges like these, we want all the best minds to work on them, not only the engineers who work for the ISA owners. Thus, a free and open ISA such as RISC-V can be a boon to researchers<sup>5</sup> because:

- Many people in many organizations can innovate simultaneously using RISC-V.
- The ISA is designed for modularity and extensions.
- It comes with a complete software stack, including compilers, operating systems, and debuggers, which are open source and thus also modifiable.
- This modern ISA is designed to work for any application, from cloud-level servers down to mobile and IoT devices.
- RISC-V is driven by a 100-member foundation<sup>6</sup> that ensures its long-term stability and evolution.

Unlike the past, open ISAs are viable because many engineers for a wide range of products are designing SOCs by incorporating IP and because ARM has demonstrated that IP works for ISAs.

An open architecture also enables open-source processor designs for both FPGAs and real chips, so architects can innovate by modifying an existing RISC-V design and its software stack. While FPGAs run at perhaps only 100 MHz, that is fast enough to run trillions of instructions or to be deployed on the internet to test a security feature against real attacks. Given the plasticity of FPGAs, the RISC-V ecosystem enables experimental investigations of novel features that can be deployed, evaluated, and iterated in days rather than in years. That vision requires more IP than CPUs, such as GPUs, neural network accelerators, DRAM controllers, and PCIe controllers<sup>7</sup>. The stability of process nodes due to the ending of Moore's Law make this goal easier than in the past. This necessity opens a path for architects to have impact by contributing open-source components much as their software colleagues do for databases and operating systems.

#### 4. Agile Chip Development

As the focus of innovation in architecture shifts from the general-purpose CPU to domain-specific and heterogeneous processors, we will need to achieve major breakthroughs in design time and cost (as happened for VLSI in the 1980s). Small teams should be able to design chips, tailored for a specific domain or application. This will require that hardware design become much more efficient, and more like modern software design.

Unlike the "waterfall" development process of giant chips by large companies, Agile development process<sup>8</sup> allows small groups to iterate designs of working but incomplete prototypes for small chips. Fortuitously, the same programming language advances that improved reuse of software have been incorporated in recent hardware design languages, which makes hardware design and reuse easier. While one can stop at layout for a research paper, building real chips is inspiring for everyone in a project, and is the only way to verify important characteristics like timing and energy consumption. The good news is that today TMSC will deliver 100 small test chips in the latest technology for only \$30,000<sup>9</sup>. Thus, virtually all projects can afford real chips as final validation of innovation as well as to enjoy the satisfaction of seeing your ideas work in silicon.

We believe the deceleration of performance gains for standard microprocessors, the opportunities in high-level, domain-specific languages and security, the freeing of architects from the chains of proprietary ISAs, and (ironically) the ending of Dennard scaling and Moore's law will lead to another Golden Age for architecture. Aided by an open-source ecosystem, agily developed prototypes will demonstrate advances and thereby accelerate commercial adoption. We envision the same rapid improvement as in the last Golden Age, but this time in cost, energy, and security as well in performance.

#### **Transition Period will be Disruptive**

- New devices and architectures may not be hidden in traditional levels of abstraction
  - A new type of CNT transistor may be completely hidden from higher levels
  - A new paradigm like quantum may require new architectures, programming models, and algorithmic approaches
- Solutions need a co-design framework to evaluate and mature specific technologies

| Layer       | Switch, 3D | NVM | Approximate | Neuro | Quantum |
|-------------|------------|-----|-------------|-------|---------|
| Application | 1          | 1   | 2           | 2     | 3       |
| Algorithm   | 1          | 1   | 2           | 3     | 3       |
| Language    | 1          | 2   | 2           | 3     | 3       |
| API         | 1          | 2   | 2           | 3     | 3       |
| Arch        | 1          | 2   | 2           | 3     | 3       |
| ISA         | 1          | 2   | 2           | 3     | 3       |
| Microarch   | 2          | 3   | 2           | 3     | 3       |
| FU          | 2          | 3   | 2           | 3     | 3       |
| Logic       | 3          | 3   | 2           | 3     | 3       |
| Device      | 3          | 3   | 2           | 3     | 3       |

Adapted from IEEE Rebooting Computing Chart



## **HPC Architectures Reflect these Trends**



### **Department of Energy (DOE) Roadmap to Exascale Systems**

An impressive, productive lineup of *accelerated node* systems supporting DOE's mission



### Summit Node Overview

| Application Performance | 200 PF                                                                          |  |  |
|-------------------------|---------------------------------------------------------------------------------|--|--|
| Number of Nodes         | 4,608                                                                           |  |  |
| Node performance        | 42 TF                                                                           |  |  |
| Memory per Node         | 512 GB DDR4 + 96 GB HBM2                                                        |  |  |
| NV memory per Node      | 1600 GB                                                                         |  |  |
| Total System Memory     | >10 PB DDR4 + HBM2 + Non-volatile                                               |  |  |
| Processors              | 2 IBM POWER9 <sup>™</sup> 9,216 CPUs<br>6 NVIDIA Volta <sup>™</sup> 27,648 GPUs |  |  |
| File System             | 250 PB, 2.5 TB/s, GPFS™                                                         |  |  |
| Power Consumption       | 13 MW                                                                           |  |  |
| Interconnect            | Mellanox EDR 100G InfiniBand                                                    |  |  |
| Operating System        | Red Hat Enterprise Linux (RHEL) version 7.4                                     |  |  |





900 GB/s

900 GB/s

GPU 7 TF

GPU 7 TF

GB/s

ജ

50 GB/s

DRAM

256 GB

35 GB/s

P9

16 GBI9

DRAM

256 GB

35 GB/s

P9

16 GB/s

64

GB/s

HBM 16 GB

50 GB/s

HBM 16 GB

50 GB/s

 1 Gb Ethernet VGA 1 USB 3.0

BMC Card IPMI

PCIe slot (4x)

· 1, Shared slot

2, x16 HHHL Adapter

1 x8 HHHL Adapter

Gen4 PCIe

- Power 9 Processor (2x) 18, 22C water cooled
  - 16, 20C air cooled

COAK RIDGE National Laboratory FACILITY

GB/s

**1**006

GB/s

006

900 GB/

GPU 7 HF

GPU 7TF

GPU

50 GB/s

HBM 16 GB

50 GB/s

16 GB

50 GB/s

HBM

GB/s

ß

GB/s

ജ

During this Sixth Wave transition, Complexity is our major challenge!

Design: How do we design future systems so that they are better than current systems on mission applications?

- Entirely possible that the new system will be slower than the old system!
- Expect 'disaster' procurements

Programmability: How do we design applications with some level of performance portability?

- Software lasts much longer than transient hardware platforms
- Adapt or die



#### **Final Report on Workshop on Extreme Heterogeneity**

- 1. Maintaining and improving programmer productivity
  - Flexible, expressive, programming models and languages
  - Intelligent, domain-aware compilers and tools
  - Composition of disparate software components
- Managing resources intelligently
  - Automated methods using introspection and machine learning
  - Optimize for performance, energy efficiency, and availability
- Modeling & predicting performance
  - Evaluate impact of potential system designs and application mappings
  - Model-automated optimization of applications
- Enabling reproducible science despite non-determinism & asynchrony
  - Methods for validation on non-deterministic architectures
  - Detection and mitigation of pervasive faults and errors
- Facilitating Data Management, Analytics, and Workflows
  - Mapping of science workflows to heterogeneous hardware and software services
  - Adapting workflows and services to meet facility-level objectives through learning approaches



https://doi.org/10.2172/1473756



<sup>21</sup> <u>https://orau.gov/exheterogeneity2018/</u>



#### **Emerging Memory Systems**



#### Memory Systems Started Diversifying Several Years Ago

- Architectures
  - HMC, HBM/2/3, LPDDR4, GDDR5X, WIDEIO2 etc
  - 2.5D, 3D Stacking
- Configurations
  - Unified memory
  - Scratchpads
  - Write through, write back, etc
  - Consistency and coherence protocols
  - Virtual v. Physical, paging strategies
- New devices
  - ReRAM, PCRAM, STT-MRAM, 3D-Xpoint



Copyright (c) 2014 Hiroshige Goto All rights reserved.

|                             | SRAM    | DRAM    | eDRAM   | 2D NAND<br>Flash | 3D NAND<br>Flash | PCRAM                             | STTRAM | 2D ReRAM                          | 3D ReRA                           |
|-----------------------------|---------|---------|---------|------------------|------------------|-----------------------------------|--------|-----------------------------------|-----------------------------------|
| Data Retention              | N       | N       | Ν       | Y                | Y                | Y                                 | Y      | Y                                 | Y                                 |
| Cell Size (F <sup>2</sup> ) | 50-200  | 4-6     | 19-26   | 2-5              | <1               | 4-10                              | 8-40   | 4                                 | <1                                |
| Minimum F demonstrated (nm) | 14      | 25      | 22      | 16               | 64               | 20                                | 28     | 27                                | 24                                |
| Read Time (ns)              | < 1     | 30      | 5       | 104              | 104              | 10-50                             | 3-10   | 10-50                             | 10-50                             |
| Write Time (ns)             | < 1     | 50      | 5       | 105              | 105              | 100-300                           | 3-10   | 10-50                             | 10-50                             |
| Number of Rewrites          | 1016    | 1016    | 1016    |                  |                  | 10 <sup>8</sup> -10 <sup>10</sup> | 1015   | 10 <sup>4</sup> -10 <sup>12</sup> | 10 <sup>8</sup> -10 <sup>12</sup> |
| Read Power                  | Low     | Low     | Low     | High             |                  | Low                               | Medium | Medium                            | Medium                            |
| Write Power                 | Low     | Low     | Low     | High             | High             | High                              | Medium | Medium                            | Medium                            |
| Power (other than R/W)      | Leakage | Refresh | Refresh | None             | None             | None                              | None   | Sneak                             | Sneak                             |
| Maturity                    |         |         |         |                  |                  |                                   |        |                                   |                                   |
|                             |         |         |         |                  |                  |                                   |        |                                   |                                   |

J.S. Vetter and S. Mittal, "Opportunities for Nonvolatile Memory Systems in Extreme-Scale High Performance Computing," CiSE, 17(2):73-82, 2015.



**Fig. 4.** (a) A typical 1T1R structure of RRAM with HfO<sub>x</sub>; (b) HR-TEM image of the TiN/Ti/HfO<sub>x</sub>/TiN stacked layer; the thickness of the HfO<sub>2</sub> is 20 nm.

H.S.P. Wong, H.Y. Lee, S. Yu et al., "Metal-oxide RRAM," Proceedings of the IEE,



#### **Complexity in the Expanding and Diversifying Memory Hierarchy**







#### Many Memory Architecture Options under Consideration...

I.B. Peng et al, "Siena: Exploring the Design Space of Heterogeneous Memory Systems," in SC18, 2018

25-125 µs

NAND Flash

25

X

>1,000 <3

0.1x

50K IOPS



DDR3 and HPM of different costs under a fixed budget.

#### **NVRAM Technology Continues to Improve – Driven by Broad Market Forces**



ades to existing wafer fabilines and brand new manufactur

# **Programming NVM Systems Portably**



### **NVM Design Choices**

#### • Dimensions

- Integration point
- Exploit persistence
  - ACID?
- Scalability
- Programming model
- Our Approaches
  - Transparent access to NVM from GPU
  - NVL-C: expose NVM to user/applications
  - Papyrus: parallel aggregate persistent memory
  - Many others (See S. Mittal and J. S. Vetter, "A Survey of Software Techniques for Using Non-Volatile Memories for Storage and Main Memory Systems," in IEEE TPDS 27:5, pp. 1537-1550, 2016)



#### http://j.mp/nvm-sw-survey

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTING SYSTEMS

#### A Survey of Software Techniques for Using Non-Volatile Memories for Storage and Main Memory Systems

Sparsh Mittal, Member, IEEE, and Jeffrey S. Vetter, Senior Member, IEEE

Abstract—Non-volatile memory (NVM) devices, such as Flash, phase change RAM, spin transfer torque RAM, and resistive RAM, offer several advantages and challenges when compared to conventional memory technologies, such as DRAM and magnetic hard disk drives (HDDs). In this paper, we present a survey of software techniques that have been proposed to exploit the advantages and mitigate the disadvantages of NVMs when used for designing memory systems, and, in particular, secondary storage (e.g., solid state drive) and main memory. We classify these software techniques along several dimensions to highlight their similarities and differences. Given that NVMs are growing in popularity, we believe that this survey will motivate further research in the field of software technology for NVMs.

Index Terms—Review, classification, non-volatile memory (NVM) (NVRAM), flash memory, phase change RAM (PCM) (PCRAM), spin transfer torque RAM (STT-RAM) (STT-MRAM), resistive RAM (ReRAM) (RRAM), storage class memory (SCM), Solid State Drive (SSD).



### **NVM Opportunities in Applications**

BG/P Tree Ethernet InfiniBand Serial ATA Software Buffer Buffer Compute nodes IO nodes File servers Enterprise storage

[Liu, et al., MSST 2012]

Persistent data structures like materials tables

Burst Buffers, C/R

۲

29



Figure 3: Read/write ratios, memory reference rates and memory object sizes for memory objects in Nek5000

• In situ visualization and analytics



#### Empirical results show many reasons...

- •Lookup, index, and permutation tables
- •Inverted and 'element-lagged' mass matrices
- •Geometry arrays for grids
- •Thermal conductivity for soils
- •Strain and conductivity rates
- •Boundary condition data
- Constants for transforms, interpolation
- •MC Tally tables, cross-section materials tables...



# Transparent Runtime Support for NVM from GPUs



### **DRAGON: API and Integration**

```
Out-of-Core using CUDA
```

```
// Allocate host & device memory
h_buf = malloc(size);
cudaMalloc(&g_buf, size);
while() { // go over all chunks
   // Read-in data
   f = fopen(filepath, "r");
   fread(h_buf, size, 1, f);
   // H2D Transfer
   cudaMemcpy(g buf, h buf, H2D);
   // GPU compute
   compute_on_gpu(g_buf);
   // Transfer back to host
   cudaMemcpy(h_buf, g_buf, D2H);
   compute_on_host(h_buf);
   // Write out result
   fwrite(h_buf, size, 1, f);
```

#### DRAGON

// mmap data to host and GPU
dragon\_map(filepath, size,
 D\_READ | D\_WRITE, &g\_buf);

// Accessible on both host and GPU
compute\_on\_gpu(g\_buf);
compute\_on\_host(g\_buf);

// Implicitly called when program
exits
dragon\_sync(g\_buf);
dragon\_unmap(g\_buf);

#### Notes

- Similar to NVIDIA's Unified Memory (UM)
- Enable access to large memory on NVM
  - UM is limited by host memory tional Labo

#### **DRAGON Operations: Key Components**



#### **Results with Caffe**



Figure 6: Comparison of ResNet execution times on Caffe.

- Improves capability and productivity
  - Larger problem sizes transparently
  - Handles irregularity easily
  - Surprising performance on applications



Figure 7: Comparison of C3D the execution times on Caffe.



# Language support for NVM: NVL-C - extending C to support NVM



### **NVL-C: Portable Programming for NVMM**



| NVL-   | c                          | Other<br>Langu   | NVL<br>ages              |
|--------|----------------------------|------------------|--------------------------|
| OpenAR | C ARES HL                  | Oth<br>F         | er Compiler<br>ront Ends |
| ARE    | S LLVM<br>Passes           |                  |                          |
|        | LLVM IR<br>Metadata, Intri | +<br>insics,     | NVL Runtime              |
|        | LLVM                       |                  | libnvlrt-pmemobj         |
|        | Target Obje                | ects             |                          |
|        |                            | syster<br>linker | n                        |
|        | Та                         | rget Exec        | cutable                  |
|        |                            |                  |                          |
|        |                            |                  | OAK RIDGE                |

J. Denny, S. Lee, and J.S. Vetter, "NVL-C: Static Analysis Techniques for Efficient, Correct Programming of Non-Volatile Main Memory Systems," in ACM High Performance Distributed Computing (HPDC). Kyoto: ACM, 2016

intra-heap NV-to-NV

inter-heap NV-to-NV

Table 1: Pointer Classes

yes

no

35

#### Programming Model: Pointer types (like Coburn et al.)



#### **Programming Model: Transactions: MATMUL Example**

- Store i in NVM
- Caller initializes \* i to 0 when allocated
- To recover after failure, matmul resumes at old \*i
- Problem: failure might have occurred before all of a [\*i-1] became durable in NVM due to buffering and caching



#### **Programming Model: Transactions: MATMUL Example**

```
#include <nvl.h>
void matmul(nvl float a[I][J],
             nvl float b[I][K],
             nvl float c[K][J],
             nvl int *i)
  while (*i<I) {</pre>
    #pragma nvl atomic heap(heap)
      for (int j=0; j<J; ++j) {</pre>
        float sum = 0.0;
        for (int k=0; k < K; ++k)
         sum += b[*i][k] * c[k][j];
        a[*i][j] = sum;
      ++*i;
```

- **nvl atomic** pragma specifies explicit transaction that computes one row of a
- Transaction guarantees atomicity: both
   \*i is incremented and one row of a is
   written durably, or neither
- Incomplete transaction rolled back after failure



# **Programming Scalable NVM with Papyrus**



### Papyrus – Goals and Design

\*Wikipedia: Papyrus can refer to a document written on sheets of papyrus, an early form of a book.



- Massive amounts of NVM in future systems will enable distributed persistent data structures – just say 'no' to I/O
- **Papyrus** is a novel programming system for aggregate NVM in the next generation HPC systems
  - Parallel Aggregate Persistent YRU Storage
  - Portable and scalable programming interface
    - Private NVM & Shared NVM architectures
    - No centralized control
  - Papyrus Virtual File System
    - Interfaces to standard POSIX API
    - Allows for optimization on NVMe, Optane memory, etc.
  - Papyrus Template Container Library
    - C++ template container implementations





- [1] J. Kim, S. Lee, and J.S. Vetter, "PapyrusKV: a high-performance parallel key-value store for distributed NVM architectures," in SC17.
- [2] J. Kim, K. Sajjapongse, S. Lee, and J.S. Vetter, "Design and Implementation of Papyrus: Parallel Aggregate Persistent Storage," in IPDPS 2017.

#### PapyrusKV: A High-Performance Parallel Key-Value Store for Distributed NVM Architectures

- Leverage emerging NVM technologies
  - High performance
  - High capacity
  - Persistence property
- Designed for the next-generation DOE systems
  - Portable across local NVM and dedicated NVM architectures
  - An embedded key-value store (no system-level daemons and servers)
  - Scalability and performance
- Designed for HPC applications
  - MPI/UPC-interoperable
  - Application customizability
    - Memory consistency models (sequential and relaxed)
    - Protection attributes (read-only, write-only, read-write)
    - Load balancing
  - Zero-copy workflow, asynchronous checkpoint/restart



J. Kim, S. Lee, and J. S. Vetter, "PapyrusKV: A High-Performance Parallel Key-Value Store for Distributed NVM Architectures," In Proc. of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2017

#### **PapyrusKV Example Get operations**





Present design allows remote cache only for RO data.

#### ECP Application Case Study 1 Meraculous (UPC)

- A parallel De Bruijin graph construction ar genome assembly
  - ExaBiome, Exascale Solutions for Microbiome



Graphic from ExaBiome: Exascale Solutions to Microbiome Analysis (LBNL, LANL, JGI), 2017

Table 1: Source lines of code.

| Source file          | UPC  | UPC+PapyrusKV |
|----------------------|------|---------------|
| meraculous.c         | 469  | 475 (+6)      |
| buildUFXhashBinary.h | 315  | 173 (-143)    |
| kmer_hash.h          | 457  | 129 (-328)    |
| UU_traversal_final.h | 1754 | 1724 (-30)    |
| Modified Total       | 2995 | 2501 (-494)   |
| Grand Total          | 5971 | 5477 (-494)   |



Figure 5: Distributed hash table implementations in UPC and PapyrusKV. \*The same user *hash* function in the UPC application can be used in PapyrusKV.



Figure 13: Meraculous performance comparison between PapyrusKV (PKV) and UPC on Cori.

# **NVM Implications**



### Implications

- 1. Device and architecture trends will have major impacts on HPC in coming decade
  - 1. NVM in HPC systems is real!
  - 2. Entirely possible to have an Exabyte of NVM in upcoming systems!
- 2. Performance trends of system components will create new opportunities and challenges
  - 1. Winners and losers
- 3. Sea of NVM allows/requires applications to operate differently
  - 1. Sea of NVM will permit applications to run for weeks without doing I/O to external storage system
  - 2. Applications will simply access local/remote NVM
  - 3. Longer term productive I/O will be 'occasionally' written to Lustre, GPFS
  - 4. Checkpointing (as we know it) will disappear
- 4. Requirements for system design will change
  - 1. Increase in byte-addressable memory-like message sizes and frequencies
  - 2. Reduced traditional IO demands
  - 3. KV traffic could have considerable impact need more applications evidence
  - 4. Need changes to the operational mode of the system



### Recap

- Recent trends in extreme-scale HPC paint an ambiguous future
- Complexity is the next major hurdle
  - Heterogeneous compute
  - Deep memory with NVM
- New software solutions
  - Programming
    - Memory
      - DRAGON
      - NVL-C
      - Papyrus
    - Heterogeneity
      - OpenACC->FPGAs
      - Clacc for LLVM
- These changes will have a substantial impact on both software and application design

#### • Visit us

- We host interns and other visitors year round
- Jobs in FTG
  - Postdoctoral Research Associate in Computer Science
  - Software Engineer
  - Computer Scientist
  - Visit <u>http://jobs.ornl.gov</u>
- Contact me <u>vetter@ornl.gov</u>

