

# What's new@intel P. Thierry

**Principal Engineer, Intel Corp** 

philippe.thierry@intel.com

**CPU trend** 

Memory update

Software

Characterization

... in 30 mn

### 10 000 feet view





CPU : Range of few TF/s and <200 GB/s per node. Large memory footprint. Standard programing model (PM)

MIC : bootable. hundreds of threads. 3TF/s DP , 500GB/s HBM , Standard PM

FPGA : Discrete and with Xeon CPU. 1.5 TF/s A10 (now)

GenGraphics: Single socket CPU. 45W. 1.5 GF/s SP. DDR4 and eDRAM. Omp and Ocl

ASIC : > 50 Tops/s, HBM low level PM for now

+ SSD/ NVM + Parallel File System + HPC Software stack, Compilers, Math Libraries

### 10 000 feet view



**PKG Substrate** 



RTC Workshop, Paris 2016

(2S)

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. \* Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without any notice. Copyright © 2015, Intel Corporation.

### **Storage Evolution**

Storage class memory



On package memory









### **3D Xpoint : DIMM form factor**







Compiler, mkl, ipp, MPI, Openmp, Opencl vtune, advisor, itac, inspector ..

CPU trend

Memory update

**Software** 

**Characterization** 

.. in 5 mn

### **MPI Performance Snapshot**

Your application is OpenMP bound. High OpenMP imbalance has been identified. Use <u>Intel VTune Amplifier</u> for further analysis. Application: /nfs/inn/home/yshchyok/p/svn/testing/ts/results/2015.09.23 12.31.09/itac\_testspec/vt\_key\_default\_test\_c\_icc15\_n2\_itac\_it\_mps/test

Number of ranks: 4

Used statistics: app\_stat\_4p28t.txt, stats\_4p28t.txt

Creation date: 2015-09-28 14:58:48

# Wallclock time

1.78 sec



#### TOP 5 MPI functions

| Func    | <u>%</u> |
|---------|----------|
| Wait    | 71.98    |
| Barrier | 20.92    |
| Init    | 3.98     |
| Send    | 2.04     |
| Recv    | 0.93     |

#### Memory usage



Per-process memory usage affects the application scalability.

#### Cycles Per Instruction Rate



This could be caused by such issues as memory stalls, instruction starvation, branch misprediction or long latency instructions.

Please use Intel® VTune™ Amplifier XE to identify the cause of this bottleneck. High values are usually bad. The CPI value may be too high.

#### GFLOPS

20.67

0.00%

#### I/O operations

I/O wait: 0.00 sec



This is the time the application spends waiting for an I/O operation to complete. High percentage of I/O wait time indicates that your application actively reads data from the storage device. This application does not spend much time on I/O operations.

#### Memory Bound Coefficient



It indicates that the application doesn't spend much time waiting for data. High values are usually bad. The application is not Memory Bound.

Free download: <a href="http://www.intel.com/performance-snapshot">http://www.intel.com/performance-snapshot</a>. Also included with Intel® Parallel Studio Cluster Edition.

### **Storage performance snapshot**





### Intel® Advisor - Vectorization Advisor



#### The data and guidance you need:

- Compiler diagnostics + Performance Data + SIMD efficiency
- Detect problems & recommend fixes
- Loop-Carried Dependency Analysis
- Memory Access Patterns Analysis



### **Roofline Model. High level characterization**



#### R.M gives the max achievable performance on a given platform

$$GFlop/s(AI) = min \left\{ \begin{matrix} p_f \\ AI \times p_b \end{matrix} \ or \ \min \left\{ \begin{matrix} xGEMM \\ AI \times StreamBW \end{matrix} \right. \right.$$

See where your application stands And what you can expect

AI: arithmetic intensity

**p**<sub>f</sub>: peak FP

**p**<sub>h</sub>: peak bandwidth



D. Lazowska, J. Zahorjan, G. Graham, K. Sevcik, "Quantitative System Performance" (1984)

### **Performance Characterization**





### **Temporal roofline: Application phases identification**



An application with 2 phases in the extremities on E5-2697 v2

- ✓ Bandwidth bound: Stream
- ✓ CPU bound: HPL

The scalar AI and flops are averaged and are not representative of the application evolution.

Temporal roofline identifies these phases distinctively.



Courtesy of A. Mrabet et al. 2015

RTC Workshop, Paris 2016

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. \* Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without any notice. Copyright © 2015, Intel Corporation.

### Cache aware roofline





### **Cache aware roofline**



- Know where you are vs peaks
- "what to expect"
- Per block view for the whole apps.
- Linked to source and assembly



RTC Workshop, Paris 2016

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.\* Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without any notice. Copyright © 2015, Intel Corporation.

### **How to collect Flops and Bytes (AI definition)**





SDE

- ✓ Possible for future architectures
- ✓ Average over execution time

#### Hardware counters

- ✓ Not always available
- √ Vtune, PCM, LIKWID, PAPI

#### By hands

✓ Not always possible

Hardware counters

✓ Vtune, PCM, LIKWID, PAPI

#### By hands

✓ Not always possible

How to define memory demand without cache impact or latency?

**DRAM demand**: how many DRAM transactions a workload 'wants' to do.

**versus DRAM BW**: the no. of DRAM transactions completed per unit time

### **Programming model**





**Auto-vectorization (no change of code)** 

#### **Compiler:**

Auto-vectorization hints (#pragma vector, ...)

#### **Compiler:**

**Intel® Cilk™ Plus Array Notation Extensions** 

#### **SIMD** intrinsic class

(e.g.: F32vec, F64vec, ...)

#### **Vector intrinsic**

(e.g.: \_mm\_fmadd\_pd(...), \_mm\_add\_ps(...), ...)

### **Assembly code**

(e.g.: [v]addps, [v]addss, ...)





## **Questions?**

RTC Workshop, Paris 2016



### **Legal Notices and Disclaimers**



INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY RELATING TO SALE AND/OR USE OF INTEL PRODUCTS, INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT, OR OTHER INTELLECTUAL PROPERTY RIGHT. Intel products are not intended for use in medical, life-saving, life-sustaining, critical control or safety systems, or in nuclear facility applications.

Intel products may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Intel may make changes to dates, specifications, product descriptions, and plans referenced in this document at any time, without notice.

This document may contain information on products in the design phase of development. The information herein is subject to change without notice. Do not finalize a design with this information.

Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them.

Intel Corporation or its subsidiaries in the United States and other countries may have patents or pending patent applications, trademarks, copyrights, or other intellectual property rights that relate to the presented subject matter. The furnishing of documents and other materials and information does not provide any license, express or implied, by estoppel or otherwise, to any such patents, trademarks, copyrights, or other intellectual property rights.

Wireless connectivity and some features may require you to purchase additional software, services or external hardware.

Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit Intel Performance Benchmark Limitations

Intel, the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.

Other names and brands may be claimed as the property of others.

Copyright © 2015 Intel Corporation. All rights reserved.

### **Legal Disclaimer & Optimization Notice**



INFORMATION IN THIS DOCUMENT IS PROVIDED "AS IS". NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Copyright© 2015, Intel Corporation. All rights reserved. Intel, the Intel logo, Atom, Xeon, Xeon Phi, Core, VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries.

#### **Optimization Notice**

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804