



## Big Data Analytics & Machine Learning on Intel architecture

Igor Freitas Developer Relations Division Software & Services Group

## **Legal Notices and Disclaimers**

INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL® PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. INTEL PRODUCTS ARE NOT INTENDED FOR USE IN MEDICAL, LIFE SAVING, OR LIFE SUSTAINING APPLICATIONS.

Intel may make changes to specifications and product descriptions at any time, without notice.

All products, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice.

Intel, processors, chipsets, and desktop boards may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Any code names featured are used internally within Intel to identify products that are in development and not yet publicly announced for release. Customers, licensees and other third parties are not authorized by Intel to use code names in advertising, promotion or marketing of any product or services and any such use of Intel's internal code names is at the sole risk of the user.

Intel product plans in this presentation do not constitute Intel plan of record product roadmaps. Please contact your Intel representative to obtain Intel's current plan of record product roadmaps.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to <a href="https://www.intel.com/performance">http://www.intel.com/performance</a>

Intel, the Intel logo, Xeon and Xeon logo, Xeon Phi and Xeon Phi logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries

Other names and brands may be claimed as the property of others.

All products, dates, and figures are preliminary and are subject to change without any notice. Copyright © 2016, Intel Corporation.

This document contains information on products in the design phase of development.

**Optimization Notice** 



## **Optimization Notice**

#### **Optimization Notice**

Intel<sup>®</sup> compilers, associated libraries and associated development tools may include or utilize options that optimize for instruction sets that are available in both Intel<sup>®</sup> and non-Intel microprocessors (for example SIMD instruction sets), but do not optimize equally for non-Intel microprocessors. In addition, certain compiler options for Intel compilers, including some that are not specific to Intel micro-architecture, are reserved for Intel microprocessors. For a detailed description of Intel compiler options, including the instruction sets and specific microprocessors they implicate, please refer to the "Intel<sup>®</sup> Compiler User and Reference Guides" under "Compiler Options." Many library routines that are part of Intel<sup>®</sup> compiler products are more highly optimized for Intel microprocessors than for other microprocessors. While the compilers and libraries in Intel<sup>®</sup> compiler products offer optimizations for both Intel and Intel-compatible microprocessors, depending on the options you select, your code and other factors, you likely will get extra performance on Intel microprocessors.

Intel<sup>®</sup> compilers, associated libraries and associated development tools may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include Intel<sup>®</sup> Streaming SIMD Extensions 2 (Intel<sup>®</sup> SSE2), Intel<sup>®</sup> Streaming SIMD Extensions 3 (Intel<sup>®</sup> SSE3), and Supplemental Streaming SIMD Extensions 3 (Intel<sup>®</sup> SSE3) instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors.

While Intel believes our compilers and libraries are excellent choices to assist in obtaining the best performance on Intel<sup>®</sup> and non-Intel microprocessors, Intel recommends that you evaluate other compilers and libraries to determine which best meet your requirements. We hope to win your business by striving to offer the best performance of any compiler or library; please let us know if you find we do not.

Notice revision #20101101





- Intel Machine Learning Strategy
- Software and Hardware Architecture
- Get started today
- Backup More Architecture Details



# Artificial intelligence on intel architecture

# Artificial intelligence @ Intel



MACHINE/DEEP (intel) XEON Itaday LEARNING REASONING SYSTEMS Programmable YISHOLD & **STANDARDS** Memory/store Networki communicat **bons** 

### Unleash Your Potential with Intel's Complete AI Portfolio



# A common language for AI Today







# What is Machine Learning?

### **Classic ML**

Using optimized functions or algorithms to extract insights from data

### Algorithms

- Random ForestSupport Vector Machines
- Regression
- Naïve Bayes
- Hidden Markov
- K-Means Clustering
- Ensemble Methods
- More...

New Data\*

Inference, Clustering, or Classification

### **Deep learning**

Using massive labeled data sets to train deep (neural) graphs that can make inferences about new data



\*Note: not all classic machine learning functions require training

Training

Data\*



htimization Notice right DLC, All Corporation. All rights reserved. Her names and brands may be claimed as the property of others. neural network

# What is a Reasoning system?

### **Memory based**

Using associations between concepts from multiple data types to make sense of complex situations





- ✓ Flexibility to handle ALL data types at once
- ✓ Incorporate new data in real-time
- Transparent and explainable

### Logic based

Using a rule-based reasoning engine, usually handcreated or maintained, to perform logical inferencing steps

**e.g.** should I maintain or alter my equity portfolio given my risk profile?





- ✓ Explicit encoding of knowledge
- ✓ Repeatable, reversible, deterministic
- ✓ Transparent and explainable



# **Customer examples**



Early Tumor Detection Leading medical imaging company Early detection of malignant tumors in mammograms Millions of "Diagnosed" Mammograms Deep Learning (CNN) tumor image recognition Higher accuracy and earlier breast cancer detection



### Finance

Data Synthesis Financial services institution with >\$750B assets Parse info to reduce portfolio manager time to insight Vast stores of documents (news. emails. research. social) Deep Learning (RNN w/ encoder/decoder) Faster and more informed invest-ment decisions



Smart Agriculture World leader in agricultural biotech Accelerate hybrid plant development Large dataset of hybrid plant performance based on genotype markers Deep Learning (CNN) to detect favorable interactions between genotype markers

More accurate selection leading to cost reduction



Personalized Care Renowned US Hospital system Accurately diagnose fatal heart conditions 10,000 health attributes used Saffron memory-based reasoning Increased accuracy to 94% compared with 54% for average cardiologist



Customer Personalization Leading Insurance Group Increase product recommendation accuracy 5 Product Levels 1.353 Products

12M Members Saffron memory-based reasoning 50% increase in product recommendation accuracy



Supply Chain Logistics Multinational Aerospace Corp Reduce time to select aircraft replacement part 15.000 Aircraft \$1M/day idle Saffron memory-based reasoning Reduced part selection from 4 hours to 5 mins

### **Deep Learning**



#DDte Alel Corporation. All rights reserved. r names and brands may be claimed as the property of others.

### Memory-based Reasoning



# intel Al portfolio



# MACHINE Learning: SW & HW Architecture on Intel®

# End-to-end use case

### **Artificial Intelligence For Automated Driving**



# Intel<sup>®</sup> Nervana<sup>™</sup> porftolio (Detail)



Train machine learning models across a diverse set of dense and sparse data





Train large deep neural networks Train large models as fast as possible



\*Future\*



inferenc

ρ

Training

Infer billions of data samples at a time and<br/>feed applications within ~1 dayInfer deep data stree<br/>order to take acti

(intel

**XEON PHI** 

nside



Option for higher throughput/watt Infer deep data streams with low latency in order to take action within milliseconds

Ca-Stream



Required for low latency



Power-constrained environments Movidius or other Intel® edge processor



intel

**XEON** 

inside







# Intel<sup>®</sup> Nervana<sup>™</sup> Platform

### For Deep Learning



Results have been estimated or simulated using internal Intel analysis or architecture simulation or modeling, and provided to you for informational purposes. Any differences in your system hardware, software or configuration may affect your actual performance





# Lake crest



Deep Learning by Design



#### Add-in card for unprecedented compute density in deep learning centric environments

#### Hardware for DL Workloads

- Custom-designed for deep learning
- Unprecedented compute density
- More raw computing power than today's state-of-the-art GPUs

#### **Blazingly Fast Data Access**

- 32 GB of in package memory via HBM2 technology
- 8 Tera-bits/s of memory access speed

#### **High Speed Scalability**

- 12 bi-directional high-bandwidth links
- Seamless data transfer via interconnects



#### Everything needed for deep learning and nothing more!





# Intel<sup>®</sup> Xeon Phi<sup>™</sup> Processor Family

### **Enables Shorter Time to Train Using General Purpose Infrastructure**



Processor for HPC & enterprise customers running scale-out, highly-parallel, memory intensive apps

#### **Removing IO and Memory Barriers**

- Integrated Intel<sup>®</sup> Omni-Path fabric increases price-performance and reduces communication latency
- Direct access of up to 400 GB of memory with no PCIe performance lag (vs. GPU:16GB)

orporation. All rights reserved.

#### **Breakthrough Highly Parallel Performance**

- Up to 400X deep learning performance on existing hardware via Intel software optimizations
- Up to **4X** deep learning performance increase estimated (Knights Mill, 2017)

#### **Easier Programmability**

- Binary-compatible with Intel<sup>®</sup> Xeon<sup>®</sup> processors
- Open standards, libraries and frameworks

#### iguration details on slide: 30

Comparison between the second second

the second s functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microparchetecture information ests covered by by this notice, sets covered by by this notice, sets covered by by this notice.



# Intel<sup>®</sup> Arria<sup>®</sup> 10 FPGA

### **Superior Inference Capabilities**



#### Add-in card for higher performance/watt inference with low latency and flexible precision

**Energy Efficient Inference with Infrastructure Flexibility** 

- Excellent energy efficiency up to 25 images/sec/watt inference on Caffe/Alexnet
- Reconfigurable accelerator can be used for variety of data center workloads
- Integrated FPGA with Intel<sup>®</sup> Xeon<sup>®</sup> processor fits in standard server infrastructure -OR- Discrete FPGA fits in PCIe card and embedded applications\*

\*Xeon with Integrated FPGA refers to Broadwell Proof of Concept

Configuration details on slide: 44

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information with: <a href="http://www.intel.com/performance">http://www.intel.com/performance</a> Software and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change information with other products here the performance of that product when combined with other products. For more complete information with: <a href="http://www.intel.com/performance">http://www.intel.com/performance</a> Software and MobileMark, are measured as of November 2016

by in those table resolution may base the resolution and performance is to assist yourned with the products for management of the product of th





# **Canyon Vista**

### **Turnkey Deep Learning Inference Solution**



Pre-configured add-in card for higher performance/watt inference for image recognition

#### **Energy Efficient Inference**

- Accelerates image recognition using convolutional neural networks (CNN)
- Excellent energy efficient inference up to 25 images/s/w on Caffe/Alexnet
- Fits in standard server infrastructure\*

#### Accelerate Time to Market

- Simplify deployment with preloaded optimized CNN algorithms
- Integrated software ecosystem: optimized libraries, frameworks and APIs

Deptimization Notice: Intel® completes may or may not optimize to be same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice Revision #20110804





<sup>\*</sup>Standard server infrastructure – verified with Broadwell platform Configuration details on slide: 44

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit: thtp://www.intel.com/performance\_source: Intel measured as of November 2016

# Intel<sup>®</sup> Xeon<sup>®</sup> Processor Family

Most Widely Deployed Machine Learning Platform (97% share)\*



Processor optimized for a wide variety of datacenter workloads enabling flexible infrastructure

#### **Lowest TCO With Superior Infrastructure Flexibility**

- Standard server infrastructure
- Open standards, libraries & frameworks
- Optimized to run wide variety of data center workloads

#### Configuration details on slide: 30

n actains on sinae: 30 processors are used in 97% of servers that are running machine learning workloads today (Source: Intel) workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change se factors may cause the results to avery. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information www.intel.com/performance\_Source: Intel measured as of November 2016 Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not e availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessors dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microprocessors. Certain optimizations not specific to Intel microarchitecture

or Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.



#### **Server Class Reliability**

Industry standard server features: high reliability, hardware enhanced security

#### Leadership Throughput

- Industry leading inference performance
- Up to **18X** performance on existing hardware via Intel software optimizations



# Intel<sup>®</sup> OMNI-path Architecture

### World-Class Interconnect Solution for Shorter Time to Train



Single port x8 and x16





**OSEP-based** 192 and 768 port

| ŝ      | 164  |      |    |                |   |
|--------|------|------|----|----------------|---|
|        | 595  |      |    |                |   |
| 3      | 100  | -    |    | and the second |   |
| OY CO. | 10   |      |    |                |   |
|        | 1    | Sec. | 12 | 国を言            |   |
|        | 3.5  |      | 1  | -              | 2 |
| 8      | 1    |      |    |                |   |
| 2      | -112 |      |    |                |   |

Open Source Host Software and Fabric Manager

Software





Third Party Vendors **Passive Copper** Active Optical



### Fabric interconnect for breakthrough performance on scale-out apps like deep learning training

**Building on some of Industry's** best technologies

- Highly leverage existing Aries & Intel True Scale fabrics
- Excellent price/performance  $\leftarrow \rightarrow$  price/port, 48 radix
- Re-use of existing OpenFabrics Alliance Software
- Over 80+ Fabric Builder Members

#### **Breakthrough Performance**

- Increases price performance, reduces communication latency compared to InfiniBand FDR<sup>1</sup>:
  - Up to 21% Higher Performance, lower latency at scale
  - > Up to **17%** higher messaging rate
  - > Up to **9%** higher application performance

#### **Innovative Features**

- Improve performance, reliability and QoS through:
  - Traffic Flow Optimization to maximize QoS in mixed traffic
  - Packet Integrity Protection for rapid and transparent recovery of transmission errors
  - > Dynamic lane scaling to maintain link continuity

<sup>1</sup>Intel® Xeon® Processor E5-2697A v4 dual-socket servers with 2133 MHz DDR4 memory. Intel® Turbo Boost Technology and Intel® Hyper Threading Technology enabled. BIOS: Early snoop disabled, Cluster on Die disabled, IOU non-posted prefetch disabled, Snoop hold-off timer=9. Red Hat Enterprise Linux Server release 7.2 (Maipo). Intel® OPA testing performed with Intel Corporation Device 24/0 – Series 100 HFI ASIC (B0 silicon). OPA Switch: Series 100 Edge Switch – 48 port (B0 silicon). Intel® OPA testing performed with Intel Corporation Device 24/0 – Series 100 HFI ASIC (B0 silicon). OPA Switch: Series 100 Edge Switch – 48 port (B0 silicon). Intel® OPA testing performed with Intel Corporation Device 24/0 – Series 100 HFI ASIC (B0 silicon). OPA Switch: Series 100 Edge Switch – 48 port (B0 silicon). Intel® OPA testing performed with Mellaw Switch – BDR ConnectX-4 Single Port Rev 3 MCX455A HCA: Mellanox Series 100 HFI ASIC (B0 silicon). OPA Switch: Series 100 Edge Switch – 48 port (BD silicon). Intel® OPA testing performed with MLNX\_OFED\_Linux-3.2.x. OpenMPI 1.10.x contained within MLNX HPC\_X. Message rate claim: <u>Ohio State</u> Micro Benchmarks.y. 5.0. su\_mbw\_mr, 8 B message (uni-directional), 32 MPI rank pairs. tch hop. Latency claim: HPCC 1.4.3 Random order ring latency using 16 nodes, 32 MPI ranks per node, 512 total MPI ranks. Application claim: GROMACS version 5.0.4 ion\_channel benchmark. 16 nodes, 32 MPI ranks per node, 512







# intel Al portfolio



|                        |                                                                                       | rools                                                                                                     | 5 & F                                                                                                                     |                                                                             |                                                                                                           | rks                                                                                                                     |
|------------------------|---------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------|
|                        | Intel® Math I<br>Intel® MKL 🕻                                                         | Kernel Library<br>MKL-DNN                                                                                 | Analytics<br>Acceleration<br>Library (DAAL)                                                                               | Distribution<br>for<br>Python                                               | Prem Source<br>Frameworks<br>Theano Spork<br>Caffe Mor                                                    | Intel Deep<br>Learning Tools                                                                                            |
| High Level<br>Overview | High performance<br>math primitives<br>granting low level<br>of control               | Free open source<br>DNN functions for<br>high-velocity<br>integration with<br>deep learning<br>frameworks | Broad data analytics<br>acceleration object<br>oriented library<br>supporting<br>distributed ML at the<br>algorithm level | Most popular and<br>fastest growing<br>language for<br>machine learning     | <i>e</i><br>Toolkits driven by<br>academia and<br>industry for training<br>machine learning<br>algorithms | Accelerate deep<br>learning model<br>design, training and<br>deployment                                                 |
| Primary<br>Audience    | Consumed by<br>developers of<br>higher level<br>libraries and<br>Applications         | Consumed by<br>developers of the<br>next generation of<br>deep learning<br>frameworks                     | Wider Data Analytics<br>and ML audience,<br>Algorithm level<br>development for all<br>stages of data<br>analytics         | Application<br>Developers and<br>Data Scientists                            | Machine Learning<br>App Developers,<br>Researchers and<br>Data Scientists.                                | Application<br>Developers and Data<br>Scientists                                                                        |
| Example<br>Usage       | Framework<br>developers call<br>matrix<br>multiplication,<br>convolution<br>functions | New framework<br>with functions<br>developers call for<br>max CPU<br>performance                          | Call distributed<br>alternating least<br>squares algorithm for<br>a recommendation<br>system                              | Call scikit-learn<br>k-means function<br>for credit card<br>fraud detection | Script and train a<br>convolution neural<br>network for image<br>recognition                              | Deep Learning<br>training and model<br>creation, with<br>optimization for<br>deployment on<br>constrained end<br>device |

# Intel<sup>®</sup> Deep Learning SDK

### Accelerate Deep Learning Development

For developers looking to accelerate deep learning model design, training & deployment

- FREE for data scientists and software developers to develop, train & deploy deep learning
- Simplify installation of Intel optimized frameworks and libraries
- Increase productivity through simple and highly-visual interface
- Enhance deployment through model compression and normalization
- Facilitate integration with full software stack via inference engine



### software.intel.com/deep-learning-sdk





# BIGDL

### Bringing Deep Learning to Big Data

For developers looking to run deep learning on Hadoop/Spark due to familiarity or analytics use

- Open Sourced Deep Learning Library for Apache Spark\*
- Make Deep learning more Accessible to Big data users and data scientists.
- Feature Parity with popular DL frameworks like Caffe, Torch, Tensorflow etc.
- Easy Customer and Developer Experience
  - Run Deep learning Applications as Standard Spark programs;
  - Run on top of existing Spark/Hadoop clusters (No Cluster change)
- High Performance powered by Intel MKL and Multithreaded programming.
- Efficient Scale out leveraging Spark architecture.





### github.com/intel-analytics/BigDL





# Intel distribution for python

Advancing Python Performance Closer to Native Speeds



For developers using the most popular and fastest growing programming language for AI

### Easy, Out-of-the-box Access to High Performance Python

- Prebuilt, optimized for numerical computing, data analytics, HPC
- Drop in replacement for your existing Python (no code changes required)

Drive Performance with Multiple Optimization Techniques

- Accelerated NumPy/SciPy/Scikit-Learn with Intel<sup>®</sup> MKL
- Data analytics with pyDAAL, enhanced thread scheduling with TBB, Jupyter\* Notebook interface, Numba, Cython
- Scale easily with optimized MPI4Py and Jupyter notebooks

Faster Access to Latest Optimizations for Intel Architecture

- Distribution and individual optimized packages available through conda and Anaconda Cloud
- Optimizations upstreamed back to main Python trunk

### Easy to install with Anaconda\* https://anaconda.org/intel/



**Optimization Notice** 



### **Skt-Learn\* Optimizations With Intel® MKL**



System info: 32x Intel® Xeon® CPU E5-2698 v3 @ 2.30GHz, disabled HT, 64GB RAM; Intel® Distribution for Python\* 2017 Gold; Intel® MKL 2017.0.0; Ubuntu 14.04.4 LTS; Numpy 1.11.1; scikit-learn 0.17.1.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other products. \* Other brands on fully evaluating your contemplated purchases, including the performance of that product when combined with other products. \* Other brands on the presented of the product used. In the performance of that product used in the performance of that product when combined with other products.

Deptimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel dees not a compiler to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel dees not a compiler to the same degree for non-Intel microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to the applicable product User and Reference for optimization of the product user are reserved for Intel microprocessors. Please refer to the applicable product User and Reference for optimization optimization sets covered by this notice. Notice revision #20110804.

# Coming soon: intel<sup>®</sup> Nervana<sup>™</sup> High-legraph compiler neural



Intel<sup>®</sup> Nervana<sup>™</sup> Graph Compiler enables optimizations that are applicable across multiple HW targets.

Efficient buffer allocation
Training vs inference optimizations
Efficient scaling across multiple nodes
Efficient partitioning of subgraphs
Compounding of ops

## Intel<sup>®</sup> Nervana<sup>™</sup> Al academy hone your skills and build the future of ai

Intel<sup>®</sup> Nervana<sup>™</sup> Al Academy

in partnership with



Frameworks, Tools and Libraries Software innovators and Black Belts Workshops, webinars, meetups

software.intel.com/ai

## **Optimized Mathematical Building Blocks** Intel<sup>®</sup> MKL

| Linear Algebra                                                                                                                                         | Fast Fourier Transforms                                                                                                   | Vector Math                                                                                                          |
|--------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------|
| <ul> <li>BLAS</li> <li>LAPACK</li> <li>ScaLAPACK</li> <li>Sparse BLAS</li> <li>PARDISO* SMP &amp; Cluster</li> <li>Iterative sparse solvers</li> </ul> | <ul><li>Multidimensional</li><li>FFTW interfaces</li><li>Cluster FFT</li></ul>                                            | <ul> <li>Trigonometric</li> <li>Hyperbolic</li> <li>Exponential</li> <li>Log</li> <li>Power</li> <li>Root</li> </ul> |
| Vector RNGs <ul> <li>Congruential</li> <li>Wichmann-Hill</li> <li>Mersenne Twister</li> <li>Sobol</li> </ul>                                           | Summary Statistics <ul> <li>Kurtosis</li> <li>Variation coefficient</li> <li>Order statistics</li> <li>Min/max</li> </ul> | And More <ul> <li>Splines</li> <li>Interpolation</li> <li>Trust Region</li> <li>Fast Poisson Solver</li> </ul>       |

\*Other names and brands may be claimed as property of others.

#### Data Center Group





# Intel<sup>®</sup> MKL-DNN

### Math Kernel Library for Deep Neural Networks

For developers of deep learning frameworks featuring optimized performance on Intel hardware

#### **Distribution Details**

Open Source

**Endert** Ganoup

- Apache 2.0 License
- Common DNN APIs across all Intel hardware.
- Rapid release cycles, iterated with the DL community, to best support industry framework integration.
- Highly vectorized & threaded for maximal performance, based on the popular Intel<sup>®</sup> MKL library.









# el<sup>®</sup> Machine learning scaliing library (ML

### Scaling Deep Learning to 32 Nodes and Beyond

For maximum deep learning scale-out performance on Intel® architecture

Deep learning abstraction of message-passing implementation

- Built on top of MPI; allows other communication libraries to be used as well
- Optimized to drive scalability of communication patterns
- Works across various interconnects: Intel<sup>®</sup> Omni-Path Architecture, InfiniBand, and Ethernet
- Common API to support Deep Learning frameworks (Caffe, Theano, Torch etc.)



#### github.com/01org/MLSL/releases





BETA Now Available

## Better performance in Deep Neural Network workloads with Intel<sup>®</sup> Math Kernel Library (Intel<sup>®</sup> MKL)

Caffe/AlexNet single node training performance



Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to <a href="http://www.intel.com/performance">http://www.intel.com/performance</a> .\*Other names and brands may be property of others Configurations:

• 2 socket system with Intel® Xeon Processor E5-2699 v4 (22 Cores, 2.2 GHz.), 128 GB memory, Red Hat\* Enterprise Linux 6.7, BVLC Caffe, Intel Optimized Caffe framework, Intel® MKL 11.3.3, Intel® MKL 2017

Intel® Xeon Phi™ Processor 7250 (68 Cores, 1.4 GHz, 16GB MCDRAM), 128 GB memory, Red Hat\* Enterprise Linux 6.7, Intel® Optimized Caffe framework, Intel® MKL 2017

All numbers measured without taking data manipulation into account.

#### **Optimization Notice**



## Better performance in Deep Neural Network workloads with Intel<sup>®</sup> Math Kernel Library (Intel<sup>®</sup> MKL)



Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to <a href="http://www.intel.com/performance">http://www.intel.com/performance</a> .\*Other names and brands may be property of others Configurations:

• 2 socket system with Intel® Xeon® Processor E5-2699 v4 (22 Cores, 2.2 GHz.), 128 GB memory, Red Hat\* Enterprise Linux 6.7, BVLC Caffe, Intel Optimized Caffe framework, Intel® MKL 11.3.3, Intel® MKL 2017

Intel® Xeon Phi™ Processor 7250 (68 Cores, 1.4 GHz, 16GB MCDRAM), 128 GB memory, Red Hat\* Enterprise Linux 6.7, Intel® Optimized Caffe framework, Intel® MKL 2017

All numbers measured without taking data manipulation into account.

#### **Optimization Notice**



### Case Study: LeTV Cloud Illegal Video Detection

- LeTV Cloud (www.lecloud.com) is a leading video cloud provider in China
- LeTV Cloud provides illegal video detection service to 3rd party video cloud customers to help them detect illegal videos
- Originally, LeTV adopted open source BVLC Caffe plus OpenBlas as CNN framework, but the performance was poor
- By using Caffe + Intel MKL, they gained up to 30x performance improvement on training in production environment



\* The test data is based on Intel Xeon E5 2680 V3 processor

BVLC Caffe + OpenBlas(Baseline)

#### LeTV Cloud Caffe Optimization - higher is better

IntelCaffe + MKI



Other names and brands may be claimed as property of others. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/performance

#### **Optimization Notice**



# Intel<sup>®</sup> DAAL: high level view

- Library of optimized building blocks covering all stages of the data analysis, from data extraction till datadriven decisions
- Targets both data centers (Intel<sup>®</sup> Xeon<sup>®</sup> and Intel<sup>®</sup> Xeon Phi<sup>™</sup>) and edge-devices (Intel<sup>®</sup> Atom)
  - Perform analysis close to data source (sensor/client/server) to optimize response latency, decrease network bandwidth utilization, and maximize security.
  - Offload data to server/cluster for complex and large-scale analytics only.



#### **Optimization Notice**

# Get started today

# Get started today Frameworks optimized for Intel

# Caffe

#### Build Faster Deep Learning Applications with Caffe\*

The popular open-source development framework for image recognition is now optimized for Intel® Architecture. <u>Get the framework</u> Learn how to install Caffe\*

# Spark

#### Speed Up Your Spark Analytics

Apache Spark\* MLlib, the open-source data processing framework's machine learning library, now includes Intel®

### MLlib

<u>Get the library</u>

Architecture support.

Building faster applications on Spark clusters

# theano

#### **Delving Into Deep Learning**

The Python library designed to help write deep learning models is now optimized for Intel<sup>®</sup> Architecture.

<u>Visit the library</u> <u>Getting sta</u>rted with Theano\*



**BigDL: Distributed Deep learning on Apache Spark** 

BigDL on GitHub BigDL on Spark video Running on EC2 Page



# Get started today

# Get Intel<sup>®</sup> Libraries (Community License) for Free



### Intel<sup>®</sup> Data Analytics Acceleration Library Intel<sup>®</sup> DAAL

Highly optimized library that helps speed big data analytics by providing algorithmic building blocks for all data analysis stages and for offline, streaming, and distributed analytics usages.

Open-source options for Intel<sup>®</sup> DAAL Learn more about Intel<sup>®</sup> DAAL



#### Intel<sup>®</sup> Math Kernel Library Intel<sup>®</sup> MKL

A high-performance library with assets to help accelerate math processing routines and increase application performance. <u>Deep neural network technical preview for Intel® MKL</u> <u>Get the library</u>

#### **Optimization Notice**

# Training

#### Accelerating Deep Learning and Machine Learning

This talk focuses on two Intel performance libraries, MKL and DAAL, which offer optimized building blocks for data analytics and machine learning.

#### <u>Remove Python Performance Barriers for Machine Learning</u>

This webinar highlights significant performance speed-ups achieved by implementing multiple Intel tools and techniques for high performance Python.

#### • Analyze Python\* App Performance with Intel® VTune™ Amplifier

Efficient profiling techniques can help dramatically improve the performance of your Python\* code. Learn how Intel® VTune Amplifier can help.

#### **Boost Python\* Performance with Intel® Math Kernel Library**

Meet Intel<sup>®</sup> Distribution for Python\*, an easy-to-install, optimized Python distribution that can help you optimize your app's performance.

#### **Building Faster Data Applications on Spark\* Clusters**

Apache Spark\* is big for big data processing apps. Intel<sup>®</sup> Data Analytics Acceleration Library (Intel<sup>®</sup> DAAL) can help optimize performance. Learn how.

#### Faster Big Data Analytics Using New Intel® Data Analytics Acceleration Library

Big data is BIG. And you need information faster. New Intel<sup>®</sup> Data Analytics Acceleration Library (Intel<sup>®</sup> DAAL) speeds data processing for data mining, statistical analysis, and machine learning.



# Resources

## Intel<sup>®</sup> Machine Learning

- <u>http://www.intel.com/content/www/us/en/analytics/machine-learning/overview.html</u>
- Intel Caffe\* fork, <u>https://github.com/intelcaffe/caffe</u>
- Intel Theano\* fork, <u>https://github.com/intel/theano</u>
- Intel<sup>®</sup> Deep Learning SDK: <u>https://software.intel.com/deep-learning-SDK</u>

### Intel<sup>®</sup> DAAL

<u>https://software.intel.com/en-us/intel-daal</u>

### Intel<sup>®</sup> Omni-Path Architecture

• <u>http://www.intel.com/content/www/us/en/high-performance-computing-fabrics/omni-path-architecture-fabric-overview.html</u>



# Intel(R) MKL Resources

# Intel<sup>®</sup> MKL website

<u>https://software.intel.com/en-us/intel-mkl</u>

## Intel<sup>®</sup> MKL forum

<u>https://software.intel.com/en-us/forums/intel-math-kernel-library</u>

## Intel<sup>®</sup> MKL benchmarks

<u>https://software.intel.com/en-us/intel-mkl/benchmarks#</u>

# Intel<sup>®</sup> MKL link line advisor

<u>http://software.intel.com/en-us/articles/intel-mkl-link-line-advisor/</u>

#### Optimization Notice



# performance

# **Artificial Intelligence Plan**

# Bringing the HPC Strategy to AI



#### Top 500 % FLOPs

## Intel<sup>®</sup> Nervana<sup>™</sup> Portfolio





anton the Corporation. All rights reserved. names and brands may be claimed as the property of others Intel Confidential



# Intel<sup>®</sup> Xeon<sup>®</sup> Processor Most Widely Deproved Machine Learning



### Lowest TCO With Superior Infrastructure Flexibility

- Standard server infrastructure
- Open standards, libraries & frameworks
- Optimized to run wide variety of data center workloads

#### Server Class Reliability

• Industry standard server features: high reliability, hardware enhanced security

#### Leadership Throughput

• Industry leading inference performance



### Higher is better

Configuration details on slide: 30

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit: <a href="http://www.intel.com/performance">http://www.intel.com/performance</a> Software and Workloads used in performance of that product when combined with other products. For more complete information visit: <a href="http://www.intel.com/performance">http://www.intel.com/performance</a> Software and Workloads used in performance of that product when combined with other products. For more complete information visit: <a href="http://www.intel.com/performance">http://www.intel.com/performance</a> Software and Workloads used in performance tests are applied to the performance of that product when combined with other products. For more complete information visit: <a href="http://www.intel.com/performance">http://www.intel.com/performance</a> Software and Workloads used in the performance tests are applied to the performance of that product when combined with other products. For more complete information visit: <a href="http://www.intel.com/performance">http://www.intel.com/performance</a> Software and Workloads used in the performance tests are applied to th

Dptimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimizations not microprocessors not manufactured by Intel. Microprocessors-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture in reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

# Intel<sup>®</sup> Xeon Phi<sup>™</sup> Processor Eagly ral purpose



### Removing IO and Memory Barriers

- Integrated Intel<sup>®</sup> Omni-Path fabric increases price-performance and reduces communication latency
- Direct access of up to **400 GB** of memory with no PCIe performance lag (vs. GPU:16GB)

### Breakthrough Highly Parallel Performance

- Near linear scaling with **31X** reduction in time to train when scaling to 32 nodes
- Up to **400X** performance on existing hardware via Intel software optimizations
- Up to **4X** deep learning performance increase estimated on Knights Mill (2017)

### Easier Programmability

- Binary-compatible with Intel<sup>®</sup> Xeon<sup>®</sup> processors
- Open standards, libraries and frameworks

Configuration details on slide: 3

offware and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobieMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other formation and performance tests to assisty out in theil microprocessors. Performance tests, such as SYSmark and MobieMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other primization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimizations on cooprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations on specific to Intel microprocessors. Performance effective are reserved for Intel microprocessors. Performance effective and the effortmance effective are reserved for Intel microprocessors. Performance effective are reserved f



Configuration details on slide: 30

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change t any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit: http://www.intel.com/performance

Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimizations on microprocessors not manufactured by lintel. Microprocessors company and the inter optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice Revision #20110804





#### Configuration details on slide: 30

oftware and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to y of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit: tp://www.intel.com/performance

timization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations intel does not rantee the availability, intervision of fectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations intended for use with Intel microprocessors. Cretain optimizations not specific to Intel microarchitecture reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. #20110804

# Better Scaling Efficiency: Intel<sup>®</sup> Xeon Phi<sup>™</sup> Processor

Deep Learning Image Classification Training Performance - MULTI-NODE Scaling



#### Dataset: Large image database

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to <a href="http://www.intel.com/performance">http://www.intel.com/performance</a> .\*Other names and brands may be property of others Configurations:

Intel<sup>®</sup> Xeon Phi<sup>™</sup> Processor 7250 (68 Cores, 1.4 GHz, 16GB MCDRAM), 128 GB memory, Red Hat\* Enterprise Linux 6.7, Intel<sup>®</sup> Optimized Frameworks

\*\*Source: http://arxiv.org/abs/1511.00175 showing FireCaffe\* with 32x NVIDIA\* K20s (Titan Supercomputer\*) running GoogLeNet\* at 20x speedup over Caffe\* with 1x K20

#### **Optimization Notice**



Configuration details on slide: 30

Knights Mill: Results have been estimated or simulated using internal Intel analysis or architecture simulation or modeling, and provided to you for informational purposes. Any differences in your system hardware, software or configuration may affect your actual performance.

oftware and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to ny of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit: tizy//www.intel.com/performance

ptimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations intel does not uarantee the availability, functionality, or effectiveness of any optimizations not manufactured by Intel. Microprocessor-denet optimizations intended for use with Intel microprocessors. These re reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Inter Revision #20110804

# Train Up to 50x faster with Intel<sup>®</sup> Xeon Phi<sup>™</sup> Processor

Deep Learning Image Classification Training Performance - MULTI-NODE Scaling



#### Topology: AlexNet\*

#### Dataset: Large image database

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to <u>http://www.intel.com/performance/datacenter</u>. Configurations: Up to 50x faster training on 128-node as compared to single-node based on AlexNet\* topology workload (batch size = 1024) training time using a large image database running on node Intel Xeon Phi processor 7250 (16 GB MCDRAM, 1.4 GHz, 68 Cores) in Intel® Server System LADMP2312KXXX41, 96GB DDR4-2400 MHz, quad cluster mode, MCDRAM flat memory mode, Red Hat Enterprise Linux\* 6.7 (Santiago), 1.0 TB SATA drive WD1003FZEX-00MK2A0 System Disk, running Intel® Optimized DNN Framework, training in 39.17 hours compared to 128-node identically configured with Intel® Omni-Path Host Fabric Interface Adapter 100 Series 1 Port PCle x16 connectors training in 0.75 hours. Contact your Intel representative for more information on how to obtain the binary. For information on workload, see https://papers.nips.cc/paper/4824-Large image database-classification-with-deep-convolutional-networks.pdf.

#### **Optimization Notice**

Copyright © 2016, Intel Corporation. All rights reserved. \*Other names and brands may be claimed as the property of others 54

## Intel<sup>®</sup> Xeon Phi <sup>™</sup> Processor Image Classification Training Throughput

Single node: 1s Xeon Phi up to 1.8x better than two Intel® Xeon® processor E5-2697v4



#### Configuration details on slide: 12

Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations on typecific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice Revision #20110804

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit: <a href="http://www.intel.com/performance">http://www.intel.com/performance</a> Source: Intel measured as of November 2016

#### **Optimization Notice**

Copyright © 2016, Intel Corporation. All rights reserved. \*Other names and brands may be claimed as the property of others

#### Intel Confidential

| (intel) |  |
|---------|--|
| (incer/ |  |
|         |  |

# Intel<sup>®</sup> Xeon Phi<sup>™</sup> Processor Increases Customer Value through Mo Cores, Wider Vectors, and Memory BW



Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit: <a href="http://www.intel.com/performance">http://www.intel.com/performance</a> Source: Intel measured or estimated as of May 2016.

#### **Optimization Notice**





- On-going white-paper in Molecular Dynamics software with LNCC
- Partial results\*
- Xeon only:
  - Original code vs Modernized code: up to 11x speedup
- Xeon + 1 Xeon Phi (same optimized code)
  - 1.14x speedup

Authors: <sup>1</sup>CENPES team and Gilvan Vieira - <u>glvandsv@gmai</u> <sup>2</sup>LNCC - Frederico Cabral - <u>fredluiscabral@gmai.cc</u> <sup>3</sup>NCC/UNESP - Silvio Stanzani silvio stanzani@gma

• 11 HPC Hands-on Workshops so far

576+ developers trained so far

# Backup – architecture details

# Intel<sup>®</sup> Xeon Phi<sup>™</sup> Processor

(Knights Landing)

### Self-Boot Processor

Binary-compatibility with Xeon, 3+ TFLOPS<sup>1</sup> (DP)

**On-package memory** 

16GB, Up to 490 GB/s STREAM TRIAD

### **Platform Memory**

Up to 384GB (6ch DDR4-2400 MHz)



Copyright © 2016, Intel Corporation. All rights reserved. \*Other names and brands may be claimed as the property of others.



intel

**XEON PHI** inside

x4 DMI2 to PCH 36 Lanes PCIe\* Gen3 (x16, x16, x4)



# **Knights Mill & Groveport Platform Overview**



### 1 Trains Machines Faster

- Up to 2.5X\* Single Precision performance over Knights Landing for deep learning workloads
- Industry leader variable precision QVNNI up to 4X\* faster performance
- Highly distributed multi-node scaling

### 2 Memory Flexibility & Bootable Host-CPU

- High memory bandwidth with integrated 16GB MC DRAM and bootable host-CPU reduces offloading & latency challenges
- 384GB 6-channel DDR4 memory capacity for massive AI use cases

### **3** Consistent Programming Models

- Common Intel<sup>®</sup> Xeon<sup>®</sup> & Intel<sup>®</sup> Xeon Phi<sup>™</sup> programming
- Optimized for industry standard Open Source ML frameworks
- Flexibility to run vast workloads across x86 infrastructure

\*NOTE: Performance theoretical wrt KNL7250 SKU based on KNM architectural changes.

#### Groveport Platform



Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit <u>www.intel.com/benchmarks</u>. Performance estimate wrt KNL 7250 SKU SGEMM. Performance Calculation= AVX freq X Cores X Flops per Core X Efficiency

#### **Optimization Notice**

# **Knights Mill QFMA for Faster Performance**

### Enhanced ISA QFMA instructions in Knights Mill delivers:

- ✓ Higher Peak Flops for CNN, RNN, DNN, LSTM
- ✓ **Higher Efficiency** (One Quad FMA executed in two cycles)
- ✓ 2X FP operations per cycle





QMADD packs 4 IEEE FMA ops in a single instruction \*2X faster than KNL SP

#### **Optimization Notice**



# **Knights Mill Variable Precision Performance**

Enabling Faster Throughput for Machine Learning Training



\*4x faster than KNL SP

#### **Optimization Notice**

# **Knights Mill QVNNI Advantages over FP16**



### **QVNNI for Higher Accuracy and Faster Operations**

#### **Optimization Notice**



# **Knights Hill Processor Developments**



## CPU – fabric integration



### **Enhanced performance**



### **Reduced costs**



# Memory





- Direct access to KNH CPU resources
- Improved fabric latency
- Lower cost and improved density opportunity
- Huge leap in Dual Precision vector performance
- Dramatic leap in Single & 16-bit performance
- High density system options
- Improved Intel<sup>®</sup> OPA fabric bandwidth
- TCO via Performance/Watt
- Faster time to solution
- Higher radix Omni-Path switches
- Higher capacity and bandwidth in package memory
- Innovations in 3D XPoint<sup>™</sup> technology support
- Emphasis on reliability and resiliency
- Storm Lake 2 scaling support for 100K nodes

Optimization Notice

# Intel Nervana Engine (Lake Crest)

### Nervana Engine

- > ~55 Tera Ops
- Patented *FlexPoint* precision for maximum Tput and high accuracy
- Over Tera b/s of inter/intra connectivity for optimal scaling
- > 32GB HBM
- Standard PCIe Gen3x16 AiC ("like GPU")
- Platform design for 4-8 AiCs per 3U-4U chassis





# Skylake + FPGA on Purley



- Power for FPGA is drawn from socket & requires modified Purley platform specs
- Platform Modifications include Stackup, Clock, Power Delivery, Debug, Power up/down sequence, Misc IO pins (see BOM cost section)

| Cores                                                                                 | Up to 28C with Intel <sup>®</sup> HT Technology                                                                                                         |                                                                       |  |
|---------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------|--|
| FPGA                                                                                  | Altera® Arria 10 GX 1150                                                                                                                                |                                                                       |  |
| Socket TDP                                                                            | Shared socket TDP of 165W combined, or<br>Up to 165W SKL & Up to 90W FPGA                                                                               |                                                                       |  |
| Socket                                                                                | Socket P                                                                                                                                                |                                                                       |  |
| Scalability                                                                           | Up to 2S – with SKL-SP or SKL + FPGA SKUs                                                                                                               |                                                                       |  |
| РСН                                                                                   | Lewisburg: DMI3 – 4 lanes; 14xUSB2 ports<br>Up to: 10xUSB3; 14xSATA3, 20xPCle*3 New: Innovation<br>Engine, 4x10GbE ports, Intel® QuickAssist Technology |                                                                       |  |
|                                                                                       | For CPU                                                                                                                                                 | For FPGA                                                              |  |
| Memory                                                                                | 6 channels DDR4<br>RDIMM, LRDIMM,<br>Apache Pass DIMMs                                                                                                  | Low latency access to system<br>memory via UPI & PCIe<br>interconnect |  |
|                                                                                       | 2666 1DPC,<br>2133, 2400 2DPC                                                                                                                           |                                                                       |  |
| Intel <sup>®</sup> UPI                                                                | 2 channels<br>(10.4, 9.6 GT/s)                                                                                                                          | 1 channel<br>(9.6 GT/s)                                               |  |
| PCle*                                                                                 | PCIe* 3.0<br>(8.0, 5.0, 2.5 GT/s)                                                                                                                       | PCle* 3.0<br>(8.0, 5.0, 2.5 GT/s)                                     |  |
|                                                                                       | 32 lanes per CPU<br>Bifurcation support:<br>x16, x8, x4                                                                                                 | 16 lanes per FPGA<br>Bifurcation support:<br>x8                       |  |
| High Speed<br>Serial Interface<br>(Different board<br>design based on HSSI<br>config) | N/A                                                                                                                                                     | 2xPCIe 3.0 x8                                                         |  |
|                                                                                       |                                                                                                                                                         | Direct Ethernet<br>(4x10 GbE, 2x40 GbE, 10x10<br>GbE, 2x25 GbE)       |  |

# (intel)

#### Data Center Group

# **SKL+FPGA Customer Profile**



#### **Application Development Method**

Data Center Group

intel

**Customer Profile** 

# Intel<sup>®</sup> Xeon<sup>®</sup> Processor E5 Family with Altera Arria<sup>®</sup> 10 FPGA



### **Energy Efficient Scoring**

- Best in class energy efficiency at 18.1 images/s/w
- Up to 40 percent lower power than previous generation FPGAs and SoCs







### **Infrastructure Flexibility**

- Fits in standard server infrastructure
- Reconfigurable accelerator can be used for variety of data center workloads

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit <a href="http://www.intel.com/performance">http://www.intel.com/performance</a>

#### Data Center Group



# Intel<sup>®</sup> Arria 10- 1150 FPGA energy efficiency up to 25 images/second/watt on Caffe/AlexNet



Arria 10 1150 FP16 @ 297 MHz

Energy efficiency on Caffe/AlexNet up to 25 images/s/w



Configuration Details:

Vanilla AlexNet Classification Implementation as specified by <u>http://www.cs.toronto.edu/~fritz/absps/imagenet.pdf\_</u>, Training Parameters taken from Caffe open-source Framework are 224x224x3 Input, 1000x1 Output, FP16 with Shared Block-Exponents, All compute layers (incl. Fully Connected) done on the FPGA except for Softmax, Arria 10-1150 FPGA, -1 Speed Grade on Altera PCIe DevKit with x72 DDR4 @ 1333 MHz, Power measured through on-board power monitor (FPGA POWER ONLY), ACDS 16.1 Internal Builds + OpenCL SDK 16.1 Internal Build, Compute machine is an HP Z620 Workstation, Xeon E5-1660 at 3.3 GHz with 32GB RAM. The Xeon is not used for compute

Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Construction sets and other optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice Revision #20110804

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit: <a href="http://www.intel.com/performance">http://www.intel.com/performance</a> Source: Intel measured as of November 2016

69

# Legal Disclaimer & Optimization Notice

INFORMATION IN THIS DOCUMENT IS PROVIDED "AS IS". NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Copyright © 2016, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries.

#### **Optimization Notice**

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804



