Efficient Neural Network Training for an Al Radiologist on Intel® Xeon® based Supercomputers Vikram A. Saletore, Ph.D. Principal Engineer, Al Products Group, Intel Valeriu Codreanu, Ph.D. and Damian Podareanu. MSc, Research & Data Scientist, SURFsara B.V. Lucas A. Wilson, Ph.D. and Alex Filby HPC and Al Engineering, Dell EMC The AI Conference, Sept. 4-7, San Francisco # Agenda - Al Usages & Performance Drivers - Efficient Scaling of Neural Network Training on Supercomputers - Al Radiologist Trained on Intel<sup>®</sup> Xeon<sup>®</sup> Scalable Processors - Call To Action ## Al Usage Growth #### Consumer Smart Assistants Chatbots Search Personalization Augmented Reality Robots #### lealth Enhanced Diagnostics Drug Discovery Patient Care Research Sensory Aids #### **Finance** Algorithmic Trading Fraud Detection Research Personal Finance Risk Mitigation #### Retail Support Experience Marketing Merchandising Loyalty Supply Chain Security #### Gov't Defense Data Insights Safety & Security Resident Engagement Smarter Cities #### Energy Oil & Gas Exploration Smart Grid Operational Improvement Conservation #### **Transport** Autonomous Cars Automated Trucking Aerospace Shipping Search & Rescue #### Industrial Factory Automation Predictive Maintenance Precision Agriculture Field Automation #### Other Advertising Education Gaming Professional & IT Services Telco/Media Space **Exploration** Optimization notice $<sup>\</sup>ensuremath{^{*}}$ Other names and brands may be claimed as the property of others. #### Performance Drivers for Al Workloads #### Compute #### SW Optimizations #### Fabric Intel® Omni-Path™ **Architecture Fabric** # Al Portfolio **END-TO-END COMPUTE** <sup>a</sup>Alpha available <sup>†</sup>Beta available <sup>‡</sup> Future \*Other names and brands may be claimed as the property of others All products, computer systems, dates, and figures are preliminary based on current expectations, and are subject to change without notice SYSTEMS & COMPONENTS # Intel® Xeon® Scalable Processors the foundation of data center innovation BCS: https://www.bsc.es/ \*TACC (Texas Advanced Computing Center): https://www.tacc.utexas.edu/ \*DellEMC HPC and AI Innovation Lab Architected for efficient, secure, and agile HPC Supercomputing center # Efficient scaling of Neural Network training on supercomputers Valeriu Codreanu, Ph.D. and Damian Podareanu Compute Services, SURFsara B.V. #### Intel & SURFsara IPCC\* Team Valeriu Codreanu, Ph.D. (PI)., Senior HPC Consultant, SURFsara\* B.V., The Netherlands, Intel Parallel Computing Center Vikram Saletore, Ph.D. (Co-PI), Principal Engineer & Performance Architect, Al Products Group, Intel Corp. Computing Center # IPCC@SURFsara: Scaling up Deep Learning #### Research goals: - Speeding up time-to-train for deep neural network models on large datasets - Improve convergence accuracy - Generalization of methodology across Intel® CPU architectures #### **Main Results** - Efficient scaling - 512 Intel®2S Xeon® 8160 nodes, with a TTT of 44 minutes on ImageNet-1K - Improved SOTA using a reduced number of epochs on ImageNet-1K ## Accuracy vs Large Batch Size #### **Datasets** - ImageNet-1K | 1.2 million | 1000 categories => ~1200 examples / class - Chest-Xray14 | 0.07 million | 14 categories => ~200-20000 examples / class #### Training from scratch (< 2% accuracy degradation) ImageNet-1K | Batch size up to 32K | ~ 40 updates / epochs | 70-90 epochs #### Fine tuning (< 2% accuracy degradation) Chest-Xray14 | Batch size up to 8K | ~ 10 updates / epoch | 70-90 epochs # Accuracy, Training Epochs, HW Scaling - Achieving reasonably good to significantly better accuracy requires: - Increased Training time with a fixed level of HW scaling - Increased HW scaling for a desired Training Time - We show results that trade-off accuracy with the number of training epochs - >74.0% Top-1 Accuracy - >75.5% Top-1 Accuracy - >76.5% Top-1 Accuracy - Using several hardware architectures - Intel® Xeon® Platinum Processor Family with Intel® Omni-Path® Architecture (Intel® OPA) Fabric BCS: https://www.bsc.es/ ResNet-50 Scaling on 2S Intel® Xeon® Platinum 8160 **Processor Cluster** MareNostrum4 Barcelona Supercomputing Center - 90% Scaling Efficiency - Top-1/Top-5 > 74%/92% - Global BS=8192 Up to - Throughput: 15170 Img/sec - Time-To-Train: 70 minutes Best Practices From SURFsara B.V: https://surfdrive.surf.nl/files/index.php/s/xrEFLPvo7IDRARs # ResNet-50 Training Time to 74% Top-1 Accuracy Intel® Xeon® Platinum 8160 Processor Cluster MareNostrum4\* #### Intel® Distribution of Caffe\* with ImageNet-1K dataset <sup>\*</sup>MareNostrum4 (Barcelona Supercomputing Center): https://www.bsc.es/marenostrum # Extremely Large Batch Size convergence # Weight decay scaling throughout training eases the optimisation problem further 64K batch size: convergence in 2100 iterations to ~74% top-1 accuracy! | Batch size | 8K | 16K | 32K | 48K | 64K | |----------------|-------|-------|-------|-----|-----| | IBM [1] | 75% | - | - | - | - | | Facebook [2] | 76.2% | 75.2% | 72.4% | - | 66% | | You et al. [3] | 75.3% | 75.3% | 74.7% | - | 72% | | This work [4] | 76.6% | 76.3% | 75.3% | | 74% | [1] Cho, M., Finkler, U., Kumar, S., Kung, D., Saxena, V., Sreedhar, D.: PowerAl DDL. arXiv [2] Goyal, P., Dollár, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., He, K.: Accurate, large minibatch sgd: Training imagenet in hour. arXiv [3] You, Y., Zhang, Z., Demmel, J., Keutzer, K., Hsieh, C.J.: Imagenet training in minutes. arXiv [4] Codreanu, V., Podareanu, D., Saletore, V: Scale out for large minibatch SGD: Residual network training on ImageNet-1K with improved accuracy and reduced time to train, arXiv # Increasing Accuracy Using Collapsed Ensembles | No. on plot | Top-1 $\%$ acc. | Top-5 $\%$ acc. | |-------------|-----------------|-----------------| | 1 | 68.33 | 88.71 | | 1c | 75.50 | 92.83 | | 2 | 71.54 | 90.78 | | 2c | 76.15 | 93.17 | | 3 | 73.28 | 91.58 | | 3c | 76.50 | 93.24 | | 4 | 73.31 | 91.53 | | 4c | 76.57 | 93.24 | | 5 | 73.89 | 91.97 | | 5c | 76.83 | 93.32 | | 6 | 74.49 | 92.13 | | 6c | 76.81 | 93.32 | | 7c | 76.70 | 93.32 | Fig. 3. Plot of learning rate behaviour when obtaining the ensemble snapshots #### **Collapsed ensembles** Similar in fashion to the learning-rate collapses: - However, after performing a partial collapse, LR is again increased - Cycling the LR: - Improves single-model accuracy faster - Ensemble of the collapsed points leads to 77.5% accuracy using a ResNet-50 regular training budget https://github.com/sara-nl/caffe/tree/master/models/intel\_optimized\_models/multinode/resnet50\_custom\_lr # Improving Hardware Efficiency - Using 2 training processes per node increases HW efficiency significantly! - Each process has a local batch size of 16. At 448 nodes, global batch size is 14336, so no convergence issues. - Each process is pinned to a separate NUMA domain - Scaling efficiency is not negatively impacted (until 512 nodes) - Caffe achieves good HW efficiency now! #### Comparing efficiency of CPU to GPU-based training of ResNet50. GPU peak performance **does not include** the CPU hosts | Work | HW type | # nodes (devices) | Peak [FP32] | TTT | HW eff. | |-----------------|-------------|-------------------|-------------|--------|---------| | This work | SKX 8160 | 448 (896) | 2682 TF | 58 min | 12.36 | | Facebook [5] | NVIDIA P100 | 32 (256) | 2658 TF | 60 min | 12.03 | | You et al. [26] | SKX 8160 | 1024 (2048) | 6144 TF | 48 min | 6.51 | $\underline{\text{https://github.com/sara-nl/caffe/tree/master/models/intel\_optimized\_models/multinode/resnet50\_448 nodes}$ # Best Practices To Improve Accuracy of ResNet-50 | Technique | Approximate top-1 accuracy | |----------------------|----------------------------| | Default augmentation | 74,0% | | Warm-up of LR | 75,4% | | Polynomial decay | 75,7% | | Weight decay scaling | 76,2% | | Single collapse | 76,6% | | Collapsed ensembles | 77,5% | # Summary of Caffe Work - Extensively evaluated Intel<sup>®</sup> Xeon<sup>®</sup> Platinum Processors with Intel<sup>®</sup> OPA Fabric on training ResNet-50 with ImageNet-1K: - >90% scaling efficiency up to 256 nodes to achieve Top-1 >74% Accuracy - >85% scaling efficiency from 256 to 512 SKX nodes & achieve 76.5% Top-1 - NUMA-awareness improves throughput significantly - Introduced several techniques to improve accuracy: - Collapses - Weight decay scaling - Achieve SOTA at batch sizes of up to 64K - Models achieve a SOTA of >76.5%+ Top-1 accuracy for ResNet-50 Benchmark - Collapsed ensemble techniques lead to 77.5% accuracy using ResNet-50 # Extending to Tensorflow and to scientific disciplines # Tensorflow Scalability on Intel® Xeon® Processors #### ResNet-50 Scaling Efficiency With TensorFlow # TO STATE OF THE ST #### Intel® Xeon® Platinum 8160 processor Cluster Stampede2 at TACC 81% Efficiency with TensorFlow+horovod # ResNet-50 with ImageNet-1K on 256 Nodes on Stampede2/TACC: - Improved single-node perf with multiworkers/node - 81% scaling efficiency - Batch size of 64 per worker: Global BS=64K - 16400 Images/sec on 256 nodes - 26700 images/sec on 512 nodes - Time-To-Train: ~2 Hrs on 256 Nodes First to achieve convergence with state-of-the-art accuracy with TensorFlow on 256 node Intel® Xeon® cluster ## Scaling up Training On ImageNet-1K Intel® Xeon® Gold 6148F processor Zenith\* cluster at DellEMC #### **DenseNet-121 Training at Scale** | Global<br>batch size | # nodes | # epochs | Time/epoch<br>(secs/epoch) | Time-To-<br>Train | % Top1<br>Accuracy | |----------------------|---------|----------|----------------------------|-------------------|--------------------| | 8192 | 64 | 90 | 346 s | 8h40m | 74.9 | | 16384 | 128 | 64 | 187.5 s | 3h20 | 74.5 | These models are to be further fine-tuned on the real-world dataset: Chest-Xray14 # Transfer Learning Using Highly Accurate Benchmark for Real Use Case Fine-tuned ResNet-50 that was pre-trained on ImageNet using the Zenith cluster. #### To increase accuracy: - When picking a pre-trained checkpoint do not pick the last one. - Start with the learning rate at which the model was training when it was checkpointed. - Perform gradual warmup of the learning rate, proportionally to the global batch size. #### **Comparative timings for 128-node fine-tuning run** | Global batch size | Framework | # nodes | Time/epoch | |-------------------|------------|---------|------------| | 4096 | Keras | 128 | 85 s | | 4096 | Tensorflow | 128 | 18 s | # An Al Radiologist Trained on Intel® Xeon® Scalable Processors Automatically Identifying Thoracic Pathologies in Chest X-rays Lucas A. Wilson, Ph.D. and Alex Filby HPC and AI Engineering, DellEMC # DellEMC AI Engineering Team – Intel Projects Onur Celebioglu Director, HPC and AI Engineering Quy Ta Manager, AI Engineering Lucas A. Wilson Artificial Intelligence Research Vineet Gundecha Al Software Principal Engineer Srinivas Varadharajan Al Software Principal Engineer Pei Yang Al Software Principal Engineer Alex Filby Sr. Systems Development Engineer ## The Importance of Early Detection Emphysema is estimated to affect more than - 1. 3 million people in the U.S. - 2. 65 million people worldwide - Severe emphysema (types 3 / 4) are life threatening - Early detection is important to try to halt progression Pneumonia affects more than 1 million people each year in the U.S.3, and more than 450 million4 each year worldwide. - 1.4 million deaths per year worldwide - Treatable with early detection <sup>2.</sup> http://www.who.int/respiratory/copd/burden/en/ https://www.ctsnet.org/article/airway-bypass-stenting-severe-emphysema <sup>3.</sup> https://www.cdc.gov/features/pneumonia/index.html <sup>4.</sup> https://doi.org/10.1016%2FS0140-6736%2810%2961459-6 #### CheXNet Developed at Stanford University, CheXNet is a model for identifying thoracic pathologies from the NIH ChestXray14 dataset - DenseNet121 topology - Pretrained on ImageNet - Dataset contains 112K images - Multicategory / Multilabel - Unbalanced $\frac{http://academictorrents.com/details/557481faacd824c83fbf57dcf7b6da938}{3b3235a}$ https://stanfordmlgroup.github.io/projects/chexnet/ # **Building CheXNet** ## Training CheXNet #### High-accuracy model - √ 84% accuracy identifying pneumonia - √ 89% accuracy identifying emphysema #### Baseline performance on CPUs - 4 images per second - 1 epoch takes 5 hours! # Parallelizing CheXNet #### Faster Model Development with Distributed Deep Learning # CheXNet – Parallel Speedup Dell EMC PowerEdge C6420 with dual Intel® Xeon® Scalable Gold 6148 on Intel® Omni-Path fabric. # Parallelizing CheXNet - Accuracy #### Parallelizing CheXNet – Accuracy Relative to single-process #### Can We Do Better? # DenseNet121 is a very deep topology with lots of batch normalization Batch normalization with large batches (thousands) can hinder convergence # VGG16 and ResNet50 are shallower topologies with less batch normalization - ResNet50 contains less than half the batch normalization layers of DenseNet121 - VGG16 has no batch normalization #### Why not try another topology? #### Accuracy of VGG16 relative to DenseNet-121 #### Accuracy of ResNet50 relative to DenseNet-121 #### Categorical Accuracy of ResNet-50 based AI Radiologist ## Training Throughput with VGG and ResNet **2063x faster** than sequential DenseNet on Dell EMC PowerEdge C6420 with dual Intel® Xeon® Scalable Gold 6148 on Intel® Omni-Path fabric. ResNet50 tests performed with TensorFlow+Horovod #### Time to Solution DenseNet vs VGG vs ResNet Dell EMC PowerEdge C6420 with dual Intel® Xeon® Scalable Gold 6148 on Intel® Omni-Path fabric. ResNet50 tests performed with TensorFlow+Horovod # Call To Action - Tensorflow: https://ai.intel.com/tensorflow/ - Blog: <a href="http://www.techenablement.com/surfsara-achieves-accuracy-performance-breakthroughs-deep-learning-wide-network-training/">http://www.techenablement.com/surfsara-achieves-accuracy-performance-breakthroughs-deep-learning-wide-network-training/</a> - SURFsara-Caffe\* Blog: <u>SURFsara\* Caffe\* blog: https://blog.surf.nl/en/imagenet-1k-training-on-intel-xeon-phi-in-less-than-40-minutes/</u> - SURFsara-Intel Paper: <a href="https://arxiv.org/pdf/1711.04291.pdf">https://arxiv.org/pdf/1711.04291.pdf</a> - Intel Blog: <a href="https://ai.intel.com/accelerating-deep-learning-training-inference-system-level-optimizations/">https://ai.intel.com/accelerating-deep-learning-training-inference-system-level-optimizations/</a> - SURFsara\* Best Practices for Caffe\*: https://github.com/sara-nl/caffe - SURFsara Best Practices for TensorFlow: https://surfdrive.surf.nl/files/index.php/s/xrEFLPvo7IDRARs - Dell EMC Ready Solutions for AI Blog: https://community.dellemc.com/community/products/rs for ai engage Use Intel's performance-optimized libraries & frameworks Contact us/Intel for help and collaboration opportunities Vikram Saletore, Ph.D. Valeriu Codreanu, Ph.D. vikram.a.saletore@intel.com valeriu.codreanu@surfsara.nl Lucas A. Wilson, Ph.D. luke wilson@dell.com **y** @lucasawilson Damian Podreanu, MSc damian.podareanu@surfsara.nl Alex Filby alex\_filby@dell.com # **Legal Notices & Disclaimers** This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps. Intel technologies' features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at intel.com, or from the OEM or retailer. No computer system can be absolutely secure. Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit http://www.intel.com/performance. Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may affect future costs and provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction. Statements in this document that refer to Intel's plans and expectations for the quarter, the year, and the future, are forward-looking statements that involve a number of risks and uncertainties. A detailed discussion of the factors that could affect Intel's results and plans is included in Intel's SEC filings, including the annual report on Form 10-K. The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Performance estimates were obtained prior to implementation of recent software patches and firmware updates intended to address exploits referred to as "Spectre" and "Meltdown." Implementation of these updates may make these results inapplicable to your device or system. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document. Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate. Results have been estimated or simulated using internal Intel analysis or architecture simulation or modeling, and provided to you for informational purposes. Any differences in your system hardware, software or configuration may affect your actual performance. Intel, the Intel logo, Pentium, Celeron, Atom, Core, Xeon, Movidius, Saffron and others are trademarks of Intel Corporation in the U.S. and/or other countries. \*Other names and brands may be claimed as the property of others. © 2018 Intel Corporation. # Stampede2\*/TACC\* Configuration Details \*Stampede2/TACC: https://portal.tacc.utexas.edu/user-guides/stampede2 Compute Nodes: 2 sockets Intel® Xeon® Platinum 8160 CPU with 24 cores each @ 2.10GHz for a total of 48 cores per node, 2 Threads per core, L1d 32K; L1i cache 32K; L2 cache 1024K; L3 cache 33792K, 96 GB of DDR4, Intel® Omni-Path Host Fabric Interface, dual-rail. Software: Intel® MPI Library 2017 Update 4Intel® MPI Library 2019 Technical Preview OFI 1.5.0PSM2 w/ Multi-EP, 10 Gbit Ethernet, 200 GB local SSD, Red Hat\* Enterprise Linux 6.7. TensorFlow 1.6: Built & Installed from source: https://www.tensorflow.org/install/install sources Model: Topology specs from <a href="https://github.com/tensorflow/tpu/tree/master/models/official/resnet">https://github.com/tensorflow/tpu/tree/master/models/official/resnet</a> (ResNet-50); Batch size as stated in the performance chart Convergence & Performance Model: https://surfdrive.surf.nl/files/index.php/s/xrEFLPvo7IDRARs Dataset: ImageNet2012-1K: http://www.image-net.org/challenges/LSVRC/2012/ #### Performance measured on 256 Nodes with: OMP\_NUM\_THREADS=24 HOROVOD\_FUSION\_THRESHOLD=134217728 export I\_MPI\_FABRICS=tmi, export I\_MPI\_TMI\_PROVIDER=psm2 \ mpirun -np 512 -ppn 2 python resnet\_main.py --train\_batch\_size 8192 --train\_steps 14075 --num\_intra\_threads 24 --num\_inter\_threads 2 --mkl=True --data\_dir=/scratch/04611/valeriuc/tf-1.6/tpu\_rec/train --model\_dir model\_batch\_8k\_90ep --use\_tpu=False --kmp\_blocktime 1 Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-intel microprocessors. Intel does not have availability functionality, or effectiveness of any optimization on interoprocessors to any optimization on interoprocessors. Please refer to the applicable product are interned by Intel. Miningation of microprocessors of any optimization on son specific to Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Software and workloads used in performance elses may have been optimized for performance elses, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of tha product when combined with other products. For more complete information visit into the product when combined with other products. For more complete information visit. # \*DellEMC Zenith Cluster Configuration Details #### \*DellEMC Internal Cluster: Compute Nodes: 2 sockets Intel® Xeon® Gold 6148F CPU with 20 cores each @ 2.40GHz for a total of 40 cores per node, 2 Threads per core, L1d 32K; L1i cache 32K; L2 cache 1024K; L3 cache 33792K, 96 GB of DDR4, Intel® Omni-Path Host Fabric Interface, dual-rail. Software: Intel® MPI Library 2017 Update 4Intel® MPI Library 2019 Technical Preview OFI 1.5.0PSM2 w/ Multi-EP, 10 Gbit Ethernet, 200 GB local SSD, Red Hat\* Enterprise Linux 6.7. TensorFlow 1.6: Built & Installed from source: https://www.tensorflow.org/install/install\_sources ResNet-50 Model: Topology specs from <a href="https://github.com/tensorflow/tpu/tree/master/models/official/resnet">https://github.com/tensorflow/tpu/tree/master/models/official/resnet</a> DenseNet-121Model: Topology specs from <a href="https://github.com/liuzhuang13/DenseNet">https://github.com/liuzhuang13/DenseNet</a> Convergence & Performance Model: <a href="https://surfdrive.surf.nl/files/index.php/s/xrEFLPvo7IDRARs">https://surfdrive.surf.nl/files/index.php/s/xrEFLPvo7IDRARs</a> Dataset: ImageNet2012-1K: http://www.image-net.org/challenges/LSVRC/2012/ChexNet: https://stanfordmlgroup.github.io/projects/chexnet/ #### Performance measured with: OMP\_NUM\_THREADS=24 HOROVOD\_FUSION\_THRESHOLD=134217728 export I\_MPI\_FABRICS=tmi, export I\_MPI\_TMI\_PROVIDER=psm2 \ mpirun -np 512 -ppn 2 python resnet\_main.py --train\_batch\_size 8192 --train\_steps 14075 --num\_intra\_threads 24 --num\_inter\_threads 2 --mkl=True --data\_dir=/scratch/04611/valeriuc/tf-1.6/tpu\_rec/train --model\_dir model\_batch\_8k\_90ep --use\_tpu=False --kmp\_blocktime 1 Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations, Intel does not quarantee the availability functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessors dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microprocessors. Performance less may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYgmanr and MobileMark, are microprocessors. Performance sets may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYgmanr and MobileMark, are microprocessors or performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYgmanr and MobileMark, are microprocessors or performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYgmanr and MobileMark, are microprocessors or performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYgmanr and MobileMark, are microprocessors. Performance tests may have been optimized for performance only on Intel microprocessors. Performance tests may have been optimized for performance only on Intel microprocessors. Performance tests may have been optimized for performance only on Intel microprocessors. Performance only on Intel microprocessors. Performance tests may have been optimized for performance only on Intel microprocessors. Performance only on Intel microprocessors. Performance tests may have been optimized for performance tests may have been optimized for performance only on Intel microprocessors. Performance tests may have been optimized for performance tests may have been optimize # MareNostrum4/BSC\* Configuration Details \*MareNostrum4/Barcelona Supercomputing Center: <a href="https://www.bsc.es/">https://www.bsc.es/</a> Compute Nodes: 2 sockets Intel® Xeon® Platinum 8160 CPU with 24 cores each @ 2.10GHz for a total of 48 cores per node, 2 Threads per core, L1d 32K; L1i cache 32K; L2 cache 1024K; L3 cache 33792K, 96 GB of DDR4, Intel® Omni-Path Host Fabric Interface, dual-rail. Software: Intel® MPI Library 2017 Update 4Intel® MPI Library 2019 Technical Preview OFI 1.5.0PSM2 w/ Multi-EP, 10 Gbit Ethernet, 200 GB local SSD, Red Hat\* Enterprise Linux 6.7. Intel® Distribution of Caffe\*: <a href="http://github.com/intel/caffe/">http://github.com/intel/caffe/</a>, revision 8012927bf2bf70231cbc7ff55de0b1bc11de4a69. Intel® MKL version: mklml\_lnx\_2018.0.20170425; Intel® MLSL version: l\_mlsl\_2017.1.016 **Model**: Topology specs from <a href="https://github.com/intel/caffe/tree/master/models/intel">https://github.com/intel/caffe/tree/master/models/intel</a> optimized models (ResNet-50) and modified for wide-RedNet-50. Batch size as stated in the performance chart **Time-To-Train**: measured using "train" command. Data copied to memory on all nodes in the cluster before training. No input image data transferred over the fabric while training; Performance measured for node count: 128, 192, 256, 400, 512 & Performance projected for node count: 1-64. #### Performance measured with: export OMP\_NUM\_THREADS=44 (the remaining 4 cores are used for driving communication), export I\_MPI\_FABRICS=tmi, export I\_MPI\_TMI\_PROVIDER=psm2 OMP\_NUM\_THREADS=44 KMP\_AFFINITY="proclist=[0-87],granularity=thread,explicit" KMP\_HW\_SUBSET=1t MLSL\_NUM\_SERVERS=4 mpiexec.hydra -PSM2 -l -n \$\$LURM\_JOB\_NUM\_NODES -ppn 1 -f hosts2 -genv OMP\_NUM\_THREADS 44 -env KMP\_AFFINITY "proclist=[0-87],granularity=thread,explicit" -env KMP\_HW\_SUBSET 1t -genv I\_MPI\_FABRICS tmi -genv I\_MPI\_HYDRA\_BRANCH\_COUNT \$\$LURM\_JOB\_NUM\_NODES -genv I\_MPI\_HYDRA\_PMI\_CONNECT alltoall sh -c 'cat /ilsvrc12\_train\_lmdb\_striped\_64/data.mdb > /dev/null ; cat /ilsvrc12\_val\_lmdb\_striped\_64/data.mdb > /dev/null ; ulimit -u 8192 ; ulimit -a ; numactl -H ; /caffe/build/tools/caffe train --solver=/caffe/models/intel\_optimized\_models/multinode/resnet\_50\_256\_nodes\_8k\_batch/solver\_poly\_quick\_large.prototxt - engine "MKL2017" Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations, Intel does not quarantee the availability functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessors dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microprocessors. Performance less may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYgmanr and MobileMark, are microprocessors. Performance sets may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYgmanr and MobileMark, are microprocessors or performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYgmanr and MobileMark, are microprocessors or performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYgmanr and MobileMark, are microprocessors or performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYgmanr and MobileMark, are microprocessors. Performance tests may have been optimized for performance only on Intel microprocessors. Performance tests may have been optimized for performance only on Intel microprocessors. Performance tests may have been optimized for performance only on Intel microprocessors. Performance only on Intel microprocessors. Performance tests may have been optimized for performance only on Intel microprocessors. Performance only on Intel microprocessors. Performance tests may have been optimized for performance tests may have been optimized for performance only on Intel microprocessors. Performance tests may have been optimized for performance tests may have been optimize