Exascale computing: Opportunities and Challenges
The Department of Energy is embarked in a 10 year plan to develop supercomputers that have a sustained performance in excess of 1 Exaflop/s. This program faces many challenges; in particular, energy, resilience and scale. On the other hand, research done in support of exascale computing may be key for the continue increase in computer performance in the next decade. I shall describe in my talk the exiting challenges and possible solutions and discuss how exascale research will lead into the "post-Moore" era.
Marc Snir is Director of the Mathematics and Computer Science Division at the Argonne National Laboratory and Michael Faiman and Saburo Muroga Professor in the Department of Computer Science at the University of Illinois at Urbana-Champaign. He currently pursues research in parallel computing.
He was head of the Computer Science Department from 2001 to 2007. Until 2001 he was a senior manager at the IBM T. J. Watson Research Center where he led the Scalable Parallel Systems research group that was responsible for major contributions to the IBM SP scalable parallel system and to the IBM Blue Gene system.
Marc Snir received a Ph.D. in Mathematics from the Hebrew University of Jerusalem in 1979, worked at NYU on the NYU Ultracomputer project in 1980-1982, and was at the Hebrew University of Jerusalem in 1982-1986, before joining IBM. Marc Snir was a major contributor to the design of the Message Passing Interface. He has published numerous papers and given many presentations on computational complexity, parallel algorithms, parallel architectures, interconnection networks, parallel languages and libraries and parallel programming environments.
Marc is Argonne Distinguished Fellow, AAAS Fellow, ACM Fellow and IEEE Fellow. He has Erdos number 2 and is a mathematical descendant of Jacques Salomon Hadamard.
Failure prediction in HPC: current situation and open questions
Failure prediction is central to failure avoidance and can potentially improve significantly fault tolerance in HPC systems: by predicting future failure occurrences, one could minimize the effects of failures by taking proactive actions. In the past few years, several key results have demonstrated that new anomaly/symptom detection and correlation analysis algorithms can provide precise on-line failure prediction. In this talk, we will present the current situation in this domain for component level failure prediction and system level failure prediction. We will describe proactive actions that could be triggered and show that the failure prediction time of the best prediction algorithms is consistent with the latency of proactive actions. We will also show that despite impressive progresses, there are still many open questions and we will discuss several of them.
Franck Cappello holds a research director position at INRIA and is visiting research professor in Computer Science at University of Illinois at Urbana Champaign. Since 2009, he is co-director with Marc Snir of the INRIA-Illinois Joint-Laboratory on PetaScale Computing where he is also leading the Resilience/Fault Tolerance effort. He is leading the roadmaping effort on Resilience/Fault Tolerance for EESI2 (European Exascale Software Initiative) and led similar effort for IESP (International Exascale Software Project) and EESI1. He is the main PI of the G8 ECS (Enabling Climate Simulation at Exascale) project gathering researchers from USA, France, Germany, Japan, Canada and Spain with the objective of identifying scalability, performance and resilience solutions for running the CESM climate model at extreme scale.
The IBM Blue Gene/Q Supercomputer
The IBM Blue Gene/Q supercomputer achieves 20 PetaFlops of peak performance. It is the 3rd generation of Supercomputer built upon massive numbers of low power cores with a embedded network, eclipsing the Blue Gene/P machine by twenty-fold. In this talk I will trace the evolution of key concepts in hardware and software, describing what worked well, what is still being studied, and which ideas were abandoned. Blue Gene/Q is a grand experiment that has already succeeded in many fronts. Finally I will comment on the future challenges as we look towards an Exascale computer at the end of the decade.
IBM Fellow Dr. Paul Coteus has lead the system packaging of Blue Gene for more than a decade, and serves as Chief Engineer. Following his PhD in Physics at Columbia University, Paul became Assistant Professor of Physics at the University of Colorado, Boulder and then joined the IBM T.J. Watson Research Center as Research Staff Member. He has managed the Systems Packaging Group since 1994, where he directs and designs advanced packaging for high speed electronics, including I/O circuits, memory system design and standardization of high speed DRAM, and high performance system packaging. He is a senior member of the IEEE, a member of IBM's Academy of Technology, and an IBM Master Inventor. He has authored more than 90 papers in the field of electronic packaging, and holds over 100 US patents. He is currently full time working on IBM's next generation Exascale computer system.
Conference Proceedingswill be published by