J. Dongarra, The International Exascale Software Project: a Call To Cooperative Action By the Global High-Performance Community, International Journal of High Performance Computing Applications, vol.23, issue.4, pp.309-322, 2009.
DOI : 10.1177/1094342009347714

K. M. Chandy and L. Lamport, Distributed snapshots: determining global states of distributed systems, ACM Transactions on Computer Systems, vol.3, issue.1, pp.63-75, 1985.
DOI : 10.1145/214451.214456

E. N. Elnozahy, L. Alvisi, Y. Wang, and D. B. Johnson, A survey of rollback-recovery protocols in message-passing systems, ACM Computing Surveys, vol.34, issue.3, pp.375-408, 2002.
DOI : 10.1145/568522.568525

T. O. Gorman, The effect of cosmic rays on the soft error rate of a DRAM at ground level, IEEE Transactions on Electron Devices, vol.41, issue.4, pp.553-557, 1994.
DOI : 10.1109/16.278509

J. Ziegler, H. Muhlfeld, C. Montrose, H. Curtis, T. O-'gorman et al., Accelerated testing for cosmic soft-error rate, IBM Journal of Research and Development, vol.40, issue.1, pp.51-72, 1996.
DOI : 10.1147/rd.401.0051

J. Ziegler, M. Nelson, J. Shell, R. Peterson, C. Gelderloos et al., Cosmic ray soft error rates of 16-Mb DRAM memory chips, IEEE Journal of Solid-State Circuits, vol.33, issue.2, pp.246-252, 1998.
DOI : 10.1109/4.658626

A. Moody, G. Bronevetsky, K. Mohror, and B. R. Supinski, Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System, Proc. of the ACM, pp.1-11, 2010.
DOI : 10.2172/984082

G. Lu, Z. Zheng, and A. A. Chien, When is multi-version checkpointing needed?, Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale, FTXS '13, 2013.
DOI : 10.1145/2465813.2465821

G. Aupy, A. Benoit, T. Hérault, Y. Robert, F. Vivien et al., On the Combination of Silent Error Detection and Checkpointing, 2013 IEEE 19th Pacific Rim International Symposium on Dependable Computing, 2013.
DOI : 10.1109/PRDC.2013.10

URL : https://hal.archives-ouvertes.fr/hal-00836871

J. W. Young, A first order approximation to the optimum checkpoint interval, Communications of the ACM, vol.17, issue.9, pp.530-531, 1974.
DOI : 10.1145/361147.361115

J. T. Daly, A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Computer Systems, vol.22, issue.3, pp.303-312, 2004.
DOI : 10.1016/j.future.2004.11.016

Z. Chen, Online-ABFT: An Online Algorithm Based Fault Tolerance Scheme for Soft Error Detection in Iterative Methods, Proc. 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '13, pp.167-176, 2013.

R. E. Lyons and W. Vanderkulk, The Use of Triple-Modular Redundancy to Improve Computer Reliability, IBM Journal of Research and Development, vol.6, issue.2, pp.200-209, 1962.
DOI : 10.1147/rd.62.0200

D. Fiala, F. Mueller, C. Engelmann, R. Riesen, K. Ferreira et al., Detection and correction of silent data corruption for large-scale high-performance computing, Proc. of the ACM/IEEE SC Int. Conf., SC '12, 2012.

X. Ni, E. Meneses, N. Jain, and L. V. Kalé, ACR, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '13, 2013.
DOI : 10.1145/2503210.2503266

A. A. Hwang, I. A. Stefanovici, and B. Schroeder, Cosmic rays don't strike twice, ACM SIGARCH Computer Architecture News, vol.40, issue.1, pp.111-122, 2012.
DOI : 10.1145/2189750.2150989

K. Huang and J. A. Abraham, Algorithm-based fault tolerance for matrix operations, IEEE Trans. Comput, vol.33, issue.6, pp.518-528, 1984.

G. Bosilca, R. Delmas, J. Dongarra, and J. Langou, Algorithm-based fault tolerance applied to high performance computing, Journal of Parallel and Distributed Computing, vol.69, issue.4, pp.410-416, 2009.
DOI : 10.1016/j.jpdc.2008.12.002

M. Shantharam, S. Srinivasmurthy, and P. Raghavan, Fault tolerant preconditioned conjugate gradient for sparse linear system solution, Proceedings of the 26th ACM international conference on Supercomputing, ICS '12, 2012.
DOI : 10.1145/2304576.2304588

A. R. Benson, S. Schmit, and R. Schreiber, Silent error detection in numerical time-stepping schemes, International Journal of High Performance Computing Applications, vol.29, issue.4
DOI : 10.1177/1094342014532297

P. Sao and R. Vuduc, Self-stabilizing iterative solvers, Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, ScalA '13, 2013.
DOI : 10.1145/2530268.2530272

M. Heroux and M. Hoemmen, Fault-tolerant iterative methods via selective reliability, Research report SAND2011-3915 C, Sandia National Laboratories, 2011.

G. Bronevetsky and B. De-supinski, Soft error vulnerability of iterative linear algebra methods, Proceedings of the 22nd annual international conference on Supercomputing , ICS '08, pp.155-164, 2008.
DOI : 10.1145/1375527.1375552

J. Elliott, K. Kharbas, D. Fiala, F. Mueller, K. Ferreira et al., Combining Partial Redundancy and Checkpointing for HPC, 2012 IEEE 32nd International Conference on Distributed Computing Systems, 2012.
DOI : 10.1109/ICDCS.2012.56