R. C. Agarwal, F. G. Gustavson, and M. Zubair, A high-performance matrixmultiplication algorithm on a distributed-memory parallel computer, using overlapped communication, IBM Journal of Research and Development, vol.38, pp.673-682, 1994.

C. Augonnet, S. Thibault, R. Namyst, and P. Wacrenier, StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. Concurrency and Computation: Practice and Experience, vol.23, pp.187-198, 2011.
URL : https://hal.archives-ouvertes.fr/inria-00384363

G. Bosilca, A. Bouteiller, A. Danalis, M. Faverge, T. Herault et al., Dongarra. PaRSEC: Exploiting Heterogeneity to Enhance Scalability, IEEE Computing in Science Engineering, vol.15, issue.6, pp.36-45, 2013.

G. Bosilca, A. Bouteiller, and T. Herault,

J. Choi, J. Dongarra, S. Ostrouchov, A. Petitet, D. Walker et al., The design and implementation of the ScaLAPACK LU, QR, and Cholesky factorization routines. Scientific Programming, vol.5, pp.173-184, 1996.

J. Choi, J. Dongarra, S. Ostrouchov, A. Petitet, D. Walker et al., A proposal for a set of parallel basic linear algebra subprograms, Applied Parallel Computing Computations in Physics, pp.107-114, 1996.

, Distributed Parallel Linear Algebra Software for Multicore Architectures

, Elemental: C++ library for distributed-memory linear algebra and optimization

M. Gates, J. Kurzak, A. Charara, A. Yarkhan, and J. Dongarra, SLATE: Design of a Modern Distributed and Accelerated Linear Algebra Library, SC'2019, the IEEE/ACM Conference on High Performance Computing Networking, Storage and Analysis, 2019.

K. Goto and R. A. Geijn, Anatomy of High-performance Matrix Multiplication, ACM Trans. Math. Software, vol.34, issue.3, 2008.

J. Hong and H. Kung, I/O complexity: the red-blue pebble game, STOC '81: Proceedings of the 13th ACM symposium on Theory of Computing, pp.326-333, 1981.

D. Ironya, S. Toledo, and A. Tiskin, Communication lower bounds for distributed-memory matrix multiplication, J. Parallel Distributed Computing, vol.64, issue.9, pp.1017-1026, 2004.

J. Kurzak, M. Gates, A. Charara, A. Yarkhan, I. Yamazaki et al., Linear systems solvers for distributed-memory machines with gpu accelerators, Parallel Processing, pp.495-506, 2019.

G. Kwasniewski, M. Kabi?, M. Besta, J. Vandevondele, R. Solcà et al., Red-blue pebbling revisited: near optimal parallel matrix-matrix multiplication, 2019.

, Parallel Linear Algebra PACKage

J. Pineau, Y. Robert, F. Vivien, and J. Dongarra, Matrix product on heterogeneous master-worker platforms, PPoPP'2008, the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp.53-62, 2008.
URL : https://hal.archives-ouvertes.fr/hal-00803487

, Scalable Linear Algebra PACKage

M. D. Schatz, R. A. Van-de-geijn, and J. Poulson, Parallel matrix multiplication: A systematic journey, SIAM J. Scientific Computing, vol.38, issue.6, pp.748-781, 2016.

, Task-Based Environment for Scientific Simulation at Extreme Scale

S. Toledo, A survey of out-of-core algorithms in numerical linear algebra, External Memory Algorithms and Visualization, pp.161-180, 1999.

. Top500, Top 500 Supercomputer Sites, 2019.

R. A. Van-de-geijn and J. Watts, SUMMA: Scalable Universal Matrix Multiplication Algorithm, 1995.

S. Williams, A. Waterman, and D. Patterson, Roofline: an insightful visual performance model for multicore architectures, Comm. ACM, vol.52, pp.65-76, 2009.