A comparison of several fault-tolerance methods for the detection and correction of floating-point errors in matrix-matrix multiplication - ENS de Lyon - École normale supérieure de Lyon Accéder directement au contenu
Communication Dans Un Congrès Année : 2020

A comparison of several fault-tolerance methods for the detection and correction of floating-point errors in matrix-matrix multiplication

Résumé

This paper compares several fault-tolerance methods for the detection and correction of floating-point errors in matrix-matrix multiplication. These methods include replication, triplication, Algorithm-Based Fault Tolerance (ABFT) and residual checking (RC). Error correction for ABFT can be achieved either by solving a small-size linear system of equations, or by recomputing corrupted coefficients. We show that both approaches can be used for RC. We provide a synthetic presentation of all methods before discussing their pros and cons. We have implemented all these methods with calls to optimized BLAS routines, and we provide performance data for a wide range of failure rates and matrix sizes.
Fichier principal
Vignette du fichier
resilience-europar-hal.pdf (461.42 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)

Dates et versions

hal-03029309 , version 1 (30-11-2020)

Identifiants

  • HAL Id : hal-03029309 , version 1

Citer

Valentin Le Fèvre, Thomas Herault, Julien Langou, Yves Robert. A comparison of several fault-tolerance methods for the detection and correction of floating-point errors in matrix-matrix multiplication. Resilience 2020 - 12th Workshop on Resiliency in High Performance Computing in Clusters, Clouds, and Grids (colocated with Euro-Par), Aug 2020, Warsaw, Poland. pp.1-14. ⟨hal-03029309⟩
25 Consultations
178 Téléchargements

Partager

Gmail Facebook X LinkedIn More