A. R. Alameldeen and D. A. Wood, Adaptive Cache Compression for High-Performance Processors, ACM SIGARCH Computer Architecture News, vol.32, issue.2, pp.212-223, 2004.
DOI : 10.1145/1028176.1006719

S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer et al., Rodinia: A benchmark suite for heterogeneous computing, 2009 IEEE International Symposium on Workload Characterization (IISWC), pp.44-54, 2009.
DOI : 10.1109/IISWC.2009.5306797

S. Collange, M. Daumas, D. Defour, and D. Parello, Barra: A Parallel Functional Simulator for GPGPU, 2010 IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, pp.351-360, 2010.
DOI : 10.1109/MASCOTS.2010.43

S. Collange, D. Defour, and A. Tisserand, Power Consumption of GPUs from a Software Perspective, ICCS 2009, pp.922-931, 2009.
DOI : 10.1007/978-3-642-01970-8_92
URL : https://hal.archives-ouvertes.fr/hal-00348672

S. Collange, D. Defour, and Y. Zhang, Dynamic Detection of Uniform and Affine Vectors in GPGPU Computations, Europar 3rd Workshop on Highly Parallel Processing on a Chip (HPPC), volume LNCS 6043, pp.46-55, 2009.
DOI : 10.1007/978-3-642-14122-5_8
URL : https://hal.archives-ouvertes.fr/hal-00396719

B. Coutinho, D. Sampaio, F. M. Pereira, and W. Meira, Divergence Analysis and Optimizations, 2011 International Conference on Parallel Architectures and Compilation Techniques, 2011.
DOI : 10.1109/PACT.2011.63

M. Dechene, E. Forbes, and E. Rotenberg, Multithreaded instruction sharing, 2010.

E. Demers, Evolution of AMD's graphics core, and preview of Graphics Core Next. AMD Fusion Developer Summit keynote, 2011.

J. Dusser, T. Piquet, and A. Seznec, Zero-content augmented caches, Proceedings of the 23rd international conference on Conference on Supercomputing, ICS '09, pp.46-55, 2009.
DOI : 10.1145/1542275.1542288
URL : https://hal.archives-ouvertes.fr/inria-00337742

C. W. Everitt, Bandwidth compression for shader engine store operations. US Patent 7886116, assignee NVIDIA, 2011.

E. S. Fetzer, M. Gibson, A. Klein, N. Calick, C. Zhu et al., A fully bypassed six-issue integer datapath and register file on the Itanium-2 microprocessor, IEEE Journal of Solid-State Circuits, vol.37, issue.11, pp.371433-1440, 2002.
DOI : 10.1109/JSSC.2002.803948

W. W. Fung, I. Sham, G. Yuan, and T. M. Aamodt, Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007), pp.407-420, 2007.
DOI : 10.1109/MICRO.2007.30

M. Garland and D. B. Kirk, Understanding throughput-oriented architectures, Communications of the ACM, vol.53, issue.11, pp.58-66, 2010.
DOI : 10.1145/1839676.1839694

M. Gebhart, D. R. Johnson, D. Tarjan, S. W. Keckler, W. J. Dally et al., Energy-efficient mechanisms for managing thread context in throughput processors, Proceeding of the 38th annual international symposium on Computer architecture, pp.235-246, 2011.

S. Hong and H. Kim, An integrated GPU power and performance model, ACM SIGARCH Computer Architecture News, vol.38, issue.3, pp.280-289, 2010.
DOI : 10.1145/1816038.1815998
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.332.3923

K. M. Lepak, G. B. Bell, and M. H. Lipasti, Silent stores and store value locality, IEEE Transactions on Computers, vol.50, issue.11, pp.1174-1190, 2001.
DOI : 10.1109/12.966493

J. E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym, NVIDIA Tesla: A Unified Graphics and Computing Architecture, IEEE Micro, vol.28, issue.2, pp.39-55, 2008.
DOI : 10.1109/MM.2008.31

J. Meng, J. Sheaffer, and K. Skadron, Exploiting inter-thread temporal locality for chip multithreading, 2010 IEEE International Symposium on Parallel Distributed Processing (IPDPS), 2010.

P. Micikevicius, 3D finite difference computation on GPUs using CUDA, Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units, GPGPU-2, pp.79-84, 2009.
DOI : 10.1145/1513895.1513905
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.472.447

J. Nickolls and W. J. Dally, The GPU Computing Era, IEEE Micro, vol.30, issue.2, pp.56-69, 2010.
DOI : 10.1109/MM.2010.41

D. Patil, O. Azizi, M. Horowitz, R. Ho, and R. Ananthraman, Robust Energy-Efficient Adder Topologies, 18th IEEE Symposium on Computer Arithmetic (ARITH '07), pp.16-28, 2007.
DOI : 10.1109/ARITH.2007.31
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.102.6716

S. Przybylski, The performance impact of block sizes and fetch strategies, ACM SIGARCH Computer Architecture News, vol.18, issue.3, pp.160-169, 1990.
DOI : 10.1145/325096.325135

L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash et al., Larrabee, ACM Transactions on Graphics, vol.27, issue.3, pp.1-15, 2008.
DOI : 10.1145/1360612.1360617

A. Seznec, Concurrent support of multiple page sizes on a skewed associative TLB, IEEE Transactions on Computers, vol.53, issue.7, pp.924-927, 2004.
DOI : 10.1109/TC.2004.21

H. Shim, N. Chang, and M. Pedram, A compressed frame buffer to reduce display power consumption in mobile systems, Proceedings of the 2004 Asia and South Pacific Design Automation Conference, ASP-DAC '04, pp.818-823, 2004.

V. Volkov and J. W. , Benchmarking GPUs to tune dense linear algebra, 2008 SC, International Conference for High Performance Computing, Networking, Storage and Analysis, pp.1-11, 2008.
DOI : 10.1109/SC.2008.5214359
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.218.3436

H. Wong, M. Papadopoulou, M. Sadooghi-alvandi, and A. Moshovos, Demystifying GPU microarchitecture through microbenchmarking, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS), 2010.
DOI : 10.1109/ISPASS.2010.5452013
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.189.5309

Y. Zhang, J. Yang, and R. Gupta, Frequent value locality and value-centric data cache design, pp.150-159, 2000.
DOI : 10.1145/378993.379235
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.33.5641