This material is presented to
ensure timely dissemination of scholarly and technical work. Copyright and all
rights therein are retained by authors or by other copyright holders. All persons
copying this information are expected to adhere to the terms and constraints
invoked by each author's copyright. In most cases, these works may not be
reposted without the explicit permission of the copyright holder.
S. Claggett, S. Azimi, and M. Burtscher.
SPDP: An Automatically Synthesized Lossless Compression Algorithm for Floating-Point Data.
Data Compression Conference. March 2018.
Scientific computing produces, transfers, and stores massive amounts of single-
and double-precision floating-point data, making this a domain that can greatly
benefit from data compression. To gain insight into what makes an effective
lossless compression algorithm for such data, we generated over nine million
algorithms and selected the one that yields the highest compression ratio on 26
datasets. The resulting algorithm, called SPDP, comprises four data
transformations that operate exclusively at word or byte granularity.
Nevertheless, SPDP delivers the highest compression ratio on eleven datasets
and, on average, outperforms all but one of the seven compared compressors. An
analysis of SPDP's internals reveals how to build effective compression
algorithms for scientific data.
S. Maleki and M. Burtscher.
Automatic Hierarchical Parallelization of Linear Recurrences.
23rd ACM International Conference on Architectural Support for Programming Languages and Operating Systems. March 2018.
Linear recurrences encompass many fundamental computations including prefix sums
and digital filters. Later result values depend on earlier result values in
recurrences, making it a challenge to compute them in parallel. We present a new
work- and space-efficient algorithm to compute linear recurrences that is amenable
to automatic parallelization and suitable for hierarchical massively-parallel
architectures such as GPUs. We implemented our approach in a domain-specific code
generator that emits optimized CUDA code. Our evaluation shows that, for standard
prefix sums and single-stage IIR filters, the generated code reaches the throughput
of memory copy for large inputs, which cannot be surpassed. On higher-order prefix
sums, it performs nearly as well as the fastest handwritten code from the literature.
On tuple-based prefix sums and 1D digital filters, our automatically parallelized
code outperforms the fastest prior implementations.
J. Coplin and M. Burtscher.
Energy and Power Considerations of GPUs.
Chapter 19 in Advances in GPU Research and Practice. Elsevier, 2017.
A. Yang, J. Coplin, H. Mukka, F. Hesaaraki, and M. Burtscher.
MPC: An Effective Floating-Point Compression Algorithm for GPUs.
Chapter 13 in Advances in GPU Research and Practice. Elsevier, 2017.
M. Burtscher, F. Hesaaraki, H. Mukka, and A. Yang.
Real-Time Synthesis of Compression Algorithms for Scientific Data.
ACM/IEEE International Conference for High-Performance Computing, Networking, Storage and Analysis. November 2016.
Many scientific programs produce large amounts of floating-point data that are saved
for later use. To minimize the storage requirement, it is worthwhile to compress such
data as much as possible. However, existing algorithms tend to compress floating-point
data relatively poorly. As a remedy, we have developed FPcrush, a tool that automatically
synthesizes an optimized compressor for each given input. The synthesized algorithms are
lossless and parallelized using OpenMP. This paper describes how FPcrush is able to
perform this synthesis in real-time, i.e., even when accounting for the synthesis
overhead, it compresses the 16 tested real-world single- and double-precision data files
more quickly than parallel bzip2. Decompression is an order of magnitude faster and
exceeds the throughput of multicore implementations of bzip2, gzip, and FPC. On all but
two of the tested files, as well as on average, the customized algorithms deliver higher
compression ratios than the other three tools.
J. Coplin, A. Yang, A. Poppe, and M. Burtscher.
Increasing Telemetry Throughput Using Customized and Adaptive Data Compression.
AIAA SPACE and Astronautics Forum and Exposition. September 2016.
Due to the increasing generation of massive amounts of data by space-based instruments,
it has become a challenge to transmit even a fraction of a typical spacecraft data volume
back to Earth in a feasible amount of time. Thus, improvements in the ability to losslessly
compress data on-board before transmission represent an important method of increasing
overall data return rates. We describe a custom methodology for compressing spacecraft data
on-board that provides significant improvements in both compression ratio and speed. We have
used data returned by the five-probe THEMIS/ARTEMIS constellation to quantify the compression
ratio and compression speed improvements over a variety of data types (e.g., time-series
and particle data). Our approach results in a 38% improvement in compression ratio and up to
a three-fold improvement in compression throughput and energy efficiency. We argue that such
methods should be adopted by future space missions to maximize the data return to Earth, thus
enabling greater insight and scientific discovery.
S. Maleki, A. Yang, and M. Burtscher.
Higher-Order and Tuple-Based Massively-Parallel Prefix Sums.
ACM SIGPLAN Conference on Programming Language Design and Implementation. June 2016.
Prefix sums are an important parallel primitive, especially in massively-parallel
programs. This paper discusses two orthogonal generalizations thereof, which we call
higher-order and tuple-based prefix sums. Moreover, it describes and evaluates SAM,
a GPU-friendly algorithm for computing prefix sums and other scans that directly
supports higher orders and tuple values. Its templated CUDA implementation unifies
all of these computations in a single 100-statement kernel. SAM is communication-efficient
in the sense that it minimizes main-memory accesses. When computing prefix sums of a
million or more values, it outperforms Thrust and CUDPP on both a Titan X and a K40 GPU.
On the Titan X, SAM reaches memory-copy speeds for large input sizes, which cannot be
surpassed. SAM outperforms CUB, the currently fastest conventional prefix sum
implementation, by up to a factor of 2.9 on eighth-order prefix sums and by up to a factor
of 2.6 on eight-tuple prefix sums.
J. Coplin and M. Burtscher.
Energy, Power, and Performance Characterization of GPGPU Benchmark Programs.
Twelfth Workshop on High-Performance, Power-Aware Computing. May 2016.
This paper studies the effects on energy consumption, power draw, and runtime of a
modern compute GPU when changing the core and memory clock frequencies, enabling or
disabling ECC, using alternate implementations, and varying the program inputs. We
evaluate 34 applications from 5 benchmark suites and measure their power draw over
time on a K20c GPU. Our results show that changing the frequency or the program
implementation can alter the energy, power, and performance by a factor of two or
more. Interestingly, some changes affect these three aspects very unevenly. ECC can
greatly increase the runtime and energy consumption, but only on memory-bound codes.
Compute-bound codes tend to behave quite differently from memory-bound codes, in
particular regarding their power draw. On irregular programs, a small change in
frequency can result in a large change in runtime and energy consumption.
I. Szczyrba, R. Szczyrba, and M. Burtscher.
Geometric Representations of the n-anacci Constants and Generalizations Thereof.
Journal of Integer Sequences, Vol. 19, Article 16.3.8. April 2016.
We introduce geometric representations of the sequence of the n-anacci constants
and generalizations thereof that consist of the ratio limits generated by linear
recurrences of an arbitrary order n with equal real weights p > 0. We represent the
n-anacci constants and their generalizations geometrically by means of the dilation
factors of dilations transforming collections of compact convex sets with increasing
B. Goodarzi, M. Burtscher, and D. Goswami.
Parallel Graph Partitioning on a CPU-GPU Architecture.
Twenty Fifth International Heterogeneity in Computing Workshop. May 2016.
Graph partitioning has important applications in
multiple areas of computing, including scheduling, social
networks, and parallel processing. In recent years, GPUs
have proven successful at accelerating several graph algorithms.
However, the irregular nature of the real-world
graphs poses a problem for GPUs, which favor regularity. In
this paper, we discuss the design and implementation of a
parallel multilevel graph partitioner for a CPU-GPU system.
The partitioner aims to overcome some of the challenges
arising due to memory constraints on GPUs and maximizes
the utilization of GPU threads through suitable load-balancing
schemes. We present a lock-free shared-memory scheme
since fine-grained synchronization among thousands of
threads imposes too high a performance overhead. The
partitioner, implemented in CUDA, outperforms serial Metis
and parallel MPI-based ParMetis. It performs similar to the
shared-memory CPU-based parallel graph partitioner mt-metis.
A. Yang, H. Mukka, F. Hesaaraki, and M. Burtscher.
MPC: A Massively Parallel Compression Algorithm for Scientific Data.
IEEE Cluster Conference. September 2015.
Due to their high peak performance and energy efficiency, massively parallel
accelerators such as GPUs are quickly spreading in high-performance computing,
where large amounts of floating-point data are processed, transferred, and
stored. Such environments can greatly benefit from data compression if done
sufficiently quickly. Unfortunately, most conventional compression algorithms
are unsuitable for highly parallel execution. In fact, it is generally unknown
how to design good compression algorithms for massively parallel systems. To
remedy this situation, we study 138,240 lossless compression algorithms for
single- and double-precision floating-point values that are built exclusively
from easily parallelizable components. We analyze the best of these algorithms,
explain why they compress well, and derive the Massively Parallel Compression
(MPC) algorithm from them. This novel algorithm requires almost no internal
state, achieves heretofore unreached compression ratios on several data sets,
and roughly matches the best CPU-based algorithms in compression ratio while
outperforming them by one to two orders of magnitude in throughput.
S. Rahman, M. Burtscher, Z. Zong, and A. Qasem.
Maximizing Hardware Prefetch Effectiveness with Machine Learning.
17th IEEE International Conference on High Performance Computing and Communications. August 2015.
Modern processors are equipped with multiple
hardware prefetchers, each of which targets a distinct level
in the memory hierarchy and employs a separate prefetching
algorithm. However, different programs require different subsets
of these prefetchers to maximize their performance. Turning on
all available prefetchers rarely yields the best performance and,
in some cases, prefetching even hurts performance. This paper
studies the effect of hardware prefetching on multithreaded code
and presents a machine-learning technique to predict the optimal
combination of prefetchers for a given application. This technique
is based on program characterization and utilizes hardware
performance events in conjunction with a pruning algorithm to
obtain a concise and expressive feature set. The resulting feature
set is used in three different learning models. All necessary steps
are implemented in a framework that reaches, on average, 96%
of the best possible prefetcher speedup. The framework is built
from open-source tools, making it easy to extend and port to
A. Dzhagaryan, A. Milenkovic, and M. Burtscher.
Quantifying Benefits of Lossless Compression Utilities on Modern Smartphones.
24th International Conference on Computer Communications and Networks. August 2015.
[pdf] (best paper candidate)
The data traffic originating on mobile computing devices has been growing
exponentially over the last several years. Lossless data compression and
decompression can be essential in increasing communication throughput,
reducing communication latency, achieving energy-efficient communication,
and making effective use of available storage. This paper experimentally
evaluates several compression utilities and configurations on a modern
smartphone. We characterize each utility in terms of its compression ratio,
compression and decompression throughput, and energy efficiency for
representative use cases. We find a wide variety of energy costs associated
with data compression and decompression and provide practical guidelines
for selecting the most energy efficient configurations for each use case.
For data transfers over WLAN, the best configurations provide a 2.1-fold
and 2.7-fold improvement in energy efficiency for compressed uploads and
downloads, respectively, when compared to uncompressed data transfers. For
data transfers over a mobile broad-band network, the best configurations
provide a 2.7-fold and 3-fold improvement in energy efficiency for
compressed uploads and downloads, respectively.
S. Taheri, A. Qasem, and M. Burtscher.
A Tool for Automatically Suggesting Source-Code Optimizations for Complex GPU Kernels.
2015 International Conference on Parallel and Distributed Processing Techniques and Applications. July 2015.
Future computing systems, from handhelds to supercomputers,
will undoubtedly be more parallel and heterogeneous
than today's systems to provide more performance
and energy efficiency. Thus, GPUs are increasingly being
used to accelerate general-purpose applications, including
applications with data-dependent, irregular control flow
and memory access patterns. However, the growing complexity,
exposed memory hierarchy, incoherence, heterogeneity,
and parallelism will make accelerator-based systems
progressively more difficult to program. In the foreseeable
future, the vast majority of programmers will no longer be
able to extract additional performance or energy-savings
from next-generation systems because the programming will
be too difficult. Automatic performance analysis and optimization
recommendation tools have the potential to avert this
situation. They embody expert knowledge and make it available
to software developers when needed. In this paper, we
describe and evaluate such a tool. It quantifies performance
characteristics of GPU code through profiling, employs machine
learning models to estimate the suitability and benefit
of several known source-code optimizations, ranks the optimizations,
and suggests the most promising ones to the user
if the expected speedup is sufficiently high.
I. Szczyrba, R. Szczyrba, and M. Burtscher.
Analytic Representations of the n-anacci Constants and Generalizations Thereof.
Journal of Integer Sequences, Vol. 18, Article 15.4.5. May 2015.
We study generalizations of the sequence of the n-anacci constants that are
constructed from the ratio limits generated by linear recurrences of an
arbitrary order n with equal integer weights m. We derive the analytic
representation of the class C∞ of these ratio limits and prove that,
for a fixed m, the ratio limits form a strictly increasing sequence
converging to m+1. We also show that the generalized n-anacci constants
form a totally ordered set.
M. Burtscher, W. Peng, A. Qasem, H. Shi, D. Tamir, and H. Thiry.
A Module-based Approach to Adopting the 2013 ACM Curricular Recommendations on Parallel Computing.
2015 ACM SIGCSE Symposium. March 2015.
The widespread deployment of multicore systems over the last decade has
brought about major changes in the software and hardware landscape. The
resulting importance of parallel computing is reflected in the 2013
Curriculum Guidelines developed by the joint ACM/IEEE taskforce. The
document recommends increased coverage of parallel computing and describes
a new Knowledge Area on this topic. These recommendations have already
been adopted by several universities in the form of new parallel-programming
courses. Implementing the recommendations in a complete curriculum, however,
poses many challenges, including deciding on existing material to be removed,
complying with administrative and ABET requirements, and maintaining caps on
graduation credit hours. This paper describes an alternative approach for
adopting the 2013 curricular recommendations on parallel computing.
Specifically, we use a module-based approach that introduces parallel
computing concepts and re-iterates them through a series of short,
self-contained modules taught across several lower-division courses. Most
of these concepts are then combined into a new senior-level capstone course
on parallel programming. Each module covers parallelism aspects in the context
of a conventional computer science topic, thus enabling us to include parallel
computing without a major overhaul of the curriculum. Evaluations conducted
during the first year show encouraging results for this early-and-often
approach in terms of learning outcomes, student interest, and confidence gains.
J. Coplin and M. Burtscher.
Effects of Source-Code Optimizations on GPU Performance and Energy Consumption.
Eighth Workshop on General Purpose Processing Using GPUs. February 2015.
This paper studies the effects of source-code optimizations on the performance,
power draw, and energy consumption of a modern compute GPU. We evaluate 128
versions of two n-body codes: a compute-bound regular implementation and a
memory-bound irregular implementation. Both programs include six optimizations
that can be individually enabled or disabled. We measured the active runtime
and the power consumption of each code version on three inputs, various GPU
clock frequencies, two arithmetic precisions, and with and without ECC. This
paper investigates which optimizations primarily improve energy efficiency,
which ones mainly boost performance, and which ones help both aspects. Some
optimizations also have the added benefit of reducing the power draw. Our
analysis shows that individual and combinations of optimizations can alter
the performance and energy consumption of a GPU kernel by up to a factor of
M. A. O'Neil and M. Burtscher.
Rethinking the Parallelization of Random-Restart Hill Climbing.
Eighth Workshop on General Purpose Processing Using GPUs. February 2015.
Random-restart hill climbing is a common approach to combinatorial optimization
problems such as the traveling salesman problem (TSP). We present and evaluate
an implementation of random-restart hill climbing with 2-opt local search
applied to TSP. Our implementation is capable of addressing large problem sizes
at high throughput. It is based on the key insight that the GPU's hierarchical
hardware parallelism is best exploited with a hierarchical implementation
strategy, where independent climbs are parallelized between blocks and the 2-opt
evaluations are parallelized across the threads within a block. We analyze the
performance impact of this and other optimizations on our heuristic TSP solver
and compare its performance to existing GPU-based 2-opt TSP solvers as well as
a parallel CPU implementation. Our code outperforms the existing implementations
by up to 3X, evaluating up to 60 billion 2-opt moves per second on a single K40
GPU. It also outperforms an OpenMP implementation run on 20 CPU cores by up to 8X.
B. Li, Y. Lu, C. Li, A. Godil, T. Schreck, M. Aono, M. Burtscher, Q. Chen, N.K. Chowdhury, B. Fang, H. Fu, T. Furuya, H. Li, J. Liu, H. Johan, R. Kosaka, H. Koyanagi, R. Ohbuchi, A. Tatsuma, Y. Wan, C. Zhang, and C. Zou.
A Comparison of 3D Shape Retrieval Methods based on a Large-Scale Benchmark Supporting Multimodal Queries.
Computer Vision and Image Understanding, Vol. 131, pp. 1-27, Elsevier. February 2015.
Large-scale 3D shape retrieval has become an important research direction in
content-based 3D shape retrieval. To promote this research area, two Shape
Retrieval Contest (SHREC) tracks on large scale comprehensive and sketch-based
3D model retrieval have been organized by us in 2014. Both tracks were based
on a unified large-scale benchmark that supports multimodal queries (3D
models and sketches). This benchmark contains 13680 sketches and 8987 3D
models, divided into 171 distinct classes. It was compiled to be a superset of
existing benchmarks and presents a new challenge to retrieval methods as it
comprises generic models as well as domain-specific model types. Twelve and
six distinct 3D shape retrieval methods have competed with each other in
these two contests, respectively. To measure and compare the performance of
the participating and other promising Query-by-Model or Query-by-Sketch 3D
shape retrieval methods and to solicit state-of-the-art approaches, we perform
a more comprehensive comparison of twenty-six (eighteen originally participating
algorithms and eight additional state-of-the-art or new) retrieval methods by
evaluating them on the common benchmark. The benchmark, results, and evaluation
tools are publicly available at our websites
J. Coplin and M. Burtscher.
Power Characteristics of Irregular GPGPU Programs.
2014 International Workshop on Green Programming, Computing, and Data Processing. November 2014.
This paper investigates the power profiles of irregular programs running on a
K20 compute GPU and contrasts them with the profiles of regular programs. The
paper further studies the effects on the power profile when changing the GPU's
core and memory frequencies, using alternate implementations of the same
algorithm, and varying the program input. Our results show that the power
behavior of irregular applications often cannot be accurately captured by a
single average. Rather, the entire profile, i.e., the power as a function of
time, needs to be considered. In addition, lowering the frequency, employing
alternate implementations, or using different inputs can drastically alter the
power profile of irregular codes, meaning that measurements using one setting
may not be representative of that program's power characteristics under a
M. A. O'Neil and M. Burtscher.
Microarchitectural Performance Characterization of Irregular GPU Kernels.
2014 IEEE International Symposium on Workload Characterization. October 2014.
GPUs are increasingly being used to accelerate general-purpose applications,
including applications with data-dependent, irregular memory access patterns
and control flow. However, relatively little is known about the behavior of
irregular GPU codes, and there has been minimal effort to quantify the ways
in which they differ from regular GPGPU applications. We examine the behavior
of a suite of optimized irregular CUDA applications on a cycle-accurate GPU
simulator. We characterize the performance bottlenecks in each program and
connect source code with microarchitectural characteristics. We also assess
the impact of improvements in cache and DRAM bandwidth and latency and discuss
the implications for GPU architecture design. We find that, while irregular
graph codes exhibit significantly more underutilized execution cycles due to
thread divergence, load imbalance, and synchronization overhead than regular
programs, these factors contribute less to performance degradation than we
expected. It appears that code optimizations are often able to effectively
address these performance hurdles. Insufficient bandwidth and long memory
latency are the biggest limiters of performance. Surprisingly, we find that
applications with irregular memory access patterns are more sensitive to
changes in L2 latency and bandwidth than DRAM latency and bandwidth.
R. Ge, X. Feng, M. Burtscher, and Z. Zong.
Performance and Energy Modeling for Cooperative Hybrid Computing.
9th IEEE International Conference on Networking, Architecture, and Storage. August 2014.
Accelerator-based heterogeneous systems can provide
high performance and energy efficiency, both of which
are key design goals in high performance computing. To fully
realize the potential of heterogeneous architectures, software
must optimally exploit the hosts' and accelerators' processing
and power-saving capabilities. Yet, previous studies mainly
focus on using hosts and accelerators to boost application
performance. Power-saving features to improve the energy
efficiency of parallel programs, such as Dynamic Voltage and
Frequency Scaling (DVFS), remain largely unexplored.
Recognizing that energy efficiency is a different objective
than performance and should therefore be independently
pursued, we study how to judiciously distribute computation
between hosts and accelerators for energy optimization. We
further explore energy-saving scheduling in combination with
computation distribution for even larger gains. Moreover, we
present PEACH, an analytical model for Performance and
Energy Aware Cooperative Hybrid computing. With just a few
system- and application-dependent parameters, PEACH accurately
captures the performance and energy impact of computation
distribution and energy-saving scheduling to quickly
identify the optimal coupled strategy for achieving the best
performance or the lowest energy consumption. PEACH thus
eliminates the need for extensive profiling and measurement.
Experimental results from two GPU-accelerated heterogeneous
systems show that PEACH predicts the performance and energy
of the studied codes with less than 3% error and successfully
identifies the optimal strategy for a given objective.
H. Rabeti and M. Burtscher.
Feature Selection by Tree Search of Correlation-Adjusted Class Distances.
2014 International Conference on Data Mining. July 2014.
The rapidly growing dimensionality of datasets has
made feature selection indispensable. We introduce the TS-CACD
feature-selection algorithm, which uses a generalization of the
Stern-Brocot tree to traverse the search space. This family of
trees supports different divergence ratios, i.e., enables the search
to focus on and reach certain areas of interest more quickly.
TS-CACD uses a continuous filter method, which combines an
inter/intra-class distance measure with a pair-wise ranked feature
correlation measure. It requires almost no parameters, explicitly
selects the most important features, and performs well.
B. Li, Y. Lu, C. Li, A. Godil, T. Schreck, M. Aono, M. Burtscher, H. Fu, T. Furuya, H. Johan, J. Liu, R. Ohbuchi, A. Tatsuma, and C. Zou.
SHREC'14 Track: Extended Large Scale Sketch-Based 3D Shape Retrieval
Eurographics Workshop on 3D Object Retrieval. April 2014.
Large scale sketch-based 3D shape retrieval has received more and more attentions in the community of content-based
3D object retrieval. The objective of this track is to evaluate the performance of different sketch-based 3D
model retrieval algorithms using a large scale hand-drawn sketch query dataset on a comprehensive 3D model
dataset. The benchmark contains 12,680 sketches and 8,987 3D models, divided into 171 distinct classes. In this
track, 12 runs were submitted by 4 groups and their retrieval performance was evaluated using 7 commonly
used retrieval performance metrics. We hope that this benchmark, the comparative evaluation results, and the
corresponding evaluation code will further promote the progress of this research direction for the 3D model
V. Uzelac, A. Milenkovic, M. Milenkovic, and M. Burtscher.
Using Branch Predictors and Variable Encoding for On-the-Fly Program Tracing.
IEEE Transactions on Computers, Vol. 63, No. 4, pp. 1008-1020, IEEE Computer Society. April 2014.
Unobtrusive capturing of program execution traces in real-time is crucial for
debugging many embedded systems. However, tracing even limited program segments
is often cost-prohibitive, requiring wide trace ports and large on-chip trace
buffers. This paper introduces a new cost-effective technique for capturing and
compressing program execution traces on-the-fly. It relies on branch predictor-like
structures in the trace module and corresponding software modules in the debugger
to significantly reduce the number of events that need to be streamed out of the
target system. Coupled with an effective variable encoding scheme that adapts to
changing program patterns, our technique requires merely 0.029 bits per instruction
of trace port bandwidth, providing a 34-fold improvement over the commercial
state-of-the-art and a five-fold improvement over academic proposals, at the low
cost of under 5,000 logic gates.
K. Rocki, M. Burtscher, and R. Suda.
The Future of Accelerator Programming: Abstraction, Performance or Can We Have Both?
29th ACM Symposium on Applied Computing. March 2014.
In a perfect world, code would only be written once and would run on different devices with high efficiency.
To a degree, that used to be the case in the era of frequency scaling on a single core. However, due to power
limitations, parallel programming has become necessary to obtain performance gains. But parallel architectures
differ substantially from each other, often require specialized knowledge to exploit them, and typically
necessitate reimplementation and fine tuning of programs. These slow tasks frequently result in situations
where most of the time is spent reimplementing old rather than writing new code.
The goal of our research is to find programming techniques that increase productivity, maintain high
performance, and provide abstraction to free the programmer from these unnecessary and time-consuming tasks.
However, such techniques usually come at the cost of substantial performance degradation. This paper
investigates current approaches to portable accelerator programming, seeking to answer whether they make it
possible to combine high efficiency with sufficient algorithm abstraction. It discusses OpenCL as a potential
solution and presents three approaches of writing portable code: GPU-centric, CPU-centric, and combined. By
applying the three approaches to a real-world program, we show that it is at least sometimes possible to run
exactly the same code on many different devices with minimal performance degradation using parameterization.
The main contributions of this paper are an extensive review of the current state-of-the-art and our original
approach of addressing the stated problem with the copious-parallelism technique.
M. Burtscher, I. Zecena, and Z. Zong.
Measuring GPU Power with the K20 Built-in Sensor.
Seventh Workshop on General Purpose Processing on Graphics Processing Units, pp. 28-36. March 2014.
GPU-accelerated programs are becoming increasingly common in HPC, personal computers, and even handheld
devices, making it important to optimize their energy efficiency. However, accurately profiling the power
consumption of GPU code is not straightforward. In fact, we have identified multiple anomalies when using
the on-board power sensor of K20 GPUs. For example, we have found that doubling a kernel's runtime more
than doubles its energy usage, that kernels consume energy after they have stopped executing, and that
running two kernels in close temporal proximity inflates the energy consumption of the later kernel.
Moreover, we have observed that the power sampling frequency varies greatly and that the GPU sensor only
performs power readings once in a while. We present a methodology to accurately compute the instant power
and the energy consumption despite these issues.
K. Rocki and M. Burtscher.
The Future of Accelerator Programming.
HPC wire. January 2014.
I. Zecena, M. Burtscher, T. Jin, and Z. Zong.
Evaluating the Performance and Energy Efficiency of N-Body Codes on Multi-Core CPUs and GPUs.
32nd IEEE International Performance Computing and Communications Conference, pp. 1 - 8. December 2013.
N-body simulations are computation-intensive applications
that calculate the motion of a large number of bodies under
pair-wise forces. Although different versions of n-body codes
have been widely used in many scientific fields, the
performance and energy efficiency of various n-body codes
have not been comprehensively studied, especially when they
are running on newly released multi-core CPUs and GPUs (e.g.,
Tesla K20). In this paper, we evaluate the performance and
energy efficiency of five parallel n-body implementations on
two different multi-core CPU systems and on two different
types of GPUs. Our experimental results show that up to 71%
of the energy can be saved by using all cores of a Xeon E5620
CPU instead of only one. We find hyper-threading to be able
to further reduce the energy usage and runtime, but not by as
much as adding more cores does. Finally, our experiments
illustrate that GPU-based acceleration using a Tesla K20c can
boost the performance and energy efficiency by orders of
M. Burtscher, H. Shi, W. Peng, D. Tamir, A. Qasem, and H. Thiry.
Integrating Parallel Computing into the Undergraduate Curriculum at Texas State University: Experiences from the First Year.
Workshop on Parallel, Distributed, and High-Performance Computing in Undergraduate Curricula. November 2013.
The widespread deployment of multicore-based
computer systems over the last decade has brought about
drastic changes in the software and hardware landscape. Yet,
many undergraduate computer science (CS) curricula have not
embraced the pervasiveness of parallel computing. In their first
years, CS undergraduates are typically exclusively trained to
think and program sequentially. However, too firm a root in
sequential thinking can be a nontrivial barrier for parallel
thinking and computing. Thus, there is an urgent need to teach
multicore and parallel computing concepts earlier and often in
This paper describes our efforts at addressing the rapidly
widening gap between highly parallel computer architectures
and the sequential programming approach taught in
traditional CS courses. At Texas State University, we have
adopted the early-and-often mode of integrating parallelism
into the undergraduate curriculum. In this approach, parallel
computing concepts are introduced and reiterated through a
series of short, self-contained modules across several lower-division
courses. Most of these concepts are then combined
into a newly designed senior-level capstone course in multicore
programming. Evaluations conducted during the first year
show encouraging results for the early-and-often approach in
terms of learning outcomes, student interest and confidence
gains in computer science.
R. Ge, R. Vogt, J. Majumder, A. Alam, M. Burtscher, and Z. Zong.
Effects of Dynamic Voltage and Frequency Scaling on a K20 GPU.
2nd International Workshop on Power-aware Algorithms, Systems, and Architectures, pp. 826 - 833. October 2013.
Improving energy efficiency is an ongoing challenge
in HPC because of the ever-increasing need for
performance coupled with power and economic constraints.
Though GPU-accelerated heterogeneous computing systems
are capable of delivering impressive performance, it is
necessary to explore all available power-aware technologies
to meet the inevitable energy efficiency challenge.
In this paper, we experimentally study the impacts of
DVFS on application performance and energy efficiency for
GPU computing and compare them with those of DVFS
for CPU computing. Based on a power-aware heterogeneous
system that includes dual Intel Sandy Bridge CPUs and
the latest Nvidia K20c Kepler GPU, the study provides
numerous new insights, general trends and exceptions of
DVFS for GPU computing. In general, the effects of DVFS on
a GPU differ from those of DVFS on a CPU. For example,
on a GPU running compute-bound high-performance and
high-throughput workloads, the system performance and
the power consumption are approximately proportional to
the GPU frequency. Hence, with a permissible power limit,
increasing the GPU frequency leads to better performance
without incurring a noticeable increase in energy. This
paper further provides detailed analytical explanations of
the causes of the observed trends and exceptions.
The findings presented in this paper have the potential
to impact future CPU and GPU architectures to achieve
better energy efficiency and point out directions for designing
effective DVFS schedulers for heterogeneous systems.
A. Milenkovic, A. Dzhagaryan, and M. Burtscher.
Performance and Energy Consumption of Lossless Compression/Decompression Utilities on Mobile Computing Platforms.
IEEE 21st International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, pp. 254 - 263. August 2013.
Data compression and decompression utilities can be
critical in increasing communication throughput, reducing
communication latencies, achieving energy-efficient communication,
and making effective use of available storage. This
paper experimentally evaluates several such utilities for multiple
compression levels on systems that represent current
mobile platforms. We characterize each utility in terms of its
compression ratio, compression and decompression throughput,
and energy efficiency. We consider different use cases that
are typical for modern mobile environments. We find a wide
variety of energy costs associated with data compression and
decompression and provide practical guidelines for selecting
the most energy efficient configurations for each use case. The
best performing configurations provide 6-fold and 4-fold
improvements in energy efficiency for compressed uploads and
downloads over WLAN, respectively, when compared to uncompressed
M. Burtscher and H. Rabeti.
GPU Acceleration of a Genetic Algorithm for the Synthesis of FSM-based Bimodal Predictors.
2013 International Conference on Parallel and Distributed Processing Techniques and Applications. July 2013.
This paper presents a fast GPU implementation of a genetic algorithm for synthesizing
bimodal predictor FSMs of a given size. Bimodal predictors, i.e., predictors that make
binary yes/no predictions, are ubiquitous in microprocessors. Many of these predictors
are based on finite-state machines (FSMs). However, there are countless possible FSMs
and even heuristic searches for finding good FSMs can be slow when billions of
predictions need to be assessed. We designed such a search heuristic that maps well
onto GPU hardware. It is based on a multi-start genetic algorithm. On our six traces,
the resulting FSMs are 1% to 29% more accurate than saturating up/down counters. On a
Kepler-based GTX 680, the CUDA implementation evaluates 18 to 73 billion predictions
per second, which is 14 to 18 times faster than a multicore version running on a
hex-core Xeon X5690 with hyper-threading.
M. Burtscher and H. Rabeti.
A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches.
27th IEEE International Parallel & Distributed Processing Symposium, pp. 1289-1298. May 2013.
This paper describes and evaluates a highly-scalable framework for running iterative
local searches on heterogeneous HPC platforms. The user only needs to provide serial
CPU or single-GPU code that implements a simple interface. The framework then executes
this code in parallel using MPI between compute nodes and OpenMP and multi-GPU support
within nodes. It handles all parallelization aspects, seed distribution and program
termination, and it regularly records the currently best solution. We evaluate our
framework on three supercomputers using a heuristic iterative hill-climbing TSP solver
as well as a search for good finite-state machines. The framework scales to 2048 nodes
(32,768 cores) on Ranger with less than a 5% drop in efficiency, searches over 12.2
trillion TSP tours per second on Stampede using 1024 nodes, and evaluates over 21.5
trillion FSM transitions per second using 256 CPUs and 384 GPUs on Keeneland.
R. Nasre, M. Burtscher, and K. Pingali.
Data-driven versus Topology-driven Irregular Computations on GPUs.
27th IEEE International Parallel & Distributed Processing Symposium, pp. 463-474. May 2013.
Irregular algorithms are algorithms with complex main data structures such as directed
and undirected graphs, trees, etc. A useful abstraction for many irregular algorithms is
its operator formulation in which the algorithm is viewed as the iterated
application of an operator to certain nodes, called active nodes, in the graph.
Each operator application, called an activity, usually touches only a small part
of the overall graph, so non-overlapping activities can be performed in parallel. In
topology-driven implementations, all nodes are assumed to be active so the
operator is applied everywhere in the graph even if there is no work to do at some nodes.
In contrast, in data-driven implementations the operator is applied only to nodes
at which there might be work to do. Multicore implementations of irregular algorithms are
usually data-driven because current multicores only support small numbers of threads and
work-efficiency is important. Conversely, many irregular GPU implementations use a
topology-driven approach because work inefficiency can be counterbalanced by the large
number of GPU threads.
In this paper, we study data-driven and topology-driven implementations of six important
graph algorithms on GPUs. Our goal is to understand the tradeoffs between these
implementations and how to optimize them. We find that data-driven versions are generally
faster and scale better despite the cost of maintaining a worklist. However,
topology-driven versions can be superior when certain algorithmic properties are
exploited to optimize the implementation. These results led us to devise hybrid
approaches that combine the two techniques and outperform both of them.
A. Dzhagaryan, A. Milenkovic, and M. Burtscher.
Energy Efficiency of Lossless Data Compression on a Mobile Device: An Experimental Evaluation.
2013 IEEE International Symposium on Performance Analysis of Systems and Software. April 2013.
Lossless compression and decompression are routinely
used in mobile computing devices to reduce the costs of
communicating and storing data. This paper presents the results
of an experimental evaluation of common compression utilities
on Pandaboard, a development platform similar to current
commercial mobile devices. We study the compression ratio,
compression and decompression throughput, and energy efficiency
of different usage scenarios typical for mobile computing.
We observe a wide variety of energy costs associated with data
compression and provide practical guidelines for selecting the
most energy-efficient configurations.
R. Nasre, M. Burtscher, and K. Pingali.
Atomic-free Irregular Computations on GPUs.
Sixth Workshop on General Purpose Processing on Graphics Processing Units, pp. 96-107. March 2013.
Atomic instructions are a key ingredient of codes that operate on
irregular data structures like trees and graphs. It is well known
that atomics can be expensive, especially on massively parallel GPUs,
and are often on the critical path of a program. In this paper, we
present two high-level methods to eliminate atomics in irregular
programs. The first method advocates synchronous processing using
barriers. We illustrate how to exploit synchronous processing to
avoid atomics even when the threads' memory accesses conflict with
each other. The second method is based on exploiting algebraic
properties of algorithms to elide atomics.
Specifically, we focus on three key properties: monotonicity,
idempotency and associativity, and show how each of them enables
an atomic-free implementation. We illustrate the generality of the
two methods by applying them to five irregular graph applications:
breadth-first search, single-source shortest paths computation,
Delaunay mesh refinement, pointer analysis and survey propagation,
and show that both methods provide substantial speedup in each case
on different GPUs.
R. Nasre, M. Burtscher, and K. Pingali.
Morph Algorithms on GPUs.
18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 147-156. February 2013.
There is growing interest in using GPUs to accelerate graph algorithms
such as breadth-first search, computing page-ranks, and
finding shortest paths. However, these algorithms do not modify
the graph structure, so their implementation is relatively easy compared
to general graph algorithms like mesh generation and refinement,
which morph the underlying graph in non-trivial ways by
adding and removing nodes and edges. We know relatively little
about how to implement morph algorithms efficiently on GPUs.
In this paper, we present and study four morph algorithms: (i)
a computational geometry algorithm called Delaunay Mesh Refinement
(DMR), (ii) an approximate SAT solver called Survey
Propagation (SP), (iii) a compiler analysis called Points-to Analysis
(PTA), and (iv) Boruvka's Minimum Spanning Tree algorithm
(MST). Each of these algorithms modifies the graph data structure
in different ways and thus poses interesting challenges.
We overcome these challenges using algorithmic and GPU-specific
optimizations. We propose efficient techniques to perform
concurrent subgraph addition, subgraph deletion, conflict detection
and several optimizations to improve the scalability of morph algorithms.
For an input mesh with 10 million triangles, our DMR code
achieves an 80x speedup over the highly optimized serial Triangle
program and a 2.3x speedup over a multicore implementation
running with 48 threads. Our SP code is 3x faster than a multicore
implementation with 48 threads on an input with 1 million literals.
The PTA implementation is able to analyze six SPEC 2000 benchmark
programs in just 74 milliseconds, achieving a geometric mean
speedup of 9.3x over a 48-thread multicore version. Our MST code
is slower than a multicore version with 48 threads for sparse graphs
but significantly faster for denser graphs.
This work provides several insights into how other morph algorithms
can be efficiently implemented on GPUs.
M. Burtscher, R. Nasre, and K. Pingali.
A Quantitative Study of Irregular Programs on GPUs.
2012 IEEE International Symposium on Workload Characterization, pp. 141-151. November 2012.
GPUs have been used to accelerate many regular applications and, more recently,
irregular applications in which the control flow and memory access patterns are
data-dependent and statically unpredictable. This paper defines two measures of
irregularity called control-flow irregularity and memory-access
irregularity, and investigates, using performance-counter measurements, how
irregular GPU kernels differ from regular kernels with respect to these measures.
For a suite of 13 benchmarks, we find that (i) irregularity at the warp level
varies widely, (ii) control-flow irregularity and memory-access irregularity are
largely independent of each other, and (iii) most kernels, including regular ones,
exhibit some irregularity. A program's irregularity can change between different
inputs, systems, and arithmetic precision but generally stays in a specific region
of the irregularity space. Whereas some highly tuned implementations of irregular
algorithms exhibit little irregularity, trading off extra irregularity for better
locality or less work can improve overall performance.
I. Szczyrba, M. Burtscher, and R. Szczyrba.
Validating Critical Limits of the Universal Brain Injury Criterion.
2012 International Conference on Bioinformatics and Computational Biology, pp. 199-205. July 2012.
We present results of numerical simulations that
further validate the critical limits we previously proposed
for our universal Brain Injury Criterion (BIC). The BIC
extends the applicability of the translational Head Injury
Criterion (HIC) to arbitrary head motions by reformulating
the acceleration-based HIC formula in terms of the energy
transferred locally from the skull to the brain. Our simulations
are based on a generalization of the Kelvin-Voigt (K-V)
Close Head Injury model that includes a nonlinear strain-stress
relation. We validate the proposed BIC limits against
(i) the critical limit HIC15 = 700, (ii) the Diffuse Axonal
Injury Tolerance Criterion (DAITC) for head rotations that
has been derived from the K-V model and from experiments
with animal brains, and (iii) recent experimental data on
strain levels leading to permanent neuronal damage. Our
results imply that for head rotations about various fixed axes,
the critical BIC15 limits coincide with the HIC15 critical
limit and are in agreement with the DAITC thresholds.
P. Ratanaworabhan, M. Burtscher, D. Kirovski, B. Zorn.
Hardware Support for Enforcing Isolation in Lock-Based Parallel Programs.
26th International Conference on Supercomputing, pp. 301-310. June 2012.
When lock-based parallel programs execute on conventional multicore
hardware, faulty software can cause hard-to-debug race conditions
in critical sections that violate the contract between locks and
their protected shared variables. This paper proposes new hardware
support for enforcing isolation of critical section execution. It can
detect and tolerate races, allowing programs to execute race-free.
Our hardware scheme targets the existing large code base of lock-based
parallel programs written in type unsafe languages such as C
and C++. Our approach works directly on unmodified executables.
An evaluation of 13 programs from the SPLASH2 and PARSEC
suites shows that the cost of the additional hardware and the impact
on the overall execution time is minimal for these applications. Our
mechanism is complementary to hardware transactional memory in
that it uses similar structures but focuses on enhancing the reliability
of existing lock-based programs.
P. Ratanaworabhan, M. Burtscher, D. Kirovski, B. Zorn, R. Nagpal, and K. Pattabiraman.
Efficient Runtime Detection and Toleration of Asymmetric Races.
IEEE Transactions on Computers, Vol. 61, No. 4, pp. 548-562, IEEE Computer Society. April 2012.
We introduce ToleRace, a runtime system that allows programs to detect
and even tolerate asymmetric data races. Asymmetric races are race
conditions where one thread correctly acquires and releases a lock for
a shared variable while another thread improperly accesses the same
variable. ToleRace provides approximate isolation in the critical
sections of lock-based parallel programs by creating a local copy of
each shared variable when entering a critical section, operating on the
local copies, and propagating the appropriate copies upon leaving the
critical section. We start by characterizing all possible interleavings
that can cause races and precisely describe the effect of ToleRace in
each case. Then, we study the theoretical aspects of an oracle that
knows exactly what type of interleaving has occurred. Finally, we
present software implementations of ToleRace and evaluate them on
multithreaded applications from the SPLASH2 and PARSEC suites.
M. Mendez-Lojo, M. Burtscher, and K. Pingali.
A GPU Implementation of Inclusion-based Points-to Analysis.
17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 107-116. February 2012.
Graphics Processing Units (GPUs) have emerged as powerful accelerators
for many regular algorithms that operate on dense arrays
and matrices. In contrast, we know relatively little about using
GPUs to accelerate highly irregular algorithms that operate on
pointer-based data structures such as graphs. For the most part, research
has focused on GPU implementations of graph analysis algorithms
that do not modify the structure of the graph, such as algorithms
for breadth-first search and strongly-connected components.
In this paper, we describe a high-performance GPU implementation
of an important graph algorithm used in compilers such as
gcc and LLVM: Andersen-style inclusion-based points-to analysis.
This algorithm is challenging to parallelize effectively on GPUs because
it makes extensive modifications to the structure of the underlying
graph and performs relatively little computation. In spite of
this, our program, when executed on a 14 Streaming Multiprocessor
GPU, achieves an average speedup of 7x compared to a sequential
CPU implementation and outperforms a parallel implementation
of the same algorithm running on 16 CPU cores.
Our implementation provides general insights into how to produce
high-performance GPU implementations of graph algorithms,
and it highlights key differences between optimizing parallel programs
for multicore CPUs and for GPUs.
O. A. Sopeju, M. Burtscher, A. Rane, and J. Browne.
AutoSCOPE: Automatic Suggestions for Code Optimizations Using PerfExpert.
2011 International Conference on Parallel and Distributed Processing Techniques and Applications, pp. 19-25. July 2011.
Automated source-code performance optimization has four stages: measurement,
diagnosis of bottlenecks, determination of optimizations, and rewriting of
the source code. Each stage must be successfully implemented to enable the
next stage. The PerfExpert tool supports automatic performance measurement
and bottleneck diagnosis for multicore and multichip compute nodes, i.e.,
it implements the first two stages. This paper presents AutoSCOPE, a new
system that extends PerfExpert by implementing the third stage. Based on
PerfExpert's output, AutoSCOPE automatically determines appropriate
source-code optimizations and compiler flags. We describe the process for
selecting optimizations and evaluate the effectiveness of AutoSCOPE by
applying it to three HPC production codes. Each of these codes is available
in unoptimized and manually optimized versions. AutoSCOPE succeeds in
selecting the same source-code transformations as were chosen by human
experts in most cases. AutoSCOPE is an extensible framework to which
additional optimizations and further rules for selecting optimizations can
M. A. O'Neil, D. Tamir, and M. Burtscher.
A Parallel GPU Version of the Traveling Salesman Problem.
2011 International Conference on Parallel and Distributed Processing Techniques and Applications, pp. 348-353. July 2011.
This paper describes and evaluates an implementation of iterative hill
climbing with random restart for determining high-quality solutions to
the traveling salesman problem. With 100,000 restarts, this algorithm
finds the optimal solution for four out of five 100-city TSPLIB inputs
and yields a tour that is only 0.07% longer than the optimum on the
fifth input. The presented implementation is highly parallel and
optimized for GPU-based execution. Running on a single GPU, it
evaluates over 20 billion tour modifications per second. It takes 32
CPUs with 8 cores each (256 cores total) to match this performance.
I. Szczyrba, M. Burtscher, and R. Szczyrba.
Computer Modeling of Diffuse Axonal Injury Mechanisms.
2011 International Conference on Bioinformatics and Computational Biology. July 2011.
We investigate numerically which properties of
the human brain cause Diffuse Axonal Injuries (DAI) to
appear in a scattered and pointwise manner near the
gray/white matter boundary, mostly in the white matter.
These simulations are based on our dually-nonlinear, viscoelastic,
fluid Traumatic Brain Injury model, which includes
a nonlinear stress/strain relation. We simulate rotational
accelerations and decelerations of a human head that replicate
realistic traumatic scenarios. The rotational loads are
quantified by our Brain Injury Criterion, which extends
the translational Head Injury Criterion to arbitrary head
motions. Our simulations show that: (i) DAI occurrences
near the gray/white matter boundary can be explained by the
difference in the gray and the white matter's shear modulus
values, (ii) the scattered/pointwise DAI character can be
attributed to the nonlinear fluid aspect of the brain tissue,
and (iii) the scattering of DAI deeper in the white matter
appears to be caused by the complicated shape of the brain.
Our results also show that the nonlinear stress/strain relation
plays a secondary role in shaping basic DAI features.
A. Milenkovic, V. Uzelac, M. Milenkovic, and M. Burtscher.
Caches and Predictors for Real-time, Unobtrusive, and Cost-Effective Program Tracing in Embedded Systems.
IEEE Transactions on Computers, Vol. 60, No. 7, pp. 992-1005, IEEE Computer Society. July 2011.
The increasing complexity of modern embedded computer systems makes
software development and system verification the most critical steps
in the system development. To expedite verification and program debugging,
chip manufacturers increasingly consider hardware infrastructure for
program debugging and tracing, including logic to capture and filter
traces, buffers to store traces, and a trace port through which the
trace is read by the debug tools. In this paper, we introduce a new
approach to capture and compress program execution traces in hardware.
The proposed trace compressor encompasses two cost-effective structures,
a stream descriptor cache and a last stream predictor. Information
about the program flow is translated into a sequence of hit and miss
events in these structures, thus dramatically reducing the number of
bits that need to be sent out of the chip. We evaluate the efficiency
of the proposed mechanism by measuring the trace port bandwidth on a
set of benchmark programs. Our mechanism requires only 0.15
bits/instruction/CPU on average on the trace port, which is a six-fold
improvement over state-of-the-art commercial solutions. The trace
compressor requires an on-chip area that is equivalent to one third of
a 1 kilobyte cache and it allows for continual and unobtrusive program
tracing in real-time.
K. Pingali, D. Nguyen, M. Kulkarni, M. Burtscher, M. A. Hassaan, R. Kaleem, T. H. Lee, A. Lenharth, R. Manevich, M. Mendez-Lojo, D. Prountzos, and X. Sui.
The Tao of Parallelism in Algorithms.
ACM SIGPLAN 2011 Conference on Programming Language Design and Implementation, pp. 12-25. June 2011.
For more than thirty years, the parallel programming community
has used the dependence graph as the main abstraction for reasoning
about and exploiting parallelism in "regular" algorithms that
use dense arrays, such as finite-differences and FFTs. In this paper,
we argue that the dependence graph is not a suitable abstraction for
algorithms in new application areas like machine learning and network
analysis in which the key data structures are "irregular" data
structures like graphs, trees, and sets.
To address the need for better abstractions, we introduce a data-centric
formulation of algorithms called the operator formulation
in which an algorithm is expressed in terms of its action on data
structures. This formulation is the basis for a structural analysis of
algorithms that we call Tao-analysis. Tao-analysis can be viewed
as an abstraction of algorithms that distills out algorithmic properties
important for parallelization. It reveals that a generalized form
of data-parallelism called amorphous data-parallelism is ubiquitous
in algorithms, and that, depending on the Tao-structure of
the algorithm, this parallelism may be exploited by compile-time,
inspector-executor or optimistic parallelization, thereby unifying
these seemingly unrelated parallelization techniques. Regular algorithms
emerge as a special case of irregular algorithms, and many
application-specific optimization techniques can be generalized to
a broader context.
These results suggest that the operator formulation and Tao analysis
of algorithms can be the foundation of a systematic approach
to parallel programming.
J. Diamond, M. Burtscher, J. McCalpin, B. Kim, S. Kecker, and J. Browne.
Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications.
2011 IEEE International Symposium on Performance Analysis of Systems and Software, pp. 32-43. April 2011.
[pdf] (best paper award)
The computation nodes of modern supercomputers commonly
consist of multiple multicore processors. To maximize the performance
of such systems requires measurement, analysis, and
optimization techniques that specifically target multicore environments.
This paper first examines traditional unicore metrics
and demonstrates how they can be misleading in a multicore
system. Second, it examines and characterizes performance
bottlenecks specific to multicore-based systems. Third,
it describes performance measurement challenges that arise in
multicore systems and outlines methods for extracting sound
measurements that lead to performance optimization opportunities.
The measurement and analysis process is based on a
case study of the HOMME atmospheric modeling benchmark
code from NCAR running on supercomputers built upon AMD
Barcelona and Intel Nehalem quad-core processors. Applying
the multicore bottleneck analysis to HOMME led to multicore
aware source-code optimizations that increased performance by
up to 35%. While the case studies were carried out on multichip
nodes of supercomputers using an HPC application as the target
for optimization, the pitfalls identified and the insights obtained
should apply to any system that is composed of multicore processors.
M. A. O'Neil and M. Burtscher.
Floating-Point Data Compression at 75 Gb/s on a GPU.
Fourth Workshop on General Purpose Processing on Graphics Processing Units. March 2011.
Numeric simulations often generate large amounts of data that
need to be stored or sent to other compute nodes. This paper
investigates whether GPUs are powerful enough to make real-time
data compression and decompression possible in such environments,
that is, whether they can operate at the 32- or 40-Gb/s throughput
of emerging network cards. The fastest parallel CPU-based
floating-point data compression algorithm operates below 20 Gb/s
on eight Xeon cores, which is significantly slower than the network
speed and thus insufficient for compression to be practical in
high-end networks. As a remedy, we have created the highly parallel
GFC compression algorithm for double-precision floating-point data.
This algorithm is specifically designed for GPUs. It compresses at a
minimum of 75 Gb/s, decompresses at 90 Gb/s and above, and can
therefore improve internode communication throughput on current and
upcoming networks by fully saturating the interconnection links with
M. A. Hassaan, M. Burtscher, and K. Pingali.
Ordered vs. Unordered: a Comparison of Parallelism and Work-efficiency in Irregular Algorithms.
16th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 3-12. February 2011.
Outside of computational science, most problems are formulated
in terms of irregular data structures such as graphs,
trees and sets. Unfortunately, we understand relatively little
about the structure of parallelism and locality in irregular
algorithms. In this paper, we study several algorithms for four
such problems: discrete-event simulation, single-source
shortest path, breadth-first search, and minimal spanning
trees. We show that these algorithms can be classified into two
categories that we call unordered and ordered,
and demonstrate experimentally that there is a trade-off
between parallelism and work efficiency: unordered algorithms
usually have more parallelism than their ordered counterparts
for the same problem, but they may also perform more work.
Nevertheless, our experimental results show that unordered
algorithms typically lead to more scalable implementations,
demonstrating that less work-efficient irregular algorithms may
be better for parallel execution.
M. Burtscher and K. Pingali.
An Efficient CUDA Implementation of the Tree-based Barnes Hut n-Body Algorithm.
Chapter 6 in GPU Computing Gems Emerald Edition, pp. 75-92. January 2011.
This chapter describes the first CUDA implementation of the classical Barnes Hut n-body algorithm
that runs entirely on the GPU. Unlike most other CUDA programs, our code builds an irregular tree-based
data structure and performs complex traversals on it. It consists of six GPU kernels. The kernels
are optimized to minimize memory accesses and thread divergence and are fully parallelized within
and across blocks. Our CUDA code takes 5.2 seconds to simulate one time step with 5,000,000 bodies
on a 1.3 GHz Quadro FX 5800 GPU with 240 cores, which is 74 times faster than an optimized serial
implementation running on a 2.53 GHz Xeon E5540 CPU.
M. Burtscher, B.D. Kim, J. Diamond, J. McCalpin, L. Koesterke, and J. Browne.
PerfExpert: An Easy-to-Use Performance Diagnosis Tool for HPC Applications.
SC 2010 International Conference for High-Performance Computing, Networking, Storage and Analysis, pp. 1-11. November 2010.
HPC systems are notorious for operating at a small fraction of their peak performance, and the ongoing migration to multi-core and multi-socket compute nodes further complicates performance optimization. The readily available performance evaluation tools require considerable effort to learn and utilize. Hence, most HPC application writers do not use them. As remedy, we have developed PerfExpert, a tool that combines a simple user interface with a sophisticated analysis engine to detect probable core, socket, and node-level performance bottlenecks in each important procedure and loop of an application. For each bottleneck, PerfExpert provides a concise performance assessment and suggests steps that can be taken by the programmer to improve performance. These steps include compiler switches and optimization strategies with code examples. We have applied PerfExpert to several HPC production codes on the Ranger supercomputer. In all cases, it correctly identified the critical code sections and provided accurate assessments of their performance.
V. Uzelac, A. Milenkovic, M. Burtscher, M. Milenkovic.
Real-time Unobtrusive Program Execution Trace Compression Using Branch Predictor Events.
International Conference on Compilers, Architectures and Synthesis for Embedded Systems, pp. 97-106. October 2010.
Unobtrusive capturing of program execution traces in real-time is crucial in debugging cyber-physical systems. However, tracing even limited program segments is often cost-prohibitive, requiring wide trace ports and large on-chip trace buffers. This paper introduces a new cost-effective technique for capturing and compressing program execution traces in real time. It uses branch predictor-like structures in the trace module to losslessly compress the traces. This approach results in high compression ratios because it only has to transmit misprediction events to the software debugger. Coupled with an effective variable encoding scheme, our technique requires merely 0.036 bits/instruction of trace port bandwidth (a 28-fold improvement over the commercial state-of-the-art) at a cost of roughly 5,200 logic gates.
X. Sui, D. Nguyen, M. Burtscher, and K. Pingali.
Parallel Graph Partitioning on Multicore Architectures.
Languages and Compilers for Parallel Computing 23rd Annual Workshop. October 2010.
Graph partitioning is a common and frequent preprocessing step in many high-performance parallel applications on distributed-
memory and shared-memory architectures. It is used to distribute graphs
across memory and to improve spatial locality. There are several parallel
implementations of graph partitioning for distributed-memory architectures.
In this paper, we present a parallel graph partitioner that implements a
variation of the Metis partitioner on a shared-memory, multicore architecture. We show that (1) the parallelism in this algorithm is an instance
of the general amorphous data-parallelism pattern, and (2) a parallel implementation can be derived systematically from a sequential specification of the algorithm. The resulting program can be executed in parallel
using the Galois system for optimistic parallelization. We show that the
scalability of this parallel implementation compares favorably with that
of a publicly available, hand-parallelized C implementation of the algorithm, ParMetis, but absolute performance is lower because of missing
sequential optimizations in our system. On a set of 15 large, publicly
available graphs, we achieve an average scalability of 2.98x on 8 cores
with our implementation, compared with 1.77x for ParMetis, and we
achieve an average speedup of 2.80x over Metis, compared with 3.60x
for ParMetis. These results show that our systematic approach for parallelizing irregular algorithms on multicore architectures is promising.
M. Burtscher, B. Livshits, G. Sinha, and B. Zorn.
USENIX Conference on Web Application Development. June 2010.
to reduce network delays and server bandwidth requirements
rely on syntactic changes to the source code and
content encoding using gzip. In this paper, we consider
syntax tree (AST) and transmitting the code in
this format. An AST-based representation has a number
of benefits including reducing parsing time on the
client, fast checking for well-formedness, and, as we
three streams: AST production rules, identifiers, and
literals, each of which is compressed independently.
While previous work has compressed Java programs using
ASTs for network transmission, no prior work has
source code, despite the fact that it is by far the most
commonly transmitted program representation. We show
majority of the total file size and we describe techniques
that compress each stream effectively. On average, compared
to gzip we reduce the production, identifier, and
literal streams by 30%, 12%, and 4%, respectively. Overall,
we reduce total file size by 10% compared to gzip
while, at the same time, benefiting the client by moving
some of the necessary processing to the server.
M. Burtscher and P. Ratanaworabhan.
gFPC: A Self-Tuning Compression Algorithm.
2010 Data Compression Conference, pp. 396-405. March 2010.
This paper presents and evaluates gFPC, a self-tuning implementation of the FPC compression algorithm for double-precision floating-point data. gFPC uses a genetic algorithm to repeatedly reconfigure four hash-function parameters, which enables it to adapt to changes in the data during compression. Self tuning increases the harmonic-mean compression ratio on thirteen scientific datasets from 22% to 28% with sixteen kilobyte hash tables and from 36% to 43% with one megabyte hash tables. Individual datasets compress up to 1.72 times better. The self-tuning overhead reduces the compression speed by a factor of four but makes decompression faster because of the higher compression ratio. On a 2.93 GHz Xeon processor, gFPC compresses at a throughput of almost one gigabit per second and decompresses at over seven gigabits per second.
M. Mendez-Lojo, D. Nguyen, D. Prountzos, X. Sui, M. A. Hassaan, M. Kulkarni, M. Burtscher, and K. Pingali.
Structure-driven Optimizations for Amorphous Data-parallel Programs
15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 3-14. January 2010.
[pdf] (best paper candidate)
Irregular algorithms are organized around pointer-based data structures such as graphs and trees, and they are ubiquitous in applications. Recent work by the Galois project has provided a systematic approach for parallelizing irregular applications based on the idea of optimistic or speculative execution of programs. However, the overhead of optimistic parallel execution can be substantial. In this paper, we show that many irregular algorithms have structure that can be exploited and present three key optimizations that take advantage of algorithmic structure to reduce speculative overheads. We describe the implementation of these optimizations in the Galois system and present experimental results to demonstrate their benefits. To the best of our knowledge, this is the first system to exploit algorithmic structure to optimize the execution of irregular programs.
C. Burstedde, M. Burtscher, O. Ghattas, G. Stadler, T. Tu, and L. C. Wilcox.
ALPS: A Framework for Parallel Adaptive PDE Solution.
Journal of Physics: Conference Series, Vol. 180, 012009, 8pp. August 2009.
Adaptive mesh refinement and coarsening (AMR) is essential for the numerical
solution of partial differential equations (PDEs) that exhibit behavior over a wide range of length
and time scales. Because of the complex dynamic data structures and communication patterns
and frequent data exchange and redistribution, scaling dynamic AMR to tens of thousands of
processors has long been considered a challenge. We are developing ALPS, a library for dynamic
mesh adaptation of PDEs that is designed to scale to hundreds of thousands of compute cores.
Our approach uses parallel forest-of-octree-based hexahedral finite element meshes and dynamic
load balancing based on space-filling curves. ALPS supports arbitrary-order accurate continuous
and discontinuous finite element/spectral element discretizations on general geometries. We
present scalability and performance results for two applications from geophysics: seismic wave
propagation and mantle convection.
V. Uzelac, A. Milenkovic, M. Milenkovic, and M. Burtscher.
Real-time, Unobtrusive, and Efficient Program Execution Tracing with Stream Caches and Last Stream Predictors.
2009 International Conference of Computer Design. October 2009.
This paper introduces a new hardware mechanism
for capturing and compressing program execution traces unobtrusively
in real-time. The proposed mechanism is based on two
structures called stream cache and last stream predictor. We
explore the effectiveness of a trace module based on these structures
and analyze the design space. We show that our trace
module, with less than 600 bytes of state, achieves a trace-port
bandwidth of 0.15 bits/instruction/processor, which is over six
times better than state-of-the-art commercial designs.
J. Diamond, B. D. Kim, M. Burtscher, S. Keckler, K. Pingali, and J. Browne.
Multicore Optimization for Ranger.
2009 TeraGrid Conference. June 2009.
As we enter the multicore era, it is of growing concern that an important class of high-performance parallel applications with near perfect weak scaling across nodes cannot take advantage of more than two cores per chip before saturating the on-chip memory hierarchy. This paper presents a first step towards solving this on-chip scalability issue at the source-code level. We begin with a detailed case study of two important simulation benchmarks on the Ranger supercomputing cluster. The two codes are Homme, a spectral element code, and Mangll, a finite element code. Whereas both applications had previously been classified as a computationally intense and memory light, we instead found both of them to be memory bound, achieving less than 0.5 IPC. Each application has lead to a different strategy for attaining a better memory access/computation balance. Both codes demonstrate substantially worse intra-chip (multicore) scaling than inter-node scaling.
This study is primarily empirical and presents advanced techniques to locate and ameliorate intra-node bottlenecks in applications. As classical optimization for computation is replaced with the more difficult optimization for memory bandwidth, one goal of this study is to make the use of detailed architectural knowledge and low-level program counters more accessible to help a wider programming audience optimize their code. Based on our analysis of Homme, we apply several source transformations to reduce memory bandwidth, including utilizing idle FPUs to replace the use of non fundamental memory arrays with computation. The performance characteristics of Mangll suggest an optimization strategy for better exploiting the available memory bandwidth by rewriting loops so that the compiler can use vector instructions that load and store multiple data values in a single memory transaction.
M. Kulkarni, M. Burtscher, C. Cascaval, and K. Pingali.
Lonestar: A Suite of Parallel Irregular Programs.
2009 IEEE International Symposium on Performance Analysis of Systems and Software, pp. 65-76. April 2009.
Until recently, parallel programming has largely focused on the
exploitation of data-parallelism in dense matrix programs. However,
many important application domains, including meshing, clustering,
simulation, and machine learning, have very different algorithmic
foundations: they require building, computing with, and modifying
large sparse graphs. In the parallel programming literature, these
types of applications are usually classified as irregular
applications, and relatively little attention has been paid to
them. To study and understand the patterns of parallelism and
locality in sparse graph computations better, we are in the process
of building the Lonestar benchmark suite. In this paper, we
characterize the first five programs from this suite, which target
domains like data mining, survey propagation, and design automation.
We show that even such irregular
applications often expose large amounts of parallelism in the form of
amorphous data-parallelism. Our speedup numbers demonstrate
that this new type of parallelism can successfully be exploited on
modern multi-core machines.
M. Burtscher and P. Ratanaworabhan.
pFPC: A Parallel Compressor for Floating-Point Data.
2009 Data Compression Conference, pp. 43-52. March 2009.
This paper describes and evaluates pFPC, a parallel implementation of the lossless FPC compression algorithm for 64-bit floating-point data. pFPC can trade off compression ratio for throughput. For example, on a 4-core 3 GHz Xeon system, it compresses our nine datasets by 18% at a throughput of 1.36 gigabytes per second and by 41% at a throughput of 570 megabytes per second. Decompression is even faster. Our experiments show that the thread count should match or be a small multiple of the data's dimensionality to maximize the compression ratio and the chunk size should be at least equal to the system's page size to maximize the throughput.
P. Ratanaworabhan, M. Burtscher, D. Kirovski, R. Nagpal, K. Pattabiraman, and B. Zorn.
Detecting and Tolerating Asymmetric Races.
14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 173-184. February 2009.
This paper introduces ToleRace, a runtime system that allows programs to detect and even tolerate asymmetric data races. Asymmetric races are race conditions where one thread correctly acquires and releases a lock for a shared variable while another thread improperly accesses the same variable. ToleRace provides approximate isolation in the critical sections of lock-based parallel programs by creating a local copy of each shared variable when entering a critical section, operating on the local copies, and propagating the appropriate copies upon leaving the critical section. We start by characterizing all possible interleavings that can cause races and precisely describe the effect of ToleRace in each case. Then, we study the theoretical aspects of an oracle that knows exactly what type of interleaving has occurred. Finally, we present two software implementations of ToleRace and evaluate them on multithreaded applications from the SPLASH2 and PARSEC suites. Our implementation on top of a dynamic instrumentation tool, which works directly on executables and requires no source code modifications, incurs an overhead of a factor of two on average. Manually adding ToleRace to the source code of these applications results in an average overhead of 6.4 percent.
M. Kulkarni, M. Burtscher, R. Inkulu, C. Cascaval, and K. Pingali.
How Much Parallelism is There in Irregular Applications?
14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 3-14. February 2009.
Irregular programs are programs organized around pointer-based data structures such as trees
and graphs. Recent investigations by the Galois project have shown that many irregular programs
have a generalized form of data-parallelism called amorphous data-parallelism. However,
in many programs, amorphous data-parallelism cannot be uncovered using static techniques, and
its exploitation requires runtime strategies such as optimistic parallel execution. This raises
a natural question: how much amorphous data-parallelism actually exists in irregular programs?
In this paper, we describe the design and implementation of a tool called ParaMeter that produces
parallelism profiles for irregular programs. Parallelism profiles are an abstract measure
of the amount of amorphous data-parallelism at different points in the execution of an algorithm,
independent of implementation-dependent details such as the number of cores, cache sizes,
load-balancing, etc. ParaMeter can also generate constrained parallelism profiles for a fixed
number of cores. We show parallelism profiles for seven irregular applications, and explain how
these profiles provide insight into the behavior of these applications.
M. Burtscher and P. Ratanaworabhan.
FPC: A High-Speed Compressor for Double-Precision Floating-Point Data.
IEEE Transactions on Computers, Vol. 58, No. 1, pp. 18-31. January 2009.
Many scientific programs exchange large quantities of double-precision data between processing nodes and with mass storage devices. Data compression can reduce the number of bytes that need to be transferred and stored. However, data compression is only likely to be employed in high-end computing environments if it does not impede the throughput. This paper describes and evaluates FPC, a fast lossless compression algorithm for linear streams of 64-bit floating-point data. FPC works well on hard-to-compress scientific datasets and meets the throughput demands of high-performance systems. A comparison with five lossless compression schemes, BZIP2, DFCM, FSD, GZIP, and PLMI, on four architectures and thirteen datasets shows that FPC compresses and decompresses one to two orders of magnitude faster than the other algorithms at the same geometric-mean compression ratio. Moreover, FPC provides a guaranteed throughput as long as the prediction tables fit into the L1 data cache. For example, on a 1.6 GHz Itanium 2 server, the throughput is 670 megabytes per second regardless of what data are being compressed.
M. Burtscher, M. Kulkarni, D. Prountzos, and K. Pingali.
On the Scalability of an Automatically Parallelized Irregular Application.
Languages and Compilers for Parallel Computing 21st Annual Workshop, pp. 109-123. July 2008.
Irregular applications, i.e., programs that manipulate pointer-based data structures such as graphs and trees, constitute a challenging target for parallelization because the amount of parallelism is input dependent and changes dynamically. Traditional dependence analysis techniques are too conservative to expose this parallelism. Even manual parallelization is difficult, time consuming, and error prone. The Galois system parallelizes such applications using an optimistic approach that exploits higher-level semantics of abstract data types.
In this paper, we study the performance and scalability of a Galoised, i.e., automatically parallelized, version of Delaunay mesh refinement (DR) on a shared-memory system with 128 CPUs. DR is an important irregular application that is used, e.g., in graphics and finite-element codes. The parallelized program scales to 64 threads, where it reaches a speedup of 25.8. For large numbers of threads, the performance is hampered by the load imbalance and the nonuniform memory latency, both of which grow as the number of threads increases. While these two issues will have to be addressed in future work, we believe our results already show the Galois approach to be very promising.
I. Szczyrba, M. Burtscher, and R. Szczyrba.
On the Role of a Nonlinear Stress-Strain Relation in Brain Trauma.
2008 International Conference on Bioinformatics & Computational Biology, pp. 265-271. July 2008.
We investigate how a nonlinear stress-strain relation (that leads to a stiffening of the brain matter under strain) influences the brain dynamics in traumatic situations. We numerically simulate rapid rotational accelerations and decelerations of a human head using our generalization of the viscoelastic Kelvin-Voigt brain injury model that includes an experimentally established dependency of stress on strain. The rotational loads are expressed in terms of the Brain Injury Criterion, which we developed to extend the translational Head Injury Criterion to arbitrary head motions. Under traumatic loads corresponding to HIC15 = 700, our model predicts that the brain stiffening reduces the maximal strain near the skull by up to 70%, but leads to a distribution of relatively high strain values throughout the brain. We show how the brain's complex geometry enhances the random spatial distribution of high strain values.
P. Ratanaworabhan and M. Burtscher.
Program Phase Detection based on Critical Basic Block Transitions.
2008 IEEE International Symposium on Performance Analysis of Systems and Software, pp. 11-21. April 2008.
Many programs go through phases as they execute. Knowing where these phases begin and end can be beneficial. For example, adaptive architectures can exploit such information to lower their power consumption without much loss in performance. Architectural simulations can benefit from phase information by simulating only a small interval of each program phase, which significantly reduces the simulation time while still yielding results that are representative of complete simulations. This paper presents a lightweight profile-based phase detection technique that marks each phase change boundary in the programs binary at the basic block level with a critical basic block transition (CBBT). It is independent of execution windows and does not explicitly employ the notion of threshold to make a phase change decision. We evaluate the effectiveness of CBBTs for reconfiguring the L1 data cache size and for guiding architectural simulations. Our CBBT method is as effective at dynamically reducing the L1 data cache size as idealized cache reconfiguration schemes are. Using CBBTs to statically determine simulation intervals yields as low a CPI error as the well-known SimPoint method does. In addition, experimental results indicate the CBBTs effectiveness in both the self-trained and cross-trained inputs, demonstrating the CBBTs stability across different program inputs.
I. Szczyrba, M. Burtscher, and R. Szczyrba.
Computational Modeling of Brain Dynamics during Repetitive Head Motions.
2007 International Conference on Modeling, Simulation and Visualization Methods. June 2007.
We numerically model the effects of repetitive human head motions in traumatic scenarios that are associated with severe brain injuries. Our results are based on the linear Kelvin-Voigt brain injury model, which treats the brain matter as a viscoelastic solid, and on our nonlinear generalization of that model, which emulates the gel-like character of the brain tissue. To properly compare the various traumatic scenarios, we use the BIC scale, which we developed to generalize the HIC scale to arbitrary head motions. Our simulations of the brain dynamics in sagittal and horizontal 2D cross-sections of the skull interior indicate that a repetitive reversal of traumatic head rotations can increase the severity/likelihood of brain injuries due to resonance effects.
I. Szczyrba, M. Burtscher, and R. Szczyrba.
A Proposed New Brain Injury Tolerance Criterion Based on the Exchange of Energy between the Skull and the Brain.
2007 Summer Bioengineering Conference. June 2007.
M. Burtscher and P. Ratanaworabhan.
High Throughput Compression of Double-Precision Floating-Point Data.
2007 Data Compression Conference, pp. 293-302. March 2007.
This paper describes FPC, a lossless compression algorithm for linear streams of 64-bit floating-point data. FPC is designed to compress well while at the same time meeting the high throughput demands of scientific computing environments. On our thirteen datasets, it achieves a substantially higher average compression ratio than BZIP2, DFCM, FSD, GZIP, and PLMI. At comparable compression ratios, it compresses and decompresses 8 to 300 times faster than the other five algorithms.
M. Milenkovic, A. Milenkovic, and M. Burtscher.
Algorithms and Hardware Structures for Unobtrusive Real-Time Compression of Instruction and Data Address Traces.
2007 Data Compression Conference, pp. 283-292. March 2007.
Instruction and data address traces are widely used by computer designers for quantitative evaluations of new architectures and workload characterization, as well as by software developers for program optimization, performance tuning, and debugging. Such traces are typically very large and need to be compressed to reduce the storage, processing, and communication bandwidth requirements. However, preexisting general-purpose and trace-specific compression algorithms are designed for software implementation and are not suitable for runtime compression.
Compressing program execution traces at runtime in hardware can deliver insights into the behavior of the system under test without any negative interference with normal program execution. Traditional debugging tools, on the other hand, have to stop the program frequently to examine the state of the processor. Moreover, software developers often do not have access to the entire history of computation that led to an erroneous state. In addition, stepping through a program is a tedious task and may interact with other system components in such a way that the original errors disappear, thus preventing any useful insight. The need for unobtrusive tracing is further underscored by the development of computer systems that feature multiple processing cores on a single chip.
In this paper, we introduce a set of algorithms for compressing instruction and data address traces that can easily be implemented in an on-chip trace compression module and describe the corresponding hardware structures. The proposed algorithms are analytically and experimentally evaluated. Our results show that very small hardware structures suffice to achieve a compression ratio similar to that of a software implementation of gzip while being orders of magnitude faster. A hardware structure with slightly over 2 KB of state achieves a compression ratio of 125.9 for instruction address traces, whereas gzip achieves a compression ratio of 87.4. For data address traces, a hardware structure with 5 KB of state achieves a compression ratio of 6.1, compared to 6.8 achieved by gzip.
I. Ganusov and M. Burtscher.
Future Execution: A Prefetching Mechanism that Uses Multiple Cores to Speed up Single Threads.
ACM Transactions on Architecture and Code Optimization, Vol. 3, No. 4, pp. 424-449. December 2006.
This paper describes future execution (FE), a simple hardware-only
technique to accelerate individual program threads running on
multi-core microprocessors. Our approach uses available idle cores to
prefetch important data for the threads executing on the active
cores. FE is based on the observation that many cache misses are
caused by loads that execute repeatedly and whose address-generating
program slices do not change (much) between consecutive executions. To
exploit this property, FE dynamically creates a prefetching thread for
each active core by simply sending a copy of all committed,
register-writing instructions to an otherwise idle core. The key
innovation is that on the way to the second core, a value predictor
replaces each predictable instruction in the prefetching thread with a
load immediate instruction, where the immediate is the
predicted result that the instruction is likely to produce during its
nth next dynamic execution. Executing this modified instruction
stream (i.e., the prefetching thread) on another core allows to
compute the future results of the instructions that are not directly
predictable, issue prefetches into the shared memory hierarchy, and
thus reduce the primary threads' memory access time.
We demonstrate the viability and effectiveness of future execution by
performing cycle-accurate simulations of a two-way CMP running the
single-threaded SPECcpu2000 benchmark suite. Our mechanism improves
program performance by 12% on average over a baseline that already
includes an optimized hardware stream prefetcher. We further show that
FE is complementary to runahead execution and that the combination of
these two techniques raises the average speedup to 20% above the
performance of the baseline processor with the aggressive stream
P. Ratanaworabhan and M. Burtscher.
Load Instruction Characterization and Acceleration of the BioPerf Programs.
2006 IEEE International Symposium on Workload Characterization, pp. 71-79. October 2006.
The load instructions of some of the bioinformatics applications in the BioPerf
suite possess interesting characteristics: only a few static loads cover almost
the entire dynamic load execution and they almost always hit in the data cache.
Nevertheless, these load instructions represent a major performance bottleneck.
They often precede or follow branches that are hard to predict, which makes
their L1 hit latency difficult to hide even in dynamically scheduled execution
cores. This paper investigates this behavior and suggests simple source-code
transformations to improve the performance of these benchmark programs by up to
I. Ganusov and M. Burtscher.
Efficient Emulation of Hardware Prefetchers via Event-Driven Helper Threading.
2006 International Conference on Parallel Architectures and Compilation Techniques, pp. 144-153. September 2006.
The advance of multi-core architectures provides significant benefits
for parallel and throughput-oriented computing, but the performance of
individual computation threads does not improve and may even suffer a
penalty because of the increased contention for shared resources. This
paper explores the idea of using available general-purpose cores in a
CMP as helper engines for individual threads running on the active
cores. We propose a lightweight architectural framework for efficient
event-driven software emulation of complex hardware accelerators and
describe how this framework can be applied to implement a variety of
prefetching techniques. We demonstrate the viability and effectiveness
of our framework on a wide range of applications from the SPEC CPU2000
and Olden benchmark suites. On average, our mechanism provides
performance benefits within 5% of pure hardware
implementations. Furthermore, we demonstrate that running event-driven
prefetching threads on top of a baseline with a hardware stride
prefetcher yields significant speedups for many programs. Finally, we
show that our approach provides competitive performance improvements
over other hardware approaches for multi-core execution while
executing fewer instructions and requiring considerably less hardware
TCgen 2.0: A Tool to Automatically Generate Lossless Trace Compressors.
Computer Architecture News, Vol. 34, No. 3, pp. 1-8. June 2006.
This tutorial explains the usage of TCgen, a tool for generating portable
lossless trace compressors based on user-provided trace format descriptions.
TCgen automatically translates these descriptions into optimized C source code.
In many cases, the synthesized code is faster and compresses better than BZIP2,
GZIP, MACHE, PDATS II, SBC, SEQUITUR, and VPC3, making it ideal for trace-based
research and teaching environments as well as for trace archives. Version 2
includes several improvements and simplifies the integration of the generated
code with other code through a streamlined API.
M. Burtscher and I. Szczyrba.
Computational Simulation and Visualization of Traumatic Brain Injuries.
2006 International Conference on Modeling, Simulation and Visualization Methods, pp. 101-107. June 2006.
We numerically model the human brain dynamics in realistic 2D cross-sections during
traumatic scenarios corresponding to the Head Injury Criterion's critical value HIC36=1000.
Our simulations are based on the Kelvin-Voigt Partial Differential
Equations that treat the brain tissue as a viscoelastic solid and on our
nonlinear generalization of these linear PDEs that treats the tissue as
an incompressible, viscoelastic fluid. To better visualize and study the
brain motion, we have developed curved-vector-field plots and movies, which allow the simultaneous analysis of
velocity fields with a wide range of magnitudes (as appear in turbulent flows) as well as the corresponding
trajectories of brain-matter parcels. The discovered complex brain tissue
oscillations shed new light on the mechanisms of Closed Head Injuries. Our results may
help establish a universal brain injury tolerance criterion.
P. Ratanaworabhan, J. Ke, and M. Burtscher.
Fast Lossless Compression of Scientific Floating-Point Data.
2006 Data Compression Conference, pp. 133-142. March 2006.
In scientific computing environments, large amounts of floating-point data
often need to be transferred between computers as well as to and from storage
devices. Compression can reduce the number of bits that need to be transferred
and stored. However, the runtime overhead due to compression may be undesirable
in high-performance settings where short communication latencies and high
bandwidths are essential. This paper describes and evaluates a new compression
algorithm that is tailored to such environments. It typically compresses
numeric floating-point values better and faster than other algorithms do. On
our data sets, it achieves compression ratios between 1.2 and 4.2 as well as
compression and decompression throughputs between 2.8 and 5.9 million 64-bit
double-precision numbers per second on a 3GHz Pentium 4 machine.
S. J. Jackson and M. Burtscher.
Self Optimizing Finite State Machines for Confidence Estimators.
2006 Workshop on Introspective Architecture. February 2006.
Modern microprocessors are increasingly relying on speculation as a key
technique for improving performance and reducing power, temperature and energy.
Confidence estimators are necessary to prevent likely misspeculations. These
confidence estimators are commonly realized using a Finite State Machine (FSM)
called a saturating counter.
This paper presents a hardware method that allows FSMs to dynamically optimize
their state transitions and confidence thresholds to improve CPU performance
by automatically adapting to the current workload. The technique further
allows the FSMs to continuously adjust to changing program conditions. These
adaptable, self-optimizing confidence estimators are evaluated as a component
in a value predictor on the C programs from the SPECcpu2000 benchmark suite.
On average, the self-optimizing method achieves a miss rate reduction of 11%
with a maximum of 47%. The self-optimizing confidence estimator can provide a
speedup of 4%.
C. C. Liu, I. Ganusov, M.
Burtscher, and S. Tiwari. Bridging the Processor-Memory Performance Gap with
3D IC Technology. IEEE Design & Test of Computers, Vol. 22, No. 6, pp. 556-564. November/December 2005.
The growing gap between processor
speeds and memory latency is becoming a serious issue. Three-dimensional
integrated circuit (3D IC) technology can bridge the processor-memory gap by
bringing DRAM main memory on chip, thereby reducing main memory latency, and by
allowing for more levels and larger sizes of on-chip caches. Using stream
prefetching and on-chip main memory, only a small L2/L3 cache combination is
needed to obtain excellent performance. With proper system design, a two-layer,
3D microprocessor-memory structure can achieve performance that is within 5-7%
of the performance of a processor with a perfect L2 cache.
M. Burtscher, I. Ganusov, S. J.
Jackson, J. Ke, P. Ratanaworabhan, and N. B. Sam. The VPC Trace-Compression
Algorithms. IEEE Transactions on Computers, Vol. 54, No. 11, pp.
1329-1344. November 2005.
Execution traces, which are used
to study and analyze program behavior, are often so large that they need to be
stored in compressed form. This paper describes the design and implementation
of four value prediction based compression (VPC) algorithms for traces that
record the PC as well as other information about executed instructions. VPC1
directly compresses traces using value predictors, VPC2 adds a second
compression stage, and VPC3 utilizes value predictors to convert traces into
streams that can be compressed better and more quickly than the original
traces. VPC4 introduces further algorithmic enhancements and is automatically
synthesized. Of the 55 SPECcpu2000 traces we evaluate, VPC4 compresses 36
better, decompresses 26 faster and compresses 53 faster than BZIP2, MACHE,
PDATS II, SBC and SEQUITUR. It delivers the highest geometric-mean compression rate,
decompression speed and compression speed because of the predictors' simplicity
and their ability to exploit local value locality. Most other compression
algorithms can only exploit global value locality.
C. C. Liu, I. Ganusov, M. Burtscher,
and S. Tiwari. Improving Microprocessor Performance through 3D IC Technology.
Semiconductor Research Corporation's TECHCON 2005 Conference.
October 2005. [pdf]
The growing gap between logic and
memory performance has forced microprocessors to employ increasingly complex
designs, including out-of-order execution and speculation, as well as memory
hierarchies with more levels and larger caches to hide the main memory latency.
In this paper, we examine how 3D IC technology can be effective in improving
the interactions between processors and their memory. We simulate SPEC2000
integer and floating-point programs on an extended version of the SimpleScalar
v4.0 simulator. The results reveal that stacking an L2 cache on top of the CPU
provides little benefit, whereas stacking main memory shows significant
performance improvements. We also investigate the benefits of very large L2/L3
caches, which can be stacked on top of the CPU through 3D technology. In
addition, we show that stream prefetching is an effective technique to prefetch
data into these large caches.
I. Ganusov and M. Burtscher. Future
Execution: A Hardware Prefetching Technique for Chip Multiprocessors. 2005
International Conference on Parallel Architectures and Compilation Techniques,
pp. 350-360. September 2005. [pdf]
This paper proposes a new hardware
technique for using one core of a CMP to prefetch data for a thread running on
another core. Our approach simply executes a copy of all non-control
instructions in the prefetching core after they have executed in the primary
core. On the way to the second core, each instruction's output is replaced by a
prediction of the likely output that the nth future instance
of this instruction will produce. Speculatively executing the resulting
instruction stream on the second core issues load requests that the main program
will probably reference in the future. Unlike previously proposed thread-based
prefetching approaches, our technique does not need any thread spawning points,
features an adjustable lookahead distance, does not require complicated
analyzers to extract prefetching threads, is recovery-free, and necessitates no
storage for the prefetching threads. We demonstrate that for the SPECcpu2000
benchmark suite, our mechanism significantly increases the prefetching coverage
and improves the primary core's performance by 10% on average over baseline
that already includes an aggressive hardware stream prefetcher. We further show
that our approach works well in combination with runahead execution.
N. B. Sam and M. Burtscher. Improving
Memory System Performance with Energy-Efficient Value Speculation.
Computer Architecture News, Vol. 33, No. 4, pp. 121-127. September 2005.
Microprocessor speeds have been
improving much faster than memory speeds, resulting in the CPU spending a
larger and larger amount of time waiting for data. Processor designers have
employed several ways to improve memory performance, including hierarchical
caching, prefetching, and faster memory chips. Yet, memory accesses still
represent a major performance bottleneck and much remains to be done to
tolerate the increasing memory latencies. Load-value prediction has been shown
to effectively hide some of this latency. However, the hardware required to
achieve good performance is substantial, making load-value prediction
unappealing in light of increasing power constraints. In this paper, we present
a novel predictor that significantly increases CPU performance while at the
same time decreasing the energy consumption of the entire processor relative to
a baseline with a well-performing hybrid load-value predictor.
J. Ke, M. Burtscher, and E.
Speight. Tolerating Message Latency through the Early Release of Blocked
Receives. Euro-Par 2005, pp. 19-29. August 2005. [pdf]
Large message latencies often lead
to poor performance of parallel applications. In this paper, we investigate a
latency-tolerating technique that immediately releases all blocking receives,
even when the message has not yet (completely) arrived, and enforces execution
correctness through page protection. This approach eliminates false message
data dependencies on incoming messages and allows the computation to proceed as
early as possible. We implement and evaluate our early-release technique in the
context of an MPI runtime library. The results show that the execution speed of
MPI applications improves by up to 60% when early release is enabled. Our
approach also enables faster and easier parallel programming as it frees
programmers from adopting more complex nonblocking receives and from tuning
message sizes to explicitly reduce false message data dependencies.
J. Ke, M. Burtscher, and E.
Speight. Reducing Communication Time through Message Prefetching. The
2005 International Conference on Parallel and Distributed Processing Techniques
and Applications, pp. 557-563. June 2005. [pdf]
The latency of large messages
often leads to poor performance of parallel applications. In this paper, we
investigate a novel latency reduction technique where message receivers
prefetch messages from senders before the matching sends are called. When the
send is finally called, only the parts of the message that have changed since
the prefetch need to be transmitted, resulting in a smaller message. Our
message prefetching technique initiates communication while the sender is still
in the computation phase and thus overlaps computation with communication to
hide part of the message latency. We implement and evaluate our technique in
the context of an MPI run-time library. The results show that the execution
speed of five MPI applications improves by up to 24% when message prefetching
M. Burtscher and I. Szczyrba. On
the Role of the Brain's Geometry in Closed Head Injuries. 2005 Summer
Bioengineering Conference. June 2005. [pdf]
M. Burtscher and I. Szczyrba. Numerical
Modeling of Brain Dynamics in Traumatic Situations - Impulsive Translations.
The 2005 International Conference on Mathematics and Engineering Techniques
in Medicine and Biological Sciences, pp. 205-211. June 2005. [pdf]
We numerically model the brain
dynamics during and after impulsive head translations using linear Partial
Differential Equations (PDEs) describing viscoelastic solids and a nonlinear
generalization of these PDEs describing incompressible, viscoelastic fluids.
The brain matter motion and the sensitivity of the solutions with respect to
the skull's geometry differ substantially in the two cases. In particular, the
oscillatory rotational flows, which we discovered to appear in the matter after
the translations stop, are quite distinct. The significance of the results for
understanding the mechanisms of Closed Head Injuries (CHI) is discussed.
I. Ganusov and M. Burtscher. On
the Importance of Optimizing the Configuration of Stream Prefetchers. 3rd
Annual ACM SIGPLAN Workshop on Memory Systems Performance, pp. 54-61. June
This paper provides a detailed
analysis of how the parameters of hardware prefetchers affect the memory system
performance. In particular, we found the configuration of the frequently used
stream prefetcher to have a major impact on the runtime, making parameter
optimizations imperative when comparing a stream prefetcher with other
prefetching techniques. For example, we show that adjusting the prefetch
distance to the optimal value can increase the average speedup over the
SPECcpu2000 benchmark suite from 40% to 70%. Moreover, our investigation of the
performance of runahead prefetching relative to stream prefetching shows that
choosing a non-optimal stream prefetcher as a baseline can distort the results
by as much as a factor of two.
N. B. Sam and M. Burtscher. Complex
Load-Value Predictors: Why We Need Not Bother. Fourth Annual Workshop on
Duplicating, Deconstructing, and Debunking, pp. 16-24. June 2005. [pdf]
Memory accesses continue to
represent a major performance bottleneck and much remains to be done to
tolerate their latencies. A large body of work exists that presents load-value
prediction as an effective means to hide some of the memory latency. To
increase the prediction accuracy and hence the performance, researchers have
proposed more complex and larger predictor design.
This paper re-evaluates load-value predictors and examines the
tradeoffs between predictor size, latency, energy consumption and performance.
We demonstrate that even though complex predictors deliver the highest
accuracy, they are unlikely to be implemented in hardware because they provide
a much worse energy-performance tradeoff than simpler predictors with moderate
N. B. Sam and M. Burtscher. On
the Energy-Efficiency of Speculative Hardware. 2005 ACM International
Conference on Computing Frontiers, pp. 361-370. May 2005.
Microprocessor trends are moving
towards wider architectures and more aggressive speculation. With the
increasing transistor budgets, energy consumption has become a critical design
constraint. To address this problem, several researchers have proposed and
evaluated energy-efficient variants of speculation mechanisms. However, such
hardware is typically evaluated in isolation and its impact on the energy
consumption of the rest of the processor, for example, due to wrong-path
executions, is ignored. Moreover, the available metrics that would provide a
thorough evaluation of an architectural optimization employ somewhat
complicated formulas with hard-to-measure parameters.
In this paper, we introduce a simple method to accurately
compare the energy-efficiency of speculative architectures. Our metric is based
on runtime analysis of the entire processor chip and thus captures the energy
consumption due to the positive as well as the negative activities that arise from
the speculation activities. We demonstrate the usefulness of our metric on the
example of value speculation, where we found some proposed value predictors,
including low-power designs, not to be energy-efficient.
M. Burtscher and N. B. Sam. Automatic
Generation of High-Performance Trace Compressors. 2005 International
Symposium on Code Generation and Optimization, pp. 229-240. March 2005. [pdf]
Program execution traces are
frequently used in industry and academia. Yet, most trace-compression
algorithms have to be re-implemented every time the trace format is changed,
which takes time, is error prone, and often results in inefficient solutions.
This paper describes and evaluates TCgen, a tool that automatically generates
portable, customized, high-performance trace compressors. All the user has to
do is provide a description of the trace format and select one or more
predictors to compress the fields in the trace records. TCgen translates this
specification into C source code and optimizes it for the specified trace
format and predictor algorithms. On average, the generated code is faster and
compresses better than the six other compression algorithms we have tested. For
example, a comparison with SBC, one of the best trace-compression algorithms in
the current literature, shows that TCgen's synthesized code compresses
SPECcpu2000 address traces 23% more, decompresses them 24% faster, and
compresses them 1029% faster.
M. Burtscher and I. Ganusov. Automatic
Synthesis of High-Speed Processor Simulators. 37th Annual
IEEE/ACM International Symposium on Microarchitecture, pp. 55-66. December
Microprocessor simulators are very
popular in research and teaching environments. For example, functional
simulators are often used to perform architectural studies, to fast-forward over
uninteresting code, to generate program traces, and to warm up tables before
switching to a more detailed but slower simulator. Unfortunately, most portable
functional simulators are on the order of 100 times slower than native
execution. This paper describes a set of novel techniques and optimizations to
synthesize portable functional simulators that are only 6.6 times slower on
average (16 times in the worst case) than native execution and 19 times faster
than SimpleScalar's sim-fast on the SPECcpu2000 programs. When simulating a
memory hierarchy, the synthesized code is 2.6 times faster than the equivalent
ATOM code. Our fully automated synthesis approach works without access to
source/assembly code or debug information. It generates C code, integrates
optional user-provided code, performs unwanted-code removal, preserves basic
blocks, generates low-overhead profiles, employs a simple heuristic to
determine potential jump targets, only compiles important instructions, and
utilizes mixed-mode execution, i.e., it interleaves compiled and interpreted
simulation to maximize performance.
J. Ke, M. Burtscher, and E.
Speight. Runtime Compression of MPI Messages to Improve the Performance and
Scalability of Parallel Applications. High-Performance Computing,
Networking and Storage Conference, pp. 59-65. November 2004. [pdf]
applications spend a significant amount of their total execution time exchanging
data between processes, which leads to poor performance in many cases. In this
paper, we investigate message compression in the context of large-scale
parallel message-passing systems to reduce the communication time of individual
messages and to improve the bandwidth of the overall system. We implement and
evaluate the cMPI message-passing library, which quickly compresses messages
on-the-fly with a low enough overhead that a net execution time reduction is
obtained. Our results on six large-scale benchmark applications show that their
execution speed improves by up to 98% when message compression is enabled.
N. B. Sam and M. Burtscher. Exploiting
Type Information in Load-Value Predictors. Second Value-Prediction and
Value-Based Optimization Workshop, pp. 32-39. October 2004. [pdf]
To alleviate the high cost of
main-memory accesses, computer architects have proposed various speculation
mechanisms, including load-value prediction. A load-value predictor forecasts
the result of load instructions, thus allowing dependent instructions to
execute without having to wait for the memory access to complete.
Unfortunately, costly mispredictions hinder the true potential of load-value prediction.
This paper describes several approaches to build more accurate
and faster load-value predictors with little extra hardware by exploiting the
hardware type of the load instructions. Our techniques are easily implementable
in any CPU that supports multiple load types, e.g., byte, word, single, etc.
Compared to traditional load-value predictors, our schemes substantially reduce
the number of mispredictions with a concomitant speedup of the CPU. Moreover,
we show that the type of a load can effectively be used as the selector in a
M. Burtscher. VPC3: A Fast and
Effective Trace-Compression Algorithm. Joint International Conference on
Measurement and Modeling of Computer Systems, pp. 167-176. June 2004.
Trace files are widely used in
research and academia to study the behavior of programs. They are simple to
process and guarantee repeatability. Unfortunately, they tend to be very large.
This paper describes vpc3, a fundamentally new approach to compressing program
traces. Vpc3 employs value predictors to bring out and amplify patterns in the
traces so that conventional compressors can compress them more effectively. In
fact, our approach not only results in much higher compression rates but also
provides faster compression and decompression. For example, compared to bzip2,
vpc3's geometric mean compression rate on SPECcpu2000 store address traces is
18.4 times higher, compression is ten times faster, and decompression is three
M. Burtscher and M. Jeeradit. Compressing
Extended Program Traces Using Value Predictors. 12th
International Conference on Parallel Architectures and Compilation Techniques,
pp. 159-169. September 2003. [pdf]
Trace files record the execution
behavior of programs for future analysis. Unfortunately, nontrivial program
traces tend to be very large and have to be compressed. While good compression
schemes exist for traces that capture only the PCs of the executed
instructions, these schemes can be ineffective on extended traces that include
important additional information such as register values or effective
addresses. Our novel, value-prediction-based approach compresses extended
traces up to 22.8 times better and about two and a half times as well on
average. In addition to the higher compression rate, our lossless single-pass
algorithm has a fixed memory requirement and compresses traces faster than
other algorithms. It achieves compression rates of up to 6170. This paper
describes the design of our compression method and illustrates how value
predictors can be used to effectively compress extended program traces.
I. Szczyrba and M. Burtscher. On
the Role of Ventricles in Diffuse Axonal Injuries. 2003 Summer
Bioengineering Conference, pp. 147-148. June 2003. [pdf]
M. Burtscher and B. G. Zorn. Hybrid
Load-Value Predictors. IEEE Transactions on Computers, Vol. 51, No.
7, pp. 759-774. July 2002. [pdf]
Load instructions diminish
processor performance in two ways. First, due to the continuously widening gap
between CPU and memory speed, the relative latency of load instructions grows
constantly and already slows program execution. Second, memory reads limit the
available instruction-level parallelism because instructions that use the
result of a load must wait for the memory access to complete before they can
start executing. Load-value predictors alleviate both problems by allowing the
CPU to speculatively continue processing without having to wait for load
instructions, which can significantly improve the execution speed.
While several hybrid load-value predictors have been proposed
and found to work well, no systematic study of such predictors exists. In this
paper, we investigate the performance of all hybrids that can be built out of a
register value, a last value, a stride 2-delta, a last four value, and a finite
context method predictor. Our analysis shows that hybrids can deliver 25% more
speedup than the best single-component predictors. An examination of the
individual components of hybrids revealed that predictors with a poor
standalone performance sometimes make excellent components in a hybrid while
combining well-performing individual predictors often does not result in an
Our hybridization study identified the register value + stride
2-delta predictor as one of the best two-component hybrids. It matches or
exceeds the speedup of two-component hybrids from the literature in spite of
its substantially smaller and simpler design. Of all the predictors we studied,
the register value + stride 2-delta + last four value hybrid performs best. It
yields a harmonic-mean speedup over the eight SPECint95 programs of 17.2%.
E. Speight and M. Burtscher. Delphi:
Prediction-Based Page Prefetching to Improve the Performance of Shared Virtual
Memory Systems. The International Conference on Parallel and Distributed
Processing Techniques and Applications, pp. 49-55. June 2002. [pdf]
Software distributed shared memory
(SDSM) systems traditionally exhibit poor performance on applications with
significant fine-grain or false sharing. Relaxed-consistency models and
multiple-writer protocols improve the performance of SDSM systems
significantly, but their performance still lags that of hardware shared memory
implementations. This paper describes Delphi, a system that borrows techniques
from microarchitectural research on value prediction and applies them to software
distributed shared memory. We run a small software predictor on each node in
the Delphi system to predict which virtual pages will be needed in the future.
We use these predictions to prefetch pages, which reduces the number of
accesses to invalid data and thereby reduces expensive network accesses.
Experimental results show that Delphi is able to reduce the number of read
misses to virtual pages by up to 62% on a set of well-known, scientific
benchmarks with minimal runtime overhead in extra processing and memory
requirements. This translates into a 13% reduction in execution time over a
comparable base system that does not employ prediction techniques.
M. Burtscher, A. Diwan, and M.
Hauswirth. Static Load Classification for Improving the Value Predictability
of Data-Cache Misses. ACM SIGPLAN 2002 Conference on Programming
Language Design and Implementation, pp. 222-233. June 2002.
While caches are effective at avoiding
most main-memory accesses, the few remaining memory references are still
expensive. Even one cache miss per one hundred accesses can double a program's
execution time. To better tolerate the data-cache miss latency, architects have
proposed various speculation mechanisms, including load-value prediction. A
load-value predictor guesses the result of a load so that the dependent
instructions can immediately proceed without having to wait for the memory
access to complete.
To use the prediction resources most effectively, speculation
should be restricted to loads that are likely to miss in the cache and that are
likely to be predicted correctly. Prior work has considered hardware- and
profile-based methods to make these decisions. Our work focuses on making these
decisions at compile time. We show that a simple compiler classification is
effective at separating the loads that should be speculated from the loads that
should not. We present results for a number of C and Java programs and
demonstrate that our results are consistent across programming languages and
across program inputs.
M. Burtscher. An Improved Index
Function for (D)FCM Predictors. Computer Architecture News, Vol. 30,
No. 3, pp. 19-24. June 2002.
The most promising value
predictors to date are the finite context method predictor and a recent
improvement thereof, the differential finite context method predictor. Both
predictors comprise two levels and the index into the second level is a
function of the content of the first level. This index function is crucial for
good performance. However, our research shows that the currently used
select-fold-shift-xor function performs poorly on range-limited sequences of
values. For example, it does not predict the results of byte loads well. The
problem with the current function is that it often cannot reach the predictor's
entire second-level table. We propose an improved index function that does not
suffer from this shortcoming. On the 15 SPECcpu2000 C programs, our new index
function improves the average load-value predictability by about 1% to 5%
without increase in predictor size. On byte loads, the improvement is over 6%
for 4096-entry predictors.
M. Burtscher and B. G. Zorn. Hybridizing
and Coalescing Load Value Predictors. 2000 IEEE International Conference
on Computer Design: VLSI in Computers & Processors, pp. 81-92.
September 2000. [pdf]
Most well-performing load value
predictors are hybrids that combine multiple predictors into one. Such hybrids
are often large. To reduce their size and to improve their performance, this
paper presents two storage reduction techniques as well as a detailed analysis
of the interaction between a hybrid's components. We found that state sharing
and simple value compression can shrink the size of a predictor by a factor of
two without compromising the performance. Our component analysis revealed that
combining well-performing predictors does not always yield a good hybrid,
whereas sometimes a poor predictor can make an excellent complement to another
predictor in a hybrid.
Performance evaluations using a cycle-accurate simulator
running SPECint95 show that hybridizing can improve non-hybrids by thirty to
fifty percent over a wide range of sizes. With fifteen kilobytes of state, our
coalesced-hybrid yields a harmonic mean speedup of twelve and fifteen percent
with a re-fetch and a re-execute misprediction recovery mechanism,
respectively, which is higher than the speedup of other predictors we evaluate,
some of which are six times larger.
M. Burtscher. Improving
Context-Based Load Value Prediction. Ph.D. Dissertation, Department of
Computer Science, University of Colorado at Boulder. April 2000. [pdf]
Microprocessors are becoming
faster at such a rapid pace that other components like random access memory
cannot keep up. As a result, the latency of load instructions grows constantly
and already often impedes processor performance.
Fortunately, load instructions frequently fetch predictable
sequences of values. Load value predictors exploit this behavior to predict the
results of load instructions. Because the predicted values are available before
the memory can deliver the true load values, the CPU is able to speculatively
continue processing without having to wait for memory accesses to complete,
which improves the execution speed.
The contributions of this dissertation to the area of load
value prediction include a novel technique to decrease the number of
mispredictions, a predictor design that increases the hardware utilization and
thus the number of correctly predicted load values, a detailed analysis of
hybrid predictor combinations to determine components that complement each
other well, and several approaches to substantially reduce the size of hybrid
load value predictors without affecting their performance.
One result of this research is a very small yet high-performing
load value predictor. Cycle-accurate simulations of a four-way superscalar
microprocessor running SPECint95 show that this predictor outperforms other
predictors from the literature by twenty or more percent over a wide range of
sizes. With about fifteen kilobytes of state, the smallest examined
configuration, it surpasses the speedups delivered by other, five-times larger
predictors both with a re-fetch and a re-execute misprediction recovery
M. Burtscher and B. G. Zorn. Exploring
Last n Value Prediction. The 1999 International Conference on
Parallel Architectures and Compilation Techniques, pp. 66-76. October 1999.
Most load value predictors retain
a large number of previously loaded values for making future predictions. In
this paper we evaluate the trade-off between tall and slim versus short and
wide predictors of the same total size, i.e., between retaining a few values
for a large number of load instructions and many values for a proportionately
smaller number of loads. Our results show, for example, that even modest
predictors holding sixteen kilobytes of values benefit from retaining four
values per load instruction when running SPECint95.
A detailed comparison of eight load value predictors on a
cycle-accurate simulator of a superscalar out-of-order microprocessor shows
that our implementation of a last four value predictor outperforms other
predictors from the literature, often significantly. With 21kB of state, it
yields a harmonic mean speedup of 12.5% with existing re-fetch misprediction
recovery hardware and 13.7% with a not yet realized re-execution recovery
M. Burtscher and B. G. Zorn. Prediction
Outcome History-based Confidence Estimation for Load Value Prediction. Journal
of Instruction-Level Parallelism, Vol. 1. May 1999. [pdf]
Load instructions occasionally
incur very long latencies that can significantly affect system performance.
Load value prediction alleviates this problem by allowing the CPU to speculatively
continue processing without having to wait for the slow memory access to
Current load value predictors can only correctly predict about
forty to seventy percent of the fetched load values. To avoid the cycle-penalty
for mispredictions in the remaining cases, confidence estimators are employed.
They inhibit all predictions that are not likely to be correct.
In this paper we present a novel confidence estimator that is
based on prediction outcome histories. Profiles are used to identify the
high-confidence history patterns. Our confidence estimator is able to trade off
coverage for accuracy and vice-versa with great flexibility and reaches an
average prediction accuracy over SPECint95 of as high as 99.3%. Cycle-accurate
pipeline-level simulations show that a simple last value predictor combined
with our confidence estimator outperforms other predictors, sometimes by over
100%. Furthermore, this predictor is one of two predictors that yield a genuine
speedup for all eight SPECint95 programs.
M. Burtscher and B. G. Zorn. Load
Value Prediction Using Prediction Outcome Histories. University of
Colorado at Boulder, Computer Science, Technical Report CU-CS-873-98.
November 1998. [pdf]
Due to their occasional very long
latency, load instructions are among the slowest instructions of current
high-performance microprocessors. Unfortunately, their long latency also delays
the execution of all the dependent instructions, which can significantly affect
system performance. Load value prediction alleviates this problem by allowing
the CPU to speculatively continue processing without having to wait for the
memory access to complete.
Today's load value predictors can only correctly predict about
40 to 70 percent of the load instructions. Confidence estimators are employed
to estimate how likely a prediction is to be correct and to keep the predictor
from making a (probably incorrect) prediction if the confidence is below a
Despite its simplicity, the adaptive prediction outcome
history-based confidence estimator we present in this paper outperforms other
proposed mechanisms and reaches average prediction accuracies over SPECint95 in
excess of 99%, even with small predictor sizes.
A detailed pipeline-level simulation shows that a load value
predictor equipped with our confidence estimator not only outperforms other
predictors by more than 65% when a re-fetch misprediction recovery policy is
used, but is also the only predictor that yields a genuine speedup for all
eight SPECint95 programs.
M. Burtscher and B. G. Zorn. Profile-Supported
Confidence Estimation for Load-Value-Prediction. Workshop on Profile and
Feedback-Directed Compilation. October 1998. [pdf]
Due to their occasional very long
latency, load instructions are among the slowest instructions of current high-performance
microprocessors. Unfortunately, the long latency of one instruction also delays
the execution of its dependent instructions, which can significantly affect
system performance. Load value prediction alleviates this problem by allowing
the CPU to speculatively continue processing without having to wait for the
slow memory access to complete.
The potential of today's load value predictors for making
correct predictions is only about 40 to 70 percent. Confidence estimators are
employed to estimate how likely a prediction is to be correct and to keep the
predictor from making a (probably incorrect) prediction if the confidence is
below a preset threshold. Setting this threshold to a higher level increases
the probability that the attempted predictions will be correct (higher
accuracy) but results also in more missed opportunities for making correct
predictions (lower coverage).
The profile-supported confidence estimator we present in this
paper is able to trade off coverage for accuracy and vice-versa with previously
unseen flexibility and reaches an average prediction accuracy over SPECint95 of
as high as 99.4%.
A detailed pipeline-level simulation shows that our
profile-supported load value predictor not only outperforms other predictors, but
is also the only predictor that yields a speedup at all when a re-fetch
misprediction recovery policy is used.