Garbage-First Garbage Collection
Posted 菠萝科技
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Garbage-First Garbage Collection相关的知识,希望对你有一定的参考价值。
原文链接
https://www.researchgate.net/publication/221032945_Garbage-First_garbage_collection
pdf版本免积分下载:https://download.csdn.net/download/wabiaozia/19642705
Conference: Proceedings of the 4th International Symposium on Memory Management, ISMM 2004, Vancouver, BC, Canada, October 24-25, 2004
建议看原文,有图片和评论。
Garbage-First Garbage Collection
David Detlefs, Christine Flood, Steve Heller, Tony Printezis
Sun Microsystems, Inc.
1 Network Drive, Burlington, MA 01803, USA
david.detlefs, christine.flood, steve.heller, tony.printezis@sun.com
ABSTRACT
Garbage-First is a server-style garbage collector, targeted
for multi-processors with large memories, that meets a soft
real-time goal with high probability, while achieving high
throughput. Whole-heap operations, such as global mark-
ing, are performed concurrently with mutation, to prevent
interruptions proportional to heap or live-data size. Concur-
rent marking both provides collection ”completeness” and
identifies regions ripe for reclamation via compacting evac-
uation. This evacuation is performed in parallel on multi-
processors, to increase throughput.
Categories and Subject Descriptors:
D.3.4 [Programming Languages]: Processors—Memory
management (garbage collection)
General Terms: Languages, Management, Measurement,
Performance
Keywords: concurrent garbrage collection, garbage collec-
tion, garbage-first garbage collection, parallel garbage col-
lection, soft real-time garbage collection
1. INTRODUCTION
The Java programming language is widely used in large
server applications. These applications are characterized by
large amounts of live heap data and considerable thread-
level parallelism, and are often run on high-end multipro-
cessors. Throughput is clearly important for such applica-
tions, but they often also have moderately stringent (though
soft) real-time constraints, e.g. in telecommunications, call-
processing applications (several of which are now imple-
mented in the Java language), delays of more than a fraction
of a second in setting up calls are likely to annoy customers.
The Java language specification mandates some form of
garbage collection to reclaim unused storage. Traditional
“stop-world” collector implementations will affect an appli-
cation’s responsiveness, so some form of concurrent and/or
incremental collector is necessary. In such collectors, lower
pause times generally come at a cost in throughput. There-
fore, we allow users to specify a soft real-time goal, stating
their desire that collection consume no more than xms of
Copyright 2004 Sun Microsystems, Inc. All rights reserved.
ISMM’04, October 24–25, 2004, Vancouver, British Columbia, Canada.
ACM 1-58113-945-4/04/0010.
any yms time slice. By making this goal explicit, the collec-
tor can try to keep collection pauses as small and infrequent
as necessary for the application, but not so low as to decrease
throughput or increase footprint unnecessarily. This paper
describes the Garbage-First collection algorithm, which at-
tempts to satisfy such a soft real-time goal while maintain-
ing high throughput for programs with large heaps and high
allocation rates, running on large multi-processor machines.
The Garbage-First collector achieves these goals via sev-
eral techniques. The heap is partitioned into a set of equal-
sized heap regions, much like the train cars of the Mature-
Object Space collector of Hudson and Moss [22]. However,
whereas the remembered sets of the Mature-Object Space
collector are unidirectional, recording pointers from older
regions to younger but not vice versa, Garbage-First remem-
bered sets record pointers from all regions (with some excep-
tions, described in sections 2.4 and 4.6). Recording all ref-
erences allows an arbitrary set of heap regions to be chosen
for collection. A concurrent thread processes log records cre-
ated by special mutator write barriers to keep remembered
sets up-to-date, allowing shorter collections.
Garbage-First uses a snapshot-at-the-beginning (hence-
forth SATB) concurrent marking algorithm [36]. This pro-
vides periodic analysis of global reachability, providing com-
pleteness, the property that all garbage is eventually iden-
tified. The concurrent marker also counts the amount of
live data in each heap region. This information informs the
choice of which regions are collected: regions that have little
live data and more garbage yield more efficient collection,
hence the name “Garbage-First”. The SATB marking algo-
rithm also has very small pause times.
Garbage-First employs a novel mechanism to attempt to
achieve the real-time goal. Recent hard real-time collectors
[4, 20] have satisfied real-time constraints by making col-
lection interruptible at the granularity of copying individ-
ual objects, at some time and space overhead. In contrast,
Garbage-First copies objects at the coarser granularity of
heap regions. The collector has a reasonably accurate model
of the cost of collecting a particular heap region, as a func-
tion of quickly-measured properties of the region. Thus,
the collector can choose a set of regions that can be col-
lected within a given pause time limit (with high probabil-
ity). Further, collection is delayed if necessary (and possi-
ble) to avoid violating the real-time goal. Our belief is that
abandoning hard real-time guarantees for this softer best-
effort style may yield better throughput and space usage,
an appropriate tradeoff for many applications.
2. DATA STRUCTURES/MECHANISMS
In this section, we describe the data structures and mech-
anisms used by the Garbage-First collector.
2.1 Heap Layout/Heap Regions/Allocation
The Garbage-First heap is divided into equal-sized heap
regions, each a contiguous range of virtual memory. Alloca-
tion in a heap region consists of incrementing a boundary,
top, between allocated and unallocated space. One region
is the current allocation region from which storage is be-
ing allocated. Since we are mainly concerned with multi-
processors, mutator threads allocate only thread-local allo-
cation buffers, or TLABs, directly in this heap region, using
acompare-and-swap, or CAS, operation. They then allocate
objects privately within those buffers, to minimize allocation
contention. When the current allocation region is filled, a
new allocation region is chosen. Empty regions are orga-
nized into a linked list to make region allocation a constant
time operation.
Larger objects may be allocated directly in the current
allocation region, outside of TLABs. Objects whose size
exceeds 3/4 of the heap region size, however, are termed
humongous. Humongous objects are allocated in dedicated
(contiguous sequences of) heap regions; these regions con-
tain only the humongous object.1
2.2 Remembered Set Maintenance
Each region has an associated remembered s et, which in-
dicates all locations that might contain pointers to (live) ob-
jects within the region. Maintaining these remembered sets
requires that mutator threads inform the collector when they
make pointer modifications that might create inter-region
pointers. This notification uses a card table [21]: every 512-
byte card in the heap maps to a one-byte entry in the card
table. Each thread has an associated reme mbered s et log, a
current buffer or sequence of modified cards. In addition,
there is a global set of filled RS buffers.
The remembered sets themselves are sets (represented by
hash tables) of cards. Actually, because of parallelism, each
region has an associated array of several such hash tables,
one per parallel GC thread, to allow these threads to update
remembered sets without interference. The logical contents
of the remembered set is the union of the sets represented
by each of the component hash tables.
The remembered set write barrier is performed after the
pointer write. If the code performs the pointer write x.f =
y, and registers rX and rY contain the object pointer values
xand yrespectively, then the pseudo-code for the barrier is:
1| rTmp := rX XOR rY
2| rTmp := rTmp >> LogOfHeapRegionSize
3| // Below is a conditional move instr
4| rTmp := (rY == NULL) then 0 else rTmp
5| if (rTmp == 0) goto filtered
6| call rs_enqueue(rX)
7| filtered:
This barrier uses a filtering technique mentioned briefly by
Stefanovi´cet al. in [32]. If the write creates a pointer from
an object to another object in the same heap region, a case
we expect to be common, then it need not be recorded in a
remembered set. The exclusive-or and shifts of lines 1 and 2
1Humongous objects complicate the system in various ways.
We will not cover these complications in this paper.
means that rTmp is zero after the second line if xand yare
in the same heap region. Line 4 adds filtering of stores of
null pointers. If the store passes these filtering checks, then
it creates an out-of-region pointer. The rs enqueue routine
reads the card table entry for the object head rX.Ifthat
entry is already dirty, nothing is done. This reduces work
for multiple stores to the same card, a common case because
of initializing writes. If the card table entry is not dirty,
then it is dirtied, and a pointer to the card is enqueued
on the thread’s remembered set log. If this enqueue fills
the thread’s current log buffer (which holds 256 elements by
default), then that buffer is put in the global set of filled
buffers, and a new empty buffer is allocated.
The concurrent remembered set thread waits (on a con-
dition variable) for the size of the filled RS buffer set to
reach a configurable initiating threshold (the default is 5
buffers). The remembered set thread processes the filled
buffers as a queue, until the length of the queue decreases
to 1/4 of the initiating threshold. For each buffer, it pro-
cesses each card table pointer entry. Some cards are hot:
they contain locations that are written to frequently. To
avoid processing hot cards repeatedly, we try to identify the
hottest cards, and defer their processing until the next evac-
uation pause (see section 2.3 for a description of evacuation
pauses). We accomplish this with a second card table that
records the number of times the card has been dirtied since
the last evacuation pause (during which this table, like the
card table proper, is cleared). When we process a card we
increment its count in this table. If the count exceeds a hot-
ness threshold (default 4), then the card is added to circular
buffer called the hot queue (of default size 1 K). This queue
is processed like a log buffer at the start of each evacuation
pause, so it is empty at the end. If the circular buffer is full,
then a card is evicted from the other end and processed.
Thus, the concurrent remembered set thread processes a
card if it has not yet reached the hotness threshold, or if it
is evicted from the hot queue. To process a card, the thread
first resets the corresponding card table entry to the clean
value, so that any concurrent modifications to objects on the
card will redirty and re-enqueue the card.2It then exam-
ines the pointer fields of all the objects whose modification
may have dirtied the card, looking for pointers outside the
containing heap region. If such a pointer is found, the card
is inserted into the remembered set of the referenced region.
We use only a single concurrent remembered set thread, to
introduce parallelism when idle processors exist. However, if
this thread is not sufficient to service the rate of mutation,
the filled RS buffer set will grow too large. We limit the
size of this set; mutator threads attempting to add further
buffers perform the remembered set processing themselves.
2.3 Evacuation Pauses
At appropriate points (described in section 3.4), we stop
the mutator threads and perform an evacuation pause. Here
we choose a col lection set of regions, and evacuate the re-
gions by copying all their live objects to other locations in
the heap, thus freeing the collection set regions. Evacu-
ation pauses exist to allow compaction: object movement
must appear atomic to mutators. This atomicity is costly
to achieve in truly concurrent systems, so we move objects
during incremental stop-world pauses instead.
2On non-sequentially consistent architectures memory bar-
riers may be necessary to prevent reorderings.
empty
empty
R0 R1 R2 R3
RemSet1
R0 R1 R2 R3
RemSet3
(a)
(b)
GC
abc
cb
a
Figure 1: Remembered Set Operation
If a multithreaded program is running on a multiprocessor
machine, using a sequential garbage collector can create a
performance bottleneck. We therefore strive to parallelize
the operations of an evacuation pause as much as possible.
The first step of an evacuation pause is to sequentially
choose the collection set (section 3 details the mechanisms
and heuristics of this choice). Next, the main parallel phase
of the evacuation pause starts. GC threads compete to claim
tasks such as scanning pending log buffers to update remem-
bered sets, scanning remembered sets and other root groups
for live objects, and evacuating the live objects. There is no
explicit synchronization among tasks other than ensuring
that each task is performed by only one thread.
The evacuation algorithm is similar to the one reported by
Flood et al. [18]. To achieve fast parallel allocation we use
GCLABs, i.e. thread-local GC allocation buffers (similar
to mutator TLABs). Threads allocate an object copy in
their GCLAB and compete to install a forwarding pointer
in the old image. The winner is responsible for copying
the object and scanning its contents. A technique based on
work-stealing [1] provides load balancing.
Figure 1 illustrates the operation of an evacuation pause.
Step A shows the remembered set of a collection set region
R1 being used to find pointers into the collection set. As will
be discussed in section 2.6, pointers from objects identified
as garbage via concurrent marking (object ain the figure)
are not followed.
2.4 Generational Garbage-First
Generational garbage collection [34, 26] has several ad-
vantages, which a collection strategy ignores at its peril.
Newly allocated objects are usually more likely to become
garbage than older objects, and newly allocated objects are
also more likely to be the target of pointer modifications,
if only because of initialization. We can take advantage of
both of these properties in Garbage-First in a flexible way.
We can heuristically designate a region as young when it is
chosen as a mutator allocation region. This commits the re-
gion to be a member of the next collection set. In return for
this loss of heuristic flexibility, we gain an important benefit:
remembered set processing is not required to consider mod-
ifications in young regions. Reachable young objects will be
scanned after they are evacuated as a normal part of the
next evacuation pause.
Note that a collection set can contain a mix of young and
non-young regions. Other than the special treatment for
remembered sets described above, both kinds of regions are
treated uniformly.
Garbage-First runs in two modes: generational and pure
garbage-first. Generational mode is the default, and is used
for all performance results in this paper.
There are two further “submodes” of generational mode:
evacuation pauses can be fully or partially young. A fully-
young pause adds all (and only) the allocated young regions
to the collection set. A partially-young pause chooses all
the allocated young regions, and may add further non-young
regions, as pause times allow (see section 3.2.1).
2.5 Concurrent Marking
Concurrent marking is an important component of the
system. It provides collector completeness without impos-
ing any order on region choice for collection sets (as, for ex-
ample, the Train algorithm of Hudson and Moss [22] does).
Further, it provides the live data information that allows
regions to be collected in “garbage-first” order. This section
describes our concurrent marking algorithm.
We use a form of snapshot-at-the-beginning concurrent
marking [36]. In this style, marking is guaranteed to iden-
tify garbage objects that exist at the start of marking, by
marking a logical “snapshot” of the object graph existing at
that point. Objects allocated during marking are necessar-
ily considered live. But while such objects must be consid-
ered marked, they need not be traced: they are not part of
the object graph that exists at the start of marking. This
greatly decreases concurrent marking costs, especially in a
system like Garbage-First that has no physically separate
young generation treated specially by marking.
2.5.1 Marking Data Structures
We maintain two marking bitmaps, labeled previous and
next. The previous marking bitmap is the last bitmap in
which marking has been completed. The next marking bitmap
may be under construction. The two physical bitmaps swap
logical roles as marking is completed. Each bitmap contains
one bit for each address that can be the start of an ob-
ject. With the default 8-byte object alignment, this means
1 bitmap bit for every 64 heap bits. We use a mark stack
to hold (some of) the gray (marked but not yet recursively
scanned) objects.
2.5.2 Initial Marking Pause/Concurrent Marking
The first phase of a marking cycle clears the next marking
bitmap. This is performed concurrently. Next, the initial
marking pause stops all mutator threads, and marks all ob-
jects directly reachable from the roots (in the generational
mode, initial marking is in fact piggy-backed on a fully-
young evacuation pause). Each heap region contains two
top at mark start (TAMS) variables, one for the previous
marking and one for the next. We will refer to these as
the previous and next TAMS variables. These variables are
used to identify objects allocated during a marking phase.
These objects above a TAMS value are considered implic-
itly marked with respect to the marking to which the TAMS
variable corresponds, but allocation is not slowed down by
marking bitmap updates. The initial marking pause iterates
over all the regions in the heap, copying the current value of
top in each region to the next TAMS of that region. Steps
A and D of figure 2 illustrate this. Steps B and E of this
A
B
C
D
E
F
Bottom
Bottom
Top
NextTAMS
NextBitmap
NextBitmap
TopNextTAMS
Bottom
PrevBitmap
TopPrevTAMS
Bottom PrevTAMS Top
NextTAMS
Bottom TopPrevTAMS NextTAMS
PrevBitmap NextBitmap
Bottom
PrevBitmap NextBitmap
PrevTAMS Top
Initial Marking
Remark
Cleanup/GC Pauses
Initial Marking
Remark
Cleanup/GC Pauses
PrevBitmap
PrevTAMS
PrevTAMS
NextTAMS
NextTAMS
PrevBitmap
PrevBitmap
NextBitmap
NextBitmap
Figure 2: Implicit marking via TAMS variables
figure show that objects allocated during concurrent mark-
ing are above the next TAMS value, and are thus considered
live. (The bitmaps physically cover the entire heap, but are
shown only for the portions of regions for which they are
relevant.)
Now mutator threads are restarted, and the concurrent
phase of marking begins. This phase is very similar to the
concurrent marking phase of [29]: a “finger” pointer iterates
over the marked bits. Objects higher than the finger are
implicitly gray; gray objects below the finger are represented
with a mark stack.
2.5.3 Concurrent Marking Write Barrier
The mutator may be updating the pointer graph as the
collector is tracing it. This mutation may remove a pointer
in the “snapshot” object graph, violating the guarantee on
which SATB marking is based. Therefore, SATB mark-
ing requires mutator threads to record the values of pointer
fields before they are overwritten. Below we show pseudo-
code for the marking write barrier for a write of the value
in rY to offset FieldOffset in an object whose address is in
rX. Its operation is explained below.
1| rTmp := load(rThread + MarkingInProgressOffset)
2| if (!rTmp) goto filtered
3| rTmp := load(rX + FieldOffset)
4| if (rTmp == null) goto filtered
5| call satb_enqueue(rTmp)
6| filtered:
The actual pointer store [rX, FieldOffset] := rY would
follow. The first two lines of the barrier skip the remainder
if marking is not in progress; for many programs, this filters
out a large majority of the dynamically executed barriers.
Lines 3 and 4 load the value in the object field, and check
whether it is null. It is only necessary to log non-null values.
In many programs the majority of pointer writes are initial-
izing writes to previously-null fields, so this further filtering
is quite effective.
The satb enqueue operation adds the pointer value to
the thread’s current marking buffer. As with remembered
set buffers, if the enqueue fills the buffer, it then adds it to
the global set of completed marking buffers. The concurrent
marking thread checks the size of this set at regular intervals,
interrupting its heap traversal to process filled buffers.
2.5.4 Final Marking Pause
A marking phase is complete when concurrent marking
has traversed all the marked objects and completely drained
the mark stack, and when all logged updates have been pro-
cessed. The former condition is easy to detect; the latter is
harder, since mutator threads “own” log buffers until they
fill them. The purpose of the stop-world final marking pause
is to reach this termination condition reliably, while all mu-
tator threads are stopped. It is very simple: any unpro-
cessed completed log buffers are processed as above, and
the partially completed per-thread buffers are processed in
the same way. This process is done in parallel, to guard
against programs with many mutator threads with partially
filled marking log buffers causing long pause times or paral-
lel scaling issues.3
2.5.5 Live Data Counting and Cleanup
Concurrent marking also counts the amount of marked
data in each heap region. Originally, this was done as part of
the marking process. However, evacuation pauses that move
objects that are live must also update the per-region live
data count. When evacuation pauses are performed in par-
allel, and several threads are evacuating objects to the same
region, updating this count consistently can be a source of
parallel contention. While a variety of techniques could have
ameliorated this scaling problem, updating the count repre-
sented a significant portion of evacuation pause cost even
with a single thread. Therefore, we opted to perform all live
data counting concurrently. When final marking is com-
plete, the GC thread re-examines each region, counting the
bytes of marked data below the TAMS value associated with
the marking. This is something like a sweeping phase, but
note that we find live objects by examining the marking
bitmap, rather than by traversing dead objects.
As will be discussed in section 2.6, evacuation pauses oc-
curring during marking may increase the next TAMS value
3Note that there is no need to trace again from the roots:
we examined the roots in the initial marking pause, marking
all objects directly reachable in the original object graph.
All objects reachable from the roots after the final marking
pause has completed marking must be live with respect to
the marking.
of some heap regions. So a final stop-world cleanup pause
is necessary to reliably finish this counting process. This
cleanup phase also completes marking in several other ways.
It is here that the next and previous bitmaps swap roles: the
newly completed bitmap becomes the previous bitmap, and
the old one is available for use in the next marking. In
addition, since the marking is complete, the value in the
next TAMS field of each region is copied into the previous
TAMS field, as shown in steps C and F of figure 2. Liveness
queries rely on the previous marking bitmap and the pre-
vious TAMS, so the newly-completed marking information
will now be used to determine object liveness. In figure 2,
light gray indicates objects known to be dead. Steps D and
E show how the results of a completed marking may be used
while a new marking is in progress.
Finally, the cleanup phase sorts the heap regions by ex-
pected GC efficiency. This metric divides the marking’s es-
timate of garbage reclaimable by collecting a region by the
cost of collecting it. This cost is estimated based on a num-
ber of factors, including the estimated cost of evacuating the
live data and the cost of traversing the region’s remembered
set. (Section 3.2.1 discusses our techniques for estimating
heap region GC cost.) The result of this sorting is an initial
ranking of regions by desirability for inclusion into collec-
tion sets. As discussed in section 3.3, the cost estimate can
change over time, so this estimate is only initial.
Regions containing no live data whatsoever are imme-
diately reclaimed in this phase. For some programs, this
method can reclaim a significant fraction of total garbage.
2.6 Evacuation Pauses and Marking
In this section we discuss the two major interactions be-
tween evacuation pauses and concurrent marking.
First, an evacuation pause never evacuates an object that
was proven dead in the last completed marking pass. Since
the object is dead, it obviously is not referenced from the
roots, but it might be referenced from other dead objects.
References within the collection set are followed only if the
referring object is found to be live. References from outside
the collection set are identified by the remembered sets; ob-
jects identified by the remembered sets are ignored if they
have been shown to be dead.
Second, when we evacuate an object during an evacuation
pause, we need to ensure that it is marked correctly, if nec-
essary, with respect to both the previous and next markings.
It turns out that this is quite subtle and tricky. Unfortu-
nately, due to space restrictions, we cannot give here all the
details of this interaction.
We allow evacuation pauses to occur when the marking
thread’s marking stack is non-empty: if we did not, then
marking could delay a desired evacuation pause by an arbi-
trary amount. The marking stack entries may refer to ob-
jects in the collection set. Since these objects are marked in
the current marking, they are clearly live with respect to the
previous marking, and may be evacuated by the evacuation
pause. To ensure that marking stack entries are updated
properly, we treat the marking stack as a source of roots.
2.7 Popular Object Handling
Apopular object is one that is referenced from many lo-
cations. This section describes special handling for popular
objects that achieves two goals: smaller remembered sets
and a more efficient remembered set barrier.
We reserve a small prefix of the heap regions to contain
popular objects. We attempt to identify popular objects
quickly, and isolate them in this prefix, whose regions are
never chosen for collection sets.
When we update region remembered sets concurrently,
regions whose remembered set sizes have reached a given
threshold are scheduled for processing in a popularity pause;
such growth is often caused by popular objects. The popu-
larity pause first constructs an approximate reference count
for each object, then evacuates objects whose count reach
an individual object popularity threshold to the regions in
the popular prefix; non-popular objects are evacuated to the
normal portion of the heap. If no individual popular objects
are found, no evacuation is performed, but the per-region
threshold is doubled, to prevent a loop of such pauses.
There are two benefits to this treatment of popular ob-
jects. Since we do not relocate popular objects once they
have been segregated, we do not have to maintain remem-
bered sets for popular object regions. We show in section
4.6 that popular object handling eliminates a majority of
remembered set entries for one of our benchmarks. We also
save remembered set processing overhead by filtering out
pointers to popular objects early on. We modify the step
of the remembered set write barrier described in 2.2 that
filtered out null pointers to instead do:
if (rY < PopObjBoundary) goto filtered
This test filters both popular objects and also null pointers
(using zero to represent null). Section 4.6 also measures the
effectiveness of this filtering.
While popular object handling can be very beneficial, it
is optional, and disabled in the performance measurements
described in section 4, except for the portion of section 4.6
that explicitly investigates popularity. As discussed that
section, popular objects effectively decrease remembered set
sizes for some applications, but not for all; this mechanism
may be superseded in the future.
3. HEURISTICS
In the previous section we defined the mechanisms used
in the Garbage-First collector. In this section, we describe
the heuristics that control their application.
3.1 User Inputs
A basic premise of the Garbage-First collector is that the
user specifies two things:
•an upper bound on space usage.
•asoft real-time goal,inwhichtheusergives: atime
slice,andamax GC time within a time slice that
should be devoted to stop-world garbage collection.
In addition, there is currently a flag indicating whether the
collector should use generational mode (see section 2.4). In
the future, we hope to make this choice dynamically.
When attempting to meet the soft real-time goal, we only
take into account the stop-world pauses and ignore any con-
current GC processes. On the relatively large multiproces-
sors we target, concurrent GC can be considered a fairly
evenly distributed ”tax” on mutator operation. The soft
real-time applications we target determine their utilization
requirements by benchmarking, not by program analysis, so
the concurrent GC load will be factored into this testing.
3.2 Satisfying a Soft Real-Time Goal
The soft real-time goal is treated as a primary constraint.
(We should be clear that Garbage-First is not a hard real-
time collector. We meet the soft real-time goal with high
probability, but not with absolute certainty.) Meeting such
a goal requires two things: ensuring that individual pauses
do not exceed the pause time bound, and scheduling pauses
so that only the allowed amount of GC activity occurs in
any time slice. Below we discuss techniques for meeting
these requirements.
3.2.1 Predicting Evacuation Pause Times
To meet a given pause time bound, we carefully choose
a collection set that can be collected in the available time.
We have a model for the cost of an evacuation pause that
can predict the incremental cost of adding a region to the
collection set. In generational mode, some number of young
regions are “mandatory” members of the collection set. In
the fully-young submode, the entire collection set is young.
Since the young regions are mandatory, we must predict in
advance the number of such regions that will yield a col-
lection of the desired duration. We track the fixed and
per-regions costs of fully-young collections via historical av-
eraging, and use these estimates to determine the number
of young regions allocated between fully-young evacuation
pauses.
In the partially-young mode, we may add further non-
young regions if pause times permit. In this and pure garbage-
first modes, we stop choosing regions when the “best” re-
maining one would exceed the pause time bound.
In the latter cases, we model the cost of an evacuation
pause with collection set cs as follows:
V(cs)=Vfixed +U·d+
r∈cs
(S·rsSize(r)+C·liveBytes(r))
The variables in this expression are as follows:
•V(cs) is the cost of collecting collection set cs;
•Vfixed represents fixed costs, common to all pauses;
•Uis the average cost of scanning a card, and dis the
number of dirty cards that must be scanned to bring
remembered sets up-to-date;
•Sis the of scanning a card from a remembered set for
pointers into the collection set, and rsSize(r)isthe
number of card entries in r’s remembered set; and
•Cis the cost per byte of evacuating (and scanning)
a live object, and liveBytes(r) is an estimate of the
number of live bytes in region r.
The parameters Vfixed,U,S,andCdepend on the algo-
rithm implementation and the host platform, and somewhat
on the characteristics of the particular application, but we
hope they should be fairly constant within a run. We start
with initial estimates of these quantities that lead to con-
servative pause time estimates, and refine these estimates
with direct measurement over the course of execution. To
account for variation in application behavior, we measure
the standard deviation of the sequences of measurements
for each parameter, and allow the user to input a further
confidence parameter, which adjusts the assumed value for
the constant parameter by a given number of standard devi-
ations (in the conservative direction). Obviously, increasing
the confidence parameter may decrease the throughput.
The remaining parameters d,rsSize(r), and liveBytes(r)
are all quantities that can be calculated (or at least esti-
mated) efficiently at the start of an evacuation pause. For
liveBytes(r), if a region contained allocated objects when
the last concurrent marking cycle started, this marking pro-
vides an upper bound on the number of live bytes in the
region. We (conservatively) use this upper bound as an es-
timate. For regions allocated since the last marking, we
keep a dynamic estimate of the survival rates of recently al-
located regions, and use this rate to compute the expected
number of live bytes. As above, we track the variance of sur-
vival rates, and adjust the estimate according to the input
confidence parameter.
We have less control over the duration of the stop-world
pauses associated with concurrent marking. We strive there-
fore to make these as short as possible, to minimize the ex-
tent to which they limit the real-time specifications achiev-
able. Section 4.3 shows that we are largely successful and
marking-related pauses are quite short.
3.2.2 Scheduling Pauses to Meet a Real-Time Goal
Above we have shown how we meet a desired pause time.
The second half of meeting a real-time constraint is keep-
ing the GC time in a time slice from exceeding the allowed
limit. An important property of our algorithm is that as
long as there is sufficient space, we can always delay col-
lection activity: we can delay any of the stop-world phases
of marking, and we can delay evacuation pauses, expand-
ing the heap as necessary, at least until the maximum heap
size is reached. When a desired young-generation evacua-
tion pause in generational mode must be postponed, we can
allow further mutator allocation in regions not designated
young, to avoid exceeding the pause time bound in the sub-
sequent evacuation pause.
This scheduling is achieved by maintaining a queue of
start/stop time pairs for pauses that have occurred in the
most recent time slice, along with the total stop world time
in that time slice. It is easy to insert pauses at one end of
this queue (which updates the start of the most recent time
slice, and may cause pauses at the other end to be deleted as
now-irrelevant). With this data structure, we can efficiently
answer two forms of query:
•What is the longest pause that can be started now
without violating the real-time constraint?
•What is the earliest point in the future at which a
pause of a given duration may be started?
We use these primitives to decide how long to delay activities
that other heuristics described below would schedule.
3.3 Collection Set Choice
This section describes the order in which we consider (non-
young, “discretionary”) regions for addition to the collection
set in partially-young pauses. As in concurrent marking
(section 2.5.5), let the expected GC efficiency of a region
be the estimated amount of garbage in it divided by the
estimated cost of collecting it. We estimate the garbage us-
ing the same liveBytes(r) calculation we used in estimating
the cost. Note that a region may have a small amount of
live data, yet still have low estimated efficiency because of
a large remembered set. Our plan is to consider the regions
in order of decreasing estimated efficiency.
At the end of marking we sort all regions containing marked
objects according to efficiency. However, this plan is com-
plicated somewhat by the fact that the cost of collecting
a region may change over time: in particular, the region’s
remembered set may grow. So this initial sorting is consid-
ered approximate. At the start of an evacuation pause, a
fixed-size prefix of the remaining available regions, ordered
according to this initial efficiency estimate, is resorted ac-
cording to the current efficiency estimate, and then the re-
gions in this new sorting are considered in efficiency order.
When picking regions, we stop when the pause time limit
would be exceeded, or when the surviving data is likely to
exceed the space available. However, we do have a mecha-
nism for handling evacuation failure when necessary.
3.4 Evacuation Pause Initiation
The evacuation pause initiation heuristics differ signifi-
cantly between the generational and pure garbage-first modes.
First, in all modes we choose a fraction hof the total
heap size M:wecallhthe hard margin,andH=(1−
h)Mthe hard limit. Since we use evacuation to reclaim
space, we must ensure that there is sufficient “to-space” to
evacuate into; the hard margin ensures that this space exists.
Therefore, when allocated space reaches the hard limit an
evacuation pause is always initiated, even if doing so would
violate the soft real-time goal. Currently his a constant
but, in the future, it should be dynamically adjusted. E.g.,
if we know the maximum pause duration P, and have an
estimate of the per-byte copying cost C, we can calculate
the maximum “to-space” that could be copied into in the
available time.
In fully-young generational mode, we maintain a dynamic
estimate of the number of young-generation regions that
leads to an evacuation pause that meets the pause time
bound, and initiate a pause whenever this number of young
regions is allocated. For steady-state applications, this leads
to a natural period between evacuation pauses. Note that we
can meet the soft real-time goal only if this period exceeds
its time slice. In partially-young mode, on the other hand,
we do evacuation pauses as often as the soft real-time goal
allows. Doing pauses at the maximum allowed frequency
minimizes the number of young regions collected in those
pauses, and therefore maximizes the number of non-young
regions that may be added to the collection set.
A generational execution starts in fully-young mode. Af-
ter a concurrent marking pass is complete, we switch to
partially-young mode, to harvest any attractive non-young
regions identified by the marking. We monitor the efficiency
of collection; when the efficiency of the partial collections de-
clines to the efficiency of fully-young collections, we switch
back to fully-young mode. This rule is modified somewhat
by a factor that reflects the heap occupancy: if the heap is
nearly full, we continue partially-young collections even af-
ter their efficiency declines. The extra GC work performed
decreases the heap occupancy.
3.5 Concurrent Marking Initiation
In generational mode, our heuristic for triggering concur-
rent marking is simple. We define a second soft margin u,
and call H−uM the soft limit. If the heap occupancy ex-
ceeds the soft limit before an evacuation pause, then mark-
ing is initiated as soon after the pause as the soft real-time
goal allows. As with the hard margin, the soft margin is
presently a constant, but will be calculated dynamically in
the future. The goal in sizing the soft margin is to allow con-
current marking to complete without reaching the hard mar-
gin (where collections can violate the soft real-time goal).
In pure garbage-first mode, the interaction between evac-
uation pause and marking initiation is more interesting. We
attempt to maximize the cumulative efficiency of sequences
consisting of a marking cycle and subsequent evacuation
pauses that benefit from the information the marking pro-
vides. The marking itself may collect some completely empty
regions, but at considerable cost. The first collections after
marking collect very desirable regions, making the cumu-
lative efficiency of the marking and collections rise. Later
pauses, however, collect less desirable regions, and the cumu-
lative efficiency reaches a maximum and begins to decline.
At this point we initiate a new marking cycle: in a steady-
state program, all marking sequences will be similar, and
the overall collection efficiency of the execution will be the
same as that of the individual sequences, which have been
maximized.
4. PERFORMANCE EVALUATION
We have implemented Garbage-First as part of a pre-1.5-
release version of the Java HotSpot Virtual Machine. We
ran on a Sun V880, which has 8 750 MHz UltraSPARC III
processors. We used the Solaris 10 operating environment.
We use the “client” Java system; it was easier to modify the
simpler client compiler to emit our modified write barriers.
We sometimes compare with the ParNew + CMS config-
uration of the production Java HotSpot VM, which couples
a parallel copying young-generation collector with a con-
current mark-sweep old generation [29]. This is the Java
HotSpot VM’s best configuration for applications that re-
quire low pause times, and is widely used by our customers.
4.1 Benchmark Applications
We use the following two benchmark applications:
•SPECjbb. This is intended to model a business-
oriented object database. We run with 8 warehouses,
and report throughput and maximum transaction times.
In this configuration its maximum live data size is ap-
proximately 165 M.
•telco. A benchmark, based on a commercial prod-
uct, provided to exercise a telephone call-processing
application. It requires a maximum 500 ms latency in
call setup, and determines the maximum throughput
(measured in calls/sec) that a system can support. Its
maximum live data size is approximately 100 M.
For all runs we used 8 parallel GC threads and a confi-
dence parameter of 1σ. We deliberately chose not to use the
SPECjvm98 benchmark suite. The Garbage-First collector
has been designed for heavily multi-threaded applications
with large heap sizes and these benchmarks have neither at-
tribute. Additionally, we did not run the benchmarks with
a non-incremental collector; long GC pauses cause telco to
time out and halt.
Benchmark/ Soft real-time goal compliance statistics by Heap Size
configuration V% avgV% wV% V% avgV% wV% V% avgV% wV%
SPECjbb 512 M 640 M 768 M
G-F (100/200) 4.29% 36.40% 100.00% 1.73% 12.83% 63.31% 1.68% 10.94% 69.67%
G-F (150/300) 1.20% 5.95% 15.29% 1.51% 4.01% 20.80% 1.78% 3.38% 8.96%
G-F (150/450) 1.63% 4.40% 14.32% 3.14% 2.34% 6.53% 1.23% 1.53% 3.28%
G-F (150/600) 2.63% 2.90% 5.38% 3.66% 2.45% 8.39% 2.09% 2.54% 8.65%
G-F (200/800) 0.00% 0.00% 0.00% 0.34% 0.72% 0.72% 0.00% 0.00% 0.00%
CMS (150/450) 23.93% 82.14% 100.00% 13.44% 67.72% 100.00% 5.72% 28.19% 100.00%
Telc o 384 M 512 M 640 M
G-F (50/100) 0.34% 8.92% 35.48% 0.16% 9.09% 48.08% 0.11% 12.10% 38.57%
G-F (75/150) 0.08% 11.90% 19.99% 0.08% 5.60% 7.47% 0.19% 3.81% 9.15%
G-F (75/225) 0.44% 2.90% 10.45% 0.15% 3.31% 3.74% 0.50% 1.04% 2.07%
G-F (75/300) 0.65% 2.55% 8.76% 0.42% 0.57% 1.07% 0.63% 1.07% 2.91%
G-F (100/400) 0.57% 1.79% 6.04% 0.29% 0.37% 0.54% 0.44% 1.52% 2.73%
CMS (75/225) 0.78% 35.05% 100.00% 0.54% 32.83% 100.00% 0.60% 26.39% 100.00%
Table 1: Compliance with soft real-time goals for SPECjbb and telco
4.2 Compliance with the Soft Real-time Goal
In this section we show how successfully we meet the user-
defined soft real-time goal (see section 3.1 for an explana-
tion why we only report on stop-world activity and exclude
concurrent GC overhead from our measurements). Present-
ing this data is itself an interesting problem. One option
is to calculate the Minimum Mutator Utilization (or MMU)
curve of each benchmark execution, as defined by Cheng and
Blelloch [10]. This is certainly an interesting measurement,
especially for applications and collectors with hard real-time
constraints. However, its inherently worst-case nature fails
to give any insight into how often the application failed to
meet its soft real-time goal and by how much.
Table 1 reports three statistics for each soft real-time
goal/heap size pair, based on consideration of all the possi-
ble time slices of the duration specified in the goal:
•V%: the percentage of time slices that are violating,
i.e. whose GC time exceeds the max GC time of the
soft real-time goal.
•avgV%: the average amount by which violating time
slices exceed the max GC time, expressed as a percent-
age of the desired minimum mutator time in a time
slice (i.e. time slice minus max GC time), and
•wV%: the excess GC time in the worst time slice, i.e.
the GC time in the time slice(s) with the most GC time
minus max GC time, expressed again as a percentage
of the desired minimum mutator time in a time slice.
There are ways to accurately calculate these measurements.
However, we chose, instead, to approximate them by quan-
tizing in 1 ms increments. We test five soft real-time goals,
for each of three heap sizes, for each benchmark. The statis-
tics shown in the table were gathered from a single bench-
mark run, as it was not obvious how to combine results from
several runs in a useful way. The combination of soft real-
time goals that we chose to use allows us to observe how
the behavior of Garbage-First changes for different max GC
times, but also for a fixed max GC time and different time
slices. Rows are labeled with the collector used and the
real time goal “(X/Y ),” where Yis the time slice, and X
the max GC time within the time slice, both in ms. Each
group of three columns is devoted to a heap size. We also
show the behavior of the Java HotSpot VM’s ParNew +
CMS collector. This collector does not accept an input real-
time goal, so we chose a young generation size that gave
average young-generation pauses comfortably less than the
pause time allowed by one of the Garbage-First real-time
goals, then evaluated the ParNew + CMS collector against
the chosen goal, which is indicated in the row label.
Considering the Garbage-First configurations, we see that
we succeed quite well at meeting the given soft real-time
goal. For telco, which is a real commercial application, the
goal is violated in less than 0.5% of time slices, in most cases,
less than 0.7% in all cases, and by small average amounts
relative to the allowed pause durations. SPECjbb is some-
what less successful, but still restricts violations to under 5%
in all cases, around 2% or less in most. Generally speaking,
the violating percentage increases as time slice durations in-
crease. On the other hand, the average violation amount and
excess GC time in the worst time slices tend to decrease as
the time slice durations increase (as a small excess is divided
by a larger desired mutator time). Garbage-First struggled
with the 100/200 configuration in SPECjbb, having a max-
imal wV% for the 512 M heap size and a wV% of over 60%
for the other two heap sizes. It behaved considerably better,
however, in the rest of the configurations. In fact, SPECjbb
with the 200/800 configuration gave the best overall results,
causing no violations at all for two of three heap sizes shown
and very few in the third one.
The CMS configurations generally have considerably worse
behavior in all metrics. For telco, the worst-case violations
are caused by the final phase of concurrent marking. For
SPECjbb, this final marking phase also causes maximal
violations, but in addition, the smaller heap size induces
full (non-parallel) stop-world GCs whose durations are con-
siderably greater than the time slices.
4.3 Throughput/GC Overhead
Figures 3 and 4 show application throughput for various
heap sizes and soft real-time goals. We also show ParNew
+ CMS (labeled “CMS”) performance for young-generation
sizes selected to yield average collection times (in the appli-
cation steady state) similar to the pause times of one of the
Garbage-First configurations. The SPECjbb results shown
are averages over three separate runs, the telco results are
averages over fifteen separate runs (in an attempt to smooth
out the high variation that the telco results exhibit).
In the case of the SPECjbb results, CMS actually forced
20
25
30
35
40
384 512 640 768 896 1024
Throughput (1000 ops/sec)
Heap Size (MBs)
SPECjbb Throughput (G-F/CMS)
G-F 100/200
G-F 150/300
G-F 150/450
G-F 150/600
G-F 200/800
CMS
Figure 3: Throughput measurements for SPECjbb
1140
1150
1160
1170
1180
1190
1200
1210
1220
1230
1240
384 512 640
Throughput (calls/sec)
Heap Size (MBs)
telco Throughput (G-F/CMS)
G-F 50/100
G-F 75/150
G-F 75/225
G-F 75/300
G-F 100/400
CMS
Figure 4: Throughput measurements for telco
full heap collections in heap sizes of 640 M and below (hence,
the decrease in throughput for those heap sizes). Garbage-
First only reverted to full heap collections in the 384 M
case. For the larger heap sizes, Garbage-First seems to have
very consistent behaviour, with the throughput being only
slightly affected by the heap size. In the Garbage-First con-
figurations, it is clear that increasing the max GC time im-
proves throughput, but varying the time slice for a fixed
max GC time has little effect on throughput. Comparing
the SPECjbb results of Garbage-First and CMS we see
that CMS is ahead, by between 5% and 10%.
In the telco throughput graph, no configuration of ei-
ther collector forced full heap collections. The Garbage-
First results seem to be affected very little by the max GC
time/time slice pairs, apart from the 50/100 configuration
that seems to perform slightly worse than all the rest. Again,
the comparison between the two collectors shows CMS to be
slighty ahead (by only 3% to 4% this time).
Table 2 compares final marking pauses for the Garbage-
First and CMS collectors, taking the maximum pauses over
all configurations shown in table 1. The results show that
the SATB marking algorithm is quite effective at reducing
pause times due to marking.
We wish to emphasize that many customers are quite will-
ing to trade off some amount of throughput for more pre-
dictable soft real-time goal compliance.
4.4 Parallel Scaling
In this section we quantify how well the collector paral-
lelizes stop-world collection activities. The ma jor such ac-
benchmark G-F CMS
SPECjbb 25.4 ms 934.7 ms
telco 48.7 ms 381.7 ms
Table 2: Maximum final marking pauses
0
1
2
3
4
5
6
7
8
9
1 2 3 4 5 6 7 8
Parallel Speedup
Number of CPUs
SPECjbb
Telco
Perfect Scalability
Figure 5: Parallel Scaling of Fully-Young Collections
tivities are evacuation pauses. The adaptive nature of the
collector makes this a difficult property to measure: if we
double the number of processors available for GC, and this
increases the GC rate, then we’ll do more collection at each
pause, changing the execution significantly. Therefore, for
this experiment, we restrict the collector to fully young col-
lections, use a fixed “young generation” size, and use a total
heap size large enough to make “old generation” collection
activity unnecessary. This measures the scaling at least of
fully young collections. We define the paral lel speedup for
nprocessors as the total GC time using one parallel thread
divided by the total GC time using nparallel threads (on
a machine with least nprocessors). This is the factor by
which parallelization decreased GC time. Figure 5 graphs
parallel speedup as a function of the number of processors.
While these measurements indicate appreciable scaling,
we are hoping, with some tuning, to achieve nearly linear
scaling in the future (on such moderate numbers of CPUs).
4.5 Space Overhead
The card table has one byte for every 512 heap bytes, or
about 0.2% overhead. The second card table used to track
card hotness adds a similar overhead (though a smaller ta-
ble could be used). The two marking bitmaps each have one
bit for every 64 heap bits; together this is about a 3% over-
head. Parallel task queues are constant-sized; using queues
with 32K 4-byte entries per parallel thread avoids task queue
overflow on all tests described in this paper. The marking
stack is currently very conservatively sized at 8 M, but could
be made significantly smaller with the addition of a “restart”
mechanism to deal with stack overflow.
Next we consider per heap region overhead. The size of
the data structure that represents a region is a negligible
fraction of the region size, which is 1 M by default. Heap
region remembered sets are more important source of per-
region space overhead; we discuss this in the next section.
4.6 Remembered Set Overhead/Popularity
Here, we document the overheads due to remembered set
maintenance. We also show how the popular object handling
heap Rem set space ##/bytes
benchmark size pop. ob j handling no pop. obj handling pop. pauses pop. objects
SPECjbb 1024 M 205 M (20.5%) 289 M (28.9%) 36/256
telco 500 M 4.8 M (1.0%) 20.4 M (4.0%) 660/1736
Table 3: Remembered set space overhead
all same null popular
benchmark tests region pointer object
SPECjbb 72.2 39.9 30.0 2.3
telco 81.7 36.8 37.0 7.9
Table 4: Effectiveness of Remembered Set Filtering
(see section 2.7) can have several positive effects.
Table 3 shows the remembered set overheads. For each
benchmark, we show the heap size and space devoted to re-
membered sets with and without popular object handling.
We also show the number of popularity pauses, and the num-
ber of popular objects they identify and isolate.
The telco results prove that popular object handling can
significantly decrease the remembered set space overhead.
Clearly, however, the SPECjbb results represent a signifi-
cant problem: the remembered set space overhead is simi-
lar to the total live data size. Many medium-lived objects
point to long-lived objects in other regions, so these refer-
ences are recorded in the remembered sets of the long-lived
regions. Whereas popularity handling for telco identifies
a small number of highly popular objects, SPECjbb has
both a small number of highly popular objects, and a larger
number of objects with moderate reference counts. These
characteristics cause some popularity pauses to be aban-
doned, as no individual objects meet the per-object popu-
larity threshold (see section 2.7); the popularity threshold
is thus increased, which accounts for the small number of
popularity pauses and objects identified for SPECjbb.
We are implementing a strategy for dealing with popular
heap regions, based on constraining the heap regions chosen
for collection. If heap region B’s remembered set contains
many entries from region A, then we can delete these entries,
and constrain collection set choice such that A must be col-
lected before or at the same time as B. If B also contains
many pointers into A, then adding a similar constraint in the
reverse direction joins A and B into an equivalence class that
must be collected as a unit. We keep track of no remembered
set entries between members of an equivalence class. Any
constraint on a member of the equivalence class becomes a
constraint on the entire class. This approach has several
advantages: we can limit the size of our remembered sets,
we can handle popular regions becoming unpopular with-
out any special mechanisms, and we can collect regions that
heavily reference one another together and save remembered
set scanning costs. Other collectors [22, 30] have reduced re-
membered set costs by choosing an order of collection, but
the dynamic scheme we describe attempts to get maximal
remembered set footprint reduction with as few constraints
as possible.
Next we consider the effectiveness of filtering in remem-
bered set write barriers. Table 4 shows that the effectiveness
of the various filtering techniques. Each column is the per-
cent of pointer writes filtered by the given test.
The filtering techniques are fairly effective at avoiding the
more expensive out-of-line portion of the remembered set
write barrier. Again, SPECjbb is something of a worst
case. The most effective technique is detection o
以上是关于Garbage-First Garbage Collection的主要内容,如果未能解决你的问题,请参考以下文章
The Garbage-First (G1) collector
The The Garbage-First (G1) collector since Oracle JDK 7 update 4 and later releases
用了很多年的 CMS 垃圾收集器,终于换成了 G1,真香!!