[ Pobierz całość w formacie PDF ]
Level 4 Entry
Figure 4.2: 4-Level Address Translation
To determine the physical address corresponding to a vir- shared libraries) are mapped at randomized addresses [9].
tual address the processor first determines the address of The randomization extends to the relative position of the
the highest level directory. This address is usually stored various parts; that implies that the various memory re-
in a register. Then the CPU takes the index part of the gions in use in a process are widespread throughout the
virtual address corresponding to this directory and uses virtual address space. By applying some limits to the
that index to pick the appropriate entry. This entry is the number of bits of the address which are randomized the
address of the next directory, which is indexed using the range can be restricted, but it certainly, in most cases, will
next part of the virtual address. This process continues not allow a process to run with just one or two directories
until it reaches the level 1 directory, at which point the for levels 2 and 3.
value of the directory entry is the high part of the physi-
If performance is really much more important than se-
cal address. The physical address is completed by adding
curity, randomization can be turned off. The OS will
the page offset bits from the virtual address. This process
then usually at least load all DSOs contiguously in vir-
is called page tree walking. Some processors (like x86
tual memory.
and x86-64) perform this operation in hardware, others
need assistance from the OS.
4.3 Optimizing Page Table Access
Each process running on the system might need its own
page table tree. It is possible to partially share trees but
All the data structures for the page tables are kept in the
this is rather the exception. It is therefore good for per-
main memory; this is where the OS constructs and up-
formance and scalability if the memory needed by the
dates the tables. Upon creation of a process or a change
page table trees is as small as possible. The ideal case
of a page table the CPU is notified. The page tables are
for this is to place the used memory close together in the
used to resolve every virtual address into a physical ad-
virtual address space; the actual physical addresses used
dress using the page table walk described above. More to
do not matter. A small program might get by with using
the point: at least one directory for each level is used in
just one directory at each of levels 2, 3, and 4 and a few
the process of resolving a virtual address. This requires
level 1 directories. On x86-64 with 4kB pages and 512
up to four memory accesses (for a single access by the
entries per directory this allows the addressing of 2MB
running process) which is slow. It is possible to treat
with a total of 4 directories (one for each level). 1GB of
these directory table entries as normal data and cache
contiguous memory can be addressed with one directory
them in L1d, L2, etc., but this would still be far too slow.
for levels 2 to 4 and 512 directories for level 1.
From the earliest days of virtual memory, CPU designers
Assuming all memory can be allocated contiguously is
have used a different optimization. A simple computa-
too simplistic, though. For flexibility reasons the stack
tion can show that only keeping the directory table en-
and the heap area of a process are, in most cases, allo-
tries in the L1d and higher cache would lead to horrible
cated at pretty much opposite ends of the address space.
performance. Each absolute address computation would
This allows either area to grow as much as possible if
require a number of L1d accesses corresponding to the
needed. This means that there are most likely two level 2
page table depth. These accesses cannot be parallelized
directories needed and correspondingly more lower level
since they depend on the previous lookup s result. This
directories.
alone would, on a machine with four page table levels,
require at the very least 12 cycles. Add to that the proba-
But even this does not always match current practice. For
bility of an L1d miss and the result is nothing the instruc-
security reasons the various parts of an executable (code,
tion pipeline can hide. The additional L1d accesses also
data, heap, stack, Dynamic Shared Objects (DSOs), aka
steal precious bandwidth to the cache.
38 Version 1.0 What Every Programmer Should Know About Memory
So, instead of just caching the directory table entries, also possible that the address space layout of a process
the complete computation of the address of the physi- changes. There are two ways to deal with this problem:
cal page is cached. For the same reason that code and
data caches work, such a cached address computation is
" The TLB is flushed whenever the page table tree is
effective. Since the page offset part of the virtual address
changed.
does not play any part in the computation of the physi-
cal page address, only the rest of the virtual address is
" The tags for the TLB entries are extended to ad-
used as the tag for the cache. Depending on the page size
ditionally and uniquely identify the page table tree
this means hundreds or thousands of instructions or data
they refer to.
objects share the same tag and therefore same physical
address prefix.
In the first case the TLB is flushed whenever a context
The cache into which the computed values are stored is
switch is performed. Since, in most OSes, a switch from
called the Translation Look-Aside Buffer (TLB). It is
one thread/process to another requires executing some
usually a small cache since it has to be extremely fast.
kernel code, TLB flushes are restricted to leaving (and
Modern CPUs provide multi-level TLB caches, just as
sometimes entering) the kernel address space. On vir-
for the other caches; the higher-level caches are larger
tualized systems it also happens when the kernel has to
and slower. The small size of the L1TLB is often made
call the VMM and on the way back. If the kernel and/or
up for by making the cache fully associative, with an
VMM does not have to use virtual addresses, or can reuse
LRU eviction policy. Recently, this cache has been grow-
the same virtual addresses as the process or kernel which
ing in size and, in the process, was changed to be set as-
made the system/VMM call (i.e., the address spaces are
sociative. As a result, it might not be the oldest entry
overlaid), the TLB only has to be flushed if, upon leaving
which gets evicted and replaced whenever a new entry
the kernel or VMM, the processor resumes execution of
has to be added. [ Pobierz całość w formacie PDF ]
zanotowane.pl doc.pisz.pl pdf.pisz.pl fopke.keep.pl
Level 4 Entry
Figure 4.2: 4-Level Address Translation
To determine the physical address corresponding to a vir- shared libraries) are mapped at randomized addresses [9].
tual address the processor first determines the address of The randomization extends to the relative position of the
the highest level directory. This address is usually stored various parts; that implies that the various memory re-
in a register. Then the CPU takes the index part of the gions in use in a process are widespread throughout the
virtual address corresponding to this directory and uses virtual address space. By applying some limits to the
that index to pick the appropriate entry. This entry is the number of bits of the address which are randomized the
address of the next directory, which is indexed using the range can be restricted, but it certainly, in most cases, will
next part of the virtual address. This process continues not allow a process to run with just one or two directories
until it reaches the level 1 directory, at which point the for levels 2 and 3.
value of the directory entry is the high part of the physi-
If performance is really much more important than se-
cal address. The physical address is completed by adding
curity, randomization can be turned off. The OS will
the page offset bits from the virtual address. This process
then usually at least load all DSOs contiguously in vir-
is called page tree walking. Some processors (like x86
tual memory.
and x86-64) perform this operation in hardware, others
need assistance from the OS.
4.3 Optimizing Page Table Access
Each process running on the system might need its own
page table tree. It is possible to partially share trees but
All the data structures for the page tables are kept in the
this is rather the exception. It is therefore good for per-
main memory; this is where the OS constructs and up-
formance and scalability if the memory needed by the
dates the tables. Upon creation of a process or a change
page table trees is as small as possible. The ideal case
of a page table the CPU is notified. The page tables are
for this is to place the used memory close together in the
used to resolve every virtual address into a physical ad-
virtual address space; the actual physical addresses used
dress using the page table walk described above. More to
do not matter. A small program might get by with using
the point: at least one directory for each level is used in
just one directory at each of levels 2, 3, and 4 and a few
the process of resolving a virtual address. This requires
level 1 directories. On x86-64 with 4kB pages and 512
up to four memory accesses (for a single access by the
entries per directory this allows the addressing of 2MB
running process) which is slow. It is possible to treat
with a total of 4 directories (one for each level). 1GB of
these directory table entries as normal data and cache
contiguous memory can be addressed with one directory
them in L1d, L2, etc., but this would still be far too slow.
for levels 2 to 4 and 512 directories for level 1.
From the earliest days of virtual memory, CPU designers
Assuming all memory can be allocated contiguously is
have used a different optimization. A simple computa-
too simplistic, though. For flexibility reasons the stack
tion can show that only keeping the directory table en-
and the heap area of a process are, in most cases, allo-
tries in the L1d and higher cache would lead to horrible
cated at pretty much opposite ends of the address space.
performance. Each absolute address computation would
This allows either area to grow as much as possible if
require a number of L1d accesses corresponding to the
needed. This means that there are most likely two level 2
page table depth. These accesses cannot be parallelized
directories needed and correspondingly more lower level
since they depend on the previous lookup s result. This
directories.
alone would, on a machine with four page table levels,
require at the very least 12 cycles. Add to that the proba-
But even this does not always match current practice. For
bility of an L1d miss and the result is nothing the instruc-
security reasons the various parts of an executable (code,
tion pipeline can hide. The additional L1d accesses also
data, heap, stack, Dynamic Shared Objects (DSOs), aka
steal precious bandwidth to the cache.
38 Version 1.0 What Every Programmer Should Know About Memory
So, instead of just caching the directory table entries, also possible that the address space layout of a process
the complete computation of the address of the physi- changes. There are two ways to deal with this problem:
cal page is cached. For the same reason that code and
data caches work, such a cached address computation is
" The TLB is flushed whenever the page table tree is
effective. Since the page offset part of the virtual address
changed.
does not play any part in the computation of the physi-
cal page address, only the rest of the virtual address is
" The tags for the TLB entries are extended to ad-
used as the tag for the cache. Depending on the page size
ditionally and uniquely identify the page table tree
this means hundreds or thousands of instructions or data
they refer to.
objects share the same tag and therefore same physical
address prefix.
In the first case the TLB is flushed whenever a context
The cache into which the computed values are stored is
switch is performed. Since, in most OSes, a switch from
called the Translation Look-Aside Buffer (TLB). It is
one thread/process to another requires executing some
usually a small cache since it has to be extremely fast.
kernel code, TLB flushes are restricted to leaving (and
Modern CPUs provide multi-level TLB caches, just as
sometimes entering) the kernel address space. On vir-
for the other caches; the higher-level caches are larger
tualized systems it also happens when the kernel has to
and slower. The small size of the L1TLB is often made
call the VMM and on the way back. If the kernel and/or
up for by making the cache fully associative, with an
VMM does not have to use virtual addresses, or can reuse
LRU eviction policy. Recently, this cache has been grow-
the same virtual addresses as the process or kernel which
ing in size and, in the process, was changed to be set as-
made the system/VMM call (i.e., the address spaces are
sociative. As a result, it might not be the oldest entry
overlaid), the TLB only has to be flushed if, upon leaving
which gets evicted and replaced whenever a new entry
the kernel or VMM, the processor resumes execution of
has to be added. [ Pobierz całość w formacie PDF ]