CPU cache - Knowledge (XXG)

612:

mapped cache, the unused cache index bits become a part of the tag bits. For example, a 2-way set associative cache contributes 1 bit to the tag and a 4-way set associative cache contributes 2 bits to the tag. The basic idea of the multicolumn cache is to use the set index to map to a cache set as a conventional set associative cache does, and to use the added tag bits to index a way in the set. For example, in a 4-way set associative cache, the two bits are used to index way 00, way 01, way 10, and way 11, respectively. This double cache indexing is called a “major location mapping”, and its latency is equivalent to a direct-mapped access. Extensive experiments in multicolumn cache design shows that the hit ratio to major locations is as high as 90%. If cache mapping conflicts with a cache block in the major location, the existing cache block will be moved to another cache way in the same set, which is called “selected location”. Because the newly indexed cache block is a most recently used (MRU) block, it is placed in the major location in multicolumn cache with a consideration of temporal locality. Since multicolumn cache is designed for a cache with a high associativity, the number of ways in each set is high; thus, it is easy find a selected location in the set. A selected location index by an additional hardware is maintained for the major location in a cache block.

1013:(VIPT) caches use the virtual address for the index and the physical address in the tag. The advantage over PIPT is lower latency, as the cache line can be looked up in parallel with the TLB translation, however the tag cannot be compared until the physical address is available. The advantage over VIVT is that since the tag has the physical address, the cache can detect homonyms. Theoretically, VIPT requires more tags bits because some of the index bits could differ between the virtual and physical addresses (for example bit 12 and above for 4 KiB pages) and would have to be included both in the virtual index and in the physical tag. In practice this is not an issue because, in order to avoid coherency problems, VIPT caches are designed to have no such index bits (e.g., by limiting the total number of bits for the index and the block offset to 12 for 4 KiB pages); this limits the size of VIPT caches to the page size times the associativity of the cache. 1003:(VIVT) caches use the virtual address for both the index and the tag. This caching scheme can result in much faster lookups, since the MMU does not need to be consulted first to determine the physical address for a given virtual address. However, VIVT suffers from aliasing problems, where several different virtual addresses may refer to the same physical address. The result is that such addresses would be cached separately despite referring to the same memory, causing coherency problems. Although solutions to this problem exist they do not work for standard coherence protocols. Another problem is homonyms, where the same virtual address maps to several different physical addresses. It is not possible to distinguish these mappings merely by looking at the virtual index itself, though potential solutions include: flushing the cache after a 1411:, is a specialized cache which holds the first few instructions at the destination of a taken branch. This is used by low-powered processors which do not need a normal instruction cache because the memory system is capable of delivering instructions fast enough to satisfy the CPU without one. However, this only applies to consecutive instructions in sequence; it still takes several cycles of latency to restart instruction fetch at a new address, causing a few cycles of pipeline bubble after a control transfer. A branch target cache provides instructions for those few cycles avoiding a delay after most taken branches. 583:. A good hash function has the property that addresses which conflict with the direct mapping tend not to conflict when mapped with the hash function, and so it is less likely that a program will suffer from an unexpectedly large number of conflict misses due to a pathological access pattern. The downside is extra latency from computing the hash function. Additionally, when it comes time to load a new line and evict an old line, it may be difficult to determine which existing line was least recently used, because the new line conflicts with data at different indexes in each way; 1075:

addresses into a cache index, generally by storing physical tags as well as virtual tags. For comparison, a physically tagged cache does not need to keep virtual tags, which is simpler. When a virtual to physical mapping is deleted from the TLB, cache entries with those virtual addresses will have to be flushed somehow. Alternatively, if cache entries are allowed on pages not mapped by the TLB, then those entries will have to be flushed when the access rights on those pages are changed in the page table.

358:. The data in these locations is written back to the main memory only when that data is evicted from the cache. For this reason, a read miss in a write-back cache may sometimes require two memory accesses to service: one to first write the dirty location to main memory, and then another to read the new location from memory. Also, a write to a main memory location that is not yet mapped in a write-back cache may evict an already dirty location, thereby freeing that cache space for the new memory location. 1693:

checked as well. As a drawback, there is a correlation between the associativities of L1 and L2 caches: if the L2 cache does not have at least as many ways as all L1 caches together, the effective associativity of the L1 caches is restricted. Another disadvantage of inclusive cache is that whenever there is an eviction in L2 cache, the (possibly) corresponding lines in L1 also have to get evicted in order to maintain inclusiveness. This is quite a bit of work, and would result in a higher L1 miss rate.

1924:

memory technologies would span semi-conductor, magnetic core, drum and disc. Virtual memory seen and used by programs would be flat and caching would be used to fetch data and instructions into the fastest memory ahead of processor access. Extensive studies were done to optimize the cache sizes. Optimal values were found to depend greatly on the programming language used with Algol needing the smallest and Fortran and Cobol needing the largest cache sizes.

1730: 1885: 1034:. The R6000 solves the issue by putting the TLB memory into a reserved part of the second-level cache having a tiny, high-speed TLB "slice" on chip. The cache is indexed by the physical address obtained from the TLB slice. However, since the TLB slice only translates those virtual address bits that are necessary to index the cache and does not use any tags, false cache hits may occur, which is solved by tagging with the virtual address. 2032: 1798:, depending on whether those lines were evicted from the data or instruction primary caches. Since the parity code takes fewer bits than the ECC code, lines from the instruction cache have a few spare bits. These bits are used to cache branch prediction information associated with those instructions. The net result is that the branch predictor has a larger effective history table, and so has better accuracy. 182: 1633:

such as allowing a single core to use the whole cache, reducing data redundancy by making it possible for different processes or threads to share cached data, and reducing the complexity of utilized cache coherency protocols. For example, an eight-core chip with three levels may include an L1 cache for each core, one intermediate L2 cache for each pair of cores, and one L3 cache shared between all cores.

1877:

operation. There is no need for any tag checking in the inner loop – in fact, the tags need not even be read. Later in the pipeline, but before the load instruction is retired, the tag for the loaded data must be read, and checked against the virtual address to make sure there was a cache hit. On a miss, the cache is updated with the requested cache line and the pipeline is restarted.

466:), so that the CPU wastes less time reading from the slow main memory. The general guideline is that doubling the associativity, from direct mapped to two-way, or from two-way to four-way, has about the same effect on raising the hit rate as doubling the cache size. However, increasing associativity more than four does not improve hit rate as much, and are generally done for other reasons (see 1609:(circa 1976) had eight address "A" and eight scalar data "S" registers that were generally usable. There was also a set of 64 address "B" and 64 scalar data "T" registers that took longer to access, but were faster than main memory. The "B" and "T" registers were provided because the Cray-1 did not have a data cache. (The Cray-1 did, however, have an instruction cache.) 1939:. While it was technically possible to have all the main memory as fast as the CPU, a more economically viable path has been taken: use plenty of low-speed memory, but also introduce a small high-speed cache memory to alleviate the performance gap. This provided an order of magnitude more capacity—for the same price—with only a slightly reduced combined performance. 393:

important, especially in high-performance systems. The cache hit rate and the cache miss rate play an important role in determining this performance. To improve the cache performance, reducing the miss rate becomes one of the necessary steps among other steps. Decreasing the access time to the cache also gives a boost to its performance and helps with optimization.

1126:

this problem is to divide up the virtual pages the program uses and assign them virtual colors in the same way as physical colors were assigned to physical pages before. Programmers can then arrange the access patterns of their code so that no two pages with the same virtual color are in use at the same time. There is a wide literature on such optimizations (e.g.

1908:

are read, and matched against a subset of the virtual address. Later on in the pipeline, the virtual address is translated into a physical address by the TLB, and the physical tag is read (just one, as the vhint supplies which way of the cache to read). Finally the physical address is compared to the physical tag to determine if a hit has occurred.

2119:, which was previously used by Intel with on-motherboard caches. K6-III included 256 KiB on-die L2 cache and took advantage of the on-board cache as a third level cache, named L3 (motherboards with up to 2 MiB of on-board cache were produced). After the Socket 7 became obsolete, on-motherboard cache disappeared from the x86 systems. 2405:

in the slave store in the location given by the fetch address modulo 32; the remaining bits of the fetch address were also stored. If the wanted word was in the slave it was read from there instead of main memory. This would give a major speedup to instruction loops up to 32 instructions long, and reduced effect for loops up to 64 words.

1007:, forcing address spaces to be non-overlapping, tagging the virtual address with an address space ID (ASID). Additionally, there is a problem that virtual-to-physical mappings can change, which would require flushing cache lines, as the VAs would no longer be valid. All these issues are absent if tags use physical addresses (VIPT). 462:. If there are ten places to which the placement policy could have mapped a memory location, then to check if that location is in the cache, ten cache entries must be searched. Checking more places takes more power and chip area, and potentially more time. On the other hand, caches with more associativity suffer fewer misses (see 1166: 1228:

conflict misses. Many commonly used programs do not require an associative mapping for all the accesses. In fact, only a small fraction of the memory accesses of the program require high associativity. The victim cache exploits this property by providing high associativity to only these accesses. It was introduced by

1095:

cache entry with the matching hint must be evicted so that cache accesses after the cache fill at this address will have just one hint match. Since virtual hints have fewer bits than virtual tags distinguishing them from one another, a virtually hinted cache suffers more conflict misses than a virtually tagged cache.

911:, which contains code and data for that program only, or all programs run in a common virtual address space. A program executes by calculating, comparing, reading and writing to addresses of its virtual address space, rather than addresses of physical address space, making programs simpler and thus easier to write. 1923:

The early history of cache technology is closely tied to the invention and use of virtual memory. Because of scarcity and cost of semi-conductor memories, early mainframe computers in the 1960s used a complex hierarchy of physical memory, mapped onto a flat virtual memory space used by programs. The

1483:

As the latency difference between main memory and the fastest cache has become larger, some processors have begun to utilize as many as three levels of on-chip cache. Price-sensitive designs used this to pull the entire cache hierarchy on-chip, but by the 2010s some of the highest-performance designs

1227:

is a cache used to hold blocks evicted from a CPU cache upon replacement. The victim cache lies between the main cache and its refill path, and holds only those blocks of data that were evicted from the main cache. The victim cache is usually fully associative, and is intended to reduce the number of

1125:

Programmers attempting to make maximum use of the cache may arrange their programs' access patterns so that only 1 MiB of data need be cached at any given time, thus avoiding capacity misses. But they should also ensure that the access patterns do not have conflict misses. One way to think about

1078:

It is also possible for the operating system to ensure that no virtual aliases are simultaneously resident in the cache. The operating system makes this guarantee by enforcing page coloring, which is described below. Some early RISC processors (SPARC, RS/6000) took this approach. It has not been used

792:

is the number of bytes per data block. The tag contains the most significant bits of the address, which are checked against all rows in the current set (the set has been retrieved by index) to see if this set contains the requested address. If it does, a cache hit occurs. The tag length in bits is as

543:

The simplest and most commonly used scheme, shown in the right-hand diagram above, is to use the least significant bits of the memory location's index as the index for the cache memory, and to have two entries for each index. One benefit of this scheme is that the tags stored in the cache do not have

323:

To make room for the new entry on a cache miss, the cache may have to evict one of the existing entries. The heuristic it uses to choose the entry to evict is called the replacement policy. The fundamental problem with any replacement policy is that it must predict which existing cache entry is least

2194:

A multi-ported cache is a cache which can serve more than one request at a time. When accessing a traditional cache we normally use a single memory address, whereas in a multi-ported cache we may request N addresses at a time – where N is the number of ports that connected through the

2122:

The three-level caches were used again first with the introduction of multiple processor cores, where the L3 cache was added to the CPU die. It became common for the total cache sizes to be increasingly larger in newer processor generations, and recently (as of 2011) it is not uncommon to find Level

1900:

Because the cache is 4 KiB and has 64 B lines, there are just 64 lines in the cache, and we read two at a time from a Tag SRAM which has 32 rows, each with a pair of 21 bit tags. Although any function of virtual address bits 31 through 6 could be used to index the tag and data SRAMs, it is

1880:

An associative cache is more complicated, because some form of tag must be read to determine which entry of the cache to select. An N-way set-associative level-1 cache usually reads all N possible tags and N data in parallel, and then chooses the data associated with the matching tag. Level-2 caches

1829:

in hardware, which means that a store into an instruction closely following the store instruction will change that following instruction. Other processors, like those in the Alpha and MIPS family, have relied on software to keep the instruction cache coherent. Stores are not guaranteed to show up in

1765:

The data cache keeps copies of 64-byte lines of memory. It is split into 8 banks (each storing 8 KiB of data), and can fetch two 8-byte data each cycle so long as those data are in different banks. There are two copies of the tags, because each 64-byte line is spread among all eight banks. Each

1758:

sections, one to keep PTEs that map 4 KiB pages, and one to keep PTEs that map 4 MiB or 2 MiB pages. The split allows the fully associative match circuitry in each section to be simpler. The operating system maps different sections of the virtual address space with different size PTEs.

1688:

The advantage of exclusive caches is that they store more data. This advantage is larger when the exclusive L1 cache is comparable to the L2 cache, and diminishes if the L2 cache is many times larger than the L1 cache. When the L1 misses and the L2 hits on an access, the hitting cache line in the L2

1463:

Another issue is the fundamental tradeoff between cache latency and hit rate. Larger caches have better hit rates but longer latency. To address this tradeoff, many computers use multiple levels of cache, with small fast caches backed up by larger, slower caches. Multi-level caches generally operate

1094:

with each cache entry instead of virtual tags. These hints are a subset or hash of the virtual tag, and are used for selecting the way of the cache from which to get data and a physical tag. Like a virtually tagged cache, there may be a virtual hint match but physical tag mismatch, in which case the

1086:

Some processors (e.g. early SPARCs) have caches with both virtual and physical tags. The virtual tags are used for way selection, and the physical tags are used for determining hit or miss. This kind of cache enjoys the latency advantage of a virtually tagged cache, and the simple software interface

943:

Multiple virtual addresses can map to a single physical address. Most processors guarantee that all updates to that single physical address will happen in program order. To deliver on that guarantee, the processor must ensure that only one copy of a physical address resides in the cache at any given

611:

Comparing with a direct-mapped cache, a set associative cache has a reduced number of bits for its cache set index that maps to a cache set, where multiple ways or blocks stays, such as 2 blocks for a 2-way set associative cache and 4 blocks for a 4-way set associative cache. Comparing with a direct

392:

has become important in recent times where the speed gap between the memory performance and the processor performance is increasing exponentially. The cache was introduced to reduce this speed gap. Thus knowing how well the cache is able to bridge the gap in the speed of processor and memory becomes

2404:

Two tunnel diode stores were developed at Cambridge; one, which worked very well, speeded up the fetching of operands, the other was intended to speed up the fetching of instructions. The idea was that most instructions are obeyed in sequence, so when an instruction was fetched that word was placed

2165:

CPU has 128 or 192 KiB instruction L1 cache for each core (important for latency/single-thread performance), depending on core type. This is an unusually large L1 cache for any CPU type (not just for a laptop); the total cache memory size is not unusually large (the total is more important for

1907:

A more modern cache might be 16 KiB, 4-way set-associative, virtually indexed, virtually hinted, and physically tagged, with 32 B lines, 32-bit read width and 36-bit physical addresses. The read path recurrence for such a cache looks very similar to the path above. Instead of tags, vhints

1775:

L2 cache. This cache is exclusive to both the L1 instruction and data caches, which means that any 8-byte line can only be in one of the L1 instruction cache, the L1 data cache, or the L2 cache. It is, however, possible for a line in the data cache to have a PTE which is also in one of the TLBs—the

1696:

Another advantage of inclusive caches is that the larger cache can use larger cache lines, which reduces the size of the secondary cache tags. (Exclusive caches require both caches to have the same size cache lines, so that cache lines can be swapped on a L1 miss, L2 hit.) If the secondary cache is

1386:

A μop cache has many similarities with a trace cache, although a μop cache is much simpler thus providing better power efficiency; this makes it better suited for implementations on battery-powered devices. The main disadvantage of the trace cache, leading to its power inefficiency, is the hardware

1152:

If the operating system can guarantee that each physical page maps to only one virtual color, then there are no virtual aliases, and the processor can use virtually indexed caches with no need for extra virtual alias probes during miss handling. Alternatively, the OS can flush a page from the cache

1098:

Perhaps the ultimate reduction of virtual hints can be found in the Pentium 4 (Willamette and Northwood cores). In these processors the virtual hint is effectively two bits, and the cache is four-way set associative. Effectively, the hardware maintains a simple permutation from virtual address

405:

due to a cache miss) matters because the CPU will run out of work while waiting for the cache line. When a CPU reaches this state, it is called a stall. As CPUs become faster compared to main memory, stalls due to cache misses displace more potential computation; modern CPUs can execute hundreds of

1876:

The simplest cache is a virtually indexed direct-mapped cache. The virtual address is calculated with an adder, the relevant portion of the address extracted and used to index an SRAM, which returns the loaded data. The data is byte aligned in a byte shifter, and from there is bypassed to the next

1757:

The instruction TLB keeps copies of page table entries (PTEs). Each cycle's instruction fetch has its virtual address translated through this TLB into a physical address. Each entry is either four or eight bytes in memory. Because the K8 has a variable page size, each of the TLBs is split into two

1082:

It can be useful to distinguish the two functions of tags in an associative cache: they are used to determine which way of the entry set to select, and they are used to determine if the cache hit or missed. The second function must always be correct, but it is permissible for the first function to

1065:

problem, in which several cache lines end up storing data for the same physical address. Writing to such locations may update only one location in the cache, leaving the others with inconsistent data. This issue may be solved by using non-overlapping memory layouts for different address spaces, or

504:

In this cache organization, each location in the main memory can go in only one entry in the cache. Therefore, a direct-mapped cache can also be called a "one-way set associative" cache. It does not have a placement policy as such, since there is no choice of which cache entry's contents to evict.

2084:

processor, an 8 KiB cache was integrated directly into the CPU die. This cache was termed Level 1 or L1 cache to differentiate it from the slower on-motherboard, or Level 2 (L2) cache. These on-motherboard caches were much larger, with the most common size being 256 KiB. There were some

1632:

Typically, sharing the L1 cache is undesirable because the resulting increase in latency would make each core run considerably slower than a single-core chip. However, for the highest-level cache, the last one called before accessing memory, having a global cache is desirable for several reasons,

1377:

Fetching complete pre-decoded instructions eliminates the need to repeatedly decode variable length complex instructions into simpler fixed-length micro-operations, and simplifies the process of predicting, fetching, rotating and aligning fetched instructions. A μop cache effectively offloads the

1121:

To understand the problem, consider a CPU with a 1 MiB physically indexed direct-mapped level-2 cache and 4 KiB virtual memory pages. Sequential physical pages map to sequential locations in the cache until after 256 pages the pattern wraps around. We can label each physical page with a

1048:

Caches have historically used both virtual and physical addresses for the cache tags, although virtual tagging is now uncommon. If the TLB lookup can finish before the cache RAM lookup, then the physical address is available in time for tag compare, and there is no need for virtual tagging. Large

1177:

While all of the cache blocks in a particular cache are the same size and have the same associativity, typically the "lower-level" caches (called Level 1 cache) have a smaller number of blocks, smaller block size, and fewer blocks in a set, but have very short access times. "Higher-level" caches

1074:

The great advantage of virtual tags is that, for associative caches, they allow the tag match to proceed before the virtual to physical translation is done. However, coherence probes and evictions present a physical address for action. The hardware must have some means of converting the physical

294:

When the processor needs to read or write a location in memory, it first checks for a corresponding entry in the cache. The cache checks for the contents of the requested memory location in any cache lines that might contain that address. If the processor finds that the memory location is in the

1692:

One advantage of strictly inclusive caches is that when external devices or other processors in a multiprocessor system wish to remove a cache line from the processor, they need only have the processor check the L2 cache. In cache hierarchies which do not enforce inclusion, the L1 cache must be

1117:

Large physically indexed caches (usually secondary caches) run into a problem: the operating system rather than the application controls which pages collide with one another in the cache. Differences in page allocation from one program run to the next lead to differences in the cache collision

997:(PIPT) caches use the physical address for both the index and the tag. While this is simple and avoids problems with aliasing, it is also slow, as the physical address must be looked up (which could involve a TLB miss and access to main memory) before that address can be looked up in the cache. 823:

in size, with 64-byte cache blocks. Hence, there are 8 KiB / 64 = 128 cache blocks. The number of sets is equal to the number of cache blocks divided by the number of ways of associativity, what leads to 128 / 4 = 32 sets, and hence 2 = 32

615:

Multicolumn cache remains a high hit ratio due to its high associativity, and has a comparable low latency to a direct-mapped cache due to its high percentage of hits in major locations. The concepts of major locations and selected locations in multicolumn cache have been used in several cache

238:

is used for all levels of cache, down to L1. Historically L1 was also on a separate die, however bigger die sizes have allowed integration of it as well as other cache levels, with the possible exception of the last level. Each extra level of cache tends to be bigger and optimized differently.

1137:

The snag is that while all the pages in use at any given moment may have different virtual colors, some may have the same physical colors. In fact, if the operating system assigns physical pages to virtual pages randomly and uniformly, it is extremely likely that some pages will have the same

331:

Marking some memory ranges as non-cacheable can improve performance, by avoiding caching of memory regions that are rarely re-accessed. This avoids the overhead of loading something into the cache without having any reuse. Cache entries may also be disabled or locked depending on the context.

426: 1173:

Modern processors have multiple interacting on-chip caches. The operation of a particular cache can be completely specified by the cache size, the cache block size, the number of blocks in a set, the cache set replacement policy, and the cache write policy (write-through or write-back).

560:. Once the address has been computed, the one cache index which might have a copy of that location in memory is known. That cache entry can be read, and the processor can continue to work with that data before it finishes checking that the tag actually matches the requested address. 544:

to include that part of the main memory address which is implied by the cache memory's index. Since the cache tags have fewer bits, they require fewer transistors, take less space on the processor circuit board or on the microprocessor chip, and can be read and compared faster. Also

567:, can be used to pick just one of the possible cache entries mapping to the requested address. The entry selected by the hint can then be used in parallel with checking the full tag. The hint technique works best when used in the context of address translation, as explained below. 1872:

are the most common CPU operation that takes more than a single cycle. Program execution time tends to be very sensitive to the latency of a level-1 data cache hit. A great deal of design effort, and often power and silicon area are expended making the caches as fast as possible.

115:

Used to speed virtual-to-physical address translation for both executable instructions and data. A single TLB can be provided for access to both instructions and data, or a separate Instruction TLB (ITLB) and data TLB (DTLB) can be provided. However, the TLB cache is part of the

1745:

The instruction cache keeps copies of 64-byte lines of memory, and fetches 16 bytes each cycle. Each byte in this cache is stored in ten bits rather than eight, with the extra bits marking the boundaries of instructions (this is an example of predecoding). The cache has only

1448:

rate decreases when cores do not require equal parts of the cache space. Consequently, a single core can use the full level 2 or level 3 cache while the other cores are inactive. Furthermore, the shared cache makes it faster to share memory among different execution cores.

2023:, released in 1994, has the following: 8 KiB data cache (four-way associative), 8 KiB instruction cache (four-way associative), 96-byte FIFO instruction buffer, 256-entry branch cache, and 64-entry address translation cache MMU buffer (four-way associative). 1149:. Although the actual mapping from virtual to physical color is irrelevant to system performance, odd mappings are difficult to keep track of and have little benefit, so most approaches to page coloring simply try to keep physical and virtual page colors the same. 1205:). The natural design is to use different physical caches for each of these points, so that no one physical resource has to be scheduled to service two points in the pipeline. Thus the pipeline naturally ends up with at least three separate caches (instruction, 303:

has occurred. In the case of a cache hit, the processor immediately reads or writes the data in the cache line. For a cache miss, the cache allocates a new entry and copies data from main memory, then the request is fulfilled from the contents of the cache.

890:

cache generally cause the shortest delay, because the write can be queued and there are few limitations on the execution of subsequent instructions; the processor can continue until the queue is full. For a detailed introduction to the types of misses, see

1640:(LLC). Additional techniques are used for increasing the level of parallelism when LLC is shared between multiple cores, including slicing it into multiple pieces which are addressing certain ranges of memory addresses, and can be accessed independently. 1783:, with tables that help predict whether branches are taken and other tables which predict the targets of branches and jumps. Some of this information is associated with instructions, in both the level 1 instruction cache and the unified secondary cache. 505:

This means that if two locations map to the same entry, they may continually knock each other out. Although simpler, a direct-mapped cache needs to be much larger than an associative one to give comparable performance, and it is more unpredictable. Let

361:

There are intermediate policies as well. The cache may be write-through, but the writes may be held in a store data queue temporarily, usually so multiple stores can be processed together (which can reduce bus turnarounds and improve bus utilization).

850:. Having a dirty bit set indicates that the associated cache line has been changed since it was read from main memory ("dirty"), meaning that the processor has written data to that line and the new value has not propagated all the way to main memory. 602:

In the common case of finding a hit in the first way tested, a pseudo-associative cache is as fast as a direct-mapped cache, but it has a much lower conflict miss rate than a direct-mapped cache, closer to the miss rate of a fully associative cache.

1382:

and improving the frontend supply of decoded micro-operations. The μop cache also increases performance by more consistently delivering decoded micro-operations to the backend and eliminating various bottlenecks in the CPU's fetch and decode logic.

93:

When trying to read from or write to a location in the main memory, the processor checks whether the data from that location is already in the cache. If so, the processor will read from or write to the cache instead of the much slower main memory.

1593:

itself can be considered the smallest, fastest cache in the system, with the special characteristic that it is scheduled in software—typically by a compiler, as it allocates registers to hold values retrieved from main memory for, as an example,

858:

A cache miss is a failed attempt to read or write a piece of data in the cache, which results in a main memory access with much longer latency. There are three kinds of cache misses: instruction read miss, data read miss, and data write miss.

2195:

processor and the cache. The benefit of this is that a pipelined processor may access memory from different phases in its pipeline. Another benefit is that it allows the concept of super-scalar processors through different cache levels.

1648:

In a separate cache structure, instructions and data are cached separately, meaning that a cache line is used to cache either instructions or data, but not both; various benefits have been demonstrated with separate data and instruction

2185:

There are several tools available to computer architects to help explore tradeoffs between the cache cycle time, energy, and area; the CACTI cache simulator and the SimpleScalar instruction set simulator are two open-source options.

1685:) do not require that data in the L1 cache also reside in the L2 cache, although it may often do so. There is no universally accepted name for this intermediate policy; two common names are "non-exclusive" and "partially-inclusive". 1156:

The software page coloring technique has been used to effectively partition the shared Last level Cache (LLC) in multicore processors. This operating system-based LLC management in multicore processors has been adopted by Intel.

827:

The original Pentium 4 processor also had an eight-way set associative L2 integrated cache 256 KiB in size, with 128-byte cache blocks. This implies 32 - 8 - 7 = 17 bits for the tag field.

1697:

an order of magnitude larger than the primary, and the cache data is an order of magnitude larger than the cache tags, this tag area saved can be comparable to the incremental area needed to store the L1 cache data in the L2.

970:

before every programmed access to main memory. With no caches, and with the mapping table memory running at the same speed as main memory this effectively cut the speed of memory access in half. Two early machines that used a

881:

cache usually cause a smaller delay, because instructions not dependent on the cache read can be issued and continue execution until the data is returned from main memory, and the dependent instructions can resume execution.

1761:

The data TLB has two copies which keep identical entries. The two copies allow two data accesses per cycle to translate virtual addresses to physical addresses. Like the instruction TLB, this TLB is split into two kinds of

1087:

of a physically tagged cache. It bears the added cost of duplicated tags, however. Also, during miss processing, the alternate ways of the cache line indexed have to be probed for virtual aliases and any matches evicted.

1892:

The adjacent diagram is intended to clarify the manner in which the various fields of the address are used. Address bit 31 is most significant, bit 0 is least significant. The diagram shows the SRAMs, indexing, and

657:

The "size" of the cache is the amount of main memory data it can hold. This size can be calculated as the number of bytes stored in each data block times the number of blocks stored in the cache. (The tag, flag and

3981: 1346:

or from the instruction cache. When an instruction needs to be decoded, the μop cache is checked for its decoded form which is re-used if cached; if it is not available, the instruction is decoded and then cached.

1770:

The K8 also has multiple-level caches. There are second-level instruction and data TLBs, which store only PTEs mapping 4 KiB. Both instruction and data caches, and the various TLBs, can fill from the large

3727: 1297:

resulting from decoding x86 instructions, providing also the functionality of a micro-operation cache. Having this, the next time an instruction is needed, it does not have to be decoded into micro-ops again.

212:

in the 1960s. The first CPUs that used a cache had only one level of cache; unlike later level 1 cache, it was not split into L1d (for data) and L1i (for instructions). Split L1 cache started in 1976 with the

70:(SRAM), in modern CPUs by far the largest part of them by chip area, but SRAM is not always used for all levels (of I- or D-cache), or even any level, sometimes some latter or all levels are implemented with 2684: 221:

split the L1 cache. They also have L2 caches and, for larger processors, L3 caches as well. The L2 cache is usually not split, and acts as a common repository for the already split L1 cache. Every core of a

914:

Virtual memory requires the processor to translate virtual addresses generated by the program into physical addresses in main memory. The portion of the processor that does this translation is known as the

451:. Many caches implement a compromise in which each entry in the main memory can go to any one of N places in the cache, and are described as N-way set associative. For example, the level-1 data cache in an 1049:

caches, then, tend to be physically tagged, and only small, very low latency caches are virtually tagged. In recent general-purpose CPUs, virtual tagging has been superseded by vhints, as described below.

616:

designs in ARM Cortex R chip, Intel's way-predicting cache memory, IBM's reconfigurable multi-way associative cache memory and Oracle's dynamic cache replacement way selection based on address tab bits.

4282:– Hill and Cantin (2003) – This reference paper has been updated several times. It has thorough and lucidly presented simulation results for a reasonably wide set of benchmarks and cache organizations. 1045:

But virtual indexing is not the best choice for all cache levels. The cost of dealing with virtual aliases grows with cache size, and as a result most level-2 and larger caches are physically indexed.

1042:) is crucial to CPU performance, and so most modern level-1 caches are virtually indexed, which at least allows the MMU's TLB lookup to proceed in parallel with fetching the data from the cache RAM. 1714:(SPM), also known as scratchpad, scratchpad RAM or local store in computer terminology, is a high-speed internal memory used for temporary storage of calculations, data, and other work in progress. 1779:

The K8 also caches information that is never stored in memory—prediction information. These caches are not shown in the above diagram. As is usual for this class of CPU, the K8 has fairly complex

1786:

The K8 uses an interesting trick to store prediction information with instructions in the secondary cache. Lines in the secondary cache are protected from accidental data corruption (e.g. by an

843:

hardware in the cache of one processor hears an address broadcast from some other processor, and realizes that certain data blocks in the local cache are now stale and should be marked invalid.

377:

system updates data in the cache, copies of data in caches associated with other CPUs become stale. Communication protocols between the cache managers that keep the data consistent are known as

324:

likely to be used in the future. Predicting the future is difficult, so there is no perfect method to choose among the variety of replacement policies available. One popular replacement policy,

1602:

most compiler register assignments are reallocated dynamically by hardware at runtime into a register bank, allowing the CPU to break false data dependencies and thus easing pipeline hazards.

1621:, there is a question of whether the caches should be shared or local to each core. Implementing shared cache inevitably introduces more wiring and complexity. But then, having one cache per 1178:(i.e. Level 2 and above) have progressively larger numbers of blocks, larger block size, more blocks in a set, and relatively longer access times, but are still much faster than main memory. 226:

has a dedicated L1 cache and is usually not shared between the cores. The L2 cache, and higher-level caches, may be shared between the cores. L4 cache is currently uncommon, and is generally

1816:

These predictors are caches in that they store information that is costly to compute. Some of the terminology used when discussing predictors is the same as that for caches (one speaks of a

1931:

access. But since the 1980s the performance gap between processor and memory has been growing. Microprocessors have advanced much faster than memory, especially in terms of their operating

2891:"US Patent Application for DYNAMIC CACHE REPLACEMENT WAY SELECTION BASED ON ADDRESS TAG BITS Patent Application (Application #20160350229 issued December 1, 2016) – Justia Patents Search" 1289:

A trace cache stores instructions either after they have been decoded, or as they are retired. Generally, instructions are added to trace caches in groups representing either individual

824:

different indices. There are 2 = 64 possible offsets. Since the CPU address is 32 bits wide, this implies 32 - 5 - 6 = 21 bits for the tag field.

291:. When a cache line is copied from memory into the cache, a cache entry is created. The cache entry will include the copied data as well as the requested memory location (called a tag). 786: 734: 1282:

microprocessors. A trace cache is a mechanism for increasing the instruction fetch bandwidth and decreasing power consumption (in the case of the Pentium 4) by storing traces of

1118:

patterns, which can lead to very large differences in program performance. These differences can make it very difficult to get a consistent and repeatable timing for a benchmark run.

443:

decides where in the cache a copy of a particular entry of main memory will go. If the placement policy is free to choose any entry in the cache to hold the copy, the cache is called

5386: 4139:

Reilly, James, Kheradpir, Shervin, "An Overview of High-performance Hardware Design Using the 486 CPU", Intel Corporation, Microcomputer Solutions, November/December 1990, page 20

1061:), which can be solved by using physical address for tagging, or by storing the address space identifier in the cache line. However, the latter approach does not help against the 2070:, which at the time had latencies around 10–25 ns. The early caches were external to the processor and typically located on the motherboard in the form of eight or nine 1661:

Multi-level caches introduce new design decisions. For instance, in some processors, all data in the L1 cache must also be somewhere in the L2 cache. These caches are called

413:

in which the CPU attempts to execute independent instructions after the instruction that is waiting for the cache miss data. Another technology, used by many processors, is

1079:

recently, as the hardware cost of detecting and evicting virtual aliases has fallen and the software complexity and performance penalty of perfect page coloring has risen.

4012: 839:

On power-up, the hardware sets all the valid bits in all the caches to "invalid". Some systems also set a valid bit to "invalid" at other times, such as when multi-master

1994:, released in 1982, has a "loop mode" which can be considered a tiny and special-case instruction cache that accelerates loops that consist of only two instructions. The 954:

virtual address space might be cut up into 1,048,576 pages of 4 KiB size, each of which can be independently mapped. There may be multiple page sizes supported; see

743:

The block offset specifies the desired data within the stored data block within the cache row. Typically the effective address is in bytes, so the block offset length is

587:

tracking for non-skewed caches is usually done on a per-set basis. Nevertheless, skewed-associative caches have major advantages over conventional set-associative ones.

836:

An instruction cache requires only one flag bit per cache row entry: a valid bit. The valid bit indicates whether or not a cache block has been loaded with valid data.

4358: 3738: 983:, both had a small associative memory as a cache for accesses to the in-memory page table. Both machines predated the first machine with a cache for main memory, the 2063:

used for main memory had significant latency, up to 120 ns, as well as refresh cycles. The cache was constructed from more expensive, but significantly faster,

2695: 1314:. Stores from both L1D caches in the module go through the WCC, where they are buffered and coalesced. The WCC's task is reducing number of writes to the L2 cache. 250:

sizes (i.e. for larger non-L1), very early on the pattern broke down, to allow for larger caches without being forced into the doubling-in-size paradigm, with e.g.

1897:

for a 4 KiB, 2-way set-associative, virtually indexed and virtually tagged cache with 64 byte (B) lines, a 32-bit read width and 32-bit virtual address.

1507:

has a 192 KiB L1 cache for each of the four high-performance cores, an unusually large amount; however the four high-efficiency cores only have 128 KiB.

599:. A pseudo-associative cache tests each possible way one at a time. A hash-rehash cache and a column-associative cache are examples of a pseudo-associative cache. 5497: 4680: 2553: 354:

or copy-back cache, writes are not immediately mirrored to the main memory, and the cache instead tracks which locations have been written over, marking them as

5199: 563:

The idea of having the processor use the cached data before the tag match completes can be applied to associative caches as well. A subset of the tag, called a

66:(L1, L2, often L3, and rarely even L4), with different instruction-specific and data-specific caches at level 1. The cache memory is typically implemented with 3203: 1998:, released in 1984, replaced that with a typical instruction cache of 256 bytes, being the first 68k series processor to feature true on-chip cache memory. 5356: 4922: 4739: 1689:

is exchanged with a line in the L1. This exchange is quite a bit more work than just copying a line from L2 to L1, which is what an inclusive cache does.

3557: 1754:, because parity is smaller and any damaged data can be replaced by fresh data fetched from memory (which always has an up-to-date copy of instructions). 1510:

The benefits of L3 and L4 caches depend on the application's access patterns. Examples of products incorporating L3 and L4 caches include the following:

1904:

Similarly, because the cache is 4 KiB and has a 4 B read path, and reads two ways for each access, the Data SRAM is 512 rows by 8 bytes wide.

455:

is two-way set associative, which means that any particular location in main memory can be cached in either of two locations in the level-1 data cache.

1911:

Some SPARC designs have improved the speed of their L1 caches by a few gate delays by collapsing the virtual address adder into the SRAM decoders. See

4702: 2097:

and the growing disparity between bus clock rates and CPU clock rates, which caused on-motherboard cache to be only slightly faster than main memory.

1057:

A cache that relies on virtual indexing and tagging becomes inconsistent after the same virtual address is mapped into different physical addresses (

5351: 892: 463: 389: 3278:

Jouppi, Norman P. (May 1990). "Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers".

346:

If data is written to the cache, at some point it must also be written to main memory; the timing of this write is known as the write policy. In a

5423: 3603: 937:

The physical address is available from the MMU some time, perhaps a few cycles, after the virtual address is available from the address generator.

4304: 3795: 5176: 1145:

The solution is to have the operating system attempt to assign different physical color pages to different virtual colors, a technique called

3498: 3026: 2964: 2937: 2653: 1153:

whenever it changes from one virtual color to another. As mentioned above, this approach was used for some early SPARC and RS/6000 designs.

254:

with 3 MiB L2 cache in April 2008. This happened much later for L1 caches, as their size is generally still a small number of KiB. The

6120: 5244: 4507: 4351: 2759: 1122:

color of 0–255 to denote where in the cache it can go. Locations within physical pages with different colors cannot conflict in the cache.

4224: 3088:. Conferences Proceedings. Vol. 20 Proceedings of the Eastern Joint Computer Conference Washington, D.C. Macmillan. pp. 279–294. 1185:

selected to be implemented by the processor designers. In some cases, multiple algorithms are provided for different kinds of work loads.

200:

with two separate level 1 caches of 4 KiB each on the chip, one for the instructions and one for data. The board has no external L2 cache.

4130:

Chen, Allan, "The 486 CPU: ON A High-Performance Flight Vector", Intel Corporation, Microcomputer Solutions, November/December 1990, p. 2

417:(SMT), which allows an alternate thread to use the CPU core while the first thread waits for required CPU resources to become available. 6130: 5271: 3482: 4112: 2582: 2417: 1776:

operating system is responsible for keeping the TLBs coherent by flushing portions of them when the page tables in memory are updated.

4398: 3830: 2104:, which brought the secondary cache onto the same package as the microprocessor, clocked at the same frequency as the microprocessor. 1842:

is used to specify which of the possible memory locations is currently stored in a CPU cache. For a simple, direct-mapped design fast

77:

Other types of caches exist (that are not counted towards the "cache size" of the most important caches mentioned above), such as the

5438: 5266: 5239: 4618: 3851: 3702: 3126: 166:

Used to speed data fetch and store; the data cache is usually organized as a hierarchy of more cache levels (L1, L2, etc.; see also

4589: 4066: 6289: 6253: 5816: 4709: 4675: 4670: 4554: 3356: 2954: 2927: 1571: 470:). Some CPUs can dynamically reduce the associativity of their caches in low-power states, which acts as a power-saving measure. 217:

CPU, became mainstream in the late 1980s, and in 1997 entered the embedded CPU market with the ARMv5TE. In 2015, even sub-dollar

3680: 3331: 3175:

Taylor, George; Davies, Peter; Farmwald, Michael (1990). "The TLB Slice - A Low-Cost High-Speed Address Translation Mechanism".

2039:

microprocessor (33 MHz), 64 KiB cache (25 ns; 8 chips in the bottom left corner), 2 MiB DRAM (70 ns; 8

6228: 6125: 5526: 5433: 5234: 4477: 4455: 4344: 1673:

caches: data is guaranteed to be in at most one of the L1 and L2 caches, never in both. Still other processors (like the Intel

4973: 4408: 4288: 3481:. 2001 International Symposium on Low Power Electronics and Design (ISLPED'01), August 6-7, 2001. Huntington Beach, CA, USA: 2249: 1363: 1283: 3624: 3579: 3016: 1653:. In a unified structure, this constraint is not present, and cache lines can be used to cache both instructions and data. 5428: 5276: 5110: 4724: 4685: 4542: 1807: 1311: 2473: 5865: 5710: 5705: 5627: 5103: 5064: 4719: 4714: 4648: 4460: 2322:

had no page tables in main memory; there was an associative memory with one entry for every 512 word page frame of core.

1738: 1650: 1206: 1031: 920: 227: 109: 78: 4584: 4330: 3242:. IEEE 14th International Symposium on High Performance Computer Architecture. Salt Lake City, Utah. pp. 367–378. 5492: 5189: 4887: 3403:"The microarchitecture of Intel, AMD and VIA CPUs: An optimization guide for assembly programmers and compiler makers" 3001: 2319: 2130: 2064: 1843: 1578: 1497: 1367: 1250: 1242: 990:

Caches can be divided into four types, based on whether the index or tag correspond to physical or virtual addresses:

870: 414: 231: 105:, and industrial CPUs have at least three independent levels of caches (L1, L2 and L3) and different types of caches: 67: 3959: 3938: 3471: 3282:. 17th Annual International Symposium on Computer Architecture, May 28-31, 1990. Seattle, WA, USA. pp. 364–373. 2803:

Zhang, Chenxi; Zhang, Xiaodong; Yan, Yong (September–October 1997). "Two fast and high-associativity cache schemes".

3838:. IEEE International Symposium on Performance Analysis of Systems and Software. Austin, Texas, USA. pp. 89–96. 2606: 539:

If each location in the main memory can be cached in either of two locations in the cache, one logical question is:

6299: 6142: 5789: 5206: 4697: 4665: 4435: 4423: 4403: 4149: 3648: 1851: 1795: 1131: 1100: 596: 318: 255: 2059:, small amounts of fast cache memory began to be featured in systems to improve performance. This was because the 1472:), first; if it hits, the processor proceeds at high speed. If that smaller cache misses, the next fastest cache, 266:-based processors from 2018, having 48 KiB L1 data cache and 48 KiB L1 instruction cache. In 2020, some 6233: 6196: 6186: 4574: 4098:

Proceedings of the October 27–29, 1964, fall joint computer conference, part II: very high speed computer systems

3909: 2225: 1722:

To illustrate both specialization and multi-level caching, here is the cache hierarchy of the K8 core in the AMD

1544:(2003) MX 2 module incorporated two Itanium 2 processors along with a shared 64 MiB L4 cache on a 263: 2788:

Skewed-associative caches have been shown to have two major advantages over conventional set-associative caches.

746: 694: 258:

from 2012 is an exception however, to gain unusually large 96 KiB L1 data cache for its time, and e.g. the

6294: 6248: 5655: 5591: 5568: 5418: 5380: 5216: 5166: 5161: 4638: 4532: 4440: 3214: 2245: 2235: 2089:

which had either 64 or 128 Kbyte of cache memory. The popularity of on-motherboard cache continued through the

1936: 1555:

MP product codenamed "Tulsa" (2006) features 16 MiB of on-die L3 cache shared between two processor cores.

1493: 987:, so the first hardware cache used in a computer system was not a data or instruction cache, but rather a TLB. 440: 434: 259: 4445: 4036: 495:

Fully associative cache – the best miss rates, but practical only for a small number of entries

447:. At the other extreme, if each entry in the main memory can go in just one place in the cache, the cache is 6201: 5984: 5878: 5842: 5759: 5743: 5585: 5374: 5333: 5321: 5184: 5098: 5019: 4784: 4388: 3261: 2067: 1955: 1582: 1501: 1371: 984: 976: 209: 47: 39: 2640:. 2009 2nd IEEE International Conference on Computer Science and Information Technology. pp. 551–556. 2359:"Survey of CPU Cache-Based Side-Channel Attacks: Systematic Analysis, Security Models, and Countermeasures" 2150:) on the same package. This L4 cache is shared dynamically between the on-die GPU and CPU, and serves as a 1414:

This allows full-speed operation with a much smaller cache than a traditional full-time instruction cache.

6007: 5979: 5889: 5854: 5603: 5597: 5579: 5313: 5307: 5211: 5115: 5006: 4945: 4807: 4450: 3104: 2516: 2439: 2006: 1881:

sometimes save power by reading the tags first, so that only one data element is read from the data SRAM.

1595: 1351: 1127: 916: 410: 117: 82: 2166:

throughput) for a laptop, and much larger total (e.g. L3 or L4) sizes are available in IBM's mainframes.

6181: 6090: 5836: 5548: 5366: 5125: 5093: 5051: 4963: 4764: 4579: 4569: 4559: 4549: 4519: 4502: 4367: 3240:

Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems

2399: 2290: 2260: 2179: 1912: 1791: 1751: 1202: 1027: 670: 557: 402: 205: 2986: 930:

For the purposes of the present discussion, there are three important features of address translation:

429:

An illustration of different ways in which memory locations can be cached by particular cache locations

373:), in which case the copy in the cache may become out-of-date or stale. Alternatively, when a CPU in a 3803: 3065:

Sumner, F. H.; Haley, G.; Chenh, E. C. Y. (1962). "The Central Control Unit of the 'Atlas' Computer".

2685:"Power Management of the Third Generation Intel Core Micro Architecture formerly codenamed Ivy Bridge" 6211: 6147: 5733: 5455: 5345: 5292: 4824: 4537: 4393: 4375: 4174: 2175: 2071: 1618: 1441: 1194: 666: 370: 366: 223: 139: 6258: 5860: 4273: 3109: 3081: 2146:

variant of Intel's integrated Iris Pro graphics, effectively feature 128 MiB of embedded DRAM (

1820:

in a branch predictor), but predictors are not generally thought of as part of the cache hierarchy.

6243: 6063: 5914: 5896: 5848: 5502: 5449: 5254: 5249: 5226: 5142: 5024: 4879: 4774: 4633: 3532: 2280: 2044: 1636:

A shared highest-level cache, which is called before accessing memory, is usually referred to as a

1537: 1408: 1343: 325: 650:

contains (part of) the address of the actual data fetched from the main memory. The flag bits are

6115: 6107: 5959: 5934: 5738: 5613: 5137: 5078: 4958: 4690: 4418: 4196: 4004: 3504: 3132: 3046:. Proc. AFIPS Computer Conference 30 (Spring Joint Computer Conference, 1967). pp. 611–621. 2659: 2496: 2005:, released in 1987, is basically a 68020 core with an additional 256-byte data cache, an on-chip 1928: 1806:

Other processors have other kinds of predictors (e.g., the store-to-load bypass predictor in the

102: 3882: 3044:

Experience using a time sharing multiprogramming system with dynamic address relocation hardware

2766: 846:

A data cache typically requires two flag bits per cache line – a valid bit and a

595:

A true set-associative cache tests all the possible ways simultaneously, using something like a

4215: 4090: 1830:

the instruction stream until a program calls an operating system facility to ensure coherency.

6068: 6035: 5951: 5883: 5784: 5774: 5764: 5695: 5690: 5685: 5608: 5537: 5443: 5403: 5036: 4986: 4936: 4912: 4794: 4734: 4729: 4611: 4527: 3847: 3494: 3479:

ISLPED'01: Proceedings of the 2001 International Symposium on Low Power Electronics and Design

3122: 3022: 2960: 2933: 2649: 2380: 2240: 2210: 2133: 1780: 1711: 1706: 1599: 1545: 1489: 1379: 1020: 425: 341: 4279: 3238:

Lin, Jiang; Lu, Qingda; Ding, Xiaoning; Zhang, Zhao; Zhang, Xiaodong; Sadayappan, P. (2008).

2077:

Some versions of the Intel 386 processor could support 16 to 256 KiB of external cache.

477:

Direct mapped cache – good best-case time, but unpredictable in the worst case

6238: 6171: 6012: 5919: 5873: 5680: 5675: 5670: 5665: 5660: 5650: 5520: 5487: 5398: 5393: 5302: 5154: 5149: 5132: 5120: 5059: 4623: 4601: 4487: 4465: 4383: 4188: 4045: 3996: 3881:

Jaleel, Aamer; Eric Borch; Bhandaru, Malini; Steely Jr., Simon C.; Emer, Joel (2010-09-27).

3839: 3775: 3486: 3470:

Solomon, Baruch; Mendelson, Avi; Orenstein, Doron; Almog, Yoav; Ronen, Ronny (August 2001).

3304: 3283: 3243: 3184: 3114: 3047: 2812: 2735: 2641: 2569:

L1 cache of 32KB/core, L2 cache of 4.5MB per 4-core cluster and shared LLC cache up to 15MB.

2488: 2454: 2370: 2270: 2255: 2215: 2205: 1863: 1530:(2001) had off-chip L3 caches of 32 MiB per processor, shared among several processors. 1294: 1139: 584: 545: 251: 242:

Caches (like for RAM historically) have generally been sized in powers of: 2, 4, 8, 16 etc.

218: 98: 3857: 2909: 1927:

In the early days of microcomputer technology, memory access was only slightly slower than

1138:

physical color, and then locations from those pages will collide in the cache (this is the

1019:(PIVT) caches are often claimed in literature to be useless and non-existing. However, the 365:

Cached data from the main memory may be changed by other entities (e.g., peripherals using

6152: 6137: 6085: 5989: 5964: 5801: 5794: 5645: 5640: 5635: 5574: 5482: 5472: 5194: 5029: 4981: 4744: 4628: 4596: 4497: 4492: 4413: 4313: 4298: 4295:– by Ruud van der Pas, 2002, Sun Microsystems – introductory article to CPU memory caching 4292: 3558:"AMD Zen Microarchitecture: Dual Schedulers, Micro-Op Cache and Memory Hierarchy Revealed" 3280:

Conference Proceedings of the 17th Annual International Symposium on Computer Architecture

2536: 2343: 2275: 2230: 2220: 1825: 1458: 1339: 1182: 665:

An effective memory address which goes along with the cache line (memory block) is split (

662:

bits are not included in the size, although they do affect the physical area of a cache.)

579:, where the index for way 0 is direct, as above, but the index for way 1 is formed with a 378: 299:

has occurred. However, if the processor does not find the memory location in the cache, a

132: 59: 4270: – an article on lwn.net by Ulrich Drepper describing CPU caches in detail 2830: 1729: 1574:(2008) has an 8 MiB on-die unified L3 cache that is inclusive, shared by all cores. 270:

CPUs (with up to 24 cores) have (multiple of) 4.5 MiB and 15 MiB cache sizes.

6263: 6097: 6080: 6073: 5969: 5826: 5563: 5477: 5408: 4991: 4953: 4902: 4897: 4892: 4606: 4430: 2554:"Product Fact Sheet: Accelerating 5G Network Infrastructure, from the Core to the Edge" 2285: 2100:

The next development in cache implementation in the x86 microprocessors began with the

2086: 2074:

devices placed in sockets to enable the cache as an optional extra or upgrade feature.

2013: 1884: 1787: 1629:, greatly reduces the amount of space needed, and thus one can include a larger cache. 1198: 1112: 1004: 955: 904: 691:

The index describes which cache set that the data has been put in. The index length is

409:

Various techniques have been employed to keep the CPU busy during this time, including

374: 193: 192:

computer (1990). At the lower edge of the image left from the middle, there is the CPU

55: 35: 4325: 4276:– Hill and Smith (1989) – introduces capacity, conflict, and compulsory classification 3402: 2514:"Altering Computer Architecture is Way to Raise Throughput, Suggest IBM Researchers". 2031: 1249:

Level 4 cache which serves as a victim cache to the processors' Level 3 cache. In the

262:

having a 96 KiB L1 instruction cache (and 128 KiB L1 data cache), and Intel

6283: 6058: 5974: 5014: 4996: 4789: 4482: 4223:. 41st annual IEEE/ACM International Symposium on Microarchitecture. pp. 83–93. 4179: 3508: 1590: 1350:

One of the early works describing μop cache as an alternative frontend for the Intel

1229: 908: 807:

Some authors refer to the block offset as simply the "offset" or the "displacement".

580: 347: 4200: 4008: 3136: 2663: 548:

algorithm is especially simple since only one bit needs to be stored for each pair.

181: 6268: 6206: 6022: 5999: 5811: 5532: 4470: 3472:"Micro-Operation Cache: A Power Aware Frontend for Variable Instruction Length ISA" 2855: 2783: 2500: 2295: 2151: 1894: 1813:), and various specialized predictors are likely to flourish in future processors. 1359: 1356:"Micro-Operation Cache: A Power Aware Frontend for Variable Instruction Length ISA" 1218: 1026:

uses this cache type as the sole known implementation. The R6000 is implemented in

907:. To summarize, either each program running on the machine sees its own simplified 840: 350:

cache, every write to the cache causes a write to main memory. Alternatively, in a

4301:– by Paul Genua, P.E., 2004, Freescale Semiconductor, another introductory article 4285: 3103:. 40th International Symposium on Computer Architecture (ISCA). pp. 535–547. 2869: 2016:, released in 1990, has split instruction and data caches of four kilobytes each. 1030:, which is an extremely fast technology not suitable for large memories such as a 919:(MMU). The fast path through the MMU can perform those translations stored in the 556:

One of the advantages of a direct-mapped cache is that it allows simple and fast

517:

be number of blocks in cache, then mapping is done with the help of the equation

6053: 6017: 5728: 5700: 5558: 5413: 3843: 3069:. IFIP Congress Proceedings. Vol. Proceedings of IFIP Congress 62. Spartan. 2890: 2645: 2634:

Megalingam, Rajesh Kannan; Deepu, K.B; Joseph, Iype P.; Vikram, Vandana (2009).

2265: 2138: 2101: 2081: 2056: 1810: 1678: 1514: 1290: 1262: 1237: 1066:

otherwise the cache (or a part of it) must be flushed when the mapping changes.

967: 963: 869:

cache generally cause the largest delay, because the processor, or at least the

185: 51: 4192: 3449: 2635: 2400:"Atlas 2 at Cambridge Mathematical Laboratory (and Aldermaston and CAD Centre)" 5939: 5929: 5924: 5906: 5806: 5779: 5041: 4874: 4844: 4564: 3372:"The Pentium 4's Cache – Intel Pentium 4 1.4 GHz & 1.5 GHz" 3247: 2112: 1747: 1674: 1666: 1558: 1520: 1445: 972: 924: 659: 452: 406:

instructions in the time taken to fetch a single cache line from main memory.

351: 267: 3287: 2384: 283:

Data is transferred between memory and cache in blocks of fixed size, called

6030: 6027: 4839: 4817: 3536: 3490: 3430: 3398: 3375: 3335: 3118: 3051: 1932: 1723: 1682: 1565: 1541: 1533: 1388: 1279: 847: 816: 459: 355: 3450:"Intel's Sandy Bridge Microarchitecture – Instruction Decode and uop Cache" 3151: 2375: 2358: 4319: 4000: 3780: 3763: 3188: 2910:"Choosing an Error Protection Scheme for a Microprocessor's L1 Data Cache" 2740: 2723: 2492: 950:

The virtual address space is broken up into pages. For instance, a 4

489:

Eight-way set-associative cache, a common choice for later implementations

6045: 4917: 4864: 4336: 4241: 3357:"The Intel Skylake Mobile and Desktop Launch, with Architecture Analysis" 2683:

Jahagirdar, Sanjeev; George, Varghese; Sodhi, Inder; Wells, Ryan (2012).

2316: 2162: 2116: 2108: 2090: 1967: 1737:

The K8 has four specialized caches: an instruction cache, an instruction

1504: 951: 873:, has to wait (stall) until the instruction is fetched from main memory. 820: 247: 243: 189: 43: 3656: 1484:

returned to having large off-chip caches, which is often implemented in

1391:

deciding on caching and reusing dynamically created instruction traces.

646:(cache line) contains the actual data fetched from the main memory. The 17: 4854: 4812: 2458: 1492:, as a fourth cache level. In rare cases, such as in the mainframe CPU 1293:

or dynamic instruction traces. The Pentium 4's trace cache stores

1253:

microarchitecture the Level 4 cache no longer works as a victim cache.

1062: 1058: 214: 3917: 3728:"2nd Generation Intel Core Processor Family: Intel Core i7, i5 and i3" 3371: 2816: 1306:

Write Coalescing Cache is a special cache that is part of L2 cache in

6157: 4869: 4834: 4799: 2178:

and average execution speed. More recent cache designs also consider

2174:

Early cache designs focused entirely on the direct cost of cache and

1948: 1741:, a data TLB, and a data cache. Each of these caches is specialized: 1606: 1527: 1103:(CAM) is necessary to select the right one of the four ways fetched. 980: 2055:

microprocessors reached clock rates of 20 MHz and above in the

4267: 2440:"Structural aspects of the System/360 Model 85 - Part II The cache" 2357:

Su, Chao; Zeng, Qingkai (2021-06-10). Nicopolitidis, Petros (ed.).

5327: 4859: 4829: 4153: 3829:

Zheng, Ying; Davis, Brian T.; Jordan, Matthew (10–12 March 2004).

3706: 3652: 3628: 3308: 2583:"Intel Launches Atom P5900: A 10nm Atom for Radio Access Networks" 2147: 2126: 2094: 2030: 2020: 2002: 1995: 1991: 1883: 1728: 1496:(2019), all levels down to L1 are implemented by eDRAM, replacing 1485: 1440:

Smart Cache shares the actual cache memory between the cores of a

1434: 1276: 1246: 1164: 1023: 424: 235: 180: 71: 4217:

A novel cache architecture with enhanced performance and security

3883:"Achieving Non-Inclusive Cache Performance with Inclusive Caches" 2956:

Computer Organization and Design: The Hardware/Software Interface

2637:

Phased set associative cache design for reduced power consumption

2107:

On-motherboard caches enjoyed prolonged popularity thanks to the

2085:

system boards that contained sockets for the Intel 485Turbocache

819:

processor had a four-way set associative L1 data cache of 8

6191: 5339: 5259: 4849: 4308: 3431:"AMD's Bulldozer Microarchitecture – Memory Subsystem Continued" 2143: 2060: 2040: 2036: 1552: 1267:

One of the more extreme examples of cache specialization is the

1165: 923:(TLB), which is a cache of mappings from the operating system's 800:

tag_length = address_length - index_length - block_offset_length

4340: 4779: 4769: 4075: 2424: 2052: 2009:(MMU), a process shrink, and added burst mode for the caches. 1979: 1952: 1307: 1090:

The extra area (and some latency) can be mitigated by keeping

197: 3649:"Inside Intel Core Microarchitecture and Smart Memory Access" 1585:

have 128 MiB of eDRAM acting essentially as an L4 cache.

3681:"Intel Iris Pro 5200 Graphics Review: Core i7-4950HQ Tested" 3332:"Intel Iris Pro 5200 Graphics Review: Core i7-4950HQ Tested" 2535:

White, Bill; De Leon, Cecilia A.; et al. (March 2016).

1966:

The first documented use of an instruction cache was on the

1500:

entirely (for cache, SRAM is still used for registers). The

1480:), is checked, and so on, before accessing external memory. 58:, which stores copies of the data from frequently used main 3604:"How does the BTIC (branch target instruction cache) work?" 3204:"Advanced Operating Systems Caches and TLBs (263-3800-00L)" 3080:

Kilburn, T.; Payne, R. B.; Howarth, D. J. (December 1961).

1589:

Finally, at the other end of the memory hierarchy, the CPU

1444:. In comparison to a dedicated per-core cache, the overall 54:. A cache is a smaller, faster memory, located closer to a 4316: – an article on Ars Technica by Jon Stokes 1433:

caching method for multiple execution cores, developed by

4322: – an article on ixbtlabs by Pavel Danilov 3982:"The processor-memory bottleneck: problems and solutions" 3762:

Chen, J. Bradley; Borg, Anita; Jouppi, Norman P. (1992).

3703:"Software Techniques for Shared-Cache Multi-Core Systems" 3299: 3297: 1193:

Pipelined CPUs access memory from multiple points in the

401:

The time taken to fetch one cache line from memory (read

624:

Cache row entries usually have the following structure:

3101:

A New Perspective for Efficient Virtual-Cache Coherence

1733:

Cache hierarchy of the K8 core in the AMD Athlon 64 CPU

1342:

of decoded instructions, as received directly from the

1209:, and data), each specialized to its particular role. 46:

to reduce the average cost (time or energy) to access

4214:

Wang, Zhenghong; Lee, Ruby B. (November 8–12, 2008).

3939:"Cortex-R4 and Cortex-R4F Technical Reference Manual" 3832:

Performance Evaluation of Exclusive Cache Hierarchies

3326: 3324: 1958:, both of which used an associative memory as a TLB. 1362:

processors and in successive microarchitectures like

749: 697: 458:

Choosing the right value of associativity involves a

3980:

Mahapatra, Nihar R.; Venkatrao, Balakrishna (1999).

3526: 3524: 2560:(Press release). Intel Corporation. 25 February 2020 2418:"IBM System/360 Model 85 Functional Characteristics" 2043:

to the right of the cache), and a cache controller (

1978:

The first documented use of a data cache was on the

1568:(2008) has up to 6 MiB on-die unified L3 cache. 473:

In order of worse but simple to better but complex:

6221: 6170: 6106: 6044: 5998: 5950: 5905: 5825: 5752: 5721: 5626: 5547: 5511: 5465: 5365: 5291: 5225: 5175: 5086: 5077: 5050: 5005: 4972: 4944: 4935: 4755: 4658: 4647: 4518: 4374: 2870:"Reconfigurable multi-way associative cache memory" 1766:

tag copy handles one of the two accesses per cycle.

328:(LRU), replaces the least recently accessed entry. 4114:IBM System/360 Model 85 Functional Characteristics 4068:IBM System/360 Model 67 Functional Characteristics 3960:"L210 Cache Controller Technical Reference Manual" 1605:Register files sometimes also have hierarchy: The 1181:Cache entry replacement policy is determined by a 780: 728: 492:12-way set associative cache, similar to eight-way 2093:era but was made obsolete by the introduction of 1548:that was pin compatible with a Madison processor. 1245:processors introduced an on-package 128 MiB 903:Most general purpose CPUs implement some form of 120:(MMU) and not directly related to the CPU caches. 4333: – a Princeton University lecture 966:, required an access to a mapping table held in 673:) into the tag, the index and the block offset. 3202:Roscoe, Timothy; Baumann, Andrew (2009-03-03). 2953:Patterson, David A.; Hennessy, John L. (2009). 2926:Hennessy, John L.; Patterson, David A. (2011). 2129:introduced a Level 4 on-package cache with the 1947:The first documented uses of a TLB were on the 575:Other schemes have been suggested, such as the 62:. Most CPUs have a hierarchy of multiple cache 3021:. Jones & Bartlett Learning. p. 209. 2980: 2978: 2976: 2929:Computer Architecture: A Quantitative Approach 2724:"A Case for Two-Way Skewed-Associative Caches" 1517:(1995) had 1 to 64 MiB off-chip L3 cache. 1083:guess, and get the wrong answer occasionally. 230:(DRAM) on a separate die or chip, rather than 4352: 4280:Cache Performance for SPEC CPU2000 Benchmarks 4091:"Parallel operation in the control data 6600" 3764:"A Simulation Based Study of TLB Performance" 2537:"IBM z13 and IBM z13s Technical Introduction" 1823:The K8 keeps the instruction and data caches 8: 3465: 3463: 3393: 3391: 2908:Sadler, Nathan N.; Sorin, Daniel L. (2006). 1901:simplest to use the least significant bits. 1286:that have already been fetched and decoded. 775: 750: 723: 698: 5357:Computer performance by orders of magnitude 3824: 3822: 3820: 2798: 2796: 1378:fetch and decode hardware, thus decreasing 1169:Memory hierarchy of an AMD Bulldozer server 27:Hardware cache of a central processing unit 5822: 5462: 5083: 4941: 4655: 4359: 4345: 4337: 4175:"Chip Design Thwarts Sneak Attack on Data" 2831:"ARM® Cortex®-R Series Programmer's Guide" 1358:. Later, Intel included μop caches in its 781:{\displaystyle \lceil \log _{2}(b)\rceil } 729:{\displaystyle \lceil \log _{2}(s)\rceil } 149:Used to speed executable instruction fetch 4314:Understanding CPU caching and performance 3779: 3578:Gu, Leon; Motiani, Dipti (October 2003). 3108: 2739: 2717: 2715: 2374: 1201:address translation, and data fetch (see 757: 748: 705: 696: 204:Early examples of CPU caches include the 3701:Tian, Tian; Shih, Chiu-Pi (2012-03-08). 3099:Kaxiras, Stefanos; Ros, Alberto (2013). 3086:Computers - Key to Total Systems Control 2760:"Lecture 3: Advanced Caching Techniques" 893:cache performance measurement and metric 4286:Memory Hierarchy in Cache-Based Systems 3533:"Intel's Haswell Architecture Analyzed" 3018:Memory Systems and Pipelined Processors 2334: 2308: 1888:Read path for a 2-way associative cache 1370:. AMD implemented a μop cache in their 167: 4274:Evaluating Associativity in CPU Caches 3768:ACM SIGARCH Computer Architecture News 3737:. pp. 7–10, 31–45. Archived from 3177:ACM SIGARCH Computer Architecture News 2728:ACM SIGARCH Computer Architecture News 2342:Torres, Gabriel (September 12, 2007). 2047:A38202; to the right of the processor) 1523:(1999) had motherboard-based L3 cache. 2753: 2751: 1338:) is a specialized cache that stores 1161:Cache hierarchy in a modern processor 995:Physically indexed, physically tagged 962:One early virtual memory system, the 467: 234:(SRAM). An exception to this is when 7: 5328:Floating-point operations per second 3796:"Explanation of the L1 and L2 Cache" 3448:Kanter, David (September 25, 2010). 2182:, fault tolerance, and other goals. 2123:3 cache sizes of tens of megabytes. 1017:Physically indexed, virtually tagged 1011:Virtually indexed, physically tagged 342:Cache (computing) § Writing policies 4326:What is Cache Memory and its Types! 4230:from the original on March 6, 2012. 4089:Thornton, James E. (October 1964). 3483:Association for Computing Machinery 2363:Security and Communication Networks 2315:The very first paging machine, the 1561:(2007) with 2 MiB of L3 cache. 1001:Virtually indexed, virtually tagged 4150:"Intel® Xeon® Processor E7 Family" 3355:Cutress, Ian (September 2, 2015). 3305:"Products (Formerly Crystal Well)" 2472:Smith, Alan Jay (September 1982). 1847: 1318:Micro-operation (μop or uop) cache 1038:The speed of this recurrence (the 25: 3655:. 2006. p. 5. Archived from 3429:Kanter, David (August 26, 2010). 2035:Example of a motherboard with an 1935:, so memory became a performance 6254:Semiconductor device fabrication 3531:Shimpi, Anand Lal (2012-10-05). 3370:Shimpi, Anand Lal (2000-11-20). 2959:. Morgan Kaufmann. p. 484. 2672:– via ieeexplore.ieee.org. 2142:Haswell CPUs, equipped with the 1132:High Performance Computing (HPC) 975:in main memory for mapping, the 571:Two-way skewed associative cache 483:Two-way skewed associative cache 6229:History of general-purpose CPUs 4456:Nondeterministic Turing machine 1540:level 3 (L3) cache on-die; the 1464:by checking the fastest cache, 1430: 1426: 1405:branch target instruction cache 1395:Branch target instruction cache 513:be block number of memory, and 63: 4409:Deterministic finite automaton 4305:An 8-way set-associative cache 2438:Liptay, John S. (March 1968). 2398:Landy, Barry (November 2012). 2250:University of Wisconsin System 772: 766: 720: 714: 651: 486:Four-way set-associative cache 1: 5200:Simultaneous and heterogenous 4307: – written in 4078:. February 1972. GA27-2719-2. 2856:"Way-predicting cache memory" 1665:. Other processors (like the 1651:translation lookaside buffers 1617:When considering a chip with 535:Two-way set associative cache 480:Two-way set associative cache 390:Cache performance measurement 5884:Integrated memory controller 5866:Translation lookaside buffer 5065:Memory dependence prediction 4508:Random-access stored program 4461:Probabilistic Turing machine 4173:Sally Adee (November 2009). 2694:. p. 18. Archived from 2616:. 2010-12-02. pp. 10–15 2344:"How The Cache Memory Works" 1387:complexity required for its 1302:Write Coalescing Cache (WCC) 1053:Homonym and synonym problems 921:translation lookaside buffer 228:dynamic random-access memory 110:Translation lookaside buffer 85:(MMU) which most CPUs have. 79:translation lookaside buffer 5340:Synaptic updates per second 4320:IBM POWER4 processor review 4120:(2nd ed.). A22-6916-1. 3844:10.1109/ISPASS.2004.1291359 3726:Lempel, Oded (2013-07-28). 3556:Cutress, Ian (2016-08-18). 3067:Information Processing 1962 2646:10.1109/ICCSIT.2009.5234663 2524:(25): 30–31. December 1976. 2115:processors that still used 1838:In computer engineering, a 1312:Bulldozer microarchitecture 1130:), largely coming from the 1099:to cache index, so that no 675: 626: 415:simultaneous multithreading 369:(DMA) or another core in a 232:static random-access memory 146:Instruction cache (I-cache) 81:(TLB) which is part of the 68:static random-access memory 6316: 5744:Heterogeneous architecture 4666:Orthogonal instruction set 4436:Alternating Turing machine 4424:Quantum cellular automaton 4193:10.1109/MSPEC.2009.5292036 3015:Cragon, Harvey G. (1996). 2226:Cache control instructions 1861: 1852:content-addressable memory 1704: 1657:Exclusive versus inclusive 1456: 1260: 1216: 1110: 1101:content-addressable memory 927:, segment table, or both. 597:content-addressable memory 509:be block number in cache, 432: 339: 319:Cache replacement policies 316: 6234:Microprocessor chronology 6197:Dynamic frequency scaling 5352:Cache performance metrics 4268:Memory part 2: CPU caches 3625:"Intel Smart Cache: Demo" 3248:10.1109/HPCA.2008.4658653 3150:Bottomley, James (2004). 2932:. Elsevier. p. B-9. 1943:First TLB implementations 6249:Hardware security module 5592:Digital signal processor 5569:Graphics processing unit 5381:Graphics processing unit 3602:Niu, Kun (28 May 2015). 3288:10.1109/ISCA.1990.134547 2985:Cooperman, Gene (2003). 2427:. June 1968. A22-6916-1. 2236:Cache placement policies 1536:(2003) had a 6 MiB 591:Pseudo-associative cache 435:Cache placement policies 6290:Central processing unit 6202:Dynamic voltage scaling 5985:Memory address register 5879:Branch target predictor 5843:Address generation unit 5586:Physics processing unit 5375:Central processing unit 5334:Transactions per second 5322:Instructions per second 5245:Array processing (SIMT) 4389:Stored-program computer 3491:10.1109/LPE.2001.945363 3454:Real World Technologies 3435:Real World Technologies 3152:"Understanding Caching" 3119:10.1145/2485922.2485968 3052:10.1145/1465482.1465581 2154:to the CPU's L3 cache. 1962:First instruction cache 1750:protection rather than 1644:Separate versus unified 1583:Intel Iris Pro Graphics 1070:Virtual tags and vhints 985:IBM System/360 Model 85 977:IBM System/360 Model 67 210:IBM System/360 Model 85 40:central processing unit 6008:Hardwired control unit 5890:Memory management unit 5855:Memory management unit 5604:Secure cryptoprocessor 5598:Tensor Processing Unit 5580:Vision processing unit 5314:Cycles per instruction 5308:Instructions per cycle 5255:Associative processing 4946:Instruction pipelining 4368:Processor technologies 3082:"The Atlas Supervisor" 2722:Seznec, André (1993). 2158:In ARM microprocessors 2048: 2027:In x86 microprocessors 2007:memory management unit 1986:In 68k microprocessors 1889: 1734: 1596:loop nest optimization 1170: 1128:loop nest optimization 917:memory management unit 782: 730: 430: 411:out-of-order execution 201: 118:memory management unit 83:memory management unit 6091:Sum-addressed decoder 5837:Arithmetic logic unit 4964:Classic RISC pipeline 4918:Epiphany architecture 4765:Motorola 68000 series 4001:10.1145/357783.331677 3781:10.1145/146628.139708 3262:"Letter to Jiang Lin" 3189:10.1145/325096.325161 2772:on September 7, 2012. 2741:10.1145/173682.165152 2493:10.1145/356887.356892 2291:Sum-addressed decoder 2261:Locality of reference 2034: 1982:System/360 Model 85. 1913:sum-addressed decoder 1887: 1732: 1581:CPUs with integrated 1372:Zen microarchitecture 1324:micro-operation cache 1273:execution trace cache 1203:classic RISC pipeline 1197:: instruction fetch, 1168: 1028:emitter-coupled logic 783: 731: 660:error correction code 620:Cache entry structure 552:Speculative execution 541:which one of the two? 428: 184: 6212:Performance per watt 5790:replacement policies 5456:Package on a package 5346:Performance per watt 5250:Pipelined processing 5020:Tomasulo's algorithm 4825:Clipper architecture 4681:Application-specific 4394:Finite-state machine 4038:GE-645 System Manual 3774:(2): 114–123. 2784:"Micro-Architecture" 2376:10.1155/2021/5559552 2248:(cache simulator by 1846:can be used. Higher 1442:multi-core processor 1344:instruction decoders 747: 695: 371:multi-core processor 367:direct memory access 313:Replacement policies 224:multi-core processor 163:Data cache (D-cache) 140:Branch Target Buffer 6244:Digital electronics 5897:Instruction decoder 5849:Floating-point unit 5503:Soft microprocessor 5450:System in a package 5025:Reservation station 4555:Transport-triggered 3000:Dugan, Ben (2002). 2872:. November 22, 1994 2447:IBM Systems Journal 2281:No-write allocation 1409:ARM microprocessors 1407:, the name used on 1401:branch target cache 1352:P6 processor family 1199:virtual-to-physical 899:Address translation 871:thread of execution 500:Direct-mapped cache 326:least-recently used 6116:Integrated circuit 5960:Processor register 5614:Baseband processor 4959:Operand forwarding 4419:Cellular automaton 4291:2009-09-15 at the 3002:"Concerning Cache" 2895:patents.justia.com 2542:. IBM. p. 20. 2459:10.1147/sj.71.0015 2190:Multi-ported cache 2049: 1890: 1848:associative caches 1790:strike) by either 1735: 1663:strictly inclusive 1453:Multi-level caches 1354:is the 2001 paper 1232:from DEC in 1990. 1189:Specialized caches 1171: 884:Cache write misses 778: 726: 431: 202: 168:multi-level caches 6300:Cache (computing) 6277: 6276: 6166: 6165: 5785:Instruction cache 5775:Scratchpad memory 5622: 5621: 5609:Network processor 5538:Network on a chip 5493:Ultra-low-voltage 5444:Multi-chip module 5287: 5286: 5073: 5072: 5060:Branch prediction 5037:Register renaming 4931: 4930: 4913:VISC architecture 4735:Quantum computing 4730:VISC architecture 4612:Secondary storage 4528:Microarchitecture 4488:Register machines 4111:IBM (June 1968). 4074:. Third Edition. 3500:978-1-58113-371-4 3028:978-0-86720-474-2 2966:978-0-12-374493-7 2939:978-0-12-383872-8 2835:developer.arm.com 2817:10.1109/40.621212 2655:978-1-4244-4519-6 2481:Computing Surveys 2241:Cache prefetching 2211:Cache (computing) 2180:energy efficiency 2134:microarchitecture 1781:branch prediction 1712:Scratchpad memory 1707:Scratchpad memory 1701:Scratchpad memory 1600:register renaming 1546:multi-chip module 1490:multi-chip module 1488:and mounted on a 1380:power consumption 875:Cache read misses 863:Cache read misses 689: 688: 640: 639: 607:Multicolumn cache 445:fully associative 385:Cache performance 126:Instruction Cache 16:(Redirected from 6307: 6239:Processor design 6131:Power management 6013:Instruction unit 5874:Branch predictor 5823: 5521:System on a chip 5463: 5303:Transistor count 5227:Flynn's taxonomy 5084: 4942: 4745:Addressing modes 4656: 4602:Memory hierarchy 4466:Hypercomputation 4384:Abstract machine 4361: 4354: 4347: 4338: 4256: 4255: 4253: 4252: 4238: 4232: 4231: 4229: 4222: 4211: 4205: 4204: 4170: 4164: 4163: 4161: 4160: 4146: 4140: 4137: 4131: 4128: 4122: 4121: 4119: 4108: 4102: 4101: 4095: 4086: 4080: 4079: 4073: 4063: 4057: 4056: 4054: 4053: 4046:General Electric 4043: 4033: 4027: 4026: 4024: 4023: 4017: 4011:. Archived from 3986: 3977: 3971: 3970: 3968: 3967: 3956: 3950: 3949: 3947: 3946: 3935: 3929: 3928: 3926: 3925: 3916:. Archived from 3906: 3900: 3899: 3897: 3896: 3887: 3878: 3872: 3871: 3869: 3868: 3862: 3856:. Archived from 3837: 3826: 3815: 3814: 3812: 3811: 3802:. Archived from 3800:amecomputers.com 3792: 3786: 3785: 3783: 3759: 3753: 3752: 3750: 3749: 3743: 3732: 3723: 3717: 3716: 3714: 3713: 3698: 3692: 3691: 3689: 3688: 3677: 3671: 3670: 3668: 3667: 3661: 3645: 3639: 3638: 3636: 3635: 3621: 3615: 3614: 3612: 3610: 3599: 3593: 3592: 3590: 3589: 3584: 3575: 3569: 3568: 3566: 3565: 3553: 3547: 3546: 3544: 3543: 3528: 3519: 3518: 3516: 3515: 3485:. pp. 4–9. 3476: 3467: 3458: 3457: 3445: 3439: 3438: 3426: 3420: 3419: 3417: 3416: 3407: 3395: 3386: 3385: 3383: 3382: 3367: 3361: 3360: 3352: 3346: 3345: 3343: 3342: 3328: 3319: 3318: 3316: 3315: 3301: 3292: 3291: 3275: 3269: 3268: 3266: 3258: 3252: 3251: 3235: 3229: 3228: 3226: 3225: 3219: 3213:. Archived from 3208: 3199: 3193: 3192: 3183:(2SI): 355–363. 3172: 3166: 3165: 3163: 3162: 3147: 3141: 3140: 3112: 3096: 3090: 3089: 3077: 3071: 3070: 3062: 3056: 3055: 3039: 3033: 3032: 3012: 3006: 3005: 2997: 2991: 2990: 2982: 2971: 2970: 2950: 2944: 2943: 2923: 2917: 2916: 2914: 2905: 2899: 2898: 2887: 2881: 2880: 2878: 2877: 2866: 2860: 2859: 2852: 2846: 2845: 2843: 2842: 2827: 2821: 2820: 2800: 2791: 2790: 2780: 2774: 2773: 2771: 2765:. Archived from 2764: 2755: 2746: 2745: 2743: 2719: 2710: 2709: 2707: 2706: 2700: 2689: 2680: 2674: 2673: 2671: 2670: 2631: 2625: 2624: 2622: 2621: 2611: 2603: 2597: 2596: 2594: 2593: 2578: 2572: 2571: 2566: 2565: 2550: 2544: 2543: 2541: 2532: 2526: 2525: 2511: 2505: 2504: 2478: 2474:"Cache Memories" 2469: 2463: 2462: 2444: 2435: 2429: 2428: 2422: 2414: 2408: 2407: 2395: 2389: 2388: 2378: 2354: 2348: 2347: 2339: 2323: 2313: 2271:Memory hierarchy 2256:Instruction unit 2216:Cache algorithms 2206:Branch predictor 2170:Current research 1974:First data cache 1864:Cache algorithms 1802:More hierarchies 1638:last level cache 1613:Multi-core chips 1598:. However, with 1340:micro-operations 1295:micro-operations 1140:birthday paradox 958:for elaboration. 801: 791: 787: 785: 784: 779: 762: 761: 739: 735: 733: 732: 727: 710: 709: 676: 627: 530: 516: 512: 508: 468:virtual aliasing 441:placement policy 252:Intel Core 2 Duo 60:memory locations 21: 6315: 6314: 6310: 6309: 6308: 6306: 6305: 6304: 6295:Computer memory 6280: 6279: 6278: 6273: 6259:Tick–tock model 6217: 6173: 6162: 6102: 6086:Address decoder 6040: 5994: 5990:Program counter 5965:Status register 5946: 5901: 5861:Load–store unit 5828: 5821: 5748: 5717: 5618: 5575:Image processor 5550: 5543: 5513: 5507: 5483:Microcontroller 5473:Embedded system 5461: 5361: 5294: 5283: 5221: 5171: 5069: 5046: 5030:Re-order buffer 5001: 4982:Data dependency 4968: 4927: 4757: 4751: 4650: 4649:Instruction set 4643: 4629:Multiprocessing 4597:Cache hierarchy 4590:Register/memory 4514: 4414:Queue automaton 4370: 4365: 4293:Wayback Machine 4264: 4259: 4250: 4248: 4240: 4239: 4235: 4227: 4220: 4213: 4212: 4208: 4172: 4171: 4167: 4158: 4156: 4148: 4147: 4143: 4138: 4134: 4129: 4125: 4117: 4110: 4109: 4105: 4093: 4088: 4087: 4083: 4071: 4065: 4064: 4060: 4051: 4049: 4041: 4035: 4034: 4030: 4021: 4019: 4015: 3984: 3979: 3978: 3974: 3965: 3963: 3958: 3957: 3953: 3944: 3942: 3937: 3936: 3932: 3923: 3921: 3908: 3907: 3903: 3894: 3892: 3885: 3880: 3879: 3875: 3866: 3864: 3860: 3854: 3835: 3828: 3827: 3818: 3809: 3807: 3794: 3793: 3789: 3761: 3760: 3756: 3747: 3745: 3741: 3730: 3725: 3724: 3720: 3711: 3709: 3700: 3699: 3695: 3686: 3684: 3679: 3678: 3674: 3665: 3663: 3659: 3647: 3646: 3642: 3633: 3631: 3623: 3622: 3618: 3608: 3606: 3601: 3600: 3596: 3587: 3585: 3582: 3577: 3576: 3572: 3563: 3561: 3555: 3554: 3550: 3541: 3539: 3530: 3529: 3522: 3513: 3511: 3501: 3474: 3469: 3468: 3461: 3447: 3446: 3442: 3428: 3427: 3423: 3414: 3412: 3405: 3397: 3396: 3389: 3380: 3378: 3369: 3368: 3364: 3354: 3353: 3349: 3340: 3338: 3330: 3329: 3322: 3313: 3311: 3303: 3302: 3295: 3277: 3276: 3272: 3264: 3260: 3259: 3255: 3237: 3236: 3232: 3223: 3221: 3217: 3211:systems.ethz.ch 3206: 3201: 3200: 3196: 3174: 3173: 3169: 3160: 3158: 3149: 3148: 3144: 3129: 3110:10.1.1.307.9125 3098: 3097: 3093: 3079: 3078: 3074: 3064: 3063: 3059: 3042:O'Neill, R. W. 3041: 3040: 3036: 3029: 3014: 3013: 3009: 2999: 2998: 2994: 2984: 2983: 2974: 2967: 2952: 2951: 2947: 2940: 2925: 2924: 2920: 2912: 2907: 2906: 2902: 2889: 2888: 2884: 2875: 2873: 2868: 2867: 2863: 2854: 2853: 2849: 2840: 2838: 2829: 2828: 2824: 2802: 2801: 2794: 2782: 2781: 2777: 2769: 2762: 2757: 2756: 2749: 2721: 2720: 2713: 2704: 2702: 2698: 2687: 2682: 2681: 2677: 2668: 2666: 2656: 2633: 2632: 2628: 2619: 2617: 2609: 2605: 2604: 2600: 2591: 2589: 2580: 2579: 2575: 2563: 2561: 2552: 2551: 2547: 2539: 2534: 2533: 2529: 2513: 2512: 2508: 2476: 2471: 2470: 2466: 2442: 2437: 2436: 2432: 2420: 2416: 2415: 2411: 2397: 2396: 2392: 2356: 2355: 2351: 2341: 2340: 2336: 2332: 2327: 2326: 2314: 2310: 2305: 2300: 2276:Micro-operation 2231:Cache hierarchy 2221:Cache coherence 2201: 2192: 2172: 2160: 2029: 1988: 1976: 1964: 1945: 1921: 1866: 1860: 1850:usually employ 1836: 1804: 1720: 1718:Example: the K8 1709: 1703: 1659: 1646: 1615: 1461: 1459:Cache hierarchy 1455: 1420: 1397: 1320: 1304: 1275:) found in the 1271:(also known as 1265: 1259: 1241:variant of its 1221: 1215: 1191: 1183:cache algorithm 1163: 1115: 1109: 1072: 1055: 901: 856: 834: 813: 799: 789: 753: 745: 744: 737: 701: 693: 692: 652:discussed below 622: 609: 593: 573: 554: 537: 518: 514: 510: 506: 502: 464:conflict misses 437: 423: 399: 387: 379:cache coherence 344: 338: 321: 315: 310: 281: 276: 196:operated at 25 179: 91: 28: 23: 22: 15: 12: 11: 5: 6313: 6311: 6303: 6302: 6297: 6292: 6282: 6281: 6275: 6274: 6272: 6271: 6266: 6264:Pin grid array 6261: 6256: 6251: 6246: 6241: 6236: 6231: 6225: 6223: 6219: 6218: 6216: 6215: 6209: 6204: 6199: 6194: 6189: 6184: 6178: 6176: 6168: 6167: 6164: 6163: 6161: 6160: 6155: 6150: 6145: 6140: 6135: 6134: 6133: 6128: 6123: 6112: 6110: 6104: 6103: 6101: 6100: 6098:Barrel shifter 6095: 6094: 6093: 6088: 6081:Binary decoder 6078: 6077: 6076: 6066: 6061: 6056: 6050: 6048: 6042: 6041: 6039: 6038: 6033: 6025: 6020: 6015: 6010: 6004: 6002: 5996: 5995: 5993: 5992: 5987: 5982: 5977: 5972: 5970:Stack register 5967: 5962: 5956: 5954: 5948: 5947: 5945: 5944: 5943: 5942: 5937: 5927: 5922: 5917: 5911: 5909: 5903: 5902: 5900: 5899: 5894: 5893: 5892: 5881: 5876: 5871: 5870: 5869: 5863: 5852: 5846: 5840: 5833: 5831: 5820: 5819: 5814: 5809: 5804: 5799: 5798: 5797: 5792: 5787: 5782: 5777: 5772: 5762: 5756: 5754: 5750: 5749: 5747: 5746: 5741: 5736: 5731: 5725: 5723: 5719: 5718: 5716: 5715: 5714: 5713: 5703: 5698: 5693: 5688: 5683: 5678: 5673: 5668: 5663: 5658: 5653: 5648: 5643: 5638: 5632: 5630: 5624: 5623: 5620: 5619: 5617: 5616: 5611: 5606: 5601: 5595: 5589: 5583: 5577: 5572: 5566: 5564:AI accelerator 5561: 5555: 5553: 5545: 5544: 5542: 5541: 5535: 5530: 5527:Multiprocessor 5524: 5517: 5515: 5509: 5508: 5506: 5505: 5500: 5495: 5490: 5485: 5480: 5478:Microprocessor 5475: 5469: 5467: 5466:By application 5460: 5459: 5453: 5447: 5441: 5436: 5431: 5426: 5421: 5416: 5411: 5409:Tile processor 5406: 5401: 5396: 5391: 5390: 5389: 5378: 5371: 5369: 5363: 5362: 5360: 5359: 5354: 5349: 5343: 5337: 5331: 5325: 5319: 5318: 5317: 5305: 5299: 5297: 5289: 5288: 5285: 5284: 5282: 5281: 5280: 5279: 5269: 5264: 5263: 5262: 5257: 5252: 5247: 5237: 5231: 5229: 5223: 5222: 5220: 5219: 5214: 5209: 5204: 5203: 5202: 5197: 5195:Hyperthreading 5187: 5181: 5179: 5177:Multithreading 5173: 5172: 5170: 5169: 5164: 5159: 5158: 5157: 5147: 5146: 5145: 5140: 5130: 5129: 5128: 5123: 5113: 5108: 5107: 5106: 5101: 5090: 5088: 5081: 5075: 5074: 5071: 5070: 5068: 5067: 5062: 5056: 5054: 5048: 5047: 5045: 5044: 5039: 5034: 5033: 5032: 5027: 5017: 5011: 5009: 5003: 5002: 5000: 4999: 4994: 4989: 4984: 4978: 4976: 4970: 4969: 4967: 4966: 4961: 4956: 4954:Pipeline stall 4950: 4948: 4939: 4933: 4932: 4929: 4928: 4926: 4925: 4920: 4915: 4910: 4907: 4906: 4905: 4903:z/Architecture 4900: 4895: 4890: 4882: 4877: 4872: 4867: 4862: 4857: 4852: 4847: 4842: 4837: 4832: 4827: 4822: 4821: 4820: 4815: 4810: 4802: 4797: 4792: 4787: 4782: 4777: 4772: 4767: 4761: 4759: 4753: 4752: 4750: 4749: 4748: 4747: 4737: 4732: 4727: 4722: 4717: 4712: 4707: 4706: 4705: 4695: 4694: 4693: 4683: 4678: 4673: 4668: 4662: 4660: 4653: 4645: 4644: 4642: 4641: 4636: 4631: 4626: 4621: 4616: 4615: 4614: 4609: 4607:Virtual memory 4599: 4594: 4593: 4592: 4587: 4582: 4577: 4567: 4562: 4557: 4552: 4547: 4546: 4545: 4535: 4530: 4524: 4522: 4516: 4515: 4513: 4512: 4511: 4510: 4505: 4500: 4495: 4485: 4480: 4475: 4474: 4473: 4468: 4463: 4458: 4453: 4448: 4443: 4438: 4431:Turing machine 4428: 4427: 4426: 4421: 4416: 4411: 4406: 4401: 4391: 4386: 4380: 4378: 4372: 4371: 4366: 4364: 4363: 4356: 4349: 4341: 4335: 4334: 4331:Memory Caching 4328: 4323: 4317: 4311: 4302: 4299:A Cache Primer 4296: 4283: 4277: 4271: 4263: 4262:External links 4260: 4258: 4257: 4233: 4206: 4165: 4141: 4132: 4123: 4103: 4081: 4058: 4048:. January 1968 4028: 3972: 3951: 3930: 3901: 3873: 3852: 3816: 3787: 3754: 3718: 3693: 3672: 3640: 3616: 3594: 3570: 3548: 3520: 3499: 3459: 3440: 3421: 3401:(2014-02-19). 3387: 3362: 3347: 3320: 3293: 3270: 3253: 3230: 3194: 3167: 3142: 3127: 3091: 3072: 3057: 3034: 3027: 3007: 2992: 2987:"Cache Basics" 2972: 2965: 2945: 2938: 2918: 2900: 2882: 2861: 2847: 2822: 2792: 2775: 2758:Kozyrakis, C. 2747: 2734:(2): 169–178. 2711: 2675: 2654: 2626: 2607:"Cache design" 2598: 2573: 2558:Intel Newsroom 2545: 2527: 2506: 2487:(3): 473–530. 2464: 2430: 2409: 2390: 2349: 2333: 2331: 2328: 2325: 2324: 2307: 2306: 2304: 2301: 2299: 2298: 2293: 2288: 2286:Scratchpad RAM 2283: 2278: 2273: 2268: 2263: 2258: 2253: 2243: 2238: 2233: 2228: 2223: 2218: 2213: 2208: 2202: 2200: 2197: 2191: 2188: 2171: 2168: 2159: 2156: 2028: 2025: 1987: 1984: 1975: 1972: 1963: 1960: 1944: 1941: 1920: 1917: 1862:Main article: 1859: 1858:Implementation 1856: 1835: 1832: 1803: 1800: 1788:alpha particle 1768: 1767: 1763: 1759: 1755: 1719: 1716: 1705:Main article: 1702: 1699: 1658: 1655: 1645: 1642: 1625:, rather than 1619:multiple cores 1614: 1611: 1587: 1586: 1575: 1569: 1562: 1556: 1549: 1531: 1524: 1518: 1454: 1451: 1419: 1416: 1396: 1393: 1319: 1316: 1303: 1300: 1280:Pentium 4 1261:Main article: 1258: 1255: 1217:Main article: 1214: 1211: 1190: 1187: 1162: 1159: 1113:Cache coloring 1111:Main article: 1108: 1105: 1071: 1068: 1054: 1051: 1036: 1035: 1014: 1008: 1005:context switch 998: 960: 959: 956:virtual memory 945: 938: 905:virtual memory 900: 897: 855: 852: 833: 830: 812: 809: 805: 804: 803: 802: 777: 774: 771: 768: 765: 760: 756: 752: 725: 722: 719: 716: 713: 708: 704: 700: 687: 686: 683: 680: 638: 637: 634: 631: 621: 618: 608: 605: 592: 589: 572: 569: 553: 550: 536: 533: 501: 498: 497: 496: 493: 490: 487: 484: 481: 478: 433:Main article: 422: 419: 398: 395: 386: 383: 375:multiprocessor 340:Main article: 337: 336:Write policies 334: 317:Main article: 314: 311: 309: 306: 280: 277: 275: 272: 194:Motorola 68040 178: 175: 174: 173: 172: 171: 164: 159: 153: 152: 151: 150: 147: 144: 142: 137: 135: 128: 122: 121: 113: 90: 87: 56:processor core 36:hardware cache 26: 24: 14: 13: 10: 9: 6: 4: 3: 2: 6312: 6301: 6298: 6296: 6293: 6291: 6288: 6287: 6285: 6270: 6267: 6265: 6262: 6260: 6257: 6255: 6252: 6250: 6247: 6245: 6242: 6240: 6237: 6235: 6232: 6230: 6227: 6226: 6224: 6220: 6213: 6210: 6208: 6205: 6203: 6200: 6198: 6195: 6193: 6190: 6188: 6185: 6183: 6180: 6179: 6177: 6175: 6169: 6159: 6156: 6154: 6151: 6149: 6146: 6144: 6141: 6139: 6136: 6132: 6129: 6127: 6124: 6122: 6119: 6118: 6117: 6114: 6113: 6111: 6109: 6105: 6099: 6096: 6092: 6089: 6087: 6084: 6083: 6082: 6079: 6075: 6072: 6071: 6070: 6067: 6065: 6062: 6060: 6059:Demultiplexer 6057: 6055: 6052: 6051: 6049: 6047: 6043: 6037: 6034: 6032: 6029: 6026: 6024: 6021: 6019: 6016: 6014: 6011: 6009: 6006: 6005: 6003: 6001: 5997: 5991: 5988: 5986: 5983: 5981: 5980:Memory buffer 5978: 5976: 5975:Register file 5973: 5971: 5968: 5966: 5963: 5961: 5958: 5957: 5955: 5953: 5949: 5941: 5938: 5936: 5933: 5932: 5931: 5928: 5926: 5923: 5921: 5918: 5916: 5915:Combinational 5913: 5912: 5910: 5908: 5904: 5898: 5895: 5891: 5888: 5887: 5885: 5882: 5880: 5877: 5875: 5872: 5867: 5864: 5862: 5859: 5858: 5856: 5853: 5850: 5847: 5844: 5841: 5838: 5835: 5834: 5832: 5830: 5824: 5818: 5815: 5813: 5810: 5808: 5805: 5803: 5800: 5796: 5793: 5791: 5788: 5786: 5783: 5781: 5778: 5776: 5773: 5771: 5768: 5767: 5766: 5763: 5761: 5758: 5757: 5755: 5751: 5745: 5742: 5740: 5737: 5735: 5732: 5730: 5727: 5726: 5724: 5720: 5712: 5709: 5708: 5707: 5704: 5702: 5699: 5697: 5694: 5692: 5689: 5687: 5684: 5682: 5679: 5677: 5674: 5672: 5669: 5667: 5664: 5662: 5659: 5657: 5654: 5652: 5649: 5647: 5644: 5642: 5639: 5637: 5634: 5633: 5631: 5629: 5625: 5615: 5612: 5610: 5607: 5605: 5602: 5599: 5596: 5593: 5590: 5587: 5584: 5581: 5578: 5576: 5573: 5570: 5567: 5565: 5562: 5560: 5557: 5556: 5554: 5552: 5546: 5539: 5536: 5534: 5531: 5528: 5525: 5522: 5519: 5518: 5516: 5510: 5504: 5501: 5499: 5496: 5494: 5491: 5489: 5486: 5484: 5481: 5479: 5476: 5474: 5471: 5470: 5468: 5464: 5457: 5454: 5451: 5448: 5445: 5442: 5440: 5437: 5435: 5432: 5430: 5427: 5425: 5422: 5420: 5417: 5415: 5412: 5410: 5407: 5405: 5402: 5400: 5397: 5395: 5392: 5388: 5385: 5384: 5382: 5379: 5376: 5373: 5372: 5370: 5368: 5364: 5358: 5355: 5353: 5350: 5347: 5344: 5341: 5338: 5335: 5332: 5329: 5326: 5323: 5320: 5315: 5312: 5311: 5309: 5306: 5304: 5301: 5300: 5298: 5296: 5290: 5278: 5275: 5274: 5273: 5270: 5268: 5265: 5261: 5258: 5256: 5253: 5251: 5248: 5246: 5243: 5242: 5241: 5238: 5236: 5233: 5232: 5230: 5228: 5224: 5218: 5215: 5213: 5210: 5208: 5205: 5201: 5198: 5196: 5193: 5192: 5191: 5188: 5186: 5183: 5182: 5180: 5178: 5174: 5168: 5165: 5163: 5160: 5156: 5153: 5152: 5151: 5148: 5144: 5141: 5139: 5136: 5135: 5134: 5131: 5127: 5124: 5122: 5119: 5118: 5117: 5114: 5112: 5109: 5105: 5102: 5100: 5097: 5096: 5095: 5092: 5091: 5089: 5085: 5082: 5080: 5076: 5066: 5063: 5061: 5058: 5057: 5055: 5053: 5049: 5043: 5040: 5038: 5035: 5031: 5028: 5026: 5023: 5022: 5021: 5018: 5016: 5015:Scoreboarding 5013: 5012: 5010: 5008: 5004: 4998: 4997:False sharing 4995: 4993: 4990: 4988: 4985: 4983: 4980: 4979: 4977: 4975: 4971: 4965: 4962: 4960: 4957: 4955: 4952: 4951: 4949: 4947: 4943: 4940: 4938: 4934: 4924: 4921: 4919: 4916: 4914: 4911: 4908: 4904: 4901: 4899: 4896: 4894: 4891: 4889: 4886: 4885: 4883: 4881: 4878: 4876: 4873: 4871: 4868: 4866: 4863: 4861: 4858: 4856: 4853: 4851: 4848: 4846: 4843: 4841: 4838: 4836: 4833: 4831: 4828: 4826: 4823: 4819: 4816: 4814: 4811: 4809: 4806: 4805: 4803: 4801: 4798: 4796: 4793: 4791: 4790:Stanford MIPS 4788: 4786: 4783: 4781: 4778: 4776: 4773: 4771: 4768: 4766: 4763: 4762: 4760: 4754: 4746: 4743: 4742: 4741: 4738: 4736: 4733: 4731: 4728: 4726: 4723: 4721: 4718: 4716: 4713: 4711: 4708: 4704: 4701: 4700: 4699: 4696: 4692: 4689: 4688: 4687: 4684: 4682: 4679: 4677: 4674: 4672: 4669: 4667: 4664: 4663: 4661: 4657: 4654: 4652: 4651:architectures 4646: 4640: 4637: 4635: 4632: 4630: 4627: 4625: 4622: 4620: 4619:Heterogeneous 4617: 4613: 4610: 4608: 4605: 4604: 4603: 4600: 4598: 4595: 4591: 4588: 4586: 4583: 4581: 4578: 4576: 4573: 4572: 4571: 4570:Memory access 4568: 4566: 4563: 4561: 4558: 4556: 4553: 4551: 4548: 4544: 4541: 4540: 4539: 4536: 4534: 4531: 4529: 4526: 4525: 4523: 4521: 4517: 4509: 4506: 4504: 4503:Random-access 4501: 4499: 4496: 4494: 4491: 4490: 4489: 4486: 4484: 4483:Stack machine 4481: 4479: 4476: 4472: 4469: 4467: 4464: 4462: 4459: 4457: 4454: 4452: 4449: 4447: 4444: 4442: 4439: 4437: 4434: 4433: 4432: 4429: 4425: 4422: 4420: 4417: 4415: 4412: 4410: 4407: 4405: 4402: 4400: 4399:with datapath 4397: 4396: 4395: 4392: 4390: 4387: 4385: 4382: 4381: 4379: 4377: 4373: 4369: 4362: 4357: 4355: 4350: 4348: 4343: 4342: 4339: 4332: 4329: 4327: 4324: 4321: 4318: 4315: 4312: 4310: 4306: 4303: 4300: 4297: 4294: 4290: 4287: 4284: 4281: 4278: 4275: 4272: 4269: 4266: 4265: 4261: 4247: 4243: 4237: 4234: 4226: 4219: 4218: 4210: 4207: 4202: 4198: 4194: 4190: 4186: 4182: 4181: 4180:IEEE Spectrum 4176: 4169: 4166: 4155: 4151: 4145: 4142: 4136: 4133: 4127: 4124: 4116: 4115: 4107: 4104: 4099: 4092: 4085: 4082: 4077: 4070: 4069: 4062: 4059: 4047: 4040: 4039: 4032: 4029: 4018:on 2014-03-05 4014: 4010: 4006: 4002: 3998: 3995:(3es): 2–es. 3994: 3990: 3983: 3976: 3973: 3961: 3955: 3952: 3940: 3934: 3931: 3920:on 2007-05-15 3919: 3915: 3911: 3905: 3902: 3891: 3884: 3877: 3874: 3863:on 2012-08-13 3859: 3855: 3853:0-7803-8385-0 3849: 3845: 3841: 3834: 3833: 3825: 3823: 3821: 3817: 3806:on 2014-07-14 3805: 3801: 3797: 3791: 3788: 3782: 3777: 3773: 3769: 3765: 3758: 3755: 3744:on 2020-07-29 3740: 3736: 3729: 3722: 3719: 3708: 3704: 3697: 3694: 3682: 3676: 3673: 3662:on 2011-12-29 3658: 3654: 3650: 3644: 3641: 3630: 3626: 3620: 3617: 3605: 3598: 3595: 3581: 3580:"Trace Cache" 3574: 3571: 3559: 3552: 3549: 3538: 3534: 3527: 3525: 3521: 3510: 3506: 3502: 3496: 3492: 3488: 3484: 3480: 3473: 3466: 3464: 3460: 3455: 3451: 3444: 3441: 3436: 3432: 3425: 3422: 3411: 3404: 3400: 3394: 3392: 3388: 3377: 3373: 3366: 3363: 3358: 3351: 3348: 3337: 3333: 3327: 3325: 3321: 3310: 3306: 3300: 3298: 3294: 3289: 3285: 3281: 3274: 3271: 3263: 3257: 3254: 3249: 3245: 3241: 3234: 3231: 3220:on 2011-10-07 3216: 3212: 3205: 3198: 3195: 3190: 3186: 3182: 3178: 3171: 3168: 3157: 3156:Linux Journal 3153: 3146: 3143: 3138: 3134: 3130: 3128:9781450320795 3124: 3120: 3116: 3111: 3106: 3102: 3095: 3092: 3087: 3083: 3076: 3073: 3068: 3061: 3058: 3053: 3049: 3045: 3038: 3035: 3030: 3024: 3020: 3019: 3011: 3008: 3003: 2996: 2993: 2988: 2981: 2979: 2977: 2973: 2968: 2962: 2958: 2957: 2949: 2946: 2941: 2935: 2931: 2930: 2922: 2919: 2911: 2904: 2901: 2896: 2892: 2886: 2883: 2871: 2865: 2862: 2857: 2851: 2848: 2836: 2832: 2826: 2823: 2818: 2814: 2810: 2806: 2799: 2797: 2793: 2789: 2785: 2779: 2776: 2768: 2761: 2754: 2752: 2748: 2742: 2737: 2733: 2729: 2725: 2718: 2716: 2712: 2701:on 2020-07-29 2697: 2693: 2686: 2679: 2676: 2665: 2661: 2657: 2651: 2647: 2643: 2639: 2638: 2630: 2627: 2615: 2608: 2602: 2599: 2588: 2584: 2581:Smith, Ryan. 2577: 2574: 2570: 2559: 2555: 2549: 2546: 2538: 2531: 2528: 2523: 2519: 2518: 2510: 2507: 2502: 2498: 2494: 2490: 2486: 2482: 2475: 2468: 2465: 2460: 2456: 2452: 2448: 2441: 2434: 2431: 2426: 2419: 2413: 2410: 2406: 2401: 2394: 2391: 2386: 2382: 2377: 2372: 2368: 2364: 2360: 2353: 2350: 2345: 2338: 2335: 2329: 2321: 2318: 2312: 2309: 2302: 2297: 2294: 2292: 2289: 2287: 2284: 2282: 2279: 2277: 2274: 2272: 2269: 2267: 2264: 2262: 2259: 2257: 2254: 2251: 2247: 2244: 2242: 2239: 2237: 2234: 2232: 2229: 2227: 2224: 2222: 2219: 2217: 2214: 2212: 2209: 2207: 2204: 2203: 2198: 2196: 2189: 2187: 2183: 2181: 2177: 2169: 2167: 2164: 2157: 2155: 2153: 2149: 2145: 2141: 2140: 2135: 2132: 2128: 2124: 2120: 2118: 2114: 2110: 2105: 2103: 2098: 2096: 2092: 2088: 2083: 2078: 2075: 2073: 2069: 2066: 2062: 2058: 2054: 2046: 2042: 2038: 2033: 2026: 2024: 2022: 2017: 2015: 2010: 2008: 2004: 1999: 1997: 1993: 1985: 1983: 1981: 1973: 1971: 1969: 1961: 1959: 1957: 1954: 1950: 1942: 1940: 1938: 1934: 1930: 1925: 1918: 1916: 1914: 1909: 1905: 1902: 1898: 1896: 1886: 1882: 1878: 1874: 1871: 1865: 1857: 1855: 1853: 1849: 1845: 1841: 1833: 1831: 1828: 1827: 1821: 1819: 1814: 1812: 1809: 1801: 1799: 1797: 1793: 1789: 1784: 1782: 1777: 1774: 1764: 1760: 1756: 1753: 1749: 1744: 1743: 1742: 1740: 1731: 1727: 1725: 1717: 1715: 1713: 1708: 1700: 1698: 1694: 1690: 1686: 1684: 1680: 1676: 1672: 1668: 1664: 1656: 1654: 1652: 1643: 1641: 1639: 1634: 1630: 1628: 1624: 1620: 1612: 1610: 1608: 1603: 1601: 1597: 1592: 1591:register file 1584: 1580: 1576: 1573: 1572:Intel Core i7 1570: 1567: 1563: 1560: 1557: 1554: 1550: 1547: 1543: 1539: 1535: 1532: 1529: 1525: 1522: 1519: 1516: 1513: 1512: 1511: 1508: 1506: 1503: 1499: 1495: 1491: 1487: 1481: 1479: 1475: 1471: 1467: 1460: 1452: 1450: 1447: 1443: 1438: 1436: 1432: 1428: 1424: 1417: 1415: 1412: 1410: 1406: 1402: 1394: 1392: 1390: 1384: 1381: 1375: 1373: 1369: 1365: 1361: 1357: 1353: 1348: 1345: 1341: 1337: 1333: 1329: 1325: 1317: 1315: 1313: 1309: 1301: 1299: 1296: 1292: 1287: 1285: 1281: 1278: 1274: 1270: 1264: 1256: 1254: 1252: 1248: 1244: 1240: 1239: 1233: 1231: 1230:Norman Jouppi 1226: 1220: 1212: 1210: 1208: 1204: 1200: 1196: 1188: 1186: 1184: 1179: 1175: 1167: 1160: 1158: 1154: 1150: 1148: 1147:page coloring 1143: 1141: 1135: 1133: 1129: 1123: 1119: 1114: 1107:Page coloring 1106: 1104: 1102: 1096: 1093: 1092:virtual hints 1088: 1084: 1080: 1076: 1069: 1067: 1064: 1060: 1052: 1050: 1046: 1043: 1041: 1033: 1029: 1025: 1022: 1018: 1015: 1012: 1009: 1006: 1002: 999: 996: 993: 992: 991: 988: 986: 982: 978: 974: 969: 965: 957: 953: 949: 946: 942: 939: 936: 933: 932: 931: 928: 926: 922: 918: 912: 910: 909:address space 906: 898: 896: 894: 889: 885: 880: 876: 872: 868: 864: 860: 853: 851: 849: 844: 842: 837: 831: 829: 825: 822: 818: 815:The original 810: 808: 798: 797: 796: 795: 794: 769: 763: 758: 754: 741: 717: 711: 706: 702: 685:block offset 684: 681: 678: 677: 674: 672: 668: 663: 661: 655: 653: 649: 645: 635: 632: 629: 628: 625: 619: 617: 613: 606: 604: 600: 598: 590: 588: 586: 582: 581:hash function 578: 570: 568: 566: 561: 559: 551: 549: 547: 542: 534: 532: 529: 525: 521: 499: 494: 491: 488: 485: 482: 479: 476: 475: 474: 471: 469: 465: 461: 456: 454: 450: 449:direct-mapped 446: 442: 436: 427: 421:Associativity 420: 418: 416: 412: 407: 404: 396: 394: 391: 384: 382: 380: 376: 372: 368: 363: 359: 357: 353: 349: 348:write-through 343: 335: 333: 329: 327: 320: 312: 307: 305: 302: 298: 292: 290: 286: 279:Cache entries 278: 273: 271: 269: 265: 261: 257: 253: 249: 246:; when up to 245: 240: 237: 233: 229: 225: 220: 216: 211: 207: 199: 195: 191: 187: 183: 176: 169: 165: 162: 161: 160: 158: 155: 154: 148: 145: 143: 141: 138: 136: 134: 133:MicroOp-Cache 131: 130: 129: 127: 124: 123: 119: 114: 111: 108: 107: 106: 104: 100: 95: 88: 86: 84: 80: 75: 73: 69: 65: 61: 57: 53: 49: 45: 41: 37: 33: 19: 6269:Chip carrier 6207:Clock gating 6126:Mixed-signal 6023:Write buffer 6000:Control unit 5812:Clock signal 5769: 5551:accelerators 5533:Cypress PSoC 5190:Simultaneous 5007:Out-of-order 4639:Neuromorphic 4520:Architecture 4478:Belt machine 4471:Zeno machine 4404:Hierarchical 4249:. Retrieved 4245: 4236: 4216: 4209: 4184: 4178: 4168: 4157:. Retrieved 4144: 4135: 4126: 4113: 4106: 4097: 4084: 4067: 4061: 4050:. Retrieved 4037: 4031: 4020:. Retrieved 4013:the original 3992: 3988: 3975: 3964:. Retrieved 3954: 3943:. Retrieved 3933: 3922:. Retrieved 3918:the original 3914:Sandpile.org 3913: 3904: 3893:. Retrieved 3889: 3876: 3865:. Retrieved 3858:the original 3831: 3808:. Retrieved 3804:the original 3799: 3790: 3771: 3767: 3757: 3746:. Retrieved 3739:the original 3735:hotchips.org 3734: 3721: 3710:. Retrieved 3696: 3685:. Retrieved 3675: 3664:. Retrieved 3657:the original 3643: 3632:. Retrieved 3619: 3607:. Retrieved 3597: 3586:. Retrieved 3573: 3562:. Retrieved 3551: 3540:. Retrieved 3512:. Retrieved 3478: 3453: 3443: 3434: 3424: 3413:. Retrieved 3409: 3379:. Retrieved 3365: 3359:. AnandTech. 3350: 3339:. Retrieved 3312:. Retrieved 3279: 3273: 3256: 3239: 3233: 3222:. Retrieved 3215:the original 3210: 3197: 3180: 3176: 3170: 3159:. Retrieved 3155: 3145: 3100: 3094: 3085: 3075: 3066: 3060: 3043: 3037: 3017: 3010: 2995: 2955: 2948: 2928: 2921: 2915:. p. 4. 2903: 2894: 2885: 2874:. Retrieved 2864: 2850: 2839:. Retrieved 2834: 2825: 2811:(5): 40–49. 2808: 2804: 2787: 2778: 2767:the original 2731: 2727: 2703:. Retrieved 2696:the original 2692:hotchips.org 2691: 2678: 2667:. Retrieved 2636: 2629: 2618:. Retrieved 2613: 2601: 2590:. Retrieved 2586: 2576: 2568: 2562:. Retrieved 2557: 2548: 2530: 2521: 2515: 2509: 2484: 2480: 2467: 2453:(1): 15–21. 2450: 2446: 2433: 2412: 2403: 2393: 2366: 2362: 2352: 2337: 2311: 2296:Write buffer 2193: 2184: 2173: 2161: 2152:victim cache 2137: 2125: 2121: 2106: 2099: 2087:daughtercard 2079: 2076: 2068:memory cells 2050: 2018: 2011: 2000: 1989: 1977: 1965: 1946: 1926: 1922: 1910: 1906: 1903: 1899: 1895:multiplexing 1891: 1879: 1875: 1869: 1867: 1839: 1837: 1824: 1822: 1817: 1815: 1805: 1785: 1778: 1772: 1769: 1736: 1721: 1710: 1695: 1691: 1687: 1670: 1662: 1660: 1647: 1637: 1635: 1631: 1626: 1622: 1616: 1604: 1588: 1509: 1482: 1477: 1473: 1469: 1465: 1462: 1439: 1422: 1421: 1413: 1404: 1400: 1398: 1385: 1376: 1360:Sandy Bridge 1355: 1349: 1335: 1331: 1327: 1323: 1321: 1305: 1291:basic blocks 1288: 1284:instructions 1272: 1268: 1266: 1236: 1234: 1225:victim cache 1224: 1222: 1219:Victim cache 1213:Victim cache 1192: 1180: 1176: 1172: 1155: 1151: 1146: 1144: 1136: 1124: 1120: 1116: 1097: 1091: 1089: 1085: 1081: 1077: 1073: 1056: 1047: 1044: 1040:load latency 1039: 1037: 1016: 1010: 1000: 994: 989: 961: 948:Granularity: 947: 940: 934: 929: 913: 902: 887: 883: 878: 874: 866: 862: 861: 857: 845: 841:bus snooping 838: 835: 826: 814: 806: 788:bits, where 742: 740:cache sets. 690: 664: 656: 647: 643: 641: 623: 614: 610: 601: 594: 577:skewed cache 576: 574: 564: 562: 555: 540: 538: 527: 523: 519: 503: 472: 457: 448: 444: 438: 408: 400: 388: 364: 360: 345: 330: 322: 300: 296: 293: 289:cache blocks 288: 284: 282: 241: 203: 156: 125: 97:Many modern 96: 92: 76: 38:used by the 31: 29: 6054:Multiplexer 6018:Data buffer 5729:Single-core 5701:bit slicing 5559:Coprocessor 5414:Coprocessor 5295:performance 5217:Cooperative 5207:Speculative 5167:Distributed 5126:Superscalar 5111:Instruction 5079:Parallelism 5052:Speculative 4884:System/3x0 4756:Instruction 4533:Von Neumann 4446:Post–Turing 3890:jaleels.org 3683:. AnandTech 3560:. AnandTech 2517:Electronics 2266:Memoization 2139:Crystalwell 2102:Pentium Pro 2091:Pentium MMX 1811:Alpha 21264 1515:Alpha 21164 1423:Smart cache 1418:Smart cache 1269:trace cache 1263:Trace cache 1257:Trace cache 1238:Crystalwell 1134:community. 968:core memory 964:IBM M44/44X 867:instruction 558:speculation 381:protocols. 285:cache lines 186:Motherboard 52:main memory 42:(CPU) of a 6284:Categories 6174:management 6069:Multiplier 5930:Logic gate 5920:Sequential 5827:Functional 5807:Clock rate 5780:Data cache 5753:Components 5734:Multi-core 5722:Core count 5212:Preemptive 5116:Pipelining 5099:Bit-serial 5042:Wide-issue 4987:Structural 4909:Tilera ISA 4875:MicroBlaze 4845:ETRAX CRIS 4740:Comparison 4585:Load–store 4565:Endianness 4251:2023-01-29 4187:(11): 16. 4159:2013-10-10 4052:2020-07-10 4022:2013-03-05 3989:Crossroads 3966:2013-09-28 3945:2013-09-28 3924:2007-06-02 3895:2014-06-09 3867:2014-06-09 3810:2014-06-09 3748:2014-01-21 3712:2015-11-24 3687:2014-02-25 3666:2012-01-26 3634:2012-01-26 3588:2013-10-06 3564:2017-04-03 3542:2013-10-20 3514:2013-10-06 3415:2014-03-21 3399:Fog, Agner 3381:2015-11-30 3341:2013-09-16 3314:2013-09-15 3224:2016-02-14 3161:2010-05-02 2876:2024-01-19 2841:2023-05-04 2805:IEEE Micro 2705:2015-12-16 2669:2023-10-18 2620:2023-01-29 2592:2020-04-12 2564:2024-04-18 2330:References 2113:AMD K6-III 1937:bottleneck 1675:Pentium II 1667:AMD Athlon 1559:AMD Phenom 1521:AMD K6-III 1457:See also: 1446:cache miss 1364:Ivy Bridge 973:page table 925:page table 854:Cache miss 644:data block 636:flag bits 633:data block 453:AMD Athlon 397:CPU stalls 352:write-back 301:cache miss 268:Intel Atom 157:Data Cache 6108:Circuitry 6028:Microcode 5952:Registers 5795:coherence 5770:CPU cache 5628:Word size 5293:Processor 4937:Execution 4840:DEC Alpha 4818:Power ISA 4634:Cognitive 4441:Universal 3962:. arm.com 3941:. arm.com 3537:AnandTech 3509:195859085 3410:agner.org 3376:AnandTech 3336:AnandTech 3105:CiteSeerX 2587:AnandTech 2385:1939-0122 2080:With the 1933:frequency 1724:Athlon 64 1671:exclusive 1566:Phenom II 1542:Itanium 2 1534:Itanium 2 1502:ARM-based 1389:heuristic 1332:uop cache 1328:μop cache 941:Aliasing: 848:dirty bit 832:Flag bits 817:Pentium 4 793:follows: 776:⌉ 764:⁡ 751:⌈ 736:bits for 724:⌉ 712:⁡ 699:⌈ 460:trade-off 297:cache hit 295:cache, a 274:Operation 256:IBM zEC12 50:from the 32:CPU cache 6046:Datapath 5739:Manycore 5711:variable 5549:Hardware 5185:Temporal 4865:OpenRISC 4560:Cellular 4550:Dataflow 4543:modified 4289:Archived 4225:Archived 4201:43892134 4009:11557476 3910:"AMD K8" 3137:15434231 2664:18236635 2614:ucsd.edu 2369:: 1–15. 2317:Ferranti 2199:See also 2163:Apple M1 2117:Socket 7 2109:AMD K6-2 1968:CDC 6600 1951:and the 1929:register 1826:coherent 1762:entries. 1551:Intel's 1505:Apple M1 1235:Intel's 1195:pipeline 979:and the 935:Latency: 865:from an 308:Policies 264:Ice Lake 208:and the 190:NeXTcube 89:Overview 44:computer 18:L4 cache 6222:Related 6153:Quantum 6143:Digital 6138:Boolean 6036:Counter 5935:Quantum 5696:512-bit 5691:256-bit 5686:128-bit 5529:(MPSoC) 5514:on chip 5512:Systems 5330:(FLOPS) 5143:Process 4992:Control 4974:Hazards 4860:Itanium 4855:Unicore 4813:PowerPC 4538:Harvard 4498:Pointer 4493:Counter 4451:Quantum 4246:HP Labs 4242:"CACTI" 3609:7 April 2501:6023466 2131:Haswell 2051:As the 1919:History 1840:tag RAM 1834:Tag RAM 1773:unified 1669:) have 1579:Haswell 1538:unified 1494:IBM z15 1474:level 2 1466:level 1 1431:level 3 1427:level 2 1368:Haswell 1251:Skylake 1243:Haswell 1063:synonym 1059:homonym 877:from a 811:Example 403:latency 260:IBM z13 215:IBM 801 206:Atlas 2 177:History 170:below). 99:desktop 6158:Switch 6148:Analog 5886:(IMC) 5857:(MMU) 5706:others 5681:64-bit 5676:48-bit 5671:32-bit 5666:24-bit 5661:16-bit 5656:15-bit 5651:12-bit 5488:Mobile 5404:Stream 5399:Barrel 5394:Vector 5383:(GPU) 5342:(SUPS) 5310:(IPC) 5162:Memory 5155:Vector 5138:Thread 5121:Scalar 4923:Others 4870:RISC-V 4835:SuperH 4804:Power 4800:MIPS-X 4775:PDP-11 4624:Fabric 4376:Models 4199: 4007: 3850: 3507: 3497: 3135: 3125: 3107: 3025: 2963: 2936: 2837:. 2014 2662: 2652: 2499: 2383: 2246:Dinero 2045:Austek 1956:360/67 1949:GE 645 1868:Cache 1796:parity 1748:parity 1681:, and 1607:Cray-1 1577:Intel 1528:POWER4 981:GE 645 103:server 64:levels 6214:(PPW) 6172:Power 6064:Adder 5940:Array 5907:Logic 5868:(TLB) 5851:(FPU) 5845:(AGU) 5839:(ALU) 5829:units 5765:Cache 5646:8-bit 5641:4-bit 5636:1-bit 5600:(TPU) 5594:(DSP) 5588:(PPU) 5582:(VPU) 5571:(GPU) 5540:(NoC) 5523:(SoC) 5458:(PoP) 5452:(SiP) 5446:(MCM) 5387:GPGPU 5377:(CPU) 5367:Types 5348:(PPW) 5336:(TPS) 5324:(IPS) 5316:(CPI) 5087:Level 4898:S/390 4893:S/370 4888:S/360 4830:SPARC 4808:POWER 4691:TRIPS 4659:Types 4228:(PDF) 4221:(PDF) 4197:S2CID 4154:Intel 4118:(PDF) 4094:(PDF) 4072:(PDF) 4042:(PDF) 4016:(PDF) 4005:S2CID 3985:(PDF) 3886:(PDF) 3861:(PDF) 3836:(PDF) 3742:(PDF) 3731:(PDF) 3707:Intel 3660:(PDF) 3653:Intel 3629:Intel 3583:(PDF) 3505:S2CID 3475:(PDF) 3406:(PDF) 3309:Intel 3265:(PDF) 3218:(PDF) 3207:(PDF) 3133:S2CID 2913:(PDF) 2770:(PDF) 2763:(PDF) 2699:(PDF) 2688:(PDF) 2660:S2CID 2610:(PDF) 2540:(PDF) 2497:S2CID 2477:(PDF) 2443:(PDF) 2421:(PDF) 2320:Atlas 2303:Notes 2148:eDRAM 2127:Intel 2095:SDRAM 2041:SIMMs 2021:68060 2014:68040 2003:68030 1996:68020 1992:68010 1870:reads 1726:CPU. 1486:eDRAM 1435:Intel 1425:is a 1277:Intel 1247:eDRAM 1024:R6000 944:time. 886:to a 682:index 356:dirty 236:eDRAM 188:of a 112:(TLB) 72:eDRAM 34:is a 6192:ACPI 5925:Glue 5817:FIFO 5760:Core 5498:ASIP 5439:CPLD 5434:FPOA 5429:FPGA 5424:ASIC 5277:SPMD 5272:MIMD 5267:MISD 5260:SWAR 5240:SIMD 5235:SISD 5150:Data 5133:Task 5104:Word 4850:M32R 4795:MIPS 4758:sets 4725:ZISC 4720:NISC 4715:OISC 4710:MISC 4703:EPIC 4698:VLIW 4686:EDGE 4676:RISC 4671:CISC 4580:HUMA 4575:NUMA 4309:VHDL 3848:ISBN 3611:2018 3495:ISBN 3123:ISBN 3023:ISBN 2961:ISBN 2934:ISBN 2650:ISBN 2381:ISSN 2367:2021 2144:GT3e 2111:and 2065:SRAM 2061:DRAM 2037:i386 2019:The 2012:The 2001:The 1990:The 1844:SRAM 1627:core 1623:chip 1564:AMD 1553:Xeon 1526:IBM 1498:SRAM 1366:and 1021:MIPS 888:data 879:data 642:The 565:hint 526:mod 439:The 219:SoCs 48:data 6187:APM 6182:PMU 6074:CPU 6031:ROM 5802:Bus 5419:PAL 5094:Bit 4880:LMC 4785:ARM 4780:x86 4770:VAX 4189:doi 4076:IBM 3997:doi 3840:doi 3776:doi 3487:doi 3284:doi 3244:doi 3185:doi 3115:doi 3048:doi 2813:doi 2736:doi 2642:doi 2489:doi 2455:doi 2425:IBM 2371:doi 2176:RAM 2082:486 2072:DIP 2057:386 2053:x86 1980:IBM 1953:IBM 1818:hit 1808:DEC 1794:or 1792:ECC 1752:ECC 1739:TLB 1679:III 1429:or 1403:or 1334:or 1310:'s 1308:AMD 1207:TLB 1142:). 1032:TLB 952:GiB 821:KiB 755:log 703:log 679:tag 671:LSB 669:to 667:MSB 648:tag 630:tag 585:LRU 546:LRU 287:or 248:MiB 244:KiB 198:MHz 6286:: 6121:3D 4244:. 4195:. 4185:46 4183:. 4177:. 4152:. 4096:. 4044:. 4003:. 3991:. 3987:. 3912:. 3888:. 3846:. 3819:^ 3798:. 3772:20 3770:. 3766:. 3733:. 3705:. 3651:. 3627:. 3535:. 3523:^ 3503:. 3493:. 3477:. 3462:^ 3452:. 3433:. 3408:. 3390:^ 3374:. 3334:. 3323:^ 3307:. 3296:^ 3209:. 3181:18 3179:. 3154:. 3131:. 3121:. 3113:. 3084:. 2975:^ 2893:. 2833:. 2809:17 2807:. 2795:^ 2786:. 2750:^ 2732:21 2730:. 2726:. 2714:^ 2690:. 2658:. 2648:. 2612:. 2585:. 2567:. 2556:. 2522:49 2520:. 2495:. 2485:14 2483:. 2479:. 2449:. 2445:. 2423:. 2402:. 2379:. 2365:. 2361:. 2136:. 1970:. 1915:. 1854:. 1677:, 1478:L2 1470:L1 1437:. 1399:A 1374:. 1336:UC 1330:, 1322:A 1223:A 895:. 654:. 531:. 522:= 101:, 74:. 30:A 4360:e 4353:t 4346:v 4254:. 4203:. 4191:: 4162:. 4100:. 4055:. 4025:. 3999:: 3993:5 3969:. 3948:. 3927:. 3898:. 3870:. 3842:: 3813:. 3784:. 3778:: 3751:. 3715:. 3690:. 3669:. 3637:. 3613:. 3591:. 3567:. 3545:. 3517:. 3489:: 3456:. 3437:. 3418:. 3384:. 3344:. 3317:. 3290:. 3286:: 3267:. 3250:. 3246:: 3227:. 3191:. 3187:: 3164:. 3139:. 3117:: 3054:. 3050:: 3031:. 3004:. 2989:. 2969:. 2942:. 2897:. 2879:. 2858:. 2844:. 2819:. 2815:: 2744:. 2738:: 2708:. 2644:: 2623:. 2595:. 2503:. 2491:: 2461:. 2457:: 2451:7 2387:. 2373:: 2346:. 2252:) 1683:4 1476:( 1468:( 1326:( 790:b 773:) 770:b 767:( 759:2 738:s 721:) 718:s 715:( 707:2 528:n 524:y 520:x 515:n 511:y 507:x 20:)

Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.

Index