Knowledge (XXG)

Vector processor

Source đź“ť

324:, which held sixty-four 64-bit words each. The vector instructions were applied between registers, which is much faster than talking to main memory. Whereas the STAR-100 would apply a single operation across a long vector in memory and then move on to the next operation, the Cray design would load a smaller section of the vector into registers and then apply as many operations as it could to that data, thereby avoiding many of the much slower memory access operations. 1300:
This way, significantly more work can be done in each batch; the instruction encoding is much more elegant and compact as well. The only drawback is that in order to take full advantage of this extra batch processing capacity, the memory load and store speed correspondingly had to increase as well. This is sometimes claimed to be a disadvantage of Cray-style vector processors: in reality it is part of achieving high performance throughput, as seen in
730:, the consequences are that the operations now take longer to complete. If multi-issue is not possible, then the operations take even longer because the LD may not be issued (started) at the same time as the first ADDs, and so on. If there are only 4-wide 64-bit SIMD ALUs, the completion time is even worse: only when all four LOADs have completed may the SIMD operations start, and only when all ALU operations have completed may the STOREs begin. 3341:– either by way of algorithmically loading data from memory, or reordering (remapping) the normally linear access to vector elements, or providing "Accumulators", arbitrary-sized matrices may be efficiently processed. IBM POWER10 provides MMA instructions although for arbitrary Matrix widths that do not fit the exact SIMD size data repetition techniques are needed which is wasteful of register file resources. NVidia provides a high-level Matrix 3249:– Vector architectures with a register-to-register design (analogous to load–store architectures for scalar processors) have instructions for transferring multiple elements between the memory and the vector registers. Typically, multiple addressing modes are supported. The unit-stride addressing mode is essential; modern vector architectures typically also support arbitrary constant strides, as well as the scatter/gather (also called 5119: 342: 228: 754:
implementation things are rarely that simple. The data is rarely sent in raw form, and is instead "pointed to" by passing in an address to a memory location that holds the data. Decoding this address and getting the data out of the memory takes some time, during which the CPU traditionally would sit idle waiting for the requested data to show up. As CPU speeds have increased, this
708: 1312:) combine both, by issuing multiple data to multiple internal pipelined SIMD ALUs, the number issued being dynamically chosen by the vector program at runtime. Masks can be used to selectively load and store data in memory locations, and use those same masks to selectively disable processing element of SIMD ALUs. Some processors with SIMD ( 3219:, SIMD by definition avoids inter-lane operations entirely (element 0 can only be added to another element 0), vector processors tackle this head-on. What programmers are forced to do in software (using shuffle and other tricks, to swap data into the right "lane") vector processors must do in hardware, automatically. 2556:
hard-coded constant 16, n is decremented by a hard-coded 4, so initially it is hard to appreciate the significance. The difference comes in the realisation that the vector hardware could be capable of doing 4 simultaneous operations, or 64, or 10,000, it would be the exact same vector assembler for all of them
466:, which is strictly limited to execution of parallel pipelined arithmetic operations only. Although the exact internal details of today's commercial GPUs are proprietary secrets, the MIAOW team was able to piece together anecdotal information sufficient to implement a subset of the AMDGPU architecture. 3483:
The above SIMD example could potentially fault and fail at the end of memory, due to attempts to read too many values: it could also cause significant numbers of page or misaligned faults by similarly crossing over boundaries. In contrast, by allowing the vector architecture the freedom to decide how
3475:
Contrast this situation with SIMD, which is a fixed (inflexible) load width and fixed data processing width, unable to cope with loads that cross page boundaries, and even if they were they are unable to adapt to what actually succeeded, yet, paradoxically, if the SIMD program were to even attempt to
2324:
For Cray-style vector ISAs such as RVV, an instruction called "setvl" (set vector length) is used. The hardware first defines how many data values it can process in one "vector": this could be either actual registers or it could be an internal loop (the hybrid approach, mentioned above). This maximum
1299:
to hold vector data in batches. The batch lengths (vector length, VL) could be dynamically set with a special instruction, the significance compared to Videocore IV (and, crucially as will be shown below, SIMD as well) being that the repeat length does not have to be part of the instruction encoding.
725:
Having to perform 4-wide simultaneous 64-bit LOADs and 64-bit STOREs is very costly in hardware (256 bit data paths to memory). Having 4x 64-bit ALUs, especially MULTIPLY, likewise. To avoid these high costs, a SIMD processor would have to have 1-wide 64-bit LOAD, 1-wide 64-bit STORE, and only 2-wide
502:
As of 2016 most commodity CPUs implement architectures that feature fixed-length SIMD instructions. On first inspection these can be considered a form of vector processing because they operate on multiple (vectorized, explicit length) data sets, and borrow features from vector processors. However, by
3395:– elements may typically contain two, three or four sub-elements (vec2, vec3, vec4) where any given bit of a predicate mask applies to the whole vec2/3/4, not the elements in the sub-vector. Sub-vectors are also introduced in RISC-V RVV (termed "LMUL"). Subvectors are a critical integral part of the 457:
the key distinguishing factor of SIMT-based GPUs is that it has a single instruction decoder-broadcaster but that the cores receiving and executing that same instruction are otherwise reasonably normal: their own ALUs, their own register files, their own Load/Store units and their own independent L1
4205:
Moreira, José E.; Barton, Kit; Battle, Steven; Bergner, Peter; Bertran, Ramon; Bhat, Puneeth; Caldeira, Pedro; Edelsohn, David; Fossum, Gordon; Frey, Brad; Ivanovic, Nemanja; Kerchner, Chip; Lim, Vincent; Kapoor, Shakti; Tulio Machado Filho; Silvia Melitta Mueller; Olsson, Brett; Sadasivam, Satish;
2982:
for example, things go rapidly downhill just as they did with the general case of using SIMD for general-purpose IAXPY loops. To sum the four partial results, two-wide SIMD can be used, followed by a single scalar add, to finally produce the answer, but, frequently, the data must be transferred out
769:
in which the instructions pass through several sub-units in turn. The first sub-unit reads the address and decodes it, the next "fetches" the values at those addresses, and the next does the math itself. With pipelining the "trick" is to start decoding the next instruction even before the first has
753:
In general terms, CPUs are able to manipulate one or two pieces of data at a time. For instance, most CPUs have an instruction that essentially says "add A to B and put the result in C". The data for A, B and C could be—in theory at least—encoded directly into the instruction. However, in efficient
721:
Consider both a SIMD processor and a vector processor working on 4 64-bit elements, doing a LOAD, ADD, MULTIPLY and STORE sequence. If the SIMD width is 4, then the SIMD processor must LOAD four elements entirely before it can move on to the ADDs, must complete all the ADDs before it can move on to
661:
Whereas pure (fixed-width, no predication) SIMD is often mistakenly claimed to be "vector" (because SIMD processes data which happens to be vectors), through close analysis and comparison of historic and modern ISAs, actual vector ISAs may be observed to have the following features that no SIMD ISA
3514:
be the vectorization ratio. If the time taken for the vector unit to add an array of 64 numbers is 10 times faster than its equivalent scalar counterpart, r = 10. Also, if the total number of operations in a program is 100, out of which only 10 are scalar (after vectorization), then f = 0.9, i.e.,
3479:
This begins to hint at the reason why ffirst is so innovative, and is best illustrated by memcpy or strcpy when implemented with standard 128-bit non-predicated non-ffirst SIMD. For IBM POWER9 the number of hand-optimised instructions to implement strncpy is in excess of 240. By contrast, the same
3211:
From the IAXPY example, it can be seen that unlike SIMD processors, which can simplify their internal hardware by avoiding dealing with misaligned memory access, a vector processor cannot get away with such simplification: algorithms are written which inherently rely on Vector Load and Store being
3193:
Implementations in hardware may, if they are certain that the right answer will be produced, perform the reduction in parallel. Some vector ISAs offer a parallel reduction mode as an explicit option, for when the programmer knows that any potential rounding errors do not matter, and low latency is
2555:
Also note, that just like the predicated SIMD variant, the pointers to x and y are advanced by t0 times four because they both point to 32 bit data, but that n is decremented by straight t0. Compared to the fixed-size SIMD assembler there is very little apparent difference: x and y are advanced by
1967:
Realistically, for general-purpose loops such as in portable libraries, where n cannot be limited in this way, the overhead of setup and cleanup for SIMD in order to cope with non-multiples of the SIMD width, can far exceed the instruction count inside the loop itself. Assuming worst-case that the
1761:
here (only start on a multiple of 16) and that n is a multiple of 4, as otherwise some setup code would be needed to calculate a mask or to run a scalar version. It can also be assumed, for simplicity, that the SIMD instructions have an option to automatically repeat scalar operands, like ARM NEON
1368:
This example starts with an algorithm ("IAXPY"), first show it in scalar instructions, then SIMD, then predicated SIMD, and finally vector instructions. This incrementally helps illustrate the difference between a traditional vector processor and a modern SIMD one. The example starts with a 32-bit
1256:
The vector pseudocode example above comes with a big assumption that the vector computer can process more than ten numbers in one batch. For a greater quantity of numbers in the vector register, it becomes unfeasible for the computer to have a register that large. As a result, the vector processor
331:
to implement vector instructions rather than multiple ALUs. In addition, the design had completely separate pipelines for different instructions, for example, addition/subtraction was implemented in different hardware than multiplication. This allowed a batch of vector instructions to be pipelined
3446:
workloads. Of interest, however, is that speed is far more important than accuracy in 3D for GPUs, where computation of pixel coordinates simply do not require high precision. The Vulkan specification recognises this and sets surprisingly low accuracy requirements, so that GPU Hardware can reduce
1972:
first have to have a preparatory section which works on the beginning unaligned data, up to the first point where SIMD memory-aligned operations can take over. this will either involve (slower) scalar-only operations or smaller-sized packed SIMD operations. Each copy implements the full algorithm
795:
To illustrate what a difference this can make, consider the simple task of adding two groups of 10 numbers together. In a normal programming language one would write a "loop" that picked up each of the pairs of numbers in turn, and then added them. To the CPU, this would look something like this:
313:, but at data-related tasks they could keep up while being much smaller and less expensive. However the machine also took considerable time decoding the vector instructions and getting ready to run the process, so it required very specific data sets to work on before it actually sped anything up. 2307:
Here it can be seen that the code is much cleaner but a little complex: at least, however, there is no setup or cleanup: on the last iteration of the loop, the predicate mask wil be set to either 0b0000, 0b0001, 0b0011, 0b0111 or 0b1111, resulting in between 0 and 4 SIMD element operations being
787:
Vector processors take this concept one step further. Instead of pipelining just the instructions, they also pipeline the data itself. The processor is fed instructions that say not just to add A to B, but to add all of the numbers "from here to here" to all of the numbers "from there to there".
305:
The basic ASC (i.e., "one pipe") ALU used a pipeline architecture that supported both scalar and vector computations, with peak performance reaching approximately 20 MFLOPS, readily achieved when processing long vectors. Expanded ALU configurations supported "two pipes" or "four pipes" with a
1207:
adding those numbers in parallel. The checking of dependencies between those numbers is not required as a vector instruction specifies multiple independent operations. This simplifies the control logic required, and can further improve performance by avoiding stalls. The math operations thus
3291:– a less restrictive more generic variation of the compress/expand theme which instead takes one vector to specify the indices to use to "reorder" another vector. Gather/scatter is more complex to implement than compress/expand, and, being inherently non-sequential, can interfere with 2510:
This is essentially not very different from the SIMD version (processes 4 data elements per loop), or from the initial Scalar version (processes just the one). n still contains the number of data elements remaining to be processed, but t0 contains the copy of VL – the number that is
3281:– usually using a bit-mask, data is linearly compressed or expanded (redistributed) based on whether bits in the mask are set or clear, whilst always preserving the sequential order and never duplicating values (unlike Gather-Scatter aka permute). These instructions feature in 1680:
The STAR-like code remains concise, but because the STAR-100's vectorisation was by design based around memory accesses, an extra slot of memory is now required to process the information. Two times the latency is also needed due to the extra requirement of memory access.
1343:(SIMT). SIMT units run from a shared single broadcast synchronised Instruction Unit. The "vector registers" are very wide and the pipelines tend to be long. The "threading" part of SIMT involves the way data is handled independently on each of the compute units. 697:, almost qualifies as a vector processor. Predicated SIMD uses fixed-width SIMD ALUs but allows locally controlled (predicated) activation of units to provide the appearance of variable length vectors. Examples below help explain these categorical distinctions. 1983:
Eight-wide SIMD requires repeating the inner loop algorithm first with four-wide SIMD elements, then two-wide SIMD, then one (scalar), with a test and branch in between each one, in order to cover the first and last remaining SIMD elements (0 <= n <= 7).
2986:
Even with a general loop (n not fixed), the only way to use 4-wide SIMD is to assume four separate "streams", each offset by four elements. Finally, the four partial results have to be summed. Other techniques involve shuffle: examples online can be found for
2994:
Aside from the size of the program and the complexity, an additional potential problem arises if floating-point computation is involved: the fact that the values are not being summed in strict order (four partial results) could result in rounding errors.
3488:
iterations of the loop the batches of vectorised memory reads are optimally aligned with the underlying caches and virtual memory arrangements. Additionally, the hardware may choose to use the opportunity to end any given loop iteration's memory reads
3408:– aka "Lane Shuffling" which allows sub-vector inter-element computations without needing extra (costly, wasteful) instructions to move the sub-elements into the correct SIMD "lanes" and also saves predicate mask bits. Effectively this is an in-flight 3493:
on a page boundary (avoiding a costly second TLB lookup), with speculative execution preparing the next virtual memory page whilst data is still being processed in the current loop. All of this is determined by the hardware, not the program itself.
2311:
It is clear how predicated SIMD at least merits the term "vector capable", because it can cope with variable-length vectors by using predicate masks. The final evolving step to a "true" vector ISA, however, is to not have any evidence in the ISA
1272:
prefix. However, only very simple calculations can be done effectively in hardware this way without a very large cost increase. Since all operands have to be in memory for the STAR-100 architecture, the latency caused by access became huge too.
2343:
number that can be processed by the hardware in subsequent vector instructions, and sets the internal special register, "VL", to that same amount. ARM refers to this technique as "vector length agnostic" programming in its tutorials on SVE2.
179:
machine with 256 ALUs, but, when it was finally delivered in 1972, it had only 64 ALUs and could reach only 100 to 150 MFLOPS. Nevertheless, it showed that the basic concept was sound, and, when used on data-intensive applications, such as
2563:
Not only is it a much more compact program (saving on L1 Cache size), but as previously mentioned, the vector version can issue far more data processing to the ALUs, again saving power because Instruction Decode and Issue can sit idle.
3206:
Compared to any SIMD processor claiming to be a vector processor, the order of magnitude reduction in program size is almost shocking. However, this level of elegance at the ISA level has quite a high price tag at the hardware level:
657:
SIMD instruction sets lack crucial features when compared to vector instruction sets. The most important of these is that vector processors, inherently by definition and design, have always been variable-length since their inception.
2547:
Where with predicated SIMD the mask bitlength is limited to that which may be held in a scalar (or special mask) register, vector ISA's mask registers have no such limitation. Cray-I vectors could be just over 1,000 elements (in
2020:
Vector processors on the other hand are designed to issue computations of variable length for an arbitrary count, n, and thus require very little setup, and no cleanup. Even compared to those SIMD ISAs which have masks (but no
1234:, as the supercomputers themselves were, in general, found in places such as weather prediction centers and physics labs, where huge amounts of data are "crunched". However, as shown above and demonstrated by RISC-V RVV the 426:
2 module within a card that physically resembles a graphics coprocessor, but instead of serving as a co-processor, it is the main computer with the PC-compatible computer into which it is plugged serving support functions.
3189:
The simplicity of the algorithm is stark in comparison to SIMD. Again, just as with the IAXPY example, the algorithm is length-agnostic (even on Embedded implementations where maximum vector length could be only one).
788:
Instead of constantly having to decode instructions and then fetch the data needed to complete them, the processor reads a single instruction from memory, and it is simply implied in the definition of the instruction
741:, the ability to run MULTIPLY simultaneously with ADD), may complete the four operations faster than a SIMD processor with 1-wide LOAD, 1-wide STORE, and 2-wide SIMD. This more efficient resource utilization, due to 202:(DAP) design, categorising the ILLIAC and DAP as cellular array processors that potentially offered substantial performance benefits over conventional vector processor designs such as the CDC STAR-100 and Cray 1. 1359:
IV is also capable of this hybrid approach: nominally stating that its SIMD QPU Engine supports 16-long FP array operations in its instructions, it actually does them 4 at a time, as (another) form of "threads".
3197:
This example again highlights a key critical fundamental difference between true vector processors and those SIMD processors, including most commercial GPUs, which are inspired by features of vector processors.
3460:
Introduced in ARM SVE2 and RISC-V RVV is the concept of speculative sequential Vector Loads. ARM SVE2 has a special register named "First Fault Register", where RVV modifies (truncates) the Vector Length (VL).
1307:
Modern SIMD computers claim to improve on early Cray by directly using multiple ALUs, for a higher degree of parallelism compared to only using the normal scalar pipeline. Modern vector processors (such as the
458:
data caches. Thus although all cores simultaneously execute the exact same instruction in lock-step with each other they do so with completely different data from completely different memory locations. This is
1324:) processing, and it is these which somewhat deserve the nomenclature "vector processor" or at least deserve the claim of being capable of "vector processing". SIMD processors without per-element predication ( 2857:
This is where the problems start. SIMD by design is incapable of doing arithmetic operations "inter-element". Element 0 of one SIMD register may be added to Element 0 of another register, but Element 0 may
1354:
may use fewer vector units than the width implies: instead of having 64 units for a 64-number-wide register, the hardware might instead do a pipelined loop over 16 units for a hybrid approach. The Broadcom
3476:
find out in advance (in each inner loop, every time) what might optimally succeed, those instructions only serve to hinder performance because they would, by necessity, be part of the critical inner loop.
2579:
This example starts with an algorithm which involves reduction. Just as with the previous example, it will be first shown in scalar instructions, then SIMD, and finally vector instructions, starting in
184:, the ILLIAC was the fastest machine in the world. The ILLIAC approach of using separate ALUs for each data element is not common to later designs, and is often referred to under a separate category, 2017:
Without predication, the wider the SIMD width the worse the problems get, leading to massive opcode proliferation, degraded performance, extra power consumption and unnecessary software complexity.
1219:
adding up many numbers in a row. The more complex instructions also add to the complexity of the decoders, which might slow down the decoding of the more common instructions such as normal adding. (
4206:
Saleil, Baptiste; Schmidt, Bill; Srinivasaraghavan, Rajalakshmi; Srivatsan, Shricharan; Thompto, Brian; Wagner, Andreas; Wu, Nelson (2021). "A matrix math facility for Power ISA(TM) processors".
1284:
field, but unlike the STAR-100 which uses memory for its repeats, the Videocore IV repeats are on all operations including arithmetic vector operations. The repeat length can be a small range of
338:. The Cray-1 normally had a performance of about 80 MFLOPS, but with up to three chains running it could peak at 240 MFLOPS and averaged around 150 – far faster than any machine of the era. 3388:
operations as well as short vectors for common operations (RGB, ARGB, XYZ, XYZW) support for the following is typically present in modern GPUs, in addition to those found in vector processors:
2339:
On calling setvl with the number of outstanding data elements to be processed, "setvl" is permitted (essentially required) to limit that to the Maximum Vector Length (MVL) and thus returns the
2335:
not make the mistake of assuming a fixed vector width: consequently MVL is not a quantity that the programmer needs to know. This can be a little disconcerting after years of SIMD mindset).
2849:
This is very straightforward. "y" starts at zero, 32 bit integers are loaded one at a time into r1, added to y, and the address of the array "x" moved on to the next element in the array.
2567:
Additionally, the number of elements going in to the function can start at zero. This sets the vector length to zero, which effectively disables all vector instructions, turning them into
2347:
Below is the Cray-style vector assembler for the same SIMD style loop, above. Note that t0 (which, containing a convenient copy of VL, can vary) is used instead of hard-coded constants:
3468:
amount loaded to either the amount that would succeed without raising a memory fault or simply to an amount (greater than zero) that is most convenient. The important factor is that
2010:
Over time as the ISA evolves to keep increasing performance, it results in ISA Architects adding 2-wide SIMD, then 4-wide SIMD, then 8-wide and upwards. It can therefore be seen why
6184: 4445: 2033:
Assuming a hypothetical predicated (mask capable) SIMD ISA, and again assuming that the SIMD instructions can cope with misaligned data, the instruction loop would look like this:
714:
Additionally, vector processors can be more resource-efficient by using slower hardware and saving power, but still achieving throughput and having less latency than SIMD, through
1230:
Vector processors were traditionally designed to work best only when there are large amounts of data to be worked on. For this reason, these sorts of CPUs were found primarily in
2025:
instruction), Vector processors produce much more compact code because they do not need to perform explicit mask calculation to cover the last few elements (illustrated below).
3818: 1211:
Not all problems can be attacked with this sort of solution. Including these types of instructions necessarily adds complexity to the core CPU. That complexity typically makes
4310: 3910: 2325:
amount (the number of hardware "lanes") is termed "MVL" (Maximum Vector Length). Note that, as seen in SX-Aurora and Videocore IV, MVL may be an actual hardware lane quantity
2308:
performed, respectively. One additional potential complication: some RISC ISAs do not have a "min" instruction, needing instead to use a branch or scalar predicated compare.
3472:
instructions are notified or may determine exactly how many Loads actually succeeded, using that quantity to only carry out work on the data that has actually been loaded.
1964:
Unfortunately for SIMD, the clue was in the assumption above, "that n is a multiple of 4" as well as "aligned access", which, clearly, is a limited specialist use-case.
792:
that the instruction will operate again on another item of data, at an address one increment larger than the last. This allows for significant savings in decoding time.
3601: 1260:
The self-repeating instructions are found in early vector computers like the STAR-100, where the above action would be described in a single instruction (somewhat like
5156: 4053: 3641: 1192:
With the length (equivalent to SIMD width) not being hard-coded into the instruction, not only is the encoding more compact, it's also "future-proof" and allows even
414:
Although vector supercomputers resembling the Cray-1 are less popular these days, NEC has continued to make this type of computer up to the present day with their
168: 6295: 5478: 3573: 5997: 4535: 2537:
in the SIMD width (load32x4 etc.) the vector ISA equivalents have no such limit. This makes vector programs both portable, Vendor Independent, and future-proof.
1257:
either gains the ability to perform loops itself, or exposes some sort of vector control (status) register to the programmer, usually known as a vector Length.
4103: 1957:
Note that both x and y pointers are incremented by 16, because that is how long (in bytes) four 32-bit integers are. The decision was made that the algorithm
4387: 3345:
API although the internal details are not available. The most resource-efficient technique is in-place reordering of access to otherwise linear vector data.
360:
machine, but it sold poorly and they took that as an opportunity to leave the supercomputing field entirely. In the early and mid-1980s Japanese companies (
41:) that were specifically designed from the ground up to handle large Vectors (Arrays). For SIMD instructions present in some general-purpose computers, see 6154: 5720: 5537: 3265:
containing multiple members. The members are extracted from data structure (element), and each extracted member is placed into a different vector register.
1500:
In each iteration, every element of y has an element of x multiplied by a and added to it. The program is expressed in scalar linear form for readability.
3315:– a very simple and strategically useful instruction which drops sequentially-incrementing immediates into successive elements. Usually starts from zero. 515:. The difference is illustrated below with examples, showing and comparing the three categories: Pure SIMD, Predicated SIMD, and Pure Vector Processing. 1200:
Additionally, in more modern vector processor ISAs, "Fail on First" or "Fault First" has been introduced (see below) which brings even more advantages.
3484:
many elements to load, the first part of a strncpy, if beginning initially on a sub-optimal memory boundary, may return just enough loads such that on
5500: 784:, but the CPU can process an entire batch of operations, in an overlapping fashion, much faster and more efficiently than if it did so one at a time. 642: 3948: 1186:
The code itself is also smaller, which can lead to more efficient memory use, reduction in L1 instruction cache size, reduction in power consumption.
745:, is a key advantage and difference compared to SIMD. SIMD, by design and definition, cannot perform chaining except to the entire group of results. 6149: 3309:– useful for interaction between scalar and vector, these broadcast a single value across a vector, or extract one item from a vector, respectively. 6221: 3059:
The code when n is larger than the maximum vector length is not that much more complex, and is a similar pattern to the first example ("IAXPY").
2866:
than another Element 0. This places some severe limitations on potential implementations. For simplicity it can be assumed that n is exactly 8:
5974: 4516: 3959: 3694: 3412:
of the sub-vector, heavily features in 3D Shader binaries, and is sufficiently important as to be part of the Vulkan SPIR-V spec. The Broadcom
3367:– including vectorised versions of bit-level permutation operations, bitfield insert and extract, centrifuge operations, population count, and 211: 4248: 3359:
or decimal fixed-point, and support for much larger (arbitrary precision) arithmetic operations by supporting parallel carry-in and carry-out
1976:
perform the aligned SIMD loop at the maximum SIMD width up until the last few elements (those remaining that do not fit the fixed SIMD width)
6918: 6042: 5305: 5149: 4783: 3971: 436: 189: 407:
processing rather than better implementations of vector processors. However, recognising the benefits of vector processing, IBM developed
2515:
to be processed in each iteration. t0 is subtracted from n after each iteration, and if n is zero then all elements have been processed.
6928: 6069: 4806: 3272: 1340: 1321: 1251: 391:
Throughout, Cray continued to be the performance leader, continually beating the competition with a series of machines that led to the
5196: 4695: 6236: 6064: 6037: 5416: 4801: 4778: 4169: 3368: 3275:
allow parallel if/then/else constructs without resorting to branches. This allows code with conditional statements to be vectorized.
722:
the MULTIPLYs, and likewise must complete all of the MULTIPLYs before it can start the STOREs. This is by definition and by design.
275: 249: 137: 85: 5387: 4321: 693:) which is comprehensive individual element-level predicate masks on every vector instruction as is now available in ARM SVE2. And 156:(ALUs), one per cycle, but with a different data point for each one to work on. This allowed the Solomon machine to apply a single 3464:
The basic principle of ffirst is to attempt a large sequential Vector Load, but to allow the hardware to arbitrarily truncate the
2571:, at runtime. Thus, unlike non-predicated SIMD, even when there are no elements to process there is still no wasted cleanup code. 1757:), can do most of the operation in batches. The code is mostly similar to the scalar version. It is assumed that both x and y are 633:
Other CPU designs include some multiple instructions for vector processing on multiple (vectorized) data sets, typically known as
84:, whose instructions operate on single data items only, and in contrast to some of those same scalar processors having additional 7087: 7051: 6614: 5507: 5473: 5468: 5352: 4380: 4332: 3649:
is crucial to the performance. This ratio depends on the efficiency of the compilation like adjacency of the elements in memory.
7026: 6923: 6324: 6231: 6032: 5275: 5253: 5142: 4773: 4588: 195: 1180:
only three address translations are needed. Depending on the architecture, this can represent a significant savings by itself.
7102: 5771: 5206: 4880: 4743: 4089: 3995: 3780: 253: 69: 4117: 6226: 6074: 5908: 5522: 5483: 5340: 5104: 4938: 4556: 4476: 3893: 3417: 3328:
on a vector (for example, find the one maximum value of an entire vector, or sum all elements). Iteration is of the form
3007:
to the ISA. If it is assumed that n is less or equal to the maximum vector length, only three instructions are required:
372:(NEC) introduced register-based vector machines similar to the Cray-1, typically being slightly faster and much smaller. 6663: 6508: 6503: 6425: 5901: 5862: 5517: 5512: 5446: 5258: 3689: 3292: 2560:. Even compared to the predicate-capable SIMD, it is still more compact, clearer, more elegant and uses less resources. 742: 715: 333: 299: 181: 5382: 3242:
Where many SIMD ISAs borrow or are inspired by the list below, typical features that a vector processor will have are:
478:
follows similar principles as the early vector processors, and is being implemented in commercial products such as the
7097: 6290: 5987: 5685: 5123: 5069: 4529: 4373: 408: 369: 199: 4277: 238: 6940: 6587: 6004: 5495: 5463: 5233: 5221: 5201: 5048: 4843: 4728: 4690: 4540: 4430: 3724: 3443: 1508:
The scalar version of this would load one of each of x and y, process one calculation, store one result, and loop:
539: 116: 3873: 3261:
variants of the standard vector load and stores. Segment loads read a vector from memory, where each element is a
1183:
Another saving is fetching and decoding the instruction itself, which has to be done only one time instead of ten.
1085:
Cray-style vector ISAs take this a step further and provide a global "count" register, called vector length (VL):
257: 242: 7031: 6994: 6984: 5372: 5064: 5043: 4988: 4875: 4865: 4838: 4700: 4030: 3719: 1329: 1317: 593: 535: 530:
in Flynn's Taxonomy. Common examples using SIMD with features inspired by vector processors include: Intel x86's
778:
is constantly in use. Any particular instruction takes the same amount of time to complete, a time known as the
7046: 6453: 6389: 6366: 6216: 6178: 6014: 5964: 5959: 5436: 5330: 5238: 5018: 4644: 4583: 4496: 2581: 1758: 1370: 1196:
designs to consider using vectors purely to gain all the other advantages, rather than go for high performance.
353: 289: 5243: 3299:
Memory Load/Store modes, Gather/scatter vector operations act on the vector registers, and are often termed a
453:, and can be considered vector processors (using a similar strategy for hiding memory latencies). As shown in 4933: 1762:
can. If it does not, a "splat" (broadcast) must be used, to copy the scalar argument across a SIMD register:
6999: 6782: 6676: 6640: 6557: 6541: 6383: 6172: 6131: 6119: 5982: 5896: 5817: 5582: 5186: 5079: 5074: 4524: 3684: 2004: 149: 61: 6805: 6777: 6687: 6652: 6401: 6395: 6377: 6111: 6105: 6009: 5913: 5804: 5743: 5605: 5248: 4818: 4750: 4654: 4546: 4501: 3714: 766: 551: 523: 377: 306:
corresponding 2X or 4X performance gain. Memory bandwidth was sufficient to support these expanded modes.
89: 4608: 3937: 1208:
completed far faster overall, the limiting factor being the time required to fetch the data from memory.
765:
In order to reduce the amount of time consumed by these steps, most modern CPUs use a technique known as
92:(SWAR) Arithmetic Units. Vector processors can greatly improve performance on certain workloads, notably 7092: 6979: 6888: 6634: 6346: 6164: 5923: 5891: 5849: 5761: 5562: 5377: 5367: 5357: 5347: 5317: 5300: 5165: 4910: 4870: 4823: 4813: 4551: 4471: 4410: 4264: 3842:
An algorithm of hardware unit generation for processor core synthesis with packed SIMD type instructions
1351: 1325: 780: 727: 531: 423: 153: 144:
project. Solomon's goal was to dramatically increase math performance by using a large number of simple
3806: 4343: 7009: 6945: 6531: 6253: 6143: 6090: 5622: 5335: 5191: 5173: 4850: 4738: 4733: 4723: 4710: 4506: 3663: 3356: 759: 606: 328: 101: 93: 73: 7056: 6658: 4041: 3745: 7041: 6861: 6712: 6694: 6646: 6300: 6247: 6052: 6047: 6024: 5940: 5822: 5677: 5572: 5431: 5013: 4968: 4794: 4789: 4768: 4634: 4136: 3409: 3300: 3216: 1754: 690: 589: 527: 463: 454: 415: 42: 1995:
increase in instruction count! This can easily be demonstrated by compiling the iaxpy example for
6913: 6905: 6757: 6732: 6536: 6411: 5935: 5876: 5756: 5488: 5216: 5038: 4887: 4860: 4685: 4649: 4639: 4440: 4420: 4415: 4396: 4207: 4191: 1296: 1193: 404: 185: 97: 4598: 4354: 4228: 4147: 4066: 4011: 1339:
Modern GPUs, which have many small compute units each with their own independent SIMD ALUs, use
3580: 3447:
power usage. The concept of reducing accuracy where it is simply not needed is explored in the
1979:
have a cleanup phase which, like the preparatory section, is just as large and just as complex.
6866: 6833: 6749: 6681: 6582: 6572: 6562: 6493: 6488: 6483: 6406: 6335: 6241: 6201: 5834: 5784: 5734: 5710: 5592: 5532: 5527: 5409: 5325: 5084: 4760: 4718: 4613: 4244: 3991: 3987: 3776: 3679: 1309: 563: 419: 385: 296: 3606: 320:. Instead of leaving the data in memory like the STAR-100 and ASC, the Cray design had eight 7036: 6969: 6810: 6717: 6671: 6478: 6473: 6468: 6463: 6458: 6448: 6318: 6285: 6196: 6100: 5947: 5930: 5918: 5857: 5421: 5399: 5285: 5263: 5181: 5094: 4893: 4675: 4491: 4486: 4481: 4450: 4236: 3975: 3853: 3845: 3709: 3363: 479: 321: 81: 31: 1078:
which has performed 10 sequential operations: effectively the loop count is on an explicit
737:
and uses no SIMD ALUs, only having 1-wide 64-bit LOAD, 1-wide 64-bit STORE (and, as in the
411:
for use in supercomputers coupling several scalar processors to act as a vector processor.
6950: 6935: 6883: 6787: 6762: 6599: 6592: 6443: 6438: 6433: 6372: 6280: 6270: 5992: 5827: 5779: 5542: 5426: 5394: 5295: 5290: 5211: 4958: 4898: 4833: 4680: 4670: 4603: 4435: 4425: 3658: 3234:
These stark differences are what distinguishes a vector processor from one that has SIMD.
1204: 775: 614: 188:
computing. Around this time Flynn categorized this type of processing as an early form of
65: 4593: 4077: 2518:
A number of things to note, when comparing against the Predicated SIMD assembly variant:
1225:
principles: RVV only adds around 190 vector instructions even with the advanced features.
3520: 704:
to cope with iteration and reduction. This is illustrated further with examples, below.
7061: 6895: 6878: 6871: 6767: 6624: 6361: 6275: 6206: 5789: 5751: 5700: 5695: 5690: 5404: 5228: 5089: 4905: 4562: 4455: 4181: 3674: 3296: 3262: 1238:
of vector ISAs brings other benefits which are compelling even for Embedded use-cases.
755: 579: 450: 120: 2552:
Thus it can be seen, very clearly, how vector ISAs reduce the number of instructions.
7081: 6856: 6772: 5812: 5794: 5587: 5280: 4978: 4855: 4299: 3980: 3427: 3385: 1231: 771: 503:
definition, the addition of SIMD cannot, by itself, qualify a processor as an actual
108: 4158: 7066: 7004: 6820: 6797: 6609: 6330: 5268: 4578: 3515:
90% of the work is done by the vector unit. It follows the achievable speed up of:
3416:
IV uses the terminology "Lane rotate" where the rest of the industry uses the term
3352: 2533:
Where the SIMD variant hard-coded both the width (4) into the creation of the mask
1961:
only cope with 4-wide SIMD, therefore the constant is hard-coded into the program.
1285: 675: 646: 626: 381: 341: 4288: 3844:. Asia-Pacific Conference on Circuits and Systems. Vol. 1. pp. 171–176. 2983:
of dedicated SIMD registers before the last scalar computation can be performed.
1968:
hardware cannot do misaligned SIMD memory accesses, a real-world algorithm will:
167:
In 1962, Westinghouse cancelled the project, but the effort was restarted by the
17: 6851: 6815: 6526: 6498: 6356: 6211: 5099: 4240: 4229:"A Modular Massively Parallel Processor for Volumetric Visualisation Processing" 3849: 483: 227: 145: 3924: 1276:
Interestingly, though, Broadcom included space in all vector operations of the
43:
Flynn's taxonomy § Single instruction stream, multiple data streams (SIMD)
6737: 6727: 6722: 6704: 6604: 6577: 5839: 5672: 5642: 5362: 2970:- Fourth SIMD ADD: element 3 of first group added to element 2 of second group 2958:- Second SIMD ADD: element 1 of first group added to element 1 of second group 474:
Several modern CPU architectures are being designed as vector processors. The
198:
sought to avoid many of the difficulties with the ILLIAC concept with its own
4104:"Assembly - Fastest way to do horizontal SSE vector sum (Or other reduction)" 2964:- Third SIMD ADD: element 2 of first group added to element 2 of second group 2952:- First SIMD ADD: element 0 of first group added to element 0 of second group 6828: 6825: 6567: 5637: 5615: 4973: 4948: 3982:
Computer Organization and Design: the Hardware/Software Interface page 751-2
3439: 3413: 3325: 3320: 1356: 1347: 1277: 707: 491: 487: 400: 396: 172: 157: 49: 3480:
strncpy routine in hand-optimised RVV assembler is a mere 22 instructions.
2711:
Here, an accumulator (y) is used to sum up all the values in the array, x.
592:. Two notable examples which have per-element (lane-based) predication are 1203:
But more than that, a high performance vector processor may have multiple
6843: 5715: 5662: 5134: 5023: 5003: 4928: 1784:
The time taken would be basically the same as a vector implementation of
1074:
Note the complete lack of looping in the instructions, because it is the
547: 345: 310: 309:
The STAR-100 was otherwise slower than CDC's own supercomputers like the
292: 161: 2719:
The scalar version of this would load each of x, add it to y, and loop:
1189:
With the program size being reduced branch prediction has an easier job.
5652: 5610: 5028: 5008: 4983: 4618: 4186: 3840:
Miyaoka, Y.; Choi, J.; Togawa, N.; Yanagisawa, M.; Ohtsuki, T. (2002).
3448: 3442:
obviously feature much more predominantly in 3D than in many demanding
3282: 3253:) addressing mode. Advanced architectures may also include support for 2988: 2011: 1996: 1333: 1313: 804:; assume a, b, and c are memory locations in their respective registers 694: 597: 571: 559: 555: 446: 365: 361: 4182:"IBM's POWER10 Processor - William Starke & Brian W. Thompto, IBM" 3858: 6955: 5667: 5632: 5597: 4998: 4993: 4365: 4122: 4016: 3878: 3819:"Andes Announces RISC-V Multicore 1024-bit Vector Processor: AX45MPV" 3703: 3699: 3435: 3399: 3396: 3381: 1292: 738: 622: 618: 610: 543: 475: 392: 373: 357: 317: 316:
The vector technique was first fully exploited in 1976 by the famous
176: 123:
designs led to a decline in vector supercomputers during the 1990s.
4212: 971:
But to a vector processor, this task looks considerably different:
422:
places the processor and either 24 or 48 gigabytes of memory on an
27:
Computer processor which works on arrays of several numbers at once
6125: 5657: 5627: 4233:
High Performance Computing for Computer Graphics and Visualisation
3669: 3215:
Whilst from the reduction example it can be seen that, aside from
2568: 1753:
A modern packed SIMD architecture, known by many names (listed in
340: 6989: 6137: 6057: 5647: 5033: 4963: 4953: 3431: 3342: 1991:
the size of the code, in fact in extreme cases it results in an
1247: 1222: 638: 634: 575: 442: 403:. Since then, the supercomputer market has focused much more on 112: 96:
and similar tasks. Vector processing techniques also operate in
5138: 4369: 3212:
successful, regardless of alignment of the start of the vector.
758:
has historically become a large impediment to performance; see
5577: 5567: 4943: 4920: 3773:
The history of computer technology in their faces (in Russian)
1301: 1265: 567: 221: 136:
Vector processing development began in the early 1960s at the
38: 3577:
So, even if the performance of the vector unit is very high (
3003:
Vector instruction sets have arithmetic reduction operations
302:(ASC), which were introduced in 1974 and 1972, respectively. 152:(CPU). The CPU fed a single common instruction to all of the 111:
design through the 1970s into the 1990s, notably the various
72:
are designed to operate efficiently and effectively on large
3230:
simplified software and complex hardware (vector processors)
1221:
This can be somewhat mitigated by keeping the entire ISA to
807:; add 10 numbers in a to 10 numbers in b, store results in c 2316:
of a SIMD width, leaving that entirely up to the hardware.
674:
instruction in NEC SX, without restricting the length to a
2331:(Note: As mentioned in the ARM SVE2 Tutorial, programmers 486:
vector processor architectures being developed, including
107:
Vector machines appeared in the early 1970s and dominated
30:"Array processor" redirects here. Not to be confused with 4118:"Riscv-v-spec/V-spec.adoc at master · riscv/Riscv-v-spec" 4012:"Riscv-v-spec/V-spec.adoc at master · riscv/Riscv-v-spec" 3874:"Riscv-v-spec/V-spec.adoc at master · riscv/Riscv-v-spec" 3702:, an open ISA standard with an associated variable width 637:(Multiple Instruction, Multiple Data) and realized with 175:. Their version of the design originally called for a 1 3795: 700:
SIMD, because it uses fixed-width batch processing, is
332:
into each of the ALU subunits, a technique they called
1320:) are capable of this kind of selective, per-element ( 726:
64-bit ALUs. As shown in the diagram, which assumes a
4090:"Sse - 1-to-4 broadcast and 4-to-1 reduce in AVX-512" 3609: 3583: 3523: 1176:
There are several savings inherent in this approach.
356:
tried to re-enter the high-end market again with its
7019: 6968: 6904: 6842: 6796: 6748: 6703: 6623: 6550: 6519: 6424: 6345: 6309: 6263: 6163: 6089: 6023: 5973: 5884: 5875: 5848: 5803: 5770: 5742: 5733: 5553: 5456: 5445: 5316: 5172: 5057: 4919: 4759: 4709: 4663: 4627: 4571: 4515: 4464: 4403: 2979: 2967: 2961: 2955: 2949: 2527: 2523: 2022: 2000: 1785: 1281: 1269: 1261: 671: 667: 348:
processor module with four scalar/vector processors
3979: 3894:"Vector Engine Assembly Language Reference Manual" 3635: 3595: 3567: 649:VLIW/vector processor combines both technologies. 1350:IV and other external vector processors like the 645:(Explicitly Parallel Instruction Computing). The 4054:"Coding for Neon - Part 3 Matrix Multiplication" 214:was presented and developed by Kartsev in 1967. 3821:(Press release). GlobeNewswire. 7 December 2022 3227:complex software and simplified hardware (SIMD) 733:A vector processor, by contrast, even if it is 4333:PATCH to libc6 to add optimised POWER9 strncpy 3223:Overall then there is a choice to either have 5150: 4381: 2945:At this point four adds have been performed: 2558:and there would still be no SIMD cleanup code 1288:or sourced from one of the scalar registers. 1215:instructions run slower—i.e., whenever it is 1090:; again assume we have vector registers v1-v3 653:Difference between SIMD and vector processors 8: 2544:that is automatically applied to the vectors 1369:integer variant of the "DAXPY" function, in 666:a way to set the vector length, such as the 37:This article is about Processors (including 6155:Computer performance by orders of magnitude 256:. Unsourced material may be challenged and 6620: 6260: 5881: 5739: 5453: 5157: 5143: 5135: 4388: 4374: 4366: 169:University of Illinois at Urbana–Champaign 4211: 3986:(2nd ed.). Morgan Kaufmann. p.  3857: 3613: 3608: 3582: 3527: 3522: 276:Learn how and when to remove this message 4170:RVV register gather-scatter instructions 3938:Vector and SIMD processors, slides 12-13 2107:# now do the operation, masked by m bits 2041:# prepare mask. few ISAs have min though 706: 380:(FPS) built add-on array processors for 288:The first vector supercomputers are the 3736: 1346:In addition, GPUs such as the Broadcom 976:; assume we have vector registers v1-v3 760:Random-access memory § Memory wall 678:or to a multiple of a fixed data width. 4289:Abandoned US patent US20110227920-0096 4042:Videocore IV QPU analysis by Jeff Bush 3949:Array vs Vector Processing, slides 5-7 3695:Computer for operations with functions 3289:Register Gather, Scatter (aka permute) 1093:; with size larger than or equal to 10 681:Iteration and reduction over elements 212:computer for operations with functions 206:Computer for operations with functions 2974:but with 4-wide SIMD being incapable 2526:instruction has embedded within it a 1304:, which face exactly the same issue. 148:under the control of a single master 7: 6126:Floating-point operations per second 498:Comparison with modern architectures 437:Single instruction, multiple threads 254:adding citations to reliable sources 190:single instruction, multiple threads 1341:Single Instruction Multiple Threads 1252:Single Instruction Multiple Threads 979:; with size equal or larger than 10 770:left the CPU, in the fashion of an 3744:Parkinson, Dennis (17 June 1976). 3590: 1749:Pure (non-predicated, packed) SIMD 441:Modern graphics processing units ( 25: 2218:# update x, y and n for next loop 963:; loop back if count is not yet 0 670:instruction in RISCV RVV, or the 641:(Very Long Instruction Word) and 418:of computers. Most recently, the 138:Westinghouse Electric Corporation 115:platforms. The rapid fall in the 86:single instruction, multiple data 7052:Semiconductor device fabrication 5118: 5117: 4194:from the original on 2021-12-11. 4031:Videocore IV Programmer's Manual 3960:SIMD vs Vector GPU, slides 22-24 3643:, which suggests that the ratio 482:AX45MPV. There are also several 226: 7027:History of general-purpose CPUs 5254:Nondeterministic Turing machine 4589:Analysis of parallel algorithms 3911:"Documentation – Arm Developer" 3603:) there is a speedup less than 3332:where Reduction is of the form 2542:creates a hidden predicate mask 522:- also known as "Packed SIMD", 462:more complex and involved than 196:International Computers Limited 164:, fed in the form of an array. 5207:Deterministic finite automaton 3666:on pipelined vector processors 3630: 3618: 3562: 3547: 3535: 3532: 3508:be the vector speed ratio and 3376:GPU vector processing features 2991:of how to do "Horizontal Sum" 1264:). They are also found in the 132:Early research and development 1: 5998:Simultaneous and heterogenous 4536:Simultaneous and heterogenous 3796:MIAOW Vertical Research Group 1686:; Assume tmp is pre-allocated 1295:introduced the idea of using 625:. Although memory-based, the 609:- these include the original 6682:Integrated memory controller 6664:Translation lookaside buffer 5863:Memory dependence prediction 5306:Random-access stored program 5259:Probabilistic Turing machine 5124:Category: Parallel computing 4355:ARM SVE2 paper by N. Stevens 4265:"CUDA C++ Programming Guide" 4227:Krikelis, Anargyros (1996). 3690:Chaining (vector processing) 3355:arithmetic, but can include 629:was also a vector processor. 300:Advanced Scientific Computer 182:computational fluid dynamics 6138:Synaptic updates per second 4241:10.1007/978-1-4471-1011-8_8 3850:10.1109/APCCAS.2002.1114930 3746:"Computers by the thousand" 801:; Hypothetical RISC machine 728:multi-issue execution model 578:collaborated to create the 409:Virtual Vector Architecture 384:, later building their own 370:Nippon Electric Corporation 200:Distributed Array Processor 7119: 6542:Heterogeneous architecture 5464:Orthogonal instruction set 5234:Alternating Turing machine 5222:Quantum cellular automaton 4431:High-performance computing 3725:Supercomputer architecture 3324:– operations that perform 3295:. Not to be confused with 2001:"-O3 -march=knl" 1364:Vector instruction example 1245: 434: 117:price-to-performance ratio 36: 29: 7032:Microprocessor chronology 6995:Dynamic frequency scaling 6150:Cache performance metrics 5113: 5065:Automatic parallelization 4701:Application checkpointing 3720:History of supercomputing 3596:{\displaystyle r=\infty } 3238:Vector processor features 1102:# Set vector length VL=10 689:Predicated SIMD (part of 352:Other examples followed. 80:. This is in contrast to 64:(CPU) that implements an 7047:Hardware security module 6390:Digital signal processor 6367:Graphics processing unit 6179:Graphics processing unit 4311:Introduction to ARM SVE2 3771:B.N. Malinovsky (1995). 3498:Performance and speed up 3061: 3009: 2868: 2721: 2586: 2575:Vector reduction example 2349: 2035: 1790: 1764: 1683: 1510: 1375: 1336:) categorically do not. 1087: 973: 798: 354:Control Data Corporation 290:Control Data Corporation 7088:Central processing unit 7000:Dynamic voltage scaling 6783:Memory address register 6677:Branch target predictor 6641:Address generation unit 6384:Physics processing unit 6173:Central processing unit 6132:Transactions per second 6120:Instructions per second 6043:Array processing (SIMT) 5187:Stored-program computer 5080:Embarrassingly parallel 5075:Deterministic algorithm 4067:SIMD considered harmful 3685:Automatic vectorization 3636:{\displaystyle 1/(1-f)} 3339:Matrix Multiply support 3178:# repeat if n != 0 2832:; loop back if n > 0 2736:; y initialised to zero 2540:Setting VL effectively 2505:# repeat if n != 0 1669:; loop back if n > 0 476:RISC-V vector extension 449:which may be driven by 150:Central processing unit 62:central processing unit 6806:Hardwired control unit 6688:Memory management unit 6653:Memory management unit 6402:Secure cryptoprocessor 6396:Tensor Processing Unit 6378:Vision processing unit 6112:Cycles per instruction 6106:Instructions per cycle 6053:Associative processing 5744:Instruction pipelining 5166:Processor technologies 4795:Associative processing 4751:Non-blocking algorithm 4557:Clustered multi-thread 3715:Tensor Processing Unit 3637: 3597: 3569: 3202:Insights from examples 2320:Pure (true) vector ISA 767:instruction pipelining 711: 590:associative processing 524:SIMD within a register 445:) include an array of 378:Floating Point Systems 349: 154:arithmetic logic units 90:SIMD within a register 74:one-dimensional arrays 7103:Vector supercomputers 6889:Sum-addressed decoder 6635:Arithmetic logic unit 5762:Classic RISC pipeline 5716:Epiphany architecture 5563:Motorola 68000 series 4911:Hardware acceleration 4824:Superscalar processor 4814:Dataflow architecture 4411:Distributed computing 4322:RVV fault-first loads 3925:"Vector Architecture" 3638: 3598: 3570: 3456:Fault (or Fail) First 3384:applications needing 3349:Advanced Math formats 3257:load and stores, and 3247:Vector Load and Store 2862:be added to anything 2754:; load one 32bit data 2296:; go back if n > 0 2104:; m = (1<<t0)-1 1946:; go back if n > 0 1528:; load one 32bit data 1352:NEC SX-Aurora TSUBASA 710: 582:, which is also SIMD. 344: 327:The Cray design used 102:graphics accelerators 7010:Performance per watt 6588:replacement policies 6254:Package on a package 6144:Performance per watt 6048:Pipelined processing 5818:Tomasulo's algorithm 5623:Clipper architecture 5479:Application-specific 5192:Finite-state machine 4790:Pipelined processing 4739:Explicit parallelism 4734:Implicit parallelism 4724:Dataflow programming 4235:. pp. 101–124. 4056:. 11 September 2013. 3607: 3581: 3521: 3357:binary-coded decimal 3217:permute instructions 2999:Vector ISA reduction 2239:; x := x + t0*4 2197:; v3 := v1 + v2 1999:, using the options 1859:; v3 := v1 + v2 1582:; r3 := r1 + r2 1268:architecture as the 870:; r3 := r1 + r2 605:- as categorised in 542:instructions, AMD's 329:pipeline parallelism 250:improve this section 94:numerical simulation 7042:Digital electronics 6695:Instruction decoder 6647:Floating-point unit 6301:Soft microprocessor 6248:System in a package 5823:Reservation station 5353:Transport-triggered 5014:Parallel Extensions 4819:Pipelined processor 4344:RVV strncpy example 4126:. 19 November 2022. 4020:. 19 November 2022. 3972:Patterson, David A. 3430:operations such as 3301:permute instruction 3279:Compress and Expand 3148:# advance x by VL*4 3091:# VL=t0=min(MVL, n) 3054:# reduce-add into y 3024:# VL=t0=min(MVL, n) 2844:; returns result, y 2475:# advance x by VL*4 2454:# advance y by VL*4 2367:# VL=t0=min(MVL, n) 2170:; v1 := v1 * a 1838:; v1 := v1 * a 1561:; r1 := r1 * a 1297:processor registers 1242:Vector instructions 528:Pipelined Processor 7098:Parallel computing 6914:Integrated circuit 6758:Processor register 6412:Baseband processor 5757:Operand forwarding 5217:Cellular automaton 4888:Massively parallel 4866:distributed shared 4686:Cache invalidation 4650:Instruction window 4441:Manycore processor 4421:Massively parallel 4416:Parallel computing 4397:Parallel computing 4278:LMUL > 1 in RVV 3752:. pp. 626–627 3633: 3593: 3568:{\displaystyle r/} 3565: 3406:Sub-vector Swizzle 3127:# add all x into y 2775:; y := y + r1 2278:; n := n - t0 1993:order of magnitude 1892:; x := x + 16 1262:vadd c, a, b, $ 10 1194:embedded processor 1168:# 10 stores into c 822:; count := 10 712: 511:, and vectors are 507:, because SIMD is 470:Recent development 455:Flynn's 1972 paper 405:massively parallel 386:minisupercomputers 350: 186:massively parallel 98:video-game console 7075: 7074: 6964: 6963: 6583:Instruction cache 6573:Scratchpad memory 6420: 6419: 6407:Network processor 6336:Network on a chip 6291:Ultra-low-voltage 6242:Multi-chip module 6085: 6084: 5871: 5870: 5858:Branch prediction 5835:Register renaming 5729: 5728: 5711:VISC architecture 5533:Quantum computing 5528:VISC architecture 5410:Secondary storage 5326:Microarchitecture 5286:Register machines 5132: 5131: 5085:Parallel slowdown 4719:Stream processing 4609:Karp–Flatt metric 4250:978-3-540-76016-0 4159:SX-Arora Overview 4078:ARM SVE2 tutorial 3976:Hennessy, John L. 3680:Stream processing 3664:Duncan's taxonomy 3351:– often includes 3307:Splat and Extract 3269:Masked Operations 2817:; n := n - 1 2796:; x := x + 4 2083:; m = 1<<t0 1931:; n := n - 4 1788:described above. 1654:; n := n - 1 1615:; x := x + 4 1310:SX-Aurora TSUBASA 1132:# 10 loads from b 1117:# 10 loads from a 607:Duncan's taxonomy 520:Pure (fixed) SIMD 420:SX-Aurora TSUBASA 297:Texas Instruments 286: 285: 278: 82:scalar processors 18:Vector processing 16:(Redirected from 7110: 7037:Processor design 6929:Power management 6811:Instruction unit 6672:Branch predictor 6621: 6319:System on a chip 6261: 6101:Transistor count 6025:Flynn's taxonomy 5882: 5740: 5543:Addressing modes 5454: 5400:Memory hierarchy 5264:Hypercomputation 5182:Abstract machine 5159: 5152: 5145: 5136: 5121: 5120: 5095:Software lockout 4894:Computer cluster 4829:Vector processor 4784:Array processing 4769:Flynn's taxonomy 4676:Memory coherence 4451:Computer network 4390: 4383: 4376: 4367: 4357: 4352: 4346: 4341: 4335: 4330: 4324: 4319: 4313: 4308: 4302: 4300:Videocore IV QPU 4297: 4291: 4286: 4280: 4275: 4269: 4268: 4261: 4255: 4254: 4224: 4218: 4217: 4215: 4202: 4196: 4195: 4178: 4172: 4167: 4161: 4156: 4150: 4145: 4139: 4134: 4128: 4127: 4114: 4108: 4107: 4100: 4094: 4093: 4086: 4080: 4075: 4069: 4064: 4058: 4057: 4050: 4044: 4039: 4033: 4028: 4022: 4021: 4008: 4002: 4001: 3985: 3968: 3962: 3957: 3951: 3946: 3940: 3935: 3929: 3928: 3927:. 27 April 2020. 3921: 3915: 3914: 3907: 3901: 3900: 3898: 3890: 3884: 3883: 3870: 3864: 3863: 3861: 3837: 3831: 3830: 3828: 3826: 3815: 3809: 3804: 3798: 3793: 3787: 3786: 3768: 3762: 3761: 3759: 3757: 3741: 3710:Barrel processor 3704:vector extension 3642: 3640: 3639: 3634: 3617: 3602: 3600: 3599: 3594: 3574: 3572: 3571: 3566: 3531: 3364:Bit manipulation 3335: 3331: 3185: 3182: 3179: 3176: 3173: 3170: 3167: 3164: 3161: 3158: 3155: 3152: 3149: 3146: 3143: 3140: 3137: 3134: 3131: 3128: 3125: 3122: 3119: 3116: 3113: 3110: 3107: 3104: 3101: 3098: 3095: 3092: 3089: 3086: 3083: 3080: 3077: 3074: 3071: 3068: 3065: 3055: 3052: 3049: 3046: 3043: 3040: 3037: 3034: 3031: 3028: 3025: 3022: 3019: 3016: 3013: 2981: 2969: 2963: 2957: 2951: 2941: 2938: 2935: 2932: 2929: 2926: 2923: 2920: 2917: 2914: 2911: 2908: 2905: 2902: 2899: 2896: 2893: 2890: 2889:; for 2nd 4 of x 2887: 2884: 2881: 2878: 2875: 2872: 2845: 2842: 2839: 2836: 2833: 2830: 2827: 2824: 2821: 2818: 2815: 2812: 2809: 2806: 2803: 2800: 2797: 2794: 2791: 2788: 2785: 2782: 2779: 2776: 2773: 2770: 2767: 2764: 2761: 2758: 2755: 2752: 2749: 2746: 2743: 2740: 2737: 2734: 2731: 2728: 2725: 2715:Scalar assembler 2707: 2704: 2701: 2698: 2695: 2692: 2689: 2686: 2683: 2680: 2677: 2674: 2671: 2668: 2665: 2662: 2659: 2656: 2653: 2650: 2647: 2644: 2641: 2638: 2635: 2632: 2629: 2626: 2623: 2620: 2617: 2614: 2611: 2608: 2605: 2602: 2599: 2596: 2593: 2590: 2529: 2525: 2506: 2503: 2500: 2497: 2494: 2491: 2488: 2485: 2482: 2479: 2476: 2473: 2470: 2467: 2464: 2461: 2458: 2455: 2452: 2449: 2446: 2443: 2440: 2437: 2434: 2431: 2428: 2425: 2422: 2419: 2416: 2413: 2410: 2407: 2404: 2401: 2398: 2395: 2392: 2389: 2386: 2383: 2380: 2377: 2374: 2371: 2368: 2365: 2362: 2359: 2356: 2353: 2327:or a virtual one 2303: 2300: 2297: 2294: 2291: 2288: 2285: 2282: 2279: 2276: 2273: 2270: 2267: 2264: 2261: 2258: 2255: 2252: 2249: 2246: 2243: 2240: 2237: 2234: 2231: 2228: 2225: 2222: 2219: 2216: 2213: 2210: 2207: 2204: 2201: 2198: 2195: 2192: 2189: 2186: 2183: 2180: 2177: 2174: 2171: 2168: 2165: 2162: 2159: 2156: 2153: 2150: 2147: 2144: 2141: 2138: 2135: 2132: 2129: 2126: 2123: 2120: 2117: 2114: 2111: 2108: 2105: 2102: 2099: 2096: 2093: 2090: 2087: 2084: 2081: 2078: 2075: 2072: 2069: 2066: 2063: 2062:; t0 = min(n, 4) 2060: 2057: 2054: 2051: 2048: 2045: 2042: 2039: 2024: 2002: 1953: 1950: 1947: 1944: 1941: 1938: 1935: 1932: 1929: 1926: 1923: 1920: 1917: 1914: 1911: 1908: 1905: 1902: 1899: 1896: 1893: 1890: 1887: 1884: 1881: 1878: 1875: 1872: 1869: 1866: 1863: 1860: 1857: 1854: 1851: 1848: 1845: 1842: 1839: 1836: 1833: 1830: 1827: 1824: 1821: 1818: 1815: 1812: 1809: 1806: 1803: 1800: 1797: 1794: 1787: 1780: 1777: 1774: 1771: 1768: 1759:properly aligned 1755:Flynn's taxonomy 1744: 1741: 1738: 1735: 1732: 1729: 1726: 1723: 1720: 1717: 1714: 1711: 1708: 1705: 1702: 1699: 1696: 1693: 1690: 1687: 1676: 1673: 1670: 1667: 1664: 1661: 1658: 1655: 1652: 1649: 1646: 1643: 1640: 1637: 1634: 1631: 1628: 1625: 1622: 1619: 1616: 1613: 1610: 1607: 1604: 1601: 1598: 1595: 1592: 1589: 1586: 1583: 1580: 1577: 1574: 1571: 1568: 1565: 1562: 1559: 1556: 1553: 1550: 1547: 1544: 1541: 1538: 1535: 1532: 1529: 1526: 1523: 1520: 1517: 1514: 1504:Scalar assembler 1496: 1493: 1490: 1487: 1484: 1481: 1478: 1475: 1472: 1469: 1466: 1463: 1460: 1457: 1454: 1451: 1448: 1445: 1442: 1439: 1436: 1433: 1430: 1427: 1424: 1421: 1418: 1415: 1412: 1409: 1406: 1403: 1400: 1397: 1394: 1391: 1388: 1385: 1382: 1379: 1283: 1271: 1263: 1205:functional units 1172: 1169: 1166: 1163: 1160: 1157: 1154: 1151: 1148: 1145: 1142: 1139: 1136: 1133: 1130: 1127: 1124: 1121: 1118: 1115: 1112: 1109: 1106: 1103: 1100: 1097: 1094: 1091: 1070: 1067: 1064: 1061: 1058: 1055: 1052: 1049: 1046: 1043: 1040: 1037: 1034: 1031: 1028: 1025: 1022: 1019: 1016: 1013: 1010: 1007: 1004: 1001: 998: 995: 992: 989: 986: 983: 980: 977: 967: 964: 961: 958: 955: 952: 949: 946: 943: 940: 937: 934: 931: 928: 925: 922: 919: 916: 913: 910: 907: 904: 901: 898: 895: 892: 889: 886: 883: 880: 877: 874: 871: 868: 865: 862: 859: 856: 853: 850: 847: 844: 841: 838: 835: 832: 829: 826: 823: 820: 817: 814: 811: 808: 805: 802: 702:unable by design 691:Flynn's taxonomy 673: 669: 588:- also known as 505:vector processor 480:Andes Technology 447:shader pipelines 322:vector registers 281: 274: 270: 267: 261: 230: 222: 119:of conventional 100:hardware and in 54:vector processor 32:array processing 21: 7118: 7117: 7113: 7112: 7111: 7109: 7108: 7107: 7078: 7077: 7076: 7071: 7057:Tick–tock model 7015: 6971: 6960: 6900: 6884:Address decoder 6838: 6792: 6788:Program counter 6763:Status register 6744: 6699: 6659:Load–store unit 6626: 6619: 6546: 6515: 6416: 6373:Image processor 6348: 6341: 6311: 6305: 6281:Microcontroller 6271:Embedded system 6259: 6159: 6092: 6081: 6019: 5969: 5867: 5844: 5828:Re-order buffer 5799: 5780:Data dependency 5766: 5725: 5555: 5549: 5448: 5447:Instruction set 5441: 5427:Multiprocessing 5395:Cache hierarchy 5388:Register/memory 5312: 5212:Queue automaton 5168: 5163: 5133: 5128: 5109: 5053: 4959:Coarray Fortran 4915: 4899:Beowulf cluster 4755: 4705: 4696:Synchronization 4681:Cache coherence 4671:Multiprocessing 4659: 4623: 4604:Cost efficiency 4599:Gustafson's law 4567: 4511: 4460: 4436:Multiprocessing 4426:Cloud computing 4399: 4394: 4363: 4361: 4360: 4353: 4349: 4342: 4338: 4331: 4327: 4320: 4316: 4309: 4305: 4298: 4294: 4287: 4283: 4276: 4272: 4263: 4262: 4258: 4251: 4226: 4225: 4221: 4204: 4203: 4199: 4180: 4179: 4175: 4168: 4164: 4157: 4153: 4146: 4142: 4135: 4131: 4116: 4115: 4111: 4102: 4101: 4097: 4088: 4087: 4083: 4076: 4072: 4065: 4061: 4052: 4051: 4047: 4040: 4036: 4029: 4025: 4010: 4009: 4005: 3998: 3970: 3969: 3965: 3958: 3954: 3947: 3943: 3936: 3932: 3923: 3922: 3918: 3909: 3908: 3904: 3899:. 16 June 2023. 3896: 3892: 3891: 3887: 3882:. 16 June 2023. 3872: 3871: 3867: 3839: 3838: 3834: 3824: 3822: 3817: 3816: 3812: 3805: 3801: 3794: 3790: 3783: 3770: 3769: 3765: 3755: 3753: 3743: 3742: 3738: 3733: 3659:SX architecture 3655: 3605: 3604: 3579: 3578: 3519: 3518: 3500: 3458: 3424:Transcendentals 3378: 3333: 3329: 3293:vector chaining 3273:predicate masks 3240: 3204: 3187: 3186: 3183: 3180: 3177: 3174: 3171: 3168: 3165: 3162: 3159: 3156: 3153: 3150: 3147: 3144: 3141: 3138: 3135: 3132: 3129: 3126: 3123: 3120: 3117: 3114: 3111: 3108: 3106:# load vector x 3105: 3102: 3099: 3096: 3093: 3090: 3087: 3084: 3081: 3078: 3075: 3072: 3069: 3066: 3063: 3057: 3056: 3053: 3050: 3047: 3044: 3041: 3039:# load vector x 3038: 3035: 3032: 3029: 3026: 3023: 3020: 3017: 3014: 3011: 3001: 2943: 2942: 2939: 2936: 2933: 2930: 2927: 2924: 2921: 2918: 2915: 2912: 2909: 2906: 2903: 2900: 2897: 2894: 2891: 2888: 2885: 2882: 2879: 2876: 2873: 2870: 2855: 2847: 2846: 2843: 2840: 2837: 2834: 2831: 2828: 2825: 2822: 2819: 2816: 2813: 2810: 2807: 2804: 2801: 2798: 2795: 2792: 2789: 2786: 2783: 2780: 2777: 2774: 2771: 2768: 2765: 2762: 2759: 2756: 2753: 2750: 2747: 2744: 2741: 2738: 2735: 2732: 2729: 2726: 2723: 2717: 2709: 2708: 2705: 2702: 2699: 2696: 2693: 2690: 2687: 2684: 2681: 2678: 2675: 2672: 2669: 2666: 2663: 2660: 2657: 2654: 2651: 2648: 2645: 2642: 2639: 2636: 2633: 2630: 2627: 2624: 2621: 2618: 2615: 2612: 2609: 2606: 2603: 2600: 2597: 2594: 2591: 2588: 2577: 2508: 2507: 2504: 2501: 2498: 2495: 2492: 2489: 2486: 2483: 2480: 2477: 2474: 2471: 2468: 2465: 2462: 2459: 2456: 2453: 2450: 2447: 2444: 2441: 2438: 2435: 2432: 2429: 2426: 2423: 2420: 2417: 2414: 2411: 2408: 2405: 2402: 2399: 2397:# load vector y 2396: 2393: 2390: 2387: 2384: 2382:# load vector x 2381: 2378: 2375: 2372: 2369: 2366: 2363: 2360: 2357: 2354: 2351: 2322: 2305: 2304: 2301: 2298: 2295: 2292: 2289: 2286: 2283: 2280: 2277: 2274: 2271: 2268: 2265: 2262: 2259: 2256: 2253: 2250: 2247: 2244: 2241: 2238: 2235: 2232: 2229: 2226: 2223: 2220: 2217: 2214: 2211: 2208: 2205: 2202: 2199: 2196: 2193: 2190: 2187: 2184: 2181: 2178: 2175: 2172: 2169: 2166: 2163: 2160: 2157: 2154: 2151: 2148: 2145: 2142: 2139: 2136: 2133: 2130: 2127: 2124: 2121: 2118: 2115: 2112: 2109: 2106: 2103: 2100: 2097: 2094: 2091: 2088: 2085: 2082: 2079: 2076: 2073: 2070: 2067: 2064: 2061: 2058: 2055: 2052: 2049: 2046: 2043: 2040: 2037: 2031: 2029:Predicated SIMD 2014:exists in x86. 1987:This more than 1955: 1954: 1951: 1948: 1945: 1942: 1939: 1936: 1933: 1930: 1927: 1924: 1921: 1918: 1915: 1912: 1909: 1906: 1903: 1900: 1897: 1894: 1891: 1888: 1885: 1882: 1879: 1876: 1873: 1870: 1867: 1864: 1861: 1858: 1855: 1852: 1849: 1846: 1843: 1840: 1837: 1834: 1831: 1828: 1825: 1822: 1819: 1816: 1813: 1810: 1807: 1804: 1801: 1798: 1795: 1792: 1782: 1781: 1778: 1775: 1772: 1769: 1766: 1751: 1746: 1745: 1742: 1739: 1736: 1733: 1730: 1727: 1724: 1721: 1718: 1715: 1712: 1709: 1706: 1703: 1700: 1697: 1694: 1691: 1688: 1685: 1678: 1677: 1674: 1671: 1668: 1665: 1662: 1659: 1656: 1653: 1650: 1647: 1644: 1641: 1638: 1635: 1632: 1629: 1626: 1623: 1620: 1617: 1614: 1611: 1608: 1605: 1602: 1599: 1596: 1593: 1590: 1587: 1584: 1581: 1578: 1575: 1572: 1569: 1566: 1563: 1560: 1557: 1554: 1551: 1548: 1545: 1542: 1539: 1536: 1533: 1530: 1527: 1524: 1521: 1518: 1515: 1512: 1506: 1498: 1497: 1494: 1491: 1488: 1485: 1482: 1479: 1476: 1473: 1470: 1467: 1464: 1461: 1458: 1455: 1452: 1449: 1446: 1443: 1440: 1437: 1434: 1431: 1428: 1425: 1422: 1419: 1416: 1413: 1410: 1407: 1404: 1401: 1398: 1395: 1392: 1389: 1386: 1383: 1380: 1377: 1366: 1254: 1244: 1174: 1173: 1170: 1167: 1164: 1161: 1158: 1155: 1152: 1149: 1146: 1143: 1140: 1137: 1134: 1131: 1128: 1125: 1122: 1119: 1116: 1113: 1110: 1107: 1104: 1101: 1098: 1095: 1092: 1089: 1080:per-instruction 1072: 1071: 1068: 1065: 1062: 1059: 1056: 1053: 1050: 1047: 1044: 1041: 1038: 1035: 1032: 1029: 1026: 1023: 1020: 1017: 1014: 1011: 1008: 1005: 1002: 999: 996: 993: 990: 987: 984: 981: 978: 975: 969: 968: 965: 962: 959: 956: 953: 950: 947: 944: 941: 938: 935: 932: 929: 926: 923: 920: 917: 914: 911: 908: 905: 902: 899: 896: 893: 890: 887: 884: 881: 878: 875: 872: 869: 866: 863: 860: 857: 854: 851: 848: 845: 842: 839: 836: 833: 830: 827: 824: 821: 818: 815: 812: 809: 806: 803: 800: 776:address decoder 751: 743:vector chaining 716:vector chaining 655: 615:Convex C-Series 586:Predicated SIMD 513:variable-length 500: 472: 451:compute kernels 439: 433: 335:vector chaining 282: 271: 265: 262: 247: 231: 220: 208: 134: 129: 76:of data called 66:instruction set 58:array processor 46: 35: 28: 23: 22: 15: 12: 11: 5: 7116: 7114: 7106: 7105: 7100: 7095: 7090: 7080: 7079: 7073: 7072: 7070: 7069: 7064: 7062:Pin grid array 7059: 7054: 7049: 7044: 7039: 7034: 7029: 7023: 7021: 7017: 7016: 7014: 7013: 7007: 7002: 6997: 6992: 6987: 6982: 6976: 6974: 6966: 6965: 6962: 6961: 6959: 6958: 6953: 6948: 6943: 6938: 6933: 6932: 6931: 6926: 6921: 6910: 6908: 6902: 6901: 6899: 6898: 6896:Barrel shifter 6893: 6892: 6891: 6886: 6879:Binary decoder 6876: 6875: 6874: 6864: 6859: 6854: 6848: 6846: 6840: 6839: 6837: 6836: 6831: 6823: 6818: 6813: 6808: 6802: 6800: 6794: 6793: 6791: 6790: 6785: 6780: 6775: 6770: 6768:Stack register 6765: 6760: 6754: 6752: 6746: 6745: 6743: 6742: 6741: 6740: 6735: 6725: 6720: 6715: 6709: 6707: 6701: 6700: 6698: 6697: 6692: 6691: 6690: 6679: 6674: 6669: 6668: 6667: 6661: 6650: 6644: 6638: 6631: 6629: 6618: 6617: 6612: 6607: 6602: 6597: 6596: 6595: 6590: 6585: 6580: 6575: 6570: 6560: 6554: 6552: 6548: 6547: 6545: 6544: 6539: 6534: 6529: 6523: 6521: 6517: 6516: 6514: 6513: 6512: 6511: 6501: 6496: 6491: 6486: 6481: 6476: 6471: 6466: 6461: 6456: 6451: 6446: 6441: 6436: 6430: 6428: 6422: 6421: 6418: 6417: 6415: 6414: 6409: 6404: 6399: 6393: 6387: 6381: 6375: 6370: 6364: 6362:AI accelerator 6359: 6353: 6351: 6343: 6342: 6340: 6339: 6333: 6328: 6325:Multiprocessor 6322: 6315: 6313: 6307: 6306: 6304: 6303: 6298: 6293: 6288: 6283: 6278: 6276:Microprocessor 6273: 6267: 6265: 6264:By application 6258: 6257: 6251: 6245: 6239: 6234: 6229: 6224: 6219: 6214: 6209: 6207:Tile processor 6204: 6199: 6194: 6189: 6188: 6187: 6176: 6169: 6167: 6161: 6160: 6158: 6157: 6152: 6147: 6141: 6135: 6129: 6123: 6117: 6116: 6115: 6103: 6097: 6095: 6087: 6086: 6083: 6082: 6080: 6079: 6078: 6077: 6067: 6062: 6061: 6060: 6055: 6050: 6045: 6035: 6029: 6027: 6021: 6020: 6018: 6017: 6012: 6007: 6002: 6001: 6000: 5995: 5993:Hyperthreading 5985: 5979: 5977: 5975:Multithreading 5971: 5970: 5968: 5967: 5962: 5957: 5956: 5955: 5945: 5944: 5943: 5938: 5928: 5927: 5926: 5921: 5911: 5906: 5905: 5904: 5899: 5888: 5886: 5879: 5873: 5872: 5869: 5868: 5866: 5865: 5860: 5854: 5852: 5846: 5845: 5843: 5842: 5837: 5832: 5831: 5830: 5825: 5815: 5809: 5807: 5801: 5800: 5798: 5797: 5792: 5787: 5782: 5776: 5774: 5768: 5767: 5765: 5764: 5759: 5754: 5752:Pipeline stall 5748: 5746: 5737: 5731: 5730: 5727: 5726: 5724: 5723: 5718: 5713: 5708: 5705: 5704: 5703: 5701:z/Architecture 5698: 5693: 5688: 5680: 5675: 5670: 5665: 5660: 5655: 5650: 5645: 5640: 5635: 5630: 5625: 5620: 5619: 5618: 5613: 5608: 5600: 5595: 5590: 5585: 5580: 5575: 5570: 5565: 5559: 5557: 5551: 5550: 5548: 5547: 5546: 5545: 5535: 5530: 5525: 5520: 5515: 5510: 5505: 5504: 5503: 5493: 5492: 5491: 5481: 5476: 5471: 5466: 5460: 5458: 5451: 5443: 5442: 5440: 5439: 5434: 5429: 5424: 5419: 5414: 5413: 5412: 5407: 5405:Virtual memory 5397: 5392: 5391: 5390: 5385: 5380: 5375: 5365: 5360: 5355: 5350: 5345: 5344: 5343: 5333: 5328: 5322: 5320: 5314: 5313: 5311: 5310: 5309: 5308: 5303: 5298: 5293: 5283: 5278: 5273: 5272: 5271: 5266: 5261: 5256: 5251: 5246: 5241: 5236: 5229:Turing machine 5226: 5225: 5224: 5219: 5214: 5209: 5204: 5199: 5189: 5184: 5178: 5176: 5170: 5169: 5164: 5162: 5161: 5154: 5147: 5139: 5130: 5129: 5127: 5126: 5114: 5111: 5110: 5108: 5107: 5102: 5097: 5092: 5090:Race condition 5087: 5082: 5077: 5072: 5067: 5061: 5059: 5055: 5054: 5052: 5051: 5046: 5041: 5036: 5031: 5026: 5021: 5016: 5011: 5006: 5001: 4996: 4991: 4986: 4981: 4976: 4971: 4966: 4961: 4956: 4951: 4946: 4941: 4936: 4931: 4925: 4923: 4917: 4916: 4914: 4913: 4908: 4903: 4902: 4901: 4891: 4885: 4884: 4883: 4878: 4873: 4868: 4863: 4858: 4848: 4847: 4846: 4841: 4834:Multiprocessor 4831: 4826: 4821: 4816: 4811: 4810: 4809: 4804: 4799: 4798: 4797: 4792: 4787: 4776: 4765: 4763: 4757: 4756: 4754: 4753: 4748: 4747: 4746: 4741: 4736: 4726: 4721: 4715: 4713: 4707: 4706: 4704: 4703: 4698: 4693: 4688: 4683: 4678: 4673: 4667: 4665: 4661: 4660: 4658: 4657: 4652: 4647: 4642: 4637: 4631: 4629: 4625: 4624: 4622: 4621: 4616: 4611: 4606: 4601: 4596: 4591: 4586: 4581: 4575: 4573: 4569: 4568: 4566: 4565: 4563:Hardware scout 4560: 4554: 4549: 4544: 4538: 4533: 4527: 4521: 4519: 4517:Multithreading 4513: 4512: 4510: 4509: 4504: 4499: 4494: 4489: 4484: 4479: 4474: 4468: 4466: 4462: 4461: 4459: 4458: 4456:Systolic array 4453: 4448: 4443: 4438: 4433: 4428: 4423: 4418: 4413: 4407: 4405: 4401: 4400: 4395: 4393: 4392: 4385: 4378: 4370: 4359: 4358: 4347: 4336: 4325: 4314: 4303: 4292: 4281: 4270: 4256: 4249: 4219: 4197: 4173: 4162: 4151: 4148:RISC-V RVV ISA 4140: 4129: 4109: 4095: 4081: 4070: 4059: 4045: 4034: 4023: 4003: 3996: 3963: 3952: 3941: 3930: 3916: 3902: 3885: 3865: 3832: 3810: 3799: 3788: 3781: 3763: 3735: 3734: 3732: 3729: 3728: 3727: 3722: 3717: 3712: 3707: 3697: 3692: 3687: 3682: 3677: 3675:Compute kernel 3672: 3667: 3661: 3654: 3651: 3632: 3629: 3626: 3623: 3620: 3616: 3612: 3592: 3589: 3586: 3564: 3561: 3558: 3555: 3552: 3549: 3546: 3543: 3540: 3537: 3534: 3530: 3526: 3499: 3496: 3457: 3454: 3453: 3452: 3421: 3403: 3377: 3374: 3373: 3372: 3360: 3346: 3336: 3334:x = y + y… + y 3319:Reduction and 3316: 3310: 3304: 3297:Gather-scatter 3286: 3276: 3266: 3263:data structure 3239: 3236: 3232: 3231: 3228: 3221: 3220: 3213: 3203: 3200: 3163:# n -= VL (t0) 3062: 3010: 3000: 2997: 2972: 2971: 2965: 2959: 2953: 2940:; add 2 groups 2904:; first 4 of x 2869: 2854: 2853:SIMD reduction 2851: 2722: 2716: 2713: 2587: 2576: 2573: 2550: 2549: 2545: 2538: 2531: 2490:# n -= VL (t0) 2418:# v1 += v0 * a 2350: 2321: 2318: 2036: 2030: 2027: 1981: 1980: 1977: 1974: 1791: 1779:; v4 = a,a,a,a 1765: 1750: 1747: 1684: 1511: 1505: 1502: 1376: 1365: 1362: 1243: 1240: 1232:supercomputers 1198: 1197: 1190: 1187: 1184: 1181: 1088: 974: 799: 756:memory latency 750: 747: 703: 687: 686: 684: 679: 654: 651: 631: 630: 600: 583: 580:Cell processor 514: 510: 499: 496: 471: 468: 435:Main article: 432: 429: 284: 283: 234: 232: 225: 219: 218:Supercomputers 216: 207: 204: 133: 130: 128: 125: 121:microprocessor 26: 24: 14: 13: 10: 9: 6: 4: 3: 2: 7115: 7104: 7101: 7099: 7096: 7094: 7091: 7089: 7086: 7085: 7083: 7068: 7065: 7063: 7060: 7058: 7055: 7053: 7050: 7048: 7045: 7043: 7040: 7038: 7035: 7033: 7030: 7028: 7025: 7024: 7022: 7018: 7011: 7008: 7006: 7003: 7001: 6998: 6996: 6993: 6991: 6988: 6986: 6983: 6981: 6978: 6977: 6975: 6973: 6967: 6957: 6954: 6952: 6949: 6947: 6944: 6942: 6939: 6937: 6934: 6930: 6927: 6925: 6922: 6920: 6917: 6916: 6915: 6912: 6911: 6909: 6907: 6903: 6897: 6894: 6890: 6887: 6885: 6882: 6881: 6880: 6877: 6873: 6870: 6869: 6868: 6865: 6863: 6860: 6858: 6857:Demultiplexer 6855: 6853: 6850: 6849: 6847: 6845: 6841: 6835: 6832: 6830: 6827: 6824: 6822: 6819: 6817: 6814: 6812: 6809: 6807: 6804: 6803: 6801: 6799: 6795: 6789: 6786: 6784: 6781: 6779: 6778:Memory buffer 6776: 6774: 6773:Register file 6771: 6769: 6766: 6764: 6761: 6759: 6756: 6755: 6753: 6751: 6747: 6739: 6736: 6734: 6731: 6730: 6729: 6726: 6724: 6721: 6719: 6716: 6714: 6713:Combinational 6711: 6710: 6708: 6706: 6702: 6696: 6693: 6689: 6686: 6685: 6683: 6680: 6678: 6675: 6673: 6670: 6665: 6662: 6660: 6657: 6656: 6654: 6651: 6648: 6645: 6642: 6639: 6636: 6633: 6632: 6630: 6628: 6622: 6616: 6613: 6611: 6608: 6606: 6603: 6601: 6598: 6594: 6591: 6589: 6586: 6584: 6581: 6579: 6576: 6574: 6571: 6569: 6566: 6565: 6564: 6561: 6559: 6556: 6555: 6553: 6549: 6543: 6540: 6538: 6535: 6533: 6530: 6528: 6525: 6524: 6522: 6518: 6510: 6507: 6506: 6505: 6502: 6500: 6497: 6495: 6492: 6490: 6487: 6485: 6482: 6480: 6477: 6475: 6472: 6470: 6467: 6465: 6462: 6460: 6457: 6455: 6452: 6450: 6447: 6445: 6442: 6440: 6437: 6435: 6432: 6431: 6429: 6427: 6423: 6413: 6410: 6408: 6405: 6403: 6400: 6397: 6394: 6391: 6388: 6385: 6382: 6379: 6376: 6374: 6371: 6368: 6365: 6363: 6360: 6358: 6355: 6354: 6352: 6350: 6344: 6337: 6334: 6332: 6329: 6326: 6323: 6320: 6317: 6316: 6314: 6308: 6302: 6299: 6297: 6294: 6292: 6289: 6287: 6284: 6282: 6279: 6277: 6274: 6272: 6269: 6268: 6266: 6262: 6255: 6252: 6249: 6246: 6243: 6240: 6238: 6235: 6233: 6230: 6228: 6225: 6223: 6220: 6218: 6215: 6213: 6210: 6208: 6205: 6203: 6200: 6198: 6195: 6193: 6190: 6186: 6183: 6182: 6180: 6177: 6174: 6171: 6170: 6168: 6166: 6162: 6156: 6153: 6151: 6148: 6145: 6142: 6139: 6136: 6133: 6130: 6127: 6124: 6121: 6118: 6113: 6110: 6109: 6107: 6104: 6102: 6099: 6098: 6096: 6094: 6088: 6076: 6073: 6072: 6071: 6068: 6066: 6063: 6059: 6056: 6054: 6051: 6049: 6046: 6044: 6041: 6040: 6039: 6036: 6034: 6031: 6030: 6028: 6026: 6022: 6016: 6013: 6011: 6008: 6006: 6003: 5999: 5996: 5994: 5991: 5990: 5989: 5986: 5984: 5981: 5980: 5978: 5976: 5972: 5966: 5963: 5961: 5958: 5954: 5951: 5950: 5949: 5946: 5942: 5939: 5937: 5934: 5933: 5932: 5929: 5925: 5922: 5920: 5917: 5916: 5915: 5912: 5910: 5907: 5903: 5900: 5898: 5895: 5894: 5893: 5890: 5889: 5887: 5883: 5880: 5878: 5874: 5864: 5861: 5859: 5856: 5855: 5853: 5851: 5847: 5841: 5838: 5836: 5833: 5829: 5826: 5824: 5821: 5820: 5819: 5816: 5814: 5813:Scoreboarding 5811: 5810: 5808: 5806: 5802: 5796: 5795:False sharing 5793: 5791: 5788: 5786: 5783: 5781: 5778: 5777: 5775: 5773: 5769: 5763: 5760: 5758: 5755: 5753: 5750: 5749: 5747: 5745: 5741: 5738: 5736: 5732: 5722: 5719: 5717: 5714: 5712: 5709: 5706: 5702: 5699: 5697: 5694: 5692: 5689: 5687: 5684: 5683: 5681: 5679: 5676: 5674: 5671: 5669: 5666: 5664: 5661: 5659: 5656: 5654: 5651: 5649: 5646: 5644: 5641: 5639: 5636: 5634: 5631: 5629: 5626: 5624: 5621: 5617: 5614: 5612: 5609: 5607: 5604: 5603: 5601: 5599: 5596: 5594: 5591: 5589: 5588:Stanford MIPS 5586: 5584: 5581: 5579: 5576: 5574: 5571: 5569: 5566: 5564: 5561: 5560: 5558: 5552: 5544: 5541: 5540: 5539: 5536: 5534: 5531: 5529: 5526: 5524: 5521: 5519: 5516: 5514: 5511: 5509: 5506: 5502: 5499: 5498: 5497: 5494: 5490: 5487: 5486: 5485: 5482: 5480: 5477: 5475: 5472: 5470: 5467: 5465: 5462: 5461: 5459: 5455: 5452: 5450: 5449:architectures 5444: 5438: 5435: 5433: 5430: 5428: 5425: 5423: 5420: 5418: 5417:Heterogeneous 5415: 5411: 5408: 5406: 5403: 5402: 5401: 5398: 5396: 5393: 5389: 5386: 5384: 5381: 5379: 5376: 5374: 5371: 5370: 5369: 5368:Memory access 5366: 5364: 5361: 5359: 5356: 5354: 5351: 5349: 5346: 5342: 5339: 5338: 5337: 5334: 5332: 5329: 5327: 5324: 5323: 5321: 5319: 5315: 5307: 5304: 5302: 5301:Random-access 5299: 5297: 5294: 5292: 5289: 5288: 5287: 5284: 5282: 5281:Stack machine 5279: 5277: 5274: 5270: 5267: 5265: 5262: 5260: 5257: 5255: 5252: 5250: 5247: 5245: 5242: 5240: 5237: 5235: 5232: 5231: 5230: 5227: 5223: 5220: 5218: 5215: 5213: 5210: 5208: 5205: 5203: 5200: 5198: 5197:with datapath 5195: 5194: 5193: 5190: 5188: 5185: 5183: 5180: 5179: 5177: 5175: 5171: 5167: 5160: 5155: 5153: 5148: 5146: 5141: 5140: 5137: 5125: 5116: 5115: 5112: 5106: 5103: 5101: 5098: 5096: 5093: 5091: 5088: 5086: 5083: 5081: 5078: 5076: 5073: 5071: 5068: 5066: 5063: 5062: 5060: 5056: 5050: 5047: 5045: 5042: 5040: 5037: 5035: 5032: 5030: 5027: 5025: 5022: 5020: 5017: 5015: 5012: 5010: 5007: 5005: 5002: 5000: 4997: 4995: 4992: 4990: 4987: 4985: 4982: 4980: 4979:Global Arrays 4977: 4975: 4972: 4970: 4967: 4965: 4962: 4960: 4957: 4955: 4952: 4950: 4947: 4945: 4942: 4940: 4937: 4935: 4932: 4930: 4927: 4926: 4924: 4922: 4918: 4912: 4909: 4907: 4906:Grid computer 4904: 4900: 4897: 4896: 4895: 4892: 4889: 4886: 4882: 4879: 4877: 4874: 4872: 4869: 4867: 4864: 4862: 4859: 4857: 4854: 4853: 4852: 4849: 4845: 4842: 4840: 4837: 4836: 4835: 4832: 4830: 4827: 4825: 4822: 4820: 4817: 4815: 4812: 4808: 4805: 4803: 4800: 4796: 4793: 4791: 4788: 4785: 4782: 4781: 4780: 4777: 4775: 4772: 4771: 4770: 4767: 4766: 4764: 4762: 4758: 4752: 4749: 4745: 4742: 4740: 4737: 4735: 4732: 4731: 4730: 4727: 4725: 4722: 4720: 4717: 4716: 4714: 4712: 4708: 4702: 4699: 4697: 4694: 4692: 4689: 4687: 4684: 4682: 4679: 4677: 4674: 4672: 4669: 4668: 4666: 4662: 4656: 4653: 4651: 4648: 4646: 4643: 4641: 4638: 4636: 4633: 4632: 4630: 4626: 4620: 4617: 4615: 4612: 4610: 4607: 4605: 4602: 4600: 4597: 4595: 4592: 4590: 4587: 4585: 4582: 4580: 4577: 4576: 4574: 4570: 4564: 4561: 4558: 4555: 4553: 4550: 4548: 4545: 4542: 4539: 4537: 4534: 4531: 4528: 4526: 4523: 4522: 4520: 4518: 4514: 4508: 4505: 4503: 4500: 4498: 4495: 4493: 4490: 4488: 4485: 4483: 4480: 4478: 4475: 4473: 4470: 4469: 4467: 4463: 4457: 4454: 4452: 4449: 4447: 4444: 4442: 4439: 4437: 4434: 4432: 4429: 4427: 4424: 4422: 4419: 4417: 4414: 4412: 4409: 4408: 4406: 4402: 4398: 4391: 4386: 4384: 4379: 4377: 4372: 4371: 4368: 4364: 4356: 4351: 4348: 4345: 4340: 4337: 4334: 4329: 4326: 4323: 4318: 4315: 4312: 4307: 4304: 4301: 4296: 4293: 4290: 4285: 4282: 4279: 4274: 4271: 4266: 4260: 4257: 4252: 4246: 4242: 4238: 4234: 4230: 4223: 4220: 4214: 4209: 4201: 4198: 4193: 4189: 4188: 4183: 4177: 4174: 4171: 4166: 4163: 4160: 4155: 4152: 4149: 4144: 4141: 4138: 4137:Cray Overview 4133: 4130: 4125: 4124: 4119: 4113: 4110: 4105: 4099: 4096: 4091: 4085: 4082: 4079: 4074: 4071: 4068: 4063: 4060: 4055: 4049: 4046: 4043: 4038: 4035: 4032: 4027: 4024: 4019: 4018: 4013: 4007: 4004: 3999: 3993: 3989: 3984: 3983: 3977: 3973: 3967: 3964: 3961: 3956: 3953: 3950: 3945: 3942: 3939: 3934: 3931: 3926: 3920: 3917: 3912: 3906: 3903: 3895: 3889: 3886: 3881: 3880: 3875: 3869: 3866: 3860: 3855: 3851: 3847: 3843: 3836: 3833: 3820: 3814: 3811: 3808: 3803: 3800: 3797: 3792: 3789: 3784: 3778: 3774: 3767: 3764: 3751: 3750:New Scientist 3747: 3740: 3737: 3730: 3726: 3723: 3721: 3718: 3716: 3713: 3711: 3708: 3705: 3701: 3698: 3696: 3693: 3691: 3688: 3686: 3683: 3681: 3678: 3676: 3673: 3671: 3668: 3665: 3662: 3660: 3657: 3656: 3652: 3650: 3648: 3647: 3627: 3624: 3621: 3614: 3610: 3587: 3584: 3575: 3559: 3556: 3553: 3550: 3544: 3541: 3538: 3528: 3524: 3516: 3513: 3512: 3507: 3506: 3497: 3495: 3492: 3487: 3481: 3477: 3473: 3471: 3467: 3462: 3455: 3450: 3445: 3441: 3437: 3433: 3429: 3428:trigonometric 3425: 3422: 3419: 3415: 3411: 3407: 3404: 3401: 3398: 3394: 3391: 3390: 3389: 3387: 3386:trigonometric 3383: 3380:With many 3D 3375: 3370: 3366: 3365: 3361: 3358: 3354: 3350: 3347: 3344: 3340: 3337: 3327: 3323: 3322: 3317: 3314: 3311: 3308: 3305: 3302: 3298: 3294: 3290: 3287: 3284: 3280: 3277: 3274: 3270: 3267: 3264: 3260: 3256: 3252: 3248: 3245: 3244: 3243: 3237: 3235: 3229: 3226: 3225: 3224: 3218: 3214: 3210: 3209: 3208: 3201: 3199: 3195: 3191: 3060: 3008: 3006: 2998: 2996: 2992: 2990: 2984: 2977: 2966: 2960: 2954: 2948: 2947: 2946: 2867: 2865: 2861: 2852: 2850: 2720: 2714: 2712: 2585: 2583: 2574: 2572: 2570: 2565: 2561: 2559: 2553: 2546: 2543: 2539: 2536: 2532: 2521: 2520: 2519: 2516: 2514: 2348: 2345: 2342: 2337: 2336: 2334: 2328: 2319: 2317: 2315: 2309: 2034: 2028: 2026: 2018: 2015: 2013: 2008: 2006: 1998: 1994: 1990: 1985: 1978: 1975: 1971: 1970: 1969: 1965: 1962: 1960: 1789: 1763: 1760: 1756: 1748: 1740:; y = y + tmp 1713:; tmp = a * x 1682: 1509: 1503: 1501: 1374: 1372: 1363: 1361: 1358: 1353: 1349: 1344: 1342: 1337: 1335: 1331: 1327: 1323: 1319: 1315: 1311: 1305: 1303: 1298: 1294: 1289: 1287: 1280:IV ISA for a 1279: 1274: 1267: 1258: 1253: 1249: 1241: 1239: 1237: 1233: 1228: 1226: 1224: 1218: 1214: 1209: 1206: 1201: 1195: 1191: 1188: 1185: 1182: 1179: 1178: 1177: 1086: 1083: 1081: 1077: 972: 797: 793: 791: 785: 783: 782: 777: 773: 772:assembly line 768: 763: 761: 757: 748: 746: 744: 740: 736: 731: 729: 723: 719: 717: 709: 705: 701: 698: 696: 692: 682: 680: 677: 665: 664: 663: 659: 652: 650: 648: 644: 640: 636: 628: 624: 620: 616: 612: 608: 604: 601: 599: 595: 591: 587: 584: 581: 577: 573: 569: 565: 561: 557: 553: 549: 545: 541: 537: 533: 529: 525: 521: 518: 517: 516: 512: 508: 506: 497: 495: 493: 489: 485: 481: 477: 469: 467: 465: 464:"Packed SIMD" 461: 460:significantly 456: 452: 448: 444: 438: 430: 428: 425: 421: 417: 412: 410: 406: 402: 398: 394: 389: 387: 383: 382:minicomputers 379: 375: 371: 367: 363: 359: 355: 347: 343: 339: 337: 336: 330: 325: 323: 319: 314: 312: 307: 303: 301: 298: 294: 291: 280: 277: 269: 259: 255: 251: 245: 244: 240: 235:This section 233: 229: 224: 223: 217: 215: 213: 205: 203: 201: 197: 193: 191: 187: 183: 178: 174: 170: 165: 163: 159: 155: 151: 147: 143: 139: 131: 126: 124: 122: 118: 114: 110: 109:supercomputer 105: 103: 99: 95: 91: 87: 83: 79: 75: 71: 67: 63: 59: 55: 51: 44: 40: 33: 19: 7093:Coprocessors 7067:Chip carrier 7005:Clock gating 6924:Mixed-signal 6821:Write buffer 6798:Control unit 6610:Clock signal 6349:accelerators 6331:Cypress PSoC 6191: 5988:Simultaneous 5952: 5805:Out-of-order 5437:Neuromorphic 5318:Architecture 5276:Belt machine 5269:Zeno machine 5202:Hierarchical 4828: 4664:Coordination 4594:Amdahl's law 4530:Simultaneous 4362: 4350: 4339: 4328: 4317: 4306: 4295: 4284: 4273: 4259: 4232: 4222: 4200: 4185: 4176: 4165: 4154: 4143: 4132: 4121: 4112: 4098: 4084: 4073: 4062: 4048: 4037: 4026: 4015: 4006: 3981: 3966: 3955: 3944: 3933: 3919: 3905: 3888: 3877: 3868: 3841: 3835: 3823:. Retrieved 3813: 3802: 3791: 3772: 3766: 3754:. Retrieved 3749: 3739: 3645: 3644: 3576: 3517: 3510: 3509: 3504: 3503: 3501: 3490: 3485: 3482: 3478: 3474: 3469: 3465: 3463: 3459: 3423: 3410:mini-permute 3405: 3392: 3379: 3362: 3353:Galois field 3348: 3338: 3318: 3312: 3306: 3288: 3278: 3268: 3258: 3254: 3250: 3246: 3241: 3233: 3222: 3205: 3196: 3192: 3188: 3058: 3004: 3002: 2993: 2985: 2975: 2973: 2944: 2919:; 2nd 4 of x 2863: 2859: 2856: 2848: 2718: 2710: 2578: 2566: 2562: 2557: 2554: 2551: 2541: 2534: 2517: 2512: 2509: 2346: 2340: 2338: 2332: 2330: 2326: 2323: 2313: 2310: 2306: 2032: 2019: 2016: 2009: 1992: 1988: 1986: 1982: 1966: 1963: 1958: 1956: 1783: 1752: 1679: 1507: 1499: 1367: 1345: 1338: 1322:"predicated" 1306: 1290: 1286:power of two 1275: 1259: 1255: 1235: 1229: 1220: 1216: 1212: 1210: 1202: 1199: 1175: 1084: 1079: 1075: 1073: 994:; count = 10 970: 794: 789: 786: 779: 764: 752: 735:single-issue 734: 732: 724: 720: 713: 699: 688: 676:power of two 660: 656: 647:Fujitsu FR-V 632: 627:CDC STAR-100 603:Pure Vectors 602: 585: 546:extensions, 526:(SWAR), and 519: 509:fixed-length 504: 501: 473: 459: 440: 413: 390: 351: 334: 326: 315: 308: 304: 287: 272: 263: 248:Please help 236: 209: 194: 166: 146:coprocessors 141: 135: 106: 77: 70:instructions 57: 53: 47: 6852:Multiplexer 6816:Data buffer 6527:Single-core 6499:bit slicing 6357:Coprocessor 6212:Coprocessor 6093:performance 6015:Cooperative 6005:Speculative 5965:Distributed 5924:Superscalar 5909:Instruction 5877:Parallelism 5850:Speculative 5682:System/3x0 5554:Instruction 5331:Von Neumann 5244:Post–Turing 5100:Scalability 4861:distributed 4744:Concurrency 4711:Programming 4552:Cooperative 4541:Speculative 4477:Instruction 3825:23 December 3393:Sub-vectors 3369:many others 2530:instruction 1973:inner loop. 948:; decrement 749:Description 566:. In 2000, 554:extension, 484:open source 160:to a large 7082:Categories 6972:management 6867:Multiplier 6728:Logic gate 6718:Sequential 6625:Functional 6605:Clock rate 6578:Data cache 6551:Components 6532:Multi-core 6520:Core count 6010:Preemptive 5914:Pipelining 5897:Bit-serial 5840:Wide-issue 5785:Structural 5707:Tilera ISA 5673:MicroBlaze 5643:ETRAX CRIS 5538:Comparison 5383:Load–store 5363:Endianness 5105:Starvation 4844:asymmetric 4579:PRAM model 4547:Preemptive 4213:2104.03142 3997:155860491X 3859:2065/10689 3782:5770761318 3731:References 3486:subsequent 3470:subsequent 3451:extension. 3259:fail-first 3194:critical. 2978:of adding 1786:y = mx + c 1246:See also: 1236:efficiency 623:RISC-V RVV 562:and MIPS' 550:, Sparc's 488:ForwardCom 88:(SIMD) or 68:where its 6906:Circuitry 6826:Microcode 6750:Registers 6593:coherence 6568:CPU cache 6426:Word size 6091:Processor 5735:Execution 5638:DEC Alpha 5616:Power ISA 5432:Cognitive 5239:Universal 4839:symmetric 4584:PEM model 3807:MIAOW GPU 3625:− 3591:∞ 3551:∗ 3542:− 3440:logarithm 3418:"swizzle" 3414:Videocore 3330:x = y + x 3326:mapreduce 3321:Iteration 3109:vredadd32 3042:vredadd32 2976:by design 2433:# store Y 2200:store32x4 1862:store32x4 1357:Videocore 1348:Videocore 1278:Videocore 1153:# 10 adds 903:; move on 774:, so the 492:Libre-SOC 416:SX series 401:Cray Y-MP 397:Cray X-MP 266:July 2023 237:does not 173:ILLIAC IV 158:algorithm 140:in their 50:computing 6844:Datapath 6537:Manycore 6509:variable 6347:Hardware 5983:Temporal 5663:OpenRISC 5358:Cellular 5348:Dataflow 5341:modified 5070:Deadlock 5058:Problems 5024:pthreads 5004:OpenHMPP 4929:Ateji PX 4890:computer 4761:Hardware 4628:Elements 4614:Slowdown 4525:Temporal 4507:Pipeline 4192:Archived 3978:(1998). 3653:See also 3303:instead. 3005:built-in 2907:load32x4 2892:load32x4 2128:load32x4 2110:load32x4 1808:load32x4 1796:load32x4 1076:hardware 685:vectors. 594:ARM SVE2 548:ARM NEON 346:Cray J90 311:CDC 7600 293:STAR-100 192:(SIMT). 162:data set 7020:Related 6951:Quantum 6941:Digital 6936:Boolean 6834:Counter 6733:Quantum 6494:512-bit 6489:256-bit 6484:128-bit 6327:(MPSoC) 6312:on chip 6310:Systems 6128:(FLOPS) 5941:Process 5790:Control 5772:Hazards 5658:Itanium 5653:Unicore 5611:PowerPC 5336:Harvard 5296:Pointer 5291:Counter 5249:Quantum 5029:RaftLib 5009:OpenACC 4984:GPUOpen 4974:C++ AMP 4949:Charm++ 4691:Barrier 4635:Process 4619:Speedup 4404:General 4187:YouTube 3775:. KIT. 3491:exactly 3449:MIPS-3D 3283:AVX-512 3255:segment 3251:indexed 2989:AVX-512 2922:add32x4 2400:vmadd32 2281:# loop? 2173:add32x4 2146:mul32x4 2012:AVX-512 1997:AVX-512 1989:triples 1841:add32x4 1820:mul32x4 1767:splatx4 1585:store32 1334:AltiVec 1314:AVX-512 1082:basis. 781:latency 695:AVX-512 598:AVX-512 572:Toshiba 560:AltiVec 556:PowerPC 376:-based 366:Hitachi 362:Fujitsu 258:removed 243:sources 171:as the 142:Solomon 127:History 78:vectors 6956:Switch 6946:Analog 6684:(IMC) 6655:(MMU) 6504:others 6479:64-bit 6474:48-bit 6469:32-bit 6464:24-bit 6459:16-bit 6454:15-bit 6449:12-bit 6286:Mobile 6202:Stream 6197:Barrel 6192:Vector 6181:(GPU) 6140:(SUPS) 6108:(IPC) 5960:Memory 5953:Vector 5936:Thread 5919:Scalar 5721:Others 5668:RISC-V 5633:SuperH 5602:Power 5598:MIPS-X 5573:PDP-11 5422:Fabric 5174:Models 5122:  4999:OpenCL 4994:OpenMP 4939:Chapel 4856:shared 4851:Memory 4786:(SIMT) 4729:Models 4640:Thread 4572:Theory 4543:(SpMT) 4497:Memory 4482:Thread 4465:Levels 4247:  4123:GitHub 4017:GitHub 3994:  3879:GitHub 3779:  3756:7 July 3700:RISC-V 3466:actual 3436:cosine 3400:SPIR-V 3397:Vulkan 3382:shader 3076:vloop: 2742:load32 2697:return 2649:size_t 2595:size_t 2569:no-ops 2548:1977). 2352:vloop: 2341:actual 2314:at all 2038:vloop: 1793:vloop: 1531:load32 1516:load32 1435:size_t 1387:size_t 1316:, ARM 1293:Cray-1 1156:vstore 1096:setvli 1051:vstore 790:itself 739:Cray-1 683:within 668:vsetvl 621:, and 619:NEC SX 611:Cray-1 544:3DNow! 393:Cray-2 374:Oregon 358:ETA-10 318:Cray-1 177:GFLOPS 7012:(PPW) 6970:Power 6862:Adder 6738:Array 6705:Logic 6666:(TLB) 6649:(FPU) 6643:(AGU) 6637:(ALU) 6627:units 6563:Cache 6444:8-bit 6439:4-bit 6434:1-bit 6398:(TPU) 6392:(DSP) 6386:(PPU) 6380:(VPU) 6369:(GPU) 6338:(NoC) 6321:(SoC) 6256:(PoP) 6250:(SiP) 6244:(MCM) 6185:GPGPU 6175:(CPU) 6165:Types 6146:(PPW) 6134:(TPS) 6122:(IPS) 6114:(CPI) 5885:Level 5696:S/390 5691:S/370 5686:S/360 5628:SPARC 5606:POWER 5489:TRIPS 5457:Types 4969:Dryad 4934:Boost 4655:Array 4645:Fiber 4559:(CMT) 4532:(SMT) 4446:GPGPU 4208:arXiv 3988:751-2 3897:(PDF) 3670:GPGPU 3402:spec. 3175:vloop 3094:vld32 3079:setvl 3027:vld32 3012:setvl 2864:other 2757:add32 2739:loop: 2613:const 2524:setvl 2513:going 2502:vloop 2421:vst32 2385:vld32 2370:vld32 2355:setvl 2293:vloop 2065:shift 2023:setvl 1959:shall 1943:vloop 1564:add32 1543:mul32 1513:loop: 1405:const 1381:iaxpy 1213:other 1120:vload 1105:vload 1066:count 1030:count 1015:vload 1012:count 997:vload 991:count 954:count 945:count 873:store 825:loop: 819:count 662:has: 60:is a 6990:ACPI 6723:Glue 6615:FIFO 6558:Core 6296:ASIP 6237:CPLD 6232:FPOA 6227:FPGA 6222:ASIC 6075:SPMD 6070:MIMD 6065:MISD 6058:SWAR 6038:SIMD 6033:SISD 5948:Data 5931:Task 5902:Word 5648:M32R 5593:MIPS 5556:sets 5523:ZISC 5518:NISC 5513:OISC 5508:MISC 5501:EPIC 5496:VLIW 5484:EDGE 5474:RISC 5469:CISC 5378:HUMA 5373:NUMA 5034:ROCm 4964:CUDA 4954:Cilk 4921:APIs 4881:COMA 4876:NUMA 4807:MIMD 4802:MISD 4779:SIMD 4774:SISD 4502:Loop 4492:Data 4487:Task 4245:ISBN 3992:ISBN 3827:2022 3777:ISBN 3758:2024 3502:Let 3438:and 3432:sine 3343:CUDA 3313:Iota 3166:bnez 2886:$ 16 2871:addl 2835:out: 2829:loop 2799:subl 2778:addl 2667:< 2589:void 2522:The 2493:bnez 2333:must 2299:out: 2260:subl 2242:addl 2221:addl 1949:out: 1913:subl 1910:$ 16 1895:addl 1889:$ 16 1874:addl 1716:vadd 1689:vmul 1672:out: 1666:loop 1636:subl 1618:addl 1597:addl 1453:< 1378:void 1318:SVE2 1302:GPUs 1291:The 1250:and 1248:SIMD 1223:RISC 1135:vadd 1099:$ 10 1033:vadd 985:$ 10 982:move 960:loop 951:jnez 840:load 828:load 813:$ 10 810:move 643:EPIC 639:VLIW 635:MIMD 596:and 576:Sony 574:and 538:and 490:and 443:GPUs 399:and 368:and 295:and 241:any 239:cite 113:Cray 52:, a 39:GPUs 6985:APM 6980:PMU 6872:CPU 6829:ROM 6600:Bus 6217:PAL 5892:Bit 5678:LMC 5583:ARM 5578:x86 5568:VAX 5049:ZPL 5044:TBB 5039:UPC 5019:PVM 4989:MPI 4944:HPX 4871:UMA 4472:Bit 4237:doi 3854:hdl 3846:doi 3444:HPC 3181:ret 3151:sub 3130:add 3064:set 2980:x+x 2968:x+x 2962:x+x 2956:x+x 2950:x+x 2860:not 2838:ret 2820:jgz 2814:$ 1 2793:$ 4 2724:set 2643:for 2628:int 2616:int 2604:int 2535:and 2528:min 2478:sub 2457:add 2436:add 2302:ret 2284:jgz 2101:$ 1 2086:sub 2074:$ 1 2059:$ 4 2044:min 2005:gcc 2003:to 1952:ret 1934:jgz 1928:$ 4 1743:ret 1731:tmp 1692:tmp 1675:ret 1657:jgz 1651:$ 1 1633:$ 4 1612:$ 4 1429:for 1417:int 1408:int 1396:int 1330:SSE 1326:MMX 1282:REP 1270:REP 1266:x86 1217:not 1171:ret 1069:ret 966:ret 942:dec 939:$ 4 924:add 921:$ 4 906:add 900:$ 4 885:add 852:add 672:lvl 568:IBM 564:MSA 558:'s 552:VIS 540:AVX 536:SSE 532:MMX 431:GPU 424:HBM 252:by 56:or 48:In 7084:: 6919:3D 4243:. 4231:. 4190:. 4184:. 4120:. 4014:. 3990:. 3974:; 3876:. 3852:. 3748:. 3434:, 3426:– 3271:– 3160:t0 3139:t0 3124:v0 3097:v0 3082:t0 3051:v0 3030:v0 3015:t0 2937:v1 2931:v2 2925:v1 2916:r3 2910:v2 2895:v1 2874:r3 2772:r1 2745:r1 2688:+= 2679:++ 2584:: 2487:t0 2466:t0 2445:t0 2424:v1 2409:v0 2403:v1 2388:v1 2373:v0 2358:t0 2329:. 2275:t0 2251:t0 2230:t0 2203:v3 2188:v2 2182:v1 2176:v3 2161:v1 2149:v1 2131:v2 2113:v1 2080:t0 2047:t0 2007:. 1865:v3 1856:v2 1850:v1 1844:v3 1835:v1 1823:v1 1811:v2 1799:v1 1770:v4 1588:r3 1579:r2 1573:r1 1567:r3 1558:r1 1546:r1 1534:r2 1519:r1 1465:++ 1373:: 1332:, 1328:, 1227:) 1159:v3 1150:v2 1144:v1 1138:v3 1123:v2 1108:v1 1054:v3 1048:v2 1042:v1 1036:v3 1018:v2 1000:v1 876:r3 867:r2 861:r1 855:r3 843:r2 831:r1 762:. 718:. 617:, 613:, 570:, 534:, 494:. 395:, 388:. 364:, 210:A 104:. 5158:e 5151:t 5144:v 4389:e 4382:t 4375:v 4267:. 4253:. 4239:: 4216:. 4210:: 4106:. 4092:. 4000:. 3913:. 3862:. 3856:: 3848:: 3829:. 3785:. 3760:. 3706:. 3646:f 3631:) 3628:f 3622:1 3619:( 3615:/ 3611:1 3588:= 3585:r 3563:] 3560:f 3557:+ 3554:r 3548:) 3545:f 3539:1 3536:( 3533:[ 3529:/ 3525:r 3511:f 3505:r 3420:. 3371:. 3285:. 3184:y 3172:, 3169:n 3157:, 3154:n 3145:4 3142:* 3136:, 3133:x 3121:, 3118:y 3115:, 3112:y 3103:x 3100:, 3088:n 3085:, 3073:0 3070:, 3067:y 3048:, 3045:y 3036:x 3033:, 3021:n 3018:, 2934:, 2928:, 2913:, 2901:x 2898:, 2883:, 2880:x 2877:, 2841:y 2826:, 2823:n 2811:, 2808:n 2805:, 2802:n 2790:, 2787:x 2784:, 2781:x 2769:, 2766:y 2763:, 2760:y 2751:x 2748:, 2733:0 2730:, 2727:y 2706:} 2703:; 2700:y 2694:; 2691:x 2685:y 2682:) 2676:i 2673:; 2670:n 2664:i 2661:; 2658:0 2655:= 2652:i 2646:( 2640:; 2637:0 2634:= 2631:y 2625:{ 2622:) 2619:x 2610:, 2607:a 2601:, 2598:n 2592:( 2582:c 2499:, 2496:n 2484:, 2481:n 2472:4 2469:* 2463:, 2460:x 2451:4 2448:* 2442:, 2439:y 2430:y 2427:, 2415:a 2412:, 2406:, 2394:y 2391:, 2379:x 2376:, 2364:n 2361:, 2290:, 2287:n 2272:, 2269:n 2266:, 2263:n 2257:4 2254:* 2248:, 2245:y 2236:4 2233:* 2227:, 2224:x 2215:m 2212:, 2209:y 2206:, 2194:m 2191:, 2185:, 2179:, 2167:m 2164:, 2158:, 2155:a 2152:, 2143:m 2140:, 2137:y 2134:, 2125:m 2122:, 2119:x 2116:, 2098:, 2095:m 2092:, 2089:m 2077:, 2071:, 2068:m 2056:, 2053:n 2050:, 1940:, 1937:n 1925:, 1922:n 1919:, 1916:n 1907:, 1904:y 1901:, 1898:y 1886:, 1883:x 1880:, 1877:x 1871:y 1868:, 1853:, 1847:, 1832:, 1829:a 1826:, 1817:y 1814:, 1805:x 1802:, 1776:a 1773:, 1737:n 1734:, 1728:, 1725:y 1722:, 1719:y 1710:n 1707:, 1704:x 1701:, 1698:a 1695:, 1663:, 1660:n 1648:, 1645:n 1642:, 1639:n 1630:, 1627:y 1624:, 1621:y 1609:, 1606:x 1603:, 1600:x 1594:y 1591:, 1576:, 1570:, 1555:, 1552:a 1549:, 1540:y 1537:, 1525:x 1522:, 1495:} 1492:; 1489:y 1486:+ 1483:x 1480:* 1477:a 1474:= 1471:y 1468:) 1462:i 1459:; 1456:n 1450:i 1447:; 1444:0 1441:= 1438:i 1432:( 1426:{ 1423:) 1420:y 1414:, 1411:x 1402:, 1399:a 1393:, 1390:n 1384:( 1371:C 1165:c 1162:, 1147:, 1141:, 1129:b 1126:, 1114:a 1111:, 1063:, 1060:c 1057:, 1045:, 1039:, 1027:, 1024:b 1021:, 1009:, 1006:a 1003:, 988:, 957:, 936:, 933:c 930:, 927:c 918:, 915:b 912:, 909:b 897:, 894:a 891:, 888:a 882:c 879:, 864:, 858:, 849:b 846:, 837:a 834:, 816:, 279:) 273:( 268:) 264:( 260:. 246:. 45:. 34:. 20:)

Index

Vector processing
array processing
GPUs
Flynn's taxonomy § Single instruction stream, multiple data streams (SIMD)
computing
central processing unit
instruction set
instructions
one-dimensional arrays
scalar processors
single instruction, multiple data
SIMD within a register
numerical simulation
video-game console
graphics accelerators
supercomputer
Cray
price-to-performance ratio
microprocessor
Westinghouse Electric Corporation
coprocessors
Central processing unit
arithmetic logic units
algorithm
data set
University of Illinois at Urbana–Champaign
ILLIAC IV
GFLOPS
computational fluid dynamics
massively parallel

Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.

↑