324:, which held sixty-four 64-bit words each. The vector instructions were applied between registers, which is much faster than talking to main memory. Whereas the STAR-100 would apply a single operation across a long vector in memory and then move on to the next operation, the Cray design would load a smaller section of the vector into registers and then apply as many operations as it could to that data, thereby avoiding many of the much slower memory access operations.
1300:
This way, significantly more work can be done in each batch; the instruction encoding is much more elegant and compact as well. The only drawback is that in order to take full advantage of this extra batch processing capacity, the memory load and store speed correspondingly had to increase as well. This is sometimes claimed to be a disadvantage of Cray-style vector processors: in reality it is part of achieving high performance throughput, as seen in
730:, the consequences are that the operations now take longer to complete. If multi-issue is not possible, then the operations take even longer because the LD may not be issued (started) at the same time as the first ADDs, and so on. If there are only 4-wide 64-bit SIMD ALUs, the completion time is even worse: only when all four LOADs have completed may the SIMD operations start, and only when all ALU operations have completed may the STOREs begin.
3341:– either by way of algorithmically loading data from memory, or reordering (remapping) the normally linear access to vector elements, or providing "Accumulators", arbitrary-sized matrices may be efficiently processed. IBM POWER10 provides MMA instructions although for arbitrary Matrix widths that do not fit the exact SIMD size data repetition techniques are needed which is wasteful of register file resources. NVidia provides a high-level Matrix
3249:– Vector architectures with a register-to-register design (analogous to load–store architectures for scalar processors) have instructions for transferring multiple elements between the memory and the vector registers. Typically, multiple addressing modes are supported. The unit-stride addressing mode is essential; modern vector architectures typically also support arbitrary constant strides, as well as the scatter/gather (also called
5119:
342:
228:
754:
implementation things are rarely that simple. The data is rarely sent in raw form, and is instead "pointed to" by passing in an address to a memory location that holds the data. Decoding this address and getting the data out of the memory takes some time, during which the CPU traditionally would sit idle waiting for the requested data to show up. As CPU speeds have increased, this
708:
1312:) combine both, by issuing multiple data to multiple internal pipelined SIMD ALUs, the number issued being dynamically chosen by the vector program at runtime. Masks can be used to selectively load and store data in memory locations, and use those same masks to selectively disable processing element of SIMD ALUs. Some processors with SIMD (
3219:, SIMD by definition avoids inter-lane operations entirely (element 0 can only be added to another element 0), vector processors tackle this head-on. What programmers are forced to do in software (using shuffle and other tricks, to swap data into the right "lane") vector processors must do in hardware, automatically.
2556:
hard-coded constant 16, n is decremented by a hard-coded 4, so initially it is hard to appreciate the significance. The difference comes in the realisation that the vector hardware could be capable of doing 4 simultaneous operations, or 64, or 10,000, it would be the exact same vector assembler for all of them
466:, which is strictly limited to execution of parallel pipelined arithmetic operations only. Although the exact internal details of today's commercial GPUs are proprietary secrets, the MIAOW team was able to piece together anecdotal information sufficient to implement a subset of the AMDGPU architecture.
3483:
The above SIMD example could potentially fault and fail at the end of memory, due to attempts to read too many values: it could also cause significant numbers of page or misaligned faults by similarly crossing over boundaries. In contrast, by allowing the vector architecture the freedom to decide how
3475:
Contrast this situation with SIMD, which is a fixed (inflexible) load width and fixed data processing width, unable to cope with loads that cross page boundaries, and even if they were they are unable to adapt to what actually succeeded, yet, paradoxically, if the SIMD program were to even attempt to
2324:
For Cray-style vector ISAs such as RVV, an instruction called "setvl" (set vector length) is used. The hardware first defines how many data values it can process in one "vector": this could be either actual registers or it could be an internal loop (the hybrid approach, mentioned above). This maximum
1299:
to hold vector data in batches. The batch lengths (vector length, VL) could be dynamically set with a special instruction, the significance compared to
Videocore IV (and, crucially as will be shown below, SIMD as well) being that the repeat length does not have to be part of the instruction encoding.
725:
Having to perform 4-wide simultaneous 64-bit LOADs and 64-bit STOREs is very costly in hardware (256 bit data paths to memory). Having 4x 64-bit ALUs, especially MULTIPLY, likewise. To avoid these high costs, a SIMD processor would have to have 1-wide 64-bit LOAD, 1-wide 64-bit STORE, and only 2-wide
502:
As of 2016 most commodity CPUs implement architectures that feature fixed-length SIMD instructions. On first inspection these can be considered a form of vector processing because they operate on multiple (vectorized, explicit length) data sets, and borrow features from vector processors. However, by
3395:– elements may typically contain two, three or four sub-elements (vec2, vec3, vec4) where any given bit of a predicate mask applies to the whole vec2/3/4, not the elements in the sub-vector. Sub-vectors are also introduced in RISC-V RVV (termed "LMUL"). Subvectors are a critical integral part of the
457:
the key distinguishing factor of SIMT-based GPUs is that it has a single instruction decoder-broadcaster but that the cores receiving and executing that same instruction are otherwise reasonably normal: their own ALUs, their own register files, their own Load/Store units and their own independent L1
4205:
Moreira, José E.; Barton, Kit; Battle, Steven; Bergner, Peter; Bertran, Ramon; Bhat, Puneeth; Caldeira, Pedro; Edelsohn, David; Fossum, Gordon; Frey, Brad; Ivanovic, Nemanja; Kerchner, Chip; Lim, Vincent; Kapoor, Shakti; Tulio
Machado Filho; Silvia Melitta Mueller; Olsson, Brett; Sadasivam, Satish;
2982:
for example, things go rapidly downhill just as they did with the general case of using SIMD for general-purpose IAXPY loops. To sum the four partial results, two-wide SIMD can be used, followed by a single scalar add, to finally produce the answer, but, frequently, the data must be transferred out
769:
in which the instructions pass through several sub-units in turn. The first sub-unit reads the address and decodes it, the next "fetches" the values at those addresses, and the next does the math itself. With pipelining the "trick" is to start decoding the next instruction even before the first has
753:
In general terms, CPUs are able to manipulate one or two pieces of data at a time. For instance, most CPUs have an instruction that essentially says "add A to B and put the result in C". The data for A, B and C could be—in theory at least—encoded directly into the instruction. However, in efficient
721:
Consider both a SIMD processor and a vector processor working on 4 64-bit elements, doing a LOAD, ADD, MULTIPLY and STORE sequence. If the SIMD width is 4, then the SIMD processor must LOAD four elements entirely before it can move on to the ADDs, must complete all the ADDs before it can move on to
661:
Whereas pure (fixed-width, no predication) SIMD is often mistakenly claimed to be "vector" (because SIMD processes data which happens to be vectors), through close analysis and comparison of historic and modern ISAs, actual vector ISAs may be observed to have the following features that no SIMD ISA
3514:
be the vectorization ratio. If the time taken for the vector unit to add an array of 64 numbers is 10 times faster than its equivalent scalar counterpart, r = 10. Also, if the total number of operations in a program is 100, out of which only 10 are scalar (after vectorization), then f = 0.9, i.e.,
3479:
This begins to hint at the reason why ffirst is so innovative, and is best illustrated by memcpy or strcpy when implemented with standard 128-bit non-predicated non-ffirst SIMD. For IBM POWER9 the number of hand-optimised instructions to implement strncpy is in excess of 240. By contrast, the same
3211:
From the IAXPY example, it can be seen that unlike SIMD processors, which can simplify their internal hardware by avoiding dealing with misaligned memory access, a vector processor cannot get away with such simplification: algorithms are written which inherently rely on Vector Load and Store being
3193:
Implementations in hardware may, if they are certain that the right answer will be produced, perform the reduction in parallel. Some vector ISAs offer a parallel reduction mode as an explicit option, for when the programmer knows that any potential rounding errors do not matter, and low latency is
2555:
Also note, that just like the predicated SIMD variant, the pointers to x and y are advanced by t0 times four because they both point to 32 bit data, but that n is decremented by straight t0. Compared to the fixed-size SIMD assembler there is very little apparent difference: x and y are advanced by
1967:
Realistically, for general-purpose loops such as in portable libraries, where n cannot be limited in this way, the overhead of setup and cleanup for SIMD in order to cope with non-multiples of the SIMD width, can far exceed the instruction count inside the loop itself. Assuming worst-case that the
1761:
here (only start on a multiple of 16) and that n is a multiple of 4, as otherwise some setup code would be needed to calculate a mask or to run a scalar version. It can also be assumed, for simplicity, that the SIMD instructions have an option to automatically repeat scalar operands, like ARM NEON
1368:
This example starts with an algorithm ("IAXPY"), first show it in scalar instructions, then SIMD, then predicated SIMD, and finally vector instructions. This incrementally helps illustrate the difference between a traditional vector processor and a modern SIMD one. The example starts with a 32-bit
1256:
The vector pseudocode example above comes with a big assumption that the vector computer can process more than ten numbers in one batch. For a greater quantity of numbers in the vector register, it becomes unfeasible for the computer to have a register that large. As a result, the vector processor
331:
to implement vector instructions rather than multiple ALUs. In addition, the design had completely separate pipelines for different instructions, for example, addition/subtraction was implemented in different hardware than multiplication. This allowed a batch of vector instructions to be pipelined
3446:
workloads. Of interest, however, is that speed is far more important than accuracy in 3D for GPUs, where computation of pixel coordinates simply do not require high precision. The Vulkan specification recognises this and sets surprisingly low accuracy requirements, so that GPU Hardware can reduce
1972:
first have to have a preparatory section which works on the beginning unaligned data, up to the first point where SIMD memory-aligned operations can take over. this will either involve (slower) scalar-only operations or smaller-sized packed SIMD operations. Each copy implements the full algorithm
795:
To illustrate what a difference this can make, consider the simple task of adding two groups of 10 numbers together. In a normal programming language one would write a "loop" that picked up each of the pairs of numbers in turn, and then added them. To the CPU, this would look something like this:
313:, but at data-related tasks they could keep up while being much smaller and less expensive. However the machine also took considerable time decoding the vector instructions and getting ready to run the process, so it required very specific data sets to work on before it actually sped anything up.
2307:
Here it can be seen that the code is much cleaner but a little complex: at least, however, there is no setup or cleanup: on the last iteration of the loop, the predicate mask wil be set to either 0b0000, 0b0001, 0b0011, 0b0111 or 0b1111, resulting in between 0 and 4 SIMD element operations being
787:
Vector processors take this concept one step further. Instead of pipelining just the instructions, they also pipeline the data itself. The processor is fed instructions that say not just to add A to B, but to add all of the numbers "from here to here" to all of the numbers "from there to there".
305:
The basic ASC (i.e., "one pipe") ALU used a pipeline architecture that supported both scalar and vector computations, with peak performance reaching approximately 20 MFLOPS, readily achieved when processing long vectors. Expanded ALU configurations supported "two pipes" or "four pipes" with a
1207:
adding those numbers in parallel. The checking of dependencies between those numbers is not required as a vector instruction specifies multiple independent operations. This simplifies the control logic required, and can further improve performance by avoiding stalls. The math operations thus
3291:– a less restrictive more generic variation of the compress/expand theme which instead takes one vector to specify the indices to use to "reorder" another vector. Gather/scatter is more complex to implement than compress/expand, and, being inherently non-sequential, can interfere with
2510:
This is essentially not very different from the SIMD version (processes 4 data elements per loop), or from the initial Scalar version (processes just the one). n still contains the number of data elements remaining to be processed, but t0 contains the copy of VL – the number that is
3281:– usually using a bit-mask, data is linearly compressed or expanded (redistributed) based on whether bits in the mask are set or clear, whilst always preserving the sequential order and never duplicating values (unlike Gather-Scatter aka permute). These instructions feature in
1680:
The STAR-like code remains concise, but because the STAR-100's vectorisation was by design based around memory accesses, an extra slot of memory is now required to process the information. Two times the latency is also needed due to the extra requirement of memory access.
1343:(SIMT). SIMT units run from a shared single broadcast synchronised Instruction Unit. The "vector registers" are very wide and the pipelines tend to be long. The "threading" part of SIMT involves the way data is handled independently on each of the compute units.
697:, almost qualifies as a vector processor. Predicated SIMD uses fixed-width SIMD ALUs but allows locally controlled (predicated) activation of units to provide the appearance of variable length vectors. Examples below help explain these categorical distinctions.
1983:
Eight-wide SIMD requires repeating the inner loop algorithm first with four-wide SIMD elements, then two-wide SIMD, then one (scalar), with a test and branch in between each one, in order to cover the first and last remaining SIMD elements (0 <= n <= 7).
2986:
Even with a general loop (n not fixed), the only way to use 4-wide SIMD is to assume four separate "streams", each offset by four elements. Finally, the four partial results have to be summed. Other techniques involve shuffle: examples online can be found for
2994:
Aside from the size of the program and the complexity, an additional potential problem arises if floating-point computation is involved: the fact that the values are not being summed in strict order (four partial results) could result in rounding errors.
3488:
iterations of the loop the batches of vectorised memory reads are optimally aligned with the underlying caches and virtual memory arrangements. Additionally, the hardware may choose to use the opportunity to end any given loop iteration's memory reads
3408:– aka "Lane Shuffling" which allows sub-vector inter-element computations without needing extra (costly, wasteful) instructions to move the sub-elements into the correct SIMD "lanes" and also saves predicate mask bits. Effectively this is an in-flight
3493:
on a page boundary (avoiding a costly second TLB lookup), with speculative execution preparing the next virtual memory page whilst data is still being processed in the current loop. All of this is determined by the hardware, not the program itself.
2311:
It is clear how predicated SIMD at least merits the term "vector capable", because it can cope with variable-length vectors by using predicate masks. The final evolving step to a "true" vector ISA, however, is to not have any evidence in the ISA
1272:
prefix. However, only very simple calculations can be done effectively in hardware this way without a very large cost increase. Since all operands have to be in memory for the STAR-100 architecture, the latency caused by access became huge too.
2343:
number that can be processed by the hardware in subsequent vector instructions, and sets the internal special register, "VL", to that same amount. ARM refers to this technique as "vector length agnostic" programming in its tutorials on SVE2.
179:
machine with 256 ALUs, but, when it was finally delivered in 1972, it had only 64 ALUs and could reach only 100 to 150 MFLOPS. Nevertheless, it showed that the basic concept was sound, and, when used on data-intensive applications, such as
2563:
Not only is it a much more compact program (saving on L1 Cache size), but as previously mentioned, the vector version can issue far more data processing to the ALUs, again saving power because
Instruction Decode and Issue can sit idle.
3206:
Compared to any SIMD processor claiming to be a vector processor, the order of magnitude reduction in program size is almost shocking. However, this level of elegance at the ISA level has quite a high price tag at the hardware level:
657:
SIMD instruction sets lack crucial features when compared to vector instruction sets. The most important of these is that vector processors, inherently by definition and design, have always been variable-length since their inception.
2547:
Where with predicated SIMD the mask bitlength is limited to that which may be held in a scalar (or special mask) register, vector ISA's mask registers have no such limitation. Cray-I vectors could be just over 1,000 elements (in
2020:
Vector processors on the other hand are designed to issue computations of variable length for an arbitrary count, n, and thus require very little setup, and no cleanup. Even compared to those SIMD ISAs which have masks (but no
1234:, as the supercomputers themselves were, in general, found in places such as weather prediction centers and physics labs, where huge amounts of data are "crunched". However, as shown above and demonstrated by RISC-V RVV the
426:
2 module within a card that physically resembles a graphics coprocessor, but instead of serving as a co-processor, it is the main computer with the PC-compatible computer into which it is plugged serving support functions.
3189:
The simplicity of the algorithm is stark in comparison to SIMD. Again, just as with the IAXPY example, the algorithm is length-agnostic (even on
Embedded implementations where maximum vector length could be only one).
788:
Instead of constantly having to decode instructions and then fetch the data needed to complete them, the processor reads a single instruction from memory, and it is simply implied in the definition of the instruction
741:, the ability to run MULTIPLY simultaneously with ADD), may complete the four operations faster than a SIMD processor with 1-wide LOAD, 1-wide STORE, and 2-wide SIMD. This more efficient resource utilization, due to
202:(DAP) design, categorising the ILLIAC and DAP as cellular array processors that potentially offered substantial performance benefits over conventional vector processor designs such as the CDC STAR-100 and Cray 1.
1359:
IV is also capable of this hybrid approach: nominally stating that its SIMD QPU Engine supports 16-long FP array operations in its instructions, it actually does them 4 at a time, as (another) form of "threads".
3197:
This example again highlights a key critical fundamental difference between true vector processors and those SIMD processors, including most commercial GPUs, which are inspired by features of vector processors.
3460:
Introduced in ARM SVE2 and RISC-V RVV is the concept of speculative sequential Vector Loads. ARM SVE2 has a special register named "First Fault
Register", where RVV modifies (truncates) the Vector Length (VL).
1307:
Modern SIMD computers claim to improve on early Cray by directly using multiple ALUs, for a higher degree of parallelism compared to only using the normal scalar pipeline. Modern vector processors (such as the
458:
data caches. Thus although all cores simultaneously execute the exact same instruction in lock-step with each other they do so with completely different data from completely different memory locations. This is
1324:) processing, and it is these which somewhat deserve the nomenclature "vector processor" or at least deserve the claim of being capable of "vector processing". SIMD processors without per-element predication (
2857:
This is where the problems start. SIMD by design is incapable of doing arithmetic operations "inter-element". Element 0 of one SIMD register may be added to
Element 0 of another register, but Element 0 may
1354:
may use fewer vector units than the width implies: instead of having 64 units for a 64-number-wide register, the hardware might instead do a pipelined loop over 16 units for a hybrid approach. The
Broadcom
3476:
find out in advance (in each inner loop, every time) what might optimally succeed, those instructions only serve to hinder performance because they would, by necessity, be part of the critical inner loop.
2579:
This example starts with an algorithm which involves reduction. Just as with the previous example, it will be first shown in scalar instructions, then SIMD, and finally vector instructions, starting in
184:, the ILLIAC was the fastest machine in the world. The ILLIAC approach of using separate ALUs for each data element is not common to later designs, and is often referred to under a separate category,
2017:
Without predication, the wider the SIMD width the worse the problems get, leading to massive opcode proliferation, degraded performance, extra power consumption and unnecessary software complexity.
1219:
adding up many numbers in a row. The more complex instructions also add to the complexity of the decoders, which might slow down the decoding of the more common instructions such as normal adding. (
4206:
Saleil, Baptiste; Schmidt, Bill; Srinivasaraghavan, Rajalakshmi; Srivatsan, Shricharan; Thompto, Brian; Wagner, Andreas; Wu, Nelson (2021). "A matrix math facility for Power ISA(TM) processors".
1284:
field, but unlike the STAR-100 which uses memory for its repeats, the
Videocore IV repeats are on all operations including arithmetic vector operations. The repeat length can be a small range of
338:. The Cray-1 normally had a performance of about 80 MFLOPS, but with up to three chains running it could peak at 240 MFLOPS and averaged around 150 – far faster than any machine of the era.
3388:
operations as well as short vectors for common operations (RGB, ARGB, XYZ, XYZW) support for the following is typically present in modern GPUs, in addition to those found in vector processors:
2339:
On calling setvl with the number of outstanding data elements to be processed, "setvl" is permitted (essentially required) to limit that to the
Maximum Vector Length (MVL) and thus returns the
2335:
not make the mistake of assuming a fixed vector width: consequently MVL is not a quantity that the programmer needs to know. This can be a little disconcerting after years of SIMD mindset).
2849:
This is very straightforward. "y" starts at zero, 32 bit integers are loaded one at a time into r1, added to y, and the address of the array "x" moved on to the next element in the array.
2567:
Additionally, the number of elements going in to the function can start at zero. This sets the vector length to zero, which effectively disables all vector instructions, turning them into
2347:
Below is the Cray-style vector assembler for the same SIMD style loop, above. Note that t0 (which, containing a convenient copy of VL, can vary) is used instead of hard-coded constants:
3468:
amount loaded to either the amount that would succeed without raising a memory fault or simply to an amount (greater than zero) that is most convenient. The important factor is that
2010:
Over time as the ISA evolves to keep increasing performance, it results in ISA Architects adding 2-wide SIMD, then 4-wide SIMD, then 8-wide and upwards. It can therefore be seen why
6184:
4445:
2033:
Assuming a hypothetical predicated (mask capable) SIMD ISA, and again assuming that the SIMD instructions can cope with misaligned data, the instruction loop would look like this:
714:
Additionally, vector processors can be more resource-efficient by using slower hardware and saving power, but still achieving throughput and having less latency than SIMD, through
1230:
Vector processors were traditionally designed to work best only when there are large amounts of data to be worked on. For this reason, these sorts of CPUs were found primarily in
2025:
instruction), Vector processors produce much more compact code because they do not need to perform explicit mask calculation to cover the last few elements (illustrated below).
3818:
1211:
Not all problems can be attacked with this sort of solution. Including these types of instructions necessarily adds complexity to the core CPU. That complexity typically makes
4310:
3910:
2325:
amount (the number of hardware "lanes") is termed "MVL" (Maximum Vector Length). Note that, as seen in SX-Aurora and
Videocore IV, MVL may be an actual hardware lane quantity
2308:
performed, respectively. One additional potential complication: some RISC ISAs do not have a "min" instruction, needing instead to use a branch or scalar predicated compare.
3472:
instructions are notified or may determine exactly how many Loads actually succeeded, using that quantity to only carry out work on the data that has actually been loaded.
1964:
Unfortunately for SIMD, the clue was in the assumption above, "that n is a multiple of 4" as well as "aligned access", which, clearly, is a limited specialist use-case.
792:
that the instruction will operate again on another item of data, at an address one increment larger than the last. This allows for significant savings in decoding time.
3601:
1260:
The self-repeating instructions are found in early vector computers like the STAR-100, where the above action would be described in a single instruction (somewhat like
5156:
4053:
3641:
1192:
With the length (equivalent to SIMD width) not being hard-coded into the instruction, not only is the encoding more compact, it's also "future-proof" and allows even
414:
Although vector supercomputers resembling the Cray-1 are less popular these days, NEC has continued to make this type of computer up to the present day with their
168:
6295:
5478:
3573:
5997:
4535:
2537:
in the SIMD width (load32x4 etc.) the vector ISA equivalents have no such limit. This makes vector programs both portable, Vendor Independent, and future-proof.
1257:
either gains the ability to perform loops itself, or exposes some sort of vector control (status) register to the programmer, usually known as a vector Length.
4103:
1957:
Note that both x and y pointers are incremented by 16, because that is how long (in bytes) four 32-bit integers are. The decision was made that the algorithm
4387:
3345:
API although the internal details are not available. The most resource-efficient technique is in-place reordering of access to otherwise linear vector data.
360:
machine, but it sold poorly and they took that as an opportunity to leave the supercomputing field entirely. In the early and mid-1980s Japanese companies (
41:) that were specifically designed from the ground up to handle large Vectors (Arrays). For SIMD instructions present in some general-purpose computers, see
6154:
5720:
5537:
3265:
containing multiple members. The members are extracted from data structure (element), and each extracted member is placed into a different vector register.
1500:
In each iteration, every element of y has an element of x multiplied by a and added to it. The program is expressed in scalar linear form for readability.
3315:– a very simple and strategically useful instruction which drops sequentially-incrementing immediates into successive elements. Usually starts from zero.
515:. The difference is illustrated below with examples, showing and comparing the three categories: Pure SIMD, Predicated SIMD, and Pure Vector Processing.
1200:
Additionally, in more modern vector processor ISAs, "Fail on First" or "Fault First" has been introduced (see below) which brings even more advantages.
3484:
many elements to load, the first part of a strncpy, if beginning initially on a sub-optimal memory boundary, may return just enough loads such that on
5500:
784:, but the CPU can process an entire batch of operations, in an overlapping fashion, much faster and more efficiently than if it did so one at a time.
642:
3948:
1186:
The code itself is also smaller, which can lead to more efficient memory use, reduction in L1 instruction cache size, reduction in power consumption.
745:, is a key advantage and difference compared to SIMD. SIMD, by design and definition, cannot perform chaining except to the entire group of results.
6149:
3309:– useful for interaction between scalar and vector, these broadcast a single value across a vector, or extract one item from a vector, respectively.
6221:
3059:
The code when n is larger than the maximum vector length is not that much more complex, and is a similar pattern to the first example ("IAXPY").
2866:
than another Element 0. This places some severe limitations on potential implementations. For simplicity it can be assumed that n is exactly 8:
5974:
4516:
3959:
3694:
3412:
of the sub-vector, heavily features in 3D Shader binaries, and is sufficiently important as to be part of the Vulkan SPIR-V spec. The Broadcom
3367:– including vectorised versions of bit-level permutation operations, bitfield insert and extract, centrifuge operations, population count, and
211:
4248:
3359:
or decimal fixed-point, and support for much larger (arbitrary precision) arithmetic operations by supporting parallel carry-in and carry-out
1976:
perform the aligned SIMD loop at the maximum SIMD width up until the last few elements (those remaining that do not fit the fixed SIMD width)
6918:
6042:
5305:
5149:
4783:
3971:
436:
189:
407:
processing rather than better implementations of vector processors. However, recognising the benefits of vector processing, IBM developed
2515:
to be processed in each iteration. t0 is subtracted from n after each iteration, and if n is zero then all elements have been processed.
6928:
6069:
4806:
3272:
1340:
1321:
1251:
391:
Throughout, Cray continued to be the performance leader, continually beating the competition with a series of machines that led to the
5196:
4695:
6236:
6064:
6037:
5416:
4801:
4778:
4169:
3368:
3275:
allow parallel if/then/else constructs without resorting to branches. This allows code with conditional statements to be vectorized.
722:
the MULTIPLYs, and likewise must complete all of the MULTIPLYs before it can start the STOREs. This is by definition and by design.
275:
249:
137:
85:
5387:
4321:
693:) which is comprehensive individual element-level predicate masks on every vector instruction as is now available in ARM SVE2. And
156:(ALUs), one per cycle, but with a different data point for each one to work on. This allowed the Solomon machine to apply a single
3464:
The basic principle of ffirst is to attempt a large sequential Vector Load, but to allow the hardware to arbitrarily truncate the
2571:, at runtime. Thus, unlike non-predicated SIMD, even when there are no elements to process there is still no wasted cleanup code.
1757:), can do most of the operation in batches. The code is mostly similar to the scalar version. It is assumed that both x and y are
633:
Other CPU designs include some multiple instructions for vector processing on multiple (vectorized) data sets, typically known as
84:, whose instructions operate on single data items only, and in contrast to some of those same scalar processors having additional
7087:
7051:
6614:
5507:
5473:
5468:
5352:
4380:
4332:
3649:
is crucial to the performance. This ratio depends on the efficiency of the compilation like adjacency of the elements in memory.
7026:
6923:
6324:
6231:
6032:
5275:
5253:
5142:
4773:
4588:
195:
1180:
only three address translations are needed. Depending on the architecture, this can represent a significant savings by itself.
7102:
5771:
5206:
4880:
4743:
4089:
3995:
3780:
253:
69:
4117:
6226:
6074:
5908:
5522:
5483:
5340:
5104:
4938:
4556:
4476:
3893:
3417:
3328:
on a vector (for example, find the one maximum value of an entire vector, or sum all elements). Iteration is of the form
3007:
to the ISA. If it is assumed that n is less or equal to the maximum vector length, only three instructions are required:
372:(NEC) introduced register-based vector machines similar to the Cray-1, typically being slightly faster and much smaller.
6663:
6508:
6503:
6425:
5901:
5862:
5517:
5512:
5446:
5258:
3689:
3292:
2560:. Even compared to the predicate-capable SIMD, it is still more compact, clearer, more elegant and uses less resources.
742:
715:
333:
299:
181:
5382:
3242:
Where many SIMD ISAs borrow or are inspired by the list below, typical features that a vector processor will have are:
478:
follows similar principles as the early vector processors, and is being implemented in commercial products such as the
7097:
6290:
5987:
5685:
5123:
5069:
4529:
4373:
408:
369:
199:
4277:
238:
6940:
6587:
6004:
5495:
5463:
5233:
5221:
5201:
5048:
4843:
4728:
4690:
4540:
4430:
3724:
3443:
1508:
The scalar version of this would load one of each of x and y, process one calculation, store one result, and loop:
539:
116:
3873:
3261:
variants of the standard vector load and stores. Segment loads read a vector from memory, where each element is a
1183:
Another saving is fetching and decoding the instruction itself, which has to be done only one time instead of ten.
1085:
Cray-style vector ISAs take this a step further and provide a global "count" register, called vector length (VL):
257:
242:
7031:
6994:
6984:
5372:
5064:
5043:
4988:
4875:
4865:
4838:
4700:
4030:
3719:
1329:
1317:
593:
535:
530:
in Flynn's Taxonomy. Common examples using SIMD with features inspired by vector processors include: Intel x86's
778:
is constantly in use. Any particular instruction takes the same amount of time to complete, a time known as the
7046:
6453:
6389:
6366:
6216:
6178:
6014:
5964:
5959:
5436:
5330:
5238:
5018:
4644:
4583:
4496:
2581:
1758:
1370:
1196:
designs to consider using vectors purely to gain all the other advantages, rather than go for high performance.
353:
289:
5243:
3299:
Memory Load/Store modes, Gather/scatter vector operations act on the vector registers, and are often termed a
453:, and can be considered vector processors (using a similar strategy for hiding memory latencies). As shown in
4933:
1762:
can. If it does not, a "splat" (broadcast) must be used, to copy the scalar argument across a SIMD register:
6999:
6782:
6676:
6640:
6557:
6541:
6383:
6172:
6131:
6119:
5982:
5896:
5817:
5582:
5186:
5079:
5074:
4524:
3684:
2004:
149:
61:
6805:
6777:
6687:
6652:
6401:
6395:
6377:
6111:
6105:
6009:
5913:
5804:
5743:
5605:
5248:
4818:
4750:
4654:
4546:
4501:
3714:
766:
551:
523:
377:
306:
corresponding 2X or 4X performance gain. Memory bandwidth was sufficient to support these expanded modes.
89:
4608:
3937:
1208:
completed far faster overall, the limiting factor being the time required to fetch the data from memory.
765:
In order to reduce the amount of time consumed by these steps, most modern CPUs use a technique known as
92:(SWAR) Arithmetic Units. Vector processors can greatly improve performance on certain workloads, notably
7092:
6979:
6888:
6634:
6346:
6164:
5923:
5891:
5849:
5761:
5562:
5377:
5367:
5357:
5347:
5317:
5300:
5165:
4910:
4870:
4823:
4813:
4551:
4471:
4410:
4264:
3842:
An algorithm of hardware unit generation for processor core synthesis with packed SIMD type instructions
1351:
1325:
780:
727:
531:
423:
153:
144:
project. Solomon's goal was to dramatically increase math performance by using a large number of simple
3806:
4343:
7009:
6945:
6531:
6253:
6143:
6090:
5622:
5335:
5191:
5173:
4850:
4738:
4733:
4723:
4710:
4506:
3663:
3356:
759:
606:
328:
101:
93:
73:
7056:
6658:
4041:
3745:
7041:
6861:
6712:
6694:
6646:
6300:
6247:
6052:
6047:
6024:
5940:
5822:
5677:
5572:
5431:
5013:
4968:
4794:
4789:
4768:
4634:
4136:
3409:
3300:
3216:
1754:
690:
589:
527:
463:
454:
415:
42:
1995:
increase in instruction count! This can easily be demonstrated by compiling the iaxpy example for
6913:
6905:
6757:
6732:
6536:
6411:
5935:
5876:
5756:
5488:
5216:
5038:
4887:
4860:
4685:
4649:
4639:
4440:
4420:
4415:
4396:
4207:
4191:
1296:
1193:
404:
185:
97:
4598:
4354:
4228:
4147:
4066:
4011:
1339:
Modern GPUs, which have many small compute units each with their own independent SIMD ALUs, use
3580:
3447:
power usage. The concept of reducing accuracy where it is simply not needed is explored in the
1979:
have a cleanup phase which, like the preparatory section, is just as large and just as complex.
6866:
6833:
6749:
6681:
6582:
6572:
6562:
6493:
6488:
6483:
6406:
6335:
6241:
6201:
5834:
5784:
5734:
5710:
5592:
5532:
5527:
5409:
5325:
5084:
4760:
4718:
4613:
4244:
3991:
3987:
3776:
3679:
1309:
563:
419:
385:
296:
3606:
320:. Instead of leaving the data in memory like the STAR-100 and ASC, the Cray design had eight
7036:
6969:
6810:
6717:
6671:
6478:
6473:
6468:
6463:
6458:
6448:
6318:
6285:
6196:
6100:
5947:
5930:
5918:
5857:
5421:
5399:
5285:
5263:
5181:
5094:
4893:
4675:
4491:
4486:
4481:
4450:
4236:
3975:
3853:
3845:
3709:
3363:
479:
321:
81:
31:
1078:
which has performed 10 sequential operations: effectively the loop count is on an explicit
737:
and uses no SIMD ALUs, only having 1-wide 64-bit LOAD, 1-wide 64-bit STORE (and, as in the
411:
for use in supercomputers coupling several scalar processors to act as a vector processor.
6950:
6935:
6883:
6787:
6762:
6599:
6592:
6443:
6438:
6433:
6372:
6280:
6270:
5992:
5827:
5779:
5542:
5426:
5394:
5295:
5290:
5211:
4958:
4898:
4833:
4680:
4670:
4603:
4435:
4425:
3658:
3234:
These stark differences are what distinguishes a vector processor from one that has SIMD.
1204:
775:
614:
188:
computing. Around this time Flynn categorized this type of processing as an early form of
65:
4593:
4077:
2518:
A number of things to note, when comparing against the Predicated SIMD assembly variant:
1225:
principles: RVV only adds around 190 vector instructions even with the advanced features.
3520:
704:
to cope with iteration and reduction. This is illustrated further with examples, below.
7061:
6895:
6878:
6871:
6767:
6624:
6361:
6275:
6206:
5789:
5751:
5700:
5695:
5690:
5404:
5228:
5089:
4905:
4562:
4455:
4181:
3674:
3296:
3262:
1238:
of vector ISAs brings other benefits which are compelling even for Embedded use-cases.
755:
579:
450:
120:
2552:
Thus it can be seen, very clearly, how vector ISAs reduce the number of instructions.
7081:
6856:
6772:
5812:
5794:
5587:
5280:
4978:
4855:
4299:
3980:
3427:
3385:
1231:
771:
503:
definition, the addition of SIMD cannot, by itself, qualify a processor as an actual
108:
4158:
7066:
7004:
6820:
6797:
6609:
6330:
5268:
4578:
3515:
90% of the work is done by the vector unit. It follows the achievable speed up of:
3416:
IV uses the terminology "Lane rotate" where the rest of the industry uses the term
3352:
2533:
Where the SIMD variant hard-coded both the width (4) into the creation of the mask
1961:
only cope with 4-wide SIMD, therefore the constant is hard-coded into the program.
1285:
675:
646:
626:
381:
341:
4288:
3844:. Asia-Pacific Conference on Circuits and Systems. Vol. 1. pp. 171–176.
2983:
of dedicated SIMD registers before the last scalar computation can be performed.
1968:
hardware cannot do misaligned SIMD memory accesses, a real-world algorithm will:
167:
In 1962, Westinghouse cancelled the project, but the effort was restarted by the
17:
6851:
6815:
6526:
6498:
6356:
6211:
5099:
4240:
4229:"A Modular Massively Parallel Processor for Volumetric Visualisation Processing"
3849:
483:
227:
145:
3924:
1276:
Interestingly, though, Broadcom included space in all vector operations of the
43:
Flynn's taxonomy § Single instruction stream, multiple data streams (SIMD)
6737:
6727:
6722:
6704:
6604:
6577:
5839:
5672:
5642:
5362:
2970:- Fourth SIMD ADD: element 3 of first group added to element 2 of second group
2958:- Second SIMD ADD: element 1 of first group added to element 1 of second group
474:
Several modern CPU architectures are being designed as vector processors. The
198:
sought to avoid many of the difficulties with the ILLIAC concept with its own
4104:"Assembly - Fastest way to do horizontal SSE vector sum (Or other reduction)"
2964:- Third SIMD ADD: element 2 of first group added to element 2 of second group
2952:- First SIMD ADD: element 0 of first group added to element 0 of second group
6828:
6825:
6567:
5637:
5615:
4973:
4948:
3982:
Computer Organization and Design: the Hardware/Software Interface page 751-2
3439:
3413:
3325:
3320:
1356:
1347:
1277:
707:
491:
487:
400:
396:
172:
157:
49:
3480:
strncpy routine in hand-optimised RVV assembler is a mere 22 instructions.
2711:
Here, an accumulator (y) is used to sum up all the values in the array, x.
592:. Two notable examples which have per-element (lane-based) predication are
1203:
But more than that, a high performance vector processor may have multiple
6843:
5715:
5662:
5134:
5023:
5003:
4928:
1784:
The time taken would be basically the same as a vector implementation of
1074:
Note the complete lack of looping in the instructions, because it is the
547:
345:
310:
309:
The STAR-100 was otherwise slower than CDC's own supercomputers like the
292:
161:
2719:
The scalar version of this would load each of x, add it to y, and loop:
1189:
With the program size being reduced branch prediction has an easier job.
5652:
5610:
5028:
5008:
4983:
4618:
4186:
3840:
Miyaoka, Y.; Choi, J.; Togawa, N.; Yanagisawa, M.; Ohtsuki, T. (2002).
3448:
3442:
obviously feature much more predominantly in 3D than in many demanding
3282:
3253:) addressing mode. Advanced architectures may also include support for
2988:
2011:
1996:
1333:
1313:
804:; assume a, b, and c are memory locations in their respective registers
694:
597:
571:
559:
555:
446:
365:
361:
4182:"IBM's POWER10 Processor - William Starke & Brian W. Thompto, IBM"
3858:
6955:
5667:
5632:
5597:
4998:
4993:
4365:
4122:
4016:
3878:
3819:"Andes Announces RISC-V Multicore 1024-bit Vector Processor: AX45MPV"
3703:
3699:
3435:
3399:
3396:
3381:
1292:
738:
622:
618:
610:
543:
475:
392:
373:
357:
317:
316:
The vector technique was first fully exploited in 1976 by the famous
176:
123:
designs led to a decline in vector supercomputers during the 1990s.
4212:
971:
But to a vector processor, this task looks considerably different:
422:
places the processor and either 24 or 48 gigabytes of memory on an
27:
Computer processor which works on arrays of several numbers at once
6125:
5657:
5627:
4233:
High Performance Computing for Computer Graphics and Visualisation
3669:
3215:
Whilst from the reduction example it can be seen that, aside from
2568:
1753:
A modern packed SIMD architecture, known by many names (listed in
340:
6989:
6137:
6057:
5647:
5033:
4963:
4953:
3431:
3342:
1991:
the size of the code, in fact in extreme cases it results in an
1247:
1222:
638:
634:
575:
442:
403:. Since then, the supercomputer market has focused much more on
112:
96:
and similar tasks. Vector processing techniques also operate in
5138:
4369:
3212:
successful, regardless of alignment of the start of the vector.
758:
has historically become a large impediment to performance; see
5577:
5567:
4943:
4920:
3773:
The history of computer technology in their faces (in Russian)
1301:
1265:
567:
221:
136:
Vector processing development began in the early 1960s at the
38:
3577:
So, even if the performance of the vector unit is very high (
3003:
Vector instruction sets have arithmetic reduction operations
302:(ASC), which were introduced in 1974 and 1972, respectively.
152:(CPU). The CPU fed a single common instruction to all of the
111:
design through the 1970s into the 1990s, notably the various
72:
are designed to operate efficiently and effectively on large
3230:
simplified software and complex hardware (vector processors)
1221:
This can be somewhat mitigated by keeping the entire ISA to
807:; add 10 numbers in a to 10 numbers in b, store results in c
2316:
of a SIMD width, leaving that entirely up to the hardware.
674:
instruction in NEC SX, without restricting the length to a
2331:(Note: As mentioned in the ARM SVE2 Tutorial, programmers
486:
vector processor architectures being developed, including
107:
Vector machines appeared in the early 1970s and dominated
30:"Array processor" redirects here. Not to be confused with
4118:"Riscv-v-spec/V-spec.adoc at master · riscv/Riscv-v-spec"
4012:"Riscv-v-spec/V-spec.adoc at master · riscv/Riscv-v-spec"
3874:"Riscv-v-spec/V-spec.adoc at master · riscv/Riscv-v-spec"
3702:, an open ISA standard with an associated variable width
637:(Multiple Instruction, Multiple Data) and realized with
175:. Their version of the design originally called for a 1
3795:
700:
SIMD, because it uses fixed-width batch processing, is
332:
into each of the ALU subunits, a technique they called
1320:) are capable of this kind of selective, per-element (
726:
64-bit ALUs. As shown in the diagram, which assumes a
4090:"Sse - 1-to-4 broadcast and 4-to-1 reduce in AVX-512"
3609:
3583:
3523:
1176:
There are several savings inherent in this approach.
356:
tried to re-enter the high-end market again with its
7019:
6968:
6904:
6842:
6796:
6748:
6703:
6623:
6550:
6519:
6424:
6345:
6309:
6263:
6163:
6089:
6023:
5973:
5884:
5875:
5848:
5803:
5770:
5742:
5733:
5553:
5456:
5445:
5316:
5172:
5057:
4919:
4759:
4709:
4663:
4627:
4571:
4515:
4464:
4403:
2979:
2967:
2961:
2955:
2949:
2527:
2523:
2022:
2000:
1785:
1281:
1269:
1261:
671:
667:
348:
processor module with four scalar/vector processors
3979:
3894:"Vector Engine Assembly Language Reference Manual"
3635:
3595:
3567:
649:VLIW/vector processor combines both technologies.
1350:IV and other external vector processors like the
645:(Explicitly Parallel Instruction Computing). The
4054:"Coding for Neon - Part 3 Matrix Multiplication"
214:was presented and developed by Kartsev in 1967.
3821:(Press release). GlobeNewswire. 7 December 2022
3227:complex software and simplified hardware (SIMD)
733:A vector processor, by contrast, even if it is
4333:PATCH to libc6 to add optimised POWER9 strncpy
3223:Overall then there is a choice to either have
5150:
4381:
2945:At this point four adds have been performed:
2558:and there would still be no SIMD cleanup code
1288:or sourced from one of the scalar registers.
1215:instructions run slower—i.e., whenever it is
1090:; again assume we have vector registers v1-v3
653:Difference between SIMD and vector processors
8:
2544:that is automatically applied to the vectors
1369:integer variant of the "DAXPY" function, in
666:a way to set the vector length, such as the
37:This article is about Processors (including
6155:Computer performance by orders of magnitude
256:. Unsourced material may be challenged and
6620:
6260:
5881:
5739:
5453:
5157:
5143:
5135:
4388:
4374:
4366:
169:University of Illinois at Urbana–Champaign
4211:
3986:(2nd ed.). Morgan Kaufmann. p.
3857:
3613:
3608:
3582:
3527:
3522:
276:Learn how and when to remove this message
4170:RVV register gather-scatter instructions
3938:Vector and SIMD processors, slides 12-13
2107:# now do the operation, masked by m bits
2041:# prepare mask. few ISAs have min though
706:
380:(FPS) built add-on array processors for
288:The first vector supercomputers are the
3736:
1346:In addition, GPUs such as the Broadcom
976:; assume we have vector registers v1-v3
760:Random-access memory § Memory wall
678:or to a multiple of a fixed data width.
4289:Abandoned US patent US20110227920-0096
4042:Videocore IV QPU analysis by Jeff Bush
3949:Array vs Vector Processing, slides 5-7
3695:Computer for operations with functions
3289:Register Gather, Scatter (aka permute)
1093:; with size larger than or equal to 10
681:Iteration and reduction over elements
212:computer for operations with functions
206:Computer for operations with functions
2974:but with 4-wide SIMD being incapable
2526:instruction has embedded within it a
1304:, which face exactly the same issue.
148:under the control of a single master
7:
6126:Floating-point operations per second
498:Comparison with modern architectures
437:Single instruction, multiple threads
254:adding citations to reliable sources
190:single instruction, multiple threads
1341:Single Instruction Multiple Threads
1252:Single Instruction Multiple Threads
979:; with size equal or larger than 10
770:left the CPU, in the fashion of an
3744:Parkinson, Dennis (17 June 1976).
3590:
1749:Pure (non-predicated, packed) SIMD
441:Modern graphics processing units (
25:
2218:# update x, y and n for next loop
963:; loop back if count is not yet 0
670:instruction in RISCV RVV, or the
641:(Very Long Instruction Word) and
418:of computers. Most recently, the
138:Westinghouse Electric Corporation
115:platforms. The rapid fall in the
86:single instruction, multiple data
7052:Semiconductor device fabrication
5118:
5117:
4194:from the original on 2021-12-11.
4031:Videocore IV Programmer's Manual
3960:SIMD vs Vector GPU, slides 22-24
3643:, which suggests that the ratio
482:AX45MPV. There are also several
226:
7027:History of general-purpose CPUs
5254:Nondeterministic Turing machine
4589:Analysis of parallel algorithms
3911:"Documentation – Arm Developer"
3603:) there is a speedup less than
3332:where Reduction is of the form
2542:creates a hidden predicate mask
522:- also known as "Packed SIMD",
462:more complex and involved than
196:International Computers Limited
164:, fed in the form of an array.
5207:Deterministic finite automaton
3666:on pipelined vector processors
3630:
3618:
3562:
3547:
3535:
3532:
3508:be the vector speed ratio and
3376:GPU vector processing features
2991:of how to do "Horizontal Sum"
1264:). They are also found in the
132:Early research and development
1:
5998:Simultaneous and heterogenous
4536:Simultaneous and heterogenous
3796:MIAOW Vertical Research Group
1686:; Assume tmp is pre-allocated
1295:introduced the idea of using
625:. Although memory-based, the
609:- these include the original
6682:Integrated memory controller
6664:Translation lookaside buffer
5863:Memory dependence prediction
5306:Random-access stored program
5259:Probabilistic Turing machine
5124:Category: Parallel computing
4355:ARM SVE2 paper by N. Stevens
4265:"CUDA C++ Programming Guide"
4227:Krikelis, Anargyros (1996).
3690:Chaining (vector processing)
3355:arithmetic, but can include
629:was also a vector processor.
300:Advanced Scientific Computer
182:computational fluid dynamics
6138:Synaptic updates per second
4241:10.1007/978-1-4471-1011-8_8
3850:10.1109/APCCAS.2002.1114930
3746:"Computers by the thousand"
801:; Hypothetical RISC machine
728:multi-issue execution model
578:collaborated to create the
409:Virtual Vector Architecture
384:, later building their own
370:Nippon Electric Corporation
200:Distributed Array Processor
7119:
6542:Heterogeneous architecture
5464:Orthogonal instruction set
5234:Alternating Turing machine
5222:Quantum cellular automaton
4431:High-performance computing
3725:Supercomputer architecture
3324:– operations that perform
3295:. Not to be confused with
2001:"-O3 -march=knl"
1364:Vector instruction example
1245:
434:
117:price-to-performance ratio
36:
29:
7032:Microprocessor chronology
6995:Dynamic frequency scaling
6150:Cache performance metrics
5113:
5065:Automatic parallelization
4701:Application checkpointing
3720:History of supercomputing
3596:{\displaystyle r=\infty }
3238:Vector processor features
1102:# Set vector length VL=10
689:Predicated SIMD (part of
352:Other examples followed.
80:. This is in contrast to
64:(CPU) that implements an
7047:Hardware security module
6390:Digital signal processor
6367:Graphics processing unit
6179:Graphics processing unit
4311:Introduction to ARM SVE2
3771:B.N. Malinovsky (1995).
3498:Performance and speed up
3061:
3009:
2868:
2721:
2586:
2575:Vector reduction example
2349:
2035:
1790:
1764:
1683:
1510:
1375:
1336:) categorically do not.
1087:
973:
798:
354:Control Data Corporation
290:Control Data Corporation
7088:Central processing unit
7000:Dynamic voltage scaling
6783:Memory address register
6677:Branch target predictor
6641:Address generation unit
6384:Physics processing unit
6173:Central processing unit
6132:Transactions per second
6120:Instructions per second
6043:Array processing (SIMT)
5187:Stored-program computer
5080:Embarrassingly parallel
5075:Deterministic algorithm
4067:SIMD considered harmful
3685:Automatic vectorization
3636:{\displaystyle 1/(1-f)}
3339:Matrix Multiply support
3178:# repeat if n != 0
2832:; loop back if n > 0
2736:; y initialised to zero
2540:Setting VL effectively
2505:# repeat if n != 0
1669:; loop back if n > 0
476:RISC-V vector extension
449:which may be driven by
150:Central processing unit
62:central processing unit
6806:Hardwired control unit
6688:Memory management unit
6653:Memory management unit
6402:Secure cryptoprocessor
6396:Tensor Processing Unit
6378:Vision processing unit
6112:Cycles per instruction
6106:Instructions per cycle
6053:Associative processing
5744:Instruction pipelining
5166:Processor technologies
4795:Associative processing
4751:Non-blocking algorithm
4557:Clustered multi-thread
3715:Tensor Processing Unit
3637:
3597:
3569:
3202:Insights from examples
2320:Pure (true) vector ISA
767:instruction pipelining
711:
590:associative processing
524:SIMD within a register
445:) include an array of
378:Floating Point Systems
349:
154:arithmetic logic units
90:SIMD within a register
74:one-dimensional arrays
7103:Vector supercomputers
6889:Sum-addressed decoder
6635:Arithmetic logic unit
5762:Classic RISC pipeline
5716:Epiphany architecture
5563:Motorola 68000 series
4911:Hardware acceleration
4824:Superscalar processor
4814:Dataflow architecture
4411:Distributed computing
4322:RVV fault-first loads
3925:"Vector Architecture"
3638:
3598:
3570:
3456:Fault (or Fail) First
3384:applications needing
3349:Advanced Math formats
3257:load and stores, and
3247:Vector Load and Store
2862:be added to anything
2754:; load one 32bit data
2296:; go back if n > 0
2104:; m = (1<<t0)-1
1946:; go back if n > 0
1528:; load one 32bit data
1352:NEC SX-Aurora TSUBASA
710:
582:, which is also SIMD.
344:
327:The Cray design used
102:graphics accelerators
7010:Performance per watt
6588:replacement policies
6254:Package on a package
6144:Performance per watt
6048:Pipelined processing
5818:Tomasulo's algorithm
5623:Clipper architecture
5479:Application-specific
5192:Finite-state machine
4790:Pipelined processing
4739:Explicit parallelism
4734:Implicit parallelism
4724:Dataflow programming
4235:. pp. 101–124.
4056:. 11 September 2013.
3607:
3581:
3521:
3357:binary-coded decimal
3217:permute instructions
2999:Vector ISA reduction
2239:; x := x + t0*4
2197:; v3 := v1 + v2
1999:, using the options
1859:; v3 := v1 + v2
1582:; r3 := r1 + r2
1268:architecture as the
870:; r3 := r1 + r2
605:- as categorised in
542:instructions, AMD's
329:pipeline parallelism
250:improve this section
94:numerical simulation
7042:Digital electronics
6695:Instruction decoder
6647:Floating-point unit
6301:Soft microprocessor
6248:System in a package
5823:Reservation station
5353:Transport-triggered
5014:Parallel Extensions
4819:Pipelined processor
4344:RVV strncpy example
4126:. 19 November 2022.
4020:. 19 November 2022.
3972:Patterson, David A.
3430:operations such as
3301:permute instruction
3279:Compress and Expand
3148:# advance x by VL*4
3091:# VL=t0=min(MVL, n)
3054:# reduce-add into y
3024:# VL=t0=min(MVL, n)
2844:; returns result, y
2475:# advance x by VL*4
2454:# advance y by VL*4
2367:# VL=t0=min(MVL, n)
2170:; v1 := v1 * a
1838:; v1 := v1 * a
1561:; r1 := r1 * a
1297:processor registers
1242:Vector instructions
528:Pipelined Processor
7098:Parallel computing
6914:Integrated circuit
6758:Processor register
6412:Baseband processor
5757:Operand forwarding
5217:Cellular automaton
4888:Massively parallel
4866:distributed shared
4686:Cache invalidation
4650:Instruction window
4441:Manycore processor
4421:Massively parallel
4416:Parallel computing
4397:Parallel computing
4278:LMUL > 1 in RVV
3752:. pp. 626–627
3633:
3593:
3568:{\displaystyle r/}
3565:
3406:Sub-vector Swizzle
3127:# add all x into y
2775:; y := y + r1
2278:; n := n - t0
1993:order of magnitude
1892:; x := x + 16
1262:vadd c, a, b, $ 10
1194:embedded processor
1168:# 10 stores into c
822:; count := 10
712:
511:, and vectors are
507:, because SIMD is
470:Recent development
455:Flynn's 1972 paper
405:massively parallel
386:minisupercomputers
350:
186:massively parallel
98:video-game console
7075:
7074:
6964:
6963:
6583:Instruction cache
6573:Scratchpad memory
6420:
6419:
6407:Network processor
6336:Network on a chip
6291:Ultra-low-voltage
6242:Multi-chip module
6085:
6084:
5871:
5870:
5858:Branch prediction
5835:Register renaming
5729:
5728:
5711:VISC architecture
5533:Quantum computing
5528:VISC architecture
5410:Secondary storage
5326:Microarchitecture
5286:Register machines
5132:
5131:
5085:Parallel slowdown
4719:Stream processing
4609:Karp–Flatt metric
4250:978-3-540-76016-0
4159:SX-Arora Overview
4078:ARM SVE2 tutorial
3976:Hennessy, John L.
3680:Stream processing
3664:Duncan's taxonomy
3351:– often includes
3307:Splat and Extract
3269:Masked Operations
2817:; n := n - 1
2796:; x := x + 4
2083:; m = 1<<t0
1931:; n := n - 4
1788:described above.
1654:; n := n - 1
1615:; x := x + 4
1310:SX-Aurora TSUBASA
1132:# 10 loads from b
1117:# 10 loads from a
607:Duncan's taxonomy
520:Pure (fixed) SIMD
420:SX-Aurora TSUBASA
297:Texas Instruments
286:
285:
278:
82:scalar processors
18:Vector processing
16:(Redirected from
7110:
7037:Processor design
6929:Power management
6811:Instruction unit
6672:Branch predictor
6621:
6319:System on a chip
6261:
6101:Transistor count
6025:Flynn's taxonomy
5882:
5740:
5543:Addressing modes
5454:
5400:Memory hierarchy
5264:Hypercomputation
5182:Abstract machine
5159:
5152:
5145:
5136:
5121:
5120:
5095:Software lockout
4894:Computer cluster
4829:Vector processor
4784:Array processing
4769:Flynn's taxonomy
4676:Memory coherence
4451:Computer network
4390:
4383:
4376:
4367:
4357:
4352:
4346:
4341:
4335:
4330:
4324:
4319:
4313:
4308:
4302:
4300:Videocore IV QPU
4297:
4291:
4286:
4280:
4275:
4269:
4268:
4261:
4255:
4254:
4224:
4218:
4217:
4215:
4202:
4196:
4195:
4178:
4172:
4167:
4161:
4156:
4150:
4145:
4139:
4134:
4128:
4127:
4114:
4108:
4107:
4100:
4094:
4093:
4086:
4080:
4075:
4069:
4064:
4058:
4057:
4050:
4044:
4039:
4033:
4028:
4022:
4021:
4008:
4002:
4001:
3985:
3968:
3962:
3957:
3951:
3946:
3940:
3935:
3929:
3928:
3927:. 27 April 2020.
3921:
3915:
3914:
3907:
3901:
3900:
3898:
3890:
3884:
3883:
3870:
3864:
3863:
3861:
3837:
3831:
3830:
3828:
3826:
3815:
3809:
3804:
3798:
3793:
3787:
3786:
3768:
3762:
3761:
3759:
3757:
3741:
3710:Barrel processor
3704:vector extension
3642:
3640:
3639:
3634:
3617:
3602:
3600:
3599:
3594:
3574:
3572:
3571:
3566:
3531:
3364:Bit manipulation
3335:
3331:
3185:
3182:
3179:
3176:
3173:
3170:
3167:
3164:
3161:
3158:
3155:
3152:
3149:
3146:
3143:
3140:
3137:
3134:
3131:
3128:
3125:
3122:
3119:
3116:
3113:
3110:
3107:
3104:
3101:
3098:
3095:
3092:
3089:
3086:
3083:
3080:
3077:
3074:
3071:
3068:
3065:
3055:
3052:
3049:
3046:
3043:
3040:
3037:
3034:
3031:
3028:
3025:
3022:
3019:
3016:
3013:
2981:
2969:
2963:
2957:
2951:
2941:
2938:
2935:
2932:
2929:
2926:
2923:
2920:
2917:
2914:
2911:
2908:
2905:
2902:
2899:
2896:
2893:
2890:
2889:; for 2nd 4 of x
2887:
2884:
2881:
2878:
2875:
2872:
2845:
2842:
2839:
2836:
2833:
2830:
2827:
2824:
2821:
2818:
2815:
2812:
2809:
2806:
2803:
2800:
2797:
2794:
2791:
2788:
2785:
2782:
2779:
2776:
2773:
2770:
2767:
2764:
2761:
2758:
2755:
2752:
2749:
2746:
2743:
2740:
2737:
2734:
2731:
2728:
2725:
2715:Scalar assembler
2707:
2704:
2701:
2698:
2695:
2692:
2689:
2686:
2683:
2680:
2677:
2674:
2671:
2668:
2665:
2662:
2659:
2656:
2653:
2650:
2647:
2644:
2641:
2638:
2635:
2632:
2629:
2626:
2623:
2620:
2617:
2614:
2611:
2608:
2605:
2602:
2599:
2596:
2593:
2590:
2529:
2525:
2506:
2503:
2500:
2497:
2494:
2491:
2488:
2485:
2482:
2479:
2476:
2473:
2470:
2467:
2464:
2461:
2458:
2455:
2452:
2449:
2446:
2443:
2440:
2437:
2434:
2431:
2428:
2425:
2422:
2419:
2416:
2413:
2410:
2407:
2404:
2401:
2398:
2395:
2392:
2389:
2386:
2383:
2380:
2377:
2374:
2371:
2368:
2365:
2362:
2359:
2356:
2353:
2327:or a virtual one
2303:
2300:
2297:
2294:
2291:
2288:
2285:
2282:
2279:
2276:
2273:
2270:
2267:
2264:
2261:
2258:
2255:
2252:
2249:
2246:
2243:
2240:
2237:
2234:
2231:
2228:
2225:
2222:
2219:
2216:
2213:
2210:
2207:
2204:
2201:
2198:
2195:
2192:
2189:
2186:
2183:
2180:
2177:
2174:
2171:
2168:
2165:
2162:
2159:
2156:
2153:
2150:
2147:
2144:
2141:
2138:
2135:
2132:
2129:
2126:
2123:
2120:
2117:
2114:
2111:
2108:
2105:
2102:
2099:
2096:
2093:
2090:
2087:
2084:
2081:
2078:
2075:
2072:
2069:
2066:
2063:
2062:; t0 = min(n, 4)
2060:
2057:
2054:
2051:
2048:
2045:
2042:
2039:
2024:
2002:
1953:
1950:
1947:
1944:
1941:
1938:
1935:
1932:
1929:
1926:
1923:
1920:
1917:
1914:
1911:
1908:
1905:
1902:
1899:
1896:
1893:
1890:
1887:
1884:
1881:
1878:
1875:
1872:
1869:
1866:
1863:
1860:
1857:
1854:
1851:
1848:
1845:
1842:
1839:
1836:
1833:
1830:
1827:
1824:
1821:
1818:
1815:
1812:
1809:
1806:
1803:
1800:
1797:
1794:
1787:
1780:
1777:
1774:
1771:
1768:
1759:properly aligned
1755:Flynn's taxonomy
1744:
1741:
1738:
1735:
1732:
1729:
1726:
1723:
1720:
1717:
1714:
1711:
1708:
1705:
1702:
1699:
1696:
1693:
1690:
1687:
1676:
1673:
1670:
1667:
1664:
1661:
1658:
1655:
1652:
1649:
1646:
1643:
1640:
1637:
1634:
1631:
1628:
1625:
1622:
1619:
1616:
1613:
1610:
1607:
1604:
1601:
1598:
1595:
1592:
1589:
1586:
1583:
1580:
1577:
1574:
1571:
1568:
1565:
1562:
1559:
1556:
1553:
1550:
1547:
1544:
1541:
1538:
1535:
1532:
1529:
1526:
1523:
1520:
1517:
1514:
1504:Scalar assembler
1496:
1493:
1490:
1487:
1484:
1481:
1478:
1475:
1472:
1469:
1466:
1463:
1460:
1457:
1454:
1451:
1448:
1445:
1442:
1439:
1436:
1433:
1430:
1427:
1424:
1421:
1418:
1415:
1412:
1409:
1406:
1403:
1400:
1397:
1394:
1391:
1388:
1385:
1382:
1379:
1283:
1271:
1263:
1205:functional units
1172:
1169:
1166:
1163:
1160:
1157:
1154:
1151:
1148:
1145:
1142:
1139:
1136:
1133:
1130:
1127:
1124:
1121:
1118:
1115:
1112:
1109:
1106:
1103:
1100:
1097:
1094:
1091:
1070:
1067:
1064:
1061:
1058:
1055:
1052:
1049:
1046:
1043:
1040:
1037:
1034:
1031:
1028:
1025:
1022:
1019:
1016:
1013:
1010:
1007:
1004:
1001:
998:
995:
992:
989:
986:
983:
980:
977:
967:
964:
961:
958:
955:
952:
949:
946:
943:
940:
937:
934:
931:
928:
925:
922:
919:
916:
913:
910:
907:
904:
901:
898:
895:
892:
889:
886:
883:
880:
877:
874:
871:
868:
865:
862:
859:
856:
853:
850:
847:
844:
841:
838:
835:
832:
829:
826:
823:
820:
817:
814:
811:
808:
805:
802:
702:unable by design
691:Flynn's taxonomy
673:
669:
588:- also known as
505:vector processor
480:Andes Technology
447:shader pipelines
322:vector registers
281:
274:
270:
267:
261:
230:
222:
119:of conventional
100:hardware and in
54:vector processor
32:array processing
21:
7118:
7117:
7113:
7112:
7111:
7109:
7108:
7107:
7078:
7077:
7076:
7071:
7057:Tick–tock model
7015:
6971:
6960:
6900:
6884:Address decoder
6838:
6792:
6788:Program counter
6763:Status register
6744:
6699:
6659:Load–store unit
6626:
6619:
6546:
6515:
6416:
6373:Image processor
6348:
6341:
6311:
6305:
6281:Microcontroller
6271:Embedded system
6259:
6159:
6092:
6081:
6019:
5969:
5867:
5844:
5828:Re-order buffer
5799:
5780:Data dependency
5766:
5725:
5555:
5549:
5448:
5447:Instruction set
5441:
5427:Multiprocessing
5395:Cache hierarchy
5388:Register/memory
5312:
5212:Queue automaton
5168:
5163:
5133:
5128:
5109:
5053:
4959:Coarray Fortran
4915:
4899:Beowulf cluster
4755:
4705:
4696:Synchronization
4681:Cache coherence
4671:Multiprocessing
4659:
4623:
4604:Cost efficiency
4599:Gustafson's law
4567:
4511:
4460:
4436:Multiprocessing
4426:Cloud computing
4399:
4394:
4363:
4361:
4360:
4353:
4349:
4342:
4338:
4331:
4327:
4320:
4316:
4309:
4305:
4298:
4294:
4287:
4283:
4276:
4272:
4263:
4262:
4258:
4251:
4226:
4225:
4221:
4204:
4203:
4199:
4180:
4179:
4175:
4168:
4164:
4157:
4153:
4146:
4142:
4135:
4131:
4116:
4115:
4111:
4102:
4101:
4097:
4088:
4087:
4083:
4076:
4072:
4065:
4061:
4052:
4051:
4047:
4040:
4036:
4029:
4025:
4010:
4009:
4005:
3998:
3970:
3969:
3965:
3958:
3954:
3947:
3943:
3936:
3932:
3923:
3922:
3918:
3909:
3908:
3904:
3899:. 16 June 2023.
3896:
3892:
3891:
3887:
3882:. 16 June 2023.
3872:
3871:
3867:
3839:
3838:
3834:
3824:
3822:
3817:
3816:
3812:
3805:
3801:
3794:
3790:
3783:
3770:
3769:
3765:
3755:
3753:
3743:
3742:
3738:
3733:
3659:SX architecture
3655:
3605:
3604:
3579:
3578:
3519:
3518:
3500:
3458:
3424:Transcendentals
3378:
3333:
3329:
3293:vector chaining
3273:predicate masks
3240:
3204:
3187:
3186:
3183:
3180:
3177:
3174:
3171:
3168:
3165:
3162:
3159:
3156:
3153:
3150:
3147:
3144:
3141:
3138:
3135:
3132:
3129:
3126:
3123:
3120:
3117:
3114:
3111:
3108:
3106:# load vector x
3105:
3102:
3099:
3096:
3093:
3090:
3087:
3084:
3081:
3078:
3075:
3072:
3069:
3066:
3063:
3057:
3056:
3053:
3050:
3047:
3044:
3041:
3039:# load vector x
3038:
3035:
3032:
3029:
3026:
3023:
3020:
3017:
3014:
3011:
3001:
2943:
2942:
2939:
2936:
2933:
2930:
2927:
2924:
2921:
2918:
2915:
2912:
2909:
2906:
2903:
2900:
2897:
2894:
2891:
2888:
2885:
2882:
2879:
2876:
2873:
2870:
2855:
2847:
2846:
2843:
2840:
2837:
2834:
2831:
2828:
2825:
2822:
2819:
2816:
2813:
2810:
2807:
2804:
2801:
2798:
2795:
2792:
2789:
2786:
2783:
2780:
2777:
2774:
2771:
2768:
2765:
2762:
2759:
2756:
2753:
2750:
2747:
2744:
2741:
2738:
2735:
2732:
2729:
2726:
2723:
2717:
2709:
2708:
2705:
2702:
2699:
2696:
2693:
2690:
2687:
2684:
2681:
2678:
2675:
2672:
2669:
2666:
2663:
2660:
2657:
2654:
2651:
2648:
2645:
2642:
2639:
2636:
2633:
2630:
2627:
2624:
2621:
2618:
2615:
2612:
2609:
2606:
2603:
2600:
2597:
2594:
2591:
2588:
2577:
2508:
2507:
2504:
2501:
2498:
2495:
2492:
2489:
2486:
2483:
2480:
2477:
2474:
2471:
2468:
2465:
2462:
2459:
2456:
2453:
2450:
2447:
2444:
2441:
2438:
2435:
2432:
2429:
2426:
2423:
2420:
2417:
2414:
2411:
2408:
2405:
2402:
2399:
2397:# load vector y
2396:
2393:
2390:
2387:
2384:
2382:# load vector x
2381:
2378:
2375:
2372:
2369:
2366:
2363:
2360:
2357:
2354:
2351:
2322:
2305:
2304:
2301:
2298:
2295:
2292:
2289:
2286:
2283:
2280:
2277:
2274:
2271:
2268:
2265:
2262:
2259:
2256:
2253:
2250:
2247:
2244:
2241:
2238:
2235:
2232:
2229:
2226:
2223:
2220:
2217:
2214:
2211:
2208:
2205:
2202:
2199:
2196:
2193:
2190:
2187:
2184:
2181:
2178:
2175:
2172:
2169:
2166:
2163:
2160:
2157:
2154:
2151:
2148:
2145:
2142:
2139:
2136:
2133:
2130:
2127:
2124:
2121:
2118:
2115:
2112:
2109:
2106:
2103:
2100:
2097:
2094:
2091:
2088:
2085:
2082:
2079:
2076:
2073:
2070:
2067:
2064:
2061:
2058:
2055:
2052:
2049:
2046:
2043:
2040:
2037:
2031:
2029:Predicated SIMD
2014:exists in x86.
1987:This more than
1955:
1954:
1951:
1948:
1945:
1942:
1939:
1936:
1933:
1930:
1927:
1924:
1921:
1918:
1915:
1912:
1909:
1906:
1903:
1900:
1897:
1894:
1891:
1888:
1885:
1882:
1879:
1876:
1873:
1870:
1867:
1864:
1861:
1858:
1855:
1852:
1849:
1846:
1843:
1840:
1837:
1834:
1831:
1828:
1825:
1822:
1819:
1816:
1813:
1810:
1807:
1804:
1801:
1798:
1795:
1792:
1782:
1781:
1778:
1775:
1772:
1769:
1766:
1751:
1746:
1745:
1742:
1739:
1736:
1733:
1730:
1727:
1724:
1721:
1718:
1715:
1712:
1709:
1706:
1703:
1700:
1697:
1694:
1691:
1688:
1685:
1678:
1677:
1674:
1671:
1668:
1665:
1662:
1659:
1656:
1653:
1650:
1647:
1644:
1641:
1638:
1635:
1632:
1629:
1626:
1623:
1620:
1617:
1614:
1611:
1608:
1605:
1602:
1599:
1596:
1593:
1590:
1587:
1584:
1581:
1578:
1575:
1572:
1569:
1566:
1563:
1560:
1557:
1554:
1551:
1548:
1545:
1542:
1539:
1536:
1533:
1530:
1527:
1524:
1521:
1518:
1515:
1512:
1506:
1498:
1497:
1494:
1491:
1488:
1485:
1482:
1479:
1476:
1473:
1470:
1467:
1464:
1461:
1458:
1455:
1452:
1449:
1446:
1443:
1440:
1437:
1434:
1431:
1428:
1425:
1422:
1419:
1416:
1413:
1410:
1407:
1404:
1401:
1398:
1395:
1392:
1389:
1386:
1383:
1380:
1377:
1366:
1254:
1244:
1174:
1173:
1170:
1167:
1164:
1161:
1158:
1155:
1152:
1149:
1146:
1143:
1140:
1137:
1134:
1131:
1128:
1125:
1122:
1119:
1116:
1113:
1110:
1107:
1104:
1101:
1098:
1095:
1092:
1089:
1080:per-instruction
1072:
1071:
1068:
1065:
1062:
1059:
1056:
1053:
1050:
1047:
1044:
1041:
1038:
1035:
1032:
1029:
1026:
1023:
1020:
1017:
1014:
1011:
1008:
1005:
1002:
999:
996:
993:
990:
987:
984:
981:
978:
975:
969:
968:
965:
962:
959:
956:
953:
950:
947:
944:
941:
938:
935:
932:
929:
926:
923:
920:
917:
914:
911:
908:
905:
902:
899:
896:
893:
890:
887:
884:
881:
878:
875:
872:
869:
866:
863:
860:
857:
854:
851:
848:
845:
842:
839:
836:
833:
830:
827:
824:
821:
818:
815:
812:
809:
806:
803:
800:
776:address decoder
751:
743:vector chaining
716:vector chaining
655:
615:Convex C-Series
586:Predicated SIMD
513:variable-length
500:
472:
451:compute kernels
439:
433:
335:vector chaining
282:
271:
265:
262:
247:
231:
220:
208:
134:
129:
76:of data called
66:instruction set
58:array processor
46:
35:
28:
23:
22:
15:
12:
11:
5:
7116:
7114:
7106:
7105:
7100:
7095:
7090:
7080:
7079:
7073:
7072:
7070:
7069:
7064:
7062:Pin grid array
7059:
7054:
7049:
7044:
7039:
7034:
7029:
7023:
7021:
7017:
7016:
7014:
7013:
7007:
7002:
6997:
6992:
6987:
6982:
6976:
6974:
6966:
6965:
6962:
6961:
6959:
6958:
6953:
6948:
6943:
6938:
6933:
6932:
6931:
6926:
6921:
6910:
6908:
6902:
6901:
6899:
6898:
6896:Barrel shifter
6893:
6892:
6891:
6886:
6879:Binary decoder
6876:
6875:
6874:
6864:
6859:
6854:
6848:
6846:
6840:
6839:
6837:
6836:
6831:
6823:
6818:
6813:
6808:
6802:
6800:
6794:
6793:
6791:
6790:
6785:
6780:
6775:
6770:
6768:Stack register
6765:
6760:
6754:
6752:
6746:
6745:
6743:
6742:
6741:
6740:
6735:
6725:
6720:
6715:
6709:
6707:
6701:
6700:
6698:
6697:
6692:
6691:
6690:
6679:
6674:
6669:
6668:
6667:
6661:
6650:
6644:
6638:
6631:
6629:
6618:
6617:
6612:
6607:
6602:
6597:
6596:
6595:
6590:
6585:
6580:
6575:
6570:
6560:
6554:
6552:
6548:
6547:
6545:
6544:
6539:
6534:
6529:
6523:
6521:
6517:
6516:
6514:
6513:
6512:
6511:
6501:
6496:
6491:
6486:
6481:
6476:
6471:
6466:
6461:
6456:
6451:
6446:
6441:
6436:
6430:
6428:
6422:
6421:
6418:
6417:
6415:
6414:
6409:
6404:
6399:
6393:
6387:
6381:
6375:
6370:
6364:
6362:AI accelerator
6359:
6353:
6351:
6343:
6342:
6340:
6339:
6333:
6328:
6325:Multiprocessor
6322:
6315:
6313:
6307:
6306:
6304:
6303:
6298:
6293:
6288:
6283:
6278:
6276:Microprocessor
6273:
6267:
6265:
6264:By application
6258:
6257:
6251:
6245:
6239:
6234:
6229:
6224:
6219:
6214:
6209:
6207:Tile processor
6204:
6199:
6194:
6189:
6188:
6187:
6176:
6169:
6167:
6161:
6160:
6158:
6157:
6152:
6147:
6141:
6135:
6129:
6123:
6117:
6116:
6115:
6103:
6097:
6095:
6087:
6086:
6083:
6082:
6080:
6079:
6078:
6077:
6067:
6062:
6061:
6060:
6055:
6050:
6045:
6035:
6029:
6027:
6021:
6020:
6018:
6017:
6012:
6007:
6002:
6001:
6000:
5995:
5993:Hyperthreading
5985:
5979:
5977:
5975:Multithreading
5971:
5970:
5968:
5967:
5962:
5957:
5956:
5955:
5945:
5944:
5943:
5938:
5928:
5927:
5926:
5921:
5911:
5906:
5905:
5904:
5899:
5888:
5886:
5879:
5873:
5872:
5869:
5868:
5866:
5865:
5860:
5854:
5852:
5846:
5845:
5843:
5842:
5837:
5832:
5831:
5830:
5825:
5815:
5809:
5807:
5801:
5800:
5798:
5797:
5792:
5787:
5782:
5776:
5774:
5768:
5767:
5765:
5764:
5759:
5754:
5752:Pipeline stall
5748:
5746:
5737:
5731:
5730:
5727:
5726:
5724:
5723:
5718:
5713:
5708:
5705:
5704:
5703:
5701:z/Architecture
5698:
5693:
5688:
5680:
5675:
5670:
5665:
5660:
5655:
5650:
5645:
5640:
5635:
5630:
5625:
5620:
5619:
5618:
5613:
5608:
5600:
5595:
5590:
5585:
5580:
5575:
5570:
5565:
5559:
5557:
5551:
5550:
5548:
5547:
5546:
5545:
5535:
5530:
5525:
5520:
5515:
5510:
5505:
5504:
5503:
5493:
5492:
5491:
5481:
5476:
5471:
5466:
5460:
5458:
5451:
5443:
5442:
5440:
5439:
5434:
5429:
5424:
5419:
5414:
5413:
5412:
5407:
5405:Virtual memory
5397:
5392:
5391:
5390:
5385:
5380:
5375:
5365:
5360:
5355:
5350:
5345:
5344:
5343:
5333:
5328:
5322:
5320:
5314:
5313:
5311:
5310:
5309:
5308:
5303:
5298:
5293:
5283:
5278:
5273:
5272:
5271:
5266:
5261:
5256:
5251:
5246:
5241:
5236:
5229:Turing machine
5226:
5225:
5224:
5219:
5214:
5209:
5204:
5199:
5189:
5184:
5178:
5176:
5170:
5169:
5164:
5162:
5161:
5154:
5147:
5139:
5130:
5129:
5127:
5126:
5114:
5111:
5110:
5108:
5107:
5102:
5097:
5092:
5090:Race condition
5087:
5082:
5077:
5072:
5067:
5061:
5059:
5055:
5054:
5052:
5051:
5046:
5041:
5036:
5031:
5026:
5021:
5016:
5011:
5006:
5001:
4996:
4991:
4986:
4981:
4976:
4971:
4966:
4961:
4956:
4951:
4946:
4941:
4936:
4931:
4925:
4923:
4917:
4916:
4914:
4913:
4908:
4903:
4902:
4901:
4891:
4885:
4884:
4883:
4878:
4873:
4868:
4863:
4858:
4848:
4847:
4846:
4841:
4834:Multiprocessor
4831:
4826:
4821:
4816:
4811:
4810:
4809:
4804:
4799:
4798:
4797:
4792:
4787:
4776:
4765:
4763:
4757:
4756:
4754:
4753:
4748:
4747:
4746:
4741:
4736:
4726:
4721:
4715:
4713:
4707:
4706:
4704:
4703:
4698:
4693:
4688:
4683:
4678:
4673:
4667:
4665:
4661:
4660:
4658:
4657:
4652:
4647:
4642:
4637:
4631:
4629:
4625:
4624:
4622:
4621:
4616:
4611:
4606:
4601:
4596:
4591:
4586:
4581:
4575:
4573:
4569:
4568:
4566:
4565:
4563:Hardware scout
4560:
4554:
4549:
4544:
4538:
4533:
4527:
4521:
4519:
4517:Multithreading
4513:
4512:
4510:
4509:
4504:
4499:
4494:
4489:
4484:
4479:
4474:
4468:
4466:
4462:
4461:
4459:
4458:
4456:Systolic array
4453:
4448:
4443:
4438:
4433:
4428:
4423:
4418:
4413:
4407:
4405:
4401:
4400:
4395:
4393:
4392:
4385:
4378:
4370:
4359:
4358:
4347:
4336:
4325:
4314:
4303:
4292:
4281:
4270:
4256:
4249:
4219:
4197:
4173:
4162:
4151:
4148:RISC-V RVV ISA
4140:
4129:
4109:
4095:
4081:
4070:
4059:
4045:
4034:
4023:
4003:
3996:
3963:
3952:
3941:
3930:
3916:
3902:
3885:
3865:
3832:
3810:
3799:
3788:
3781:
3763:
3735:
3734:
3732:
3729:
3728:
3727:
3722:
3717:
3712:
3707:
3697:
3692:
3687:
3682:
3677:
3675:Compute kernel
3672:
3667:
3661:
3654:
3651:
3632:
3629:
3626:
3623:
3620:
3616:
3612:
3592:
3589:
3586:
3564:
3561:
3558:
3555:
3552:
3549:
3546:
3543:
3540:
3537:
3534:
3530:
3526:
3499:
3496:
3457:
3454:
3453:
3452:
3421:
3403:
3377:
3374:
3373:
3372:
3360:
3346:
3336:
3334:x = y + y… + y
3319:Reduction and
3316:
3310:
3304:
3297:Gather-scatter
3286:
3276:
3266:
3263:data structure
3239:
3236:
3232:
3231:
3228:
3221:
3220:
3213:
3203:
3200:
3163:# n -= VL (t0)
3062:
3010:
3000:
2997:
2972:
2971:
2965:
2959:
2953:
2940:; add 2 groups
2904:; first 4 of x
2869:
2854:
2853:SIMD reduction
2851:
2722:
2716:
2713:
2587:
2576:
2573:
2550:
2549:
2545:
2538:
2531:
2490:# n -= VL (t0)
2418:# v1 += v0 * a
2350:
2321:
2318:
2036:
2030:
2027:
1981:
1980:
1977:
1974:
1791:
1779:; v4 = a,a,a,a
1765:
1750:
1747:
1684:
1511:
1505:
1502:
1376:
1365:
1362:
1243:
1240:
1232:supercomputers
1198:
1197:
1190:
1187:
1184:
1181:
1088:
974:
799:
756:memory latency
750:
747:
703:
687:
686:
684:
679:
654:
651:
631:
630:
600:
583:
580:Cell processor
514:
510:
499:
496:
471:
468:
435:Main article:
432:
429:
284:
283:
234:
232:
225:
219:
218:Supercomputers
216:
207:
204:
133:
130:
128:
125:
121:microprocessor
26:
24:
14:
13:
10:
9:
6:
4:
3:
2:
7115:
7104:
7101:
7099:
7096:
7094:
7091:
7089:
7086:
7085:
7083:
7068:
7065:
7063:
7060:
7058:
7055:
7053:
7050:
7048:
7045:
7043:
7040:
7038:
7035:
7033:
7030:
7028:
7025:
7024:
7022:
7018:
7011:
7008:
7006:
7003:
7001:
6998:
6996:
6993:
6991:
6988:
6986:
6983:
6981:
6978:
6977:
6975:
6973:
6967:
6957:
6954:
6952:
6949:
6947:
6944:
6942:
6939:
6937:
6934:
6930:
6927:
6925:
6922:
6920:
6917:
6916:
6915:
6912:
6911:
6909:
6907:
6903:
6897:
6894:
6890:
6887:
6885:
6882:
6881:
6880:
6877:
6873:
6870:
6869:
6868:
6865:
6863:
6860:
6858:
6857:Demultiplexer
6855:
6853:
6850:
6849:
6847:
6845:
6841:
6835:
6832:
6830:
6827:
6824:
6822:
6819:
6817:
6814:
6812:
6809:
6807:
6804:
6803:
6801:
6799:
6795:
6789:
6786:
6784:
6781:
6779:
6778:Memory buffer
6776:
6774:
6773:Register file
6771:
6769:
6766:
6764:
6761:
6759:
6756:
6755:
6753:
6751:
6747:
6739:
6736:
6734:
6731:
6730:
6729:
6726:
6724:
6721:
6719:
6716:
6714:
6713:Combinational
6711:
6710:
6708:
6706:
6702:
6696:
6693:
6689:
6686:
6685:
6683:
6680:
6678:
6675:
6673:
6670:
6665:
6662:
6660:
6657:
6656:
6654:
6651:
6648:
6645:
6642:
6639:
6636:
6633:
6632:
6630:
6628:
6622:
6616:
6613:
6611:
6608:
6606:
6603:
6601:
6598:
6594:
6591:
6589:
6586:
6584:
6581:
6579:
6576:
6574:
6571:
6569:
6566:
6565:
6564:
6561:
6559:
6556:
6555:
6553:
6549:
6543:
6540:
6538:
6535:
6533:
6530:
6528:
6525:
6524:
6522:
6518:
6510:
6507:
6506:
6505:
6502:
6500:
6497:
6495:
6492:
6490:
6487:
6485:
6482:
6480:
6477:
6475:
6472:
6470:
6467:
6465:
6462:
6460:
6457:
6455:
6452:
6450:
6447:
6445:
6442:
6440:
6437:
6435:
6432:
6431:
6429:
6427:
6423:
6413:
6410:
6408:
6405:
6403:
6400:
6397:
6394:
6391:
6388:
6385:
6382:
6379:
6376:
6374:
6371:
6368:
6365:
6363:
6360:
6358:
6355:
6354:
6352:
6350:
6344:
6337:
6334:
6332:
6329:
6326:
6323:
6320:
6317:
6316:
6314:
6308:
6302:
6299:
6297:
6294:
6292:
6289:
6287:
6284:
6282:
6279:
6277:
6274:
6272:
6269:
6268:
6266:
6262:
6255:
6252:
6249:
6246:
6243:
6240:
6238:
6235:
6233:
6230:
6228:
6225:
6223:
6220:
6218:
6215:
6213:
6210:
6208:
6205:
6203:
6200:
6198:
6195:
6193:
6190:
6186:
6183:
6182:
6180:
6177:
6174:
6171:
6170:
6168:
6166:
6162:
6156:
6153:
6151:
6148:
6145:
6142:
6139:
6136:
6133:
6130:
6127:
6124:
6121:
6118:
6113:
6110:
6109:
6107:
6104:
6102:
6099:
6098:
6096:
6094:
6088:
6076:
6073:
6072:
6071:
6068:
6066:
6063:
6059:
6056:
6054:
6051:
6049:
6046:
6044:
6041:
6040:
6039:
6036:
6034:
6031:
6030:
6028:
6026:
6022:
6016:
6013:
6011:
6008:
6006:
6003:
5999:
5996:
5994:
5991:
5990:
5989:
5986:
5984:
5981:
5980:
5978:
5976:
5972:
5966:
5963:
5961:
5958:
5954:
5951:
5950:
5949:
5946:
5942:
5939:
5937:
5934:
5933:
5932:
5929:
5925:
5922:
5920:
5917:
5916:
5915:
5912:
5910:
5907:
5903:
5900:
5898:
5895:
5894:
5893:
5890:
5889:
5887:
5883:
5880:
5878:
5874:
5864:
5861:
5859:
5856:
5855:
5853:
5851:
5847:
5841:
5838:
5836:
5833:
5829:
5826:
5824:
5821:
5820:
5819:
5816:
5814:
5813:Scoreboarding
5811:
5810:
5808:
5806:
5802:
5796:
5795:False sharing
5793:
5791:
5788:
5786:
5783:
5781:
5778:
5777:
5775:
5773:
5769:
5763:
5760:
5758:
5755:
5753:
5750:
5749:
5747:
5745:
5741:
5738:
5736:
5732:
5722:
5719:
5717:
5714:
5712:
5709:
5706:
5702:
5699:
5697:
5694:
5692:
5689:
5687:
5684:
5683:
5681:
5679:
5676:
5674:
5671:
5669:
5666:
5664:
5661:
5659:
5656:
5654:
5651:
5649:
5646:
5644:
5641:
5639:
5636:
5634:
5631:
5629:
5626:
5624:
5621:
5617:
5614:
5612:
5609:
5607:
5604:
5603:
5601:
5599:
5596:
5594:
5591:
5589:
5588:Stanford MIPS
5586:
5584:
5581:
5579:
5576:
5574:
5571:
5569:
5566:
5564:
5561:
5560:
5558:
5552:
5544:
5541:
5540:
5539:
5536:
5534:
5531:
5529:
5526:
5524:
5521:
5519:
5516:
5514:
5511:
5509:
5506:
5502:
5499:
5498:
5497:
5494:
5490:
5487:
5486:
5485:
5482:
5480:
5477:
5475:
5472:
5470:
5467:
5465:
5462:
5461:
5459:
5455:
5452:
5450:
5449:architectures
5444:
5438:
5435:
5433:
5430:
5428:
5425:
5423:
5420:
5418:
5417:Heterogeneous
5415:
5411:
5408:
5406:
5403:
5402:
5401:
5398:
5396:
5393:
5389:
5386:
5384:
5381:
5379:
5376:
5374:
5371:
5370:
5369:
5368:Memory access
5366:
5364:
5361:
5359:
5356:
5354:
5351:
5349:
5346:
5342:
5339:
5338:
5337:
5334:
5332:
5329:
5327:
5324:
5323:
5321:
5319:
5315:
5307:
5304:
5302:
5301:Random-access
5299:
5297:
5294:
5292:
5289:
5288:
5287:
5284:
5282:
5281:Stack machine
5279:
5277:
5274:
5270:
5267:
5265:
5262:
5260:
5257:
5255:
5252:
5250:
5247:
5245:
5242:
5240:
5237:
5235:
5232:
5231:
5230:
5227:
5223:
5220:
5218:
5215:
5213:
5210:
5208:
5205:
5203:
5200:
5198:
5197:with datapath
5195:
5194:
5193:
5190:
5188:
5185:
5183:
5180:
5179:
5177:
5175:
5171:
5167:
5160:
5155:
5153:
5148:
5146:
5141:
5140:
5137:
5125:
5116:
5115:
5112:
5106:
5103:
5101:
5098:
5096:
5093:
5091:
5088:
5086:
5083:
5081:
5078:
5076:
5073:
5071:
5068:
5066:
5063:
5062:
5060:
5056:
5050:
5047:
5045:
5042:
5040:
5037:
5035:
5032:
5030:
5027:
5025:
5022:
5020:
5017:
5015:
5012:
5010:
5007:
5005:
5002:
5000:
4997:
4995:
4992:
4990:
4987:
4985:
4982:
4980:
4979:Global Arrays
4977:
4975:
4972:
4970:
4967:
4965:
4962:
4960:
4957:
4955:
4952:
4950:
4947:
4945:
4942:
4940:
4937:
4935:
4932:
4930:
4927:
4926:
4924:
4922:
4918:
4912:
4909:
4907:
4906:Grid computer
4904:
4900:
4897:
4896:
4895:
4892:
4889:
4886:
4882:
4879:
4877:
4874:
4872:
4869:
4867:
4864:
4862:
4859:
4857:
4854:
4853:
4852:
4849:
4845:
4842:
4840:
4837:
4836:
4835:
4832:
4830:
4827:
4825:
4822:
4820:
4817:
4815:
4812:
4808:
4805:
4803:
4800:
4796:
4793:
4791:
4788:
4785:
4782:
4781:
4780:
4777:
4775:
4772:
4771:
4770:
4767:
4766:
4764:
4762:
4758:
4752:
4749:
4745:
4742:
4740:
4737:
4735:
4732:
4731:
4730:
4727:
4725:
4722:
4720:
4717:
4716:
4714:
4712:
4708:
4702:
4699:
4697:
4694:
4692:
4689:
4687:
4684:
4682:
4679:
4677:
4674:
4672:
4669:
4668:
4666:
4662:
4656:
4653:
4651:
4648:
4646:
4643:
4641:
4638:
4636:
4633:
4632:
4630:
4626:
4620:
4617:
4615:
4612:
4610:
4607:
4605:
4602:
4600:
4597:
4595:
4592:
4590:
4587:
4585:
4582:
4580:
4577:
4576:
4574:
4570:
4564:
4561:
4558:
4555:
4553:
4550:
4548:
4545:
4542:
4539:
4537:
4534:
4531:
4528:
4526:
4523:
4522:
4520:
4518:
4514:
4508:
4505:
4503:
4500:
4498:
4495:
4493:
4490:
4488:
4485:
4483:
4480:
4478:
4475:
4473:
4470:
4469:
4467:
4463:
4457:
4454:
4452:
4449:
4447:
4444:
4442:
4439:
4437:
4434:
4432:
4429:
4427:
4424:
4422:
4419:
4417:
4414:
4412:
4409:
4408:
4406:
4402:
4398:
4391:
4386:
4384:
4379:
4377:
4372:
4371:
4368:
4364:
4356:
4351:
4348:
4345:
4340:
4337:
4334:
4329:
4326:
4323:
4318:
4315:
4312:
4307:
4304:
4301:
4296:
4293:
4290:
4285:
4282:
4279:
4274:
4271:
4266:
4260:
4257:
4252:
4246:
4242:
4238:
4234:
4230:
4223:
4220:
4214:
4209:
4201:
4198:
4193:
4189:
4188:
4183:
4177:
4174:
4171:
4166:
4163:
4160:
4155:
4152:
4149:
4144:
4141:
4138:
4137:Cray Overview
4133:
4130:
4125:
4124:
4119:
4113:
4110:
4105:
4099:
4096:
4091:
4085:
4082:
4079:
4074:
4071:
4068:
4063:
4060:
4055:
4049:
4046:
4043:
4038:
4035:
4032:
4027:
4024:
4019:
4018:
4013:
4007:
4004:
3999:
3993:
3989:
3984:
3983:
3977:
3973:
3967:
3964:
3961:
3956:
3953:
3950:
3945:
3942:
3939:
3934:
3931:
3926:
3920:
3917:
3912:
3906:
3903:
3895:
3889:
3886:
3881:
3880:
3875:
3869:
3866:
3860:
3855:
3851:
3847:
3843:
3836:
3833:
3820:
3814:
3811:
3808:
3803:
3800:
3797:
3792:
3789:
3784:
3778:
3774:
3767:
3764:
3751:
3750:New Scientist
3747:
3740:
3737:
3730:
3726:
3723:
3721:
3718:
3716:
3713:
3711:
3708:
3705:
3701:
3698:
3696:
3693:
3691:
3688:
3686:
3683:
3681:
3678:
3676:
3673:
3671:
3668:
3665:
3662:
3660:
3657:
3656:
3652:
3650:
3648:
3647:
3627:
3624:
3621:
3614:
3610:
3587:
3584:
3575:
3559:
3556:
3553:
3550:
3544:
3541:
3538:
3528:
3524:
3516:
3513:
3512:
3507:
3506:
3497:
3495:
3492:
3487:
3481:
3477:
3473:
3471:
3467:
3462:
3455:
3450:
3445:
3441:
3437:
3433:
3429:
3428:trigonometric
3425:
3422:
3419:
3415:
3411:
3407:
3404:
3401:
3398:
3394:
3391:
3390:
3389:
3387:
3386:trigonometric
3383:
3380:With many 3D
3375:
3370:
3366:
3365:
3361:
3358:
3354:
3350:
3347:
3344:
3340:
3337:
3327:
3323:
3322:
3317:
3314:
3311:
3308:
3305:
3302:
3298:
3294:
3290:
3287:
3284:
3280:
3277:
3274:
3270:
3267:
3264:
3260:
3256:
3252:
3248:
3245:
3244:
3243:
3237:
3235:
3229:
3226:
3225:
3224:
3218:
3214:
3210:
3209:
3208:
3201:
3199:
3195:
3191:
3060:
3008:
3006:
2998:
2996:
2992:
2990:
2984:
2977:
2966:
2960:
2954:
2948:
2947:
2946:
2867:
2865:
2861:
2852:
2850:
2720:
2714:
2712:
2585:
2583:
2574:
2572:
2570:
2565:
2561:
2559:
2553:
2546:
2543:
2539:
2536:
2532:
2521:
2520:
2519:
2516:
2514:
2348:
2345:
2342:
2337:
2336:
2334:
2328:
2319:
2317:
2315:
2309:
2034:
2028:
2026:
2018:
2015:
2013:
2008:
2006:
1998:
1994:
1990:
1985:
1978:
1975:
1971:
1970:
1969:
1965:
1962:
1960:
1789:
1763:
1760:
1756:
1748:
1740:; y = y + tmp
1713:; tmp = a * x
1682:
1509:
1503:
1501:
1374:
1372:
1363:
1361:
1358:
1353:
1349:
1344:
1342:
1337:
1335:
1331:
1327:
1323:
1319:
1315:
1311:
1305:
1303:
1298:
1294:
1289:
1287:
1280:IV ISA for a
1279:
1274:
1267:
1258:
1253:
1249:
1241:
1239:
1237:
1233:
1228:
1226:
1224:
1218:
1214:
1209:
1206:
1201:
1195:
1191:
1188:
1185:
1182:
1179:
1178:
1177:
1086:
1083:
1081:
1077:
972:
797:
793:
791:
785:
783:
782:
777:
773:
772:assembly line
768:
763:
761:
757:
748:
746:
744:
740:
736:
731:
729:
723:
719:
717:
709:
705:
701:
698:
696:
692:
682:
680:
677:
665:
664:
663:
659:
652:
650:
648:
644:
640:
636:
628:
624:
620:
616:
612:
608:
604:
601:
599:
595:
591:
587:
584:
581:
577:
573:
569:
565:
561:
557:
553:
549:
545:
541:
537:
533:
529:
525:
521:
518:
517:
516:
512:
508:
506:
497:
495:
493:
489:
485:
481:
477:
469:
467:
465:
464:"Packed SIMD"
461:
460:significantly
456:
452:
448:
444:
438:
430:
428:
425:
421:
417:
412:
410:
406:
402:
398:
394:
389:
387:
383:
382:minicomputers
379:
375:
371:
367:
363:
359:
355:
347:
343:
339:
337:
336:
330:
325:
323:
319:
314:
312:
307:
303:
301:
298:
294:
291:
280:
277:
269:
259:
255:
251:
245:
244:
240:
235:This section
233:
229:
224:
223:
217:
215:
213:
205:
203:
201:
197:
193:
191:
187:
183:
178:
174:
170:
165:
163:
159:
155:
151:
147:
143:
139:
131:
126:
124:
122:
118:
114:
110:
109:supercomputer
105:
103:
99:
95:
91:
87:
83:
79:
75:
71:
67:
63:
59:
55:
51:
44:
40:
33:
19:
7093:Coprocessors
7067:Chip carrier
7005:Clock gating
6924:Mixed-signal
6821:Write buffer
6798:Control unit
6610:Clock signal
6349:accelerators
6331:Cypress PSoC
6191:
5988:Simultaneous
5952:
5805:Out-of-order
5437:Neuromorphic
5318:Architecture
5276:Belt machine
5269:Zeno machine
5202:Hierarchical
4828:
4664:Coordination
4594:Amdahl's law
4530:Simultaneous
4362:
4350:
4339:
4328:
4317:
4306:
4295:
4284:
4273:
4259:
4232:
4222:
4200:
4185:
4176:
4165:
4154:
4143:
4132:
4121:
4112:
4098:
4084:
4073:
4062:
4048:
4037:
4026:
4015:
4006:
3981:
3966:
3955:
3944:
3933:
3919:
3905:
3888:
3877:
3868:
3841:
3835:
3823:. Retrieved
3813:
3802:
3791:
3772:
3766:
3754:. Retrieved
3749:
3739:
3645:
3644:
3576:
3517:
3510:
3509:
3504:
3503:
3501:
3490:
3485:
3482:
3478:
3474:
3469:
3465:
3463:
3459:
3423:
3410:mini-permute
3405:
3392:
3379:
3362:
3353:Galois field
3348:
3338:
3318:
3312:
3306:
3288:
3278:
3268:
3258:
3254:
3250:
3246:
3241:
3233:
3222:
3205:
3196:
3192:
3188:
3058:
3004:
3002:
2993:
2985:
2975:
2973:
2944:
2919:; 2nd 4 of x
2863:
2859:
2856:
2848:
2718:
2710:
2578:
2566:
2562:
2557:
2554:
2551:
2541:
2534:
2517:
2512:
2509:
2346:
2340:
2338:
2332:
2330:
2326:
2323:
2313:
2310:
2306:
2032:
2019:
2016:
2009:
1992:
1988:
1986:
1982:
1966:
1963:
1958:
1956:
1783:
1752:
1679:
1507:
1499:
1367:
1345:
1338:
1322:"predicated"
1306:
1290:
1286:power of two
1275:
1259:
1255:
1235:
1229:
1220:
1216:
1212:
1210:
1202:
1199:
1175:
1084:
1079:
1075:
1073:
994:; count = 10
970:
794:
789:
786:
779:
764:
752:
735:single-issue
734:
732:
724:
720:
713:
699:
688:
676:power of two
660:
656:
647:Fujitsu FR-V
632:
627:CDC STAR-100
603:Pure Vectors
602:
585:
546:extensions,
526:(SWAR), and
519:
509:fixed-length
504:
501:
473:
459:
440:
413:
390:
351:
334:
326:
315:
308:
304:
287:
272:
263:
248:Please help
236:
209:
194:
166:
146:coprocessors
141:
135:
106:
77:
70:instructions
57:
53:
47:
6852:Multiplexer
6816:Data buffer
6527:Single-core
6499:bit slicing
6357:Coprocessor
6212:Coprocessor
6093:performance
6015:Cooperative
6005:Speculative
5965:Distributed
5924:Superscalar
5909:Instruction
5877:Parallelism
5850:Speculative
5682:System/3x0
5554:Instruction
5331:Von Neumann
5244:Post–Turing
5100:Scalability
4861:distributed
4744:Concurrency
4711:Programming
4552:Cooperative
4541:Speculative
4477:Instruction
3825:23 December
3393:Sub-vectors
3369:many others
2530:instruction
1973:inner loop.
948:; decrement
749:Description
566:. In 2000,
554:extension,
484:open source
160:to a large
7082:Categories
6972:management
6867:Multiplier
6728:Logic gate
6718:Sequential
6625:Functional
6605:Clock rate
6578:Data cache
6551:Components
6532:Multi-core
6520:Core count
6010:Preemptive
5914:Pipelining
5897:Bit-serial
5840:Wide-issue
5785:Structural
5707:Tilera ISA
5673:MicroBlaze
5643:ETRAX CRIS
5538:Comparison
5383:Load–store
5363:Endianness
5105:Starvation
4844:asymmetric
4579:PRAM model
4547:Preemptive
4213:2104.03142
3997:155860491X
3859:2065/10689
3782:5770761318
3731:References
3486:subsequent
3470:subsequent
3451:extension.
3259:fail-first
3194:critical.
2978:of adding
1786:y = mx + c
1246:See also:
1236:efficiency
623:RISC-V RVV
562:and MIPS'
550:, Sparc's
488:ForwardCom
88:(SIMD) or
68:where its
6906:Circuitry
6826:Microcode
6750:Registers
6593:coherence
6568:CPU cache
6426:Word size
6091:Processor
5735:Execution
5638:DEC Alpha
5616:Power ISA
5432:Cognitive
5239:Universal
4839:symmetric
4584:PEM model
3807:MIAOW GPU
3625:−
3591:∞
3551:∗
3542:−
3440:logarithm
3418:"swizzle"
3414:Videocore
3330:x = y + x
3326:mapreduce
3321:Iteration
3109:vredadd32
3042:vredadd32
2976:by design
2433:# store Y
2200:store32x4
1862:store32x4
1357:Videocore
1348:Videocore
1278:Videocore
1153:# 10 adds
903:; move on
774:, so the
492:Libre-SOC
416:SX series
401:Cray Y-MP
397:Cray X-MP
266:July 2023
237:does not
173:ILLIAC IV
158:algorithm
140:in their
50:computing
6844:Datapath
6537:Manycore
6509:variable
6347:Hardware
5983:Temporal
5663:OpenRISC
5358:Cellular
5348:Dataflow
5341:modified
5070:Deadlock
5058:Problems
5024:pthreads
5004:OpenHMPP
4929:Ateji PX
4890:computer
4761:Hardware
4628:Elements
4614:Slowdown
4525:Temporal
4507:Pipeline
4192:Archived
3978:(1998).
3653:See also
3303:instead.
3005:built-in
2907:load32x4
2892:load32x4
2128:load32x4
2110:load32x4
1808:load32x4
1796:load32x4
1076:hardware
685:vectors.
594:ARM SVE2
548:ARM NEON
346:Cray J90
311:CDC 7600
293:STAR-100
192:(SIMT).
162:data set
7020:Related
6951:Quantum
6941:Digital
6936:Boolean
6834:Counter
6733:Quantum
6494:512-bit
6489:256-bit
6484:128-bit
6327:(MPSoC)
6312:on chip
6310:Systems
6128:(FLOPS)
5941:Process
5790:Control
5772:Hazards
5658:Itanium
5653:Unicore
5611:PowerPC
5336:Harvard
5296:Pointer
5291:Counter
5249:Quantum
5029:RaftLib
5009:OpenACC
4984:GPUOpen
4974:C++ AMP
4949:Charm++
4691:Barrier
4635:Process
4619:Speedup
4404:General
4187:YouTube
3775:. KIT.
3491:exactly
3449:MIPS-3D
3283:AVX-512
3255:segment
3251:indexed
2989:AVX-512
2922:add32x4
2400:vmadd32
2281:# loop?
2173:add32x4
2146:mul32x4
2012:AVX-512
1997:AVX-512
1989:triples
1841:add32x4
1820:mul32x4
1767:splatx4
1585:store32
1334:AltiVec
1314:AVX-512
1082:basis.
781:latency
695:AVX-512
598:AVX-512
572:Toshiba
560:AltiVec
556:PowerPC
376:-based
366:Hitachi
362:Fujitsu
258:removed
243:sources
171:as the
142:Solomon
127:History
78:vectors
6956:Switch
6946:Analog
6684:(IMC)
6655:(MMU)
6504:others
6479:64-bit
6474:48-bit
6469:32-bit
6464:24-bit
6459:16-bit
6454:15-bit
6449:12-bit
6286:Mobile
6202:Stream
6197:Barrel
6192:Vector
6181:(GPU)
6140:(SUPS)
6108:(IPC)
5960:Memory
5953:Vector
5936:Thread
5919:Scalar
5721:Others
5668:RISC-V
5633:SuperH
5602:Power
5598:MIPS-X
5573:PDP-11
5422:Fabric
5174:Models
5122:
4999:OpenCL
4994:OpenMP
4939:Chapel
4856:shared
4851:Memory
4786:(SIMT)
4729:Models
4640:Thread
4572:Theory
4543:(SpMT)
4497:Memory
4482:Thread
4465:Levels
4247:
4123:GitHub
4017:GitHub
3994:
3879:GitHub
3779:
3756:7 July
3700:RISC-V
3466:actual
3436:cosine
3400:SPIR-V
3397:Vulkan
3382:shader
3076:vloop:
2742:load32
2697:return
2649:size_t
2595:size_t
2569:no-ops
2548:1977).
2352:vloop:
2341:actual
2314:at all
2038:vloop:
1793:vloop:
1531:load32
1516:load32
1435:size_t
1387:size_t
1316:, ARM
1293:Cray-1
1156:vstore
1096:setvli
1051:vstore
790:itself
739:Cray-1
683:within
668:vsetvl
621:, and
619:NEC SX
611:Cray-1
544:3DNow!
393:Cray-2
374:Oregon
358:ETA-10
318:Cray-1
177:GFLOPS
7012:(PPW)
6970:Power
6862:Adder
6738:Array
6705:Logic
6666:(TLB)
6649:(FPU)
6643:(AGU)
6637:(ALU)
6627:units
6563:Cache
6444:8-bit
6439:4-bit
6434:1-bit
6398:(TPU)
6392:(DSP)
6386:(PPU)
6380:(VPU)
6369:(GPU)
6338:(NoC)
6321:(SoC)
6256:(PoP)
6250:(SiP)
6244:(MCM)
6185:GPGPU
6175:(CPU)
6165:Types
6146:(PPW)
6134:(TPS)
6122:(IPS)
6114:(CPI)
5885:Level
5696:S/390
5691:S/370
5686:S/360
5628:SPARC
5606:POWER
5489:TRIPS
5457:Types
4969:Dryad
4934:Boost
4655:Array
4645:Fiber
4559:(CMT)
4532:(SMT)
4446:GPGPU
4208:arXiv
3988:751-2
3897:(PDF)
3670:GPGPU
3402:spec.
3175:vloop
3094:vld32
3079:setvl
3027:vld32
3012:setvl
2864:other
2757:add32
2739:loop:
2613:const
2524:setvl
2513:going
2502:vloop
2421:vst32
2385:vld32
2370:vld32
2355:setvl
2293:vloop
2065:shift
2023:setvl
1959:shall
1943:vloop
1564:add32
1543:mul32
1513:loop:
1405:const
1381:iaxpy
1213:other
1120:vload
1105:vload
1066:count
1030:count
1015:vload
1012:count
997:vload
991:count
954:count
945:count
873:store
825:loop:
819:count
662:has:
60:is a
6990:ACPI
6723:Glue
6615:FIFO
6558:Core
6296:ASIP
6237:CPLD
6232:FPOA
6227:FPGA
6222:ASIC
6075:SPMD
6070:MIMD
6065:MISD
6058:SWAR
6038:SIMD
6033:SISD
5948:Data
5931:Task
5902:Word
5648:M32R
5593:MIPS
5556:sets
5523:ZISC
5518:NISC
5513:OISC
5508:MISC
5501:EPIC
5496:VLIW
5484:EDGE
5474:RISC
5469:CISC
5378:HUMA
5373:NUMA
5034:ROCm
4964:CUDA
4954:Cilk
4921:APIs
4881:COMA
4876:NUMA
4807:MIMD
4802:MISD
4779:SIMD
4774:SISD
4502:Loop
4492:Data
4487:Task
4245:ISBN
3992:ISBN
3827:2022
3777:ISBN
3758:2024
3502:Let
3438:and
3432:sine
3343:CUDA
3313:Iota
3166:bnez
2886:$ 16
2871:addl
2835:out:
2829:loop
2799:subl
2778:addl
2667:<
2589:void
2522:The
2493:bnez
2333:must
2299:out:
2260:subl
2242:addl
2221:addl
1949:out:
1913:subl
1910:$ 16
1895:addl
1889:$ 16
1874:addl
1716:vadd
1689:vmul
1672:out:
1666:loop
1636:subl
1618:addl
1597:addl
1453:<
1378:void
1318:SVE2
1302:GPUs
1291:The
1250:and
1248:SIMD
1223:RISC
1135:vadd
1099:$ 10
1033:vadd
985:$ 10
982:move
960:loop
951:jnez
840:load
828:load
813:$ 10
810:move
643:EPIC
639:VLIW
635:MIMD
596:and
576:Sony
574:and
538:and
490:and
443:GPUs
399:and
368:and
295:and
241:any
239:cite
113:Cray
52:, a
39:GPUs
6985:APM
6980:PMU
6872:CPU
6829:ROM
6600:Bus
6217:PAL
5892:Bit
5678:LMC
5583:ARM
5578:x86
5568:VAX
5049:ZPL
5044:TBB
5039:UPC
5019:PVM
4989:MPI
4944:HPX
4871:UMA
4472:Bit
4237:doi
3854:hdl
3846:doi
3444:HPC
3181:ret
3151:sub
3130:add
3064:set
2980:x+x
2968:x+x
2962:x+x
2956:x+x
2950:x+x
2860:not
2838:ret
2820:jgz
2814:$ 1
2793:$ 4
2724:set
2643:for
2628:int
2616:int
2604:int
2535:and
2528:min
2478:sub
2457:add
2436:add
2302:ret
2284:jgz
2101:$ 1
2086:sub
2074:$ 1
2059:$ 4
2044:min
2005:gcc
2003:to
1952:ret
1934:jgz
1928:$ 4
1743:ret
1731:tmp
1692:tmp
1675:ret
1657:jgz
1651:$ 1
1633:$ 4
1612:$ 4
1429:for
1417:int
1408:int
1396:int
1330:SSE
1326:MMX
1282:REP
1270:REP
1266:x86
1217:not
1171:ret
1069:ret
966:ret
942:dec
939:$ 4
924:add
921:$ 4
906:add
900:$ 4
885:add
852:add
672:lvl
568:IBM
564:MSA
558:'s
552:VIS
540:AVX
536:SSE
532:MMX
431:GPU
424:HBM
252:by
56:or
48:In
7084::
6919:3D
4243:.
4231:.
4190:.
4184:.
4120:.
4014:.
3990:.
3974:;
3876:.
3852:.
3748:.
3434:,
3426:–
3271:–
3160:t0
3139:t0
3124:v0
3097:v0
3082:t0
3051:v0
3030:v0
3015:t0
2937:v1
2931:v2
2925:v1
2916:r3
2910:v2
2895:v1
2874:r3
2772:r1
2745:r1
2688:+=
2679:++
2584::
2487:t0
2466:t0
2445:t0
2424:v1
2409:v0
2403:v1
2388:v1
2373:v0
2358:t0
2329:.
2275:t0
2251:t0
2230:t0
2203:v3
2188:v2
2182:v1
2176:v3
2161:v1
2149:v1
2131:v2
2113:v1
2080:t0
2047:t0
2007:.
1865:v3
1856:v2
1850:v1
1844:v3
1835:v1
1823:v1
1811:v2
1799:v1
1770:v4
1588:r3
1579:r2
1573:r1
1567:r3
1558:r1
1546:r1
1534:r2
1519:r1
1465:++
1373::
1332:,
1328:,
1227:)
1159:v3
1150:v2
1144:v1
1138:v3
1123:v2
1108:v1
1054:v3
1048:v2
1042:v1
1036:v3
1018:v2
1000:v1
876:r3
867:r2
861:r1
855:r3
843:r2
831:r1
762:.
718:.
617:,
613:,
570:,
534:,
494:.
395:,
388:.
364:,
210:A
104:.
5158:e
5151:t
5144:v
4389:e
4382:t
4375:v
4267:.
4253:.
4239::
4216:.
4210::
4106:.
4092:.
4000:.
3913:.
3862:.
3856::
3848::
3829:.
3785:.
3760:.
3706:.
3646:f
3631:)
3628:f
3622:1
3619:(
3615:/
3611:1
3588:=
3585:r
3563:]
3560:f
3557:+
3554:r
3548:)
3545:f
3539:1
3536:(
3533:[
3529:/
3525:r
3511:f
3505:r
3420:.
3371:.
3285:.
3184:y
3172:,
3169:n
3157:,
3154:n
3145:4
3142:*
3136:,
3133:x
3121:,
3118:y
3115:,
3112:y
3103:x
3100:,
3088:n
3085:,
3073:0
3070:,
3067:y
3048:,
3045:y
3036:x
3033:,
3021:n
3018:,
2934:,
2928:,
2913:,
2901:x
2898:,
2883:,
2880:x
2877:,
2841:y
2826:,
2823:n
2811:,
2808:n
2805:,
2802:n
2790:,
2787:x
2784:,
2781:x
2769:,
2766:y
2763:,
2760:y
2751:x
2748:,
2733:0
2730:,
2727:y
2706:}
2703:;
2700:y
2694:;
2691:x
2685:y
2682:)
2676:i
2673:;
2670:n
2664:i
2661:;
2658:0
2655:=
2652:i
2646:(
2640:;
2637:0
2634:=
2631:y
2625:{
2622:)
2619:x
2610:,
2607:a
2601:,
2598:n
2592:(
2582:c
2499:,
2496:n
2484:,
2481:n
2472:4
2469:*
2463:,
2460:x
2451:4
2448:*
2442:,
2439:y
2430:y
2427:,
2415:a
2412:,
2406:,
2394:y
2391:,
2379:x
2376:,
2364:n
2361:,
2290:,
2287:n
2272:,
2269:n
2266:,
2263:n
2257:4
2254:*
2248:,
2245:y
2236:4
2233:*
2227:,
2224:x
2215:m
2212:,
2209:y
2206:,
2194:m
2191:,
2185:,
2179:,
2167:m
2164:,
2158:,
2155:a
2152:,
2143:m
2140:,
2137:y
2134:,
2125:m
2122:,
2119:x
2116:,
2098:,
2095:m
2092:,
2089:m
2077:,
2071:,
2068:m
2056:,
2053:n
2050:,
1940:,
1937:n
1925:,
1922:n
1919:,
1916:n
1907:,
1904:y
1901:,
1898:y
1886:,
1883:x
1880:,
1877:x
1871:y
1868:,
1853:,
1847:,
1832:,
1829:a
1826:,
1817:y
1814:,
1805:x
1802:,
1776:a
1773:,
1737:n
1734:,
1728:,
1725:y
1722:,
1719:y
1710:n
1707:,
1704:x
1701:,
1698:a
1695:,
1663:,
1660:n
1648:,
1645:n
1642:,
1639:n
1630:,
1627:y
1624:,
1621:y
1609:,
1606:x
1603:,
1600:x
1594:y
1591:,
1576:,
1570:,
1555:,
1552:a
1549:,
1540:y
1537:,
1525:x
1522:,
1495:}
1492:;
1489:y
1486:+
1483:x
1480:*
1477:a
1474:=
1471:y
1468:)
1462:i
1459:;
1456:n
1450:i
1447:;
1444:0
1441:=
1438:i
1432:(
1426:{
1423:)
1420:y
1414:,
1411:x
1402:,
1399:a
1393:,
1390:n
1384:(
1371:C
1165:c
1162:,
1147:,
1141:,
1129:b
1126:,
1114:a
1111:,
1063:,
1060:c
1057:,
1045:,
1039:,
1027:,
1024:b
1021:,
1009:,
1006:a
1003:,
988:,
957:,
936:,
933:c
930:,
927:c
918:,
915:b
912:,
909:b
897:,
894:a
891:,
888:a
882:c
879:,
864:,
858:,
849:b
846:,
837:a
834:,
816:,
279:)
273:(
268:)
264:(
260:.
246:.
45:.
34:.
20:)
Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.