Arithmetic Didactics

Embedded Processor Compages

Peter Barry , Patrick Crowley , in Modern Embedded Computing, 2012

Arithmetic Instructions

The arithmetic instructions define the set of operations performed by the processor Arithmetics Logic Unit (ALU). The arithmetics instructions are further classified into binary, decimal, logical, shift/rotate, and bit/byte manipulation instructions.

Binary Operations

The binary arithmetic instructions perform basic binary integer computations on byte, word, and double word integers located in memory and/or the full general-purpose registers, every bit described in Tabular array v.4.

Table 5.4. Binary Arithmetic Performance Instructions

Teaching
Mnemonic
Example Description
Add Add EAX, EAX Add together the contents of EAX to EAX
ADC ADC EAX, EAX Add with behave
SUB SUB EAX, 0002h Decrease the ii from the register
SBB SBB EBX, 0002h Subtract with borrow
MUL MUL EBX Unsigned multiply EAX by EBX; results in EDX:EAX
DIV DIV EBX Unsigned split up
INC INC [EAX] Increment value at retention eax by 1
December Dec EAX Decrement EAX by one
NEG NEG EAX Two's complement negation

Decimal Operations

The decimal arithmetic instructions perform decimal arithmetic on binary coded decimal (BCD) data, as described in Tabular array five.5. BCD is not used as much as it has been in the past, but information technology still remains relevant for some financial and industrial applications.

Table 5.5. Decimal Operation Instructions (Subset)

Instruction
Mnemonic
Example Description
DAA ADD EAX, EAX Decimal accommodate later addition
DAS DAS Decimal adjust AL afterward subtraction. Adjusts the outcome of the subtraction of ii packed BCD values to create a packed BCD effect
AAA AAA ASCII adjust after add-on. Adjusts the sum of 2 unpacked BCD values to create an unpacked BCD outcome
AAS AAS ASCII adjust after subtraction. Adjusts the event of the subtraction of 2 unpacked BCD values to create a unpacked BCD result

Logical Operations

The logical instructions perform basic AND, OR, XOR, and Non logical operations on byte, word, and double word values, as described in Table v.vi.

Table v.half dozen. Logical Operation Instructions

Education Mnemonic Example Clarification
AND AND EAX, 0ffffh Performs bitwise logical AND
OR OR EAX, 0fffffff0h Performs bitwise logical OR
XOR EBX, 0fffffff0h Performs bitwise logical XOR
Non NOT [EAX] Performs bitwise logical Not

Shift Rotate Operations

The shift and rotate instructions shift and rotate the $.25 in word and double give-and-take operands. Table 5.7 shows some examples.

Table 5.7. Shift and Rotate Instructions

Instruction
Mnemonic
Instance Clarification
SAR SAR EAX, 4h Shifts arithmetic right
SHR SAL EAX,ane Shifts logical right
SAL/SHL SAL EAX,one Shifts arithmetics left/Shifts logical left
SHRD SHRD EAX, EBX, 4 Shifts correct double
SHLD SHRD EAX, EBX, iv Shifts left double
ROR ROR EAX, 4h Rotates right
ROL ROL EAX, 4h Rotates left
RCR RCR EAX, 4h Rotates through carry right
RCL RCL EAX, 4h Rotates through carry left

The arithmetics shift operations are oft used in power of two arithmetic operations (such a multiply by ii), as the instructions are much faster than the equivalent multiply or separate functioning.

Bit/Byte Operations

Chip instructions test and modify individual bits in word and double word operands, every bit described in Table 5.8. Byte instructions set the value of a byte operand to indicate the status of flags in the EFLAGS register.

Table v.viii. Bit/Byte Operation Instructions

Instruction
Mnemonic
Instance Description
BT BT EAX, 4h Fleck test. Stores selected scrap in Deport flag
BTS BTS EAX, 4h Bit exam and set. Stores selected scrap in Carry flag and sets the scrap
BTR BTS EAX, 4h Bit test and reset. Stores selected bit in Conduct flag and clears the flake
BTC BTS EAX, 4h Scrap test and complement. Stores selected bit in Acquit flag and complements the fleck
BSF BTS EBX, [EAX] Fleck scan forward. Searches the source operand (2d operand) for the least pregnant gear up bit (i bit)
BSR BTR EBX, [EAX] Bit scan reference. Searches the source operand (second operand) for the most pregnant set bit (1 bit)
SETE/SETZ Set up EAX Provisional Set up byte if equal/Set byte if zero
Test TEST EAX, 0ffffffffh Logical compare. Computes the bit-wise logical AND of first operand (source one operand) and the second operand (source ii operand) and sets the SF, ZF, and PF condition flags according to the effect

Read full affiliate

URL:

https://www.sciencedirect.com/science/commodity/pii/B9780123914903000059

Instruction Sets

Joseph Yiu , in The Definitive Guide to the ARM Cortex-M3 (Second Edition), 2009

4.3.3 Assembler Language: Processing Information

The Cortex-M3 provides many different instructions for data processing. A few basic ones are introduced here. Many data operation instructions can take multiple didactics formats. For case, an ADD pedagogy can operate between 2 registers or between i register and an immediate data value:

Add   R0, R0, R1   ; R0 = R0 + R1

ADDS   R0, R0, #0x12   ; R0 = R0 + 0x12

ADD.W R0, R1, R2   ; R0 = R1 + R2

These are all ADD instructions, only they accept different syntaxes and binary coding.

With the traditional Thumb education syntax, when 16-scrap Pollex code is used, an Add educational activity can change the flags in the PSR. However, 32-flake Thumb-2 code tin can either modify a flag or keep it unchanged. To separate the ii different operations, the Southward suffix should be used if the following operation depends on the flags:

Add together.West   R0, R1, R2 ; Flag unchanged

ADDS.W R0, R1, R2 ; Flag modify

Aside from Add instructions, the arithmetics functions that the Cortex-M3 supports include subtract (SUB), multiply (MUL), and unsigned and signed split up (UDIV/SDIV). Table 4.xviii shows some of the most commonly used arithmetic instructions.

Table 4.eighteen. Examples of Arithmetic Instructions

Instruction Performance
Add together Rd, Rn, Rm   ; Rd = Rn + Rm Add functioning
Add Rd, Rd, Rm   ; Rd = Rd + Rm
Add together Rd, #immed   ; Rd = Rd + #immed
ADD Rd, Rn, # immed   ; Rd = Rn + #immed
ADC Rd, Rn, Rm   ; Rd = Rn + Rm + carry Add with bear
ADC Rd, Rd, Rm   ; Rd = Rd + Rm + carry
ADC Rd, #immed   ; Rd = Rd + #immed + carry
ADDW Rd, Rn,#immed   ; Rd = Rn + #immed Add annals with 12-bit immediate value
SUB Rd, Rn, Rm   ; Rd = Rn − Rm SUBTRACT
SUB Rd, #immed   ; Rd = Rd − #immed
SUB Rd, Rn,#immed   ; Rd = Rn − #immed
SBC Rd, Rm   ; Rd = Rd − Rm − borrow Subtract with borrow (not carry)
SBC.W Rd, Rn, #immed ; Rd = Rn − #immed − borrow
SBC.W Rd, Rn, Rm   ; Rd = Rn − Rm − borrow
RSB.Due west Rd, Rn, #immed ; Rd = #immed –Rn Reverse decrease
RSB.W Rd, Rn, Rm   ; Rd = Rm − Rn
MUL Rd, Rm   ; Rd = Rd * Rm Multiply
MUL.W Rd, Rn, Rm   ; Rd = Rn * Rm
UDIV Rd, Rn, Rm   ; Rd = Rn/Rm Unsigned and signed divide
SDIV Rd, Rn, Rm   ; Rd = Rn/Rm

These instructions can be used with or without the "S" suffix to determine if the APSR should be updated. In about cases, if UAL syntax is selected and if "Southward" suffix is non used, the 32-bit version of the instructions would be selected equally most of the 16-bit Pollex instructions update APSR.

The Cortex-M3 also supports 32-bit multiply instructions and multiply accrue instructions that give 64-bit results. These instructions support signed or unsigned values (meet Table 4.nineteen).

Tabular array four.19. 32-Bit Multiply Instructions

Didactics Operation
SMULL RdLo, RdHi, Rn, Rm ; {RdHi,RdLo} = Rn * Rm 32-bit multiply instructions for signed values
SMLAL RdLo, RdHi, Rn, Rm ; {RdHi,RdLo} += Rn * Rm
UMULL RdLo, RdHi, Rn, Rm ; {RdHi,RdLo} = Rn * Rm 32-bit multiply instructions for unsigned values
UMLAL RdLo, RdHi, Rn, Rm ; {RdHi,RdLo} += Rn * Rm

Some other grouping of data processing instructions are the logical operations instructions and logical operations such as AND, ORR (or), and shift and rotate functions. Table four.20 shows some of the most ordinarily used logical instructions. These instructions can be used with or without the "S" suffix to decide if the APSR should exist updated. If UAL syntax is used and if "S" suffix is not used, the 32-scrap version of the instructions would be selected every bit all of the 16-bit logic performance instructions update APSR.

Table 4.twenty. Logic Operation Instructions

Educational activity Operation
AND Rd, Rn   ; Rd = Rd & Rn Bitwise AND
AND.West Rd, Rn,#immed ;Rd = Rn & #immed
AND.W Rd, Rn, Rm   ; Rd = Rn & Rd
ORRRd, Rn   ; Rd = Rd | Rn Bitwise OR
ORR.Due west Rd, Rn,#immed ; Rd = Rn | #immed
ORR.W Rd, Rn, Rm   ; Rd = Rn | Rd
BIC Rd, Rn   ; Rd = Rd & (~Rn) Chip clear
BIC.Due west Rd, Rn,#immed ; Rd = Rn &(~#immed)
BIC.West Rd, Rn, Rm   ; Rd = Rn &(~Rd)
ORN.W Rd, Rn,#immed ; Rd = Rn | (~#immed) Bitwise OR NOT
ORN.W Rd, Rn, Rm   ; Rd = Rn | (~Rd)
EOR Rd, Rn   ; Rd = Rd ^ Rn Bitwise Exclusive OR
EOR.W Rd, Rn,#immed ; Rd = Rn | #immed
EOR.W Rd, Rn, Rm   ; Rd = Rn | Rd

The Cortex-M3 provides rotate and shift instructions. In some cases, the rotate functioning tin can be combined with other operations (for case, in retentiveness address offset calculation for load/store instructions). For standalone rotate/shift operations, the instructions shown in Table 4.21 are provided. Again, a 32-fleck version of the teaching is used if "Southward" suffix is not used and if UAL syntax is used.

Table 4.21. Shift and Rotate Instructions

Instruction Operation
ASR Rd, Rn,#immed   ; Rd = Rn » immed Arithmetic shift right
ASRRd, Rn   ; Rd = Rd » Rn
ASR.W Rd, Rn, Rm   ; Rd = Rn » Rm
LSLRd, Rn,#immed   ; Rd = Rn « immed Logical shift left
LSLRd, Rn   ; Rd = Rd « Rn
LSL.W Rd, Rn, Rm   ; Rd = Rn « Rm
LSRRd, Rn,#immed   ; Rd = Rn » immed Logical shift right
LSRRd, Rn   ; Rd = Rd » Rn
LSR.W Rd, Rn, Rm   ; Rd = Rn » Rm
ROR Rd, Rn   ; Rd rot by Rn Rotate right
ROR.Westward Rd, Rn,#immed ; Rd = Rn rot by immed
ROR.W Rd, Rn, Rm   ; Rd = Rn rot past Rm
RRX.W Rd, Rn   ; {C, Rd} = {Rn, C} Rotate correct extended

In UAL syntax, the rotate and shift operations tin can also update the deport flag if the S suffix is used (and always update the conduct flag if the 16-bit Pollex code is used). Encounter Effigy four.1.

FIGURE 4.1. Shift and Rotate Instructions.

If the shift or rotate operation shifts the register position by multiple bits, the value of the behave flag C will exist the concluding scrap that shifts out of the register.

Why Is There Rotate Right But No Rotate Left?

The rotate left operation tin can be replaced by a rotate right functioning with a different rotate offset. For example, a rotate left by 4-scrap operation can exist written as a rotate right past 28-bit instruction, which gives the same result and takes the same corporeality of time to execute.

For conversion of signed information from byte or half discussion to word, the Cortex-M3 provides the two instructions shown in Tabular array 4.22. Both 16-bit and 32-bit versions are available. The 16-bit version can just access low registers.

Tabular array iv.22. Sign Extend Instructions

Educational activity Performance
SXTB Rd, Rm ; Rd = signext(Rm[vii:0]) Sign extend byte data into discussion
SXTH Rd, Rm ; Rd = signext(Rm[xv:0]) Sign extend half word data into word

Another grouping of data processing instructions is used for reversing data bytes in a annals (run into Table iv.23). These instructions are ordinarily used for conversion between petty endian and large endian data. See Effigy four.2. Both sixteen-bit and 32-bit versions are available. The 16-bit version can only access low registers.

Tabular array iv.23. Data Reverse Ordering Instructions

Instruction Functioning
REV Rd, Rn   ; Rd = rev(Rn) Opposite bytes in word
REV16 Rd, Rn ; Rd = rev16(Rn) Contrary bytes in each half word
REVSH Rd, Rn ; Rd = revsh(Rn) Reverse bytes in bottom half word and sign extend the result

FIGURE 4.2. Operation of Reverse instructions.

The concluding group of data processing instructions is for bit field processing. They include the instructions shown in Table 4.24. Examples of these instructions are provided in a subsequently part of this chapter.

Table four.24. Scrap Field Processing and Manipulation Instructions

Pedagogy Functioning
BFC.W Rd, Rn, #<width> Clear fleck field within a register
BFI.Westward Rd, Rn, #<lsb>, #<width> Insert bit field to a register
CLZ.W Rd, Rn Count leading nothing
RBIT.W Rd, Rn Reverse bit guild in annals
SBFX.Due west Rd, Rn, #<lsb>, #<width> Copy flake field from source and sign extend information technology
UBFX.Due west Rd, Rn, #<lsb>, #<width> Copy scrap field from source register

Read total affiliate

URL:

https://www.sciencedirect.com/science/article/pii/B9781856179638000077

The Linux/ARM embedded platform

Jason D. Bakos , in Embedded Systems, 2016

1.xiii Basic ARM Teaching Set

This section provides a concise summary of a basic subset of the ARM instruction set. The information provided here is only plenty to go yous started writing basic ARM assembly programs, and does non include any specialized instructions, such as system instructions and those related to coprocessors. Note that in the post-obit tables, the instruction mnemonics are shown in capital, merely can be written in uppercase or lowercase.

one.13.1 Integer arithmetic instructions

Table one.4 shows a list of integer arithmetic instructions. All of these support provisional execution, and all will update the status register when the S suffix is specified. Some of these instructions—those with "operand2"—support the flexible second operand as described earlier in this chapter. This allows these instructions to have either a register, a shifted register, or an immediate as the second operand.

Table i.4. Integer Arithmetic Instructions

Instruction Clarification Part
ADC{S}{&lt;   cond   &gt;} Rd, Rn, operand2 Add with carry R[Rd]   =   R[Rn]   +   operand2   +   Cflag
Add{S}{&lt;   cond   &gt;} Rd, Rn, operand2 Add R[Rd]   =   R[Rn]   +   operand2
MLA{S}{&lt;   cond   &gt;} Rd, Rn, Rm, Ra Multiply-accumulate R[Rd]   =   R[Rn]   *   R[Rm]   +   R[Ra]
MUL{S}{&lt;   cond   &gt;} Rd, Rn, Rm Multiply R[Rd]   =   R[Rn]   *   R[Rm]
RSB{Due south}{&lt;   cond   &gt;} Rd, Rn, operand2 Contrary subtract R[Rd]   =   operand2   -   R[Rn]
RSC{South}{&lt;   cond   &gt;} Rd, Rn, operand2 Reverse subtract with behave R[Rd]   =   operand2   -   R[Rn]     not(C flag)
SBC{S}{&lt;   cond   &gt;} Rd, Rn, operand2 Subtract with behave R[Rd]   =   R[Rn]     operand2     not(C flag)
SMLAL{S}{&lt;   cond   &gt;} RdLo, RdHi, Rn, Rm Signed multiply accumulate long R[RdHi]   =   upper32bits(R[Rn]   *   R[Rm])   +   R[RdHi]
R[RdLo]   =   lower32bits(R[Rn]   *   R[Rm])   +   R[RdLo]
SMULL{S}{&lt;   cond   &gt;} RdLo, RdHi, Rn, Rm Signed multiply long R[RdHi]   =   upper32bits(R[Rm]   *   R[Rs])
R[RdLo]   =   lower32bits(R[Rm] * R[Rs])
SUB{S}{&lt;   cond   &gt;} Rd, Rn, operand2 Subtract R[Rd]   =   R[Rn]     operand2
UMLAL{South}{&lt;   cond   &gt;} RdLo, RdHi, Rn, Rm Unsigned multiply accumulate long R[RdHi]   =   upper32bits(R[Rn]   *   R[Rm])   +   R[RdHi]
R[RdLo]   =   lower32bits(R[Rn]   *   R[Rm])   +   R[RdLo]
UMULL{South}{&lt;   cond   &gt;} RdLo, RdHi, Rn, Rm Unsigned multiply long R[RdHi]   =   upper32bits(R[Rn]   *   R[Rm])
R[RdLo]   =   lower32bits(R[Rn]   *   R[Rm])

1.xiii.2 Bitwise logical instructions

Table 1.5 shows a listing of bitwise logical instructions. All of these back up conditional execution, all can update the flags when the Due south suffix is specified, and all support a flexible second operand.

Table i.five. Integer Bitwise Logical Instructions

Educational activity Clarification Functionality
AND{Due south}{&lt;   cond   &gt;} Rd, Rn, operand2 Bitwise AND R[Rd]   =   R[Rn] &amp; operand2
BIC{South}{&lt;   cond   &gt;} Rd, Rn, operand2 Chip articulate R[Rd]   =   R[Rn] &amp; non operand2
EOR{S}{&lt;   cond   &gt;} Rd, Rn, operand2 Bitwise XOR R[Rd]   =   R[Rn]   ˆ   operand2
ORR{Due south}{&lt;   cond   &gt;} Rd, Rn, operand2 Bitwise OR R[Rd]   =   R[Rn]   |   operand2

1.xiii.3 Shift instructions

Table 1.6 shows a list of shift instructions. All of these back up conditional execution, all tin update the flags when the S suffix is specified, simply annotation that these instructions do not support the flexible 2nd operand.

Table ane.6. Integer Bitwise Logical Instructions

Didactics Description Functionality
ASR{S}{&lt;   cond   &gt;} Rd, Rn, Rs/#sh Arithmetics shift right R[Rd]   =   (int)R[Rn] &gt;&gt; (R[Rs] or #sh)
immune shift amount 1-32
LSR{S}{&lt;   cond   &gt;} Rd, Rn, Rs/#sh Logical shift right R[Rd]   =   (unsigned int)R[Rn]   &gt;&gt;   (R[Rs] or #sh)
allowed shift amount ane-32
LSL{Due south}{&lt;   cond   &gt;} Rd, Rn, Rs/#sh Logical shift left R[Rd]   =   R[Rn]   &lt;&lt;   (R[Rs] or #sh)
allowed shift amount 0-31
ROR{Due south}{&lt;   cond   &gt;} Rd, Rn, Rs/#sh Rotate correct R[Rd]   =   rotate R[Rn] by operand2 bits
immune shift amount one-31
RRX{S}{&lt;   cond   &gt;} Rd, Rm Shift correct by 1 scrap
The former carry flag is shifted into R[Rd] flake 31
If used with the S suffix, the old bit 0 is placed in the comport flag

ane.xiii.4 Movement instructions

Table ane.7 shows a list of data movement instructions. Virtually useful of these is the MOV instruction, since its flexible second operand allows for loading immediates and annals shifting.

Table 1.7. Data Move Instructions

Instruction Description Functionality
MOV{Southward}{&lt;   cond   &gt;} Rd, operand2 Move R[Rd]   =   operand2
MRS{&lt;   cond   &gt;} Rd, CPSR Move status register or saved status register to GPR R[Rd]   =   CPSR
R[Rd]   =   SPSR
MRS{&lt;   cond   &gt;} Rd, SPSR
MSR{&lt;   cond   &gt;} CPSR_f, #imm Move to status register from ARM annals fields is one of:
_c, _x, _s, _f
MSR{&lt;   cond   &gt;} SPSR_f, #imm
MSR{&lt;   cond   &gt;} CPSR_   &lt;   fields   &gt;, Rm
MSR{&lt;   cond   &gt;} SPSR_   &lt;   fields   &gt;, Rm
MVN{S}{&lt;   cond   &gt;} Rd, operand2 Move ane's complement R[Rd] = not operand2

1.13.5 Load and store instructions

Table 1.eight shows a listing of load and store instructions. The LDR/STR instructions are ARM'south bread-and-butter load and store instructions. The memory address can be specified using any of the addressing modes described earlier in this chapter.

Table 1.viii. ARM Load and Store Instructions

Teaching Description Functionality
LDM{cond}   &lt;   address mode   &gt;   Rn{!}, &lt;   reg list in braces   &gt; Load multiple Loads multiple registers from sequent words starting at R[Rn]
Bang (!) will autoincrement base register
Address mode:
IA   =   increment after
IB   =   increment before
DA   =   decrement subsequently
DB   =   decrement before
Instance:
LDMIA r2!, {r3,5-r7}
LDR{cond}{B|H|SB|SH} Rd, &lt;   address   &gt; Load annals Loads from memory into Rd.
Optional size specifiers:
B   =   byte
H   =   halfword
SB   =   signed byte
SH   =   signed halfword
STM{cond}   &lt;   address mode   &gt;   Rn, &lt;   registers   &gt; Shop multiple Stores multiple registers
Blindside (!) will autoincrement base of operations register
Address mode:
IA   =   increment after
IB   =   increment before
DA   =   decrement subsequently
DB   =   decrement before
Example:
STMIA r2!, {r3,5-r7}
STR{cond}{B|H} Rd, &lt;   address   &gt; Store annals Stores from memory into Rd.
Optional size specifiers:
B   =   byte
H   =   halfword
SWP{cond}   &lt;   B| Rd, Rm, [Rn] Swap Swap a discussion (or byte) between registers and memory

The LDR instruction tin can likewise be used to load symbols into base registers, due east.grand. "ldr r1,=data".

The LDM and STM instructions can load and shop multiple registers and are often used for accessing the stack.

1.13.6 Comparison instructions

Table 1.9 lists comparison instructions. These instructions are used to the status flags, which are used for conditional instructions, often used for conditional branches.

Table 1.ix. Comparison Instructions

Instruction Clarification Functionality
CMN{&lt;   cond   &gt;} Rn, Rm Compare negative Sets flags based on comparison between R[Rn] and –R[Rm]
CMP{&lt;   cond   &gt;} Rn, Rm Compare negative Sets flags based on comparison between R[Rn] and R[Rm]
TEQ{cond} Rn, Rm Test equivalence Tests for equivalence without affecting 5 flag
TST{cond} Rn, Rm Exam Performs a bitwise AND of 2 registers and updates the flags

one.thirteen.7 Branch instructions

Table 1.x lists two branch instructions. The BX (branch exchange) didactics is used when branching to annals values, which is used oft for branching to the link register for returning from functions. When using this didactics, the LSB of the target register specifies whether the processor will exist in ARM way or Thumb mode after the branch is taken.

Table 1.10. Branch Instructions

Educational activity Description Functionality
B{L}{cond}   &lt;   target   &gt; Branch Branches (and optionally links in annals r14) to characterization
B{50}10{cond} Rm Branch and exchange Branches (and optionally links in register r14) to annals. Chip 0 of register specifies if the instruction set mode will exist standard or Thumb upon branching

one.13.8 Floating-bespeak instructions

At that place are two types of floating-point instructions: the Vector Floating Indicate (VFP) instructions and the NEON instructions.

ARMv6 processors such as the Raspberry Pi (gen1)'s ARM11 support only VFP instructions. Newer architectures such as ARMv7 back up merely NEON instructions. The most common floating-point operations map to both a VFP teaching and a NEON instruction. For example, the VFP didactics FADDS and the NEON teaching VADD.F32 (when used with s-registers) both perform a unmarried precision floating signal add.

The NEON instruction set is more all-encompassing than the VFP instruction prepare, and so while most VFP instructions have an equivalent NEON instruction, there are many NEON instructions that perform operations non possible with VFP instructions.

In guild to draw floating point and single pedagogy, multiple information (SIMD) programming techniques that are applicable to both the ARM11 and ARM Cortex processors, this department and Chapter ii will cover both VFP and NEON instructions.

Table 1.11 lists the VFP and NEON version of commonly used floating-signal instructions. Similar the integer arithmetic instructions, most floating-bespeak instructions support conditional execution, but there is a separate ready of flags for floating-point instructions located in the 32-bit floating-point condition and control register (FPSCR). NEON instructions apply only $.25 31 downward to 27 of this register, while VFP instructions use additional bit fields.

Table 1.11. Floating-Indicate Instructions

VFP Pedagogy Equivalent NEON Instruction Description
FADD[Due south|D]{cond} Fd, Fn, Fm VADD.[F32|F64] Fd, Fn, Fm Single and double precision add
FSUB[South|D]{cond} Fd, Fn, Fm VSUB.[F32|F64] Fd, Fn, Fm Single and double precision subtract
FMUL[S|D]{cond} Fd, Fn, Fm VMUL.[F32|F64] Fd, Fn, Fm Single and double precision multiply and multiply-and-negate
FNMUL[S|D]{cond} Fd, Fn, Fm VNMUL.[F32|F64] Fd, Fn, Fm
FDIV[S|D]{cond} Fd, Fn, Fm VDIV.[F32|F64] Fd, Fn, Fm Unmarried and double precision divide
FABS[S|D]{cond} Fd, Fm VABS.[F32|F64] Fd, Fn, Fm Unmarried and double precision accented value
FNEG[S|D]{cond} Fd, Fm VNEG.[F32|F64] Fd, Fn, Fm Single and double precision negate
FSQRT[S|D]{cond} Fd, Fm VSQRT.[F32|F64] Fd, Fn, Fm Single and double precision square root
FCVTSD{cond} Fd, Fm VCVT.F32.F64 Fd, Fm Catechumen double precision to unmarried precision
FCVTDS{cond} Fd, Fm VCVT.F64.F32 Fd, Fm Convert single precision to double precision
VCVT.[S|U][32|xvi].[F32|F64], #fbits Fd, Fm Catechumen floating point to stock-still betoken
VCVT.[F32|F64].[S|U][32|sixteen],#fbits Fd, Fm, #fbits Convert floating point to fixed signal
FMAC[South|D]{cond} Fd, Fn, Fm VMLA.[F32|F64] Fd, Fn, Fm Single and double precision floating bespeak multiply-accumulate, calculates Rd   =   Fn * Fm   +   Fd
At that place are similar instructions that negate the contents of Fd, Fn, or both prior to use, for instance, FNMSC[S|D], VNMLS[.F32|.F64]
FLD[S|D]{cond} Fd,&lt;address   &gt; VLDR{cond} Rd, &lt;   accost   &gt; Unmarried and double precision floating point load/store
FST[South|D]{cond} Fd,&lt;address   &gt; LSTR{cond} Rd, &lt;   address   &gt;
FLDMI[S|D]{cond}   &lt;   address   &gt;, &lt;   FPRegs   &gt; VLDM{cond} Rn{!}, &lt;   FPRegs   &gt; Single and double precision floating point load/store multiple
FSTMI[S|D]{cond}   &lt;   address   &gt;, &lt;   FPRegs   &gt; VSTM{cond} Rn{!}, &lt;   FPRegs   &gt;
FMRX{cond} Rd FMRX Rd Move from/to floating point status and command
FMXR{cond} Rm FMXR Rm register
FCPY[Due south|D]{cond} Fd,Fm VMOV{cond} Fd,Fm Copy floating point register

Floating-bespeak instructions use a separate ready of registers than integer instructions. ARMv6/VFP provides 32 floating-point registers, used as 32 individual single-precision registers named s0-s31 or as 16 double-precision registers named d0-s15.

ARMv7/NEON provides 64 floating-betoken registers, which can be used in many more ways, such as:

64 single-precision registers named s0-s63,

32 two-element single-precision registers named d0-d31,

16 4-element single-precision registers named q0-q15,

32 double-precision registers named d0-d31, and

16 2-element double-precision registers named q0-q15.

In both VFP and NEON, annals d0 consumes the same physical space as registers s0 and s1 and register d1 consumes the same space as registers s2 and s3.

Values in floating-point registers tin exist exchanged with general-purpose registers, and there is hardware support for type conversion between single precision, double precision, and integer.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780128003428000018

Pixel Shader Reference

Ron Fosner , in Real-Time Shader Programming, 2003

Annotation:

If you used the D3DTOP_ADDSIGNED2X texture functioning in one of your DirectX texture stages, the signed scaling modifier performs the same operation.

Rules for using signed source scaling:

For use merely with arithmetic instructions.

Cannot be combined with the invert modifier.

Initial information outside the [0, i] range may produce undefined results.

source scale 2X

PS ane.iv The scale past two modifier is used for shifting the range of the input register from the [0, 1] range to the [−ane, + 1] range, typically when you want to utilize the full signed range of which registers are capable. The scale past two modifier is indicated past calculation a _x2 suffix to a register. Essentially, the modifier multiplies the register values by two before they are used. The source register values are unchanged.

Rules for using scale past 2:

For utilise but with arithmetic instructions.

Cannot be combined with the invert modifier.

Available for PS 1.4 shaders but.

source replication/selection

Merely as vertex shaders allow you select the particular elements of a source register to use, then pixel shaders exercise, with some differences. You tin can select just a unmarried element, and that element will be replicated to all channels. You specify a aqueduct to replicate by adding .due north suffix to the register, where n is r, thou, b, or a (or x, y, z, or w).

SOURCE Annals SELECTORS
Annals SWIZZLE
PS version .rrrr .gggg .bbbb .aaaa .gbra .brga .abgr
1.0 10
i.1 x ten
one.2 x x
1.3 x ten
1.iv stage 1 x x x x
i.iv stage 2 x x x x
2.0 x x x x x x x

texture register modifiers ps 1.4 only

PS 1.4 PS 1.4 has its own ready of modifiers for texture instructions. Since only the texcrd and texld instructions are used to load or sample textures with PS 1.4, these modifiers are unique to those instructions. Note that y'all can interchange .rgbasyntax with xyzw syntax, thus –dz is the aforementioned every bit –db.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B978155860853550010X

Arithmetic optimization and the Linux Framebuffer

Jason D. Bakos , in Embedded Systems, 2016

3.seven Fixed-Signal Performance

As compared to floating point, using fixed point reduces the latency after each arithmetic instruction at the toll of additional instructions required for rounding and radix bespeak direction, although if the overhead code contains sufficient instruction level parallelism the affect of these additional instructions on throughput may not substantial.

On the other hand, for graphics applications like the prototype transformation that crave frequent conversions between floating bespeak and integer, using stock-still point may result in a reduction of executed instructions.

In fact, when compared the floating-point implementation on the Raspberry Pi, the fixed-point implementation achieves approximately the same CPI and enshroud miss charge per unit, but decreases the number of instructions per pixel from 225 to 160. This resulted in a speedup of throughput of approximately 40%.

Read total chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780128003428000031

Overview of Digital Signal Processing Algorithms

Robert Oshana , in DSP Software Development Techniques for Embedded and Real-Time Systems, 2006

Basic Software Implementation

The implementation of a FIR is straightforward; it'southward but a weighted moving average. Whatever processor with decent arithmetic instructions or a math library tin can perform the necessary computations. The real constraint is the speed. Many general-purpose processors tin can't perform the calculations fast enough to generate real-time output from real-time input. This is why a DSP is used.

A defended hardware solution like a DSP has two major speed advantages over a general-purpose processor. A DSP has multiple arithmetic units, which tin can all be working in parallel on individual terms of the weighted average. A DSP architecture as well has data paths that closely mirror the data movements used past the FIR filter. The delay line in a DSP automatically aligns the current window of samples with the appropriate coefficients, which increases throughput considerably. The results of the multiplications automatically flow to the accumulating adders, further increasing efficiency.

DSP architectures provide these optimizations and concurrent opportunities in a programmable processor. DSP processors have multiple arithmetics units that can be used in parallel, which closely mimics the parallelism in the filtering algorithm. These DSPs likewise tend to have special data movement operations. These operations tin can "shift" data amid special purpose registers in the DSP. DSP processors almost always accept special compound instructions (similar a multiply and accrue or MAC operation) that allow data to menses straight from a multiplier into an accumulator without explicit control intervention (Figure iv.17). This is why a DSP tin perform one of these MAC operations in one clock cycle. A significant part of learning to employ a particular DSP processor efficiently is learning how to exploit these special features.

Figure 4.17. DSPs take optimized MAC instructions to perform multiply and accumulate operations very quickly

In a DSP context, a "MAC" is the operation of multiplying a coefficient past the respective delayed data sample and accumulating the event. FIR filters usually require ane MAC per tap.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780750677592500065

Scalable parallel execution

Mark Ebersole , in Programming Massively Parallel Processors (Third Edition), 2017

iii.7 Thread Scheduling and Latency Tolerance

Thread scheduling is strictly an implementation concept. Thus, it must be discussed in the context of specific hardware implementations. In the majority of implementations to date, a block assigned to an SM is further divided into 32 thread units called warps. The size of warps is implementation-specific. Warps are not part of the CUDA specification; however, knowledge of warps can exist helpful in understanding and optimizing the performance of CUDA applications on item generations of CUDA devices. The size of warps is a property of a CUDA device, which is in the warpSize field of the device query variable (dev_prop in this case).

The warp is the unit of thread scheduling in SMs. Fig. iii.13 shows the partition of blocks into warps in an implementation. Each warp consists of 32 threads of consecutive threadIdx values: thread 0 through 31 form the first warp, 32 through 63 the second warp, and so on. In this example, three blocks—Cake 1, Cake 2, and Block 3—are assigned to an SM. Each of the three blocks is farther divided into warps for scheduling purposes.

Figure 3.13. Blocks are partitioned into warps for thread scheduling.

We can calculate the number of warps that reside in an SM for a given block size and a given number of blocks assigned to each SM. In Fig. three.13, if each block has 256 threads, we can decide that each cake has 256/32 or 8 warps. With three blocks in each SM, nosotros have eight ×3 =24 warps in each SM.

An SM is designed to execute all threads in a warp following the Single Teaching, Multiple Data (SIMD) model—i.e., at whatever instant in time, one teaching is fetched and executed for all threads in the warp. This state of affairs is illustrated in Fig. 3.13 with a single pedagogy fetch/acceleration shared among execution units (SPs) in the SM. These threads will use the same instruction to different portions of the information. Consequently, all threads in a warp volition always have the same execution timing.

Fig. three.13 besides shows a number of hardware Streaming Processors (SPs) that actually execute instructions. In general, there are fewer SPs than the threads assigned to each SM; i.east., each SM has only enough hardware to execute instructions from a small subset of all threads assigned to the SM at any indicate in time. In early GPU designs, each SM can execute simply one instruction for a single warp at any given instant. In recent designs, each SM tin execute instructions for a pocket-sized number of warps at whatsoever signal in time. In either case, the hardware can execute instructions for a pocket-size subset of all warps in the SM. A legitimate question is why we need to have so many warps in an SM if it can only execute a small subset of them at whatsoever instant. The answer is that this is how CUDA processors efficiently execute long-latency operations, such as global memory accesses.

When an instruction to be executed past a warp needs to wait for the issue of a previously initiated long-latency operation, the warp is not selected for execution. Instead, another resident warp that is no longer waiting for results volition exist selected for execution. If more than one warp is set for execution, a priority mechanism is used to select ane for execution. This mechanism of filling the latency time of operations with work from other threads is often chosen "latency tolerance" or "latency hiding" (see "Latency Tolerance" sidebar).

Warp scheduling is besides used for tolerating other types of functioning latencies, such as pipelined floating-point arithmetic and co-operative instructions. Given a sufficient number of warps, the hardware will probable discover a warp to execute at whatsoever betoken in time, thus making full use of the execution hardware in spite of these long-latency operations. The selection of ready warps for execution avoids introducing idle or wasted time into the execution timeline, which is referred to as nothing-overhead thread scheduling. With warp scheduling, the long waiting time of warp instructions is "hidden" by executing instructions from other warps. This ability to tolerate long-latency operations is the main reason GPUs do not dedicate nearly as much chip expanse to enshroud memories and co-operative prediction mechanisms equally do CPUs. Thus, GPUs tin can dedicate more of its chip area to floating-point execution resource.

Latency Tolerance

Latency tolerance is as well needed in various everyday situations. For instance, in post offices, each person trying to ship a package should ideally have filled out all necessary forms and labels before going to the service counter. Instead, some people wait for the service desk clerk to tell them which form to fill out and how to make full out the class.

When at that place is a long line in front end of the service desk, the productivity of the service clerks has to be maximized. Letting a person fill out the form in front end of the clerk while everyone waits is not an efficient approach. The clerk should be profitable the other customers who are waiting in line while the person fills out the form. These other customers are "ready to go" and should not be blocked by the customer who needs more than fourth dimension to make full out a form.

Thus, a skilful clerk would politely ask the first customer to step bated to fill up out the form while he/she can serve other customers. In the majority of cases, the starting time customer will be served as shortly as that client accomplishes the form and the clerk finishes serving the current customer, instead of that customer going to the cease of the line.

We can think of these post function customers as warps and the clerk as a hardware execution unit. The client that needs to fill out the form corresponds to a warp whose connected execution is dependent on a long-latency functioning.

Nosotros are at present ready for a simple exercise. 3 Assume that a CUDA device allows upwards to viii blocks and 1024 threads per SM, whichever becomes a limitation offset. Furthermore, it allows up to 512 threads in each block. For prototype mistiness, should we use 8 ×viii, 16 ×xvi, or 32 ×32 thread blocks? To answer the question, we tin analyze the pros and cons of each choice. If nosotros use 8 ×eight blocks, each block would accept simply 64 threads. Nosotros will need 1024/64 =12 blocks to fully occupy an SM. Even so, each SM tin can only allow up to 8 blocks; thus, nosotros will finish up with only 64 ×8 =512 threads in each SM. This limited number implies that the SM execution resource will probable be underutilized considering fewer warps will be bachelor to schedule around long-latency operations.

The 16 ×xvi blocks outcome in 256 threads per cake, implying that each SM can take 1024/256 =4 blocks. This number is within the 8-cake limitation and is a good configuration as information technology will let the states a total thread capacity in each SM and a maximal number of warps for scheduling effectually the long-latency operations. The 32 ×32 blocks would give 1024 threads in each block, which exceeds the 512 threads per cake limitation of this device. Only 16 ×16 blocks allow a maximal number of threads assigned to each SM.

Read full chapter

URL:

https://www.sciencedirect.com/science/commodity/pii/B9780128119860000030

INTRODUCTION TO THE ARM INSTRUCTION Fix

ANDREW Due north. SLOSS , ... CHRIS WRIGHT , in ARM System Developer'due south Guide, 2004

3.9 SUMMARY

In this chapter nosotros covered the ARM teaching gear up. All ARM instructions are 32 bits in length. The arithmetics, logical, comparisons, and move instructions can all employ the inline barrel shifter, which pre-processes the 2d register Rm before information technology enters into the ALU.

The ARM instruction fix has iii types of load-store instructions: single-annals load-store, multiple-annals load-store, and swap. The multiple load-store instructions provide the push-pop operations on the stack. The ARM-Thumb Procedure Telephone call Standard (ATPCS) defines the stack equally being a total descending stack.

The software interrupt education causes a software interrupt that forces the processor into SVC manner; this instruction invokes privileged operating organization routines. The programme status register instructions write and read to the cpsr and spsr. There are likewise special pseudoinstructions that optimize the loading of 32-scrap constants.

The ARMv5E extensions include count leading zeros, saturation, and improved multiply instructions. The count leading zeros teaching counts the number of binary zeros before the first binary ane. Saturation handles arithmetic calculations that overflow a 32-bit integer value. The improved multiply instructions provide better flexibility in multiplying 16-bit values.

Nearly ARM instructions can be conditionally executed, which can dramatically reduce the number of instructions required to perform a specific algorithm.

Read total chapter

URL:

https://www.sciencedirect.com/scientific discipline/article/pii/B9781558608740500046

Smarter systems and the PIC 18F2420

Tim Wilmshurst , in Designing Embedded Systems with PIC Microcontrollers (2d Edition), 2010

New instructions

Finally, at that place are many instructions that are only plain new. These derive in many cases from enhanced hardware or memory addressing techniques. Pregnant among arithmetic instructions is the multiply, available equally mulwf (multiply W and f) and mullw (multiply West and literal). These invoke the hardware multiplier, seen already in Effigy 13.2. Multiplier and multiplicand are viewed every bit unsigned, and the result is placed in the registers PRODH and PRODL. Information technology is worth noting that the multiply instructions cause no change to the Status flags, even though a nil result is possible.

Other important additions to the instruction prepare are a whole block of Table Read and Write instructions, information transfer to and from the Stack, and a skillful choice of conditional branch instructions, which build upon the increased number of status flags in the Status register. There are also instructions that contribute to conditional branching. These include the group of compares, for instance cpfseq, and the test instruction, tstfsz.

A useful new motion instruction is movff, which gives a direct move from one retentivity location to another. This codes in two words and takes two cycles to execute. Therefore, its advantage over the two sixteen Series instructions which it replaces may seem slight. Information technology does, nevertheless, relieve the value of the Due west register from existence overwritten.

Some of these new instructions will exist explored in the programme example and exercises of Section 13.10.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9781856177504100174

Using CUDA in Practice

Shane Cook , in CUDA Programming, 2013

Memory versus operations tradeoff

With most algorithms it's possible to merchandise an increased memory footprint for a decreased execution time. It depends significantly on the speed of memory versus the cost and number of arithmetic instructions being traded.

There are implementations of AES that just aggrandize the operations of the exchange, shift rows left, and mix columns performance to a series of lookups. With a 32-bit processor, this requires a four One thousand constant table and a small number of lookup and bitwise operations. Providing the 4 K lookup table remains in the enshroud, the execution time is profoundly reduced using such a method on most processors. We will, nevertheless, implement at least initially the full algorithm earlier we wait to this type of optimization.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780124159334000077