LINUX KERNEL INTERNALS: ARM Architecture

It is a spinoff from a UK based company Acorn Computers.
Its name has changes from Acorn RISC Machine to Advanced RISC Machine and now simply ARM.
ARM doesnt produce Silicon, it just provides RISC processor core designs(Physical IPs).
There are two category of ARM processors- Embedded Cortex Processor, Application Cortex Processors.
Embedded Cortex Processors consist of -Microprocessors and Real Time Processors.
Application Cortex Processor consists of 32 bit ARM Cortex-A5, ARM Cortex-A7, ARM Cortex-A8, ARM Cortex-A9, ARM Cortex-A12, ARM Cortex-A15, ARM Cortex-A17 MPCore. They need an platform OS(linux), has extended instruction set and a good memory management technique.Newer cortex A series processors supports multicore.
he ARM Cortex-M is a group of 32-bit RISC ARM processor cores licensed by ARM Holdings. The cores are intended for microcontroller use, and consist of the Cortex-M0, Cortex-M0+, Cortex-M1, Cortex-M3, and Cortex-M4 .
The ARM Cortex-R is a group of 32-bit RISC ARM processor cores licensed by ARM Holdings. The cores are intended for robust real-time use, and consists of the Cortex-R4, Cortex-R5, Cortex-R7.

Von Neuman and Harvard Architecture

Before we move ahead we must look at these two architectures as some ARM architecture use one and some the other one.

In Von Neuman Architecture the Instruction(Program) and Data both reside in the same memory segment whereas in Harvard Architecture both reside separately. In Von Neuman Architecture we can get either instruction or data to the cache( which will ultimately goto the ALU) in one clock cycle whereas in Harvard Architecture we can get both the instruction and data to the respective caches in the same clock cycle itself. This is achieved because in the idle time the address and data bus keeps on filling the caches so that when t

he processing unit requests for it , it can be given. ARM v7 uses Von Neuman Architecture whereas ARM 9 uses Harvard Architecture.

Snooping

Snooping is a cache coherency protocol used in ARM systems with multiple cores and thus multiple caches. I will first explain the problem that can occur because of this setup.

Suppose the same data , say X from the main memory is there in two caches(of different cores say core 1 and core 4). Suppose the core 1 performs some operation on data X and it gets modified to Y. But the cache of core 4 will still have data X i.e it doesn't have updated data. So, if the core 4 needs to perform some operation then it will do it in the old data and this is undesirable. The diagram below will show that how different caches do snooping on data being requested by other cache from the memory. If the data is being requested from the same memory location from where the other cache has data then cache hit happens and the other cache transfers the data to the cache which is requesting from the memory.

Fast Burst Access

The memory is divided into horizontal and vertical array of cells. For the addresses in same vertical line we have Row access and in different vertical line we have Column Access. If the addresses are in the same verical line then out of 32 bits of the address only 16 bits has to be changed and the other 16 bit is constant.

This makes access faster and its called fast burst mode.

Static Memory and Dynamic Memory

Static memories are those memories which don't need to be refreshed regularly whereas dynamic memories need to be refreshed after every 14ns then only they keep data. To store the same data static memory uses 10 times more transistors as compared to dynamic memories.

ARM 7 Architecture

It has 37 registers (31 normal registers and 6 status registers). Barrel shifter can multiply the second operand by shifting the bits. It has 32 8 bit multipliers also for multiplication operations.

ARM Register Sets

There are 37 registers in total. We can see that when the core enters abort mode from user mode then the registers of abort mode are swapped in and those of user mode are swapped out. The contents of cpsr is copied into spsr and when the mode is exited then the contents of sprs will be copied into cpsr.

Explaination of Modes

FIQ uses separate registers as it is used to handle very fast interrupts(very high priority) so we want the current registers to not get disturbed so it uses separate set of registers.

PSR(program status register) CPSR and SPSR


M[4:0] Mode Accessible register set
10000 User PC, R14..R0 CPSR
10001 FIQ PC, R14_fiq..R8_fiq, R7..R0 CPSR, SPSR_fiq
10010 IRQ PC, R14_irq..R13_irq, R12..R0 CPSR, SPSR_irq
10011 Supervisor PC, R14_svc..R8_svc, R7..R0 CPSR, SPSR_svc
10111 Abort PC, R14_abt..R8_abt, R7..R0 CPSR, SPSR_abt
11011 Undefined PC, R14_und..R8_und, R7..R0 CPSR, SPSR_und

M[4:0]	Mode	Accessible register set
10000	User	PC, R14..R0	CPSR
10001	FIQ	PC, R14_fiq..R8_fiq, R7..R0	CPSR, SPSR_fiq
10010	IRQ	PC, R14_irq..R13_irq, R12..R0	CPSR, SPSR_irq
10011	Supervisor	PC, R14_svc..R8_svc, R7..R0	CPSR, SPSR_svc
10111	Abort	PC, R14_abt..R8_abt, R7..R0	CPSR, SPSR_abt
11011	Undefined	PC, R14_und..R8_und, R7..R0	CPSR, SPSR_und

Jezelle

Jazelle DBX (Direct Bytecode eXecution) is a technique that allows Java Bytecode to be executed directly in the ARM architecture as a third execution state (and instruction set) alongside the existing ARM and Thumb-mode. Support for this state is signified by the "J" in the ARMv5TEJ architecture, and in ARM9EJ-S and ARM7EJ-S core names. Support for this state is required starting in ARMv6 (except for the ARMv7-M profile), though newer cores only include a trivial implementation that provides no hardware acceleration.

Thumb[edit]

Thumb and Thumb2 Instructions

o improve compiled code-density, processors since the ARM7TDMI (released in 1994) have featured the Thumb instruction set, which have their own state. (The "T" in "TDMI" indicates the Thumb feature.) When in this state, the processor executes the Thumb instruction set, a compact 16-bit encoding for a subset of the ARM instruction set.Most of the Thumb instructions are directly mapped to normal ARM instructions. The space-saving comes from making some of the instruction operands implicit and limiting the number of possibilities compared to the ARM instructions executed in the ARM instruction set state.

In Thumb, the 16-bit opcodes have less functionality. For example, only branches can be conditional, and many opcodes are restricted to accessing only half of all of the CPU's general-purpose registers. The shorter opcodes give improved code density overall, even though some operations require extra instructions. In situations where the memory port or bus width is constrained to less than 32 bits, the shorter Thumb opcodes allow increased performance compared with 32-bit ARM code, as less program code may need to be loaded into the processor over the constrained memory bandwidth.

Embedded hardware, such as the Game Boy Advance, typically have a small amount of RAM accessible with a full 32-bit datapath; the majority is accessed via a 16-bit or narrower secondary datapath. In this situation, it usually makes sense to compile Thumb code and hand-optimise a few of the most CPU-intensive sections using full 32-bit ARM instructions, placing these wider instructions into the 32-bit bus accessible memory.

Thumb-2 technology was introduced in the ARM1156 core, announced in 2003. Thumb-2 extends the limited 16-bit instruction set of Thumb with additional 32-bit instructions to give the instruction set more breadth, thus producing a variable-length instruction set. A stated aim for Thumb-2 was to achieve code density similar to Thumb with performance similar to the ARM instruction set on 32-bit memory.

Instruction Format

Saturated Math

Priority Of instruction

1. Reset
2. Data Abort
3. FIQ
4. IRQ
5. Prefetch Abort
6.Undefined

I and F bits are part of CPSR. If I or F bit is set, it means in this particular mode IRQ or FIQ is disabled.For example in reset mode, both FIQ and IRQ bit is set, it means they are disabled.

Undefined instruction and SWI have same priority as both can't come at same time.

Some Key Points to be noted:-\

Abort happens suring instruction prefetch.
PC counts the fetched state out of fetch decode and execute cycle.
Exception can be handled only in ARM mode and not in thumb mode.
Thumb only sees from R0-R7, though there are some provisions to use it beyond that.
Since there is only 1 LR so internal branching is not possible.
Cache is organised as 16 bit so we need to take full 16 bits even for a single change.
for SWI there are 24 bits.
Thumb2 is a superset of ARM and thumb.
Cortex A has MMU.
Cortex R has MPU(subset of MMU).
Cortex M has optional MPU.
P is proportional to frequency.
Cache can use logical or physical add. space.

For any clarifications or if you have any suggestion please post comments below.

LINUX KERNEL INTERNALS

Popular Posts

Wednesday, June 25, 2014

ARM Architecture

Thumb[edit]

7 comments:

About Me

Blog Archive