Technologies about TV, Video, Streaming, UI and Hardware

Embedded System Programming for ARM, Linux and Multi-Processing

ARM SoC Family:

What’s in ARM Cortex-A Chip Family?

  • ARMv7
    • Cortex A5, A7, A8, A9, A12, A15, A17
  • ARMv8
    • 32-bit: Cortex-A32
    • 64-bit: Cortex A35, A53, A57, A72, A73

What is ARM big.LITTLE architecture?

  • It’s a heterogeneous computing architecture
  • Introduced with Cortex-A7 and compatible with A15
  • Concerns for power-saving:  coupling relatively battery-saving and slower processor cores (LITTLE) with relatively more powerful and power-hungry ones (big).
  • Typically, only one “side” or the other will be active at once, but since all the cores have access to the same memory regions, workloads can be swapped between Big and Little cores on the fly.
  • The intention is to create a multi-core processor that can adjust better to dynamic computing needs and use less power than clock scaling alone.
  • ARM’s marketing material promises up to a 75% savings in power usage for some activities.

How is software scheduling between big and LITTLE core?

  • Three ways for the different cores to be arranged in a Big.Little design, depending on the scheduler implemented in the kernel:
    • Clustered switching:
      • The clustered model approach is the first and simplest implementation, arranging the processor into identically-sized clusters of “Big” or “Little” cores.
      • The operating system scheduler can only see one cluster at a time; when the load on the whole processor changes between low and high, the system transitions to the other cluster
      • This model has been implemented in the Samsung Exynos 5 Octa (5410).
    • In-kernel switcher:
      • CPU migration via the in-kernel switcher (IKS) involves pairing up a ‘Big’ core with a ‘Little’ core
      • Each pair operates as one virtual core, and only one real core is (fully) powered up and running at a time.
      • The ‘Big’ core is used when the demand is high and the ‘Little’ core is employed when demand is low.
      • The more complex arrangement involves a non-symmetric grouping of ‘Big’ and ‘Little’ cores. A single chip could have one or two ‘Big’ cores and many more ‘Little’ cores, or vice versa.
    • Heterogeneous multi-processing:
      • The most powerful use model of Big.Little architecture
      • Enables the use of all physical cores at the same time
      • This model has been implemented in the Samsung Exynos starting with the Exynos 5 Octa series (5420, 5422, 5430)

ARM Cache Architecture

  • Modified Harvard architecture:
    • multiple levels of cache
    • separated I and D cache for level 1 for each CPU core
    • controlled in page tables
    • L2 cacle serve CPU group
    • L3 system cache serve all CPU core
  • 2-way set-associate
    • Set/Way operations are local to a CPU
    • Set/Way operations are impossible to virtualize
  • Mostly invisible to software, except:
    • Executable code loading
    • DMA with no cache-coherent devices
    • require usual Clean, Invalidate or both
    • DMA with cache coherent devices when CPU caches are disabled
      • still require Clean and Invalidate since cache HW is not really disable
    • Conflicting memory attributes
      • writing to non-cacheable memory and reading from a cacheable one

How cache works on virtualization?

  • Stage-2 translation
  • Virtual Machine add complexity with:
    • second stage of page tables
    • second set of memory attributes
    • KVM always configures RAM cacheable at Stage-2
  • Rules:
    • strongest memory type wins: devices vs normal memory
    • least cacheable wins: non-cacheable always enforced.
    • hypervisor doesn’t have much control, no fine grained control

Virtualization

How ARM Virtualization Works?

  • In full virtualization, the virtual machine simulates enough hardware to allow an unmodified “guest” OS (one designed for the same instruction set) to be run in isolation.
  • In hardware-assisted virtualization, the hardware provides architectural support that facilitates building a virtual machine monitor and allows guest OSes to be run in isolation
  • Two popular ARM and x86 hypervisors, KVM and Xen.
  • ARM hardware support for virtualization enable much faster transitions between VMs and the hypervisor
  • Current hypervisor designs, including both Type 1 hypervisors such as Xen and Type 2 hypervisors such as KVM, are not able to fully leverage this performance benefit for real application workloads.

hypervisor

  • Type 1 hypervisors, like Xen, comprise a separate hypervisor software component, which runs directly on the hardware and provides a virtual machine abstraction to VMs running on top of the hypervisor.
  • Type 2 hypervisors, like KVM, run an existing OS on the hardware and run both VMs and applications on top of the OS.
  • Type 2 hypervisors typically modify the existing OS to facilitate running of VMs, either by integrating the Virtual Machine Monitor (VMM) into the existing OS source code base, or by installing the VMM as a driver into the OS.
  • KVM integrates directly with Linux where other solutions such as VMware Workstation use a loadable driver in the existing OS kernel to monitor virtual machines.
  • The OS integrated with a Type 2 hypervisor is commonly referred to as the host OS, as opposed to the guest OS which runs in a VM.
  • Support for virtualization requires memory protection (in the form of a memory management unit or at least a memory protection unit), which rules out most microcontrollers.
  • Performance advantages of paravirtualization make this usually the virtualization technology of choice.
  • ARM and MIPS have recently added full virtualization support as an IP option and has included it in their latest high-end processors and architecture versions, such as ARM Cortex-A15 MPCore and ARMv8 EL2.

What is Kernel-based Virtual Machine (KVM)?

Operating-System Level Virtualization

  • A server virtualization method in which the kernel of an operating system allows the existence of multiple isolated user-space instances, instead of just one.
  • Such instances are sometimes called containers, software containers, virtualization engines (VEs) or jails (FreeBSD jail or chroot jail)
  • It look and feel like a real server from the point of view of its owners and users.
  • On Unix-like operating systems, this technology can be seen as an advanced implementation of the standard chroot mechanism.
  • The kernel often provides resource-management features to limit the impact of one container’s activities on other containers.
  • Popular Implementation includes Docker, LXC, etc.

What is paravirtualization?

In partial virtualization, including address space virtualization, the virtual machine simulates multiple instances of much of an underlying hardware environment, particularly address spaces. Usually, this means that entire operating systems cannot run in the virtual machine—which would be the sign of full virtualization—but that many applications can run. A key form of partial virtualization is address space virtualization, in which each virtual machine consists of an independent address space. This capability requires address relocation hardware, and has been present in most practical examples of partial virtualization.
In paravirtualization, the virtual machine does not necessarily simulate hardware, but instead (or in addition) offers a special API that can only be used by modifying the “guest” OS. For this to be possible, the “guest” OS’s source code must be available.
  • virtualization technique that presents a software interface to virtual machines that is similar, but not identical to that of the underlying hardware.
  • Provides modified interface to reduce the portion of the guest’s execution time spent performing operations which are substantially more difficult to run in a virtual environment compared to a non-virtualized environment.
  • Provides specially defined ‘hooks’ to allow the guest(s) and host to request and acknowledge these tasks, which would otherwise be executed in the virtual domain (where execution performance is worse).
  • A successful paravirtualized platform may allow the virtual machine monitor (VMM) to be simpler (by relocating execution of critical tasks from the virtual domain to the host domain), and/or reduce the overall performance degradation of machine-execution inside the virtual-guest.
  • Requires the guest operating system to be explicitly ported for the para-API

Reference:

  1. https://www.linaro.org/blog/core-dump/on-the-performance-of-arm-virtualization
  2. https://en.wikipedia.org/wiki/Kernel-based_Virtual_Machine

Linux Memory for DMA

CMA: Continuous Memory Allocator

  • Developed to allow allocation of big, physically-contiguous memory blocks for hardware DMA (Direct Memory Access)
  • Waste lots of memory on statically allocate device memory for hardware
  • Help solving the memory fragmentation problem on system memory
  • Introduce in Linux v3.5-rc1
  • Access via DMA mapping API, dma_alloc_coherent, many driver already use so no change

IOMMU:

  • memory management unit (MMU) that connects a direct-memory-access–capable (DMA-capable) I/O bus to the main memory.
  • Like a traditional MMU, which translates CPU-visible virtual addresses to physical addresses, the IOMMU maps device-visible virtual addresses (also called device addresses or I/O addresses in this context) to physical addresses.
  • Some units also provide memory protection from faulty or malicious devices.
  • Some devices provide central IOMMU that can perform scatter-gather for other devices
  • Another way to solve dynamic DMA memory fragmentation
  • Added to Linux along with CMA on 3.5, backported to LTSI v3.4
  • Need to have (write) IOMMU driver for your MMU hardware – see: drivers/iommu/* for examples
  • IOMMU allocated memory available via DMA mapping API

Dynamic Memory UIO driver

  • CMA and IOMMU allow dynamically allocating DMA memory
  • Supported added in v3.5 and up
  • Usable by kernel driver via DMA mapping API
  • Usable by user space driver via uio_dmem_genirq driver

Reference:

  1. A Deep Dive into CMA
  2. Linux Contiguous Memory Allocator (and a little IOMMU)

Linux Multi-Processing

Process vs Thread

  • Threads are easier to create than processes since they don’t require a separate address space.

  • Multithreading requires careful programming since threads share data structures that should only be modified by one thread at a time. Unlike threads, processes don’t share the same address space.

  • Threads are considered lightweight because they use far less resources than processes. 

  • Processes are independent of each other. Threads, since they share the same address space are interdependent, so caution must be taken so that different threads don’t step on each other. This is really another way of stating #2 above.

  • A process can consist of multiple threads.

What is thread-safe implementation?

A piece of code is thread-safe if it manipulates shared data structures only in a manner that guarantees safe execution by multiple threads at the same time. Thread safety is a property that allows code to run in multithreaded environments by re-establishing some of the correspondences between the actual flow of control and the text of the program, by means of synchronization.
Thread safe: Implementation is guaranteed to be free of race conditions when accessed by multiple threads simultaneously.
Below we discuss two approaches for avoiding race conditions to achieve thread safety.
The first class of approaches focuses on avoiding shared state, and includes:
Re-entrancy
Writing code in such a way that it can be partially executed by a thread, reexecuted by the same thread or simultaneously executed by another thread and still correctly complete the original execution. This requires the saving of state information in variables local to each execution, usually on a stack, instead of in static or global variables or other non-local state. All non-local state must be accessed through atomic operations and the data-structures must also be reentrant.
Thread-local storage
Variables are localized so that each thread has its own private copy. These variables retain their values across subroutine and other code boundaries, and are thread-safe since they are local to each thread, even though the code which accesses them might be executed simultaneously by another thread.
Immutable Objects
The state of an object cannot be changed after construction. This implies both that only read-only data is shared and that inherent thread safety is attained. Mutable (non-const) operations can then be implemented in such a way that they create new objects instead of modifying existing ones. This approach is used by the string implementations in Java, C# and Python.

Inter-Process Communication (IPC):

Named Pipes:
  • named pipe (also known as a FIFO for its behavior) is an extension to the traditional pipe concept on Unix and Unix-like systems
  • A traditional pipe is “unnamed” and lasts only as long as the process. A named pipe, however, can last as long as the system is up, beyond the life of the process.
  • Usually a named pipe appears as a file, and generally processes attach to it for inter-process communication.
  • In C language, “make mknod()” system call to create a named pipe and it can be read and written like a regular file.
Message queue:
  • Implemented as an internal linked list within the kernel’s addressing space.
  • Messages can be sent to the queue in order and retrieved from the queue in several different ways.
  • Each message queue is uniquely identified by an IPC identifier.
Shared Memory:
  • POSIX shared memory API for processes to communicate information by sharing a memory region.
  • Shared memory API:
    • shm_open(): Create and open a new or existing shared memory object.
    • mmap(): Map the shared memory object into the virtual address spare of the calling process.
    • munmap(): Unmap the shared memory object.
Socket:
  • Create an endpoint for client and server communication
  • Use a file descriptor for communicate
  • Steps to establish a socket on the client side:
    1. Create a socket with the socket() system call
    2. Connect the socket to the address of the server using the connect() system call
    3. Send and receive data. There are a number of ways to do this, but the simplest is to use the read() and write() system calls.
  • Steps to establish a socket on the server side are as follows:
    1. Create a socket with the socket() system call
    2. Bind the socket to an address using the bind() system call. For a server socket on the Internet, an address consists of a port number on the host machine.
    3. Listen for connections with the listen() system call
    4. Accept a connection with the accept() system call. This call typically blocks until a client connects with the server.
    5. Send and receive data using read() and write() system calls.

IPC Protection:

Semaphores:
  • Counters used to control access to shared resources by multiple processes
  • Often used as a locking mechanism, eg: mutual exclusion (mutex) to prevent processes from accessing a particular resource while another process is performing operations on it.
Binary Semaphore vs Counting Semaphore

Binary semaphores are binary, they can have two values only; one to represent that a process/thread is in the critical section(code that access the shared resource) and others should wait, the other indicating the critical section is free.

On the other hand, counting semaphores take more than two values, they can have any value you want. The max value X they take allows X process/threads to access the shared resource simultaneously.

Mutex vs Binary Semaphore

Mutual exclusion (Mutex) semaphores are used to protect shared resources including data or hardware resources.

Mutex always uses the following sequence:

  Semaphore Take
  Critical Section Operations
  Semaphore Give

Binary Semaphore address a different question:

  • Task B is pended waiting for something to happen (a sensor being tripped for example).
  • Sensor Trips and an Interrupt Service Routine runs. It needs to notify a task of the trip.
  • Task B should run and take appropriate actions for the sensor trip. Then go back to waiting.

   Task A                      Task B
   ...                         Take BinSemaphore   <== wait for something
   Do Something Noteworthy
   Give BinSemaphore           do something    <== unblocks
They have different purposes. Mutex is for exclusive access to a resource. A Binary semaphore should be used for Synchronization (i.e. “Hey Someone! This occurred!”). The Binary “giver” simply notifies whoever the “taker” that what they were waiting for happened
Binary semaphore implementation with mutex in C:
struct binary_semaphore {
    pthread_mutex_t mutex;
    pthread_cond_t cvar;
    int v;
};

void mysem_post(struct binary_semaphore *p)
{
    pthread_mutex_lock(&p->mutex);
    if (p->v == 1)
        /* error */
    p->v += 1;
    pthread_cond_signal(&p->cvar);
    pthread_mutex_unlock(&p->mutex);
}

void mysem_wait(struct binar_semaphore *p)
{
    pthread_mutex_lock(&p->mutex);
    while (!p->v)
        pthread_cond_wait(&p->cvar, &p->mutex);
    p->v -= 1;
    pthread_mutex_unlock(&p->mutex);
}
These two functions are used to unblock threads blocked on a condition variable.

The pthread_cond_signal() call unblocks at least one of the threads that are blocked on the specified condition variable cond (if any threads are blocked on cond).

The pthread_cond_broadcast() call unblocks all threads currently blocked on the specified condition variable cond.

Linux Kernel Space Implementation

 

Bottom Halves:

Code in the Linux kernel runs in one of three contexts: Process, Bottom-half and Interrupt. Process context executes directly on behalf of a user process. All syscalls run in process context, for example. Interrupt handlers run in interrupt context. Softirqs, tasklets and timers all run in bottom-half context. After Linux 2.5, the term ‘Bottom Half’ is used to refer to code that is either a softirq or a tasklet.

 

There are a fixed number of softirqs and they are run in priority order. Linux 2.5.48 defines 6 softirqs. The highest priority softirq runs the high priority tasklets. Then the timers run, then network transmit and receive softirqs are run, then the SCSI softirq is run. Finally, low-priority tasklets are run.

 

Unlike softirqs, tasklets are dynamically allocated. Also unlike softirqs, a tasklet may run on only one CPU at a time. They are more SMP-friendly than the old-style bottom halves in that other tasklets may run at the same time. Tasklets have a weaker CPU affinity than softirqs. If the tasklet has already been scheduled on a different CPU, it will not be moved to another CPU if it’s still pending.

 

When the machine is under heavy interrupt load, it is possible for the CPU to spend all its time servicing interrupts and softirqs without making forward progress. To prevent this from saturating the machine, if too much work is happening in softirq context, further softirq processing is handled by ksoftirqd.

 

Task queues were originally designed to replace the old-style bottom halves. When they were integrated into the kernel, they did not replace bottom halves but were used as an adjunct to them. Like tasklets and the new-style timers, they were dynamically allocated. Also like tasklets and timers, they consist of a function pointer and a data argument to pass to that function. Despite their name, they are not related to tasks (as in ‘threads, tasks and processes’), which is partly why they were renamed to work queues in 2.5.

Difference between tasklet and work queue:

  • The Tasklet are used in interrupt context. All the tasklet code must be atomic, so all rules that are applied on atomic context are applied to it. Eg, they cannot sleep or hold a lock for long period of time.
  • Unlike Tasklet, work-queue executes is in process context, which means they can sleep and hold the lock for longtime.
In short tasklet are used for fast execution as they cannot sleep, where as work queue are used in case of normal execution of bottom half. Both are executed at later time by the kernel.
Basically, there are 4 ways to defer work in Bottom Half
  1. softirq
  2. tasklet
  3. workqueue (replacement of task queues)
  4. Kernel Timer

Spin locks

  • spinlock is a lock which causes a thread trying to acquire it to simply wait in a loop (“spin”) while repeatedly checking if the lock is available.
  • Since the thread remains active but is not performing a useful task, the use of such a lock is a kind of busy waiting. Once acquired, spinlocks will usually be held until they are explicitly released, although in some implementations they may be automatically released if the thread being waited on (that which holds the lock) blocks, or “goes to sleep”.

Reference:

  1. http://www.cs.columbia.edu/~nahum/w6998/papers/2003-wilcox-softirq.pdf
  2. https://www.safaribooksonline.com/library/view/understanding-the-linux/0596002130/ch04s07.html

  1. Reply

    It’s a shame you don’t have a donate button! I’d definitely donate
    to this superb blog! I suppose for now i’ll settle for book-marking and adding your RSS feed to my Google account.
    I look forward to brand new updates and will share
    this blog with my Facebook group. Talk soon!

    • Sherryl on May 5, 2017 at 1:08 am

    Reply

    Thanks for the excellent information, it really is useful.

    • Rodger on May 5, 2017 at 1:25 am

    Reply

    Greetings! Quite helpful advice on this post! It truly
    is the little changes that make the biggest changes. Thanks a lot for sharing!

  2. Reply

    I wanted to thank you for this fantastic read!! I undoubtedly loving every small bit of it I have you bookmarked to check out new material you post.

  3. Reply

    EXCEPTIONAL Post.thanks for share..more delay.

  4. Reply

    I needed to thank you for this excellent read!!
    I undoubtedly loving every small touch of it I have you bookmarked to check out new stuff you post.

  5. Reply

    Really excellent post, I surely adore this website, keep on it.

Leave a Reply

Your email address will not be published.