XOR gates are a fundamental component in cryptography. Many of the common symmetric and asymmetric, stream and block ciphers use XOR gates. A few of these ciphers are AES, RSA, Ed25519, and Twofish. While many of the compiled and interpreted languages support bitwise operations such as XOR, the software implementation of both block and stream ciphers is computationally inefficient compared to FPGA and ASIC implementations.
Hybrid boards integrate FPGAs with multicore ARM processors over high speed buses. On hybrid boards, the ARM processor is termed the hard processor system or HPS. Writing to the FPGA from the ARM processor is typically performed via C from an embedded Linux build (yocto or buildroot) running on the ARM core. A simple bitstream can also be loaded into the FPGA fabric without using any ARM design blocks or functionality in the ARM core.
The following is a simple hardware design that I wrote in VHDL and simulated in ModelSim. The HPS is not used. The bitstream is loaded into the FPGA fabric on boot. VHDL components are utilized and a testbench is defined for testing the design.
On a preemptive, timed-sliced, UNIX or Linux operating system (Solaris, AIX, Linux, BSD, OS X), sequences of program code from different software applications are executed over time on a single or multiple core processor. The execution of program code on a processor core constitutes what is called a process. If that process is executing from program code that comprises a UNIX operating system and the accompanying executable and linking format, then the process is called a UNIX process. A UNIX process is a schedulable entity. On a UNIX system, program code from one process executes on the processor for a time quantum, after which, program code from another process executes for a time quantum. The first process relinquishes the processor either voluntarily or involuntarily so that another process can execute its program code. This is known as context switching. When a process context switch occurs, the state of a process is saved to its process control block and another process resumes execution on the processor. A UNIX process is heavyweight because it has its own virtual memory space, file descriptors, register state, scheduling information, memory management information, etc. When a process context switch occurs, this information has to be saved, and this is a computationally expensive operation.
Concurrency refers to the interleaved execution of schedulable entities over time. Context switching facilitates interleaved execution. On a modern Linux system, the execution time quantum is so small that the interleaved execution of independent, schedulable entities, often performing unrelated tasks, gives the appearance that multiple software applications are running in parallel. The on/off scheduling of a process in this manner occurs in non-realtime operating systems - windows, osx, linux, freebsd.
Concurrency applies to both threads and processes. A thread is also a schedulable entitity and is defined as an independent sequence of execution within a UNIX process. UNIX processes often have multiple threads of execution that share the memory space of the process. When multiple threads of execution are running inside of a process, they are typically performing related tasks.
While threads are typically lighter weight than processes, there have been different implementations of both across UNIX and Linux operating systems over the years. The three models that typically define the implementations across preemptive, time sliced, multi user UNIX and Linux operating systems are defined as follows: 1:1, 1:N, and M:N where 1:1 refers to the mapping of one user space thread to one kernel thread, 1:N refers to the mapping of multiple user space threads to a single kernel thread, and M:N refers to the mapping of N user space threads to M kernel threads.
In summary, both threads and processes are scheduled for execution. Thread context switching is lighter in weight than process context switching. Both threads and processes are schedulable entities and concurrency is defined as the interleaved execution over time of schedulable entities.
The Linux user space APIs for process and thread management are abstracted from lower level details. However, the APIs typically provide calls for setting the level of concurrency in order to influence the time quantum so that system throughput is affected by shorter and longer durations of schedulable entity execution time.
Conversely, parallelism refers to the simultaneous execution of multiple schedulable entities over a time quanta. Both processes and threads can execute in parallel across multiple cores or multiple processors. On a multiuser system with preemptive time slicing and multiple processor cores, both concurrency and parallelism are often at play. Affinity scheduling refers to the scheduling of both processes and threads across multiple cores so that their concurrent and often parallel execution is close to optimal.
Software applications are often designed to solve computationally complex problems. If the algorithm to solve a computationally complex problem can be parallelized, then multiple threads or processes can all run at the same time across multiple cores. Each process or thread executes by itself and does not contend for resources with other threads or processes that are working on the other parts of the problem to be solved. When each thread or process reaches the point where it can no longer contribute any more work to the solution of the problem, it waits at the barrier. When all threads or processes reach the barrier, the output of their work is synchronized, and often aggregated by the master process. Complex test frameworks often implement the barrier synchronization problem when certain types of tests can be run in parallel.
Most individual software applications running on preemptive, time sliced, multiuser Linux and UNIX operating systems are not designed with heavy, parallel thread or parallel, multi-process execution in mind. Expensive, parallel algorithms often require multiple, dedicated processor cores with hard real time scheduling constrains. The following paper describes the solution to a popular, parallel algorithm; flight scheduling.
Last, when designing multithreaded and multiprocess software programs, minimizing lock granularity greatly increases concurrency, throughput, and execution efficiency. Multithreaded and multiprocess programs that do not utilize course-grained synchronization strategies do not run efficiently and often require countless hours of debugging. The use of semaphores, mutex locks, and other synchronization primitives should be minimized to the maximum extent possible in computer programs that share resources between multiple threads or processes. Proper program design allows for schedulable entities to run in parallel or concurrently with high throughput and minimum resource contention, and this is optimal for solving computationally complex problems on preemptive, time scliced, multi user operating systems without requiring hard real time scheduling.
After a fairly considerable amount of research in the above areas, I utilized the above design techniques for several successful, multi threaded and multi process software programs.
The DE1-SoC FPGA Development board from Terasic is powered by an integrated Altera Cyclone V FPGA and ARM MPCore Cortex-A9 processor. The FPGA and ARM core are connected by a high-speed interconnect fabric so you can boot Linux on the ARM core and then talk to the FPGA.
A low-level hardware abstraction layer was programmed in C to configure the on-board audio codec chip. The NIOS II chip is stored in on-chip memory and a PLL driven, clock signal is fed into the audio chip. The Verilog code for the hardware design was generated from Qsys. The design supports configurable sample rates, mic in, and line in/out.
The DC934A features an LTC2607 16-Bit Dual DAC with i2c interface and an LTC2422 2-Channel 20-Bit uPower No Latency Delta Sigma ADC.
3.5mm audio cables are connected to the mic in and line out ports, respectively. The DE1-SoC is connected to an external display over VGA so that a local console can be managed via a connected keyboard and mouse when Linux is booted from uSD.
With GPIO pins accessible via the GPIO 0 and 1 breakouts, external LEDs can be pulsed directly from the Hard Processor System (HPS), FPGA, or the FPGA via the HPS.
Curve selection at the library or implementation level can be time consuming. While the type of application may have an impact on curve selection, the security of the curve should undoubtedly take priority. The following site is a valuable resource for selecting a safe curve.
Daniel J. Bernstein and Tanja Lange. SafeCurves: choosing safe curves for elliptic-curve cryptography. https://safecurves.cr.yp.to, accessed Sat May 8 02:29:40 UTC 2021.