February 24, 2021

A hardware design for variable output frequency using an n-bit counter

The DE1-SoC from Terasic is a great board for hardware design and prototyping. The following VHDL process is from a hardware design that I created for the Terasic DE1-SoC FPGA. The ten switches and four buttons on the FPGA are used as an n-bit counter with an adjustable multiplier to increase the output frequency of one or more output pins at a 50% duty cycle.

As the switches are moved or the buttons are pressed, the seven-segment display is updated to reflect the numeric output frequency, and the output pin(s) are driven at the desired frequency.  The on-board clock runs at 50MHz and the signal on the output pins is set on the rising edge of the clock input signal (positive edge triggered).  At 50MHz, the output pins can be toggled at a maximum rate of 50 million cycles per second or 25 million rising edges of the clock per second.  In other words, an LED attached to one of the output pins would blink 25 million times per second, which is not recognizable to the human eye.

scaler <= compute_prescaler((to_integer(unsigned( SW )))*scaler_mlt);
gpiopulse_process : process(CLOCK_50, KEY(0))
begin
    if (KEY(0) = '0') then  -- async reset
        count <= 0;
    elsif rising_edge(CLOCK_50) then
        if (count = scaler - 1) then
            state <= not state;
            count <= 0;
        elsif (count = clk50divider) then -- auto reset
            count <= 0;
        else
            count <= count + 1;
        end if;
    end if;
end process gpiopulse_process;

August 25, 2020

Creating stronger keys for OpenSSH and GPG

Create Ed25519 SSH keypair (supported in OpenSSH 6.5+). Parameters are as follows:

-o save in new format
-a 128 for 128 kdf (key derivation function) rounds
-t ed25519 for type of key
ssh-keygen -o -a 128 -t ed25519 -f .ssh/ed25519-$(date '+%m-%d-%Y') -C ed25519-$(date '+%m-%d-%Y')
Create Ed448-Goldilocks GPG master key and sub keys.
gpg --quick-generate-key ed448-master-key-$(date '+%m-%d-%Y') ed448 sign 0
gpg --list-keys --with-colons "ed448-master-key-08-03-2021" | grep fpr
gpg --quick-add-key "$fpr" cv448 encr 2y
gpg --quick-add-key "$fpr" ed448 auth 2y
gpg --quick-add-key "$fpr" ed448 sign 2y
Create strong passphrase for private key.
pwgen 31 -sy
yqaC,B\^Qm.SN-_?14#0BZ'+b


February 1, 2018

a Hardware Design for XOR gates using sequential logic in VHDL

XOR logic gates are a fundamental component in cryptography. Many of the common stream and block ciphers use XOR gates. A few of these ciphers are ChaCha (stream cipher), AES (block cipher), and RSA (block cipher).

While many of the compiled and interpreted languages support bitwise operations such as XOR, the software implementation of both block and stream ciphers is computationally inefficient compared to FPGA and ASIC implementations.

Hybrid FPGA boards integrate FPGAs with multicore ARM and Intel application processors over high speed buses.  The ARM and Intel processors are general purpose processors. On a hybrid board, the ARM or Intel processor is termed the hard processor system or HPS. Writing to the FPGA from the HPS is typically performed via C from an embedded Linux build (yocto or buildroot) running on the ARM or Intel core.  For a hybrid ARM configuration, a simple bitstream can also be loaded into the FPGA fabric without using any ARM design blocks or functionality in the ARM core.

The following is a simple hardware design that I wrote in VHDL and simulated in ModelSim. The image contains the waveform output of a simulation in ModelSim. The HPS is not used. The bitstream is loaded into the FPGA fabric on boot. VHDL components are utilized and a testbench is defined for testing the design.  The entity and architecture VHDL design units are below.

ModelSim Full Window view with wave form output of xor simulation. ModelSim-Intel FPGA Starter Edition © Intel

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
- --three input xnor gate entity declaration - external interface to design entity
entity xnorgate is
port (
    a,b,c : in std_logic;
    q : out std_logic);
end xnorgate;

architecture xng of xnorgate is
begin
    q <= a xnor b xnor c;
end xng;

- --chain of xor / xnor gates using components and sequential logic
entity xorchain is
port (
    A,B,C,D,E,F : in std_logic;
    Av,Bv       : in std_logic_vector(31 downto 0);
    CLOCK_50    : in std_logic;
    Q           : out std_logic;
    Qv          : out std_logic_vector(31 downto 0));
end xorchain;

architecture rtl of xorchain is
component xorgate is
port (
    a,b  : in std_logic;
    q    : out std_logic);
end component;

component xnorgate is
port (
    a,b,c  : in std_logic;
    q      : out std_logic);
end component;

component xorsgate is
port (
    av : in std_logic_vector(31 downto 0);
    bv : in std_logic_vector(31 downto 0);
    qv : out std_logic_vector(31 downto 0));
end component;

signal a_in, b_in, c_in, d_in, e_in, f_in : std_logic;
signal av_in, bv_in : std_logic_vector(31 downto 0); 

signal conn1, conn2, conn3 : std_logic;

begin
    xorgt1  : xorgate port map(a => a_in, b => b_in, q => conn1); 
    xorgt2  : xorgate port map(a => c_in, b => d_in, q => conn2);
    xorgt3  : xorgate port map(a => e_in, b => f_in, q => conn3); 
    xnorgt1 : xnorgate port map(conn1, conn2, conn3, Q);
    xorsgt1 : xorsgate port map(av => av_in, bv => bv_in, qv => Qv);

   process(CLOCK_50)
   begin
       if rising_edge(CLOCK_50) then --assign inputs on rising clock edge
           a_in <= A;
           b_in <= B;
           c_in <= C;
           d_in <= D;
           e_in <= E;
           f_in <= F;
           av_in(31 downto 0) <= Av(31 downto 0);
           bv_in(31 downto 0) <= Bv(31 downto 0);
       end if;
    end process;
end rtl;

entity xorchain_tb is
end xorchain_tb;

architecture xorchain_tb_arch of xorchain_tb is
    signal A_in,B_in,C_in,D_in,E_in,F_in : std_logic := '0';
    signal Av_in                         : std_logic_vector(31 downto 0);
    signal Bv_in                         : std_logic_vector(31 downto 0);
    signal CLOCK_50_in                   : std_logic;
    signal BRK                           : boolean := FALSE;
    signal Q_out                         : std_logic;
    signal Qv_out                        : std_logic_vector(31 downto 0);

component xorchain
port (
    A,B,C,D,E,F      : in std_logic;
    Av               : in std_logic_vector(31 downto 0);
    Bv               : in std_logic_vector(31 downto 0);
    CLOCK_50         : in std_logic;
    Q                : out std_logic;
    Qv               : out std_logic_vector(31 downto 0));
end component;

begin
    xorchain_instance: xorchain port map (A => A_in,B => B_in, C => C_in, 
                                          D => D_in, E => E_in, F => F_in, Av => Av_in,
                                          Bv => Bv_in, CLOCK_50 => CLOCK_50_in, Q => Q_out,
                                          Qv => Qv_out);
clockprocess: process
    begin
        while not BRK loop
            CLOCK_50_in <= '0';
                wait for 20 ns;
                CLOCK_50_in <= '1';
                wait for 20 ns;
        end loop;
    wait;
end process clockprocess;
  
testprocess : process
    begin
        A_in <= '1';
        B_in <= '0';
        C_in <= '1';
        D_in <= '0';
        E_in <= '1';
        F_in <= '1';
        wait for 40 ns;
        A_in <= '1';
        B_in <= '0';
        C_in <= '1';
        D_in <= '0';
        E_in <= '1';
        F_in <= '0';
        wait for 20 ns;
        A_in <= '0';
        B_in <= '0';
        C_in <= '1';
        D_in <= '0';
        E_in <= '1';
        F_in <= '0';
        wait for 40 ns;
        BRK <= TRUE;
        wait;
    end process testprocess;
end xorchain_tb_arch;

entity xorgate is
port (
    a,b : in std_logic;
    q   : out std_logic);
end xorgate;

architecture xg of xorgate is
begin
    q <= a xor b;
end xg;

entity xorsgate is
port (
    av : in std_logic_vector(31 downto 0);
    bv : in std_logic_vector(31 downto 0);
    qv : out std_logic_vector(31 downto 0));
end xorsgate;

architecture xsg of xorsgate is
begin
    qv <= av xor bv;
end xsg;

September 16, 2016

Implementing Software-defined radio and Infrared Time-lapse Imaging with Tensorflow on a custom Linux distribution for the Raspberry Pi 3

GNURadio Companion Qt Gui Frequency Sync - multiple FIR filter taps
sample running on Raspberry Pi 3 custom Linux distributio
The Raspberry Pi 3 is powered by the ARM® Cortex®-A53 processor.  This 1.2GHz 64-bit quad-core processor fully supports the ARMv8-A architecture. For this project, a custom Linux distribution was created for the Raspberry Pi 3.  

The custom Linux distribution includes support for GNURadio, several FPGA and ARM Powered® SDR devices, D-STAR (hotspot, repeater, and dongle support), hsuart, libusb, hardware real-time clock support, Sony 14 megapixel NoIR image sensor, HDMI and 3.5mm audio, USB Microphone input, X-windows with xfce, lighttpd and php, bluetooth, WiFi, SSH, TCPDump, Docker, Docker registry, MySQL, Perl, Python, QT, GTK, IPTables, x11vnc, SELinux, and full native-toolchain development support.  

The Sony 14 megapixel image sensor with the infrared filter removed can be connected to the Raspberry Pi 3's MIPI camera serial interface.  Image capture and recognition can then be performed over contiguous periods of time, and time-lapsed video can be created from the images.  With support for Tensorflow and OpenCV, object recognition within images can be performed. 
 
D-STAR hotspot with time-lapsed infrared imaging.


For the initial run, an infrared Time-lapse Video was created from an initial image capture run of  one 3280x2460 infrared jpeg image captured every 15 seconds for three hours.  40, 5mm, 940nm LEDs, powered by 500ma over 12v DC provided infrared illumination in the 940nm wavelength.

Tensorflow was running in the background (on v4l2 kmod) and providing continuous object recognition and scoring within each image via a sample model.  Finally, OpenCV was also installed in the root file system.

The time-lapse infrared video was captured of my living room using the above setup and custom Linux distribution.  Below this image are images of Tensorflow running in a terminal in the background on the Raspberry Pi 3 and recognizing/scoring objects in my living room.


Tensorflow running on the Raspberry Pi 3 and continuously capturing frames from the image sensor and scoring objects



 

GNURadio Companion running on xfce on the Raspberry Pi 3


August 16, 2016

Profiling Multiprocess C programs with ARM DS-5 Streamline

The ARM DS-5 Streamline Performance Analyzer is a powerful tool for debugging, profiling, and analyzing multithreaded and multiprocess C programs.  Instructions can easily be traced between load and store operations.  Per process and per thread function call paths can be broken down by system utilization percentage.  Branch mispredictions and multi-level CPU caches can be analyzed. Furthermore, disk I/O usage, stack and heap usage, and a number of other useful metrics can quickly be referenced within the debugger. These are just a few of its capabilities.

In order to capture meaningful information from the DS-5 Streamline Performance Analyzer tool, a Linux, multiprocess, C program was modified to insert 1000 packets into a packet processing simulation buffer.  A code excerpt from the program is below.  The child processes were modified to sleep and then wake 1000 times in order to simulate process activity.  The program was analyzed using the DS-5 Streamline Performance Analyzer tool.  There are two screenshots below the code excerpt where the program is loaded into the DS-5 Streamline Performance Analyzer.

void *insertpackets(void *arg) {
   
   struct pktbuf *pkbuf;
   struct packet *pkt;
   int idx;

   if(arg != NULL) {
   
      pkbuf = (struct pktbuf *)arg;

      /* seed random number generator */
      ...

      /* insert 1000 packets into the packet buffer */
      for(idx = 0; idx < 1000; ++idx) {

         pkt = (struct packet *)malloc(sizeof(struct packet));

         if(pkt != NULL) {

            /* set the packet processing simulation multiplier to 3 */
            pkt->mlt=...()%3;

            /* insert packet in the packet buffer */
            if(pkt_queue(pkbuf,pkt) != 0) {
            
               ...
            ... 
         ...
      ...
   ...
...

int fcnb(time_t secs, long nsecs) {
 
   struct timespec rqtp;
   struct timespec rmtp;
   int ret;
   int idx;

   rqtp.tv_sec = secs;
   rqtp.tv_nsec = nsecs; 

   for(idx = 0; idx < 1000; idx++) {

      ret = nanosleep(&rqtp, &rmtp);

      ...
   ...
... 
 
ARM DS-5 Streamline - Profiling the process creation application

ARM DS-5 Streamline - Code View with C code in the top window
and ARM assembly instructions in the bottom window

July 30, 2016

Concurrency, Parallelism, and Barrier Synchronization - Multiprocess and Multithreaded Programming

 

Concurrency, parallelism, threads, and processes are often misunderstood concepts.

On a preemptive, timed sliced UNIX or Linux operating system (Solaris, AIX, Linux, BSD, OS X), program code from one process executes on the processor for a time slice or quantum, after which, program code from another process executes for a time quantum. The first process relinquishes the processor either voluntarily or involuntarily so that another process can execute its program code. This is known as context switching. Context switching facilitates interleaved execution. When a process context switch occurs, the state of a process is saved to its process control block and another process resumes execution on the processor. A UNIX process is heavyweight because it has its own virtual memory space, file descriptors, register state, scheduling information, memory management information, etc. When a process context switch occurs, this information must be saved, and this is a computationally expensive operation.

Concurrency refers to the interleaved execution of schedulable entities across one or more processor cores. The execution quantum is so small that the interleaved execution of independent, schedulable entities, often performing unrelated tasks, gives the appearance that multiple software applications are running in parallel.

Concurrency applies to both threads and processes. A thread is also a schedulable entity and is defined as an independent sequence of execution within a UNIX process. UNIX processes often have multiple threads of execution that share the memory space of the process. When multiple threads of execution are running inside of a process, they are typically performing related tasks.

While threads are typically lighter weight than processes, there have been different implementations of both across UNIX and Linux operating systems over the years. The three models that typically define the implementations across preemptive, time sliced, multi-user UNIX and Linux operating systems are defined as follows: 1:1, 1:N, and M:N where 1:1 refers to the mapping of one user space thread to one kernel thread, 1:N refers to the mapping of multiple user space threads to a single kernel thread, and M:N refers to the mapping of N user space threads to M kernel threads.

In summary, both threads and processes are scheduled for execution on a processor core. Thread context switching is lighter in weight than process context switching. Both threads and processes are schedulable entities and concurrency is defined as the interleaved execution over time of schedulable entities across one or more processor cores.

The Linux user space APIs for process and thread management abstract a lot of the details but you can set the level of concurrency and directly influence the time quantum so that system throughput is affected by shorter and longer durations of schedulable entity execution time.

Conversely, parallelism on a time sliced, preemptive operating system refers to the simultaneous execution of multiple schedulable entities over a time quantum. Both processes and threads can execute in parallel across multiple cores or multiple processors. On a multi-user system with preemptive time slicing and multiple processor cores, both concurrency and parallelism are often at play. Affinity scheduling refers to the scheduling of both processes and threads across multiple cores so that their concurrent and parallel execution is close to optimal.

Software applications are often designed to solve computationally complex problems. If the algorithm to solve a computationally complex problem can be parallelized, then multiple threads or processes can all run at the same time across multiple cores. Each process or thread executes by itself and does not contend for resources with other threads or processes that are working on the other parts of the problem to be solved. When each thread or process reaches the point where it can no longer contribute any more work to the solution of the problem, it waits at the barrier, if a barrier has been implemented in software. When all threads or processes reach the barrier, the output of their work is synchronized, and often aggregated by the primary process. Complex test frameworks often implement the barrier synchronization problem when certain types of tests can be run in parallel.

Most individual software applications running on preemptive, time sliced, multi-user Linux and UNIX operating systems are not designed with heavy, parallel thread or parallel, multiprocess execution in mind.

Last, when designing multithreaded and multiprocess software programs, minimizing lock granularity increases concurrency, throughput, and execution efficiency. Multithreaded and multiprocess programs that do not properly utilize synchronization primitives often require countless hours of debugging. The use of semaphores, mutex locks, and other synchronization primitives should be minimized to the maximum extent possible in computer programs that share resources between multiple threads or processes. Proper program design allows for schedulable entities to run in parallel or concurrently with high throughput and minimum resource contention, and this is optimal for solving computationally complex problems on preemptive, time sliced, multi-user operating systems without requiring hard real time scheduling.

June 30, 2016

VHDL Processes for Pulsing Multiple GPIO Pins at Different Frequencies on Altera FPGA

 
DE1-SoC GPIO Pins connected to 780nm Infrared Laser Diodes, 660nm Red Laser Diodes, and Oscilloscope

The following VHDL processes pulse the GPIO pins at different frequencies on the Altera DE1-SoC using multiple Phase-Locked Loops. Several diodes were connected to the GPIO banks and pulsed at a 50% duty cycle with 16mA across 3.3V. Each GPIO bank on the DE1-SoC has 36 pins. Pin 1 is pulsed at 20Hz from GPIO bank 0, and pins 0 and 1 are pulsed at 30Hz from GPIO bank 1. A direct mode PLL with locked output was configured using the Altera Quartus Prime MegaWizard. The PLL reference clock frequency is set to 50MHz, the output clock frequency is set to 50MHz, and the duty cycle is set to 50%. The pin mappings for GPIO banks 0 and 1 are documented on the DE1-SoC datasheet.

Pulsed Laser Diodes via GPIO pins on DE1-SoC FPGA

- -- ---------------------
- -- CLOCK A AND B PROCESSES --
- -- INPUT: direct mode pll with locked output 
- -- and reference clock frequency set to 50MHz, 
- -- output clock frequency set to 50MHz with 50% duty 
- -- cycle and output frequency scaled by freq divider constant
- -- ----------------------------------------------------------- 
clk_a_process : process (lkd_pll_clk_a)
begin
    if rising_edge(lkd_pll_clk_a) then
        if (cycle_ctr_a < FREQ_A_DIVIDER) then
            cycle_ctr_a <= cycle_ctr_a + 1;
        else
            cycle_ctr_a <= 0;
        end if;
    end if;
end process clk_a_process;
 
clk_b_process : process (lkd_pll_clk_b)
begin
    if rising_edge(lkd_pll_clk_b) then
        if (cycle_ctr_b < FREQ_B_DIVIDER) then
            cycle_ctr_b <= cycle_ctr_b + 1;
        else
            cycle_ctr_b <= 0;
        end if;
    end if;
end process clk_b_process; 
- -- ---------------------
- -- GPIO A AND B PROCESSES --
- -- INPUT: direct mode pll with locked output
- -- ------------------------------------------------------- 
gpio_a_process : process (lkd_pll_clk_a)
begin
    if rising_edge(lkd_pll_clk_a) then
        if (cycle_ctr_a = 0) then
            gpio_sig_0 <= NOT gpio_sig_0;
        end if;
    end if;
end process gpio_a_process;

gpio_b_process : process (lkd_pll_clk_b)
begin
    if rising_edge(lkd_pll_clk_b) then
        if (cycle_ctr_b = 0) then
            gpio_sig_1 <= NOT gpio_sig_1;
        end if;
    end if;
end process gpio_b_process;
GPIO_0 <= gpio_sig_0;
GPIO_1 <= gpio_sig_1;