Hardware-Efficient Gaussian and Sobel Filters for Real-Time Image Processing on FPGA
ABSTRACT. This work presents compact Gaussian and Sobel convolutional filters for efficient image smoothing and edge detection on field-programmable gate arrays (FPGAs). To design the architectures, we mathematically analyze the convolution operations performed by both filters and develop computationally efficient implementations. We also apply several digital circuit optimization techniques, such as resource sharing, operation sequencing, and memory optimization, to further reduce area usage and power consumption. The proposed approaches are also suitable for real-time high-speed applications, since they are designed to work with a continuous data stream at a high clock frequency. The implemented circuits have been analyzed regarding resource usage, speed, and power consumption. A comparison with several previous FPGA implementations from the literature shows that the proposed designs significantly reduce logic usage and latency. Memory usage and power consumption are also reduced compared to generic convolutional architectures.
Hardware implementation of the Hungarian algorithm for optimum task assignments
ABSTRACT. Optimal solutions of task assignment problems like, e.g., subcarrier allocation for OFDM, are solved in polynomial time with the Hungarian algorithm. The sequential nature of the related procedure makes it difficult to accelerate its execution. In this work, several hardware implementations of a state version of the algorithm are presented and compared. Results show that the single-memory architecture delivers similar energy consumption than the software version, but multi-block memory systems can significantly cut execution times as well as hardware resources.
Configurable Ultra-High-Throughput QRD FPGA Accelerators for small matrices
ABSTRACT. QR decomposition is an essential operation in matrix algebra that is applicable in many fields, such as signal processing, automatic control, communications, and physics simulations. The QRD computation is the system's bottleneck in many of these applications. This paper presents a configurable ultra-high-throughput accelerator for FPGAs designed using High-Level Synthesis language. The accelerator is arranged as a 2D-systolic array of Givens rotators based on the CORDIC algorithm. Its dimension can be configured at compilation time to fit the matrix size. Similarly, the degree of parallelism in the rotators can be configured, such as the computation throughput goes from one n x n matrix every 2n clock cycles up to one matrix every clock cycle.
FPGA-Based Implementation of sEMG Feature Extraction and Movement Classification with MLP
ABSTRACT. The difficulty in determining movement intention in patients with partial spinal cord injury or stroke sequelae has created the need to develop efficient and intelligent frameworks to decode movement intention through muscle signals. However, the analysis of surface electromyography (sEMG) signals from the lower limbs presents challenges due to their susceptibility to noise and the complexity of extracting meaningful features and establishing robust classification models. Despite these challenges, sEMG signal analysis enables non-invasive assessment of muscle activity, which is fundamental in rehabilitation and assistive technologies, optimizing the monitoring and control of motor function. This study proposes a method for sEMG feature extraction and lower limb movement classification using a neural network based on a multilayer perceptron (MLP). The objective is to explore an efficient method for classifying lower limb movements through sEMG signals, implementing the neural network on a System-on-Chip (SoC) device based on FPGAs. The classification is based on the acquisition of sEMG signals from eight muscle groups to differentiate between sitting, standing up, remaining still, and walking movements. To achieve this, features such as Root Mean Square (RMS), Mean Absolute Value (MAV), Integrated sEMG (IEMG), and Simple Squared Integration (SSI) are extracted from the sEMG signal and fed into the network. The results show an average classification accuracy of 92.50%. From a hardware perspective, it is shown that implementation on a Zynq-7000-based SoC device is feasible, suggesting future development directions for real-time applications.
Design of a CMOS Transmitter Chain for Satellite on the Move Communications
ABSTRACT. One of the main features that are expected for 6G networks will be their ability to have a three-dimensional (3D) extension. In contrast with current communications networks, in which the transimission of the signals is carried out at the surface level, this will allow true global coverage including oceans and vast unpopulated areas. 3D global coverage has become a strategic goal and efforts are being put into its implementation with the current deployment of 5G networks and in satellite communications (SATCOM). This requires minimizing the energy losses experienced by electromagnetic waves in the GHz bands. To achieve this goal, the most promising solution is the use of active antenna arrays incorporating smart beamforming techniques to achieve directionality in signal transmission. This work presents the design and simulation of the transmitter path of an active antenna array operating within the European downlink frequency band for SATCOM on the move (SOTM) applications (17.7 to 21.2 GHz). It incorporates a compact phase shifter based on a vector-sum architecture and a power amplifier based on the concatenation of neutralized differential common-source pairs. The transmitter has been designed in a 65nm CMOS MM/RF process. It achieves a phase resolution of 11.25° over the 360° range, with a gain of approximately 22.5 dB using a 5-bit control word while consuming 30mW.
Design of a 160-210 GHz SiGe HBT Square-Law Detector for Total Power Radiometers
ABSTRACT. This paper presents a square-law detector with a performance suitable for radiometric applications that is designed with the SiGe 0.13~$\mu$m BiCMOS SG13G2 technology offered by IHP. The implemented topology is based on an HBT transistor in common-emitter topology with a differential output. The proposed detector is centered around 178~GHz and presents a great balance between the post-layout responsivity, noise, and power consumption performance when compared to the rest of the works of the SoA, with a simulated maximum responsivity of 160~kV/W, a minimum NEP of 1.37~pW/$\sqrt{Hz}$, and a power consumption of 0.32~mW.
CMOS SPDT Switch Topologies in the Frequency Range of 6 to 20 GHz
ABSTRACT. This work builds on a recent contribution from the literature to improve isolation on single-pole single-throw switches through the combination of an additional transistor on the gate of the series transistors and a custom fabrication process. This paper explores how said gate transistor affects the different figures of merit of different topologies of single-pole double-throw switches in the 6 to 20~GHz range for a 130~nm technology process with standard CMOS RF transistors. The insights drawn from post-layout simulations show that the additional gate transistor leads to higher isolation at the cost of a worse overall figure of merit that combines area, isolation, power, and insertion loss.
Acceleration of C/C++ Kernels and ONNX Models on CGRAs with MLIR-Based Compilation
ABSTRACT. Executing AI at the edge is challenging due to tight energy and computational constraints. Heterogeneous platforms, particularly those incorporating CGRA, offer a compelling trade-off between hardware specialization and programmability, supporting spatially distributed and energy-efficient computation. Despite their potential, the deployment of applications on CGRA accelerators remains limited by the lack of practical toolchains and methodologies. In this work, we propose a compilation flow based on MLIR to enable the seamless integration of both C/C++ kernels and ONNX-based AI models into a RISC-V system augmented with a CGRA accelerator. Our approach extracts the underlying DFG from the high-level representation. It maps it onto the CGRA using an ILP mapper that accounts for the accelerator's architectural constraints. A custom backend completes the toolchain by generating the necessary binaries for coordinated execution across the RISC-V processor and the CGRA. This framework enables the practical deployment of heterogeneous edge workloads, combining the flexibility of software execution with the efficiency of hardware acceleration.
A Framework for Automated CGRA Design Space Exploration with Genetic Algorithm Optimization
ABSTRACT. The rapid growth of compute-intensive applications has created a pressing need for computing architectures that effectively balance flexibility, efficiency, and performance. While Field-Programmable Gate Arrays (FPGAs) offer a good level of flexibility, they suffer from high configuration overhead and energy consumption. Coarse-Grained Reconfigurable Architectures (CGRAs) provide a more energy-efficient alternative with lower configuration costs. They can be customized for domain-specific applications by modifying their coarse-grained processing elements to execute particular sequences of operations. In fact, their domain-specific nature can be used to further improve their energy efficiency and reduce their area overhead by
exploiting computing fabric specialization. This can be achieved by replacing homogeneous processing elements with a subset of heterogeneous, more optimized ones that are specifically suited to the target application domain. However, achieving an optimal CGRA configuration requires extensive design space exploration (DSE), which involves evaluating many architectural possibilities. Existing CGRA frameworks struggle with slow and inefficient exploration due to long runtimes and constrained customization options. These issues make it hard to find the best configurations rapidly. To tackle these challenges, this paper presents Genetic Algorithm-based CGRA Generator (GA-CG), a framework that enhances DSE in the CGRA design process. GA-CG uses a genetic algorithm to discover an efficient structural configuration, thereby improving resource utilization and reducing power consumption.
Machine Learning for Microwave Pixelated Structures Design
ABSTRACT. The growing complexity of wireless communication systems has highlighted the need for innovative methods to optimize the design of passive radiofrequency (RF) networks. This work presents a novel AI-driven approach for the electromagnetic-free design and optimization of square-shaped passive RF filter models. The method relies on 16×16 matrices composed of randomly placed metallic squares and ports. These structures are used to generate a large and diverse dataset, which feeds a deep artificial neural network (ANN) trained to predict scattering parameters (S-parameters) with high accuracy. Due to the vast design space, with more than $2^{256}$ possible configurations, genetic algorithms (GAs) are used to guide the optimization process, employing the ANN for real-time evaluation. This strategy eliminates the reliance on time-consuming electromagnetic simulations while enabling the efficient exploration of complex square-based architectures, ultimately achieving high-performance RF filter designs with minimal computational cost.
A 1.15 mW SiGe BiCMOS Cryogenic LNA for Superconducting Qubit Readout with 4.5 K Noise Temperature from 4 to 9 GHz
ABSTRACT. This work presents the design and post-layout
simulation of a cryogenic SiGe BiCMOS low-noise amplifier
(LNA) for superconducting transmon-qubit readout in
quantum processors scaling beyond hundreds of qubits. The
target specifications are first established and justified by
surveying state-of-the-art of cryogenic LNAs and modeling
the dispersive qubit readout process. The LNA is
implemented with three cascaded stages in common-emitter
configuration and employs tuned inductive matching and
parallel peaking networks in each stage to optimize noise, gain
flatness, and bandwidth while maintaining minimal DC
power consumption. The amplifier draws only 1.15 mW from
a 0.15 V supply and occupies 0.252 mm². Post-layout
simulations confirm input/output S-parameter matching
better than –10 dB, 41–44 dB gain with <3 dB ripple, <5 K
noise temperature across 4–9 GHz, and a worst-case OP1 dB
compression point of –19.96.6 dBm. A comparative analysis
demonstrates that SiGe BiCMOS offers a favorable trade-off
between InP HEMT’s low noise and CMOS’s integration
potential for large-scale quantum processors.
A Methodology for Cryogenic Modeling of CMOS Technology Based on BSIM-BULK
ABSTRACT. This paper presents a methodology for adjusting the BSIM-BULK model (formerly BSIM6) to simulate the behavior of NMOS and PMOS transistors at cryogenic temperatures, down to 4.2 K. The study analyzes existing characterization data for 28 nm bulk CMOS processes, identifying the threshold voltage (VTH), subthreshold swing (SS), and low-field mobility (μ0) as the transistor parameters most significantly impacted by cryogenic temperatures. Based on this analysis, it is shown that VTH is expected to increase by 100-150 mV from 300 K to 4.2 K, SS approximately 60 mV/decade from 300 K to 4.2 K, and μ0 increases, approximately doubling from 300 K to 4.2 K. In addition, a practical strategy is proposed to modify specific parameters that capture these temperature dependencies on the BSIM-BULK model. The different methods to verify the cryogenic behavior are described and applied to 3 μm/28 nm NMOS and PMOS transistors at simulation level for validation.
Robust DTMOS Schmitt-Trigger Circuits in 130 nm SOI CMOS for Sub-100 mV Supply Voltage
ABSTRACT. This work proposes to utilize a dynamic threshold voltage MOSFET (DTMOS) technique for Schmitt-Trigger-based circuits that significantly enhances the Ion/Ioff ratio while improving robustness against process variations and mismatch. We designed and validated DTMOS Schmitt Trigger (DST) inverter and NAND gates in a commercial 130 nm SOI CMOS technology. Comprehensive post-layout simulations compared noise margins, power dissipation and propagation delay against standard Schmitt-Trigger implementations. Monte Carlo analysis demonstrates that our proposed circuits achieve 99.9% yield at an ultra-low supply voltage of 60 mV. Evaluation of 11-stage inverter- and NAND-based ring oscillators revealed that the DST-based circuits deliver 24-27% improved energy efficiency and 30-37% reduced delay at 90 mV operation, with only a small area overhead. Furthermore, the minimum operating voltage is reduced by 12.5%, with the DST inverter demonstrating functionality at voltages as low as 40 mV.
A Programmable, Negative, and Dynamically Biased Sampler for Ultra-Low Power Body-Bias Generators in 18nm FD-SOI
ABSTRACT. With the rapid growth of Internet of Things (IoT),
the demand for Ultra-Low Power (ULP) circuits increased,
making power management circuits a critical need. In this
context, Fully Depleted Silicon on Insulator (FD-SOI) technology
is indicated thanks to an enhanced body biasing possibility’s. An
Adaptive Body-Biasing (ABB) circuit is therefore needed and
must meet the ULP constraints. To address this challenge, a
programmable negative sampling solution optimized for ABB
circuit in 18 nm FD-SOI technology is proposed. The sampling
rate ranges from 10 kHz to 100 MHz and introduces specific
design techniques to enhance power consumption and area
efficiency. The circuit is composed of a modified 6-bit binaryweighted
Capacitive Digital to Analog Converter (CDAC) coupled
with a dynamic comparator. This solution exhibits only dynamic
power consumption, making it a suitable solution for frequency
regulation of the ABB circuit. This implementation can achieve a
power reduction up to x460 at 10 kHz and a silicon area reduction
by x7.5 compared to a previously implemented design that relies
on static bias functions in the same technology.