A Novel and Efficient Large Integer Number Theoretic Transform Multiplier Based on Unified Blocks
ABSTRACT. This work presents a novel integer multiplication
architecture based on the Number Theoretic Transform (NTT)
for Post Quantum Cryptography. The proposed NTT introduces
an overlapping execution of NTT stages, enabling subsequent
stages to commence as soon as the necessary operands from the
preceding stage are available. This approach significantly reduces
the number of clock cycles required for the NTT operation, leading
to better latency. Furthermore, a unified block is proposed
and developed, which enhances memory management for the
execution of the NTT operations and results in reduced required
resources for the implementation. The FPGA implementation
on Virtex-7 series for 512-bit integer multiplier demonstrates
a substantial reduction in both delay and area by 93.75% and
98%, respectively, defying the typical trade-off between these two
metrics.
09:18
Carson Sager (Oklahoma State University, United States) James Stine (Oklahoma State University, United States)
Design of a Robust IEEE Compliant Floating-Point Divide and Square Root using Iterative Approximation
ABSTRACT. In this paper, we discuss an IEEE 754 compliant normalized floating-point divide and square root unit that utilizes
iterative approximation. We provide a robust architecture that
allows multiple formats and all IEEE 754 rounding modes while
still exhibiting high-performance. Moreover, we also adhere to the
IEEE 754 2019 standard and demonstrate methods for rounding
results to all five rounding modes using iterative approximation.
Performance, Power, and Area estimates are determined from
physical synthesis using ARM-based standard cells in a TSMC
28nm process. This paper also presents comparisons versus other
implementations and demonstrates the efficient of the approach
presented here.
Improving Circuit Area with a 7nm Predictive FinFET PDK Multi-Height Standard Cell Library
ABSTRACT. Moore’s Law predicts a doubling of transistors
every two years, driving semiconductor innovation. To meet this
challenge, FinFET technology offers enhanced current control
and higher transistor density. This work introduces a 7 nm
multi-height standard cell library using FinFETs, which enhances
design flexibility by allowing different cell heights. We designed
13 cells, including D-type flip-flops and 2:1 multiplexer, with up to
50% area reduction compared to a 6-track library. Preliminary
results show area reductions of up to 36% in benchmarks, with
promising electrical performance despite incomplete parasitic
characterization.
Material Classification using Optical Wireless Communications Data
ABSTRACT. This study proposes the integration of Optical Wireless Communication (OWC) and the classification of the material type of the object whose distance to the laser is being measured, using Machine Learning (ML) techniques such as KNN, RF, and SVM.
The application relies on using the OWC dynamic communication structure between vehicles to estimate relevant information in order to improve system communication and navigation. The aim of this work is to classify materials such as glass, plastic and aluminum in order to improve the functionality of the OWC system by expanding the sensory information without adding new hardware, taking advantage of the structure of the communication system already in use. The methodology employs different ML techniques, along with approaches for dealing with limited and unbalanced amounts of data, k-fold and SMOTE, in order to perform a comparative analysis and obtain an efficient classification model. The most significant impact of this study lies in offering an integrated solution that optimizes optical communication while providing material classification for selective information processing. This approach offers a comparative analysis of the ML techniques applied, obtaining an accuracy of 93% using KNN and SMOTE.
Guilherme Dias (Escola Politécnica, Universidade de São Paulo. INESC-ID, Instituto Superior Técnico, Universidade de Lisboa, Portugal) Luís Crespo (INESC-ID, Instituto Superior Técnico, Universidade de Lisboa, Portugal) Pedro Tomas (INESC-ID, Instituto Superior Técnico, Universidade de Lisboa, Portugal) Nuno Roma (INESC-ID, Instituto Superior Técnico, Universidade de Lisboa, Portugal) Nuno Neves (INESC-ID, Instituto Superior Técnico, Universidade de Lisboa, Portugal)
Dynamic Reconfigurable FPU for Next-Generation Transprecision Computing
ABSTRACT. Recent limitations in technology scaling have emphasized the need for energy-efficient computing architectures that can dynamically adjust operand precision based on the real-time requirements of the application, without compromising result accuracy. This adaptability also creates opportunities to enhance throughput and optimize hardware utilization. Furthermore, the use of lower-precision formats (e.g., 16-bit) often releases portions of the arithmetic datapath, allowing these resources to be reallocated for increased vector parallelism. In this context, we present a novel transprecision Floating-Point Unit (FPU) that supports all IEEE 754 data types (double, single, half-precision), as well as the bfloat16 and DLFloat formats. The unit features dynamic precision tuning of operands, enabling increased throughput through vectorization and improved energy efficiency. The proposed design was implemented using a 28nm UMC technology process, achieving a peak energy efficiency of 152 GOPS/W as a result of new precision adaptation capabilities.
09:18
Pedro Silva (Universidade Federal de Santa Catarina, Brazil) Rita Louro (Universidade Federal de Santa Catarina, Brazil) Mateus Grellert (Universidade Federal do Rio Grande do Sul, Brazil) Cristina Meinhardt (Universidade Federal de Santa Catarina, Brazil)
Power-Efficient Design of Approximate Parallel-Tree Digital Comparators
ABSTRACT. With the rise of decision-making circuits in smart
edge devices, the demand for energy-efficient Digital Magnitude
Comparators has significantly increased. In this work, we propose
a method for approximating tree-based comparators to improve
energy efficiency and performance. Compared to state-of-the-
art exact full-custom and gate-level counterparts, our proposed
comparator shows superior power metrics, reducing up to 21%
on total power dissipation, at the penalty of 0.59% error rates. In
a decision tree learning case study, integrating these approximate
comparators yields up to 25% reduction in the power dissipation
from comparison operations, highlighting the practical relevance
of our approach.
09:36
Rafael dos Santos Ferreira (Federal University of Pelotas (UFPel), Brazil) Luciano Agostini (Federal University of Pelotas (UFPel), Brazil) Cláudio Diniz (Federal University of Rio Grande do Sul (UFRGS), Brazil) Bruno Zatt (Federal University of Pelotas (UFPel), Brazil)
Applying Approximate Subtractors for Power Reduction in TZS
ABSTRACT. This paper investigates the impact of imprecise subtractors in the hardware architecture of the Sum of Absolute Differences (SAD) computation within the Test Zone Search (TZS) algorithm, commonly used in Versatile Video Coding (VVC). Four state-of-the-art imprecise subtractors (AppS, AXSC1, AXSC2, AXSC3) were analyzed across various video resolutions to assess their influence on computational complexity, energy consumption, and coding efficiency. The results show that subtractors like AppS4 and AXCS14 provide significant reductions in energy consumption with minimal impact on video coding quality. These findings are especially relevant for low-power devices and embedded systems, where energy efficiency is critical. The use of imprecise subtractors offers a promising trade-off between computational efficiency and energy savings, making them a viable solution for high-performance video encoders.
Ultra-compact Approximate 4:2 Compressor on the Design of Power-efficient Multipliers for Image Multiplication
ABSTRACT. This work proposes the adoption of an energy-efficient approximate 4:2 compressor at the transistor level for designing a Dadda tree approximate multiplier. The proposal is also evaluated considering eight other state-of-the-art approximate compressors. Our MAX4:2CV2-based multiplier proposal reduces delay by up to 50.4%, power consumption by up to 59.2%, and Power-Delay Product (PDP) by up to 79.7% compared to an exact multiplier.Furthermore, we evaluate the multiplier quality through pixel-by-pixel image multiplication, where we observe an acceptable result of 31 dB on average for the Peak Signal-to-Noise Ratio (PSNR). These findings highlight that the adoption of the proposed compressor can improve the efficiency of approximate multiplier designs, especially when area and power savings are a critical factor.
Quality-Reconfigurable Approximate Multiplier Utilizing Select Leading One-Bit Blocks
ABSTRACT. Multipliers are essential for emerging technologies,
as they are vital arithmetic circuits in many energy-efficient ap-
plications, such as digital signal processing and machine learning
applications. Approximate multipliers (AxM) became an optimal
option for applications in ASIC systems with error tolerance.
This paper proposes to develop a run-time reconfigurable AxM
with four approximation levels in a single circuit, evaluating
accuracy and circuit metrics (e.g., circuit area, timing, and power
consumption) based on the leading one-bit-based approximate
(LoBA) multiplier. The results achieve a circuit area reduction
of 52% less area and up to 27% less power consumption when
compared with equivalent architecture based on LoBA state-of-
art. Applying the proposed RLoBA to a normalized least mean
square (NLMS) adaptive filter case study, we obtain 22.95% less
power consumption using dynamic approximate level selection
then the precise multiplier while maintaining the same accuracy
level.
ABSTRACT. In this paper we present a novel ultra-low voltage (ULV) operational transconductance amplifier (OTA) topology inspired to the DIG-OTA. The proposed amplifier architecture leverages the principles of the conventional DIG-OTA while incorporating an inverter-based common-mode feedback (CMFB) loop and an inverter based output stage. Designed using TSMC’s 180 nm CMOS process, the proposed architecture achieves a gain of 49dB, a gain-bandwidth product of about 3.72 kHz, and a phase margin of 55 degrees, with an output load capacitance of only 5 pF that can be integrated on-chip. The CMFB mechanism implemented here ensures a commendable common-mode rejection ratio (CMRR), as high as 65 dB, which remains stable across process, supply voltage, and temperature (PVT) variations. Additionally, the power consumption of the proposed OTA is remarkably low at just 0.62 nW. All of these characteristics put the proposed OTA at state-of-the-art of ULV OTAs.
09:18
Cristina Adornes (Universidade Federal de Santa Catarina, Brazil) Gabriel Maranhão (Universidade Federal de Santa Catarina, Brazil) Deni Alves (Universidade Federal de Santa Catarina, Brazil) Cesar Rodrigues (Universidade Federal de Santa Catarina, Brazil) Márcio Schneider (Universidade Federal de Santa Catarina, Brazil)
A CMOS instrumentation amplifier designed with open-source tools
ABSTRACT. This paper presents a CMOS instrumentation amplifier based on a fully differential difference amplifier (FDDA) as part of a bioimpedance readout circuit for skin cancer detection. Developed using the open-source SkyWater 0.13 μm CMOS process from design to tape-out, the FDDA achieves a DC gain of 72 dB, a gain-bandwidth product of 47.8 MHz, an input-referred noise of 0.275 μV /√Hz at 1 kHz, and a CMRR of 119.9 dB. This work highlights the potential of open-source design flows in developing high-performance FDDA circuits, paving the way for more accessible development of advanced biomedical applications.
New multiplier and input layer in current mode for analog artificial neural networks
ABSTRACT. Thanks to FDSOI (Fully Depleted Silicon On Insulator) technology, this paper presents a new implementation of a multiplier and the input layer of an Analog ANN (Artificial Neural Network). The architecture of the circuit is based on the MLP (Multi-Layer Perceptron) algorithm with back-propagation. The analog implementation of such an algorithm typically uses multipliers which are surface and power consuming. The second drawback of this topology concerns the storage of the weights. To overcome these problems, we propose to use new current mirrors functioning as multipliers and take advantages of the FDSOI technology. Using a similar approach, we have realized the input layer using current mirrors without digital-to-analog converters, reducing both silicon area and power consumption. This dual reduction will eventually enable us to implement a much larger number of neurons, thus increasing the complexity of the final artificial neural network.
A 13.56-MHz CMOS Active-Rectifier WPT With Dynamically Controllable Comparator for IMDs
ABSTRACT. This paper presents the design of a 13.56 MHz active full-wave integrated rectifier for wireless powered implantable medical devices. The four diodes of a conventional passive rectifier are replaced by two cross-coupled PMOS transistors and two comparator-controlled NMOS switches to reduce the voltage drops of the diodes, so that the voltage conversion ratio and power conversion efficiency are improved. The proposed design also focuses on reducing the reverse current in the switches. It was simulated in a standard 65-nm CMOS process with an ideal AC input of 1.2 V and presented a maximum power conversion efficiency of 84% and a maximum output power of 590 μW at a nominal output voltage of 1.1 V.
An Inverter-Based Difference Differential Amplifier with Active Frequency Compensation
ABSTRACT. This work presents an improved circuit of the Dif-
ference Differential Amplifier (DDA) based on Nauta’s inverter-
based fully-differential amplifier. The proposed topology keeps
the original Nauta DDA as the first stage and adds a second
stage with feedforward common-mode cancellation and active
frequency compensation. The circuit achieves 88 dB differential
gain, 88 dB CMRR, 88 dB PSRR, 17.5 MHz GBW for a 15 pF
load while consuming 1.17 mA for a 1.8 V supply voltage at room
temperature.
Heitor Huarachi (Universidade Federal do Pampa, Brazil) Gabriel Cardoso (Universidade Federal do Pampa, Brazil) Jiovana Gomes (Universidade Federal do Rio Grande do Sul, Brazil) Sergio Bampi (Federal University of Rio Grande do Sul, Brazil) Fabio Ramos (Universidade Federal do Pampa, Brazil)
Arquitetura para a Geração dos Elementos Sintáticos Residuais da Transformada do VVC
ABSTRACT. Nos últimos anos, a demanda por vídeo aumentou significativamente devido ao uso intensivo de plataformas de streaming e trabalho remoto. Para atender a essa demanda, são necessárias soluções mais eficientes. O Versatile Video Coding (VVC) é um padrão avançado de codificação de vídeo, projetado para oferecer alta qualidade de vídeo com excelente compressão. No entanto, essa eficiência vem acompanhada de uma maior complexidade no processo. Uma solução tradicional para lidar com esse aumento de complexidade é a utilização de aceleradores de hardware nas etapas mais criticas. Nos formatos modernos de codificação, a codificação residual gera a maior parte dos dados que entram na Codificação de Entropia. Acelerando esse processamento, é possível evitar gargalos e ociosidade no fluxo de codificação Este trabalho explora arquiteturas para a geração de ESRs (Elementos Sintáticos Residuais) nos modos baseados em transformada, denominada globalmente como TB-RSE-arch.
Projeto e Síntese de Multiplicadores com Foco em Alto Desempenho
ABSTRACT. Circuitos de multiplicação são utilizados em diversas aplicações importantes como visão computacional e aprendizado de máquina. No entanto, estes circuitos normalmente são custosos em termos de área e energia quando comparados com circuitos aritméticos mais simples como somadores. Com o objetivo de avaliar o desempenho de multiplicadores com foco em alto de desempenho e baixo consumo de energia, este trabalho apresenta uma análise de multiplicadores do tipo array para diferentes larguras de bits. Além de uma versão combinacional para cada caso, duas versões com pipeline são propostas, a fim de se maximizar o desempenho dos circuitos para tamanhos maiores de dados. Resultados para a tecnologia standard cell XFAB 180nm de síntese apontam que o modelo com pipeline mais profundo é capaz de atingir um desempenho 5,78 superior quando comparado à versão combinacional, gerando ganhos de até 49% pela métrica energy-delay product.
Yasmin Souza Camargo (Federal University of Pelotas (UFPel), Brazil) Matheus Isquierdo (Federal University of Pelotas (UFPel), Brazil) Renira Soares (Federal University of Pelotas (UFPel), Brazil) Daniel Palomino (Federal University of Pelotas (UFPel), Brazil) Bruno Zatt (Federal University of Pelotas (UFPel), Brazil) Felipe Sampaio (Federal Institute of Rio Grande do Sul (IFRS), Brazil)
Approximate Storage Evaluation at Intra-Frame Prediction in VVC Encoders
ABSTRACT. This paper explores the approximate storage to tolerate memory operation errors to improve energy consumption in intra-frame prediction for VVC encoders. We analyze the resilience levels in two memory regions: Original and Neighbor Samples Buffer (OrigSB and NeighSB). Further, multiple operation levels SRAM memory is adopted to evaluate the energy savings. The resilience profiling depicts the encoding efficiency drops for a wide-range of scenarios, considering different video sequences, error rates and VVC parameters. The results point to a substantial reduction in SRAM dynamic energy consumption (up to 58%) and promising error tolerance levels for OrigSB memory. Meanwhile, NeighSB exhibited lower resilience potential, with significant coding efficiency drops and subjective video visual quality deterioration, at the highest evaluated error rates.
Avaliação de Ferramentas do Codificador AV1 para Interpolação de Pixels na Predição Inter-Quadros Fracionária
ABSTRACT. A reprodução de vídeos digitais é um processo
computacionalmente custoso, pois requer um grande volume de
dados. Portanto, para que a transmissão e/ou recepção desses
meios seja viável, a compressão de vídeo é um fator fundamental.
Os codificadores de vídeo incorporam uma série de ferramentas
para tornar a compressão mais eficiente. Entre os codificadores
de vídeo modernos, o Alliance for Open Media Video 1 (AV1) foi
lançado em 2018, desenvolvido pelo consórcio AOMedia. Para
alcançar esse desempenho e eficiência, ferramentas de codificação
complexas foram adotadas no AV1. Este artigo apresenta uma
série de avaliações sobre as ferramentas existentes no codificador
de vídeo AV1, com foco no processo de interpolação. A ativação
da ferramenta de interpolação com dual filter resulta em um
ganho de 0,81% na eficiência de codificação para vídeos em UHD,
mas esse ganho é considerado irrelevante em outras resoluções,
o que pode não justificar seu uso. O filtro Regular é preferido,
com 80,46% de utilização na vertical e 89,5% na horizontal. Em
resoluções 4K, cerca de 21,42% do tempo de codificação é gasto
na escolha de filtros de interpolação.
Implementação de Núcleos RISC-V em FPGA com Monitoramento de Corrente Usando INA219
ABSTRACT. Este artigo apresenta uma análise da implementação de diversos núcleos RISC-V na FPGA Cyclone IV EP4CE6E22C8N, com ênfase na investigação da relação entre desempenho, consumo de energia e área. Foram implementados 12 núcleos RISC-V de código aberto. O desempenho de cada núcleo foi avaliado através do benchmark CoreMark. As medições de corrente e tensão foram realizadas a cada 50 us, permitindo um monitoramento com uma boa taxa de amostragem da corrente. Entre as arquitetura investigadas, dados de área apresentaram uma diferença de até 6X, enquanto desempenho e consumo de potência tiveram diferenças de até 3,6X e 3X, respectivamente.
Escolha dos Filtros de Interpolação da Estimação de Movimento Fracionária do AV1 Usando Aprendizado de Máquina
ABSTRACT. O uso crescente de vídeos digitais tem se tornado cada vez mais presente em nossas vidas, abrangendo áreas como entretenimento, saúde, educação, entre outras. A reprodução de vídeos digitais é tanto computacional quanto energeticamente custosa, pois requer uma grande quantidade de dados, especialmente para conteúdos de alta qualidade. Portanto, para que a transmissão ou recepção dessas mídias seja viável, a compressão de vídeo é um fator fundamental. Codificadores de vídeo reúnem uma série de ferramentas que visam tornar a compressão mais eficiente e reduzir o tempo de processamento. Para alcançar esse desempenho e eficiência, ferramentas complexas de codificação foram adotadas no AV1, como o esquema de filtragem adaptativa aplicado aos filtros de interpolação, utilizado na etapa de predição inter-quadros. Este artigo apresenta uma solução baseada em aprendizado de máquina para acelerar o processo de interpolação de amostras fracionárias na etapa de Motion Estimation (ME). Os modelos preditivos apresentaram uma alta taxa de acerto. Para vídeos com resolução Full HD, os modelos provêm uma redução de 2,14% no tempo de codificação ao custo de um aumento de 0,124% na eficiência de compressão. Para vídeos com resolução HD, a redução no tempo foi de 1,84%, com uma perda de eficiência de compressão de 0,2195%. Assim, em ambos os casos, os modelos preditivos levam a uma diminuição no tempo de codificação com um pequeno impacto na eficiência de compressão.
Federico Fernández (Facultad Politécnica - Universidad Nacional de Asuncion, Paraguay) Diego Pinto (Facultad Politécnica - Universidad Nacional de Asuncion, Paraguay)
Reconfiguración parcial dinamica de una memoria ROM mediante hardware reconfigurable
ABSTRACT. Los dispositivos reconfigurables como los FPGAs o Field Programable Gate Array, proporcionan capacidades únicas que permiten diseñar dispositivos empotrados cuya especificación arquitectónica les permite adaptarse a una funcionalidad específica. Estas funcionalidades van desde memorias, filtros, operadores aritméticos, controladores de video, etc. Una característica que distingue a las FPGA de otras tecnologías es la posibilidad de reconfigurarse, es decir, cambiar de configuración según necesidades de funcionamiento. Esto es muy importante a la hora de corregir una parte de un diseño de manera parcial sin detener el funcionamiento del mismo lo que le hace imprescindible cuando se debe corregir un error por mal funcionamiento o fallas en sitios ubicados fuera del alcance directo como satélites, vehículos espaciales, sitios cercanos a plantas nucleares, aviones, misiles etc. Presentamos un sistema de reconfiguración parcial dinámica de una memoria de solo lectura ROM (Read Only Memory) cuya aplicación puede ser desde su uso como medio de almacenamiento de datos así como memoria de instrucciones para programas a ejecutar. El diseño funciona correctamente y el mismo puede ser ampliado a otros módulos digitales de otra funcionalidad y dentro de circuitos más complejos.
Simon Hofmann (Technical University of Munich, Germany) Marcel Walter (Technical University of Munich, Germany) Robert Wille (Technical University of Munich, Germany)
Physical Design for Field-coupled Nanocomputing with Discretionary Cost Objectives
ABSTRACT. Field-coupled Nanocomputing (FCN) represents a class of emerging post-CMOS technologies that achieve nanoscale computation without relying on the flow of electrical current. Despite their potential, existing physical design algorithms for FCN predominantly focus on minimizing either layout area or execution runtime, neglecting the complexity of real-world design constraints. In this work, we introduce the first physical design method for FCN that accommodates discretionary cost objectives, marking a significant advancement in the field. This approach integrates insights from both simulation and manufacturing, facilitating more comprehensive and optimized design solutions. We offer an open-source implementation and validate the proposed algorithm experimentally on a set of common benchmark functions, demonstrating its effectiveness across a range of different scenarios and cost objectives.
13:48
Ruan Formigoni (Universidade Federal de Vicosa, Brazil) Ricardo Ferreira (Universidade Federal de Vicosa, Brazil) Omar Neto (Universidade Federal de Minas Gerais, Brazil) José Augusto Nacif (Universidade Federal de Vicosa, Brazil)
Network Collapsing Placement and Routing for Field-Coupled Nanocomputing
ABSTRACT. The complementary metal-oxide semiconductor (CMOS) is the industry standard for chip fabrication. In recent decades, its miniaturization processes have become increasingly complex and expensive, with atomic limitations and ever-growing static power dissipation. Field-coupled nanocomputing has emerged to address these issues with technologies that use elements that are alternative to the traditional transistor and require no static power dissipation. In this field, the well-known NP-hard placement and routing problem in CMOS re-emerges, now with novel constraints. Our work provides a scalable solution for this NP-Hard problem, improving the area overhead compared to current state-of-the-art techniques, with a minor trade-off in time complexity. We achieve up to 23.15x area reduction with an average of 5.13x and runtime of only 8 milliseconds.
14:06
João V. C. Teixeira (Departamento de Ciência da Computação Universidade Federal de Minas Gerais (UFMG), Brazil) Poliana A. C. Oliveira (Centro Federal de Educação Tecnológica de Minas Gerais (CEFET-MG), Brazil) Renan A. Marks (Faculdade de Computação Universidade Federal de Mato Grosso do Sul (UFMS), Brazil) Omar P. V. Neto (Departamento de Ciência da Computação Universidade Federal de Minas Gerais (UFMG), Brazil)
Enhancing DNA Analog Circuits Design Through Delayed Species Insertion
ABSTRACT. Molecular computing, particularly DNA-based systems, offers immense potential for the creation of programmable biological devices. However, a critical challenge in advancing DNA computing is the lack of precise control over the time and sequence of chemical reactions within these circuits. The stochastic nature of molecular interactions, combined with the inherently parallel nature of reactions, makes synchronizing and ordering them difficult. Variations in reaction rates and the presence of noise, such as unintended leak reactions, further complicate this control, limiting the scalability and complexity of DNA-based circuits. In this paper, we explore the limitations of analog DNA circuits, focusing on the need for better mechanisms to regulate the timing and sequence of reactions. We argue that improving this control could address many existing problems, enabling the development of more complex and reliable molecular circuits. These advancements are essential for moving DNA-based computation closer to practical, real-world applications.
14:24
Gabriel Novy (Universidade Federal de Minas Gerais, Brazil) Julio Teodoro (Universidade Federal de Minas Gerais, Brazil) Jhonattan Ramírez (Universidade Federal de Minas Gerais, Brazil) Omar Neto (Universidade Federal de Minas Gerais, Brazil)
Feasible all-optical OR and NOR logic gates in photonic crystals
ABSTRACT. This work presents the design and simulation of photonic crystal-based OR and NOR logic gates, aimed at eliminating the need for dynamic control signals dependent on input combinations. Utilizing silicon-based photonic crystals operating at a 1550 nm wavelength, the gates exhibit enhanced performance, making them suitable for optical computing. Simulation results confirm the reliability of the proposed designs, with the OR and NOR gates achieving contrast ratios of 5.3 dB and 5.1 dB, respectively. Innovative waveguide junctions play a crucial role in minimizing signal loss and preserving signal integrity. This work advances photonic computing by offering a simplified control mechanism while maintaining high performance, paving the way for more energy-efficient and faster alternatives to traditional electronic logic gates. Future research directions include experimental validation and integration into more complex photonic circuits.
14:42
Emanuel Ruella (Universidade Federal de Viçosa, Brazil) Ricardo Ferreira (Universidade Federal de Viçosa, Brazil) Omar Neto (Universidade Federal de Minas Gerais, Brazil) José Nacif (Universidade Federal de Viçosa, Brazil)
Development of More Robust and Stable Logic Gates Using Novel Parameter Values for SiDB
ABSTRACT. As CMOS technology approaches its physical limits, there is a growing need to explore alternatives beyond CMOS, such as Silicon Dangling Bonds (SiDBs). SiDBs, utilizing Coulombic interactions, offer the potential for ultra-low energy consumption and high integration density. Recent research has introduced new parameter values for SiDB technology, specifically the Thomas-Fermi Screening Length of 1.8 and relative permittivity of 4.10
This study presents a comprehensive library of logic gates based on these novel values, demonstrating significant stability and interference reduction improvements. Our contributions include: (1) A library of standard Boolean function gates, (2) Additional gates adapted from existing designs, and (3) A detailed comparison of gate performance using new versus old parameter values. The findings indicate that these new parameters enable more robust and compact circuit designs, advancing the potential of SiDB technology.
Bridging the Gap: Accelerating Random Forests on FPGAs with High-Bandwidth Memory
ABSTRACT. In memory-bound problems, Field Programmable
Gate Arrays (FPGAs) have traditionally underperformed com-
pared to Graphics Processing Units (GPUs) due to their lower
memory bandwidth. However, the advent of High-Bandwidth
Memory (HBM) in FPGAs has significantly enhanced their
performance, achieving bandwidths up to 425 GB/s. Additionally,
FPGAs offer the advantage of customizable accelerators for
domain-specific tasks, potentially outperforming general-purpose
GPU architectures. This work focuses on accelerating random
forest algorithms on FPGAs, leveraging their customization
capabilities to efficiently manage control flow structures such as
decision branches. Despite these advancements, FPGAs remain
challenging to program, requiring a deep understanding of
hardware design. To address this, we propose a new hardware
generator that integrates necessary tools into a cohesive workflow,
simplifying FPGA development. Validated on a Xilinx Alveo
FPGA, the design utilizes 32 HBM channels, and reaches a
performance of 8 billions samples per second, offering a practical
solution for memory-bound machine learning tasks in high-
performance computing environments.
13:48
Vilmondes R. Silva (Federal University of Minas Gerais (UFMG), Brazil) Dalton M. Colombo (Federal University of Minas Gerais (UFMG), Brazil) Tomás P. Corrêa (Federal University of Minas Gerais (UFMG), Brazil)
Low-cost FPGA-based Digital-to-Time Converter
ABSTRACT. In this paper, we present a Digital-to-Time Converter (DTC) based on a low-cost FPGA platform, utilizing the Vernier with oscillators architecture. A DTC is a circuit that converts digital information into a very accurate time output. The proposed Vernier DTC is implemented on an Altera Cyclone III FPGA chip and features tunable resolution through the periodic relationships between two PLL-generated signals. The linearity of the system was measured for a resolution of 990 ps, showing DNL and INL values of -0.41 to +0.54 and -0.82 to +0.17, respectively. It has a range of 99 ns and an estimated power consumption of 89 mW. Furthermore, measurements demonstrate that the system achieves a maximum resolution of 12.5 ps, utilizing only 2% of the FPGA resources. Additionally, the logical and physical synthesis of the proposed design was carried out using a commercial 350 nm CMOS technology, and the estimated power consumption and silicon area are 1.75 mW, and 237 µm x 190 µm, respectively.
Exploiting Design Flexibility in Multi-Tenant Multi-FPGA Edge Systems
ABSTRACT. Multi-FPGA architectures are increasingly used in edge environments for their reconfigurable nature, enabling high performance and energy efficiency by tailoring designs to specific workloads. However, in multi-tenant edge with diverse and dynamic workloads, navigating design heterogeneity while managing resources constraints is challenging. Efficient provisioning requires selecting the most suitable set of designs to accommodate various task requests with different behaviors, balancing design variety with limited resources. In this paper, we propose a flexible framework that bridges design heterogeneity and efficient resource management in multi-tenant multi-FPGA edge. The framework leverages a comprehensive design pool and dynamic strategies for design selection and task distribution. Our results point out 4.7x and 3x improvements in makespan and energy efficiency over traditional non-adaptive methods.
14:24
Emanuel Trabes (Service d’electronique et de Microelectronique, University of Mons, Mons-Belgium, Belgium) Aymen Zayed (Service d’electronique et de Microelectronique, University of Mons, Mons-Belgium, Belgium) Carlos Valderrama (Service d’electronique et de Microelectronique, University of Mons, Mons-Belgium, Belgium) Jimmy Tarrillo (Universidad de Ingenieria y Tecnologia, Peru)
Design Exploration of DWT-Based Feature Extraction Using FPGA for High-Performance Signal Processing
ABSTRACT. The discrete wavelet transform (DWT) is commonly used for feature extraction in machine learning applications. Since these applications are frequently deployed in portable systems with limited computational resources, FPGA-based hybrid hardware/software solutions might be a viable choice. This article provides an analysis of various 4-level db4 DWT and feature extraction techniques implemented on the Zynq 7020 device.
Alternative DWT versions include fixed-point and floating-point implementations, cascade and single-core reuse architectures, as well as designs in HDL and VHDL. The feature extraction process considers mean, energy, and entropy. It has also been implemented in an architecture that efficiently reuses these computational cores. These versions are compared in terms of accuracy, resources used, performance and, poewr consumption.
Evaluating Multiplier-Less CNNs in RISC-V Architecture
ABSTRACT. In recent years Convolutional Neural Network (CNN) emerged as Machine Learning (ML) became a popular approach to solve problems in distributed area computations such as mobile devices and Internet of Things (IoT). It is well known that local computation at edge devices is preferable over transmitting a huge amount of data to run ML algorithms at a central node. In this sense, RISC-V has the research community’s attention as a flexible architecture and royalty-free alternative for embedded processors and IoT devices. Although the latest research on RISC-V and CNNs has been instruction set architecture (ISA) customization to speed up the convolution process, this work investigates the impact on inference execution time when replacing multiplication instructions by shift in multiply and accumulate (MAC) operations. Compared to slow multi-cycle multiplication instructions, our experiments showed inference throughput speedup ranging from 1.45x to 1.95x with negligible impact on memory footprint and employing only the base integer RISC-V ISA (RV32I).
Hardware Design for VVC Angular Intra Prediction Modes with Coding Efficiency Awareness
ABSTRACT. The Versatile Video Coding (VVC) is the current
state-of-art video coding standard, and it was developed to
provide very high coding efficiency for different types of visual
information. As a drawback, VVC demands much higher
computational cost when compared with previous standards,
which can affect the current trends as dedicated hardware for
mobile devices and real-time applications. This work is focused
on the angular modes tool of the VVC intra-frame prediction
and presents a heuristic to reduce its computational cost
together with its high throughput hardware design. The
proposed heuristic result shows a decrease of 18.72% in the
computational cost of the entire VVC encoder, with 2.17% of
loss in coding efficiency. The designed hardware used an area of
4,838.7 k NAND2 gates and it is capable of encoding HD
1080p@30fps videos running at 124.3 MHz and with a power
dissipation of 270.5 mW.
High-Performance Binary Arithmetic Encoder with Multiple Bypass Bin Scheme for VVC CABAC
ABSTRACT. Video is an essential part of the human experience these days, with a broad range of applications, from entertainment to remote work applications. With this in mind, new techniques are mandatory to comply with this data type's ever-increasing demand and quality requirements. Versatile Video Coding (VVC) is the newest member of a family of video coding standards, begotten to tackle the challenges of the new video processing landscape. VVC follows the hybrid codec paradigm, which is composed of predictions, transforms, and entropy coding. The Context Adaptive Binary Arithmetic Encoder (CABAC) is the chosen algorithm for the entropy stage, but it has some differences compared with past versions used in predecessor video codecs. Thus, a hardware circuit for VVC CABAC is a desirable solution for coping with real-time processing and energy-efficient scenarios. More significantly, the bottleneck step is the Binary Arithmetic Encoder (BAE), which is the focus of this work, and where a design named VArchBAE is introduced. A Multiple Bypass Bin Scheme (MBBS) is also integrated into the architecture to improve the throughput. To the best of the authors' knowledge, this is the first BAE architectural solution found in the literature for the VVC standard.
14:06
Laiane Souza (Federal University of Pelotas (UFPel), Brazil) Yasmin Souza Camargo (UFPEL, Brazil) Bruno Zatt (Federal University of Pelotas (UFPel), Brazil) Sergio Bampi (Federal University of Rio Grande do Sul (UFRGS), Brazil) Felipe Sampaio (Federal Institute of Rio Grande do Sul (IFRS), Brazil)
Video Decoder Optimization for Speculative Motion Compensation for Near-Data Processing Exploitation
ABSTRACT. This paper presents a video decoder optimization strategy for improving the performance of speculative implementations of Motion Compensation (MC) at near-data processing (NDP) platforms. To fully exploit the 3D-DRAM memory access parallelism provided by NDP-based processing elements, the correlations between motion fields of neighboring frame regions should be exploited in speculative approaches. As our first contribution, a detailed analysis is presented to understand the behavior of fractional motion vectors to be decoded by the MC. The statistical distributions of fractional MV positions are evaluated, providing key insights to be considered by speculative approaches for MC decoding. Then, a video decoder optimization is introduced to adjust the fractional MV coordinates for each predefined interpolation window (2Kx128) according to the most frequent fractional position. As a result, the decoded video quality losses are evaluated, providing negligible PSNR drops for video decoder experiments with higher QP values. Still, dynamic approaches should be addressed to adapt the optimization strengths to minimize quality drops while keeping the performance of NDP-based speculative MC decoding.
14:24
Vitória Fabricio (Video Technology Research Group, Federal University of Pelotas, Brazil) Iago Storch (Video Technology Research Group, Federal University of Pelotas, Brazil) Daniel Palomino (Video Technology Research Group, Federal University of Pelotas, Brazil)
Processing Time Evaluation of the Classification Step in the Adaptive Loop Filter of VVC under Multiple Programming Paradigms
ABSTRACT. Several tools were introduced by the Versatile Video Coding (VVC) standard to enhance compression, with the Adaptive Loop Filter (ALF) being one such tool that significantly enhances visual quality. Although it provides coding efficiency gains, the ALF also poses a substantial computational burden. To address this issue, this paper evaluates the processing time of the classification step in the ALF process of VVC encoders considering different programming paradigms. A sequential CPU implementation, a Single Instruction Multiple Data implementation, and a customized parallel implementation using CUDA to be executed in GPUs. The results showed that SIMD-optimized implementation significantly outperforms the fully-scalar implementation. Although the GPU paradigm is faster than fully-scalar, it remains slower than SIMD-optimized due to CPU-GPU communication overhead. With more tasks, the GPU could potentially surpass the SIMD-optimized processing time.
A Parallel JPEG Pleno Baseline Block-Based Profile Light Field Encoder using OpenMP
ABSTRACT. Light Fields (LFs) are a plenoptic image modality that provides more information on light rays, making them an excellent representation for immersive media. To compress such a modality, the Joint Photographic Experts Group (JPEG) committee created the JPEG Pleno Part 2 standard with two profiles. This work focuses on the reference encoder implementation for the Baseline Block-Based Profile (BBBP), called JPEG Pleno Model (JPLM). Our main contribution lies in the proposal and analysis of a parallel implementation of JPLM using OpenMP. We show that it is possible to accelerate encoding from nearly 2 to 10 times when using 2 to 16 threads, with a memory overhead ranging from 15% up to 78%, depending on the LF size. Moreover, the speedup comes with no cost in terms of coding efficiency, i.e., the LFs encoded with the proposed parallel version are bit-exact matches to ones encoded with the sequential version.
Kush Desai (San Jose State University, United States)
AR Circuits: Augmented Reality for Electrical Education
ABSTRACT. This research introduces ”AR Circuits,” an innovative educational tool utilizing Augmented Reality (AR) to simulate and visualize complex electrical circuits interactively. Traditional methods of teaching electrical concepts often struggle to effectively convey three-dimensional and dynamic structures. AR Circuits addresses this challenge by leveraging AR technology, allowing users to view the real world augmented with digital content related to electrical components. The system employs fiducial markers representing circuit elements, enabling users to build, modify, and visualize circuits in real-time. The integration of OpenSceneGraph for 3D graphics, AR Toolkit for marker tracking, and GnuCap for circuit analysis form the foundation of AR Circuits. The user’s ability to control the circuit layout through marker placement and receive real-time feedback on voltage and current enhances the learning experience. The research discusses the system’s goals, limitations, and potential improvements, highlighting the need for further work to improve user interaction, scalability, and educational effectiveness. As Augmented Reality continues to prominence in education, AR Circuits provide a glimpse into the future of interactive and immersive learning experiences for electrical engineering students.
ABSTRACT. Este trabalho propõe uma topologia de um Driver de Corrente de ± 10 mA. O circuito foi implementado usando uma tecnologia CMOS comercial de 350 nm e uma tensão de alimentação simétrica de ± 3,3 V. O controle do valor da corrente de saída é realizado por meio de uma palavra digital de 11 bits. No chip de teste, quatro drivers idênticos foram projetados para fornecimento simultâneo de corrente de saída
Implementation and Analysis of a TRNG Based on RO-PUF
ABSTRACT. This article presents a True Random Number Generator (TRNG) based on a Ring Oscillator (RO-PUF) circuit, which leverages the unique characteristics of signal propagation time variations in digital circuits. With the growing concern surrounding security in electronic devices, the demand for robust and reliable sources of randomness becomes increasingly evident, particularly in critical applications such as cryptography, authentication, and data protection. The proposed TRNG aims to capture entropy from the Ring Oscillators to facilitate the generation of random numbers. The methodology encompasses the implementation of the system on an FPGA using the Quartus Prime tool, utilizing Verilog for the design. This is followed by a series of statistical tests based on NIST guidelines, implemented in Python, to validate the randomness of the generated numbers.
High Level Design of CIFB and CIFF Incremental Σ∆ ADC Architectures for Biomedical Signals
ABSTRACT. In the last decade, there has been an increasing
interest in biomedical electronic devices. It has driven the
development of smart wearable devices for real-time monitoring
of individuals’ health, as such it is accomplished by smartwatches.
These accessories are characterized by the presence of sensors
capable of capturing the environmental context, have expanded
memory, processors for multitasking and wireless protocols for
autonomous operation. These devices collect precise physiolog-
ical data in real time through non-invasive processes such as
electrocardiogram (ECG) and photoplethysmography (PPG). The
capture of physiological signals by wearable devices requires
analog-to-digital converters (ADCs) capable of converting low-
frequency signals into medium- and high-precision data, with
low power consumption. In this article, the operating mode
of incremental sigma-delta ADCs are reviewed, in addition to
comparing the structures of the CIFB and CIFF architectures
through the implementation of a fourth-order incremental sigma-
delta (IΣ∆) ADC for a 250-Hz signal bandwidth. Furthermore,
the two modulator architectures achieved a very similar effective
bit rate and SNR based on the variation of the cycle numbers.
Another factor that corroborates these results is the fact that the
simulations were performed in a high-level, Matlab/Simulink, as
the components are treated in an idealized way. The integrators,
filters and quantizers work without losses or errors related to the
real circuit, such as thermal noise, offset errors, or temperature
variations.
Biasing analysis of an RF CMOS cascode power amplifier
ABSTRACT. This paper presents an analysis of a cascode radiofrequency power amplifier (PA) designed using 130 nm CMOS technology, focusing on balancing key performance metrics such as linearity, efficiency, and gain. Investigating the behavior of the PA across a range of bias voltages, with the common gate bias (Vgbias) varying from 0.6 V to 2.9 V and common source bias (Vsbias) assuming 0.4 V, 0.6 V and 1.1 V, it is possible to tailor the behavior to better suit mobile transmitter applications. Three simulations were performed using harmonic balance analysis, loadpull to find the load impedance on the best OCP1dB and its value. Then a compression point analysis to acquire both the gain and power added efficiency (PAE) at OCP1dB. Finally, a sweep on the input power to verify the saturated output power and maximum PAE values. Results reveal distinct combinations of balance between linearity, gain and efficiency. The best performance results include a gain of 20.8 dB, a peak PAE of 40.1 % and a 15 dBm compression point at 1dB produced by the cascode on a configuration of Vsbias = 0.6 V and Vgbias = 2.7 V.
Developing a Wearable Device Solution for Seizure Prediction of Patients with Epilepsy
ABSTRACT. Epilepsy affects approximately 50 million people worldwide, 2\% of whom live in Brazil. It significantly impacts the quality of life of these patients, putting their physical integrity and lives at risk. In addition, epilepsy can also affect the mental health of its sufferers, such as problems due to anxiety and depression due to, for example, the unpredictability of seizures. In this context, this work proposes a wearable device in the form of a bracelet capable of alerting the patient of an imminent seizure. The device will use an embedded system responsible for capturing, filtering and classifying heart rate variability signals to detect and alert its user. To do this, it will employ time domain and frequency domain analysis and spectral density analysis to obtain data on heart rate changes. It will then use a support vector machine to classify these signals and support decision-making regarding the issuance of the alert.
Data Acquisition and Sequencing System for Spectrophotometric Techniques Applications
ABSTRACT. Data from Spectrophotometric (SPM) systems involves the aquisition of and interpretation of spectra to determine the composition of materials. This work deals with the hardware components of an SPM device focusing on methods for obtaining data efficiently and accurately on a low-cost equipment. This is why the first steps mainly cover the selection of photodetctors and the design of amplifiers and filters. The analog part, covering the optical front-end and the photodecters with the eletronic circuits will be modelled and simulated for applications such as identifying materials on vegetables and liquids. A microcontroller will read the sampled information from digital-to-analog converter and process, store and transfer the data to a host computer and/or present to the user on a human-machine interface. More complex algorithms and high-performance components may follow on future works, expanding the systems' capabilities for scientific and industrial applications. Finally, advanced software is necessary to process, calibrate, and analyze the spectrophotometric data.