DSD2022: EUROMICRO CONFERENCE ON DIGITAL SYSTEMS DESIGN 2022
PROGRAM FOR FRIDAY, SEPTEMBER 2ND
Days:
previous day
all days

View: session overviewtalk overview

09:00-10:00 Session 13: [Keynote] Prof. Marisa Lopez-Vallejo

Title: Looking for the limits of electronics for autonomous microsystems

Abstract: Autonomous microsystems are microscale systems that do not need external power to operate and communicate for a given period of time. If we can build autonomous microsystems even with dimensions as small as the diameter of a human hair (< 100 μm) new use cases for sensing applications could be addressed. For example, microsensors could be embedded into fibers to produce smart clothing, new approaches to in-vitro and in-body sensing could be performed, etc. This keynote will address the challenges that electronic circuits must meet to be part of and support the design and integration of autonomous microsystems.

Location: Auditorium
10:00-10:30Coffee Break
10:30-12:00 Session 14A: Applications
Location: Room 1 - DSD
10:30
PositIV: A Configurable Posit Processor Architecture for Image and Video Processing

ABSTRACT. Image processing is essential for applications such as robot vision, remote sensing, computational photography, augmented reality etc. In the design of dedicated hardware for such applications, IEEE Std 754 floating point (float) arithmetic units have been widely used. While float-based architectures have achieved favorable results, their hardware is complicated and requires a large silicon footprint. In this paper we propose a Posit-based Image and Video processor (PositIV), a completely pipelined, configurable, image processor using posit arithmetic that guarantees lower power use and smaller silicon footprint than floats. PositIV is able to effectively overlap computation with memory access and supports multidimensional addressing, virtual border handling, prefetching and buffering. It is successfully able to integrate configurability, flexibility, and ease of development with real-time performance characteristics. The performance of PositIV is validated on several image processing algorithms for different configurations and compared against state-of-the-art implementations. Additionally, we empirically demonstrate the superiority of posits in processing images for several conventional algorithms, achieving at least 35-40% improvement in image quality over standard floats.

10:55
High-Level Synthesis of Geant4 Particle Transport Application for FPGA
PRESENTER: Ramakant Joshi

ABSTRACT. Geant4 is a software toolkit that simulates particle transport in matter and is widely used in high energy, nuclear, and medical physics applications. As target applications become more complex and time-critical, there is a need to explore custom hardware implementations of the code to reduce simulation times. Since the toolkit is written in C++, targeting a hand-coded RTL implementation is not feasible. In such scenarios, High-Level synthesis provides a promising alternative to synthesize untimed C/C++ code into optimized hardware. This paper presents the methodologies used to synthesize and optimize the Geant4 code for FPGA using High-Level synthesis, highlighting the challenges faced in the source-to-source transformation. We implement the Geant4 Tracking Algorithm taking Photon Transport in Water as the use case and comparing it with software implementation for functionality and performance. The scope of extending the methodology to complex use cases and its limitations are also discussed. The final implementation is based on the Xilinx Vitis tool flow targeted to the Xilinx Alveo U250 FPGA card.

11:20
ImageSpec: Efficient High-Level Synthesis of Image Processing Applications

ABSTRACT. The necessity of efficient hardware accelerators for image processing kernels is a well known problem. Unlike the conventional HDL based design process, High-level Synthesis (HLS) can directly convert behavioral (C/C++) description into RTL code and can reduce design complexity, design time as well as provides user opportunity for design space exploration. Due to the vast optimization possibilities in HLS, a proper application level behavioral characterization is necessary to understand the leverages offered by these workloads especially for facilitating parallel computation. In this work, we present a set of HLS optimization strategies derived upon exploiting the most general HLS influential characteristic features of image processing algorithms. We also present an HLS benchmark suite ImageSpec to demonstrate our framework and its efficiency in optimizing workloads spanning diverse domains within image processing sector. We have shown that an average performance to hardware gain of 143x could be achieved over the baseline implementation using our optimization framework.

11:37
A YOLO v3-tiny FPGA Architecture using a Reconfigurable Hardware Accelerator for Real-time Region of Interest Detection

ABSTRACT. With the recent advances in the fields of machine learning, neural networks and deep-learning algorithms have become a prevalent subject of computer vision. Especially for tasks like object classification and detection Convolutional Neu- ronal Networks (CNNs) have surpassed the previous traditional approaches. In addition to these applications, CNNs can recently also be found in other applications. For example the parametriza- tion of video encoding algorithms as used in our example is quite a new application domain. Especially CNN’s high recognition rate makes them particularly suitable for finding Regions of Interest (ROIs) in video sequences, which can be used for adapting the data rate of the compressed video stream accordingly. On the downside, these CNN require an immense amount of processing power and memory bandwidth. Object detection networks such as You Only Look Once (YOLO) try to balance processing speed and accuracy but still rely on power-hungry GPUs to meet real-time requirements. Specialized hardware like Field Programmable Gate Array (FPGA) implementations proved to strongly reduce this problem while still providing sufficient computational power. In this paper we propose a flexible architecture for object detection hardware acceleration based on the YOLO v3-tiny model. The reconfigurable accel- erator comprises a high throughput convolution engine, custom blocks for all additional CNN operations and a programmable control unit to manage on-chip execution. The model can be deployed without significant changes based on 32-bit floating point values and without further methods that would reduce the model accuracy. Experimental results show a high capability of the design to accelerate the object detection task with a processing time of 27.5 ms per frame. It is thus real-time-capable for 30 FPS applications at frequency of 200 MHz.

10:30-12:00 Session 14B: Artificial Intelligence Implementations
Location: Auditorium
10:30
Inference Time Reduction of DNNs on Embedded Devices: A Case Study

ABSTRACT. We first motivated the use of filter pruning instead of weight pruning as it relaxes the need for specialized hardware/software needed for handling matrix sparsity. Then, we reviewed different filter ranking criteria and proposed Mean Weight filter ranking criterion which offers both simplicity and efficiency in pruning the less important filters. Further, we combined the ranking criterion with two different filter pruning strategies so-called iterative global and T-sensitive pruning. T-sensitive pruning is a proposed variation of layer-sensitive pruning which assigned a pruning threshold to all layers. By introducing a structured filter pruning algorithm, we removed all selected filters and their dependencies from a DNN model, thus speeding up inference and facilitating the deployment of neural networks on embedded devices equipped with lower-end CPUs. With speedup results of up to 13x for the SICK-net, we demonstrated that our combined ranking and pruning strategy together with the filter removal algorithm outperforms more complex approaches such as the Taylor ranking and is more stable than the simple L1 norm saliency ranking.

10:55
PosAx-O: Exploring Operator-level Approximatons for Posit Arithmetic in Embedded AI/ML

ABSTRACT. The quest for low cost embedded AI/ML has motivated innovations across multiple abstractions of the computation stack. The innovations for the corresponding arithmetic operations have primarily involved quantization, precision-scaling, approximations and modified data representation. In this context, Posit has emerged as as an alternative to the IEEE-754 standard as it offers multiple benefits, primarily due to its dynamic range and tapered precision. However, the implementation of Posit arithmetic operations tend to result in high resource utilization and power dissipation. Consequently, recent works have delved into the idea of exploiting the error resilience of Neural Networks by using low-precision Posit arithmetic. However, limiting the exploration to precision-scaling limits the scope for application-specific optimizations for embedded AI/ML. To this end, we explore operator-level optimizations and approximations for low precision Posit numbers. Specifically, we identify and eliminate redundant operations in state-of-the-art Posit arithmetic operator designs and provide a modular framework for exploring approximations in various stages of the computation. We also present a novel framework for behaviorally testing the corresponding Posit approximate designs in Neural Networks. The proposed modules exhibit considerable improvement in resources with a small error in many cases. For instance, our Posit Multiplier with 1-bit reduced precision shows a 33% improvement in power and utilization, with only a ~0.2% degradation in overall accuracy.

11:20
Evaluation of Early-exit Strategies in Low-cost FPGA-based Binarized Neural Networks

ABSTRACT. In this paper, we investigate the application of early-exit strategies to quantized neural networks with binarized weights, mapped to low-cost FPGA SoC devices. The increasing complexity of network models means that hardware reuse and heterogeneous execution are needed and this opens the opportunity to evaluate the prediction confidence level early on. We apply the early-exit strategy to a network model suitable for ImageNet classification that combines weights with floating-point and binary arithmetic precision. The experiments show an improvement in inferred speed of around 20% using a single early-exit branch, compared with using a single primary neural network, with a negligible accuracy drop of 1.56%.

11:37
A Clustering-Based Scoring Mechanism for Malicious Model Detection in Federated Learning

ABSTRACT. Federated learning is a distributed machine learning technique that aggregates every client model on the server side. This feature of the system generates vulnerabilities. Some clients may harm the system by poisoning their model or data to make the global model irrelevant to its objective. This paper introduces an approach for the server to detect these malicious models by coordinate-based statistical comparison to eliminate malicious models from the system. Consequently, our result can also detect malicious models when their participation rate is at most 40\%. In addition to that, this system allows clients to be differentiated (non-iid dataset, different batch size, or computing power) from other clients even if they are not malicious and have a low impact on the aggregation model itself.

10:30-12:00 Session 14C: FTET-1
Location: Room 2 - DSD
10:30
Polynomial Formal Verification of Approximate Adders

ABSTRACT. To ensure the functional correctness of digital circuits, formal verification methods have been established, where the circuits are proven to implement the correct function. Several methods exist for the execution of the verification process. However, the verification process can have an exponential time or space complexity, causing the verification to fail. While exponential in general, recently it has been proven that the verification complexity of several circuits is polynomially bounded.

In this paper, we prove the polynomial verifiability of several state-of-the-art approximate adders using BDDs. These approximate adders include handcrafted approximate adders, which consist of several subadders, as well as automatically generated approximate adders, where regular adders can be arbitrarily altered by removing gates and changing the type of gates. Thus, this paper provides insight into the possible methods for the design of approximate adders, such that the approximate adders remain polynomially verifiable. Here, we give upper bounds for the BDD sizes during the verification process, as well as for the time and space complexity. The upper bounds for the BDD sizes are then experimentally evaluated.

10:55
SAT-based Exact Synthesis of Ternary Reversible Circuits using a Functionally Complete Gate Library

ABSTRACT. The problem of synthesis and optimization of reversible and quantum circuits have drawn the attention of researchers for the last two decades due to increasing interest in quantum computing. Although lot of works have been done on the synthesis of binary reversible circuits, very less works have been reported on the synthesis of ternary reversible circuits. Ternary circuits have lower cost of implementation as compared to their binary counterparts. However, the synthesis approaches that exist for ternary reversible circuits either use too many circuit lines (qutrits) or too many gates. Only one prior work has discussed the problem of generating cost-optimal ternary reversible circuits, but for a very restrictive gate library, which limits the approach to a specific subset of ternary reversible functions and often the solution becomes sub-optimal due to the imposed restrictions. The present paper overcomes that restriction, and uses multiple control ternary Toffoli gates with all possible ternary target operations as the gate library. This gate library is functionally complete and can be used to synthesize any arbitrary function. The proposed SAT-based synthesis approach provides low cost solutions in terms of the number of gates for any arbitrary ternary reversible function. Experimental results on various randomly generated permutations as well as standard ternary benchmarks establish this claim. The results can be used as template for other synthesis approaches by observing how far they deviate from the optimal solutions

11:12
Optimizing Lattice-based Post-Quantum Cryptography Codes for High-Level Synthesis

ABSTRACT. High-level synthesis is a mature Electronics Design Automation (EDA) solution for building hardware design in a short time. It produces automatically HDL code for FPGAs out of C/C++, bridging the gap from algorithm to hardware. Nevertheless, sometimes the QoR (Quality of Results) can be suboptimal due to the difficulties of HLS in handling general-purpose software code. In this paper, we explore the difficulties of HLS while synthesizing Lattice-based Post-Quantum Cryptography (PQC) algorithms. We propose code-level optimizations to overcome the limitations, increasing the QoR of generated hardware. We analyzed and improved the results for the algorithms competing in the 3rd round of the NIST standardization process. We show how, starting from the original reference code submitted for the competition, original performance and resource utilization can be improved, in some cases with a speedup factor up to 200x or an area reduction of 80%.

11:29
Designing Approximate Arithmetic Circuits with Combined Error Constraints

ABSTRACT. Approximate circuits trading the power consumption for the quality of results play a key role in the development of energy-aware systems. Designing complex approximate circuits is, however, a very difficult and computationally demanding process. When deploying approximate circuits, various error metrics (e.g., mean average error, worst-case error, error rate), as well as other constraints (e.g., correct multiplication by 0), have to be considered. The state-of-the-art approximation methods typically focus on a single metric which significantly limits the applicability of the resulting circuits. In this paper, we experimentally investigate how various error metrics and their combinations affect the reduction of the power consumption that can be achieved. To this end, we extend evolutionary-driven techniques that allow us to effectively explore the design space of the approximate circuits. We identify principal limitations when complex error constraints are required as well as important correlations among the error metrics enabling the construction of circuits providing the best-known trade-offs between the power reduction and combined error constraints.

11:46
Technology Mapping for PAIG Optimised Polymorphic Circuits

ABSTRACT. The concept of polymorphic electronics allows to efficiently implement two or more functions in a single circuit. It is characteristic of that approach that the currently selected function from the set of available ones depends on the state of the circuit operating environment. The key components of such circuits are polymorphic gates. Since the introduction of polymorphic electronics, just a few tens of polymorphic gates have been published. However, a large number of them exhibit parameters that fall behind ubiquitous CMOS technology, which makes their utilization for real applications rather difficult. As it turns out, the synthesis of polymorphic circuits achieves a significantly higher degree of complexity in comparison to the ordinary digital circuit. In past, many of the previously reported polymorphic circuits were designed using evolutionary principles (EA, CGP, etc.). It has been shown that the problem of scalable synthesis techniques suitable for large-scale polymorphic circuits could be addressed by the adoption of multi-level synthesis techniques such as And-Inverter-Graphs. The PAIG (Polymorphic And-Inverter-Graphs) concept and synthesis techniques based on it seem to be a promising approach. This paper shows how modern polymorphic gates could be used in combination with a PAIG-based synthesis tool to obtain an efficient implementation of complex polymorphic circuits.

10:30-12:00 Session 14D: DCPS
Location: Room 3 - DSD
10:30
Real-Time Polling Task: Design and Analysis

ABSTRACT. When designing robotic systems, we regularly face the same problem, the systems that we design have to be reactive systems, must be able to react to their environment and, also, to communicate with some other systems. However, the reactive behavior, because of its nondeterministic nature, is difficult to analyze and prevent designers to do real-time analysis, which is needed for critical systems. In this paper, we propose a deterministic task model to handle the reactive behavior as well as the necessary tools to verify the respect of real-time constraints. An implementation of this model, which is used in a ROS2 patch, is also presented.

10:55
Design Space Exploration for Distributed Cyber-Physical Systems: State-of-the-art, Challenges, and Directions

ABSTRACT. Industrial Cyber-Physical Systems are complex heterogeneous and distributed computing systems, typically integrating and interconnecting a large number of subsystems and containing a substantial number of computing hardware and software components. Designers of these distributed Cyber-Physical Systems (dCPS) face serious challenges with respect to designing the next generations of these machines and require proper support in making (early) design decisions. This calls for efficient and scalable system-level Design Space Exploration (DSE) methods for dCPS. In this position paper, we review the current state of the art in DSE, and argue that efficient and scalable DSE technology for dCPS is more or less non-existing and constitutes a largely unchartered research ground. Moreover, we identify several research challenges that need to be addressed and discuss possible solution directions when targeting such DSE technology for dCPS.

11:20
Event-Driven Programming of FPGA-accelerated ROS 2 Robotics Applications

ABSTRACT. Many applications from the robotics domain can benefit from FPGA acceleration. A corresponding key question is not only how to integrate hardware accelerators into software-centric robotics programming environments but also how to integrate more advanced approaches like dynamic partial reconfiguration. Recently, several approaches have demonstrated hardware acceleration for the robot operating system (ROS), the dominant programming environment in robotics. ROS is a middleware layer that features the composition of complex robotics applications as a set of nodes that communicate via mechanisms such as publish/subscribe, and distributes them over several compute platforms.

In this paper, we present a novel approach for event-based programming of robotics applications that leverages dynamic partial reconfiguration and ReconROS, a framework for flexibly mapping ROS 2 nodes to either software or reconfigurable hardware. The approach bases on the ReconROS executor that schedules callbacks of ROS 2 nodes and utilizes a reconfigurable slot model and partial runtime reconfiguration to load hardware-based callbacks on demand. We describe the ReconROS executor approach, give design examples, and experimentally evaluate its functionality with examples.

12:00-13:30Lunch Break
13:30-15:00 Session 15A: HW Architectures
Location: Room 1 - DSD
13:30
Coherency Traffic Reduction in Manycore Systems

ABSTRACT. With the increasing number of cores in manycore accelerators and chip multiprocessors (CMPs), it gets more challenging to provide cache coherency efficiently. Although the snooping-based protocols are appropriate solutions for small-scale systems, they are inefficient for large systems because of the limited bandwidth. Therefore, large-scale manycores require directory-based solutions where a hardware structure called a directory holds the information. This directory keeps track of all memory blocks and which cache stores a copy of these blocks. The directory sends messages only to caches that store relevant blocks and coordinates simultaneous accesses to a cache block. As directory-based protocols scale to many cores, performance, network-on-chip (NoC) traffic, and bandwidth become significant problems. This paper presents software mechanisms to improve the effectiveness of directory-based cache coherency in manycore and multicore systems with shared memory. In multithreaded applications, some data accesses do not disrupt cache coherency, but they still produce coherency messages among cores, such as read-only (private) data. However, if at least two cores access data and at least one of them is a write operation, it is called shared data and requires cache coherency. In our proposed system, private data and shared data are determined at compile-time, and cache coherency protocol only applies to shared data. We implement our approach in two stages. First, we use Andersen's static pointer analysis to analyze the program and mark its private instructions, i.e., instructions that load or store private data. Then, we use these analyses to decide if cache coherency protocol will be applied or not at runtime. Our simulation results on parallel benchmarks show that our approach reduces cycle count, dynamic random access memory (DRAM) accesses, and coherency traffic by 13%.

13:55
Ballast: Implementation of a Large MP-SoC on 22nm ASIC Technology

ABSTRACT. Chips have become the critical asset of the technology, and increasing effort is put to design System-on-Chips (SoC) faster and more affordable. Typically the focus of the research has been on the Power, Performance and Area optimization of the specific component or sub-system. To improve the situation we report design effort for complex SoC counted from specification to ASIC tape-out to lay out a solid reference for the community. NAME OMITTED is the first (ORGANIZATION NAME OMITTED) chip taped out on 22nm technology. It includes six sub-systems on 15 mm2 area and reaches 1.2GHz top speed. The design team included 24 persons and spent 21 200 person hours to tape-out in one calendar year from scratch. This is an outstanding achievement and sets the baseline to SoC design productivity development.

14:20
Investigating Novel 3D Modular Schemes for Large Array Topologies: Power Modeling and Prototype Feasibility.
PRESENTER: Pakon Thuphairo

ABSTRACT. This paper presents the Tiled Computing Array (TCA), a simple, uniform, 3D-mesh packaging at inter-board level, for massively parallel computers. In particular, the power modelling and practical feasibility of the system is examined. TCA eliminates the need for hierarchical rackmount-structures and introduces short and immediate data channels in multiple physical orientations, allowing a more direct physical mapping of 3D computational topology to real hardware. A dedicated simulation platform has been developed, and an engineered prototype demonstrator has been built. This paper explores the feasibility of the TCA concept for current hardware technologies and systems, evaluates power modeling and validation, and highlights some of the novel design challenges associated with such a system. Evaluations of physical scalability toward large-scale systems are reported, showing that TCA is a promising approach.

13:30-15:00 Session 15B: SPCPS
Location: Auditorium
13:30
A Framework for Evaluating Connected Vehicle Security against False Data Injection Attacks

ABSTRACT. Recent developments in the smart mobility domain have transformed automobiles into networked transportation agents helping realize new age, large-scale intelligent transportation systems (ITS). The motivation behind such networked transportation is to improve road safety as well as traffic efficiency. In this setup, vehicles can share information about their speed and/or acceleration values among themselves and infrastructures can share traffic signal data with them. This enables the connected vehicles (CVs) to stay informed about their surroundings while moving. However, the inter-vehicle communication channels significantly broaden the attack surface. The inter-vehicle network enables an attacker to remotely launch attacks. An attacker can create collision as well as hamper performance by reducing the traffic efficiency. Thus, security vulnerabilities must be taken into consideration in the early phase of CVs’ development cycle. To the best of our knowledge, there exists no such automated simulation tool using which engineers can verify the performance of CV prototypes in the presence of an attacker. In this work, we present an automated tool flow that facilitates false data injection attack synthesis and simulation on customizable platoon structure and vehicle dynamics. This tool can be used to simulate as well as design and verify control-theoretic light-weight attack detection and mitigation algorithms for CVs.

13:55
Hybrid Post-Quantum Enhanced TLS 1.3 on Embedded Devices

ABSTRACT. Most of todays Internet connections are protected through the Transport Layer Security (TLS) protocol. Its client-server handshake mechanism provides authentication, privacy and data integrity between communicating applications. It is also the security base for the 5G connectivity. While currently considered secure, the dawn of quantum computing represents a threat for TLS. In order to prepare for such an event, TLS must integrate quantum-secure (post-quantum) cryptography (PQC). The use of hybrid approaches, that combines PQC and traditional cryptography are recommended by security agencies. Efficient PQC integration at TLS requires the exploration of a wide set of design parameters and platforms. To this end this work presents the following contributions. First, wide evaluation of PQC-enhanced TLS hybrid protocols, using end-to-end communication latency as metric. Second, the exploration and benchmarking in constrained embedded devices. Third, a wide traffic analysis, including the impact and behavior of PQC-enhanced hybrid TLS in real practical scenarios.

14:20
Blind Data Adversarial Bit-flip Attack against Deep Neural Networks

ABSTRACT. Adversarial bit-flip attack (BFA) on Neural Network weights can result in catastrophic accuracy degradation by flipping a very small number of bits. A major drawback of prior bit-flip attack techniques is their reliance on test data. This is frequently not possible for applications that contain sensitive or proprietary data. In this paper, we propose Blind Data Adversarial Bit-flip Attack (BDFA), a novel technique to enable BFA without any access to the training or testing data. This is achieved by optimizing for a synthetic dataset, which is engineered to match the statistics of batch normalization across different layers of the network and the targeted label. Experimental results show that BDFA could decrease the accuracy of ResNet50 significantly from 75.96% to 13.94% with only 4 bits flips.

13:30-15:00 Session 15C: FTET-2
Location: Room 2 - DSD
13:30
MEDA Biochip based Single-Target Fluidic Mixture Preparation with Minimum Wastage

ABSTRACT. Sample preparation is an inherent procedure of many biochemical applications, and digital microfluidic biochips (DMBs) proved to be very effective in performing such a procedure. In a single mixing step, conventional DMBs can mix two droplets in 1:1 ratio only. Due to this limitation, DMBs suffer from heavy fluid wastage and large number of mixing steps. However, the next generation DMBs, i.e., micro-electrode-dot-array (MEDA) biochips can realize multiple mixing ratios and are able to overcome a lot of those limitations. In this paper, we present a heuristic-based sample preparation algorithm, specifically a mixing algorithm called Division by Factor Method for Mixing that exploits the mixing models of MEDA biochips. We propose another mixing algorithm for MEDA biochips called Single Target Waste Minimization (STWM), which minimizes the wastage of fluids and determines an optimized mixing graph. Simulation results confirm that the proposed STWM method outperforms the state-of-the-art method in terms of minimizing the number of waste fluids, reducing the total reagent usage, and minimizing the number of mixing operations.

13:55
Unlocking Sneak Path Analysis in Memristor Based Logic Design Styles

ABSTRACT. Memristors or Resistive Random Access Memory (RRAM) are emerging non-volatile memory devices that can be used for both storage and computing. In this type of memory the information is stored in memory cells in the form of resistance. One of the very important challenges in memristive crossbars is the existence of Sneak Paths, which result in erroneous reading of memory cells. Most of the logic in-memory techniques have emphasized on improving the logic design perspective, but have given minor importance to the sneak path issue. In this paper we show the effect of sneak paths on crossbars of various sizes, and then try to analyze the logic design approaches like MAGIC and MAJORITY with respect to their immunity to sneak paths. Experimental result shows that with some extra overhead we can eliminate the sneak path effect in various logic design methods.

14:20
Generation of Verified Programs for In-Memory Computing

ABSTRACT. In order to overcome the von Neumann bottleneck, recently the paradigm of in-memory computing has emerged. Here, instead of transferring data from the memory to the CPU for computation, the computation is directly performed within the memory. ReRAM, a resistance-based storage device, is a promising technology for this paradigm. Based on ReRAM, the PLiM computer architecture and LiM-HDL, an HDL for specifying PLiM programs have emerged.

In this paper, we first present a novel levelization algorithm for LiM-HDL. Based on this novel algorithm, large circuits can be compiled to PLiM programs. Then, we present a verification scheme for these programs. This scheme is separated into two steps: (1) A proof of purity and (2) a proof of equivalence. Finally, in the experiments, we first apply our levelization algorithms to a well-known benchmark set, where we show that we can generate PLiM programs for large benchmarks, for which existing levelization algorithms fails. Then, we apply our proposed verification scheme to these PLiM programs.

14:37
SMART: Investigating the Impact of Threshold Voltage Suppression in an In-SRAM Multiplication/Accumulation Accelerator for Accuracy Improvement in 65 nm CMOS Technology

ABSTRACT. State-of-the-art in-memory computation has recently emerged as the most promising solution to overcome design challenges related to data movement inside current computing systems. One of the approaches to performing in-memory computation is based on the analog behavior of the data stored inside the memory cell. These approaches proposed various system architectures for that. In this paper, we investigated the effect of threshold voltage suppression on the access transistors of the In-SRAM multiplication and accumulation (MAC) accelerator to improve and enhance the performance of bit line (bit line bar) discharge rate that will increase the accuracy of MAC operation. We provide a comprehensive analytical analysis followed by circuit implementation, including a Monte-Carlo simulation by a 65nm CMOS technology. We confirmed the efficiency of our method (SMART) for a four-by-four-bit MAC operation. The proposed technique improves the accuracy while consuming 0.683 pJ per computation from a power supply of 1V. Our novel technique presents less than 0.009 standard deviations for the worst-case incorrect output scenario.

13:30-15:00 Session 15D: ITS
Location: Room 3 - DSD
13:30
POLAR: Performance-aware On-device Learning Capable Programmable Processing-in-Memory Architecture for Low-Power ML Applications

ABSTRACT. Improving the performance of real-time Traffic Sign Recognition (TSR) applications using Deep Learning (DL) algorithms such as Convolutional Neural Networks (CNN) on software platforms is challenging due to the sheer computational complexity of these algorithms. In this work, we adopt a hardware-software combined approach to address this issue. We introduce a data-centric Processing-in-Memory (PIM) architecture that leverages Look-up-Table (LUT)-based processing for minimal data movement and superior performance and efficiency. Despite the superior performance, the limited available memory in PIM makes it complex to deploy deep CNNs. We propose merging CNN layers in this work to meet the limited resource constraints. One specific challenge in the TSR is the continuous change in the deployed environment, which makes a CNN trained over static data experience degradation in the performance over time. To address these challenges, we introduce a lightweight, PIM-compatible and performance-aware Generative Adversarial Network (GAN)-based on-device learning here. This compact CNN on PIM architecture attains data-level parallelism and reduces pipelining delays and makes it easier for on-device training and inference. Evaluation is performed on multiple state-of-the-art DL networks such as AlexNet, LeNet, ResNet using the German Traffic Sign Recognition Benchmark (GTSRB) Dataset and the Belgium Traffic Sign Dataset (BTSD). With the proposed learning technique, it is observed to achieve maximum accuracy of 92.8% and 89.27% on GTSRB, and BTSD datasets. Also, it is observed the proposed mechanism maintains an average accuracy to be above 85% despite changes in the environment on all the CNNs deployed on the PIM accelerator

13:55
A resolution method in case of air congestion: rerouting and/or ground holding approach

ABSTRACT. This paper deals with the problem of congestion in air transportation systems. Congestion occurs when demand for the infrastructures exceeds their capacity, causing delays as one of the main drawbacks. Congestion can be solved by applying ground holding and/or rerouting flights operations. Such procedures often imply delays and a series of significant reactionary costs for all the operators of the air traffic management. In this paper authors propose a method to solve the congestion by an optimizaion algorithm combining the ground holding and/or rerouting operations. Aircraft rerouting is performed by a tailored shortest path algorithm. A real portion of the air traffic network and a real slice of the air traffic flow have been considered for the experiments.

14:20
Optimizing UAV Location Awareness Telemetry for Low Power Wide Area Network

ABSTRACT. Latest trends show an increasing use of UAVs in urban and rural areas assigned with various tasks like end-to-end package delivery, search and rescue missions or crop disease detection. This subsequently has led to an exponential growth in traffic in the sky above us, accompanied with increased safety-related events like mid-air collisions due to lack of location awareness between users. This problem is addressed in modern manned aviation by using high powered transponders (e.g. ADS-B) for transmitting real time position information and helping air traffic controllers (ATC) raise the level of safety among them. On the other hand, despite using state of the art equipment, most drones running on batteries do not have communication systems that are able to provide long-range coverage. The main focus of this research is to optimize transmitted location awareness payloads, so that they can be used in energy efficient/bandwidth IoT networks like LoRaWAN without violating fair usage policy of this network. In this paper, optimization of position information payload is proposed, demonstrating that LoRaWAN can be utilized for a real-time location exchange network, sending four bytes of payload with 1Hz rate, without compromising the level of accuracy needed to guarantee safe operation.

15:00-15:30Coffee Break