Program for Friday, September 3rd

PROGRAM FOR FRIDAY, SEPTEMBER 3RD

Days:

Keynote 5 - Dr. Danilo Pietro Pau - Technical director, IEEE and ST Fellow, System Research and Applications, STMicroelectronics (Italy) - Characterizing deeply quantized neural networks for in-sensor computing

10:00-11:30 Session 16A: APPLICATIONS

10:00	Nikos Petrellis, Stavros Zogas, Panagiotis Christakos, Georgios Keramidas, Panagiotis Mousouliotis, Christos Antonopoulos and Nikolaos Voros High Speed Implementation of the Deformable Shape Tracking Face Alignment Algorithm ABSTRACT. The 2D facial landmark alignment method, implemented in C++ in the open source libraries DLIB and Deformable Shape Tracking (DEST), is used in several applications such as driver drowsiness detection, recognition of facial expressions, etc. The most challenging of these applications require fast processing of video frames. Therefore, the alignment of the facial landmarks in a single video frame has to be performed with the minimum possible latency without precision loss. In this paper, the DEST implementation of the face alignment method that is based on regression trees is heavily restructured to reduce latency. The resulting face alignment predictor is implemented in C. The elimination of multiple nested routine calls, excessive argument copying, type conversions and integrity checks lead to a software implementation that is 240 times faster than the one provided in the DEST library. Moreover, the structure of the new face alignment predictor is appropriate for hardware implementation on a Field Programmable Gate Array (FPGA) for further acceleration.
10:20	Matteo Bertolucci and Luca Fanucci Highly Parallel Sample Rate Converter for Space Telemetry Transmitters ABSTRACT. In recent years, following the rapid innovation guidelines of most space agencies, there have been major advances in satellite transmitter technologies. In particular, standards with very high performances and flexibility have been introduced (e.g. DVB-S2, DVB-S2X and CCSDS 131.2-B-1) to maximize efficiency and throughput. Moreover, the gradual use of higher transmission bands, started with the S-Band, moved to the X-band and now widely employing also Ku-Band/Ka-Band, are increasing the systems usable bandwidths. For all those reasons system integrators are pushing to develop architecture that may also dynamically change the payload symbol rate, thus the bandwidth, to cover different target scenarios. Considering those wide bands and the fact that DAC clocks must be fixed to minimize jitter, the dynamical symbol rate may be achieved through the use of fractional sample rate interpolators. In this paper the architecture of a massively parallel sample rate converter, with no backpressure availability on the modulator block, is presented. The analysis takes into account the problem of quantization effect in asynchronous sample rate converters, proposing a novel architecture to address this specific case. Implementation considerations and results are taken for the new Xilinx Space-Grade Kintex Ultrascale XQRKU-060 considering an SRRC IQ modulated signal with the DVB-S2 standard.
10:40	Caterina Nahler, Armin Schoenlieb, Sebastian Handel, Hannes Plank, Christian Steger and Norbert Druml Single-Frame Direct Reflectance Estimation With Indirect Time-of-Flight Cameras ABSTRACT. Computer vision algorithms are influenced by variations in lighting conditions. Images, independent of lightning conditions, have the potential to improve the training of machine learning algorithms, especially in tasks like material and object classification including face recognition. For cameras with an active illumination source, such as Time-of-Flight~(ToF) cameras, the largest variation is caused by the distance between the camera and the object. ToF cameras provide dense 3D point clouds. In addition, they are capable of measuring infrared~(IR) images. As ToF pixels are often larger than IR camera pixels, they are rarely considered for 2D only imaging. However, the additional features of a ToF camera, can be used to realize methods to record distance normalised IR images. In this paper, we explore methods to extract direct reflectance estimates from ToF measurements. We propose two novel methods relying on coded modulation~(CM) and compare it to a method that can convert data from the state-of-the-art continuous wave~(CW) measurement method. With the invention of the CM based methods we are able to realize the normalization in a singe frame measurement, compared to the four frames recorded by the CW method. All three methods are evaluated based on simulation results and in-laboratory measurements. We are able to demonstrate that our novel methods, relying on CM, can achieve the desired measurement behaviour.
11:00	Wenyao Zhu and Zhonghai Lu Evaluation of Time Series Clustering in Embedded Sensor Platform ABSTRACT. Clustering is one of the major problems in studying the time series data, while solving this problem on the embedded platform is almost absent because of the limitation of computational resources on the edge. In this paper, two typical clustering algorithms, the K-means and the Self-Organizing Maps (SOM), together with Euclidean distance measurement and dynamic time warping (DTW) are studied to verify their feasibility in an embedded sensor platform. For the given datasets, the models are trained on a computer and moved to an ESP32 microprocessor for inference. It is found that the SOM achieves similar accuracy compared with K-means, while its inference process takes a longer time. The experiment results show that a sample with 300 data points can be clustered among 12 clusters within 40 ms by SOM with the DTW model, while the fastest model can run at around 2 ms using K-means with Euclidean distance model. In other words, it can consume the data from 40 sensors in 680 ms, which can be scheduled with the real-time data acquisition and transmission tasks. The performance gathered supports that it is feasible to deploy the time series clustering model in the embedded sensor platform.

10:00-11:30 Session 16B: FRAMEWORKS

10:00	Shayan Tabatabaei Nikkhah, Marc Geilen, Dip Goswami, Martijn Koedam, Andrew Nelson and Kees Goossens A Deployment Framework for Quality-Sensitive Applications in Resource-Constrained Dynamic Environments ABSTRACT. Traditional embedded systems and recent platforms used in emerging computing paradigms (e.g., fog computing) have resource limits and require their applications and services to be dynamically added (i.e., deployed) and removed at run-time. These applications often have non-functional (quality) requirements (e.g., end-to-end latency) which are only satisfied when sufficient resources are allocated to them. Hence, a run-time decision-maker is needed to optimize the deployments, in terms of resource budgets that are allocated to applications. Additionally, computing platforms have become heterogeneous in terms of their resources and the applications they execute. However, the existing deployment solutions are limited to specific resources and services. In this paper, we propose a run-time deployment framework that is more flexible in defining constraints and optimization goals and works with more heterogeneous resources and resource models than existing solutions. The framework is implemented on an embedded platform as a proof of concept.
10:30	Evangelos Petrongonas, Vasileios Leon, George Lentaris and Dimitrios Soudris ParalOS: A Scheduling & Memory Management Framework for Heterogeneous VPUs ABSTRACT. Embedded systems are presented today with the challenge of a very rapidly evolving application diversity followed by increased programming and computational complexity. Customised heterogeneous System-on-Chip (SoC) processors emerge as an attractive HW solution in various application domains, however, they still require sophisticated SW development to provide efficient implementations at the expense of slower adaptation to algorithmic changes. In this context, the current paper proposes a framework for accelerating the SW development of computationally intensive applications on Vision Processing Units (VPUs), while still enabling the exploitation of their full HW potential via low-level kernel optimisations. Our framework is tailored for heterogeneous architectures and integrates a dynamic task scheduler, a novel scratchpad memory management scheme, I/O & inter-process communication techniques, as well as a visual profiler. We evaluate our work on the Intel Movidius Myriad VPUs using synthetic benchmarks and real-world applications, which vary from Convolutional Neural Networks (CNNs) to computer vision algorithms. In terms of execution time, our results range from a limited ∼8% performance overhead vs optimised CNN programs to 4.2 x performance gain in content-dependent applications. We achieve up to 33% decrease in scratchpad memory usage vs well-established memory allocators and up to 6x smaller inter-process communication time.
11:00	Yu Yang, Ahmed Hemani and Kolin Paul Scheduling Persistent and Fully Cooperative Instructions ABSTRACT. Parallel, distributed two-level control system has been adopted in streaming application accelerators that implement atomic vector operations. Each instruction of such architecture deals with one aspect (arithmetic, interconnect, storage, etc.) of an atomic vector operation. Such instructions are persistent and fully cooperative. Their lifetimes vary because of the vector size and the degree of parallelism. More complex constraints are also required to express the cooperation among these instructions. The conventional instruction behavior models are no longer suitable for such instructions. Therefore, we develop a novel instruction behavior model to address the scheduling aspect of the instruction set required by such architecture. Based on the behavior model, we formally defines the scheduling problem and formulate it as a constraint satisfaction optimization problem (CSOP). However, the naive CSOP formulation quickly becomes unscalable. Thus a heuristic enhanced scheduling algorithm is introduced to make the CSOP approach scalable. The enhanced algorithm's scalability is validated by a large set of experiments varying in problem size.

11:30-13:00 Session 17A: HW-SW CODESIGN AND RECONFIGURABILITY

11:30	Gabriella D'Andrea, Giacomo Valente, Luigi Pomante and Tania Di Mascio An Investigation of Dynamic Partial Reconfiguration Offloading in Hard Real-Time Systems ABSTRACT. Nowadays, complex Cyber-Physical Systems (CPSs) often exploit the so-called computing at the edge (i.e., edge-computing), where the Dynamic Partial Reconfiguration (DPR, also known as Dynamic Function eXchange or Partial Reconfiguration) feature has been proved to be efficient to face the adaptivity challenges typical of the CPSs domain. In this context, the increase of both platforms heterogeneity and required customizations are leading to a growth of the number of per-task requested DPR, for which Industry is enhancing the reconfiguration controllers providing the capability to offload more than one DPR request, in turn allowing pipelining between multiple DPR processes and application execution. Several works in literature have introduced the DPR process in the hard real-time system domain, however not considering the multiple DPR offloading capabilities. In this paper, we provide a theoretical analysis and a practical evaluation of multiple DPR offloading in the context of hard real-time systems. In particular, through a motivational case-study, supported by some experimental activities conducted on the Zynq-7000 SoC, we show how the offload of multiple DPR can provide benefits with respect to the traditional approach of one DPR request per time.
12:00	Tobias Scheipel, Peter Brungs and Marcel Baunach A Hardware/Software Concept for Partial Logic Updates of Embedded Soft Processors at Runtime ABSTRACT. Embedded systems are built from various hardware components and execute software on one or more microcontroller units (MCU). These MCUs usually contain a fixed integrated circuit, thus disallowing modifications to their logic at runtime. While this keeps the instruction set architecture (ISA) fixed as well, it leaves the software as the only flexible part in the system. But what if the MCU logic could be easily changed at runtime in order to fix bugs or if the ISA could be extended on-the-fly in order to introduce application-specific instructions and features on demand? This work demonstrates a concept for introducing more hardware flexibility through application-specific MCU modifications. Therefore, the MCU is implemented as a soft core on a field-programmable gate array (FPGA) and we reconfigure its logic with support of the operating system (OS) running on it. The reconfiguration happens on-the-fly, so no interruption of the application code or even a system restart is required. Therefore, (i) the MCU pipeline is specially designed for extensibility by new instructions, and (ii) the FPGA is selected to support partial self-reconfiguration of its logic cells at runtime. As long as an instruction is not yet part of the ISA, the OS supports its emulation to provide a consistent interface for applications. Apart, no special compiler support is required, but the application must provide either the emulation code or a hardware description for adding the required logic. For a proof of concept, we use a RISC-V based MCU on a Xilinx Artix-7 FPGA and for evaluating the general benefit of our approach we use an algorithm that is costly when executed with the original ISA but fast with application-specific instructions added at runtime. The experimental evaluation also shows that the on-the-fly hardware update does not disrupt or compromise the software execution flow.
12:30	Swantje Plambeck, Görschwin Fey and Gianluca Martino Metrics for the Evaluation of Approximate Sequential Streaming Circuits ABSTRACT. The design of energy- and area-efficient systems is important for modern technology. One approach to increase these efficiencies is approximate computing. During the last years, efficient approximations for combinational hardware components, e.g., adders or multipliers, have been proposed. We focus on quality metrics for the evaluation of approximations in sequential circuits with streaming in- and outputs. We propose the usage of sequence distance metrics for analysis of the sequential behavior after approximation and compare their performance to other metrics like mean errors and accumulated errors. We present case studies on some exemplary circuits. The experimental results show that our sequential metrics provide additional information to common mean errors and for stochastic applications yield the best guidance in selecting approximate sequential circuits.

11:30-13:00 Session 17B: MODELING AND SIMULATION 2

11:30	Marie Badaroux, Saverio Miroddi and Frederic Petrot To Pin or Not to Pin: Asserting the Scalability of QEMU Parallel Implementation ABSTRACT. Due to its speed in cross-executing sequential code, dynamic binary translation is the unchallenged technology for full system-level simulation. Among the translators, QEMU has become the de facto solution. It introduced parallel host execution of the target cores a few years ago for the ARM instruction set architecture and this support is now also available, among others, for RISC-V. Given the popularity of these instruction sets in multi and many-core systems, assessing the scalability of their parallel implementation makes sense. In this paper, we use a subset of the PARSEC benchmark to measure the execution time of QEMU’s parallel implementation, to which we added the ability to pin a target processor to a host core or hardware thread. We report the results of a wealth of experiments we performed on a 16-core/32-thread x86-64 SMP machine. They show that the support of parallelism in QEMU scales well, and that, somewhat counter intuitively, pinning does not improve performance.
12:00	Jürgen Maier Gain and Pain of a Reliable Delay Model ABSTRACT. State-of-the-art digital circuit design tools almost exclusively rely on pure and inertial delay for timing simulations. While these provide reasonable estimations at very low execution time in the average case, their ability to cover complex signal traces is limited. Research has provided the dynamic Involution Delay Model (IDM) as a promising alternative, which was shown (i) to depict reality more closely and recently (ii) to be compatible with modern simulation suites. In this paper we complement these encouraging results by experimentally exploring the behavioral coverage for more advanced circuits. In detail we apply the IDM to three simple circuits (a combinatorial loop, an SR latch and an adder), interpret the delivered results and evaluate the overhead in realistic settings. Comparisons to digital (inertial delay) and analog (SPICE) simulations reveal, that the IDM delivers very fine-grained results, which match analog simulations very closely. Moreover, severe shortcomings of inertial delay become apparent in our simulations, as it fails to depict a range of malicious behaviors. Overall the Involution Delay Model hence represents a viable upgrade to the available delay models in modern digital timing simulation tools.
12:20	Thinh Pham, Shanker Shreejith, Sebastian Steinhorst, Suhaib A. Fahmy and Samarjit Chakraborty Heterogeneous Communication Virtualisation for Distributed Embedded Applications ABSTRACT. Distributed embedded applications (DEAs) are typically implemented on diverse embedded nodes interconnected through communication network(s) to exchange data and control information to achieve the desired functionality. Conventional approaches of utilising a single large-bandwidth link in a distributed system are not efficient in large DEAs owing to diverse requirements and factors like cost, reliability, scalability and criticality, among others. Heterogeneous communication is a promising approach in DEAs, where the diverse nature of underlying protocols (wired/wireless, synchronous/asynchronous, multiple access modes and others) can be leveraged to meet such requirements, in addition to the benefits like aggregated bandwidth and robustness. However, utilising them `directly' places significant complexity on the application as it needs to dynamically evaluate the channels and utilise different protocol structures for each case. Virtualising the communication channels would present a unified interface to the application by abstracting away low-level details, similar to virtualisation applied in compute architectures. However, unlike architecture virtualisation, virtualising heterogeneous communication particularly for resource-constrained device networks involves unique challenges imposed by the physical (wired/wireless) and logical domains (limited-bandwidth, small payload, protocols, channel access schemes, etc.), which needs to be concurrently evaluated to optimise the communication system. This paper presents a model and an optimal transmission strategy as the proof of concept for deploying heterogeneous communication in DEAs. The model is described at an abstracted level while capturing transmission parameters of multiple channels, which are then optimised to meet the application's communication requirements. The model and the optimisation method are validated through simulation and a practical case study.
12:50	Stefano Corda, Madhurya Kumaraswamy, Dr. Ahsan Javed Awan, Roel Jordans and Henk Corporaal NMPO: Near-Memory Computing Profiling and Offloading ABSTRACT. Real-world applications are now processing big-data sets, often bottlenecked by the data movement between the compute units and the main memory. Near-memory computing (NMC), a modern data-centric computational paradigm, can alleviate these bottlenecks by leveraging improved performance. The lack of NMC system availability makes simulators the primary evaluation methodology for performance estimation. However, simulators are usually time-consuming, and methodologies that can reduce this overhead would help in the early-stage design of NMC systems. This work proposes Near-Memory computing Profiling and Offloading (NMPO), a high-level framework capable of predicting NMC offloading suitability employing an ensemble machine learning model. NMPO predicts NMC profitability with an accuracy of 85.6% and, compared to prior works, can reduce the prediction time by using hardware-dependent applications features by 2 to 3 order of magnitude.

14:30-15:30 Session 18: BEST PAPER AWARD and CLOSING

15:30-16:30 Session 19A: Dependability, Testing and Fault Tolerance in Digital Systems 1