MDPI - Publisher of Open Access Journals

19 pages, 2693 KiB

Open AccessArticle

Adaptive Switching Redundant-Mode Multi-Core System for Photovoltaic Power Generation

by Liang Liu, Xige Zhang, Jiahui Zhou, Kai Niu, Zixuan Guo, Yawen Zhao and Meng Zhang

Sensors 2024, 24(23), 7561; https://doi.org/10.3390/s24237561 - 27 Nov 2024

Viewed by 451

As maximum power point tracking (MPPT) algorithms have developed towards multi-task intelligent computing, processors in photovoltaic power generation control systems must be capable of achieving a higher performance. However, the challenges posed by the complex environment of photovoltaic fields with regard to processor [...] Read more.

As maximum power point tracking (MPPT) algorithms have developed towards multi-task intelligent computing, processors in photovoltaic power generation control systems must be capable of achieving a higher performance. However, the challenges posed by the complex environment of photovoltaic fields with regard to processor reliability cannot be overlooked. To address these issues, we proposed a novel approach. Our approach uses error rate and performance as switching metrics and performs joint statistics to achieve efficient adaptive switching. Based on this, our work designed a redundancy-mode switchable three-core processor system to balance performance and reliability. Additionally, by analyzing the relationship between performance and reliability, we proposed optimization methods to improve reliability while ensuring a high performance was maintained. Finally, we designed an error injection method and verified the system’s reliability by analyzing the error rate probability model in different scenarios. The results of the analysis show that compared with the traditional MPPT controller, the redundancy mode switchable multi-core processor system proposed in this paper exhibits a reliability approximately 5.58 times that of a non-fault-tolerant system. Furthermore, leveraging the feature of module switching, the system’s performance has been enhanced by 26% compared to a highly reliable triple modular redundancy systems, significantly improving the system’s reliability while ensuring a good performance is maintained. Full article

(This article belongs to the Topic Advanced Systems Engineering: Theory and Applications, 2nd Volume)

► Show Figures

Figure 1

23 pages, 4551 KiB

Open AccessArticle

A Model-Based Optimization Method of ARINC 653 Multicore Partition Scheduling

by Pujie Han, Wentao Hu, Zhengjun Zhai and Min Huang

Aerospace 2024, 11(11), 915; https://doi.org/10.3390/aerospace11110915 - 7 Nov 2024

Viewed by 725

Abstract

ARINC 653 Part 1 Supplement 5 (ARINC 653P1-5) provides temporal partitioning capabilities for real-time applications running on the multicore processors in Integrated Modular Avionics (IMAs) systems. However, it is difficult to schedule a set of ARINC 653 multicore partitions to achieve a minimum [...] Read more.

ARINC 653 Part 1 Supplement 5 (ARINC 653P1-5) provides temporal partitioning capabilities for real-time applications running on the multicore processors in Integrated Modular Avionics (IMAs) systems. However, it is difficult to schedule a set of ARINC 653 multicore partitions to achieve a minimum processor occupancy. This paper proposes a model-based optimization method for ARINC 653 multicore partition scheduling. The IMA multicore processing system is modeled as a network of timed automata in UPPAAL. A parallel genetic algorithm is employed to explore the solution space of the IMA system. Owing to a lack of priori information for the system model, the configuration of genetic operators is self-adaptively controlled by a Q-learning algorithm. During the evolution, each individual in a population is evaluated independently by compositional model checking, which verifies each partition in the IMA system and combines all the schedulability results to form a global fitness evaluation. The experiments show that our model-based method outperforms the traditional analytical methods when handling the same task loads in the ARINC 653 multicore partitions, while alleviating the state space explosion of model checking via parallelization acceleration. Full article

(This article belongs to the Special Issue Aircraft Design and System Optimization)

► Show Figures

Figure 1

25 pages, 3881 KiB

Open AccessArticle

Logical Execution Time and Time-Division Multiple Access in Multicore Embedded Systems: A Case Study

by Carlos-Antonio Mosqueda-Arvizu, Julio-Alejandro Romero-González, Diana-Margarita Córdova-Esparza, Juan Terven, Ricardo Chaparro-Sánchez and Juvenal Rodríguez-Reséndiz

Algorithms 2024, 17(7), 294; https://doi.org/10.3390/a17070294 - 5 Jul 2024

Viewed by 954

Abstract

The automotive industry has recently adopted multicore processors and microcontrollers to meet the requirements of new features, such as autonomous driving, and comply with the latest safety standards. However, inter-core communication poses a challenge in ensuring real-time requirements such as time determinism and [...] Read more.

The automotive industry has recently adopted multicore processors and microcontrollers to meet the requirements of new features, such as autonomous driving, and comply with the latest safety standards. However, inter-core communication poses a challenge in ensuring real-time requirements such as time determinism and low latencies. Concurrent access to shared buffers makes predicting the flow of data difficult, leading to decreased algorithm performance. This study explores the integration of Logical Execution Time (LET) and Time-Division Multiple Access (TDMA) models in multicore embedded systems to address the challenges in inter-core communication by synchronizing read/write operations across different cores, significantly reducing latency variability and improving system predictability and consistency. Experimental results demonstrate that this integrated approach eliminates data loss and maintains fixed operation rates, achieving a consistent latency of 11 ms. The LET-TDMA method reduces latency variability to approximately 1 ms, maintaining a maximum delay of 1.002 ms and a minimum delay of 1.001 ms, compared to the variability in the LET-only method, which ranged from 3.2846 ms to 8.9257 ms for different configurations. Full article

(This article belongs to the Special Issue Scheduling: Algorithms and Real-World Applications)

► Show Figures

Figure 1

20 pages, 2100 KiB

Open AccessArticle

Parallel Algorithm on Multicore Processor and Graphics Processing Unit for the Optimization of Electric Vehicle Recharge Scheduling

by Vincent Roberge, Katerina Brooks and Mohammed Tarbouchi

Electronics 2024, 13(9), 1783; https://doi.org/10.3390/electronics13091783 - 5 May 2024

Viewed by 1776

Abstract

Electric vehicles (EVs) are becoming more and more popular as they provide significant environmental benefits compared to fossil-fuel vehicles. However, they represent substantial loads on the power grid, and the scheduling of EV charging can be a challenge, especially in large parking lots. [...] Read more.

Electric vehicles (EVs) are becoming more and more popular as they provide significant environmental benefits compared to fossil-fuel vehicles. However, they represent substantial loads on the power grid, and the scheduling of EV charging can be a challenge, especially in large parking lots. This paper presents a metaheuristic-based approach parallelized on multicore processors (CPU) and graphics processing units (GPU) to optimize the scheduling of EV charging in a single smart parking lot. The proposed method uses a particle swarm optimization algorithm that takes as input the arrival time, the departure time, and the power demand of the vehicles and produces an optimized charging schedule for all vehicles in the parking lot, which minimizes the overall charging cost while respecting the chargers’ capacity and the parking lot feeder capacity. The algorithm exploits task-level parallelism for the multicore CPU implementation and data-level parallelism for the GPU implementation. The proposed algorithm is tested in simulation on parking lots containing 20 to 500 EVs. The parallel implementation on CPUs provides a speedup of 7.1x, while the implementation on a GPU provides a speedup of up to 247.6x. The parallel implementation on a GPU is able to optimize the charging schedule for a 20-EV parking lot in 0.87 s and a 500-EV lot in just under 30 s. These runtimes allow for real-time computation when a vehicle arrives at the parking lot or when the electricity cost profile changes. Full article

(This article belongs to the Special Issue Vehicle Technologies for Sustainable Smart Cities and Societies)

► Show Figures

Figure 1

21 pages, 6467 KiB

Open AccessArticle

Architectural and Technological Approaches for Efficient Energy Management in Multicore Processors

by Claudiu Buduleci, Arpad Gellert, Adrian Florea and Remus Brad

Computers 2024, 13(4), 84; https://doi.org/10.3390/computers13040084 - 22 Mar 2024

Viewed by 1755

Abstract

Benchmarks play an essential role in the performance evaluation of novel research concepts. Their effectiveness diminishes if they fail to exploit the available hardware of the evaluated microprocessor or, more broadly, if they are not consistent in comparing various systems. An empirical analysis [...] Read more.

Benchmarks play an essential role in the performance evaluation of novel research concepts. Their effectiveness diminishes if they fail to exploit the available hardware of the evaluated microprocessor or, more broadly, if they are not consistent in comparing various systems. An empirical analysis of the consecrated Splash-2 benchmarks suite vs. the latest version Splash-4 was performed. It was shown that on a 64-core configuration, half of the simulated benchmarks reach temperatures well beyond the critical threshold of 105 °C, emphasizing the necessity of a multi-objective evaluation from at least the following perspectives: energy consumption, performance, chip temperature, and integration area. During the analysis, it was observed that the cores spend a large amount of time in the idle state, around 45% on average in some configurations. This can be exploited by implementing a predictive dynamic voltage and frequency scaling (DVFS) technique called the Simple Core State Predictor (SCSP) to enhance the Intel Nehalem architecture and to simulate it using Sniper. The aim was to decrease the overall energy consumption by reducing power consumption at core level while maintaining the same performance. More than that, the SCSP technique, which operates with core-level abstract information, was applied in parallel with a Value Predictor (VP) or a Dynamic Instruction Reuse (DIR) technique, which rely on instruction-level information. Using the SCSP alone, a 9.95% reduction in power consumption and an energy reduction of 10.54% were achieved, maintaining the performance. By combining the SCSP with the VP technique, a performance increase of 8.87% was obtained while reducing power and energy consumption by 3.13% and 8.48%, respectively. Full article

(This article belongs to the Special Issue Green Networking and Computing 2022)

► Show Figures

Figure 1

20 pages, 600 KiB

Open AccessArticle

Optimizing Large Language Models on Multi-Core CPUs: A Case Study of the BERT Model

by Lanxin Zhao, Wanrong Gao and Jianbin Fang

Appl. Sci. 2024, 14(6), 2364; https://doi.org/10.3390/app14062364 - 11 Mar 2024

Viewed by 1658

Abstract

The BERT model is regarded as the cornerstone of various pre-trained large language models that have achieved promising results in recent years. This article investigates how to optimize the BERT model in terms of fine-tuning speed and prediction accuracy, aiming to accelerate [...] Read more.

The BERT model is regarded as the cornerstone of various pre-trained large language models that have achieved promising results in recent years. This article investigates how to optimize the BERT model in terms of fine-tuning speed and prediction accuracy, aiming to accelerate the execution of the BERT model on a multi-core processor and improve its prediction accuracy in typical downstream natural language processing tasks. Our contributions are two-fold. First, we port and parallelize the fine-tuning training of the BERT model on a multi-core shared-memory processor. We port the BERT model onto a multi-core processor platform to accelerate the fine-tuning training process of the model for downstream tasks. Second, we improve the prediction performance of typical downstream natural language processing tasks through fine-tuning the model parameters. We select five typical downstream natural language processing tasks (CoLA, SST-2, MRPC, RTE, and WNLI) and perform optimization on the multi-core platform, taking the hyperparameters of batch size, learning rate, and training epochs into account. Our experimental results show that, by increasing the number of CPUs and the number of threads, the model training time can be significantly reduced. We observe that the reduced time is primarily concentrated in the self-attention mechanism. Our further experimental results show that setting reasonable hyperparameters can improve the accuracy of the BERT model when applied to downstream tasks and that appropriately increasing the batch size under conditions of sufficient computing resources can significantly reduce training time. Full article

(This article belongs to the Special Issue Design and Application of High-Performance Computing Systems)

► Show Figures

Figure 1

16 pages, 8958 KiB

Open AccessArticle

An Algorithm for Solving the Problem of Phase Unwrapping in Remote Sensing Radars and Its Implementation on Multicore Processors

by Petr S. Martyshko, Elena N. Akimova, Andrey V. Sosnovsky and Victor G. Kobernichenko

Mathematics 2024, 12(5), 727; https://doi.org/10.3390/math12050727 - 29 Feb 2024

Viewed by 954

Abstract

The problem of the interferometric phase unwrapping in radar remote sensing of Earth systems is considered. Such interferograms are widely used in the problems of creating and updating maps of the relief of the Earth’s surface in geodesy, cartography, environmental monitoring, geological, hydrological [...] Read more.

The problem of the interferometric phase unwrapping in radar remote sensing of Earth systems is considered. Such interferograms are widely used in the problems of creating and updating maps of the relief of the Earth’s surface in geodesy, cartography, environmental monitoring, geological, hydrological and glaciological studies, and for monitoring transport communications. Modern radar systems have ultra-high spatial resolution and a wide band, which leads to the need to unwrap large interferograms from several tens of millions of elements. The implementation of calculations by these methods requires a processing time of several days. In this paper, an effective method for equalizing the inverse vortex field for phase unwrapping is proposed, which allows solving a problem with quasi-linear computational complexity depending on the interferogram size and the number of singular points on it. To implement the method, a parallel algorithm for solving the problem on a multi-core processor using OpenMP technology was developed. Numerical experiments on radar data models were carried out to investigate the effectiveness of the algorithm depending on the size of the source data, the density of singular points and the number of processor cores. Full article

(This article belongs to the Special Issue Intelligence Computing and Optimization Methods in Natural Sciences)

► Show Figures

Figure 1

18 pages, 4743 KiB

Open AccessArticle

High-Precision Joint TDOA and FDOA Location System

by Guoyao Xiao, Qianhui Dong, Guisheng Liao, Shuai Li, Kaijie Xu and Yinghui Quan

Remote Sens. 2024, 16(4), 693; https://doi.org/10.3390/rs16040693 - 16 Feb 2024

Cited by 1 | Viewed by 2415

Abstract

Passive location based on TDOA (time difference of arrival) and FDOA (frequency difference of arrival) is the mainstream method for target localization. This paper proposes a fast time–frequency difference positioning method to address issues such as low accuracy, large computational resource utilization, and [...] Read more.

Passive location based on TDOA (time difference of arrival) and FDOA (frequency difference of arrival) is the mainstream method for target localization. This paper proposes a fast time–frequency difference positioning method to address issues such as low accuracy, large computational resource utilization, and limited suitability for real-time signal processing in the conventional CAF (cross-ambiguity function)-based approach, aiming to complete the processing of the target radiation source to obtain the target parameters within a short timeframe. In the mixing product operation step of the CAF, a frequency-domain approach replaces the time-domain convolution operation in PW-ZFFT (pre-weighted Zoom-FFT) to reduce the computational load of the CAF. Additionally, a quadratic surface fitting method is used to enhance the accuracy of TDOA and FDOA. The localization solution is obtained using Newton’s method, which can provide more accurate results compared to analytical methods. Next, a signal processing platform is designed with FPGA (field-programmable gate array) and multi-core DSP (digital signal processor), and works by dividing and mapping the algorithm functional modules according to the hardware’s characteristics. We analyze the architectural advantages of multi-core DSP and design methods to improve program performance, such as EDMA transfer optimization, inline function optimization, and cache optimization. Finally, this paper constructs simulation tests in typical positioning scenarios and compares them to hardware measurement results, thus confirming the correctness and real-time capability of the program. Full article

► Show Figures

Figure 1

19 pages, 16073 KiB

Open AccessArticle

Minimizing Fuel Consumption for Surveillance Unmanned Aerial Vehicles Using Parallel Particle Swarm Optimization

by Vincent Roberge, Gilles Labonté and Mohammed Tarbouchi

Sensors 2024, 24(2), 408; https://doi.org/10.3390/s24020408 - 9 Jan 2024

Cited by 3 | Viewed by 1568

Abstract

This paper presents a method based on particle swarm optimization (PSO) for optimizing the power settings of unmanned aerial vehicle (UAVs) along a given trajectory in order to minimize fuel consumption and maximize autonomy during surveillance missions. UAVs are widely used in surveillance [...] Read more.

This paper presents a method based on particle swarm optimization (PSO) for optimizing the power settings of unmanned aerial vehicle (UAVs) along a given trajectory in order to minimize fuel consumption and maximize autonomy during surveillance missions. UAVs are widely used in surveillance missions and their autonomy is a key characteristic that contributes to their success. Providing a way to reduce fuel consumption and increase autonomy provides a significant advantage during the mission. The method proposed in this paper included path smoothing techniques in 3D for fixed-wing UAVs based on circular arcs that overfly the waypoints, an essential feature in a surveillance mission. It used the equations of motions and the decomposition of Newton’s equation to compute the fuel consumption based on a given power setting. The proposed method used PSO to compute optimized power settings while respecting the absolute physical constraints, such as the load factor, the lift coefficient, the maximum speed and the maximum amount of fuel onboard. Finally, the method was parallelized on a multicore processor to accelerate the computation and provide fast optimization of the power settings in case the trajectory was changed in flight by the operator. Our results showed that the proposed PSO was able to reduce fuel consumption by up to 25% in the trajectories tested and the parallel implementation provided a speedup of 21.67× compared to a sequential implementation on the CPU. Full article

(This article belongs to the Topic Vehicle Dynamics and Control)

► Show Figures

Figure 1

22 pages, 4202 KiB

Open AccessArticle

RLARA: A TSV-Aware Reinforcement Learning Assisted Fault-Tolerant Routing Algorithm for 3D Network-on-Chip

by Jiajia Jiao, Ruirui Shen, Lujian Chen, Jin Liu and Dezhi Han

Electronics 2023, 12(23), 4867; https://doi.org/10.3390/electronics12234867 - 2 Dec 2023

Viewed by 1499

Abstract

A three-dimensional Network-on-Chip (3D NoC) equips modern multicore processors with good scalability, a small area, and high performance using vertical through-silicon vias (TSV). However, the failure rate of TSV, which is higher than that of horizontal links, causes unpredictable topology variations and requires [...] Read more.

A three-dimensional Network-on-Chip (3D NoC) equips modern multicore processors with good scalability, a small area, and high performance using vertical through-silicon vias (TSV). However, the failure rate of TSV, which is higher than that of horizontal links, causes unpredictable topology variations and requires adaptive routing algorithms to select the available paths dynamically. Most works have aimed at the congestion control for TSV partially 3D NoCs to bypass the TSV reliability issue, while others have focused on the fault tolerance in TSV fully connected 3D NoCs and ignored the performance degradation. In order to adequately improve reliability and performance in TSV fully connected 3D NoC architectures, we propose a TSV-aware Reinforcement Learning Assisted Routing Algorithm (RLARA) for fault-tolerant 3D NoCs. The proposed method can take advantage of both the high throughput of fully connected TSVs and the cost-effective fault tolerance of partially connected TSVs using periodically updated TSV-aware Q table of reinforcement learning. RLARA makes the distributed routing decision with the lowest TSV utilization to avoid the overheating of the TSVs and mitigate the reliability problem. Furthermore, the K-means clustering algorithm is further adopted to compress the routing table of RLARA by exploiting the routing information similarity. To alleviate the inherent deadlock issue of adaptive routing algorithms, the link Q-value from reinforcement learning is combined with the router status based in buffer utilization to predict the congestion and enable RLARA to perform best even under a high traffic load. The experimental results of the ablation study on simulator Garnet 2.0 verify the effectiveness of our proposed RLARA under different fault models, which can perform better than the latest 3D NoC routing algorithms, with up to a 9.04% lower average delay and 8.58% higher successful delivered rate. Full article

(This article belongs to the Section Computer Science & Engineering)

► Show Figures

Figure 1

16 pages, 2370 KiB

Open AccessArticle

Dual-Core PLC for Cooperating Projects with Software Implementation

by Marcin Hubacz and Bartosz Trybus

Electronics 2023, 12(23), 4730; https://doi.org/10.3390/electronics12234730 - 22 Nov 2023

Cited by 1 | Viewed by 1561

Abstract

Development of a general-purpose PLC based on a typical dual-core processor as a hardware platform is presented. The cores run two cooperating projects involving data exchange through shared memory. Such a solution is equivalent to a single-core PLC running two tasks by means [...] Read more.

Development of a general-purpose PLC based on a typical dual-core processor as a hardware platform is presented. The cores run two cooperating projects involving data exchange through shared memory. Such a solution is equivalent to a single-core PLC running two tasks by means of a real-time operating system. Upgrading to a typical programming tool involves defining which of the global variables are shared, and whether a variable in a particular core is read-from or written-to the shared memory. Extensions to core runtimes consist of read-from at the beginning of the scan cycle and write-to at the end, and of an algorithm for protecting the shared memory against access conflicts. As an example, the proposed solution is implemented in an engineering tool with runtime based on a virtual machine concept. The PLC prototype is based on a heterogeneous ARM dual-core STM32 microcontroller running different projects. The innovation in the research lies in showing how to run two projects in a dual-core PLC without using an operating system. Extension to multiple projects for a multi-core processor is can be accomplished in a similar manner. Full article

(This article belongs to the Special Issue Advances in Hardware-Software Codesign)

► Show Figures

Figure 1

15 pages, 14472 KiB

Open AccessArticle

Speed Up of Volumetric Non-Local Transform-Domain Filter Utilising HPC Architecture

by Petr Strakos, Milan Jaros, Lubomir Riha and Tomas Kozubek

J. Imaging 2023, 9(11), 254; https://doi.org/10.3390/jimaging9110254 - 20 Nov 2023

Viewed by 1637

Abstract

This paper presents a parallel implementation of a non-local transform-domain filter (BM4D). The effectiveness of the parallel implementation is demonstrated by denoising image series from computed tomography (CT) and magnetic resonance imaging (MRI). The basic idea of the filter is based on grouping [...] Read more.

This paper presents a parallel implementation of a non-local transform-domain filter (BM4D). The effectiveness of the parallel implementation is demonstrated by denoising image series from computed tomography (CT) and magnetic resonance imaging (MRI). The basic idea of the filter is based on grouping and filtering similar data within the image. Due to the high level of similarity and data redundancy, the filter can provide even better denoising quality than current extensively used approaches based on deep learning (DL). In BM4D, cubes of voxels named patches are the essential image elements for filtering. Using voxels instead of pixels means that the area for searching similar patches is large. Because of this and the application of multi-dimensional transformations, the computation time of the filter is exceptionally long. The original implementation of BM4D is only single-threaded. We provide a parallel version of the filter that supports multi-core and many-core processors and scales on such versatile hardware resources, typical for high-performance computing clusters, even if they are concurrently used for the task. Our algorithm uses hybrid parallelisation that combines open multi-processing (OpenMP) and message passing interface (MPI) technologies and provides up to 283× speedup, which is a 99.65% reduction in processing time compared to the sequential version of the algorithm. In denoising quality, the method performs considerably better than recent DL methods on the data type that these methods have yet to be trained on. Full article

(This article belongs to the Section Medical Imaging)

► Show Figures

Figure 1

12 pages, 2836 KiB

Open AccessArticle

A Simple Denoising Algorithm for Real-World Noisy Camera Images

by Manfred Hartbauer

J. Imaging 2023, 9(9), 185; https://doi.org/10.3390/jimaging9090185 - 18 Sep 2023

Cited by 2 | Viewed by 4074

Abstract

The noise statistics of real-world camera images are challenging for any denoising algorithm. Here, I describe a modified version of a bionic algorithm that improves the quality of real-word noisy camera images from a publicly available image dataset. In the first step, an [...] Read more.

The noise statistics of real-world camera images are challenging for any denoising algorithm. Here, I describe a modified version of a bionic algorithm that improves the quality of real-word noisy camera images from a publicly available image dataset. In the first step, an adaptive local averaging filter was executed for each pixel to remove moderate sensor noise while preserving fine image details and object contours. In the second step, image sharpness was enhanced by means of an unsharp mask filter to generate output images that are close to ground-truth images (multiple averages of static camera images). The performance of this denoising algorithm was compared with five popular denoising methods: bm3d, wavelet, non-local means (NL-means), total variation (TV) denoising and bilateral filter. Results show that the two-step filter had a performance that was similar to NL-means and TV filtering. Bm3d had the best denoising performance but sometimes led to blurry images. This novel two-step filter only depends on a single parameter that can be obtained from global image statistics. To reduce computation time, denoising was restricted to the Y channel of YUV-transformed images and four image segments were simultaneously processed in parallel on a multi-core processor. Full article

(This article belongs to the Topic Bio-Inspired Systems and Signal Processing)

► Show Figures

Figure 1

19 pages, 631 KiB

Open AccessArticle

Research on Cache Coherence Protocol Verification Method Based on Model Checking

by Yiqiang Zhao, Boning Shi, Qizhi Zhang, Yidong Yuan and Jiaji He

Electronics 2023, 12(16), 3420; https://doi.org/10.3390/electronics12163420 - 11 Aug 2023

Viewed by 2287

Abstract

This paper analyzes the underlying logic of the processor’s behavior level code. It proposes an automatic model construction and formal verification method for the cache consistency protocol with the aim of ensuring data consistency in the processor and the correctness of the cache [...] Read more.

This paper analyzes the underlying logic of the processor’s behavior level code. It proposes an automatic model construction and formal verification method for the cache consistency protocol with the aim of ensuring data consistency in the processor and the correctness of the cache function. The main idea of this method is to analyze the register transfer level (RTL) code directly at the module level and variable level, and extract the key modules and key variables according to the code information. Then, based on key variables, conditional behavior statements are retrieved from the code, and unnecessary statements are deleted. The model construction and simplification of related core states are completed automatically, while also simultaneously generating the attribute library to be verified, using “white list” as the construction strategy. Finally, complete cache consistency protocol verification is implemented in the model detector UPPAAL. Ultimately, this mechanism reduces the 142 state-transition path-guided global states of the cache module to be verified into 4 core functional states driven by consistency protocol implementation, effectively reducing the complexity of the formal model, and extracting 32 verification attributes into 6 verification attributes, reducing the verification time cost by 76.19%. Full article

(This article belongs to the Special Issue Computer-Aided Design for Hardware Security and Trust)

► Show Figures

Figure 1

15 pages, 1614 KiB

Open AccessArticle

Improving Structured Grid-Based Sparse Matrix-Vector Multiplication and Gauss–Seidel Iteration on GPDSP

by Yang Wang, Jie Liu, Xiaoxiong Zhu, Qingyang Zhang, Shengguo Li and Qinglin Wang

Appl. Sci. 2023, 13(15), 8952; https://doi.org/10.3390/app13158952 - 3 Aug 2023

Cited by 1 | Viewed by 1163

Abstract

Structured grid-based sparse matrix-vector multiplication and Gauss–Seidel iterations are very important kernel functions in scientific and engineering computations, both of which are memory intensive and bandwidth-limited. GPDSP is a general purpose digital signal processor, which is a very significant embedded processor that has [...] Read more.

Structured grid-based sparse matrix-vector multiplication and Gauss–Seidel iterations are very important kernel functions in scientific and engineering computations, both of which are memory intensive and bandwidth-limited. GPDSP is a general purpose digital signal processor, which is a very significant embedded processor that has been introduced into high-performance computing. In this paper, we designed various optimization methods, which included a blocking method to improve data locality and increase memory access efficiency, a multicolor reordering method to develop Gauss–Seidel fine-grained parallelism, a data partitioning method designed for GPDSP memory structures, and a double buffering method to overlap computation and access memory on structured grid-based SpMV and Gauss–Seidel iterations for GPDSP. At last, we combined the above optimization methods to design a multicore vectorization algorithm. We tested the matrices generated with structured grids of different sizes on the GPDSP platform and obtained speedups of up to 41× and 47× compared to the unoptimized SpMV and Gauss–Seidel iterations, with maximum bandwidth efficiencies of 72% and 81%, respectively. The experiment results show that our algorithms could fully utilize the external memory bandwidth. We also implemented the commonly used mixed precision algorithm on the GPDSP and obtained speedups of 1.60× and 1.45× for the SpMV and Gauss–Seidel iterations, respectively. Full article

► Show Figures

Figure 1

Search Results (95)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (95)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI