poster

An Efficient Approach to Resolving Stack Overflow of SYCL Kernel on Intel® CPUs

Authors:

Wenwan Xing,

Wenju He,

Xinmin TianAuthors Info & Claims

IWOCL '24: Proceedings of the 12th International Workshop on OpenCL and SYCL

Article No.: 13, Page 1

https://doi.org/10.1145/3648115.3648135

Published: 08 April 2024 Publication History

Get Access

Abstract

SYCL is a parallel programming language and enables heterogeneous computing on various devices. SYCL CPU device [1] uses CPU as a device to run SYCL kernel. While most SYCL concepts, such as devices memory model, sub-group and work-group construction, can be mapped on GPU hardware, the CPU device lacks native support for them. Therefore, these concepts need to be emulated on the CPU device to ensure full hardware utilization to achieve the performance portability of SYCL programs. To facilitate task parallelism at the work-group level, the SYCL CPU device distributes the execution of SYCL work-groups to CPU threads, each of which has a restricted stack size. The SYCL device’s memory model consists of three distinct memory regions. Local memory is accessible by all the work-items in a single work-group. Private memory is accessible to a work-item. The CPU device doesn’t have dedicated hardware to support local and private memory. Therefore, they are emulated by allocating a block of memory for each of them on the stack. A stack overflow could occur when a kernel uses a large private or local memory, as a thread’s stack size can’t be changed after its creation. The probability of error is much higher on Windows since the default thread stack size of a master thread is only 1MB. To address this issue, SYCL CPU device previously adopted an approach of context swapping to expand the stack size using low-level API provided by operating system. Application master thread stack size is 8MB on Linux and 1MB on Windows. The stack size for other worker threads is set to 8MB on a 64-bit system and 4MB on a 32-bit system. When a work-group requires a stack size larger than that of its executing thread’s stack size, the SYCL CPU device runtime swaps the thread’s context before execution. However, this method results in large-scale performance degradation on Windows due to the swapping involving frequent and inefficient memory movement. Some SYCL workloads on Windows even hang with this approach. To solve the performance issue, we propose a novel approach that replaces allocation instructions for private and local memory with an address in heap. A block of memory is allocated on the heap before kernel execution. The heap memory size can grow in case another kernel with larger stack memory is executed later. The heap buffer pointer is passed to the kernel as an implicit argument. A null pointer is passed to the kernel if heap usage for the stack is unnecessary. During kernel compilation, we replace the original alloca instructions with specific instructions to access heap memory for private and local buffers. Experiment results on 21 SYCL workloads show the novel approach significantly outperforms the context-swapping approach. The geomean speedup is 153.73 on Windows and 1.11 on Linux, and the workloads don’t hang on Windows anymore. The novel approach doesn’t have any evident performance penalty compared to the baseline that doesn’t use heap. The novel approach could be adopted by other SYCL or SPMD CPU devices, such as the SYCL Native CPU device [2], since they all face the same problem. This feature will be delivered in Intel OneAPI 2024.2 toolkit.

References

[1]

James Brodman Michael Kinsner John Pennycook James Reinders, Ben Ashbaugh and Xinmin Tian. 2023. Data Parallel C++: Programming Accelerated Systems Using C++ and SYCL. https://doi.org/10.1007/978-1-4842-9691-2

Crossref

Google Scholar

[2]

Mehdi Goli Pietro Ghiglio, Uwe Dolinsky and Kumudha Narasimhan. 2022. Improving Performance of SYCL Applications on CPU Architectures Using LLVM-Directed Compilation Flow. In Proceedings of the Thirteenth International Workshop on Programming Models. (2022). https://doi.org/10.1145/3528425.3529099

Digital Library

Google Scholar

Recommendations

Black-Scholes Option Pricing on Intel CPUs and GPUs: Implementation on SYCL and Optimization Techniques
Supercomputing
Abstract
The Black-Scholes option pricing problem is one of the widely used financial benchmarks. We explore the possibility of developing a high-performance portable code using the SYCL (Data Parallel C++) programming language. We start from a C++ code ...
Optimizing image processing on multi-core CPUs with Intel parallel programming technologies

The rapid advance of computer hardware and popularity of multimedia applications enable multi-core processors with sub-word parallelism instructions to become a dominant market trend in desk-top PCs as well as high end mobile devices. This paper ...
Porting Batched Iterative Solvers onto Intel GPUs with SYCL
SC-W '23: Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis

Batched linear solvers play a vital role in computational sciences, especially in the fields of plasma physics and combustion simulations. With the imminent deployment of the Aurora Supercomputer and other upcoming systems equipped with Intel GPUs, ...

Comments

Information & Contributors

Information

Published In

IWOCL '24: Proceedings of the 12th International Workshop on OpenCL and SYCL

April 2024

124 pages

ISBN:9798400717901

DOI:10.1145/3648115

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 April 2024

Check for updates

Qualifiers

Poster
Research
Refereed limited

Conference

IWOCL '24

IWOCL '24: International Workshop on OpenCL and SYCL

April 8 - 11, 2024

IL, Chicago, USA

Acceptance Rates

Overall Acceptance Rate 84 of 152 submissions, 55%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
9
Total Downloads

Downloads (Last 12 months)9
Downloads (Last 6 weeks)1

Reflects downloads up to 24 Dec 2024

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Abstract

References

Recommendations

Black-Scholes Option Pricing on Intel CPUs and GPUs: Implementation on SYCL and Optimization Techniques

Optimizing image processing on multi-core CPUs with Intel parallel programming technologies

Porting Batched Iterative Solvers onto Intel GPUs with SYCL

Comments

Information

Published In

Publisher

Publication History

Check for updates

Qualifiers

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Login options

Full Access

View options

PDF

eReader

HTML Format

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations