skip to main content
10.1145/3648115.3648135acmotherconferencesArticle/Chapter ViewAbstractPublication PagesiwoclConference Proceedingsconference-collections
poster

An Efficient Approach to Resolving Stack Overflow of SYCL Kernel on Intel® CPUs

Published: 08 April 2024 Publication History

Abstract

SYCL is a parallel programming language and enables heterogeneous computing on various devices. SYCL CPU device [1] uses CPU as a device to run SYCL kernel. While most SYCL concepts, such as devices memory model, sub-group and work-group construction, can be mapped on GPU hardware, the CPU device lacks native support for them. Therefore, these concepts need to be emulated on the CPU device to ensure full hardware utilization to achieve the performance portability of SYCL programs. To facilitate task parallelism at the work-group level, the SYCL CPU device distributes the execution of SYCL work-groups to CPU threads, each of which has a restricted stack size. The SYCL device’s memory model consists of three distinct memory regions. Local memory is accessible by all the work-items in a single work-group. Private memory is accessible to a work-item. The CPU device doesn’t have dedicated hardware to support local and private memory. Therefore, they are emulated by allocating a block of memory for each of them on the stack. A stack overflow could occur when a kernel uses a large private or local memory, as a thread’s stack size can’t be changed after its creation. The probability of error is much higher on Windows since the default thread stack size of a master thread is only 1MB. To address this issue, SYCL CPU device previously adopted an approach of context swapping to expand the stack size using low-level API provided by operating system. Application master thread stack size is 8MB on Linux and 1MB on Windows. The stack size for other worker threads is set to 8MB on a 64-bit system and 4MB on a 32-bit system. When a work-group requires a stack size larger than that of its executing thread’s stack size, the SYCL CPU device runtime swaps the thread’s context before execution. However, this method results in large-scale performance degradation on Windows due to the swapping involving frequent and inefficient memory movement. Some SYCL workloads on Windows even hang with this approach. To solve the performance issue, we propose a novel approach that replaces allocation instructions for private and local memory with an address in heap. A block of memory is allocated on the heap before kernel execution. The heap memory size can grow in case another kernel with larger stack memory is executed later. The heap buffer pointer is passed to the kernel as an implicit argument. A null pointer is passed to the kernel if heap usage for the stack is unnecessary. During kernel compilation, we replace the original alloca instructions with specific instructions to access heap memory for private and local buffers. Experiment results on 21 SYCL workloads show the novel approach significantly outperforms the context-swapping approach. The geomean speedup is 153.73 on Windows and 1.11 on Linux, and the workloads don’t hang on Windows anymore. The novel approach doesn’t have any evident performance penalty compared to the baseline that doesn’t use heap. The novel approach could be adopted by other SYCL or SPMD CPU devices, such as the SYCL Native CPU device [2], since they all face the same problem. This feature will be delivered in Intel OneAPI 2024.2 toolkit.

References

[1]
James Brodman Michael Kinsner John Pennycook James Reinders, Ben Ashbaugh and Xinmin Tian. 2023. Data Parallel C++: Programming Accelerated Systems Using C++ and SYCL. https://doi.org/10.1007/978-1-4842-9691-2
[2]
Mehdi Goli Pietro Ghiglio, Uwe Dolinsky and Kumudha Narasimhan. 2022. Improving Performance of SYCL Applications on CPU Architectures Using LLVM-Directed Compilation Flow. In Proceedings of the Thirteenth International Workshop on Programming Models. (2022). https://doi.org/10.1145/3528425.3529099

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
IWOCL '24: Proceedings of the 12th International Workshop on OpenCL and SYCL
April 2024
124 pages
ISBN:9798400717901
DOI:10.1145/3648115
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 April 2024

Check for updates

Qualifiers

  • Poster
  • Research
  • Refereed limited

Conference

IWOCL '24

Acceptance Rates

Overall Acceptance Rate 84 of 152 submissions, 55%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 9
    Total Downloads
  • Downloads (Last 12 months)9
  • Downloads (Last 6 weeks)1
Reflects downloads up to 24 Dec 2024

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media