-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8342818: Implement CPU Time Profiling for JFR #20752
base: master
Are you sure you want to change the base?
8342818: Implement CPU Time Profiling for JFR #20752
Conversation
I usually do test it on Mac, I probably just forgot it. I'm fixing the bug tomorrow. |
src/hotspot/share/jfr/recorder/stacktrace/jfrAsyncStackTrace.cpp
Outdated
Show resolved
Hide resolved
*/ | ||
|
||
#include "precompiled.hpp" | ||
#include "jfr/recorder/service/jfrOptionSet.hpp" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All includes except:
#include "precompiled.hpp"
#include "jfr/periodic/sampling/jfrCPUTimeThreadSampler.hpp"
#include "utilities/debug.hpp"
Can be moved below #if defined(Linux) (to reduce compilation time on non-Linux platforms)
#include "runtime/osThread.hpp" | ||
|
||
#if defined(LINUX) | ||
#include "signals_posix.hpp" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You will also need to include <sys/types.h> and change the type from OSThread::thread_id_t to pid_t, else you can get the following error:
"error: 'typedef pid_t OSThread::thread_id_t' is private within this context"
{} | ||
|
||
bool JfrAsyncStackTrace::record_async(JavaThread* jt, const frame& frame) { | ||
NoHandleMark nhm; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stack: [0x00007f02ff2f4000,0x00007f02ff3f4000], sp=0x00007f02ff3ed8e0, free space=998k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V [libjvm.so+0xdf5649] HandleArea::allocate_null_handle()+0x179 (handles.cpp:46)
V [libjvm.so+0xc5259a] RegisterMap::RegisterMap(JavaThread*, RegisterMap::UpdateMap, RegisterMap::ProcessFrames, RegisterMap::WalkContinuation)+0xaa (frame.cpp:75)
V [libjvm.so+0xf9ffb9] JfrVframeStream::JfrVframeStream(JavaThread*, frame const&, bool, bool)+0x59 (jfrStackTrace.cpp:160)
V [libjvm.so+0xf0940c] JfrAsyncStackTrace::record_async(JavaThread*, frame const&)+0xac (jfrAsyncStackTrace.cpp:58)
V [libjvm.so+0xf0be3d] JfrCPUTimeThreadSampler::handle_timer_signal(void*)+0x4fd (jfrCPUTimeThreadSampler.cpp:231)
V [libjvm.so+0xf0bf7f] handle_timer_signal(int, siginfo*, void*)+0x4f (jfrCPUTimeThreadSampler.cpp:723)
NoHandleMark will not work with with continuations. Please see how we do it in JfrStackTrace::record()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps even more interesting is that I got this when running out-of-the-box with:
XX:StartFlightRecording:settings=profile.jfc
// profile.jfc
<event name="jdk.CPUTimeSample">
<setting name="enabled" control="method-sampling-enabled">false</setting>
<setting name="throttle">10ms</setting>
</event>
This indicates that the event is turned on, despite "enabled" being false. Is there a bug in the control / enable setting mechanism?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is both interesting (and worrying). I'm going to write a test for the latter issue.
|
||
private static void testThrottleSettingsPeriod() throws Exception { | ||
float rate = countEvents(1000, "1ms").rate(); | ||
Asserts.assertTrue(rate > 950 && rate < 1050, "Expected around 1000 events per second, got " + rate); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fails on linux-aarch64-debug builds:
----------System.out:(0/0)----------
----------System.err:(14/1061)----------
java.lang.RuntimeException: Expected around 1000 events per second, got 245.15503: expected true, was false
at jdk.test.lib.Asserts.fail(Asserts.java:691)
at jdk.test.lib.Asserts.assertTrue(Asserts.java:543)
at jdk.jfr.event.profiling.TestCPUTimeEventThrottling.testThrottleSettingsPeriod(TestCPUTimeEventThrottling.java:65)
at jdk.jfr.event.profiling.TestCPUTimeEventThrottling.main(TestCPUTimeEventThrottling.java:49)
at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
at java.base/java.lang.reflect.Method.invoke(Method.java:573)
at com.sun.javatest.regtest.agent.MainWrapper$MainTask.run(MainWrapper.java:138)
at java.base/java.lang.Thread.run(Thread.java:1576)
JavaTest Message: Test threw exception: java.lang.RuntimeException: Expected around 1000 events per second, got 245.15503: expected true, was false
JavaTest Message: shutting down test
Test: jdk/jfr/jcmd/TestJcmdStartWithOptions.java seems to be timing out when the CPU Time Profiler is turned on. |
Test: TestActiveSettingEvent also has problems on Linux, not only on non-Linux platforms. Testing configuration default |
Test: jdk/jfr/event/profiling/TestNative.java times out on Linux. Could perhaps be an indication that event control/settings is getting mixed up between the two samplers? jdk.jfr.event.profiling.TestNative |
Linux AArch64-debug: Are you enqueuing entries that have already been unloaded? A fatal error has been detected by the Java Runtime Environment: SIGSEGV (0xb) at pc=0x0000ffff898316e4, pid=458651, tid=458700 JRE version: Java(TM) SE Runtime Environment (24.0) (fastdebug build 24-internal-2024-10-30-2131063.markus.gronlund.jdkcopy) If you would like to submit a bug report, please visit: --------------- T H R E A D --------------- Current thread (0x0000fffef803f480): JavaThread "JFR Recorder Thread" daemon [_thread_in_vm, id=458700, stack(0x0000ffff05772000,0x0000ffff05970000) (2040K)] Stack: [0x0000ffff05772000,0x0000ffff05970000], sp=0x0000ffff0596e100, free space=2032k |
Into the JFR buffers? It is possible that we access the methods of unloaded classes, when they are unloaded between the signal handler enqueue and the capture. I thought that this would crash in the method resolve, which would then be caught via the used ThreadCrashProtection. But apparently not. I'm unsure how to solve this. Maybe I could check whether the methods class is unloaded? |
Is there any form design document? The description in the JEP is extremely sparse, and this PR carries no information at all.
That is all I could find. Is there more? I am mostly curious - and I admit, nervous - about reentrant safety. I see, again, use of setjmp/longjmp semantics here in combination with a mechanism that I believe stops the JVM at arbitrary(?) times, which to me is a warning sign. |
@@ -40,14 +40,14 @@ | |||
*/ | |||
public class TestNative { | |||
|
|||
final static String NATIVE_EVENT = EventNames.NativeMethodSample; | |||
static String nativeEvent = EventNames.NativeMethodSample; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this change needed?
ThreadCrashProtection is not used inside the signal handler but in a separate worker thread. |
SIGILL on AArch64 Current thread (0x0000ffff783aa220): JfrCPUTimeThreadSampler "JFR CPU Time Thread Sampler" [id=45622, stack(0x0000fffb01f6c000,0x0000fffb0216a000) (2040K)] Stack: [0x0000fffb01f6c000,0x0000fffb0216a000], sp=0x0000fffb021683a0, free space=2032k siginfo: si_signo: 4 (SIGILL), si_code: 1 (ILL_ILLOPC), si_addr: 0x0000ffff68300b4c |
Linux x64-debug: jfr/recorder/checkpoint/types/traceid/jfrTraceIdKlassQueue.cpp:190 assert(id > 0) failed: invariant --------------- T H R E A D --------------- Current thread (0x00007efe60f3f280): JavaThread "JFR Recorder Thread" daemon [_thread_in_vm, id=3499556, stack(0x00007efeb13f5000,0x00007efeb14f5000) (1024K)] Stack: [0x00007efeb13f5000,0x00007efeb14f5000], sp=0x00007efeb14f37d0, free space=1017k |
@mgronlun what is your reproducer? I didn't see these issues before in my own testing |
Internal stress testing. |
linux-aarch64-debug --------------- T H R E A D --------------- Current thread (0x0000ffffac4222b0): JavaThread "JFR Recorder Thread" daemon [_thread_in_vm, id=144094, stack(0x0000ffff5e17a000,0x0000ffff5e378000) (2040K)] Stack: [0x0000ffff5e17a000,0x0000ffff5e378000], sp=0x0000ffff5e376110, free space=2032k |
I see a pattern in the last two segfaults, it seems to conform to what I wrote before. So I probably need a test case that unloads a lot of classes. |
frame topframe; | ||
if (!jt->pd_get_top_frame_for_signal_handler(&topframe, ucontext, false)) { | ||
_error = ERROR_NO_TOPFRAME; | ||
return; | ||
} | ||
frame first_java_frame; | ||
Method* method = nullptr; | ||
JfrGetCallTrace gct(false, jt); | ||
if (!gct.find_top_frame(topframe, &method, first_java_frame)) { | ||
_error = ERROR_NO_TOPFRAME; | ||
return; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't we end up with the same top frame if we started at the anchor frame instead? In fact, the frame we want should always be the immediate caller of the anchor frame. But I guess we need to check that the anchor frame is walkable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not always the case, as far as I know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not always the case, as far as I know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be ok to start from the anchor frame.
Please see JfrNativeSamplerCallback::call().
#if defined(LINUX) | ||
#include "signals_posix.hpp" | ||
|
||
const int64_t AUTOADAPT_INTERVAL_MS = 100; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
static
} | ||
} | ||
|
||
const int SIG = SIGPROF; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
static
} | ||
|
||
|
||
int64_t compute_sampling_period(double rate) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
static
|
||
static bool _showed_warning = false; | ||
|
||
void warn() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
static
// none of it is safe | ||
return false; | ||
} | ||
const traceid mid = JfrTraceId::load(frame._method); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is JfrTraceId::load() barrier code:
inline traceid JfrTraceIdLoadBarrier::load(const Klass* klass, const Method* method) {
assert(klass != nullptr, "invariant");
assert(method != nullptr, "invariant");
if (should_tag(method)) {
SET_METHOD_AND_CLASS_USED_THIS_EPOCH(klass);
SET_METHOD_FLAG_USED_THIS_EPOCH(method);
assert(METHOD_AND_CLASS_USED_THIS_EPOCH(klass), "invariant");
assert(METHOD_FLAG_USED_THIS_EPOCH(method), "invariant");
enqueue(klass);
JfrTraceIdEpoch::set_changed_tag_state();
}
return (METHOD_ID(klass, method));
}
What happens if the sampler crashes at one of these routines?
- Sets bits in the Klass, but not in the Method, and does not enqueue?
- Set bits in the Klass and Method, but does not enqueue?
- Sets bits in Klass, Method and enqueues, but does not notify?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a really good point.
FYI: Today is a bank holiday in Germany, so I don't work on the PR today, but it'll be my main focus in the following weeks
Am I reading this right that the callback does upcall to Java where it modifies bytecode via ASM? |
This is the code for the [JEP draft: CPU Time based profiling for JFR].
Progress
Issue
Contributors
<[email protected]>
<[email protected]>
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/20752/head:pull/20752
$ git checkout pull/20752
Update a local copy of the PR:
$ git checkout pull/20752
$ git pull https://git.openjdk.org/jdk.git pull/20752/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 20752
View PR using the GUI difftool:
$ git pr show -t 20752
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/20752.diff
Using Webrev
Link to Webrev Comment