High- (but not too high-) resolution timeouts

By Jonathan Corbet
September 2, 2008

Linux provides a number of system calls that allow an application to wait for file descriptors to become ready for I/O; they include select(), pselect(), poll(), ppoll(), and epoll_wait(). Each of these interfaces allows the specification of a timeout putting an upper bound on how long the application will be blocked. In typical fashion, the form of that timeout varies greatly. poll() and epoll_wait() take an integer number of milliseconds; select() takes a struct timeval with microsecond resolution, and ppoll() and pselect() take a struct timespec with nanosecond resolution.

They are all the same, though, in that they convert this timeout value to jiffies, with a maximum resolution between one and ten milliseconds. A programmer might program a pselect() call with a 10 nanosecond timeout, but the call may not return until 10 milliseconds later, even in the absence of contention for the CPU. An error of six orders of magnitude seems like a bit much, especially given that contemporary hardware can easily support much more accurate timing.

Arjan van de Ven recently surfaced with a patch set aimed at addressing this problem. The core idea is simple: have the code implementing poll() and select() use high-resolution timers instead of converting the timeout period to low-resolution jiffies. The implementation relied on a new function to provide the timeouts:

    long schedule_hrtimeout(struct timespec *time, int mode);

Here, time is the timeout period, as interpreted by mode (which is either HRTIMER_MODE_ABS or HRTIMER_MODE_REL).

High-resolution timeouts are a nice feature, but one can immediately imagine a problem: higher-resolution timeouts are less likely to coincide with other events which wake up the processor. The result will be more wakeups and greater power consumption. As it happens, there are few developers who are more aware of this fact than Arjan, who has done quite a bit of work aimed at keeping processors asleep as much as possible. His solution to this problem was to only use high-resolution timeouts if the timeout period is less than one second. For longer timeout periods, the old, jiffie-based mechanism was used as before.

Linus didn't like that solution, calling it "ugly." His preference, instead, was to have schedule_hrtimeout() apply an appropriate amount of fuzz to all timeout values; the longer the timeout, the less resolution would be supplied. Alan Cox suggested that a better mechanism would be for the caller to supply the required accuracy with the timeout value. The problem with that idea, as Linus pointed out, is that the current system call interfaces provide no way for an application to supply the accuracy value. One could create more poll()-like system calls - as if there weren't enough of them already - with an accuracy parameter, but that looks like a lot of trouble to create a non-standard interface which few programmers would bother to use.

A different solution came in the form of Arjan's range-capable timer patch set. This patch extends hrtimers to accept two timeout values, called the "soft" and "hard" timeouts. The soft value - the shorter of the two - is the first time at which the timeout can expire; the kernel will make its best effort to ensure that it does not expire after the hard period has elapsed. In between the two, the kernel is free to expire the timer at any convenient time.

It's a useful feature, but it comes at the cost of some significant API changes. To begin with, the expires field of struct hrtimer goes away. Rather than manipulate expires directly, kernel code must now use one of the new accessor functions:

    void hrtimer_set_expires(struct hrtimer *timer, ktime_t time);
    void hrtimer_set_expires_tv64(struct hrtimer *timer, s64 tv64);
    void hrtimer_add_expires(struct hrtimer *timer, ktime_t time);
    void hrtimer_add_expires_ns(struct hrtimer *timer, unsigned long ns);
    ktime_t hrtimer_get_expires(const struct hrtimer *timer);
    s64 hrtimer_get_expires_tv64(const struct hrtimer *timer);
    s64 hrtimer_get_expires_ns(const struct hrtimer *timer);
    ktime_t hrtimer_expires_remaining(const struct hrtimer *timer);

Once that's done, the range capability is added to hrtimers. By default, the soft and hard expiration times are the same; code which wishes to set them independently can use the new functions:

    void hrtimer_set_expires_range(struct hrtimer *timer, ktime_t time, 
                                   ktime_t delta);
    void hrtimer_set_expires_range_ns(struct hrtimer *timer, ktime_t time,
                                      unsigned long delta);
    ktime_t hrtimer_get_softexpires(const struct hrtimer *timer);
    s64 hrtimer_get_softexpires_tv64(const struct hrtimer *timer)

In the new "set" functions, the specified time is the soft timeout, while time+delta provides the hard timeout value. There is also another form of schedule_timeout():

    int schedule_hrtimeout_range(ktime_t *expires, unsigned long delta,
				 const enum hrtimer_mode mode);

With this infrastructure in place, poll() and friends can be given approximate timeouts; the only remaining question is just how wide the range of times should be. In Arjan's patch, that range comes from two different sources. The first is a new field in the task structure called timer_slack_ns; as one might expect, it specifies the maximum expected timer accuracy in nanoseconds. This value can be adjusted via the prctl() system call. The default value is set to 50 microseconds - approximate to a certain degree, but still far more accurate than the timeouts in current kernels.

Beyond that, though, there is a heuristic function which provides an accuracy value depending on the requested timeout period. In the case of especially long timeouts - more than ten seconds - the accuracy is set to 100ms; as the timeouts get shorter, the amount of acceptable error drops, down to a minimum of 10ns for very brief timeouts. Normally, poll() and company will use the value returned by the heuristic, but with the exception that the accuracy will never exceed the value found in timer_slack_ns.

The end result is the provision of more accurate timeouts on the polling functions while, simultaneously, preserving the ability to combine timeouts with other system events.

Index entries for this article
Kernel	hrtimer
Kernel	Scheduler
Kernel	Timers

High- (but not too high-) resolution timeouts

Posted Sep 4, 2008 5:08 UTC (Thu) by njs (guest, #40338) [Link] (5 responses)

AFAICT there is no way to set a lower bound on accuracy (i.e., "no really, I *meant* that timespec"), which might be useful for realtime work.

Squinting at the code, I think the constants in the accuracy heuristic are actually ok for the soft-realtime case that I happen to care about. It seems safer to use timerfd for RT poll timeouts, though -- it seems to keep full accuracy (though perhaps that's a bug!).

High- (but not too high-) resolution timeouts

Posted Sep 4, 2008 13:15 UTC (Thu) by arjan (subscriber, #36785) [Link] (4 responses)

it's a hard complex problem; the only real solution is to have a pppoll syscall that takes explicit slack as argument, for those people who really need that 10 second sleep with 1 nanosecond accuracy.

I can't say that I'm happy about the "estimate how big the slack is" function, but I've not been able to come up with something significantly nicer that still works well (where "works well" includes just doing the right thing for power saving). I was hoping others on lkml would have better ideas but so far it seems it's a hard problem, where the best I've seen so far is "just use 0.1% of the total delay".

High- (but not too high-) resolution timeouts

Posted Sep 4, 2008 18:34 UTC (Thu) by dcoutts (subscriber, #5387) [Link] (1 responses)

How about people who need the really high accuracy use the timerfd() and insert that timer into their poll/epoll/select set and then give no timeout when they wait on that event set. That way, the highly accurate timers can be used while letting all the ordinary stuff use the existing mechanism that saves power by waking everyone up at the same time etc.

High- (but not too high-) resolution timeouts

Posted Sep 4, 2008 21:09 UTC (Thu) by quotemstr (subscriber, #45331) [Link]

Good idea! Why not post this to linux-kernel?

High- (but not too high-) resolution timeouts

Posted Sep 5, 2008 5:02 UTC (Fri) by njs (guest, #40338) [Link] (1 responses)

It might be sufficient to just disable the heuristic when the thread making the syscall has real-time priority. Don't know if that's the Perfect Answer, but I think it would be an incremental improvement at least.

I don't much like the idea of making this an undocumented difference between different timing syscalls (like someone else suggested), so that if you use ppoll you get one thing and timerfd another etc. -- I don't see why timerfd should be useful only to apps who need precision and don't care about power! Really the behavior should be uniform across ppoll/poll/pselect/select, epoll, timerfd, nanosleep, posix timers, interval timers. (Not sure if all of those use hrtimers yet; I know nanosleep and posix timers do in -rt.)

For timerfd in particular, one could add a timerfd_setslack call without breaking compatibility. It might be possible for some of those other APIs as well.

High- (but not too high-) resolution timeouts

Posted Sep 5, 2008 5:24 UTC (Fri) by arjan (subscriber, #36785) [Link]

the realtime thing is there already in my current codebase

right now what I do is (in summary)

if realtime => slack is 0

if nice, slack is 0.5% with a max of 100 msec
if not nice, slack is 0.1% with a max of 100 msec
if not rt and slack is less than the per thread setting, use the per thread setting

High- (but not too high-) resolution timeouts

Posted Sep 4, 2008 17:44 UTC (Thu) by mezcalero (subscriber, #45103) [Link] (4 responses)

Using prctl() for a process-wide timer accuracy doesn't appear to be a good idea to me. In many RT apps you have both RT and non-RT threads. Usually only for the RT threads timer accuracy matters while the non-RT threads only do non-critical housekeeping work where accuracy really doesn't matter. Having a per-thread way to adjust timer accuracy seems more appropriate to me.

Speaking of prctl

Posted Sep 4, 2008 18:05 UTC (Thu) by quotemstr (subscriber, #45331) [Link] (2 responses)

Why do we use prctl at all anyway? Isn't adding a new prctl just a backdoor way of adding another system call? Doesn't the same argument against using ioctl for arbitrary functionality apply also to prctl?

Why not just add new system calls for new functionality?

Speaking of prctl

Posted Sep 4, 2008 18:15 UTC (Thu) by arjan (subscriber, #36785) [Link]

prctl()'s purpose is to get and set values/properties on a per thread basis.....

it's not nearly as much a blanket thing as ioctl() is.

Speaking of prctl

Posted Sep 5, 2008 15:40 UTC (Fri) by wahern (subscriber, #37304) [Link]

One nice thing about such interfaces is that you can usually check for the interface using the C preprocessor (i.e. "#if defined FD_CLOEXEC ... #elif defined HANDLE_FLAG_INHERIT ..."). For totally obscure non-portable interfaces, that's definitely a win, especially for those of us not willing to use autoconf (or unable, such as when porting to Visual Studio).

High- (but not too high-) resolution timeouts

Posted Sep 4, 2008 18:14 UTC (Thu) by arjan (subscriber, #36785) [Link]

prctl() works on a per thread basis.....

High- (but not too high-) resolution timeouts

Posted Sep 6, 2008 15:23 UTC (Sat) by kbob (guest, #1770) [Link] (1 responses)

"This value can be adjusted via the prctl() system call. The default value is set to 50 microseconds - approximate to a certain degree, but still far more accurate than the timeouts in current kernels"

Why not set the default value to one jiffy? That would maintain compatibility with older kernels, and the longer default value would result in fewer wakeups for programs that haven't called prctl().

High- (but not too high-) resolution timeouts

Posted Sep 6, 2008 17:39 UTC (Sat) by arjan (subscriber, #36785) [Link]

one of the issues is that the old code had a *maximum* of a 1 millisecond, but the average was more like 500 usec.

The other thing is... with the code in the patchkit, the behavior in terms of power isn't all that bad (after all, userland isn't poll()ing like crazy anymore; all that got fixed with powertop).
And using 50 usec means that media apps and desktop apps actually get an improvement in behavior... I'd hate to give away that real value for something that doesn't really save more power.

High- (but not too high-) resolution timeouts

Posted Sep 8, 2008 11:50 UTC (Mon) by cde (guest, #46554) [Link]

10ms isn't so bad. back in the days of windows 95 we had to deal with 50ms timers ;)