Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large scale watching of paths appears to drop/miss events #412

Open
chipsenkbeil opened this issue Jun 4, 2022 · 8 comments
Open

Large scale watching of paths appears to drop/miss events #412

chipsenkbeil opened this issue Jun 4, 2022 · 8 comments

Comments

@chipsenkbeil
Copy link

System details

  • Rust version (if building from source): rustc --version: rustc 1.60.0 (7737e0b5c 2022-04-04)
  • Notify version (or commit hash if building from git): 5.0.0-pre.15

Mac

M1 Macbook Air from 2020.

image

Windows

Traditional Windows 11 machine.

image

Linux

Running Alpine Linux using Mac Parallels on an M1 Mac, which means Alpine is using ARM.

image

What you did (as detailed as you can)

Example repo where you can run test: https://github.com/chipsenkbeil/notify-stress-test

  1. Create a watcher using notify::recommended_watcher
  2. Have events get sent from the watcher handler out of the thread using std::sync::mpsc::Sender
  3. Continually receive events on a separate thread using std::sync::mpsc::Receiver
  4. Create a large number of paths to watch (files or directories) and watch each path using the same watcher
  5. Perform something within each path (modify a file, add a file to a directory, etc)
  6. Tick off each path with an event received from the std::sync::mpsc::Receiver

What you expected

Path modifications or other events show up for watched paths at scale. For example, if watching 1500 individual files, modifying each file would result in some event being captured and passed along.

What happened

With enough different paths being watched, events start to be missing or dropped. For example, watching 1500 individual files, modifying each file, 246 paths were never reported as modified.

❯ cargo test
   Compiling notify-stress-test v0.1.0 (/Users/senkwich/projects/notify-stress-test)
    Finished test [unoptimized + debuginfo] target(s) in 0.80s
     Running unittests (target/debug/deps/notify_stress_test-af3191235b8b1ce0)

running 1 test
test tests::stress_test ... FAILED

failures:

---- tests::stress_test stdout ----
thread 'tests::stress_test' panicked at 'assertion failed: `(left == right)`
  left: `246`,
 right: `0`: 246/1500 file paths not modified', src/lib.rs:88:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace


failures:
    tests::stress_test

test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 0 filtered out; finished in 18.16s

error: test failed, to rerun pass '--lib'
@chipsenkbeil
Copy link
Author

I've tried this with files and directories. It seems like the watcher doesn't receive or maybe invoke the handler for a subset of the paths if enough changes happen to a large enough set of files.

On my Mac, I had to crank the file count to 1500 to see this happen consistently. On Windows, it was a bit slower and I could reliably hit this with ~500 paths.

Is there a better way to handle this? Should I not expect notify to handle a large number of paths at once? I know that some recommendations have been to watch parent directories to detect changes within, but my project is a server that can have any number of paths watched at any time; so, I want to make sure that I understand the limits and if there's anything I can do to handle this better! 😄

@0xpr03
Copy link
Member

0xpr03 commented Jun 5, 2022

Can you try running your example with --release mode ? You seem to ignore any send-errors, is there a reason this panics ? Also if possible don't print anything inside the hot code, this requires a global lock to be acquired for every println!. So this can certainly generate misses and create contention between notifies callback thread and your file changer.

I looked over your code, which does seem to be fine. Watching a folder or a specific set of files may not really differ depending on the OS used, though it may also not emit all events (file deletion, move etc), which is also dependent on the OS. Thus we recommend watching the parent of the thing you're actually interested in.

@chipsenkbeil
Copy link
Author

I tried doing a release test on my Mac with no print statements and it's still failing:

image

I've also tested it where any event for a path will count instead of modification. That helps, but if you throw enough paths at the problem (or use windows), then it still fails.

I guess my next step is to watch the actual parent path as you recommended and then filter events if they are for a child path that we're actively watching.

@chipsenkbeil
Copy link
Author

Oh, as for ignoring the send error, when I had it panic with an expect("...."), the thread dies and then my test locks up instead of failing. In my real use case, I use try_send to determine if the event can be sent or if capacity has been reached, reporting a warning if so as that's a bigger issue to deal with.

@0xpr03
Copy link
Member

0xpr03 commented Jun 6, 2022

Well it shouldn't reach capacity. And if you're ignoring send errors, you'll certainly not receive all events in your test, as you're apparently not sending all events over the channel.

@chipsenkbeil
Copy link
Author

@0xpr03 sorry, ignore the capacity thing as I don't think it's relevant for this specific challenge since the sender I'm using in the example is unbounded.

I modified the code example to create a bunch of directories with a single file within each like this:

  • $TMP/dir_1/file
  • $TMP/dir_2/file
  • $TMP/dir_3/file

From there, I watched the parent directories of the files such as $TMP/dir_1.

I also removed any filtering of events by kind, so I'm just checking to see if I get any event for a given file. Lastly, I added a sleep of 1ms per attempt to receive an event from the main test thread, just in case my loop was spiking the CPU or something that would affect the notify thread.

This does appear to help Windows out quite a lot, which was the biggest troublemaker when it came to missing events. On Mac, I'm still having the problem of missing events. Is there anything else I can do in this case? I don't expect to have 1500 separate paths being watched at a time, but want to check if there's anything else that I could be doing better to use the library.

image

@0xpr03
Copy link
Member

0xpr03 commented Sep 5, 2022

It's probably not what you want to hear.. But I'll let the manpage speak for me:

With careful programming, an application can use inotify to
efficiently monitor and cache the state of a set of filesystem
objects. However, robust applications should allow for the fact
that bugs in the monitoring logic or races of the kind described
below may leave the cache inconsistent with the filesystem state
.
It is probably wise to do some consistency checking, and rebuild
the cache when inconsistencies are detected.

If successive output inotify events produced on the inotify file
descriptor are identical (same wd, mask, cookie, and name), then
they are coalesced into a single event if the older event has not
yet been read (but see BUGS).

This reduces the amount of kernel
memory required for the event queue, but also means that an
application can't use inotify to reliably count file events.

Note that the event queue can overflow. In this case, events are
lost
. Robust applications should handle the possibility of lost
events gracefully. For example, it may be necessary to rebuild
part or all of the application cache. (One simple, but possibly
expensive, approach is to close the inotify file descriptor,
empty the cache, create a new inotify file descriptor, and then
re-create watches and cache entries for the objects to be
monitored.)

(another one for the Known Issues )

I think this also applies to MacOS. (The event queue above is internally, not your channel from the previous discussion)

If I'm reading this right you also don't get any errors (EINVAL), so we're not accidentally using a buffer that is too tiny.

TLDR: Use the PollWatcher if you want to be super certain that no events are missed, but you won't get very precise informations (can only tell your diffs based on files in folder and last-changed timestamps / content changes).

@anton-kapelyushok
Copy link

anton-kapelyushok commented Sep 26, 2023

https://developer.apple.com/library/archive/documentation/Darwin/Conceptual/FSEvents_ProgGuide/UsingtheFSEventsFramework/UsingtheFSEventsFramework.html#//apple_ref/doc/uid/TP40005289-CH4-SW1

If an event in a directory occurs at about the same time as one or more events in a subdirectory of that directory, the events may be coalesced into a single event. In this case, you will receive an event with the kFSEventStreamEventFlagMustScanSubDirs flag set. When you receive such an event, you must recursively rescan the path listed in the event. The additional changes are not necessarily in an immediate child of the listed path.
If a communication error occurs between the kernel and the user-space daemon, you may receive an event with either the kFSEventStreamEventFlagKernelDropped or kFSEventStreamEventFlagUserDropped flag set. In either case, you must do a full scan of any directories that you are monitoring because there is no way to determine what may have changed.

This is handled here. Basically, every time you receive Other event with Rescan flag set, you need to rescan everything you watch.

I believe this behavior should be reflected more explicitly in documentation.
Also, it looks to me that such event is more like an error rather than regular event.

Edit: Apparently it is addressed in #434

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants