Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Progress and Cancelation for Indexing #1379

Open
wants to merge 4 commits into
base: develop
Choose a base branch
from

Conversation

BickfordA
Copy link
Contributor

It would be really nice to be able to capture the progress of the indexing functions, especially for large files as this can take some time.

Adding callbacks to the index functions would provide a way to pass progress to the embedding library. Additionally it allow for canceling the indexing function (without killing the program).

Is this something that you are interested in supporting? Let me know if this is of any interest, or if there is anything i can do to help. Thanks!

@daviesrob
Copy link
Member

This is a nice idea, but currently it changes the signature of tbx_index(), which would change both the API and ABI. Instead, you should make a tbx_index2() with the extra parameters, and leave tbx_index() as it was.

The file_progress_func typedef would be better named hts_file_progress_func (or maybe hts_progress_callback?) to make name clashes with other progress callbacks less likely.

@jkbonfield
Copy link
Contributor

jkbonfield commented Jan 17, 2022

I wonder if there is a general way this can be handled. It's rather specific to indexing only. What about any other task iterating over a file? It seems to me that file offset, if known, is a more workable progression marker. We don't necessarily know the start / end of the region in question, but maybe that too can be made accessable by analysing the index query results.

We could almost do this right now by using ftell type functions (on the relevant hts file descriptor) from a separate thread, were it not for potential thread memory access clashes. If there was a thread-safe API function that could be called though then the issue is solved.

It's a different model of operation, requiring a separate UI thread that periodically polls to perform updates, but this feels cleaner than adding callbacks to potentially more and more pieces of code.

Edit: note the above is for progress indicators, but not cancelling. I'd guess you would also need a cancelling method by setting a flag on the file descriptor. The exsiting code should handle errors when reading / writing data anyway, so cancelling is simply a matter of forcing that code to be triggered.

@daviesrob
Copy link
Member

We have htell and bgzf_tell, but currently no combined hts_tell. However, the cases here are different from the general one because the interfaces consume the entire file before returning. You either need a callback (as here), or a completely different indexing interface that gets called in a loop so you can update progress each time it returns.

@jkbonfield
Copy link
Contributor

jkbonfield commented Jan 17, 2022

I'm not making myself clear. I'm talking about separate UI threads from computation threads. This is often how many GUIs are written. So it doesn't matter that the function consumes all input before finishing, provided it's possible to query the progress from another thread and feed back.

That's why I say a thread safe interface to query position in the input file. (With also a thread safe way of determining the start / end positions for purposes of range queries)

Specifically I'm trying to think of general library design choices that can be applied universally, hopefully without too many icky internals, without having to modify every function in htslib and/or command in samtools to have callback functions or some sort of co-routine yield interface to be interruptable.

@jmarshall
Copy link
Member

We have htell and bgzf_tell, but currently no combined hts_tell.

FYI I have had a draft adding hts_seek()/hts_tell() gestating for a long time. It got bogged down because bringing cram_fd's buffering and concept of offsets into the fold seems fairly intractable.

@daviesrob
Copy link
Member

I guess you could do this by probing from another thread. It's the thread safe part of it that would be hard...

@BickfordA
Copy link
Contributor Author

It's rather specific to indexing only. What about any other task iterating over a file?

Using the file position/offset works well when reading the file records. Though a combined hts_tell would be convenient. It seems like usually there is a call to htslib per record.

Would the thread safe progress api allow for canceling the index job? Another nice feature of using a callback is you can pause the computation (by not returning immediately) .

@jkbonfield
Copy link
Contributor

Maybe we need to know what the actual problem is you're trying to solve. For example why is being able to pause computation via a callback a beneficial feature?

My worry about doing this for indexing, is what's next? General reading through a file, needing a new API with a call back there? Then for VCF too? Then in the synced reader VCF interface? You can see there is inevitable room for feature creep. You personally may only need indexing, but once precedence has been set it's inevitable other people will start to ask for a callback interface in every part of the library.

@BickfordA
Copy link
Contributor Author

BickfordA commented Jan 17, 2022

Indexing takes a long time by nature of the fact that you have to read the whole file. As a result there are two problems I'm trying to solve:

  • Since indexing is a single api call, it would be nice to have a way to return progress updates to see how far along the process so end users can see work is being done, and you can estimate how long its going to take etc...
  • There isn't a way to cancel a call to the index function, you have to wait for the whole file to be read.

I wouldn't worry too much about creep, as neither of these cases are an issue with general reading through the files. When reading the files iterating record by record provides very granular steps through the file. As a result it is easy to calculate progress based on the current offset, and it is very easy to stop reading.

I suppose an alternate option would be to make apis to run the index creation in steps, something like

open_hfile
create_index_file 
while read_file_block {
    write_index_block
}
close_index_file
close_hfile

And i could drive the indexing loop outside of htslib.

@jkbonfield
Copy link
Contributor

jkbonfield commented Jan 18, 2022

That's a good point about single long-running function vs many repeated calls to (eg) read, merge, etc. I was thinking more broadly, including those that manipulate samtools. (Although frankly the way tools like pysam pretend samtools is a library rather than a subprocess has never really sat well with me.)

@BickfordA
Copy link
Contributor Author

This is a nice idea, but currently it changes the signature of tbx_index(), which would change both the API and ABI. Instead, you should make a tbx_index2() with the extra parameters, and leave tbx_index() as it was.

The file_progress_func typedef would be better named hts_file_progress_func (or maybe hts_progress_callback?) to make name clashes with other progress callbacks less likely.

@daviesrob I made the changes that you suggested. I'm not sure if this was waylaid by the following discussion. If this is still of interest hopefully it is a little closer to where it needs to be.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants