Progress and Cancelation for Indexing #1379

BickfordA · 2022-01-12T00:59:17Z

It would be really nice to be able to capture the progress of the indexing functions, especially for large files as this can take some time.

Adding callbacks to the index functions would provide a way to pass progress to the embedding library. Additionally it allow for canceling the indexing function (without killing the program).

Is this something that you are interested in supporting? Let me know if this is of any interest, or if there is anything i can do to help. Thanks!

daviesrob · 2022-01-17T15:00:05Z

This is a nice idea, but currently it changes the signature of tbx_index(), which would change both the API and ABI. Instead, you should make a tbx_index2() with the extra parameters, and leave tbx_index() as it was.

The file_progress_func typedef would be better named hts_file_progress_func (or maybe hts_progress_callback?) to make name clashes with other progress callbacks less likely.

jkbonfield · 2022-01-17T15:16:08Z

I wonder if there is a general way this can be handled. It's rather specific to indexing only. What about any other task iterating over a file? It seems to me that file offset, if known, is a more workable progression marker. We don't necessarily know the start / end of the region in question, but maybe that too can be made accessable by analysing the index query results.

We could almost do this right now by using ftell type functions (on the relevant hts file descriptor) from a separate thread, were it not for potential thread memory access clashes. If there was a thread-safe API function that could be called though then the issue is solved.

It's a different model of operation, requiring a separate UI thread that periodically polls to perform updates, but this feels cleaner than adding callbacks to potentially more and more pieces of code.

Edit: note the above is for progress indicators, but not cancelling. I'd guess you would also need a cancelling method by setting a flag on the file descriptor. The exsiting code should handle errors when reading / writing data anyway, so cancelling is simply a matter of forcing that code to be triggered.

daviesrob · 2022-01-17T15:37:44Z

We have htell and bgzf_tell, but currently no combined hts_tell. However, the cases here are different from the general one because the interfaces consume the entire file before returning. You either need a callback (as here), or a completely different indexing interface that gets called in a loop so you can update progress each time it returns.

jkbonfield · 2022-01-17T16:25:52Z

I'm not making myself clear. I'm talking about separate UI threads from computation threads. This is often how many GUIs are written. So it doesn't matter that the function consumes all input before finishing, provided it's possible to query the progress from another thread and feed back.

That's why I say a thread safe interface to query position in the input file. (With also a thread safe way of determining the start / end positions for purposes of range queries)

Specifically I'm trying to think of general library design choices that can be applied universally, hopefully without too many icky internals, without having to modify every function in htslib and/or command in samtools to have callback functions or some sort of co-routine yield interface to be interruptable.

jmarshall · 2022-01-17T16:31:16Z

We have htell and bgzf_tell, but currently no combined hts_tell.

FYI I have had a draft adding hts_seek()/hts_tell() gestating for a long time. It got bogged down because bringing cram_fd's buffering and concept of offsets into the fold seems fairly intractable.

daviesrob · 2022-01-17T17:12:43Z

I guess you could do this by probing from another thread. It's the thread safe part of it that would be hard...

BickfordA · 2022-01-17T17:20:20Z

It's rather specific to indexing only. What about any other task iterating over a file?

Using the file position/offset works well when reading the file records. Though a combined hts_tell would be convenient. It seems like usually there is a call to htslib per record.

Would the thread safe progress api allow for canceling the index job? Another nice feature of using a callback is you can pause the computation (by not returning immediately) .

jkbonfield · 2022-01-17T18:02:27Z

Maybe we need to know what the actual problem is you're trying to solve. For example why is being able to pause computation via a callback a beneficial feature?

My worry about doing this for indexing, is what's next? General reading through a file, needing a new API with a call back there? Then for VCF too? Then in the synced reader VCF interface? You can see there is inevitable room for feature creep. You personally may only need indexing, but once precedence has been set it's inevitable other people will start to ask for a callback interface in every part of the library.

BickfordA · 2022-01-17T20:33:02Z

Indexing takes a long time by nature of the fact that you have to read the whole file. As a result there are two problems I'm trying to solve:

Since indexing is a single api call, it would be nice to have a way to return progress updates to see how far along the process so end users can see work is being done, and you can estimate how long its going to take etc...
There isn't a way to cancel a call to the index function, you have to wait for the whole file to be read.

I wouldn't worry too much about creep, as neither of these cases are an issue with general reading through the files. When reading the files iterating record by record provides very granular steps through the file. As a result it is easy to calculate progress based on the current offset, and it is very easy to stop reading.

I suppose an alternate option would be to make apis to run the index creation in steps, something like

open_hfile
create_index_file 
while read_file_block {
    write_index_block
}
close_index_file
close_hfile

And i could drive the indexing loop outside of htslib.

jkbonfield · 2022-01-18T09:19:47Z

That's a good point about single long-running function vs many repeated calls to (eg) read, merge, etc. I was thinking more broadly, including those that manipulate samtools. (Although frankly the way tools like pysam pretend samtools is a library rather than a subprocess has never really sat well with me.)

…x2 to preserve API and ABI

BickfordA · 2022-01-26T21:54:03Z

This is a nice idea, but currently it changes the signature of tbx_index(), which would change both the API and ABI. Instead, you should make a tbx_index2() with the extra parameters, and leave tbx_index() as it was.

The file_progress_func typedef would be better named hts_file_progress_func (or maybe hts_progress_callback?) to make name clashes with other progress callbacks less likely.

@daviesrob I made the changes that you suggested. I'm not sure if this was waylaid by the following discussion. If this is still of interest hopefully it is a little closer to where it needs to be.

BickfordA added 3 commits January 11, 2022 17:36

Add progress callbacks to index creation

7c013e6

Update indexing function names

efc2c97

Remove trailing spaces

4222d28

Update progress call back typedef name, add new function for tbx_inde…

19cbba4

…x2 to preserve API and ABI

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Progress and Cancelation for Indexing #1379

Progress and Cancelation for Indexing #1379

BickfordA commented Jan 12, 2022

daviesrob commented Jan 17, 2022

jkbonfield commented Jan 17, 2022 •

edited

Loading

daviesrob commented Jan 17, 2022

jkbonfield commented Jan 17, 2022 •

edited

Loading

jmarshall commented Jan 17, 2022

daviesrob commented Jan 17, 2022

BickfordA commented Jan 17, 2022

jkbonfield commented Jan 17, 2022

BickfordA commented Jan 17, 2022 •

edited

Loading

jkbonfield commented Jan 18, 2022 •

edited

Loading

BickfordA commented Jan 26, 2022

Progress and Cancelation for Indexing #1379

Are you sure you want to change the base?

Progress and Cancelation for Indexing #1379

Conversation

BickfordA commented Jan 12, 2022

daviesrob commented Jan 17, 2022

jkbonfield commented Jan 17, 2022 • edited Loading

daviesrob commented Jan 17, 2022

jkbonfield commented Jan 17, 2022 • edited Loading

jmarshall commented Jan 17, 2022

daviesrob commented Jan 17, 2022

BickfordA commented Jan 17, 2022

jkbonfield commented Jan 17, 2022

BickfordA commented Jan 17, 2022 • edited Loading

jkbonfield commented Jan 18, 2022 • edited Loading

BickfordA commented Jan 26, 2022

jkbonfield commented Jan 17, 2022 •

edited

Loading

jkbonfield commented Jan 17, 2022 •

edited

Loading

BickfordA commented Jan 17, 2022 •

edited

Loading

jkbonfield commented Jan 18, 2022 •

edited

Loading