Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add method sam_open_write #1055

Open
wants to merge 2 commits into
base: develop
Choose a base branch
from
Open

Conversation

valeriuo
Copy link
Contributor

@valeriuo valeriuo commented Apr 6, 2020

  1. The new method sam_open_write can open a htsFile file in writing mode only, attach a header struct (sam_hdr_t) to it and write the header data to the disk, before any alignment data.

  2. New getter method hts_get_fn is used to fetch the internal file name of htsFile.

  3. The index file name inside htsFile is now managed by HTSlib, and no longer points to an application variable.

@daviesrob
Copy link
Member

If we change fnidx to be managed by HTSlib (which I think we can get away with, as I haven't found anything that tries to fiddle with it), we should also change the documentation for sam_idx_init() as it's no longer necessary for the pointer to remain valid.

@daviesrob
Copy link
Member

Looks good. Maybe squash the 1st and 3rd commits together?

Decouple htsFile::fnidx from its source and let HTSlib manage its memory.
Add method sam_open_write for opening a htsFile file for writing only and
setting a header to it.
Add getter method hts_get_fn.
@valeriuo
Copy link
Contributor Author

Looks good. Maybe squash the 1st and 3rd commits together?

I squashed all three of them, as the second didn't make much sense by itself.

@daviesrob
Copy link
Member

@jmarshall so... were there any comments....?

@jmarshall
Copy link
Member

To continue the conversation from the previous PR #1052,

It could be argued that the samtools API was better. I've never entirely seen the point of having to pass in the header to sam_read1() and sam_write1()

When there are lots of different sets of headers around, e.g., when implementing samtools merge, I think the overall code is clearer when it is explicit which headers go with which bam1_t records and which files. (e.g., user code will use hdr->target_name[mybam->core.tid] to get their bam1_t's RNAME string and needs to be clear which hdr to use with that mybam — which is clearer if the right hdr also appears in nearby sam_read1/sam_write1 calls.)

Having said that, I can see the point of attaching the headers to the htsFile and having the option of not passing them explicitly to sam_read1() and sam_write1(). So FWIW I would support adding this sam_write1(fp, NULL, b) mode of operation so that user code can choose which style to use.


There are two ways of opening a file for use with sam_read1()/sam_write1(): hts_open()/hts_open_format() and hts_hopen(). In the latter case, this might be via hopen("filename") … hts_hopen() or hdopen(arbitrary_fd) … hts_hopen().

The proposed sam_open_write() limits you to directly providing the filename à la hts_open(). This means it's badly factored w.r.t. the rest of the API, as you can't use it with an hFILE* that you already have. The new function also in the end doesn't buy you much: it just bundles the sam_hdr_write and attaching of the headers into the single opening function.

I would suggest instead of adding this sam_open_write(), instead make sam_hdr_write() itself attach the headers to the file pointer. Then new-style user code could be:

samFile *fout = hts_open_format(filename, "w", &myfmt); // or hts_hopen etc
if (fout == NULL) { "error can't open: permission problem etc" }
if (sam_hdr_write(fout, headers) < 0) { "error headers borked or I/O error" }
…
…sam_write1(fout, NULL, b)…

which keeps the handling for the distinct failure modes separate and is still only two function calls (compared to the proposed sam_open_write's one function call).

Due to the header reference counting (which needs to be verified as still correct with either form of more widespread header attaching), having sam_hdr_write() implicitly attach the headers to the file pointer should have no adverse effect on old-style code that specifies the headers explicitly in sam_write1().

(If OTOH you do go with sam_open_write(), I have additional comments about error handling. hts_open_format reports operating system failures via errno, so the other error paths need to set errno too.)

@return Read-only pointer to the file's fn field.
*/
HTSLIB_EXPORT
const char *hts_get_fn(htsFile *fp);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there other parts of the API that have …fn in API function names? I'd rather spell it out as …filename (and it needs to be clear what if any relationship this has to hts_set_fai_filename).

sam.c Outdated
Comment on lines 3223 to 3224
const sam_hdr_t *h = hdr;
if (!h) h = fp->bam_header;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clearer as

const sam_hdr_t *h = hdr? hdr : fp->bam_header;
if (h == NULL) { errno = EINVAL; return -3; }

as in particular passing NULL when no headers are attached should be an error rather than a crash 😄

vcf.c Outdated
@@ -3295,7 +3295,8 @@ static int vcf_idx_init(htsFile *fp, bcf_hdr_t *h, int min_shift, const char *fn
fp->idx = NULL;
return -1;
}
fp->fnidx = fnidx;
fp->fnidx = strdup(fnidx);
if (!fp->fnidx) return -1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should tear down fp->idx like the other error path does. Ditto in bcf_idx_init below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The top caller (test_view::main), does the clean-up of the htsFILE descriptor and the memory associated with its members, so calling hts_idx_destroy here was redundant. Also, there is no other place where fp->idx is subsequently used.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No.

If fp->idx is not NULL, then the index gets built up on the fly in bcf_write/vcf_write. When bcf_idx_init() fails for any reason, it needs to tear down the fp->idx that it has attempted to construct.

Copy link
Member

@daviesrob daviesrob Apr 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well ideally the caller would inspect the return value of bcf_idx_init() and not carry on after it failed. But as we know some callers aren't very good at doing that, it would be best to tidy everything up here. It could be made slightly easier by moving the strdup() to before hts_idx_init(). Both failure paths after that would need to free fp->fnidx and set it to NULL, but otherwise remain as they were originally.

@daviesrob
Copy link
Member

On the whole I think I prefer the simplicity of only having to call one function. It is true that it lacks features compared to hts_open etc., but then the hts_open interfaces also aren't completely consistent - for example, there's no hts_dopen which can lead to quite a dance when you want to write a bam file into one. You also can't pass an htsFormat struct to hts_hopen and you can't pass varargs into hts_open - thinking of which, maybe we should allow varargs in sam_open_write() which could be passed on when creating the hFILE (would require a vhopen())?

We could always add sam_hopen_write() and sam_dopen_write() later to complete the set, and simplify the sam file into file descriptor problem.

vcf.c Outdated
Comment on lines 3294 to 3295
hts_idx_destroy(fp->idx);
fp->idx = NULL;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the opposite of what I suggested. What is the reason for not destroying the incompletely-setup index?

if (!h) h = fp->bam_header;
if (!fp || !b) { errno = EINVAL; return -1; }
const sam_hdr_t *h = hdr ? hdr : fp->bam_header;
if (!h) { hts_log_error("No header available for file \"%s\"", fp->fn); return -1; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

h == NULL here is an error in the program, not a user error. So there's not much point reporting it to the user (they can't do anything about it!), but it does need to set errno (as other error paths do).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants