Page MenuHomePhabricator

Enable PagedTiffHandler on all wikis, to allow display of TIFF files
Closed, ResolvedPublic

Assigned To
None
Authored By
daniel
Apr 20 2010, 11:00 AM
Referenced Files
F6902: som-magick-resize-sharpen.png
Nov 21 2014, 11:09 PM
F6903: som-im_shrink-magick-sharpen.png
Nov 21 2014, 11:09 PM
F6900: som-im_shrink.png
Nov 21 2014, 11:09 PM
F6899: som-im_resize_linear.png
Nov 21 2014, 11:09 PM

Description

Please enable the PagedTiffHandler extension on all wikis, to allow display of TIFF files. Documentation of the extension can be found at http://www.mediawiki.org/wiki/Extension:PagedTiffHandler.

There is broad support for this on commons, at least: http://commons.wikimedia.org/w/index.php?title=Commons:Village_pump&oldid=37854914#Support_for_TIFF_files_-_just_ask_for_it.21.

Wikimedia Deutschland contracted Hallo Welt to develop this extension. I have reviewed the code and found it sane enough. A first look by some of the core devs revealed the need for some cleanup, but no fundamental problems so far. The code is available at http://svn.wikimedia.org/viewvc/mediawiki/trunk/extensions/PagedTiffHandler/.

Please review the extension, let us know what still needs to be fixed, and then disable it, so we can enjoy tiff files directly. Thanks!


Version: unspecified
Severity: enhancement

Details

Reference
bz23258

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 11:09 PM
bzimport set Reference to bz23258.
bzimport added a subscriber: Unknown Object (MLST).

uh... "enabled", not "disable", of course :P

Review follows. I'm not sure who is still involved in this project so I'm just putting my comments here.

I don't know why this is an extension. Can it be moved to the core, replacing the barely-functional TiffHandler there?

Is there any reason we have to use exiv to read EXIF data? We already have functions for reading EXIF data from TIFF files in the MediaWiki core, in Exif.php.

The "cache" in PagedTiffImage::retrieveMetaData() seems to just be a way to completely break image uploads for a given filename for a day, I can't see any practical purpose for it. But it's a moot point if the whole file is going to be deleted in favour of Exif.php.

I've read a bit about VIPS, I see that this extension uses it. VIPS has the ability to scale arbitrarily sized TIFF files with limited RAM, which would be a useful property for us. But it only works if the whole pipeline is PIO rather than WIO. This means that the input TIFF file must be tiled (RowsPerStrip > 16), and only PIO transformation functions may be used. PagedTiffHandler currently uses im_resize_linear which is WIO. If it used the PIO function im_shrink instead, then the RAM usage would be limited.

im_resize_linear will look nicer when a small TIFF image is resized to a slightly smaller image. That's not the expected use case for PagedTiffHandler on Wikimedia. im_shrink will look OK when the thumbnail is several times smaller than the source image, which is the case we expect to be dealing with.

To limit memory usage in VIPS, please have PagedTiffHandler error out if it tries to scale a source image with area greater than $wgMaxImageArea and RowsPerStrip less than 16. And use im_shrink.

Ideally I would like to see two-stage conversion, so that very large (>100 Mpx) images can be scaled down with im_shrink to somewhat larger than web size, say 4 Mpx, once only per upload. Then the 4 Mpx intermediate image would be used to generate the ~0.1 Mpx thumbnail images requested by editors. The final step could be done with ImageMagick, or with VIPS with a transformation similar to the one ImageMagick uses.

It's a common misconception that downscaling requires interpolation. Really it needs blurring followed by downsampling followed by sharpening, that's what ImageMagick does. It's only upscaling (or downscaling by a very small factor) that needs interpolation. im_shrink uses a very simple kind of blurring, specifically a block average. The key to high quality downscaling is to choose an appropriate window function to do the blur. ImageMagick uses a Lanczos filter by default, see:

http://www.imagemagick.org/script/command-line-options.php#filter

In PagedTiffHandler.php:

$wgTiffErrorCacheTTL = 84600;

Is this meant to be 86400, the number of seconds in a day?

In PagedTiffHandler_body.php:

// @todo check height ## MediaWiki doesn't have a MagicWord-ID for height, appears to be a bug. ^SU

It's not a bug. Width and height are specified using two numbers separated by an "x", e.g. 100x100px. There is a special case for this at Parser.php line 4689. If the img_width magic word parameter matches this format, then a height parameter will automatically be passed through to the handler.

  • TODO: The purpose of this function is not yet fully clear.
	 */

function getScriptParams( $params ) { # # wtf?? ^DK

		// FIXME: This function is unused, seems to be useless,
		// and could be replaced with an array_intersect() call

The function is used to specify which parameters to File::transform() should be passed through to thumb.php, in the case where the configuration specifies thumb.php is to be used (e.g. $wgThumbnailScriptPath !== false). You should pass through the same parameters as in makeParamString(). The code you have here would be fine if the fixme comments were removed, except for a detail about the lossy parameter, see below.

PagedTiffHandler::normaliseParams() seems to be duplicating code from ImageHandler::normaliseParams(). I'm not sure why this is necessary. DjVuHandler gets away with not overriding normaliseParams() at all. If there is a bug in ImageHandler::normaliseParams(), you should fix it, rather than forking it. If all you're doing is validating the lossy parameter, then you can do that with code like:

function normaliseParams( $image, &$params ) {
if ( !parent::normaliseParams( $image, $params ) ) {

		return false;

}
// ... handle lossy parameter...
}

If you normalise the lossy parameter properly at this point, you avoid the need to duplicate the code in makeParamString(), getScriptParams() and getThumbExtension(). Also, you resolve a bug due to makeParamString() not matching getThumbExtension() in the way it handles a missing parameter.

function normaliseParams( $image, &$params ) {
if ( !parent::normaliseParams( $image, $params ) ) {

		return false;

}
if ( isset( $params['lossy'] ) ) {

		if ( in_array( $params['lossy'], array( 1, '1', 'true', 'lossy' ) ) ) {
			$params['lossy'] = 'lossy';
		} else {
			$params['lossy'] = 'lossless';
		}

} else {

		$data = $this->getMetaArray( $image );
		if ( ( strtolower( $data['page_data'][$params['page']]['alpha'] ) == 'true' ) ) {
			$params['lossy'] = 'lossless';
		} else {
			$params['lossy'] = 'lossy';
		}

}
return true;
}

In PagedTiffHandler::doTransform: please remove the error suppression (@) operators. normaliseParams() should return false if the width and height are missing. Error suppression operators have lots of implementation issues in PHP, they are evil and should almost never be used.

PagedTiffHandler::getThumbType() is missing and the parent is not returning the correct type. Please fix this by modifying File::thumbName() to pass handler parameters through to MediaHandler::getThumbType() as a third parameter, so that your lossy/lossless feature can work.

I've done some more testing and reading and I have to make a couple of corrections:

Tiled images are designated by TileWidth and TileLength, not RowsPerStrip. See the TIFF 6 spec, section 15. TileWidth and TileLength are both required to be multiples of 16, so 16 is the lowest sensible value.

However, untiled images appear to work with im_shrink without using additional memory. So it's not necessary to check TileWidth anyway.

For my test file which is 6051x8901 (53.8 Mpx), scaling down by a factor of 10 in both directions, memory usage was as follows:

Method | Tile size | Memory usage (MB)

im_shrink | 128x128 | 35
im_shrink | 64x64 | 37
im_shrink | untiled | 35
im_resize_linear | 128x128 | 173
im_resize_linear | 64x64 | 175
im_resize_linear | untiled | 173
ImageMagick | 128x128 | 432

Memory usage was measured by counting mmap2() and munmap() calls with strace.

(In reply to comment #2)

Review follows. I'm not sure who is still involved in this project so I'm just
putting my comments here.

I don't know why this is an extension. Can it be moved to the core, replacing
the barely-functional TiffHandler there?

The reson is purely organisational. Developing in core would have required Hallo Welt to get access toi trunk right away, or we would have had to use patches, wich is quite annoying.

There's no reson not to rename and/or mere it with core. Well actually, there is one: They do like to be visible on Special:Version. Not sure if adding the extension authors to the core credits would be an option.

Is there any reason we have to use exiv to read EXIF data? We already have
functions for reading EXIF data from TIFF files in the MediaWiki core, in
Exif.php.

Use of exiv2 is optional, the reason to use it is to also fetch IPTC and XMP besides Exif. I would very much like to see this for other image formats too - perhaps there could be an abstraction like Exif.php covering all kinds of image meta-data? Otoh, we'd really want to have this per file type, so we could also cover ID3 etc for Ogg files and other media... Needs some more thought.

At the moment, PagedTiffHandler defaults to using ImageMagick's identify command to get the basic data. Not sure if we can get around that, since we need some info that is not included in Exif, most notable, the number of pages in the tiff. Also, identify provides clues about whether imagemagick is able to process the file at all, and helps to reject broken tiffs.

Anyway, we could ineed use Exif.php to fetch exif data per default, in additon to what identify returns.

The "cache" in PagedTiffImage::retrieveMetaData() seems to just be a way to
completely break image uploads for a given filename for a day, I can't see any
practical purpose for it. But it's a moot point if the whole file is going to
be deleted in favour of Exif.php.

The cache thing is an akward hack to provide for broken legacy files. We already have several thousand tiff files uploaded, some of which we'll not be able to render. Caching the fact that extraction failed for a specific file is supposed to prevent mediawiki trying to extract that info over and over again, whenever the file or a thumbnail is requested.

However, if the failure occurs while uploading, it should of course not be cached. If it is, that's a bug, and a nasty one at that, since it would confuse the user who might have removed the cause for the failure, but still gets the error.

I've read a bit about VIPS, I see that this extension uses it.

Optionally, the default is to use ImageMagick. This feature has only seen light testing. Could be intersting, though.

VIPS has the
ability to scale arbitrarily sized TIFF files with limited RAM, which would be
a useful property for us. But it only works if the whole pipeline is PIO rather
than WIO. This means that the input TIFF file must be tiled (RowsPerStrip >
16), and only PIO transformation functions may be used. PagedTiffHandler
currently uses im_resize_linear which is WIO. If it used the PIO function
im_shrink instead, then the RAM usage would be limited.

im_resize_linear will look nicer when a small TIFF image is resized to a
slightly smaller image. That's not the expected use case for PagedTiffHandler
on Wikimedia. im_shrink will look OK when the thumbnail is several times
smaller than the source image, which is the case we expect to be dealing with.

To limit memory usage in VIPS, please have PagedTiffHandler error out if it
tries to scale a source image with area greater than $wgMaxImageArea and
RowsPerStrip less than 16. And use im_shrink.

Considering the comment you added later, it seems that using img_shrink is always ok? did I understand that correctly?

Ideally I would like to see two-stage conversion, so that very large (>100 Mpx)
images can be scaled down with im_shrink to somewhat larger than web size, say
4 Mpx, once only per upload. Then the 4 Mpx intermediate image would be used to
generate the ~0.1 Mpx thumbnail images requested by editors. The final step
could be done with ImageMagick, or with VIPS with a transformation similar to
the one ImageMagick uses.

Yes, we have discussed this option (both with Markus and, I think, with you). This would be very good to have, but it should be implemented in a generic way, not at a per-file-type basuis, I think. PNGs would also benefit from this.

It's a common misconception that downscaling requires interpolation. Really it
needs blurring followed by downsampling followed by sharpening, that's what
ImageMagick does. It's only upscaling (or downscaling by a very small factor)
that needs interpolation. im_shrink uses a very simple kind of blurring,
specifically a block average. The key to high quality downscaling is to choose
an appropriate window function to do the blur. ImageMagick uses a Lanczos
filter by default, see:

http://www.imagemagick.org/script/command-line-options.php#filter

Not sure what you are getting at here - does this mean ImageMagick's scaling looks better?

In PagedTiffHandler.php:

$wgTiffErrorCacheTTL = 84600;

Is this meant to be 86400, the number of seconds in a day?

Probably, and should probably be written as 60*60*24 to make it clear :)

In PagedTiffHandler_body.php:

// @todo check height ## MediaWiki doesn't have a MagicWord-ID for height,
appears to be a bug. ^SU

It's not a bug. Width and height are specified using two numbers separated by
an "x", e.g. 100x100px. There is a special case for this at Parser.php line

  1. If the img_width magic word parameter matches this format, then a height

parameter will automatically be passed through to the handler.

wow... uh... why? i mean, just adding "height" would have been cleaner and simpler, no?

  • TODO: The purpose of this function is not yet fully clear. */ function getScriptParams( $params ) { # # wtf?? ^DK FIXME: This function is unused, seems to be useless, and could be replaced with an array_intersect() call

The function is used to specify which parameters to File::transform() should be
passed through to thumb.php, in the case where the configuration specifies
thumb.php is to be used (e.g. $wgThumbnailScriptPath !== false). You should
pass through the same parameters as in makeParamString(). The code you have
here would be fine if the fixme comments were removed, except for a detail
about the lossy parameter, see below.

PagedTiffHandler::normaliseParams() seems to be duplicating code from
ImageHandler::normaliseParams(). I'm not sure why this is necessary.
DjVuHandler gets away with not overriding normaliseParams() at all. If there is
a bug in ImageHandler::normaliseParams(), you should fix it, rather than
forking it. If all you're doing is validating the lossy parameter, then you can
do that with code like:

[...]

If you normalise the lossy parameter properly at this point, you avoid the need
to duplicate the code in makeParamString(), getScriptParams() and
getThumbExtension(). Also, you resolve a bug due to makeParamString() not
matching getThumbExtension() in the way it handles a missing parameter.

[...]

Thank you very much flor clarifying this. It's quite confusing how these methods interact. What would be a good place to document this?

In PagedTiffHandler::doTransform: please remove the error suppression (@)
operators. normaliseParams() should return false if the width and height are
missing. Error suppression operators have lots of implementation issues in PHP,
they are evil and should almost never be used.

OK.

PagedTiffHandler::getThumbType() is missing and the parent is not returning the
correct type. Please fix this by modifying File::thumbName() to pass handler
parameters through to MediaHandler::getThumbType() as a third parameter, so
that your lossy/lossless feature can work.

You mean, this needs to be fixed in core? I'll have a look.

Thanks agin for your detailed review!

Fixed parameter passing from File::thumbName to MediaHandler::getThumbType in r68409.

Fixed parameter validation/handling by the handler class in r68418.

Pending: exiv2 vs Exif.php, the VIPS stuff, intermedia resize (probably OT), and core integration. Did I miss anything?

fixed $wgTiffErrorCacheTTL = 84600; in r68419.

(In reply to comment #4)

There's no reson not to rename and/or mere it with core. Well actually, there
is one: They do like to be visible on Special:Version. Not sure if adding the
extension authors to the core credits would be an option.

We can add them to the core credits.

The cache thing is an akward hack to provide for broken legacy files. We
already have several thousand tiff files uploaded, some of which we'll not be
able to render. Caching the fact that extraction failed for a specific file is
supposed to prevent mediawiki trying to extract that info over and over again,
whenever the file or a thumbnail is requested.

The correct way to do this is to save an error value into the metadata blob. See how OggHandler does it.

Considering the comment you added later, it seems that using img_shrink is
always ok? did I understand that correctly?

Yes, and further testing showed that it produces far better visual results than img_resize_linear when downscaling by a significant factor. I'll add attachments to show what I mean.

Not sure what you are getting at here - does this mean ImageMagick's scaling
looks better?

Yes, ImageMagick has done a much better job of this in terms of visual results. It just uses far more memory. I'm pretty sure now that it's best to use VIPS im_shrink to scale down to an intermediate size, and ImageMagick to go down to web size.

wow... uh... why? i mean, just adding "height" would have been cleaner and
simpler, no?

I'm not sure how that would be possible. Magic words define syntax, not handler parameters. We don't really want to have [[Image:Foo.png|width=100px|height=100px]]. Maybe it's just the name (img_width) that is offending you? If it had been called img_width_height would it have been better?

Thank you very much for clarifying this. It's quite confusing how these
methods interact. What would be a good place to document this?

MediaHandler (in Generic.php).

Created attachment 7492
im_resize_linear

Downscale from 6051px width to 200px with im_resize_linear. Yes it really looks this bad. With no filtering, it's basically the same as nearest-neighbour downsampling when the shrink factor is significant.

Attached:

som-im_resize_linear.png (294×200 px, 129 KB)

Created attachment 7493
im_shrink

Downscale with im_shrink. It's a vast improvement over im_resize_linear, but it's quite blurry compared to the ImageMagick version, mostly due to the lack of sharpening.

Attached:

som-im_shrink.png (294×200 px, 93 KB)

Created attachment 7494
ImageMagick resize sharpen

ImageMagick resize and sharpen, with the same sharpen parameter as MediaWiki's default. Looks very nice.

Attached:

som-magick-resize-sharpen.png (294×200 px, 114 KB)

Created attachment 7495
im_shrink + magick sharpen

Attachment 7493 (im_shrink) sharpened with ImageMagick. Results are similar to Attachment 7494.

Attached:

som-im_shrink-magick-sharpen.png (294×200 px, 110 KB)

Thanks once more for the detailed anaysis, tim. I think we should view the intermediate size stuff as a separate issue though and open a fresh bug for it. After all, we'd want a solution that also works for PNGs, right? Actually, thought there already was a feature request for this, but now I can't find it...

In any case, I'd like to get the TIFF support up to par with DjVU support and get it live, before looking into the intermediate rescaling stuff. Heuristics for deciding when to use VIPS, and support for intermediate scaling, can be added later. Do you agree?

replaced error caching by storing errors in metadata in r68661, as tim suggested in comment #7. also fixed several more minor issues with meta data handling.

Made PagedTiffHanlder use Exif.php per default in r68664, as tim
suggested in comment #7. also mopped up some more inconsistencies in metadata handling.

Tim, please have another look. I think all of the major issues have been resolved now. AS I said in #13, I think the automatic/smart use of VIPS along with the intermediate scaling issue is for later. Please let me know if there's still anything that keeps this from going live.

(In reply to comment #13)

In any case, I'd like to get the TIFF support up to par with DjVU support and
get it live, before looking into the intermediate rescaling stuff. Heuristics
for deciding when to use VIPS, and support for intermediate scaling, can be
added later. Do you agree?

Fine, but at least do s/im_resize_linear/im_shrink/ and have it respect $wgMaxImageArea in ImageMagick mode. The people who like the TIFF format are the same ones who like to upload 100 Mpx images, and there's no way we can support ImageMagick scaling for such images. The reason DjVu can skip the $wgMaxImageArea check is because it doesn't use ImageMagick, it uses ddjvu.

I've added CR comments for minor issues in the code you committed, but there's nothing there which will slow down deployment significantly. We should be able to deploy it in a matter of days.

It seems that running ImageMagick identify on a large image uses large amounts of memory. I scaled that test image down to 100MB and tried to upload it to a test wiki. Identify failed, it tried to use 264 MB of memory, so it hit the ulimit. This could be a problem in production.

I'm going to deploy it for now and put the identify issue on another bug report.

Identify is not installed on the app servers. It's a custom package, so installing it and maintaining it would be a PITA, and it's probably a bad idea anyway since it would expose the entire app cluster to DoS due to identify memory and speed issues. I'm installing libtiff-tools instead. We can deploy PagedTiffHandler when it supports tiffinfo (bug 24821).

I added support for libtiff/tiffinfo in r71204, please have a look

Tim activated the tiff handler a few days ago. Remaining issues are being tracked in Bug 24984.

woops - I just noticed that it's only enabled on commons. that's not good. please enable it on all wikis, so we can actually *use* tiff files.

(In reply to comment #23)

woops - I just noticed that it's only enabled on commons. that's not good.
please enable it on all wikis, so we can actually *use* tiff files.

is already done - http://wikitech.wikimedia.org/index.php?title=Server_admin_log&diff=28717&oldid=28716