[css-syntax] Urange and its problems #3588

tabatkins · 2019-02-01T22:17:54Z

(migrated from the mailing list, for easier tracking here)

Tab Atkins said:

History: CSS2.1 defined a special grammar token just for unicode
ranges, which was used in exactly one place: the 'unicode-range'
descriptor of @font-face. This special production caused bugs in
pages, where selectors like u+a { ... } were parsed as a
UNICODE-RANGE token, rather than the expected "IDENT(u) DELIM(+)
IDENT(a)", like every other selector of that form was parsed. (This
isn't theoretical - Moz had a bug reported against it for this.)

When writing the Syntax spec, I tried to fix this by dropping the
unicode-range concept from the tokenizer, and instead handling it as a
complex construct of the existing tokens, like I did with <an+b>.
This kinda worked initially, but was really nasty. Since then, we
added scinot to numbers (like 1e3 for 1000), and this completely
destroyed my ability to define cleanly - I can no longer use
the value of numeric tokens, and instead have to rely on the
"representation", which no browser stores or wants to store.

I want to go ahead and resolve this. I can see three options:

Keep what I'm currently doing. This requires browsers to hold onto
the string representation of numeric tokens (numbers and dimensions)
at least through initial parsing (longer if they're used in a custom
property).

Abandon this effort, go back to having a special unicode-range
token. Accept that this is weird and there are stupid side-effects,
like some selectors not working.

Define a new syntax that's actually simple to obtain from
the existing tokens¹. Deprecate the old syntax; require UAs to accept
the old syntax in the 'unicode-range' descriptor, but don't define how
they should do so. (Current UAs use context-sensitive retokenizing, I
think - once they realize they're in a unicode-range descriptor,
they'll retokenize the original text according to a special set of
rules.)

Thoughts?

¹ Simplest change is just to replace the + with a -, so you write
U-2016 for ‖. This makes unicode ranges always a single IDENT token,
plus possibly some trailing '?' DELIM tokens. You then have to parse
the token's value to make sure it's a valid range, but that's way, way
easier than the garbage fire I have to deal with from today's syntax.

fantasai said:

Given unicode-range is already shipping
http://caniuse.com/#feat=font-unicode-range
I think #3 is a non-starter.

I would imagine that reparsing unicode-range tokens in order to make
the selectors work would be easier than doing #1, no? Hanging onto
unicode-range tokens would be a lot less memory than hanging onto
numbers and dimensions, given they're used so rarely.

Tab Atkins said:

On Tue, Apr 12, 2016 at 2:27 PM, fantasai [email protected] wrote:

Given unicode-range is already shipping
http://caniuse.com/#feat=font-unicode-range
I think #3 is a non-starter.

You might have misread - #3 is explicitly backwards-compatible. It
requires UAs to support the old syntax, it just doesn't describe how
they would do so.

I would imagine that reparsing unicode-range tokens in order to make
the selectors work would be easier than doing #1, no? Hanging onto
unicode-range tokens would be a lot less memory than hanging onto
numbers and dimensions, given they're used so rarely.

Yeah, it just means we have to reparse them everywhere except unicode-range.

Florian Rivoal said:

On Apr 13, 2016, at 07:09, Tab Atkins Jr. [email protected] wrote:

On Tue, Apr 12, 2016 at 2:27 PM, fantasai [email protected] wrote:

Given unicode-range is already shipping
http://caniuse.com/#feat=font-unicode-range
I think #3 is a non-starter.

You might have misread - #3 is explicitly backwards-compatible. It
requires UAs to support the old syntax, it just doesn't describe how
they would do so.

As a UA implementor who has this on the roadmap, I don't like having a spec telling us to do something, without telling us how. All UAs would probably do fine at supporting the old syntax when it is correctly used, but I am much less confident that we'd all pick the same logic for error handling, and it is important that we all react the same way in the face of unknown/incorrect syntax.

I would imagine that reparsing unicode-range tokens in order to make
the selectors work would be easier than doing #1, no? Hanging onto
unicode-range tokens would be a lot less memory than hanging onto
numbers and dimensions, given they're used so rarely.

Yeah, it just means we have to reparse them everywhere except unicode-range.

Right, this feels ugly and error prone.

Florian Rivoal said:

On Apr 13, 2016, at 05:37, Tab Atkins Jr. [email protected] wrote:

Keep what I'm currently doing. This requires browsers to hold onto
the string representation of numeric tokens (numbers and dimensions)
at least through initial parsing (longer if they're used in a custom
property).

Does it really require that? Wouldn't it be good enough to hold onto the string representation of numeric tokens only when scinot is used? Given that scinot is pretty rare (and will stay that way), the memory requirement should be lower than storing the string representation of all numeric tokens.

Simon Sapin said:

How about this?

Same as 2, but tweak the Selector grammar to interpret unicode-range
tokens that don’t have question marks as: a type selector "u", followed
by a next-sibling combinator, followed by another type selector.

It’s weird, but it seems less messy to me than the alternatives.

Tab Atkins said:

Yeah. It really fucks up the grammar something fierce, so I think
I'd have to do it as a preprocessing step before matching the actual
Selectors grammar. And anything else that ever wants to use a + is
similarly affected; we seem to have settled on requiring spaces around
math + and I don't expect us to use + for anything else, but custom
properties would be stuck with this gotcha. :/

The text was updated successfully, but these errors were encountered:

tabatkins · 2019-02-01T23:01:17Z

The current state of the spec is that I require UAs to retain the representation of the tokens, and then just concatenate the representations and reparse using a bespoke algorithm to produce a unicode-range.

I don't think anyone actually does this, but at least I have tests that will establish this now... web-platform-tests/wpt#15195

tabatkins · 2019-02-01T23:11:53Z

And merged web-platform-tests/wpt@62bfaeb

tabatkins · 2019-02-01T23:13:42Z

@SimonSapin @emilio Firefox fails almost all of these tests, because in every single case (including the invalid ones!) it just returns exactly what the author wrote into unicode-range. That seems like broken behavior in the first place, and it also makes this much harder to test. Any insight into why this is?

tabatkins · 2019-02-02T00:19:04Z

An earlier thread:

Tab Atkins said:

In the telcon today, dbaron expressed concern that the definition of
<urange> requires looking at the "representation" of <number-token>s
and <dimension-token>s. (The "representation" of a numeric token is
the actual text used to write the number, including leading 0s,
leading + sign, original base and exponent when using scientific
notation, etc.)

I pointed out that storing the representation of numeric tokens is
already required, in order to implement the <quirky-color> production
from the Quirks Mode spec
https://quirks.spec.whatwg.org/#the-hashless-hex-color-quirk. IE's
behavior distinguishes between "color: 123;" and "color: 000123;", but
FF/WK/Blink don't; both are treated as #123, so we can maybe change
the Quirks Mode spec to not require the representation.

So, that leaves us with three possible resolutions to the <urange> thing.

Leave it as it is. This requires storing the representation on
every numeric token, which is a memory cost, but it lets us parse
<urange> precisely. (The cost might not be as bad as all that. If
you only store the representation when it's "non-obvious" (leading +
sign, leading 0, scinot) then the memory cost is most of the time
just a single null pointer per numeric token. You can regenerate the
representation on the fly from "obvious" forms, so a helper function
can be used to make representation-retrieval easy when it's
necessary.)

Drop the representation requirement, and rejigger the <urange>
definition to account for that. This has a few side effects:

We can no longer limit the urange syntax to at most 6 hex
digits per component; arbitrary numbers of leading 0s will be allowed
and are impossible to detect. This just means that U+0000000 becomes
valid, for example.

Four of the six grammar clauses "eat" the plus sign in the
following numeric token, and it's not detectable from the value that a
plus sign was ever used. The fact that whitespace is disallowed makes
this not a huge deal; in order to still hit the right token patterns,
you need to do some stupid comment tricks. "U/**/0001" will
technically become valid, and equivalent to "U+0001".

Scinot is still a problem. "200", "200e0", "20e1", and "2e2"
all produce the same value when parsed as a <number-token>, but
obviously refer to four different codepoints when interpreted as hex
values. Numeric tokens would have to record if they were in scinot
form, and what the exponent was.

Revert this whole thing, and restore <unicode-range-token>. This
requires us to fix the original problem some other way. As a
refresher, the original issue was that "u+a { ... }" is a syntax
error, as the selector is a <unicode-range-token>, not <ident-token>,
+, <ident-token> like the author meant. Handling this in Selectors
requires us to essentially "retokenize" selectors, to turn some
<unicode-range-token>s into the expected token patterns; this would
have to be repeated for any other syntax that ends up with allowing
something looking like a unicode-range. It also means that non-CSS
implementations of Selectors have to do some silly back-and-forth
where they tokenize some strings into (meaningless) unicode-range
tokens and then immediately re-tokenize them back into useful stuff.

I prefer solution #1 - doing it well increases the memory footprint of
a numeric token by the size of a pointer (generally doubling the size
of a <number-token>, but increasing the size of a <dimension-token> by
somewhat less), and allows us to handle <urange> exactly, without a
bunch of crazy hacks.

#2 isn't so great. It means we're expanding the syntax of <urange>,
something dbaron didn't want to do in the first place, and it
increases the cost of numeric tokens anyway, as you have to remember
scinot exponents. I don't think this wins us much.

#3 means that the unicode-range syntax infects Selectors, and any
future syntax we create that might have a + sign in it. (An+B avoids
it, since the only letter allowed is "n", and calc() avoids it by
requiring whitespace around the +, but we almost resolved to remove
the whitespace requirement, which would have put this back into the
realm of possibility once we allowed keywords in calc().)

Zack Weinburg said:

Option 3a: Restore <unicode-range-token> but declare that it is only
considered as a tokenization within @font-face { ... }, or even only
within the unicode-range: descriptor within @font-face.

I can't say that I like this, but that's because I am
philosophically not a fan of special tokenizer productions that only
apply in specific grammar contexts -- can anyone think of a
practical problem? It's not any worse than unquoted url() in terms
of code, it can't change the boundaries of a top-level construct, and
the only other issue that comes to mind is that it'll make it harder
to use <unicode-range-token> somewhere else in the future. But I
don't know that there are other uses, so.

Tab Atkins said:

That requires a vastly more complicated change, switching the Syntax
module from being separate tokenizer/parser steps to being integrated,
with a lot more state being thrown around. And it doesn't help us if
we ever want to use <urange> in another property or context, which I
think is plausible.

L. David Baron said:

Leave it as it is. This requires storing the representation on
every numeric token, which is a memory cost, but it lets us parse
<urange> precisely. (The cost might not be as bad as all that. If
you only store the representation when it's "non-obvious" (leading +
sign, leading 0, scinot) then the memory cost is most of the time
just a single null pointer per numeric token. You can regenerate the
representation on the fly from "obvious" forms, so a helper function
can be used to make representation-retrieval easy when it's
necessary.)

I'm ok with this, and I think I prefer it at this point.

tabatkins · 2019-02-02T00:21:09Z

This contextualizes the current state of the spec. I tried hard to remove the representation requirement anyway, but dbaron at least was okay with it originally.

emilio · 2019-02-02T19:33:16Z

I don't have that much context on unicode-range. @SimonSapin is a better person to ask, though I could investigate if he doesn't have the time :)

SimonSapin · 2019-02-02T23:00:35Z

Firefox fails almost all of these tests, because in every single case (including the invalid ones!) it just returns exactly what the author wrote into unicode-range.

I can reproduce this in your test case, but I don’t understand it when looking at code. FontFace::GetUnicodeRange ends up calling UnicodeRange::to_css which serializes two 32-bit integers.

When profiling though I see calls CSSFontFaceRule::Style but not FontFace::GetUnicodeRange. (But this is a sampling profiler and I haven’t tried an actual debugger.) The former method returns nsICSSDeclaration* pointer that looks like it might not know about @font-face but rather pretent it is a style rule. @emilio does this ring a bell?

Now the parsing does take a shortcut by taking a single slice of the original input rather than concatenating the representation of each token after removing comment tokens. This would make an observable difference for cases like U+/**/1, but doesn’t explain the results we get.

SimonSapin · 2019-02-02T23:05:02Z

@tabatkins To sum up I think Firefox has a serialization bug around the getters of the fields of CSSFontFaceRule.style (possibly related to the WebIDL type of that being CSSStyleDeclaration, same as for style rules) but that unicode-range parsing and its impact on fonts are not impacted by that bug.

emilio · 2019-02-03T03:21:45Z

You just made me curious and I took a look. You're just getting and setting JS properties.

> 'unicodeRange' in rule.style
< false

You can use getPropertyValue to get the value of the descriptor, but setProperty throws an ugly NOT_IMPLEMENTED error. That's clearly a bug we should fix.

That being said, I don't know what should make CSSStyleDeclaration.prototype.unicodeRange work, spec-wise.

https://drafts.csswg.org/css-fonts-4/#dom-cssfontfacerule-style doesn't say anything.

https://drafts.csswg.org/cssom-1/#cssstyledeclaration says:

For each CSS property property that is a supported CSS property, the following partial interface applies where camel-cased attribute is obtained by running the CSS property to IDL attribute algorithm for property.

https://drafts.csswg.org/cssom-1/#supported-css-property doesn't say anything about descriptors. And indeed Blink at least returns true for 'unicodeRange' in document.documentElement.style, which is really weird.

So that part looks a spec issue to me, if you agree I'll file it. @tabatkins maybe you could update the tests to use getPropertyValue / setProperty? I can fix https://bugzilla.mozilla.org/show_bug.cgi?id=443978 (lol, quite an old bug) and file that spec issue about descriptors in CSSStyleDeclaration.

@emilio

Requested by @emilio in <w3c/csswg-drafts#3588 (comment)>

tabatkins · 2019-02-08T00:18:40Z

Ahaha, nice.

Changed the test to use the method versions instead.

tabatkins · 2019-02-11T22:15:41Z

Now that Firefox is being properly tested, and thus passing a lot of the tests, I'm happier about this. The WG resolved on the current spec text, and I'm happy enough with that, so I'm closing this bug as "fixed".

emilio · 2019-02-12T11:17:22Z

I filed #3647 about CSSStyleDeclaration woes. I'll try to find some time to fix the FF bug to allow modification of font-face rules.

@emilio

…-supported, a=testonly Automatic update from web-platform-tests Use gPV/sP, as that's slightly more well-supported Requested by @emilio in <w3c/csswg-drafts#3588 (comment)> -- Merge pull request #15286 from web-platform-tests/tabatkins-patch-1 Use gPV/sP, as that's slightly more well-supported -- wpt-commits: f912ea28e7a5451f51d38d62cd0aee489618e4cb, ead8f8b00d0b68237109f3c93d0ccae076a34f98 wpt-pr: 15286

@emilio

…-supported, a=testonly Automatic update from web-platform-tests Use gPV/sP, as that's slightly more well-supported Requested by @emilio in <w3c/csswg-drafts#3588 (comment)> -- Merge pull request #15286 from web-platform-tests/tabatkins-patch-1 Use gPV/sP, as that's slightly more well-supported -- wpt-commits: f912ea28e7a5451f51d38d62cd0aee489618e4cb, ead8f8b00d0b68237109f3c93d0ccae076a34f98 wpt-pr: 15286

@emilio

…-supported, a=testonly Automatic update from web-platform-tests Use gPV/sP, as that's slightly more well-supported Requested by @emilio in <w3c/csswg-drafts#3588 (comment)> -- Merge pull request #15286 from web-platform-tests/tabatkins-patch-1 Use gPV/sP, as that's slightly more well-supported -- wpt-commits: f912ea28e7a5451f51d38d62cd0aee489618e4cb, ead8f8b00d0b68237109f3c93d0ccae076a34f98 wpt-pr: 15286

@emilio

…-supported, a=testonly Automatic update from web-platform-tests Use gPV/sP, as that's slightly more well-supported Requested by @emilio in <w3c/csswg-drafts#3588 (comment)> -- Merge pull request #15286 from web-platform-tests/tabatkins-patch-1 Use gPV/sP, as that's slightly more well-supported -- wpt-commits: f912ea28e7a5451f51d38d62cd0aee489618e4cb, ead8f8b00d0b68237109f3c93d0ccae076a34f98 wpt-pr: 15286

…-supported, a=testonly Automatic update from web-platform-tests Use gPV/sP, as that's slightly more well-supported Requested by emilio in <w3c/csswg-drafts#3588 (comment)> -- Merge pull request #15286 from web-platform-tests/tabatkins-patch-1 Use gPV/sP, as that's slightly more well-supported -- wpt-commits: f912ea28e7a5451f51d38d62cd0aee489618e4cb, ead8f8b00d0b68237109f3c93d0ccae076a34f98 wpt-pr: 15286 UltraBlame original commit: 0cc2cbc16dda0f8e5e6538edb4cea4cf2087927f

…-supported, a=testonly Automatic update from web-platform-tests Use gPV/sP, as that's slightly more well-supported Requested by emilio in <w3c/csswg-drafts#3588 (comment)> -- Merge pull request #15286 from web-platform-tests/tabatkins-patch-1 Use gPV/sP, as that's slightly more well-supported -- wpt-commits: f912ea28e7a5451f51d38d62cd0aee489618e4cb, ead8f8b00d0b68237109f3c93d0ccae076a34f98 wpt-pr: 15286 UltraBlame original commit: b3874f50c9fb17379335c08ce79eaf35d6a4a3ff

…-supported, a=testonly Automatic update from web-platform-tests Use gPV/sP, as that's slightly more well-supported Requested by emilio in <w3c/csswg-drafts#3588 (comment)> -- Merge pull request #15286 from web-platform-tests/tabatkins-patch-1 Use gPV/sP, as that's slightly more well-supported -- wpt-commits: f912ea28e7a5451f51d38d62cd0aee489618e4cb, ead8f8b00d0b68237109f3c93d0ccae076a34f98 wpt-pr: 15286 UltraBlame original commit: 0cc2cbc16dda0f8e5e6538edb4cea4cf2087927f

…-supported, a=testonly Automatic update from web-platform-tests Use gPV/sP, as that's slightly more well-supported Requested by emilio in <w3c/csswg-drafts#3588 (comment)> -- Merge pull request #15286 from web-platform-tests/tabatkins-patch-1 Use gPV/sP, as that's slightly more well-supported -- wpt-commits: f912ea28e7a5451f51d38d62cd0aee489618e4cb, ead8f8b00d0b68237109f3c93d0ccae076a34f98 wpt-pr: 15286 UltraBlame original commit: b3874f50c9fb17379335c08ce79eaf35d6a4a3ff

tabatkins added the css-syntax-3 label Feb 1, 2019

tabatkins added the Tested Memory aid - issue has WPT tests label Feb 1, 2019

tabatkins mentioned this issue Feb 1, 2019

[css-syntax] Wrapping up the <unicode-range> thing #3591

Closed

tabatkins added a commit to web-platform-tests/wpt that referenced this issue Feb 8, 2019

Use gPV/sP, as that's slightly more well-supported

f912ea2

Requested by @emilio in <w3c/csswg-drafts#3588 (comment)>

tabatkins mentioned this issue Feb 8, 2019

Use gPV/sP, as that's slightly more well-supported web-platform-tests/wpt#15286

Merged

tabatkins added Closed Accepted by CSSWG Resolution Commenter Satisfied Commenter has indicated satisfaction with the resolution / edits. labels Feb 11, 2019

tabatkins closed this as completed Feb 11, 2019

emilio mentioned this issue Feb 12, 2019

[cssom] Should font descriptors and such appear as properties in CSSStyleDeclaration? #3647

Closed

tabatkins added this to the CSS Syntax 3 June 2019 CR milestone Jun 28, 2019

tabatkins mentioned this issue May 12, 2023

[css-syntax] Give up on <urange> production? #8835

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[css-syntax] Urange and its problems #3588

[css-syntax] Urange and its problems #3588

tabatkins commented Feb 1, 2019

tabatkins commented Feb 1, 2019

tabatkins commented Feb 1, 2019

tabatkins commented Feb 1, 2019

tabatkins commented Feb 2, 2019 •

edited

Loading

tabatkins commented Feb 2, 2019

emilio commented Feb 2, 2019

SimonSapin commented Feb 2, 2019

SimonSapin commented Feb 2, 2019

emilio commented Feb 3, 2019

tabatkins commented Feb 8, 2019

tabatkins commented Feb 11, 2019

emilio commented Feb 12, 2019

[css-syntax] Urange and its problems #3588

[css-syntax] Urange and its problems #3588

Comments

tabatkins commented Feb 1, 2019

tabatkins commented Feb 1, 2019

tabatkins commented Feb 1, 2019

tabatkins commented Feb 1, 2019

tabatkins commented Feb 2, 2019 • edited Loading

tabatkins commented Feb 2, 2019

emilio commented Feb 2, 2019

SimonSapin commented Feb 2, 2019

SimonSapin commented Feb 2, 2019

emilio commented Feb 3, 2019

tabatkins commented Feb 8, 2019

tabatkins commented Feb 11, 2019

emilio commented Feb 12, 2019

tabatkins commented Feb 2, 2019 •

edited

Loading