Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[css-syntax] Urange and its problems #3588

Closed
tabatkins opened this issue Feb 1, 2019 · 12 comments
Closed

[css-syntax] Urange and its problems #3588

tabatkins opened this issue Feb 1, 2019 · 12 comments
Labels
Closed Accepted by CSSWG Resolution Commenter Satisfied Commenter has indicated satisfaction with the resolution / edits. css-syntax-3 Tested Memory aid - issue has WPT tests

Comments

@tabatkins
Copy link
Member

(migrated from the mailing list, for easier tracking here)

Tab Atkins said:

History: CSS2.1 defined a special grammar token just for unicode
ranges, which was used in exactly one place: the 'unicode-range'
descriptor of @font-face. This special production caused bugs in
pages, where selectors like u+a { ... } were parsed as a
UNICODE-RANGE token, rather than the expected "IDENT(u) DELIM(+)
IDENT(a)", like every other selector of that form was parsed. (This
isn't theoretical - Moz had a bug reported against it for this.)

When writing the Syntax spec, I tried to fix this by dropping the
unicode-range concept from the tokenizer, and instead handling it as a
complex construct of the existing tokens, like I did with <an+b>.
This kinda worked initially, but was really nasty. Since then, we
added scinot to numbers (like 1e3 for 1000), and this completely
destroyed
my ability to define cleanly - I can no longer use
the value of numeric tokens, and instead have to rely on the
"representation", which no browser stores or wants to store.

I want to go ahead and resolve this. I can see three options:

  1. Keep what I'm currently doing. This requires browsers to hold onto
    the string representation of numeric tokens (numbers and dimensions)
    at least through initial parsing (longer if they're used in a custom
    property).

  2. Abandon this effort, go back to having a special unicode-range
    token. Accept that this is weird and there are stupid side-effects,
    like some selectors not working.

  3. Define a new syntax that's actually simple to obtain from
    the existing tokens¹. Deprecate the old syntax; require UAs to accept
    the old syntax in the 'unicode-range' descriptor, but don't define how
    they should do so. (Current UAs use context-sensitive retokenizing, I
    think - once they realize they're in a unicode-range descriptor,
    they'll retokenize the original text according to a special set of
    rules.)

Thoughts?

¹ Simplest change is just to replace the + with a -, so you write
U-2016 for ‖. This makes unicode ranges always a single IDENT token,
plus possibly some trailing '?' DELIM tokens. You then have to parse
the token's value to make sure it's a valid range, but that's way, way
easier than the garbage fire I have to deal with from today's syntax.


fantasai said:

Given unicode-range is already shipping
http://caniuse.com/#feat=font-unicode-range
I think #3 is a non-starter.

I would imagine that reparsing unicode-range tokens in order to make
the selectors work would be easier than doing #1, no? Hanging onto
unicode-range tokens would be a lot less memory than hanging onto
numbers and dimensions, given they're used so rarely.


Tab Atkins said:

On Tue, Apr 12, 2016 at 2:27 PM, fantasai [email protected] wrote:

Given unicode-range is already shipping
http://caniuse.com/#feat=font-unicode-range
I think #3 is a non-starter.

You might have misread - #3 is explicitly backwards-compatible. It
requires UAs to support the old syntax, it just doesn't describe how
they would do so.

I would imagine that reparsing unicode-range tokens in order to make
the selectors work would be easier than doing #1, no? Hanging onto
unicode-range tokens would be a lot less memory than hanging onto
numbers and dimensions, given they're used so rarely.

Yeah, it just means we have to reparse them everywhere except unicode-range.


Florian Rivoal said:

On Apr 13, 2016, at 07:09, Tab Atkins Jr. [email protected] wrote:

On Tue, Apr 12, 2016 at 2:27 PM, fantasai [email protected] wrote:

Given unicode-range is already shipping
http://caniuse.com/#feat=font-unicode-range
I think #3 is a non-starter.

You might have misread - #3 is explicitly backwards-compatible. It
requires UAs to support the old syntax, it just doesn't describe how
they would do so.

As a UA implementor who has this on the roadmap, I don't like having a spec telling us to do something, without telling us how. All UAs would probably do fine at supporting the old syntax when it is correctly used, but I am much less confident that we'd all pick the same logic for error handling, and it is important that we all react the same way in the face of unknown/incorrect syntax.

I would imagine that reparsing unicode-range tokens in order to make
the selectors work would be easier than doing #1, no? Hanging onto
unicode-range tokens would be a lot less memory than hanging onto
numbers and dimensions, given they're used so rarely.

Yeah, it just means we have to reparse them everywhere except unicode-range.

Right, this feels ugly and error prone.


Florian Rivoal said:

On Apr 13, 2016, at 05:37, Tab Atkins Jr. [email protected] wrote:

  1. Keep what I'm currently doing. This requires browsers to hold onto
    the string representation of numeric tokens (numbers and dimensions)
    at least through initial parsing (longer if they're used in a custom
    property).

Does it really require that? Wouldn't it be good enough to hold onto the string representation of numeric tokens only when scinot is used? Given that scinot is pretty rare (and will stay that way), the memory requirement should be lower than storing the string representation of all numeric tokens.


Simon Sapin said:

How about this?

  1. Same as 2, but tweak the Selector grammar to interpret unicode-range
    tokens that don’t have question marks as: a type selector "u", followed
    by a next-sibling combinator, followed by another type selector.

It’s weird, but it seems less messy to me than the alternatives.


Tab Atkins said:

Yeah. It really fucks up the grammar something fierce, so I think
I'd have to do it as a preprocessing step before matching the actual
Selectors grammar. And anything else that ever wants to use a + is
similarly affected; we seem to have settled on requiring spaces around
math + and I don't expect us to use + for anything else, but custom
properties would be stuck with this gotcha. :/

@tabatkins
Copy link
Member Author

The current state of the spec is that I require UAs to retain the representation of the tokens, and then just concatenate the representations and reparse using a bespoke algorithm to produce a unicode-range.

I don't think anyone actually does this, but at least I have tests that will establish this now... web-platform-tests/wpt#15195

@tabatkins tabatkins added the Tested Memory aid - issue has WPT tests label Feb 1, 2019
@tabatkins
Copy link
Member Author

And merged web-platform-tests/wpt@62bfaeb

@tabatkins
Copy link
Member Author

@SimonSapin @emilio Firefox fails almost all of these tests, because in every single case (including the invalid ones!) it just returns exactly what the author wrote into unicode-range. That seems like broken behavior in the first place, and it also makes this much harder to test. Any insight into why this is?

@tabatkins
Copy link
Member Author

tabatkins commented Feb 2, 2019

An earlier thread:

Tab Atkins said:

In the telcon today, dbaron expressed concern that the definition of
<urange> requires looking at the "representation" of <number-token>s
and <dimension-token>s. (The "representation" of a numeric token is
the actual text used to write the number, including leading 0s,
leading + sign, original base and exponent when using scientific
notation, etc.)

I pointed out that storing the representation of numeric tokens is
already required, in order to implement the <quirky-color> production
from the Quirks Mode spec
https://quirks.spec.whatwg.org/#the-hashless-hex-color-quirk. IE's
behavior distinguishes between "color: 123;" and "color: 000123;", but
FF/WK/Blink don't; both are treated as #123, so we can maybe change
the Quirks Mode spec to not require the representation.

So, that leaves us with three possible resolutions to the <urange> thing.

  1. Leave it as it is. This requires storing the representation on
    every numeric token, which is a memory cost, but it lets us parse
    <urange> precisely. (The cost might not be as bad as all that. If
    you only store the representation when it's "non-obvious" (leading +
    sign, leading 0, scinot) then the memory cost is most of the time
    just a single null pointer per numeric token. You can regenerate the
    representation on the fly from "obvious" forms, so a helper function
    can be used to make representation-retrieval easy when it's
    necessary.)

  2. Drop the representation requirement, and rejigger the <urange>
    definition to account for that. This has a few side effects:

    1. We can no longer limit the urange syntax to at most 6 hex
      digits per component; arbitrary numbers of leading 0s will be allowed
      and are impossible to detect. This just means that U+0000000 becomes
      valid, for example.
    2. Four of the six grammar clauses "eat" the plus sign in the
      following numeric token, and it's not detectable from the value that a
      plus sign was ever used. The fact that whitespace is disallowed makes
      this not a huge deal; in order to still hit the right token patterns,
      you need to do some stupid comment tricks. "U/**/0001" will
      technically become valid, and equivalent to "U+0001".
    3. Scinot is still a problem. "200", "200e0", "20e1", and "2e2"
      all produce the same value when parsed as a <number-token>, but
      obviously refer to four different codepoints when interpreted as hex
      values. Numeric tokens would have to record if they were in scinot
      form, and what the exponent was.
  3. Revert this whole thing, and restore <unicode-range-token>. This
    requires us to fix the original problem some other way. As a
    refresher, the original issue was that "u+a { ... }" is a syntax
    error, as the selector is a <unicode-range-token>, not <ident-token>,
    +, <ident-token> like the author meant. Handling this in Selectors
    requires us to essentially "retokenize" selectors, to turn some
    <unicode-range-token>s into the expected token patterns; this would
    have to be repeated for any other syntax that ends up with allowing
    something looking like a unicode-range. It also means that non-CSS
    implementations of Selectors have to do some silly back-and-forth
    where they tokenize some strings into (meaningless) unicode-range
    tokens and then immediately re-tokenize them back into useful stuff.

I prefer solution #1 - doing it well increases the memory footprint of
a numeric token by the size of a pointer (generally doubling the size
of a <number-token>, but increasing the size of a <dimension-token> by
somewhat less), and allows us to handle <urange> exactly, without a
bunch of crazy hacks.

#2 isn't so great. It means we're expanding the syntax of <urange>,
something dbaron didn't want to do in the first place, and it
increases the cost of numeric tokens anyway, as you have to remember
scinot exponents. I don't think this wins us much.

#3 means that the unicode-range syntax infects Selectors, and any
future syntax we create that might have a + sign in it. (An+B avoids
it, since the only letter allowed is "n", and calc() avoids it by
requiring whitespace around the +, but we almost resolved to remove
the whitespace requirement, which would have put this back into the
realm of possibility once we allowed keywords in calc().)


Zack Weinburg said:

Option 3a: Restore <unicode-range-token> but declare that it is only
considered as a tokenization within @font-face { ... }, or even only
within the unicode-range: descriptor within @font-face.

I can't say that I like this, but that's because I am
philosophically not a fan of special tokenizer productions that only
apply in specific grammar contexts -- can anyone think of a
practical problem? It's not any worse than unquoted url() in terms
of code, it can't change the boundaries of a top-level construct, and
the only other issue that comes to mind is that it'll make it harder
to use <unicode-range-token> somewhere else in the future. But I
don't know that there are other uses, so.


Tab Atkins said:

That requires a vastly more complicated change, switching the Syntax
module from being separate tokenizer/parser steps to being integrated,
with a lot more state being thrown around. And it doesn't help us if
we ever want to use <urange> in another property or context, which I
think is plausible.


L. David Baron said:

  1. Leave it as it is. This requires storing the representation on
    every numeric token, which is a memory cost, but it lets us parse
    <urange> precisely. (The cost might not be as bad as all that. If
    you only store the representation when it's "non-obvious" (leading +
    sign, leading 0, scinot) then the memory cost is most of the time
    just a single null pointer per numeric token. You can regenerate the
    representation on the fly from "obvious" forms, so a helper function
    can be used to make representation-retrieval easy when it's
    necessary.)

I'm ok with this, and I think I prefer it at this point.

@tabatkins
Copy link
Member Author

This contextualizes the current state of the spec. I tried hard to remove the representation requirement anyway, but dbaron at least was okay with it originally.

@emilio
Copy link
Collaborator

emilio commented Feb 2, 2019

I don't have that much context on unicode-range. @SimonSapin is a better person to ask, though I could investigate if he doesn't have the time :)

@SimonSapin
Copy link
Contributor

Firefox fails almost all of these tests, because in every single case (including the invalid ones!) it just returns exactly what the author wrote into unicode-range.

I can reproduce this in your test case, but I don’t understand it when looking at code. FontFace::GetUnicodeRange ends up calling UnicodeRange::to_css which serializes two 32-bit integers.

When profiling though I see calls CSSFontFaceRule::Style but not FontFace::GetUnicodeRange. (But this is a sampling profiler and I haven’t tried an actual debugger.) The former method returns nsICSSDeclaration* pointer that looks like it might not know about @font-face but rather pretent it is a style rule. @emilio does this ring a bell?

Now the parsing does take a shortcut by taking a single slice of the original input rather than concatenating the representation of each token after removing comment tokens. This would make an observable difference for cases like U+/**/1, but doesn’t explain the results we get.

@SimonSapin
Copy link
Contributor

@tabatkins To sum up I think Firefox has a serialization bug around the getters of the fields of CSSFontFaceRule.style (possibly related to the WebIDL type of that being CSSStyleDeclaration, same as for style rules) but that unicode-range parsing and its impact on fonts are not impacted by that bug.

@emilio
Copy link
Collaborator

emilio commented Feb 3, 2019

You just made me curious and I took a look. You're just getting and setting JS properties.

> 'unicodeRange' in rule.style
< false

You can use getPropertyValue to get the value of the descriptor, but setProperty throws an ugly NOT_IMPLEMENTED error. That's clearly a bug we should fix.

That being said, I don't know what should make CSSStyleDeclaration.prototype.unicodeRange work, spec-wise.

https://drafts.csswg.org/css-fonts-4/#dom-cssfontfacerule-style doesn't say anything.

https://drafts.csswg.org/cssom-1/#cssstyledeclaration says:

For each CSS property property that is a supported CSS property, the following partial interface applies where camel-cased attribute is obtained by running the CSS property to IDL attribute algorithm for property.

https://drafts.csswg.org/cssom-1/#supported-css-property doesn't say anything about descriptors. And indeed Blink at least returns true for 'unicodeRange' in document.documentElement.style, which is really weird.

So that part looks a spec issue to me, if you agree I'll file it. @tabatkins maybe you could update the tests to use getPropertyValue / setProperty? I can fix https://bugzilla.mozilla.org/show_bug.cgi?id=443978 (lol, quite an old bug) and file that spec issue about descriptors in CSSStyleDeclaration.

@tabatkins
Copy link
Member Author

Ahaha, nice.

Changed the test to use the method versions instead.

@tabatkins tabatkins added Closed Accepted by CSSWG Resolution Commenter Satisfied Commenter has indicated satisfaction with the resolution / edits. labels Feb 11, 2019
@tabatkins
Copy link
Member Author

Now that Firefox is being properly tested, and thus passing a lot of the tests, I'm happier about this. The WG resolved on the current spec text, and I'm happy enough with that, so I'm closing this bug as "fixed".

@emilio
Copy link
Collaborator

emilio commented Feb 12, 2019

I filed #3647 about CSSStyleDeclaration woes. I'll try to find some time to fix the FF bug to allow modification of font-face rules.

moz-v2v-gh pushed a commit to mozilla/gecko-dev that referenced this issue Feb 21, 2019
…-supported, a=testonly

Automatic update from web-platform-tests
Use gPV/sP, as that's slightly more well-supported

Requested by @emilio in <w3c/csswg-drafts#3588 (comment)>
--
Merge pull request #15286 from web-platform-tests/tabatkins-patch-1

Use gPV/sP, as that's slightly more well-supported

--

wpt-commits: f912ea28e7a5451f51d38d62cd0aee489618e4cb, ead8f8b00d0b68237109f3c93d0ccae076a34f98
wpt-pr: 15286
mykmelez pushed a commit to mykmelez/gecko that referenced this issue Feb 22, 2019
…-supported, a=testonly

Automatic update from web-platform-tests
Use gPV/sP, as that's slightly more well-supported

Requested by @emilio in <w3c/csswg-drafts#3588 (comment)>
--
Merge pull request #15286 from web-platform-tests/tabatkins-patch-1

Use gPV/sP, as that's slightly more well-supported

--

wpt-commits: f912ea28e7a5451f51d38d62cd0aee489618e4cb, ead8f8b00d0b68237109f3c93d0ccae076a34f98
wpt-pr: 15286
moz-v2v-gh pushed a commit to mozilla/gecko-dev that referenced this issue Feb 26, 2019
…-supported, a=testonly

Automatic update from web-platform-tests
Use gPV/sP, as that's slightly more well-supported

Requested by @emilio in <w3c/csswg-drafts#3588 (comment)>
--
Merge pull request #15286 from web-platform-tests/tabatkins-patch-1

Use gPV/sP, as that's slightly more well-supported

--

wpt-commits: f912ea28e7a5451f51d38d62cd0aee489618e4cb, ead8f8b00d0b68237109f3c93d0ccae076a34f98
wpt-pr: 15286
mykmelez pushed a commit to mykmelez/gecko that referenced this issue Feb 27, 2019
…-supported, a=testonly

Automatic update from web-platform-tests
Use gPV/sP, as that's slightly more well-supported

Requested by @emilio in <w3c/csswg-drafts#3588 (comment)>
--
Merge pull request #15286 from web-platform-tests/tabatkins-patch-1

Use gPV/sP, as that's slightly more well-supported

--

wpt-commits: f912ea28e7a5451f51d38d62cd0aee489618e4cb, ead8f8b00d0b68237109f3c93d0ccae076a34f98
wpt-pr: 15286
@tabatkins tabatkins added this to the CSS Syntax 3 June 2019 CR milestone Jun 28, 2019
gecko-dev-updater pushed a commit to marco-c/gecko-dev-wordified that referenced this issue Oct 3, 2019
…-supported, a=testonly

Automatic update from web-platform-tests
Use gPV/sP, as that's slightly more well-supported

Requested by emilio in <w3c/csswg-drafts#3588 (comment)>
--
Merge pull request #15286 from web-platform-tests/tabatkins-patch-1

Use gPV/sP, as that's slightly more well-supported

--

wpt-commits: f912ea28e7a5451f51d38d62cd0aee489618e4cb, ead8f8b00d0b68237109f3c93d0ccae076a34f98
wpt-pr: 15286

UltraBlame original commit: 0cc2cbc16dda0f8e5e6538edb4cea4cf2087927f
gecko-dev-updater pushed a commit to marco-c/gecko-dev-comments-removed that referenced this issue Oct 4, 2019
…-supported, a=testonly

Automatic update from web-platform-tests
Use gPV/sP, as that's slightly more well-supported

Requested by emilio in <w3c/csswg-drafts#3588 (comment)>
--
Merge pull request #15286 from web-platform-tests/tabatkins-patch-1

Use gPV/sP, as that's slightly more well-supported

--

wpt-commits: f912ea28e7a5451f51d38d62cd0aee489618e4cb, ead8f8b00d0b68237109f3c93d0ccae076a34f98
wpt-pr: 15286

UltraBlame original commit: 0cc2cbc16dda0f8e5e6538edb4cea4cf2087927f
gecko-dev-updater pushed a commit to marco-c/gecko-dev-wordified that referenced this issue Oct 4, 2019
…-supported, a=testonly

Automatic update from web-platform-tests
Use gPV/sP, as that's slightly more well-supported

Requested by emilio in <w3c/csswg-drafts#3588 (comment)>
--
Merge pull request #15286 from web-platform-tests/tabatkins-patch-1

Use gPV/sP, as that's slightly more well-supported

--

wpt-commits: f912ea28e7a5451f51d38d62cd0aee489618e4cb, ead8f8b00d0b68237109f3c93d0ccae076a34f98
wpt-pr: 15286

UltraBlame original commit: b3874f50c9fb17379335c08ce79eaf35d6a4a3ff
gecko-dev-updater pushed a commit to marco-c/gecko-dev-wordified-and-comments-removed that referenced this issue Oct 4, 2019
…-supported, a=testonly

Automatic update from web-platform-tests
Use gPV/sP, as that's slightly more well-supported

Requested by emilio in <w3c/csswg-drafts#3588 (comment)>
--
Merge pull request #15286 from web-platform-tests/tabatkins-patch-1

Use gPV/sP, as that's slightly more well-supported

--

wpt-commits: f912ea28e7a5451f51d38d62cd0aee489618e4cb, ead8f8b00d0b68237109f3c93d0ccae076a34f98
wpt-pr: 15286

UltraBlame original commit: 0cc2cbc16dda0f8e5e6538edb4cea4cf2087927f
gecko-dev-updater pushed a commit to marco-c/gecko-dev-comments-removed that referenced this issue Oct 4, 2019
…-supported, a=testonly

Automatic update from web-platform-tests
Use gPV/sP, as that's slightly more well-supported

Requested by emilio in <w3c/csswg-drafts#3588 (comment)>
--
Merge pull request #15286 from web-platform-tests/tabatkins-patch-1

Use gPV/sP, as that's slightly more well-supported

--

wpt-commits: f912ea28e7a5451f51d38d62cd0aee489618e4cb, ead8f8b00d0b68237109f3c93d0ccae076a34f98
wpt-pr: 15286

UltraBlame original commit: b3874f50c9fb17379335c08ce79eaf35d6a4a3ff
gecko-dev-updater pushed a commit to marco-c/gecko-dev-wordified-and-comments-removed that referenced this issue Oct 4, 2019
…-supported, a=testonly

Automatic update from web-platform-tests
Use gPV/sP, as that's slightly more well-supported

Requested by emilio in <w3c/csswg-drafts#3588 (comment)>
--
Merge pull request #15286 from web-platform-tests/tabatkins-patch-1

Use gPV/sP, as that's slightly more well-supported

--

wpt-commits: f912ea28e7a5451f51d38d62cd0aee489618e4cb, ead8f8b00d0b68237109f3c93d0ccae076a34f98
wpt-pr: 15286

UltraBlame original commit: b3874f50c9fb17379335c08ce79eaf35d6a4a3ff
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Closed Accepted by CSSWG Resolution Commenter Satisfied Commenter has indicated satisfaction with the resolution / edits. css-syntax-3 Tested Memory aid - issue has WPT tests
Projects
None yet
Development

No branches or pull requests

3 participants