Re: CSS3 Speech Module : Working Draft 27 July 2004 Comments from Andrew Thompson on 2004-08-12 (www-style@w3.org from August 2004)

From: Andrew Thompson <lordpixel@mac.com>
Date: Wed, 11 Aug 2004 23:40:30 -0400
To: Dave Raggett <dsr@w3.org>, www style <www-style@w3.org>
Message-Id: <5BB70046-EC11-11D8-8981-000A27D7D9DC@mac.com>
On Aug 10, 2004, at 6:33 AM, Dave Raggett wrote:

>> Substantive Comments
>> --------------------
>>
>> 1.
>> Section: Definition of the property 'speak'
>>
>> This draft of the spec -
>> http://www.w3.org/TR/2002/WD-speech-synthesis-20021202/ - defined two
>> additional properties, 'date' and 'words'. The later is probably only
>> marginally useful (in theory it was supposed to force 'ASCII' to be
>> rendered as "as-key" rather than "a s c i i") but I'm really surprised
>> at the removal of "date" which would seem to be really useful.
>
> This will be addressed in terms of the SSML say-as mechanism, the
> details of which are still being refined in the W3C Voice Browser
> working group. Appendix A notes: "The interpret-as property has been
> temporarily dropped until the Voice Browser working group has
> further progressed work on the SSML <say-as> element."

[...]

> But that would prevent the application of the existing speak
> properties. Note that such a property could only be applied to a
> specific instance of an element. In the longer term, the use of
> prounciation lexicons would provide a better solution.
>
> Your ideas on this are welcomed.

OK, lets see, SSML has:

Sub
"The sub element is employed to indicate that the text in the alias 
attribute value replaces the contained text for pronunciation. This 
allows a document to contain both a spoken and written form."

Say as
"The say-as element allows the author to indicate information on the 
type of text construct contained within the element and to help specify 
the level of detail for rendering the contained text."

CSS has:

Speak
"This property specifies whether text will be rendered aurally and if 
so, in what manner."

interpret-as (2003 draft)

"This provides a hint to the speech platform as to how to interpret the 
corresponding element's content and is useful when the content is 
ambigous and liable to be misinterpreted. "

Frankly this makes my head spin. Intellectually I can see that each of 
these does a distinct useful thing, but trying to intuitively grasp 
which one to use in a document would take a lot of practice.

The primary thing I'd like to see done is some consolidation. I'd be 
very tempted just to roll the 'interpret-as' values into 'speak' when 
its re-introduced. I'm really not sure what distinguishing the two buys 
us. How often would something like this be useful (trying to think of 
relatively sensible combinations):

interpret-as: currency;
speak: digits;

or

interpret-as: address;
speak: literal-punctuation;

Is it worth the additional complexity?

Assuming you do keep interpret-as, obviously its intended to harmonize 
with SSML like so:

SSML: <say-as interpret-as="foo"> ... </say-as>
CSS: interpret-as: foo;

On balance, I'd find it easier to keep things straight if the CSS 
property was "say-as" as well:

SSML: <say-as interpret-as="foo"> ... </say-as>
CSS: say-as: foo;

I think using the element name (say-as) instead of the attribute name 
(interpret-as) establishes the relationship more clearly. When one 
reads the SSML spec, <say-as> stands out in the table of contents 
because its a full section heading whereas the attribute name 
'interpret-as' is buried inside section 3.1.8. Therefore people would 
realize they're intended to be related much more easily if it was 
'say-as'.

>> 4.3. As per my 2003 comments, although I like the fact there is a
>> facility for selecting variations, using <number> for specifying then
>> is not a satisfactory solution.
>>
>> * firstly using absolute numbers is not very portable. If I write
>>
>> body { voice-family: male 1 }
>> .foo { voice-family: male 2 }
>> .bar { voice-family: male 3 }
>>
>> Then what happens if the synthesizer only has two male voices? When
>> something of class 'bar' is rendered, does the synthesizer round-robin
>> back to "male 1" or does it stay with the current voice because it
>> doesn't have enough male voices? At the very least the specification
>> should specify what "best effort" strategy the synthesizer should
>> apply. This allows document authors to at least predict whether the
>> voice will change or not (assuming the synthesizer has at least 2
>> voices).
>
> This is tricky given that SSML doesn't provide a specific algorithm
> other than with respect to the value for xml:lang. The current
> wording is the best I have been able to come up with.

You mean this paragraph:

"If there is no voice available for the requested value of xml:lang, 
the processor should select a voice that is closest to the requested 
language (e.g. a variant or dialect of the same language). If there are 
multiple such voices available, the processor should use a voice that 
best matches the values provided with the voice-volume property. It is 
an error if there are no such matches."

Aside: did you really mean the "voice-volume" property, or is this a 
typo?

I think it might be worth adding something like this (though perhaps 
this should go in an appendix as an "informative" example):

"Non-existent voice variants should be resolved in a round-robin / 
modulus arithmetic fashion. eg, if a synthesizer has only 2 male 
voices:

voice: male 3;

is resolves as '3 mod 2' and becomes:

voice: male 1;

This is reasonably predictable, should result in the voice changing 
when the author intends it to, and is very very easy to implement. If 
we say nothing on the subject, I think it will leave stylesheet authors 
in a bad place because it'll be very hard to predict what might happen 
on synthesizers without enough voices.

>> Overall I believe something like 'previous', 'next' and 'different'
>> would be more useful, more intuitive and more portable than absolute
>> integer indices.
>
> Unfortunately the need to align with SSML precludes this. The speech
> engine vendors are currently focusing on the VoiceXML market and are
> much less interested in CSS, so for now, we need to align with SSML.

I understand. However, if we can diverge from SSML to do 'young', 'old' 
and 'child' perhaps we can support keywords in _addition_ to numbered 
variations. IIRC I suggested adding "different" to SSML in the comment 
period for that, and the WG said they'd consider it for the next major 
revision (presumably 1.1) as it wasn't a bad idea.

>> 5.
>> Section: Definition of 'voice-pitch'
[...]
> The primary use case was for when you wanted to get the TTS engine
> to "sing" by tweaking the pitch contours.
>
> This remains an open issue ....

Well, personally I took more physics than music at school, so Hz and % 
work for me.
I guess I'm not in the intended audience for semitones.

> Thanks for your feedback. Would you be willing to help with
> work on test suites and implementations?

Hrm, well. That's a tricky one. I'm actually on the expert group for 
the Java Speech API 2.0, and staying up to date with that, in addition 
to actually building things for Mozilla keeps me pretty busy when I'm 
not doing my real job :) Having said that, I actually implemented bits 
of ACSS back in 1998 for my Masters project at school, so I've been 
interested for a long while. Maybe I could do something, really depends 
how much of a time commitment it would be.

AndyT (lordpixel - the cat who walks through walls)
A little bigger on the inside

         (see you later space cowboy ...)
Received on Thursday, 12 August 2004 03:40:33 UTC