What about 3 to 6 occurrences?
A proposal to simplify Invisible XML grammars that match a range of occurrences.
Invisible XML has operators for “one, optionally”, “zero or more”, and “one or
more” occurrences of a string.Technically a “factor” not a string. That means a terminal, a nonterminal,
an insertion, or a parenthesized expression. But I’m going to stick with strings here
for simplicity. Those operators are ? ("a"? matches “” or “a”), * ("a"* matches “” or “a” or “aa”, etc.),
and + ("a"+ matches “a” or “aa”, etc.).
But what if you want at least three but not more than six occurrences?
You can do that too, but you have to work a little harder:
"a", "a", "a", ("a", ("a", ("a")?)?)?
That could be simplified slightly to
"aaa", ("a", ("a", "a"?)?)?
but I think that obscures the pattern a bit. It’s worth noting that you can also simplify it this way:
"a", "a", "a", "a"?, "a"?, "a"?
Unfortunately, that will result in ambiguous parses. In the string “aaaa”, that fourth “a” could match either the first, second, or third optional “a”.
In issue 308, martian-a observes that the inability to specify a specific number of repetitions leads to grammars that are harder to read. The example given is:
code: -" ", [Lu], [Lu], ["0"-"9"], ["0"-"9"], ["0"-"9"], ["0"-"9"], ["0"-"9"], ["0"-"9"] .
Quick! How many digits are allowed? This initial question is about a single, specific number of occurrences, but in the first follow-up comment, vincentml observes that it applies to ranges as well.
At this point, I entered the chat and stared noodling on some ideas. I’ve pulled those ideas together into a proposal for specified repetitions with and without a separator.
The basic idea is to allow <m,n> in addition to ?, *, and +. The new
syntax expresses that at least “m” occurrences are required, but no more than “n”
are allowed. If “n” is omitted, it’s assumed to be the same as “m”. If you want
at least “m” occurrences with no upper bound, use * for “n”: <3,*> is at least
three but as many as you like.
The example martian-a introduced can be simplified to:
code: -" ", [Lu], [Lu], ["0"-"9"]<6>.
And if you wanted 3 to 6 occurrences, you’d write:
code: -" ", [Lu], [Lu], ["0"-"9"]<3,6>.
The repetition counts have to be greater than or equal to zero. I don’t think -3
“a”s makes any sense. I also don’t think there’s any obvious interpretation for
<6,3>, so I propose that the second number must be greater than or equal to
the first. Technically, that leaves <0,0>,"a"<0> matches no “a”s. But so does {"a"}. Invisible XML uses curly braces
for comments and commenting out the string is equally effective at removing it
and less potentially confusing. but that seems really awkward, so I
also propose that the second number must be greater than zero.
Invisible XML has two more repetition forms, ** and ++. They indicate repetition
with a separator: "a"**"," matches zero or more “a”s separated by commas:
“”, “a”, “a,a”, “a,a,a”, etc. And ++ matches one or more such “a”s.
Following the pattern of repeating the character, I propose <<m,n>> for this purpose.
The pattern:
"a"<<3,6>>","
matches between three and six “a”s, separated by commas.
In the proposal, I extended the “hints for implementors” section to show how these new syntactic forms can be simplified to existing forms.
We’re left with the fact that each of the current occurrence indicators can be
expressed using the new syntax (? is <0,1>, etc.). I think we should just
live with that. Yes, those forms could be forbidden, or the existing occurrence
indicators could be removed, but neither of those seems worthwhile.
If, at this point, you want to say that you don’t like the choice of < and
>, well, okay.
What do you propose instead? I think it’s useful if they’re a
matched pair of delimiters. But parentheses, square brackets, and curly braces
are all unavailable. That uses up the “easy to type on a regular keyboard”
choices, I think. You could, instead, use the same character before and after,
for example /3,6/ and //3,6//. I think that could be made to parse. I don’t
think it’s better.Bikeshed all the things! (Or much worse, TBH.)
Repetition on an insertion is allowed, and a bit odd. But we allow the existing occurrence indicators and those are awful too:
"a", +"X", "a"
matches “aa” and serializes as “aXa”.
"a", +"X"*, "a"
is infinitely ambiguous. It matches “aa” and serializes as “a” followed by
0, 1, 2, 3, … ∞ “X”s, followed by “a”.
The syntactic form +"X"<3,6> isn’t more infinitely ambiguous.