so

(Invisible) XML and names

Volume 6, Issue 2; 09 Apr 2022

Names and the characters they contain.

In the context of Invisible XML, a name is nice and simple:

        name: namestart, namefollower*.
   namestart: ["_"; L].
namefollower: namestart; ["-.·‿⁀"; Nd; Mn].

A name begins with an underscore or a character in the Unicode “Letters” category, it can then contain more underscores or letters, hypens, full stops, middle dots, under ties, character ties, or any characters in the “Numbers, decimal” category or the “Marks, nonspacing” category.

In the context of XML (fifth edition), names are also fairly simple:

[5]           Name ::= NameStartChar (NameChar)*
[4]  NameStartChar ::= ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6]
                       | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D]
                       | [#x37F-#x1FFF] | [#x200C-#x200D]
                       | [#x2070-#x218F] | [#x2C00-#x2FEF] 
                       | [#x3001-#xD7FF] | [#xF900-#xFDCF]
                       | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]
[4a]      NameChar ::= NameStartChar | "-" | "." | [0-9]
                       | #xB7 | [#x0300-#x036F] | [#x203F-#x2040]

The notation is a little different, and it isn’t as tidy because it can’t refer to Unicode character categories, but it’s pretty much the same thing.

How pretty much you ask?

Near as I can tell, there are three characters allowed in Invisible XML names that are not allowed in XML names: ª (feminine ordinal indicator), µ (micro sign), and º (masculine ordinal indicator).

Why does this matter? It matters because the element and attribute names in the “visible XML” that results from a parse are taken from the Invisible XML names.

This is a perfectly valid Invisible XML grammar:

µ: ["0"-"9"]+, (".",  ["0"-"9"]+)? .

If you parse the input “3.14” with that grammar, the result is <µ>3.14</µ>, but that’s not XML. My NineML processor is perfectly happy to output JSON, so you can see that it works:

$ coffeepot -g:micro.ixml --format:json 3.14
{"µ":3.14}

A conforming Invisible XML processor won’t produce not-well-formed output, it will raise an error, so for my money, three random characters that are unlikely to be used in names is an entirely reasonable price to pay for the concise definition of name characters in Invisible XML.

Except.

Except most parsers don’t, apparently, implement the name character rules of fifth edition. (In some post-pandemic world when it’s safe to sit around in bars again, buy me a beer and I’ll recap the political brouhaha that surrounded fifth edition.)

Most parsers, as far as I can tell, still implement the fourth edition rules. (Or, at least, that’s what I assume they’re doing.)

Apache Xerces, for example, excludes great rafts of characters that would be allowed by the fifth edition rules. According to Apache, there are 21,277 characters allowed by Invisible XML that aren’t allowed in XML.

I still think the concise formulation in Invisible XML is worth keeping. And I think parsers should update to the fifth edition rules.

But for the record, and in case someone wants one, here’s an Invisible XML grammar that defines names such that they are completely compatible with XML fourth edition:

    name: (Letter; '_'), NameChar* .

NameChar: Letter; Digit; ['.-_'];
          CombiningChar;
          Extender .

  Letter: BaseChar; Ideographic .

BaseChar: [#41-#5a; #61-#7a; #c0-#d6; #d8-#f6; #f8-#ff;
           #100-#131; #134-#13e; #141-#148; #14a-#17e;
           #180-#1c3; #1cd-#1f0; #1f4-#1f5; #1fa-#217;
           #250-#2a8; #2bb-#2c1; #386; #388-#38a; #38c;
           #38e-#3a1; #3a3-#3ce; #3d0-#3d6; #3da; #3dc;
           #3de; #3e0; #3e2-#3f3; #401-#40c; #40e-#44f;
           #451-#45c; #45e-#481; #490-#4c4; #4c7-#4c8;
           #4cb-#4cc; #4d0-#4eb; #4ee-#4f5; #4f8-#4f9;
           #531-#556; #559; #561-#586; #5d0-#5ea; #5f0-#5f2;
           #621-#63a; #641-#64a; #671-#6b7; #6ba-#6be;
           #6c0-#6ce; #6d0-#6d3; #6d5; #6e5-#6e6; #905-#939;
           #93d; #958-#961; #985-#98c; #98f-#990; #993-#9a8;
           #9aa-#9b0; #9b2; #9b6-#9b9; #9dc-#9dd; #9df-#9e1;
           #9f0-#9f1; #a05-#a0a; #a0f-#a10; #a13-#a28;
           #a2a-#a30; #a32-#a33; #a35-#a36; #a38-#a39;
           #a59-#a5c; #a5e; #a72-#a74; #a85-#a8b; #a8d;
           #a8f-#a91; #a93-#aa8; #aaa-#ab0; #ab2-#ab3;
           #ab5-#ab9; #abd; #ae0; #b05-#b0c; #b0f-#b10;
           #b13-#b28; #b2a-#b30; #b32-#b33; #b36-#b39; #b3d;
           #b5c-#b5d; #b5f-#b61; #b85-#b8a; #b8e-#b90;
           #b92-#b95; #b99-#b9a; #b9c; #b9e-#b9f; #ba3-#ba4;
           #ba8-#baa; #bae-#bb5; #bb7-#bb9; #c05-#c0c;
           #c0e-#c10; #c12-#c28; #c2a-#c33; #c35-#c39;
           #c60-#c61; #c85-#c8c; #c8e-#c90; #c92-#ca8;
           #caa-#cb3; #cb5-#cb9; #cde; #ce0-#ce1; #d05-#d0c;
           #d0e-#d10; #d12-#d28; #d2a-#d39; #d60-#d61;
           #e01-#e2e; #e30; #e32-#e33; #e40-#e45; #e81-#e82;
           #e84; #e87-#e88; #e8a; #e8d; #e94-#e97;
           #e99-#e9f; #ea1-#ea3; #ea5; #ea7; #eaa-#eab;
           #ead-#eae; #eb0; #eb2-#eb3; #ebd; #ec0-#ec4;
           #f40-#f47; #f49-#f69; #10a0-#10c5; #10d0-#10f6;
           #1100; #1102-#1103; #1105-#1107; #1109;
           #110b-#110c; #110e-#1112; #113c; #113e; #1140;
           #114c; #114e; #1150; #1154-#1155; #1159;
           #115f-#1161; #1163; #1165; #1167; #1169;
           #116d-#116e; #1172-#1173; #1175; #119e; #11a8;
           #11ab; #11ae-#11af; #11b7-#11b8; #11ba;
           #11bc-#11c2; #11eb; #11f0; #11f9; #1e00-#1e9b;
           #1ea0-#1ef9; #1f00-#1f15; #1f18-#1f1d;
           #1f20-#1f45; #1f48-#1f4d; #1f50-#1f57; #1f59;
           #1f5b; #1f5d; #1f5f-#1f7d; #1f80-#1fb4;
           #1fb6-#1fbc; #1fbe; #1fc2-#1fc4; #1fc6-#1fcc;
           #1fd0-#1fd3; #1fd6-#1fdb; #1fe0-#1fec;
           #1ff2-#1ff4; #1ff6-#1ffc; #2126; #212a-#212b;
           #212e; #2180-#2182; #3041-#3094; #30a1-#30fa;
           #3105-#312c; #ac00-#d7a3] .

Ideographic: [#4e00-#9fa5; #3007; #3021-#3029] .

CombiningChar: [#300-#345; #360-#361; #483-#486; #591-#5a1;
           #5a3-#5b9; #5bb-#5bd; #5bf; #5c1-#5c2; #5c4;
           #64b-#652; #670; #6d6-#6dc; #6dd-#6df; #6e0-#6e4;
           #6e7-#6e8; #6ea-#6ed; #901-#903; #93c; #93e-#94c;
           #94d; #951-#954; #962-#963; #981-#983; #9bc;
           #9be; #9bf; #9c0-#9c4; #9c7-#9c8; #9cb-#9cd;
           #9d7; #9e2-#9e3; #a02; #a3c; #a3e; #a3f;
           #a40-#a42; #a47-#a48; #a4b-#a4d; #a70-#a71;
           #a81-#a83; #abc; #abe-#ac5; #ac7-#ac9; #acb-#acd;
           #b01-#b03; #b3c; #b3e-#b43; #b47-#b48; #b4b-#b4d;
           #b56-#b57; #b82-#b83; #bbe-#bc2; #bc6-#bc8;
           #bca-#bcd; #bd7; #c01-#c03; #c3e-#c44; #c46-#c48;
           #c4a-#c4d; #c55-#c56; #c82-#c83; #cbe-#cc4;
           #cc6-#cc8; #cca-#ccd; #cd5-#cd6; #d02-#d03;
           #d3e-#d43; #d46-#d48; #d4a-#d4d; #d57; #e31;
           #e34-#e3a; #e47-#e4e; #eb1; #eb4-#eb9; #ebb-#ebc;
           #ec8-#ecd; #f18-#f19; #f35; #f37; #f39; #f3e;
           #f3f; #f71-#f84; #f86-#f8b; #f90-#f95; #f97;
           #f99-#fad; #fb1-#fb7; #fb9; #20d0-#20dc; #20e1;
           #302a-#302f; #3099; #309a ] .

   Digit: [#30-#39; #660-#669; #6f0-#6f9; #966-#96f;
           #9e6-#9ef; #a66-#a6f; #ae6-#aef; #b66-#b6f;
           #be7-#bef; #c66-#c6f; #ce6-#cef; #d66-#d6f;
           #e50-#e59; #ed0-#ed9; #f20-#f29] .

Extender: [#b7; #2d0; #2d1; #387; #640; #e46; #ec6;
           #3005; #3031-#3035; #309d-#309e; #30fc-#30fe] . 

Note that this grammar varies from the XML rules in that it doesn’t allow colons in names, because Invisible XML doesn’t currently allow colons in names.