XML is a nice syntax for some kinds of data. It's also a nice specific set of rules for data exchange between heterogeneous systems. But XML needs to be customized to the type of data that you're using it for. It serves as a wrapper around some other data. And once you get the characters of that data out, you still have some other rules to interpret it.
Let me give a concrete example. In SVG you have path elements. The path has a d attribute. The value of the d attribute is a string of text. The encoding rules for that text are all clearly defined by XML so that disparate computer systems can set and get the value unambiguously. But once you get the value of the d attribute, you have to interpret it according to the rules of the SVG specification. The d attribute describes a path using characters that signify movement of a pen in arcs and lines. Could this have been more XML-y? Sure. The SVG WG could have (and people have suggested) used a syntax where each subpath is an element so the lines (m, l, v, h, etc) become XML elements themselves.
Another concrete example is the meaning of the title element in an RSS feed. The content could be pretty wide open, but the rules for what can go in there are up to RSS, not XML.
So what am I getting at? Just nested syntax I guess. Nesting syntax allows us to do things like mix Javascript with HTML and still figure out what a browser should do with the resulting compound document. SQL statements have been nested in strings of C++ programs forever, and there are plenty of other examples of parser A handing off some stuff to parser B. When we nest syntax B in syntax A this way we're using a set of rules to package up some text written in one language using a construct of another language. So strings in C++ can be any text and guess what, all of SQL is text. So that works.
I think I'll stick to talking about SQL in C++ since it's less likely to spark religious wars. But the concepts apply to any other syntax B packed up in a construct from syntax A.
Problems come up really quick when statement B (written in syntax B) happens across some disallowed construct from syntax A. Suppose you want to search for text that has a quote (") in it using an SQL statement in your C++ code. "Escape it!" You all shout in unison. Well, almost all of you. That one guy smirking in the back is thinking "C++ sucks. If this were Javascript I could use single quotes around the outer statement and a quote character would be fine." Thing is, there are only a handful of approaches to strings.
1. Out of band character. The " or ' that delimits the text. Using the oob character in the string requires escaping.
2. Start/end sequence. Like <![CDATA[]]> (bet that one gets munged by your feed reader). In this case, sometimes there's just no allowance for using the sequence in a string, instead you need to make two strings.
3. String plus length. Length, like in bytes. And omfg we can't count bytes! So I don't know of modern languages that let you approach setting that number yourself, though there are languages that use it internally for string representation.
I don't think we've tried option 3 enough. Maybe it's because I'm not afraid of binary, hex editors, and things that aren't text. But if I remember right (and I'm not going to look this up in case I'm wrong - that would spoil my point) this is closer to what useful network stacks do. A TCP packet has a field for the length of the IP packet it's delivering. This speeds up operations for any points that speak TCP between the sender and the final receiver since the data bytes don't need to be read one-by-one looking for oob characters but it also means there's no ambiguity about whether the payload data is escaped properly. Since the TCP layer doesn't read the content that also reduces the potential for bugs in processing escaped data (and also surface area for exploits).
Say we have statement B in syntax B which contains statement A in syntax A. Instead of treating anything written in syntax A as an arbitrary chunk of text, maybe statement A is a wrapped up foreign syntax package - with the network stack analogy syntax A is IP and syntax B is TCP. That foreign syntax package would be treated as binary and stored as some data at an offset with a given length in bytes. The syntax B parser then passes the foreign syntax package as binary with it's length to the syntax A parser. In the earlier example, syntax A is SQL and syntax B is C++. In general we could be talking about a Javascript VM, database engine, or whatever little parser figures out the content of the d attribute from that SVG file.
The way you described it, I think syntax B is IP and syntax A is TCP. But that's just me side-stepping your point.
Speaking of side-stepping your point, I'll also take this opportunity to say: the SVG DOM exposes the path sub-types as elements with properties, so why do you need to parse it? :)
But I do get your point: the length+blob technique over arcane escaping rules. On the other hand, it brings back scary memories of how Windows put a m_dwSize member as the first member of every object in its API... shudder...