Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

If XML is better at representing trees, and JSON is a tree, does that mean that XML is better at representing JSON than JSON?


Yes.

Consider a comparison of examples using JSON schema:

  {
    "$schema": "http://json-schema.org/draft-04/schema#",
    "description": "Modified JSON Schema draft v4 that includes the optional '$ref' and 'format'",
    "definitions": {
        "schemaArray": {
            "type": "array",
            "minItems": 1,
            "items": { "$ref": "#" }
        }
    }
  ...


  <schema schema="http://json-schema.org/draft-04/schema#">
    <description>Modified JSON Schema draft v4 that includes the optional '$ref' and 'format'</description>
    <definitions>
        <schemaArray>
            <type>array</type>
            <minItems>1</minItem>
            <items ref="#" />
        </schemaArray>
  ...
$schema and $ref are differentiated by convention in JSON, and require repeated consideration in every parsing scenario to treat them as exceptions relative to actual values within the JSON document. In the XML representation, the distinction is implicit and handled automatically for you by every parser.


>$schema and $ref are differentiated by convention in JSON, and require repeated consideration in every parsing scenario to treat them as exceptions relative to actual values within the JSON document.

This is a good thing. The main issue with XML is how ridiculously overcomplicated and overengineered its design was. This isn't just a matter of usability, it has security implications because it increases the attack surface of anything that uses it (e.g. the XML billion laughs attack).


> This is a good thing. The main issue with XML is how ridiculously overcomplicated and overengineered its design was.

Citing element attributes as an example of such complexity is hardly reasonable. The JSON example above contains the same complexity, just moved to the application/implementation space, instead of the parser (so your app is more complex to handle it).

> e.g. the XML billion laughs attack

Also not a great example. This is a simple DOS attack, of which there are many other examples: zip bombs, yaml bombs, etc. None of these invalidate yaml nor zip themselves.

There are also far worse attacks on XML than the billion laughs attack (XXE attacks are an entire category in the OWASP top 10).

What these attacks have in common is that they use references. They aren't unique to XML because any format wanting to handle references (like JSON schema!) will have to account for them.

The difference is, since references aren't built into the JSON spec, you have to do it yourself (and protect against attack like this in your own code). Since XML handles this at spec. level, common XML parsers can account for & mitigate for this for you (which fyi, modern ones do).

---

Side note (and a vote in favour of JSON):

By your own metric, JSON is actually "more complex" (in a good way) than XML in one area: value types. XML values are strings. Having value types is one massive advantage of JSON imo.

In that sense, it would be nice to see a language that combines both of these "complexities", to form something better.


I would disagree about having more value types. JSON has more syntax for value types, but it only has four: string, integer, float, and boolean. XML has less syntax for value types (everything is serialized as a string), but it has a way to define types of attributes with XML Schema and there you get much more primitive types already (decimal, date & time, binary). XML approach is more uniform: it uses schema to look up all the parsing rules. JSON uses syntax for some types and (ad-hoc) schema-like logic for others.


>Citing element attributes as an example of such complexity is hardly reasonable. The JSON example above contains the same complexity, just moved to the application/implementation space

Where it belongs...

>so your app is more complex to handle it

There is a trade off between more simple markup and more complex code and vice versa. I would argue it is almost always better that way around because it helps enforce clear separation of concerns. I'm similarly allergic to putting complex logic in template code because that also violates a clear separation of concerns.

>Also not a great example. This is a simple DOS attack, of which there are many other examples: zip bombs, yaml bombs, etc.

This is precisely why it's a great example. YAML has the same problem with overcomplexity XML does. JSON has no such issues.

>There are also far worse attacks on XML than the billion laughs attack (XXE attacks are an entire category in the OWASP top 10).

Right, billion laughs is part of a class of security vulnerabilities that are enabled by XML's bloated design.

>The difference is, since references aren't built into the JSON spec, you have to do it yourself

Except you don't. I'm not sure if I've ever implemented references in any JSON schema I've ever used. It's an entirely pointless feature as far as I'm concerned. I've seen them used in YAML (where it's equally yucky) and every time it's been used it's been as a band aid over deficient schema design that also inadvertently made the markup harder to understand.

>By your own metric, JSON is actually "more complex" (in a good way) than XML in one area: value types. XML values are strings. Having value types is one massive advantage of JSON imo.

I hardly think having 4/5 scalar types counts as spec overcomplication. If you want a measure of how complex each markup language is simply look at the length of the respective specifications. XML is ridiculous.


This is reminiscent of the (whimsical) RFC 3252: Binary Lexical Octet Ad-hoc Transport

https://www.ietf.org/rfc/rfc3252.txt


> JSONx is an IBM® standard format to represent JSON as XML.

https://www.ibm.com/support/knowledgecenter/SS9H2Y_7.5.0/com...


Representing is easy task. You need to parse that representation. And XML here is worse than JSON for typical programming graphs. You can't express number or array with XML. XML Schema helps to add more structure, but JSON does not need that, array and numbers are built-in. And I would say, that XML Schema is too powerful and that's usually not needed by developers. JSON is just a fine medium, simple enough and powerful enough.

But when you deal with inherently hierarchical data, trees, XML is good.


> You can't express number or array with XML

Numbers can be expressed, they just have to be parsed on the client. For array, child elements are ordered and form an array naturally.


> Numbers can be expressed, they just have to be parsed on the client.

But client has to know that they're numbers. So you can't just parse XML into JavaScript object without any knowledge about its structure.

> For array, child elements are ordered and form an array naturally.

But you can't express array of 0-length this way without explicit knowledge about structure. And you can't distinguish array of 1-length from an ordinary value without explicit knowledge about structure.


> without explicit knowledge about structure

That's why XML Schema exists. So, everything you listed is possible in XML if you don't put aside some specifications on purpose.

And JSON is so self-describing that someone came with... JSON Schema.


Yes. And that's my point: you don't need schemas to work with JSON. JSON Schema exists, of course, it would be strange if someone did not invent it, as it's an obvious idea. But I've yet to see anyone using it.


> But client has to know that they're numbers. So you can't just parse XML into JavaScript object without any knowledge about its structure.

<Number value="123" type="i32" />


This is tag with name Number and with two attributes. Are you trying to invent XML sublanguage? It's possible. But with JSON it's already invented and there are multiple libraries for every programming language. What should I use to parse your format?


> But with JSON it's already invented

It's inventory and optimized for certain language types, specifically those that are loose with their numeric types. This includes most scripting languages.

In languages that require specific intrinsic types to be defined for a number, there are usually a lot to choose from and using the wrong one can be a real problem when converting from XML or JSON to a native format.

On a very simple level, what container do we use to encode the following data structures:

  [100,4,10,156]

  <array>
    <num>100</num>
    <num>4</num>
    <num>10</num>
    <num>156</num>
  </array>
In most dynamic languages, hut number type used is just the included numeric container, which generally includes some sort of complex decision between floating point and bignums. For something like C, Java or Rust, generally we would want to choose an appropriate type. In this case, it looks like an unsigned 8-bit integer will suffice for most, and a 16 bit signed value for Java (which doesn't support unsigned values).

But what if the next number is much larger? Should we really need to parse all the values to determine the correct data type to use? That seems very inefficient, and we can't even be sure that we'll encounter values that accurately illustrate the range of values in one parsing. What if the next message or file we parse has large values?

For these languages the inherent data type definition of JSON is a poor match, since its looseness does not transfer easily to a language which does not inherently support it. If your target languages supports dynamicaly resizing untyped arrays, untyped key-value maps, and generic number types that support both very big and floating number types automatically, then JSON is an almost perfect representation format for you. If your target language works best when those items are broken into smaller more explicit components, there's a lot of extra work in parsing JSON, and I can see how that makes XML not look much worse in comparison (especially since your parsed data structure will likely be leaner because there isn't the overhead inherent in those convenient magical types, references are rarely as efficient as pointers).


Odd that you would pick numbers. Those are not exactly but free in json. And dates are as likely to actually be a problem.


> And dates are as likely to actually be a problem.

I'm currently working against a REST API which JSON format has three different date formats as sub-elements of the same root object...


As opposed to json: `{"number": 12345678901234567890}`

What does that look like when you parse it?


Mapping from string to number. Probably BigInteger for Java, whatever other standard integer for other languages.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: