Pros and Cons of storing data in Binary vs Text-based formats (XML)

The eXtensible Markup Language (XML) is a markup language that uses textual tags to describe the meaning of its content. It is a text based format and as such it is both human and machine-readable.

The XML language was first developed as a means of overcoming the problem of storing and passing data between entities. Today, XML is commonly used in web applications to pass data over the network with formatting embedded into the file itself. HTML is a markup language based on XML.

<html>
  <header>
  	<title>XML vs Binary Data</title>
  </header>
  <body>
    <div class="content">
      <p>
      What are the advantages and disadvantages of storing data in XML and binary format? The eXtensible Markup Language (XML) is a markup language that uses textual tags to describe the meaning of its content. It is a text based format and as such it is both human and machine-readable.
    </p>
    </div>
  </body>
</html>

##What are the advantages of XML over binary data?

Representing numbers and strings in binary can be ambiguous because different system encode and represent these datatypes differently, which makes bianry data difficult to interpret. For example, try saving an int and then a float to disk in binary format using a C program and read it back as a float followed by an int!

Binary suffers from a number of such representation issues. Endian byte order for integers1, the IEEE format for floats, and different sized booleans and string formats used across platforms and programming languages name few of these representation issues. Since XML is a text-based format, the representation of numbers and strings is unambigous. What goes in as a printable character, must come out as a printable character. It is therefore trivial to detect corrupted XML data. Unprintable characters imply corruption.

XML is extensible because you can do things with it that you didn’t think of when XML was designed. Adding new XML tags does not break existing code.

<object name="XML sample">
	<age>21</age>
</object>

Similarly, XML is a very flexible language in that slight formatting variations usually don’t matter.

XML is self-describing and text-based.

Binary Advantages over XML

There are some great reasons to use binary data. For one, binary data tends to be a lot smaller in size compared to text-based data. Transferring large amounts of data in binary form over a network is therefore much more efficient as there is little overhead introduced by XML tags and character encoding. XML epxresses all data as alphanumeric and symbol characters. Encoding raw binary data, such as image files, involves encoding its bits into (unprintable) “characters” which must be converted back into binary data when the file is read. This process doubles the data size and makes XML very unsuitable for storing and transmitting binary data.

Storing the number 1000 as a standard 32-bit integer in binary format would take 32 bits exactly the same number of bits as storing the string “1000”, which consists of four characters, each one taking up 8-bits. The number 10000 would still only take 32 bits in memory1, however, the string representation of the number, "10000" would take 5*8 = 40bits stored in an XML file, excluding any tags and metadata.

Another advantage is the speed of reading binary data. There is no need to convert between text and numbers when dealing directly with binary information as is the case when reading and converting XML files.

Binary data is simpler than XML. Parsing XML is complicated because there are multiple ways to structure the same data.

<person age="20">
	<address>
   		<street>Sample Street</street>
        <suburb>Sampleton</suburb>
  </address>
</person>

OR

<person>
<age>20</age>
  <address street="Sample Street" suburb="Sampleton" />
</person>

XML’s flexibility is a burden when it comes to parsing XML data.

  1. In fact, storing any number less than 2^32^ would only take up 32 bits in memory as that is the largest number we can store in a 32 bit integer. It’s obvious that it is not the size of the decimal number that causes it to use up more memory, but rather, the number of bits required to represent that number in binary. (I learned this in first year of uni, though I probably should have known before that.) ↩︎ ↩︎2