Plain Text Data Formats - CSV, JSON, XML, and HTML

Database Design

Overview
Plain text
Fixed-width text
Comma-Separated Values (CSV)
JavaScript Object Notation (JSON)
eXtensible Markup Language (XML)
HyperText Markup Language (HTML)
Conclusions

Overview

Intro

There are boundless possibilities when it comes to representing structured data as plain text.

In reality, there are just a few common formats that meet most needs.

Comma-Separated Values (CSV)
Javascript Object Notation (JSON)
eXtensible Markup Lanauge (XML)

Each of these formats attempts to make data both human-readable and machine-readable.

We will take a quick look at each.

Plain text

Meaning

What do we mean by “plain text”?

As you’ve probably heard, all computer data, including all text, is ultimately stored as numbers in the computer’s processor, memory, and storage.
Plain text, for our purposes, means that computer data which contains only numeric codes in the ASCII or Unicode encoding schemes and nothing else.
ASCII and Unicode offer standardized widely-supported encoding schemes that indicate which number represents which character.
For example, the ASCII standard specifies that an a character is represented by the number 97, b is 98, c is 99, and so on.
So any computer data that contains only a set of these codes, and no other numbers not in these encoding schemes, is “plain text” data.

Creating

How do we create plain text data?

A plain text editor is an application intended to be used to create plain text files.
Windows’ Notepad and Apple’s TextEdit are not good plain text editors - both can and will store non-plain-text data into the files they make.
Microsoft Word, Apple Pages, and Google Docs are not good plain text editors.
Almost all computer programming editors and integrated development environments are good plain text editors.

Saving

How to save a file with just plain text in it.

When saving any file, always include a file extension in the filename. This helps the operating system understand what kind of data is in the file and what application it should launch to open it.
A file with unstructured plain text is often saved with the .txt extension.
A file with structured plain text is saved with an extension that matches its structure, e.g.:
- .csv for files containing comma-separated-values data
- .json for files containing Javascript Object Notation data
- .xml for files containing eXtensible Markup Language data
- … and so on
A filename such as diabetes_rates_by_state.csv is a good filename, whereas diabetes_rates_by_state is not.

Opening

How do you open a plain text file?

When you double click on a plain text file and it opens in Windows’ Notepad or Mac’s TextEdit, quit those applications immediately and fix your process!
Use a good plain text editor instead.
Open your good plain text editor and from there open the file using the File->Open... menu.
It’s also possible to change which application is opened by default when you double-click on a file with a certain extension… look it up.

One more note about file extensions

Annoyingly perhaps, both Windows and Mac OSX hide file extensions by default.

It is urgent that you change this setting on your own computer so that file extensions are always visible.
Research how to do this for your operating system.

Fixed-width text

Basic concept

Back in the days when paper ruled, it was useful to print data in nicely-aligned tables that were easy to follow on paper.

Thus was born the fixed-width formatting of text, where data in each column consumes a consistent amount of horizontal space. E.g.:

Year   Jan  Feb  Mar  Apr  May  Jun  Jul  Aug  Sep  Oct  Nov  Dec    J-D D-N    DJF  MAM  JJA  SON  Year
 -29  -18  -11  -20  -12  -23  -21   -9  -16  -23  -20  -23    -19 ***   ****  -14  -18  -20  1880
 -16  -17    4    4    2  -20   -7   -3  -14  -21  -22  -11    -10 -11    -18    3  -10  -19  1881
  14   15    3  -19  -16  -26  -21   -6  -10  -25  -16  -25    -11 -10      6  -10  -17  -17  1882

This example comes from NASA’s records of land and ocean surface temperatures, which are [perhaps shockingly] still published in fixed-width text format.
This data is very print-friendly, but machine-unfriendly - it looks good on paper, but is unnecessarily difficult for a computer program to parse.

Comma-Separated Values (CSV)

Basic concept

When text is formatted as comma-separated values, a series of values is separated by commas!

Line breaks separate series of related values.
For example, three lines could be used to represent three different works of nonsense literature:

Carroll,Lewis,Jabberwocky,1871
Lear,Edward,The Jumblies,1910
Bishop,Elizabeth,The Man-Moth,1946

Optionally, the first line of a CSV file can contain headings:

Last name,First name,Title,Year
Carroll,Lewis,Jabberwocky,1871
Lear,Edward,The Jumblies,1910
Bishop,Elizabeth,The Man-Moth,1946

Consistency

CSV files are used with a fixed schema - a consistent set of fields that are present in each line of the text.

In our example, the text before the first comma is always the author’s last name, the following text before the next comma is the author’s first name, and so on.

Carroll,Lewis,Jabberwocky,1871
Lear,Edward,The Jumblies,1910
Bishop,Elizabeth,The Man-Moth,1946

Inconsistently-ordered fields would be unusable by a machine and possibly by a human as well - a CSV would never look like this:

Carroll,Lewis,Jabberwocky,1871
1910,Edward,Lear,The Jumblies
The Man-Moth,1946,Bishop,Elizabeth

Variation

A variation of the CSV format is the Tab-Separated Values (TSV) format.

An example of three lines of tab-separated values used to represent three different works of nonsense literature (the spaces between values are tabs in this case):

Carroll    Lewis   Jabberwocky 1871
Lear   Edward  The Jumblies    1910
Bishop Elizabeth   The Man-Moth  1946

While commas are the most common separators for this kind of format, tabs, colons :, semi-colons ;, and pipes | are not uncommon.

Javascript Object Notation (JSON)

A slightly more flexible format

Javascript Object Notation (JSON) is a competitor format to CSV for representing structured data, with or without a fixed schema, in text.

For example, the following might be a JSON representation of one work of nonsense literature:

{
  "Last name": "Carroll",
  "First name": "Lewis",
  "Title": "Jaberwocky",
  "Year": 1871
}

Unlike in CSV format, line breaks, tabs, and spaces are not meaningful in JSON format and are used solely for human readability.

// prettier-ignore
{ "Last name": "Carroll", "First name": "Lewis", "Title": "Jaberwocky", "Year": 1871 }

Multiple data series

JSON format is valid Javascript programming code. So it may not be a surprise that JSON supports arrays/lists of data series.

For example, three works of nonsense literature might be represented:

[
  {
    "Last name": "Carroll",
    "First name": "Lewis",
    "Title": "Jaberwocky",
    "Year": 1871
  },
  {
    "Last name": "Lear",
    "First name": "Edward",
    "Title": "The Jumblies",
    "Year": 1910
  },
  {
    "Last name": "Bishop",
    "First name": "Elizabeth",
    "Title": "The Man-Moth",
    "Year": 1946
  }
]

The square brackets indicate an array. [ ... ]
Commas separate the elements of the array. [ ... , ... , ... ]
The extra line breaks, tabs, and spaces are for human-readability only.

Nesting

JSON can support a hierarchical order of data, also known as nesting of one object within another.

For example, one could nest the First name and Last name fields within an Author field.

{
  "Author": {
    "Last name": "Carroll",
    "First name": "Lewis"
  },
  "Title": "Jaberwocky",
  "Year": 1871
}

This would not be possible with CSV format.

Flexibility to be schema-less

While JSON, like CSV, is always used for structured data, and all series of data often have the same fields, a fixed schema is not a requirement of the format.

For example, some data about three people, with no consistent set of fields:

[
  { "Name": "Bob", "Age": 10 },
  { "Last name": "Lear", "First name": "Edward", "Occupation": "naturalist" },
  { "Zodiac": "Libra", "First name": "Juliette", "Favorite animal": "Koala" }
]

eXtensible Markup Language (XML)

Structured data

Like CSV and JSON, XML can be used to represent structured data.

<?xml version="1.0" encoding="UTF-8"?>
<nonsense_works>
    <work>
        <last_name>Carroll</last_name>
        <first_name>Lewis</first_name>
        <title>Jabberwocky</title>
        <year>1871</year>
    </work>
    <work>
        <last_name>Lear</last_name>
        <first_name>Edward</first_name>
        <title>The Jumblies</title>
        <year>1910</year>
    </work>
    <work>
        <last_name>Bishop</last_name>
        <first_name>Elizabeth</first_name>
        <title>The Man-Moth</title>
        <year>1946</year>
    </work>
</nonsense_works>

Tags and markup

XML tags (code words betweeen < and > signs) are used to annotate the data and explain its meaning.

For example, the title of each work in the preceeding example is clearly surrounded by title tags:

<work>
        <last_name>Bishop</last_name>
        <first_name>Elizabeth</first_name>
        <title>The Man-Moth</title>
        <year>1946</year>
</work>

This way of annotating a document with tags is called markup.

Nesting

As with JSON, XML supports nesting of data, such as placing the last_name and first_name fields within an author field.

    <work>
        <author>
            <last_name>Carroll</last_name>
            <first_name>Lewis</first_name>
        </author>
        <title>Jabberwocky</title>
        <year>1871</year>
    </work>

Flexibility to be schema-less

Like, JSON, XML also allows for representing schema-less data with inconsistent fields, although this is not common:

<?xml version="1.0" encoding="UTF-8"?>
<people>
    <person>
        <name>Bob</name>
        <age>10</age>
    </person>
    <person>
        <last_name>Lear</last_name>
        <first_name>Edward</first_name>
        <occupation>naturalist</occupation>
    </person>
    <person>
        <zodiac>Libra</zodiac>
        <first_name>Juliette</first_name>
        <favorite_animal>Koala</favorite_animal>
    </person>
</people>

Similarity to other markup languages

There is a wide variety of subsets of XML for specific purposes.

Hypertext Markup Language (HTML) - the language used for marking up almost all web page content (see example)

– - The current version, HTML 5 is not a direct subset of XML, but retains many of the same features as XML. – - the previous version, XHTML (eXtensible HyperText Markup Language) was a direct subset of XML and followed all XML rules.

Real Simple Syndication (RSS) - used for marking up episodic blog and podcast content (see example)
Scalable Vector Graphics (SVG) - a text-based format for representing vector graphics (see examples)

HyperText Markup Language (HTML)

Web publishing

Unlike CSV, JSON, and XML, Hypertext Markup Language (HTML) is not a general-purpose data format.

HTML is used almost exclusively to define and structure content published to web pages.
HTML today is not a sub-set of XML, but shares much in common with XML.
Given that web pages are so important today and hold much of the world’s data, HTML, despite its obvious flaws, has become an important plain text data format.

Example

An example of a simple HTML document.

<doctype html>
  <html lang="en">
    <head>
      <title>Famous Works of Nonsense Literature</title>
    </head>
    <body>
      <section>
        <h1>Famous Works of Nonsense Literature</h1>
        <article>
          <h2>Jabberwocky</h2>
          <p>by Lewis Carroll<br />1871</p>
        </article>
        <article>
          <h2>The Jumblies</h2>
          <p>by Edward Lear<br />1910</p>
        </article>
        <article>
          <h2>The Man-Moth</h2>
          <p>by Elizabeth Bishop<br />1946</p>
        </article>
      </section>
    </body>
  </html></doctype
>

Challenges

Whereas CSV, JSON, and XML are most often used for packaging data with little consideration to aesthetics, the same cannot be said of HTML.

HTML is primarily used for publishing web content in a visually-pleasing manner (to visually-able humans).
HTML does not require a consistent repeated data structure, i.e. can be schema-less, and often does not have one in practice.
There are drastic differences in the way HTML is written from one web page to the next, making it difficult to parse.

Conclusions

Comparisons

A few points of comparisons among the various plain text data formats:

CSV, JSON, and XML all support structured data
JSON (and in theory XML, but not much in practice) supports schema-less data
HTML is its own unique creature that is not going away anytime soon. We must live with its many flaws, and perhaps even learn to love them.

Thank you. Bye.