knowledge-kitchen

Plain Text Data Formats - CSV, JSON, XML, and HTML

Database Design

  1. Overview
  2. Plain text
  3. Fixed-width text
  4. Comma-Separated Values (CSV)
  5. JavaScript Object Notation (JSON)
  6. eXtensible Markup Language (XML)
  7. HyperText Markup Language (HTML)
  8. Conclusions

Overview

Intro

There are boundless possibilities when it comes to representing structured data as plain text.

In reality, there are just a few common formats that meet most needs.

Each of these formats attempts to make data both human-readable and machine-readable.

We will take a quick look at each.

Plain text

Meaning

What do we mean by “plain text”?

Creating

How do we create plain text data?

Saving

How to save a file with just plain text in it.

Opening

How do you open a plain text file?

One more note about file extensions

Annoyingly perhaps, both Windows and Mac OSX hide file extensions by default.

Fixed-width text

Basic concept

Back in the days when paper ruled, it was useful to print data in nicely-aligned tables that were easy to follow on paper.

Year   Jan  Feb  Mar  Apr  May  Jun  Jul  Aug  Sep  Oct  Nov  Dec    J-D D-N    DJF  MAM  JJA  SON  Year
1880   -29  -18  -11  -20  -12  -23  -21   -9  -16  -23  -20  -23    -19 ***   ****  -14  -18  -20  1880
1881   -16  -17    4    4    2  -20   -7   -3  -14  -21  -22  -11    -10 -11    -18    3  -10  -19  1881
1882    14   15    3  -19  -16  -26  -21   -6  -10  -25  -16  -25    -11 -10      6  -10  -17  -17  1882

Comma-Separated Values (CSV)

Basic concept

When text is formatted as comma-separated values, a series of values is separated by commas!

Carroll,Lewis,Jabberwocky,1871
Lear,Edward,The Jumblies,1910
Bishop,Elizabeth,The Man-Moth,1946
Last name,First name,Title,Year
Carroll,Lewis,Jabberwocky,1871
Lear,Edward,The Jumblies,1910
Bishop,Elizabeth,The Man-Moth,1946

Consistency

CSV files are used with a fixed schema - a consistent set of fields that are present in each line of the text.

Carroll,Lewis,Jabberwocky,1871
Lear,Edward,The Jumblies,1910
Bishop,Elizabeth,The Man-Moth,1946
Carroll,Lewis,Jabberwocky,1871
1910,Edward,Lear,The Jumblies
The Man-Moth,1946,Bishop,Elizabeth

Variation

A variation of the CSV format is the Tab-Separated Values (TSV) format.

Carroll    Lewis   Jabberwocky 1871
Lear   Edward  The Jumblies    1910
Bishop Elizabeth   The Man-Moth  1946

Javascript Object Notation (JSON)

A slightly more flexible format

Javascript Object Notation (JSON) is a competitor format to CSV for representing structured data, with or without a fixed schema, in text.

{
  "Last name": "Carroll",
  "First name": "Lewis",
  "Title": "Jaberwocky",
  "Year": 1871
}
// prettier-ignore
{ "Last name": "Carroll", "First name": "Lewis", "Title": "Jaberwocky", "Year": 1871 }

Multiple data series

JSON format is valid Javascript programming code. So it may not be a surprise that JSON supports arrays/lists of data series.

[
  {
    "Last name": "Carroll",
    "First name": "Lewis",
    "Title": "Jaberwocky",
    "Year": 1871
  },
  {
    "Last name": "Lear",
    "First name": "Edward",
    "Title": "The Jumblies",
    "Year": 1910
  },
  {
    "Last name": "Bishop",
    "First name": "Elizabeth",
    "Title": "The Man-Moth",
    "Year": 1946
  }
]

Nesting

JSON can support a hierarchical order of data, also known as nesting of one object within another.

{
  "Author": {
    "Last name": "Carroll",
    "First name": "Lewis"
  },
  "Title": "Jaberwocky",
  "Year": 1871
}

Flexibility to be schema-less

While JSON, like CSV, is always used for structured data, and all series of data often have the same fields, a fixed schema is not a requirement of the format.

[
  { "Name": "Bob", "Age": 10 },
  { "Last name": "Lear", "First name": "Edward", "Occupation": "naturalist" },
  { "Zodiac": "Libra", "First name": "Juliette", "Favorite animal": "Koala" }
]

eXtensible Markup Language (XML)

Structured data

Like CSV and JSON, XML can be used to represent structured data.

<?xml version="1.0" encoding="UTF-8"?>
<nonsense_works>
    <work>
        <last_name>Carroll</last_name>
        <first_name>Lewis</first_name>
        <title>Jabberwocky</title>
        <year>1871</year>
    </work>
    <work>
        <last_name>Lear</last_name>
        <first_name>Edward</first_name>
        <title>The Jumblies</title>
        <year>1910</year>
    </work>
    <work>
        <last_name>Bishop</last_name>
        <first_name>Elizabeth</first_name>
        <title>The Man-Moth</title>
        <year>1946</year>
    </work>
</nonsense_works>

Tags and markup

XML tags (code words betweeen < and > signs) are used to annotate the data and explain its meaning.

<work>
        <last_name>Bishop</last_name>
        <first_name>Elizabeth</first_name>
        <title>The Man-Moth</title>
        <year>1946</year>
</work>

Nesting

As with JSON, XML supports nesting of data, such as placing the last_name and first_name fields within an author field.

    <work>
        <author>
            <last_name>Carroll</last_name>
            <first_name>Lewis</first_name>
        </author>
        <title>Jabberwocky</title>
        <year>1871</year>
    </work>

Flexibility to be schema-less

Like, JSON, XML also allows for representing schema-less data with inconsistent fields, although this is not common:

<?xml version="1.0" encoding="UTF-8"?>
<people>
    <person>
        <name>Bob</name>
        <age>10</age>
    </person>
    <person>
        <last_name>Lear</last_name>
        <first_name>Edward</first_name>
        <occupation>naturalist</occupation>
    </person>
    <person>
        <zodiac>Libra</zodiac>
        <first_name>Juliette</first_name>
        <favorite_animal>Koala</favorite_animal>
    </person>
</people>

Similarity to other markup languages

There is a wide variety of subsets of XML for specific purposes.

– - The current version, HTML 5 is not a direct subset of XML, but retains many of the same features as XML. – - the previous version, XHTML (eXtensible HyperText Markup Language) was a direct subset of XML and followed all XML rules.

HyperText Markup Language (HTML)

Web publishing

Unlike CSV, JSON, and XML, Hypertext Markup Language (HTML) is not a general-purpose data format.

Example

An example of a simple HTML document.

<doctype html>
  <html lang="en">
    <head>
      <title>Famous Works of Nonsense Literature</title>
    </head>
    <body>
      <section>
        <h1>Famous Works of Nonsense Literature</h1>
        <article>
          <h2>Jabberwocky</h2>
          <p>by Lewis Carroll<br />1871</p>
        </article>
        <article>
          <h2>The Jumblies</h2>
          <p>by Edward Lear<br />1910</p>
        </article>
        <article>
          <h2>The Man-Moth</h2>
          <p>by Elizabeth Bishop<br />1946</p>
        </article>
      </section>
    </body>
  </html></doctype
>

Challenges

Whereas CSV, JSON, and XML are most often used for packaging data with little consideration to aesthetics, the same cannot be said of HTML.

Conclusions

Comparisons

A few points of comparisons among the various plain text data formats:

Thank you. Bye.