Plain Text Data Formats - CSV, JSON, XML, and HTML
Database Design
- Overview
- Plain text
- Fixed-width text
- Comma-Separated Values (CSV)
- JavaScript Object Notation (JSON)
- eXtensible Markup Language (XML)
- HyperText Markup Language (HTML)
- Conclusions
Overview
Intro
There are boundless possibilities when it comes to representing structured data as plain text.
In reality, there are just a few common formats that meet most needs.
-
Comma-Separated Values (CSV)
-
Javascript Object Notation (JSON)
-
eXtensible Markup Lanauge (XML)
Each of these formats attempts to make data both human-readable and machine-readable.
We will take a quick look at each.
Plain text
Meaning
What do we mean by “plain text”?
-
As you’ve probably heard, all computer data, including all text, is ultimately stored as numbers in the computer’s processor, memory, and storage.
-
Plain text, for our purposes, means that computer data which contains only numeric codes in the ASCII or Unicode encoding schemes and nothing else.
-
ASCII and Unicode offer standardized widely-supported encoding schemes that indicate which number represents which character.
-
For example, the ASCII standard specifies that an
a
character is represented by the number 97,b
is 98,c
is 99, and so on. -
So any computer data that contains only a set of these codes, and no other numbers not in these encoding schemes, is “plain text” data.
Creating
How do we create plain text data?
-
A plain text editor is an application intended to be used to create plain text files.
-
Windows’ Notepad and Apple’s TextEdit are not good plain text editors - both can and will store non-plain-text data into the files they make.
-
Microsoft Word, Apple Pages, and Google Docs are not good plain text editors.
-
Almost all computer programming editors and integrated development environments are good plain text editors.
Saving
How to save a file with just plain text in it.
-
When saving any file, always include a file extension in the filename. This helps the operating system understand what kind of data is in the file and what application it should launch to open it.
-
A file with unstructured plain text is often saved with the
.txt
extension. - A file with structured plain text is saved with an extension that matches its structure, e.g.:
.csv
for files containing comma-separated-values data.json
for files containing Javascript Object Notation data.xml
for files containing eXtensible Markup Language data- … and so on
- A filename such as
diabetes_rates_by_state.csv
is a good filename, whereasdiabetes_rates_by_state
is not.
Opening
How do you open a plain text file?
-
When you double click on a plain text file and it opens in Windows’ Notepad or Mac’s TextEdit, quit those applications immediately and fix your process!
-
Use a good plain text editor instead.
-
Open your good plain text editor and from there open the file using the
File
->Open...
menu. -
It’s also possible to change which application is opened by default when you double-click on a file with a certain extension… look it up.
One more note about file extensions
Annoyingly perhaps, both Windows and Mac OSX hide file extensions by default.
-
It is urgent that you change this setting on your own computer so that file extensions are always visible.
-
Research how to do this for your operating system.
Fixed-width text
Basic concept
Back in the days when paper ruled, it was useful to print data in nicely-aligned tables that were easy to follow on paper.
- Thus was born the fixed-width formatting of text, where data in each column consumes a consistent amount of horizontal space. E.g.:
Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec J-D D-N DJF MAM JJA SON Year
1880 -29 -18 -11 -20 -12 -23 -21 -9 -16 -23 -20 -23 -19 *** **** -14 -18 -20 1880
1881 -16 -17 4 4 2 -20 -7 -3 -14 -21 -22 -11 -10 -11 -18 3 -10 -19 1881
1882 14 15 3 -19 -16 -26 -21 -6 -10 -25 -16 -25 -11 -10 6 -10 -17 -17 1882
-
This example comes from NASA’s records of land and ocean surface temperatures, which are [perhaps shockingly] still published in fixed-width text format.
-
This data is very print-friendly, but machine-unfriendly - it looks good on paper, but is unnecessarily difficult for a computer program to parse.
Comma-Separated Values (CSV)
Basic concept
When text is formatted as comma-separated values, a series of values is separated by commas!
-
Line breaks separate series of related values.
-
For example, three lines could be used to represent three different works of nonsense literature:
Carroll,Lewis,Jabberwocky,1871
Lear,Edward,The Jumblies,1910
Bishop,Elizabeth,The Man-Moth,1946
- Optionally, the first line of a CSV file can contain headings:
Last name,First name,Title,Year
Carroll,Lewis,Jabberwocky,1871
Lear,Edward,The Jumblies,1910
Bishop,Elizabeth,The Man-Moth,1946
Consistency
CSV files are used with a fixed schema - a consistent set of fields that are present in each line of the text.
- In our example, the text before the first comma is always the author’s last name, the following text before the next comma is the author’s first name, and so on.
Carroll,Lewis,Jabberwocky,1871
Lear,Edward,The Jumblies,1910
Bishop,Elizabeth,The Man-Moth,1946
- Inconsistently-ordered fields would be unusable by a machine and possibly by a human as well - a CSV would never look like this:
Carroll,Lewis,Jabberwocky,1871
1910,Edward,Lear,The Jumblies
The Man-Moth,1946,Bishop,Elizabeth
Variation
A variation of the CSV format is the Tab-Separated Values (TSV) format.
- An example of three lines of tab-separated values used to represent three different works of nonsense literature (the spaces between values are tabs in this case):
Carroll Lewis Jabberwocky 1871
Lear Edward The Jumblies 1910
Bishop Elizabeth The Man-Moth 1946
- While commas are the most common separators for this kind of format, tabs, colons
:
, semi-colons;
, and pipes|
are not uncommon.
Javascript Object Notation (JSON)
A slightly more flexible format
Javascript Object Notation (JSON) is a competitor format to CSV for representing structured data, with or without a fixed schema, in text.
- For example, the following might be a JSON representation of one work of nonsense literature:
{
"Last name": "Carroll",
"First name": "Lewis",
"Title": "Jaberwocky",
"Year": 1871
}
- Unlike in CSV format, line breaks, tabs, and spaces are not meaningful in JSON format and are used solely for human readability.
// prettier-ignore
{ "Last name": "Carroll", "First name": "Lewis", "Title": "Jaberwocky", "Year": 1871 }
Multiple data series
JSON format is valid Javascript programming code. So it may not be a surprise that JSON supports arrays/lists of data series.
- For example, three works of nonsense literature might be represented:
[
{
"Last name": "Carroll",
"First name": "Lewis",
"Title": "Jaberwocky",
"Year": 1871
},
{
"Last name": "Lear",
"First name": "Edward",
"Title": "The Jumblies",
"Year": 1910
},
{
"Last name": "Bishop",
"First name": "Elizabeth",
"Title": "The Man-Moth",
"Year": 1946
}
]
-
The square brackets indicate an array.
[ ... ]
-
Commas separate the elements of the array.
[ ... , ... , ... ]
-
The extra line breaks, tabs, and spaces are for human-readability only.
Nesting
JSON can support a hierarchical order of data, also known as nesting of one object within another.
- For example, one could nest the First name and Last name fields within an Author field.
{
"Author": {
"Last name": "Carroll",
"First name": "Lewis"
},
"Title": "Jaberwocky",
"Year": 1871
}
- This would not be possible with CSV format.
Flexibility to be schema-less
While JSON, like CSV, is always used for structured data, and all series of data often have the same fields, a fixed schema is not a requirement of the format.
- For example, some data about three people, with no consistent set of fields:
[
{ "Name": "Bob", "Age": 10 },
{ "Last name": "Lear", "First name": "Edward", "Occupation": "naturalist" },
{ "Zodiac": "Libra", "First name": "Juliette", "Favorite animal": "Koala" }
]
eXtensible Markup Language (XML)
Structured data
Like CSV and JSON, XML can be used to represent structured data.
<?xml version="1.0" encoding="UTF-8"?>
<nonsense_works>
<work>
<last_name>Carroll</last_name>
<first_name>Lewis</first_name>
<title>Jabberwocky</title>
<year>1871</year>
</work>
<work>
<last_name>Lear</last_name>
<first_name>Edward</first_name>
<title>The Jumblies</title>
<year>1910</year>
</work>
<work>
<last_name>Bishop</last_name>
<first_name>Elizabeth</first_name>
<title>The Man-Moth</title>
<year>1946</year>
</work>
</nonsense_works>
Tags and markup
XML tags (code words betweeen <
and >
signs) are used to annotate the data and explain its meaning.
- For example, the title of each work in the preceeding example is clearly surrounded by
title
tags:
<work>
<last_name>Bishop</last_name>
<first_name>Elizabeth</first_name>
<title>The Man-Moth</title>
<year>1946</year>
</work>
- This way of annotating a document with tags is called markup.
Nesting
As with JSON, XML supports nesting of data, such as placing the last_name
and first_name
fields within an author
field.
<work>
<author>
<last_name>Carroll</last_name>
<first_name>Lewis</first_name>
</author>
<title>Jabberwocky</title>
<year>1871</year>
</work>
Flexibility to be schema-less
Like, JSON, XML also allows for representing schema-less data with inconsistent fields, although this is not common:
<?xml version="1.0" encoding="UTF-8"?>
<people>
<person>
<name>Bob</name>
<age>10</age>
</person>
<person>
<last_name>Lear</last_name>
<first_name>Edward</first_name>
<occupation>naturalist</occupation>
</person>
<person>
<zodiac>Libra</zodiac>
<first_name>Juliette</first_name>
<favorite_animal>Koala</favorite_animal>
</person>
</people>
Similarity to other markup languages
There is a wide variety of subsets of XML for specific purposes.
- Hypertext Markup Language (HTML) - the language used for marking up almost all web page content (see example)
– - The current version, HTML 5
is not a direct subset of XML, but retains many of the same features as XML.
– - the previous version, XHTML
(eXtensible HyperText Markup Language) was a direct subset of XML and followed all XML rules.
-
Real Simple Syndication (RSS) - used for marking up episodic blog and podcast content (see example)
-
Scalable Vector Graphics (SVG) - a text-based format for representing vector graphics (see examples)
HyperText Markup Language (HTML)
Web publishing
Unlike CSV, JSON, and XML, Hypertext Markup Language (HTML) is not a general-purpose data format.
-
HTML is used almost exclusively to define and structure content published to web pages.
-
HTML today is not a sub-set of XML, but shares much in common with XML.
-
Given that web pages are so important today and hold much of the world’s data, HTML, despite its obvious flaws, has become an important plain text data format.
Example
An example of a simple HTML document.
<doctype html>
<html lang="en">
<head>
<title>Famous Works of Nonsense Literature</title>
</head>
<body>
<section>
<h1>Famous Works of Nonsense Literature</h1>
<article>
<h2>Jabberwocky</h2>
<p>by Lewis Carroll<br />1871</p>
</article>
<article>
<h2>The Jumblies</h2>
<p>by Edward Lear<br />1910</p>
</article>
<article>
<h2>The Man-Moth</h2>
<p>by Elizabeth Bishop<br />1946</p>
</article>
</section>
</body>
</html></doctype
>
Challenges
Whereas CSV, JSON, and XML are most often used for packaging data with little consideration to aesthetics, the same cannot be said of HTML.
-
HTML is primarily used for publishing web content in a visually-pleasing manner (to visually-able humans).
-
HTML does not require a consistent repeated data structure, i.e. can be schema-less, and often does not have one in practice.
-
There are drastic differences in the way HTML is written from one web page to the next, making it difficult to parse.
Conclusions
Comparisons
A few points of comparisons among the various plain text data formats:
-
CSV, JSON, and XML all support structured data
-
JSON (and in theory XML, but not much in practice) supports schema-less data
-
HTML is its own unique creature that is not going away anytime soon. We must live with its many flaws, and perhaps even learn to love them.
Thank you. Bye.