Output: John Smith

JSON, or JavaScript Object Notation, has become one of the most popular formats for transmitting structured data between web services and applications. Its simplicity and native compatibility with JavaScript have led to widespread adoption for APIs, configuration files, data storage, and much more.

At the heart of working with JSON is the JSON parser – the software component that reads in JSON-formatted text and converts it into a data structure the program can easily use, such as an object, dictionary, array, etc. In this article, we‘ll take a deep dive into how JSON parsers work under the hood.

JSON Parser Overview

At a high level, a JSON parser takes in a string of JSON-formatted text as input, analyzes its contents, and outputs an equivalent representation using the data structures of the programming language.

For example, consider the following JSON:

{ "name": "John Smith", "age": 35, "city": "New York" }

A JSON parser would read in this text and output something like the following JavaScript object:

let person = { name: "John Smith", age: 35, city: "New York" };

Or the following Python dictionary:

person = { "name": "John Smith", "age": 35, "city": "New York" }

The exact representation depends on the programming language, but the key point is the parser converts the JSON text into a native data structure the language can work with.

Lexical Analysis and Tokenization

The first step in parsing JSON is lexical analysis, which breaks down the input text into a series of tokens. Each token is a meaningful unit of the JSON structure, such as a opening brace, string value, colon, comma, closing brace, etc.

The JSON lexical grammar defines the valid tokens. For example:

{ represents an opening brace token
} represents a closing brace token
[ represents an opening bracket token
] represents a closing bracket token
, represents a comma token
: represents a colon token
true, false, null represent boolean/null tokens
String tokens are surrounded by double quotes
Number tokens are sequences of digits

The lexer scans through the JSON input and emits a series of these tokens which represent the low-level building blocks of the JSON data. Whitespace between tokens is ignored.

Parsing the Token Stream

Once the JSON input has been transformed into a stream of tokens, the parser begins its work. The parser is responsible for analyzing the token stream and constructing a parse tree that represents the structure of the JSON data.

The parsing is guided by the JSON language grammar which defines the valid syntax and structure of JSON data. The grammar is usually defined in a format like ABNF or diagrams.

The parser uses techniques like recursive descent parsing or shift-reduce parsing to process the token stream. It maintains a stack data structure to keep track of nested JSON structures like objects and arrays.

As the parser consumes tokens, it applies the grammatical rules to validate that the JSON is well-formed and matches the expected structure. If the parser encounters a token that doesn‘t fit the grammar, it will raise a syntax error.

Constructing the Parse Tree

As the parser makes its way through the token stream, it constructs a parse tree representing the JSON structure. The parse tree is an intermediate representation that closely matches the nesting and layout of the original JSON text.

For example, consider this JSON snippet:

{ "name": "John Smith", "age": 35, "city": "New York", "employed": true }

The parse tree would look something like:

object
- true

Each level of nesting in the JSON becomes a level of the tree. Objects are represented as nodes with key/value pairs as children. Arrays are represented as nodes with index/value pairs as children. The leaf nodes are the actual data values of the JSON.

Converting to Language Objects

The final step is to convert the parse tree into the data structures of the host programming language. This step is sometimes called semantic analysis or building the abstract syntax tree.

The parser traverses the parse tree and maps each JSON construct to an equivalent language construct according to a defined set of conventions.

For example, in Python the mapping might be:

JSON objects become Python dictionaries
JSON arrays become Python lists
JSON strings become Python strings
JSON numbers become Python int/float
JSON true/false become Python True/False
JSON null becomes Python None

So the earlier example parse tree would become the Python dictionary:

{ "name": "John Smith", "age": 35, "city": "New York", "employed": True }

The details of the language mapping vary by implementation, but the core concept is the same – systematically converting the JSON parse tree into a native data structure the program can use.

Handling JSON Data Types

One of the key responsibilities of a JSON parser is to handle the conversion of the JSON-formatted string values into the appropriate data types of the language.

The JSON specification defines six data types:

string: a sequence of zero or more Unicode characters, wrapped in double quotes
number: an integer or floating point number
object: an unordered collection of name/value pairs, wrapped in curly braces {}
array: an ordered list of values, wrapped in square brackets []
boolean: either of the values true or false
null: an empty value, represented by the keyword null

During the parsing process, the lexical analyzer identifies the data type of each JSON value based on its formatting, and the parser/semantic analyzer then maps that to the corresponding language type.

For example, consider this JSON snippet:

{ "name": "John Smith", "age": "35", "city": "New York", "employed": "true" }

Even though age and employed are encoded as strings in the JSON, a JSON parser would recognize that they should be converted to an integer and boolean in the host language. So in Python, the parsed output would be:

{ "name": "John Smith", "age": 35, "city": "New York", "employed": True }

The parser is responsible for handling these conversions properly and raising an error if the JSON contains a malformed or invalid value that can‘t be converted.

JSON Parsing Performance

For small amounts of data, the performance of JSON parsing is usually not a major concern. However, when dealing with large JSON payloads or high volumes of data, parse time can become a performance bottleneck for an application.

Most JSON parsers are optimized for fast performance in the average case. Common optimizations include:

Using a state machine instead of a full recursive descent parser to reduce function call overhead
Tokenizing with a single pass through the input using minimal backtracking
Avoiding unnecessary memory allocations and copying of data
Using efficient hash tables for storing object properties
Lazy parsing of values, only converting what is accessed
Parallelizing parsing of large arrays

Some JSON parsers also have special support for "streaming" JSON data, where the JSON is parsed incrementally as it is received over a network connection or read from disk. This can allow processing to start before the full JSON is available.

Choosing a performant JSON parser can make a big difference for speed-critical applications. Some popular high-performance JSON parsers include:

simdjson: a C++ JSON parser that uses SIMD instructions and just-in-time compilation
RapidJSON: another C++ parser focused on speed and low memory footprint
Jackson: a multi-format Java parser with good JSON performance
Orjson: a fast Python JSON parser written in Rust
System.Text.Json: a high-performance JSON parser built into .NET

However, for most applications the default JSON parser included with the language (like json.parse in JavaScript or json.loads in Python) offers more than adequate performance.

JSON Parser Demos

To tie it all together, let‘s look at a few demos of using JSON parsers in different programming languages.

In Python, parsing JSON is as easy as:

import json


json_string = ‘{"name": "John Smith", "age": 35, "city": "New York"}‘

data = json.loads(json_string)

print(data["name"])

In JavaScript:

const jsonString = ‘{"name": "John Smith", "age": 35, "city": "New York"}‘; const data = JSON.parse(jsonString); console.log(data.name); // Output: John Smith

Parsing a JSON file in Go:

package main


import (

"encoding/json"

"fmt"

"os"

)
type Person struct {

Name string json:"name"

Age  int    json:"age"

City string json:"city"

}
func main() {

jsonFile, err := os.Open("person.json")

if err != nil {

fmt.Println(err)

}

defer jsonFile.Close()
var person Person
json.NewDecoder(jsonFile).Decode(&person)
fmt.Println(person.Name)

}

As you can see, JSON parsing is usually quite straightforward in most languages thanks to the popularity of JSON and extensive built-in/standard library support. The heavy lifting is done by the JSON parser under the hood.

Conclusion

JSON parsers play a critical role in working with JSON data in programs. Understanding how they work can help demystify some of the magic that happens when a simple function call converts a string of JSON text into a full-fledged object.

While the low-level details can get complex, the core concepts are straightforward: break the JSON string into tokens, analyze the token stream to validate the JSON and construct a parse tree, and finally map the parse tree into appropriate language-specific objects and data types.

Luckily, most programming languages provide high-quality, fast, and easy-to-use JSON parsers in their standard libraries or popular third-party modules. So for day-to-day work, JSON parsing is usually as simple as calling a function. But it‘s nice to appreciate all the work the parser does behind the scenes to make that possible.

JSON Parser Overview

Lexical Analysis and Tokenization

Parsing the Token Stream

Constructing the Parse Tree

Converting to Language Objects

Handling JSON Data Types

JSON Parsing Performance

JSON Parser Demos

Conclusion

Join the conversation Cancel reply

Related Posts

How to Use XPath Selectors for Web Scraping in Python

How to Select Elements by Text in XPath

How to Select Elements by Class in XPath: The Ultimate Guide