May 6, 2018

Writing a simple json parser.

Writing a JSON parser is one of the easiest ways to get familiar with parsing techniques. The format is extremely simple. It's defined recursively so you get a slight challenge compared to, say, parsing Brainfuck ; and you probably already use JSON. Aside from that last point, parsing S-expressions for Scheme might be an even simpler task.

If you'd just like to see the code for the library, pj , check it out on Github .

What parsing is and (typically) is not

Parsing is often broken up into two stages: lexical analysis and syntactic analysis. Lexical analysis breaks source input into the simplest decomposable elements of a language called "tokens". Syntactic analysis (often itself called "parsing") receives the list of tokens and tries to find patterns in them to meet the language being parsed.

Parsing does not determine semantic viability of an input source. Semantic viability of an input source might include whether or not a variable is defined before being used, whether a function is called with the correct arguments, or whether a variable can be declared a second time in some scope.

There are, of course, always variations in how people choose to parse and apply semantic rules, but I am assuming a "traditional" approach to explain the core concepts.

The JSON library's interface

Ultimately, there should be a from_string method that accepts a JSON-encoded string and returns the equivalent Python dictionary.

For example:

Lexical analysis

Lexical analysis breaks down an input string into tokens. Comments and whitespace are often discarded during lexical analysis so you are left with a simpler input you can search for grammatical matches during the syntactic analysis.

Assuming a simple lexical analyzer, you might iterate over all the characters in an input string (or stream) and break them apart into fundemental, non-recursively defined language constructs such as integers, strings, and boolean literals. In particular, strings must be part of the lexical analysis because you cannot throw away whitespace without knowing that it is not part of a string.

In a helpful lexer you keep track of the whitespace and comments you've skipped, the current line number and file you are in so that you can refer back to it at any stage in errors produced by analysis of the source. The V8 Javascript engine recently became able to do reproduce the exact source code of a function. This, at the very least, would need the help of a lexer to make possible.

Implementing a JSON lexer

The gist of the JSON lexer will be to iterate over the input source and try to find patterns of strings, numbers, booleans, nulls, or JSON syntax like left brackets and left braces, ultimately returning each of these elements as a list.

Here is what the lexer should return for an example input:

Here is what this logic might begin to look like:

The goal here is to try to match strings, numbers, booleans, and nulls and add them to the list of tokens. If none of these match, check if the character is whitespace and throw it away if so. Otherwise store it as a token if it is part of JSON syntax (like left brackets). Finally throw an exception if the character/string didn't match any of these patterns.

Let's extend the core logic here a little bit to support all the types and add the function stubs.

Lexing strings

For the lex_string function, the gist will be to check if the first character is a quote. If it is, iterate over the input string until you find an ending quote. If you don't find an initial quote, return None and the original list. If you find an initial quote and an ending quote, return the string within the quotes and the rest of the unchecked input string.

Lexing numbers

For the lex_number function, the gist will be to iterate over the input until you find a character that cannot be part of a number. (This is, of course, a gross simplification, but being more accurate will be left as an exercise to the reader.) After finding a character that cannot be part of a number, either return a float or int if the characters you've accumulated number more than 0. Otherwise return None and the original string input.

Lexing booleans and nulls

Finding boolean and null values is a very simple string match.

And now the lexer code is done! See the pj/lexer.py for the code as a whole.

Syntactic analysis

The syntax analyzer's (basic) job is to iterate over a one-dimensional list of tokens and match groups of tokens up to pieces of the language according to the definition of the language. If, at any point during syntactic analysis, the parser cannot match the current set of tokens up to a valid grammar of the language, the parser will fail and possibly give you useful information as to what you gave, where, and what it expected from you.

Implementing a JSON parser

The gist of the JSON parser will be to iterate over the tokens received after a call to lex and try to match the tokens to objects, lists, or plain values.

Here is what the parser should return for an example input:

A key structural difference between this lexer and parser is that the lexer returns a one-dimensional array of tokens. Parsers are often defined recursively and returns a recursive, tree-like object. Since JSON is a data serialization format instead of a language, the parser should produce objects in Python rather than a syntax tree on which you could perform more analysis (or code generation in the case of a compiler).

And, again, the benefit of having the lexical analysis happen independent from the parser is that both pieces of code are simpler and concerned with only specific elements.

Parsing arrays

Parsing arrays is a matter of parsing array members and expecting a comma token between them or a right bracket indicating the end of the array.

Parsing objects

Parsing objects is a matter of parsing a key-value pair internally separated by a colon and externally separated by a comma until you reach the end of the object.

And now the parser code is done! See the pj/parser.py for the code as a whole.

Unifying the library

To provide the ideal interface, create the from_string function wrapping the lex and parse functions.

And the library is complete! (ish). Check out the project on Github for the full implementation including basic testing setup.

Appendix A: Single-step parsing

Some parsers choose to implement lexical and syntactic analysis in one stage. For some languages this can simplify the parsing stage entirely. Or, in more powerful languages like Common Lisp, it can allow you to dynamically extend the lexer and parser in one step with reader macros .

I wrote this library in Python to make it more accessible to a larger audience. However, many of the techniques used are more amenable to languages with pattern matching and support for monadic operations -- like Standard ML. If you are curious what this same code would look like in Standard ML, check out the JSON code in Ponyo .

I wrote a short post (and a corresponding Python library) explaining lexing and parsing with JSON https://t.co/3yEZlcU6i5 https://t.co/FbksvUO9aT #json #python — Phil Eaton (@phil_eaton) May 6, 2018

As always, please email or tweet me with questions, corrections, or ideas!

Writing a simple JSON Parser from scratch in C++

JSON is one of the most common data interchange formats there is. Thankfully, it is also a pretty simple one. I had written a JSON parser in C++ sometime back, and whilst making it I came across a few decisions that were more involved than I thought they would be. So I have decided to write an article around that JSON Parser. While JSON Parsing is simple, writing one would definitely be useful to build more complex parsers, since the general principles are still present.

I would also like to clarify that this is not in full compliance with the JSON spec. It doesn’t handle stuff like escapes in strings, but the basics are there. It is also probably not very efficient, since I wrote most parts with an intention of them being easier to write than for them to be the fastest possible implementation.In addition to this, it expects a correct input JSON and does not give syntax errors.

The JSON spec

Before starting anywhere, let us pull up the JSON spec.

writing a json parser

As you can see, it is quite simple. The simple types are:

The composite types are:

  • Array - A list of JSON values (Note that array itself is a JSON value so you can have an array of arrays)
  • Object - It is a set of key-value pairs where you can retrieve any value using its corresponding key.

Choosing a structure for the JSONNode type

The first decision to make when writing a JSON parser is the structure of the JSONNode type which will be holding valid JSON values. I’ll try to explain some possible approaches and see which one is more appropriate.

What do we want from the JSONNode type? We want it to be as space efficient as possible.

The very naive approach

A naive approach would be:

This would work, but it is grossly inefficient. Each of these types takes some memory, and for each JSONValue, you have an overhead of the other values.

The size of an empty JSON value would be: 112 bytes .

The naive approach

An approach that is a bit better would be to use pointers, so instead of having to allocate size for each class type, we’ll just have to allocate a constant space for each pointer.

An empty object has size: 32 bytes , a lot better!

Generics? Not.

One of the incorrect paths that you might go down is to think about using generics. At least I’m not aware of a way to use generics in this situation.

You might to think to do generics as:

The issue with this is that the compiler generates a copy of the struct for each type with the name mangled.

Something like struct jsonObject_type_float

This means that for each type there is a separate jsonObject type and a separate jsonArray type.

So say you have a jsonArray like:

This is a perfectly legal Javascript array. How would you represent this using our jsonArray struct declared above? You can’t, since it is parametrized by type T which can be only one type.

When instantiating jsonArray, you’ll have to tell the compiler what type you’re parametrizing it with like:

There is no way for T to be both int and string. But we want T to be one of all possible JSON value types. This is one of those gotchas for those who misunderstand generics (like me). Generics are a way to save time when writing code so that you do not have to write similar implementations for each type. They’re not a way to have the sort of dynamic behavior we’re expecting (to store values of multiple types within the same array)..

My choice: The Union

Coming back to our JSON parser, the way I went with is:

Here, this is similar to our second naive implementation with one optimization: We’re using a union. It’ll take up space of the largest member. Unlike before, we won’t be wasting space by needlessly storing the other types.

NOTE : This is not the canonical C++ way to do this. For that, you can check out std::variant which is a typesafe union.

The size of this is 16 bytes .

writing a json parser

You might be wondering what JSONObject and JSONList types are.

As you can see, they’re just type declarations over standard types.

To access, I have a bunch of utility functions:

  • auto returnObject()
  • auto returnList()
  • auto returnString()
  • auto returnFloat();

Let us see the implementation for returnObject(), the rest are quite similar:

It doesn’t do much, just some bookkeeping to inform the user whether they’re returning the wrong type and an error thrown i not.

I also have the corresponding setter functions, let us look at setObject:

The only issue with this that I perceive is that this puts the burden on the end user to issue the correct function call (i.e they have to know when to call returnObject() instead of returnList()).

You can remove this burden by having some sort of dynamic dispatch but that brings with it its own hurdles. Although the function will be called on the correct object, what set of functions would be common across all JSON value types? I’m not sure, this is something I have to explore further, but currently I do not think that this would work. In any case, I do not think having an idea about the structure about the input is unreasonable.

Tokenization

Now let us look at how we’re going to parse. First we need to tokenize the input. What is tokenization? Well it is the next step after raw text, instead of having a raw stream of characters, you have a stream of more logical units, like strings, numbers and other symbols.

The tokenizer is fairly simple, it cares about very localized regions of the input.

We have an enum for denoting all possible token types:

We have a token struct that encapsulates the token and its value. Tokens like ARRAY_CLOSE do not have a value, but the STRING token contains the value as the actual string.

The Tokenizer class has this structure:

It has a helper function getWithoutWhitespace which just gets us the next character after skipping all the whitespace and newlines.

We also have a helper named rollBackToken. The rollBackToken function goes back to the previous token. We store the position of the previous token in prevPos.

The actual body of the getToken function is also fairly simple.

For getToken, the logic is:

For characters like ‘[’, ‘,’, ‘]’, ‘{’, ‘}’, we just map it to the corressponding enum type

The special case is when we encounter either ‘"’ or a number. In both cases the logical unit is the whole string or the whole number. So in case of string, we keep consuming characters till ‘"’ is reached and for numbers, till we run out of consecutive number characters. We then store the string / number in the “value” field of the token.

NOTE : that here we’re mapping to boolean or null as soon as we encounter “f” (false) or “t” (true) or “n” (null). Ideally, we should validate whether the characters after are also matching. If it is “faklse”, for example, then it should give an error.

The parser class is structured as:

The root refers to the root of the JSON tree, and current refers to the node currently being parsed.

Parse Function

Now as for the parser, here’s the basic logic:

  • CURLY_OPEN -> We’re parsing a JSON object, call parseObject to parse object
  • ARRAY_OPEN -> We’re parsing a JSON array, call parseArray to parse array
  • STRING -> We’re already given the string value by the tokenizer, we just have to assign it to a JSONNode.
  • NUMBER, BOOLEAN, NULL_TYPE -> Similar case as STRING

Initially, root will be null and we’ll assign it the initial JSONNode.

This is implemented in the parse function:

Now let us look at parseObject, parseList and parseString. The implementations of parseNumber, parseBoolean and parseNull are quite similar to parseString so they aren’t covered explicitly. However, you can check them out in the Github Repo.

In parseList, what we’re basically doing is this:

  • First token is ARRAY_OPEN
  • Second token should be of the JSON value and should indicate whether it is an object, a list, number, boolean or null. Corressponding to that, we call the appropriate parsing function.
  • The next token should either be COMMA or ARRAY_CLOSE. If it is COMMA, we have another JSON value to parse. If it is ARRAY_CLOSE, then we’re done.

NOTE : There is some code duplication of the parsing logic between parse, parseList and parseObject. I’ll later refactor this out.

Parse Object

For parseObject, the logic is:

  • First token is CURLY_OPEN
  • Next token should be a string (the key) = The next token should be ‘:’
  • The next token should be a JSON value. Like in parseList and parse, we delegate the parsing to the appropriate parsing function and then use the JSON node it returns. We then add this JSON node to the map with the above key.
  • Next token is either COMMA or CURLY_CLOSE. If it is COMMA, we have another entry to parse. If it is CURLY_CLOSE, we’re done.

parseString is very simple, it just needs to map the token type to a JSONNode.

That’s it, we have all the basic elements of a JSON parser!

Back to JSON

Since we have a JSON node that represents our JSON document, let us try going the other way. This’ll help us to also verify whether the JSON is parsed correctly.

The JSONNode::toString function takes in an indentation level. It is the amount by which to offset the child contents.

We first make a spaceString, which is a string with the number of space characters controlled by the indentation level.

Then, we check our current node type:

  • STRING -> Output directly
  • NUMBER -> Output directly
  • BOOLEAN -> Output directly
  • NULL -> Output directly
  • Array -> Output ‘[’. Then call toString again for each JSON value, outputting either ‘,’ or ‘]’ after depending on whether it is the last JSON value
  • Object -> Output ‘{’. Then the key, then ‘:’, then append the output of toString for the corressponding jsonValue. Then output ‘}’

NOTE : For a more efficient implementation, we should take in a stream and output there. This is memory inefficient.

I ran the parser on a sample JSON, and below is the output after stringifying it from our internal structure.

writing a json parser

Its indentation is a bit janky. I should probably get around to fixing that sometime :).

I discussed the most important bits and pieces, but if you want to know what it looks like as a whole, you can check out the Github Repo. I would love it if you can help improve it!

Write Your Own JSON Parser with Node and Typescript

16 min read

Write Your Own JSON Parser with Node and Typescript - Cover

JSON parsers are everywhere in today's development landscape. For example, even in VS Code, a JSON parser is built-in. You can test this by copying and pasting a valid or invalid JSON into a new file in VS Code. It will immediately pick up the file, parse it, and highlight any errors. Go ahead, try it out!

Now, let's delve into some key concepts we need to understand before diving into the code.

.css-12m0k8p{pointer-events:auto;} .css-spn4bz{transition-property:var(--chakra-transition-property-common);transition-duration:var(--chakra-transition-duration-fast);transition-timing-function:var(--chakra-transition-easing-ease-out);cursor:pointer;-webkit-text-decoration:none;text-decoration:none;outline:2px solid transparent;outline-offset:2px;color:inherit;}.css-spn4bz:hover,.css-spn4bz[data-hover]{-webkit-text-decoration:underline;text-decoration:underline;}.css-spn4bz:focus-visible,.css-spn4bz[data-focus-visible]{box-shadow:var(--chakra-shadows-outline);} Lexical Analysis a.k.a Tokenizing .css-x53p{color:var(--chakra-colors-transparent);font-weight:var(--chakra-fontWeights-normal);outline:2px solid transparent;outline-offset:2px;margin-left:0.375rem;}.css-x53p:hover,.css-x53p[data-hover]{cursor:pointer;color:rgb(76, 110, 245, 0.8);}.css-x53p:focus,.css-x53p[data-focus]{opacity:1;box-shadow:var(--chakra-shadows-outline);} #

Tokenizing is always the first step when writing your own interpreter or compiler. Even your favorite prettifiers tokenize the entire file before prettifying it. Tokenizing involves breaking the code or input into smaller, understandable parts. This is essential because it allows the parser to determine where to start and stop during the parsing process.

Tokenizing helps us understand the structure of the code or input being processed. By breaking it down into tokens, such as keywords, symbols, and literals, we gain insight into the underlying components. Additionally, tokenizing plays a crucial role in error handling. By identifying and categorizing tokens, we can detect and handle syntax errors more effectively.

In the upcoming sections of this blog post, we will dive deeper into the process of tokenizing. We will provide a step-by-step guide on how to implement tokenizing in your own JSON parser. So, let's get started!

Let's imagine we have a JSON like this:

For example:

  • Opening brace
  • Then, string key "id"
  • Then, a colon
  • Then, string value "647ceaf3657eade56f8224eb"
  • Open bracket and close bracket
  • Boolean value and null value

If we had to come up with a Type for this in TypeScript, it would be something similar to this:

Does it make sense? Now let's manually convert this into an array of tokens before we write the actual tokenizer, to see how it will look in the end. This way, we can better understand the process step by step.

When the tokenizer finishes tokenizing, we want to end up with the array of tokens we manually created earlier.

Let's take the first step and write our initial code. Assuming you have already set up a Node.js and TypeScript development environment for yourself, create an entry file called main.ts within the same directory. Additionally, let's create a file called types.ts and place the token type definitions inside it.

Now, go ahead and create another file called tokenizer.ts .

Alright, let's think about how we can go through the JSON string to get our tokens.

One way to do it is by using a variable to track our current position in the JSON string. We'll keep increasing its value until we cover the entire string length.

By doing this, we'll be able to process each character one by one and identify the corresponding tokens. This step-by-step approach will lay the foundation for our tokenizer, ensuring we extract the necessary information from the JSON data.

Alright, now that we've addressed that, we need another variable to store the tokens, right? Let's go ahead and add that.

In our JSON example:

First thing to token here is BraceOpen . Let's add this.

So, if char is BraceOpen , we simply push it to tokens array and increment the current. Same goes for BraceClose , BracketOpen , BracketClose , Colon and Comma so let's add those as well.

It's time to implement the trickiest part of all: handling String. Instead of copying the entire function repeatedly, I'll just show you the new ones.

The unique aspect of String values is that they are not single characters like Comma or Colon. When we encounter a Quote, our task is to iterate through the JSON string until we find the closing Quote. This function handles that process.

If we haven't reached the ending Quote yet, we continue building the string gradually by appending new characters to the value variable.

The rest of the process is relatively straightforward, and we can proceed smoothly with handling other token types.

Let's keep up the momentum and move forward!

We apply a similar technique that we used for handling String values to build a character array for null , true , false , or number data types. By doing so, we can systematically examine each character and identify the respective data types.

As we parse the characters and recognize the data type, we proceed accordingly. If none of these data types match, we handle the situation by throwing an Unexpected value error.

Here are the utilies for them:

Let's finish off with whitespace skipping and default condition to handle unknown/unexpected chars.

Finished version of tokenizer.ts

Let's move on to Parser.

The parser is where we make sense out of our tokens. Now we have to build our Abstract Syntax Tree (AST). The AST represents the structure and meaning of the code in a hierarchical tree-like structure. It captures the relationships between different elements of the code, such as statements, expressions, and declarations.

Every language or format you can think of uses some form of AST based on grammar rules of the programming language or data format being parsed. So, we will do that together now.

It's actually pretty similar to tokenizer . We will iterate over our tokens and form a tree depending on that value type.

Let's start by defining our type first in the types.ts file

Now, our parser.ts file.

If token list is empty, we simply throw error. And, in order to iterate through our tokens we need a counter variable and a function to increment it.

Let's start by parsing simple values first.

The provided code snippet is relatively straightforward and handles basic data types like strings, numbers, booleans, and null using a simple switch statement.

However, when encountering a BraceOpen or BracketOpen , the parser needs to handle the nested objects and arrays recursively. This means calling parseValue() within parseObject() or parseArray() until all the inner key-value pairs or elements are evaluated.

For instance, when parsing an object represented by the following JSON data:

The parser needs to iterate through the object and call parseValue() for each key-value pair, handling nested objects and arrays recursively.

Let's add our parseObject()

In this code, we expect the ASTNode to represent an object, and we iterate through the tokens until we encounter a BraceClose , which marks the end of the object.

Within the loop, we check if the current token type is String to ensure that we are processing a key-value pair and not encountering another object. If it's a valid string key, we move forward by consuming the token and then expect to find a colon ":" separating the key from the value.

Once we find the colon, we recursively call parseValue() to parse the value of the key-value pair. This is important because the value might be another object or array, and we need to handle nested structures correctly.

We continue this process iteratively, consuming tokens and parsing key-value pairs until we reach the end of the object (marked by the BraceClose token). Along the way, we might encounter commas (",") between key-value pairs, and we skip them as they separate multiple key-value pairs within the object.

By repeating this process recursively, we can successfully parse all the inner objects and nested structures present in the JSON data.

Let's move onto parseArray()

This works in a similar fashion, as we parse new values we simply push them to array store in node object.

We actually finished our parser here is the full code:

Now, in your main.ts you can do this to test it:

Conclusion #

And that's it! Today, we embarked on the journey of creating our very own JSON parser from scratch. It's been a thrilling ride, and we owe a huge shoutout to John Crickett for inspiring this adventure.

We learned about the crucial concept of tokenizing, breaking down the code into smaller, understandable parts. This sets the foundation for the parser to work its magic.

Our tokenizer expertly handles JSON strings and produces an array of tokens that capture the essence of the data structure.

Moving on to the parser, we built an Abstract Syntax Tree (AST) that neatly organizes the key-value pairs and nested structures of the JSON data.

It's incredible how much power we've packed into this parser, enabling us to handle various JSON data types and their complexities.

So, with our tokenizer and parser working hand in hand, we can confidently say that we've successfully crafted our very own JSON parser!

Now go forth and experiment with your newfound knowledge. Keep tinkering and exploring the fascinating world of programming. Until next time, happy coding! 🚀😄

Edit this page

Learn Python practically and Get Certified .

Popular Tutorials

Popular examples, reference materials, learn python interactively, python introduction.

  • Getting Started
  • Keywords and Identifier
  • Python Comments
  • Python Variables
  • Python Data Types
  • Python Type Conversion
  • Python I/O and Import
  • Python Operators
  • Python Namespace

Python Flow Control

  • Python if...else
  • Python for Loop
  • Python while Loop
  • Python break and continue
  • Python Pass

Python Functions

  • Python Function
  • Function Argument
  • Python Recursion
  • Anonymous Function
  • Global, Local and Nonlocal
  • Python Global Keyword
  • Python Modules
  • Python Package

Python Datatypes

  • Python Numbers
  • Python List
  • Python Tuple
  • Python String
  • Python Dictionary

Python Files

  • Python File Operation
  • Python Directory
  • Python Exception
  • Exception Handling
  • User-defined Exception

Python Object & Class

  • Classes & Objects
  • Python Inheritance
  • Multiple Inheritance
  • Operator Overloading

Python Advanced Topics

  • Python Iterator
  • Python Generator
  • Python Closure
  • Python Decorators
  • Python Property
  • Python RegEx
  • Python Examples

Python Date and time

  • Python datetime Module
  • Python datetime.strftime()
  • Python datetime.strptime()
  • Current date & time
  • Get current time
  • Timestamp to datetime
  • Python time Module
  • Python time.sleep()

Python Tutorials

Python open()

Working with CSV files in Python

Python CSV: Read and Write CSV Files

Python Docstrings

  • Python Set remove()
  • Python Main function

Python JSON

JSON ( J ava S cript O bject N otation) is a popular data format used for representing structured data. It's common to transmit and receive data between a server and web application in JSON format.

In Python, JSON exists as a string. For example:

It's also common to store a JSON object in a file.

Import json Module

To work with JSON (string, or file containing JSON object), you can use Python's json module. You need to import the module before you can use it.

Parse JSON in Python

The json module makes it easy to parse JSON strings and files containing JSON object.

Example 1: Python JSON to dict

You can parse a JSON string using json.loads() method. The method returns a dictionary.

Here, person is a JSON string, and person_dict is a dictionary.

Example 2: Python read JSON file

You can use json.load() method to read a file containing JSON object.

Suppose, you have a file named person.json which contains a JSON object.

Here's how you can parse this file:

Here, we have used the open() function to read the json file. Then, the file is parsed using json.load() method which gives us a dictionary named data .

If you do not know how to read and write files in Python, we recommend you to check Python File I/O .

Python Convert to JSON string

You can convert a dictionary to JSON string using json.dumps() method.

Example 3: Convert dict to JSON

Here's a table showing Python objects and their equivalent conversion to JSON.

Writing JSON to a file

To write JSON to a file in Python, we can use json.dump() method.

Example 4: Writing JSON to a file

In the above program, we have opened a file named person.txt in writing mode using 'w' . If the file doesn't already exist, it will be created. Then, json.dump() transforms person_dict to a JSON string which will be saved in the person.txt file.

When you run the program, the person.txt file will be created. The file has following text inside it.

Python pretty print JSON

To analyze and debug JSON data, we may need to print it in a more readable format. This can be done by passing additional parameters indent and sort_keys to json.dumps() and json.dump() method.

Example 5: Python pretty print JSON

When you run the program, the output will be:

In the above program, we have used 4 spaces for indentation. And, the keys are sorted in ascending order.

By the way, the default value of indent is None . And, the default value of sort_keys is False .

Recommended Readings:

  • Python JSON to CSV and vice-versa
  • Python XML to JSON and vice-versa
  • Python simplejson

Table of Contents

  • What is JSON?
  • Using json Module
  • Example: JSON string to dict
  • Example: Python read JSON file
  • dict to JSON string (with Example)
  • Writing JSON to a file (with Example)
  • Pretty print JSON (with Example)

Sorry about that.

Related Tutorials

Python Library

Python Tutorial

DEV Community

DEV Community

kalium.xyz

Posted on Jul 17, 2019

Building a JSON parser for great good

A good way to learn more about programming languages is by understanding how they work. The best way to understand something is to have made it in the first place. Today we set out to build a simple JSON parser using the specs on json.org . JSON is a fairly straightforward and well documented spec so it is a good target for learning how to write a parser.

Process of parsing

JSON gets parsed into a data structure, so we don't have to worry about anything but the translation of a JSON string to the data structure. The first step in the process of translating the JSON file into a more abstract JSON object is called tokenization, lexical analysis or lexing. These terms are often used interchangeably. Lexing and lexical analysis are more often used in natural language processing and have their roots in linguistics. The process is as follows:

the tokenizer gets called with a string lets say: { "key": "value"}

The string gets broken down in units of meaning (e.g. keywords, separators, operators). The way we do this is by simply scanning over every single character and changing our mode of operation to accommodate (see an opening curly brace? we are dealing with a JSON object, opening square bracket? an array). These units of meaning are called tokens or lexemes. We can see that our string starts with an { indicating its a JSON object, moving on to the next relevant token we see a "key" , : , "value" , and finally the end of the JSON object } .

The tokens get put in the resulting data structure as we identify them. In our case our string becomes a root JSON object in our abstract representation of the JSON. (JSON object) We create the JSON object as soon as we identify it. If we encounter any malformed token along the way we will panic and the parser will quit, so its save to make assumptions like that the object will be a complete JSON object. Once we see the key we append it in our structure (JSON object (key)) and once we see the value we append it to the key (JSON object (key value)) .

You might wonder why we would still need the closing brace, or why we don't just represent it without these symbols if they don't contribute up until this point. Our tokenizer works like a state machine, it switches modes depending on the input it gets called with. The } means, exit parsing JSON object, where a , would indicate there to be more key value pairs to follow.

Once we have the entire string turned to a data structure we are done, because JSON is much simpler than a programming language we do not have to go over the resulting data structure to run any instructions. The program can decide how it will use your resulting JSON data structure.

Implementing the tokenizer

the easiest way is to loop trough every character and switch modes depending on the character fed into the parser. You can do this using a switch statement that calls into a the right function for parsing a token. For a string you would see that it starts with " so any character that is not " will need to be stored, the latter token indicating that the string has ended and that you should exit the function and return the string to be embedded in the data structure you are creating.

understanding how to read the relevant documents for implementing a JSON parser

This excerpt from "how javascript works" is linked on json.org and explains the "McKeeman form", its good to read this trough and use this as a guide when implementing the different modes of the parser as you can see which types include what more clearly than on a syntax diagram: https://www.crockford.com/mckeeman.html

If you have any suggestions or sources that should be linked you can fork this article .

Top comments (0)

pic

Templates let you quickly answer FAQs or store snippets for re-use.

Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment's permalink .

Hide child comments as well

For further actions, you may consider blocking this person and/or reporting abuse

sithub profile image

Best junction box manufacturers in India

Sithub - Feb 19

preetham02 profile image

The use of artificial intelligence in UX/UI

preetham - Feb 19

kumar001repo profile image

Expanding Horizons with Modules and Add-Ons, in Flash Bootloaders; Part 1

Kumar001repo - Feb 19

madzimai profile image

🌞 Happy Monday, Devs! 🌈👩‍💻👨‍💻

Netsai - Feb 19

Once suspended, kalium will not be able to comment or publish posts until their suspension is removed.

Once unsuspended, kalium will be able to comment and publish posts again.

Once unpublished, all posts by kalium will become hidden and only accessible to themselves.

If kalium is not suspended, they can still re-publish their posts from their dashboard.

Once unpublished, this post will become invisible to the public and only accessible to kalium.xyz.

They can still re-publish the post if they are not suspended.

Thanks for keeping DEV Community safe. Here is what you can do to flag kalium:

kalium consistently posts content that violates DEV Community's code of conduct because it is harassing, offensive or spammy.

Unflagging kalium will restore default visibility to their posts.

DEV Community

We're a place where coders share, stay up-to-date and grow their careers.

  • Skip to main content
  • Skip to search
  • Skip to select language
  • Sign up for free
  • English (US)

JSON.parse()

The JSON.parse() static method parses a JSON string, constructing the JavaScript value or object described by the string. An optional reviver function can be provided to perform a transformation on the resulting object before it is returned.

The string to parse as JSON. See the JSON object for a description of JSON syntax.

If a function, this prescribes how each value originally produced by parsing is transformed before being returned. Non-callable values are ignored. The function is called with the following arguments:

The key associated with the value.

The value produced by parsing.

Return value

The Object , Array , string, number, boolean, or null value corresponding to the given JSON text .

Thrown if the string to parse is not valid JSON.

Description

JSON.parse() parses a JSON string according to the JSON grammar , then evaluates the string as if it's a JavaScript expression. The only instance where a piece of JSON text represents a different value from the same JavaScript expression is when dealing with the "__proto__" key — see Object literal syntax vs. JSON .

The reviver parameter

If a reviver is specified, the value computed by parsing is transformed before being returned. Specifically, the computed value and all its properties (in a depth-first fashion, beginning with the most nested properties and proceeding to the original value itself) are individually run through the reviver .

The reviver is called with the object containing the property being processed as this (unless you define the reviver as an arrow function, in which case there's no separate this binding) and two arguments: key and value , representing the property name as a string (even for arrays) and the property value. If the reviver function returns undefined (or returns no value — for example, if execution falls off the end of the function), the property is deleted from the object. Otherwise, the property is redefined to be the return value. If the reviver only transforms some values and not others, be certain to return all untransformed values as-is — otherwise, they will be deleted from the resulting object.

Similar to the replacer parameter of JSON.stringify() , for arrays and objects, reviver will be last called on the root value with an empty string as the key and the root object as the value . For other valid JSON values, reviver works similarly and is called once with an empty string as the key and the value itself as the value .

If you return another value from reviver , that value will completely replace the originally parsed value. This even applies to the root value. For example:

There is no way to work around this generically. You cannot specially handle the case where key is an empty string, because JSON objects can also contain keys that are empty strings. You need to know very precisely what kind of transformation is needed for each key when implementing the reviver.

Note that reviver is run after the value is parsed. So, for example, numbers in JSON text will have already been converted to JavaScript numbers, and may lose precision in the process. To transfer large numbers without loss of precision, serialize them as strings, and revive them to BigInts , or other appropriate arbitrary precision formats.

Using JSON.parse()

Using the reviver parameter, using reviver when paired with the replacer of json.stringify().

In order for a value to properly round-trip (that is, it gets deserialized to the same original object), the serialization process must preserve the type information. For example, you can use the replacer parameter of JSON.stringify() for this purpose:

Because JSON has no syntax space for annotating type metadata, in order to revive values that are not plain objects, you have to consider one of the following:

  • Serialize the entire object to a string and prefix it with a type tag.
  • "Guess" based on the structure of the data (for example, an array of two-member arrays)
  • If the shape of the payload is fixed, based on the property name (for example, all properties called registry hold Map objects).

JSON.parse() does not allow trailing commas

Json.parse() does not allow single quotes, specifications, browser compatibility.

BCD tables only load in the browser with JavaScript enabled. Enable JavaScript to view data.

  • JSON.stringify()

Python Land

JSON in Python: How To Read, Write, and Parse

JSON, short for  JavaScript Object Notation , is an open standard. Although its name doesn’t imply so, it is a language-independent data format. With Python’s JSON library, we can read, write, and parse JSON to both store and exchange data using this versatile data format. It’s a prevalent data format because it is easy to read and write for humans as well, although not as easy as YAML !

Working with JSON in Python is super easy! Python has two data types that, together, form the perfect tool for working with JSON in Python: dictionaries and lists . In this article, I’ll show you how to use the built-in Python JSON library. In addition, we’ll take a look at JSON5: an extension to JSON that allows things like comments inside your JSON documents.

Table of Contents

  • 1 Importing the built-in JSON library
  • 2 How to parse JSON in Python
  • 3 Encoding JSON with json.dumps
  • 4 Pretty printing JSON on the command line
  • 5 How to read a JSON file in python
  • 6 How to write JSON to a file in Python
  • 7 JSON5 vs. JSON
  • 8 Frequently Asked Questions
  • 9 Keep learning

Importing the built-in JSON library

Python ships with a powerful and elegant JSON library to help you decode and encode JSON. You can import the module with:

This library is part of Python, so you don’t need to install it with the Pip package manager .

How to parse JSON in Python

Parsing a string of JSON data, also called decoding JSON, is as simple as using  json.loads(…) . Loads is short for load string.

It converts:

  • objects to dictionaries
  • arrays to lists,
  • booleans , integers , floats, and strings are recognized for what they are and will be converted into the correct types in Python
  • Any  null  will be converted into Python’s  None  type

Here’s an example of  json.loads  in action:

If the interactive example above doesn’t work (it’s still in beta), here’s a more static example instead:

The output might look like a string, but it’s actually a dictionary that you can use in your code as explained on our page about Python dictionaries . You can check for yourself:

Encoding JSON with json.dumps

Encoding JSON data with Python’s json.dumps is just as easy as decoding. Use  json.dumps (short for ‘dump to string’) to convert a Python object consisting of dictionaries , lists , and other native types into a string:

Here’s the same example, in case the above interactive example doesn’t work in your browser:

This is the same document, converted back to a string! If you want to make your JSON document more readable for humans, use the indent option. It will nicely format the JSON, using space characters:

Pretty printing JSON on the command line

Python’s JSON module can also be used from the command line. It will both validate and pretty-print your JSON:

You may also be interested in using the jq-tool for this though!

How to read a JSON file in python

Besides json.loads , there’s also a function called json.load (without the s). It will load data from a file, but you have to open the file yourself. If you want to read the contents of a JSON file into Python and parse it, use the following example:

How to write JSON to a file in Python

The json.dump function is used to write data to a JSON file. You’ll need to open the file in write mode first:

JSON5 vs. JSON

JSON5 is an extension of JSON. The main advantage of JSON5 over JSON is that it allows for more human-readable and editable JSON files. Notable JSON5 features are:

  • single-line and multi-line comments
  • trailing commas in objects and arrays
  • single-quoted strings

For machine-to-machine communication, I recommend using the built-in JSON library. However, when using JSON as a configuration file, JSON5 is recommended, mainly because it allows for comments.

Python does not support JSON5 natively. To read and write JSON5, we’ll need to pip install one of the following packages:

  • PyJSON5 : a library that uses the official JSON5 C library, making it the fastest option to use.
  • json5 : a pure Python implementation, confusingly called pyjson5 as well on in their documentation. According to the author, the library is slow.

I recommend the first (fast) option, but unless you are parsing hundreds or thousands of documents, the speed advantage will be negligible.

Both libraries offer functions that mimic the Python JSON module, making it super easy to convert your code to JSON5. You could, for example, do an import pyjson5 as json but I recommend making it more explicit that you’re using json5 as show in the following example:

To make it extra clear that you’re using JSON5, you can also use the extension .json5 . While you’re at it, search the marketplace of your code editor for a JSON5 plugin. E.g., VSCode has one or two.

Frequently Asked Questions

Simply use the methods described above. The json.dump and json.dumps functions accept both dictionaries and lists

Similar to arrays, so use json.dump or json.dumps on the dictionary.

The dump and dumps functions both accept an option called sort_keys, for example: json.dumps(data, sort_keys=True) .

By default: no. The library outputs ASCII and will convert characters that are not part of ASCII. If you want to output Unicode, set ensure_ascii to False. Example: json.dumps(data, ensure_ascii=False)

Keep learning

  • If you’re looking for a format that is easy to write for humans (e.g.: config files), read our article on reading and writing YAML with Python .
  • JMESPath is a query language for JSON. JMESPath in Python allows you to obtain the data you need from a JSON document or dictionary easily. 
  • If you need to parse JSON on the command-line , try our article on a tool called jq!
  • Get a refresher on opening, writing, and reading files with Python .

Get certified with our courses

Learn Python properly through small, easy-to-digest lessons, progress tracking, quizzes to test your knowledge, and practice sessions. Each course will earn you a downloadable course certificate.

The Python Course for Beginners

Related articles

  • Python YAML: How to Load, Read, and Write YAML
  • Python List Comprehension: Tutorial With Examples
  • PyInstaller: Create An Executable From Python Code
  • Python Tuple: How to Create, Use, and Convert

Leave a Comment Cancel reply

You must be logged in to post a comment.

Kalan's Blog

Kalan 頭像照片,在淡水拍攝,淺藍背景

四零二曜日 電子報上線啦!訂閱訂起來

Software Engineer / Taiwanese / Life in Fukuoka This blog supports RSS feed (all content), you can click RSS icon or setup through third-party service. If there are special styles such as code syntax in the technical article, it is still recommended to browse to the original website for the best experience.

我會把一些不成文的筆記或是最近的生活雜感放在 短筆記 ,如果有興趣的話可以來看看唷!

Please notice that currenly most of posts are translated by AI automatically and might contain lots of confusion. I'll gradually translate the post ASAP

Write a JSON parser from scratch (2)

In part1 , we mentioned how to write a JSON parser and implemented the string parsing functionality. Now let's add the remaining functions. (In fact, once you understand the basic principles, implementing the remaining functions is just following the same pattern.)

json-grammer

Implementing the number function is not difficult. The tricky parts are handling decimal points, negative numbers, floating-point numbers, and exponential notation (e.g., 1e6 ). (By the way, I just realized that E can also be used.)

In the first part, we check if it is a negative number.

In the second part, we run a while loop to extract the number part.

In the third part, we check if there is a decimal point.

  • If there is, we extract the decimal part.

In the fourth part, we check if there is an exponential expression (using uppercase or lowercase e ).

  • If there is, we extract the exponential part.

In the fifth part, we convert the string to a number (using parseInt or parseFloat ).

Keywords (true, false, null)

This part is simple. We check if the value matches any of the keywords.

json-grammer

In the first part, we check if it starts with [ .

  • If we encounter ] , it means it is an empty array.

In the second part, we run a while loop and call the value function to parse and add values to the array.

If we encounter ] , it means the array ends, so we return the array.

If we encounter a comma, it means there is another element, so we continue parsing.

This is almost it. You can take a look at the implementation in the Repository to see the code in action. You will notice that one of the tests, specical-character , fails because there might be escape characters in the string. Let's try to implement it.

We create an escape character table and replace the corresponding characters with their actual values. Here, we only implemented \t and \r . This way, we pass the basic JSON tests 🍻. However, besides the mentioned escape characters, we also need to implement \u , which represents Unicode. This functionality is quite important.

Custom Feature: t e m p l a t e template t e m pl a t e

Since we are writing our own parser, we can add new syntax! Let's say we want to implement a template feature, where any variable enclosed in $$ will be replaced with the corresponding value from the passed object. For example:

will become:

Implementation

  • First, we match $ .
  • Then, we start reading the content until we see $ .
  • When we encounter $ , we stop the while loop and replace the template variable with the corresponding variable value, and return the result.

You can check the detailed implementation in the template branch. Take a look at the test results in the test/template folder, too.

By writing our own parser, we can express complex implementations more easily in terms of syntax. We can even extend existing grammars (like JSON in this case) with our desired functionality. Although parsing itself is important and interesting, parsing a language is just the first step. It's like converting JSX into JavaScript code without the help of React, which wouldn't be useful; or transforming SQL into an abstract syntax tree without implementing a database, which would be a bit pointless. The purpose of parsing a language is to facilitate further processing (executing queries, rendering to the DOM).

In fact, there are now many libraries that help you skip the parsing part, such as the famous Bison or PEG.js , which automatically generate a stable parser for you based on a BNF-like syntax, saving you time in parsing and allowing you to focus directly on language implementation.

Our JSON parser in this case doesn't convert to an abstract syntax tree and then generate the final result. So, in the next stage, we will try to parse simple HTML and convert it into a syntax tree, then render it using JavaScript's DOM API.

Write a JSON parser from scratch (1)

Technology always comes from humanity (Svelte Society: Questions Questions Notes)

If you found this article helpful, please consider buy me a drink ☕️ It'll make my ordinary day shine✨

writing a json parser

Build With Go

writing a json parser

Writing a Simple JSON Parser in Golang

Learn to build a json parser in go: mastering lexer and ast for advanced data interpretation..

writing a json parser

We'll explore the core components of a JSON parser: tokenization , where JSON data is broken down into identifiable tokens, and the construction of an Abstract Syntax Tree (AST) , which organizes these tokens into a hierarchical structure. By the end of this guide, you'll gain a deeper understanding of these essential building blocks in the context of JSON parsing in Go.

Overview of the JSON Parser Structure 🏗️

Thanks for reading Build With Go! Subscribe for free to receive new posts and support my work.

In this section, we'll delve into the architecture of our JSON parser, designed in Go. The parser's main job is to read JSON data and convert it into a format that's easier for our program to understand and manipulate. Central to our parser's operation is the token package, which plays a pivotal role in identifying and categorizing different elements of JSON data.

The token Package and Its Role 📦

The token package is essentially the backbone of our parser.

It helps in categorizing JSON elements into recognizable tokens.

These tokens are then used by the parser to construct a meaningful representation of the JSON data.

Building the Token Struct 🛠️

In our JSON parser, the Token struct is a fundamental component. It serves as the basic building block, encapsulating each piece of data we extract from the JSON input. Let's break down its structure and understand its purpose in our parser.

The Purpose of the Token Struct :

The Token struct represents a single unit of data or symbol in the JSON string.

It holds information about the type of token and its value.

This struct is used to construct a sequence of tokens, which the parser then interprets.

Explanation of the Struct :

Type: This field holds the type of the token, as defined in our Type constants. It tells us whether the token is a String, Number, LeftBrace, etc.

Val: This field contains the actual value of the token. For instance, if our token is a String, Val will hold the text of the string.

Token Types 🔠

ILLEGAL : Represents any character or sequence that doesn't conform to valid JSON syntax.

EOF (End of File) : Signifies the end of the JSON input.

Braces and Brackets ( {} , [] ): Used to denote objects and arrays in JSON.

Comma ( , ): Separates elements in arrays or objects.

Colon ( : ): Separates keys from values in JSON objects.

String : Enclosed in double quotes, represents text data.

Number : Represents numerical values.

Boolean : Represents true or false .

Null : Represents a null value.

The Tokenizer Function 🔄

The tokenizer function is a key player in our JSON parser. Its primary role is to convert a JSON string into a sequence of tokens, each representing a meaningful element within the JSON structure. This process is crucial for parsing, as it transforms the raw text into a format that our parser can easily interpret and manipulate.

Key Points About the Tokenizer Function 🖥️

It iterates over the input string and categorizes each character or sequence of characters into tokens. The function is designed as a simple state machine, transitioning states based on the current character in the input string. It handles different scenarios like skipping whitespace, recognizing data types, and dealing with nested structures like arrays and objects.

Tokenizing an Example String 🔍

Let's take a closer look at how the tokenizer function works with an example JSON string:

When passed through our tokenizer, this string will be broken down into a sequence of tokens:

{ ➡️ LeftBrace

"name" ➡️ String

"John" ➡️ String

"age" ➡️ String

30 ➡️ Number

} ➡️ RightBrace

Below is a simplified version of the tokenizer function, with comments explaining its key sections:

Explaining Key Sections :

The function iterates over each character in the input string. A switch statement categorizes each character into a token type. For a string token, it finds the starting and ending quotes and captures the text in between. Each token is then appended to the tokens slice, which is returned at the end.

Handling Errors and Edge Cases 🚨

Error handling and managing edge cases are vital to ensure the robustness of the tokenizer. The tokenizer should gracefully handle unexpected or malformed input without crashing. Here are some scenarios where error handling is crucial:

Unexpected Characters : If the tokenizer encounters a character that doesn't fit into any known token types, it should classify it as ILLEGAL and possibly halt or raise an error.

Unclosed Tokens : Situations where a string or a structural token like { or [ is not properly closed. The tokenizer should detect these and either attempt recovery or flag an error.

Invalid Sequences : Certain sequences of tokens are not valid in JSON (e.g., a comma not followed by another value or key). The tokenizer should identify these invalid sequences and handle them appropriately.

Number Formatting Issues : Detecting incorrectly formatted numbers (like a number with two dots) is important to prevent parsing errors.

Invalid Sequences

In building a robust JSON parser, handling errors and edge cases effectively is important. One key method in ensuring this is the isValidSequences method. Let's dive into how this method contributes to error handling in our parser.

The isValidSequences Method Explained 🧐

Purpose : Ensures the sequence of tokens in the JSON string is syntactically correct.

Functionality : It checks if a token can logically follow the previous token based on JSON syntax rules.

How It Works :

Mapping Valid Sequences : The method uses a map to define valid sequences of tokens. For example, a number should not directly follow an opening brace { .

Checking Sequences : When a new token is identified, isValidSequences checks if this token is a valid successor to the previous token.

Error Flagging : If the sequence is invalid, the method flags an error, preventing incorrect parsing.

Understanding the AstNode Struct 🌳

The AstNode struct is crucial in our JSON parser as it represents nodes in our Abstract Syntax Tree (AST), which is a hierarchical model of the JSON data.

Type : Indicates the kind of data (e.g., String, Number) based on token.Type .

Value : Stores the actual data. Its type interface{} allows for flexibility, accommodating various data types.

The Parser Function 🔄

The parser function's role is to process the tokens generated by the tokenizer and create corresponding AST nodes.

Iterates over tokens. Calls parseValue to transform each token into an AstNode . Collects nodes to form the AST.

Converting AST to a Map 🗺️

AstToMap converts the AST into a user-friendly map format, aligning with the JSON's key-value structure.

Iterates over AST nodes. Transforms them into a map, handling each token type accordingly.

Parsing Individual Values 🔍

parseValue is pivotal in translating individual tokens into AST nodes.

Switches on the token type.

Creates an AstNode for each type, handling String, Number, Boolean, and Null.

Example Input and Output :

Conclusion and Further Exploration 🚀

We've journeyed through the exciting process of building a JSON parser in Go, covering everything from tokenization to parsing and mapping JSON data into a structured format. This hands-on experience has not only enhanced your understanding of JSON parsing but also sharpened your Go programming skills.

Explore the Complete Code on GitHub 🌐

For those eager to dive deeper or review the complete code, all the code for this JSON parser is available on GitHub. You can access it at onerciller/gojsonp . This repository includes all the components we've discussed, providing a comprehensive view of the project. This library is not a comprehensive library. It's a starting point, showcasing the core aspects of JSON parsing.

writing a json parser

Ready for more?

JS Tutorial

Js versions, js functions, js html dom, js browser bom, js web apis, js vs jquery, js graphics, js examples, js references, json .parse().

A common use of JSON is to exchange data to/from a web server.

When receiving data from a web server, the data is always a string.

Parse the data with JSON.parse() , and the data becomes a JavaScript object.

Example - Parsing JSON

Imagine we received this text from a web server:

Use the JavaScript function JSON.parse() to convert text into a JavaScript object:

Make sure the text is in JSON format, or else you will get a syntax error.

Use the JavaScript object in your page:

Array as JSON

When using the JSON.parse() on a JSON derived from an array, the method will return a JavaScript array, instead of a JavaScript object.

Advertisement

Parsing Dates

Date objects are not allowed in JSON.

If you need to include a date, write it as a string.

You can convert it back into a date object later:

Convert a string into a date:

Or, you can use the second parameter, of the JSON.parse() function, called reviver .

The reviver parameter is a function that checks each property, before returning the value.

Convert a string into a date, using the reviver function:

Parsing Functions

Functions are not allowed in JSON.

If you need to include a function, write it as a string.

You can convert it back into a function later:

Convert a string into a function:

You should avoid using functions in JSON, the functions will lose their scope, and you would have to use eval() to convert them back into functions.

Get Certified

COLOR PICKER

colorpicker

Report Error

If you want to report an error, or if you want to make a suggestion, do not hesitate to send us an e-mail:

[email protected]

Top Tutorials

Top references, top examples, get certified.

Writing a Fast JSON Parser

May 25, 2017 -  json ,  performance ,  sajson ,  hackernews ,  reddit

Several holidays ago, I got a bee in my bonnet and wrote a fast JSON parser whose parsed AST fits in a single contiguous block of memory . The code was small and simple and the performance was generally on-par with RapidJSON, so I stopped and moved on with my life.

Well, at the end of 2016, Rich Geldreich shared that he'd also written a high-performance JSON parser.

A while back I wrote a *really* fast JSON parser in C++ for fun, much faster than RapidJSON (in 2012 anyway). Is this useful tech to anyone? — Rich Geldreich (@richgel999) December 18, 2016

I dropped his pjson into my benchmarking harness and discovered it was twice as fast as both RapidJSON and sajson! Coupled with some major JSON parsing performance problems in a project at work, I was nerd-sniped again.

I started reading his code but nothing looked particularly extraordinary. In fact, several things looked like they ought to be slower than sajson... Oh wait, what's this ?

Why is unrolling that loop a win? How is the parse flags look-up (even L1 is 3-4 cycle latency) better than some independent comparisons?

Guess it was time to do some instruction dependency analysis!

Fast String Parsing

The problem at hand: given a pointer to the first byte after the opening quote of a string, find the first special byte, where special bytes are ", \, <0x20, or >0x7f.

roughly breaks down into:

How do we think about the performance of this loop? Remember that mainstream CPUs are out-of-order and can execute four (or more!) instructions in parallel. So an approximate mental model for reasoning about CPU performance is that they understand multiple instructions per cycle, then stick them all into an execution engine that can execute multiple instructions per cycle. Instructions will execute simultaneously if they're independent of each other. If an instruction depends on the result of another, it must wait N cycles, where N is the first instruction's latency.

So, assuming all branches are correctly predicted (branch predictors operate in the frontend and don't count towards dependency chains), let's do a rough estimate of the cost of the loop above:

increment p is the only instruction on the critical path - it carries a dependency across iterations of the loop. Is there enough work to satisfy the execution resources during the increment? Well, the comparisons are independent so they can all be done in parallel, and there are four of them, so we can probably keep the execution units busy. But it does mean that, at most, we can only check one byte per cycle. In reality, we need to issue the load and increment too, so we're looking at a loop overhead of about 2-3 cycles per byte.

Now let's look more closely at Rich's code.

Replacing the comparisons with a lookup table increases comparison latency (3-4 cycles latency from L1) and increased comparison throughput (multiple loads can be issued per cycle).

So let's spell out the instructions for Rich's code (reordered for clarity):

  • Load four bytes per iteration [not on critical execution path]
  • Load four LUT entries [not on critical execution path]
  • Issue four comparisons per cycle [not on critical execution path]

Again, the critical path is only add p, 4 , but we still need to issue the other instructions. The difference is that now all of the loads happen in parallel and the comparisons for 4 bytes happen in parallel too, rather than doing four comparisons per byte.

It's still hard to say if this is a win on paper -- Haswell can only issue 2 loads per cycle. But the loads can overlap with the comparisons from previous bytes). However, we still have to issue all of these instructions. So maybe we're looking at something closer to 2 cycles per byte?

Empirically, at least on my Broadwell, replacing four comparisons with a LUT was definitely a win. 55cc213

But is unrolling the loop necessary? If I take out the unrolling but leave the LUT, clang gets slower but gcc stays the same. I checked - neither gcc nor clang do any automatic unrolling here. What's going on? Branch predictor effects? Also, Intel's Intel Architecture Code Analyzer tool says that the unrolled and non-unrolled loops are both frontend-bound and have the same loop throughput.

I'm not yet convinced that a tight dependency chain is why unrolling is a win. More investigation is required here. But the important thing to keep in mind is to pay attention to the critical path latency through a loop body as well as the number of independent operations that can soak up spare execution bandwidth.

Lead Bullets

After playing around with LUTs and unrolling, I started implementing a bunch of small optimizations that I'd long known were available but didn't consider to be that important.

Well, as it often turns out in performance projects, a sequence of 2% gains adds up to significant wins over time! If you've read Ben Horowitz's lead bullets story or how SQLite achieved a 50% speed up , this will sound familiar.

Here are the optimizations that mattered:

  • Moved the input and output pointers into locals instead of members, which helps VC++ and Clang understand that they can be placed in registers. (gcc was already performing that optimization.) 71078d3 4a07c77
  • Replaced the entire parse loop with a goto state machine. Surprisingly, not only was this a performance win, but it actually made the code clearer . 3828769 05b3ec8 c02cb31
  • Change an 8-entry enum to a uint8_t instead of the pointer-sized value it was defaulting to. 799c55f
  • Duplicated a bit of code to avoid needing to call a function pointer. 44ec0df
  • Tiny microoptimizations like avoiding branches and unnecessary data flow. 235d330 002ba31 193b183 c23fa23
  • Store the tag bits at the bottom of the element index instead of the top, which avoids a shift on 64-bit. e7f2351

Static Branch Prediction

I also spent a bit of time on static branch prediction. It's a questionable optimization; in theory, you should just use profile-guided optimization (PGO) and/or rely on the CPU's branch predictors, but in practice few people actually bother to set up PGO. Plus, even though the CPU will quickly learn which branches are taken, the compiler doesn't know. Thus, by using static branch prediction hints, the compiler can line up the instructions so the hot path flows in a straight line and all the cold paths are off somewhere else, sometimes saving register motion in the hot path.

For some examples, look at these uses of SAJSON_LIKELY and SAJSON_UNLIKELY .

I can't recommend spending a lot of time on annotating your branches, but it does show up as a small but measurable win in benchmarks, especially on smaller and simpler CPUs.

Things I Didn't Do

  • Unlike Rich's parser and RapidJSON, I chose not to optimize whitespace skipping. Why? Not worth the code size increase - the first thing someone who cares about JSON parsing performance does is minify the JSON. b05082b
  • I haven't yet optimized number parsing. Both RapidJSON and Rich's parser are measurably faster there, and it would be straightforward to apply the techniques. But the documents I regularly deal with are strings and objects and rarely contain numbers.

Benchmark Results

I tested on four devices: my desktop (high-clocked Haswell), laptop (low-clocked Broadwell), iPhone SE, and Atom-based home server.

writing a json parser

Dell XPS 13

writing a json parser

The charts aren't very attractive, but if you look closely, you'll notice a few things:

  • Parsing JSON on modern CPUs can be done at a rate of hundreds of megabytes per second.
  • gcc does a much better job with RapidJSON than either clang or MSVC.
  • JSON parsing benefits from x64 - it's not a memory-bound or cache-bound problem, and the extra registers help a lot.
  • The iPhone SE is not much slower than my laptop's Broadwell. :)

The Remainder of the Delta

As you can see in the charts above, sajson is often faster than RapidJSON, but still not as fast as Rich's pjson. Here are the reasons why:

sajson does not require the input buffer to be null-terminated, meaning that every pointer increment requires a comparison with the end of the buffer (to detect eof) in addition to the byte comparison itself. I've thought about changing this policy (or even adding a compile-time option) but I really like the idea that I can take a buffer straight from a disk mmap or database and pass it straight to sajson without copying. On the other hand, I measured about a 10% performance boost from avoiding length checks.

sajson sorts object keys so that object lookup takes logarithmic rather than linear time. The other high-performance parsers have linear-time object lookup by key. This is an optimization that, while not necessary for most use cases, avoids any accidental worst-case quadratic-time usage.

sajson's contiguous AST design requires, for every array, shifting N words in memory where N is the number of elements in the array. The alternative would be to use growable arrays in the AST (requiring that they get shifted as the array is realloc'd). Hard to say how much this matters.

Aside: The "You Can't Beat the Compiler" Myth

There's this persistent idea that compilers are smart and will magically turn your code into something that maps efficiently to the machine. That's only approximately true. It's really hard for compilers to prove the safety (or profitability) of certain transformations and, glancing through the produced code for sajson, I frequently noticed just plain dumb code generation. Like, instead of writing a constant into a known memory address, it would load a constant into a register, then jump to another location to OR it with another constant, and then jump somewhere else to write it to memory.

Also, just look at the charts - there are sometimes significant differences between the compilers on the same code. Compilers aren't perfect and they appreciate all the help you can give!

Benchmarking

Measuring the effect of microoptimizations on modern computers can be tricky. With dynamic clock frequencies and all of today's background tasks, the easiest way to get stable performance numbers is to take the fastest time from all the runs. Run your function a couple thousand times and record the minimum. Even tiny optimizations will show up this way.

(If a library that uses statistical techniques to automatically achieve precise results is readily available , consider using that.)

I also had my JSON benchmarks report MB/s, which normalizes the differences between test files of different sizes. It also helps understand parser startup cost (when testing small files) and the differences in parse speed between files with lots of numbers, strings, huge objects, etc.

Swift Bindings

Dropbox (primarily @aeidelson ) contributed high-quality Swift bindings for sajson. The challenge in writing these bindings was finding a way to efficiently expose the sajson parse tree to Swift. It turns out that constructing Swift arrays and objects is REALLY expensive; we once benchmarked 10 ms in sajson's parser and 400 ms of Swift data structure construction.

Fortunately, Swift has decent APIs for unsafely manipulating pointers and memory, and with those we implemented the ability to access sajson AST nodes through a close-to-natural ValueReader type .

Converting sajson strings (e.g. ranges of UTF-8 memory) into Swift Strings is still expensive, but Swift 4 might improve the situation:

@jckarter @chadaustin We have new APIs for this on the 4.0 branch that I actually need to benchmark… — Airspeed Velocity (@AirspeedSwift) May 19, 2017

An iOS team at Dropbox replaced JSONSerialization with sajson and cut their initial load times by two thirds!

I used to think JSON parsing was not something you ever wanted in your application's critical path. It's certainly not the kind of algorithm that modern computers love (read byte, branch, read byte, branch). That said, this problem has been beaten to death. We now have multiple parsers that can parse data at hundreds of megabytes per second -- around the same rate as SHA-256! If we relaxed some of the constraints on sajson, it could even go faster.

So how fast was Rich's parser, after all? When measured in Clang and MSVC, quite a lot, actually. But when measured in GCC, RapidJSON, sajson, and pjson were (and remain) very similar. Many of the differences come down to naive, manually-inlined code, which we know the compiler can reliably convert to a fast sequence of instructions. It's annoying to eschew abstractions and even duplicate logic across several locations, but, in the real world, people use real compilers. For critical infrastructure, it's worth taking on some manual optimization so your users don't have to care.

See discussion on Hacker News (2017) , Hacker News (2019) , /r/cpp , /r/programming , and /r/rust .

Have a comment? Send me an email or tweet .

It's possible that a SIMD implementation (assuming that you routinely pass over enough bytes) can blow this away. I'm not sure what the cross-over point is - it depends on whether you are falling out of this code quickly or not. Obviously using a big SIMD sequence to replace a "2 cycles per byte" implementation is pretty stupid if the SIMD sequence is "0.5 cycles a byte" but the fixed cost of using the SIMD means that your costs are >=16 cycles and the typical case is small.

It's also not clear that you actually need a SIMD instruction - if you've squashed your hotspot to the point that adding the complexity of per-platform SIMD specialization is only a small win, then why bother?

All that being said...

I've been really lazy in writing this up, but someone kindly did a decent write-up for us in the Rust reimplementation of our "Teddy" literal matcher. Please see https://github.com/rust-lang/regex/blob/master/src/simd_accel/teddy128.rs if interested or look up the Hyperscan project at https://github.com/01org/hyperscan for a few different versions of this trick.

fast is good, but have you checked your parser for correctnesss? You may want to run it against this end of 2016 challenging JSON parser library: http://seriot.ch/parsing_json.php

Hi, good question. I've validated it against Milo Yip's excellent conformance tests, where it passes everything but the extreme numeric precision tests. (The JSON spec does not specify precision requirements, so I've considered that okay for now. Probably worth a look though.)

Your null termination problem is an oldie but a goodie. The Ragel state machine generator has a 'noend' option to purposely omits bounds checking.

If you're parsing mmap'd memory btw, and don't want to map it with write permissions, you can always just map an additional page beyond the end of the file. Most OS's guarantee this will always be zeroed out. Hell, you don't even have to do this if the file size isn't a multiple of the page size, since the padding in the final page is also zeroed.

The trick mentioned in the update is the Elephant in Cairo! https://en.m.wikipedia.org/wiki/Elephant in Cairo

How does it compare to jsmn? http://zserge.com/jsmn.html Which gives 482MBps on http://zserge.com/jsmn.html

Project License API UTF-8 Neg. tests Followers Benchmark Finalist? Note yajl ISC SAX Y Y 815/120 249 MB/s, 5.1 s Y Large code base. jsonsl MIT SAX N Y 10/2 246 MB/s, 5.6 s Y Tiny source code base. Jansson MIT DOM Y Y 234/59 25 MB/s, 52 s Y cson BSD' DOM Y Y n/a 30 MB/s, 44 s Y json-c BSD DOM Y N 103/45 38 MB/s, 29.7 s Y json-parser BSD DOM N Y 276/24 49 MB/s, 12.4 s Y Tiny! A single source file. jsmn MIT custom N Y 65/4 16 MB/s, 3.3 s 482 MB/s, 2.3 s Y Tiny! js0n pub. custom N N 67/10 n/a N LibU BSD N 18/3 n/a N WJElement LGPL N 7/2 n/a N M's JSON parser LGPL N n/a n/a N cJSON BSD N n/a n/a N

Benchmark https://github.com/vlm/vlm-json-test

I've been wanting to write a JSON stream parser for a while. By the time the last packet arrives, the rest of the JSON will already be parsed. We have some large queries that download a few megabytes of JSON. It looks like CPUs and memory are getting fast enough that it may no longer make sense, though.

Thanks for the very interesting post.

It would be interesting to see performance on this benchmark. Chap here does use sse2 instructions if I recall right, and he is faster than RapidJSON.

https://github.com/kostya/benchmarks#json

Just in case you weren't aware, there's a statistics-based micro-benchmarking library for c++: nonius. Have a look: https://nonius.io

I've experimented a little with CSV parsing using AVX2, and had some good results, mainly on wider files. Specifically, I take the approach of doing a single preparse to produce a bitmap of places where there are delimiters. Can then parse the file in a normal loop more efficiently. No formal results as it doesn't yet have complete correctness, but initial testing looks promising. I'm not sure how JSON would go. https://github.com/jasonk000/csv-experiments/

[…] Decoder Speed How quickly can the entire file be scanned or processed? For comparison, JSON can be parsed at a rate of hundreds of MB/s. […]

[…] Chad Austin (via Hacker News): […]

I ran sajson with my JSONTestSuite and found it to be pretty good, except that some non-numbers are erroneously accepted as numbers.

I opened the following issue: https://github.com/chadaustin/sajson/issues/31

Thanks! I'll look into that.

I did benchmark of sajson, RapidJSON and gason using RapidJSON's nativebenchmark. The results were consistent with the Dell XPS 13 graph above, except that in my hands rapidjson-clang is comparable with others (here it's much slower than rapidjson-gcc). https://github.com/project-gemmi/benchmarking-json

Regarding benchmarking of microoptimization. SQLite has a writeup about it: https://www.sqlite.org/cpu.html The message is that the cycle count from Cachegrind that is only a proxy for actual performance, but is more repeatable than real timings, is actually better for microoptimizations. Because in this case repeatability is more important than accuracy.

Very nice, thanks for the heads up!

Build Your Own JSON Parser

This challenge is to build your own JSON parser.

Building a JSON parser is an easy way to learn about parsing techniques which are useful for everything from parsing simple data formats through to building a fully featured compiler for a programming language.

Parsing is often broken up into two stages: lexical analysis and syntactic analysis . Lexical analysis is the process of dividing a sequence of characters into meaningful chunks, called tokens. Syntactic analysis (which is also sometimes referred to as parsing) is the process of analysing the list of tokens to match it to a formal grammar.

You can read far more about building lexers, parses and compilers in what is regarded as the definitive book on compilers: Compilers: Principles, Techniques, and Tools - widely known as the “Dragon Book” (because there’s an illustration of a dragon on the cover).

The Challenge - Building a JSON Parser

JSON (which stands for JavaScript Object Notation) is a lightweight data-interchange format, which is widely used for transmitting data over the Internet. It is formally defined by the IETF here: https://tools.ietf.org/html/std90 or there’s a simpler graphical representation here: https://www.json.org/json-en.html

This is software engineering so we’re zero-indexed and for this step you’re going to set your environment up ready to begin developing and testing your solution.

I’ll leave you to setup your IDE / editor of choice and programming language of choice. After that you can download some simple test data for the JSON parser from my DropBox .

In this step your goal is to parse a valid simple JSON object, specifically: ‘ {} ’ and an invalid JSON file and correctly report which is which. So you should build a very simple lexer and parser for this step.

Your program should report to the standard output stream a suitable message and exit with the code 0 for valid and 1 for invalid. It is conventional for CLI tools to return 0 for success and between 1 and 255 for an error and allows us to combined CLI tools to create more powerful programs. Check out write your own wc tool for more on combing simple cli tools.

You can test your code against the files in the folder tests/step1. Consider automating the tests so you can run them repeatedly as you progress through the challenge.

In this step your goal is to extend the parser to parse a simple JSON object containing string keys and string values, i.e.:

You can test against the files in the folder tests/step2.

In this step your goal is to extend the parser to parse a JSON object containing string, numeric, boolean and null values, i.e.:

You can test against the files in the folder tests/step3.

In this step your goal is to extend the parser to parse a JSON object with object and array values, i.e.:

You can test against the files in the folder tests/step4.

In this step your goal is to add some of your own tests to ensure you’re confident that your parse can handle valid JSON and will fail with useful error messages on invalid JSON.

Once you’re confident your parser is done and well tested you can try running it against the test suite here: http://www.json.org/JSON_checker/test.zip

  • Help Others by Sharing Your Solutions!

If you think your solution is an example other developers can learn from please share it, put it on GitHub, GitLab or elsewhere. Then let me know - ping me a message via Twitter or LinkedIn or just post about it there and tag me.

  • Get The Challenges By Email

If you would like to recieve the coding challenges by email, you can subscribe to the weekly newsletter on SubStack here:

writing a json parser

Writing a JSON parser from scratch

UPDATE: Slides and video from my talk on this topic

In this series, we are looking at how applicative parsers and parser combinators work.

  • In the first post , we created the foundations of a parsing library.
  • In the second post , we built out the library with many other useful combinators.
  • In the third post , we improved the error messages.
  • In this last post, we’ll use the library we’ve written to build a JSON parser.

First, before we do anything else, we need to load the parser library script that we developed over the last few posts, and then open the ParserLibrary namespace:

You can download ParserLibrary.fsx from the link at the bottom of this post.

1. Building a model to represent the JSON spec

The JSON spec is available at json.org . I’ll paraphase it here:

  • These structures can be nested.
  • A string is a sequence of zero or more Unicode characters, wrapped in double quotes, using backslash escapes.
  • A number is very much like a C or Java number, except that the octal and hexadecimal formats are not used.
  • A boolean is the literal true or false
  • A null is the literal null
  • An object begins with { (left brace) and ends with } (right brace).
  • Each name is followed by : (colon) and the name/value pairs are separated by , (comma).
  • An array begins with [ (left bracket) and ends with ] (right bracket).
  • Values are separated by , (comma).
  • Whitespace can be inserted between any pair of tokens.

In F#, this definition can be modelled naturally as:

So the goal of our JSON parser is:

  • Given a string, we want to output a JValue value.

2. Getting started with “Null” and “Bool”

Let’s start with the simplest tasks – parsing the literal values for null and the booleans.

Parsing Null

Parsing the null literal is trivial. The logic will be:

  • Match the string “null”.
  • Map the result to the JNull case.

Here’s the code:

Note that we don’t actually care about the value returned by the parser because we know in advance that it is going to be “null”!

This is a common situation, so let’s write a little utility function, >>% to make this look nicer:

Now we can rewrite jNull as follows:

Let’s test:

That looks good. Let’s try another one!

Parsing Bool

The bool parser will be similar to null:

  • Create a parser to match “true”.
  • Create a parser to match “false”.
  • And then choose between them using <|> .

And here are some tests:

Note that the “Unexpected ’t'” error is misleading due to the backtracking issue discussed in the previous post. Since “true” failed, it is trying to parse “false” now, and “t” is an unexpected character. We won’t attempt to fix that here.

3. Parsing “String”

Now for something more complicated – strings.

The spec for string parsing is available as a “railway diagram” like this:

All diagrams sourced from json.org .

To build a parser from a diagram like this, we work from the bottom up, building small “primitive” parsers which we then combine into larger ones.

Let’s start with “any unicode character other than quote and backslash”. We have a simple condition to test, so we can just use the satisfy function:

We can test it immediately:

Escaped characters

Now what about the next case, the escaped characters?

In this case we have a list of strings to match ( "\"" , "\n" , etc) and for each of these, a character to use as the result.

The logic will be:

  • First define a list of pairs in the form (stringToMatch, resultChar) .
  • For each of these, build a parser using pstring stringToMatch >>% resultChar) .
  • Finally, combine all these parsers together using the choice function.

And again, let’s test it immediately:

It works nicely!

Unicode characters

The final case is the parsing of unicode characters with hex digits.

  • First define the primitives for backslash , u and hexdigit .
  • Combine them together, using four hexdigit s.
  • The output of the parser will be a nested, ugly tuple, so we need a helper function to convert the digits to an int, and then a char.

And let’s test with a smiley face – \u263A .

The complete “String” parser

Putting it all together now:

  • Define a primitive for quote
  • Define a jchar as a choice between jUnescapedChar , jEscapedChar , and jUnicodeChar .
  • The whole parser is then zero or many jchar between two quotes.

One more thing, which is to wrap the quoted string in a JString case and give it a label:

Let’s test the complete jString function:

4. Parsing Number

The “railway diagram” for Number parsing is:

Again, we’ll work bottom up. Let’s start with the most primitive components, the single chars and digits:

Now let’s build the “integer” part of the number. This is either:

  • The digit zero, or,
  • A nonZeroInt , which is a digitOneNine followed by zero or more normal digits.

Note that, for the nonZeroInt parser, we have to combine the output of digitOneNine (a char) with manyChars digit (a string) so a simple map function is needed.

The optional fractional part is a decimal point followed by one or more digits:

And the exponent part is an e followed by an optional sign, followed by one or more digits:

With these components, we can assemble the whole number:

We haven’t defined convertToJNumber yet though. This function will take the four-tuple output by the parser and convert it into a float.

Now rather than writing custom float logic, we’re going to be lazy and let the .NET framework to the conversion for us! That is, each of the components will be turned into a string, concatenated, and the whole string parsed into a float.

The problem is that some of the components (like the sign and exponent) are optional. Let’s write a helper that converts an option to a string using a passed in function, but if the option is None return the empty string.

I’m going to call it |>? but it doesn’t really matter because it is only used locally within the jNumber parser.

Now we can create convertToJNumber :

  • The sign is converted to a string.
  • The fractional part is converted to a string, prefixed with a decimal point.
  • The exponent part is converted to a string, with the sign of the exponent also being converted to a string.

It’s pretty crude, and converting things to strings can be slow, so feel free to write a better version.

With that, we have everything we need for the complete jNumber function:

It’s a bit long-winded, but each component follows the spec, so I think it is still quite readable.

Let’s start testing it:

And what about some failing cases?

Hmm. Something went wrong! These cases should fail, surely?

Well, no. What’s happening in the -123. case is that the parser is consuming everything up the to decimal point and then stopping, leaving the decimal point to be matched by the next parser! So, not an error.

Similarly, in the 00.1 case, the parser is consuming only the first 0 then stopping, leaving the rest of the input ( 0.4 ) to be matched by the next parser. Again, not an error.

To fix this properly, we want the parser to be greedy , but that is out of scope, and in practice, should not be a problem. For now, let’s just add some whitespace to the parser to force it to consume all the characters in the input before terminating.

Now let’s test again:

and we find the error is being detected properly now.

Let’s test the fractional part:

and the exponent part now:

It’s all looking good so far. Onwards and upwards!

5. Parsing “Array”

Next up is the Array case. Again, we can use the railway diagram to guide the implementation:

We will start with the primitives again. Note that we are adding optional whitespace after each token:

What is this jValue though? We’ll come to that shortly.

Next, following the guidance of the railway diagram above, we create a list of values separated by a comma, with the whole list between the left and right brackets.

Let’s revisit that jValue now.

Well, the spec says that a JSON array can contain a list of JSON values, so we’ll assume that we have a jValue parser that can be used to parse the input into a JValue object.

But to parse a JValue , we need to parse a JArray first, because JArray is one of the choices of JValue !

We have hit a common problem in parsing – mutually recursive definitions. We need a JValue parser to build an JArray , but we need an JArray parser to build a JValue .

How can we deal with this?

Forward references

The trick is to create a forward reference – a dummy jValue parser that we can use right now to define the jArray parser, and then later on, we will fix up the forward reference with the “real” jValue parser.

This is one time where mutable references come in handy!

We will need a helper function to assist us with this, and the logic will be as follows:

  • Define a dummy parser that will be replaced later.
  • Define a “proxy” or “wrapper” parser that forwards the input stream to the inner dummy parser.
  • Return both the real parser and a mutable reference to the dummy parser.

Later on, the client code will fix up the mutable reference to point to the correct parser, and from then on, the proxy parser will forward the input to the new parser that has replaced the dummy parser. Here’s the code:

With this in place, we can create a placeholder for a parser of type JValue :

Finishing up the “Array” parser

Going back to the jArray parser, we can now compile it successfully, using the jValue “proxy” parser:

If we try to test it now, we get an exception because we haven’t fixed up the reference to the default dummy parser:

So for now, let’s fix up the reference to use one of the parsers that we have already created, such as jNumber . Later on, we’ll fix it up to be the entire parser.

Now we can successfully test the jArray function, as long as we are careful to only use numbers in our array!

6. Parsing “Object”

The parser for Object is very similar to the one for Array .

First, the railway diagram:

Using this diagram as a guide, we can create the parser directly, so I’ll present it here without comment:

A bit of testing to make sure it works (but remember, only numbers are supported as JValues for now).

7. Putting it all together

Finally, we can combine all six of the parsers using the choice combinator, and we can assign this to the jValueRef parser reference that we created earlier:

And now we are ready to rock and roll!

Testing the complete parser: example 1

Here’s an example of a JSON string that we can attempt to parse:

And here is the result:

Testing the complete parser: example 2

Here’s one from the example page on json.org :

Complete listing of the JSON parser

Here’s the complete listing for the JSON parser – it’s about 250 lines of useful code.

Source code used in this post is available here .

In this post, we built a JSON parser using the parser library that we have developed over the previous posts.

I hope that, by building both the parser library and a real-world parser from scratch, you have gained a good appreciation for how parser combinators work, and how useful they are.

I’ll repeat what I said in the first post: if you are interesting in using this technique in production, be sure to investigate the FParsec library for F#, which is optimized for real-world usage.

And if you are using languages other than F#, there is almost certainly a parser combinator library available to use.

  • For more information about parser combinators in general, search the internet for “Parsec”, the Haskell library that influenced FParsec.
  • Implementing a phrase search query for FogCreek’s Kiln
  • A LOGO Parser
  • A Small Basic Parser
  • A C# Parser and building a C# compiler in F#
  • Write Yourself a Scheme in 48 Hours in F#
  • Parsing GLSL, the shading language of OpenGL
  • ← 3. Improving the parser library

The "Understanding Parser Combinators" series

  • Understanding Parser Combinators Building a parser combinator library from scratch
  • Building a useful set of parser combinators 15 or so combinators that can be combined to parse almost anything
  • Improving the parser library Adding more informative errors
  • Writing a JSON parser from scratch In 250 lines of code

writing a json parser

  • How to write a streaming parser

Dec 19, 2023 | Parse

A parser can turn stringified data (or code) into a structured object that can be operated on. For example a JSON parser can parse text into a JSON document, and a CSV parser parses text into a structured CSV document.

In some cases, the input data is so large that it cannot fit in memory. In that case, you cannot use a regular parser since it would run out of memory. Instead, you need a streaming parser. This article explains what a streaming parser is, when you need one, and how it works under the hood.

How does a parser work in the first place?

When writing a parser, the easiest way is to first load the data in memory, and then loop over it character by character to interpret the data. The code will follow the structure of the data. For example, a CSV file consists of rows separated by a newline, with on each row a list of comma separated values. The structure of a CSV parser can look as follows in pseudo code, aligning with the data structure of CSV:

Here, the function parseCsv has text as input, and returns the parsed rows with values as output. The parser contains a loop that repeats parsing a row one by one. The function to parse a row in turn contains a loop to parse each value in the row one by one. And the function to parse a value will loop over each character until the end of the value is reached.

The following interactive CodePen shows a minimal, non-streaming CSV parser. In the next sections we will implement the same parser in different ways, so we can compare pros and cons:

The CSV parser has only two levels: rows and values. This keeps things clear.

Let’s now shortly look at the JSON data format, which has nested structures. JSON contains object, array, string, number, boolean, null. It contains recursion, since every object and array can contain nested values including nested objects and arrays. In pseudo code, a JSON parser can look like:

What is important to realize here is that this parser recurses into nested arrays and objects. This is what can make a streaming parser challenging to write. We will get back to that later.

When do I need a streaming parser?

Suppose that you have a large JSON file of 200 MB and a regular JSON parser. This means that you first need to load 200 MB of bytes into memory, and after that parse it, and at last, throw away the 200 MB of bytes from memory, and then you are left with the parsed data. It costs time to load a large amount of bytes in memory, and it is quite a waste that this is needed just temporarily. And if the data is larger than the total amount of available memory, it is simply not possible to parse the data: you will run out of memory. There may be even more waste, for example when you need to filter the data on a specific condition: then you will throw away a large part of the parsed data directly after filtering too.

When parsing the data in a streaming way instead, there is no need to first load the full document into memory and parse the full document before being able to operate on it. A streaming parser processes chunks of data as soon as they come in, and allows to directly apply a filtering step or other processing step without needing all of the data to be parsed beforehand. Therefore, there is not much memory needed to process a possibly endlessly large amount of data.

A use case can be reading a large log file and filtering the logs that match a search request, or receiving a large query result from a database in a backend, transforming the data and then sending it to the client. As an illustration, the library ijson is a the streaming JSON parser for Python and it can parse some nested array “earth.europe” and let you directly process the array items one by one whilst they are being parsed ( see docs ):

How to write a streaming parser?

There are three different approaches that to implement a streaming parser:

Parse a flat collection row by row

Parse a nested data structure (a): generator functions, parse a nested data structure (b): pause and resume using state.

We’ll discuss how to work with that in the following sections.

Parsing a collection in a streaming way is quite straightforward. A collection consists of a set of items that are separated by a delimiter like a newline character. Each item or row can be processed one by one. Examples of this are NDJSON and CSV , which are popular formats for logging for example. 

To parse NDJSON or CSV data, you can read the data until you encounter a newline character, then parse the row with a regular parser, and repeat that for the next rows until you reach the end of the file. Appending a new row can be done without having to parse the data that is already there. 

When writing the streaming parser, we can’t simply write a function that processes some input and returns the output:

Instead, we need an API which can pause processing, wait for new data to come in, process the data that is received so far, and wait again for more data:

We can change the non-streaming CSV parser example such that it has a streaming API where you can pass the data chunk by chunk, and the data is processed line by line:

The most important difference is the API with methods push and flush , and a callback onRow . The parser section itself is reduced to only parsing a single row, parsing multiple rows is handle on top.

When dealing with an arbitrary nested data structure it is not possible to process items one by one in an easy way like when dealing with a collection. You have to keep the full data structure in memory whilst parsing. There are two ways to achieve this: either use generator functions when available, or write a parser that can pause and resume and manually keeps track of the state.

The first way to write a streaming parser is to use generator functions which can “pause” execution of a function using yield. Using a generator function, we can pause the parsing process when we need more input data, and then continue as soon as a new chunk of data comes in. Not every programming language has support for generator functions, but you can use them for example in JavaScript and Python.

The nice thing about this approach is that the code and the flow can be the same as when writing a regular (non-streaming) parser. The only thinking that is needed is to change all functions into generator functions, and to replace the function that reads the next character from the input data into a generator that will pause (yield) when there is no new data, and will continue as soon as new data is received. The CSV parser from before will look like:

We can adjust the CSV parser shared before to use generator functions. The nice thing is that the logic of the parser itself is the same as the original, non-steaming parser, and only the public API changed, and the inner nextCharacter function that can now pause to wait for more data:

The second approach to make a parser streaming is to write the parser in such a way that there are a lot of points at which you can pause the parser, such as after parsing a single value. When the parser pauses, awaiting new data, it needs to remember where it left off. When new data comes in, it must be able to resume processing where it left off purely based on the stored state. 

In pseudo code, this structure looks as follows: there is a function push that you can use to append new data. This function will process next steps, one by one, as long as there is data. At the end, a function flush is called to process any leftover data.

The upside of this approach is that there is no need for generator functions (which are not available in all languages). The downside is that you need to split the code of the parser into separate pieces that each process a single “token”, the smallest unit of data that we process: a delimiter, number, boolean, etc. The parser must be able to determine what token is expected next based on some state, and it must return both a new state and optional parsed output. This indirection results in more complex code that is harder to reason about.

The challenges of writing a parser this we becomes clear when looking at the rewritten version of the CSV parser shared before:

You see that the original flow that was still present in the three earlier versions of the CSV parser is put upside down and split apart. The structure of this parser is prone to issues like infinite loops when the different states do not correctly follow each other up. It is harder to see the overall flow of the parser. In this case the parser is quite minimal and it is still doable, but you can imagine that the complexity and room for bugs grows when parsing a more advanced data format.

More complex cases

So far, we used examples of a CSV parser and a JSON parser. These parsers are relatively straightforward. In other cases, you may come across more complex needs, such as the need to not just read the current character, but ahead or behind to determine what to do. That is for example the case in the jsonrepair library, which recently got streaming support. The library for example has to look behind to fix trailing commas after white space, or has to revert parsing of a string when it discovers that the string misses an end quote. In that case, it needs to parse the string again with a different strategy, stopping at the first next delimiter instead of an end quote. In the jsonrepair library, this is implemented using an input buffer and an output buffer, which keep a limited “moving window” of input and output available to read from and write to.

It is important to think through possible implications for memory. When using a streaming parser, the assumption is that memory usage will be limited. If parts of the data like a large string require more memory, the parser either has to throw an error (prompting the user to configure a large buffer), or it has to use more memory without informing the user, possibly blowing up memory usage. In general, this last option is not preferable.

Conclusion about writing a streaming parser

A streaming parser can be needed when processing large amounts of data. It is quite common that the data to be processed is a collection, like rows in a log file or items from a database. In that case, processing this data in a streaming way is quite straightforward. Processing an arbitrary nested data structure in a streaming is more challenging, but luckily also quite a niche. There are two main approaches to go about that, and in essence, it must be possible to pause the parser to await receiving more data, and then continue.

To summarize, here are the links to the four CSV parsers discussed throughout the article. You can compare them side by side and play around with the different concepts yourself:

  • CSV Parser (non-streaming)
  • CSV Parser (streaming, line by line)
  • CSV Parser (streaming, generator functions)
  • CSV Parser (streaming, pause and resume keeping state)

Recent Posts

  • Ad-free JSON Editor Online experience
  • JSON alternatives for data
  • JSON alternatives for configuration files
  • 10 Best JSON query languages

Indepth categories

  • Data fetching
  • Specification

About: Indepth articles

  • DSA with JS - Self Paced
  • JS Tutorial
  • JS Exercise
  • JS Interview Questions
  • JS Operator
  • JS Projects
  • JS Cheat Sheet
  • JS Examples
  • JS Free JS Course
  • JS A to Z Guide
  • JS Formatter
  • JS Web Technology

Related Articles

  • Solve Coding Problems
  • How to call the key of an object but returned as a method, not a string ?
  • Compare Async/Await and Generators usage to Achieve Same Functionality
  • Floating point number precision in JavaScript
  • Why using "for...in" for Array Iteration a Bad Idea in JavaScript ?
  • Why we use then() method in JavaScript ?
  • How to add/update an attribute to an HTML element using JavaScript?
  • How to get the Highlighted/Selected text in JavaScript?
  • How to Catch an Error from External Library and Store it in a String in JavaScript ?
  • How to copy the text to the clipboard in JavaScript ?
  • How to get value of selected radio button using JavaScript ?
  • How to use an HTTP GET or POST for Ajax Calls ?
  • Handling Promise rejection with catch while using await
  • Local Storage vs Cookies
  • How to sort a collection in JavaScript ?
  • How to terminate a script in JavaScript ?
  • How to cancel the current request in Ajax ?
  • What is the use of debugger keyword in JavaScript ?
  • How to create a GUID / UUID in JavaScript ?
  • Visual studio code doesn't give out error if parents function didn't close correctly

JavaScript JSON Parser

JSON (JavaScript Object Notation) is a popular lightweight data exchange format for sending data between a server and a client, or across various systems.

JSON data is parsed and interpreted using a software component or library called a JSON parser . Through the JSON parsing process, a JSON string is converted into a structured format that is easy to modify and access programmatically. Developers may deal with JSON data in a systematic and effective way thanks to the availability of JSON parsers in a variety of programming languages and frameworks.

JSON can be in the following two structures:

  • Arrays i.e. Ordered list of items /values
  • Objects i.e. Collection of key-value pairs

JSON parser reads and writes the formatted JSON data. It is used for mapping the JSON Object entries or attributes and the JavaScript objects, array, string, boolean, Number, etc. It can be performed in two types:

  • Mapping JSON types to Entries or Attributes
  • Mapping Entries or Attributes to JSON types

Mapping JSON types to Entries or Attributes: JSON types are mapped in a way that the entries are the value and the attributes as the properties having that value. So the structures data remain the same and it is converted to the JavaScript objects

Mapping Entries or Attributes to JSON types: These Entries and attributes to JSON objects are converted as the Attribute are the object properties and entries are the property Values maintaining the structure of the data from one type to another.

JSON data when converted is the reciprocal i.e. It can be reformed back as the original data and object from the converted state. THe data remains the same only the representation or outer form is changed. Hence no data is lost and is used efficiently.

Importance of using JSON Parsing

  • Developers can transform JSON data into usable objects or data structures in their preferred programming language by using JSON parsing.
  • For managing APIs, obtaining data from databases, and processing data obtained from online services, JSON parsing is essential.
  • Developers may extract and use the necessary data by accessing particular data pieces inside a JSON structure thanks to JSON parsing.

JSON Parsing Methods

  • Using JSON.parse() method
  • fetch data from APIs or local JSON files

Method 1: Using JSON.parse() Method

JSON.parse() is a function included in JavaScript supports JSON parsing. It transforms JSON text into a JavaScript object so that its attributes may be easily accessed.

Parameters: It takes JavaScript String as a parameter to parse.

  • JSON.parse(): This method analyzes a JavaScript String and outputs an object to make its attributes accessible.

Example: The code example shows how to implement a JSON parser with JavaScript using JSON.parse() :

Method 2: Fetching Data from Local File

In this method, we wiil import local json file and output the data on console using JavaScript require method.

Example: In this method, we will use require method to import the local data.json file and display output.

There are more method to read the json files that you can find here https://www.geeksforgeeks.org/read-json-file-using-javascript/

Please Login to comment...

  • JavaScript-Questions
  • Web Technologies

Improve your Coding Skills with Practice

 alt=

What kind of Experience do you want to share?

IMAGES

  1. Writing a simple JSON Parser from scratch in C++ (2023)

    writing a json parser

  2. Using the JSON parser to parse a JSON string into a JSON object

    writing a json parser

  3. C++ : Writing a JSON parser for C++

    writing a json parser

  4. How do you parse a json string in Javascript?

    writing a json parser

  5. JSON Parser

    writing a json parser

  6. Parsing Methods for JSON Parameters

    writing a json parser

VIDEO

  1. Writing JSON Syntax By DOM with Delphi

  2. IICS

  3. شرح Json Files & Csv

  4. Writing a Parser Using Lex and Yacc--Minimal Basics

  5. Compiler Development : 0x02

  6. IICS

COMMENTS

  1. Writing a simple JSON parser

    Writing a JSON parser is one of the easiest ways to get familiar with parsing techniques. The format is extremely simple. It's defined recursively so you get a slight challenge compared to, say, parsing Brainfuck; and you probably already use JSON. Aside from that last point, parsing S-expressions for Scheme might be an even simpler task.

  2. Writing a simple JSON Parser from scratch in C++

    The first decision to make when writing a JSON parser is the structure of the JSONNode type which will be holding valid JSON values. I'll try to explain some possible approaches and see which one is more appropriate. What do we want from the JSONNode type? We want it to be as space efficient as possible. The very naive approach

  3. java

    16 Could some one guide how to write a class which would take a JSON data and would try to parse it into a simple buffered list from which we could read the data back? Ex. JSON { "name": "John", "age": 56 } How would I catch incorrect JSON like: { name: 'John', age: 56 } ..will be parsed into a table of key value pairs name John age 56

  4. Write Your Own JSON Parser with Node and Typescript

    Write Your Own JSON Parser with Node and Typescript July 19, 2023 16 min read typescript JSON parsers are everywhere in today's development landscape. For example, even in VS Code, a JSON parser is built-in. You can test this by copying and pasting a valid or invalid JSON into a new file in VS Code.

  5. Working with JSON

    Even though it closely resembles JavaScript object literal syntax, it can be used independently from JavaScript, and many programming environments feature the ability to read (parse) and generate JSON. JSON exists as a string — useful when you want to transmit data across a network.

  6. Write a JSON parser from scratch (1)

    Write a JSON parser from scratch (1) May 10, 2020 Recursive descent can be described as an intuitive and powerful parsing method. Today, we will start with parsing JSON and explain how to build a JSON parser from scratch. JSON has a simple structure, making it suitable for practice.

  7. Python JSON: Read, Write, Parse JSON (With Examples)

    You can parse a JSON string using json.loads () method. The method returns a dictionary. import json person = ' {"name": "Bob", "languages": ["English", "French"]}' person_dict = json.loads (person) # Output: {'name': 'Bob', 'languages': ['English', 'French']} print( person_dict) # Output: ['English', 'French'] print(person_dict ['languages'])

  8. Building a JSON parser for great good

    JSON is a fairly straightforward and well documented spec so it is a good target for learning how to write a parser. Process of parsing JSON gets parsed into a data structure, so we don't have to worry about anything but the translation of a JSON string to the data structure. The first step in the process of translating the JSON file into a ...

  9. Working With JSON Data in Python

    Using Python's context manager, you can create a file called data_file.json and open it in write mode. (JSON files conveniently end in a .json extension.) Note that dump () takes two positional arguments: (1) the data object to be serialized, and (2) the file-like object to which the bytes will be written.

  10. JSON.parse()

    Description JSON.parse () parses a JSON string according to the JSON grammar, then evaluates the string as if it's a JavaScript expression. The only instance where a piece of JSON text represents a different value from the same JavaScript expression is when dealing with the "__proto__" key — see Object literal syntax vs. JSON. The reviver parameter

  11. JSON in Python: How To Read, Write, and Parse

    How to parse JSON in Python. Parsing a string of JSON data, also called decoding JSON, is as simple as using json.loads(…). Loads is short for load string. It converts: objects to dictionaries; arrays to lists, booleans, integers, floats, and strings are recognized for what they are and will be converted into the correct types in Python

  12. Write a JSON parser from scratch (2)

    Write a JSON parser from scratch (2) May 10, 2020. In part1, we mentioned how to write a JSON parser and implemented the string parsing functionality. Now let's add the remaining functions. (In fact, once you understand the basic principles, implementing the remaining functions is just following the same pattern.)

  13. Writing a JSON Parser in TypeScript

    we'll be using TypeScript to implement a parser that can convert a JSON string into a JavaScript object. This parser will be able to handle JSON objects, arrays, strings, numbers, and the...

  14. Writing a Simple JSON Parser in Golang

    We'll explore the core components of a JSON parser: tokenization, where JSON data is broken down into identifiable tokens, and the construction of an Abstract Syntax Tree (AST), which organizes these tokens into a hierarchical structure.By the end of this guide, you'll gain a deeper understanding of these essential building blocks in the context of JSON parsing in Go.

  15. (Java) Writing JSON parser in 300 lines of code

    This and a recent lack of programming challenges inspired me to write this post. A parser is a software component that takes input data (frequently text) and builds a data structure — often some kind of parse tree, abstract syntax tree or other hierarchical structure, giving a structural representation of the input while checking for correct ...

  16. JSON.parse()

    Use the JavaScript function JSON.parse () to convert text into a JavaScript object: const obj = JSON.parse(' {"name":"John", "age":30, "city":"New York"}'); Make sure the text is in JSON format, or else you will get a syntax error. Use the JavaScript object in your page: Example <p id="demo"></p> <script>

  17. Writing a Fast JSON Parser

    Writing a Fast JSON Parser. Several holidays ago, I got a bee in my bonnet and wrote a fast JSON parser whose parsed AST fits in a single contiguous block of memory. The code was small and simple and the performance was generally on-par with RapidJSON, so I stopped and moved on with my life. Well, at the end of 2016, Rich Geldreich shared that ...

  18. Build Your Own JSON Parser

    Building a JSON parser is an easy way to learn about parsing techniques which are useful for everything from parsing simple data formats through to building a fully featured compiler for a programming language. Parsing is often broken up into two stages: lexical analysis and syntactic analysis.

  19. Writing a JSON parser from scratch

    1. Building a model to represent the JSON spec The JSON spec is available at json.org. I'll paraphase it here: A value can be a string or a number or a bool or null or an object or an array . These structures can be nested. A string is a sequence of zero or more Unicode characters, wrapped in double quotes, using backslash escapes.

  20. How to write a streaming parser

    The first way to write a streaming parser is to use generator functions which can "pause" execution of a function using yield. Using a generator function, we can pause the parsing process when we need more input data, and then continue as soon as a new chunk of data comes in.

  21. JavaScript JSON Parser

    JavaScript JSON Parser. JSON (JavaScript Object Notation) is a popular lightweight data exchange format for sending data between a server and a client, or across various systems. JSON data is parsed and interpreted using a software component or library called a JSON parser. Through the JSON parsing process, a JSON string is converted into a ...

  22. Reading, Writing and Parsing JSON Files in Python

    In this course, Reading, Writing and Parsing JSON Files in Python, you'll learn the skills and practices you need to work and manipulate JSON data and files in Python easily. First, you'll explore how to import JSON data from a web API and a URL using Python modules. Next, you'll discover how to import JSON data from a file and describe ...