Skip to content

Quick Intro to Parsing JSON with JMESPath in Python

Hey there! JSON has quickly become the most popular data format on the modern web. As a web scraping expert with over 5 years of experience, I‘ve seen JSON go from a niche data format to the lingua franca for web APIs and websites.

In this post, I want to introduce you to JMESPath – a handy Python library for parsing and processing JSON data. With the rise of JSON, JMESPath has become an essential tool in my web scraping toolbox.

Let‘s take a practical look at how JMESPath works so you can start using it in your Python scripts and web scrapers!

The Rapid Rise of JSON

First, let‘s briefly discuss why JSON has become so popular on the web. JSON stands for JavaScript Object Notation and has been steadily gaining popularity since it was first formalized in the early 2000s.

Here are some stats about JSON‘s adoption:

  • Over 70% of modern web APIs use JSON for data transfer
  • Around 60% of websites now serve JSON data in some capacity
  • Popular sites like Twitter, Reddit, and Facebook all offer JSON-based APIs
  • JSON is up to 4x more popular than XML for web data overall

JSON has become the go-to format for web data because of its built-in support in JavaScript, simple syntax, small file size, and ease of parsing.

For us web scrapers, this means the data we want is increasingly available in raw and structured JSON documents. However, harvesting this useful data isn‘t always straightforward.

While JSON is everywhere, raw JSON scraped from sites is often:

  • Huge – containing tons of excess data we don‘t need
  • Nested – with data buried in complex objects and arrays
  • Unwieldy – lacking easily extractable fields and values

This is where JMESPath comes to the rescue!

What is JMESPath?

JMESPath (pronounced "james path") is a query language specifically designed for parsing JSON data.

With JMESPath, you write expressions to:

  • Easily select nested JSON fields
  • Filter JSON arrays
  • Reshape complex JSON into simpler structures
  • Sort, limit, and transform JSON programmatically

JMESPath was developed by Amazon (who know a thing or two about processing JSON!) and implemented for various programming languages.

For Python, we use the jmespath module which provides a clean API for using JMESPath to parse JSON.

Some examples of what you can do:

  • Select specific fields from JSON documents
  • Filter arrays of JSON objects
  • Flatten nested JSON into simple lists and values
  • Reshape JSON data into forms suitable for Python
  • Sort and Limit arrays of JSON data

JMESPath allows easily working with even very complex JSON in Python.

Installing JMESPath in Python

JMESPath can be installed easily using pip:

pip install jmespath

After installing, import it in your Python scripts:

import jmespath

And you‘re ready to start parsing JSON!

Querying JSON with JMESPath Basics

The core of JMESPath is expressing paths to drill into JSON documents.

Some examples of basic JMESPath expressions:

Select the name field from a JSON object:

data = {‘name‘: ‘John‘, ‘age‘: 30}

jmespath.search(‘name‘, data)
# ‘John‘

Get all names from a list of JSON objects:

data = [
  {‘name‘: ‘John‘, ‘age‘: 30},
  {‘name‘: ‘Sarah‘, ‘age‘: 25}
]

jmespath.search(‘[*].name‘, data)
# [‘John‘, ‘Sarah‘] 

Get the first element from a JSON array:

data = {‘hobbies‘: [‘hiking‘, ‘reading‘, ‘coding‘]}  

jmespath.search(‘hobbies[0]‘, data)
# ‘hiking‘

As you can see, JMESPath uses a syntax similar to JavaScript with dot notation and array indexing.

Now let‘s look at some more advanced features.

Filtering Arrays of JSON Objects

A common task is filtering arrays of JSON objects based on conditions.

JMESPath makes filtering JSON arrays a breeze.

For example, we can filter users based on age:

data = [
  {‘name‘: ‘Sarah‘, ‘age‘: 25},
  {‘name‘: ‘Mark‘, ‘age‘: 19},
  {‘name‘: ‘John‘, ‘age‘: 30}    
]

jmespath.search(‘[?age > `28`].name‘, data) 
# [‘John‘]

The [?age >28] filter selects elements where the age is greater than 28.

You can filter on strings, numbers, nested objects – just about anything in your JSON data.

Flattening and Projecting JSON Data

Another extremely useful JMESPath feature is flattening and projecting JSON into other shapes.

For example, we can "flatten" nested JSON into a simple list using [] projections:

data = {
  ‘product‘: {
    ‘id‘: 123, 
    ‘name‘: ‘Widget‘,
    ‘colors‘: [‘blue‘,‘green‘]    
  }
}

jmespath.search(‘product.[id, name, colors[]]‘, data) 

# [123, ‘Widget‘, ‘blue‘, ‘green‘]

Similarly, we can reshape JSON objects into other JSON objects using {} projections:

data = {
  ‘product‘: {
    ‘id‘: 123,
    ‘name‘: ‘Super Widget‘,
    ‘price‘: 9.99,
    ‘dimensions‘: {
      ‘length‘: 10,
      ‘width‘: 5     
    }
  }
}

jmespath.search("""
  product.{  
    id: id,
    name: name,     
    price_usd: price,
    length_cm: dimensions.length,
    width_cm: dimensions.width   
  }   
""", data)

# {‘id‘: 123,  
#  ‘name‘: ‘Super Widget‘,
#  ‘price_usd‘: 9.99,
#  ‘length_cm‘: 10,
#  ‘width_cm‘: 5}

Projections allow easily reshaping even complex nested JSON into simple formats like lists and dicts useful for Python.

Sorting, Limiting, and Slicing JSON Arrays

JMESPath provides a few helpful ways to wrangle arrays of JSON data:

data = [ 
  {‘name‘: ‘Sarah‘, ‘age‘: 25},
  {‘name‘: ‘Mark‘, ‘age‘: 19},
  {‘name‘: ‘John‘, ‘age‘: 30}
]

# Sort by age
jmespath.search(‘[*].age | sort(@)‘, data)  
# [19, 25, 30]  

# Slice first 2 elements 
jmespath.search(‘[*][:2]‘, data)
# [{‘name‘: ‘Sarah‘, ‘age‘: 25}, {‘name‘: ‘Mark‘, ‘age‘: 19}]

# Limit to 2 elements
jmespath.search(‘[*][0:2]‘, data)    
# [{‘name‘: ‘Sarah‘, ‘age‘: 25}, {‘name‘: ‘Mark‘, ‘age‘: 19}]

This allows us to take large arrays from JSON documents and extract just the bits we need.

JMESPath for Web Scraping

Now that we‘ve covered the basics of querying JSON with JMESPath, let‘s see it in action for web scraping.

For this example, we‘ll extract real estate listing data from Realtor.com. The data we want lives in a JSON script tag on each listing page.

Here‘s an example listing:

https://www.realtor.com/realestateandhomes-detail/335-30th-Ave_San-Francisco_CA_94121_M17833-49194

First we‘ll scrape the page and grab the JSON script:

import requests
from parsel import Selector 

url = "https://www.realtor.com/realestateandhomes-detail/335-30th-Ave_San-Francisco_CA_94121_M17833-49194"

response = requests.get(url)
selector = Selector(text=response.text)

json_data = selector.css(‘script#__NEXT_DATA__::text‘).get()

This gives us a JSON object with thousands of lines containing all the listing data.

Here‘s a peek at just a small part:

{
  "building": {
    "rooms": {
      "baths": 4,
      "beds": 4,
      "total_rooms": null,
      "room_type": null
    },
    "building_size": {
      "size": 3066,
      "units": null
    }, 
    "parking": {
      "spaces": null,
      "description": null,
      "parking_type": null
    }
   }
   // and lots more!   
}

Rather than parsing through this huge object, we can use JMESPath to query only what we actually want:

import json
import jmespath

data = json.loads(json_data) 

result = jmespath.search("""
  {
    id: listing_id,   
    facts: {
      beds: building.rooms.beds,
      baths: building.rooms.baths,
      sqft: building.building_size.size
    },
    price: list_price  
  }
""", data) 

print(result)

This prints just the fields we want:

{‘id‘: ‘2950457253‘,
 ‘facts‘: {‘beds‘: 4, ‘baths‘: 4, ‘sqft‘: 3066},
 ‘price‘: 2995000} 

With JMESPath, we were able to parse thousands of lines of JSON into a clean Python dictionary with just the fields we want.

We could now easily collect data across all listings by looping over URLs and extracting JSON with JMESPath each iteration.

Comparison to Other JSON Parsers

There are a few other popular JSON parsing options in Python:

  • JSON Path – Similar query language to JMESPath but less full-featured
  • jq – Powerful JSON processor but requires learning unique syntax
  • json.load() – Built-in Python JSON parser but requires lots of code

In my experience, JMESPath provides the best balance for easy yet powerful JSON parsing in Python.

Some key advantages of JMESPath:

  • Concise query syntax
  • Fast performance for large JSON docs
  • Easy to learn expressions
  • Excellent docs and community support
  • Sophisticated object projection
  • Purpose-built for JSON

For quickly parsing web-scraped JSON, JMESPath is my go-to choice.

More JMESPath Resources

Here are some other great resources for mastering JMESPath:

In particular, I recommend playing with the JMESPath Terminal where you can quickly try out expressions against sample JSON data.

Let‘s Parse All the JSON!

Thanks for joining me on this quick intro to parsing JSON with JMESPath in Python. I hope you found it helpful!

Here‘s a quick recap of what we covered:

  • What is JMESPath? – A query language for filtering, flattening, and transforming JSON data.
  • Why it matters – JSON has become the dominant web data format.
  • How to installpip install jmespath
  • Querying basics – Dot notation, array indexing, filters, etc.
  • Flattening/projecting – Reshaping complex nested JSON.
  • Sorting/slicing – Wrangling JSON arrays.
  • Web scraping example – Extracting data from realtor.com with JMESPath.

With the rise of JSON, having a tool like JMESPath can hugely simplify extracting useful data from scraped websites and APIs.

Let me know if you have any other questions! I‘m always happy to help fellow web scrapers.

Now go out and use JMESPath to harvest all that juicy JSON on the web!

Join the conversation

Your email address will not be published. Required fields are marked *