Parsing JavaScript objects into Python dictionaries

These details have not been verified by PyPI

Project links

Homepage

Project description

Usage

chompjs can be used in web scrapping for turning JavaScript objects embedded in pages into valid Python dictionaries.

>>> import chompjs
>>> chompjs.parse_js_object('{"my_data": "test"}')
{u'my_data': u'test'}

Think of it as a more powerful json.loads. For example, it can handle JSON objects containing embedded methods by storing their code in a string:

>>> import chompjs
>>> js = """
... var myObj = {
...     myMethod: function(params) {
...         // ...
...     },
...     myValue: 100
... }
... """
>>> chompjs.parse_js_object(js, json_params={'strict': False})
{'myMethod': 'function(params) {\n        // ...\n    }', 'myValue': 100}

An example usage with scrapy:

import chompjs
import scrapy


class MySpider(scrapy.Spider):
    # ...

    def parse(self, response):
        script_css = 'script:contains("__NEXT_DATA__")::text'
        script_pattern = r'__NEXT_DATA__ = (.*);'
        # warning: for some pages you need to pass replace_entities=True
        # into re_first to have JSON escaped properly
        script_text = response.css(script_css).re_first(script_pattern)
        try:
            json_data = chompjs.parse_js_object(script_text)
        except ValueError:
            self.log('Failed to extract data from {}'.format(response.url))
            return

        # work on json_data

If the input string is not yet escaped and contains a lot of \\ characters, then unicode_escape=True argument might help to sanitize it:

>>> chompjs.parse_js_object('{\\\"a\\\": 12}', unicode_escape=True)
{u'a': 12}

jsonlines=True can be used to parse JSON Lines:

>>> chompjs.parse_js_object('[1,2]\n[2,3]\n[3,4]', jsonlines=True)
[[1, 2], [2, 3], [3, 4]]

By default chompjs tries to start with first { or [ character it founds, omitting the rest:

>>> chompjs.parse_js_object('<div>...</div><script>foo = [1, 2, 3];</script><div>...</div>')
[1, 2, 3]

json_params argument can be used to pass options to underlying json_loads, such as strict or object_hook:

>>> import decimal
>>> import chompjs
>>> chompjs.parse_js_object('[23.2]', json_params={'parse_float': decimal.Decimal})
[Decimal('23.2')]

Rationale

In web scraping data often is not present directly inside HTML, but instead provided as an embedded JavaScript object that is later used to initialize the page, for example:

<html>
<head>...</head>
<body>
...
<script type="text/javascript">window.__PRELOADED_STATE__={"foo": "bar"}</script>
...
</body>
</html>

Standard library function json.loads is usually sufficient to extract this data:

>>> # scrapy shell file:///tmp/test.html
>>> import json
>>> script_text = response.css('script:contains(__PRELOADED_STATE__)::text').re_first('__PRELOADED_STATE__=(.*)')
>>> json.loads(script_text)
{u'foo': u'bar'}

The problem is that not all valid JavaScript objects are also valid JSONs. For example all those strings are valid JavaScript objects but not valid JSONs:

"{'a': 'b'}" is not a valid JSON because it uses ' character to quote
'{a: "b"}'is not a valid JSON because property name is not quoted at all
'{"a": [1, 2, 3,]}' is not a valid JSON because there is an extra , character at the end of the array
'{"a": .99}' is not a valid JSON because float value lacks a leading 0

As a result, json.loads fail to extract any of those:

>>> json.loads("{'a': 'b'}")
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/usr/lib/python2.7/json/__init__.py", line 339, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python2.7/json/decoder.py", line 364, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python2.7/json/decoder.py", line 380, in raw_decode
    obj, end = self.scan_once(s, idx)
ValueError: Expecting property name: line 1 column 2 (char 1)
>>> json.loads('{a: "b"}')
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/usr/lib/python2.7/json/__init__.py", line 339, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python2.7/json/decoder.py", line 364, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python2.7/json/decoder.py", line 380, in raw_decode
    obj, end = self.scan_once(s, idx)
ValueError: Expecting property name: line 1 column 2 (char 1)
>>> json.loads('{"a": [1, 2, 3,]}')
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/usr/lib/python2.7/json/__init__.py", line 339, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python2.7/json/decoder.py", line 364, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python2.7/json/decoder.py", line 382, in raw_decode
    raise ValueError("No JSON object could be decoded")
ValueError: No JSON object could be decoded
>>> json.loads('{"a": .99}')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.7/json/__init__.py", line 348, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python3.7/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python3.7/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 7 (char 6)

chompjs library was designed to bypass this limitation, and it allows to scrape such JavaScript objects into proper Python dictionaries:

>>> import chompjs
>>> 
>>> chompjs.parse_js_object("{'a': 'b'}")
{u'a': u'b'}
>>> chompjs.parse_js_object('{a: "b"}')
{u'a': u'b'}
>>> chompjs.parse_js_object('{"a": [1, 2, 3,]}')
{u'a': [1, 2, 3]}

Internally chompjs use a parser written in C to iterate over raw string, fixing its issues along the way. The final result is then passed down to standard library's json.loads, ensuring a high speed as compared to full-blown JavaScript parsers such as demjson.

>>> import json
>>> import _chompjs
>>> 
>>> _chompjs.parse('{a: 1}')
'{"a":1}'
>>> json.loads(_)
{u'a': 1}
>>> chompjs.parse_js_object('{"a": .99}')
{'a': 0.99}

Installation

From PIP:

$ python3 -m venv venv
$ . venv/bin/activate
# pip install chompjs

From sources:

$ git clone https://github.com/Nykakin/chompjs
$ cd chompjs
$ python setup.py build
$ python setup.py install

To run unittests

$ python -m unittest

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

1.4.0

Aug 4, 2025

1.3.2

May 25, 2025

1.3.1

Mar 10, 2025

1.3.0

Aug 17, 2024

1.2.4

Jun 13, 2024

1.2.3

Jan 31, 2024

1.2.2

Jun 2, 2023

1.2.1

May 14, 2023

1.2.0

May 8, 2023

1.1.9

Nov 17, 2022

This version

1.1.8

Aug 22, 2022

1.1.7

Aug 22, 2022

1.1.6

Dec 7, 2021

1.1.5

Dec 2, 2021

1.1.4

Jul 29, 2021

1.1.3

May 24, 2021

1.1.2

Apr 24, 2021

1.1.1

Apr 3, 2021

1.1.0

Apr 3, 2021

1.0.16

Jul 1, 2020

1.0.15

Jun 15, 2020

1.0.14

Apr 30, 2020

1.0.13

Apr 30, 2020

1.0.12

Apr 30, 2020

1.0.11

Apr 25, 2020

1.0.10

Apr 2, 2020

1.0.9

Apr 1, 2020

1.0.8

Apr 1, 2020

1.0.7

Apr 1, 2020

1.0.6

Mar 29, 2020

1.0.5

Mar 26, 2020

1.0.4

Mar 26, 2020

1.0.3

Mar 20, 2020

1.0.2

Mar 20, 2020

1.0.1

Feb 27, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chompjs-1.1.8.tar.gz (13.8 kB view details)

Uploaded Aug 22, 2022 Source

File details

Details for the file chompjs-1.1.8.tar.gz.

File metadata

Download URL: chompjs-1.1.8.tar.gz
Upload date: Aug 22, 2022
Size: 13.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.3.0 pkginfo/1.4.2 requests/2.25.1 setuptools/65.2.0 requests-toolbelt/0.9.1 tqdm/4.57.0 CPython/3.9.5

File hashes

Hashes for chompjs-1.1.8.tar.gz
Algorithm	Hash digest
SHA256	`e6cf7f4ad689e4aa5fdf7e54956fef62074de283c694f50ee453e480e2b752b2`
MD5	`f2362485f75f0b3e125a76936a35c09c`
BLAKE2b-256	`60967d63ee0da9efd16464880c852dbc23f95f71dc66bf8c310e42e2fa2d7084`

See more details on using hashes here.

chompjs 1.1.8

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Usage

Rationale

Installation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes