A Python NextJS data parser from HTML
Project description
NJSParser
A powerful parser and explorer for any website built with NextJS.
- Parses flight data (from the
self.__next_f.pushscripts). - Parses next data from
__NEXT_DATA__script. - Parses build manifests.
- Searches for build id.
- Many other things ...
It uses only lxml, orjson, pydantic to garantee a fast and efficient data parsing and processing.
Installation:
pip install njsparser
Use
CLI
You can use the cli from 3 different commands:
njspnjsparserpython3 -m njsparser.cliIt has only one functionality of displaying informations about the website, like this:For more informations, use the
--helpargument with the command.
Parsing __next_f.
The data you find in __next_f is called flight data, and contains data under react format. You can parse it easily with njsparser the way it follows.
We will build a parser for the flight data example
- In the website you want to parse, make sure you see the
self.__next_f.pushin the begining of script contained the data you search for. Here I am searching for the description"I should really have a better hobby, but this is it..."(in blue) in my page, and I can also see theself.__next_f.push(in green). - Then I will do this simple script, to parse, then dump the flight data of my website, and see what objects I am searching for:
import requests import njsparser import json # Here I get my page's html response = requests.get("https://mediux.pro/user/r3draid3r04").text # Then I parse it with njsparser fd = njsparser.BeautifulFD(response) # Then I will write to json the content of the flight data with open("fd.json", "w") as write: # I use the njsparser.default function to support the dump of the flight data objects. json.dump(fd, write, indent=4, default=njsparser.default)
- In my dumped flight data, I will search for the same string:
- Then I will do to the closed
"value"root to my found string, and look at the value of"cls". Here it is"Data": - Now that I know the
"cls"(class) of object my data is contained in, I can search for it in myBeautifulFDobject:import requests import njsparser import json # Here I get my page's html response = requests.get("https://mediux.pro/user/r3draid3r04").text # Then I parse it with njsparser fd = njsparser.BeautifulFD(response) # Then I iterate over the different classes `Data` in my flight data. for data in fd.find_iter([njsparser.T.Data]): # Then I make sure that the content of my data is not None, and # check if the key `"user"` is in the data's content. If it is, # then i break the loop of searching. if data.content is not None and "user" in data.content: break else: # If i didn't find it, i raise an error raise ValueError # Now i have the data of my user user = data.content["user"] # And I can print the string i was searching for before print(user["tagline"])
More informations:
- If your object is inside another object (e.g.
"Data"in a"DataParent", or in a"DataContainer"), the.find_iterwill also find it recursively (except if you setrecursive=False). - Make sure you use the correct flight data classes attributes when fetching their data. The class
"Data"has a.contentattribute. If you use.value, you will end up with the raw value and will have to parse it yourself. If you work with a"DataParent"object, instead of using.value(that will give you["$", "$L16", None, {"children": ["$", "$L17", None, {"profile": {}}]}]), use.children(that will give you a"Data"object with a.contentof{"profile": {}}). Check for the type file to see what classes you're interested in, and their attributes. - You can also use
.findonBeautifulFDto return the only first occurence of your query, or None if not found.
Parsing <script id='__NEXT_DATA__'>
Just do:
import njsparser
html_text = ...
data = njsparser.get_next_data(html_text)
If the page contains any script <script id='__NEXT_DATA__'>, it will return the json loaded data, otherwise will return None.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file njsparser-2.16.tar.gz.
File metadata
- Download URL: njsparser-2.16.tar.gz
- Upload date:
- Size: 17.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
045539311507e5fe45031e51587516d075c4ae6ed0580644e51171968e6ca497
|
|
| MD5 |
64200df76ee9ada49a0d824b3bb35828
|
|
| BLAKE2b-256 |
3b1973d1d1ac2b979eb1354a00015a4147b7bb2bb4f09eb1b509dc02c482cccc
|
Provenance
The following attestation bundles were made for njsparser-2.16.tar.gz:
Publisher:
publish.yml on novitae/njsparser
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
njsparser-2.16.tar.gz -
Subject digest:
045539311507e5fe45031e51587516d075c4ae6ed0580644e51171968e6ca497 - Sigstore transparency entry: 923704669
- Sigstore integration time:
-
Permalink:
novitae/njsparser@200cc70e2fd37adbcaff2552d961f9c0717e8a8c -
Branch / Tag:
refs/tags/2.16 - Owner: https://github.com/novitae
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@200cc70e2fd37adbcaff2552d961f9c0717e8a8c -
Trigger Event:
release
-
Statement type:
File details
Details for the file njsparser-2.16-py3-none-any.whl.
File metadata
- Download URL: njsparser-2.16-py3-none-any.whl
- Upload date:
- Size: 22.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8d49b958705e9503f91fca40b694e20e1f6db18bd88c85de12b623d112358d4b
|
|
| MD5 |
5d5896f42e5986d0f5e89dcd6d338ec8
|
|
| BLAKE2b-256 |
11b4f62d7609ec654811e796669c0973fdbd33efa7da2ecdc842d410ae62aac5
|
Provenance
The following attestation bundles were made for njsparser-2.16-py3-none-any.whl:
Publisher:
publish.yml on novitae/njsparser
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
njsparser-2.16-py3-none-any.whl -
Subject digest:
8d49b958705e9503f91fca40b694e20e1f6db18bd88c85de12b623d112358d4b - Sigstore transparency entry: 923704730
- Sigstore integration time:
-
Permalink:
novitae/njsparser@200cc70e2fd37adbcaff2552d961f9c0717e8a8c -
Branch / Tag:
refs/tags/2.16 - Owner: https://github.com/novitae
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@200cc70e2fd37adbcaff2552d961f9c0717e8a8c -
Trigger Event:
release
-
Statement type: