Skip to main content

Combine XPath, CSS Selectors and JSONPath for Web data extracting.

Project description

license Pypi Status Python version Package version PyPI - Downloads GitHub last commit Code style: black Build Status codecov Documentation Status PDM managed

Combine XPath, CSS Selectors and JSONPath for Web data extracting.

Quickstarts

Installation

Install the stable version from PYPI.

pip install "data-extractor[jsonpath-extractor]"  # for extracting JSON data
pip install "data-extractor[lxml]"  # for extracting HTML data

Or install the latest version from Github.

pip install "data-extractor[jsonpath-extractor] @ git+https://github.com/linw1995/data_extractor.git@master"

Extract JSON data

Currently supports to extract JSON data with below optional dependencies

install one dependency of them to extract JSON data.

Extract HTML(XML) data

Currently supports to extract HTML(XML) data with below optional dependencies

Usage

from data_extractor import Field, Item, JSONExtractor


class Count(Item):
    followings = Field(JSONExtractor("countFollowings"))
    fans = Field(JSONExtractor("countFans"))


class User(Item):
    name_ = Field(JSONExtractor("name"), name="name")
    age = Field(JSONExtractor("age"), default=17)
    count = Count()


assert User(JSONExtractor("data.users[*]"), is_many=True).extract(
    {
        "data": {
            "users": [
                {
                    "name": "john",
                    "age": 19,
                    "countFollowings": 14,
                    "countFans": 212,
                },
                {
                    "name": "jack",
                    "description": "",
                    "countFollowings": 54,
                    "countFans": 312,
                },
            ]
        }
    }
) == [
    {"name": "john", "age": 19, "count": {"followings": 14, "fans": 212}},
    {"name": "jack", "age": 17, "count": {"followings": 54, "fans": 312}},
]

Changelog

v1.0.1

Build

  • Supports Python 3.13

Contributing

Environment Setup

Clone the source codes from Github.

git clone https://github.com/linw1995/data_extractor.git
cd data_extractor

Setup the development environment. Please make sure you install the pdm, pre-commit and nox CLIs in your environment.

make init
make PYTHON=3.7 init  # for specific python version

Linting

Use pre-commit for installing linters to ensure a good code style.

make pre-commit

Run linters. Some linters run via CLI nox, so make sure you install it.

make check-all

Testing

Run quick tests.

make

Run quick tests with verbose.

make vtest

Run tests with coverage. Testing in multiple Python environments is powered by CLI nox.

make cov

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

data-extractor-1.0.1.tar.gz (28.2 kB view details)

Uploaded Source

Built Distribution

data_extractor-1.0.1-py3-none-any.whl (18.3 kB view details)

Uploaded Python 3

File details

Details for the file data-extractor-1.0.1.tar.gz.

File metadata

  • Download URL: data-extractor-1.0.1.tar.gz
  • Upload date:
  • Size: 28.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.20

File hashes

Hashes for data-extractor-1.0.1.tar.gz
Algorithm Hash digest
SHA256 3ff9424d4859ecd1a4f3b0f9f5b614117a0efa2747493365169b54c6af51aa90
MD5 e579f24780210425917e241421abdf42
BLAKE2b-256 19ab4a9fff19fe0fcb15eb83083fa51288bd1f21aa1acbc12fba8dd93c8c6597

See more details on using hashes here.

File details

Details for the file data_extractor-1.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for data_extractor-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 2126dc68207b650ae884cac6caf8dec0388c6334ad379069b166d6336e27b1e7
MD5 a1e87a5b66c2376a1429bd55cd603df8
BLAKE2b-256 1f9e4c7f72bd7e7a0879eb751c8599dda74a8542d736e97c8f673eaff8226b68

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page