duplicate-url-discarder

Discarding duplicate URLs based on rules.

These details have not been verified by PyPI

Project links

Source

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: BSD License
Natural Language
- English
Operating System
- OS Independent
Programming Language

Project description

duplicate-url-discarder contains a Scrapy fingerprinter that uses customizable URL processors to canonicalize URLs before fingerprinting.

Quick Start

Installation

pip install duplicate-url-discarder

Alternatively, you can also include in the installation the predefined rules in duplicate-url-discarder-rules via:

pip install duplicate-url-discarder[rules]

If such rules are installed, they would automatically be used if the DUD_LOAD_RULE_PATHS setting is left empty (see configuration).

Requires Python 3.9+.

Using

If you use Scrapy >= 2.10 you can enable the fingerprinter by enabling the provided Scrapy add-on:

ADDONS = {
    "duplicate_url_discarder.Addon": 600,
}

If you are using other Scrapy add-ons that modify the request fingerprinter, such as the scrapy-zyte-api add-on, configure this add-on with a higher priority value so that the fallback fingerprinter is set to the correct value.

With older Scrapy versions you need to enable the fingerprinter directly:

REQUEST_FINGERPRINTER_CLASS = "duplicate_url_discarder.Fingerprinter"

If you were using a non-default request fingerprinter already, be it one you implemented or one from a Scrapy plugin like scrapy-zyte-api, set it as fallback:

DUD_FALLBACK_REQUEST_FINGERPRINTER_CLASS = "scrapy_zyte_api.ScrapyZyteAPIRequestFingerprinter"

duplicate_url_discarder.Fingerprinter will make canonical forms of the request URLs and get the fingerprints for those using the configured fallback fingerprinter (which is the default Scrapy one unless another one is configured in the DUD_FALLBACK_REQUEST_FINGERPRINTER_CLASS setting). Requests with the "dud" meta value set to False are processed directly, without making a canonical form.

URL Processors

duplicate-url-discarder utilizes URL processors to make canonical versions of URLs. The processors are configured with URL rules. Each URL rule specifies an URL pattern for which the processor applies, and specific processor arguments to use.

The following URL processors are currently available:

queryRemoval: removes query string parameters (i.e. key=value), wherein the keys are specified in the arguments. If a given key appears multiple times with different values in the URL, all of them are removed.
queryRemovalExcept: like queryRemoval, but the keys specified in the arguments are kept while all others are removed.
subpathRemoval: removes the subpaths of a URL based on its integer positions.
normalizer: removes trailing / and www. prefixes which also includes numbers like www2..

URL Rules

A URL rule is a dictionary specifying the url-matcher URL pattern(s), the URL processor name, the URL processor args and the order that is used to sort the rules. They are loaded from JSON files that contain arrays of serialized rules:

[
  {
    "args": [
      "foo",
      "bar",
    ],
    "order": 100,
    "processor": "queryRemoval",
    "urlPattern": {
      "include": [
        "foo.example"
      ]
    }
  },
  {
    "args": [
      "PHPSESSIONID"
    ],
    "order": 100,
    "processor": "queryRemoval",
    "urlPattern": {
      "include": []
    }
  }
]

All non-universal rules (ones that have non-empty include pattern) that match a request URL are applied according to their order field. If there are no non-universal rules that match the URL, the universal ones are applied.

Configuration

duplicate-url-discarder uses the following Scrapy settings:

DUD_LOAD_RULE_PATHS: it should be a list of file paths (str or pathlib.Path) pointing to JSON files with the URL rules to apply:
```
DUD_LOAD_RULE_PATHS = [
    "/home/user/project/custom_rules1.json",
]
```
The default value of this setting is empty. However, if the package duplicate-url-discarder-rules is installed and DUD_LOAD_RULE_PATHS has been left empty, the rules in said package are automatically used.

As this setting requires a file path, it’s not straightforward to deploy custom rule files to Scrapy Cloud or other similar environments, one way for that is this: put custom rule files into some location inside your Scrapy project, list them in the package data files, disable the zip_safe flag and calculate the absolute file path(s) in the setting value. So a sample setup.py would include:
```
setup(
    ...
    zip_safe=False,
    package_data={
        "my_project": [
            "data/dud_rules.json",
        ]
    },
)
```
and settings.py can have code like this:
```
DUD_LOAD_RULE_PATHS = [
    os.path.join(
        os.path.dirname(os.path.realpath(__file__)), "data", "dud_rules.json"
    )
]
```

DUD_ATTRIBUTES_PER_ITEM: it’s a mapping of a type (or its import path) into a list of attributes present in the instances of that type.

For example:

DUD_ATTRIBUTES_PER_ITEM = {
    "zyte_common_items.Product": [
        "canonicalUrl",
        "brand",
        "name",
        "gtin",
        "mpn",
        "productId",
        "sku",
        "color",
        "size",
        "style",
    ],
    # Other than strings representing import paths, types are supported as well.
    dict: ["name"]
}

This allows DUD to select which attributes to use to derive a signature for an item. This signature is then used to compare the identities of different items. For instance, duplicate_url_discarder.DuplicateUrlDiscarderPipeline uses this to find duplicate items that were extracted so it can drop them.

Project details

These details have not been verified by PyPI

Project links

Source

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: BSD License
Natural Language
- English
Operating System
- OS Independent
Programming Language

Release history Release notifications | RSS feed

This version

0.3.0

Dec 30, 2024

0.2.0

Jul 23, 2024

0.1.0

Jul 8, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

duplicate_url_discarder-0.3.0.tar.gz (17.9 kB view details)

Uploaded Dec 30, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

duplicate_url_discarder-0.3.0-py3-none-any.whl (14.6 kB view details)

Uploaded Dec 30, 2024 Python 3

File details

Details for the file duplicate_url_discarder-0.3.0.tar.gz.

File metadata

Download URL: duplicate_url_discarder-0.3.0.tar.gz
Upload date: Dec 30, 2024
Size: 17.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.12.8

File hashes

Hashes for duplicate_url_discarder-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`581663608ad4f2477e30a64c5a531214ea74d05ec01868fdb6dd79883cdee070`
MD5	`d37a2b610bb4dd2e06d42867d306a70a`
BLAKE2b-256	`fdf39252efefa40c7a95368780cdf150c9d96aeab2a893055f01bc9b507cdf1c`

See more details on using hashes here.

File details

Details for the file duplicate_url_discarder-0.3.0-py3-none-any.whl.

File metadata

Download URL: duplicate_url_discarder-0.3.0-py3-none-any.whl
Upload date: Dec 30, 2024
Size: 14.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.12.8

File hashes

Hashes for duplicate_url_discarder-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`548398ae2665f4851a07047e2ba88686573d1133a2e9da2cd6009395838b5c20`
MD5	`2d10a60fefd6059fe0bfc5a9f914f9d4`
BLAKE2b-256	`6aad406aba236a281f541738807281e0db40ab12b68c9e743775338d49905af9`

See more details on using hashes here.

duplicate-url-discarder 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Quick Start

Installation

Using

URL Processors

URL Rules

Configuration

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes