Analyze Scrapy Cloud data

These details have not been verified by PyPI

Project links

Homepage

Project description

Arche

GitHub

pip install arche

Arche (pronounced Arkey) helps to verify scraped data using set of defined rules, for example:

Validation with JSON schema
Coverage (items, fields, categorical data, including booleans and enums)
Duplicates
Garbage symbols
Comparison of two jobs

We use it in Scrapinghub, among the other tools, to ensure quality of scraped data

Installation

Arche requires Jupyter environment, supporting both JupyterLab and Notebook UI

For JupyterLab, you will need to properly install plotly extensions

Then just pip install arche

Why

To check the quality of scraped data continuously. For example, if you scraped a website, a typical approach would be to validate the data with Arche. You can also create a schema and then set up Spidermon

Developer Setup

pipenv install --dev
pipenv shell
tox

Contribution

Any contributions are welcome! See https://github.com/scrapinghub/arche/issues if you want to take on something or suggest an improvement/report a bug.

Changes

Most recent releases are shown at the top. Each release shows:

Added: New classes, methods, functions, etc
Changed: Additional parameters, changes to inputs or outputs, etc
Fixed: Bug fixes that don't change documented behaviour

Note that the top-most release is changes in the unreleased master branch on Github. Parentheses after an item show the name or github id of the contributor of that change.

Keep a Changelog, Semantic Versioning.

[0.3.6] (2019-07-12)

Added

Categories rule with a plot showing unique values and count per field. By default, report_all() only includes fields which have less or equal to 10 unique values. See https://arche.readthedocs.io/en/latest/nbs/Rules.html#Category-fields, #100
Category documentation

Changed

Arche.report_all() does not shorten report by default, added short parameter.
Data is consistent with Dash and Spidermon: _type, _key fields are dropped from dataframe, raw data, basic schema, #104, #106
df.index now stores _key instead
basic_json_schema() works with deleted jobs
start is supported for Collections, #112
enum is counted as a category tag, #18
Garbage Symbols searches in str representation of nested fields instead of expanded df, #130
Show real coverage difference (negative\positive) instead of absolute, #114

Fixed

Arche.glance(), #88
Item links in Schema validation errors, #89
Empty NAN bars on category graphs, #93
data_quality_report(), #95
Wrong number of Collection Items if it contains item 0, #112

Removed

Responses Per Item Ratio rule
Deprecated expand parameter and removed flat_df, since Garbage Rule deal with nested data itself, #133

[0.3.5] (2019-05-14)

Added

Arche() supports any iterables with item dicts, fixing jsonschema consistency, #83
Items.from_array to read raw data from iterables, #83

Changed

If reading from pandas df directly, store raw data in numpy array. See gotchas http://pandas.pydata.org/pandas-docs/stable/user_guide/gotchas.html#support-for-integer-na

Fixed

Removed

[0.3.4] (2019-05-06)

Fixed

basic_json_schema() fails with long 1.0 types, #80

[0.3.3] (2019-05-03)

Added

Accept dataframes as source or target, #69

Changed

data_quality_report plots the same "Fields Coverage" instead of green "Scraped Fields Coverage"
Plot theme changed from ggplot2 to seaborn, #62
Same target and source raise an error, was a warning before
Passed rules marked with green PASSED.

Fixed

Online documentation now renders graphs https://arche.readthedocs.io/en/latest/, #41
Error colours are back in report_all().

Removed

Deprecated Arche.basic_json_schema(), use basic_json_schema()
Removed Quickstart.md as redundant - documentation lives in notebooks

[0.3.2] (2019-04-18)

Added

Allow reading private raw schemas directly from bitbucket, #58

Changed

Progress widgets are removed before printing graphs
New plotly v4 API

Fixed

Failing Compare Prices For Same Urls when url is nan, #67
Empty graphs in Jupyter Notebook, #63

Removed

Scraped Items History graphs

[0.3.1] (2019-04-12)

Fixed

Empty graphs due to lack of plotlyjs, #61

[0.3.0] (2019-04-12)

Fixed

Big notebook size, replaced cufflinks with plotly and ipython, #39

Changed

Fields Coverage now is printed as a bar plot, #9
Fields Counts renamed to Coverage Difference and results in 2 bar plots, #9, #51:
- Coverage from job stats fields counts which reflects coverage for each field for both jobs
- Coverage difference more than 5% which prints >5% difference between the coverages (was ratio difference before)
Compare Scraped Categories renamed to Category Coverage Difference and results in 2 bar plots for each category, #52:
- Coverage for field which reflects value counts (categories) coverage for the field for both jobs
- Coverage difference more than 10% for field which shows >10% differences between the category coverages
Boolean Fields plots Coverage for boolean fields graph which reflects normalized value counts for boolean fields for both jobs, #53

Removed

cufflinks dependency
Deprecated category_field tag

[2019.03.25]

Added

CHANGES.md
new arche.rules.duplicates.find_by() to find duplicates by chosen columns

import arche
from arche.readers.items import JobItems
df = JobItems(0, "235801/1/15").df
arche.rules.duplicates.find_by(df, ["title", "category"]).show()

basic_json_schema().json() prints a schema in JSON format
Result.show() to print a rule result, e.g.

from arche.rules.garbage_symbols import garbage_symbols
from arche.readers.items import JobItems
items = JobItems(0, "235801/1/15")
garbage_symbols(items).show()

notebooks to documentation

Changed

Tags rule returns unused tags, #2
basic_json_schema() prints a schema as a python dict

Deprecated

Arche().basic_json_schema() deprecated in favor of arche.basic_json_schema()

Removed

Fixed

Arche().basic_json_schema() not using items_numbers argument

2019.03.18

Last release without CHANGES updates

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.3.6

Jul 12, 2019

0.3.5

May 14, 2019

0.3.4

May 6, 2019

0.3.3

May 3, 2019

0.3.2

Apr 18, 2019

0.3.1

Apr 13, 2019

0.3.0

Apr 12, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arche-0.3.6.tar.gz (2.7 MB view details)

Uploaded Jul 12, 2019 Source

File details

Details for the file arche-0.3.6.tar.gz.

File metadata

Download URL: arche-0.3.6.tar.gz
Upload date: Jul 12, 2019
Size: 2.7 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.7.1

File hashes

Hashes for arche-0.3.6.tar.gz
Algorithm	Hash digest
SHA256	`f19aca0d572e4cb25da064adbec898a27924b80c36205d5c50717ae83a986e4d`
MD5	`47c8a8f3dcc0879a43bc6a2868c48c86`
BLAKE2b-256	`3c20ba1b60a885cd417d18ec00c4a86246cb6ffbf71745332db0b65be5d8f64f`

See more details on using hashes here.

arche 0.3.6

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Arche

Installation

Why

Developer Setup

Contribution

Changes

[0.3.6] (2019-07-12)

Added

Changed

Fixed

Removed

[0.3.5] (2019-05-14)

Added

Changed

Fixed

Removed

[0.3.4] (2019-05-06)

Fixed

[0.3.3] (2019-05-03)

Added

Changed

Fixed

Removed

[0.3.2] (2019-04-18)

Added

Changed

Fixed

Removed

[0.3.1] (2019-04-12)

Fixed

[0.3.0] (2019-04-12)

Fixed

Changed

Removed

[2019.03.25]

Added

Changed

Deprecated

Removed

Fixed

2019.03.18

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes