Tools for helping build of extraction models with scrapy spiders.

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: BSD License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

e-models

Suite of tools to assist in the build of extraction models with scrapy spiders

Installation:

$ pip install e-models

scrapyutils module

scrapyutils module provides two classes, one for extending scrapy.http.TextResponse and another for extending scrapy.loader.ItemLoader. The extensions provide methods that:

Allow to extract item data in the text (markdown) domain instead of the html source domain.
The main purpose of this approach is the generation of datasets suitable for training transformer models for text extraction (aka extractive question answering, EQA)
As a secondary objective, it provides an alternative approach to xpath and css selectors for extraction of data from the html source, that may be more suitable and readable for humans.

Usage:

Instead of subclass your item loaders from scrapy.loader.ItemLoader, use emodels.scrapyutils.ExtractItemLoader. This action will not affect the working of itemloaders and will enable the properties just described above. In addition, in order to save the collected extraction data, it is required to set the environment variable EMODELS_SAVE_EXTRACT_ITEMS to 1. The collected extraction data will be stored at <user home folder>/.datasets/items/<item class name>/<sequence number>.jl.gz. The base folder <user home folder>/.datasets is the default one. You can customize it via the environment variable EMODELS_DIR.

So, in order to maintain a clean dataset well ordered, only enable extract items saving when you are sure you have the correct extraction selectors. Then run locally:

EMODELS_SAVE_EXTRACT_ITEMS=1 scrapy crawl myspider

In addition, in order to have your dataset well ordered, you should choose the same item class name for same item schema, even accross multiple projects. And avoid to repeat it among items with different schema. However, in general you will use extraction data from all classes of items at same time in order to train a transformer model, as this is the way how transformers learn to generalize. At the end you will have a transformer model that is suited to extract any kind of item, as they are trained not to extract "data from x item" but instead to recognize and extract based on fields. So, even if you didn't train the transformer to extract a specific item class, it will do great if you trained it to extract its fields, if it already learned to extract same fields from other item classes. You only need to ask the correct question. For example, given an html page as a context, you can ask the model: which is the phone number?. You don't need to specify which kind of data (a business? a person? an organization?) you expect to find there.

(WIP...)

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: BSD License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

1.8.10.1

May 16, 2024

1.8.10

May 15, 2024

1.8.9

May 15, 2024

1.8.8.2

May 9, 2024

1.8.8.1

Apr 16, 2024

1.8.8

Apr 16, 2024

1.8.7.1

Apr 16, 2024

1.8.7

Apr 2, 2024

1.8.6

Apr 2, 2024

1.8.5

Mar 19, 2024

1.8.4

Mar 18, 2024

1.8.3

Mar 16, 2024

1.8.2.5

Mar 15, 2024

1.8.2.4

Mar 14, 2024

1.8.2.3

Mar 12, 2024

1.8.2.2

Mar 7, 2024

1.8.2.1

Mar 5, 2024

1.8.2

Mar 5, 2024

1.8.1

Feb 27, 2024

1.8

Feb 27, 2024

1.7

Nov 6, 2023

1.6.8.4

Oct 26, 2023

1.6.8.3

Oct 26, 2023

1.6.8.2

Oct 26, 2023

1.6.8

Oct 24, 2023

1.6.7

Oct 23, 2023

1.6.6.1

Oct 22, 2023

1.6.6

Oct 22, 2023

1.6.5

Oct 19, 2023

1.6.4

Sep 15, 2023

1.6.3.2

Sep 12, 2023

1.6.3.1

Sep 8, 2023

1.6.3

Sep 8, 2023

1.6.2.1

Sep 5, 2023

1.6.2

Aug 31, 2023

1.6.1

Aug 31, 2023

1.6.0

Aug 27, 2023

1.5.5.1

Aug 18, 2023

1.5.5

Aug 17, 2023

1.5.4

Aug 11, 2023

1.5.3.2

Aug 4, 2023

1.5.3.1

Aug 3, 2023

1.5.3

Aug 3, 2023

1.5.2.1

Jul 27, 2023

1.5.2

Jul 27, 2023

1.5.1

Jul 21, 2023

1.5.0.1

Jul 5, 2023

1.5

Jul 4, 2023

1.4.0.1

Jun 26, 2023

1.4

Jun 26, 2023

1.3.3

Jun 8, 2023

This version

1.3.2

May 19, 2023

1.3.1

May 9, 2023

1.3

May 4, 2023

1.2.1.4

May 3, 2023

1.2.1.3

May 3, 2023

1.2.1.2

May 3, 2023

1.2.1.1

May 3, 2023

1.2.1

May 3, 2023

1.2

May 2, 2023

1.1.1

Apr 19, 2023

1.1

Apr 18, 2023

1.0

Apr 6, 2023

0.3

Mar 31, 2023

0.2

Mar 31, 2023

0.1

Mar 30, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

e-models-1.3.2.tar.gz (20.4 kB view hashes)

Uploaded May 19, 2023 Source

Built Distribution

e_models-1.3.2-py3-none-any.whl (19.5 kB view hashes)

Uploaded May 19, 2023 Python 3

Hashes for e-models-1.3.2.tar.gz

Hashes for e-models-1.3.2.tar.gz
Algorithm	Hash digest
SHA256	`9151d6dff154953c745e2897093e68276b23e06595366a923b7342b48ab07ca3`
MD5	`81da8435b3bd343dcb9d19d629491b68`
BLAKE2b-256	`19507da4b8e49f36ae0531be17bb20c5cd7b8aa581dcd52e4bb9cf1d8acc011e`

Hashes for e_models-1.3.2-py3-none-any.whl

Hashes for e_models-1.3.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6ecc772dbf6dfc014895c4591b5eb540d82137b6aef88bd7064ff3e736a9b634`
MD5	`cefd16c8061804a64ae3efd6530470a4`
BLAKE2b-256	`57c962ca9f2d0e045578e77a49fd3cdeabe42a1ba42b62396f29f102efd02a61`