Skip to main content

Web Crawling application running Scrapy Tool extracting official policies

Project description

Scrapy Tool for Omdena Latam LFR Challenge

Web Crawling application running Scrapy Tool, extracting official policies from the following sources:

Characteristics of the information sources

Chile (LeyChile)

Search type: Exhaustive, through API , limited by the pages_num

Speed: Fast

Amount of avaliable documents: 10-100k

Document Type: HTML

Mexico Distrito oficial de la Federación

Search type: Exhaustive, through scrapping (Xpath) , limited by years range.

Speed: Terrible slow and buggy when you change pipelines order

Amount of avaliable documents: 10-100k

Document Type: HTML

El Peruano

Search type:

Speed:

Amount of avaliable documents:

Document Type:

Setup Steps:

Recommendations:

Use a virtual environment not your python system to run and also to install the dependencies.

Install dependencies

pip install -r requirements.txt

Scrapy settings.py

https://drive.google.com/file/d/1bjbjYSXQqZQpJdwATRCLZFSRQciiULy-/view?usp=sharing

Warning!

S3 upload pipeline and MySQL insert pipeline doesn't work together. Use:

ITEM_PIPELINES = {
    # 'scrapy.pipelines.files.FilesPipeline': 100,
    'scrapy_official_newspapers.pipelines.ScrapyOfficialNewspapersMySQLPipeline': 200,
        }

and

ITEM_PIPELINES = {
     'scrapy.pipelines.files.FilesPipeline': 100,
    #'scrapy_official_newspapers.pipelines.ScrapyOfficialNewspapersMySQLPipeline': 200,
        }

The order doesn't mind.

Database access

Setup the DB access inserting settings.json into scrappy_official_newspapers.

{
  "username": "username",
  "password": "password",
  "db_name": "db_name",
  "aws_endpoint": "your_db_instance_access"
}

S3 Access

Setup scrapy settings.py located at scrappy_official_newspapers

AWS_ACCESS_KEY_ID = "XXXXXXXXXXXXXXXXXXXX"
AWS_SECRET_ACCESS_KEY = "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
FILES_STORE = 's3://wri-latin-test/'

Run

from repository root:

  • cd scrapy_official_newspapers
  • scrapy crawl leychile
  • scrapy crawl MexicoDOF

Monitorization/Debug/Test

Through mysql table inspection, you can check how the information is being inserted.

Through https://console.aws.amazon.com/console/home after authenticating you can navigate to S3 Service, you can check the files uploaded and their properties.

Goal

Provide a structured information system, from multiple variated sources formats containing raw official policy documents, keeping the reference to their attributes.

Attributes

country

geo_code

level

source

title

reference

authorship

resume

publication_date

enforcement_date url

doc_url

doc_name

doc_type

file_urls (Needed for Scrappy S3FilesPipeline, Not in DB Schema)

Not implemented like that, but the idea is to also keep track of:

  • file_raw_S3_url
  • file_proccesed_task_1_S3_url

So we are able to keep track of the policies with their attributes and also have the capability of maintaining the relationships and results of different processes.

What is this information system:

Policies Documents Stored and indexed through the integration of a relational MySQL database and AWS S3 media database service (FTP would work too), with most attributes such as the title of the act, a resume of the document, the date of publication...

Documentation:

https://docs.scrapy.org/en/2.2/

https://docs.scrapy.org/en/2.2/topics/media-pipeline.html#enabling-your-media-pipeline

Recomendations:

Do not make an extensive use of the tool, probably information will be thrown because it is still in development, you can collapse the target's resources and engraving amazon's bill through this.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy_omdena_latam-0.0.2.tar.gz (9.9 kB view details)

Uploaded Source

Built Distribution

scrapy_omdena_latam-0.0.2-py3-none-any.whl (24.0 kB view details)

Uploaded Python 3

File details

Details for the file scrapy_omdena_latam-0.0.2.tar.gz.

File metadata

  • Download URL: scrapy_omdena_latam-0.0.2.tar.gz
  • Upload date:
  • Size: 9.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/50.3.1 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.7.6

File hashes

Hashes for scrapy_omdena_latam-0.0.2.tar.gz
Algorithm Hash digest
SHA256 547ad77b8e0764a7ef6e3f62ab7b59a69110d1a91fd5b69ce36121dcf31eccb4
MD5 9150ade6a2039691719ddce876d5ad02
BLAKE2b-256 8b6525d4f44e82096334dd09f44812c3becea3009a7c6752b0453fcb2057842c

See more details on using hashes here.

File details

Details for the file scrapy_omdena_latam-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: scrapy_omdena_latam-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 24.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/50.3.1 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.7.6

File hashes

Hashes for scrapy_omdena_latam-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 1060dda844bf8d1d3e97b5722f7ed188d678d08afe3414562740842631f9fdf3
MD5 3809abe7859a3c4c4ba0a2f1f6f38784
BLAKE2b-256 2c0711295b948c6a64de249c0313b8cf273afb128fd62ffb1e1d4956ed4d8efb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page