Skip to main content

Systemitized tool for scraping

Project description

Scrape_it

Scrape_it is a tool for extracting valueble information from the website of interest. Save your time on reading and crawling through the website and leave it for Scrape_it!

Installation

Scrape_it is avalable on Pypi, you can install it using pip:

pip install scrape-it

Install the lastest version:

pip install git+https://github.com/erelin6613/Scrape_it

Scrape_it object

As a baseline Scrape_it relies on the model (dictionary) which could be customized although specific methods should be defined too.

Currently the object's base-line model is set up to scrape contact information, address, social media links, links for website's policies pages and if posiible condence its' texts.

To initialize the object specify the url as a string. For more precision provide some more details if known:

  • country: 'us' for United States of America, 'gb' for Great Britain/United Kingdom, 'au' for Australia
  • geo_key: API key for address verification, test is set up to work with this API
  • method: 'requests' for usual get request or 'webdriver' for request capable of rendering JavaScript code and dynamically changing webpages

Usage

Initialize Scrape_it object (find an example in run.py)

from scrape_it import Scrape_it

with open('/home/val/Downloads/geo-key.txt', 'r') as key:
    geo_key = key.read()

scrape_it = Scrape_it(url='https://www.all-wall.com', country='us', geo_key=geo_key, method='webdriver')

scrape_it.scrape()

The output will like this:

Scraping https://www.all-wall.com ...
url : https://www.all-wall.com
country : us
category : None
company_name : All
contact_link : None
phones : {'+18009290927'}
address : 6561 W Post Rd
state : NV
county : Clark
city : Las Vegas
street : W Post Rd
housenumber : 6561
postalcode : 89118
district : Spring Valley
email : None
facebook : https://www.facebook.com/AllWallEquipment
instagram : https://www.instagram.com/allwall_inc/
linkedin : None
pinterest : None
twitter : https://twitter.com/AllWall_Inc
youtube : https://www.youtube.com/channel/UCsNTFJvx3Wi8D3I92pYVZSg
faq_link : None
privacy_link : https://www.iubenda.com/privacy-policy/569672
return_link : None
shipping_link : None
terms_link : None
warranty_link : None
faq_text : None
privacy_text : None
return_text : None
shipping_text : None
terms_text : None
warranty_text : None

Contributing

The Scrape_it is by no means a perfect package and can be improved for sure. If you have any ideas, issues or would like to improve code or documentation please feel free to open issue or pull request. It is my honor to be at help if I can.

FAQ

Q: The object returns the emplty dictionary. What do I do?

A: It could be the case the tools used did not find anything though it is certainly an exception rather than a rule. What you can try though: use 'webdriver' method to ensure JavaScript is rendered too, try specify the country, use proxy/VPN in case the website might block requests from your location

Q: Should I pass a root link or any would work?

A: Yes, for now at least. Scrape_it will scrape some information still but it relies on finding additional links to scrape the most information possible and I did not set the pipeline to process non-root links yet (I am working on it)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrape_it-0.3.8.tar.gz (25.3 kB view details)

Uploaded Source

Built Distribution

scrape_it-0.3.8-py2.py3-none-any.whl (32.9 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file scrape_it-0.3.8.tar.gz.

File metadata

  • Download URL: scrape_it-0.3.8.tar.gz
  • Upload date:
  • Size: 25.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.2 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.6.9

File hashes

Hashes for scrape_it-0.3.8.tar.gz
Algorithm Hash digest
SHA256 eba947f0505b61e6ebd9e9ccf5cd158ebd8551709daefa7fd72fe3552a334f9a
MD5 1b63132dc21d91d7417340b8acb317bb
BLAKE2b-256 9444eceeef776bd2e29e048ddd8a79d78736fd301d096268801d665778480fe0

See more details on using hashes here.

File details

Details for the file scrape_it-0.3.8-py2.py3-none-any.whl.

File metadata

  • Download URL: scrape_it-0.3.8-py2.py3-none-any.whl
  • Upload date:
  • Size: 32.9 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.2 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.6.9

File hashes

Hashes for scrape_it-0.3.8-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 d98cce7eb6c872376312c07ba9f81aa619cf625f22e33bca449b0ccf2f0adbe8
MD5 e907270c7f711fcbfaff3d0a07525be5
BLAKE2b-256 335c78005e0c379505599c595142a82628faf5f0cc82f976fc0fd749cd154eac

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page