Skip to main content

Scraping sites with multithreading, random proxies and user-agents

Project description

First_scrap

https://theodor85.github.io/first_scrap/


English, Русский


First_scrap is a library for multithread scraping sites with random proxies and user-agents.

Installation

To get started with the first_scrap library, activate (or create if necessary) your virtual environment. For example, as follows:

python3 -m venv env
source ./env/bin/activate

To install First_scrap use pip package manager:

pip install firstscrap

Another installing approach is getting source code from GitHub. For this execute the commands in your console:

git clone http://github.com/theodor85/first_scrap
cd first_scrap
python setup.py develop

How to use

Using example for exctracting data from one web page:

from firstscrap import pagehandler

@pagehandler(parser="BeautifulSoup")
def get_data(url, soup=None):
    # your only beatifulsoup code, without any requests, proxies, etc
    span = soup.find( name="span", attrs={"class": "p-nickname vcard-username d-block"} )
    text = span.get_text().strip()
    return text

if __name__ == '__main__' :
    print( get_data('https://github.com/theodor85') )

    # output:
    # theodor85

What's under hood

When extracting data from a single page:

  1. Random proxy server and user-agent are selected from the lists stored in the file.
  2. These proxies and user-agents are used to access the page we need.
  3. With BeautifulSoup the data is retrieved from the page.

The most interesting thing is plenty identical pages processing

Here is the example:

from firstscrap import listhandler

TEST_URLLIST_OLX = [
    'https://www.olx.ua/obyavlenie/spetsialist-po-podklyucheniyu-interneta-IDGnCkB.html',
    'https://www.olx.ua/obyavlenie/menedzher-po-robot-s-klentami-IDGkGK6.html',
]

@listhandler(threads_limit=5, parser='BeautifulSoup')
def get_date_time_from_olx(urllist, soup=None):
    ''' Beautifulsoup code for one page '''
    em = soup.find('em')
    row_text = em.get_text().strip()
    return row_text

if __name__ == '__main__' :
    data = get_date_time_from_olx(TEST_URLLIST_OLX)
    for item in data:
        print(item)
# output:
# Добавлено: в 16:49, 26 декабря 2019, Номер объявления: 626235005
# Добавлено: в 16:18, 29 декабря 2019, Номер объявления: 625536978

What's under hood

The program processes each page in a separate thread, and the number of threads running at the same time does not exceed threads_limit.

Every thread makes request using random proxy and user-agent.

Running the tests

To run the tests type in your console:

python -m unittest -v tests/tests.py

Before running the tests enjure that your internet connection is active.

Contributing

Merge you code to the 'develop' branch for contributing please.

Forks and pull requests are welcome! If you like first_scrap, do not forget to put a star!

Bug reports

To bug report please mail to fedor_coder@mail.ru with tag "first_scrap bug reporting".

License

This project is licensed under the MIT License - see the LICENSE.txt file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

firstscrap-0.2.0.tar.gz (10.1 kB view details)

Uploaded Source

Built Distribution

firstscrap-0.2.0-py3-none-any.whl (14.1 kB view details)

Uploaded Python 3

File details

Details for the file firstscrap-0.2.0.tar.gz.

File metadata

  • Download URL: firstscrap-0.2.0.tar.gz
  • Upload date:
  • Size: 10.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.7.3

File hashes

Hashes for firstscrap-0.2.0.tar.gz
Algorithm Hash digest
SHA256 bab0ef34f1564737ee0993318a740997534d35ccd0cf063e7d56f90140ef3a25
MD5 c12b48eb9c17d44e23af67df870fdcfa
BLAKE2b-256 74685b6498936281bf92089d827f8356ccc116afb7c6a90990caba8bb13a06a7

See more details on using hashes here.

File details

Details for the file firstscrap-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: firstscrap-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 14.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.7.3

File hashes

Hashes for firstscrap-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 14f7306fa249fb385c82e63e53dd729b0bba242372a763a93aa338b24c2ed321
MD5 d909ba4d00c7913285cb21fc9d248c97
BLAKE2b-256 a66005ca2400fa44123bb01aeff7ebeef7af1a9fa06309ecc8a1cb59499630b1

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page