Skip to main content

Scraping sites with multithreading, random proxies and user-agents

Project description

First_scrap

https://theodor85.github.io/first_scrap/


English, Русский


First_scrap is a library for multithread scraping sites with random proxies and user-agents.

Installation

To get started with the first_scrap library, activate (or create if necessary) your virtual environment. For example, as follows:

python3 -m venv env
source ./env/bin/activate

To install First_scrap use pip package manager:

pip install firstscrap

Another installing approach is getting source code from GitHub. For this execute the commands in your console:

git clone http://github.com/theodor85/first_scrap
cd first_scrap
python setup.py develop

How to use

Using example for exctracting data from one web page:

from firstscrap import pagehandler

@pagehandler(parser="BeautifulSoup")
def get_data(url, soup=None):
    # your only beatifulsoup code, without any requests, proxies, etc
    span = soup.find( name="span", attrs={"class": "p-nickname vcard-username d-block"} )
    text = span.get_text().strip()
    return text

if __name__ == '__main__' :
    print( get_data('https://github.com/theodor85') )

    # output:
    # theodor85

What's under hood

When extracting data from a single page:

  1. Random proxy server and user-agent are selected from the lists stored in the file.
  2. These proxies and user-agents are used to access the page we need.
  3. With BeautifulSoup the data is retrieved from the page.

The most interesting thing is plenty identical pages processing

Here is the example:

from firstscrap import listhandler

TEST_URLLIST_OLX = [
    'https://www.olx.ua/obyavlenie/spetsialist-po-podklyucheniyu-interneta-IDGnCkB.html',
    'https://www.olx.ua/obyavlenie/menedzher-po-robot-s-klentami-IDGkGK6.html',
]

@listhandler(threads_limit=5, parser='BeautifulSoup')
def get_date_time_from_olx(urllist, soup=None):
    ''' Beautifulsoup code for one page '''
    em = soup.find('em')
    row_text = em.get_text().strip()
    return row_text

if __name__ == '__main__' :
    data = get_date_time_from_olx(TEST_URLLIST_OLX)
    for item in data:
        print(item)
# output:
# Добавлено: в 16:49, 26 декабря 2019, Номер объявления: 626235005
# Добавлено: в 16:18, 29 декабря 2019, Номер объявления: 625536978

What's under hood

The program processes each page in a separate thread, and the number of threads running at the same time does not exceed threads_limit.

Every thread makes request using random proxy and user-agent.

Running the tests

To run the tests type in your console:

python -m unittest -v tests/tests.py

Before running the tests enjure that your internet connection is active.

Contributing

Merge you code to the 'develop' branch for contributing please.

Forks and pull requests are welcome! If you like first_scrap, do not forget to put a star!

Bug reports

To bug report please mail to fedor_coder@mail.ru with tag "first_scrap bug reporting".

License

This project is licensed under the MIT License - see the LICENSE.txt file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

firstscrap-0.2.0.tar.gz (10.1 kB view hashes)

Uploaded Source

Built Distribution

firstscrap-0.2.0-py3-none-any.whl (14.1 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page