Scraping sites with multithreading, random proxies and user-agents
Project description
First_scrap
https://theodor85.github.io/first_scrap/
First_scrap is a library for multithread scraping sites with random proxies and user-agents.
Installation
To get started with the first_scrap library, activate (or create if necessary) your virtual environment. For example, as follows:
python3 -m venv env
source ./env/bin/activate
To install First_scrap use pip package manager:
pip install firstscrap
Another installing approach is getting source code from GitHub. For this execute the commands in your console:
git clone http://github.com/theodor85/first_scrap
cd first_scrap
python setup.py develop
How to use
Using example for exctracting data from one web page:
from firstscrap import pagehandler
@pagehandler(parser="BeautifulSoup")
def get_data(url, soup=None):
# your only beatifulsoup code, without any requests, proxies, etc
span = soup.find( name="span", attrs={"class": "p-nickname vcard-username d-block"} )
text = span.get_text().strip()
return text
if __name__ == '__main__' :
print( get_data('https://github.com/theodor85') )
# output:
# theodor85
What's under hood
When extracting data from a single page:
- Random proxy server and user-agent are selected from the lists stored in the file.
- These proxies and user-agents are used to access the page we need.
- With BeautifulSoup the data is retrieved from the page.
The most interesting thing is plenty identical pages processing
Here is the example:
from firstscrap import listhandler
TEST_URLLIST_OLX = [
'https://www.olx.ua/obyavlenie/spetsialist-po-podklyucheniyu-interneta-IDGnCkB.html',
'https://www.olx.ua/obyavlenie/menedzher-po-robot-s-klentami-IDGkGK6.html',
]
@listhandler(threads_limit=5, parser='BeautifulSoup')
def get_date_time_from_olx(urllist, soup=None):
''' Beautifulsoup code for one page '''
em = soup.find('em')
row_text = em.get_text().strip()
return row_text
if __name__ == '__main__' :
data = get_date_time_from_olx(TEST_URLLIST_OLX)
for item in data:
print(item)
# output:
# Добавлено: в 16:49, 26 декабря 2019, Номер объявления: 626235005
# Добавлено: в 16:18, 29 декабря 2019, Номер объявления: 625536978
What's under hood
The program processes each page in a separate thread, and the number of threads running at the same time does not exceed threads_limit
.
Every thread makes request using random proxy and user-agent.
Running the tests
To run the tests type in your console:
python -m unittest -v tests/tests.py
Before running the tests enjure that your internet connection is active.
Contributing
Merge you code to the 'develop' branch for contributing please.
Forks and pull requests are welcome! If you like first_scrap, do not forget to put a star!
Bug reports
To bug report please mail to fedor_coder@mail.ru with tag "first_scrap bug reporting".
License
This project is licensed under the MIT License - see the LICENSE.txt file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file firstscrap-0.2.0.tar.gz
.
File metadata
- Download URL: firstscrap-0.2.0.tar.gz
- Upload date:
- Size: 10.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.7.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | bab0ef34f1564737ee0993318a740997534d35ccd0cf063e7d56f90140ef3a25 |
|
MD5 | c12b48eb9c17d44e23af67df870fdcfa |
|
BLAKE2b-256 | 74685b6498936281bf92089d827f8356ccc116afb7c6a90990caba8bb13a06a7 |
File details
Details for the file firstscrap-0.2.0-py3-none-any.whl
.
File metadata
- Download URL: firstscrap-0.2.0-py3-none-any.whl
- Upload date:
- Size: 14.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.7.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 14f7306fa249fb385c82e63e53dd729b0bba242372a763a93aa338b24c2ed321 |
|
MD5 | d909ba4d00c7913285cb21fc9d248c97 |
|
BLAKE2b-256 | a66005ca2400fa44123bb01aeff7ebeef7af1a9fa06309ecc8a1cb59499630b1 |