A library that utilizes BeautifulSoup to provide multi threaded scrapping, reducing the total time involved in the scrapping process
Project description
ParallelSoup
A library that utilizes BeautifulSoup to provide multi threaded scrapping, reducing the total time involved in the scrapping process
Aim
The library should implement the following affectively: (this list can be extended in future):
- Parallelization (I mean duh, you didn't see the repo name or what...)
- Should consider the fact that user already has serial BeautifulSoup code
- Adding to above point: should be easier for user too just use our library
- All parts of the library should be documented
- All parts of the library should have unit tests written for verification of their functionality
- Showcase written examples for different sorts of scrapping
Example
A small example of scraping imdb with ParallelSoup with goodness of 8 threads
from parallelsoup import ParallelSoup
urls = []
for i in range(0, 10):
urls.append("https://www.imdb.com/search/title/?countries=in&adult=include&start=" + str(50*i + 1) + "&ref_=adv_nxt")
def extractor(soup):
data = []
for item in soup.findAll('div', attrs={'class':'lister-item mode-advanced'}):
title = item.find('div', attrs={'class':'lister-item-content'}).h3.a
if title != None:
data.append(title.text)
return data
ps = ParallelSoup(8, urls, extractor)
ps.start()
dataParallel = ps.get()
print(dataParallel)
Issues
This library is still in development, for reporting any bugs or suggestions please open a new issue here
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file parallelsoup-2.0.0.tar.gz.
File metadata
- Download URL: parallelsoup-2.0.0.tar.gz
- Upload date:
- Size: 3.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.3.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.0 CPython/3.9.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3c080fa184eea73d6552e6cee1cf8b262a4a65de1bc72e02a75f0e789ed2ddfc
|
|
| MD5 |
914820d850a94c603b10811c4f01216a
|
|
| BLAKE2b-256 |
b019746386f3ee49e8c0cf9c14ec9a4cee504488ec5d0f428b59595568ff7069
|
File details
Details for the file parallelsoup-2.0.0-py3-none-any.whl.
File metadata
- Download URL: parallelsoup-2.0.0-py3-none-any.whl
- Upload date:
- Size: 3.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.3.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.0 CPython/3.9.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b8afc786b6ff5447f214903b0f66b17f451edad20ef22b544dbccdf080e6c38c
|
|
| MD5 |
561eb0ac48289612fdd8096bcd32d018
|
|
| BLAKE2b-256 |
709d1b565e5e72e25f19ecc3f21a4e77760115d96b1c4b6a5b3ef8593358a238
|