A library that utilizes BeautifulSoup to provide multi threaded scrapping, reducing the total time involved in the scrapping process
Project description
ParallelSoup
A library that utilizes BeautifulSoup to provide multi threaded scrapping, reducing the total time involved in the scrapping process
Aim
The library should implement the following affectively: (this list can be extended in future):
- Parallelization (I mean duh, you didn't see the repo name or what...)
- Should consider the fact that user already has serial BeautifulSoup code
- Adding to above point: should be easier for user too just use our library
- All parts of the library should be documented
- All parts of the library should have unit tests written for verification of their functionality
- Showcase written examples for different sorts of scrapping
Example
A small example of scraping imdb with ParallelSoup with goodness of 8 threads
from parallelsoup import ParallelSoup
urls = []
for i in range(0, 10):
urls.append("https://www.imdb.com/search/title/?countries=in&adult=include&start=" + str(50*i + 1) + "&ref_=adv_nxt")
def extractor(soup):
data = []
for item in soup.findAll('div', attrs={'class':'lister-item mode-advanced'}):
title = item.find('div', attrs={'class':'lister-item-content'}).h3.a
if title != None:
data.append(title.text)
return data
ps = ParallelSoup(8, urls, extractor)
ps.start()
dataParallel = ps.get()
print(dataParallel)
Issues
This library is still in development, for reporting any bugs or suggestions please open a new issue here
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
parallelsoup-2.0.0.tar.gz
(3.6 kB
view hashes)
Built Distribution
Close
Hashes for parallelsoup-2.0.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b8afc786b6ff5447f214903b0f66b17f451edad20ef22b544dbccdf080e6c38c |
|
MD5 | 561eb0ac48289612fdd8096bcd32d018 |
|
BLAKE2b-256 | 709d1b565e5e72e25f19ecc3f21a4e77760115d96b1c4b6a5b3ef8593358a238 |