Skip to main content

A library that utilizes BeautifulSoup to provide multi threaded scrapping, reducing the total time involved in the scrapping process

Project description

ParallelSoup

A library that utilizes BeautifulSoup to provide multi threaded scrapping, reducing the total time involved in the scrapping process


Aim

The library should implement the following affectively: (this list can be extended in future):

  • Parallelization (I mean duh, you didn't see the repo name or what...)
  • Should consider the fact that user already has serial BeautifulSoup code
  • Adding to above point: should be easier for user too just use our library
  • All parts of the library should be documented
  • All parts of the library should have unit tests written for verification of their functionality
  • Showcase written examples for different sorts of scrapping

Example

A small example of scraping imdb with ParallelSoup with goodness of 8 threads

from parallelsoup import ParallelSoup

urls = []
for i in range(0, 10):
    urls.append("https://www.imdb.com/search/title/?countries=in&adult=include&start=" + str(50*i + 1) + "&ref_=adv_nxt")

def extractor(soup):
    data = []
    for item in soup.findAll('div', attrs={'class':'lister-item mode-advanced'}):
        title = item.find('div', attrs={'class':'lister-item-content'}).h3.a
        if title != None:
            data.append(title.text)
    return data

ps = ParallelSoup(8, urls, extractor)
ps.start()
dataParallel = ps.get()
print(dataParallel)

Issues

This library is still in development, for reporting any bugs or suggestions please open a new issue here

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

parallelsoup-2.0.0.tar.gz (3.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

parallelsoup-2.0.0-py3-none-any.whl (3.8 kB view details)

Uploaded Python 3

File details

Details for the file parallelsoup-2.0.0.tar.gz.

File metadata

  • Download URL: parallelsoup-2.0.0.tar.gz
  • Upload date:
  • Size: 3.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.3.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.0 CPython/3.9.5

File hashes

Hashes for parallelsoup-2.0.0.tar.gz
Algorithm Hash digest
SHA256 3c080fa184eea73d6552e6cee1cf8b262a4a65de1bc72e02a75f0e789ed2ddfc
MD5 914820d850a94c603b10811c4f01216a
BLAKE2b-256 b019746386f3ee49e8c0cf9c14ec9a4cee504488ec5d0f428b59595568ff7069

See more details on using hashes here.

File details

Details for the file parallelsoup-2.0.0-py3-none-any.whl.

File metadata

  • Download URL: parallelsoup-2.0.0-py3-none-any.whl
  • Upload date:
  • Size: 3.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.3.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.0 CPython/3.9.5

File hashes

Hashes for parallelsoup-2.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b8afc786b6ff5447f214903b0f66b17f451edad20ef22b544dbccdf080e6c38c
MD5 561eb0ac48289612fdd8096bcd32d018
BLAKE2b-256 709d1b565e5e72e25f19ecc3f21a4e77760115d96b1c4b6a5b3ef8593358a238

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page