Skip to main content

This package is design to scrape Bible data on JW.org for NLP/IAgen task.

Project description

JW SOUP

jwsoup is a simple Python package that scrapes Bible data from the JW.org website. The package provides functionality for scraping Bible verses and saving them in a structured format. It supports scraping data from one or multiple pages, handling paginated content, and storing the results in a Parquet file.

Features

  • Scrape Bible verses from individual or multiple pages.
  • Clean the scraped verse text to remove unwanted characters.
  • Store the scraped data in a Parquet file for further analysis.
  • Simple interface with reusable functions.

Installation

To install jwsoup, you can use pip from PyPI:

pip install jwsoup

Alternatively, if you want to install it locally from the source, clone the repository and run the following commands:

git clone https://github.com/sawadogosalif/jwsoup.git
cd jwsoup
pip install .

Usage

Scrape a Single Page

You can scrape a single page of Bible verses using the scrape_single_page function. This function returns a list of verses and the URL for the next page (if available).

jwsoup.text import scrape_single_page
url = "https://www.jw.org/fr/biblioth%C3%A8que/bible/bible-d-etude/livres/Gen%C3%A8se/1/"
verses, next_url = scrape_single_page(url)

# Print the scraped verses
for verse in verses:
    print(f"{verse[0]}: {verse[1]}")

# Print the next URL
print(f"Next page URL: {next_url}")

Scrape Multiple Pages

To scrape multiple pages starting from a given URL, use the scrape_multi_page function. This function will follow pagination and save the scraped data in a Parquet file.

from jwsoup.text import scrape_multi_page

start_url = "https://www.jw.org/mos/d-s%E1%BA%BDn-yiisi/biible/nwt/books/S%C9%A9ngre/1/"
output_dir = "bible_data_moore.parquet"
res = scrape_multi_page(start_url, output_dir=output_dir, max_pages=5, page_sep="books")

Save Data to Parquet

The scraped data is stored in a Parquet file for efficient storage and querying. You can specify the output file and partition the data by page.

import pandas as pd
pd.read_parque(output_dir).head()

alt text

License

This project is licensed under the MIT License - see the LICENSE file for details.

Author

Acknowledgments

  • Thanks to the requests, beautifulsoup4, pandas, loguru, and pyarrow libraries for making scraping and data handling easier.
  • Thanks to JW for providing an accessible and rich resource of Bible texts in multiple langages

Changelog

[0.0.1] - 2024-11-23

Added

  • Initial release of jw_soup.
  • Supports scraping of text-based Bible verses from JW.org.
  • Extracts individual verses and saves them to parquet files using pyarrow.
  • Includes basic error handling and logging with loguru.

Known Limitations

  • Only supports scraping textual data.
  • Does not handle multimedia content (audio/video).
  • Limited testing for edge cases (e.g., malformed HTML or network interruptions).

[0.0.2] - 2024-11-23

Added

  • Typo correction in package descritption

[0.0.5] - 2024-11-24

Added

  • Add project url in setup
  • Fix image rendering in pypi

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jwsoup-0.0.5.tar.gz (7.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

jwsoup-0.0.5-py3-none-any.whl (5.9 kB view details)

Uploaded Python 3

File details

Details for the file jwsoup-0.0.5.tar.gz.

File metadata

  • Download URL: jwsoup-0.0.5.tar.gz
  • Upload date:
  • Size: 7.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for jwsoup-0.0.5.tar.gz
Algorithm Hash digest
SHA256 843db258c63c9fff37f97169debf12e6de005f97e1bf5cfec6503f1e0f98e622
MD5 5c4a566d69a504308f780719a2c1549e
BLAKE2b-256 f6344304c52840b9df77c5c06c297fb539db72478b2b6ba1c5c974072ab74301

See more details on using hashes here.

File details

Details for the file jwsoup-0.0.5-py3-none-any.whl.

File metadata

  • Download URL: jwsoup-0.0.5-py3-none-any.whl
  • Upload date:
  • Size: 5.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for jwsoup-0.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 0daba0b500cbd0c40aa8006ac9031c35df1f4fc63f7150d1802b00b18fd1fc09
MD5 5eb440a759d46df5a0b907fc41f5de45
BLAKE2b-256 4666d06ddb6bcd1354cee6802369cd8e106b75a4c5bc547433b0ae7eaec5c2ee

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page