This package is design to scrape Bible data on JW.org for NLP/IAgen task.

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

JW SOUP

jwsoup is a simple Python package that scrapes Bible data from the JW.org website. The package provides functionality for scraping Bible verses and saving them in a structured format. It supports scraping data from one or multiple pages, handling paginated content, and storing the results in a Parquet file.

Features

Scrape Bible verses from individual or multiple pages.
Clean the scraped verse text to remove unwanted characters.
Store the scraped data in a Parquet file for further analysis.
Simple interface with reusable functions.

Installation

To install jwsoup, you can use pip from PyPI:

pip install jwsoup

Alternatively, if you want to install it locally from the source, clone the repository and run the following commands:

git clone https://github.com/sawadogosalif/jwsoup.git
cd jwsoup
pip install .

Usage

Scrape text - Single Page

You can scrape a single page of Bible verses using the scrape_single_page function. This function returns a list of verses and the URL for the next page (if available).

jwsoup.text import scrape_single_page
url = "https://www.jw.org/fr/biblioth%C3%A8que/bible/bible-d-etude/livres/Gen%C3%A8se/1/"
verses, next_url = scrape_single_page(url)

# Print the scraped verses
for verse in verses:
    print(f"{verse[0]}: {verse[1]}")

# Print the next URL
print(f"Next page URL: {next_url}")

Scrape text - Multiple Pages

To scrape multiple pages starting from a given URL, use the scrape_multi_page function. This function will follow pagination and save the scraped data in a Parquet file.

from jwsoup.text import scrape_multi_page

start_url = "https://www.jw.org/mos/d-s%E1%BA%BDn-yiisi/biible/nwt/books/S%C9%A9ngre/1/"
output_dir = "bible_data_moore.parquet"
res = scrape_multi_page(start_url, output_dir=output_dir, max_pages=5, page_sep="books")

Save Data to Parquet

The scraped data is stored in a Parquet file for efficient storage and querying. You can specify the output file and partition the data by page.

import pandas as pd
pd.read_parque(output_dir).head()

alt text

Downloads audios

start_url = "https://www.jw.org/mos/d-s%E1%BA%BDn-yiisi/biible/nwt/books/yikri"
output_dir = "audio_files"
download_audios(start_url, output_dir,max_pages=3)

License

This project is licensed under the MIT License - see the LICENSE file for details.

Author

Salif Sawadogopro
Email: salif.sawadogopro@gmail.com

Acknowledgments

Thanks to the requests, beautifulsoup4, pandas, loguru, and pyarrow libraries for making scraping and data handling easier.
Thanks to JW for providing an accessible and rich resource of Bible texts in multiple langages

Changelog

[0.0.1] - 2024-11-23

Added

Initial release of jw_soup.
Supports scraping of text-based Bible verses from JW.org.
Extracts individual verses and saves them to parquet files using pyarrow.
Includes basic error handling and logging with loguru.

Known Limitations

Only supports scraping textual data.
Does not handle multimedia content (audio/video).
Limited testing for edge cases (e.g., malformed HTML or network interruptions).

[0.0.2] - 2024-11-23

Added

Typo correction in package descritption

[0.0.5] - 2024-11-24

Added

Add project url in setup
Fix image rendering in pypi
Improve next button parsing

[0.1.0] - 2025-01-10

Added

Introduce audio dowloaders
Improve next button parsing

[0.1.1] - 2025-01-11

Added

Good naming of folder with urllib.parse.quote

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

0.1.1

Jan 12, 2025

0.1.0

Jan 10, 2025

0.0.5

Nov 24, 2024

0.0.2

Nov 23, 2024

0.0.1

Nov 23, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jwsoup-0.1.1.tar.gz (9.9 kB view details)

Uploaded Jan 12, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

jwsoup-0.1.1-py3-none-any.whl (8.4 kB view details)

Uploaded Jan 12, 2025 Python 3

File details

Details for the file jwsoup-0.1.1.tar.gz.

File metadata

Download URL: jwsoup-0.1.1.tar.gz
Upload date: Jan 12, 2025
Size: 9.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for jwsoup-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`fa60d05160279cad8740cfb8c3040373d5970b0451bbc058d697f83cb9d76fc7`
MD5	`54a2f7d51acf49a3ecd9e342a832fa2e`
BLAKE2b-256	`a765b3bda6e5b96b29538a59e89da00147d8b765e1695e3c7c9bedcba7986f6b`

See more details on using hashes here.

File details

Details for the file jwsoup-0.1.1-py3-none-any.whl.

File metadata

Download URL: jwsoup-0.1.1-py3-none-any.whl
Upload date: Jan 12, 2025
Size: 8.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for jwsoup-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`64b7f97839a12512d8e2853de40ca89a1854c2d1f1cd237231c2161deb497911`
MD5	`bb8e29c04f9a609f594344b25f87cf90`
BLAKE2b-256	`888d621df90decfb0e3052cd0da7df9ee8ab44085327796b05d435a28917251f`

See more details on using hashes here.

jwsoup 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

JW SOUP

Features

Installation

Usage

Scrape text - Single Page

Scrape text - Multiple Pages

Save Data to Parquet

Downloads audios

License

Author

Acknowledgments

Changelog

[0.0.1] - 2024-11-23

Added

Known Limitations

[0.0.2] - 2024-11-23

Added

[0.0.5] - 2024-11-24

Added

[0.1.0] - 2025-01-10

Added

[0.1.1] - 2025-01-11

Added

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes