A web scraper to extract URLs, content from websites and convert DOCX to PDF.

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

Web_content_extractor

Data Scraper

A Python package to extract URLs from websites, extract content into DOCX files, and convert DOCX files to PDF.

Features

Extracts all href URLs from a specified website.
Reads URLs from a CSV file and extracts specified HTML tags (h1, h2, p, span) into separate DOCX files.
Converts DOCX files into PDF format.

Requirements

Python 3.x
The following Python packages are required:
- requests
- beautifulsoup4
- python-docx
- reportlab

Installation or To install the package, use:

pip install DataScraper==0.1.0

Clone the repository:

git clone https://github.com/mukhtarulislam/data_scraper.git
cd data_scraper

Create a virtual environment (recommended):

 python -m venv venv 

 source venv/bin/activate  

 On Windows use 
` venv\Scripts\activate`

3.Install the required packages: pip install -r requirements.txt

How to Run main.py

To extract URLs:

  python main.py --action extract_urls --url "http://example.com" --csv_folder ./csv_output

To extract content from URLs in a CSV:

python main.py --action extract_content --csv_folder ./csv_output --docx_folder ./docx_output

To convert DOCX files to PDF:

  python main.py --action convert_to_pdf --docx_folder ./docx_output --pdf_folder ./pdf_output

Usage

Extract URLs to CSV

To extract URLs from a specified website and save them to a CSV file, use the extract_urls_to_csv function:

from data_scraper.url_extractor import extract_urls_to_csv
website_url = "https://example.com"  # Replace with your target website
folder_name = "output"  # Output folder for CSV
extract_urls_to_csv(website_url, folder_name)

Extract Content from URLs

To read URLs from a CSV file and extract content into separate DOCX files, use the extract_content_from_urls function:

 from data_scraper.content_extractor import extract_content_from_urls
 urls = ["https://example.com/page1", "https://example.com/page2"]  # List of URLs
 folder_name = "output"  # Output folder for DOCX files
 extract_content_from_urls(urls, folder_name)

Convert DOCX to PDF

To convert multiple DOCX files to PDF, use the convert_multiple_docx_to_pdf function:

 from data_scraper.docx_to_pdf_converter import DocxToPdfConverter
  converter = DocxToPdfConverter()
  # Define input and output directories
  input_directory = "folder_word_file"
  output_directory = "folder_pdf_file"

  # Convert all .docx files to .pdf
  converter.convert_multiple_docx_to_pdf(input_directory, output_directory)

License

This project is licensed under the MIT License. See the LICENSE file for details.

Contributing

Contributions are welcome! Please follow these steps:

1.Fork the repository.

2.Create a new branch: git checkout -b feature/YourFeature

3.Make your changes and commit them: git commit -m "Add some feature"

4.Push to the branch: git push origin feature/YourFeature

5.Create a new Pull Request.

Contact

For questions or suggestions, please reach out to me Mukhtarulislam88@hotmail.com.

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

0.1.1

Oct 17, 2024

0.1.0

Oct 17, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datascraper-0.1.1.tar.gz (7.5 kB view details)

Uploaded Oct 17, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

DataScraper-0.1.1-py3-none-any.whl (10.2 kB view details)

Uploaded Oct 17, 2024 Python 3

File details

Details for the file datascraper-0.1.1.tar.gz.

File metadata

Download URL: datascraper-0.1.1.tar.gz
Upload date: Oct 17, 2024
Size: 7.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for datascraper-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`bc9944950828e33336a5c66a6e5843488a6ca73a63c327ce77de8e723e6f22ca`
MD5	`da61fc7064f063ad8cd0c21d274b3d9e`
BLAKE2b-256	`862f62a1a930d80349d868f20a876123e1616237ddcd4768f67e34da9e59c1fa`

See more details on using hashes here.

File details

Details for the file DataScraper-0.1.1-py3-none-any.whl.

File metadata

Download URL: DataScraper-0.1.1-py3-none-any.whl
Upload date: Oct 17, 2024
Size: 10.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for DataScraper-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ed448b3597d3f9862784a83caae60fd8c63a8d22f2387599d202bf3ca21043c9`
MD5	`c4f27ae1605158f25358603010151d1f`
BLAKE2b-256	`a9ef7d8f314090d51f46d035e5ce436f0440c8ed2523ec7b170c7a9e902e7ad9`

See more details on using hashes here.

DataScraper 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Web_content_extractor

Data Scraper

Features

Requirements

Installation or To install the package, use:

How to Run main.py

To extract URLs:

To extract content from URLs in a CSV:

To convert DOCX files to PDF:

Usage

Extract URLs to CSV

Extract Content from URLs

Convert DOCX to PDF

License

Contributing

Contact

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes