Skip to main content

A web scraper to extract URLs, content from websites and convert DOCX to PDF.

Project description

Web_content_extractor

Data Scraper

A Python package to extract URLs from websites, extract content into DOCX files, and convert DOCX files to PDF.

Features

  • Extracts all href URLs from a specified website.
  • Reads URLs from a CSV file and extracts specified HTML tags (h1, h2, p, span) into separate DOCX files.
  • Converts DOCX files into PDF format.

Requirements

  • Python 3.x
  • The following Python packages are required:
    • requests
    • beautifulsoup4
    • python-docx
    • reportlab

Installation or To install the package, use:

pip install DataScraper==0.1.0
  1. Clone the repository:

    git clone https://github.com/mukhtarulislam/data_scraper.git
    cd data_scraper
    
  2. Create a virtual environment (recommended):

 python -m venv venv 

 source venv/bin/activate  

 On Windows use 
` venv\Scripts\activate`

3.Install the required packages: pip install -r requirements.txt

How to Run main.py

To extract URLs:

  python main.py --action extract_urls --url "http://example.com" --csv_folder ./csv_output

To extract content from URLs in a CSV:

python main.py --action extract_content --csv_folder ./csv_output --docx_folder ./docx_output

To convert DOCX files to PDF:

  python main.py --action convert_to_pdf --docx_folder ./docx_output --pdf_folder ./pdf_output

Usage

Extract URLs to CSV

To extract URLs from a specified website and save them to a CSV file, use the extract_urls_to_csv function:

from data_scraper.url_extractor import extract_urls_to_csv
website_url = "https://example.com"  # Replace with your target website
folder_name = "output"  # Output folder for CSV
extract_urls_to_csv(website_url, folder_name)

Extract Content from URLs

To read URLs from a CSV file and extract content into separate DOCX files, use the extract_content_from_urls function:

 from data_scraper.content_extractor import extract_content_from_urls
 urls = ["https://example.com/page1", "https://example.com/page2"]  # List of URLs
 folder_name = "output"  # Output folder for DOCX files
 extract_content_from_urls(urls, folder_name)

Convert DOCX to PDF

To convert multiple DOCX files to PDF, use the convert_multiple_docx_to_pdf function:

 from data_scraper.docx_to_pdf_converter import DocxToPdfConverter
  converter = DocxToPdfConverter()
  # Define input and output directories
  input_directory = "folder_word_file"
  output_directory = "folder_pdf_file"

  # Convert all .docx files to .pdf
  converter.convert_multiple_docx_to_pdf(input_directory, output_directory)

License

This project is licensed under the MIT License. See the LICENSE file for details.

Contributing

Contributions are welcome! Please follow these steps:

1.Fork the repository.

2.Create a new branch: git checkout -b feature/YourFeature

3.Make your changes and commit them: git commit -m "Add some feature"

4.Push to the branch: git push origin feature/YourFeature

5.Create a new Pull Request.

Contact

For questions or suggestions, please reach out to me Mukhtarulislam88@hotmail.com.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datascraper-0.1.1.tar.gz (7.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

DataScraper-0.1.1-py3-none-any.whl (10.2 kB view details)

Uploaded Python 3

File details

Details for the file datascraper-0.1.1.tar.gz.

File metadata

  • Download URL: datascraper-0.1.1.tar.gz
  • Upload date:
  • Size: 7.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for datascraper-0.1.1.tar.gz
Algorithm Hash digest
SHA256 bc9944950828e33336a5c66a6e5843488a6ca73a63c327ce77de8e723e6f22ca
MD5 da61fc7064f063ad8cd0c21d274b3d9e
BLAKE2b-256 862f62a1a930d80349d868f20a876123e1616237ddcd4768f67e34da9e59c1fa

See more details on using hashes here.

File details

Details for the file DataScraper-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: DataScraper-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 10.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for DataScraper-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 ed448b3597d3f9862784a83caae60fd8c63a8d22f2387599d202bf3ca21043c9
MD5 c4f27ae1605158f25358603010151d1f
BLAKE2b-256 a9ef7d8f314090d51f46d035e5ce436f0440c8ed2523ec7b170c7a9e902e7ad9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page