A web scraper to extract URLs, content from websites and convert DOCX to PDF.
Project description
Web_content_extractor
Data Scraper
A Python package to extract URLs from websites, extract content into DOCX files, and convert DOCX files to PDF.
Features
- Extracts all href URLs from a specified website.
- Reads URLs from a CSV file and extracts specified HTML tags (h1, h2, p, span) into separate DOCX files.
- Converts DOCX files into PDF format.
Requirements
- Python 3.x
- The following Python packages are required:
requestsbeautifulsoup4python-docxreportlab
Installation or To install the package, use:
pip install DataScraper==0.1.0
-
Clone the repository:
git clone https://github.com/mukhtarulislam/data_scraper.git cd data_scraper
-
Create a virtual environment (recommended):
python -m venv venv
source venv/bin/activate
On Windows use
` venv\Scripts\activate`
3.Install the required packages: pip install -r requirements.txt
How to Run main.py
To extract URLs:
python main.py --action extract_urls --url "http://example.com" --csv_folder ./csv_output
To extract content from URLs in a CSV:
python main.py --action extract_content --csv_folder ./csv_output --docx_folder ./docx_output
To convert DOCX files to PDF:
python main.py --action convert_to_pdf --docx_folder ./docx_output --pdf_folder ./pdf_output
Usage
Extract URLs to CSV
To extract URLs from a specified website and save them to a CSV file, use the extract_urls_to_csv function:
from data_scraper.url_extractor import extract_urls_to_csv
website_url = "https://example.com" # Replace with your target website
folder_name = "output" # Output folder for CSV
extract_urls_to_csv(website_url, folder_name)
Extract Content from URLs
To read URLs from a CSV file and extract content into separate DOCX files, use the extract_content_from_urls function:
from data_scraper.content_extractor import extract_content_from_urls
urls = ["https://example.com/page1", "https://example.com/page2"] # List of URLs
folder_name = "output" # Output folder for DOCX files
extract_content_from_urls(urls, folder_name)
Convert DOCX to PDF
To convert multiple DOCX files to PDF, use the convert_multiple_docx_to_pdf function:
from data_scraper.docx_to_pdf_converter import DocxToPdfConverter
converter = DocxToPdfConverter()
# Define input and output directories
input_directory = "folder_word_file"
output_directory = "folder_pdf_file"
# Convert all .docx files to .pdf
converter.convert_multiple_docx_to_pdf(input_directory, output_directory)
License
This project is licensed under the MIT License. See the LICENSE file for details.
Contributing
Contributions are welcome! Please follow these steps:
1.Fork the repository.
2.Create a new branch: git checkout -b feature/YourFeature
3.Make your changes and commit them: git commit -m "Add some feature"
4.Push to the branch: git push origin feature/YourFeature
5.Create a new Pull Request.
Contact
For questions or suggestions, please reach out to me Mukhtarulislam88@hotmail.com.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file datascraper-0.1.1.tar.gz.
File metadata
- Download URL: datascraper-0.1.1.tar.gz
- Upload date:
- Size: 7.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bc9944950828e33336a5c66a6e5843488a6ca73a63c327ce77de8e723e6f22ca
|
|
| MD5 |
da61fc7064f063ad8cd0c21d274b3d9e
|
|
| BLAKE2b-256 |
862f62a1a930d80349d868f20a876123e1616237ddcd4768f67e34da9e59c1fa
|
File details
Details for the file DataScraper-0.1.1-py3-none-any.whl.
File metadata
- Download URL: DataScraper-0.1.1-py3-none-any.whl
- Upload date:
- Size: 10.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ed448b3597d3f9862784a83caae60fd8c63a8d22f2387599d202bf3ca21043c9
|
|
| MD5 |
c4f27ae1605158f25358603010151d1f
|
|
| BLAKE2b-256 |
a9ef7d8f314090d51f46d035e5ce436f0440c8ed2523ec7b170c7a9e902e7ad9
|