A Brazilian News Website Data Acquisition Library for Python
Project description
A Brazilian News Website Data Acquisition Library for Python
pyBrNews Project, made with ❤️ by Lucas Rodrigues (@NepZR).
The pyBrNews project is a Python 3 library in development for tasks of data acquisition in Brazilian News Websites, capable for extracting news and comments from this platforms and with it's core utilizing the requests-HTML library.
💾 pyBrNews Library is also available for download and install on PyPI! Click here.
🇧🇷 Você está lendo a versão em Inglês deste README. Para ler a versão em Português Brasileiro, clique aqui.
📲 Installation
- Using Python Package Manager (PIP), from PyPI:
pip install pyBrNews
- Using Python Package Manager (PIP), from source (GitHub):
pip install git+https://github.com/NepZR/pyBrNews.git
- Building wheel and installing it directly from source (GitHub):
git clone https://github.com/NepZR/pyBrNews.git && cd pyBrNews/
python setup.py bdist_wheel
pip install dist/pyBrNews-x.x.x-py3-none-any.whl --force-reinstall
Obs.: Replace x.x.x with the version.
📰 Websites and capture groups supported
Website name | News | Comments | URL |
Portal G1 | ✅ Working | ⌨️ In progress | Link |
Folha de São Paulo | ✅ Working | ✅ Working | Link |
Exame | ✅ Working | ⚠️ Not supported | Link |
Metrópoles | ⌨️ In progress | ⌨️ In progress | Link |
Database: using MongoDB (pyMongo), supported since October 28th, 2022. Also supports local File System storage (JSON / CSV) since October 30, 2022.
Internal Modules:pyBrNews.config.database.PyBrNewsDB
andpyBrNews.config.database.PyBrNewsFS
Additional Info: to use a local file system storage (JSON / CSV), set the parameter
use_database=False
in the news package crawlers. Example:crawler = pyBrNews.news.g1.G1News(use_database=False)
. By default, isTrue
and uses the MongoDB database from PyBrNewsDB class.
⌨️ Available methods
Package news
def parse_news(self,
news_urls: List[Union[str, dict]],
parse_body: bool = False,
save_html: bool = True) -> Iterable[dict]:
"""
Extracts all the data from the article in a given news platform by iterating over a URL list. Yields a
dictionary containing all the parsed data from the article.
Parameters:
news_urls (List[str]): A list containing all the URLs or a data dict to be parsed from a given platform.
parse_body (bool): Defines if the article body will be extracted.
save_html (bool): Defines if the HTML bytes from the article will be extracted.
Returns:
Iterable[dict]: Dictionary containing all the article parsed data.
"""
def search_news(self,
keywords: List[str],
max_pages: int = -1) -> List[Union[str, dict]]:
"""
Extracts all the data or URLs from the news platform based on the keywords given. Returns a list containing the
URLs / data found for the keywords.
Parameters:
keywords (List[str]): A list containing all the keywords to be searched in the news platform.
max_pages (int): Number of pages to have the articles URLs extracted from.
If not set, will catch until the last possible.
Returns:
List[Union[str, dict]]: List containing all the URLs / data found for the keywords.
"""
Package config.database
- Class
PyBrNewsDB
def set_connection(self, host: str = "localhost", port: int = 27017) -> None:
"""
Sets the connection host:port parameters for the MongoDB. By default, uses the standard localhost:27017 for
local usage.
Parameters:
host (str): Hostname or address to connect.
port (int): Port to be used in the connection.
"""
def insert_data(self, parsed_data: dict) -> None:
"""
Inserts the parsed data from a news article or extracted comment into the DB Backend (MongoDB - pyMongo).
Parameters:
parsed_data (dict): Dictionary containing the parsed data from a news article or comment.
Returns:
None: Shows a success message if the insertion occurred normally. If not, shows an error message.
"""
def check_duplicates(self, parsed_data: dict) -> bool:
"""
Checks if the parsed data is already in the database and prevents from being duplicated
in the crawler execution.
Parameters:
parsed_data (dict): Dictionary containing the parsed data from a news article or comment.
Returns:
bool: True if the given parsed data is already in the database. False if not.
"""
- Class
PyBrNewsFS
def set_save_path(self, fs_save_path: str) -> None:
"""
Sets the save path for all the exported data generated by this Class.
Example: set_save_path(fs_save_path="/home/ubuntu/newsData/")
Parameters:
fs_save_path (str): Desired save path directory, ending with a slash.
"""
def to_json(self, parsed_data: dict) -> None:
"""
Using the parsed data dictionary from a news article or a comment, export the data as an individual JSON file.
Parameters:
parsed_data (dict): Dictionary containing the parsed data from a news article or a comment.
"""
def export_all_data(self, full_data: List[dict]) -> None:
"""
By a given list of dictionaries containing the parsed data from news or comments, export in a CSV file
containing all data.
Parameters:
full_data (List[dict]): List containing the dictionaries of parsed data.
"""
👨🏻💻 Project Developer
Lucas Darlindo Freitas Rodrigues Data Engineer | Backend Python Dev. LinkedIn (lucasdfr) |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file pyBrNews-0.1.2.tar.gz
.
File metadata
- Download URL: pyBrNews-0.1.2.tar.gz
- Upload date:
- Size: 31.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.10.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 561dc54e109b48498c244b06e5f2b01a62a166ad041f44d098a2b6c9ba9f09c2 |
|
MD5 | 0ce2aedfe1d9f11a1f8737e0ae0e4b04 |
|
BLAKE2b-256 | 54be5e13e5795d1975a7b27c5026fab2f396288fe9f6c3e8f417f267f7c48ded |
File details
Details for the file pyBrNews-0.1.2-py3-none-any.whl
.
File metadata
- Download URL: pyBrNews-0.1.2-py3-none-any.whl
- Upload date:
- Size: 33.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.10.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6bd334f94ba9b66356521489500155f67d313c21661a7250f60b1c45802df643 |
|
MD5 | faf8bcd3097473ad179536fe85afe2d3 |
|
BLAKE2b-256 | d5cf6b3c90a7768bbee9db42d5d6c18e41c80265d8313c04f1998c6db2fbfeaf |