A Brazilian News Website Data Acquisition Library for Python
Project description
A Brazilian News Website Data Acquisition Library for Python
pyBrNews Project, made with ❤️ by Lucas Rodrigues (@NepZR).
The pyBrNews project is a Python 3 library in development for tasks of data acquisition in Brazilian News Websites, capable for extracting news and comments from this platforms and with it's core utilizing the requests-HTML library.
💾 pyBrNews Library is also available for download and install on PyPI! Click here.
🇧🇷 Você está lendo a versão em Inglês deste README. Para ler a versão em Português Brasileiro, clique aqui.
📲 Installation
- Using Python Package Manager (PIP), from PyPI:
pip install pyBrNews
- Using Python Package Manager (PIP), from source (GitHub):
pip install git+https://github.com/NepZR/pyBrNews.git
- Building wheel and installing it directly from source (GitHub):
git clone https://github.com/NepZR/pyBrNews.git && cd pyBrNews/
python setup.py bdist_wheel
pip install dist/pyBrNews-x.x.x-py3-none-any.whl --force-reinstall
Obs.: Replace x.x.x with the version.
📰 Websites and capture groups supported
Website name | News | Comments | URL |
Portal G1 | ✅ Working | ⌨️ In progress | Link |
Folha de São Paulo | ✅ Working | ✅ Working | Link |
Exame | ✅ Working | ⚠️ Not supported | Link |
Metrópoles | ⌨️ In progress | ⌨️ In progress | Link |
Database: using MongoDB (pyMongo), supported since October 28th, 2022. Also supports local File System storage (JSON / CSV) since October 30, 2022.
Internal Modules:pyBrNews.config.database.PyBrNewsDB
andpyBrNews.config.database.PyBrNewsFS
Additional Info: to use a local file system storage (JSON / CSV), set the parameter
use_database=False
in the news package crawlers. Example:crawler = pyBrNews.news.g1.G1News(use_database=False)
. By default, isTrue
and uses the MongoDB database from PyBrNewsDB class.
⌨️ Available methods
Package news
def parse_news(self,
news_urls: List[Union[str, dict]],
parse_body: bool = False,
save_html: bool = True) -> Iterable[dict]:
"""
Extracts all the data from the article in a given news platform by iterating over a URL list. Yields a
dictionary containing all the parsed data from the article.
Parameters:
news_urls (List[str]): A list containing all the URLs or a data dict to be parsed from a given platform.
parse_body (bool): Defines if the article body will be extracted.
save_html (bool): Defines if the HTML bytes from the article will be extracted.
Returns:
Iterable[dict]: Dictionary containing all the article parsed data.
"""
def search_news(self,
keywords: List[str],
max_pages: int = -1) -> List[Union[str, dict]]:
"""
Extracts all the data or URLs from the news platform based on the keywords given. Returns a list containing the
URLs / data found for the keywords.
Parameters:
keywords (List[str]): A list containing all the keywords to be searched in the news platform.
max_pages (int): Number of pages to have the articles URLs extracted from.
If not set, will catch until the last possible.
Returns:
List[Union[str, dict]]: List containing all the URLs / data found for the keywords.
"""
Package config.database
- Class
PyBrNewsDB
def set_connection(self, host: str = "localhost", port: int = 27017) -> None:
"""
Sets the connection host:port parameters for the MongoDB. By default, uses the standard localhost:27017 for
local usage.
Parameters:
host (str): Hostname or address to connect.
port (int): Port to be used in the connection.
"""
def insert_data(self, parsed_data: dict) -> None:
"""
Inserts the parsed data from a news article or extracted comment into the DB Backend (MongoDB - pyMongo).
Parameters:
parsed_data (dict): Dictionary containing the parsed data from a news article or comment.
Returns:
None: Shows a success message if the insertion occurred normally. If not, shows an error message.
"""
def check_duplicates(self, parsed_data: dict) -> bool:
"""
Checks if the parsed data is already in the database and prevents from being duplicated
in the crawler execution.
Parameters:
parsed_data (dict): Dictionary containing the parsed data from a news article or comment.
Returns:
bool: True if the given parsed data is already in the database. False if not.
"""
- Class
PyBrNewsFS
def set_save_path(self, fs_save_path: str) -> None:
"""
Sets the save path for all the exported data generated by this Class.
Example: set_save_path(fs_save_path="/home/ubuntu/newsData/")
Parameters:
fs_save_path (str): Desired save path directory, ending with a slash.
"""
def to_json(self, parsed_data: dict) -> None:
"""
Using the parsed data dictionary from a news article or a comment, export the data as an individual JSON file.
Parameters:
parsed_data (dict): Dictionary containing the parsed data from a news article or a comment.
"""
def export_all_data(self, full_data: List[dict]) -> None:
"""
By a given list of dictionaries containing the parsed data from news or comments, export in a CSV file
containing all data.
Parameters:
full_data (List[dict]): List containing the dictionaries of parsed data.
"""
👨🏻💻 Project Developer
Lucas Darlindo Freitas Rodrigues Data Engineer | Backend Python Dev. LinkedIn (lucasdfr) |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.