Skip to main content

A Brazilian News Website Data Acquisition Library for Python

Project description

A Brazilian News Website Data Acquisition Library for Python

pyBrNews Project, made with ❤️ by Lucas Rodrigues (@NepZR).

The pyBrNews project is a Python 3 library in development for tasks of data acquisition in Brazilian News Websites, capable for extracting news and comments from this platforms and with it's core utilizing the requests-HTML library.

💾 pyBrNews Library is also available for download and install on PyPI! Click here.

🇧🇷 Você está lendo a versão em Inglês deste README. Para ler a versão em Português Brasileiro, clique aqui.

📲 Installation

  • Using Python Package Manager (PIP), from PyPI:
    pip install pyBrNews
    
  • Using Python Package Manager (PIP), from source (GitHub):
    pip install git+https://github.com/NepZR/pyBrNews.git
    
  • Building wheel and installing it directly from source (GitHub):
    git clone https://github.com/NepZR/pyBrNews.git && cd pyBrNews/
    
    python setup.py bdist_wheel
    
    pip install dist/pyBrNews-x.x.x-py3-none-any.whl --force-reinstall
    

    Obs.: Replace x.x.x with the version.


📰 Websites and capture groups supported

Website name News Comments URL
Portal G1 ✅ Working ⌨️ In progress Link
Folha de São Paulo ✅ Working ✅ Working Link
Exame ✅ Working ⚠️ Not supported Link
Metrópoles ⌨️ In progress ⌨️ In progress Link

Database: using MongoDB (pyMongo), supported since October 28th, 2022. Also supports local File System storage (JSON / CSV) since October 30, 2022.
Internal Modules: pyBrNews.config.database.PyBrNewsDB and pyBrNews.config.database.PyBrNewsFS

Additional Info: to use a local file system storage (JSON / CSV), set the parameter use_database=False in the news package crawlers. Example: crawler = pyBrNews.news.g1.G1News(use_database=False). By default, is True and uses the MongoDB database from PyBrNewsDB class.


⌨️ Available methods

Package news

def parse_news(self,
               news_urls: List[Union[str, dict]],
               parse_body: bool = False,
               save_html: bool = True) -> Iterable[dict]:
    """
    Extracts all the data from the article in a given news platform by iterating over a URL list. Yields a 
    dictionary containing all the parsed data from the article.

    Parameters:
        news_urls (List[str]): A list containing all the URLs or a data dict to be parsed from a given platform.
        parse_body (bool): Defines if the article body will be extracted.
        save_html (bool): Defines if the HTML bytes from the article will be extracted.
    Returns:
         Iterable[dict]: Dictionary containing all the article parsed data.
    """
def search_news(self,
                keywords: List[str],
                max_pages: int = -1) -> List[Union[str, dict]]:
    """
    Extracts all the data or URLs from the news platform based on the keywords given. Returns a list containing the
    URLs / data found for the keywords.

    Parameters:
        keywords (List[str]): A list containing all the keywords to be searched in the news platform.
        max_pages (int): Number of pages to have the articles URLs extracted from. 
                         If not set, will catch until the last possible.
    Returns:
         List[Union[str, dict]]: List containing all the URLs / data found for the keywords.
    """

Package config.database

  • Class PyBrNewsDB
def set_connection(self, host: str = "localhost", port: int = 27017) -> None:
    """
    Sets the connection host:port parameters for the MongoDB. By default, uses the standard localhost:27017 for
    local usage.
    
    Parameters:
         host (str): Hostname or address to connect.
         port (int): Port to be used in the connection.
    """
def insert_data(self, parsed_data: dict) -> None:
    """
    Inserts the parsed data from a news article or extracted comment into the DB Backend (MongoDB - pyMongo).
    
    Parameters: 
        parsed_data (dict): Dictionary containing the parsed data from a news article or comment.
    Returns:
        None: Shows a success message if the insertion occurred normally. If not, shows an error message.
    """
def check_duplicates(self, parsed_data: dict) -> bool:
    """
    Checks if the parsed data is already in the database and prevents from being duplicated 
    in the crawler execution.
    
    Parameters: 
        parsed_data (dict): Dictionary containing the parsed data from a news article or comment.
    Returns:
        bool: True if the given parsed data is already in the database. False if not.
    """
  • Class PyBrNewsFS
def set_save_path(self, fs_save_path: str) -> None:
    """
    Sets the save path for all the exported data generated by this Class.

    Example: set_save_path(fs_save_path="/home/ubuntu/newsData/")

    Parameters:
         fs_save_path (str): Desired save path directory, ending with a slash.
    """
def to_json(self, parsed_data: dict) -> None:
    """
    Using the parsed data dictionary from a news article or a comment, export the data as an individual JSON file.

    Parameters:
        parsed_data (dict): Dictionary containing the parsed data from a news article or a comment.
    """
def export_all_data(self, full_data: List[dict]) -> None:
    """
    By a given list of dictionaries containing the parsed data from news or comments, export in a CSV file
    containing all data.

    Parameters:
        full_data (List[dict]): List containing the dictionaries of parsed data.
    """

👨🏻‍💻 Project Developer


Lucas Darlindo Freitas Rodrigues

Data Engineer | Backend Python Dev.
LinkedIn (lucasdfr)

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyBrNews-0.1.2.tar.gz (31.4 kB view hashes)

Uploaded Source

Built Distribution

pyBrNews-0.1.2-py3-none-any.whl (33.1 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page