Skip to main content

A Brazilian News Website Data Acquisition Library for Python

Project description

A Brazilian News Website Data Acquisition Library for Python

pyBrNews Project, made with ❤️ by Lucas Rodrigues (@NepZR).

The pyBrNews project is a Python 3 library in development for tasks of data acquisition in Brazilian News Websites, capable for extracting news and comments from this platforms and with it's core utilizing the requests-HTML library.

💾 pyBrNews Library is also available for download and install on PyPI! Click here.

🇧🇷 Você está lendo a versão em Inglês deste README. Para ler a versão em Português Brasileiro, clique aqui.

📲 Installation

  • Using Python Package Manager (PIP), from PyPI:
    pip install pyBrNews
    
  • Using Python Package Manager (PIP), from source (GitHub):
    pip install git+https://github.com/NepZR/pyBrNews.git
    
  • Building wheel and installing it directly from source (GitHub):
    git clone https://github.com/NepZR/pyBrNews.git && cd pyBrNews/
    
    python setup.py bdist_wheel
    
    pip install dist/pyBrNews-x.x.x-py3-none-any.whl --force-reinstall
    

    Obs.: Replace x.x.x with the version.


📰 Websites and capture groups supported

Website name News Comments URL
Portal G1 ✅ Working ⌨️ In progress Link
Folha de São Paulo ✅ Working ✅ Working Link
Exame ✅ Working ⚠️ Not supported Link
Metrópoles ⌨️ In progress ⌨️ In progress Link

Database: using MongoDB (pyMongo), supported since October 28th, 2022. Also supports local File System storage (JSON / CSV) since October 30, 2022.
Internal Modules: pyBrNews.config.database.PyBrNewsDB and pyBrNews.config.database.PyBrNewsFS

Additional Info: to use a local file system storage (JSON / CSV), set the parameter use_database=False in the news package crawlers. Example: crawler = pyBrNews.news.g1.G1News(use_database=False). By default, is True and uses the MongoDB database from PyBrNewsDB class.


⌨️ Available methods

Package news

def parse_news(self,
               news_urls: List[Union[str, dict]],
               parse_body: bool = False,
               save_html: bool = True) -> Iterable[dict]:
    """
    Extracts all the data from the article in a given news platform by iterating over a URL list. Yields a 
    dictionary containing all the parsed data from the article.

    Parameters:
        news_urls (List[str]): A list containing all the URLs or a data dict to be parsed from a given platform.
        parse_body (bool): Defines if the article body will be extracted.
        save_html (bool): Defines if the HTML bytes from the article will be extracted.
    Returns:
         Iterable[dict]: Dictionary containing all the article parsed data.
    """
def search_news(self,
                keywords: List[str],
                max_pages: int = -1) -> List[Union[str, dict]]:
    """
    Extracts all the data or URLs from the news platform based on the keywords given. Returns a list containing the
    URLs / data found for the keywords.

    Parameters:
        keywords (List[str]): A list containing all the keywords to be searched in the news platform.
        max_pages (int): Number of pages to have the articles URLs extracted from. 
                         If not set, will catch until the last possible.
    Returns:
         List[Union[str, dict]]: List containing all the URLs / data found for the keywords.
    """

Package config.database

  • Class PyBrNewsDB
def set_connection(self, host: str = "localhost", port: int = 27017) -> None:
    """
    Sets the connection host:port parameters for the MongoDB. By default, uses the standard localhost:27017 for
    local usage.
    
    Parameters:
         host (str): Hostname or address to connect.
         port (int): Port to be used in the connection.
    """
def insert_data(self, parsed_data: dict) -> None:
    """
    Inserts the parsed data from a news article or extracted comment into the DB Backend (MongoDB - pyMongo).
    
    Parameters: 
        parsed_data (dict): Dictionary containing the parsed data from a news article or comment.
    Returns:
        None: Shows a success message if the insertion occurred normally. If not, shows an error message.
    """
def check_duplicates(self, parsed_data: dict) -> bool:
    """
    Checks if the parsed data is already in the database and prevents from being duplicated 
    in the crawler execution.
    
    Parameters: 
        parsed_data (dict): Dictionary containing the parsed data from a news article or comment.
    Returns:
        bool: True if the given parsed data is already in the database. False if not.
    """
  • Class PyBrNewsFS
def set_save_path(self, fs_save_path: str) -> None:
    """
    Sets the save path for all the exported data generated by this Class.

    Example: set_save_path(fs_save_path="/home/ubuntu/newsData/")

    Parameters:
         fs_save_path (str): Desired save path directory, ending with a slash.
    """
def to_json(self, parsed_data: dict) -> None:
    """
    Using the parsed data dictionary from a news article or a comment, export the data as an individual JSON file.

    Parameters:
        parsed_data (dict): Dictionary containing the parsed data from a news article or a comment.
    """
def export_all_data(self, full_data: List[dict]) -> None:
    """
    By a given list of dictionaries containing the parsed data from news or comments, export in a CSV file
    containing all data.

    Parameters:
        full_data (List[dict]): List containing the dictionaries of parsed data.
    """

👨🏻‍💻 Project Developer


Lucas Darlindo Freitas Rodrigues

Data Engineer | Backend Python Dev.
LinkedIn (lucasdfr)

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyBrNews-0.1.2.tar.gz (31.4 kB view details)

Uploaded Source

Built Distribution

pyBrNews-0.1.2-py3-none-any.whl (33.1 kB view details)

Uploaded Python 3

File details

Details for the file pyBrNews-0.1.2.tar.gz.

File metadata

  • Download URL: pyBrNews-0.1.2.tar.gz
  • Upload date:
  • Size: 31.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.8

File hashes

Hashes for pyBrNews-0.1.2.tar.gz
Algorithm Hash digest
SHA256 561dc54e109b48498c244b06e5f2b01a62a166ad041f44d098a2b6c9ba9f09c2
MD5 0ce2aedfe1d9f11a1f8737e0ae0e4b04
BLAKE2b-256 54be5e13e5795d1975a7b27c5026fab2f396288fe9f6c3e8f417f267f7c48ded

See more details on using hashes here.

File details

Details for the file pyBrNews-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: pyBrNews-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 33.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.8

File hashes

Hashes for pyBrNews-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 6bd334f94ba9b66356521489500155f67d313c21661a7250f60b1c45802df643
MD5 faf8bcd3097473ad179536fe85afe2d3
BLAKE2b-256 d5cf6b3c90a7768bbee9db42d5d6c18e41c80265d8313c04f1998c6db2fbfeaf

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page