Skip to main content

A Powerful WebScraper With Unmatched Performance

Project description

Fadex: A Powerful Web Scraper With Unmatched Performance

Overview

Fadex is a Python module that provides powerful web scraping functionalities, including fetching web pages, extracting metadata, and parsing HTML content. Built with a Rust backend using PyO3, it aims to provide high performance and ease of use for web scraping tasks.

Installation

You can easily install Fadex using pip:

pip install fadex

Usage

Basic Example

To fetch the content of a web page asynchronously, you can use the fetch_page function:

import asyncio
from fadex import fetch_page_py

async def fetch_page(url):
    try:
        content = await fetch_page_py(url)
        print("Page content fetched successfully:")
        print(content)
    except Exception as e:
        print(f"Failed to fetch page: {e}")

# Example usage
url = "http://gigmasters.it"
asyncio.run(fetch_page(url))

API Reference

Functions

get_meta_and_title_py(html: str) -> Tuple[Optional[str], Optional[str]]

Parses the HTML content and extracts the title and meta description.

  • Parameters:
    • html: A string containing the HTML content.
  • Returns:
    • A tuple containing:
      • title: An optional string representing the page title.
      • description: An optional string representing the meta description.

extract_links_py(html: str, base_url: str) -> List[str]

Extracts and sanitizes all href links from the HTML content.

  • Parameters:
    • html: A string containing the HTML content.
    • base_url: A string representing the base URL to resolve relative links.
  • Returns:
    • A list of sanitized URLs extracted from the HTML.

fetch_page_py(url: str) -> Awaitable[str]

Asynchronously fetches the content of a web page.

  • Parameters:
    • url: A string containing the URL of the page to fetch.
  • Returns:
    • A string containing the content of the fetched page.

Performance Comparison

We conducted a performance comparison between Fadex, BeautifulSoup, and lxml by extracting the metadata (title and description) and extracting all links from 10 popular websites. The results are as follows:

Metadata Extraction Performance

Fadex Metadata Extraction Average Time: 0.56 seconds (Successful Extracts: 100)
BeautifulSoup Metadata Extraction Average Time: 0.78 seconds (Successful Extracts: 100)
lxml Metadata Extraction Average Time: 0.69 seconds (Successful Extracts: 100)

Performance Comparison for Metadata Extraction:
Fadex Time: 0.56 seconds
BeautifulSoup Time: 0.78 seconds
lxml Time: 0.69 seconds

Winner for Metadata Extraction: Fadex

Link Extraction Performance

Fadex Link Extraction Average Time: 0.62 seconds (Successful Extracts: 100)
BeautifulSoup Link Extraction Average Time: 0.81 seconds (Successful Extracts: 100)
lxml Link Extraction Average Time: 0.65 seconds (Successful Extracts: 100)

Performance Comparison for Link Extraction:
Fadex Time: 0.62 seconds
BeautifulSoup Time: 0.81 seconds
lxml Time: 0.65 seconds

Winner for Link Extraction: Fadex

These results show that Fadex outperforms both BeautifulSoup and lxml in terms of average response time for extracting metadata and links. However, the performance of each library can also depend on factors such as the complexity of the HTML content and the internet connection stability.

Example Code for Performance Comparison

Below is the code used for the performance comparison:

import asyncio
import time
from fadex import fetch_page_py, get_meta_and_title_py, extract_links_py
from bs4 import BeautifulSoup
from lxml import html as lxml_html
from urllib.parse import urljoin, urlparse

# Function to extract metadata using Fadex
def extract_metadata_with_fadex(html_content):
    try:
        title, description = get_meta_and_title_py(html_content)
        return True, title, description
    except Exception as e:
        return False, None, None

# Function to extract metadata using BeautifulSoup
def extract_metadata_with_beautifulsoup(html_content):
    try:
        soup = BeautifulSoup(html_content, 'html.parser')
        title = soup.title.string if soup.title else None
        description = None
        meta_tag = soup.find('meta', attrs={'name': 'description'})
        if meta_tag:
            description = meta_tag.get('content')
        return True, title, description
    except Exception as e:
        return False, None, None

# Function to extract metadata using lxml
def extract_metadata_with_lxml(html_content):
    try:
        tree = lxml_html.fromstring(html_content)
        title = tree.find('.//title').text if tree.find('.//title') is not None else None
        description = None
        meta = tree.xpath('//meta[@name="description"]')
        if meta and 'content' in meta[0].attrib:
            description = meta[0].attrib['content']
        return True, title, description
    except Exception as e:
        return False, None, None

# Function to extract links using Fadex
def extract_links_with_fadex(html_content, base_url):
    try:
        links = extract_links_py(html_content, base_url)
        return True, links
    except Exception as e:
        return False, []

# Function to extract links using BeautifulSoup
def extract_links_with_beautifulsoup(html_content, base_url):
    try:
        soup = BeautifulSoup(html_content, 'html.parser')
        links = [urljoin(base_url, a['href']) for a in soup.find_all('a', href=True)]
        return True, [link for link in links if urlparse(link).scheme in ["http", "https"]]
    except Exception as e:
        return False, []

# Function to extract links using lxml
def extract_links_with_lxml(html_content, base_url):
    try:
        tree = lxml_html.fromstring(html_content)
        links = [urljoin(base_url, link) for link in tree.xpath('//a/@href')]
        return True, [link for link in links if urlparse(link).scheme in ["http", "https"]]
    except Exception as e:
        return False, []

# Function to measure average performance for each library
def measure_metadata_performance(html_contents, extract_func, iterations=5):
    total_time = 0
    successful_extracts = 0
    for _ in range(iterations):
        for html_content in html_contents:
            start_time = time.time()
            success, title, description = extract_func(html_content)
            total_time += time.time() - start_time
            if success:
                successful_extracts += 1
    average_time = total_time / (len(html_contents) * iterations)
    return average_time, successful_extracts

# Function to measure link extraction performance for each library
def measure_link_extraction_performance(html_contents, base_urls, extract_func, iterations=5):
    total_time = 0
    successful_extracts = 0
    for _ in range(iterations):
        for html_content, base_url in zip(html_contents, base_urls):
            start_time = time.time()
            success, links = extract_func(html_content, base_url)
            total_time += time.time() - start_time
            if success:
                successful_extracts += 1
    average_time = total_time / (len(html_contents) * iterations)
    return average_time, successful_extracts

# Main function to run the tests
async def main():
    # List of popular URLs for testing
    urls = [
        "https://www.google.com",
        "https://www.wikipedia.org",
        "https://www.github.com",
        "https://www.reddit.com",
        "https://www.stackoverflow.com",
        "https://www.nytimes.com",
        "https://www.bbc.com",
        "https://www.amazon.com",
        "https://www.apple.com",
        "https://www.microsoft.com"
    ]

    # Fetch page content using Fadex
    html_contents = []
    for url in urls:
        try:
            content = await fetch_page_py(url)
            html_contents.append(content)
        except Exception as e:
            print(f"Failed to fetch page from {url}: {e}")

    # Define number of iterations for performance measurement
    iterations = 10

    # Measure performance for Fadex (metadata extraction)
    fadex_meta_average_time, fadex_meta_success = measure_metadata_performance(
        html_contents, extract_metadata_with_fadex, iterations
    )

    # Measure performance for BeautifulSoup (metadata extraction)
    bs_meta_average_time, bs_meta_success = measure_metadata_performance(
        html_contents, extract_metadata_with_beautifulsoup, iterations
    )

    # Measure performance for lxml (metadata extraction)
    lxml_meta_average_time, lxml_meta_success = measure_metadata_performance(
       

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

Fadex-0.1.3.tar.gz (36.5 kB view hashes)

Uploaded Source

Built Distributions

Fadex-0.1.3-cp310-none-win_amd64.whl (3.2 MB view hashes)

Uploaded CPython 3.10 Windows x86-64

Fadex-0.1.3-cp310-cp310-manylinux_2_34_x86_64.whl (4.4 MB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.34+ x86-64

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page