Skip to main content

Python Oriented Crawling Ongoing (POCONG): a simple crawling framework

Project description

POCONG Logo

POCONG 🪦

Python Oriented Crawling ON Going

POCONG is a lightweight web crawling framework built in Python.

Features

  • 🔒 Get Free Proxy: Automatic proxy fetching, validation, and rotation from free proxy sources
  • 🌐 Dynamic Media Web Scraping: Extract content, metadata, and media information from web pages with proxy support
  • 📱 Social Media Scraping: Extract data from social media platforms (coming soon)
  • 🛒 E-commerce Scraping: Extract product information from e-commerce websites (coming soon)

Installation

pip install pocong

Usage: Get Proxy from proxy_spiders

You can use the get_proxy and get_proxy_random methods from proxy_spiders to fetch working proxies.

from pocong.proxy_spiders import GetProxy

gp = GetProxy()

# Get the first working proxy
proxy = gp.get_proxy()
print("First working proxy:", proxy)
from pocong.proxy_spiders import GetProxy

gp = GetProxy()

# Get a random working proxy
random_proxy = gp.get_proxy_random()
print("Random working proxy:", random_proxy)

Sample output:

First working proxy: {'ip': '123.45.67.89', 'port': '8080', 'https': 'yes', ...}
Random working proxy: {'ip': '98.76.54.32', 'port': '3128', 'https': 'yes', ...}

You can use the returned proxy dictionary with the requests library, for example:

import requests

proxy = gp.get_proxy()
if proxy:
    proxies = {
        'http': f"http://{proxy['ip']}:{proxy['port']}",
        'https': f"http://{proxy['ip']}:{proxy['port']}"
    }
    response = requests.get('https://httpbin.org/ip', proxies=proxies)
    print(response.json())
else:
    print("No working proxy found.")
  • get_proxy() will return the first working proxy found.
  • get_proxy_random() will return a random working proxy (with up to 20 retries).

Both methods return a dictionary with proxy details (e.g., { 'ip': '...', 'port': '...', ... }) or None if no working proxy is found.

Usage: Dynamic Media Web Scraping

The DynamicScrapingNews class provides comprehensive web scraping capabilities with built-in proxy support for extracting content, metadata, and media information from web pages.

Basic Usage

from pocong.media_spiders import DynamicScrapingNews

# Simple scraping without proxy
scraper = DynamicScrapingNews("https://example.com", use_proxy=False)
result = scraper.scrape()

# Extract specific information
print(f"Title: {result['title']}")
print(f"URL: {result['url']}")
print(f"Media: {result['media']}")
print(f"Published: {result['published_date']}")
print(f"Text content: {result['text'][:200]}...")  # First 200 chars

Proxy Configuration Options

1. Automatic Proxy (Default)

# Uses automatic proxy fetching
scraper = DynamicScrapingNews("https://example.com")
result = scraper.scrape()

2. Manual Proxy Configuration

# Method 1: IP:Port format
scraper = DynamicScrapingNews("https://example.com", 
                              manual_proxy="192.168.1.1:8080")

# Method 2: Full URL format
scraper = DynamicScrapingNews("https://example.com", 
                              manual_proxy="http://192.168.1.1:8080")

# Method 3: Dictionary format
scraper = DynamicScrapingNews("https://example.com", 
                              manual_proxy={"ip": "192.168.1.1", "port": "8080"})

result = scraper.scrape()

3. No Proxy

# Disable proxy completely
scraper = DynamicScrapingNews("https://example.com", use_proxy=False)
result = scraper.scrape()

4. Manual Proxy Override

# Manual proxy overrides use_proxy setting
scraper = DynamicScrapingNews("https://example.com", 
                              use_proxy=False, 
                              manual_proxy="192.168.1.1:8080")
result = scraper.scrape()

Complete Example with Proxy Integration

from pocong.proxy_spiders import GetProxy
from pocong.media_spiders import DynamicScrapingNews

# Get a working proxy
proxy = GetProxy().get_proxy()
print(f"Using proxy: {proxy}")

# Use automatic proxy (default behavior)
scraper = DynamicScrapingNews("https://example.com")
result = scraper.scrape()

# Use manual proxy with ip:port format
scraper = DynamicScrapingNews("https://example.com", 
                              manual_proxy=f"{proxy['ip']}:{proxy['port']}")
result = scraper.scrape()

# Use manual proxy with dictionary format
scraper = DynamicScrapingNews("https://example.com", 
                              manual_proxy={"ip": proxy['ip'], "port": proxy['port']})
result = scraper.scrape()

Extracted Data Structure

The scrape() method returns a dictionary containing:

{
    'title': 'Page Title',           # Extracted from og:title or title tag
    'url': 'https://example.com',    # Canonical URL
    'image': 'https://...',          # Featured image URL
    'html': '<html>...</html>',      # Full HTML content
    'text': 'Clean text content',    # Processed text without HTML
    'media': 'example',              # Domain name extracted from URL
    'published_date': datetime(...)  # Publication date if found
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pocong-1.1.0.tar.gz (32.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pocong-1.1.0-py3-none-any.whl (9.9 kB view details)

Uploaded Python 3

File details

Details for the file pocong-1.1.0.tar.gz.

File metadata

  • Download URL: pocong-1.1.0.tar.gz
  • Upload date:
  • Size: 32.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for pocong-1.1.0.tar.gz
Algorithm Hash digest
SHA256 00b01c0f5cd5bf49ea50677efab7d28e4736ae405821126c19564ba9e1548411
MD5 f791906201df697c48f3458b28376254
BLAKE2b-256 94437f67fb38ad09438e33daf7e091df1d38859778d98ac98e11bede7f86fa65

See more details on using hashes here.

File details

Details for the file pocong-1.1.0-py3-none-any.whl.

File metadata

  • Download URL: pocong-1.1.0-py3-none-any.whl
  • Upload date:
  • Size: 9.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for pocong-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7cead32b6a86e1b6e26068e47bd1b866d8c87c42989a30b2e7f78a0a38443af1
MD5 686e00041864f5b877d9c6dd535a45e4
BLAKE2b-256 82eb6c0a840cd67dcc6a1e94e0c2dc6e0cbfa822c738b8389781a048a71fb5c2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page