Python Oriented Crawling Ongoing (POCONG): a simple crawling framework
Project description
POCONG 🪦
Python Oriented Crawling ON Going
POCONG is a lightweight web crawling framework built in Python.
Features
- 🔒 Get Free Proxy: Automatic proxy fetching, validation, and rotation from free proxy sources
- 🌐 Dynamic Media Web Scraping: Extract content, metadata, and media information from web pages with proxy support
- 📱 Social Media Scraping: Extract data from social media platforms (coming soon)
- 🛒 E-commerce Scraping: Extract product information from e-commerce websites (coming soon)
Installation
pip install pocong
Usage: Get Proxy from proxy_spiders
You can use the get_proxy and get_proxy_random methods from proxy_spiders to fetch working proxies.
from pocong.proxy_spiders import GetProxy
gp = GetProxy()
# Get the first working proxy
proxy = gp.get_proxy()
print("First working proxy:", proxy)
from pocong.proxy_spiders import GetProxy
gp = GetProxy()
# Get a random working proxy
random_proxy = gp.get_proxy_random()
print("Random working proxy:", random_proxy)
Sample output:
First working proxy: {'ip': '123.45.67.89', 'port': '8080', 'https': 'yes', ...}
Random working proxy: {'ip': '98.76.54.32', 'port': '3128', 'https': 'yes', ...}
You can use the returned proxy dictionary with the requests library, for example:
import requests
proxy = gp.get_proxy()
if proxy:
proxies = {
'http': f"http://{proxy['ip']}:{proxy['port']}",
'https': f"http://{proxy['ip']}:{proxy['port']}"
}
response = requests.get('https://httpbin.org/ip', proxies=proxies)
print(response.json())
else:
print("No working proxy found.")
get_proxy()will return the first working proxy found.get_proxy_random()will return a random working proxy (with up to 20 retries).
Both methods return a dictionary with proxy details (e.g., { 'ip': '...', 'port': '...', ... }) or None if no working proxy is found.
Usage: Dynamic Media Web Scraping
The DynamicScrapingNews class provides comprehensive web scraping capabilities with built-in proxy support for extracting content, metadata, and media information from web pages.
Basic Usage
from pocong.media_spiders import DynamicScrapingNews
# Simple scraping without proxy
scraper = DynamicScrapingNews("https://example.com", use_proxy=False)
result = scraper.scrape()
# Extract specific information
print(f"Title: {result['title']}")
print(f"URL: {result['url']}")
print(f"Media: {result['media']}")
print(f"Published: {result['published_date']}")
print(f"Text content: {result['text'][:200]}...") # First 200 chars
Proxy Configuration Options
1. Automatic Proxy (Default)
# Uses automatic proxy fetching
scraper = DynamicScrapingNews("https://example.com")
result = scraper.scrape()
2. Manual Proxy Configuration
# Method 1: IP:Port format
scraper = DynamicScrapingNews("https://example.com",
manual_proxy="192.168.1.1:8080")
# Method 2: Full URL format
scraper = DynamicScrapingNews("https://example.com",
manual_proxy="http://192.168.1.1:8080")
# Method 3: Dictionary format
scraper = DynamicScrapingNews("https://example.com",
manual_proxy={"ip": "192.168.1.1", "port": "8080"})
result = scraper.scrape()
3. No Proxy
# Disable proxy completely
scraper = DynamicScrapingNews("https://example.com", use_proxy=False)
result = scraper.scrape()
4. Manual Proxy Override
# Manual proxy overrides use_proxy setting
scraper = DynamicScrapingNews("https://example.com",
use_proxy=False,
manual_proxy="192.168.1.1:8080")
result = scraper.scrape()
Complete Example with Proxy Integration
from pocong.proxy_spiders import GetProxy
from pocong.media_spiders import DynamicScrapingNews
# Get a working proxy
proxy = GetProxy().get_proxy()
print(f"Using proxy: {proxy}")
# Use automatic proxy (default behavior)
scraper = DynamicScrapingNews("https://example.com")
result = scraper.scrape()
# Use manual proxy with ip:port format
scraper = DynamicScrapingNews("https://example.com",
manual_proxy=f"{proxy['ip']}:{proxy['port']}")
result = scraper.scrape()
# Use manual proxy with dictionary format
scraper = DynamicScrapingNews("https://example.com",
manual_proxy={"ip": proxy['ip'], "port": proxy['port']})
result = scraper.scrape()
Extracted Data Structure
The scrape() method returns a dictionary containing:
{
'title': 'Page Title', # Extracted from og:title or title tag
'url': 'https://example.com', # Canonical URL
'image': 'https://...', # Featured image URL
'html': '<html>...</html>', # Full HTML content
'text': 'Clean text content', # Processed text without HTML
'media': 'example', # Domain name extracted from URL
'published_date': datetime(...) # Publication date if found
}
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pocong-1.1.0.tar.gz.
File metadata
- Download URL: pocong-1.1.0.tar.gz
- Upload date:
- Size: 32.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
00b01c0f5cd5bf49ea50677efab7d28e4736ae405821126c19564ba9e1548411
|
|
| MD5 |
f791906201df697c48f3458b28376254
|
|
| BLAKE2b-256 |
94437f67fb38ad09438e33daf7e091df1d38859778d98ac98e11bede7f86fa65
|
File details
Details for the file pocong-1.1.0-py3-none-any.whl.
File metadata
- Download URL: pocong-1.1.0-py3-none-any.whl
- Upload date:
- Size: 9.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7cead32b6a86e1b6e26068e47bd1b866d8c87c42989a30b2e7f78a0a38443af1
|
|
| MD5 |
686e00041864f5b877d9c6dd535a45e4
|
|
| BLAKE2b-256 |
82eb6c0a840cd67dcc6a1e94e0c2dc6e0cbfa822c738b8389781a048a71fb5c2
|