Professional web scraper for tech job listings in Colombia
Project description
TechJobs Colombia
Professional web scraper for tech job listings in Colombia.
Description
Scraping tool to extract and analyze tech job listings in Colombia from multiple job portals:
- Indeed
- elempleo.com
- computrabajo.com
- mitrabajo.co
Features
- Multi-portal tech job extraction
- Scoring and classification system for job relevance
- Outsourcing company filtering (BairesDev, Turing, Crossover, etc.)
- Job deduplication
- Dynamic proxy support
- Anti-detection protection (Cloudflare bypass, User-Agent rotation)
- Professional logging
- CSV export
Requirements
- Python 3.11+
- uv (package manager)
Installation
# Clone the repository
git clone https://github.com/CristianMz21/JobsColombia.git
cd JobsColombia
# Install dependencies with uv
uv sync
# Optional: Install dev dependencies
uv sync --dev
Usage
# Run the scraper
python main.py
The script will extract job listings and save them to a timestamped CSV file.
Configuration
Configuration is located in src/config.py:
- Search terms: Keywords for job search
- Scoring weights: Weights for technologies, modality, experience
- Anti-detection settings: Delay between requests, timeouts, retries
- Company blacklist: Outsourcing companies to exclude
Project Structure
JobsColombia/
├── main.py # Entry point
├── src/
│ ├── __init__.py
│ ├── config.py # Centralized configuration
│ ├── logger.py # Logging setup
│ ├── scoring.py # Scoring system
│ ├── scraping.py # Scraping functions
│ ├── utils.py # Utilities
│ ├── utils_proxies.py # Proxy management
│ └── scrapers/ # Portal spiders
│ ├── base.py
│ ├── computrabajo.py
│ └── elempleo.py
├── tests/ # Unit tests
├── pyproject.toml # Project configuration
├── ruff.toml # Linting configuration
└── .github/
└── workflows/ # GitHub Actions
├── tests.yml
└── lint.yml
Testing
# Run all tests
pytest
# Run with coverage
pytest --cov=src --cov-report=term-missing
Linting
# Check code
ruff check .
# Format code
ruff format .
# Auto-fix issues
ruff check --fix .
Tech Stack
- Python 3.11+ - Main language
- Scrapling - Web scraping framework
- Pandas - Data manipulation
- JobSpy - LinkedIn/Indeed scraping
- Requests - HTTP client
- Ruff - Linting and formatting
- Pytest - Testing framework
Disclaimer
This project is for educational purposes only. Make sure to comply with the Terms of Service of the job portals before using this scraper.
License
MIT License - see the LICENSE file for details.
Acknowledgments
This project was built using the following open source libraries:
- JobSpy - Multi-platform job posting aggregator for scraping LinkedIn and Indeed
- Scrapling - Undetectable web scraping framework with Cloudflare bypass support
- Pandas - Data analysis and manipulation tool
- Ruff - Fast Python linter and formatter
- Pytest - Testing framework
Contributions
Contributions are welcome. Please open an issue or pull request to suggest changes or improvements.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file jobscolombia-0.1.0.tar.gz.
File metadata
- Download URL: jobscolombia-0.1.0.tar.gz
- Upload date:
- Size: 40.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
50fcc5cebb75f08f4d10f14472db810bb6eb9f85bade0531f92ad6d37d62a625
|
|
| MD5 |
b1ba8a4bd442e2eaf11b629d68f923f7
|
|
| BLAKE2b-256 |
26abfeffccfcd109800dee54fd84040cf4ff23b8b3797daa53c8ee606f01b9d4
|
File details
Details for the file jobscolombia-0.1.0-py3-none-any.whl.
File metadata
- Download URL: jobscolombia-0.1.0-py3-none-any.whl
- Upload date:
- Size: 32.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c77f5c39219870d1cfcc086756c0cb1d7c1ed4d2897ffda436650d18bd85983b
|
|
| MD5 |
94680fd56804b97b4b4080fd6b8d6cf5
|
|
| BLAKE2b-256 |
5f1425b0c4c6a4b561ab51afe39de6040305373bc3d89b78364cc6cd17f2de60
|