Skip to main content

Professional web scraper for tech job listings in Colombia

Project description

TechJobs Colombia

Professional web scraper for tech job listings in Colombia.

Python License: MIT Tests Lint

Description

Scraping tool to extract and analyze tech job listings in Colombia from multiple job portals:

  • LinkedIn
  • Indeed
  • elempleo.com
  • computrabajo.com
  • mitrabajo.co

Features

  • Multi-portal tech job extraction
  • Scoring and classification system for job relevance
  • Outsourcing company filtering (BairesDev, Turing, Crossover, etc.)
  • Job deduplication
  • Dynamic proxy support
  • Anti-detection protection (Cloudflare bypass, User-Agent rotation)
  • Professional logging
  • CSV export

Requirements

  • Python 3.11+
  • uv (package manager)

Installation

# Clone the repository
git clone https://github.com/CristianMz21/JobsColombia.git
cd JobsColombia

# Install dependencies with uv
uv sync

# Optional: Install dev dependencies
uv sync --dev

Usage

# Run the scraper
python main.py

The script will extract job listings and save them to a timestamped CSV file.

Configuration

Configuration is located in src/config.py:

  • Search terms: Keywords for job search
  • Scoring weights: Weights for technologies, modality, experience
  • Anti-detection settings: Delay between requests, timeouts, retries
  • Company blacklist: Outsourcing companies to exclude

Project Structure

JobsColombia/
├── main.py                 # Entry point
├── src/
│   ├── __init__.py
│   ├── config.py           # Centralized configuration
│   ├── logger.py           # Logging setup
│   ├── scoring.py          # Scoring system
│   ├── scraping.py         # Scraping functions
│   ├── utils.py            # Utilities
│   ├── utils_proxies.py    # Proxy management
│   └── scrapers/          # Portal spiders
│       ├── base.py
│       ├── computrabajo.py
│       └── elempleo.py
├── tests/                  # Unit tests
├── pyproject.toml         # Project configuration
├── ruff.toml              # Linting configuration
└── .github/
    └── workflows/         # GitHub Actions
        ├── tests.yml
        └── lint.yml

Testing

# Run all tests
pytest

# Run with coverage
pytest --cov=src --cov-report=term-missing

Linting

# Check code
ruff check .

# Format code
ruff format .

# Auto-fix issues
ruff check --fix .

Tech Stack

  • Python 3.11+ - Main language
  • Scrapling - Web scraping framework
  • Pandas - Data manipulation
  • JobSpy - LinkedIn/Indeed scraping
  • Requests - HTTP client
  • Ruff - Linting and formatting
  • Pytest - Testing framework

Disclaimer

This project is for educational purposes only. Make sure to comply with the Terms of Service of the job portals before using this scraper.

License

MIT License - see the LICENSE file for details.

Acknowledgments

This project was built using the following open source libraries:

  • JobSpy - Multi-platform job posting aggregator for scraping LinkedIn and Indeed
  • Scrapling - Undetectable web scraping framework with Cloudflare bypass support
  • Pandas - Data analysis and manipulation tool
  • Ruff - Fast Python linter and formatter
  • Pytest - Testing framework

Contributions

Contributions are welcome. Please open an issue or pull request to suggest changes or improvements.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jobscolombia-0.1.0.tar.gz (40.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

jobscolombia-0.1.0-py3-none-any.whl (32.6 kB view details)

Uploaded Python 3

File details

Details for the file jobscolombia-0.1.0.tar.gz.

File metadata

  • Download URL: jobscolombia-0.1.0.tar.gz
  • Upload date:
  • Size: 40.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for jobscolombia-0.1.0.tar.gz
Algorithm Hash digest
SHA256 50fcc5cebb75f08f4d10f14472db810bb6eb9f85bade0531f92ad6d37d62a625
MD5 b1ba8a4bd442e2eaf11b629d68f923f7
BLAKE2b-256 26abfeffccfcd109800dee54fd84040cf4ff23b8b3797daa53c8ee606f01b9d4

See more details on using hashes here.

File details

Details for the file jobscolombia-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: jobscolombia-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 32.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for jobscolombia-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c77f5c39219870d1cfcc086756c0cb1d7c1ed4d2897ffda436650d18bd85983b
MD5 94680fd56804b97b4b4080fd6b8d6cf5
BLAKE2b-256 5f1425b0c4c6a4b561ab51afe39de6040305373bc3d89b78364cc6cd17f2de60

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page