Skip to main content

This project aims to create a text-based scraper containing links to create a final PDF with general information about job openings.

Project description

OhMyScrapper - v0.2.3

This project aims to create a text-based scraper containing links to create a final PDF with general information about job openings.

Scope

  • Read texts;
  • Extract links;
  • Use meta og:tags to extract information;

Installation

You can install directly in your pip:

pip install ohmyscrapper

I recomend to use the uv, so you can just use the command bellow and everything is installed:

uv add ohmyscrapper
uv run ohmyscrapper --version

But you can use everything as a tool, for example:

uvx ohmyscrapper --version

How to use and test (development only)

OhMyScrapper works in 3 stages:

  1. It collects and loads urls from a text (by default input/_chat.txt) in a database;
  2. It scraps/access the collected urls and read what is relevant. If it finds new urls, they are collected as well;
  3. Export a list of urls in CSV files;

You can do 3 stages with the command:

ohmyscrapper start

Remember to add your text file in the folder /input with the name _chat.txt!

You will find the exported files in the folder /output like this:

  • /output/report.csv
  • /output/report.csv-preview.html
  • /output/urls-simplified.csv
  • /output/urls-simplified.csv-preview.html
  • /output/urls.csv
  • /output/urls.csv-preview.html

BUT: if you want to do step by step, here it is:

First we load a text file you would like to look for urls, the idea here is to use the whatsapp history, but it works with any txt file.

The default file is input/_chat.txt. If you have the default file you just use the command load:

ohmyscrapper load

or, if you have another file, just use the argument -file like this:

ohmyscrapper load -file=my-text-file.txt

That will create a database if it doesn't exist and store every url the oh-my-scrapper find. After that, let's scrap the urls with the command scrap-urls:

ohmyscrapper scrap-urls --recursive --ignore-type

That will scrap only the linkedin urls we are interested in. For now they are:

  • linkedin_post: https://%.linkedin.com/posts/%
  • linkedin_redirect: https://lnkd.in/%
  • linkedin_job: https://%.linkedin.com/jobs/view/%
  • linkedin_feed" https://%.linkedin.com/feed/%
  • linkedin_company: https://%.linkedin.com/company/%

But we can use every other one generically using the argument --ignore-type:

ohmyscrapper scrap-urls --ignore-type

And we can ask to make it recursively adding the argument --recursive:

ohmyscrapper scrap-urls --recursive

!!! important: we are not sure about blocks we can have for excess of requests

And we can finally export with the command:

ohmyscrapper export
ohmyscrapper export --file=output/urls-simplified.csv --simplify
ohmyscrapper report

That's the basic usage! But you can understand more using the help:

ohmyscrapper --help

See Also

License

This package is distributed under the MIT license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ohmyscrapper-0.2.3.tar.gz (11.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ohmyscrapper-0.2.3-py3-none-any.whl (15.6 kB view details)

Uploaded Python 3

File details

Details for the file ohmyscrapper-0.2.3.tar.gz.

File metadata

  • Download URL: ohmyscrapper-0.2.3.tar.gz
  • Upload date:
  • Size: 11.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.17 {"installer":{"name":"uv","version":"0.9.17","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for ohmyscrapper-0.2.3.tar.gz
Algorithm Hash digest
SHA256 e2b134f721259239fdec1f122749ff13dbcff41b7db64d44e42ac0dead2994cb
MD5 fdd765c0d99f6f1e9d648fc8c357ad44
BLAKE2b-256 0dafe66e71b5885d1440de79fd934d95008ca78abc80d6b790416ddb8cf5bea9

See more details on using hashes here.

File details

Details for the file ohmyscrapper-0.2.3-py3-none-any.whl.

File metadata

  • Download URL: ohmyscrapper-0.2.3-py3-none-any.whl
  • Upload date:
  • Size: 15.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.17 {"installer":{"name":"uv","version":"0.9.17","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for ohmyscrapper-0.2.3-py3-none-any.whl
Algorithm Hash digest
SHA256 7b24c4b241270b372eabecaf7b5661eb4fa75aba88fb35a10861f2083af8cdf1
MD5 d52160a7c9cf51b9f757532372c69bc9
BLAKE2b-256 b1bb7d5ebac1e7c43607fe3184292c505152034c167bb2fcd13128ca702dd23e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page