This project aims to create a text-based scraper containing links to create a final PDF with general information about job openings.
Project description
OhMyScrapper - v0.3.0
This project aims to create a text-based scraper containing links to create a final PDF with general information about job openings.
Scope
- Read texts;
- Extract links;
- Use meta og:tags to extract information;
Installation
You can install directly in your pip:
pip install ohmyscrapper
I recomend to use the uv, so you can just use the command bellow and everything is installed:
uv add ohmyscrapper
uv run ohmyscrapper --version
But you can use everything as a tool, for example:
uvx ohmyscrapper --version
How to use and test (development only)
OhMyScrapper works in 3 stages:
- It collects and loads urls from a text in a database;
- It scraps/access the collected urls and read what is relevant. If it finds new urls, they are collected as well;
- Export a list of urls in CSV files;
You can do 3 stages with the command:
ohmyscrapper start
Remember to add your text file in the folder
/inputwith the name that finishes with.txt!
You will find the exported files in the folder /output like this:
/output/report.csv/output/report.csv-preview.html/output/urls-simplified.csv/output/urls-simplified.csv-preview.html/output/urls.csv/output/urls.csv-preview.html
BUT: if you want to do step by step, here it is:
First we load a text file you would like to look for urls. It it works with any txt file.
The default folder is /input. Put one or more text (finished with .txt) files
in this folder and use the command load:
ohmyscrapper load
or, if you have another file in a different folder, just use the argument -file like this:
ohmyscrapper load -file=my-text-file.txt
That will create a database if it doesn't exist and store every url the oh-my-scrapper
find. After that, let's scrap the urls with the command scrap-urls:
ohmyscrapper scrap-urls --recursive --ignore-type
That will scrap only the linkedin urls we are interested in. For now they are:
- linkedin_post: https://%.linkedin.com/posts/%
- linkedin_redirect: https://lnkd.in/%
- linkedin_job: https://%.linkedin.com/jobs/view/%
- linkedin_feed" https://%.linkedin.com/feed/%
- linkedin_company: https://%.linkedin.com/company/%
But we can use every other one generically using the argument --ignore-type:
ohmyscrapper scrap-urls --ignore-type
And we can ask to make it recursively adding the argument --recursive:
ohmyscrapper scrap-urls --recursive
!!! important: we are not sure about blocks we can have for excess of requests
And we can finally export with the command:
ohmyscrapper export
ohmyscrapper export --file=output/urls-simplified.csv --simplify
ohmyscrapper report
That's the basic usage! But you can understand more using the help:
ohmyscrapper --help
See Also
License
This package is distributed under the MIT license.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ohmyscrapper-0.3.0.tar.gz.
File metadata
- Download URL: ohmyscrapper-0.3.0.tar.gz
- Upload date:
- Size: 11.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.17 {"installer":{"name":"uv","version":"0.9.17","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b9827cd3dd2900c7a32de00dafafc3ecb50de07c6e9df6c21c1ccbe2a4e4c319
|
|
| MD5 |
4d98701c66f93ba8e945764ca14759cf
|
|
| BLAKE2b-256 |
7518bc3672416e7be0a2bbdc915c59615cab32a76e83e7959505e0a6f4e44d97
|
File details
Details for the file ohmyscrapper-0.3.0-py3-none-any.whl.
File metadata
- Download URL: ohmyscrapper-0.3.0-py3-none-any.whl
- Upload date:
- Size: 16.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.17 {"installer":{"name":"uv","version":"0.9.17","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b02b5a99b70f06353b29ebd04e1b9331dcf4d554eb8c7c63f4e962e12060cbb2
|
|
| MD5 |
36ee19a9d3a7f8612544b0922225d479
|
|
| BLAKE2b-256 |
b017d61532a702823699570743a33e66b894c36191767393a36b45f8b0b2a4f2
|