Skip to main content

Scrape VK URLs to fetch info and media - python API or command line tool.

Project description

vk-url-scraper

Python library to scrape data, and especially media links like videos and photos, from vk.com URLs.

PyPI version PyPI download month Documentation Status

You can use it via the command line or as a python library, check the documentation.

Installation

You can install the most recent release from pypi via pip install vk-url-scraper.

Currently you need to manually install one dependency (as it is installed from github and not pypi): pip install git+https://github.com/python273/vk_api.git@b99dac0ec2f832a6c4b20bde49869e7229ce4742

To use the library you will need a valid username/password combination for vk.com.

Command line usage

# run this to learn more about the parameters
vk_url_scraper --help

# scrape a URL and get the JSON result in the console
vk_url_scraper --username "username here" --password "password here" --urls https://vk.com/wall12345_6789
# OR
vk_url_scraper -u "username here" -p "password here" --urls https://vk.com/wall12345_6789
# you can also have multiple urls
vk_url_scraper -u "username here" -p "password here" --urls https://vk.com/wall12345_6789 https://vk.com/photo-12345_6789 https://vk.com/video12345_6789

# you can pass a token as well to avoid always authenticating 
# and possibly getting captcha prompts
# you can fetch the token from the vk_config.v2.json file generated under by searching for "access_token"
vk_url_scraper -u "username" -p "password" -t "vktoken goes here" --urls https://vk.com/wall12345_6789

# save the JSON output into a file
vk_url_scraper -u "username here" -p "password here" --urls https://vk.com/wall12345_6789 > output.json

# download any photos or videos found in these URLS
# this will use or create an output/ folder and dump the files there
vk_url_scraper -u "username here" -p "password here" --download --urls https://vk.com/wall12345_6789
# or
vk_url_scraper -u "username here" -p "password here" -d --urls https://vk.com/wall12345_6789

Python library usage

from vk_url_scraper import VkScraper

vks = VkScraper("username", "password")

# scrape any "photo" URL
res = vks.scrape("https://vk.com/photo1_278184324?rev=1")

# scrape any "wall" URL
res = vks.scrape("https://vk.com/wall-1_398461")

# scrape any "video" URL
res = vks.scrape("https://vk.com/video-6596301_145810025")
print(res[0]["text"]) # eg: -> to get the text from code
# Every scrape* function returns a list of dict like
{
	"id": "wall_id",
	"text": "text in this post" ,
	"datetime": utc datetime of post,
	"attachments": {
		# if photo, video, link exists
		"photo": [list of urls with max quality],
		"video": [list of urls with max quality],
		"link": [list of urls with max quality],
	},
	"payload": "original JSON response converted to dict which you can parse for more data
}

see [docs] for all available functions.

TODO

  • scrape album links
  • scrape profile links
  • docs online from sphinx

Development

(more info in CONTRIBUTING.md).

  1. setup dev environment with pip install -r dev-requirements.txt or pipenv install -r dev-requirements.txt
  2. setup environment with pip install -r requirements.txt or pipenv install -r requirements.txt
  3. To run all checks to make run-checks (fixes style) or individually
    1. To fix style: black . and isort . -> flake8 . to validate lint
    2. To do type checking: mypy .
    3. To test: pytest . (pytest -v --color=yes --doctest-modules tests/ vk_url_scraper/ to use verbose, colors, and test docstring examples)
  4. make docs to generate shpynx docs -> edit config.py if needed

To test the command line interface available in main.py you need to pass the -m option to python like so: python -m vk_url_scraper -u "" -p "" --urls ...

Releasing new version

  1. edit version.py with proper versioning
  2. make sure to run pipenv run pip freeze > requirements.txt if you manage libs with pipenv
    1. if the hardcoded version of vk_api is still being used, then you must comment/remove that line from the generated requirements file and instruct users to manually install the version from the source as pypi does not allow repo/commit tags. Additionally, add the latest released version, currently vk-api==11.9.9.
  3. run ./scripts/release.sh to create a tag and push, alternatively
    1. git tag vx.y.z to tag version
    2. git push origin vx.y.z -> this will trigger workflow and put project on pypi
  4. go to https://readthedocs.org/ to deploy new docs version (if webhook is not setup)

Fixing a failed release

If for some reason the GitHub Actions release workflow failed with an error that needs to be fixed, you'll have to delete both the tag and corresponding release from GitHub. After you've pushed a fix, delete the tag from your local clone with

git tag -l | xargs git tag -d && git fetch -t

Then repeat the steps above.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vk_url_scraper-0.3.30.tar.gz (15.1 kB view details)

Uploaded Source

Built Distribution

vk_url_scraper-0.3.30-py3-none-any.whl (12.0 kB view details)

Uploaded Python 3

File details

Details for the file vk_url_scraper-0.3.30.tar.gz.

File metadata

  • Download URL: vk_url_scraper-0.3.30.tar.gz
  • Upload date:
  • Size: 15.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.14

File hashes

Hashes for vk_url_scraper-0.3.30.tar.gz
Algorithm Hash digest
SHA256 6616c8fbe6ea6f8cbe4605898a89d1173a579ed6a9da5410dba80269d708fcb1
MD5 51f0a6b490e0fec0e61af5910c27b0db
BLAKE2b-256 b94c7deed94de31c3d75edcf963eb3ad1cc5e7872faab28138aa2469e018fe66

See more details on using hashes here.

File details

Details for the file vk_url_scraper-0.3.30-py3-none-any.whl.

File metadata

File hashes

Hashes for vk_url_scraper-0.3.30-py3-none-any.whl
Algorithm Hash digest
SHA256 2e83e690844bb9b04772fae56bed2d9654780ca23132155e63de4ed9bde70c23
MD5 eb7bf03366d2a8cdf87848ea60adba48
BLAKE2b-256 1126855770b2fa445ae09620eccc71e7f8415da184f878bd1232a0d02742210a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page