Skip to main content

Extract accounts' identifiers and metadata from personal pages on various platforms.

Project description

socid_extractor

PyPI version Downloads/month Total downloads License

CI GitHub stars GitHub forks

Turn any public profile page into a structured account record — usernames, display names, bios, avatars, locations, joined-at dates, follower counts, external links, and the stable internal identifiers that uniquely pin an account across renames, redesigns, and deletions.

socid_extractor parses HTML pages and API responses from 130+ platforms and returns a flat, machine-readable dictionary of account fields. No API keys required, no headless browser — just a single function call on response text.

Why it's useful

  • Stable cross-service IDs. Get GAIA ID (Google), Facebook UID, Yandex Public ID, Instagram pk, and dozens more — values that survive username changes and let you correlate accounts across leaks, archives, and search-engine indices.
  • One uniform interface. Same extract() call for Instagram, GitHub, VK, Reddit, Substack, Bluesky, TikTok — no per-platform glue code on your side.
  • Field ontology. Normalized field names across platforms (username, fullname, created_at, is_verified, …) so downstream pipelines don't need 130 mappings.
  • Battle-tested. Powers Maigret and a number of other OSINT tools.

Installation

Python: 3.10+.

pip install socid-extractor

For a clean CLI install on a workstation:

pipx install socid-extractor

The latest development version:

pip install -U git+https://github.com/soxoj/socid-extractor.git

Quick start

As a CLI:

$ socid_extractor --url https://www.deviantart.com/muse1908
country: France
created_at: 2005-06-16 18:17:41
gender: female
username: Muse1908
website: www.patreon.com/musemercier
links: ['https://www.facebook.com/musemercier', 'https://www.instagram.com/muse.mercier/', 'https://www.patreon.com/musemercier']
tagline: Nothing worth having is easy...

As a Python library:

import requests
import socid_extractor

r = requests.get('https://www.patreon.com/annetlovart')
print(socid_extractor.extract(r.text))
# {'patreon_id': '33913189', 'patreon_username': 'annetlovart',
#  'fullname': 'Annet Lovart',
#  'links': "['https://www.facebook.com/322598031832479', ...]"}

Tip — batch runs: pass --skip-fetch-if-no-url-hint to skip the HTTP request when the URL doesn't match any known site hint (faster, but may skip generic engines such as forum templates):

$ socid_extractor --url https://example.com/foo --skip-fetch-if-no-url-hint

Supported sites

130+ schemes — see METHODS.md for the full list.

A non-exhaustive sample:

  • Major networks: Facebook (user & group pages), Instagram, VK.com, OK.ru, Reddit, TikTok, Bluesky, Tumblr, Flickr
  • Google ecosystem: Google docs/maps contributions (cookies required), Google Play, YouTube
  • Mail.ru: my.mail.ru user mainpage, photo, video
  • Dev / writing platforms: GitHub, Stack Overflow (HTML + API), LeetCode, Hashnode, Medium, Substack, Paragraph, WordPress.org, Virgool
  • Forums (universal detectors): Discourse, MediaWiki / Fandom wikis, Mastodon
  • Niche / vertical: Chess.com, Roblox, MyAnimeList, Scratch, Wikipedia, DailyMotion, SlideShare, Weebly, Calendly, Amazon Author, Boosty, Warpcast (Farcaster), Fragment (TON/Telegram), Rarible, CSSBattle, lnk.bio, Spatial, TwitchTracker, Max (max.ru)

…and many others.

For data examples, see tests/test_e2e.py; for the parsing logic, see socid_extractor/schemes.py; for the field ontology, see FIELDS.md.

Use cases

  • Pivot from a profile to everything you can see. One call returns the visible info plus the hidden internal IDs the platform uses behind the scenes. Background reading: Week in OSINT — Getting a grasp on Google IDs.
  • Track accounts across renames, redesigns, and deletions. Stable IDs (GAIA, FB UID, Yandex Public ID, Instagram pk, …) let you re-identify the same person even when every visible field has changed. Background: Aware Online — User IDs in social-media investigations.
  • Search by cross-service UID. Once you have a stable identifier you can pivot into:
    • SQL / leaked databases (forum dumps, breach data) where the UID is the join key,
    • Google / Yandex / archive.org indices that captured URLs containing the UID.
  • Feed downstream OSINT tooling. A normalized record is much easier to ingest than per-site scrapers — used by Maigret and similar tools for enrichment.

Commercial Use

The open-source socid_extractor is MIT-licensed and free for commercial use without restriction — but page parsers break over time as platforms change their HTML and APIs, and they need active maintenance.

For serious commercial use — with a maintained private plugin pack of extra parsers or a hosted extraction API — reach out: 📧 socid@soxoj.com

  • Private parser plugin — 100+ additional checks on top of the public 150+ sites, kept up to date as platforms change (separate from the public open-source database)
  • Extraction API — integrate socid_extractor into your product

SOWEL classification

Maps to the following SOWEL techniques:

Tools using socid_extractor

  • Maigret — powerful namechecker that generates a report with all available info from accounts found across 3000+ sites.
  • TheScrapper — scrape emails, phone numbers, and social-media accounts from a website.
  • InfoHunter — open-source OSINT tool to search, collect, and analyze information online.
  • YaSeeker — gather all available information about a Yandex account by login/email.
  • Marple — scrape search-engine results for a given username.

Testing

Install the test extras from pyproject.toml, then run pytest:

pip install '.[test]'   # pytest, pytest-rerunfailures, pytest-xdist
python3 -m pytest tests/test_e2e.py -n 10 -k 'not cookies' -m 'not github_failed and not rate_limited'

Use pip install '.[dev]' instead if you also want flake8 / mypy / black (the full set used by CI).

Every new scheme must have an e2e test in tests/test_e2e.py hitting a real URL/API. Unit tests with inline fixtures (tests/test_socid_improvements.py) are also required but do not replace e2e coverage. See docs/testing-and-ci.md for details.

Developer documentation (architecture, modules, CI) lives in docs/.

Contributing

See the contributing guide if you want to add a new scheme or fix anything.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

socid_extractor-0.1.0.tar.gz (85.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

socid_extractor-0.1.0-py3-none-any.whl (45.6 kB view details)

Uploaded Python 3

File details

Details for the file socid_extractor-0.1.0.tar.gz.

File metadata

  • Download URL: socid_extractor-0.1.0.tar.gz
  • Upload date:
  • Size: 85.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for socid_extractor-0.1.0.tar.gz
Algorithm Hash digest
SHA256 94e5e35be06fc3b281900122e12e8feb1a895b189417e311e07f219522d0789e
MD5 bc2f6a7ab3cdb1b580df149b527bb7a0
BLAKE2b-256 1f9299810d37c81a2ca36f55f0cbc4a0aa6fd7464238b2fa4cf0bab4c52b52eb

See more details on using hashes here.

Provenance

The following attestation bundles were made for socid_extractor-0.1.0.tar.gz:

Publisher: python-publish.yml on soxoj/socid-extractor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file socid_extractor-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: socid_extractor-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 45.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for socid_extractor-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d19303436f97d394a937ccab576e60fa107aeb2b2cb56a158dcdf62cd8953b05
MD5 414dc1aac94758c15a953c68a1acde3a
BLAKE2b-256 03213801eb16cf4540975ecbb6c53257c477784f97a1df12d08f51979fe88f1a

See more details on using hashes here.

Provenance

The following attestation bundles were made for socid_extractor-0.1.0-py3-none-any.whl:

Publisher: python-publish.yml on soxoj/socid-extractor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page