Extract accounts' identifiers and metadata from personal pages on various platforms.
Project description
socid_extractor
Turn any public profile page into a structured account record — usernames, display names, bios, avatars, locations, joined-at dates, follower counts, external links, and the stable internal identifiers that uniquely pin an account across renames, redesigns, and deletions.
socid_extractor parses HTML pages and API responses from 130+ platforms and returns a flat, machine-readable dictionary of account fields. No API keys required, no headless browser — just a single function call on response text.
Why it's useful
- Stable cross-service IDs. Get GAIA ID (Google), Facebook UID, Yandex Public ID, Instagram pk, and dozens more — values that survive username changes and let you correlate accounts across leaks, archives, and search-engine indices.
- One uniform interface. Same
extract()call for Instagram, GitHub, VK, Reddit, Substack, Bluesky, TikTok — no per-platform glue code on your side. - Field ontology. Normalized field names across platforms (
username,fullname,created_at,is_verified, …) so downstream pipelines don't need 130 mappings. - Battle-tested. Powers Maigret and a number of other OSINT tools.
Installation
Python: 3.10+.
pip install socid-extractor
For a clean CLI install on a workstation:
pipx install socid-extractor
The latest development version:
pip install -U git+https://github.com/soxoj/socid-extractor.git
Quick start
As a CLI:
$ socid_extractor --url https://www.deviantart.com/muse1908
country: France
created_at: 2005-06-16 18:17:41
gender: female
username: Muse1908
website: www.patreon.com/musemercier
links: ['https://www.facebook.com/musemercier', 'https://www.instagram.com/muse.mercier/', 'https://www.patreon.com/musemercier']
tagline: Nothing worth having is easy...
As a Python library:
import requests
import socid_extractor
r = requests.get('https://www.patreon.com/annetlovart')
print(socid_extractor.extract(r.text))
# {'patreon_id': '33913189', 'patreon_username': 'annetlovart',
# 'fullname': 'Annet Lovart',
# 'links': "['https://www.facebook.com/322598031832479', ...]"}
Tip — batch runs: pass --skip-fetch-if-no-url-hint to skip the HTTP request when the URL doesn't match any known site hint (faster, but may skip generic engines such as forum templates):
$ socid_extractor --url https://example.com/foo --skip-fetch-if-no-url-hint
Supported sites
130+ schemes — see METHODS.md for the full list.
A non-exhaustive sample:
- Major networks: Facebook (user & group pages), Instagram, VK.com, OK.ru, Reddit, TikTok, Bluesky, Tumblr, Flickr
- Google ecosystem: Google docs/maps contributions (cookies required), Google Play, YouTube
- Mail.ru: my.mail.ru user mainpage, photo, video
- Dev / writing platforms: GitHub, Stack Overflow (HTML + API), LeetCode, Hashnode, Medium, Substack, Paragraph, WordPress.org, Virgool
- Forums (universal detectors): Discourse, MediaWiki / Fandom wikis, Mastodon
- Niche / vertical: Chess.com, Roblox, MyAnimeList, Scratch, Wikipedia, DailyMotion, SlideShare, Weebly, Calendly, Amazon Author, Boosty, Warpcast (Farcaster), Fragment (TON/Telegram), Rarible, CSSBattle, lnk.bio, Spatial, TwitchTracker, Max (max.ru)
…and many others.
For data examples, see tests/test_e2e.py; for the parsing logic, see socid_extractor/schemes.py; for the field ontology, see FIELDS.md.
Use cases
- Pivot from a profile to everything you can see. One call returns the visible info plus the hidden internal IDs the platform uses behind the scenes. Background reading: Week in OSINT — Getting a grasp on Google IDs.
- Track accounts across renames, redesigns, and deletions. Stable IDs (GAIA, FB UID, Yandex Public ID, Instagram pk, …) let you re-identify the same person even when every visible field has changed. Background: Aware Online — User IDs in social-media investigations.
- Search by cross-service UID. Once you have a stable identifier you can pivot into:
- SQL / leaked databases (forum dumps, breach data) where the UID is the join key,
- Google / Yandex / archive.org indices that captured URLs containing the UID.
- Feed downstream OSINT tooling. A normalized record is much easier to ingest than per-site scrapers — used by Maigret and similar tools for enrichment.
Commercial Use
The open-source socid_extractor is MIT-licensed and free for commercial use without restriction — but page parsers break over time as platforms change their HTML and APIs, and they need active maintenance.
For serious commercial use — with a maintained private plugin pack of extra parsers or a hosted extraction API — reach out: 📧 socid@soxoj.com
- Private parser plugin — 100+ additional checks on top of the public 150+ sites, kept up to date as platforms change (separate from the public open-source database)
- Extraction API — integrate
socid_extractorinto your product
SOWEL classification
Maps to the following SOWEL techniques:
Tools using socid_extractor
- Maigret — powerful namechecker that generates a report with all available info from accounts found across 3000+ sites.
- TheScrapper — scrape emails, phone numbers, and social-media accounts from a website.
- InfoHunter — open-source OSINT tool to search, collect, and analyze information online.
- YaSeeker — gather all available information about a Yandex account by login/email.
- Marple — scrape search-engine results for a given username.
Testing
Install the test extras from pyproject.toml, then run pytest:
pip install '.[test]' # pytest, pytest-rerunfailures, pytest-xdist
python3 -m pytest tests/test_e2e.py -n 10 -k 'not cookies' -m 'not github_failed and not rate_limited'
Use pip install '.[dev]' instead if you also want flake8 / mypy / black (the full set used by CI).
Every new scheme must have an e2e test in tests/test_e2e.py hitting a real URL/API. Unit tests with inline fixtures (tests/test_socid_improvements.py) are also required but do not replace e2e coverage. See docs/testing-and-ci.md for details.
Developer documentation (architecture, modules, CI) lives in docs/.
Contributing
See the contributing guide if you want to add a new scheme or fix anything.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file socid_extractor-0.1.0.tar.gz.
File metadata
- Download URL: socid_extractor-0.1.0.tar.gz
- Upload date:
- Size: 85.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
94e5e35be06fc3b281900122e12e8feb1a895b189417e311e07f219522d0789e
|
|
| MD5 |
bc2f6a7ab3cdb1b580df149b527bb7a0
|
|
| BLAKE2b-256 |
1f9299810d37c81a2ca36f55f0cbc4a0aa6fd7464238b2fa4cf0bab4c52b52eb
|
Provenance
The following attestation bundles were made for socid_extractor-0.1.0.tar.gz:
Publisher:
python-publish.yml on soxoj/socid-extractor
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
socid_extractor-0.1.0.tar.gz -
Subject digest:
94e5e35be06fc3b281900122e12e8feb1a895b189417e311e07f219522d0789e - Sigstore transparency entry: 1632897463
- Sigstore integration time:
-
Permalink:
soxoj/socid-extractor@af708c86133cb70589de0bb776a4597451fb4acf -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/soxoj
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@af708c86133cb70589de0bb776a4597451fb4acf -
Trigger Event:
release
-
Statement type:
File details
Details for the file socid_extractor-0.1.0-py3-none-any.whl.
File metadata
- Download URL: socid_extractor-0.1.0-py3-none-any.whl
- Upload date:
- Size: 45.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d19303436f97d394a937ccab576e60fa107aeb2b2cb56a158dcdf62cd8953b05
|
|
| MD5 |
414dc1aac94758c15a953c68a1acde3a
|
|
| BLAKE2b-256 |
03213801eb16cf4540975ecbb6c53257c477784f97a1df12d08f51979fe88f1a
|
Provenance
The following attestation bundles were made for socid_extractor-0.1.0-py3-none-any.whl:
Publisher:
python-publish.yml on soxoj/socid-extractor
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
socid_extractor-0.1.0-py3-none-any.whl -
Subject digest:
d19303436f97d394a937ccab576e60fa107aeb2b2cb56a158dcdf62cd8953b05 - Sigstore transparency entry: 1632897546
- Sigstore integration time:
-
Permalink:
soxoj/socid-extractor@af708c86133cb70589de0bb776a4597451fb4acf -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/soxoj
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@af708c86133cb70589de0bb776a4597451fb4acf -
Trigger Event:
release
-
Statement type: