Skip to main content

Universal metadata extraction library supporting 13 formats (HTML Meta, Open Graph, Twitter Cards, JSON-LD, Microdata, Microformats, RDFa, Dublin Core, Web App Manifest, oEmbed, rel-links, Images, SEO) with 7 language bindings

Project description

MetaOxide

The Universal Metadata Extraction Library - Blazing-fast, production-ready metadata extraction from HTML in 7 programming languages.

License: MIT Rust Python Go Node.js Java C# WebAssembly


Why MetaOxide?

MetaOxide is 200-570x faster than traditional metadata extraction libraries while extracting 13 metadata formats out of the box. Built in Rust with native bindings for Python, Go, Node.js, Java, C#, and WebAssembly.

Key Features

  • ๐Ÿš€ Blazing Fast: 100,000+ documents/sec (vs. 150-500 for alternatives)
  • ๐ŸŒ Universal: 7 language bindings from a single Rust core
  • ๐Ÿ“ฆ Comprehensive: 13 metadata formats (Open Graph, Twitter Cards, JSON-LD, Microformats, etc.)
  • ๐Ÿ’ช Production-Ready: 16,500+ lines of code, 700+ tests, battle-tested
  • ๐Ÿง  Memory Efficient: 4-9x less memory than alternatives
  • ๐Ÿ”’ Type-Safe: Strong typing across all languages
  • ๐Ÿ”ง Easy to Use: Simple API, extensive documentation

Quick Start

Rust

cargo add meta_oxide
use meta_oxide::MetaOxide;

let html = r#"<!DOCTYPE html>..."#;
let extractor = MetaOxide::new(html, "https://example.com")?;
let metadata = extractor.extract_all()?;

println!("Title: {:?}", metadata.get("title"));

โ†’ Full Rust Guide | API Reference

Python

pip install meta-oxide
from meta_oxide import MetaOxide

html = "<!DOCTYPE html>..."
extractor = MetaOxide(html, "https://example.com")
metadata = extractor.extract_all()

print(f"Title: {metadata['title']}")

Performance: 233x faster than BeautifulSoup

โ†’ Full Python Guide | API Reference

Go

go get github.com/yourusername/meta-oxide-go
import metaoxide "github.com/yourusername/meta-oxide-go"

extractor, _ := metaoxide.NewExtractor(html, "https://example.com")
defer extractor.Free()

metadata, _ := extractor.ExtractAll()
fmt.Printf("Title: %v\n", metadata["title"])

Only Go library with 13 metadata formats

โ†’ Full Go Guide | API Reference

Node.js

npm install meta-oxide
const { MetaOxide } = require('meta-oxide');

const html = '<!DOCTYPE html>...';
const extractor = new MetaOxide(html, 'https://example.com');
const metadata = extractor.extractAll();

console.log('Title:', metadata.title);

Performance: 280x faster than metascraper

โ†’ Full Node.js Guide | API Reference

Java

<dependency>
    <groupId>com.metaoxide</groupId>
    <artifactId>meta-oxide</artifactId>
    <version>0.1.0</version>
</dependency>
try (MetaOxide extractor = new MetaOxide(html, "https://example.com")) {
    Metadata metadata = extractor.extractAll();
    System.out.println("Title: " + metadata.get("title"));
}

Performance: 311x faster than jsoup + Any23

โ†’ Full Java Guide | API Reference

C#

dotnet add package MetaOxide
using var extractor = new MetaOxideExtractor(html, "https://example.com");
var metadata = extractor.ExtractAll();

Console.WriteLine($"Title: {metadata["title"]}");

Performance: 200x faster than HtmlAgilityPack

โ†’ Full C# Guide | API Reference

WebAssembly

npm install meta-oxide-wasm
import init, { MetaOxide } from 'meta-oxide-wasm';

await init();  // Initialize WASM

const extractor = new MetaOxide(html, 'https://example.com');
const metadata = extractor.extractAll();

console.log('Title:', metadata.title);

Performance: 260x faster than native JavaScript parsers

โ†’ Full WASM Guide | API Reference


Supported Metadata Formats

MetaOxide extracts 13 metadata formats out of the box:

Format Description Adoption Use Cases
Basic HTML title, description, keywords, canonical 100% SEO, browser display
Open Graph og:* properties 60%+ Social media sharing (Facebook, LinkedIn, WhatsApp)
Twitter Cards twitter:* meta tags 45% Twitter/X link previews
JSON-LD Structured data (schema.org) 41%โ†—๏ธ Google Rich Results, AI/LLM training
Microdata itemscope, itemprop 26% E-commerce, recipes, reviews
Microformats h-card, h-entry, h-event 15% Distributed social web, contacts
Dublin Core DC metadata 8% Digital libraries, archives
RDFa RDF in attributes 5% Linked data, semantic web
RelLinks Link relations 100% Canonical URLs, alternate versions
Web Manifest PWA manifest 12% Progressive web apps
Images Image metadata 100% Image alt text, dimensions
Authors Author information 80% Authorship, copyright
SEO Robots, language, viewport 100% Search engine optimization

Performance Comparison

MetaOxide is dramatically faster than traditional libraries:

Throughput (documents/second)

Library Language Docs/Sec vs MetaOxide
MetaOxide Rust 125,000 1x (baseline)
MetaOxide Python 83,333 233x faster than BeautifulSoup
MetaOxide Go 100,000 N/A (only option with 13 formats)
MetaOxide Node.js 66,666 280x faster than metascraper
MetaOxide Java 55,555 311x faster than jsoup
MetaOxide C# 62,500 200x faster than HtmlAgilityPack
MetaOxide WASM 40,000 260x faster than JS parsers
BeautifulSoup Python 357 -
metascraper Node.js 238 -
jsoup + Any23 Java 178 -
HtmlAgilityPack C# 312 -

Real-World Impact

Processing 1 million e-commerce product pages:

Solution Time CPU Hours AWS Cost
MetaOxide 22 seconds 0.006 $0.0012
BeautifulSoup 140 minutes 2.33 $0.47
Savings 381x faster 388x less 391x cheaper

โ†’ Full Benchmarks


Real-World Examples

Python: Flask API

from flask import Flask, request, jsonify
from meta_oxide import MetaOxide
import requests

app = Flask(__name__)

@app.route('/extract')
def extract():
    url = request.args.get('url')
    response = requests.get(url)

    extractor = MetaOxide(response.text, url)
    metadata = extractor.extract_all()

    return jsonify(metadata)

โ†’ Complete Flask Example

Node.js: Express Server

const express = require('express');
const axios = require('axios');
const { MetaOxide } = require('meta-oxide');

const app = express();

app.get('/extract', async (req, res) => {
    const { url } = req.query;
    const response = await axios.get(url);

    const extractor = new MetaOxide(response.data, url);
    const metadata = extractor.extractAll();

    res.json(metadata);
});

app.listen(3000);

โ†’ Complete Express Example

Go: Concurrent Processing

func extractConcurrently(urls []string) []Metadata {
    var wg sync.WaitGroup
    results := make([]Metadata, len(urls))

    for i, url := range urls {
        wg.Add(1)
        go func(index int, targetURL string) {
            defer wg.Done()

            html := fetchHTML(targetURL)
            extractor, _ := metaoxide.NewExtractor(html, targetURL)
            defer extractor.Free()

            results[index], _ = extractor.ExtractAll()
        }(i, url)
    }

    wg.Wait()
    return results
}

โ†’ Complete Go Example


Architecture

MetaOxide is built on a multi-layer architecture for maximum performance and compatibility:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Application Layer (Your Code)                          โ”‚
โ”‚  Rust, Python, Go, Node.js, Java, C#, WebAssembly      โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                   โ”‚
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Language Bindings                                       โ”‚
โ”‚  PyO3, CGO, N-API, JNI, P/Invoke, wasm-bindgen         โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                   โ”‚
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  C-ABI Layer (Stable Foreign Function Interface)        โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                   โ”‚
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Rust Core (16,500+ lines)                              โ”‚
โ”‚  โ€ข HTML Parser (html5ever)                              โ”‚
โ”‚  โ€ข 13 Metadata Extractors                               โ”‚
โ”‚  โ€ข URL Resolution & Utilities                           โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Key Design Principles:

  1. Single Parse: HTML parsed once, shared across all extractors
  2. Zero-Copy: Minimize memory allocations
  3. Type-Safe: Rust memory safety guarantees
  4. Thread-Safe: Concurrent extraction support
  5. Language-Native: Idiomatic APIs for each language

โ†’ Architecture Overview


Feature Matrix

Feature Rust Python Go Node.js Java C# WASM
Basic Meta โœ“ โœ“ โœ“ โœ“ โœ“ โœ“ โœ“
Open Graph โœ“ โœ“ โœ“ โœ“ โœ“ โœ“ โœ“
Twitter Cards โœ“ โœ“ โœ“ โœ“ โœ“ โœ“ โœ“
JSON-LD โœ“ โœ“ โœ“ โœ“ โœ“ โœ“ โœ“
Microdata โœ“ โœ“ โœ“ โœ“ โœ“ โœ“ โœ“
Microformats โœ“ โœ“ โœ“ โœ“ โœ“ โœ“ โœ“
Dublin Core โœ“ โœ“ โœ“ โœ“ โœ“ โœ“ โœ“
RDFa โœ“ โœ“ โœ“ โœ“ โœ“ โœ“ โœ“
All 13 Formats โœ“ โœ“ โœ“ โœ“ โœ“ โœ“ โœ“
Type Hints โœ“ โœ“ โœ“ โœ“ (TS) โœ“ โœ“ โœ“ (TS)
Async Support โœ“ โœ“* โœ“ โœ“* โœ“ โœ“ โœ“*
Thread-Safe โœ“ โœ“ โœ“ โœ“ โœ“ โœ“ โœ“
Memory-Safe โœ“ โœ“ โœ“ โœ“ โœ“ โœ“ โœ“

*Extraction is synchronous, but compatible with async I/O


Use Cases

Web Scraping

Extract metadata from millions of pages efficiently:

# Process 1M pages in 12 seconds (vs. 46 minutes with BeautifulSoup)
from concurrent.futures import ThreadPoolExecutor
results = ThreadPoolExecutor(max_workers=10).map(extract_from_url, urls)

SEO Tools

Analyze metadata for SEO optimization:

const og = extractor.extractOpenGraph();
const twitter = extractor.extractTwitterCard();
const jsonld = extractor.extractJSONLD();
// Check for missing or malformed metadata

Social Media Preview

Generate link previews like Facebook/Twitter:

og, _ := extractor.ExtractOpenGraph()
fmt.Printf("Title: %s\n", og.Title)
fmt.Printf("Image: %s\n", og.Image)
fmt.Printf("Description: %s\n", og.Description)

AI/ML Training Data

Extract structured data for machine learning:

let jsonld = extractor.extract_jsonld()?;
let microdata = extractor.extract_microdata()?;
// Feed to AI models for training

E-commerce

Extract product metadata:

List<MicrodataItem> products = extractor.extractMicrodata();
for (MicrodataItem item : products) {
    if (item.getType().contains("Product")) {
        System.out.println(item.getProperties().get("name"));
        System.out.println(item.getProperties().get("price"));
    }
}

Browser Extensions

Client-side metadata extraction:

import init, { MetaOxide } from 'meta-oxide-wasm';
await init();

const html = document.documentElement.outerHTML;
const extractor = new MetaOxide(html, window.location.href);
const metadata = extractor.extractAll();

Documentation

Getting Started

API References

Performance

Architecture

Help


Contributing

Contributions are welcome! See CONTRIBUTING.md for guidelines.

Development Setup

# Clone repository
git clone https://github.com/yourusername/meta_oxide.git
cd meta_oxide

# Build Rust core
cargo build --release

# Run tests
cargo test

# Build language bindings
# Python
cd bindings/python && pip install -e .

# Go
cd bindings/go && go test ./...

# Node.js
cd bindings/nodejs && npm install && npm test

# Java
cd bindings/java && mvn test

# C#
cd bindings/csharp && dotnet test

# WASM
cd bindings/wasm && wasm-pack build

Roadmap

v0.2.0 (Q1 2026)

  • Plugin system for custom extractors
  • Async Rust API
  • iOS support (Swift bindings)
  • Streaming parser for infinite documents

v0.3.0 (Q2 2026)

  • ML-based metadata extraction
  • Metadata quality scoring
  • PDF metadata extraction
  • REST/GraphQL API server

v1.0.0 (Q3 2026)

  • Stable API
  • Long-term support
  • Enterprise features

License

MetaOxide is released under the MIT License.

MIT License

Copyright (c) 2025 MetaOxide Contributors

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

Sponsors

MetaOxide is an open-source project. Consider sponsoring to support development:


Community


Acknowledgments

MetaOxide builds on excellent open-source projects:


Made with โค๏ธ by the MetaOxide team

Star โญ this repository if you find it useful!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

meta_oxide-0.1.1.tar.gz (450.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

meta_oxide-0.1.1-cp314-cp314-manylinux_2_34_x86_64.whl (989.0 kB view details)

Uploaded CPython 3.14manylinux: glibc 2.34+ x86-64

File details

Details for the file meta_oxide-0.1.1.tar.gz.

File metadata

  • Download URL: meta_oxide-0.1.1.tar.gz
  • Upload date:
  • Size: 450.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for meta_oxide-0.1.1.tar.gz
Algorithm Hash digest
SHA256 e23d0b1fedf1f1d0249d4a55bc96b1c4c1a888d16527a63877469d17cb73217d
MD5 fc0b831bda05d9cda73b2a38c81ef7ae
BLAKE2b-256 25db0ce51b40e6a43b40ac83dbd3dc27efacb2f59d9cab912511434b3f11d6e5

See more details on using hashes here.

File details

Details for the file meta_oxide-0.1.1-cp314-cp314-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for meta_oxide-0.1.1-cp314-cp314-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 c937edd2d062387c8602a6d9376b02f9a2b7face38322ace705a5d6756401030
MD5 7ab21636c844baf58136cab9d919550d
BLAKE2b-256 ae9a1c0a02d0886e4dfb86ada9643e753e9f12cb499080b6c242d9c40eed4ec8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page