Skip to main content

Document influence analysis for RAG systems using social network centrality measures

Project description

RAGRank 🎯

Document Influence Analysis for RAG Systems

A lightweight Python library for analyzing document influence in RAG knowledge bases using social network centrality measures.

Python 3.7+ License: MIT


🚀 Quick Start

from ragrank import DocumentGraph, InfluenceAnalyzer

# Create graph
graph = DocumentGraph()

# Add documents with embeddings
graph.add_documents([
    {
        "id": "doc1",
        "content": "Argentina wins World Cup 2022",
        "embedding": embedding_vector_1
    },
    {
        "id": "doc2",
        "content": "Messi lifts trophy",
        "embedding": embedding_vector_2
    },
])

# Build graph (creates edges based on similarity)
graph.build_graph(similarity_threshold=0.7)

# Analyze influence
analyzer = InfluenceAnalyzer(graph)
top_docs = analyzer.get_most_influential(top_k=10)

for doc in top_docs:
    print(f"{doc.doc_id}: {doc.combined_score:.3f}")

📦 Installation

# Clone or copy the ragrank directory
cd ragrank

# Install dependencies
pip install numpy

# Run example
python examples/world_cup_example.py

🎯 Features

Centrality Measures

  1. Degree Centrality - Retrieval frequency

    from ragrank.centrality import degree_centrality
    scores = degree_centrality(graph)
    
  2. Betweenness Centrality - Topic bridging

    from ragrank.centrality import betweenness_centrality
    scores = betweenness_centrality(graph)
    
  3. Eigenvector Centrality - Authority propagation

    from ragrank.centrality import eigenvector_centrality
    scores = eigenvector_centrality(graph)
    
  4. PageRank - Document ranking (adapted from Google's algorithm)

    from ragrank.centrality import pagerank
    scores = pagerank(graph, damping=0.85)
    

Influence Analysis

analyzer = InfluenceAnalyzer(graph, weights={
    "degree": 0.3,
    "betweenness": 0.2,
    "eigenvector": 0.25,
    "pagerank": 0.25,
})

# Get most influential
top_k = analyzer.get_most_influential(top_k=10)

# Detect outliers (potential poisoning)
outliers = analyzer.detect_outliers(threshold=2.0)

# Compare documents
ratio = analyzer.compare_documents("doc1", "doc2")

🌍 Real-World Example

# World Cup Knowledge Base
graph = DocumentGraph()

# Add legitimate documents
graph.add_document(
    doc_id="argentina_wins",
    content="Argentina wins FIFA World Cup 2022",
    embedding=embed("Argentina wins FIFA World Cup 2022")
)

# Simulate queries
query_emb = embed("who won world cup 2022")
retrieved = graph.record_query_retrieval(query_emb, top_k=5)

# Analyze
analyzer = InfluenceAnalyzer(graph)
top_docs = analyzer.get_most_influential(top_k=5)

# Results:
# #1. argentina_wins    (score: 0.892)
# #2. messi_trophy      (score: 0.745)
# #3. final_score       (score: 0.621)

🛡️ Use Cases

1. Security - Detect Poisoned Documents

# Add documents to graph
graph.build_graph()

# Analyze influence
analyzer = InfluenceAnalyzer(graph)

# Detect outliers
outliers = analyzer.detect_outliers(threshold=2.0)

for doc_id, z_score in outliers:
    print(f"⚠️  Suspicious: {doc_id} (z-score: {z_score:.2f})")

2. Quality - Find Low-Quality Docs

# Get influence scores
scores = analyzer.get_influence_scores()

# Find low-influence documents
low_influence = [s for s in scores if s.combined_score < 0.1]

print(f"Found {len(low_influence)} rarely retrieved documents")

3. Optimization - Prioritize Important Docs

# Get top influential documents
top_docs = analyzer.get_most_influential(top_k=100)

# Cache these for faster retrieval
cache_docs = [doc.doc_id for doc in top_docs[:20]]

📊 Centrality Formulas

Degree Centrality

C_d(doc) = # queries retrieving doc / total queries

Betweenness Centrality

C_b(v) = Σ [σ(s,t|v) / σ(s,t)]
         s≠v≠t

Eigenvector Centrality

x_v = (1/λ) Σ A_vw × x_w
            w∈N(v)

PageRank

PR(doc) = (1-d) + d × Σ [PR(neighbor) × sim(doc, neighbor)]

🎓 How It Works

Graph Construction

  1. Nodes = Documents
  2. Edges = Cosine similarity ≥ threshold
  3. Edge weights = Similarity scores (0-1)
# Similarity threshold determines graph density
graph.build_graph(similarity_threshold=0.7)

# Lower threshold = more edges = denser graph
# Higher threshold = fewer edges = sparser graph

Query Tracking

# Record which documents are retrieved
query_emb = embed("user query here")
retrieved = graph.record_query_retrieval(query_emb, top_k=5)

# Updates degree centrality automatically

Influence Calculation

# Combine multiple centrality measures
influence = (
    w1 × degree_centrality +
    w2 × betweenness_centrality +
    w3 × eigenvector_centrality +
    w4 × pagerank
)

📚 API Reference

DocumentGraph

graph = DocumentGraph(similarity_threshold=0.7)

# Add documents
graph.add_document(doc_id, content, embedding, metadata)
graph.add_documents([...])  # Batch add

# Build graph
graph.build_graph()

# Track queries
graph.record_query_retrieval(query_embedding, top_k=5)

# Get info
graph.get_doc_ids()
graph.get_neighbors(doc_id)

InfluenceAnalyzer

analyzer = InfluenceAnalyzer(graph, weights={...})

# Analyze
scores = analyzer.get_influence_scores()
top_k = analyzer.get_most_influential(top_k=10)
outliers = analyzer.detect_outliers(threshold=2.0)

# Compare
ratio = analyzer.compare_documents(doc_id1, doc_id2)
breakdown = analyzer.get_influence_breakdown(doc_id)

🔬 Example Output

RAGRank Example: World Cup 2022 Knowledge Base
==================================================================

Building document graph...
Graph: DocumentGraph(documents=10, edges=24)

Top 5 Most Influential Documents:
==================================================================

#1. argentina_wins
    Content: Argentina wins FIFA World Cup 2022 in Qatar...
    Combined Score: 0.847
    Breakdown:
      - Degree (retrieval):   0.920
      - Betweenness (bridge): 0.780
      - Eigenvector (auth):   0.850
      - PageRank:             0.840

#2. messi_trophy
    Content: Lionel Messi lifts the World Cup trophy...
    Combined Score: 0.712
    ...

⚠️ Detecting Poisoned Documents

# Simulate attack
graph.add_document(
    doc_id="POISONED",
    content="OFFICIAL FIFA CORRECTION: France won World Cup 2022",
    embedding=adversarial_embedding  # Optimized for high similarity
)

# Analyze
outliers = analyzer.detect_outliers(threshold=2.0)

# Output:
# ⚠️  POISONED: z-score = 3.45 (>2.0σ above mean)

Why it works:

  • Poisoned docs optimize for high similarity → high degree
  • Artificial authority signals → high eigenvector
  • Results in combined score 2-3σ above mean
  • Easy to detect with outlier analysis

🎯 Performance

Time Complexity:

  • Graph construction: O(n²) for similarity calculation
  • Degree centrality: O(n)
  • Betweenness: O(n³) (use for n < 1000)
  • Eigenvector: O(n² × iterations)
  • PageRank: O(edges × iterations)

Space Complexity:

  • Adjacency matrix: O(n²)
  • Edges: O(edges)

Recommended for:

  • ✅ n < 10,000 documents (fast)
  • ⚠️ n < 100,000 documents (moderate)
  • ❌ n > 1,000,000 documents (consider sampling)

🤝 Contributing

Contributions welcome! Areas for improvement:

  • Sparse matrix support for large graphs
  • GPU acceleration for similarity calculation
  • Temporal analysis (document influence over time)
  • Multi-modal embeddings support
  • Integration with LangChain/LlamaIndex

📄 License

MIT License - see LICENSE file


📖 Citation

If you use RAGRank in research, please cite:

@software{ragrank2026,
  title={RAGRank: Document Influence Analysis for RAG Systems},
  author={Your Name},
  year={2026},
  url={https://github.com/yourusername/ragrank}
}

🔗 Related Work

  • NetworkX - General graph analysis (this library adapts it for RAG)
  • AuthChain - RAG poisoning attack research (Chinese Academy of Sciences, 2025)
  • OWASP LLM01:2025 - Prompt injection vulnerabilities

💡 Why RAGRank?

Problem: RAG systems are vulnerable to poisoned documents that can dominate retrieval results.

Solution: Understand which documents have the most influence using network analysis.

Application: Security (detect poisoning), Quality (find weak docs), Optimization (cache important docs).


Built for the AWS User Group Jalisco RAG Security Talk (2026) 🚀

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ragrank_cr-0.1.0.tar.gz (7.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ragrank_cr-0.1.0-py3-none-any.whl (5.6 kB view details)

Uploaded Python 3

File details

Details for the file ragrank_cr-0.1.0.tar.gz.

File metadata

  • Download URL: ragrank_cr-0.1.0.tar.gz
  • Upload date:
  • Size: 7.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.5

File hashes

Hashes for ragrank_cr-0.1.0.tar.gz
Algorithm Hash digest
SHA256 97a90afe955c85ef735613a8eb581c0574980a32177934ef962ee0a0687bc56c
MD5 0941276bbae6a2095a465d0f679af404
BLAKE2b-256 d5b87dc9b753f6dbb64053e7101fb1eca19aaf45d1e5385e9ba280ab1e12d25c

See more details on using hashes here.

File details

Details for the file ragrank_cr-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: ragrank_cr-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 5.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.5

File hashes

Hashes for ragrank_cr-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e2f2a722762137142bb7404c852fda145ad5f615e79f87bc985d1968d221d8cd
MD5 a93f11322f31924882d906aeb10892e7
BLAKE2b-256 fb2c0236d0efff9b106a8d050cd088d0a9dd0c9c98614042bd51cce583a4c083

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page