Skip to main content

Implementations of concepts presented in the paper 'On Bi Gram Graph attributes'

Project description

BiGramGraph

A Python library for analyzing text corpora using graph theory. Transforms text into directed graphs where nodes represent words and edges represent bigram (word pair) relationships.

Features

  • Graph-based text analysis - Convert any text corpus into a directed graph representation
  • Chromatic coloring - Automatic graph coloring for word categorization
  • Text vectorization - Transform text into numerical vectors using chromatic colors
  • Text generation - Generate synthetic text using random walks on the graph
  • Graph metrics - Compute cycles, paths, strongly connected components, and more
  • NLP enrichment - Add part-of-speech and named entity tags via spaCy
  • Visualization - Interactive HTML graph visualizations with PyVis

Installation

pip install BiGramGraph

For development:

git clone https://github.com/MuteJester/BiGramGraph.git
cd BiGramGraph
pip install -e ".[dev]"

Quick Start

Create a Graph from Text

from BiGramGraph import BiGramGraph

texts = [
    "the quick brown fox jumps over the lazy dog",
    "the dog runs fast",
    "quick fox is clever",
]

graph = BiGramGraph(texts)

print(graph)
# BiGramGraph(name='BiGramGraph', nodes=12, edges=15, chromatic_number=4)

Explore Graph Properties

# Basic properties
print(f"Nodes: {graph.num_nodes}")
print(f"Edges: {graph.num_edges}")
print(f"Chromatic number: {graph.chromatic_number}")
print(f"Is DAG: {graph.is_dag}")
print(f"Strongly connected: {graph.is_strongly_connected}")

# Degree statistics
print(f"Max in-degree: {graph.degree_stats.in_max}")
print(f"Max out-degree: {graph.degree_stats.out_max}")

# Find paths
path = graph.shortest_path("the", "fox")
print(f"Shortest path: {' -> '.join(path)}")

# Check if word exists
if "quick" in graph:
    print(graph["quick"])  # Get word attributes

Vectorize Text

from BiGramGraph import Vectorizer

vectorizer = Vectorizer(graph)

# Single text
vec = vectorizer.transform("the quick brown")
print(vec)  # array([1., 3., 2.])

# Batch with padding
vectors = vectorizer.transform_batch(
    ["the quick", "brown fox jumps"],
    max_length=5,
    pad_value=0
)
print(vectors.shape)  # (2, 5)

Generate Text

from BiGramGraph import TextGenerator

generator = TextGenerator(graph)

# Generate using different strategies
text = generator.generate(
    num_colors=5,        # Number of color transitions
    search_depth=10,     # Search depth for path finding
    strategy="heaviest"  # Options: heaviest, lightest, max_density, min_density
)
print(text)

Visualize the Graph

from BiGramGraph import visualize

# Create interactive HTML visualization
visualize(
    graph,
    output_path="my_graph.html",
    directed=True,
    show_weights=True,
    height=600,
    width=1000
)

Add NLP Enrichment

# Requires: python -m spacy download en_core_web_sm

# Add part-of-speech tags
graph.enrich_pos()
print(graph.node_data[["word", "color", "pos"]].head())

# Add named entity tags
graph.enrich_entities()

Save and Load Graphs

# Save to file
graph.save("my_graph.pkl")

# Load from file
loaded_graph = BiGramGraph.load("my_graph.pkl")

# Or use dict serialization
state = graph.to_dict()
restored = BiGramGraph.from_dict(state)

Compare Graphs

from BiGramGraph import chromatic_distance, graph_similarity_report

graph1 = BiGramGraph(corpus1)
graph2 = BiGramGraph(corpus2)

# Basic similarity report
report = graph_similarity_report(graph1, graph2)
print(f"Jaccard similarity: {report['jaccard_similarity']:.2f}")
print(f"Overlapping words: {report['overlapping_words']}")

# Chromatic distance (requires POS enrichment)
graph1.enrich_pos()
graph2.enrich_pos()
distance = chromatic_distance(graph1, graph2)
print(f"Chromatic distance: {distance:.2f}")

API Reference

Core Classes

Class Description
BiGramGraph Main class for creating and analyzing bigram graphs
Vectorizer Transform text to numerical vectors using chromatic coloring
TextGenerator Generate synthetic text via chromatic random walks

Analysis Functions

Function Description
calculate_path_weight() Compute total edge weight along a path
calculate_path_density() Compute density based on node degrees
chromatic_distance() Similarity metric between two graphs
graph_similarity_report() Comprehensive comparison of two graphs

Visualization

Function Description
visualize() Create interactive HTML graph visualization
visualize_subgraph() Visualize a subset of nodes

Research Paper

This library implements concepts from the paper "On Bi-gram Graph Attributes" by Thomas Konstantinovsky and Matan Mizrachi:

We propose a new approach to text semantic analysis and general corpus analysis using, as termed in this article, a "bi-gram graph" representation of a corpus. The different attributes derived from graph theory are measured and analyzed as unique insights or against other corpus graphs. We observe a vast domain of tools and algorithms that can be developed on top of the graph representation; creating such a graph proves to be computationally cheap, and much of the heavy lifting is achieved via basic graph calculations. Furthermore, we showcase the different use-cases for the bi-gram graphs and how scalable it proves to be when dealing with large datasets.

DOI: 10.5539/cis.v14n3p78

arXiv: 2107.02128

Citation

@article{Konstantinovsky2021,
  title = {On Bi-gram Graph Attributes},
  volume = {14},
  ISSN = {1913-8989},
  url = {http://dx.doi.org/10.5539/cis.v14n3p78},
  DOI = {10.5539/cis.v14n3p78},
  number = {3},
  journal = {Computer and Information Science},
  publisher = {Canadian Center of Science and Education},
  author = {Konstantinovsky, Thomas and Mizrachi, Matan},
  year = {2021},
  month = jul,
  pages = {78}
}

Requirements

  • Python 3.11+
  • networkx
  • pandas
  • numpy
  • nltk
  • spacy (optional, for NLP enrichment)
  • pyvis (for visualization)

License

MIT License - see LICENSE for details.

Authors

  • Thomas Konstantinovsky
  • Matan Mizrachi

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Run tests (pytest tests/)
  4. Commit your changes (git commit -m 'Add amazing feature')
  5. Push to the branch (git push origin feature/amazing-feature)
  6. Open a Pull Request

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bigramgraph-2.1.0.tar.gz (22.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bigramgraph-2.1.0-py3-none-any.whl (20.1 kB view details)

Uploaded Python 3

File details

Details for the file bigramgraph-2.1.0.tar.gz.

File metadata

  • Download URL: bigramgraph-2.1.0.tar.gz
  • Upload date:
  • Size: 22.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for bigramgraph-2.1.0.tar.gz
Algorithm Hash digest
SHA256 0b8e0f9c5845341626e8a390a1b60c64d79439719e091f17725a99635ad507d0
MD5 49f0d493a2e226c02532f32e3f57b276
BLAKE2b-256 32ae391212b1cbec66b9797ce3252277628ecdc1c50904308aef84467d580470

See more details on using hashes here.

File details

Details for the file bigramgraph-2.1.0-py3-none-any.whl.

File metadata

  • Download URL: bigramgraph-2.1.0-py3-none-any.whl
  • Upload date:
  • Size: 20.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for bigramgraph-2.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 47a03f7b8e0074187a64f60e35bcc2ebf6ca111c59cf387682d81357d80cb340
MD5 4c8e6c2c83215674687186b1562c8016
BLAKE2b-256 37e95e690f4eba6b943a4e262c94800657c252a768ddfaf8b5c22ee06dbb42ad

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page