Implementations of concepts presented in the paper 'On Bi Gram Graph attributes'
Project description
BiGramGraph
A Python library for analyzing text corpora using graph theory. Transforms text into directed graphs where nodes represent words and edges represent bigram (word pair) relationships.
Features
- Graph-based text analysis - Convert any text corpus into a directed graph representation
- Chromatic coloring - Automatic graph coloring for word categorization
- Text vectorization - Transform text into numerical vectors using chromatic colors
- Text generation - Generate synthetic text using random walks on the graph
- Graph metrics - Compute cycles, paths, strongly connected components, and more
- NLP enrichment - Add part-of-speech and named entity tags via spaCy
- Visualization - Interactive HTML graph visualizations with PyVis
Installation
pip install BiGramGraph
For development:
git clone https://github.com/MuteJester/BiGramGraph.git
cd BiGramGraph
pip install -e ".[dev]"
Quick Start
Create a Graph from Text
from BiGramGraph import BiGramGraph
texts = [
"the quick brown fox jumps over the lazy dog",
"the dog runs fast",
"quick fox is clever",
]
graph = BiGramGraph(texts)
print(graph)
# BiGramGraph(name='BiGramGraph', nodes=12, edges=15, chromatic_number=4)
Explore Graph Properties
# Basic properties
print(f"Nodes: {graph.num_nodes}")
print(f"Edges: {graph.num_edges}")
print(f"Chromatic number: {graph.chromatic_number}")
print(f"Is DAG: {graph.is_dag}")
print(f"Strongly connected: {graph.is_strongly_connected}")
# Degree statistics
print(f"Max in-degree: {graph.degree_stats.in_max}")
print(f"Max out-degree: {graph.degree_stats.out_max}")
# Find paths
path = graph.shortest_path("the", "fox")
print(f"Shortest path: {' -> '.join(path)}")
# Check if word exists
if "quick" in graph:
print(graph["quick"]) # Get word attributes
Vectorize Text
from BiGramGraph import Vectorizer
vectorizer = Vectorizer(graph)
# Single text
vec = vectorizer.transform("the quick brown")
print(vec) # array([1., 3., 2.])
# Batch with padding
vectors = vectorizer.transform_batch(
["the quick", "brown fox jumps"],
max_length=5,
pad_value=0
)
print(vectors.shape) # (2, 5)
Generate Text
from BiGramGraph import TextGenerator
generator = TextGenerator(graph)
# Generate using different strategies
text = generator.generate(
num_colors=5, # Number of color transitions
search_depth=10, # Search depth for path finding
strategy="heaviest" # Options: heaviest, lightest, max_density, min_density
)
print(text)
Visualize the Graph
from BiGramGraph import visualize
# Create interactive HTML visualization
visualize(
graph,
output_path="my_graph.html",
directed=True,
show_weights=True,
height=600,
width=1000
)
Add NLP Enrichment
# Requires: python -m spacy download en_core_web_sm
# Add part-of-speech tags
graph.enrich_pos()
print(graph.node_data[["word", "color", "pos"]].head())
# Add named entity tags
graph.enrich_entities()
Save and Load Graphs
# Save to file
graph.save("my_graph.pkl")
# Load from file
loaded_graph = BiGramGraph.load("my_graph.pkl")
# Or use dict serialization
state = graph.to_dict()
restored = BiGramGraph.from_dict(state)
Compare Graphs
from BiGramGraph import chromatic_distance, graph_similarity_report
graph1 = BiGramGraph(corpus1)
graph2 = BiGramGraph(corpus2)
# Basic similarity report
report = graph_similarity_report(graph1, graph2)
print(f"Jaccard similarity: {report['jaccard_similarity']:.2f}")
print(f"Overlapping words: {report['overlapping_words']}")
# Chromatic distance (requires POS enrichment)
graph1.enrich_pos()
graph2.enrich_pos()
distance = chromatic_distance(graph1, graph2)
print(f"Chromatic distance: {distance:.2f}")
API Reference
Core Classes
| Class | Description |
|---|---|
BiGramGraph |
Main class for creating and analyzing bigram graphs |
Vectorizer |
Transform text to numerical vectors using chromatic coloring |
TextGenerator |
Generate synthetic text via chromatic random walks |
Analysis Functions
| Function | Description |
|---|---|
calculate_path_weight() |
Compute total edge weight along a path |
calculate_path_density() |
Compute density based on node degrees |
chromatic_distance() |
Similarity metric between two graphs |
graph_similarity_report() |
Comprehensive comparison of two graphs |
Visualization
| Function | Description |
|---|---|
visualize() |
Create interactive HTML graph visualization |
visualize_subgraph() |
Visualize a subset of nodes |
Research Paper
This library implements concepts from the paper "On Bi-gram Graph Attributes" by Thomas Konstantinovsky and Matan Mizrachi:
We propose a new approach to text semantic analysis and general corpus analysis using, as termed in this article, a "bi-gram graph" representation of a corpus. The different attributes derived from graph theory are measured and analyzed as unique insights or against other corpus graphs. We observe a vast domain of tools and algorithms that can be developed on top of the graph representation; creating such a graph proves to be computationally cheap, and much of the heavy lifting is achieved via basic graph calculations. Furthermore, we showcase the different use-cases for the bi-gram graphs and how scalable it proves to be when dealing with large datasets.
DOI: 10.5539/cis.v14n3p78
arXiv: 2107.02128
Citation
@article{Konstantinovsky2021,
title = {On Bi-gram Graph Attributes},
volume = {14},
ISSN = {1913-8989},
url = {http://dx.doi.org/10.5539/cis.v14n3p78},
DOI = {10.5539/cis.v14n3p78},
number = {3},
journal = {Computer and Information Science},
publisher = {Canadian Center of Science and Education},
author = {Konstantinovsky, Thomas and Mizrachi, Matan},
year = {2021},
month = jul,
pages = {78}
}
Requirements
- Python 3.11+
- networkx
- pandas
- numpy
- nltk
- spacy (optional, for NLP enrichment)
- pyvis (for visualization)
License
MIT License - see LICENSE for details.
Authors
- Thomas Konstantinovsky
- Matan Mizrachi
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Run tests (
pytest tests/) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file bigramgraph-2.1.0.tar.gz.
File metadata
- Download URL: bigramgraph-2.1.0.tar.gz
- Upload date:
- Size: 22.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0b8e0f9c5845341626e8a390a1b60c64d79439719e091f17725a99635ad507d0
|
|
| MD5 |
49f0d493a2e226c02532f32e3f57b276
|
|
| BLAKE2b-256 |
32ae391212b1cbec66b9797ce3252277628ecdc1c50904308aef84467d580470
|
File details
Details for the file bigramgraph-2.1.0-py3-none-any.whl.
File metadata
- Download URL: bigramgraph-2.1.0-py3-none-any.whl
- Upload date:
- Size: 20.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
47a03f7b8e0074187a64f60e35bcc2ebf6ca111c59cf387682d81357d80cb340
|
|
| MD5 |
4c8e6c2c83215674687186b1562c8016
|
|
| BLAKE2b-256 |
37e95e690f4eba6b943a4e262c94800657c252a768ddfaf8b5c22ee06dbb42ad
|