Skip to main content

Instant Knowledge Graphs from text documents.

Project description

Skipchunk

Pypi

Travis build status

Documentation Status

Easy search autosuggest with NLP magic.

Out of the box it provides a hassle-free autosuggest for any corpus from scratch, and latent knowledge graph extraction and exploration.

Install

pip install skipchunk

python -m spacy download 'en_core_web_lg'

python -m nltk.downloader wordnet

You also need to have Solr or Elasticsearch installed and running somewhere!

The current Solr supported version is 8.4.1, but it might work on other versions.

The current Elasticsearch supported version is 7.6.2, but it might work on other versions.

Use It!

See the ./example/ folder for an end-to-end OSC blog load:

Solr

Start Solr first! Doesn’t work with Solr cloud yet, but we’re working on it.

You’ll need to start solr using skipchunk’s solr_home directory for now.

Then run this: python solr-blog-example.py

Elasticsearch

Start Elasticsearch first!

Then run this: python elasticsearch-blog-example.py

Features

  • Identifies and groups the noun phrases and verb phrases in a corpus

  • Indexes these phrases in Solr or Elasticsearch for a really good out-of-the-box autosuggest

  • Structures the phrases as a graph so that concept-relationship-concept can be easily found

  • Meant to handle batched updates as part of a full stack search platform

Library API

Engine configuration

You need an engine_config, as a dict, to create skipchunk.

The dict must contain the following entries

  • host (the fully qualified URL of the engine web API endpoint)

  • name (the name of the graph)

  • path (the on-disk location of stateful data that will be kept)

  • engine_name (either “solr” or “elasticsearch”)

Solr engine config example
engine_config_solr = {

    "host":"http://localhost:8983/solr/",

    "name":"osc-blog",

    "path":"./skipchunk_data",

    "engine_name":"solr"

}
Elasticsearch engine config example
engine_config_elasticsearch = {

    "host":"http://localhost:9200/",

    "name":"osc-blog",

    "path":"./skipchunk_data",

    "engine_name":"elasticsearch"

}

Skipchunk Initialization

When initializing Skipchunk, you will need to provide the constructor with the following parameters

  • engine_config (the dict containing search engine connection details)

  • spacy_model=“en_core_web_lg” (the spacy model to use to parse text)

  • minconceptlength=1 (the minimum number of words that can appear in a noun phrase)

  • maxconceptlength=3 (the maximum number of words that can appear in a noun phrase)

  • minpredicatelength=1 (the minimum number of words that can appear in a verb phrase)

  • maxpredicatelength=3 (the maximum number of words that can appear in a verb phrase)

  • minlabels=1 (the number of times a concept/predicate must appear before it is recognized and kept. The lower this number, the more concepts will be kept - so be careful with large content sets!)

  • cache_documents=False

  • cache_pickle=False

Skipchunk Methods

  • tuplize(filename=source,fields=['title','content',...]) (Produces a list of (text,document) tuples ready for processing by the enrichment.)

  • enrich(tuples) (Enriching can take a long time if you provide lots of text. Consider batching at 10k docs at a time.)

  • save (Saves to pickle)

  • load (Loads from pickle)

Graph API

After enrichment, you can then index the graph into the engine

  • index(skipchunk:Skipchunk) (Updates the knowledge graph in the search engine)

  • delete (Deletes a knowledge graph - be careful!)

After indexing, you can call these methods to get autocompleted concepts or walk the knowledge graph

  • conceptVerbConcepts(concept:str,verb:str,mincount=1,limit=100) -> list ( Accepts a verb to find the concepts appearing in the same context)

  • conceptsNearVerb(verb:str,mincount=1,limit=100) -> list ( Accepts a verb to find the concepts appearing in the same context)

  • verbsNearConcept(concept:str,mincount=1,limit=100) -> list ( Accepts a concept to find the verbs appearing in the same context)

  • suggestConcepts(prefix:str,build=False) -> list ( Suggests a list of concepts given a prefix)

  • suggestPredicates(prefix:str,build=False) -> list ( Suggests a list of predicates given a prefix)

  • summarize(mincount=1,limit=100) -> list ( Summarizes a core)

  • graph(subject:str,objects=5,branches=10) -> list ( Gets the subject-predicate-object neighborhood graph for a subject)

Credits

Developed by Max Irwin, OpenSource Connections https://opensourceconnections.com

All the blog posts contained in the example directory are copyright OpenSource Connections, and may not be used nor redistributed without permission

History

0.1.0 (2019-06-18)

  • Cookie-cutted.

0.9.0 (2020-09-25)

  • First release on PyPI.

1.0.0 (2020-12-10)

  • Stable API.

1.1.0 (2020-12-10)

  • Beta Release.

1.1.1 (2020-12-10)

  • Basic Readme doc.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

skipchunk-1.1.2.tar.gz (1.1 MB view details)

Uploaded Source

Built Distribution

skipchunk-1.1.2-py2.py3-none-any.whl (31.2 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file skipchunk-1.1.2.tar.gz.

File metadata

  • Download URL: skipchunk-1.1.2.tar.gz
  • Upload date:
  • Size: 1.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.23.0 setuptools/50.3.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.3

File hashes

Hashes for skipchunk-1.1.2.tar.gz
Algorithm Hash digest
SHA256 61b0bb67c530ea28ae81ad82fa9393c8849f74089c2277d75a859755a9665d30
MD5 463b97f412034e68c368919c4e1f97c8
BLAKE2b-256 0956548e70b2dcad681ae407f9bb90e1e1de192764b674f60c71afa8e59aa154

See more details on using hashes here.

File details

Details for the file skipchunk-1.1.2-py2.py3-none-any.whl.

File metadata

  • Download URL: skipchunk-1.1.2-py2.py3-none-any.whl
  • Upload date:
  • Size: 31.2 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.23.0 setuptools/50.3.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.3

File hashes

Hashes for skipchunk-1.1.2-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 5ba98512d0a1dc9d20323ec5b30957133da62891d60e86889c345ab0d62b0cd3
MD5 46db1268664e56dbe55e3d6b53b6ff17
BLAKE2b-256 55b60df3ad1f31eb58cc44ca93060a8b4ac93f7658a211c624a8c7b4c14d285d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page