Skip to main content

An NLTK-based toolkit aimed at increasing the understanding of various texts.

Project description

Smart Reading

About

smart_Reading is a Python module designed for increasing the understanding of various textforms by using natural language processing. It is heavily based on tools available from the Natural Language Toolkit (NLTK), which are used in various applications or provided with an extension.

Installation

The module is available for Python 2.7+, but recommended to run on Python 3+ for a more thorough unicode support and prettier graphs. Install via pip (or any other desired client):

$ pip install smart_reading

or by downloading the source code on PyPI or GitHub and running the following command in the root folder:

$ python setup.py install

Importing texts

The basic functionality of smart_reading is to import a given textfile into a smart_reading.book.Book object, which can be used to perform several developed analyses. Textfiles can be imported via the function smart_reading.book.load(filename). This function utilizes the functionality of the module textract to extract textual information of almost any given data form, including .txt, .pdf, .epub and .docx. See the online manual for more details on the inner workings of this module. When this module is not found on the system, the program continues with a limited functionality, in which only .txt files can be read. Additionally, a given string can be imported as an e-book via smart_reading.book.fromstring(text). This limited functionality is added because it is experienced that the installation of textract does not always succeed. The user that is not able to install textract but does want to import other text formats is encouraged to build alternative pipelines to extract text into a .txt file, which in turn can be imported in the smart_reading.book.load function.

Three sample texts are also included with different file structures, and available via the function smart_reading.book.sample:

>>> import smart_reading as sr
>>> sr.book.sample() # or sr.book.sample('txt')
Succesfully loaded 'Benn_Ch_II_The_Metaphysicians.txt' as an e-book
Total n.o. tokens: 10420
<smart_reading.book.Book instance at 0x105a546c8>
>>> sr.book.sample('pdf')
Succesfully loaded 'PhysRev.47.777.pdf' as an e-book
Total n.o. tokens: 3192
<smart_reading.book.Book instance at 0x110c0ebd8>
>>> sr.book.sample('epub')
Succesfully loaded 'Galileo_The_Sidereal_Messenger.epub' as an e-book
Total n.o. tokens: 40372
<smart_reading.book.Book instance at 0x105a46e60>

Functionality

As mentioned, the given text are imported into a smart_reading.book.Book object. The different tools that this object provides are listed below.

Concordance

A concordance is developed as an extension of the nltk.Text.concordance function, which incorporates example 3.6 of the NLTK manual, such that it not only matches with exact copies of a given word, but also inflections:

>>> import smart_reading as sr
>>> bk = sr.book.sample()
Succesfully loaded 'Benn_Ch_II_The_Metaphysicians.txt' as an e-book
Total n.o. tokens: 10420
>>> bk.concordance('philosopher')
Displaying 17 of 17 matches:
nce of an independent income enabled the philosopher to live where he liked ; and
by our opinion of his metaphysics . As a philosopher Descartes has , to begin wit
r dazzle ; they could not convince . The philosophers professed to teach truth ; 
inctly are all true . In his other great philosophical work , the _Meditations_ ,
o his postulate of universal doubt , our philosopher argues from this to an imper
Here he agrees with another mathematical philosopher , Plato , who says the same 
at least one astronomer , who was also a philosopher , declared that the ultimate
 personality of God . SPINOZA . With the philosopher whom I have just named we co
sion of 500 florins on Spinoza , but the philosopher would accept no more than 30
l . To appreciate the work of the Hebrew philosopher , of the lonely muser , bred
 divine substance . In fact , the Hebrew philosopher does this , declaring boldly
peppers his pages . Yet , like the Greek philosophers , he is much more modern , 
 name of his great work that for him the philosophical problem is essentially a p
 . But he parts company with the English philosopher in his theory of what it mea
 alone , however , does not make a great philosopher ; character also is required
rity than any one utterance of any other philosopher ; but that fame is due to th
 work . On _à priori_ grounds the German philosopher seems to have an incontrover

In order to deal with inflections, the stemmer nltk.PorterStemmer is used by default. Other stemmers can be sent through the keyword stemmer when importing a textfile.

nltk.text.Text attribute

A smart_reading.book.Book object always has an attribute Text, which is an nltk.text.Text object and as such includes all its attributes as developed by NLTK, like finding collocations, similar words, and creating disperion plots. See the NLTK API for its full documentation

>>> bk.Text.collocations()
fullest extent; infinite Power; material world; Princess Elizabeth;
external world; paramount object; supernatural revelation; two
attributes; necessarily exist; Queen Christina; early age; whole
universe; final causes; mathematical demonstration; metaphysical
system; perfection involves; best possible; many distinct;
mathematical method; divine substance
>>> bk.Text.dispersion_plot(['Descartes','Malebranche','Spinoza','Leibniz'])

alt text

The smart_reading.stats submodule

The smart_reading module comes with a stats submodule, which uses matplotlib.pyplot to create different the following graps of a given text file:

plot_noun_dist(book, no_nouns = 20, named_entities = True, exceptions = [], **kwargs)

A frequency plot of the most common nouns appearing in given text.

  • no_nouns: Number of nouns that will be included in the graph (i.e. number of bars).
  • named_entities: If false, this will exclude named entities like people and places that are recognized by the nltk.chunk.ne_chunk routine. This option is not supported in versions of Python 2.
  • exceptions: An iterable of nouns that will be excluded from the analysis
  • Further keyword arguments are passed to matplotlib.pyplot.fig.
>>> bk2 = sr.book.sample('pdf')
Succesfully loaded 'PhysRev.47.777.pdf' as an e-book
Total n.o. tokens: 3192
>>> sr.stats.plot_noun_dist(bk2)

alt text

plot_network_graph(book, no_nouns = 10, treshold = 3, exclude_empty = True, named_entities = True, exceptions = [], **kwargs)

A network graph created using neworkx depicting the relationship between frequently appearing graphs. The nouns appear as nodes, and edges are drawn between nouns if they appear frequently in the same sentences. A temperature color scheme is used on the edges to depict the frequency in which nouns appear together (red = very often, blue = a few times).

  • no_nouns: Number of nouns that will be included in the graph (i.e. number of nodes). Can become less in the final result if exclude_empty is true, see below.
  • treshold: The minimal number of sentences in which two given nouns have to appear in order for an edge to be drawn. Can be used to simplify graphs with a lot of edges.
  • exclude_empty: if True, this will exclude the nodes from the graph that do not have edges. Note that this will reduce the number of nouns depicted, as declared in no_nouns above
  • exceptions: an iterable of nouns that will be excluded from the analysis.
  • Further keyword arguments are passed to matplotlib.pyplot.fig.
>>> sr.stats.plot_network_graph(bk2)

alt text

>>> bk3 = sr.book.sample('epub')
Succesfully loaded 'Galileo_The_Sidereal_Messenger.epub' as an e-book
Total n.o. tokens: 40372
>>> sr.stats.plot_network_graph(bk3, no_nodes = 15, treshold = 5)

alt text

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

smart_reading-1.0.1.tar.gz (403.0 kB view details)

Uploaded Source

Built Distribution

smart_reading-1.0.1-py2.py3-none-any.whl (449.0 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file smart_reading-1.0.1.tar.gz.

File metadata

File hashes

Hashes for smart_reading-1.0.1.tar.gz
Algorithm Hash digest
SHA256 0b41c01e842967fb2446e439e5ae8054269b242ff246acfd1cc47c062ef2cb6a
MD5 211cb9fabd8853644393415444765164
BLAKE2b-256 aaabe009263c341da5f4289334614e77f2813a11cb62978227f20297e8dc4aea

See more details on using hashes here.

File details

Details for the file smart_reading-1.0.1-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for smart_reading-1.0.1-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 05c1707cefb853915dc55f7993e7a28fb4600d2648731f2ab682d85effec8b5a
MD5 57b2adfeea992c58901867aef0a138da
BLAKE2b-256 193852bab192e300ffc4ba66be98f3dd49359233493c5ff5e769caf370e11eb7

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page