Simple database interface for text analysis applications.
Project description
DocTable Package for Python
Object-based database access specifically intended for text analysis applications.
The package makes it easy to create a new database tables with simple schemas; a common task in many text analysis projects. The package consists primarily of two classes: the original DocTable, built directly as a thin interface to the sqlite3 package, and DocTable2, the successor which is implemented using SQLAlchemy.
The typical way to use this package is to create new classes which inherit from DocTable2 or DocTable. These classes can manage schema info and allow users to add application-specific member functions for convenient access to the underlying databases.
See the documentation here: DocTable2 Class Documentation, DocTable Class Documentation
DocTable2 Class
DocTable2 is built on SQLAlchemy Core, a flexible object-oriented interface to many mainstream DB engines. DocTable2 is inspired by the object-based interface of SQLAlchemy, but makes it easier to access SQLAlchemy features without importing a large number of python objects. The interface requires much less user code compared to SQLAlchemy, taking cues from the original Doctable class.
Examples
Most of the documentation for DocTable2 is provided via the examples. Here is a list of example notebooks and scripts:
- DocTable2 Function Documentation
- DocTable2 Basic Examples
- Insert/Delete Examples
- Select Examples
- Update Examples
- Schema Examples
- Special Column Type Examples
Quick Example
The initialization of a doctable requires a schema, as shown in this example. See the DocTable2 Basics Document for more.
```python
schema = (
('id','integer',dict(primary_key=True, autoincrement=True)),
('name','string', dict(nullable=False)),
('age','integer'),
)
db = dt.DocTable2(schema, fname='test.db')
```
After creating the instance, the database and table have been created according to the desired schema. Now, just to add a few items:
```python
N = 5
for i in range(N):
age = random.random() # number in [0,1]
row = {'name':'user_'+str(i), 'age':age}
db.insert(row)
```
Now we use the .select()
method to view the contents of the database:
```python
db.select()
```
The output will yield this:
[(1, 'user_0', 0.4161851979243477),
(2, 'user_1', 0.37148559537119163),
(3, 'user_2', 0.9389122192656695),
(4, 'user_3', 0.6709306663312412),
(5, 'user_4', 0.4574398725307163)]
Read the basic introduction or other examples to see more!
Special Data Types
In addition to regular schema mappings, DocTable2 provides custom data types for token lists and lists of token lists (think tokenized sentences). See the Special Type Examples for more information.
Original DocTable Class
DocTable provides a thin layer over the sqlite package specifically for working with single tables of data, as is often the case with many basic text analysis applications. Somewhere between a spreadsheet and full-fledged database server, this package allows for a very simple interface for storing, updating, and retrieving data. It transparently handles picklable python objects that can be stored as sqlite blob types, so you can treat python objects like any other retrievable database type.
The package consists of a single class called DocTable, a base class with useful read/write/update interface methods. This example (found in example_advanced.ipynb) shows the NewsGroups DocTable, where I have created columns file_id, category, raw_text, subject, author, and tokenized_text. Notice that tokenized_text has been assigned the "blob" type, which DocTable will automatically convert to and from Python objects automatically.
from doctable import DocTable
class NewsGroups(DocTable):
def __init__(self, fname):
'''
DocTable class.
Inputs:
fname: fname is the name of the new sqlite database that will be used for instances of class.
'''
tabname = 'newsgroups'
super().__init__(
fname=fname,
tabname=tabname,
colschema='id integer primary key autoincrement, file_id int, category string, \
raw_text string, subject string, author string, tokenized_text blob, UNIQUE(file_id)',
)
# create indices on file_id and category
self.query("create index if not exists idx1 on "+tabname+"(file_id)")
self.query("create index if not exists idx2 on "+tabname+"(category)")
Thanks
The setup of this package was created following this guide.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.