Skip to main content

Tools, wrappers, etc... for data science with a concentration on text processing

Project description

Rosetta
====

Tools for data science with a focus on text processing.

* Focuses on "medium data", i.e. data too big to fit into memory but too small to necessitate the use of a cluster.
* Integrates with existing scientific Python stack as well as select outside tools.

Examples
--------

* See the `examples/` directory.
* The [docs](http://pythonhosted.org/rosetta/#examples) contain plots of example output.


Packages
--------

### `cmd`
* Unix-like command line utilities. Filters (read from stdin/write to stdout) for files.
* Focus on stream processing and csv files.

### `parallel`
* Wrappers for Python multiprocessing that add ease of use
* Memory-friendly multiprocessing

### `text`
* Stream text from disk to formats used in common ML processes
* Write processed text to sparse formats
* Helpers for ML tools (e.g. Vowpal Wabbit, Gensim, etc...)
* Other general utilities

### `workflow`
* High-level wrappers that have helped with our workflow and provide additional examples of code use

### `modeling`
* General ML modeling utilities

Install
-------
Check out the master branch from the [rosettarepo][rosettarepo]. Then, (so long as you have `pip`).

cd rosetta
make
make test

If you update the source, you can do

make reinstall
make test

The above `make` targets use `pip`, so you can of course do `pip uninstall` at any time.

Getting the source (above) is the preferred method since the code changes often, but if you don't use Git you can download a tagged release (tarball) [here](https://github.com/columbia-applied-data-science/rosetta/releases). Then

pip install rosetta-X.X.X.tar.gz

Development
-----------

### Code

You can get the latest sources with

git clone git://github.com/columbia-applied-data-science/rosetta

### Contributing

Feel free to contribute a bug report or a request by opening an [issue](https://github.com/columbia-applied-data-science/rosetta/issues)

The preferred method to contribute is to fork and send a pull request. Before doing this, read [CONTRIBUTING.md](CONTRIBUTING.md)

Dependencies
------------

* Major dependencies on *Pandas* and *numpy*.
* Minor dependencies on *Gensim* and *statsmodels*.
* Some examples need *scikit-learn*.
* Minor dependencies on *docx*
* Minor dependencies on the unix utilities *pdftotext* and *catdoc*

Testing
-------
From the base repo directory, `rosetta/`, you can run all tests with

make test

Documentation
-------------

Documentation for releases is hosted at [pypi](http://pythonhosted.org/rosetta). This does NOT auto-update.


History
-------
*Rosetta* refers to the [Rosetta Stone](http://en.wikipedia.org/wiki/Rosetta_Stone), the ancient Egyptian tablet discovered just over 200 years ago. The tablet contained fragmented text in three different languages and the uncovering of its meaning is considered an essential key to our understanding of Ancient Egyptian civilization. We would like this project to provide individuals the necessary tools to process and unearth insight in the ever-growing volumes of textual data of today.

[rosettarepo]: https://github.com/columbia-applied-data-science/rosetta

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rosetta-0.2.4.tar.gz (93.7 kB view details)

Uploaded Source

Built Distribution

rosetta-0.2.4.macosx-10.5-x86_64.exe (148.3 kB view details)

Uploaded Source

File details

Details for the file rosetta-0.2.4.tar.gz.

File metadata

  • Download URL: rosetta-0.2.4.tar.gz
  • Upload date:
  • Size: 93.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for rosetta-0.2.4.tar.gz
Algorithm Hash digest
SHA256 b7bbd8305c2313c862b72799a9ca536e8c41e732a33074d2a5eb8fee2aa67071
MD5 2e16392fdd28c6492a9cb8fcd88ad7f4
BLAKE2b-256 53290fea681d47dd1a3f8671921b16150eeb6bc8432aa5db3d5de2977682b279

See more details on using hashes here.

File details

Details for the file rosetta-0.2.4.macosx-10.5-x86_64.exe.

File metadata

File hashes

Hashes for rosetta-0.2.4.macosx-10.5-x86_64.exe
Algorithm Hash digest
SHA256 a9134f3d0253990f6f6589d693ec6c2e71c018959e469d2bd99589558b330b80
MD5 3557e5f690edf24dd355c45fa57a107c
BLAKE2b-256 5c8680d5d434068eea4c5e28a548a435f4e9e326f85309d82678647f802ce2e6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page