Skip to main content

A set of python modules for cornel movie-dialogs corpus with storm

Project description

A set of python modules for cornel movie-dialogs corpus with storm.

Abstract

This module include some classes extending storm ORM for cornel movie-dialogs corpus data.

Install

pip install storm                # if you not
pip install cornel-movie-dialogs-corpus-storm

Setup

  1. download corpus and unzip

  2. generate database and insert with generate-mdcorpus-database.py

for example:

generate-mdcorpus-database.py --corpus-dir "cornell movie-dialogs corpus" corpus.db

Usage

from mdcorpus.orm import *
from mdcorpus.parser import *

...

Class List

  • MovieTitlesMetadata

  • Genre

  • MovieGenreLine

  • MovieCharactersMetadata

  • MovieConversation

  • MovieLine

  • RawScriptUrl

Corpus Problem

This is memo when I dealt with corpus problems.

movie_titles_metadata.txt

  • I ignored an alphabet following year.

    • for example, line 34, 1989/I

  • I ignored duplication for genre data.

    • line 58, ['horror', 'mystery', 'mystery', 'sci-fi', 'sci-fi']

Code Problem

I use Python2.7 and I don’t know how to use codecs module.(Unicode HOWTO — Python 2.7ja1 documentation)

mime

convert text-code to utf-8 with Mi

before

cornell movie-dialogs corpus$ file --mime {(ls)}
README.txt:                    text/plain; charset=iso-8859-1
chameleons.pdf:                application/pdf; charset=binary
movie_characters_metadata.txt: text/plain; charset=iso-8859-1
movie_conversations.txt:       text/plain; charset=us-ascii
movie_lines.txt:               text/plain; charset=us-ascii
movie_titles_metadata.txt:     text/plain; charset=iso-8859-1
raw_script_urls.txt:           text/plain; charset=iso-8859-1

after

cornell movie-dialogs corpus$ file --mime {(ls)}
README.txt:                    text/plain; charset=utf-8
chameleons.pdf:                application/pdf; charset=binary
movie_characters_metadata.txt: text/plain; charset=utf-8
movie_conversations.txt:       text/plain; charset=us-ascii
movie_lines.txt:               text/plain; charset=us-ascii
movie_titles_metadata.txt:     text/plain; charset=utf-8
raw_script_urls.txt:           text/plain; charset=utf-8

movie_titles_metadata.txt

  • line 115, léon

movie_characters_metadata.txt

  • line 1727 - 1736, léon

result

sqlite> select * from movie_titles_metadata where title = 'léon';
sqlite> select * from movie_titles_metadata where title = 'l駮n';
114|l駮n|1994|8.6|204901

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cornel-movie-dialogs-corpus-storm-0.1.1.tar.gz (5.4 kB view details)

Uploaded Source

File details

Details for the file cornel-movie-dialogs-corpus-storm-0.1.1.tar.gz.

File metadata

File hashes

Hashes for cornel-movie-dialogs-corpus-storm-0.1.1.tar.gz
Algorithm Hash digest
SHA256 0a365eca49ca32faf294073c9258525b6e441a4558f8af8d7ac7844ce0d01d59
MD5 5350f41b48e9eaa7859d23b026f8f5ce
BLAKE2b-256 f92e573329b79419ee5c8feb82e8e7f3aa8132c978c847aa9f5ca0b75bbdb36b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page