Skip to main content
Help us improve PyPI by participating in user testing. All experience levels needed!

Tools for building streamcorpus objects, such as those used in TREC.

Project description

TREC KBA Data
=============

kba.pipeline is a document processing pipeline that assembles
streamcorpus objects from raw data sets for us in TREC KBA.

TREC KBA 2013
-------------
1006305073 all-stream-ids.suc.txt
884018982 all-stream-ids.doc_ids.suc.txt


kba.pipeline
-------------

The kba.pipeline python module contains tools for processing
streamcorpus.StreamItems stored in Chunks. It includes transform
functions for getting clean_html, clean_visible text, creating labels
from hyperlinks to particular sites (e.g. Wikipedia), and taggers like
LingPipe and Stanford CoreNLP, that make Tokens and Sentences.

python2.7
---------
To create a python2.7 virtualenv, do this:

wget http://www.python.org/ftp/python/2.7.3/Python-2.7.3.tgz
tar xzf Python-2.7.3.tgz
cd Python-2.7
./configure --prefix /data/trec-kba/installs/py27
make install
cd ..
wget http://pypi.python.org/packages/source/v/virtualenv/virtualenv-1.8.4.tar.gz#md5=1c7e56a7f895b2e71558f96e365ee7a7
tar xzf virtualenv-1.8.4.tar.gz
cd virtualenv-1.8.4/
/data/trec-kba/installs/py27/bin/python setup.py install
cd ..
/data/trec-kba/installs/py27/bin/virtualenv --distribute -p /data/trec-kba/installs/py27/bin/python py27env



installation
------------

Easiest to put this entire repo at a path like

/data/trec-kba/installs/trec-kba-data

which is hardcoded into these three files:

scripts/spinn3r-transform.sh
scripts/spinn3r-transform.submit
configs/spinn3r-transform.yaml


Then, you need these two other directories:

/data/trec-kba/keys ---- from the trec-kba-secret-keys.tar.gz that is in the Dropbox
/data/trec-kba/third/lingpipe-4.1.0 --- also in the dropbox

As a test run this:

## first go inside the virtualenv
source /data/trec-kba/installs/py27env/bin/activate

## install all the python libraries
make install

## run a simple test
make john-smith-simple


and if that works, then try

make john-smith


To try doing the real pull/push from AWS, you can put the input paths here:
/data/trec-kba/installs/trec-kba-data/spinn3r-transform
zcat spinn3r-transform-input-paths.txt.gz | split -l 150 -a 4
b=0; for a in `ls ?????`; do mv $a input.$b; let b=$b+1; done;

and then locally as a test:

cat /data/trec-kba/installs/trec-kba-data/spinn3r-transform/input.0 | python -m kba.pipeline.run configs/spinn3r-transform.yaml

and then, after seeing that work edit the submit script to have as
many jobs as their are input files:

condor_submit scripts/spinn3r-transform.submit

There is one key problem with this, which we discussed on the phone:
when the job dies, it starts over on the input list. Let's discuss
using the zookeeper "task_queue" stage.


running on task_queue: zookeeper
--------------------------------

To use the zookeeper task queue, you must install zookeeper on a
computer that your cluster can access. Here is an example zookeeper
config:

# The number of milliseconds of each tick
tickTime=10000
# The number of ticks that the initial
# synchronization phase can take
initLimit=10
# The number of ticks that can pass between
# sending a request and getting an acknowledgement
syncLimit=5
# the directory where the snapshot is stored.
dataDir=/var/zookeeper
# the port at which the clients will connect
clientPort=2181
server.0=localhost:2888:3888
maxClientCnxns=2000


Note the large maxClientCnxns for running with many nodes in condor,
and also not the 10sec tickTime, which is needed to avoid frequent
session timeouts from condor slots that are working hard.


To make a job run off the zookeeper task queue, make these changes:

configs/spinn3r-transform.yaml:

- task_queue: stdin
+ #task_queue: stdin
+
+ task_queue: zookeeper
+ zookeeper:
+ namespace: spinn3r-transform
+ zookeeper_address: mitas-2.csail.mit.edu:2181


scripts/spinn3r-transform.submit:

-Input = /data/trec-kba/spinn3r-transform/input.$(PROCESS)
+
+## disable stdin because we are using task_queue: zookeeper
+#Input = /data/trec-kba/spinn3r-transform/input.$(PROCESS)


Important:
Also update the number of jobs at the end of the .submit file.


and then do these steps on the command line:

## see the help text
python -m kba.pipeline.load configs/spinn3r-transform.yaml -h

## load the data
python -m kba.pipeline.load configs/spinn3r-transform.yaml --load spinn3r-transform-input-paths.txt

## check the counts -- might take a bit to run, so background and come back to it
python -m kba.pipeline.load configs/spinn3r-transform.yaml --counts >& counts &

## launch the jobs
condor_submit scripts/spinn3r-transform.submit

## watch the logs for the jobs
tail -f ../spinn3r-transform/{err,out}*


Periodically check the --counts on the queue and see how fast it is
going. Do we need to turn off the lingpipe stage?

Project details


Release history Release notifications

History Node

0.7.21

History Node

0.7.19

History Node

0.7.19.dev1

History Node

0.7.18

History Node

0.7.18.dev3

History Node

0.7.17

History Node

0.7.17.dev1

History Node

0.7.16

History Node

0.7.16.dev3

History Node

0.7.16.dev2

History Node

0.7.15

History Node

0.7.15.dev5

History Node

0.7.15.dev4

History Node

0.7.15.dev3

History Node

0.7.15.dev2

History Node

0.7.14

History Node

0.7.14.dev1

History Node

0.7.13

History Node

0.7.13.dev5

History Node

0.7.13.dev4

History Node

0.7.13.dev3

History Node

0.7.13.dev1

History Node

0.7.12

History Node

0.7.12.dev5

History Node

0.7.12.dev4

History Node

0.7.12.dev3

History Node

0.7.12.dev1

History Node

0.7.11

History Node

0.7.11.dev3

History Node

0.7.11.dev1

History Node

0.7.10.dev2

History Node

0.7.10.dev1

History Node

0.7.9

History Node

0.7.8.dev1

History Node

0.7.7

History Node

0.7.7.dev6

History Node

0.7.7.dev2

History Node

0.7.7.dev1

History Node

0.7.6

History Node

0.7.6.dev7

History Node

0.7.6.dev3

History Node

0.7.6.dev1

History Node

0.7.5

History Node

0.7.5.dev7

History Node

0.7.5.dev5

History Node

0.7.5.dev1

History Node

0.7.4

History Node

0.7.4.dev1

History Node

0.7.3

History Node

0.7.2

History Node

0.7.2.dev5

History Node

0.7.2.dev2

History Node

0.7.2.dev1

History Node

0.7.1

History Node

0.7.0

History Node

0.6.8.dev24

History Node

0.6.8.dev23

History Node

0.6.8.dev22

History Node

0.6.8.dev21

History Node

0.6.8.dev20

History Node

0.6.8.dev1

History Node

0.6.7

History Node

0.6.7.dev3

History Node

0.6.7.dev2

History Node

0.6.6

History Node

0.6.5

History Node

0.6.4

History Node

0.6.4.dev10

History Node

0.6.4.dev8

History Node

0.6.4.dev7

History Node

0.6.4.dev5

History Node

0.6.4.dev4

History Node

0.6.4.dev3

History Node

0.6.4.dev2

History Node

0.6.4.dev1

History Node

0.6.3

History Node

0.6.2

History Node

0.6.1

History Node

0.6.1.dev3

History Node

0.6.1.dev1

History Node

0.6.0

History Node

0.5.54

History Node

0.5.54.dev3

History Node

0.5.53.dev5

History Node

0.5.53.dev4

History Node

0.5.53.dev3

History Node

0.5.53.dev2

History Node

0.5.53.dev1

History Node

0.5.52

History Node

0.5.51

History Node

0.5.50

History Node

0.5.50.dev2

History Node

0.5.49

History Node

0.5.49.dev4

History Node

0.5.49.dev3

History Node

0.5.49.dev2

History Node

0.5.48

History Node

0.5.48.dev1

History Node

0.5.47

History Node

0.5.46

History Node

0.5.46.dev2

History Node

0.5.45

History Node

0.5.44.dev2

History Node

0.5.43

History Node

0.5.43.dev8

History Node

0.5.43.dev7

History Node

0.5.43.dev6

History Node

0.5.43.dev5

History Node

0.5.42.dev29

History Node

0.5.42.dev27

History Node

0.5.42.dev24

History Node

0.5.42.dev21

History Node

0.5.42.dev20

History Node

0.5.42.dev19

History Node

0.5.42.dev16

History Node

0.5.42.dev14

History Node

0.5.42.dev13

History Node

0.5.42.dev12

History Node

0.5.42.dev11

History Node

0.5.42.dev10

History Node

0.5.42.dev9

History Node

0.5.42.dev3

History Node

0.5.42.dev2

History Node

0.5.42.dev1

History Node

0.5.41.dev2

History Node

0.5.41.dev1

History Node

0.5.39

History Node

0.5.39.dev10

History Node

0.5.39.dev8

History Node

0.5.39.dev7

History Node

0.5.38

History Node

0.5.38.dev10

History Node

0.5.38.dev7

History Node

0.5.38.dev6

History Node

0.5.38.dev4

History Node

0.5.38.dev3

History Node

0.5.38.dev2

History Node

0.5.38.dev1

History Node

0.5.37

History Node

0.5.35.dev2

History Node

0.5.34.dev2

History Node

0.5.33

History Node

0.5.32.dev26

History Node

0.5.32.dev25

History Node

0.5.32.dev23

History Node

0.5.32.dev22

History Node

0.5.32.dev21

History Node

0.5.32.dev20

History Node

0.5.32.dev18

History Node

0.5.32.dev17

History Node

0.5.32.dev16

History Node

0.5.32.dev14

History Node

0.5.32.dev12

History Node

0.5.32.dev11

History Node

0.5.32.dev9

History Node

0.5.32.dev8

History Node

0.5.32.dev7

History Node

0.5.32.dev6

History Node

0.5.32.dev5

History Node

0.5.32.dev3

History Node

0.5.32.dev2

History Node

0.5.32.dev1

History Node

0.5.31

History Node

0.5.30

History Node

0.5.29.dev2

History Node

0.5.29.dev1

History Node

0.5.28.dev1

History Node

0.5.26.dev8

History Node

0.5.26.dev6

History Node

0.5.26.dev5

History Node

0.5.26.dev4

History Node

0.5.26.dev3

History Node

0.5.25

History Node

0.5.24

History Node

0.5.23

History Node

0.5.23.dev7

History Node

0.5.23.dev1

History Node

0.5.22

History Node

0.5.21

History Node

0.5.19

History Node

0.5.18

History Node

0.5.18.dev3

History Node

0.5.16

History Node

0.5.15

History Node

0.5.14

History Node

0.5.13

History Node

0.5.12

History Node

0.5.10

History Node

0.5.8

History Node

0.5.7

History Node

0.5.6.dev9

History Node

0.5.5

History Node

0.5.4

History Node

0.5.4.dev1

History Node

0.5.3.dev8

History Node

0.5.2

History Node

0.5.1

History Node

0.5.1.dev8

History Node

0.5.1.dev4

History Node

0.5.1.dev3

History Node

0.5.0

History Node

0.4.5.dev7

History Node

0.4.5.dev5

History Node

0.4.5.dev1

History Node

0.4.4

History Node

0.4.4.dev3

History Node

0.4.3

History Node

0.4.2

History Node

0.4.2.dev8

History Node

0.4.2.dev7

History Node

0.4.2.dev1

History Node

0.4.1.dev11

History Node

0.4.1.dev9

History Node

0.4.1.dev1

History Node

0.4.0

History Node

0.3.42.dev2

History Node

0.3.40.dev9

History Node

0.3.40.dev8

History Node

0.3.40.dev7

History Node

0.3.38.dev1

History Node

0.3.37

This version
History Node

0.3.36

History Node

0.3.36.dev27

History Node

0.3.36.dev26

History Node

0.3.36.dev25

History Node

0.3.36.dev24

History Node

0.3.36.dev18

History Node

0.3.36.dev17

History Node

0.3.36.dev10

History Node

0.3.36.dev8

History Node

0.3.36.dev7

History Node

0.3.34.dev2

History Node

0.3.33

History Node

0.3.32

History Node

0.3.32.dev1

History Node

0.3.31.dev8

History Node

0.3.30.dev7

History Node

0.3.30.dev6

History Node

0.3.30.dev2

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging CloudAMQP CloudAMQP RabbitMQ AWS AWS Cloud computing Fastly Fastly CDN DigiCert DigiCert EV certificate StatusPage StatusPage Status page