Search for data tables.

## Project description

When we search for ordinary written documents, we send words into a search engine and get pages of words back.

What if we could search for spreadsheets by sending spreadsheets into a search engine and getting spreadsheets back? The order of the results would be determined by various specialized statistics; just as we use PageRank to find relevant hypertext documents, we can develop other statistics that help us find relevant spreadsheets.

## Indexing

Comma Search indexes only spreadsheets that are stored locally. To index a new spreadsheet, run

, --index [csv file]

Regardless of what path you give for the csv file, Comma Search will expand the path to an absolute path and then use this as the key to meta-index the cached results of the indexing. These caches are all stored in the ~/., directory.

By default, CSV files that have already been indexed will be skipped; to index the same CSV file again, run with the --force or -f option.

, --index --force [csv file]

Once you have indexed a bunch of CSV files, you can search.

, [csv file]

You’ll see a bunch of file paths as results

\$ , ‘Math Scores 2009.csv’ /home/tlevine/math-scores-2010-gender.csv /home/tlevine/Math Scores 2009.csv /home/tlevine/Math Scores 2009 Copy (1).csv /home/tlevine/math-scores-2009-ethnicity.csv

## Implementation details

When we index a table, we first figure out the unique indices

import special_snowflake
indices = special_snowflake.fromcsv(open(filepath))

and save them.

from pickle_warehouse import Warehouse
Warehouse(os.path.expathuser('~/.,/indices'))[filepath] = indices

Then we look at all of the values of all of the unique indices and save them.

Warehouse(os.path.expanduser(‘~/.,/values/%d’ % hash(index)))[filepath] = set_of_indexed_values

When we search for a table, it actually gets indexed first. Once it has been indexed, we know the unique keys of the table. We look up the indices,

indices = Warehouse(os.path.expathuser('~/.,/indices'))[path]

then we look up all of the tables that contain this index,

tables = Warehouse(os.path.expanduser('~/.,/values/%d' % hash(index)))

and the values of this tables object are sets of hashes of the different values. I can then count how many items are in the intersection between the set for the table that is used as the query and the every other particular table.

If I want to go crazy, I might do this for combinations of columns that aren’t unique indices, and I’d use collections.Counter objects to represent the distributions of the values.

## Project details

Uploaded source