Search for data tables.
When we search for ordinary written documents, we send words into a search engine and get pages of words back.
What if we could search for spreadsheets by sending spreadsheets into a search engine and getting spreadsheets back? The order of the results would be determined by various specialized statistics; just as we use PageRank to find relevant hypertext documents, we can develop other statistics that help us find relevant spreadsheets.
Comma Search indexes only spreadsheets that are stored locally. To index a new spreadsheet, run
, --index [csv file]
Regardless of what path you give for the csv file, Comma Search will expand the path to an absolute path and then use this as the key to meta-index the cached results of the indexing. These caches are all stored in the ~/., directory.
By default, CSV files that have already been indexed will be skipped; to index the same CSV file again, run with the --force or -f option.
, --index --force [csv file]
Once you have indexed a bunch of CSV files, you can search.
, [csv file]
You’ll see a bunch of file paths as results
$ , ‘Math Scores 2009.csv’ /home/tlevine/math-scores-2010-gender.csv /home/tlevine/Math Scores 2009.csv /home/tlevine/Math Scores 2009 Copy (1).csv /home/tlevine/math-scores-2009-ethnicity.csv
When we index a table, we first figure out the unique indices
import special_snowflake indices = special_snowflake.fromcsv(open(filepath))
and save them.
from pickle_warehouse import Warehouse Warehouse(os.path.expathuser('~/.,/indices'))[filepath] = indices
Then we look at all of the values of all of the unique indices and save them.
Warehouse(os.path.expanduser(‘~/.,/values/%d’ % hash(index)))[filepath] = set_of_indexed_values
When we search for a table, it actually gets indexed first. Once it has been indexed, we know the unique keys of the table. We look up the indices,
indices = Warehouse(os.path.expathuser('~/.,/indices'))[path]
then we look up all of the tables that contain this index,
tables = Warehouse(os.path.expanduser('~/.,/values/%d' % hash(index)))
and the values of this tables object are sets of hashes of the different values. I can then count how many items are in the intersection between the set for the table that is used as the query and the every other particular table.
If I want to go crazy, I might do this for combinations of columns that aren’t unique indices, and I’d use collections.Counter objects to represent the distributions of the values.