A literature database tool with GPT integration.
Project description
#+title: litdb - a literature and document database
#+attr_org: :width 600
[[./litdb.png]]
* litdb concept
litdb is a tool to help you curate and use your collection of scientific literature. You use it to collect and search papers. You can use it to collect older articles, and to keep up with newer articles. litdb uses https://openalex.org for searching the scientific literature, and https://turso.tech/libsql to store results in a local database.
The idea is you add papers to your database, and then you can search it with natural language queries, and interact with it via an ollama GPT application. It will show you the papers that best match your query. You can read those papers, get bibtex entries for them, or add new papers based on the references, papers that cite that paper, or related papers. You can also set up filters that you update when you want to get new papers created since the last time you checked.
** videos
1. https://www.youtube.com/live/e-J3Bh2Uti4 Introduction to litdb
2. https://www.youtube.com/live/teW68WogulU local files (volume is very low for some reason)
3. https://youtube.com/live/3LltpiiQaR8 CrossRef, reviewer suggestions, COA
4. https://youtube.com/live/ZkKKuvVUWkE litdb and Emacs
5. https://youtube.com/live/j7rItPwWDaY litdb and Jupyter Lab
6. https://youtube.com/live/SUtvtc7l6y0 litdb + GPT enhancements
7. https://youtube.com/live/3FZ1ROnCC6Y litdb + LiteLLM and streamlit
8. https://www.youtube.com/live/IKKTQSTXQmc litdb + Youtube and audio
9. https://youtube.com/live/MEf9rPI0Z1M litdb + Image search with text and image queries using CLIP
** installation
litdb is on PyPi.
#+BEGIN_SRC sh
pip install litdb
#+END_SRC
To get the cutting edge package, you can install it directly from GitHUB.
#+BEGIN_SRC sh
pip install git+https://github.com/jkitchin/litdb
#+END_SRC
** configuration
You have to create a toml configuration file. This file is called litdb.toml. The directory this file is in is considered the root directory. All commands will start in the current working directory and look up to find this file. You can put this file in your home directory, or you can have sub-directories, e.g. a per project litdb.
There are a few choices you have to make. You have to choose a SentenceTransformer model, and specify the size of the vectors it makes. You also have to specify the chunk_size and chunk_overlap settings that are used to break documents up to compute document level embedding vectors.
You will need an OpenAlex premium key if you want to use the update-filters feature.
#+BEGIN_EXAMPLE
[embedding]
model = 'all-MiniLM-L6-v2'
cross-encoder = 'cross-encoder/ms-marco-MiniLM-L-6-v2'
chunk_size = 1000
chunk_overlap = 200
[openalex]
email = "you@example.com"
api_key = "..."
[gpt]
model = "llama2"
[llm]
model = "ollama/llama2"
#+END_EXAMPLE
You can define an environment variable to the root of your default litdb project. This should be a directory with a litdb.toml file in it.
#+BEGIN_SRC sh
export LITDB_ROOT="/path/to/your/default/litdb"
#+END_SRC
When you run a litdb command, it will look for a dominating litdb.toml file, which means you are running the command in a litdb project. If one is not found, it will check for the LITDB_ROOT environment variable and use that if it is found. Finally, if that does not exist, it will prompt you to make a new project in the current directory.
* Using litdb
Your litdb starts out empty. You have to add articles that are relevant to you. It is an open question of the best way to build a litdb. The answer surely depends on what your aim is. You have to compromise on breadth and depth with the database size. The CLI makes it pretty easy to do this
litdb has a cli with an entry command of litdb and subcommands (like git) for interacting with it. You can see all the options with this command.
#+BEGIN_SRC sh :dir example
litdb --help
#+END_SRC
** Searching the web
You have to start somewhere. You can use this to open a search in OpenAlex.
#+BEGIN_SRC sh
litdb web query
#+END_SRC
You can also open searches with these options:
| option | source |
|-----------------------+----------------|
| -g, --google | Google |
| -gs, --google-scholar | Google Scholar |
| -ar, --arxiv | Arxiv |
| -pm, --pubmed | Pubmed |
| -cr, --chemrxiv | ChemRxiv |
| -br, --biorxiv | BioRxiv |
| -a, --all | All |
You can find starting points this way.
*** Fine-tuned search in OpenAlex
This is a default query in Open Alex. It does not change your litdb, it just does a simple text search query on works.
#+BEGIN_SRC sh
litdb openalex query
#+END_SRC
You can get more specific with a filter:
#+BEGIN_SRC sh
litdb openalex -f 'author.orcid:https://orcid.org/0000-0003-2625-9232'
#+END_SRC
You can also search other endpoints and use fulters. Here we perform a search on Sources for display_names that contain the word discovery.
#+BEGIN_SRC sh
litdb openalex -e sources -f display_name.search:discovery
#+END_SRC
** One-time additions of articles to litdb
You add an article by its DOI. There are optional arguments to also add references, citing and related articles.
#+BEGIN_SRC sh
litdb add doi --references --citing --related
#+END_SRC
To add an author, use their orcid. You can use ~litdb author-search firstname lastname~ to find an orcid for a person.
#+BEGIN_SRC sh
litdb add orcid
#+END_SRC
To add entries from a bibtex file, use the path to the file.
#+BEGIN_SRC sh
litdb add /path/to/bibtex.bib
#+END_SRC
You can provide more than one source and even mix them like this.
#+BEGIN_SRC sh
litdb add doi1 doi2 orcid
#+END_SRC
These are all one-time additions.
You can also add things like YouTube videos and podcasts. We use ML to extract the audio from these to text so they become searchable!
** Adding filters
litdb provides several convenient ways to add queries to update your litdb in the future.
*** Follow an author
To get new papers by an author, you can follow them.
#+BEGIN_SRC sh
litdb follow orcid
#+END_SRC
*** Watch a query
#+BEGIN_SRC sh
litdb watch "filter to query"
#+END_SRC
*** Citations on a paper
#+BEGIN_SRC sh
litdb citing doi
#+END_SRC
*** Related papers
#+BEGIN_SRC sh
litdb related doi
#+END_SRC
*** A custom filter
A filter is used in OpenAlex to search for relevant articles. Here is an example of adding a filter for articles in the journal Digital Discovery. This doesn't add any entries directly, it simply stores the filter in the database. The main difference of this vs the watch command above is the explicit description.
#+BEGIN_SRC sh
litdb add-filter "primary_location.source.id:https://openalex.org/S4210202120" -d "Digital Discovery"
#+END_SRC
*** Managing and updating the filters
You can get a list of your filters like this.
#+BEGIN_SRC sh
litdb list-filters
#+END_SRC
You can update the filters like this.
#+BEGIN_SRC sh
litdb update-filters
#+END_SRC
This adds papers that have been created since the last time you ran the filter. You need an OpenAlex premium API key for this. This will update the last_updated field.
You can remove a filter like this:
#+BEGIN_SRC sh
litdb rm-filter "filter-string"
#+END_SRC
** Review your litdb
I find it helpful to review your litdb. To get a list of articles added in the last week, you can run this command.
#+BEGIN_SRC sh
litdb review -s "1 week ago"
#+END_SRC
This works best when you update your litdb regularly. You might want to redirect that into a file so you can review it in an editor of your choice.
** Searching litdb
There are several search options.
*** vector search
The main way litdb was designed to be searched is with by natural language queries. The way this works is your query is converted to a vector using SentenceTransformers, and then a vector search identifies entries in the database that are similar to your query.
#+BEGIN_SRC sh
litdb vsearch "natural language query"
#+END_SRC
The default number of entries returned is 3. You can change that with an optional argument
#+BEGIN_SRC sh
litdb vsearch "natural language query" -n 5
#+END_SRC
There is an iterative version of vsearch called isearch. This finds the closest entries, then downloads the citations, references and related entries for each one, and repeats the query until you tell it to stop, or it doesn't find any new results.
#+BEGIN_SRC sh
litdb isearch "some query"
#+END_SRC
*** full text search
There is a full text search (full on the text in litdb) available. The command looks like this.
#+BEGIN_SRC sh
litdb fulltext "query"
#+END_SRC
See https://sqlite.org/fts5.html for information on what the query might look like. The search is done with this SQL command:
#+BEGIN_SRC sql
select source, text from fulltext where text match ? order by rank
#+END_SRC
The default number of entries returned is 3. You can change that with an optional argument
#+BEGIN_SRC sh
litdb fulltext "natural language query" -n 5
#+END_SRC
*** hybrid search
Vector and full text search have complementary strengths and weaknesses. We combine them in the hybrid-search subcommand. This performs two searches on two different queries, and combines them with a unified score that is used to rank all the matches. This ensures you get some results that match the full search, and the vector search. It is worth trying if you aren't finding what you want by vector or text search alone.
#+BEGIN_SRC sh
litdb hybrid-search "vector query" "text query"
#+END_SRC
*** ollama GPT
You can use litdb as a RAG source for ollama. This looks up the three most related papers to your query, and uses them as context in a prompt to ollama (with the llama2 model). I find this quite slow (it can be minutes to generate a response on an old Intel Mac). I also find it makes up things like references, and that it is usually necessary to actually read the three papers. The three papers come from the same vector search described above.
#+BEGIN_SRC sh
litdb gpt "what is the state of the art in automated laboratories for soft materials"
#+END_SRC
*** Integration with litellm
litdb supports litellm so you can use almost any LLM provider you want: OpenAI, Anthropic, Gemini, whatever you have an API key for.
The free tier of the API includes 1,500 requests per day with Gemini 1.5 Flash.
It uses a different command than the ollama gpt command.
#+BEGIN_SRC sh
litdb chat "what is the state of the art in automated laboratories for soft materials"
#+END_SRC
There are some fancy things you can do with the prompt:
1. Avoid using RAG if --norag is in your prompt.
2. If you surround python objects with backticks, it will try expanding that to the documentation from Python.
3. A line that starts with < indicates a shell command to run and the output will be expanded into the prompt.
4. A prompt of !save will save the current chat to a file.
5. You can use this syntax to expand a file or url in the prompt for context:
#+BEGIN_EXAMPLE
[[file/url]]
#+END_EXAMPLE
Your prompt history is saved in your litdb, so you can go back to them if you want.
*** Web app for litdb
If you prefer a browser, you can now launch a streamlit app for litdb:
#+BEGIN_SRC sh
litdb app
#+END_SRC
This should launch the app in your browser and you can search litdb from it. The terminal application is more advanced in terms of prompt expansion.
*** search with audio
This command will record audio, transcribe that audio to text, and then do a vector search on that text. You will be prompted when the recording starts, and you press return to stop it. litdb will show you what it heard, and ask if you want to do a vector search on it.
#+BEGIN_SRC sh
litdb audio -p
#+END_SRC
I haven't found the transcription to be that good on technical scientific terms. This is a proof of concept capability.
Note that you need to install these libraries for this feature to work:
pyaudio, playsound, SpeechRecognition
These are not trivial to install, and pyaudio relies on external libraries like portaudio that may not be easy to install. These are currently commented out in pyproject.toml because of these difficulties.
*** search from a screenshot
You can copy a screenshot to the clipboard, and then use OCR to extract text from it, and do a vector search on that text.
#+BEGIN_SRC sh
litdb screenshot
#+END_SRC
If you can copy and paste text, you should do that instead. This is helpful to get text from images, or pdfs where the text is stored in an image, maybe from videos, or screen share from online meetings, etc.
Eventually, if images get integrated into litdb, this is also an entry point for image searches.
** Tagging entries
litdb supports tagging entries so you can group them. To tag a source with tag1 and tag2, use this syntax.
#+BEGIN_SRC sh
litdb add-tag source -t tag1 -t tag2
#+END_SRC
You can remove tags like this.
#+BEGIN_SRC sh
litdb rm-tag source -t tag1 -t tag2
#+END_SRC
You can delete a tag from the database.
#+BEGIN_SRC sh
litdb delete-tag tag1
#+END_SRC
To see all the tags do this.
#+BEGIN_SRC sh
litdb list-tags
#+END_SRC
To see entries with a tag:
#+BEGIN_SRC sh
litdb show-tag tag1
#+END_SRC
You can use this to export tagged entries into bibtex entries like this.
#+BEGIN_SRC sh
litdb show-tag workflow -f '{{ source }}' | litdb bibtex
#+END_SRC
** Exporting entries
You can use these commands to export bibtex entries or citation strings.
*** Get a bibtex entry
This command will try to generate a bibtex entry for entries in your litdb.
#+BEGIN_SRC sh
litdb bibtex doi1 doi2
#+END_SRC
The output can be redirected to a file.
You can also use a search like this and pipe the output to litdb bibtex.
#+BEGIN_SRC sh
litdb vsearch "machine learning in catalysis
" -f "{{ source }}" | litdb bibtex
#+END_SRC
*** Get a citation string
This command will output a citation for the sources. It is mostly a convenience function. There is not currently a way to customize the citation.
#+BEGIN_SRC sh
litdb citation doi1 doi2
#+END_SRC
You can also use a search like this and pipe the output to litdb bibtex.
#+BEGIN_SRC sh
litdb vsearch "machine learning in catalysis
" -f "{{ source }}" | litdb citations
#+END_SRC
** Find free pdfs
You can use litdb to find freely available PDFs via https://unpaywall.org/.
#+BEGIN_SRC sh
litdb unpaywall doi
#+END_SRC
These do not always work, and sometimes you get a version from arxiv or pubmed.
** Low-level interaction with litdb
litdb is just a sqlite database (although you need to use the libsql executable for vector search). There is a CLI way to run a sql command. For example, to find all entries with a null bibtex field and their types use a query like this.
#+BEGIN_SRC sh
litdb sql "select source, json_extract(extra, '$.type'), json_extract(extra, '$.bibtex') as bt from sources where bt is null"
#+END_SRC
You might also use this for very specific queries. For example, here I search the citation strings for my name.
#+BEGIN_SRC sh
litdb sql "select source, json_extract(extra, '$.citation') as citation from sources where citation like '%kitchin%'"
#+END_SRC
* Adding local files
The idea of using local files is that it is likely you have collected information in the form of files on your hard drive, and you want to be able to find information in those files.
It is possible to add any file that can be turned into text to litdb. That includes:
- docx
- pptx
- html
- ipynb
- org / md
- bib
- url
This limits portability because you need a path if you want to be able to open that file.
The same vector, fulltext and gpt search commands are available for local file entries. These tend to be longer documents than the OpenAlex entries, and I am not sure how well the search works at the document level embeddings. Search at a chunk level is very precise; odds are you want paragraph level similarity to your query.
An early version of litdb stored each chunk. This is possible, but I used another table for it. You could munge the source to be something like f.pdf::chunk-1 so each one is unique, but that seems more complicated and you would need to do some experiments to see if it is warranted.
You can combine this with the OpenAlex entries in a single database.
You can walk a directory and add files from it with this command.
#+BEGIN_SRC sh
litdb index dir1
#+END_SRC
This directory is saved and you can update all the previously indexed directories like this.
#+BEGIN_SRC sh
litdb reindex
#+END_SRC
Some annoying things that may happen are duplicate content, e.g. because you have the same file in multiple formats like docx and pdf, or because you have literal copies of files in multiple places.
You should also be careful sharing a litdb that has indexed local files. It may have sensitive information that you don't want others to be able to find.
* Emacs integration
Of course there is some Emacs integration. I made a new link for litdb.
[[litdb:https://doi.org/10.1021/jp047349j]]
The links export as \cite{source}, and there is a function ~litdb-generate-bibtex~ to export bibtex entries for all links in the buffer. These entries are not certain to be valid, most likely from the keys (some DOIs are probably invalid keys).
You can easily insert a link like this:
M-x litdb
See [[./litdb.el]] for details. This is not a package on MELPA yet. You should just load the .el file in your config. You can also use ~litdb-fulltext~, ~litdb-vsearch~, and ~litdb-gpt~ from Emacs to interact with your litdb.
litdb.el is under active development, and will be an alternative UI to the terminal eventually. It is too early to tell if it will replace org-ref. It has potential, but that would be a very large undertaking.
* Database design
litdb uses a sqlite database with libsql. libsql is a sqlite fork with additional capabilities, most notably integrated vector search.
The main table in litdb is called sources.
- sources
- source (url to source location)
- text (the text for the source)
- extra (json data)
- embedding (float32 blob in bytes)
- date_added string
This table has an embedding_idx index for vector search.
There is also a virtual table fulltext for fulltext search.
- fulltext
- source
- text
And a table called queries.
- queries
- filter
- description
- last_updated
This database is automatically created when you use litdb.
* Limitations
The text that is stored for each entry comes from OpenAlex and is typically limited to the title and abstract. For the text in each entry The first line is typically a citation including the title, and the rest is the abstract if there is one. I feel like I see more and more entries with no abstract. This will certainly limit the quality of search, and could bias results towards entries with more text in them.
The quality of the vector search depends on several things. First, litdb stores a document level embedding vector that is computed by averaging the embedding vectors of overlapping chunks. We use Sentence Transformers to compute these. There are many choices to make on the model, and these have not been tested exhaustively. So far 'all-MiniLM-L6-v2' works well enough. There are other models you could consider like getting embeddings from ollama, but at the moment litdb can only use SentenceTransformers.
I guess that document level embeddings are less effective on longer documents. The title+abstract from OpenAlex is pretty short, and so far there isn't evidence this is a problem.
Second, we rely on defaults in libsql for the vector search, notably finding the top k nearest vectors based on cosine similarity. There are other distance metrics you could use like L2, but we have not considered these.
The query is based on vector similarity between your query and the texts. So, you should write the query so it looks like what you want to find, rather than as a question. It is less clear how you should structure your query if you are using the GPT capability. It is more natural to ask a question, or give instructions. The RAG is still done by similarity though.
Finally, the search can only find things that are in your database. If you haven't added it there, you won't find it. That definitely means you will miss some papers. I try to use a mesh of approaches to cover the most likely papers. This includes:
1. Follow authors
2. add references, related, and citing papers to the most relevant papers.
3. Use text search filters
4. Add papers I find from X, bluesky, LinkedIn, etc. (and their references, related, etc)
5. If read a paper in litdb that is good, add its references, related, etc.
It is an iterative process, and you have to make a judgment call about when to stop it. You can always come back later. There might even be newer papers to find.
** Local file limitations
Similar limitations exist for local files. There are additionally the following known limitations:
1. The quality of document to text influences the ultimate embedding. This varies by type of document, and the library used to convert it.
2. Local files tend to be longer documents and this can lead to hundreds of text chunks per document. These chunk embeddings are averaged into one embedding. It is not obvious this is as effective as vector search on each chunk, but it is more memory efficient.
For PDF to text we use [[https://pymupdf.readthedocs.io/en/latest/pymupdf4llm/][pymupd4fll]] which works for this proof of concept. There is a Pro version of that package which supports more file formats. It is not obvious what it would cost to use that. I used [[https://ds4sd.github.io/docling/][docling]] in an early prototype. It also worked pretty well, but it was a little slower I think, and would occasionally segfault so I stopped using it. Spacy is integrating PDF to structured data using docling (https://explosion.ai/blog/pdfs-nlp-structured-data). There is plenty of room for improvement in this dimension, with trade offs in performance and accuracy.
There is a new package from Microsoft to convert Office files to Markdown (https://github.com/microsoft/markitdown) that they specifically mention using in the context of LLMs.
The embedding model we use is trained on text. It is probably not as good at finding code, and the gpt we use is also probably not good at generating code. I guess you would need another table in the database for code, and a different model for embedding and generation. This only matters if you index jupyter notebooks (and later if other code files are supported).
** sqlite + sqlite-vec vs libsql
Vector search is the core requirement for litdb. There are many ways to achieve this. I only considered local solutions so the options are:
- sqlite + vectorlite (https://github.com/1yefuwang1/vectorlite)
- sqlite + sqlite-vec (https://github.com/asg017/sqlite-vec)
- libsql https://github.com/tursodatabase/libsql
vectorlite aims to be faster than sqlite-vec, but it relies on hnsw for vector search, and I was uncomfortable figuring out how to set the size of the db for this application.
sqlite-vec is nice, and early versions of litdb used it and its precursor. This approach requires a virtual table for the embeddings. This is installed as an extension, and is still considered in early stages of adoption.
libsql is a fork of sqlite with integrated vector search, and potential for using it as a cloud database. It is supported by a company, with freemium cloud services. In libsql you store the vectors in a regular table, and search on an embedding index. The code is on GitHUB, and can also be used locally.
* Roadmap
These are ideas for future expansion.
** PDFs and notes
I am not sure what the best way to do this is. The records in litdb are stored by the source, often a url, or path. The PDFs would be stored outside the database, and we would need some way to link them. The keys aren't suitable for naming, but maybe a hash of the keys would be suitable. This would add a fuller opportunity to search larger, local documents too. In org-ref, I only had one pdf per entry. I guess here I would have a new table, so you could have multiple documents linked to an entry, although it won't be easy to tell what they are from the hash-based filenames.
Notes on the other hand, might be small enough that they could be stored in the database. Then they would be easily searchable. They could also be stored externally to make them easy to edit. I haven't found the notes feature in org-ref that helpful, and usually I take notes in various places. What I should do is add a search to find the litdb links in your org-files. This is already a feature of org-db.
** Jupyter lab integration
An alternative to the CLI and Emacs would be to run this in Jupyter Lab with magic commands and rich output.
** graph visualization
It might be helpful to have a graph representation of a paper that shows nodes of citing, references, and related papers, with edge length related to a similarity score, and node size related to number of citations.
ResearchRabbit and Litmaps do this pretty well.
** ollama and agents
There might be a way to get better results using agents and / or tools. For example, you might have a tool that can lookup new articles on OpenAlex, or augment with google search somehow. Or there might be some iterative prompt building tool that refines the search for related articles based on output results.
Here are some references for when I get back to this.
- https://github.com/ollama/ollama-python
- https://github.com/MikeyBeez/Ollama_Agents
- https://github.com/premthomas/Ollama-and-Agents
- https://medium.com/@lifanov.a.v/integrating-langgraph-with-ollama-for-advanced-llm-applications-d6c10262dafa
- https://medium.com/@abhilasha.sinha/building-a-crew-of-agents-with-open-source-llm-using-ollama-to-analyze-fund-documents-as-multi-page-756d8fd9fbf0
- https://blog.paperspace.com/building-local-ai-agents-a-guide-to-langgraph-ai-agents-and-ollama/
I don't use llamaindex (maybe I should see what it does), but it has this section on agents https://docs.llamaindex.ai/en/stable/understanding/agent/
** web app / fast-api
It might be nice to have a flask app with an API. This would facilitate interaction with Emacs.
** async operations
Almost everything is done synchronously and it blocks the program. At least some things could be done asynchronously I think, and that might speed things up (especially for local files), or at least let you do other things while it happens.
The only thing to be careful about is not exceeding rate limits to OpenAlex. This is handled in the synchronous code.
** application specific encoders
I use a generic embedding model, and there are others that are better suited for specific tasks. For example:
- MatBERT [[cite:&trewartha-2022-quant-advan]]
- Scibert [[cite:&beltagy-2019-sciber]]
- Matscibert [[cite:&gupta-2022-matsc]]
- Specter cite:&cohan-2020-spect https://www.sbert.net/docs/sentence_transformer/pretrained_models.html#scientific-similarity-models
- PaECTER [[cite:&ghosh-2024-paect]] for patents
These might have a variety of uses with litdb that range from extracting data, named entity recognition, specific searches on materials, etc.
It is not essential to use SentenceTransformers for embedding, they are just easy to use. An alternative is something like ollama embeddings (https://ollama.com/blog/embedding-models) or llama.cpp https://github.com/abetlen/llama-cpp-python?tab=readme-ov-file#embeddings. The main reason to use on of these would be performance, and maybe better integration with a chat llm.
It is not that easy to just switch models; you would need to either add new columns and compute embeddings for everything, or update all the embeddings for a new model. The SPECTER embedding is much bigger than the all-MiniLM-L6-v2 embedding.
#+BEGIN_SRC jupyter-python :restart
from sentence_transformers import SentenceTransformer
m = SentenceTransformer('allenai-specter')
print(m.encode(['test']).shape)
#+END_SRC
#+RESULTS:
: (1, 768)
** merge databases
I have setup litdb to be project based. There may come a time when it is desirable to merge some set of databases. It might not be necessary, I think you can attach databases in sqlite (https://www.sqlitetutorial.net/sqlite-attach-database/) to achieve basically the same effect. litdb doesn't store version info at the moment, so it could be tricky to ensure compatibility.
Still it might be interesting to sync two databases, e.g. https://www.sqlite.org/rsync.html. I don't know if this works with libsql, but it might allow there to be a central db that users pull from.
** remote db
The first version of litdb with libsql used a fully remote db on their cloud. The main benefit of that is you can update the db from another machine, keeping your working machine load low. A secondary benefit would be using the db from different machines more easily. Right now I use Dropbox to sync it; that mostly works but I get some conflict files here and there if I change it on one machine while it is open on another machine. It is a little more complex to set up though, and I got several api errors on long running scripts, and with network issues, so I switched to this local setup. I think you could specify this in the litdb.toml file and have it do the right thing on a project basis.
** image and text models
One day it might be possible to include images in this (https://www.sbert.net/docs/sentence_transformer/pretrained_models.html#image-text-models). At the moment, OpenAlex entries do not have any images, but other web resources and local files could. I have an image database in org-db, but I don't use it a lot.
** DONE combine full text and vector search
CLOSED: [2024-12-09 Mon 13:59]
Vector search might miss some things. Full text search is hard to do with meaning. There are several ways to do a hybrid search, e.g. do a full text search on keywords, and a vector search, and use some kind of union on those results.
https://www.meilisearch.com/blog/full-text-search-vs-vector-search
** DONE tag system
CLOSED: [2024-12-09 Mon 13:58]
It could be useful to have a tag system where you could label entries, or they could be auto-tagged when updating filters. This would allow you to tag entries by a project, or select entries for some kind of bulk action like update, export to bibtex, or delete.
You might also build a scoring system, e.g. for like/dislike tags.
#+BEGIN_SRC sh
litdb tag doi -t "tag1" "tag2" # add tag
litdb tab doi -r "tag" "tag2" # rm tags
#+END_SRC
** DONE Integrate with audio input
CLOSED: [2024-12-05 Thu 09:11]
This would use your microphone to record and transcribe a query for search.
** DONE Integrate with screenshot + OCR
CLOSED: [2024-12-05 Thu 09:11]
Do the search from the results. I did this with tesseract (https://pypi.org/project/pytesseract/)
#+BEGIN_SRC jupyter-python
import pyautogui
# Prompt the user to move the mouse to the first corner and press Enter
input("Move the mouse to the first corner and press Enter...")
x1, y1 = pyautogui.position()
# Prompt the user to move the mouse to the opposite corner and press Enter
input("Move the mouse to the opposite corner and press Enter...")
x2, y2 = pyautogui.position()
# Calculate the region
left = min(x1, x2)
top = min(y1, y2)
width = abs(x2 - x1)
height = abs(y2 - y1)
region = (left, top, width, height)
print(f"Selected region: {region}")
#+END_SRC
#+RESULTS:
: Selected region: (26, 332, 473, 69)
#+BEGIN_SRC jupyter-python
import pyscreeze
im = pyscreeze.screenshot(region=(left, top, width, height))
im.save('screenshot.png')
#+END_SRC
#+RESULTS:
see mss also.
#+BEGIN_SRC jupyter-python
from PIL import Image
import pytesseract
# Open an image file
img = Image.open('screenshot.png')
# Use Tesseract to extract text
text = pytesseract.image_to_string(img)
# Print the extracted text
print(text)
#+END_SRC
#+RESULTS:
: ++RESULTS:
: ; Selected region: (26, 332, 473, 69)
:
This might be nice later when we have image embeddings.
** DONE review process
#+BEGIN_SRC sh
litdb review --since '1 week ago'
#+END_SRC
You need to have a way to review what comes in to litdb; it is part of learning about what is current. I currently do this with Emacs and scimax-org-feed. You could integrate review with update-filters, or by entries added in the past few days, or some other kind of query. Then you just need to add some format information to get what you want, e.g. org, maybe html?
#+BEGIN_SRC sqlite :db example/litdb.libsql
select source, date_added from sources where date(date_added) > '2024-11-28' limit 5
#+END_SRC
#+RESULTS:
| https://doi.org/10.1021/jp047349j | 2024-11-29 17:21:51 |
| https://doi.org/10.1149/1.1856988 | 2024-11-29 17:21:52 |
| https://doi.org/10.1002/cctc.201000397 | 2024-11-29 17:21:53 |
| https://doi.org/10.1088/1361-648x/aa680e | 2024-11-29 17:21:53 |
| https://doi.org/10.1103/physrevlett.93.156801 | 2024-11-29 17:21:54 |
** DONE semantic similarity
CLOSED: [2024-12-04 Wed 13:12]
litdb uses cosine similarity as the distance metric for the nearest neighbors. It might be useful to re-rank these with cross-encoding.
https://www.sbert.net/examples/applications/cross-encoder/README.html
* Related projects
- LitSuggest :: https://www.ncbi.nlm.nih.gov/research/litsuggest/
- Browser tool that suggests literature for you based on positive and negative PMIDs. Hosted by NIH.
- paper-qa :: https://github.com/Future-House/paper-qa
- This project by Andrew White uses LLM+RAG to explore a paper.
- ColBERT :: https://github.com/stanford-futuredata/ColBERT
- ColBERT is a fast retrieval model for large text collections. In theory it can probably be integrated with litdb. litdb is so simple, and works well enough so far without it.
Many of these projects require you to make an account. There are freemium levels in each one.
- ResearchRabbit :: https://www.researchrabbit.ai/
- This is a browser tool to navigate the scientific literature graphically. You can make collections, and papers that are related by citations are shown in a graph
- LitMaps :: https://www.litmaps.com/
- Another browser tool to graphically interact with scientific literature
- Keenious :: https://keenious.com/explore
- Browser / Google Docs and Word plugin. Finds related articles to the text in your document. I like Keenious when in Google Docs.
- scite.ai :: https://scite.ai/
- Browser tool that integrates GPT with the scientific literature, integration with Zotero
- Scopus AI :: https://www.scopus.com/search/form.uri?display=basic#scopus-ai
- Sponsored by Elsevier
- Dimensions AI :: https://app.dimensions.ai/discover/publication
- Seems similar to Scopus AI
- khoj :: https://khoj.dev/
- This is a desktop app that can be totally local, or in the cloud. It can index your files, and then you can chat with them. There is a freemium level.
- AnythingLLM :: https://anythingllm.com/
- Another tool that runs LLMs locally, and says it can index your files so you can chat with them.
- gpt4all :: https://www.nomic.ai/gpt4all
- Another tool that runs LLMs locally, and says it can index your files so you can chat with them.
With all these options, why does litdb exist? There are a lot of answers to that. First, I wanted to make it. I learned a lot about vector search by doing it. Second, I wanted a free, extensible solution for literature search that could also work for my local files while never putting data in the cloud, and that would work in Emacs. The projects above are very nice, easy to use, no or low-code solutions, and if that is what you are looking for, look there! If you want to hack on things yourself, look here.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
litdb-2.1.5.tar.gz
(67.1 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
litdb-2.1.5-py3-none-any.whl
(46.5 kB
view details)
File details
Details for the file litdb-2.1.5.tar.gz.
File metadata
- Download URL: litdb-2.1.5.tar.gz
- Upload date:
- Size: 67.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.5.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
77b7984c4bad43e5aa3357b8978609d5595fff3d8f65087adbc43d289ddbcb0f
|
|
| MD5 |
820974b04e68875e945f5797d77c3979
|
|
| BLAKE2b-256 |
01aca8e9b3ee7541ff68bcd95c775870660fa2fbf0af8351a98db1f292e48fac
|
File details
Details for the file litdb-2.1.5-py3-none-any.whl.
File metadata
- Download URL: litdb-2.1.5-py3-none-any.whl
- Upload date:
- Size: 46.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.5.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
788132a78bea177f374f0e4d23758e6454219aa1cc5d48e9c3ae392ec5e1ffb7
|
|
| MD5 |
1474b7ab0bb41ee4e2bae08b5b68eca5
|
|
| BLAKE2b-256 |
ccb309a64bc642f9456ec6200094c32824e7edce0e626bc077dd5616005f953d
|