Skip to main content

Create and maintain mathematical Obsidian.md notes, and gather data from them to train ML models

Project description

trouver

Mathematicians constantly need to learn and read about concepts with which they are unfamiliar. Keeping mathematical notes in an Obsidian.md vault can help with this learning process as Obsidian.md.

Disclaimer

At the time of this writing (01/18/2023), there is only one author/contributor of this library. Nevertheless, the author often refers to himself as “the author”, “the authors”, or “the author/authors” in writing this library. Moreover, the author often uses the “editorial we” in writing this library.

Use this library at your own risk as using this library can write or modify files in your computer and as the documentation of some components of this library may be inaccurate or outdated. By using this library, you agree that the author/authors of this library is/are not responsible for any damages from this library and related components.

This library is still somewhere in-between prototype and alpha. Moreover, the library itself may be unstable and subject to abrupt changes.

The author/authors of this library is/are also not affiliated with Obsidian.md, fast.ai, or Hugging Face.

Install

# TODO Write installation instructions
pip install trouver

You may also have to manually install other libraries which are required by the fast.ai and/or Hugging Face libraries.

How to use

Parse LaTeX documents and split them into parts

Trouver can parse LaTeX documents and split them up into “parts” which are convenient to read in Obsidian.md and to take notes on. For example, the following code splits up this paper in creates a folder in an Obsidian.md vault[^1].

import os
from pathlib import Path
import shutil
import tempfile

from trouver.helper import _test_directory, text_from_file
from trouver.latex.convert import (
    divide_preamble, divide_latex_text, custom_commands,
    setup_reference_from_latex_parts
)
# This context manager is implemented to make sure that a temporary
# folder is created and copies contents from `test_vault_5` in `nbs/_tests`,
# only the contents of the temporary folder are modified, and 
with (tempfile.TemporaryDirectory(prefix='temp_dir', dir=os.getcwd()) as temp_dir):
    temp_vault = Path(temp_dir) / 'test_vault_5'
    shutil.copytree(_test_directory() / 'test_vault_5', temp_vault)

    sample_latex_file = _test_directory() / 'latex_examples' / 'kim_park_ga1dcmmc' / 'main.tex'
    sample_latex_text = text_from_file(sample_latex_file)
    preamble, _ = divide_preamble(sample_latex_text)
    parts = divide_latex_text(sample_latex_text)
    cust_comms = custom_commands(preamble)
    vault = temp_vault
    location = Path('') # The path relative to the vault of the directory in which to make the new folder containing the new notes.
    reference_name = 'kim_park_ga1dcmmc'
    author_names = ['Kim', 'Park']
    
    setup_reference_from_latex_parts(
        parts, cust_comms, vault, location,
        reference_name,
        author_names)

    os.startfile(os.getcwd()) # This open the current working directory; find the temporary folder in here.
    input() # There should be an input prompt; make an input here when you are done viewing the

The created folder in Obsidian.md looks like this in Obsidian.md The text in magenta are links, each to a file in the Obsidian.md vault

While Obsidian.md is not strictly necessary to use trouver or to read and write the files created by setup_reference_from_latex_parts (in fact, any traditional file reader/writer can be used for such purposes), reading and writing the files on Obsidian.md can be convenient.

ML model utilities

We have trained a few ML models to detect/predict and provide information about “short” mathematical text. These ML models are available on Hugging Face and as such, they can be downloaded to and used from one’s local machines. Please note that ML models can be large and the locations that the Hugging Face Transformers library downloads such models to can vary from machine to machine.

For each of these models, we may or may not have also written some instructions on how to train similar models given appropriately formatted data[^2].

Note that the data used to train these models contains mathematical text pertaining mostly to fields closely related to number theory and algebraic geometry.

Use an ML model to categorize and label the note types

One of these ML models predicts the type of a piece of mathematical writing. For example, this model may predict that

Let $L/K$ be an field extension. An element $\alpha \in L$ is said to be algebraic over $K$ if there exists some polynomial $f(x) \in K[x]$ such that $f(\alpha) = 0$.

introduces a definition. For the purposes of trouver, an Obsidian.md note containing ought to be labeled with the #_meta/definition tag by adding the text _meta/definition to the tags field in the frontmatter YAML metadata of the note:

In this note, there is a _meta/definition in the tags field in the frontmatter YAML metadata of the note

See markdown.obsidian.personal.machine_learning.information_note_types for more details.

This ML model is trained using the fast.ai library with the ULMFiT approach; see how_to.train_ml_model.fastai for the steps taken to train this model. This ML model is also available on Hugging Face under the repository hyunjongkimmath/information_note_type

import pathlib
from pathlib import WindowsPath
import platform

from huggingface_hub import from_pretrained_fastai
repo_id = 'hyunjongkimmath/information_note_type'

# There is a PosixPath problem when trying to load
# the model on Windows; we get around this problem
# within the `if` statement.
if platform.system() == 'Windows':
    temp = pathlib.PosixPath # See https://stackoverflow.com/questions/57286486/i-cant-load-my-model-because-i-cant-put-a-posixpath
    pathlib.PosixPath = pathlib.WindowsPath
    model = from_pretrained_fastai(repo_id)
    pathlib.PosixPath = temp
else:
    model = from_pretrained_fastai(repo_id)
Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

Downloading:   0%|          | 0.00/1.38k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/768 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/166M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/162 [00:00<?, ?B/s]
sample_prediction_1 = model.predict(r'Let $L/K$ be an field extension. An element $\alpha \in L$ is said to be algebraic over $K$ if there exists some polynomial $f(x) \in K[x]$ such that $f(\alpha) = 0$.')
print(sample_prediction_1) 
sample_prediction_2 = model.predict(r'Theorem. Let $q$ be a prime power. Up to isomorphism, there is exactly one field with $q$ elements.')
print(sample_prediction_2)
(['#_meta/definition', '#_meta/notation'], tensor([False, False, False, False, False, False,  True, False, False, False,
         True, False, False, False]), tensor([1.9631e-03, 3.4931e-04, 1.7551e-02, 4.8163e-02, 5.7628e-06, 3.0610e-06,
        9.6544e-01, 2.3179e-03, 2.4539e-03, 1.6170e-02, 5.8807e-01, 4.5185e-03,
        2.5055e-04, 4.6183e-03]))
(['#_meta/concept', '#_meta/proof'], tensor([False, False, False,  True, False, False, False, False, False, False,
        False,  True, False, False]), tensor([3.4701e-03, 6.6588e-05, 7.8861e-02, 9.7205e-01, 8.8357e-06, 6.1183e-06,
        9.5552e-02, 4.0747e-03, 2.7043e-04, 2.7545e-02, 1.3064e-02, 5.6198e-01,
        1.5603e-04, 5.5122e-03]))

At the time of this writing (01/18/2023), the model seems to incorrect predict - in sample_prediction_1 that the text introduces a notation. - in sample_prediction_2 that the text contains a proof.

# from trouver.markdown.obsidian.personal.machine_learning.information_note_types import
# TODO: exmaple of loading model and using it.

Use an ML model to find notations introduced in text

Another ML model predicts locations of notations introduced in text. This model is trained as a categorizer - given a piece of mathematical text in LaTeX in which a single LaTeX math mode string (surrounded either by the dollar sign $ or double dollar signs $$) is surrounded by double asterisks **, the model should determine whether or not the LaTeX math mode string contains a newly introduced notation.

For example, suppose that we want to find notations introduced in the following text:

Let $L/K$ be a Galois field extension. Its Galois group $\operatorname{Gal}(L/K)$ is defined as the group of automorphisms of $L$ fixing $K$ pointwise.

Our approach is to consider each latex math mode strings in this text (of which there are 4: $L/K$, $\operatorname{Gal}(L/K)$, $L$, and $K$), consider the four alternate versions of this text in which double asterisks ** are surround one of these math mode strings, and use the model to predict whether that math mode string contains a newly introduced notation. In particular, we pass through the model the following pieces of text:

Let **$L/K$** be a Galois field extension. Its Galois group $\operatorname{Gal}(L/K)$ is defined as the group of automorphisms of $L$ fixing $K$ pointwise.
Let $L/K$ be a Galois field extension. Its Galois group **$\operatorname{Gal}(L/K)$** is defined as the group of automorphisms of $L$ fixing $K$ pointwise.
Let $L/K$ be a Galois field extension. Its Galois group $\operatorname{Gal}(L/K)$ is defined as the group of automorphisms of **$L$** fixing $K$ pointwise.
Let $L/K$ be a Galois field extension. Its Galois group $\operatorname{Gal}(L/K)$ is defined as the group of automorphisms of $L$ fixing **$K$** pointwise.

Ideally, the model should determine only the second version of text to contain a newly introduced notation

See markdown.obsidian.personal.machine_learning.notation_identifcation for more details.

This ML model is also trained using the fast.ai library with the ULMFiT approach, and is available on Hugging Face under the repository hyunjongkimmath/notation_identification.

import pathlib
from pathlib import WindowsPath
import platform

from huggingface_hub import from_pretrained_fastai
repo_id = 'hyunjongkimmath/notation_identification'

# There is a PosixPath problem when trying to load
# the model on Windows; we get around this problem
# within the `if` statement.
if platform.system() == 'Windows':
    temp = pathlib.PosixPath # See https://stackoverflow.com/questions/57286486/i-cant-load-my-model-because-i-cant-put-a-posixpath
    pathlib.PosixPath = pathlib.WindowsPath
    model = from_pretrained_fastai(repo_id)
    pathlib.PosixPath = temp
else:
    model = from_pretrained_fastai(repo_id)
Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]
contains_a_notation = model.predict(r'Let $L/K$ be a Galois field extension. Its Galois group **$\operatorname{Gal}(L/K)$** is defined as the group of automorphisms of $L$ fixing $K$ pointwise.')
does_not_contain_a_notation = model.predict(r'Let **$L/K$** be a Galois field extension. Its Galois group $\operatorname{Gal}(L/K)$ is defined as the group of automorphisms of $L$ fixing $K$ pointwise.')
print(contains_a_notation)
print(does_not_contain_a_notation)
('True', tensor(1), tensor([9.0574e-08, 1.0000e+00]))                
('False', tensor(0), tensor([1.0000e+00, 4.8617e-06]))
# TODO: examples of using functions in markdown.obsidian.personal.machine_learning.notation_identifcation.

Use an ML model to summarize notations introduced in text

Now that we have found notations introduced in text and created notation notes for them in our Obisidian.md vault, we now generate summaries for these notations.

The ML model in question fine-tuned from a T5 model

This ML model is available on Hugging Face under the repository hyunjongkimmath/notation_summarizations_model.

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, pipeline
model = AutoModelForSeq2SeqLM.from_pretrained('hyunjongkimmath/notation_summarizations_model')
tokenizer = AutoTokenizer.from_pretrained('hyunjongkimmath/notation_summarizations_model')
summarizer = pipeline('summarization', model=model, tokenizer=tokenizer)

The summarizer pipeline can be used to summarize notations newly introduced in a piece of mathematical text. The text needs to be formatted as follows:

summarize: <mathematical_text_goes_here>

latex_in_original: $<notation_to_summarize>$
summarizer("summarize:Let us now define the upper half plane $\mathbb{H}$ as the set of all complex numbers of real part greater than $1$.\n\n\nlatex_in_original: $\mathbb{H}$")
Your max_length is set to 200, but you input_length is only 54. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=27)

[{'summary_text': 'the upper half plane of the complex plane $\\ mathbb{ H} $. It is defined as the set of all complex numbers of real part greater than $1$.'}]

In the above example, the summarizer determines that the notation $\mathbb{H}$ introduced in the text

Let us now define the upper half plane $\mathbb{H}$ as the set of all complex numbers of real part greater than $1$.

denotes 'the upper half plane of the complex plane $\\ mathbb{ H} $. It is defined as the set of all complex numbers of real part greater than $1$.'.

How the examples/tests are structured

Many of the functions and methods in this library are accompanied by examples demonstrating how one might use them.

These examples are usually also tests of the functions/methods; the developer of this library can use nbdev’s nbdev_test command-line command to automatically run these tests[^3][^4]. Moreover, there is a GitHub workflow in the repository for this library (see the .github/workflows/test.yaml) which automatically runs these examples/tests on GitHub Actions when changes to are committed to the GitHub repository[^5].

These examples may use a combination of the following:

  • Mock patching via Python’s unittest.mock library.
  • The fastcore.test module as assertion statements.
  • example/test files in the nbs/_tests folder in the repository[^6].
    • The _test_directory() function in the helper module obtains this folder.

    • Many of these examples also use the tempfile.TemporaryDirectory class along with the shutil.copytree to create a Python context manager of a temporary directory with contents copied from the nbs/_tests folder. The temporary directory is automatically deleted once the context manager ends. We do this to run tests/examples which modify files/folders without modifying the files/folders in the nbs/_tests directory themselves.

      • For example, the code
      with tempfile.TemporaryDirectory(prefix='temp_dir', dir=os.getcwd()) as temp_dir:
          temp_vault = Path(temp_dir) / 'test_vault_1'
          shutil.copytree(_test_directory() / 'test_vault_1', temp_vault)
      
          # run the rest of the example here
      
          # Uncomment the below lines of code to view the end-results of the example; 
          # os.startfile(os.getcwd())
          # os.input()  # this line pauses the process until the user makes an input so the deletion of the temporary directory is delayed.
      

      first creates a temporary directory starting temp_dir in the current working directory and copies into this temporary directory the contents of test_vault_1 in the nbs/_tests folder. One the example/test has finished running, the temporary directory is removed whether or not the test succeeds.

Miscellaneous

This repository is still in its preliminary stages and much of the code and documentation may be faulty or not well formatted. The author greatly appreciates reports of these issues or suggestions on edits; please feel free to report them on the Issues section of the GitHub repository for this library. The author of this repository, who is primarily a mathematician (a PhD student at the time of this writing), does not guarantee quick responses or resolutions to such issues, but will do his best to address them.

For developers

This repository is based on the nbdev template. As such, code for the packages as well as the documentation for the repository are written in jupyter notebooks (the .ipynb files in the nbs folder) and the Python modules are auto-generated via the command-line command nbdev_export (or nbdev_prepare, which among other things runs nbdev_export.).

Troubleshooting

  • In the nbs/_tests folder, make sure that the folders that you want to test are not empty; since git does not track empty folders, empty folders will not be pushed in GitHub and the tests in GitHub Actions may yield different results than in a local computer.

Special Thanks

The author of trouver thanks Sun Woo Park for agreeing to allow their coauthored paper, Global $\mathbb{A}^1$-degrees covering maps between modular curves, along with some of Park’s expository writings, to be used in examples in this library.

Release notes

Ver. 0

Ver. 0.0.2

  • I made the mistake of note including much of the contents of index.ipynb in the pypi library release, so that should be fixed..

Ver. 0.0.1

  • Initial release

[^1]: There is a known bug in the numbering of the sections of the paper, cf. Issue #32.

[^2]: Given time, the author of trouver eventually plans on writing instructions on training each of the models.

[^3]: cf. nbdev’s End-To-End Walkthrough to see how to use nbdev_test

[^4]: There are also tests which are hidden from the documentation website; one can find these tests in the jupyter notebook files in the nbs folder in the repository for this library as notebook cells marked with the #| hide flag, cf. nbdev’s End-to-End Walkthrough to see what the #| hide flag does.

[^5]: The .github/workflows/test.yaml GitHub workflow file is set up in such a way that that allows GitHub Actions to access/use the contents of the nbs/_tests directory upon running the tests/examples.

[^6]: The .github/workflows/test.yaml GitHub workflow file is set up in such a way that that allows GitHub Actions to access/use the contents of the nbs/_tests directory upon running the tests/examples.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

trouver-0.0.2.tar.gz (106.1 kB view hashes)

Uploaded Source

Built Distribution

trouver-0.0.2-py3-none-any.whl (114.6 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page