Skip to main content

Package for disambiguation of identical terms in critical editions in LaTeX with reledmac.

Project description

# Samewords: Disambiguate words in critical editions

In critical textual editions notes in the critical apparatus are normally made
to the line where the words occur. This leads to ambiguous references when a
critical apparatus note refers to a word that occurs more than once in a line.
For example:

```
We have a passage of regular text here, such a nice place for a critical note.

----
1 a] om. M
```

It is very unclear which of three instances of "a" the note refers to.

[Reledmac](https://www.ctan.org/pkg/reledmac) is a great LaTeX package that
facilitates typesetting critical editions of prime quality. It already provides
facilities for disambiguating identical words, but it requires the creator of
the critical text to manually mark all potential instances of ambiguous
references manually (see the *reledmac* handbook for the details on that).
*Samewords* automates this step for the editor.

# Installation

*Samewords* requires Python 3.6 installed in your system. If you are on a Mac
OSX machine, and you use [Homebrew](https://brew.sh/), you can run `brew install
python3`. If you do not use Homebrew (or run a Windows machine), download the
[latest official python distribution](https://www.python.org/downloads/) and
follow the instructions.

## Easy installation

```bash
pip3 install samewords
```

That's it!

## Optional: Virtual environment

Before installation you may want to create a virtual environment
([see more here](http://docs.python-guide.org/en/latest/dev/virtualenvs/)) for
the installation, if you don't want to install the script globally. This is also
particularly useful if you want to hack on the script.

To create a virtual environment for the project, run:
```bash
$ mkvirtualenv -p python3 <name>
```

Where `<name>` is the name you want to give the venv.

After activating the virtual environment (`workon` or `source`, see the guide
linked above or search the interwebs), install the package.

## For development

Download the repository:
```
git clone https://github.com/stenskjaer/samewords.git
```

>From the downloaded directory, run:
```bash
$ pip install -e .
```

Now you should be able to run the script (while the virtual environment is
activated, if you used that) by running `samewords`.

To see if it works, run:

```bash
samewords --help
```
Your should get an overview of the commands available.

When you are done, you can reset your system to the state before testing,
deactivate the virtual environment. If you never want to use the script again,
remove the directory of the environment (possibly with `rmvirtualenv` if you
have installed `virtualenvwrapper`) and remove the directory created by the `git
clone` command.

### Remember the tests

Before you start making any changes, run the test suite and make sure everything
passes. From the root directory of the package, run:

```bash
pytest
```

If you make changes, don't forget to implement tests and make sure everything
passes. Otherwise, things will break.

## Usage ##

Simple: Call the script with the file you want annotated as the only argument to
get the annotated version back in the terminal.

```bash
samewords my-awesome-edition.tex
```

will send the annotated version to `stdout`. To see that it actually contains
some `\sameword{}` macros, you can try running it through `grep`:

```bash
samewords my-awesome-edition.tex | grep sameword
```

You can define a output location with the `--output` option:
```bash
samewords --output ~/Desktop/test/output my-awesome-edition.tex
```
will check whether `~/Desktop/test/output` is a directory or a file. If it is a
directory, it will put the file inside that directory (with the original name).
If it is a file, it will ask you whether you want to overwrite it. If it is
neither a directory nor a file, it will create the file `output` and write the
content to that.

Alternatively regular unix redirecting will work just as well in a Unix context:
```bash
samewords my-beautiful-edition.tex > ~/Desktop/test/output.tex
```

### Configuration file

You can configure a small range of settings relevant for the processing. This is
done in a JSON-formatted file. You give the location of the config file to the
argument `--config-file`. The script will automatically look for a config file
with the name `~/.samewords.json`, so if you put it there, you won't have to
specify the command line argument every time you call the script. That can be
very handy if you often need to use the same configuration.

The configuration file recognizes the following parameters:
- `exclude_macros`
- `ellipsis_patterns`
- `sensitive_context_match`

JSON requires backslashes to be escaped if you want to preserved them in the
string. You do that with another backslash, so `\\` will
result in a single backslash. You must remember this when noting `TeX` strings
or regular expressions that contain backslashes.

A complete configuration file could contain the following content:
```json
{
"ellipsis_patterns": [
"--",
"–"
],
"exclude_macros": [
"\\excludedMacro"
]
}
```

For details, see below.

#### `exclude_macros`
You might want to define some macros which are entirely ignored in the
comparison of text segments. That will typically be macros that *do not* contain
text content.

For example, you might use a custom macro called `\msbreak{}` to indicate a
pagebreak in your edition. The content of that is not printed in the text, but
in the margin. So you don't want the comparison to figure in the content of this
macro. Take this example phrase:

```latex
I\msbreak{23v} know that \edtext{I know}{\Afootnote{I don't know B}} nothing.
```

Since the content of (almost) all macros is included by default, this would give
the comparison of the phrase `I know` (`\edtext` content) with `I23v know that`
(context). It will not match, and hence not annotate the phrase.

If we add the macro to the `excluded_macros` field in a settings file and pass
that to the script, `\msbreak` will be ignored in processing, and we will get a
comparison between `I know` (`\edtext` content) with `I know that` (context).
This will match and hence correctly annotate the phrase.

*Another example:* The script searches for words or phrases identical to those
in the `\edtext{}{}` macros to identify possible conflicts. By default the
content of practically all macros are included in this comparison.

Take this passage:
```latex
\edtext{Sortes\test{1}}{\Afootnote{Socrates B}} dicit: Sortes\test{2} probus
```

Will result in a search for "Sortes1" in the string "dicit Sortes2 probus",
which will not succeed.

On the other hand, this passage:
```latex
\edtext{Sortes\test{1}}{\Afootnote{Socrates B}} dicit: Sortes\test{1} probus
```

Will result in a search for "Sortes1" in the string "dicit Sortes1 probus",
which will succeed and therefore annotate the instances.

If you add `\test` to the `excluded_macros` field, both examples above will
compare "Sortes" with "Sortes" and hence give a positive match.

#### `ellipsis_patterns` ####

This key contains a list of patterns that should be included when matching for
ellipsis symbols in `\lemma{}`. These are used in a regular expression match, so
any valid python regular expression will work.

Say you use "--" and "..." to indicate ellipsis. Actually, you ought to write
the dotted ellipsis with `\dots{}` in `LaTeX`, but if you insist, you could give
the key the following list (but you shouldn't, really. Use `\dots{}`):

```json
{
"ellipsis_patterns": [
"\\.\\.\\.",
"-+"
]
}
```

This looks complicated, but don't worry. The "..." is matched with a regex
pattern, which requires us to escape the regular "." – that would normally look
like this `\.\.\.`. But since we also need to escape the backslashes, they are
doubly escaped.

The second is a lot simpler, it is just a regex that will match one or more
regular dashes in your text. Note that this comes with some danger as it will
match if your lemma contains a single dash, even though you might not have
thought of it as an "ellipsis"-dash. In these cases, its better to be explicit
and either use double dashes (`--`) or real unicode en-dashes (`–`). It is also
typographically much better.

Another example of a regex match pattern would be to match for the thin space
command in `LaTeX`, which is `\,`, which produces a space of just 0.16667em. A
comma is a meta-character in regex, so it would need escaped, which would look
like `\\,`, but the backslash is also a meta-character, so that needs escaping
too. This means that to match the literal expression `\,` the regex would look
like this: `\\\\,`. So if we wanted to match the `LaTeX` expression `\,-\,`
(thin space, a dash, and another thin space), we would write the following
regex: `\\\\,-\\\\,`. But as we would probably want to match any length of
dashes, it could be improved to `\\\\,-+\\\\,`.


#### `sensitive_context_match`

The value of the settings variable `sensitive_proximity_match` determines
whether the search for matches in the proximity is case sensitive. By default it
is case insensitive, but if the value is set to `True`, it will be case
sensitive.

In JSON:
```json
{
"sensitive_context_match": true
}
```

That would mean that the search for "an" in the context string "An example"
would not match. This is a sensible setting when lemma words are not lower cased
in the critical apparatus.

# Issue reporting and testing

If you like the idea of this software, please help improving it by
filing [issue report](https://github.com/stenskjaer/samewords/issues) when you
find bugs.

## To file a bug

- Create a *minimal working example* (MWE) TeX document that contains absolutely
nothing aside from the material necessary for reproducing the bug. The
document should (if possible) be able to compile on a fresh installation of
LateX without any custom packages.
- Open an [issue report](https://github.com/stenskjaer/samewords/issues) and
describe the conditions under which you experience the bug. It should be
possible for me to reproduce the bug by following your directions.
- If the script returns an error, copy and paste the error traceback into the
report.
- If the script returns you document, include that, and describe the result you
expected, and how that differs from what you get.

## Testing updated issue branches

Once I (think I) have a solution, I will ask you to test a branch. You can do
that by either downloading that specific branch as a zip or clone the repository
and pull down the changed branch. Choose one of the following two, depending on
you preferences.

**Downloading branch zip**
This approach is simplest if (1) you don't feel quite comfortable using `git` or
(2) only want to test a single change or issue.

- Navigate to the relevant branch in Github (the “Branch: ” dropdown).
- Download that branch to your computer (the “Clone or download” button).
- Navigate to the downloaded zip file, unzip it and enter the directory.

**Clone repository and checkout branch** This approach is more flexible and
makes it easier for you to pull and test different branches. It also makes it
easier to keep track of which branch you are testing on (with the `git status`
command). Finally, if you should want to push changes in pull requests, this is also the
approach you should use.

- Navigate to an appropriate directory.
- Run `git clone https://github.com/stenskjaer/samewords.git`. A directory with the
name “samewords” will be created in you current working directory.
- Check out the branch that you want to test. If that is called `issue-13` run
`git checkout issue-13`.

After either of the above processes, the rest is identical:
- Create a *virtual environment* for testing by running `python3 -m venv .env`,
and then activate it with `source .env/bin/activate` (this is based on a Unix
environment, if you run Windows, check
out
[the Python documentation](https://docs.python.org/3.6/library/venv.html)).
- Install the script in the virtual environment with `pip install -e .`.
- To make sure you run the version in the *virtual environment*, run
`.env/bin/samewords` from the directory (to avoid using a global version of
the script, if you have that).
- Run your supplied MWE (or other material provided by me in the issue report)
and inspect whether the problem is solved and report back in the issue report.
- When you are done testing, deactivate the virtual environment by running
`deactivate` (Bash on Unix) or `deactivate.bat` (Windows).


If you have downloaded a branch zip, you can delete the unzipped directory, and
everything should be back to normal.

If you have cloned the repository, you can just leave it there.


# Disclaimer and license

This is beta level software. Bugs are to be expected and I provide no guarantees
for the integrity of your software or editions when you use the package.

Copyright (c) 2017 Michael Stenskjær Christensen, MIT License.


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

samewords-0.2.2.tar.gz (32.0 kB view hashes)

Uploaded Source

Built Distribution

samewords-0.2.2-py3-none-any.whl (56.2 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page