Skip to main content

Allow searching for text in Documents in the Wagtail content management system

Project description

Build Status Coverage Report

Text extraction for Wagtail document search

This package is for replacing Wagtail's Document class with one that allows searching in Document file contents using textract.

Textract can extract text from (among others) PDF, Excel and Word files.

The package was inspired by the "Search: Extract text from documents" issue in Wagtail.

Documents will work as before, except that Document search in Wagtail's admin interface will also find search terms in the files' contents.

Some screenshots to illustrate.

In our fresh Wagtail site with wagtail_textract installed, we uploaded a file called test_document.pdf with handwritten text in it. It is listed in the admin interface under Documents:

Document List

If we now search in Documents for the word correct, which is one of the handwritten words, the live search finds it:

Document Search finds PDF by searching for "staple"

The assumption is that this search should not only be available in Wagtail's admin interface, but also in a public-facing search view, for which we provide a code example.

Requirements

Maturity

We have been using this package in production since August 2018 on https://nuffic.nl.

Installation

  • Install the Textract dependencies
  • Add wagtail_textract to your requirements and/or pip install wagtail_textract
  • Add to your Django INSTALLED_APPS.
  • Put WAGTAILDOCS_DOCUMENT_MODEL = "wagtail_textract.document" in your Django settings.

Note: You'll get an incompatibility warning during installation of wagtail_textract (Wagtail 2.0.1 installed):

requests 2.18.4 has requirement chardet<3.1.0,>=3.0.2, but you'll have chardet 2.3.0 which is incompatible.
textract 1.6.1 has requirement beautifulsoup4==4.5.3, but you'll have beautifulsoup4 4.6.0 which is incompatible.

We haven't seen this leading to problems, but it's something to keep in mind.

Tesseract

In order to make textract use Tesseract, which happens if regular textract finds no text, you need to add the data files that Tesseract can base its word matching on.

Create a tessdata directory in your project directory, and download the languages you want.

Transcribing

Transcription is done automatically after Document save, in an asyncio executor to prevent blocking the response during processing.

To transcribe all existing Documents, run the management command::

./manage.py transcribe_documents

This may take a long time, obviously.

Usage in custom view

Here is a code example for a search view (outside Wagtail's admin interface) that shows both Page and Document results.

from itertools import chain

from wagtail.core.models import Page
from wagtail.documents.models import get_document_model


def search(request):
    # Search
    search_query = request.GET.get('query', None)
    if search_query:
        page_results = Page.objects.live().search(search_query)
        document_results = Document.objects.search(search_query)
        search_results = list(chain(page_results, document_results))

        # Log the query so Wagtail can suggest promoted results
        Query.get(search_query).add_hit()
    else:
        search_results = Page.objects.none()

    # Render template
    return render(request, 'website/search_results.html', {
        'search_query': search_query,
        'search_results': search_results,
    })

Your template should allow for handling Documents differently than Pages, because you can't do pageurl result on a Document:

{% if result.file %}
   <a href="{{ result.url }}">{{ result }}</a>
{% else %}
   <a href="{% pageurl result %}">{{ result }}</a>
{% endif %}

What if you already use a custom Document model?

In order to use wagtail_textract, your CustomizedDocument model should do the same as wagtail_textract's Document:

  • subclass TranscriptionMixin
  • alter search_fields
from wagtail_textract.models import TranscriptionMixin


class CustomizedDocument(TranscriptionMixin, ...):
    """Extra fields and methods for Document model."""
    search_fields = ... + [
        index.SearchField(
            'transcription',
            partial_match=False,
        ),
    ]

Note that the first class to subclass should be TranscriptionMixin, so its save() takes precedence over that of the other parent classes.

Tests

To run tests, checkout this repository and:

make test

Coverage

A coverage report will be generated in ./coverage_html_report/.

Contributors

  • Karl Hobley
  • Bertrand Bordage
  • Kees Hink
  • Tom Hendrikx
  • Coen van der Kamp
  • Mike Overkamp
  • Thibaud Colas
  • Dan Braghis
  • Dan Swain

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wagtail-textract-1.2.dev0.tar.gz (1.0 MB view details)

Uploaded Source

Built Distribution

wagtail_textract-1.2.dev0-py2.py3-none-any.whl (922.2 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file wagtail-textract-1.2.dev0.tar.gz.

File metadata

  • Download URL: wagtail-textract-1.2.dev0.tar.gz
  • Upload date:
  • Size: 1.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.20.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.35.0 CPython/3.7.4

File hashes

Hashes for wagtail-textract-1.2.dev0.tar.gz
Algorithm Hash digest
SHA256 792a5f792659d8a376498304285c9096610c39ad2749532a8562b40477333ed4
MD5 5735bf36d2dfa2094750d2c74eac4e64
BLAKE2b-256 17815a7dd69808f9a5af83f13df88bc7fc0a8de39f526f4bf3e395a8f742a57f

See more details on using hashes here.

File details

Details for the file wagtail_textract-1.2.dev0-py2.py3-none-any.whl.

File metadata

  • Download URL: wagtail_textract-1.2.dev0-py2.py3-none-any.whl
  • Upload date:
  • Size: 922.2 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.20.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.35.0 CPython/3.7.4

File hashes

Hashes for wagtail_textract-1.2.dev0-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 08b61c0965b18533e439c805fc49857fc9ae2eef0bb28bd5ca710bdbbcbf8eb6
MD5 e3d4f24a2b0a8de0d6a764f28e8b3d34
BLAKE2b-256 0c1face6305662d8bcf29f4049b3428dd197e2f3a05ceebe33692946328753aa

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page