Skip to main content

A script to scan HTML documents for forbidden phrases stored in a CSV.

Project description

About Docscanner

Docscanner allows you to streamline your documentation workflow by detecting words or phrases listed in a .csv file in an inputted .html file. If any words/phrases are detected, Docscanner returns both the word/phrase as well as the line number that it is located on. This helps eliminate human error when applying style guide content standards in a documentation project.

The script is case insensitive and is also capable of finding duplicate words/phrases on the same line in the .html file.

Installing Docscanner

If you haven't already, download Python here.

NOTE: If installing Python for the first time, make sure you select the Add Python 3.7 to PATH checkbox.

Install Docscanner by running the following in your Command Prompt terminal:
pip install docscanner

Using Docscanner

  1. Start Docscanner by running the following in your Command Prompt terminal:
    docscanner
    You will be prompted with the following:
    Use default csv file of forbidden phrases? (respond Y or N):

  2. Choose whether to use the default .csv file or to add a custom path:

    • To use the default csv file, enter "Y". The script will jump to the next argument where you input the .html file path.
    • To use a custom csv file, enter "N". You will be prompted with the following:
      Enter path of your custom csv file:
      Copy the path from File Explorer by holding <SHIFT>, right-clicking your desired file, and selecting Copy as path.

      NOTE: The path address is automatically stripped of unnecessary characters such as quotations marks or spaces. You do not have to format your file paths after pasting them.

  3. After choosing to use either the default or custom .csv file, you will be prompted with the following:
    Enter path of html file:
    Copy the path from File Explorer by holding <SHIFT>, right-clicking your desired file, and selecting Copy as path.

Docscanner will return whatever words/phrases it found in your .html file along with the line numbers on which they were found.

Formatting Data

Ensure that your .csv file has no header columns and that each individual word/phrase occupies a single row within the first column.

Example .CSV file:

word_1
phrase 1
word_2

Changing the Default File

  1. Locate the root directory storing docscanner.py.
  2. Open the data folder.
  3. Delete or alter the existing .csv file and save your changes.
  4. Open the src folder.
  5. Change the file names in the get_forbidden_phrases_path function to match the name of your new default file in the data folder. Alternately, you can edit the default .csv file directly.
  6. Save your changes.

Troubleshooting

If Docscanner is not running properly:

  • Ensure that all file paths are correct.
  • Verify you are using a Command Prompt terminal, not PowerShell.
  • Check whether the docscanner directory structure on your workstation matches what is on GitHub.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

doc_scanner-0.5.tar.gz (5.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

doc_scanner-0.5-py3-none-any.whl (5.8 kB view details)

Uploaded Python 3

File details

Details for the file doc_scanner-0.5.tar.gz.

File metadata

  • Download URL: doc_scanner-0.5.tar.gz
  • Upload date:
  • Size: 5.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.1

File hashes

Hashes for doc_scanner-0.5.tar.gz
Algorithm Hash digest
SHA256 554bde9781d68bebbaa91600ace63c10f8ff72c96c4cb42c5471ddc0e46d7bc7
MD5 b08adb03bfa8aaf2b2e7385acfbb089b
BLAKE2b-256 6f6608629f3d8230f48e2c1a052954abdef8a413015a88b26ffa9529182234b6

See more details on using hashes here.

File details

Details for the file doc_scanner-0.5-py3-none-any.whl.

File metadata

  • Download URL: doc_scanner-0.5-py3-none-any.whl
  • Upload date:
  • Size: 5.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.1

File hashes

Hashes for doc_scanner-0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 1c745477d4382f62145d787df9ffd404fba9b7e29eb2f6f5c9790fa9fb6305aa
MD5 b5a946eca03c9fd6a30ebe0a3662e9bf
BLAKE2b-256 941a432f01c6c932d8fb695e5b9c12022373ef453e9b7806e33641fe8bc102d2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page