Skip to main content

Library and executable for identifying anomalous file path strings

Project description

pathlesstaken

Profile strings, e.g. file paths for digital preservation considerations, e.g. characters that you want to preserve, or characters that you don't want to preserve.

pathlesstaken has no external dependencies so you can clone this repo and just run it. Just as long as your environment supports Python and you can download it!

Basis for this module

The original analysis was based around this non-recommended filenames from Microsoft: Non-recommended names from Microsoft

The code also contains copy Cooper Hewitt's code to enable writing of plain text descriptions of Unicode characters. This portion of the code is licensed BSD 3-Clause "New" or "Revised" License

The bigger project this code was developed for is still here: droid-siegfried-sqlite-analysis

Example output

Given a Unicode string: $ pathlesstaken.py โค๐Ÿ’–๐Ÿ’™๐Ÿ’š๐Ÿ’›๐Ÿ’œ๐Ÿ’โ™•โ™–โ™—โ™˜โ™™โ™šโ™›โ™œโ™โ™žโ™Ÿ

File: 'โค๐Ÿ’–๐Ÿ’™๐Ÿ’š๐Ÿ’›๐Ÿ’œ๐Ÿ’โ™•โ™–โ™—โ™˜โ™™โ™šโ™›โ™œโ™โ™žโ™Ÿ' contains, characters outside of ASCII range: '0x2764, HEAVY BLACK HEART: โค'
File: 'โค๐Ÿ’–๐Ÿ’™๐Ÿ’š๐Ÿ’›๐Ÿ’œ๐Ÿ’โ™•โ™–โ™—โ™˜โ™™โ™šโ™›โ™œโ™โ™žโ™Ÿ' contains, characters outside of ASCII range: '0x1f496, SPARKLING HEART: ๐Ÿ’–'
File: 'โค๐Ÿ’–๐Ÿ’™๐Ÿ’š๐Ÿ’›๐Ÿ’œ๐Ÿ’โ™•โ™–โ™—โ™˜โ™™โ™šโ™›โ™œโ™โ™žโ™Ÿ' contains, characters outside of ASCII range: '0x1f499, BLUE HEART: ๐Ÿ’™'
File: 'โค๐Ÿ’–๐Ÿ’™๐Ÿ’š๐Ÿ’›๐Ÿ’œ๐Ÿ’โ™•โ™–โ™—โ™˜โ™™โ™šโ™›โ™œโ™โ™žโ™Ÿ' contains, characters outside of ASCII range: '0x1f49a, GREEN HEART: ๐Ÿ’š'
File: 'โค๐Ÿ’–๐Ÿ’™๐Ÿ’š๐Ÿ’›๐Ÿ’œ๐Ÿ’โ™•โ™–โ™—โ™˜โ™™โ™šโ™›โ™œโ™โ™žโ™Ÿ' contains, characters outside of ASCII range: '0x1f49b, YELLOW HEART: ๐Ÿ’›'
File: 'โค๐Ÿ’–๐Ÿ’™๐Ÿ’š๐Ÿ’›๐Ÿ’œ๐Ÿ’โ™•โ™–โ™—โ™˜โ™™โ™šโ™›โ™œโ™โ™žโ™Ÿ' contains, characters outside of ASCII range: '0x1f49c, PURPLE HEART: ๐Ÿ’œ'
File: 'โค๐Ÿ’–๐Ÿ’™๐Ÿ’š๐Ÿ’›๐Ÿ’œ๐Ÿ’โ™•โ™–โ™—โ™˜โ™™โ™šโ™›โ™œโ™โ™žโ™Ÿ' contains, characters outside of ASCII range: '0x1f49d, HEART WITH RIBBON: ๐Ÿ’'
File: 'โค๐Ÿ’–๐Ÿ’™๐Ÿ’š๐Ÿ’›๐Ÿ’œ๐Ÿ’โ™•โ™–โ™—โ™˜โ™™โ™šโ™›โ™œโ™โ™žโ™Ÿ' contains, characters outside of ASCII range: '0x2655, WHITE CHESS QUEEN: โ™•'
File: 'โค๐Ÿ’–๐Ÿ’™๐Ÿ’š๐Ÿ’›๐Ÿ’œ๐Ÿ’โ™•โ™–โ™—โ™˜โ™™โ™šโ™›โ™œโ™โ™žโ™Ÿ' contains, characters outside of ASCII range: '0x2656, WHITE CHESS ROOK: โ™–'
File: 'โค๐Ÿ’–๐Ÿ’™๐Ÿ’š๐Ÿ’›๐Ÿ’œ๐Ÿ’โ™•โ™–โ™—โ™˜โ™™โ™šโ™›โ™œโ™โ™žโ™Ÿ' contains, characters outside of ASCII range: '0x2657, WHITE CHESS BISHOP: โ™—'
File: 'โค๐Ÿ’–๐Ÿ’™๐Ÿ’š๐Ÿ’›๐Ÿ’œ๐Ÿ’โ™•โ™–โ™—โ™˜โ™™โ™šโ™›โ™œโ™โ™žโ™Ÿ' contains, characters outside of ASCII range: '0x2658, WHITE CHESS KNIGHT: โ™˜'
File: 'โค๐Ÿ’–๐Ÿ’™๐Ÿ’š๐Ÿ’›๐Ÿ’œ๐Ÿ’โ™•โ™–โ™—โ™˜โ™™โ™šโ™›โ™œโ™โ™žโ™Ÿ' contains, characters outside of ASCII range: '0x2659, WHITE CHESS PAWN: โ™™'
File: 'โค๐Ÿ’–๐Ÿ’™๐Ÿ’š๐Ÿ’›๐Ÿ’œ๐Ÿ’โ™•โ™–โ™—โ™˜โ™™โ™šโ™›โ™œโ™โ™žโ™Ÿ' contains, characters outside of ASCII range: '0x265a, BLACK CHESS KING: โ™š'
File: 'โค๐Ÿ’–๐Ÿ’™๐Ÿ’š๐Ÿ’›๐Ÿ’œ๐Ÿ’โ™•โ™–โ™—โ™˜โ™™โ™šโ™›โ™œโ™โ™žโ™Ÿ' contains, characters outside of ASCII range: '0x265b, BLACK CHESS QUEEN: โ™›'
File: 'โค๐Ÿ’–๐Ÿ’™๐Ÿ’š๐Ÿ’›๐Ÿ’œ๐Ÿ’โ™•โ™–โ™—โ™˜โ™™โ™šโ™›โ™œโ™โ™žโ™Ÿ' contains, characters outside of ASCII range: '0x265c, BLACK CHESS ROOK: โ™œ'
File: 'โค๐Ÿ’–๐Ÿ’™๐Ÿ’š๐Ÿ’›๐Ÿ’œ๐Ÿ’โ™•โ™–โ™—โ™˜โ™™โ™šโ™›โ™œโ™โ™žโ™Ÿ' contains, characters outside of ASCII range: '0x265d, BLACK CHESS BISHOP: โ™'
File: 'โค๐Ÿ’–๐Ÿ’™๐Ÿ’š๐Ÿ’›๐Ÿ’œ๐Ÿ’โ™•โ™–โ™—โ™˜โ™™โ™šโ™›โ™œโ™โ™žโ™Ÿ' contains, characters outside of ASCII range: '0x265e, BLACK CHESS KNIGHT: โ™ž'
File: 'โค๐Ÿ’–๐Ÿ’™๐Ÿ’š๐Ÿ’›๐Ÿ’œ๐Ÿ’โ™•โ™–โ™—โ™˜โ™™โ™šโ™›โ™œโ™โ™žโ™Ÿ' contains, characters outside of ASCII range: '0x265f, BLACK CHESS PAWN: โ™Ÿ'

You can also run a print test by running: $ pathlesstaken.py test

File: 'COM4' contains, reserved name 'COM4'
File: 'COM4.txt' contains, reserved name 'COM4'
File: 'AUX' contains, reserved name 'AUX'
File: 'CON' contains, reserved name 'CON'
File: 'space ' has a SPACE as its last character
File: 'period.' has a period '.' as its last character
File: 'รณ' contains, characters outside of ASCII range: '0xf3, LATIN SMALL LETTER O WITH ACUTE: รณ'
File: 'รฉ' contains, characters outside of ASCII range: '0xe9, LATIN SMALL LETTER E WITH ACUTE: รฉ'
File: 'รถ' contains, characters outside of ASCII range: '0xf6, LATIN SMALL LETTER O WITH DIAERESIS: รถ'
File: 'รณรฉรถ' contains, characters outside of ASCII range: '0xf3, LATIN SMALL LETTER O WITH ACUTE: รณ'
File: 'รณรฉรถ' contains, characters outside of ASCII range: '0xe9, LATIN SMALL LETTER E WITH ACUTE: รฉ'
File: 'รณรฉรถ' contains, characters outside of ASCII range: '0xf6, LATIN SMALL LETTER O WITH DIAERESIS: รถ'
File: 'file[bracket]one.txt' contains, non-recommended character: '0x5b, LEFT SQUARE BRACKET: ['
File: 'file[bracket]one.txt' contains, non-recommended character: '0x5d, RIGHT SQUARE BRACKET: ]'
File: 'file[two.txt' contains, non-recommended character: '0x5b, LEFT SQUARE BRACKET: ['
File: 'filethree].txt' contains, non-recommended character: '0x5d, RIGHT SQUARE BRACKET: ]'
File: '-=_|"' contains, non-recommended character: '0x7c, VERTICAL LINE: |'
File: '-=_|"' contains, non-recommended character: '0x22, QUOTATION MARK: "'
File: '(<>:"/\?*|-)' contains, non-recommended character: '0x3c, LESS-THAN SIGN: <'
File: '(<>:"/\?*|-)' contains, non-recommended character: '0x3e, GREATER-THAN SIGN: >'
File: '(<>:"/\?*|-)' contains, non-recommended character: '0x3a, COLON: :'
File: '(<>:"/\?*|-)' contains, non-recommended character: '0x22, QUOTATION MARK: "'
File: '(<>:"/\?*|-)' contains, non-recommended character: '0x2f, SOLIDUS: /'
File: '(<>:"/\?*|-)' contains, non-recommended character: '0x5c, REVERSE SOLIDUS: \'
File: '(<>:"/\?*|-)' contains, non-recommended character: '0x3f, QUESTION MARK: ?'
File: '(<>:"/\?*|-)' contains, non-recommended character: '0x2a, ASTERISK: *'
File: '(<>:"/\?*|-)' contains, non-recommended character: '0x7c, VERTICAL LINE: |'
File: '(<>:"/\?*|-)' contains, non-printable character: '0x0, <control character>'
File: '(<>:"/\?*|-)' contains, non-printable character: '0x1f, <control character>'

Please let me know how it goes if you try out this code.

Sister project

If you like to understand your filepaths, but don't need all the detail, there's a third-way, by taking a look at the fndec project I created in Golang and using utilities from Richard Lehane's Siegfried. More info after the jump.

Docs

All docs are available in docs.

DESCRIPTION
    Module that implements checks against the Microsoft Recommendations for
    file naming, plus additional recommended analyses documented below.

    First created based on the recommendations here:
        http://msdn.microsoft.com/en-us/library/aa365247(VS.85).aspx

    First available in:
        https://github.com/exponential-decay/droid-siegfried-sqlite-analysis-engine

    Methods defined here:
     |
     |  complete_file_name_analysis(self, string, folders=False, verbose=False)
     |      Run all analyses over a string object. The analyses are as follows:
     |
     |      * detect_non_ascii_characters
     |      * detect_non_recommended_characters
     |      * detect_non_printable_characters
     |      * detect_microsoft_reserved_names
     |      * detect_spaces_at_end_of_names
     |      * detect_period_at_end_of_name
     |
     |  detect_microsoft_reserved_names(self, string)
     |      Detect names that are considered difficult on Microsoft file
     |      systems. There is a special history to these characters which can be
     |      read about on this link below:
     |
     |          * http://msdn.microsoft.com/en-us/library/aa365247(VS.85).aspx
     |
     |  detect_non_ascii_characters(self, string, folders=False)
     |      Detect characters outside of an ASCII range. These are more
     |      difficult to preserve in today's systems, even still, though it is
     |      getting easier.
     |
     |  detect_non_printable_characters(self, string, folders=False)
     |      Detect control characters below 0x20 in the ASCII table that cannot
     |      be printed. Examples include ESC (escape) or BS (backspace).
     |
     |  detect_non_recommended_characters(self, string, folders=False)
     |      Detect characters that are not particularly recommended. These
     |      characters for example a forward slash '/' often have other meanings
     |      in computer systems and can be interpreted incorrectly if not handled
     |      properly.
     |
     |  detect_period_at_end_of_name(self, string, folders=False)
     |      Detect a full-stop at the end of a name. This might indicate a
     |      missing file extension.
     |
     |  detect_spaces_at_end_of_names(self, string, folders=False)
     |      Detect spaces at the end of a string. These spaces if ignored can
     |      lead to incorrectly matching strings, e.g. 'this ' is different to
     |      'this'.
     |

License

This unique parts of this code is licensed using GPLv3

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pathlesstaken-1.0.2rc1.tar.gz (223.3 kB view details)

Uploaded Source

Built Distribution

pathlesstaken-1.0.2rc1-py2.py3-none-any.whl (230.7 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file pathlesstaken-1.0.2rc1.tar.gz.

File metadata

  • Download URL: pathlesstaken-1.0.2rc1.tar.gz
  • Upload date:
  • Size: 223.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.5

File hashes

Hashes for pathlesstaken-1.0.2rc1.tar.gz
Algorithm Hash digest
SHA256 99f1dce1392001e71d204b96b699dd631712057f96ce7ccca96c3b3d6ef481e8
MD5 0dc52c90c68312948c332b66da48f394
BLAKE2b-256 71ea6c141cf6c8132774c79a8d9020700c2e997612f0b1f033a1dc5b0fc1beaf

See more details on using hashes here.

File details

Details for the file pathlesstaken-1.0.2rc1-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for pathlesstaken-1.0.2rc1-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 ed0a008aac970403a87eacbb51f1108b0277d50ae970ca67c50686ecc3deba24
MD5 23a2e6f34ef595d62bad45930bc7b29c
BLAKE2b-256 e040cccd5b912de79cdf5c2ba2ca97ad3a21f72f89d1ed3bbc4cc437af3878fe

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page