Skip to main content

Ultra-simple human readable DSL for matching text.

Project description

SimEx
=====

SimEx is a tool that lets you write simple, readable equivalents of regular expressions that
compile down to regular expressions.

This is useful for:

* Improving the readability and maintainability of code that uses long regexes with a lot of escaped characters.
* Allowing non-developers to read and understand simple regex-equivalents and potentially even write their own.

Simex is *not* a full replacement for regular expressions and its use is not suitable everywhere a regex is used.

It is ideally used where you usually want to compare two strings but you occasionally need to compare two
strings with a pattern embedded within them.

It is an embodiment of `the rule of least power <https://en.wikipedia.org/wiki/Rule_of_least_power>`_.

To install::

$ pip install simex


Example
-------

.. code-block:: python

>>> from simex import Simex
>>> simex = Simex({"url": r".*?", "anything": r".*?"})
>>> regex = simex.compile("""<a href="{{ url }}">{{ anything }}</a>""")
>>> regex.match("""<a href="http://www.cnn.com">CNN</a>""") is not None
True


Do I have to define all of the sub-regular expressions myself?
--------------------------------------------------------------

No. SimEx also contains a built in library of commonly used regular expressions.

This will also work:

.. code-block:: python

>>> from simex import Simex
>>> my_simex = DefaultSimex()
>>> regex = my_simex.compile("""<a href="{{ url }}">{{ anything }}</a>""")
>>> regex
re.compile(r'\<a\ href\=\"(ht|f)tp(s?)\:\/\/[0-9a-zA-Z]([-.\w]*[0-9a-zA-Z])*(:(0-9)*)*(\/?)([a-zA-Z0-9\-\.\?\,\\'\/\\\+&amp;%\$#_]*)?\"\>.*?\<\/a\>', re.UNICODE)

>>> regex.match("""<a href="http://www.cnn.com">CNN</a>""") is not None

All regexes in the existing library can be overridden, and more can be added, e.g.

.. code-block:: python

>>> simex = DefaultSimex({"url": r".*?", "mycode": r"[A-Z][0-9][0-9][0-9]"})

Currently there are five in the list of pre-defined regexes:

* URL
* Email
* Integer
* Number
* Anything

Pull requests with commonly required non-controversial regexes are welcome.


Using {{ and }} creates conflicts for me! Why not [[[ and ]]]?
--------------------------------------------------------------

{{ and }} have a special meaning in some languages which you may want to use
with simex - e.g. jinja2.

In order to prevent confusion in such circumstances, you can define your
own delimeters:

.. code-block:: python

>>> from simex import Simex
>>> simex = Simex(open_delimeter="[[[", close_delimeter="]]]")
>>> simex.compile("""<a href="[[[ url ">[[[ anything ]]]</a>""")
>>> simex.match("""<a href="http://www.cnn.com">CNN</a>""") is not None


Matching exact strings
----------------------

By default a simex will not match an exact string. i.e. it will produce:

.. code-block:: python

>>> from simex import Simex
>>> simex = Simex({"url": r".*?", "anything": r".*?"})
>>> regex = simex.compile("""<a href="{{ url }}">{{ anything }}</a>""")
>>> regex
re.compile(r'\<a\ href\=\".*?\"\>.*?\<\/a\>', re.UNICODE)
>>> regex.match("""<a href="http://www.cnn.com">CNN</a> THERE IS MORE TEXT""") is not None
True

However, if you want, simexes can be used to do exact matching. For example:

.. code-block:: python

>>> from simex import Simex
>>> simex = Simex({"url": r".*?", "anything": r".*?"}, exact=True)
>>> regex = simex.compile("""<a href="{{ url }}">{{ anything }}</a>""")
>>> regex
re.compile(r'^\<a\ href\=\".*?\"\>.*?\<\/a\>$', re.UNICODE)
>>> regex.match("""<a href="http://www.cnn.com">CNN</a>""") is not None
True
>>> regex.match("""<a href="http://www.cnn.com">CNN</a> THERE IS MORE TEXT""") is not None
False

Matching can also treat whitespace (tabs, spaces and newlines) as interchangeable. For example:

.. code-block:: python

>>> from simex import Simex
>>> simex = Simex({"url": r".*?", "anything": r".*?"}, flexible_whitespace=True)
>>> regex = simex.compile("""<a href="{{ url }}">{{ anything }}</a>""")
>>> regex
re.compile(r'\<a\\s+href\=\".*?\"\>.*?\<\/a\>', re.UNICODE)
>>> regex.match("""<a href="http://www.cnn.com">CNN</a>""") is not None
True

.. code-block:: python



How does it work?
-----------------

The regular expression simply escapes an entire simexpression, except for the
components surrounded by {{ and }}, which it replaces with defined regular
expressions - like "email" or "anything" or "number" defined in the dict.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

simex-0.3.5.tar.gz (4.4 kB view details)

Uploaded Source

File details

Details for the file simex-0.3.5.tar.gz.

File metadata

  • Download URL: simex-0.3.5.tar.gz
  • Upload date:
  • Size: 4.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for simex-0.3.5.tar.gz
Algorithm Hash digest
SHA256 78b8fa89edbc6375085715a89365475aa294e3499070243c74c8b515e7c33608
MD5 1921f2f4f5c4f6aaa5a3c57967cd280c
BLAKE2b-256 021fd741eed51732130178e503d3a7978a930ec68f86aee47d3f54627853a952

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page