Skip to main content

No project description provided

Project description


Package Version Supported Python Versions Stability Status Coverage Percentage

Cognaize SDK

Welcome to Cognaize SDK. Cognaize SDK provides tools and functionalities for creating, evaluating and deploying models into Cognaize platform.

Cognaize SDK provides:

  • Working with Cognaize snapshots
  • Working with OCR data
  • Working with PDF files
  • Working with images

Installation

Cognaize SDK can be installed using pip:

pip install pycognaize

Development

Steps

The following steps should be followed after making changes to the codebase:

  1. Update pycognaize/__init__.py with the new version number.
  2. Update CHANGELOG.md with the new version number and a description of the changes.
  3. Run python scripts/update_badges.py to update the badges in README.md.
  4. Create docs by running the following commands:
    cd docs
    ./create_docs.sh
    

Have a look at the quick tutorial for understanding main concepts of the SDK.

Changelog

[1.4]

[1.4.55] - 2024-11-05

  • Updated cloudpathlib to version ~0.18.0
  • Remove transformers and langchain from model-requirements.txt
  • Add a development guide to the README.md

[1.4.54] - 2024-09-27

  • Fix load ocr for gulfim, handle string page number

[1.4.53] - 2024-09-11

  • Fix parse_raw_numeric function handle negative sign with multimple delimiters

[1.4.52] - 2024-06-19

  • Fix duplicate_text_for_spanned_cells=False case

[1.4.51] - 2024-06-19

  • Add duplicate_text_for_spanned_cells option in TableTag._build_df

[1.4.50] - 2024-06-06

  • Fix infer_rows_from_words function in common/utils.py affecting last line bug page.py

[1.4.49] - 2024-06-06

  • Fix page last line absence in page.line bug

[1.4.48] - 2024-05-16

  • Loosen pymupdf requirements

[1.4.47] - 2024-05-15

  • Add calculated values to NumericField serializer method
  • Add Document.load_ocr as a substitute for Document.load_page_ocr
  • Document.load_ocr gets the ocr data from a single json file
  • Deprecate Document.load_page_ocr in favor of Document.load_ocr
  • Add data_path property to Document class

[1.4.46] - 2024-05-15

  • Add is_calculated property to NumericField class
  • Add unit tests for NumericField class

[1.4.45] - 2024-05-13

  • Add stick_coords argument to Page().load_page_ocr function

[1.4.44] - 2024-04-05

  • Add attribute mapping in Field object

[1.4.43] - 2024-04-03

  • Add requirement cloudpathlib[s3,azure,gs]~=0.16.0

[1.4.42] - 2024-04-02

  • Improve cloudpathlib integration to support Azure and Google Cloud

[1.4.41] - 2024-03-25

  • Improve Snapshot.download
    • Download snapshot without login required

[1.4.40] - 2024-03-6

  • Improve Document.fetch_document
    • Add option to provide token and api url as parameters
    • Raise if env variables are missing
    • Raise if response is invalid
    • Handle trailing forward slashes in url
  • Fix Page.draw

[1.4.39] - 2024-02-27

  • Handle tables in Document.get_layout_text

[1.4.38] - 2024-02-25

  • Add Document.get_layout_text method

[1.4.37] - 2024-02-23

  • Add Genie for easy model testing

[1.4.36] - 2024-02-21

  • Handle pandas warnings and 1.2.0 compatibility issues.
  • Handle M series Mac incompatibility issues.
  • Warn for missing page image/ocr information.
  • Add pyarrow as a requirement

[1.4.35] - 2024-02-21

  • Optimize pycognaize.common.utils.img_to_black_and_white
  • Test coverage 85%

[1.4.34] - 2024-01-23

  • Fix create_lines function in Page class
  • Update docs/requirements.txt
  • Update python_versions.svg

[1.4.33] - 2024-01-16

  • Remove support for python 3.8 and lower. Add support for 3.10, 3.11 and 3.12
  • Use map instead of applymap in DataFrame objects

[1.4.32] - 2024-01-10

  • Fix load_page_ocr and load_page_images

[1.4.31] - 2023-12-04

  • Add classes property to fields

[1.4.30] - 2023-11-30

  • Remove cloudstorageio from the SDK

[1.4.29] - 2023-11-06

  • Changes os.path.join for s3 paths

[1.4.28] - 2023-10-25

  • Add support for scale output

[1.4.27] - 2023-09-27

  • Add support for s3 path resolution in Windows

[1.4.26] - 2023-09-25k

  • Add exclude_html to snapshot download
  • Add include option to snapshot downloader

[1.4.25] - 2023-09-13

  • Add unit test for span_field.py

[1.4.24] - 2023-09-13

  • Create snapshot downloader class

[1.4.23] - 2023-08-17

  • Add tests for exclude options
  • Add tests for cloudstorageio hooks usage
  • Update to cloudstorageio version 1.2.14

[1.4.22] - 2023-08-16

  • Add new method for document fetching

[1.4.21rc1] - 2023-08-08

  • Use new method from cloudstorageio for big snapshot download

[1.4.20] - 2023-07-30

  • Add replacing text-span datatype with text_span in from_dict input_fields of Document object

[1.4.19] - 2023-07-30

  • Update _to_dict and construct_from_raw functions in SpanField class to set value when field does not have tag

[1.4.18] - 2023-07-28

  • Update _line_values attribute in SpanField class in order to return sentences per lines

[1.4.17] - 2023-07-19

  • Update __add__ function in ExtractionTag class in order to include the text between the two BoxTag objects selected

[1.4.16] - 2023-07-14

  • Add scale property to NumericField class

[1.4.15] - 2023-07-12

  • NumericField.value will return the calculated value if no field or tag values are available

[1.4.14] - 2023-07-11

  • Add exclude options for Snapshot download

[1.4.13] - 2023-07-10

  • Add calculated values to NumericField

[1.4.12] - 2023-06-30

  • Add missing function to CloudService

[1.4.11] - 2023-06-28

  • Ignore empty table tags in langchain_loader.py

[1.4.10] - 2023-06-28

  • Separate LangchainLoader text blocks with new lines

[1.4.10] - 2023-07-13

  • Add to_text() functionality to Document class that utilizes inputs from PageLayout model to create the text version of the document

[1.4.9] - 2023-06-27

  • Add to_string() functionality to table_tags
  • Add langchain_loader util to convert Pycognaize Document objects to Langchain Document objects

[1.4.8] - 2023-06-24

  • Add field value and tag value to numeric field

[1.4.7] - 2023-06-15

  • Add ocr_tags() and line_tags() to Page

[1.4.6] - 2023-06-27

  • Add replace_nans_with_empty_html_tags functionality in HTMLTableTag build df

[1.4.5] - 2023-06-12

  • Add re-login when AWS token is expired

[1.4.4] - 2023-05-31

  • Add anytree to setup-requires in setup.cfg

[1.4.3] - 2023-05-31

  • Fix handling of section field with no tags

[1.4.2] - 2023-05-30

  • Fix handling of section field value when field does not have tags

[1.4.1] - 2023-05-29

  • Add section field and section tag functionality

[1.4.0] - 2023-05-25

  • Add classification labels functionality to Field objects

[1.3]

[1.3.14] - 2023-05-10

  • Add LinkField object
  • Add returning group name in to_dict functionality of Field object

[1.3.13] - 2023-05-01

  • Add exclude folders option for lisdir in html_info
  • Update version cloudstorageio >= 1.2.8

[1.3.12] - 2023-05-01

  • Rename HTMLTag to HTMLTagABC
  • Rename TDTag to HTMLTag
  • Handle out of table tags in XBRL

[1.3.11] - 2023-04-28

  • Add interface to create directory summary hashes
  • Add automatic snapshot hash creation for snapshot.download

[1.3.10] - 2023-03-31

  • Add login command and code for submit to model registry

[1.3.9] - 2023-03-30

  • Refactor HTML._validate_path() to include try-except block

[1.3.8] - 2023-03-28

  • Improve html file path validation by adding a new check in HTML._validate_path()

[1.3.7] - 2023-03-24

  • Match (xbrl) using xpath and indices in matches function of model.py

[1.3.6] - 2023-03-19

  • Add functionality to run Model().execute_eval given the ground truth document id

[1.3.5] - 2023-03-11

  • Make Confidence key lowercase.

[1.3.4] - 2023-03-09

  • Rename classConfidence key to Confidence in enums.

[1.3.3] - 2023-03-07

  • Check that the sum of confidence values is close to 1 instead of exactly 1.

[1.3.2] - 2023-02-21

  • Add is_xbrl attribute in document, to identify if document is XBRL or not
  • Modify assign indices functionality to handle XBRL tables
  • Modify Model.matches() to also match with HTMLTags
  • Add HTML._validate_path() to get valid path of source.html file
  • Fix reading html file from S3
  • Add tag_id attribute to HTMLTag

[1.3.1] - 2023-02-14

  • Update install requirements in setup.cfg
  • Rename environment variable COGNAIZE_HOST to API_HOST
  • Raise error when trying to log in without API_HOST environment variable
  • Add documentation and make cosmetic changes in login.py

[1.3.0] - 2023-02-10

  • Add XBRL support
  • Add spacy to model-requirements.txt
  • Add bs4 to requirements.txt

[1.2]

[1.2.9] - 2023-01-17

  • Add class confidence functionality to tag objects

[1.2.8] - 2023-01-09

  • Update Numeric parser to handle float numbers with three or more decimal numbers
  • Add "-" character handler in numeric parser
  • Change all occurrences of 'cognaize' to 'Cognaize'

[1.2.7] - 2022-12-22

  • Change PyMuPDF version to support M1

[1.2.6] - 2022-12-06

  • Update numeric parser to better handle decimal numbers

[1.2.5] - 2022-11-28

  • Read page image height/width from document.json
  • Field raw value bug fix

[1.2.4] - 2022-11-28

  • Field object raw_value bug fix

[1.2.3] - 2022-11-16

  • Field object raw_value bug fix

[1.2.3] - 2022-11-16

  • Field object raw_value contains fields value

[1.2.2] - 2022-11-11

  • Add HTTP request timeout for genie model run

[1.2.1] - 2022-11-08

  • Add functionality for grouping fields by key

[1.2.0] - 2022-10-11

  • Modify assign indices functionality to correctly index tables located side by side

[1.1]

[1.1.4] - 2022-10-06

  • Add snapshot download to specified directory functionality
  • Add Page Section tag and field functionalities

[1.1.3] - 2022-10-05

  • Add login functionality to pycognaize

[1.1.2] - 2022-09-12

  • group_by_field returns list of fields with group_key
  • Add minor improvement for NumericParser
  • Tied fields now only return unique fields

[1.1.1] - 2022-09-05

  • Fix field grouping with non-existing key
  • Add field grouping with given Field object

[1.1.0] - 2022-08-25

  • Added multiprocessing download of images and ocr data

[1.0]

[1.0.3] - 2022-08-23

  • Change get_tied_fields, get_tied_tags, get_first_tied_field, get_first_tied_tag methods to return also python names
  • Fix get_first_tied_field_value and get_first_tied_tag_value methods to work properly after changes

[1.0.2] - 2022-08-23

  • Fix get_tied_tags method

[1.0.1] - 2022-08-22

  • Fix get_tied_fields method

[1.0.0] - 2022-08-15

  • Update signature of base Field class
  • Update constructor of SpanField class

[0.3]

[0.3.66] - 2022-08-15

  • Add tied_field and tied_tag functionality to document

[0.3.65] - 2022-08-10

  • Enhance numeric parser to handle strings like 0.01, add unittest

[0.3.64] - 2022-08-09

  • Add span field and span tag

[0.3.63] - 2022-08-04

  • Add grouping functionality for input and output fields of document

[0.3.62] - 2022-07-20

  • Update GitHub workflow to publish documentation on release

[0.3.61] - 2022-07-13

  • Read image width and height from document.json instead of loading actual images for that

[0.3.60] - 2022-07-03

  • Fix seaborn issue

[0.3.59] - 2022-07-03

  • Remove opencv and seaborn from setup.cfg requirements

[0.3.58] - 2022-06-20

  • Deprecation decorator chooses version automatically

[0.3.57] - 2022-06-20

  • Fix _post_response_eval() method of Model

[0.3.56] - 2022-06-20

  • Remove OpenCV from main requirements (now in model-requirements)
  • Add deprecation and module not found warning decorators
  • Get rid of opencv dependencies
  • Update GitHub workflow to use model-requirements

[0.3.55] - 2022-06-18

  • Set logging level to debug for missing OCR or image files

[0.3.54] - 2022-06-18

  • Added for running workflows

[0.3.53] - 2022-06-18

  • Exclude tests from the package distro
  • Include white pixel in the distro
  • Remove MANIFEST.in
  • Add virtualenv to dev-requirements.txt
  • Publish in main pypi repository when running setup.sh

[0.3.52] - 2022-06-15

  • Integrate evaluation driver

[0.3.51] - 2022-06-11

  • Fix get_matching_table_cells_for_tag

[0.3.50] - 2022-06-10

  • Update documentation (logo, badges, etc.))

[0.3.49] - 2022-06-06

  • Add names to GitHub actions

[0.3.48] - 2022-06-06

  • Remove redundant information from README.md

[0.3.47] - 2022-06-06

  • Update setup.sh to create wheel and upload to pypi
  • Setup.sh performs doctests as part of the build process

[0.3.46] - 2022-06-03

  • Added tutorial about working with tags, and PDF
  • Made updates to documentation

[0.3.45] - 2022-06-03

  • Obfuscate data in pycognaize tests
  • Update names in GitHub actions

[0.3.44] - 2022-05-31

  • Create tutorial about leveraging tables in cognaize SDK
  • Add logo and favicon to documentation
  • Add supported python version badge to readme
  • Add logo to readme

[0.3.43] - 2022-05-28

  • Remove outdated modules (ocr.py and recipe.py)
  • Remove Dockerfile and outdated build scripts
  • Add dev-requirements.txt, update badges

[0.3.42] - 2022-05-27

  • Add badge generating script to show in Readme
  • Update Readme to show badge, documentation and tutorials
  • Add sphinx doctests to GitHub actions

[0.3.41] - 2022-05-25

  • Updated homepage of the documentation
  • Added versioning to RTD. (NOT FINAL. CHECKS VERSIONS FROM GIT WHICH…)
  • Added blank tutorial pages
  • Updated sidebar toc tree structure
  • Added doctests to quick tutorial
  • dded sphinx.ext.doctest
  • Updated create_docs to include new _autosummary directory and create doctest

[0.3.40] - 2022-05-20

  • Fix styling and refactor in order to pass flake8 checks

[0.3.39] - 2022-05-20

  • Add backquotes in changelog
  • Group changelog entries
  • Changelog ordered from the latest version to first
  • Add script to deploy the docs
  • Add quick_tutorial.rst file
  • Update index.rst to include links to general sections
  • Add myst-parser to docs requirements.txt
  • Change Markdown parser to myst-parser
  • Add quickstart.rst
  • Add reading .md files for changelog
  • Separate installation.rst

[0.3.38] - 2022-05-19

  • Configure sphinx for generating proper API reference
  • Create general docs outline, add generated autosummary rst files in ignore files

[0.3.37] - 2022-05-16

  • Added docstrings and type hints to modules

[0.3.36] - 2022-05-16

  • Rename the package to pycognaize

[0.3.35] - 2022-05-13

  • Added GitHub workflows for linting and testing
  • Changed docstrings in snapshot.py and lazy_dict.py

[0.3.34] - 2022-05-12

  • Fix all tests

[0.3.33] - 2022-05-11

  • Fix requirements and setup.sh

[0.3.32] - 2022-04-11

  • Add opencv-python-headless==4.0.1.23 in requirements for avoiding ImportError: cannot import name '_registerMatType' (only for usage in table_detection)

[0.3.31] - 2022-04-11

  • Bring back using opencv in bytes_to_array, string_to_array, img_to_black_and_white for stick_word_boxes functionality (only for usage in table_detection)

[0.3.30] - 2022-04-11

  • Change TableTag to take cell data, not use table dividers
  • Add tests corresponding to changed TableTag

[0.3.29.a0] - 2022-04-06

  • Fix in get_table_title to give not first 8 rows of page while 8 rows above table

[0.3.29] - 2022-04-06

  • Add numeric parser in common

[0.3.28] - 2022-03-04

  • Remove opencv from requirements, add some utils

[0.3.27] - 2022-02-10

  • Set requirement pymupdf<=1.19.4 in setup.cfg

[0.3.26] - 2022-02-04

  • Fix pymupdf conflicting version issue(1.19.4)

[0.3.25] - 2021-11-13

  • Add raw_value to TextField

[0.3.24] - 2021-11-13

  • call build_df in TableTag.__init__ to skip corrupted table tags

[0.3.23] - 2021-11-13

  • call TableTag._build_df in init so that corrupt table tags are not created on snapshot read

[0.3.22] - 2021-11-13

  • Implement TableTag.__getitem__
  • Define TableTag.raw_df
  • Cache property TableTag.df on access
  • Do not build_df in TableTag.__init__

[0.3.21] - 2021-11-12

  • Fix Document.get_table_cell_overlap (look both in input and output fields, fix the iou page check issue)

[0.3.20] - 2021-11-10

  • Add Document.metadata attribute

[0.3.19] - 2021-11-10

  • Remove assertion tests for fields and tags where applicable
  • Set all warnings to debug level in tag.py
  • Do not fail on invalid tag json data, but skip the tag (all logs are on debug level)

[0.3.18] - 2021-11-02

  • Enforce utf-8 encoding when reading document.json locally

[0.3.17] - 2021-10-29

  • Add Document.to_pdf() feature
  • Add annotate_pdf() function in document.py
  • Add Unittests for Document.to_pdf()
  • Fix requirement (fitz to pymupdf)

[0.3.16] - 2021-10-22

  • Make evaluate method of Model class abstract

[0.3.15] - 2021-10-21

  • Add handling repeating field and group cases in evaluate functionality of model

[0.3.14] - 2021-10-20

  • Rename Model.predict_based_on abstract method to Model.copy method
  • Remove predict_based_on method from test_model ExampleModel class

[0.3.13] - 2021-10-12

  • Add Model.execute_based_on_match, Model.predict_based_on, and Model._post_response methods
  • Add separate Index._store method
  • Add response_to_dict method to index class for transformation of the response to needed format - {doc_id: encoding}
  • Update execute_based_on_match method in model class to get document object for matched base document"
  • Remove INDEX from Model
  • Change Index to create fields for matched document ID and confidence

[0.3.12] - 2021-10-11

  • Remove ssl verification from GET/POST requests

[0.3.11] - 2021-10-04

  • Add tests for pycognaize.common.utils.intersects

[0.3.10] - 2021-10-03

  • Fix pycognaize.common.utils.intersects

[0.3.9a24] - 2021-09-06

  • Fixed the issue of local running the tests

[0.3.9a23] - 2021-09-01

  • LаzyDict.__getitem__ returns None, if reading the document fails
  • Define the return value type

[0.3.9a22] - 2021-08-31

  • Add Index class for document-index abstraction
  • Add unittests for Index (93% coverage)

[0.3.9a21] - 2021-08-31

  • add return_tags functionality in get_ocr_formatted, _create_lines, search_text, extract_area_words of Page
  • add image_bytes property
  • remove assigning results of get_image() ang get_ocr() to hidden attributes

[0.3.9a20] - 2021-08-26

  • Added test for execute_genie_v2

[0.3.9a19] - 2021-08-24

  • sorted self._ids for lazy_dict.py in line 18

[0.3.9a18] - 2021-08-24

  • Added test_lazy_dict.py

[0.3.9a17] - 2021-08-24

  • Added tests for document.ocr.py.

[0.3.9a16] - 2021-08-20

  • Added missing tests for table_tag.

[0.3.9a15] - 2021-08-20

  • Added tests for draw. Corrected an issue in page.py.

[0.3.9a14] - 2021-08-20

  • Corrected the issue in test_utils.py. Changed the writen code that relies on the ordering of os.listdir.

[0.3.9a13] - 2021-08-20

  • Add tests for Page (90% coverage)

[0.3.9a12] - 2021-08-20

  • Correct tag Euclidean distance method by changing the private variables into public

[0.3.9a11] - 2021-08-18

  • Added missing test methods for Tag, utils, Cell, AreaField, Document, and Field (89% coverage).

[0.3.9a10] - 2021-08-16

  • Fix tag Euclidean distance method

[0.3.9a9] - 2021-08-13

  • Added area argument to page.search_text() to specify scope

[0.3.9a8] - 2021-08-13

  • Add image_arr, image_height, image_width, ocr_raw properties

[0.3.9a7] - 2021-08-11

  • Added getter and setter for field group_key

[0.3.9a6] - 2021-08-10

  • Add an option for image size in page.draw() and set a larger size as default
  • Add OS specific behavior for preview_img
  • Remove unnecessary exceptions

[0.3.9a5] - 2021-08-10

  • Add tag euclidean distance method

[0.3.9a4] - 2021-08-09

  • Add evaluation and unittests including content only metrics
  • Add ConfusionMatrix and heatmap drawing function

[0.3.9a3] - 2021-08-08

  • Add EnvConfigEnum.SNAPSHOT_ID
  • get snapshot_path using snapshot_id
  • refactor LOCAL_SNAPSHOT_PATH to SNAPSHOT_PATH
  • Update snapshot.py tests

[0.3.9a2] - 2021-08-08

  • page.search_text() did not find certain substrings present in page.free_form_text(). Found two reasons for this behavior.
    • The list of ocr-data passed to find_frirst_word_coord was page.ocr('words'), which has the entries sorted by word_id. This makes the sort flag of the function obsolete, and second it leads to cases in which the coordinates of sub-strings from page.free_form_text() cannot be found using the function the order of the words in page.free_form_text() is unrelated to word_id. For example a sub-string of page.free_form_text() might be “brown fox” with the word_id of “brown” being equal to 3 and “fox” being equal to 8. In this case find_first_word_coords would not find the coordinates, as it would break the for-loop as the word with word_id 4 is not “fox”. This behavior was fixed by passing the ocr data in the same order as in page.free_form_text(), still giving the option to sort it by word_id using the sort-flag.
    • Inside the find_first_word_coord function the words of the sub-string were always put through a cleanup regex before being compared to the ocr_text (which was not cleaned up if the clean-flag was set to false). This leads to cases in which a sub-string such as “Phone: 12345” would not be found as “Phone:” would be cleaned up to “Phone”. This was fixed by either putting the words of the sub-string as well as the values for ocr_text through a cleanup regex or neither of them, depending on the clean-flag.

[0.3.9a1] - 2021-08-06

  • Implement and/or update all tests for 0.3.7 versions (ALL TESTS PASS)
  • Optimize imports

[0.3.8a5] - 2021-08-03 (0.3.8 versions do not include the changes in 0.3.7 versions)

  • Fix page.search_text()

[0.3.8a4] - 2021-07-18 (0.3.8 versions do not include the changes in 0.3.7 versions)

  • Add Model.evaluate
  • Fix changelog wrong years (Incorrect 2020 years changed to 2021)

[0.3.8a3] - 2021-06-24 (0.3.8 versions do not include the changes in 0.3.7 versions)

  • Fix table divider offsets and interruption coordinates (FIXES THE BUG FROM 0.3.8a2)

[0.3.8a2] - 2021-06-24 (0.3.8 versions do not include the changes in 0.3.7 versions)

  • Fix table divider offsets and interruption coordinates (BUGGED VERSION)

[0.3.8a1] - 2021-05-23 (0.3.8 versions do not include the changes in 0.3.7 versions)

  • Fix Tag.intersects method

[0.3.7a9] - 2021-07-07

  • AreaField will raise a warning if the input value field is not a string and set it to empty string (if the field has no tags)

[0.3.7a8] - 2021-05-30

  • Fix AreaField.value

[0.3.7a7] - 2021-05-26

  • Fix get item in lazy_dict (required path)

[0.3.7a6] - 2021-05-23

  • Fix Tag.intersects method

[0.3.7a5] - 2021-05-13

  • Fix NaN issue in execute_genie_v2 post request json

[0.3.7a4] - 2021-05-10

  • Fix table cell value population issue

[0.3.7a3] - 2021-05-09

  • If page image or ocr files cannot be found, use an empty ocr/ 1 white pixel image instead

[0.3.7a2] - 2021-05-08

  • Add execute_genie_v2 for executing genie with airflow

[0.3.7a1] - 2021-05-08

  • Page object uses absolute path and allows lazy-loading from cloud (all tests pass)
  • Adjust all filename conventions to work with original image/ocr storage name conventions

[0.3.6] - 2021-03-29

  • Adjust tests to count for the Range margin (all tests pass)

[0.3.6a4] - 2021-03-05

  • Modify the margin in Range.to_dict, set margin to 0.15

[0.3.6a3] - 2021-03-05

  • Add margin to table cells in TableTag._build_cell
  • Fix ordering information in digester
  • Fix typing annotation for OrderedDict output
  • Change document.x and document.y into OrderedDict, use global ordering in digester

[0.3.6a2] - 2021-02-24

  • Add tests for Tag/ExtractionTag (coverage 87%)

[0.3.6a1] - 2021-02-24

  • Fox group_key typing annotation in Field objects

[0.3.5] - 2021-02-23

  • Store pdf in the snapshot

[0.3.4] - 2021-02-22

  • Add group_key optional argument to Field objects

[0.3.3] - 2021-02-16

  • All tests pass (84% coverage)
  • Snapshot creator with threading
  • Use mongomock for DB tests

[0.3.2] - 2021-02-09

  • Update tests for storage

[0.3.1a4] - 2021-02-03

  • Add threading to SnaphotBuilder

[0.3.1a3] - 2021-02-02

  • Fix test_digestor.py to work with the correct document bson file
  • All tests pass (82% coverage)

[0.3.1a2] - 2021-02-02

  • Set DB.find call arguments no_cursor_timeout=True, batch_size=10 in SnapshotBuilder in order to avoid CursorNotFound timeout errors

[0.3.1a1] - 2021-01-29

  • Add lines, search_text, extract_area_words methods to Page
  • Add unittests for lines, search_text and extract_area_words methods
  • Add infer_rows_from_words, clean_ocr_data, find_first_word_coords, intersects, compute_intersection_area methods in utils.py

[0.3.1a0] - 2021-01-28

  • Add unittest for Document.from_dict
  • Optimize digester output_fields lookup
  • Add stick_coords option to Page.get_ocr_formatted
  • Add opencv requirement

[0.3.0b20] - 2021-01-16

  • Assign document in execute_genie when calling model.predict

[0.3.0b19] - 2021-01-12

  • Assigning recipe output fields in digester for better performance

[0.3.0b18] - 2021-01-06

  • Update digester to set id-s of the original fields from the blueprint
  • Change document.tag imports to relative

[0.3.0b17] - 2021-01-03

  • Model.predict in Model.execute_genie uses positional arguments

[0.3.0b16] - 2021-01-03

  • Load LazyDocumentDicts as bson
  • Make sure page numbers are integers

[0.3.0b15] - 2021-01-03

  • Fix SnapshotBuilder.save_doc_json_to_snapshot document_id key

[0.3.0b14] - 2021-01-03

  • Update Document.to_dict document_id key in metadata

[0.3.0b13] - 2021-01-03

  • Add field name and ID in Field.to_dict implementations

[0.3.0b12] - 2021-01-03

  • Add pages argument to construct_from_raw

[0.3.0b11] - 2021-01-03

  • Fix Document.from_dict typo (construct_from_raw method call)

[0.3.0b10] - 2021-01-03

  • Define Field data_types in to_dict methods
  • Add area to IqDataTypesEnum David A minute ago

[0.3.0b9] - 2021-01-03

  • Move FieldMapping to field.__init__
  • Fix circular import in fields
  • Use super().to_dict() in Field objects

[0.3.0b8] - 2021-01-03

  • Fix Document.from_dict page iteration
  • Add field types to Field.to_dict methods
  • TableField allows no tags when calling to_dict method

[0.3.0b7] - 2021-01-02

  • Update Dockerfile entrypoint to pycognaize.app.rest
  • Fix SnapshotBuilder.create_document_zip cls.DB assignment expression

[0.3.0b6] - 2021-01-02

  • Update changelog, fix all versions to 0.3.0b6

[0.3.0b4] - 2021-01-02

  • Update all tests (75% coverage)
  • Merge branch 'master' into major_refactor
  • Add Snapshot to pycognaize.__init__
  • Use scandir in DocumentBuilder._populate_pages
  • Update import statement for Mapping
  • Fix SnapshotBuilder to work with new Snapshot class
  • Remove FieldMapping from DocumentBuilder, use a separate module instead
  • Snapshot uses lazy_dict for reading individual documents
  • Use tempfile module in model.py
  • Add doc_file (document.json) to SnapStorageEnum
  • Add AreaField to field.__init__
  • Add to_dict and from_dict methods to Document class
  • Add test coverage in tox
  • Merged in add_test_for_overwriting_snapshot_in_s3 (pull request #47)
  • pull updates from major refactor and merge with current branch, remove test_service, add test_store_snapshot_with_same_name
  • Merge branch 'major_refactor' into add_test_for_overwriting_snapshot_in_s3
  • add unittests to test snapshot overwriting, change overwriting log message
  • Allow 'from pycognaize import Model'
  • Add doctests to text_field module
  • Update sphynx conf.py
  • Update build_docs.sh to also build pdf documentation file
  • Update setup.sh logs
  • Move all setup configurations from setup.py to setup.cfg
  • build_docs.sh generated html and pdf documentations
  • Add doc/source/generated/ to .*ignore files

[0.3.0b3] - 2020-12-28

  • Add script for building sphinx docs
  • Minor docstring changes to Tag hshift and vshift methods
  • SnapshotBuilder.DB added only on function call, to speed up module imports
  • Add a single doctest to TextField constructor
  • Add simplejson to requirements
  • Add Model.execute_genie method

[0.3.0b1] - 2020-12-27

  • Major refactored version
  • Document > DocumentBuilder (DocumentBuilder has no instance, only methods for creating Documents, which are now equivalent to DocumentDataclass objects)
  • DocumentDataclass > Document
  • SnapshotProcessor > SnapshotBuilder
  • Changed folder structure (no services package)
  • DataSnapshot > Snapshot, DataRecipe > Recipe
  • Add tox configuration for py36, py37, py38, py39, pypy
  • Add 'MANIFEST.in' (required for 'TOX' to run properly)
  • Add 'setup.cfg' for 'pytest'
  • './setup.sh' builds and pushes a version, only if no tests fail
  • Update README.md
  • All tests pass on py36, py37, py38, py39, pypy

[0.2]

[0.2.5a4] - 2020-12-18

  • Add docker push command in build.sh

[0.2.5a3] - 2020-12-17

  • Change rest service to threaded=False

[0.2.5a2] - 2020-12-15

  • Checkpoint version

[0.2.5a1] - 2020-12-15

  • Delete numpy from req-s

[0.2.5a0] - 2020-12-04

  • Update _build_df method in TableTag

[0.2.4] - 2020-11-27

  • Fix srcFieldId log in Document.get_fields_by_id to print field id instead of the whole field

[0.2.3] - 2020-11-27

  • Fix src_field_id issue in digestor.py

[0.2.2.a7] - 2020-11-27

  • All tests are fixed and running

[0.2.2.a6] - 2020-11-18

  • Fix issue in get_ocr_formatted

[0.2.2.a5] - 2020-11-13

  • Add setup.sh script
  • Use cloudstorageio>=1.1.2 which supports uploading 5GB+ files to s3
  • Do not store TableTag ocr (makes the pickle dumps way too big for documents with many tables)

[0.2.2.a4] - 2020-11-07

  • Fix typing annotation for DocumentDataclass._pages
  • Add property AreaField.value
  • Change super().tags to self.tags

[0.2.2.a3] - 2020-11-01

  • Do not store TableTag.df, build it on call
  • Update Range unittest (remove unnecessary error raising test cases)

[0.2.2.a2] - 2020-11-01

  • Add area, height, width to (Cell)Range objects
  • Add support for comparing Tag and (Cell)Range objects in magic methods of Tag
  • In create_document_zip, if the recipe retrieved from DB is empty, through a ConnectionError Add get_table_cell_overlap to DocumentDataclass

[0.2.2.a1] - 2020-10-31

  • Initiate database_setup on function call instead of import statement

[0.2.2.a0] - 2020-10-14

  • Adjust problematic OCR in Page.get_ocr_formatted (if left >= right, right = left + 1, same for top/bottom)

[0.2.1.a3] - 2020-10-28

-Optionally build df for TableTag using ocr data

[0.2.1.a2] - 2020-10-22

  • Add _build_df method in TableTag

[0.2.1.a1] - 2020-10-08

  • Catch validation errors for TableTag

[0.2.1.a0] - 2020-09-23

  • Adjust coordinates smaller than 0 and bigger than 100

[0.2.0.a12] - 2020-09-14

  • Always remove snapshot zip before creating a new one

[0.2.0.a11] - 2020-09-06

  • Update digester template check

[0.2.0.a10] - 2020-08-20

  • Use proper relative_path in populate_pages
  • Add to_dict in AreaField

[0.2.0.a9] - 2020-08-20

  • Include data_recipe in DataSnapshot

[0.2.0.a8] - 2020-08-20

  • Add value argument in construct_from_raw method in DateField

[0.2.0.a7] - 2020-07-30

  • Filter fields with repeat_parent instead of source field id

[0.2.0.a6] - 2020-07-28

  • SnapshotProcessor raises error in create_document_zip if document not found in the DB

[0.2.0.a5] - 2020-07-28

  • Attribute value of TextField and DateField are strings

[0.2.0.a4] - 2020-07-27

  • Add to_dict method to DateField

[0.2.0.a3] - 2020-07-24

  • Fix construct_from_raw in ExtractionTag
  • Implement to_dict for numeric_field

[0.2.0.a2] - 2020-07-24

  • Add generated ObjectId-s in to_dict methods, update readme

[0.2.0.a1] - 2020-07-23

  • Add additional include-services argument to setup.py

[0.2.0.a0] - 2020-07-23

  • Add make_document_snap endpoint
  • Implement DbStorage
  • database_setup raises an error if failed
  • Rename rest.app to service.rest
  • Implement to_dict method for TableField TableTag and cell Range objects
  • Add template_ids property in DataRecipe
  • Add IqTableDividerEnum and update fields of other enums, including IqTableTagEnum
  • Change cloudstorageio to version 1.0.10 in requirements
  • Add digest_results function
  • Add Storage abstract class
  • Add to_dict methods in ExtractionTag and TextField
  • Add IqFieldKeyEnum and add to_dict abstractmethods in Tag and Field abstract classes
  • Update readme to use nosetests command
  • Add abstract method decorators to methods in Tag, Field and Model abstract classes
  • Make package exclusion dynamic in setup.py

[0.1]

[0.1.9.alpha1] - 2020-07-16

  • Fix get_ocr_formatted in Page class
  • Model object's predict method takes document_dataclass as input
  • Major cosmetic change, full PEP8 compliance, except max line length
  • Remove spreading_document, build and tag_utils modules

[0.1.9.alpha] - 2020-07-15

  • Add Model abstract class
  • Update README.md packaging instructions and setup.py
  • Update tests, comment DocumentDataclass tests, until proper setUp/tearDown is implemented
  • Update SnapshotProcessor import in the rest.app
  • Remove unnecessary methods from Document, remove outdated tests
  • Add AreaField
  • Adjustments after renaming an IqTableEnum to IqCollectionEnum
  • TableTag optionally uses raw cell data to construct cell ranges, change keys to 1-based index
  • Change repr and str methods in TableField
  • Add IqCellKeyEnum and IqTableFieldEnum

[0.1.8.alpha] - 2020-07-15

  • Update make_snap endpoint to work with SnapshotProcessor, add fetch_document_zip endpoint
  • Add SnapshotProcessor
  • Add instruction for pushing to fury and pip install
  • Minor cosmetic changes in ocr.py
  • Remove source_id from Document object constructor
  • Make tag parameter optional in TableField
  • Remove raw parameter from DataSnapshot constructor
  • Add collections to IqDatasetKeyEnum
  • Remove CellTag

[0.1.7.alpha] - 2020-07-14

  • repr for Range includes the value
  • Add methods for constructing tags and fields from raw dictionaries
  • Add OCRData and modify build_cell function using dividers
  • Implement build ranges
  • Cell ranges added
  • Remove an outdated comment from DocumentDataclas
  • Add TableDivider object
  • Add test for DocumentDataclass
  • Modify constructors for Field and Tag objects
  • Add IqTagKeyEnum to enums
  • Document.get_y returns a list of fields instead of a single field
  • Add get_ocr_formatted method in Page
  • add src_field_id to IqDocumentKeysEnum
  • Move database_setup into a separate module

[0.1.6.alpha] - 2020-07-03

  • Store relative path in page objects, remove SNAP_STORAGE_PATH constant and use env variable everywhere, fix some typos in TODOs
  • Add document_src, document_id and pages as properties in DocumentDataClass
  • Improve DataSnapshot api, add document_dataclasses and get methods
  • Images and ocr folders are document ids instead of document src
  • Pickled snapshot is saved as snap.pickle instead of snap..pickle
  • Add StoredSnapshotException
  • Add src attribute in document, put NoneField in a separate module
  • Fix Page repr

[0.1.4.alpha] - 2020-07-02

  • Improve log format
  • Redefine Snapshot, return traceback if rest.app fails
  • Store images and ocr with snapshot
  • Change pymongo to pymongo[srv] in requirements, fix DB name splitting in iq.__init__
  • Remove db dependency from documents
  • Rename Snapshot to DataSnapshot, Pipeline to DataPipeline, Recipe to DataRecipe
  • Add test for DocumentDataclass
  • add load_bson_by_path in utils
  • Add document and recipe bsons to test resources

[0.1.2.alpha] - 2020-06-10

  • Use rest endpoint for get_data in DataSnap
  • Change defaults for SNAP_STORAGE_PATH and DEFAULT_DB_URL, add back DEFAULT_DATASNAP_ENDPOINT
  • Fix TypeError message in DataPipe.update
  • Add IQ_SNAP_STORAGE_PATH back to EnvConfigEnum
  • Add pydrive to requirements in order to solve issue with cloudstorageio, should be changed once cloudstorageio is updated

[0.1.3.alpha] - 2020-06-26

  • Add DocumentDataclass
  • Add get_x, get_y methods to Document
  • Rename DataRecipe to Recipe, add input_fields, output_fields attributes to Recipe
  • Rename snap to snapshot
  • Add raw_field_type in IqRecipeEnum
  • Cosmetic changes in DateField
  • Add NoneField

[0.1.1] - 2020-06-08

  • Modify rest.app to make snapshots through a shared volume
  • Implement DataSnap.create and DataSnap initialization through a shared volume
  • Modify DataPipe in order to be inherited by DataSnap
  • Add cloudstorageio to requirements
  • Add SnapshotPathMissingException and SnapshotExistsException
  • Remove IQ_DATASNAP_URL and add IQ_SNAP_STORAGE_PATH to EnvConfigEnum
  • Remove igraph dependencies from Dockerfile
  • Add DATASET_TYPE to DataSnap

[0.1.0] - 2020-05-28

  • Allow to import dataset types from pycognaize.datasets
  • Add DataSnap
  • Refactor the package structure

[0.0]

[0.0.9] - 2020-05-26

  • Add Dockerfile and build.sh for building docker image with hash

[0.0.8] - 2020-05-24

  • Fix DEFAULT_DB error handling, add rest API for snap

[0.0.7] - 2020-05-23

  • Add raw and construct_from_raw methods
  • Add data snap serialization and deserialization

[0.0.6] - 2020-05-22

  • Fix parse_periods in SpreadingDocument

[0.0.5] - 2020-05-22

  • Change df property to return CellTag objects

[0.0.4] - 2020-05-22

  • Add get_document_periods method
  • Add df property to TableTag
  • Inherit SpreadingDocument class from Document

[0.0.3] - 2020-05-17

  • Change __repr__ in Field object, add TODO for document name column in DataRecipe
  • Define a DEFAULT_DB object, with a single env variable - IQ_DB_URL

[0.0.2] - 2020-05-15

  • Created DataRecipe

[0.0.1]

  • Created Tag, Field, Page and Document abstractions for pycognaize

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pycognaize-1.4.55.tar.gz (114.9 kB view details)

Uploaded Source

Built Distribution

pycognaize-1.4.55-py3-none-any.whl (111.2 kB view details)

Uploaded Python 3

File details

Details for the file pycognaize-1.4.55.tar.gz.

File metadata

  • Download URL: pycognaize-1.4.55.tar.gz
  • Upload date:
  • Size: 114.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.7

File hashes

Hashes for pycognaize-1.4.55.tar.gz
Algorithm Hash digest
SHA256 bb968023363f26e7164cadccf6d7627b9c82a4876374e650acfe6a9e2b845a32
MD5 8f40d09ec143bc173a52375d098e186f
BLAKE2b-256 1dce3b345ce98ade56d111c688afb0da1fe2bf086563dd488697d573733c47d7

See more details on using hashes here.

File details

Details for the file pycognaize-1.4.55-py3-none-any.whl.

File metadata

  • Download URL: pycognaize-1.4.55-py3-none-any.whl
  • Upload date:
  • Size: 111.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.7

File hashes

Hashes for pycognaize-1.4.55-py3-none-any.whl
Algorithm Hash digest
SHA256 fd3dbf7a7baf25703a08a78063fb7c78fa2ff06f3a99ba9a1943c39939beb434
MD5 5270678a5dbd560b9dedcc01e2ac94d7
BLAKE2b-256 34e4a8ae3b3b6fad9dd1f0f8f79142e719e1c0b123ffdd46831108dd8899e35b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page