Skip to main content

GeOCR is a Python library for OCR, data cleaning, and historical geocoding using custom GeoJSON maps.

Project description

GeoOCR

GeoOCR is a Python library for automating OCR on historical archive images and extracting geospatial data. It provides:

  • ArchiveImage: image preprocessing, splitting and OCR.
  • ArchiveCollection: batch operations on collections of images.
  • ArchiveGeolocator: fuzzy-match CSV addresses against a GeoJSON reference to geolocate.
  • GeoExporter: export results to CSV or GeoJSON.

We recommend using it with Google Colab or Jupyter Notebook.

Note: This project was developed for a specific university research project and is not guaranteed to be maintained.


Installation

pip install geocr

Or install from source:

git clone https://github.com/joshuavachon25/geocr-py.git
cd geocr
pip install .

Quickstart

Normal usage involves creating a collection, batch-processing what you can, then applying specific pipelines to individual items. After that, you perform OCR, extract and correct the data, and generate a CSV file that can be used with the Geolocator along with GIS data.

from geocr.archive_image import ArchiveImage
from geocr.archive_collection import ArchiveCollection
from geocr.archive_geolocator import ArchiveGeolocator
from geocr.geo_exporter import GeoExporter
from shapely.geometry import Point
import cv2

# 1. Process an image
img = ArchiveImage(cv2.imread("/path/to/image.png"), name="example")
img.sharpen().binarize().ocr(language='fra', config='--psm 3')
print(img.ocr_text)

# 2. Batch process a folder 
col = ArchiveCollection("/path/to/images")
col.sharpen().binarize().ocr()
all_text = col.get_ocr_text()

# 3. Geocode addresses
locator = ArchiveGeolocator("rues.geojson", default_point=Point(0,0))
gdf = locator.geocode_csv(
    csv_path="adresses.csv",
    csv_fields=["No","Adresse"],
    ref_fields=["adresses"],
    similarity_threshold=85
)

# 4. Export results
GeoExporter.to_csv(gdf, "results.csv")
GeoExporter.to_geojson(gdf, "results.geojson")

API Reference

ArchiveImage

ArchiveImage(image: np.ndarray, name: str)

Constructor: wraps an OpenCV image.

Method Description
show(grid=False, grid_step=100, grid_color='#FF0000', subgrid_step=20, subgrid_color='#999999', title='Preview', max_size=None) displays the image with optional grid.
save(output_dir: Path, format='png', prefix='', suffix='') -> Path save image to disk.
split(template: str, orientation='vertical') -> list[ArchiveImage] split image by ratios, return new ArchiveImage objects.
rotate(angle: float) rotate image by angle in degrees.
mask(bottom=0, right=0, top=0, left=0, gray=255, color=(255,255,255)) mask borders with color.
invert() invert image colors.
denoise(kernel=1, iterations=1)
erode(kernel=1, iterations=1)
dilate(kernel=1, iterations=1)
morphological operations.
binarize(mode='auto', threshold_value=127) convert to black/white with Otsu or manual threshold.
sharpen() apply unsharp mask.
remove_borders() crop to largest contour.
add_borders(top=20, bottom=20, left=20, right=20, color=(255,255,255)) add uniform border.
ocr(language='fra', config='--psm 3') -> ArchiveImage run Tesseract OCR.
get_ocr_text() -> str display and return OCR text.
clean(rules=None, min_line_length=None) apply regex rules and merge short lines.

ArchiveCollection

ArchiveCollection(input_path: str, is_natural_sort=True, split_template=None, split_orientation='vertical')

Constructor: loads images from folder.

Method Description
get(name: str) -> ArchiveImage return ArchiveImage with it's name.
remove(image_name: str) remove image from collection.
update(original_name: str, new_images: list[ArchiveImage]) update changes from single ArchiveImage update
subset(imgs: list[str]) -> ArchiveCollection create a new ArchiveCollection from a list of image (by name)
split_only(image_name: str, template: str, orientation='vertical') split one image.
split_all(template: str, orientation='vertical') split every image.
show(title='Mosaic', max_cols=6, max_size=None) display grid of images.
copy() -> ArchiveCollection deep copy collection.
save(output_dir: Path, format='png', prefix='', suffix='') -> list[Path] write images to disk.
All image ops:
sharpen()
rotate(angle)
mask(...)
invert()
denoise()
erode()
dilate()
binarize()
remove_borders()
add_borders(...)
ocr()
clean()
shortcuts to apply on every image.
get_ocr_text(separator='\n', show=True) concatenate ocr_text of all ArchiveImage and display OCR text

ArchiveGeolocator

ArchiveGeolocator(reference_geojson_path: str, default_point: Point = Point(0,0))

Constructor: loads GeoJSON reference.

gdf = geolocator.geocode_csv(
    csv_path: str = None,
    gdf: GeoDataFrame = None,
    csv_fields: list[str] = ['Adresse'],
    ref_fields: list[str] = ['nom_voie'],
    output_geometry: str = 'geometry',
    match_label: str = 'matched_ref',
    score_label: str = 'match_score',
    similarity_threshold: float = 80,
    sep: str = ';'
) -> GeoDataFrame
  • If gdf provided, reuses it; otherwise loads CSV.
  • Only geocodes rows where geometry is missing or equals default_point.
  • Returns GeoDataFrame with original columns + output_geometry, match_label, score_label.

GeoExporter

GeoExporter.to_csv(gdf: GeoDataFrame, path: str, sep: str = ';')

Export attribute table to CSV (drops geometry).

GeoExporter.to_geojson(gdf: GeoDataFrame, path: str)

Export full GeoDataFrame to GeoJSON.


License

This project is licensed under the MIT License. See the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

geocr-1.0.0.tar.gz (11.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

geocr-1.0.0-py3-none-any.whl (11.3 kB view details)

Uploaded Python 3

File details

Details for the file geocr-1.0.0.tar.gz.

File metadata

  • Download URL: geocr-1.0.0.tar.gz
  • Upload date:
  • Size: 11.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.5

File hashes

Hashes for geocr-1.0.0.tar.gz
Algorithm Hash digest
SHA256 0a803b3a84b32906f7635c7a479747fde53484de14a0985544b62f9450c0948d
MD5 e52b4f606fc971f5f566910892cb7d4a
BLAKE2b-256 4e0cab6ca835962cfe6030b9c5947a966c71a7c502b656cd19d808c53aa9f89d

See more details on using hashes here.

File details

Details for the file geocr-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: geocr-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 11.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.5

File hashes

Hashes for geocr-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 cb800154d84f3c7bdcb32f15246ef9cdbe85097f2e7ca9188c6fecd14756fc5e
MD5 b912a78154d3154f98b5d269f6eb7a9d
BLAKE2b-256 447e9081fe1c00202b13c42299e31fb7b4b704cc546996c18228e35088c4dc70

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page