Skip to main content

GeOCR is a Python library for OCR, data cleaning, and historical geocoding using custom GeoJSON maps.

Project description

GeoOCR

GeoOCR is a Python library for automating OCR on historical archive images and extracting geospatial data. It provides:

  • ArchiveImage: image preprocessing, splitting and OCR.
  • ArchiveCollection: batch operations on collections of images.
  • ArchiveGeolocator: fuzzy-match CSV addresses against a GeoJSON reference to geolocate.
  • GeoExporter: export results to CSV or GeoJSON.

We recommend using it with Google Colab or Jupyter Notebook.

Note: This project was developed for a specific university research project and is not guaranteed to be maintained.


Installation

pip install geocr

Or install from source:

git clone https://github.com/joshuavachon25/geocr-py.git
cd geocr
pip install .

Quickstart

Normal usage involves creating a collection, batch-processing what you can, then applying specific pipelines to individual items. After that, you perform OCR, extract and correct the data, and generate a CSV file that can be used with the Geolocator along with GIS data.

from geocr.archive_image import ArchiveImage
from geocr.archive_collection import ArchiveCollection
from geocr.archive_geolocator import ArchiveGeolocator
from geocr.geo_exporter import GeoExporter
from shapely.geometry import Point
import cv2

# 1. Process an image
img = ArchiveImage(cv2.imread("/path/to/image.png"), name="example")
img.sharpen().binarize().ocr(language='fra', config='--psm 3')
print(img.ocr_text)

# 2. Batch process a folder 
col = ArchiveCollection("/path/to/images")
col.sharpen().binarize().ocr()
all_text = col.get_ocr_text()

# 3. Geocode addresses
locator = ArchiveGeolocator("rues.geojson", default_point=Point(0,0))
gdf = locator.geocode_csv(
    csv_path="adresses.csv",
    csv_fields=["No","Adresse"],
    ref_fields=["adresses"],
    similarity_threshold=85
)

# 4. Export results
GeoExporter.to_csv(gdf, "results.csv")
GeoExporter.to_geojson(gdf, "results.geojson")

API Reference

ArchiveImage

ArchiveImage(image: np.ndarray, name: str)

Constructor: wraps an OpenCV image.

Method Description
show(grid=False, grid_step=100, grid_color='#FF0000', subgrid_step=20, subgrid_color='#999999', title='Preview', max_size=None) displays the image with optional grid.
save(output_dir: Path, format='png', prefix='', suffix='') -> Path save image to disk.
split(template: str, orientation='vertical') -> list[ArchiveImage] split image by ratios, return new ArchiveImage objects.
rotate(angle: float) rotate image by angle in degrees.
mask(bottom=0, right=0, top=0, left=0, gray=255, color=(255,255,255)) mask borders with color.
invert() invert image colors.
denoise(kernel=1, iterations=1)
erode(kernel=1, iterations=1)
dilate(kernel=1, iterations=1)
morphological operations.
binarize(mode='auto', threshold_value=127) convert to black/white with Otsu or manual threshold.
sharpen() apply unsharp mask.
remove_borders() crop to largest contour.
add_borders(top=20, bottom=20, left=20, right=20, color=(255,255,255)) add uniform border.
ocr(language='fra', config='--psm 3') -> ArchiveImage run Tesseract OCR.
get_ocr_text() -> str display and return OCR text.
clean(rules=None, min_line_length=None) apply regex rules and merge short lines.

ArchiveCollection

ArchiveCollection(input_path: str, is_natural_sort=True, split_template=None, split_orientation='vertical')

Constructor: loads images from folder.

Method Description
get(name: str) -> ArchiveImage return ArchiveImage with it's name.
remove(image_name: str) remove image from collection.
update(original_name: str, new_images: list[ArchiveImage]) update changes from single ArchiveImage update
subset(imgs: list[str]) -> ArchiveCollection create a new ArchiveCollection from a list of image (by name)
split_only(image_name: str, template: str, orientation='vertical') split one image.
split_all(template: str, orientation='vertical') split every image.
show(title='Mosaic', max_cols=6, max_size=None) display grid of images.
copy() -> ArchiveCollection deep copy collection.
save(output_dir: Path, format='png', prefix='', suffix='') -> list[Path] write images to disk.
All image ops:
sharpen()
rotate(angle)
mask(...)
invert()
denoise()
erode()
dilate()
binarize()
remove_borders()
add_borders(...)
ocr()
clean()
shortcuts to apply on every image.
get_ocr_text(separator='\n', show=True) concatenate ocr_text of all ArchiveImage and display OCR text

ArchiveGeolocator

ArchiveGeolocator(reference_geojson_path: str, default_point: Point = Point(0,0))

Constructor: loads GeoJSON reference.

gdf = geolocator.geocode_csv(
    csv_path: str = None,
    gdf: GeoDataFrame = None,
    csv_fields: list[str] = ['Adresse'],
    ref_fields: list[str] = ['nom_voie'],
    output_geometry: str = 'geometry',
    match_label: str = 'matched_ref',
    score_label: str = 'match_score',
    similarity_threshold: float = 80,
    sep: str = ';'
) -> GeoDataFrame
  • If gdf provided, reuses it; otherwise loads CSV.
  • Only geocodes rows where geometry is missing or equals default_point.
  • Returns GeoDataFrame with original columns + output_geometry, match_label, score_label.

GeoExporter

GeoExporter.to_csv(gdf: GeoDataFrame, path: str, sep: str = ';')

Export attribute table to CSV (drops geometry).

GeoExporter.to_geojson(gdf: GeoDataFrame, path: str)

Export full GeoDataFrame to GeoJSON.


License

This project is licensed under the MIT License. See the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

geocr-1.0.1.tar.gz (11.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

geocr-1.0.1-py3-none-any.whl (11.3 kB view details)

Uploaded Python 3

File details

Details for the file geocr-1.0.1.tar.gz.

File metadata

  • Download URL: geocr-1.0.1.tar.gz
  • Upload date:
  • Size: 11.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.5

File hashes

Hashes for geocr-1.0.1.tar.gz
Algorithm Hash digest
SHA256 de48ea54e21b5caab2671a19059a50504fab4c2b6df3a556ff233bd38d790f51
MD5 b557a78216dfdc5b59272cd50307047b
BLAKE2b-256 2b6749a3d92175efb9644283ed6efb07c437a2529b61122bb1e10de90455debd

See more details on using hashes here.

File details

Details for the file geocr-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: geocr-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 11.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.5

File hashes

Hashes for geocr-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 d2a98557422863e264555e19275a2a094da350fc397d389a85f28548ab89f6a1
MD5 5a261c2e34d171497974e8f5af3c666e
BLAKE2b-256 e9dd87522f3fbc76a4a9dd228c92f328c407a96c6b783cbebc6e2a55dad02044

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page