GeOCR is a Python library for OCR, data cleaning, and historical geocoding using custom GeoJSON maps.
Project description
GeoOCR
GeoOCR is a Python library for automating OCR on historical archive images and extracting geospatial data. It provides:
- ArchiveImage: image preprocessing, splitting and OCR.
- ArchiveCollection: batch operations on collections of images.
- ArchiveGeolocator: fuzzy-match CSV addresses against a GeoJSON reference to geolocate.
- GeoExporter: export results to CSV or GeoJSON.
We recommend using it with Google Colab or Jupyter Notebook.
Note: This project was developed for a specific university research project and is not guaranteed to be maintained.
Installation
pip install geocr
Or install from source:
git clone https://github.com/joshuavachon25/geocr-py.git
cd geocr
pip install .
Quickstart
Normal usage involves creating a collection, batch-processing what you can, then applying specific pipelines to individual items. After that, you perform OCR, extract and correct the data, and generate a CSV file that can be used with the Geolocator along with GIS data.
from geocr.archive_image import ArchiveImage
from geocr.archive_collection import ArchiveCollection
from geocr.archive_geolocator import ArchiveGeolocator
from geocr.geo_exporter import GeoExporter
from shapely.geometry import Point
import cv2
# 1. Process an image
img = ArchiveImage(cv2.imread("/path/to/image.png"), name="example")
img.sharpen().binarize().ocr(language='fra', config='--psm 3')
print(img.ocr_text)
# 2. Batch process a folder
col = ArchiveCollection("/path/to/images")
col.sharpen().binarize().ocr()
all_text = col.get_ocr_text()
# 3. Geocode addresses
locator = ArchiveGeolocator("rues.geojson", default_point=Point(0,0))
gdf = locator.geocode_csv(
csv_path="adresses.csv",
csv_fields=["No","Adresse"],
ref_fields=["adresses"],
similarity_threshold=85
)
# 4. Export results
GeoExporter.to_csv(gdf, "results.csv")
GeoExporter.to_geojson(gdf, "results.geojson")
API Reference
ArchiveImage
ArchiveImage(image: np.ndarray, name: str)
Constructor: wraps an OpenCV image.
| Method | Description |
|---|---|
show(grid=False, grid_step=100, grid_color='#FF0000', subgrid_step=20, subgrid_color='#999999', title='Preview', max_size=None) |
displays the image with optional grid. |
save(output_dir: Path, format='png', prefix='', suffix='') -> Path |
save image to disk. |
split(template: str, orientation='vertical') -> list[ArchiveImage] |
split image by ratios, return new ArchiveImage objects. |
rotate(angle: float) |
rotate image by angle in degrees. |
mask(bottom=0, right=0, top=0, left=0, gray=255, color=(255,255,255)) |
mask borders with color. |
invert() |
invert image colors. |
denoise(kernel=1, iterations=1)erode(kernel=1, iterations=1)dilate(kernel=1, iterations=1) |
morphological operations. |
binarize(mode='auto', threshold_value=127) |
convert to black/white with Otsu or manual threshold. |
sharpen() |
apply unsharp mask. |
remove_borders() |
crop to largest contour. |
add_borders(top=20, bottom=20, left=20, right=20, color=(255,255,255)) |
add uniform border. |
ocr(language='fra', config='--psm 3') -> ArchiveImage |
run Tesseract OCR. |
get_ocr_text() -> str |
display and return OCR text. |
clean(rules=None, min_line_length=None) |
apply regex rules and merge short lines. |
ArchiveCollection
ArchiveCollection(input_path: str, is_natural_sort=True, split_template=None, split_orientation='vertical')
Constructor: loads images from folder.
| Method | Description |
|---|---|
get(name: str) -> ArchiveImage |
return ArchiveImage with it's name. |
remove(image_name: str) |
remove image from collection. |
update(original_name: str, new_images: list[ArchiveImage]) |
update changes from single ArchiveImage update |
subset(imgs: list[str]) -> ArchiveCollection |
create a new ArchiveCollection from a list of image (by name) |
split_only(image_name: str, template: str, orientation='vertical') |
split one image. |
split_all(template: str, orientation='vertical') |
split every image. |
show(title='Mosaic', max_cols=6, max_size=None) |
display grid of images. |
copy() -> ArchiveCollection |
deep copy collection. |
save(output_dir: Path, format='png', prefix='', suffix='') -> list[Path] |
write images to disk. |
All image ops:sharpen()rotate(angle)mask(...)invert()denoise()erode()dilate()binarize()remove_borders()add_borders(...)ocr()clean() |
shortcuts to apply on every image. |
get_ocr_text(separator='\n', show=True) |
concatenate ocr_text of all ArchiveImage and display OCR text |
ArchiveGeolocator
ArchiveGeolocator(reference_geojson_path: str, default_point: Point = Point(0,0))
Constructor: loads GeoJSON reference.
gdf = geolocator.geocode_csv(
csv_path: str = None,
gdf: GeoDataFrame = None,
csv_fields: list[str] = ['Adresse'],
ref_fields: list[str] = ['nom_voie'],
output_geometry: str = 'geometry',
match_label: str = 'matched_ref',
score_label: str = 'match_score',
similarity_threshold: float = 80,
sep: str = ';'
) -> GeoDataFrame
- If
gdfprovided, reuses it; otherwise loads CSV. - Only geocodes rows where
geometryis missing or equalsdefault_point. - Returns
GeoDataFramewith original columns +output_geometry,match_label,score_label.
GeoExporter
GeoExporter.to_csv(gdf: GeoDataFrame, path: str, sep: str = ';')
Export attribute table to CSV (drops geometry).
GeoExporter.to_geojson(gdf: GeoDataFrame, path: str)
Export full GeoDataFrame to GeoJSON.
License
This project is licensed under the MIT License. See the LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file geocr-1.0.0.tar.gz.
File metadata
- Download URL: geocr-1.0.0.tar.gz
- Upload date:
- Size: 11.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0a803b3a84b32906f7635c7a479747fde53484de14a0985544b62f9450c0948d
|
|
| MD5 |
e52b4f606fc971f5f566910892cb7d4a
|
|
| BLAKE2b-256 |
4e0cab6ca835962cfe6030b9c5947a966c71a7c502b656cd19d808c53aa9f89d
|
File details
Details for the file geocr-1.0.0-py3-none-any.whl.
File metadata
- Download URL: geocr-1.0.0-py3-none-any.whl
- Upload date:
- Size: 11.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cb800154d84f3c7bdcb32f15246ef9cdbe85097f2e7ca9188c6fecd14756fc5e
|
|
| MD5 |
b912a78154d3154f98b5d269f6eb7a9d
|
|
| BLAKE2b-256 |
447e9081fe1c00202b13c42299e31fb7b4b704cc546996c18228e35088c4dc70
|