Skip to main content

A package for entity linking using LionLinker.

Project description

Crocodile

Crocodile Logo

Crocodile is a powerful Python library designed for efficient entity linking over tabular data. Whether you're working with large datasets or need to resolve entities across multiple tables, Crocodile provides a scalable and easy-to-integrate solution to streamline your data processing pipeline.

Fun Fact: If a crocodile and an alligator were to meet, the crocodile would likely win in a face-to-face combat. While the alligator is faster, the crocodile has the advantage of being bigger, heavier, and having a more lethal bite due to its size and strength (Bayou Swamp Tours).

Features

  • Entity Linking: Seamlessly link entities within tabular data.
  • Scalable: Designed to handle large datasets efficiently.
  • Easy Integration: Can be easily integrated into existing data processing pipelines.

Installation

Crocodile is published on PyPI as the crocodile-linker distribution (imports remain crocodile):

pip install crocodile-linker

For the optional FastAPI app dependencies, install with extras:

pip install 'crocodile-linker[app]'

For development installs from source:

git clone https://github.com/your-org/crocodile.git
cd crocodile
pip install -e .

Additionally, one needs to download the SpaCy model by running the following code:

python -m spacy download en_core_web_sm

Usage

Using the CLI

You can run the entity linking process via the command line interface (CLI) as follows:

First, create a .env file with the required environment variables:

ENTITY_RETRIEVAL_ENDPOINT=https://lamapi.hel.sintef.cloud/lookup/entity-retrieval
ENTITY_RETRIEVAL_TOKEN=lamapi_demo_2023

Then, use the following command:

python3 -m crocodile.cli \
  --croco.input_csv tables/imdb_top_1000.csv \
  --croco.entity_retrieval_endpoint "$ENTITY_RETRIEVAL_ENDPOINT" \
  --croco.entity_retrieval_token "$ENTITY_RETRIEVAL_TOKEN" \
  --croco.mongo_uri "localhost:27017"

Specifying Column Types via CLI

To specify column types for your input table, use the following command:

python3 -m crocodile.cli \
  --croco.input_csv tables/imdb_top_1000.csv \
  --croco.entity_retrieval_endpoint "$ENTITY_RETRIEVAL_ENDPOINT" \
  --croco.entity_retrieval_token "$ENTITY_RETRIEVAL_TOKEN" \
  --croco.columns_type '{
    "NE": { "0": "OTHER" },
    "LIT": {
      "1": "NUMBER",
      "2": "NUMBER",
      "3": "STRING",
      "4": "NUMBER",
      "5": "STRING"
    },
    "IGNORED": ["6", "9", "10", "7", "8"]
  }' \
  --croco.mongo_uri "localhost:27017"

Using Python API

You can also run the entity linking process using the Crocodile class in Python:

from crocodile import Crocodile
import pandas as pd
import os

df = pd.read_csv("./tables/imdb_top_1000.csv")

croco = Crocodile(
    input_csv=df, 
    dataset_name="cinema",
    table_name="imdb",
    entity_retrieval_endpoint=os.environ["ENTITY_RETRIEVAL_ENDPOINT"],
    entity_retrieval_token=os.environ["ENTITY_RETRIEVAL_TOKEN"],
    candidate_retrieval_limit=10,
    max_workers=4,
    save_output_to_csv=False,
    return_dataframe=True         
)

result_df = croco.run()
print("Entity linking completed.")

Specifying Column Types

If you want to specify column types for your input table, use the following example:

from crocodile import Crocodile
import os

file_path = './tables/imdb_top_1000.csv'

# Create an instance of the Crocodile class
crocodile_instance = Crocodile(
    table_name="imdb",
    dataset_name="cinema",
    max_candidates=3,
    entity_retrieval_token=os.environ["ENTITY_RETRIEVAL_TOKEN"],
    entity_retrieval_endpoint=os.environ["ENTITY_RETRIEVAL_ENDPOINT"],
    columns_type={
        "NE": {
            "0": "OTHER"
        },
        "LIT": {
            "1": "NUMBER",
            "2": "NUMBER",
            "3": "STRING",
            "4": "NUMBER",
            "5": "STRING"
        },
        "IGNORED" : ["6", "9", "10", "7", "8"]
    }
)

# Run the entity linking process
crocodile_instance.run()

print("Entity linking process completed.")

In the columns_type parameter, one has to specify for every column index whether it is a Named-Entity (NE) column or a Literal (LIT) one. All the columns that are not specified neither as NE nor as LIT will be considered as IGNORED columns.

Contributing

Contributions are welcome! Please read the contributing guidelines first.

License

This project is licensed under the Apache License - see the LICENSE file for details.

Contact

For any questions or inquiries, feel free to open an issue on the GitHub repository.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

crocodile_linker-0.1.1.tar.gz (926.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

crocodile_linker-0.1.1-py3-none-any.whl (928.0 kB view details)

Uploaded Python 3

File details

Details for the file crocodile_linker-0.1.1.tar.gz.

File metadata

  • Download URL: crocodile_linker-0.1.1.tar.gz
  • Upload date:
  • Size: 926.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.15

File hashes

Hashes for crocodile_linker-0.1.1.tar.gz
Algorithm Hash digest
SHA256 244c7c64334c3a95b205ce424b22cccd51dba49baccc4989c7ea0d02bc6f50b0
MD5 98085bc3295ea9e887c5f5236b007e47
BLAKE2b-256 9684b583a65d6216508920cc342a9b69b571875b1509f9f688336f32efbae2dd

See more details on using hashes here.

File details

Details for the file crocodile_linker-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for crocodile_linker-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 ee40b73e9af404553f6939b9a23d76ac7e865e825b8905e1926f964b5444de63
MD5 f199e12c10348e1e018c393e3931c144
BLAKE2b-256 fe7ac1f8f1519e395919a52822ca87d16f3ab1d53acbcb19e56dfe323997f4cc

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page