What does the key parameter do under the hood?ΒΆ

LaminDB is designed around associating biological metadata to artifacts and collections. This enables querying for them in storage by metadata and removes the requirement for semantic artifact and collection names.

Here, we will discuss trade-offs for using the key parameter, which allows for semantic keys, in various scenarios.

SetupΒΆ

We’re simulating an artifact system with several nested folders and artifacts. Such structures are resembled in, for example, the RxRx: cell imaging guide.

import random
import string
from pathlib import Path


def create_complex_biological_hierarchy(root_folder):
    root_path = Path(root_folder)

    if root_path.exists():
        print("Folder structure already exists. Skipping...")
    else:
        root_path.mkdir()

        raw_folder = root_path / "raw"
        preprocessed_folder = root_path / "preprocessed"
        raw_folder.mkdir()
        preprocessed_folder.mkdir()

        for i in range(1, 5):
            artifact_name = f"raw_data_{i}.txt"
            with (raw_folder / artifact_name).open("w") as f:
                random_text = "".join(
                    random.choice(string.ascii_letters) for _ in range(10)
                )
                f.write(random_text)

        for i in range(1, 3):
            collection_folder = raw_folder / f"Collection_{i}"
            collection_folder.mkdir()

            for j in range(1, 5):
                artifact_name = f"raw_data_{j}.txt"
                with (collection_folder / artifact_name).open("w") as f:
                    random_text = "".join(
                        random.choice(string.ascii_letters) for _ in range(10)
                    )
                    f.write(random_text)

        for i in range(1, 5):
            artifact_name = f"result_{i}.txt"
            with (preprocessed_folder / artifact_name).open("w") as f:
                random_text = "".join(
                    random.choice(string.ascii_letters) for _ in range(10)
                )
                f.write(random_text)


root_folder = "complex_biological_project"
create_complex_biological_hierarchy(root_folder)
!lamin init --storage ./key-eval
πŸ’‘ connected lamindb: testuser1/key-eval
import lamindb as ln


ln.settings.verbosity = "hint"
πŸ’‘ connected lamindb: testuser1/key-eval
ln.UPath("complex_biological_project").view_tree()
4 sub-directories & 8 files with suffixes '.txt'
/home/runner/work/lamindb/lamindb/docs/faq/complex_biological_project
β”œβ”€β”€ preprocessed/
β”‚   β”œβ”€β”€ result_4.txt
β”‚   β”œβ”€β”€ result_1.txt
β”‚   β”œβ”€β”€ result_3.txt
β”‚   └── result_2.txt
└── raw/
    β”œβ”€β”€ Collection_1/
    β”œβ”€β”€ raw_data_3.txt
    β”œβ”€β”€ raw_data_1.txt
    β”œβ”€β”€ raw_data_2.txt
    β”œβ”€β”€ Collection_2/
    └── raw_data_4.txt
ln.settings.transform.stem_uid = "WIwaNDvlEkwS"
ln.settings.transform.version = "1"
ln.track()
πŸ’‘ notebook imports: lamindb==0.74a1
πŸ’‘ saved: Transform(uid='WIwaNDvlEkwS5zKv', version='1', name='What does the key parameter do under the hood?', key='key', type='notebook', created_by_id=1, updated_at='2024-06-19 16:18:03 UTC')
πŸ’‘ saved: Run(uid='ckFTxlwHVWFOaIT3MixU', transform_id=1, created_by_id=1)
πŸ’‘ tracked pip freeze > /home/runner/.cache/lamindb/run_env_pip_ckFTxlwHVWFOaIT3MixU.txt
Run(uid='ckFTxlwHVWFOaIT3MixU', started_at='2024-06-19 16:18:03 UTC', is_consecutive=True, transform_id=1, created_by_id=1)

Storing artifacts using Storage, File, and CollectionΒΆ

Lamin has three storage classes that manage different types of in-memory and on-disk objects:

  1. Storage: Manages the default storage root that can be either local or in the cloud. For more details we refer to Storage FAQ.

  2. Artifact: Manages datasets with an optional key that acts as a relative path within the current default storage root (see Storage). An example is a single h5 artifact.

  3. Collection: Manages a collection of datasets with an optional key that acts as a relative path within the current default storage root (see Storage). An example is a collection of h5 artifacts.

For more details we refer to Tutorial: Artifacts.

The current storage root is:

ln.settings.storage
PosixUPath('/home/runner/work/lamindb/lamindb/docs/faq/key-eval')

By default, Lamin uses virtual keys that are only reflected in the database but not in storage. It is possible to turn this behavior off by setting ln.settings.artifact_use_virtual_keys = False. Generally, we discourage disabling this setting manually. For more details we refer to Storage FAQ.

ln.settings.artifact_use_virtual_keys
True

We will now create File objects with and without semantic keys using key and also save them as Collections.

artifact_no_key_1 = ln.Artifact("complex_biological_project/raw/raw_data_1.txt")
artifact_no_key_2 = ln.Artifact("complex_biological_project/raw/raw_data_2.txt")
πŸ’‘ path content will be copied to default storage upon `save()` with key `None` ('.lamindb/EV2c9i9rHAIXItqRwtYf.txt')
πŸ’‘ path content will be copied to default storage upon `save()` with key `None` ('.lamindb/pjmF5zcd6Vrrr2bOr2Gj.txt')

The logging suggests that the artifacts will be saved to our current default storage with auto generated storage keys.

artifact_no_key_1.save()
artifact_no_key_2.save()
βœ… storing artifact 'EV2c9i9rHAIXItqRwtYf' at '/home/runner/work/lamindb/lamindb/docs/faq/key-eval/.lamindb/EV2c9i9rHAIXItqRwtYf.txt'
βœ… storing artifact 'pjmF5zcd6Vrrr2bOr2Gj' at '/home/runner/work/lamindb/lamindb/docs/faq/key-eval/.lamindb/pjmF5zcd6Vrrr2bOr2Gj.txt'
Artifact(uid='pjmF5zcd6Vrrr2bOr2Gj', suffix='.txt', type='dataset', size=10, hash='iDRI4qNkfDPIoFiKX8hh0w', hash_type='md5', visibility=1, key_is_virtual=True, created_by_id=1, storage_id=1, transform_id=1, run_id=1, updated_at='2024-06-19 16:18:04 UTC')
artifact_key_3 = ln.Artifact(
    "complex_biological_project/raw/raw_data_3.txt", key="raw/raw_data_3.txt"
)
artifact_key_4 = ln.Artifact(
    "complex_biological_project/raw/raw_data_4.txt", key="raw/raw_data_4.txt"
)
artifact_key_3.save()
artifact_key_4.save()
πŸ’‘ path content will be copied to default storage upon `save()` with key 'raw/raw_data_3.txt'
πŸ’‘ path content will be copied to default storage upon `save()` with key 'raw/raw_data_4.txt'
βœ… storing artifact 'A5ibzZximXr94GoN50iN' at '/home/runner/work/lamindb/lamindb/docs/faq/key-eval/.lamindb/A5ibzZximXr94GoN50iN.txt'
βœ… storing artifact 'SolstrdDOEvzjJ3BSZI8' at '/home/runner/work/lamindb/lamindb/docs/faq/key-eval/.lamindb/SolstrdDOEvzjJ3BSZI8.txt'
Artifact(uid='SolstrdDOEvzjJ3BSZI8', key='raw/raw_data_4.txt', suffix='.txt', type='dataset', size=10, hash='xvqAfSr0UjVJlAjib_kAmw', hash_type='md5', visibility=1, key_is_virtual=True, created_by_id=1, storage_id=1, transform_id=1, run_id=1, updated_at='2024-06-19 16:18:04 UTC')

Files with keys are not stored in different locations because of the usage of virtual keys. However, they are still semantically queryable by key.

ln.Artifact.filter(key__contains="raw").df().head()
uid version description key suffix type accessor size hash hash_type n_objects n_observations visibility key_is_virtual storage_id transform_id run_id created_by_id updated_at
id
3 A5ibzZximXr94GoN50iN None None raw/raw_data_3.txt .txt dataset None 10 AQ4NpfCAHHtzXKPzfDLs_g md5 None None 1 True 1 1 1 1 2024-06-19 16:18:04.904017+00:00
4 SolstrdDOEvzjJ3BSZI8 None None raw/raw_data_4.txt .txt dataset None 10 xvqAfSr0UjVJlAjib_kAmw md5 None None 1 True 1 1 1 1 2024-06-19 16:18:04.908779+00:00

Collection does not have a key parameter because it does not store any additional data in Storage. In contrast, it has a name parameter that serves as a semantic identifier of the collection.

ds_1 = ln.Collection([artifact_no_key_1, artifact_no_key_2], name="no key collection")
ds_2 = ln.Collection([artifact_key_3, artifact_key_4], name="sample collection")
ds_1
Collection(uid='qOeQQHwChamAPCil5Bur', name='no key collection', hash='hdRkz0j6WtWvJg7QW2JD', visibility=1, created_by_id=1, transform_id=1, run_id=1)

Advantages and disadvantages of semantic keysΒΆ

Semantic keys have several advantages and disadvantages that we will discuss and demonstrate in the remaining notebook:

Advantages:ΒΆ

  • Simple: It can be easier to refer to specific collections in conversations

  • Familiarity: Most people are familiar with the concept of semantic names

DisadvantagesΒΆ

  • Length: Semantic names can be long with limited aesthetic appeal

  • Inconsistency: Lack of naming conventions can lead to confusion

  • Limited metadata: Semantic keys can contain some, but usually not all metadata

  • Inefficiency: Writing lengthy semantic names is a repetitive process and can be time-consuming

  • Ambiguity: Overly descriptive artifact names may introduce ambiguity and redundancy

  • Clashes: Several people may attempt to use the same semantic key. They are not unique

Renaming artifactsΒΆ

Renaming Files that have associated keys can be done on several levels.

In storageΒΆ

A artifact can be locally moved or renamed:

artifact_key_3.path
PosixUPath('/home/runner/work/lamindb/lamindb/docs/faq/key-eval/.lamindb/A5ibzZximXr94GoN50iN.txt')
loaded_artifact = artifact_key_3.load()
!mkdir complex_biological_project/moved_artifacts
!mv complex_biological_project/raw/raw_data_3.txt complex_biological_project/moved_artifacts
artifact_key_3.path
PosixUPath('/home/runner/work/lamindb/lamindb/docs/faq/key-eval/.lamindb/A5ibzZximXr94GoN50iN.txt')

After moving the artifact locally, the storage location (the path) has not changed and the artifact can still be loaded.

artifact_3 = artifact_key_3.load()

The same applies to the key which has not changed.

artifact_key_3.key
'raw/raw_data_3.txt'

By keyΒΆ

Besides moving the artifact in storage, the key can also be renamed.

artifact_key_4.key
'raw/raw_data_4.txt'
artifact_key_4.key = "bad_samples/sample_data_4.txt"
artifact_key_4.key
'bad_samples/sample_data_4.txt'

Due to the usage of virtual keys, modifying the key does not change the storage location and the artifact stays accessible.

artifact_key_4.path
PosixUPath('/home/runner/work/lamindb/lamindb/docs/faq/key-eval/.lamindb/SolstrdDOEvzjJ3BSZI8.txt')
artifact_4 = artifact_key_4.load()

Modifying the path attributeΒΆ

However, modifying the path directly is not allowed:

try:
    artifact_key_4.path = f"{ln.settings.storage}/here_now/sample_data_4.txt"
except AttributeError as e:
    print(e)
property of 'Artifact' object has no setter

Clashing semantic keysΒΆ

Semantic keys should not clash. Let’s attempt to use the same semantic key twice

print(artifact_key_3.key)
print(artifact_key_4.key)
raw/raw_data_3.txt
bad_samples/sample_data_4.txt
artifact_key_4.key = "raw/raw_data_3.txt"
print(artifact_key_3.key)
print(artifact_key_4.key)
raw/raw_data_3.txt
raw/raw_data_3.txt

When filtering for this semantic key it is now unclear to which artifact we were referring to:

ln.Artifact.filter(key__icontains="sample_data_3").df()
uid version description key suffix type accessor size hash hash_type n_objects n_observations visibility key_is_virtual storage_id transform_id run_id created_by_id updated_at
id

When querying by key LaminDB cannot resolve which artifact we actually wanted. In fact, we only get a single hit which does not paint a complete picture.

print(artifact_key_3.uid)
print(artifact_key_4.uid)
A5ibzZximXr94GoN50iN
SolstrdDOEvzjJ3BSZI8

Both artifacts still exist though with unique uids that can be used to get access to them. Most importantly though, saving these artifacts to the database will result in an IntegrityError to prevent this issue.

try:
    artifact_key_3.save()
    artifact_key_4.save()
except Exception as e:
    print(
        "It is not possible to save artifacts to the same key. This results in an"
        " Integrity Error!"
    )

We refer to What happens if I save the same artifacts & records twice? for more detailed explanations of behavior when attempting to save artifacts multiple times.

HierarchiesΒΆ

Another common use-case of keys are artifact hierarchies. It can be useful to resemble the artifact structure in β€œcomplex_biological_project” from above also in LaminDB to allow for queries for artifacts that were stored in specific folders. Common examples of this are folders specifying different processing stages such as raw, preprocessed, or annotated.

Note that this use-case may also be overlapping with Collection which also allows for grouping Files. However, Collection cannot model hierarchical groupings.

KeyΒΆ

import os

for root, _, artifacts in os.walk("complex_biological_project/raw"):
    for artifactname in artifacts:
        file_path = os.path.join(root, artifactname)
        key_path = file_path.removeprefix("complex_biological_project")
        ln_artifact = ln.Artifact(file_path, key=key_path)
        ln_artifact.save()
πŸ’‘ returning existing artifact with same hash: Artifact(uid='EV2c9i9rHAIXItqRwtYf', suffix='.txt', type='dataset', size=10, hash='dexxNG0wtPa8N7B8fdPtIg', hash_type='md5', visibility=1, key_is_virtual=True, created_by_id=1, storage_id=1, transform_id=1, run_id=1, updated_at='2024-06-19 16:18:04 UTC')
❗ key None on existing artifact differs from passed key /raw/raw_data_1.txt
πŸ’‘ returning existing artifact with same hash: Artifact(uid='pjmF5zcd6Vrrr2bOr2Gj', suffix='.txt', type='dataset', size=10, hash='iDRI4qNkfDPIoFiKX8hh0w', hash_type='md5', visibility=1, key_is_virtual=True, created_by_id=1, storage_id=1, transform_id=1, run_id=1, updated_at='2024-06-19 16:18:04 UTC')
❗ key None on existing artifact differs from passed key /raw/raw_data_2.txt
πŸ’‘ returning existing artifact with same hash: Artifact(uid='SolstrdDOEvzjJ3BSZI8', key='raw/raw_data_3.txt', suffix='.txt', type='dataset', size=10, hash='xvqAfSr0UjVJlAjib_kAmw', hash_type='md5', visibility=1, key_is_virtual=True, created_by_id=1, storage_id=1, transform_id=1, run_id=1, updated_at='2024-06-19 16:18:05 UTC')
❗ key raw/raw_data_3.txt on existing artifact differs from passed key /raw/raw_data_4.txt
πŸ’‘ path content will be copied to default storage upon `save()` with key '/raw/Collection_1/raw_data_3.txt'
βœ… storing artifact 'dQCdIfa63XNX5mdOBJAx' at '/home/runner/work/lamindb/lamindb/docs/faq/key-eval/.lamindb/dQCdIfa63XNX5mdOBJAx.txt'
πŸ’‘ path content will be copied to default storage upon `save()` with key '/raw/Collection_1/raw_data_1.txt'
βœ… storing artifact '2iCk4CVpP4n2hoaxHbzi' at '/home/runner/work/lamindb/lamindb/docs/faq/key-eval/.lamindb/2iCk4CVpP4n2hoaxHbzi.txt'
πŸ’‘ path content will be copied to default storage upon `save()` with key '/raw/Collection_1/raw_data_2.txt'
βœ… storing artifact 'zuMiOZZYjSACCNgWUEvq' at '/home/runner/work/lamindb/lamindb/docs/faq/key-eval/.lamindb/zuMiOZZYjSACCNgWUEvq.txt'
πŸ’‘ path content will be copied to default storage upon `save()` with key '/raw/Collection_1/raw_data_4.txt'
βœ… storing artifact 'HhMealZntBBEgAW6ynV7' at '/home/runner/work/lamindb/lamindb/docs/faq/key-eval/.lamindb/HhMealZntBBEgAW6ynV7.txt'
πŸ’‘ path content will be copied to default storage upon `save()` with key '/raw/Collection_2/raw_data_3.txt'
βœ… storing artifact '6qTXw0EWFtKGLNd4dME5' at '/home/runner/work/lamindb/lamindb/docs/faq/key-eval/.lamindb/6qTXw0EWFtKGLNd4dME5.txt'
πŸ’‘ path content will be copied to default storage upon `save()` with key '/raw/Collection_2/raw_data_1.txt'
βœ… storing artifact 'gURqcQ0v0BSo7Mf1MEfZ' at '/home/runner/work/lamindb/lamindb/docs/faq/key-eval/.lamindb/gURqcQ0v0BSo7Mf1MEfZ.txt'
πŸ’‘ path content will be copied to default storage upon `save()` with key '/raw/Collection_2/raw_data_2.txt'
βœ… storing artifact '7uWTJRmJuIzRKLkK2yJo' at '/home/runner/work/lamindb/lamindb/docs/faq/key-eval/.lamindb/7uWTJRmJuIzRKLkK2yJo.txt'
πŸ’‘ path content will be copied to default storage upon `save()` with key '/raw/Collection_2/raw_data_4.txt'
βœ… storing artifact 'Ydlv3iOb6p6qT2iU5Pe8' at '/home/runner/work/lamindb/lamindb/docs/faq/key-eval/.lamindb/Ydlv3iOb6p6qT2iU5Pe8.txt'
ln.Artifact.filter(key__startswith="raw").df()
uid version description key suffix type accessor size hash hash_type n_objects n_observations visibility key_is_virtual storage_id transform_id run_id created_by_id updated_at
id
3 A5ibzZximXr94GoN50iN None None raw/raw_data_3.txt .txt dataset None 10 AQ4NpfCAHHtzXKPzfDLs_g md5 None None 1 True 1 1 1 1 2024-06-19 16:18:05.347718+00:00
4 SolstrdDOEvzjJ3BSZI8 None None raw/raw_data_3.txt .txt dataset None 10 xvqAfSr0UjVJlAjib_kAmw md5 None None 1 True 1 1 1 1 2024-06-19 16:18:05.388885+00:00

CollectionΒΆ

Alternatively, it would have been possible to create a Collection with a corresponding name:

all_data_paths = []
for root, _, artifacts in os.walk("complex_biological_project/raw"):
    for artifactname in artifacts:
        file_path = os.path.join(root, artifactname)
        all_data_paths.append(file_path)

all_data_artifacts = []
for path in all_data_paths:
    all_data_artifacts.append(ln.Artifact(path))

data_ds = ln.Collection(all_data_artifacts, name="data")
data_ds.save()
πŸ’‘ returning existing artifact with same hash: Artifact(uid='EV2c9i9rHAIXItqRwtYf', suffix='.txt', type='dataset', size=10, hash='dexxNG0wtPa8N7B8fdPtIg', hash_type='md5', visibility=1, key_is_virtual=True, created_by_id=1, storage_id=1, transform_id=1, run_id=1, updated_at='2024-06-19 16:18:05 UTC')
πŸ’‘ returning existing artifact with same hash: Artifact(uid='pjmF5zcd6Vrrr2bOr2Gj', suffix='.txt', type='dataset', size=10, hash='iDRI4qNkfDPIoFiKX8hh0w', hash_type='md5', visibility=1, key_is_virtual=True, created_by_id=1, storage_id=1, transform_id=1, run_id=1, updated_at='2024-06-19 16:18:05 UTC')
πŸ’‘ returning existing artifact with same hash: Artifact(uid='SolstrdDOEvzjJ3BSZI8', key='raw/raw_data_3.txt', suffix='.txt', type='dataset', size=10, hash='xvqAfSr0UjVJlAjib_kAmw', hash_type='md5', visibility=1, key_is_virtual=True, created_by_id=1, storage_id=1, transform_id=1, run_id=1, updated_at='2024-06-19 16:18:05 UTC')
πŸ’‘ returning existing artifact with same hash: Artifact(uid='dQCdIfa63XNX5mdOBJAx', key='/raw/Collection_1/raw_data_3.txt', suffix='.txt', type='dataset', size=10, hash='LeO2qBi9U1ojw2o7744KTg', hash_type='md5', visibility=1, key_is_virtual=True, created_by_id=1, storage_id=1, transform_id=1, run_id=1, updated_at='2024-06-19 16:18:05 UTC')
πŸ’‘ returning existing artifact with same hash: Artifact(uid='2iCk4CVpP4n2hoaxHbzi', key='/raw/Collection_1/raw_data_1.txt', suffix='.txt', type='dataset', size=10, hash='0d9Wc7SFhSKzhATeub8QAQ', hash_type='md5', visibility=1, key_is_virtual=True, created_by_id=1, storage_id=1, transform_id=1, run_id=1, updated_at='2024-06-19 16:18:05 UTC')
πŸ’‘ returning existing artifact with same hash: Artifact(uid='zuMiOZZYjSACCNgWUEvq', key='/raw/Collection_1/raw_data_2.txt', suffix='.txt', type='dataset', size=10, hash='Zb3txmg7tfsBrkW0fu8yEQ', hash_type='md5', visibility=1, key_is_virtual=True, created_by_id=1, storage_id=1, transform_id=1, run_id=1, updated_at='2024-06-19 16:18:05 UTC')
πŸ’‘ returning existing artifact with same hash: Artifact(uid='HhMealZntBBEgAW6ynV7', key='/raw/Collection_1/raw_data_4.txt', suffix='.txt', type='dataset', size=10, hash='a89oCUzxNwidW_Dt6DgvjQ', hash_type='md5', visibility=1, key_is_virtual=True, created_by_id=1, storage_id=1, transform_id=1, run_id=1, updated_at='2024-06-19 16:18:05 UTC')
πŸ’‘ returning existing artifact with same hash: Artifact(uid='6qTXw0EWFtKGLNd4dME5', key='/raw/Collection_2/raw_data_3.txt', suffix='.txt', type='dataset', size=10, hash='HYK9MaBwYTxJ9humnFup0A', hash_type='md5', visibility=1, key_is_virtual=True, created_by_id=1, storage_id=1, transform_id=1, run_id=1, updated_at='2024-06-19 16:18:05 UTC')
πŸ’‘ returning existing artifact with same hash: Artifact(uid='gURqcQ0v0BSo7Mf1MEfZ', key='/raw/Collection_2/raw_data_1.txt', suffix='.txt', type='dataset', size=10, hash='uNxpymCy-8hKuOWp-KzvGA', hash_type='md5', visibility=1, key_is_virtual=True, created_by_id=1, storage_id=1, transform_id=1, run_id=1, updated_at='2024-06-19 16:18:05 UTC')
πŸ’‘ returning existing artifact with same hash: Artifact(uid='7uWTJRmJuIzRKLkK2yJo', key='/raw/Collection_2/raw_data_2.txt', suffix='.txt', type='dataset', size=10, hash='nZAbGFRjQ7DnH9EVsUOZvA', hash_type='md5', visibility=1, key_is_virtual=True, created_by_id=1, storage_id=1, transform_id=1, run_id=1, updated_at='2024-06-19 16:18:05 UTC')
πŸ’‘ returning existing artifact with same hash: Artifact(uid='Ydlv3iOb6p6qT2iU5Pe8', key='/raw/Collection_2/raw_data_4.txt', suffix='.txt', type='dataset', size=10, hash='1u_qI9-OnR_zxuqpvSNC6A', hash_type='md5', visibility=1, key_is_virtual=True, created_by_id=1, storage_id=1, transform_id=1, run_id=1, updated_at='2024-06-19 16:18:05 UTC')
ln.Collection.filter(name__icontains="data").df()
uid version name description hash reference reference_type visibility transform_id artifact_id run_id created_by_id updated_at
id
1 61F8YEx1knTnJaE7fYvi None data None -mSGjQFk6KgIhIXyO30I None None 1 1 None 1 1 2024-06-19 16:18:05.594645+00:00

This approach will likely lead to clashes. Alternatively, Ulabels can be added to Files to resemble hierarchies.

UlabelsΒΆ

for root, _, artifacts in os.walk("complex_biological_project/raw"):
    for artifactname in artifacts:
        file_path = os.path.join(root, artifactname)
        key_path = file_path.removeprefix("complex_biological_project")
        ln_artifact = ln.Artifact(file_path, key=key_path)
        ln_artifact.save()

        data_label = ln.ULabel(name="data")
        data_label.save()
        ln_artifact.ulabels.add(data_label)
πŸ’‘ returning existing artifact with same hash: Artifact(uid='EV2c9i9rHAIXItqRwtYf', suffix='.txt', type='dataset', size=10, hash='dexxNG0wtPa8N7B8fdPtIg', hash_type='md5', visibility=1, key_is_virtual=True, created_by_id=1, storage_id=1, transform_id=1, run_id=1, updated_at='2024-06-19 16:18:05 UTC')
❗ key None on existing artifact differs from passed key /raw/raw_data_1.txt
πŸ’‘ returning existing artifact with same hash: Artifact(uid='pjmF5zcd6Vrrr2bOr2Gj', suffix='.txt', type='dataset', size=10, hash='iDRI4qNkfDPIoFiKX8hh0w', hash_type='md5', visibility=1, key_is_virtual=True, created_by_id=1, storage_id=1, transform_id=1, run_id=1, updated_at='2024-06-19 16:18:05 UTC')
❗ key None on existing artifact differs from passed key /raw/raw_data_2.txt
πŸ’‘ returning existing ULabel record with same name: 'data'
πŸ’‘ returning existing artifact with same hash: Artifact(uid='SolstrdDOEvzjJ3BSZI8', key='raw/raw_data_3.txt', suffix='.txt', type='dataset', size=10, hash='xvqAfSr0UjVJlAjib_kAmw', hash_type='md5', visibility=1, key_is_virtual=True, created_by_id=1, storage_id=1, transform_id=1, run_id=1, updated_at='2024-06-19 16:18:05 UTC')
❗ key raw/raw_data_3.txt on existing artifact differs from passed key /raw/raw_data_4.txt
πŸ’‘ returning existing ULabel record with same name: 'data'
πŸ’‘ returning existing artifact with same hash: Artifact(uid='dQCdIfa63XNX5mdOBJAx', key='/raw/Collection_1/raw_data_3.txt', suffix='.txt', type='dataset', size=10, hash='LeO2qBi9U1ojw2o7744KTg', hash_type='md5', visibility=1, key_is_virtual=True, created_by_id=1, storage_id=1, transform_id=1, run_id=1, updated_at='2024-06-19 16:18:05 UTC')
πŸ’‘ returning existing ULabel record with same name: 'data'
πŸ’‘ returning existing artifact with same hash: Artifact(uid='2iCk4CVpP4n2hoaxHbzi', key='/raw/Collection_1/raw_data_1.txt', suffix='.txt', type='dataset', size=10, hash='0d9Wc7SFhSKzhATeub8QAQ', hash_type='md5', visibility=1, key_is_virtual=True, created_by_id=1, storage_id=1, transform_id=1, run_id=1, updated_at='2024-06-19 16:18:05 UTC')
πŸ’‘ returning existing ULabel record with same name: 'data'
πŸ’‘ returning existing artifact with same hash: Artifact(uid='zuMiOZZYjSACCNgWUEvq', key='/raw/Collection_1/raw_data_2.txt', suffix='.txt', type='dataset', size=10, hash='Zb3txmg7tfsBrkW0fu8yEQ', hash_type='md5', visibility=1, key_is_virtual=True, created_by_id=1, storage_id=1, transform_id=1, run_id=1, updated_at='2024-06-19 16:18:05 UTC')
πŸ’‘ returning existing ULabel record with same name: 'data'
πŸ’‘ returning existing artifact with same hash: Artifact(uid='HhMealZntBBEgAW6ynV7', key='/raw/Collection_1/raw_data_4.txt', suffix='.txt', type='dataset', size=10, hash='a89oCUzxNwidW_Dt6DgvjQ', hash_type='md5', visibility=1, key_is_virtual=True, created_by_id=1, storage_id=1, transform_id=1, run_id=1, updated_at='2024-06-19 16:18:05 UTC')
πŸ’‘ returning existing ULabel record with same name: 'data'
πŸ’‘ returning existing artifact with same hash: Artifact(uid='6qTXw0EWFtKGLNd4dME5', key='/raw/Collection_2/raw_data_3.txt', suffix='.txt', type='dataset', size=10, hash='HYK9MaBwYTxJ9humnFup0A', hash_type='md5', visibility=1, key_is_virtual=True, created_by_id=1, storage_id=1, transform_id=1, run_id=1, updated_at='2024-06-19 16:18:05 UTC')
πŸ’‘ returning existing ULabel record with same name: 'data'
πŸ’‘ returning existing artifact with same hash: Artifact(uid='gURqcQ0v0BSo7Mf1MEfZ', key='/raw/Collection_2/raw_data_1.txt', suffix='.txt', type='dataset', size=10, hash='uNxpymCy-8hKuOWp-KzvGA', hash_type='md5', visibility=1, key_is_virtual=True, created_by_id=1, storage_id=1, transform_id=1, run_id=1, updated_at='2024-06-19 16:18:05 UTC')
πŸ’‘ returning existing ULabel record with same name: 'data'
πŸ’‘ returning existing artifact with same hash: Artifact(uid='7uWTJRmJuIzRKLkK2yJo', key='/raw/Collection_2/raw_data_2.txt', suffix='.txt', type='dataset', size=10, hash='nZAbGFRjQ7DnH9EVsUOZvA', hash_type='md5', visibility=1, key_is_virtual=True, created_by_id=1, storage_id=1, transform_id=1, run_id=1, updated_at='2024-06-19 16:18:05 UTC')
πŸ’‘ returning existing ULabel record with same name: 'data'
πŸ’‘ returning existing artifact with same hash: Artifact(uid='Ydlv3iOb6p6qT2iU5Pe8', key='/raw/Collection_2/raw_data_4.txt', suffix='.txt', type='dataset', size=10, hash='1u_qI9-OnR_zxuqpvSNC6A', hash_type='md5', visibility=1, key_is_virtual=True, created_by_id=1, storage_id=1, transform_id=1, run_id=1, updated_at='2024-06-19 16:18:05 UTC')
πŸ’‘ returning existing ULabel record with same name: 'data'
labels = ln.ULabel.lookup()
ln.Artifact.filter(ulabels__in=[labels.data]).df()
uid version description key suffix type accessor size hash hash_type n_objects n_observations visibility key_is_virtual storage_id transform_id run_id created_by_id updated_at
id
1 EV2c9i9rHAIXItqRwtYf None None None .txt dataset None 10 dexxNG0wtPa8N7B8fdPtIg md5 None None 1 True 1 1 1 1 2024-06-19 16:18:05.650571+00:00
2 pjmF5zcd6Vrrr2bOr2Gj None None None .txt dataset None 10 iDRI4qNkfDPIoFiKX8hh0w md5 None None 1 True 1 1 1 1 2024-06-19 16:18:05.673048+00:00
4 SolstrdDOEvzjJ3BSZI8 None None raw/raw_data_3.txt .txt dataset None 10 xvqAfSr0UjVJlAjib_kAmw md5 None None 1 True 1 1 1 1 2024-06-19 16:18:05.697849+00:00
5 dQCdIfa63XNX5mdOBJAx None None /raw/Collection_1/raw_data_3.txt .txt dataset None 10 LeO2qBi9U1ojw2o7744KTg md5 None None 1 True 1 1 1 1 2024-06-19 16:18:05.737755+00:00
6 2iCk4CVpP4n2hoaxHbzi None None /raw/Collection_1/raw_data_1.txt .txt dataset None 10 0d9Wc7SFhSKzhATeub8QAQ md5 None None 1 True 1 1 1 1 2024-06-19 16:18:05.759405+00:00
7 zuMiOZZYjSACCNgWUEvq None None /raw/Collection_1/raw_data_2.txt .txt dataset None 10 Zb3txmg7tfsBrkW0fu8yEQ md5 None None 1 True 1 1 1 1 2024-06-19 16:18:05.781852+00:00
8 HhMealZntBBEgAW6ynV7 None None /raw/Collection_1/raw_data_4.txt .txt dataset None 10 a89oCUzxNwidW_Dt6DgvjQ md5 None None 1 True 1 1 1 1 2024-06-19 16:18:05.803906+00:00
9 6qTXw0EWFtKGLNd4dME5 None None /raw/Collection_2/raw_data_3.txt .txt dataset None 10 HYK9MaBwYTxJ9humnFup0A md5 None None 1 True 1 1 1 1 2024-06-19 16:18:05.840462+00:00
10 gURqcQ0v0BSo7Mf1MEfZ None None /raw/Collection_2/raw_data_1.txt .txt dataset None 10 uNxpymCy-8hKuOWp-KzvGA md5 None None 1 True 1 1 1 1 2024-06-19 16:18:05.864607+00:00
11 7uWTJRmJuIzRKLkK2yJo None None /raw/Collection_2/raw_data_2.txt .txt dataset None 10 nZAbGFRjQ7DnH9EVsUOZvA md5 None None 1 True 1 1 1 1 2024-06-19 16:18:05.889987+00:00
12 Ydlv3iOb6p6qT2iU5Pe8 None None /raw/Collection_2/raw_data_4.txt .txt dataset None 10 1u_qI9-OnR_zxuqpvSNC6A md5 None None 1 True 1 1 1 1 2024-06-19 16:18:05.914037+00:00

However, Ulabels are too versatile for such an approach and clashes are also to be expected here.

MetadataΒΆ

Due to the chance of clashes for the aforementioned approaches being rather high, we generally recommend not to store hierarchical data with solely semantic keys. Biological metadata makes Files and Collections unambiguous and easily queryable.

Legacy data and multiple storage rootsΒΆ

Distributed CollectionsΒΆ

LaminDB can ingest legacy data that already had a structure in their storage. In such cases, it disables artifact_use_virtual_keys and the artifacts are ingested with their actual storage location. It might be therefore be possible that Files stored in different storage roots may be associated with a single Collection. To simulate this, we are disabling artifact_use_virtual_keys and ingest artifacts stored in a different path (the β€œlegacy data”).

ln.settings.artifact_use_virtual_keys = False
for root, _, artifacts in os.walk("complex_biological_project/preprocessed"):
    for artifactname in artifacts:
        file_path = os.path.join(root, artifactname)
        key_path = file_path.removeprefix("complex_biological_project")

        print(file_path)
        print()

        ln_artifact = ln.Artifact(file_path, key=f"./{key_path}")
        ln_artifact.save()
complex_biological_project/preprocessed/result_4.txt

πŸ’‘ path content will be copied to default storage upon `save()` with key './/preprocessed/result_4.txt'
βœ… storing artifact '69rgN6OAtHn5gG2wyTpD' at '/home/runner/work/lamindb/lamindb/docs/faq/key-eval/preprocessed/result_4.txt'
complex_biological_project/preprocessed/result_1.txt

πŸ’‘ path content will be copied to default storage upon `save()` with key './/preprocessed/result_1.txt'
βœ… storing artifact 'rFCk35UjfT2YAls4bFbl' at '/home/runner/work/lamindb/lamindb/docs/faq/key-eval/preprocessed/result_1.txt'
complex_biological_project/preprocessed/result_3.txt

πŸ’‘ path content will be copied to default storage upon `save()` with key './/preprocessed/result_3.txt'
βœ… storing artifact '3abEjruCqAooigpNPGnw' at '/home/runner/work/lamindb/lamindb/docs/faq/key-eval/preprocessed/result_3.txt'
complex_biological_project/preprocessed/result_2.txt

πŸ’‘ path content will be copied to default storage upon `save()` with key './/preprocessed/result_2.txt'
βœ… storing artifact 'NN0Zr9Aal91lOb9F0aQd' at '/home/runner/work/lamindb/lamindb/docs/faq/key-eval/preprocessed/result_2.txt'
ln.Artifact.df()
uid version description key suffix type accessor size hash hash_type n_objects n_observations visibility key_is_virtual storage_id transform_id run_id created_by_id updated_at
id
16 NN0Zr9Aal91lOb9F0aQd None None .//preprocessed/result_2.txt .txt dataset None 10 bXDBrm7DOhzX2s5C9j3iag md5 None None 1 False 1 1 1 1 2024-06-19 16:18:06.004466+00:00
15 3abEjruCqAooigpNPGnw None None .//preprocessed/result_3.txt .txt dataset None 10 eoARgTJqeIwgQtXi3uwvXw md5 None None 1 False 1 1 1 1 2024-06-19 16:18:05.996890+00:00
14 rFCk35UjfT2YAls4bFbl None None .//preprocessed/result_1.txt .txt dataset None 10 OIPNZAzyfs4JGR60xAwiXw md5 None None 1 False 1 1 1 1 2024-06-19 16:18:05.989435+00:00
13 69rgN6OAtHn5gG2wyTpD None None .//preprocessed/result_4.txt .txt dataset None 10 iSnNiZ9erCAPRg9JddJOnw md5 None None 1 False 1 1 1 1 2024-06-19 16:18:05.982098+00:00
12 Ydlv3iOb6p6qT2iU5Pe8 None None /raw/Collection_2/raw_data_4.txt .txt dataset None 10 1u_qI9-OnR_zxuqpvSNC6A md5 None None 1 True 1 1 1 1 2024-06-19 16:18:05.914037+00:00
11 7uWTJRmJuIzRKLkK2yJo None None /raw/Collection_2/raw_data_2.txt .txt dataset None 10 nZAbGFRjQ7DnH9EVsUOZvA md5 None None 1 True 1 1 1 1 2024-06-19 16:18:05.889987+00:00
10 gURqcQ0v0BSo7Mf1MEfZ None None /raw/Collection_2/raw_data_1.txt .txt dataset None 10 uNxpymCy-8hKuOWp-KzvGA md5 None None 1 True 1 1 1 1 2024-06-19 16:18:05.864607+00:00
9 6qTXw0EWFtKGLNd4dME5 None None /raw/Collection_2/raw_data_3.txt .txt dataset None 10 HYK9MaBwYTxJ9humnFup0A md5 None None 1 True 1 1 1 1 2024-06-19 16:18:05.840462+00:00
8 HhMealZntBBEgAW6ynV7 None None /raw/Collection_1/raw_data_4.txt .txt dataset None 10 a89oCUzxNwidW_Dt6DgvjQ md5 None None 1 True 1 1 1 1 2024-06-19 16:18:05.803906+00:00
7 zuMiOZZYjSACCNgWUEvq None None /raw/Collection_1/raw_data_2.txt .txt dataset None 10 Zb3txmg7tfsBrkW0fu8yEQ md5 None None 1 True 1 1 1 1 2024-06-19 16:18:05.781852+00:00
6 2iCk4CVpP4n2hoaxHbzi None None /raw/Collection_1/raw_data_1.txt .txt dataset None 10 0d9Wc7SFhSKzhATeub8QAQ md5 None None 1 True 1 1 1 1 2024-06-19 16:18:05.759405+00:00
5 dQCdIfa63XNX5mdOBJAx None None /raw/Collection_1/raw_data_3.txt .txt dataset None 10 LeO2qBi9U1ojw2o7744KTg md5 None None 1 True 1 1 1 1 2024-06-19 16:18:05.737755+00:00
4 SolstrdDOEvzjJ3BSZI8 None None raw/raw_data_3.txt .txt dataset None 10 xvqAfSr0UjVJlAjib_kAmw md5 None None 1 True 1 1 1 1 2024-06-19 16:18:05.697849+00:00
2 pjmF5zcd6Vrrr2bOr2Gj None None None .txt dataset None 10 iDRI4qNkfDPIoFiKX8hh0w md5 None None 1 True 1 1 1 1 2024-06-19 16:18:05.673048+00:00
1 EV2c9i9rHAIXItqRwtYf None None None .txt dataset None 10 dexxNG0wtPa8N7B8fdPtIg md5 None None 1 True 1 1 1 1 2024-06-19 16:18:05.650571+00:00
3 A5ibzZximXr94GoN50iN None None raw/raw_data_3.txt .txt dataset None 10 AQ4NpfCAHHtzXKPzfDLs_g md5 None None 1 True 1 1 1 1 2024-06-19 16:18:05.347718+00:00
artifact_from_raw = ln.Artifact.filter(key__icontains="Collection_2/raw_data_1").first()
artifact_from_preprocessed = ln.Artifact.filter(
    key__icontains="preprocessed/result_1"
).first()

print(artifact_from_raw.path)
print(artifact_from_preprocessed.path)
/home/runner/work/lamindb/lamindb/docs/faq/key-eval/.lamindb/gURqcQ0v0BSo7Mf1MEfZ.txt
/home/runner/work/lamindb/lamindb/docs/faq/key-eval/preprocessed/result_1.txt

Let’s create our Collection:

ds = ln.Collection(
    [artifact_from_raw, artifact_from_preprocessed], name="raw_and_processed_collection_2"
)
ds.save()
ds.artifacts.df()
uid version description key suffix type accessor size hash hash_type n_objects n_observations visibility key_is_virtual storage_id transform_id run_id created_by_id updated_at
id
10 gURqcQ0v0BSo7Mf1MEfZ None None /raw/Collection_2/raw_data_1.txt .txt dataset None 10 uNxpymCy-8hKuOWp-KzvGA md5 None None 1 True 1 1 1 1 2024-06-19 16:18:05.864607+00:00
14 rFCk35UjfT2YAls4bFbl None None .//preprocessed/result_1.txt .txt dataset None 10 OIPNZAzyfs4JGR60xAwiXw md5 None None 1 False 1 1 1 1 2024-06-19 16:18:05.989435+00:00

Modeling directoriesΒΆ

ln.settings.artifact_use_virtual_keys = True
dir_path = ln.core.datasets.dir_scrnaseq_cellranger("sample_001")
ln.UPath(dir_path).view_tree()
πŸ’‘ file has more than one suffix (path.suffixes), using only last suffix: '.bai' - if you want your composite suffix to be recognized add it to lamindb.core.storage.VALID_SUFFIXES.add()
3 sub-directories & 15 files with suffixes '.csv', '.mtx.gz', '.h5', '.html', '.tsv.gz', '.cloupe', '.bai', '.bam'
/home/runner/work/lamindb/lamindb/docs/faq/sample_001
β”œβ”€β”€ possorted_genome_bam.bam
β”œβ”€β”€ filtered_feature_bc_matrix.h5
β”œβ”€β”€ metrics_summary.csv
β”œβ”€β”€ filtered_feature_bc_matrix/
β”‚   β”œβ”€β”€ barcodes.tsv.gz
β”‚   β”œβ”€β”€ matrix.mtx.gz
β”‚   └── features.tsv.gz
β”œβ”€β”€ possorted_genome_bam.bam.bai
β”œβ”€β”€ analysis/
β”‚   └── analysis.csv
β”œβ”€β”€ raw_feature_bc_matrix/
β”‚   β”œβ”€β”€ barcodes.tsv.gz
β”‚   β”œβ”€β”€ matrix.mtx.gz
β”‚   └── features.tsv.gz
β”œβ”€β”€ web_summary.html
β”œβ”€β”€ raw_feature_bc_matrix.h5
β”œβ”€β”€ cloupe.cloupe
└── molecule_info.h5

There are two ways to create Artifact objects from directories: from_dir() and Artifact.

cellranger_raw_artifact = ln.Artifact.from_dir("sample_001/raw_feature_bc_matrix/")
❗ this creates one artifact per file in the directory - you might simply call ln.Artifact(dir) to get one artifact for the entire directory
❗ folder is outside existing storage location, will copy files from sample_001/raw_feature_bc_matrix/ to /home/runner/work/lamindb/lamindb/docs/faq/key-eval/raw_feature_bc_matrix
βœ… created 3 artifacts from directory using storage /home/runner/work/lamindb/lamindb/docs/faq/key-eval and key = raw_feature_bc_matrix/
for artifact in cellranger_raw_artifact:
    artifact.save()
βœ… storing artifact 'ib45SersE3qc0Jf1ns3y' at '/home/runner/work/lamindb/lamindb/docs/faq/key-eval/.lamindb/ib45SersE3qc0Jf1ns3y.tsv.gz'
βœ… storing artifact 'vKZ81xUEQ0HYdzvQyTI8' at '/home/runner/work/lamindb/lamindb/docs/faq/key-eval/.lamindb/vKZ81xUEQ0HYdzvQyTI8.mtx.gz'
βœ… storing artifact 'EVAsIP8Tlf4hpCNopzaR' at '/home/runner/work/lamindb/lamindb/docs/faq/key-eval/.lamindb/EVAsIP8Tlf4hpCNopzaR.tsv.gz'
cellranger_raw_folder = ln.Artifact(
    "sample_001/raw_feature_bc_matrix/", description="cellranger raw"
)
cellranger_raw_folder.save()
πŸ’‘ path content will be copied to default storage upon `save()` with key `None` ('.lamindb/sLUpMJWKNMCzzKk7')
βœ… storing artifact 'sLUpMJWKNMCzzKk77L8r' at '/home/runner/work/lamindb/lamindb/docs/faq/key-eval/.lamindb/sLUpMJWKNMCzzKk7'
Artifact(uid='sLUpMJWKNMCzzKk77L8r', description='cellranger raw', suffix='', type='dataset', size=18, hash='ftmdQhlBX4mhFSfFrGJNRQ', hash_type='md5-d', n_objects=3, visibility=1, key_is_virtual=True, created_by_id=1, storage_id=1, transform_id=1, run_id=1, updated_at='2024-06-19 16:18:06 UTC')
ln.Artifact.filter(key__icontains="raw_feature_bc_matrix").df()
uid version description key suffix type accessor size hash hash_type n_objects n_observations visibility key_is_virtual storage_id transform_id run_id created_by_id updated_at
id
17 ib45SersE3qc0Jf1ns3y None None raw_feature_bc_matrix/barcodes.tsv.gz .tsv.gz dataset None 6 xgFDdWw6HHLhGVmI91ipzg md5 None None 1 True 1 1 1 1 2024-06-19 16:18:06.142037+00:00
18 vKZ81xUEQ0HYdzvQyTI8 None None raw_feature_bc_matrix/matrix.mtx.gz .mtx.gz dataset None 6 1rwHUyK773Bb8Qhh9Vteaw md5 None None 1 True 1 1 1 1 2024-06-19 16:18:06.146764+00:00
19 EVAsIP8Tlf4hpCNopzaR None None raw_feature_bc_matrix/features.tsv.gz .tsv.gz dataset None 6 F5l2xKriNn6XtHWoZdKGRg md5 None None 1 True 1 1 1 1 2024-06-19 16:18:06.151294+00:00
ln.Artifact.filter(key__icontains="raw_feature_bc_matrix/matrix.mtx.gz").one().path
PosixUPath('/home/runner/work/lamindb/lamindb/docs/faq/key-eval/.lamindb/vKZ81xUEQ0HYdzvQyTI8.mtx.gz')
artifact = ln.Artifact.filter(description="cellranger raw").one()
artifact.path.glob("*")
<generator object Path.glob at 0x7fce14b069b0>