Stars pypi

Introduction

LaminDB is an open-source data framework for biology.

  • Manage storage & databases with a unified Python API (“lakehouse”).

  • Track data lineage across notebooks & pipelines.

  • Integrate registries for experimental metadata & in-house ontologies.

  • Validate, standardize & annotate.

  • Collaborate across distributed databases.

LaminDB features

Actual content in lamin-docs.

LaminHub is a data collaboration hub built on LaminDB similar to how GitHub is built on git.

LaminHub features

Actual content in lamin-docs.

Basic features of LaminHub are free. Enterprise features hosted in your or our infrastructure are available on a paid plan!

Quickstart

You’ll ingest a small dataset while tracking data lineage, and learn how to validate, annotate, query & search.

Setup

Install the lamindb Python package:

pip install 'lamindb[jupyter,bionty]'

Initialize a LaminDB instance mounting plugin bionty for biological types.

# store artifacts in a local directory `./lamin-intro`
!lamin init --storage ./lamin-intro --schema bionty
Hide code cell output
❗ using anonymous user (to identify, call: lamin login)
💡 connected lamindb: anonymous/lamin-intro

Now you’re ready to import lamindb in a Python session.

import lamindb as ln
Hide code cell output
💡 connected lamindb: anonymous/lamin-intro

Track

Run track() to track the inputs and outputs of your code.

When you first run ln.track(), it raises an exception and creates a stem_uid & version to identify a notebook or script:

# copy-pasted identifiers for your notebook or script
ln.settings.transform.stem_uid = "FPnfDtJz8qbE"  # <-- auto-generated by running ln.track()
ln.settings.transform.version = "1"  # <-- auto-generated by running ln.track()

# track the execution of your notebook or script
run = ln.track()

# see your currently running transform
run.transform
Hide code cell output
💡 notebook imports: anndata==0.10.7 bionty==0.44.0 lamindb==0.74a1 pandas==1.5.3 pytest==8.2.2
💡 saved: Transform(uid='FPnfDtJz8qbE5zKv', version='1', name='Introduction', key='introduction', type='notebook', created_by_id=1, updated_at='2024-06-19 16:16:42 UTC')
💡 saved: Run(uid='hyDkUGFJ6FqFJy0RgEGE', transform_id=1, created_by_id=1)
Transform(uid='FPnfDtJz8qbE5zKv', version='1', name='Introduction', key='introduction', type='notebook', created_by_id=1, updated_at='2024-06-19 16:16:42 UTC')

Calling ln.track() created a run and a transform and stored them in Run and Transform, respectively. The transform registry allows to find your notebook, scripts and pipelines. The Run registry allows to find their runs.

Is this compliant with OpenLineage?

Yes. What OpenLineage calls a “job”, LaminDB calls a “transform”. What OpenLineage calls a “run”, LaminDB calls a “run”.

Are the global identifiers stem_uid and version really necessary?

Yes, if you want to track data lineage outside of workflow managers.

To tie a piece of code to a record in a database in a way that survives name changes (notebook file name, script name, pipeline name) and content changes, you need to attach the code to an immutable identifier.

Content changes need to correlate with the identifier, but shouldn’t give rise to an entirely new identifier. To implement this behavior we code hashes with user input for version and create a universal identifier as follows:

stem_uid = random_base62(n_char)  # a random base62 sequence of length n_char
version_uid = encode_base62(md5_hash(version))[:4]  # version is, e.g., "1" or "2.1.0" or "2022-03-01"
uid = f"{stem_uid}{version_uid}"  # concatenate the stem_uid & version_uid

git, by comparison, identifies code by its content hash alone. Notebook platforms like Google Colab and DeepNote do not encode versions into their notebook ids.

Artifacts

Use Artifact to manage data in local or remote storage.

import pandas as pd

# a sample dataset
df = pd.DataFrame(
    {"CD8A": [1, 2, 3], "CD4": [3, 4, 5], "CD14": [5, 6, 7], "perturbation": ["DMSO", "IFNG", "DMSO"]},
    index=["observation1", "observation2", "observation3"],
)

# create an artifact from a DataFrame
artifact = ln.Artifact.from_df(df, description="my RNA-seq", version="1")

# artifacts come with typed, relational metadata
artifact.describe()

# save data & metadata in one operation
artifact.save()
Hide code cell output
Artifact(uid='PxU00Gm32x6FuPBUzIJP', version='1', description='my RNA-seq', suffix='.parquet', type='dataset', accessor='DataFrame', size=4122, hash='EzUJIW3AamdtaNxG_Bu_nA', hash_type='md5', visibility=1, key_is_virtual=True)
  Provenance
    .created_by = 'anonymous'
    .storage = '/home/runner/work/lamindb/lamindb/docs/lamin-intro'
    .transform = 'Introduction'
    .run = '2024-06-19 16:16:42 UTC'
Artifact(uid='PxU00Gm32x6FuPBUzIJP', version='1', description='my RNA-seq', suffix='.parquet', type='dataset', accessor='DataFrame', size=4122, hash='EzUJIW3AamdtaNxG_Bu_nA', hash_type='md5', visibility=1, key_is_virtual=True, created_by_id=1, storage_id=1, transform_id=1, run_id=1, updated_at='2024-06-19 16:16:43 UTC')

View data lineage:

artifact.view_lineage()
Hide code cell output
_images/dafc6731418a293c316d9cc8fdd65dea14d3c424c338e13b9f3d1e93cf3c9235.svg

Load an artifact:

artifact.load()
Hide code cell output
CD8A CD4 CD14 perturbation
observation1 1 3 5 DMSO
observation2 2 4 6 IFNG
observation3 3 5 7 DMSO

An artifact stores a dataset or model as either a file or a folder.

How do I register a file or folder?

Local:

ln.Artifact("./my_data.fcs", description="my flow cytometry file")
ln.Artifact("./my_images/", description="my folder of images")

Remote:

ln.Artifact("s3://my-bucket/my_data.fcs", description="my flow cytometry file")
ln.Artifact("s3://my-bucket/my_images/", description="my folder of images")

You can also use other remote file systems supported by fsspec.

How does LaminDB compare to a file system or object store?

Similar to organizing files in file systems & object stores with paths, you can organize artifacts using the key parameter of Artifact.

However, LaminDB encourages you to not rely on semantic keys but instead organize your data based on metadata.

Rather than memorizing names of folders and files, you find data via the entities you care about: people, code, experiments, genes, proteins, cell types, etc.

LaminDB embeds each artifact into rich relational metadata and indexes them in storage with a universal ID (uid).

This scales much better than semantic keys, which lead to deep hierarchical information structures that can become hard to navigate.

Are artifacts aware of array-like data?

Yes.

You can make artifacts from paths referencing array-like objects:

ln.Artifact("./my_anndata.h5ad", description="annotated array")
ln.Artifact("./my_zarr_array/", description="my zarr array store")

Or from in-memory objects:

ln.Artifact.from_df(df, description="my dataframe")
ln.Artifact.from_anndata(adata, description="annotated array")
How to version artifacts?

Every artifact is auto-versioned by its hash.

You can also pass a human-readable version field and make new versions via:

artifact_v2 = ln.Artifact("my_path", is_new_version_of=artifact_v1)

Artifacts of the same version family share the same stem uid (the first 16 characters of the uid).

You can see all versions of an artifact via artifact.versions.

Labels

Label an artifact with a label managed by the ULabel registry.

# create & save a label
candidate_marker_study = ln.ULabel(name="Candidate marker study").save()

# label an artifact
artifact.ulabels.add(candidate_marker_study)
artifact.describe()
Hide code cell output
Artifact(uid='PxU00Gm32x6FuPBUzIJP', version='1', description='my RNA-seq', suffix='.parquet', type='dataset', accessor='DataFrame', size=4122, hash='EzUJIW3AamdtaNxG_Bu_nA', hash_type='md5', visibility=1, key_is_virtual=True, updated_at='2024-06-19 16:16:43 UTC')
  Provenance
    .created_by = 'anonymous'
    .storage = '/home/runner/work/lamindb/lamindb/docs/lamin-intro'
    .transform = 'Introduction'
    .run = '2024-06-19 16:16:42 UTC'
  Labels
    .ulabels = 'Candidate marker study'

Registries

LaminDB’s central classes are registries that manage metadata.

The easiest way to see what’s in a registry is to call .df().

ln.Artifact.df()
uid version description key suffix type accessor size hash hash_type n_objects n_observations visibility key_is_virtual storage_id transform_id run_id created_by_id updated_at
id
1 PxU00Gm32x6FuPBUzIJP 1 my RNA-seq None .parquet dataset DataFrame 4122 EzUJIW3AamdtaNxG_Bu_nA md5 None None 1 True 1 1 1 1 2024-06-19 16:16:43.662190+00:00
ln.Transform.df()
uid version name key description type reference reference_type latest_report_id source_code_id created_by_id updated_at
id
1 FPnfDtJz8qbE5zKv 1 Introduction introduction None notebook None None None None 1 2024-06-19 16:16:42.634377+00:00
ln.ULabel.df() 
uid name description reference reference_type run_id created_by_id updated_at
id
1 oVzob2ZE Candidate marker study None None None 1 1 2024-06-19 16:16:43.834013+00:00

Queries

You can write arbitrary relational queries using Django’s query syntax.

# get an entity by uid (here, the current notebook)
transform = ln.Transform.get("FPnfDtJz8qbE")

# filter by description
ln.Artifact.filter(description="my RNA-seq").df()

# query all artifacts ingested from the current notebook
artifacts = ln.Artifact.filter(transform=transform).all()

# query all artifacts ingested from a notebook with "intro" in the name and labeled "Candidate marker study"
artifacts = ln.Artifact.filter(
    transform__name__icontains="intro",
    ulabels=candidate_marker_study
).all()

Features

You can annotate artifacts with features & values.

import pytest

with pytest.raises(ln.core.exceptions.ValidationError) as e:
    artifact.features.add_values({"temperature": 21.6})

print(e.exconly())
Hide code cell output
lamindb.core.exceptions.ValidationError: These keys could not be validated: ['temperature']
Here is how to create a feature:

  ln.Feature(name='temperature', dtype='float').save()

LaminDB validates all user input against its registries. As the temperature feature didn’t exist, we got an error.

Let’s follow the hint in the error message:

# register the "temperature" feature
ln.Feature(name='temperature', dtype='float').save()

# now we can annotate with the feature & the value
artifact.features.add_values({"temperature": 21.6})
artifact.describe()
Hide code cell output
Artifact(uid='PxU00Gm32x6FuPBUzIJP', version='1', description='my RNA-seq', suffix='.parquet', type='dataset', accessor='DataFrame', size=4122, hash='EzUJIW3AamdtaNxG_Bu_nA', hash_type='md5', visibility=1, key_is_virtual=True, updated_at='2024-06-19 16:16:43 UTC')
  Provenance
    .created_by = 'anonymous'
    .storage = '/home/runner/work/lamindb/lamindb/docs/lamin-intro'
    .transform = 'Introduction'
    .run = '2024-06-19 16:16:42 UTC'
  Labels
    .ulabels = 'Candidate marker study'
  Features
    'temperature' = 21.6

We can also annotate with categorical features:

# register a categorical feature
ln.Feature(name='study', dtype='cat').save()

# add a categorical value
artifact.features.add_values({"study": "Candidate marker study"})

# describe the artifact and add type information
artifact.describe(print_types=True)
Hide code cell output
Artifact(uid='PxU00Gm32x6FuPBUzIJP', version='1', description='my RNA-seq', suffix='.parquet', type='dataset', accessor='DataFrame', size=4122, hash='EzUJIW3AamdtaNxG_Bu_nA', hash_type='md5', visibility=1, key_is_virtual=True, updated_at='2024-06-19 16:16:43 UTC')
  Provenance
    .created_by: User = 'anonymous'
    .storage: Storage = '/home/runner/work/lamindb/lamindb/docs/lamin-intro'
    .transform: Transform = 'Introduction'
    .run: Run = '2024-06-19 16:16:42 UTC'
  Labels
    .ulabels: ULabel = 'Candidate marker study'
  Features
    'study': cat[ULabel] = 'Candidate marker study'
    'temperature': float = 21.6

Features provide a way to bucket labels beyond their type/registry.

Validate & annotate

LaminDB validates & annotates categorical metadata by mapping categories on registries.

Validate

Let’s use the high-level Annotateclass to validate a DataFrame:

# construct an object to validate & annotate a DataFrame
annotate = ln.Annotate.from_df(
    df,
    # define validation criteria
    columns=ln.Feature.name,  # map column names
    categoricals={"perturbation": ln.ULabel.name},  # map categories
)

# the dataframe doesn't validate because registries don't contain the categories
annotate.validate()
Hide code cell output
✅ added 1 record with Feature.name for columns: 'perturbation'
3 non-validated categories are not saved in Feature.name: ['CD8A', 'CD4', 'CD14']!
      → to lookup categories, use lookup().columns
      → to save, run add_new_from_columns
💡 mapping perturbation on ULabel.name
2 terms are not validated: 'DMSO', 'IFNG'
      → save terms via .add_new_from('perturbation')
False

Update registries

# add non-validated features based on the DataFrame columns
annotate.add_new_from_columns()
# see the updated content of the features registry
ln.Feature.df()
Hide code cell output
✅ added 3 records with Feature.name for columns: 'CD8A', 'CD4', 'CD14'
uid name dtype unit description synonyms run_id created_by_id updated_at
id
6 hLa4GYcd1NOO CD14 int None None None 1 1 2024-06-19 16:16:44.397212+00:00
5 9j33DTMMyxMD CD4 int None None None 1 1 2024-06-19 16:16:44.397050+00:00
4 YQNFBtztCAQR CD8A int None None None 1 1 2024-06-19 16:16:44.396910+00:00
3 7XpSoDpF2Xnf perturbation cat None None None 1 1 2024-06-19 16:16:44.260941+00:00
2 hV0VW20pcoRT study cat[ULabel] None None None 1 1 2024-06-19 16:16:44.166414+00:00
1 DxvnpnMLUCT6 temperature float None None None 1 1 2024-06-19 16:16:44.110389+00:00
# add non-validated labels based on the perturbations
annotate.add_new_from("perturbation")

# see the updated content of the ULabel registry
ln.ULabel.df()
✅ added 2 records with ULabel.name for perturbation: 'DMSO', 'IFNG'
uid name description reference reference_type run_id created_by_id updated_at
id
4 XfBvSXC5 is_perturbation None None None 1 1 2024-06-19 16:16:44.447852+00:00
3 Q2bP0fm1 IFNG None None None 1 1 2024-06-19 16:16:44.437354+00:00
2 HgpQmR0X DMSO None None None 1 1 2024-06-19 16:16:44.437222+00:00
1 oVzob2ZE Candidate marker study None None None 1 1 2024-06-19 16:16:43.834013+00:00

Annotate

# given the updated registries, the validation passes
annotate.validate()

# save annotated artifact
artifact = annotate.save_artifact(description="my RNA-seq", version="1")
artifact.describe()
Hide code cell output
✅ perturbation is validated against ULabel.name
💡 returning existing artifact with same hash: Artifact(uid='PxU00Gm32x6FuPBUzIJP', version='1', description='my RNA-seq', suffix='.parquet', type='dataset', accessor='DataFrame', size=4122, hash='EzUJIW3AamdtaNxG_Bu_nA', hash_type='md5', visibility=1, key_is_virtual=True, created_by_id=1, storage_id=1, transform_id=1, run_id=1, updated_at='2024-06-19 16:16:43 UTC')
Artifact(uid='PxU00Gm32x6FuPBUzIJP', version='1', description='my RNA-seq', suffix='.parquet', type='dataset', accessor='DataFrame', size=4122, hash='EzUJIW3AamdtaNxG_Bu_nA', hash_type='md5', visibility=1, key_is_virtual=True, updated_at='2024-06-19 16:16:44 UTC')
  Provenance
    .created_by = 'anonymous'
    .storage = '/home/runner/work/lamindb/lamindb/docs/lamin-intro'
    .transform = 'Introduction'
    .run = '2024-06-19 16:16:42 UTC'
  Labels
    .ulabels = 'Candidate marker study', 'DMSO', 'IFNG'
  Features
    'study' = 'Candidate marker study'
    'perturbation' = 'DMSO', 'IFNG'
    'temperature' = 21.6
  Feature sets
    'columns' = 'perturbation', 'CD8A', 'CD4', 'CD14'

Query for annotations

ulabels = ln.ULabel.lookup()
ln.Artifact.filter(ulabels=ulabels.ifng).one()
Hide code cell output
Artifact(uid='PxU00Gm32x6FuPBUzIJP', version='1', description='my RNA-seq', suffix='.parquet', type='dataset', accessor='DataFrame', size=4122, hash='EzUJIW3AamdtaNxG_Bu_nA', hash_type='md5', visibility=1, key_is_virtual=True, created_by_id=1, storage_id=1, transform_id=1, run_id=1, updated_at='2024-06-19 16:16:44 UTC')

Biological registries

The generic Feature and ULabel registries will get you pretty far.

But let’s now look at what you do can with a dedicated biological registry like Gene.

Access public ontologies

Every bionty registry is based on configurable public ontologies.

import bionty as bt

cell_types = bt.CellType.public()
cell_types
Hide code cell output
PublicOntology
Entity: CellType
Organism: all
Source: cl, 2024-02-13
#terms: 2918
cell_types.search("gamma delta T cell").head(2)
Hide code cell output
ontology_id definition synonyms parents __ratio__
name
gamma-delta T cell CL:0000798 A T Cell That Expresses A Gamma-Delta T Cell R... gammadelta T cell|gamma-delta T-cell|gamma-del... [CL:0000084] 100.000000
CD27-negative gamma-delta T cell CL:0002125 A Circulating Gamma-Delta T Cell That Expresse... gammadelta-17 cells [CL:0000800] 86.486486

Validate & annotate with typed features

import anndata as ad

# store the dataset as an AnnData object to distinguish data from metadata
adata = ad.AnnData(df[["CD8A", "CD4", "CD14"]], obs=df[["perturbation"]])

# create an annotation flow for an AnnData object
annotate = ln.Annotate.from_anndata(
    adata,
    # define validation criteria
    var_index=bt.Gene.symbol, # map .var.index onto Gene registry
    categoricals={adata.obs.perturbation.name: ln.ULabel.name}, 
    organism="human",  # specify the organism for the Gene registry
)
annotate.validate()

# save annotated artifact
artifact = annotate.save_artifact(description="my RNA-seq", version="1")
artifact.describe()
Hide code cell output
✅ added 3 records from public with Gene.symbol for var_index: 'CD8A', 'CD4', 'CD14'
✅ var_index is validated against Gene.symbol
✅ perturbation is validated against ULabel.name
💡 path content will be copied to default storage upon `save()` with key `None` ('.lamindb/Wp5h04Azk5RFUr1k867s.h5ad')
✅ storing artifact 'Wp5h04Azk5RFUr1k867s' at '/home/runner/work/lamindb/lamindb/docs/lamin-intro/.lamindb/Wp5h04Azk5RFUr1k867s.h5ad'
💡 parsing feature names of X stored in slot 'var'
3 terms (100.00%) are validated for symbol
✅    linked: FeatureSet(uid='Xq2WypPdhq9iZIQS1L3c', n=3, dtype='int', registry='bionty.Gene', hash='f2UVeHefaZxXFjmUwo9O', created_by_id=1, run_id=1)
💡 parsing feature names of slot 'obs'
1 term (100.00%) is validated for name
✅    linked: FeatureSet(uid='dvPFfJT4R86pocw0XeYl', n=1, registry='Feature', hash='rm_LZbJg7D-1NJ9S9-hP', created_by_id=1, run_id=1)
✅ saved 2 feature sets for slots: 'var','obs'
Artifact(uid='Wp5h04Azk5RFUr1k867s', version='1', description='my RNA-seq', suffix='.h5ad', type='dataset', accessor='AnnData', size=19240, hash='ohAeiVMJZOrc3bFTKmankw', hash_type='md5', n_observations=3, visibility=1, key_is_virtual=True, updated_at='2024-06-19 16:16:48 UTC')
  Provenance
    .created_by = 'anonymous'
    .storage = '/home/runner/work/lamindb/lamindb/docs/lamin-intro'
    .transform = 'Introduction'
    .run = '2024-06-19 16:16:42 UTC'
  Labels
    .ulabels = 'DMSO', 'IFNG'
  Features
    'perturbation' = 'DMSO', 'IFNG'
  Feature sets
    'var' = 'CD8A', 'CD4', 'CD14'
    'obs' = 'perturbation'

Query for typed features

# get a lookup object for human genes
genes = bt.Gene.filter(organism__name="human").lookup()
# query for all feature sets that contain CD8A
feature_sets = ln.FeatureSet.filter(genes=genes.cd8a).all()
# write the query
ln.Artifact.filter(feature_sets__in=feature_sets).df()
Hide code cell output
uid version description key suffix type accessor size hash hash_type n_objects n_observations visibility key_is_virtual storage_id transform_id run_id created_by_id updated_at
id
2 Wp5h04Azk5RFUr1k867s 1 my RNA-seq None .h5ad dataset AnnData 19240 ohAeiVMJZOrc3bFTKmankw md5 None 3 1 True 1 1 1 1 2024-06-19 16:16:48.728660+00:00

Add new records

Create a cell type record and add a new cell state.

# create an ontology-coupled cell type record and save it
neuron = bt.CellType.from_public(name="neuron")
neuron.save()
Hide code cell output
✅ created 1 CellType record from Bionty matching name: 'neuron'
💡 also saving parents of CellType(uid='3QnZfoBk', name='neuron', ontology_id='CL:0000540', synonyms='nerve cell', description='The Basic Cellular Unit Of Nervous Tissue. Each Neuron Consists Of A Body, An Axon, And Dendrites. Their Purpose Is To Receive, Conduct, And Transmit Impulses In The Nervous System.', created_by_id=1, run_id=1, public_source_id=29, updated_at='2024-06-19 16:16:49 UTC')
✅ created 3 CellType records from Bionty matching ontology_id: 'CL:0000393', 'CL:0002319', 'CL:0000404'
❗ now recursing through parents: this only happens once, but is much slower than bulk saving
💡 you can switch this off via: bt.settings.auto_save_parents = False
💡 also saving parents of CellType(uid='2qSJYeQX', name='electrically responsive cell', ontology_id='CL:0000393', description='A Cell Whose Function Is Determined By Its Response To An Electric Signal.', created_by_id=1, run_id=1, public_source_id=29, updated_at='2024-06-19 16:16:49 UTC')
✅ created 1 CellType record from Bionty matching ontology_id: 'CL:0000211'
💡 also saving parents of CellType(uid='590vrK18', name='electrically active cell', ontology_id='CL:0000211', description='A Cell Whose Function Is Determined By The Generation Or The Reception Of An Electric Signal.', created_by_id=1, run_id=1, public_source_id=29, updated_at='2024-06-19 16:16:50 UTC')
✅ created 1 CellType record from Bionty matching ontology_id: 'CL:0000000'
💡 also saving parents of CellType(uid='7kYbAaTq', name='neural cell', ontology_id='CL:0002319', description='A Cell That Is Part Of The Nervous System.', created_by_id=1, run_id=1, public_source_id=29, updated_at='2024-06-19 16:16:49 UTC')
💡 also saving parents of CellType(uid='5NqNmmSr', name='electrically signaling cell', ontology_id='CL:0000404', description='A Cell That Initiates An Electrical Signal And Passes That Signal To Another Cell.', created_by_id=1, run_id=1, public_source_id=29, updated_at='2024-06-19 16:16:49 UTC')
# create a record to track a new cell state
new_cell_state = bt.CellType(name="my neuron cell state", description="explains X")
new_cell_state.save()

# express that it's a neuron state
new_cell_state.parents.add(neuron)

# view ontological hierarchy
new_cell_state.view_parents(distance=2)
❗ records with similar names exist! did you mean to load one of them?
uid name ontology_id abbr synonyms description public_source_id run_id created_by_id updated_at
id
1 3QnZfoBk neuron CL:0000540 None nerve cell The Basic Cellular Unit Of Nervous Tissue. Eac... 29 1 1 2024-06-19 16:16:49.256009+00:00
2 2qSJYeQX electrically responsive cell CL:0000393 None None A Cell Whose Function Is Determined By Its Res... 29 1 1 2024-06-19 16:16:49.846531+00:00
3 7kYbAaTq neural cell CL:0002319 None None A Cell That Is Part Of The Nervous System. 29 1 1 2024-06-19 16:16:49.846678+00:00
4 5NqNmmSr electrically signaling cell CL:0000404 None None A Cell That Initiates An Electrical Signal And... 29 1 1 2024-06-19 16:16:49.846815+00:00
5 590vrK18 electrically active cell CL:0000211 None None A Cell Whose Function Is Determined By The Gen... 29 1 1 2024-06-19 16:16:50.415692+00:00
6 4bKGljt0 cell CL:0000000 None None A Material Entity Of Anatomical Origin (Part O... 29 1 1 2024-06-19 16:16:50.873496+00:00
_images/36ed6192dc0a410de9baa26550cf135a627e6c37065678fbdb55c5132b63759b.svg

Scale up data & learning

How do you learn from new datasets that extend your previous data history? Leverage Collection.

# a new dataset
df = pd.DataFrame(
    {
        "CD8A": [2, 3, 3],
        "CD4": [3, 4, 5],
        "CD38": [4, 2, 3],
        "perturbation": ["DMSO", "IFNG", "IFNG"]
    },
    index=["observation4", "observation5", "observation6"],
)
adata = ad.AnnData(df[["CD8A", "CD4", "CD38"]], obs=df[["perturbation"]])

# validate, annotate and save a new artifact
annotate = ln.Annotate.from_anndata(
    adata,
    var_index=bt.Gene.symbol,
    categoricals={adata.obs.perturbation.name: ln.ULabel.name},
    organism="human"
)
annotate.validate()
artifact2 = annotate.save_artifact(description="my RNA-seq dataset 2")
Hide code cell output
✅ added 1 record from public with Gene.symbol for var_index: 'CD38'
✅ var_index is validated against Gene.symbol
✅ perturbation is validated against ULabel.name
💡 path content will be copied to default storage upon `save()` with key `None` ('.lamindb/X8vksZc7wQdpOjIMlDHZ.h5ad')
✅ storing artifact 'X8vksZc7wQdpOjIMlDHZ' at '/home/runner/work/lamindb/lamindb/docs/lamin-intro/.lamindb/X8vksZc7wQdpOjIMlDHZ.h5ad'
💡 parsing feature names of X stored in slot 'var'
3 terms (100.00%) are validated for symbol
✅    linked: FeatureSet(uid='RujdGyONeWeY75Gfujpr', n=3, dtype='int', registry='bionty.Gene', hash='QW2rHuIo5-eGNZbRxHMD', created_by_id=1, run_id=1)
💡 parsing feature names of slot 'obs'
1 term (100.00%) is validated for name
✅    linked: FeatureSet(uid='dvPFfJT4R86pocw0XeYl', n=1, registry='Feature', hash='rm_LZbJg7D-1NJ9S9-hP', created_by_id=1, run_id=1)
✅ saved 1 feature set for slot: 'var'

Collections of artifacts

Create a collection using Collection.

collection = ln.Collection([artifact, artifact2], name="my RNA-seq collection", version="1")
collection.save()
collection.describe()
collection.view_lineage()
Hide code cell output
✅ saved 1 feature set for slot: 'var'
Collection(uid='YQqkxVICZFp4a0KgKK3p', version='1', name='my RNA-seq collection', hash='5g0aLY_lBSTkIYYUTycd', visibility=1, updated_at='2024-06-19 16:16:54 UTC')
  Provenance
    .created_by = 'anonymous'
    .transform = 'Introduction'
    .run = '2024-06-19 16:16:42 UTC'
  Feature sets
    'obs' = 'perturbation'
    'var' = 'CD8A', 'CD4', 'CD14', 'CD38'
_images/708f1a160e48a03cc9affb3ff7ecaa7bc73b129b5c7147c91cde951972544416.svg
# if it's small enough, you can load the entire collection into memory as if it was one
collection.load()

# typically, it's too big, hence, iterate over its artifacts
collection.artifacts.all()

# or look at a DataFrame listing the artifacts
collection.artifacts.df()
Hide code cell output
uid version description key suffix type accessor size hash hash_type n_objects n_observations visibility key_is_virtual storage_id transform_id run_id created_by_id updated_at
id
2 Wp5h04Azk5RFUr1k867s 1 my RNA-seq None .h5ad dataset AnnData 19240 ohAeiVMJZOrc3bFTKmankw md5 None 3 1 True 1 1 1 1 2024-06-19 16:16:48.728660+00:00
3 X8vksZc7wQdpOjIMlDHZ None my RNA-seq dataset 2 None .h5ad dataset AnnData 19240 L37UPl4IUH20HkIRzvlRMw md5 None 3 1 True 1 1 1 1 2024-06-19 16:16:54.143474+00:00

Data loaders

# to train models, batch iterate through the collection as if it was one array
from torch.utils.data import DataLoader, WeightedRandomSampler
dataset = collection.mapped(obs_keys=["perturbation"])
sampler = WeightedRandomSampler(
    weights=dataset.get_label_weights("perturbation"), num_samples=len(dataset)
)
data_loader = DataLoader(dataset, batch_size=2, sampler=sampler)
for batch in data_loader:
    pass

Read this blog post for more on training models on sharded datasets.

Data lineage

Save notebooks & scripts

If you call finish(), you save the run report, source code, and compute environment to your default storage location.

ln.finish()

See an example for this introductory notebook here.

Show me a screenshot

If you want to cache a notebook or script, call:

lamin get https://lamin.ai/laminlabs/lamindata/transform/FPnfDtJz8qbE5zKv

Data lineage across entire projects

View the sequence of data transformations (Transform) in a project (from a use case, based on Schmidt et al., 2022):

transform.view_parents()

Or, the generating flow of an artifact:

artifact.view_lineage()

Both figures are based on mere calls to ln.track() in notebooks, pipelines & app.

Distributed databases

Easily create & access databases

LaminDB is a distributed system like git. Similar to cloning a repository, collaborators can connect to your instance via:

ln.connect("account-handle/instance-name")

Or you load an instance on the command line for auto-connecting in a Python session:

lamin load "account-handle/instance-name"

Or you create your new instance:

lamin init --storage ./my-data-folder

Custom schemas and plugins

LaminDB can be customized & extended with schema & app plugins building on the Django ecosystem. Examples are:

  • bionty: Registries for basic biological entities, coupled to public ontologies.

  • wetlab: Exemplary custom schema to manage samples, treatments, etc.

If you’d like to create your own schema or app:

  1. Create a git repository with registries similar to wetlab

  2. Create & deploy migrations via lamin migrate create and lamin migrate deploy

It’s fastest if we do this for you based on our templates within an enterprise plan.

Design

Why?

The complexity of modern R&D data often blocks realizing the scientific progress it promises: see this blog post.

More basically: The pydata family of objects is at the heart of most data science, ML & comp bio workflows: DataFrame, AnnData, pytorch.DataLoader, zarr.Array, pyarrow.Table, xarray.Collection, etc. We couldn’t find a tool to link these objects to context so that they could be analyzed in context:

  • provenance: data sources, data transformations, models, users

  • domain knowledge & experimental metadata: the features & labels derived from domain entities

Assumptions

  1. Batched datasets from physical instruments are transformed (Transform) into useful representations (Artifact)

  2. Learning needs features (Feature, CellMarker, …) and labels (ULabel, CellLine, …)

  3. Insights connect representations to experimental metadata and knowledge (ontologies)

Schema & API

LaminDB provides a SQL schema for common entities: Artifact, Collection, Transform, Feature, ULabel etc. - see the API reference or the source code.

The core schema is extendable through plugins (see blue vs. red entities in graphic), e.g., with basic biological (Gene, Protein, CellLine, etc.) & operational entities (Biosample, Techsample, Treatment, etc.).

What is the schema language?

Data models are defined in Python using the Django ORM. Django translates them to SQL tables. Django is one of the most-used & highly-starred projects on GitHub (~1M dependents, ~73k stars) and has been robustly maintained for 15 years.

On top of the schema, LaminDB is a Python API that abstracts over storage & database access, data transformations, and (biological) ontologies.

Repositories

LaminDB and its plug-ins consist in open-source Python libraries & publicly hosted metadata assets:

LaminHub is not open-sourced.

Influences

LaminDB was influenced by many other projects, see Influences.