Manage a cell type registry#

Background#

Cell types classify cells based on public and private knowledge gained from studying gene expression patterns, morphology, functional & other properties.

Long established cell types have known markers and properties but cell subtypes and states are continuously being discovered, better understood and knowledge gets refined.

In this notebook, we use CellTypist, a computational tool for cell type classification in scRNA-seq data. It assigns cell types based on gene expression profiles.

First, we create a cell type registry for cell types supported by CellTypist.

Then, we’ll use CellTypist to classify cell types of a previously unannotated dataset and ingest the dataset with LaminDB.

Finally, we will demonstrate how to fetch datasets with cell type queries using LaminDB.

Setup#

To run this notebook, you need to load a LaminDB instance that has the bionty schema mounted.

Here, we’ll create a test instance (skip if you’d like to run it using your instance):

Hide code cell content
!lamin init --storage ./celltypist --schema bionty
Hide code cell content
# Filter warnings from celltypist
import warnings

warnings.filterwarnings("ignore", message=".*The 'nopython' keyword.*")
import lamindb as ln
import lnschema_bionty as lb
import celltypist
import pandas as pd

ln.settings.verbosity = 3  # show hints
lb.settings.species = "human"  # globally set species
ln.track()

For a start, let’s take a look at the public Cell Ontology.

celltype_bt = lb.CellType.bionty()  # equals to bionty.CellType()
celltype_bt

Create an in-house registry of CellTypist terms based on the public Cell Ontology#

Fetch CellTypist’s immune cell encyclopedia#

As a first step we will read in CellTypist’s immune cell encyclopedia

description = "CellTypist Pan Immune Atlas v2: basic cell type information"
celltypist_source_v2_url = "https://github.com/Teichlab/celltypist_wiki/raw/main/atlases/Pan_Immune_CellTypist/v2/tables/Basic_celltype_information.xlsx"

# our source data
celltypist_file = ln.File.filter(description=description).one_or_none()

if celltypist_file is None:
    celltypist_df = pd.read_excel(celltypist_source_v2_url)
    celltypist_file = ln.File(celltypist_df).save()
else:
    celltypist_df = celltypist_file.load().head()

It provides an ontology_id of the public Cell Ontology for the majority of records.

celltypist_df.head()

The “Cell Ontology ID” is associated with multiple “Low-hierarchy cell types”:

celltypist_df.set_index(["Cell Ontology ID", "Low-hierarchy cell types"]).head(10)

Inspect mapability with the public Cell Ontology#

For any cell type record that can be mapped against the public Cell Ontology, we’d like to ensure that it’s actually mapped.

This will avoid that we’ll refer to the same cell type with different identifiers.

Let’s see how well the Cell Typist reference data can be mapped.

All Celltypist labeled ontology IDs are mappable to the public Cell Ontology:

celltype_bt.inspect(celltypist_df["Cell Ontology ID"], celltype_bt.ontology_id);

However, when inspecting the names, most of them don’t match:

celltype_bt.inspect(celltypist_df["Low-hierarchy cell types"], celltype_bt.name);

A search tells us that terms that are named in plural in Cell Typist occur with a name in singular in the Cell Ontology:

celltypist_df["Low-hierarchy cell types"][0]
celltype_bt.search(celltypist_df["Low-hierarchy cell types"][0]).head()

Let’s try to strip "s" and inspect if more names are mappable. Yes, there are!

celltype_bt.inspect(
    [i.rstrip("s") for i in celltypist_df["Low-hierarchy cell types"]],
    celltype_bt.name,
);

Every “low-hierarchy cell type” has an ontology id and most “high-hierarchy cell types” also appear as “low-hierarchy cell types” in the Cell Typist table. Four, however, don’t, and therefore don’t have an ontology ID.

high_terms = celltypist_df["High-hierarchy cell types"].unique()
low_terms = celltypist_df["Low-hierarchy cell types"].unique()

high_terms_umapped = set(high_terms).difference(low_terms)
high_terms_umapped

Register CellTypist records in LaminDB#

Let’s first add the “High-hierarchy cell types” as a column "parent".

This enables LaminDB to populate the parents and children fields, which will enable you to query for hierarchical relationships.

celltypist_df["parent"] = celltypist_df.pop("High-hierarchy cell types")

# if high and low terms are the same, no parents
celltypist_df.loc[
    (celltypist_df["parent"] == celltypist_df["Low-hierarchy cell types"]), "parent"
] = None

# rename columns, drop markers
celltypist_df.drop(columns=["Curated markers"], inplace=True)
celltypist_df.rename(
    columns={"Low-hierarchy cell types": "name", "Cell Ontology ID": "ontology_id"},
    inplace=True,
)
celltypist_df.columns = celltypist_df.columns.str.lower()
celltypist_df.head(2)

Now, let’s create records from the public ontology:

public_records = lb.CellType.from_values(
    celltypist_df.ontology_id, lb.CellType.ontology_id
)

Let’s now amend public ontology records so that they maintain additional annotations that Cell Typist might have.

records_names = {}
public_records_dict = {r.ontology_id: r for r in public_records}

for _, row in celltypist_df.iterrows():
    name = row["name"]
    ontology_id = row["ontology_id"]
    public_record = public_records_dict[ontology_id]

    # if both name and ontology_id match public record, use public record
    if name.lower() == public_record.name.lower():
        records_names[name] = public_record
        continue
    else:  # when ontology_id matches the public record and name doesn't match
        # if singular form of the Celltypist name matches public name
        if name.lower().rstrip("s") == public_record.name.lower():
            # add the Celltypist name to the synonyms of the public ontology record
            public_record.add_synonym(name)
            records_names[name] = public_record
            continue
        if public_record.synonyms is not None:
            synonyms = [s.lower() for s in public_record.synonyms.split("|")]
            # if any of the public matches celltypist name
            if any(
                [
                    i.lower() in {name.lower(), name.lower().rstrip("s")}
                    for i in synonyms
                ]
            ):
                # add the Celltypist name to the synonyms of the public ontology record
                public_record.add_synonym(name)
                records_names[name] = public_record
                continue

        # create a record only based on Celltypist metadata
        records_names[name] = lb.CellType(
            name=name, ontology_id=ontology_id, description=row.description
        )

You can see certain records are created by adding the Celltypist name to the synonyms of the public record:

records_names["GMP"]

Other records are created based on Celltypist metadata:

records_names["Age-associated B cells"]

Let’s save them to our database:

records = set(records_names.values())

ln.save(records)

Add parent-child relationship of the records from Celltypist#

We still need to add the renaming 4 High hierarchy terms:

list(high_terms_umapped)

Let’s get the top hits from a search:

for term in list(high_terms_umapped):
    print(f"Term: {term}")
    display(celltype_bt.search(term).head(1))

So we decide to:

  • Add the “T cells” to the synonyms of the public “T cell” record

  • Create the remaining 3 terms only using their names (we think “B cell lineage” shouldn’t be identified with “B cell”)

for name in high_terms_umapped:
    if name == "T cells":
        record = lb.CellType.from_bionty(name="T cell")
        record.add_synonym(name)
        record.save()
    else:
        record = lb.CellType(name=name)
        record.save()
    records_names[name] = record

Now let’s add the parent records:

for _, row in celltypist_df.iterrows():
    record = records_names[row["name"]]
    if row["parent"] is not None:
        parent_record = records_names[row["parent"]]
        record.parents.add(parent_record)

Access the in-house CellType registry#

The previously added CellTypist ontology registry is now available in LaminDB. To retrieve the full ontology table as a Pandas DataFrame we can use .filter:

lb.CellType.filter().df()

This enables us to look for cell types by creating a lookup object from our new CellType registry.

db_lookup = lb.CellType.lookup()
db_lookup.memory_b_cell

Access parents of a record:

db_lookup.memory_b_cell.parents.all()
db_lookup.memory_b_cell.parents.all()[1].parents.all()

Annotate a dataset with cell types using CellTypist#

Annotate cell types predicted with CellTypist#

We now demonstrate how simple it is to predict and add cell types to LaminDB with CellTypist. Our dataset of choice is a simple sample dataset together with a sample model.

input_file = celltypist.samples.get_sample_csv()
input_file
predictions = celltypist.annotate(
    input_file, model="Immune_All_Low.pkl", majority_voting=True
)

Now that we’ve predicted all cell types we create an Anndata object that we will eventually track with LaminDB.

adata_annotated = predictions.to_adata()
adata_annotated.obs

Create cell type records using the “predicted_labels” as names:

celltypes = lb.CellType.from_values(
    adata_annotated.obs.predicted_labels, lb.CellType.name
)
celltypes[:2]

Track the annotated dataset in LaminDB#

Create a file record of the AnnData object. We further define a name of the dataset for clarity that can also be queried for.

file_annotated = ln.File.from_anndata(
    adata_annotated, description="Examplary CellTypist file", var_ref=lb.Gene.symbol
)
file_annotated.save()

Add cell types as labels:

file_annotated.features.add_labels(celltypes)
file_annotated.describe()

Now we can track the file and search for it for usecase by querying for a specific cell type.

ln.File.filter(cell_types=db_lookup.tcm_naive_helper_t_cells).df()

Or track in which notebook the file is annotated by celltypist:

ln.Transform.filter(files__description__icontains="CellTypist").df()

Try it yourself#

This notebook is available at laminlabs/lamin-usecases.

Hide code cell content
!lamin delete celltypist
!rm -r ./celltypist