In our current single-cell analysis pipeline based on scanpy, there’s a step that requires saving AnnData objects in loom format. However, unlike saving to h5ad format, when we write an AnnData object to a loom file without any special handling and then read it back, we find that the index information of obs and var (typically cell barcodes and gene names) is lost, and these indices become ordinary numeric identifiers.
Problem Description
Based on actual testing and the documentation of the write_loom method, it can be inferred that during actual read/write operations, write_loom does not save the index portions of the obs and var tables by default. Even when specifying the parameter write_obsm_varm=True, the writing process only creates two new columns with fixed names (obs_names and var_names).
According to the read_loom method documentation, when reading, read_loom looks for a column named CellID in obs and a column named Gene in var to use as indices for the respective tables.
This creates a mismatch: by default, indices aren’t saved, and even if they are saved, what write_loom writes and what read_loom reads don’t align - it’s like two departments failing to coordinate properly, resulting in the loss of critical information…
Solution
My goal is to make minimal changes to the original pipeline code, so directly generating loom files that the original pipeline can read is the best approach:
1 | # Save index information to columns in obs and var |