🛠️ Data processing

Overview

Before using Hydra, the datasets need to be preprocessed. The required input format is RDS/H5AD files (Reference and Query) containing raw single-cell count data in one of the following formats:

Seurat
Single-cell Experiment (SCE)
Anndata

Note

If you are using an AnnData object, please make sure the default data matrix (.X) contains raw counts.

Input data

Hydra supports both unimodal and multimodal single-cell omcis data. The supported modalities are:

Single-cell RNA sequencing (scRNA-seq)
Single-cell ATAC sequencing (scATAC-seq)
Single-cell ADT sequencing (scADT-seq)
scMultiome (Joint profiling of any of the above modalities - e.g., CITE-seq, TEA-seq, SHARE-seq, sci-CAR, and other similar technologies)

Processing scRNA-seq datasets

Below, we demonstrate the processing of two Lung scRNA-seq datasets. For simplicity, we will use only a subset of the data.

Example dataset

Lung (Madissoon et al and He et al)

wget -c -O "scRNA.zip" "https://huggingface.co/datasets/manojmw/Hydra_example_datasets/resolve/main/scRNA.zip"
unzip scRNA.zip

Process with Hydra

Datasets can be easily processed using the processdata setting of Hydra.

# Example command
hydra --setting processdata --modality [rna, adt or atac] --train [Path to the reference dataset] --test [Path to the query dataset]

Note

By default, Hydra assumes the label column in the reference dataset as cell_type. If this is not the case, please specify using the celltypecol argument. If you are only running feature selection, you can skip the query dataset.

scRNA-seq

hydra --setting processdata --modality rna --train scRNA/Lung_Madissoon.h5ad --test scRNA/Lung_He.h5ad

Thank you for using Hydra 😄, an interpretable deep generative tool for single-cell omics. Please refer to the full documentation available at https://sydneybiox.github.io/Hydra/ for detailed usage instructions. If you encounter any issues running the tool - Please open an issue on Github, and we will get back to you as soon as possible!!


===============================

Device to be used: CUDA 

===============================

INFO - 2025-08-10 07:13:14,402 - Starting to run 

INFO - 2025-08-10 07:13:14,402 - Processing datasets... 

[1] "Now processing train dataset..."
[1] "Now processing test dataset..."
[1] "Feature order is same in train and test..."
Processing train data...
                   group           name       otype dclass          dim
0                      /         matrix   H5I_GROUP                    
1                /matrix .data_dimnames   H5I_GROUP                    
2 /matrix/.data_dimnames              1 H5I_DATASET STRING         5950
3 /matrix/.data_dimnames              2 H5I_DATASET STRING        10728
4                /matrix       barcodes H5I_DATASET STRING         5950
5                /matrix           data H5I_DATASET  FLOAT 5950 x 10728
6                /matrix       features H5I_DATASET STRING        10728
                   group           name       otype dclass          dim
0                      /         matrix   H5I_GROUP                    
1                /matrix .data_dimnames   H5I_GROUP                    
2 /matrix/.data_dimnames              1 H5I_DATASET STRING         5950
3 /matrix/.data_dimnames              2 H5I_DATASET STRING        10728
4                /matrix       barcodes H5I_DATASET STRING         5950
5                /matrix           data H5I_DATASET  FLOAT 5950 x 10728
6                /matrix       features H5I_DATASET STRING        10728
Processing test data...
                   group           name       otype dclass          dim
0                      /         matrix   H5I_GROUP                    
1                /matrix .data_dimnames   H5I_GROUP                    
2 /matrix/.data_dimnames              1 H5I_DATASET STRING         6483
3 /matrix/.data_dimnames              2 H5I_DATASET STRING        10728
4                /matrix       barcodes H5I_DATASET STRING         6483
5                /matrix           data H5I_DATASET  FLOAT 6483 x 10728
6                /matrix       features H5I_DATASET STRING        10728
INFO - 2024-07-25 16:54:40,967 - Completed successfully!

The processed files will be stored in the directory - Input_Processed.

Processing scMultiome datasets

Below, we demonstrate the processing of a SHARE-seq dataset by Ma et al that jointly profiles single-cell transcriptomic (RNA) and chomatin accessibility (ATAC) profiles.

Note

If using scMultiome, the Reference–Query files of each modality need to be processed separately (Refer below)

Example dataset

Skin (Ma et al)

wget -c -O "scMultiome.zip" "https://huggingface.co/datasets/manojmw/Hydra_example_datasets/resolve/main/scMultiome.zip"
unzip scMultiome.zip

Process with Hydra

scRNA-seq

hydra --setting processdata --modality rna --train scMultiome/Ma_Skin_scRNA_train.h5ad --test scMultiome/Ma_Skin_scRNA_test.h5ad

Thank you for using Hydra 😄, an interpretable deep generative tool for single-cell omics. Please refer to the full documentation available at https://sydneybiox.github.io/Hydra/ for detailed usage instructions. If you encounter any issues running the tool - Please open an issue on Github, and we will get back to you as soon as possible!!


===============================

Device to be used: CUDA 

===============================

INFO - 2025-08-12 05:24:58,529 - Starting to run 

INFO - 2025-08-12 05:24:58,529 - Processing datasets... 

[1] "Now processing train dataset..."
[1] "Now processing test dataset..."
[1] "Feature order is same in train and test..."
Processing train data...
                   group           name       otype dclass         dim
0                      /         matrix   H5I_GROUP                   
1                /matrix .data_dimnames   H5I_GROUP                   
2 /matrix/.data_dimnames              1 H5I_DATASET STRING        4463
3 /matrix/.data_dimnames              2 H5I_DATASET STRING        8628
4                /matrix       barcodes H5I_DATASET STRING        4463
5                /matrix           data H5I_DATASET  FLOAT 4463 x 8628
6                /matrix       features H5I_DATASET STRING        8628
                   group           name       otype dclass         dim
0                      /         matrix   H5I_GROUP                   
1                /matrix .data_dimnames   H5I_GROUP                   
2 /matrix/.data_dimnames              1 H5I_DATASET STRING        4463
3 /matrix/.data_dimnames              2 H5I_DATASET STRING        8628
4                /matrix       barcodes H5I_DATASET STRING        4463
5                /matrix           data H5I_DATASET  FLOAT 4463 x 8628
6                /matrix       features H5I_DATASET STRING        8628
Processing test data...
                   group           name       otype dclass          dim
0                      /         matrix   H5I_GROUP                    
1                /matrix .data_dimnames   H5I_GROUP                    
2 /matrix/.data_dimnames              1 H5I_DATASET STRING        23597
3 /matrix/.data_dimnames              2 H5I_DATASET STRING         8628
4                /matrix       barcodes H5I_DATASET STRING        23597
5                /matrix           data H5I_DATASET  FLOAT 23597 x 8628
6                /matrix       features H5I_DATASET STRING         8628
INFO - 2025-08-12 05:26:28,556 - Completed successfully!

scATAC-seq

Note

Hydra requires gene activity scores for scATAC data

hydra --setting processdata --modality atac --train scMultiome/Ma_Skin_scATAC_train.h5ad --test scMultiome/Ma_Skin_scATAC_test.h5ad

Thank you for using Hydra 😄, an interpretable deep generative tool for single-cell omics. Please refer to the full documentation available at https://sydneybiox.github.io/Hydra/ for detailed usage instructions. If you encounter any issues running the tool - Please open an issue on Github, and we will get back to you as soon as possible!!


===============================

Device to be used: CUDA 

===============================

INFO - 2025-08-12 05:29:34,072 - Starting to run 

INFO - 2025-08-12 05:29:34,072 - Processing datasets... 

[1] "Now processing train dataset..."
[1] "Now processing test dataset..."
[1] "Feature order is same in train and test..."
Processing train data...
                   group           name       otype dclass          dim
0                      /         matrix   H5I_GROUP                    
1                /matrix .data_dimnames   H5I_GROUP                    
2 /matrix/.data_dimnames              1 H5I_DATASET STRING         4463
3 /matrix/.data_dimnames              2 H5I_DATASET STRING        13755
4                /matrix       barcodes H5I_DATASET STRING         4463
5                /matrix           data H5I_DATASET  FLOAT 4463 x 13755
6                /matrix       features H5I_DATASET STRING        13755
                   group           name       otype dclass          dim
0                      /         matrix   H5I_GROUP                    
1                /matrix .data_dimnames   H5I_GROUP                    
2 /matrix/.data_dimnames              1 H5I_DATASET STRING         4463
3 /matrix/.data_dimnames              2 H5I_DATASET STRING        13755
4                /matrix       barcodes H5I_DATASET STRING         4463
5                /matrix           data H5I_DATASET  FLOAT 4463 x 13755
6                /matrix       features H5I_DATASET STRING        13755
Processing test data...
                   group           name       otype dclass           dim
0                      /         matrix   H5I_GROUP                     
1                /matrix .data_dimnames   H5I_GROUP                     
2 /matrix/.data_dimnames              1 H5I_DATASET STRING         23597
3 /matrix/.data_dimnames              2 H5I_DATASET STRING         13755
4                /matrix       barcodes H5I_DATASET STRING         23597
5                /matrix           data H5I_DATASET  FLOAT 23597 x 13755
6                /matrix       features H5I_DATASET STRING         13755
Warning message:
In write_csv(cellType_list = list(ct = cty), csv_list = c(glue("{split_folder}/ct_train.csv"))) :
  csv_list exists! will rewrite it.
INFO - 2025-08-12 05:32:17,150 - Completed successfully!

The processed files will be stored in the directory - Input_Processed

Documentation by Manoj M Wagle