SpaceTrooper, an R package for the preprocessing and quality control of imaging-based spatial transcriptomics data
Author(s): Dario Righelli,Benedetta Banzi,Oriana Romano,Mattia Forcato,Silvio Bicciato,Davide Risso
Affiliation(s): Department of Electrical Engineer and Information Technology, University of Naples Federico II
Emerging technologies in spatially resolved single-cell omics offer high-throughput solutions for measuring the molecular characteristics of cells in situ. Single-cell spatial profiling combines highly multiplexed imaging modalities, next-generation sequencing, or mass spectrometry to assess the spatial distribution of gene and protein expression within the native context of cells. The rapid growth of these techniques requires the development of novel computational methods to analyze the massive amounts of data they produce. Several computational pipelines have been proposed for preprocessing, quality control, and downstream analysis of spatial omics data. However, they primarily rely on methods borrowed from the single-cell RNA-seq literature, with geospatial features only marginally considered in exploratory steps. While this is reasonable for sequencing-based platforms, the fundamentally different nature of imaging-based spatial transcriptomics requires the development of bespoke methods. In particular, segmentation is a crucial aspect of imaging-based spatial data as it produces the spatial entities (i.e., the polygons representing cell boundaries) in which transcripts are summarized to produce cell-level count matrices. Despite being relatively abundant, errors in cell segmentation are still evaluated through visual examination only, and, to date, simple metrics for assessing the quality of automatically computed cell boundaries are lacking. Furthermore, while some quality control metrics can be derived from specific signals (e.g., probe counts; number of detected features; counts of negative control probes) or morphological characteristics of the cells (e.g., cell area, cell aspect ratio), it remains unclear how to effectively combine them for the efficient flagging or removal of low-quality cells. To address these challenges, we introduce SpaceTrooper, an R package specifically designed for the preprocessing and quality control of spatial transcriptomic data obtained from imaging-based technologies, such as Nanostring CosMx SMI, 10x Genomics Xenium In Situ, and VizGen Merscope. The core framework of SpaceTrooper is built around the available Bioconductor data structures for spatial transcriptomics data and the rich geospatial R package ecosystem (e.g., sf, terra) that enables the generation of cell geometries directly from image and shape files, in various formats (e.g., TIFF, HDF5, parquet). This approach allows for effective quality control considering not only the cell-level transcriptomic tabular data but also the spatial information derived from cell geometries. Initially, SpaceTrooper employs a coarse cell flagging strategy by leveraging a statistical test to detect outliers. This step aims to facilitate an initial cell filter to remove poorly captured cells and obvious artifacts resulting from the experimental procedures. Subsequently, the data undergo a more detailed cleansing process based on a score derived from combining several metrics using a sigmoid transformation. Both filters take advantage of the dual nature of spatial transcriptomic assays. We tested SpaceTrooper on various public datasets obtained using the main imaging-based technologies currently commercially available across four distinct tissues: breast and lung cancer, healthy brain, and liver, encompassing both human and mouse. Unlike other computational pipelines, SpaceTrooper precisely identifies low-quality data by appropriately considering probe expression in conjunction with cell morphological characteristics, unrealistic cell polygons, and boundary effects stemming from inherent technical factors. Finally, SpaceTrooper is equipped with exporting methods to facilitate seamless integration with external visualization tools (e.g., Napari) and other pipelines available for downstream analysis in R and Python.
