Smart-seq2分析
2025-08-22 16:03:32,

1、概述

Smart-seq2是一种单细胞RNA测序技术,用于分析单个细胞的基因表达情况,并可以对单个细胞的基因表达进行分析。

2、基本原理

Smart-seq2利用了莫罗尼小鼠白血病病毒逆转录酶(MMLV-RT)的两个特性:
  1. 该逆转录酶在合成到C++DNA的3’端时会随机引入几个不依赖于模板的碱基,多数情况下是三个胞嘧啶。
  2. 逆转录酶是RNA指导的DNA聚合酶,以RNA为模板,以dNTP为底物进行反转录。

通过设计oliGo(dT)VN Primer作为逆转录引物,利用MMLVRT的模板转换活性,在cDNA的3’端添加一段接头序列,通过该接头序列进行反转录,生成cDNA第一条链。当逆转录酶到达mRNA5’末端时,会连续在末端添加几个胞嘧啶(C)残基。然后添加TSO(template-switching oliGo)引物,退火后结合在第一条链的3’端与poly(C)突出杂交,合成第二条链。这样得到的cDNA经过PCR扩增,然后再纯化后用于测序。

优势

  1. 能够得到全长cDNA,用于分析可变剪切等。
  2. 覆盖范围广,可检测到稀有转录本。
  3. 检测灵敏度高,起始量低,1-1000个细胞或10pg-10ng total RNA即可高效扩增。

局限性

  1. 只能分析带Poly(A)的RNA
  2. 不是链特异性的
Pipeline Features Description Source
Assay Type paired-end plate-based Smart-seq2  
Overall workflow Quality control module and transcriptome quantification module Code available from Github
Workflow language WDL openWDL
Genomic reference sequence GRCh38 human genome primary sequence GENCODE
Gene Model GENCODE v27 PRI GTF and Fasta files GENCODE
Aligner HISAT2 Kim, et al.,2015HISAT2 tool
QC Metrics determined using Picard command line tools Picard Tools
Estimation of gene Expression RSEM (rsem-calculate-expression) is used to estimate the gene expression profile. The input of RSEM is a bam file aligned by HISAT2. Li and Dewey, 2011
Data Input File Format File format in which sequencing data is provided FASTQ
Data Output File Format File formats in which Smart-seq2 pipeline output is provided BAM, Zarr version 2

3、数据预处理

首先,需要对原始测序数据进行质控和比对。质控可以使用FastQCMultiQC工具来检查数据质量。比对可以使用HISAT2工具,将测序数据比对到参考基因组1

# 质控 fastqc -t 6 -o ./fastqc_result ./RAW/SRR*fastq.gz multiqc ./fastqc_result
# 比对 hisat2 -p 10 -x genome_index -1 sample_1.fastq.gz -2 sample_2.fastq.gz -S output.sam samtools  sort -O bam -@ 10 -o output.bam output.sam samtools index output.bam

4、表达矩阵构建与seurat对象创建

使用featureCounts工具对比对后的BAM文件进行定量,生成基因表达矩阵2

featureCounts -T 10 -p -t exon -g gene_name -a annotation.gtf -o counts.txt *.bam   基本批量运行流程如下:
#!/bin/bash
# 检查是否安装了GNU Parallel
if ! command -v parallel &> /dev/null; then
    echo "GNU Parallel could not be found, please install it first."
    exit 1
fi

# 创建namelist文件,列出所有样本文件夹名称
ls -d */ > namelist


# 定义进一步处理RNA序列的函数
process_rna() {
    local sample=$1
    # 去除末尾的斜杠
    sample=$(echo "$sample" | sed 's:/*$::')
    echo "${sample} RNA processing start"
    
    if [ ! -d "${sample}" ]; then
        echo "Directory ${sample}does not exist, skipping."
        return
    fi
    
    cd "${sample}" || { echo "Failed to enter ${sample}"; continue; }
    
    source /data5/xxx/zengchuanj/Software/MACS3/MyPythonEnv/bin/activate
    trim_galore -j 20 --phred33 --gzip --trim-n -o result --paired *.fastq.gz
    
    cd result
    
    source /data5/tan/zengchuanj/conda/bin/activate
    conda activate HiC-Pro
    fastp -i "../${sample}_clean_R1.fastq.gz" -I "../${sample}_clean_R2.fastq.gz" -o "${sample}_clean_R1.fq.gz" -O "${sample}_clean_R2.fq.gz" -q 20 -w 16 -n 5
    fastqc -t 10 "${sample}_clean_R1.fq.gz" "${sample}_clean_R2.fq.gz"

    hisat2 -p 20 -x /data5/tan/zengchuanj/pipeline/HIC/juicer/references/mm10/mm10 -1 ${sample}_clean_R1_val_1.fq.gz -2 ${sample}_clean_R2_val_2.fq.gz 2>"${sample}_hisat.txt" | samtools view -o "${sample}_outname.bam"

    samtools sort -@ 20 -o ${sample}.sort.bam ${sample}_outname.bam
    ##featurecounts定量
    # 使用ensamble的GTF
    echo ${sample} 'Feature counts start'
    ### (ucsc的话 -t exon)
    # 根据gene与exon调整参数
    featureCounts -p --countReadPairs -T 20 -t gene -a /data5/xxx/zengchuanj/xxx/references/GTF/Mus_musculus.GRCm38.102.gtf -o ${sample}_count.txt ${sample}.sort.bam

    echo "${sample} RNA processing finish"
    cd ../../..
}

# 使用parallel并行处理每个样本,并限制最大线程数为15
export -f process_rna

# 并行进一步处理RNA序列
parallel -j 10 process_rna ::: $(cat namelist)
# 定义一个用于处理Smart-seq数据并创建Seurat对象的函数
process_smart_to_seurat_data <- function(file_path,project = "seurat_obj") {
  gc()
  library(clusterProfiler)
  library(org.Mm.eg.db)
  library(Seurat)
  # 读取数据
  result <- read.csv(file_path,header = T,check.names = FALSE)
  print(head(result))
  
  # 设置行名并移除gene列
  rownames(result) <- result$gene
  result <- result[,-1]
  
  # 转换基因ID
  hg <- bitr(result$gene, fromType="ENSEMBL", toType=c("ENTREZID","ENSEMBL","SYMBOL"), OrgDb="org.Mm.eg.db")
  num <- match(hg$ENSEMBL, result$gene)
  result$gene[num] <- hg$SYMBOL
  
  # 移除重复项并重组数据
  un <- unique(result$gene)
  ma <- match(un, result$gene)
  data <- result[ma, ]
  rownames(data) <- data$gene
  data <- data[,-1]
  
  print(head(data))
  # 用0填充
  data[is.na(data)] <- 0
  # 创建Seurat对象
  seurat_obj <- CreateSeuratObject(counts = data, project = project)
  return(seurat_obj)
}

# 仅仅只转换gene名
process_smartseq_data <- function(file_path) {
  gc()
  library(clusterProfiler)
  library(org.Mm.eg.db)
  library(Seurat)
  # 读取数据
  result <- read.csv(file_path,header = T,check.names = FALSE)
  print(head(result))
  
  # 设置行名并移除gene列
  rownames(result) <- result$gene
  result <- result[,-1]
  
  # 转换基因ID
  hg <- bitr(result$gene, fromType="ENSEMBL", toType=c("ENTREZID","ENSEMBL","SYMBOL"), OrgDb="org.Mm.eg.db")
  num <- match(hg$ENSEMBL, result$gene)
  result$gene[num] <- hg$SYMBOL
  
  # 移除重复项并重组数据
  un <- unique(result$gene)
  ma <- match(un, result$gene)
  data <- result[ma, ]
  rownames(data) <- data$gene
  data <- data[,-1]
  
  print(head(data))
  # 用0填充
  data[is.na(data)] <- 0
  return(data)
}

5、数据标准化和过滤

在进行下游分析之前,需要对数据进行标准化和过滤。可以使用Seurat包进行数据标准化和高变基因筛选1

library(Seurat) sce <- CreateSeuratObject(counts = counts_matrix, project = "sce_project", min.cells = 1, min.features = 0) sce <- NormalizeData(sce) sce <- FindVariableFeatures(sce, selection.method = "vst", nfeatures = 2000)

6、降维和聚类分析

使用PCA和tSNE进行数据降维,并使用Seurat包进行聚类分析1

sce <- ScaleData(sce, vars.to.regress = c('nUMI_raw')) sce <- RunPCA(sce, features = VariableFeatures(object = sce)) sce <- FindNeighbors(sce, dims = 1:20) sce <- FindClusters(sce, resolution = 0.3) sce <- RunTSNE(sce, dims = 1:9) DimPlot(sce, reduction = "tsne")

7、差异基因分析

使用monocle包进行差异基因分析,识别不同细胞群体之间的差异基因1

library(monocle) DE_sce <- prepare_for_DE(count_matrix, clustering, stages) DE_genes <- findDEgenes(DE_sce, qvalue = 0.05)

8、功能注释

对差异基因进行功能注释,使用clusterProfiler包进行GO富集分析3

library(clusterProfiler) entrez_genes <- bitr(de_genes$genes, fromType = "SYMBOL", toType = "ENTREZID", OrgDb = "org.Mm.eg.db") formula_res <- compareCluster(ENTREZID ~ cluster, data = de_gene_clusters, fun = "enrichGO", OrgDb = "org.Mm.eg.db", ont = "BP", pAdjustMethod = "BH", pvalueCutoff = 0.01, qvalueCutoff = 0.05) dotplot(formula_res, showCategory = 5)