class: inverse, middle, center ## The molecular mechanisms of epigenetic inheritance ### .orange[Computational analysis of multi-omics data] ## .blue1[Progress report .yellow[`#`12]] ### Deepak Tanwar #### 22<sup>nd</sup> June 2021 --- class: inverse, middle, center
# `shortRNA` ## .orange[A flexible framework for the analysis of short RNA sequencing data] .Large[Deepak Tanwar `&` Dr. Pierre-Luc Germain] --- class: hide_logo # Table of Contents .Large[ [Introduction](#) [Shortcomings of current tools](#) [Overview of the project](#) [Annotations and hierarchical organization](#) [Data analysis pipeline](#) [Next steps](#) [Summary](#) ] --- class: left, hide_logo # .center[.blue[Small RNAs]] 20 - 300 nt length ??? **The first small RNA:** [Lee <i>et. al.</i> 1993; Cell; 1993](https://doi.org/10.1016/0092-8674(93)90529-Y): _A non-coding gene in C. elegans_ -- .content-box-blue[ - MicroRNA (**miRNA**): 18 - 24 nt - Small interfering RNA (**siRNA**): 21 - 27 nt - Small nucleolar RNA (**snoRNA**): 60 - 170 nt - Small nuclear RNA (**snRNA**/ **U-RNA**): 100 - 300 nt - Small temporal RNA (**stRNA**): 18 - 24 nt - tRNA-derived small RNA (**tsRNA**): tRNA halves (**tsRNA**: 28 - 36 nt) & tRNA fragments (**tRFs**: 14 - 22 nt) - Small rDNA-derived RNA (**srRNA**): 18 - 30 nt - Repeat-associated small interfering RNA (**rasiRNA**): 24 - 27 nt - Piwi-interacting RNA (**piRNA**): 26 - 31 nt: _Only present in animals (germline cells)_ ] --- class: left, hide_logo # .center[.blue[Shortcomings of existing tools]] #### Methods concentrate on a specific type of short RNA: - MINTmap (tRFs) - TAM Prost!, Chimira, mirTools (miRNA) -- #### A lot of reads are left unassigned -- #### Multi mapping of reads -- #### Mis-assignment of reads -- #### Not "proper" priortization for reads assignment -- #### Do not deal adequately with post-transcriptional modifications -- #### Normalization problems ??? Different RNAs are of different lengths -- #### Operating system dependencies --- class: left # .center[.blue[Overview]] <u>Flexible</u> <u>standalone</u> framework for short RNA-Seq analysis. (**Everything in `R`**) -- Data preprocessing (QC, trimming, adapter detection) within R and extremely fast -- Alignment also within R and extremely fast -- `R` could be slow. Works with data structure which is faster to process with `R` -- Optimized towards <u>all known short RNA</u> types (e.g. miRNA, piRNA, rasiRNA, siRNA, snoRNA, tsRNA, tRFs, srRNA and U-RNA) -- Automatically processing of <u>features sequences and annotations</u> (currenly only for mouse: **mm10** & **mm39**) -- Adequately deal with <u>post-transcriptional modifications</u>. -- Potentially detect <u>unannotated features</u>. -- Account for the <u>hierarchical organization</u> of the features. --- class: left # .center[.blue[Databases for mouse genome]] -- ### tRNAs [GtRNAdb](http://gtrnadb.ucsc.edu/) for tRNAs [mitotRNAdb](http://mttrna.bioinf.uni-leipzig.de/mtDataOutput/) for mitochondrial tRNA genes -- ### miRNAs [miRBase](http://www.mirbase.org/) [Ensembl](http://www.ensembl.org/index.html) (Ensembl miRNAs overlapping with miRBase are removed) - If length > 25bp: **precursors** --- class: center background-image: url("files/p1/13.png") background-size: contain background-position: 50% 90% # .blue[Length of miRNAs from Ensembl] --- class: left # .center[.blue[Databases for mouse genome]] ### tRNAs [GtRNAdb](http://gtrnadb.ucsc.edu/) for tRNAs [mitotRNAdb](http://mttrna.bioinf.uni-leipzig.de/mtDataOutput/) for mitochondrial tRNA genes ### miRNAs [miRBase](http://www.mirbase.org/) [Ensembl](http://www.ensembl.org/index.html) (Ensembl miRNAs overlapping with miRBase are removed) - If length > 25bp: **precursors** ### rRNAs [SILVA](https://www.arb-silva.de/): both, large and sub-units --- class: left # .center[.blue[Databases for mouse genome]] ### piRNA precursors [Zamore Lab](https://www.zamorelab.umassmed.edu/) -- ### Long RNAs [Ensembl](http://www.ensembl.org/index.html) --- class: left # .center[.blue[tRNAs]] tRNAs ID from GtRNAdb: **tRNA-Ala-AGC-4-1** --- class: left background-image: url("files/p1/11.png") background-size: contain background-position: 50% 55% # .center[.blue[tRNAs]] tRNAs ID from GtRNAdb: **tRNA-Ala-AGC-4-1** tRNAs ID from mitotRNAdb: -- <br><br><br><br><br><br><br><br><br><br><br><br> We remove duplicated sequences -- Convert the IDs: .Large[.red[**mt_tRNA-Ala-TGC-1-1**]] --- background-image: url("https://iubmb.onlinelibrary.wiley.com/cms/asset/a6089ebf-40ba-4fb2-b894-0f731768f941/iub2041-fig-0001-m.jpg") background-size: 100% background-position: 50% -180% class: left # .center[.blue[tRNAs]] ### Post-transcriptional modifications (CCA addition at 3') <br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br> .right[[Barraud <i>et. al. </i>; IUBMB Life; 2019](https://doi.org/10.1002/iub.2041)] --- class: left # .center[.blue[tRNAs]] ### Post-transcriptional modifications (G addition at 5' of Histidine) *"All histidine tRNAs of known sequence are one nucleotide longer at the 5' end than are other tRNA species"* [Cooley <i>et. al. </i>; PNAS; 1982](https://doi.org/10.1073/pnas.79.21.6475) -- #### Other methods used this approach [Cozen <i>et. al. </i>; Nat Methods; 2015](https://doi.org/10.1038/nmeth.3508) [Shi <i>et. al. </i>; Genomics Proteomics Bioinformatics; 2018](https://doi.org/10.1016/j.gpb.2018.04.004) --- class: center # .center[.blue[tRNAs]] #### .left[tRNA]
-- #### .left[Mt tRNA]
--- class: left # .center[.blue[miRNAs]] ### miRBase **miRNAs** and **miRNA precursors** -- | | precursor | miRNA | |-|-|-| | **ID format**| .red[miR/let]-.blue[1[234][a]]-.green[[1-9]] | .red[miR/let]-.blue[1[234][a]]-.green[[1-9]]-.orange[[5/3]p] | | **Example 1** | .red[let]-.blue[7a]-.green[1] | .red[let]-.blue[7a]-.green[1]-.orange[3p] | | **Example 2** | .red[mir]-.blue[7a]-.green[1] | .red[mir]-.blue[7a]-.green[1]-.orange[3p] | -- ### Ensembl **ID format:** .green[Mir3070b] We convert it to .red[mir]-.blue[3070b] --- class: left # .center[.blue[miRNAs]] ### miRNA clusters -- - set of two or more miRNAs, which are transcribed from physically adjacent miRNA genes -- - Mostly 2-3 miRNAs per cluster -- - miR-17to miR-92 cluster is found on human chromosome 13 -- - Distance between miRNAs: - **10kb:** [Griffiths-Jones <i>et. al.</i>; Nucleic Acids Res.; 2008](https://doi.org/10.1093/nar/gkm952) - **10kb:** [Baumgart & Barth <i>et. al.</i>; BMC Genomics 2917](https://doi.org/10.1186/s12864-017-3951-8) - **50kb:** [Yuan <i>et. al.</i>; BMC Systems Biology; 2009](https://doi.org/10.1186/1752-0509-3-65) -- **.green[We choose **10kb**] because:** - miRBase authors used this - "more than 40% of experimentally validated human miRNA cluster genes have been identified within 10kb" --- # .center[.blue[miRNAs]]
--- class: center background-image: url("files/p1/01.png") background-size: contain background-position: 50% 68% # .blue[Hierarchical organization] --- class: center background-image: url("files/p1/02.png") background-size: contain background-position: 50% 70% # .blue[Hierarchical organization] --- layout: true # .blue[Reads assignment problem] --- class: center background-image: url("files/p1/03.png") background-size: 130% background-position: 50% 50% --- class: center background-image: url("files/p1/04.png") background-size: 130% background-position: 50% 50% --- layout: false # .center[.blue[Priorities for read assignment]] #### High priority types .content-box-green[ - miRNA - tRNA - tRNAp - Mt_tRNA - snRNA - snoRNA - antisense - piRNA_precursor ] #### Low priority types .content-box-blue[ - precursor - longRNA ] ---
# .center[.blue[Working with `shortRNA`]] -- ### Data Keep your `fastq` files in a directory -- ### Load `shortRNA` ```r devtools::load_all("/mnt/IM/reference/shortRNA") ``` ``` ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ##
## ## ##
## ## ## # .center[.blue[Working with `shortRNA`]] ## ## -- ## ## ### Data ## ## Keep your `fastq` files in a directory ## ## -- ## ## ### Load `shortRNA` ## ## ## ```r ## devtools::load_all("/mnt/IM/reference/shortRNA") ## ``` ``` --- # .center[.blue[Working with `shortRNA`]] ### Path to custom genome and annotation ```r genome <- "/mnt/IM/reference/shortRNA_genome/customGenome" anno <- "/mnt/IM/reference/shortRNA_genome/features.rds" ``` --- # .center[.blue[QC with `shortRNA`]] ```r ## Adapter sequence ad <- "GTGTCAGTCACTTCCAGCGG" ## fastq file fq1 <- system.file("extdata", "Fox3_Std_small.fq.gz", package = "Rfastp") ## output directory outDir <- tempdir() ## Results res <- qcFastq(file = fq1, outDir = outDir, adapterSeq = ad) ## Trimmed fastq fq2 <- list.files(outDir, "gz", full.names = TRUE) ``` --- # .center[.blue[QC with `shortRNA`]] ### General stats ```r makeGeneralDF(res$json) %>% knitr::kable(col.names = NULL) ``` | | | |:----------------------------|:--------------------------------------------------| |Rfastp version |1.2.0 | |Sequencing |single-end (51 cycles) | |Mean length before filtering |51 | |Mean length after filtering |44 | |Duplication rate |1.12% (may be overestimated since this is SE data) | --- # .center[.blue[QC with `shortRNA`]] .pull-left[ ### Before QC stats ```r makeBeforeFiltDF(res$json) %>% knitr::kable(col.names = NULL) ``` | | | |:-----------|:-----------------| |Total reads |10000 | |Total bases |510000 | |Q20 bases |475322 (93.2004%) | |Q30 bases |449228 (88.0839%) | |GC content |48.27% | ] -- .pull-right[ ### After QC stats ```r makeAfterFiltDF(res$json) %>% knitr::kable(col.names = NULL) ``` | | | |:-----------|:-----------------| |Total reads |9607 | |Total bases |427808 | |Q20 bases |418916 (97.9215%) | |Q30 bases |398004 (93.0333%) | |GC content |47.2% | ] --- # .center[.blue[QC with `shortRNA`]] ### Filtering information ```r makeFiltDF(res$json) %>% knitr::kable(col.names = NULL) ``` | | | |:----------------------|:-------------| |Reads passed filters |9607 (96.07%) | |Reads with low quality |51 (0.51%) | |Reads with too many N |1 (0.01%) | |Reads too short |341 (3.41%) | |Reads too long |0 (96.07%) | --- # .center[.blue[QC with `shortRNA`]] ### Reads duplication plot ```r readDuplicationPlot(res$json) ```
--- # .center[.blue[QC with `shortRNA`]] ### Read quality score plot ```r readsQualityPlot(res$json) ```
--- # .center[.blue[QC with `shortRNA`]] ### Base content line plot ```r baseRatioLinePlot(res$json) ```
--- # .center[.blue[QC with `shortRNA`]] ### Base content ratio proportion plot ```r baseRatioProportionPlot(res$json) ```
--- # .center[.blue[QC with `shortRNA`]] ### Read length distribution plot ```r p1 <- readLengthHist(fq1) ``` ``` ## [fastqq] File ( 1/1) '/Library/Frameworks/R.framework/Versions/4.1/Resources/library/Rfastp/extdata/Fox3_Std_small.fq.gz' done. ``` ```r p2 <- readLengthHist(fq2, col = "blue", name = "Trimmed") ``` ``` ## [fastqq] File ( 1/1) '/var/folders/6c/b7hqz5hj2xb99m3bg1kyykk40000gp/T//RtmpymouKi/Fox3_Std_small_R1.fastq.gz' done. ``` --- # .center[.blue[QC with `shortRNA`]] ### Read length distribution plot ```r subplot(p1, p2, shareY = TRUE) ```
--- class: hide_logo background-image: url("files/p1/14.png") background-size: 60% background-position: 50% 100% # .center[.blue[Benchmark of `fastq` files import tools]] --- class: hide_logo background-image: url("files/p1/15.png") background-size: 60% background-position: 50% 100% # .center[.blue[Benchmark of `fastq` files import tools]] --- # .center[.blue[Annotation and genome index with `shortRNA`]] -- ### Obtaining annotation and sequences for mouse genome ```r db_mmu <- getDB( species = "mmu", genomeVersion = "GRCm38", tRNA_addCCA = TRUE, tRNA_includeMt = TRUE ) ``` --- # .center[.blue[Annotation and genome index with `shortRNA`]] ### Building annotations and custom genome index ```r mm10_annoprep <- prepareAnnotation( ensdb = db_mmu$ensdb, genome = "genome.fa", output_dir = "genome", extra.gr = list(piRNA = db_mmu$piRNA_GR, miRNA = db_mmu$miRNA_GR), extra.seqs = list(rRNA = db_mmu$rRNA_fa, tRNA = db_mmu$tRNA_fa), resolveSplicing = NULL, rules = defaultAssignRules(), tRNAEnsembleRemove = TRUE, clusterMiRNA = TRUE ) ``` --- # .center[.blue[Alignment with `shortRNA`]] ### Sequence by sample count matrix ```r m <- fastq2SeqCountMatrix(files = fq_files) ``` -- ### Save sequences in fasta format for alignment ```r fa <- DNAStringSet(row.names(m)) names(fa) <- paste0("S", 1:length(fa)) writeXStringSet(fa, fasta_file) ``` --- # .center[.blue[Alignment with `shortRNA`]] ### Perform alignment ```r alignShortRNA( fastq = "fasta_file", index = "genome/customGenome", outDir = "alignedDir" ) ``` --- # .center[.blue[Overlap between alignment and annotation, and reads assignment with `shortRNA`]] ```r o <- overlapWithTx2( bamFile = align_file, annotation = anno, ignoreStrand = TRUE ) ``` ```r ar <- assignReads(sources = o, rules = defaultAssignRules()) ``` --- # Customizable assignment rules ### General ```git > defaultAssignRules() $overlapBy [1] 0.5 $prioritizeByOverlapSize [1] FALSE $sameStrand [1] "prioritize" $prioritizeKnown [1] TRUE ``` --- # Customizable assignment rules ### Primary piRNAs ```git $typeValidation $typeValidation$primary_piRNA $typeValidation$primary_piRNA$fun function(src, allowRevComp=FALSE, length=26:32){ length <- as.integer(length) valid <- src$length >= min(length) & src$length <= max(length) if(length(w <- which(valid))==0) return(valid) seqs <- as.character(src$seq[w]) valid[w] <- sapply(strsplit(seqs,""), FUN=function(x){ x[[1]]=="T" }) if(allowRevComp){ revcomp <- as.character(reverseComplement(DNAStringSet(seqs))) valid[w] <- valid[w] | sapply(strsplit(seqs,""),FUN=function(x){ x[[1]]=="T" }) } valid } $typeValidation$primary_piRNA$fallback [1] "piRNA_precursor" ``` --- # Customizable assignment rules ### Secondary piRNAs ```git $typeValidation$secondary_piRNA $typeValidation$secondary_piRNA$fun function(src, allowRevComp=FALSE, length=26:32){ length <- as.integer(length) valid <- src$length >= min(length) & src$length <= max(length) if(length(w <- which(valid))==0) return(valid) seqs <- as.character(src$seq[w]) valid[w] <- sapply(strsplit(seqs,""), FUN=function(x){ x[[1]]=="T" }) if(allowRevComp){ revcomp <- as.character(reverseComplement(DNAStringSet(seqs))) valid[w] <- valid[w] | sapply(strsplit(seqs,""), FUN=function(x){ x[[10]]=="A" }) } valid } $typeValidation$secondary_piRNA$fallback [1] "piRNA_precursor" ``` --- class: center, hide_logo background-image: url("https://ars.els-cdn.com/content/image/1-s2.0-S0092867407002577-gr6_lrg.jpg") background-size: contain background-position: 50% 10% <br><br><br><br><br><br><br><br><br><br><br> [Brennecke <i>et. al.</i>; Cell; 2007](https://doi.org/10.1016/j.cell.2007.01.043) --- # Customizable assignment rules ### miRNAs ```git $typeValidation$miRNA $typeValidation$miRNA$fun function(src, length=19:24, minOverlap=16L, maxNonOverlap=3L){ src$length %in% length && src$overlap >= minOverlap && (src$length - src$overlap) <= maxNonOverlap } $typeValidation$miRNA$length [1] 19 20 21 22 23 24 $typeValidation$miRNA$fallback [1] "miRNA_precursor" ``` --- # Customizable assignment rules ### tRNAs ```git $reclassify $reclassify$tRNA function(srcs, rules=list( "tRNA_internal_fragment"=function(x){ rep(TRUE, nrow(x)) }, "tRNA_5p_fragment"=function(x) { x$startInFeature < 5L & x$length < 30L }, "tRNA_3p_fragment"=function(x) { x$distanceToFeatureEnd < 5L & x$length < 50L }, "tRNA_5p_half"=function(x) { x$startInFeature %in% -1:1 & x$length %in% 30:34 }, "tRNA_3p_half"=function(x) { x$distanceToFeatureEnd %in% -1:1 & x$length >= 34L & x$length <= 50L & grepl("CCA$",x$seq) } )){ valids <- vapply(rules, FUN.VALUE=logical(nrow(srcs)), FUN=function(fn){ fn(srcs) }) factor(apply(valids, 1, FUN=function(x) max(which(x))), seq_len(ncol(valids)), colnames(valids)) } ``` --- # Customizable assignment rules ### Priorities between RNA types ```git $priorities miRNA tRNA tRNAp Mt_tRNA 1 1 1 1 snRNA snoRNA antisense primary_piRNA 1 1 1 1 secondary_piRNA precursor long_RNA longRNA 1 -1 -1 -1 ``` --- # .center[.blue[Building tree and TSE with `shortRNA`]] ### Convert annotations to factorList ```r fl <- featuresAnnoToFL(anno) ``` -- ### Tree construction ```r ar_tree <- addReadsToTree(fL = fl, mappedFeaturesDF = ar) ``` -- ### TreeSummarizedExperiment object ```r library(TreeSummarizedExperiment) tse <- TreeSummarizedExperiment( assays = list(counts = m), rowTree = ar_tree, metadata = list(assignedReads = ar) ) ``` --- # .center[.blue[Playing with tree]] ### Summary ```r summary(ar_tree) ``` ``` ## ## Phylogenetic tree: ar_tree ## ## Number of tips: 775951 ## Number of nodes: 22598 ## No branch lengths. ## No root edge. ## First ten tip labels: AAAAAAAAAAAAAAATGGCGGA ## AAAAAAAAACCCCGGACTTCGTA ## AAAAAACGATAAAGGTAGTGGTATTTCACCGGCGC ## AAAATGCACATGCTGTGAGCT ## AAGTCAGTCTTCCGCCAGGGT ## GACGTTGCTTTTTGATCCTTCGATGTCGGCTCTTCCTATCATA ## GACGTTGGACTGAGAACTTC ## GACGTTGGCCATGGAAGTCGGAAT ## GACGTTTATAGGTCAGGTGTGGAAGTG ## GACGTTTGGTTCTATTTTGTTGGTTTCT ## First ten node labels: mm10 ## protein_coding ## processed_transcript ## Mt_rRNA ## piRNA_precursor ## lincRNA ## tRNA ## bidirectional_promoter_lncRNA ## misc_RNA ## snRNA ``` --- class: center, hide_logo background-image: url("files/p1/12.png") background-size: contain background-position: 50% 10% --- class: hide_logo # .center[.blue[Playing with tree]] ### Subset a tree ```r tree <- get_subtree_at_node(ar_tree, "tRNA-Arg-ACG-1")$subtree ``` ### Make a tree ```r p <- ggtree(tree, color="firebrick", size=0.5, branch.length = "none") + geom_nodelab(size=5, color="purple", angle = 90, vjust = -0.5) + geom_tiplab(size=8, color="black", as_ylab= TRUE) ``` --- class: hide_logo # .center[.blue[Playing with tree]] ```r p ``` <img src="20210622_PR_Deepak_Tanwar_files/figure-html/unnamed-chunk-41-1.png" style="display: block; margin: auto;" /> --- class: hide_logo # .center[.blue[Playing with tree]] ```r p <- ggtree(tree, color="firebrick", size=0.5, branch.length = "none") + geom_nodelab(size=5, color="purple", angle = 90, vjust = -0.5) + geom_tiplab(size=15, color="black", as_ylab= TRUE) viewClade(p, MRCA(p, "tRNA-Arg-ACG-1-2")) ``` <img src="20210622_PR_Deepak_Tanwar_files/figure-html/unnamed-chunk-42-1.png" style="display: block; margin: auto;" /> --- class: hide_logo # .center[.blue[Tree heatmap with counts]] ```r colnames(m) <- paste0("S", 1:ncol(m)) gheatmap(p, m) + viridis::scale_fill_viridis() ``` <img src="20210622_PR_Deepak_Tanwar_files/figure-html/unnamed-chunk-43-1.png" style="display: block; margin: auto;" /> --- class: hide_logo
# .center[.blue[What's next?]] -- Cross-check a number of things to make sure we are getting what expected as results -- **Feature addition:** checking for different sequence abundance -- Code for visualizing results that will generate high-quality publication ready figures -- Functions for normalization and differential analysis --- class: hide_logo # .center[.blue[Summary]] -- This is totally a **new package** then what we had till last year (total upgrade) -- The package is **fast** and **platform independent**. Starting from QC to obtaining `TreeSummarizedExperiment` would not take more then **2 hours (n = 10)**. If you would also need to build the genome index, you can add 1 hour more -- We have included a **number of databases** after proper comparison. Just **one function: .green[`getDB`]** to do everything for you. -- Tree structure provides **flexibility to extend annotation** -- We adequately accounted for **post-transcriptional modifications** -- **Assignment rules** can easily be customized -- We have **benchmarked a number of things** (tools/ functions/ data structure) before including them -- Users get a **comprehensive QC** of their data, while not getting compromised for the speed of analysis --- class: hide_logo # .center[.blue[Summary]] A wide range of **QC plots and tables** are included -- It is possible to **automatically detect adapter** sequences for single-end data, but the gold-standard remains the user-input -- A number of tools were investigated and then the relevant were **benchmarked for reading `fastq` files** -- We have implemented `FactorLists` for dealing with **hierarchy of the annotation**, which speed up the process extremely significantly -- We have implemented `DFrame` for storing all the information about sequences. By this, we can include **a lot of information in just one object efficiently** -- We implemented a **`TreeSummarizedExperiment`** object to store **counts, normalized counts, samples metadata, sequences metadata, row-tree, and samples-tree** -- We have a function that creates a **`phylo` object (tree) from a `FactorLists`** object that is significantly faster than existing functions. --- class: hide_logo # .center[.blue[Summary]] The .red[learning curve for data structures] like `FactorLists`, `DFrame`, and `TreeSummarizedExperiment` can be steep, but once you get used to them, you'll adore them -- A number of **plotting functions are being developed**, and plots that may **not have been seen in small RNA-seq articles** should be expected -- **To consider:** The isomiR sequence may be overestimated from paired-end small RNA sequencing data [Sanchez-Herrero <i> et. al. </i>; Res Sq.; 2021](https://doi.org/10.21203/rs.3.rs-351479/v1) --- class: hide_logo background-image: url("files/p1/17.png") background-size: contain background-position: 50% 50% --- class: center, middle, hide_logo .eLarge[.blue[**Thank you!**]]