RLBase

RLBase is a web database for accessing, exploring, and analyzing R-loop mapping data.. Key features:

  1. Explore hundreds of R-loop mapping datasets
  2. Explore R-loop regions (sites of consensus R-loop formation)
  3. Analyze your own R-loop mapping data
  4. Download processed and standardized data sets.

RLBase is a core component of RLSuite, an R-loop analysis software toolchain. RLSuite also includes RLPipes (a CLI pipeline for upstream R-loop data processing), RLSeq (an R package for downstream R-loop data analysis), and RLHub (an R package interface to the RLBase datastore).

Our goal in this document is to (1) Describe the backround and motivation of RLBase, (2) describe the RLBase components in detail, and (3) address other questions which may arise when using RLBase.

Background and Motivation

R-Loops and R-loop mapping

R-loops are three-stranded nucleic acid structures formed from the hybridization of RNA and DNA, typically during transcription (Niehrs and Luke 2020). R-loops have long been studied in the context of pathology, but recent evidence indicates they may play a critical role in normal cell physiology (Niehrs and Luke 2020).

R-loops and the factors which may promote them (Hegazy et al., 2020; PMID: 31843970)

With the advent of DNA-RNA immunoprecipitation sequencing (DRIP-Seq) in 2012, it became possible to map R-loop locations throughout the genome (Ginno et al. 2012). In the years since, dozens of studies have implemented R-loop mapping approaches and submitted their data to public repositories like the Sequence Read Archive (SRA). As of June 2022, there are now 74 studies with public R-loop data in SRA.

The RLBase project

We began the RLBase project in 2019 with the aim of standardizing public R-loop mapping samples and making the resulting data accessible for R-loop biologists to explore and analyze. Our first publication from this project used over 100 DRIP-Seq data-sets to reveal an unexpected role for physiological R-loops in chromatin architecture (Pan et al. 2020). Building on this pilot study, we began in 2020 to manually catalog all public R-loop mapping samples, finding 74 studies containing 810 data sets. We developed and implemented a light-weight CLI pipeline, RLPipes, to automate and standardize the re-processing of these data from raw reads.

Unexpectedly, we found inconsistency between samples in terms of data quality, posing a major hurdle to meta-analysis and exploration of these data. These findings were recently echoed by Chedin et al., 2021 in their EMBO commentary (Chédin et al. 2021).

To overcome this limitation, we developed new methods for assessing sample quality (implemented in the RLSeq R package). We then used these methods to find high-confidence R-loop samples and used them for meta-analysis of R-loop locations (termed “RL regions” here). The processed data generated from these analyses can be accessed via the RLHub R package.

Finally, we built the RLBase database to facilitate access to these data along with the unique analysis approaches we have developed. In the following section on “RLBase Usage and Interface”, we discuss these capabilities further.

RLBase Usage and Interface

RLBase can be used for all of the following:

  • Exploration of R-loop mapping samples (view)
    • Exploration of mapping samples from different modalities
    • Analysis of sample quality
    • Analysis of genomic feature enrichment
  • R-loop Region (RL Region) analysis (link)
    • Exploration of R-loop regions
    • Analysis of R-loop region correlation with gene expression
  • User-supplied samples (view)
    • Analysis of user-supplied data and HTML report with RLSeq.

Analysis of RLBase samples

The samples in RLBase were standardized and reprocessed using the RLPipes CLI tool. They were then analyzed with RLSeq which performed (1) R-loop forming sequences analysis, (2) sample quality prediction, (3) genomic feature enrichment analysis, (4) sample-to-sample correlation analysis, (5) gene annotation, and (6) R-loop regions (RL Regions) overlap analysis. See also the RLSeq::RLSeq() function documentation.

The metadata from these analyses along with other sample information are provided in the sample data table and can be explored interactively (Figure 1).

Figure 1. The ‘Samples’ page interface. (A) Controls for the ‘RLBase Samples Table’ that change the data which is displayed. (B) The RLBase Samples Table which shows data about the R-loop mapping samples in RLBase. Selecting an individual row will change the output in (C). (C) Output pane showing the data and plots relevant to the selected sample(s) in the table.

The right-hand side of the Sample page (Fig. 1C) displays the plots and data relevant to the samples shown in the table (controlled by “Table Controls” (Fig. 1A)) and also by the row selected in the “RLBase Samples Table” (Fig. 1B). In the following subsections, we will detail each panel.

Summary panel

The summary panel provides a high-level overview of all samples in the “RLBase Samples Table” (Fig. 2B-D) and also of the specific sample selected in that table (Fig. 2A).

Figure 2. The “Summary” panel. (A) Summary characteristics of the selected sample. (B) The representation of R-loop mapping modalities within the selected data sets (determined by the “Table Controls”). (C) Similar to (B) showing the representation of sample labels (“POS” meaning expected to map R-loops, “NEG” meaning not expected to map R-loops). (D) Similar to (C) except that “POS” and “NEG” are predicted by the ML quality model (See model details).

Sample-sample comparison

The “Sample-sample comparison” panel (Figure 3) summarizes sample-level similarity using a recently-described correlation method (Chédin et al. 2021). The method finds the Pearson correlation of R-loop signal around “gold standard” R-loop sites, sites which were profiled with ultra-long-read R-loop mapping (SMRF-Seq) (Malig et al. 2020). The plots include a heatmap showing the correlation (Fig 3. A) along with a PCA plot which shows the first two principal components of the correlation matrix (Fig. 3B).

Figure 3. The “Sample-sample comparison” panel. (A) A heatmap showing sample-sample pearson correlation of coverage (bigWig) tracks around gold-standard R-loops (See here for additional detail). (B) Similar to (A) but with a scatter plot showing PC1 and PC2 from principal component analysis (PCA) of the correlation matrix. Controls allow for customization of the plot.

Annotations

Genomic features relevant to R-loops were curated from public data sources or from computational prediction. For a full description of these data, view the relevant section of the RLHub documentation. For each sample in RLBase, the called peaks were overlapped with each genomic feature annotation and overlap statistics were calculated using Fisher’s exact test. The plots in (Figure 4) show the distribution of Fisher’s exact test odds ratios for each sample present in the “RLBase Samples Table”.

Figure 4. The “Annotation” panel. (A) A “Split” control that determines plotting in (C). (B) The databases for which plots are available. See the documentation of annotation databases here for more detail. (C) A plot showing the distribution of enrichment results (represented by the log2 Fisher’s exact test odds ratio). The sample selected in the table is shown as a diamond (if results are available). Additional information can be found in the RLSeq Vignette

RLFS analysis results

R-loop forming sequences (RLFS) are regions of the genome which are favorable for the formation of R-loops. These regions are predicted using QmRLFS-Finder (GitHub repo) based on the presence of motifs like high G or C-Skew (Jenjaroenpun et al. 2015). Because they are computationally predicted from sequence alone, we decided to use them as an unbiased test of whether samples mapped R-loops or not via a permutation testing approach. This approach is implemented in the RLSeq::analyzeRLFS() function in RLSeq. The plots in (Fig. 5B-D) demonstrate the results of this permutation testing. The results from the analysis are then further analyzed by RLSeq::predictCondition() which uses a pre-trained classifier (described here) to assess whether the RLFS analysis results indicate a sample which maps R-loops robustly (“POS” prediction) or which maps R-loops poorly (“NEG” prediction). This prediction is reported in Figure 5A.

Figure 5. The “RLFS” panel. Additional details regarding the R-loop forming sequences (RLFS) analysis approach are described in the RLSeq vignette. (A) A summary of the results from the R-loop forming sequences (RLFS) analysis. (B) The empirical distribution plot from permutation testing. The distribution represents the overlap of RLFS with randomized ranges, whereas the green line represents the actual number of overlaps. (C) The Z-score distribution plot shows the Z-score of the non-random overlaps compared to random ranges within 5kb upstream and downstream from RLFS. This distribution is what the ML model uses to predict quality (along with the P Value). (D) The Fourier transform of (C).

RL-Regions

R-loop regions (RL regions) are genomic regions which show consistent R-loop formation as determined by our meta-analysis of the high-confidence samples in RLBase. For more information about these regions and their associated metadata, please see the relevant RLHub documentation. In the RLBase interface, we show the results of overlapping the peaks for the selected sample with the RL regions (see RLSeq::rlRegionTest() for more detail). In these results, we show both the significance of the overlap (Fig. 6A) and the list of specific RL regions in the overlap (Fig. 6B).

Figure 6. The “RL Regions” panel. R-loop regions (“RL Regions”; consensus sites of R-loop formation) are described in greater detail in the RLHub documentation. (A) The overlap of RL Regions and the peaks within the selected samples. The p value and odds ratio from Fisher’s exact test are also shown. See also the relevant section of the RLSeq vignette. (B) The table of RL Regions overlapping with the selected sample’s peaks. See the RL Regions section for a detailed explanation of this interface.

Exploration of R-loop regions

R-loop regions (RL regions) are sites of consensus R-loop formation as determined through a meta-analysis of all high-confidence RLBase samples. For greater detail on how RL Regions were derived and their associated metadata, please see the relevant section of the RLHub documentation. The RLBase “RL Regions” page provides a simple interface for exploring these data (Figure 7). RL regions have a variety of metadata associated with them, such as the number of samples in which they are found and the genes which they overlap with (Figure 8). Finally, the Spearman correlation between RL Region signal (as determined from high-confidence RLBase samples) and expression (as determined by paired-RNA-Seq data) is also shown (Figure 8). We also provide links to the RLBase UCSC genome browser session (Fig. 7C) which can be used to explore these results and compare them with other data stored in the UCSC database.

Generation of PCR primers from RL-Regions in the UCSC browser

One utility of the RLBase UCSC browser session is to generate PCR primer sequences from RL Regions. To accomplish this, simply navigate to the UCSC genome browser session for RLBase, find the RL Region of interest, and then right click on it. This will produce a context menu with the option to “Get DNA for <rlregion_id>” (see image below). This will produce a FASTA file which can be used with primer-generating tools such as Primer-BLAST.




Figure 7. The “R-loop Regions” page. R-loop regions (“RL Regions”; consensus sites of R-loop formation) are described in greater detail in the RLHub documentation. This page provides an interface for exploring these data. (A) Controls for altering the data shown in (B). “All genes”, if selected, forces (B) to also show pseudogenes, RNA genes, and predicted genes. If “Repetitive” is selected, RL Regions that overlap with repetitive elements will be displayed. If “Correlated with expression” is selected, only the RL Regions with significant correlation between R-loop signal and expression will be shown. (B) The table of RL Regions along with metadata. The RL Regions in this table are controlled by (A) and selecting rows will impact the output in (D). For a full description of columns, please see the relevant RLHub documentation. (C) A link to the UCSC genome browser session for RLBase. This session contains all the coverage files for every human R-loop mapping sample, along with the consensus R-loop regions. (D) The output panel impacted by the selection in (B). See (Figure 8) for a full description.

Figure 8. The “R-loop Regions” page output panels. (A) Summary of the RL Region selected in “RL Region Table”. See also the relevant RLHub documentation for further detail. (B) A scatter plot showing the correlation of R-loop signal and expression (both log2TPM-transformed counts) within the RL Region selected. Points are colored by the R-loop mapping mode which the R-loop signal corresponds to (expression is just RNA-Seq). See also the relevant RLHub documentation for rlregions_counts and gene_exp for further detail on the data sources.

Analysis of user-supplied data

The analysis methods developed in RLSeq allow for the analysis of R-loop mapping samples within the context of the publicly-available data stored in RLBase. However, an R-package implementation may not be suitable for all users. To help increase access to these methods, RLBase provides an in-browser option for running RLSeq in the “Analyze” tab (Figure 9). In this interface, users can enter details about their sample and upload their peaks. After starting the analysis, they will be given the results link which they can open to see the analysis progress and, once complete, their analysis report (Figure 10). An example of a complete analysis can be found here.

Errors during analysis

From time to time, errors may occur during analysis of user-supplied samples. This may be due to incorrect formatting, or a bug in the analysis code, or temporary errors due to limited server resources. Regardless of the reason, users are encouraged to report the error so that database documentation and code can be updated according.

When reporting an error, please include the log.txt which accompanies the analysis. If an error occurs, an Error page will be displayed at the link for the analysis, and the log.txt can be accessed from there.




Figure 9. The “Analyze” page allows users to analyze an R-loop mapping sample in the browser. (A) Inputs describing the R-loop mapping sample. It is recommended to lift-over your data to “hg38” or “mm10” if possible, as RLSeq is most compatible with these genomes. For an explanation of the “Mode” and “Label”, please see the terminology section. (B) An upload input for the peaks to analyze (should be in broadPeak format if possible (narrowPeak and BED can also be used). (C) The user must also acknowledge the privacy statement, which specifies that uploaded data will be made publicly available in the analysis results link.

Figure 10. The “RLSeq Results” page displays the analysis description and results (it also displays progress during the analysis). (A) A verbose description of the analysis. See the RLSeq vignette for additional detail. (B) The progress indicators that display the current stage of the analysis. (C) Metadata about the uploaded sample and links to the RLSeq report, RLRanges R data object for the sample, and the log files.

Other

This section attempts to answer some of the other questions which may arise during the usage of RLBase. The types of questions are divided based on their general category:

  1. Terminology: Terms and concepts used within RLBase.
  2. Bugs: Describes handling of errors and bugs.
  3. License and citation: Describes attribution and license.

Terminology

In this subsection, we describe the terminology used in RLBase to help aid in its usage.

Modes

RLBase contains R-loop mapping data from high-throughput sequencing experiments. There are, to date, over 20 different methods of R-loop mapping included in RLBase. To help summarize them, we have provided the following visualization:

Diagram showing the different R-loop mapping methods which exist. All of these are found in RLBase (except computational methods, SMRF-Seq, and bisDRIP-Seq).

Diagram showing the different R-loop mapping methods which exist. All of these are found in RLBase (except computational methods, SMRF-Seq, and bisDRIP-Seq).

For additional detail, please refer to the following summary table:

R-loop mapping modes and associated metadata.
mode family ip_type strand_specific moeity bisulfite_seq PMID
bisDRIP DRIP S9.6 TRUE DNA TRUE 29072160
BisMapR MapR dRNH TRUE DNA FALSE 33620319
DRNA Other None FALSE RNA FALSE 34205454
dDRIP DRIP S9.6 TRUE DNA FALSE 31262943
DREAM Other None TRUE RNA FALSE
DRIP DRIP S9.6 FALSE DNA FALSE 22387027
DRIPc DRIP S9.6 TRUE RNA FALSE 27373332
DRIVE Other dRNH FALSE RNA FALSE 22387027
qDRIP DRIP S9.6 TRUE DNA FALSE 32544226
R-ChIP R-ChIP dRNH TRUE DNA FALSE 29104020
RDIP RDIP S9.6 TRUE RNA FALSE 26579211
RLCnT-S96 RLCnT S9.6 TRUE DNA FALSE 33597247
RLCnT-HBD RLCnT dRNH TRUE DNA FALSE 33597247
DRIPc-HBD DRIP dRNH TRUE RNA FALSE 33597247
MapR MapR dRNH FALSE DNA FALSE 31665646
RNH-CnR MapR dRNH FALSE DNA FALSE 31665646
RR-ChIP R-ChIP dRNH TRUE RNA FALSE 31679819
S1-DRIP DRIP S9.6 FALSE DNA FALSE 27298336
sDRIP DRIP S9.6 TRUE DNA FALSE 33830170
SMRF Other S9.6 TRUE DNA TRUE 32105733
ssDRIP DRIP S9.6 TRUE DNA FALSE 28848233
DRIP-RNA-Seq DRIP S9.6 TRUE RNA FALSE 26551076
m6A-DIP RDIP RNA-m6A FALSE DNA FALSE 31844323

Column description:

  • mode: The R-loop mapping modality.
  • family: The broad classification of that mode.
  • ip_type: The type of immunoprecipitation (IP) used in the mapping protocol. ‘S9.6’ for S9.6 antibody-based IP. ‘dRNH’ for ‘catalytically dead RNaseH1’ IP. ‘None’ if no IP was used.
  • strand_specific: A boolean indicating whether the mode is strand-specific or not.
  • moeity: Indicates the moeity that was sequenced.
  • bisulfite_seq: A boolean indicating whether the mode uses bisulfite conversion sequencing.
  • PMID: The pubmed ID for the study in which this method was first described.

Sample Condition

Sample condition was added manually during initial data curation based on the metadata in GEO and/or SRA. The condition refers to the R-loop mapping condition of the sample. For example, a condition of “Input” represents a genomic input sample that would be a negative control in R-loop mapping.

Label

The label is the categorization of “Sample Condition” as “POS” or “NEG”. “POS” indicates a sample condition which is expected to map R-loops. “NEG” indicates a negative control which is not expected to map R-loops. In DNA-RNA immunoprecipitation sequencing (DRIP), for example, “S9.6” indicates a “POS” condition which should map R-loops, where as “RNH1” (RNaseH1-treated) indicates a “NEG” condition which should not as RNaseH1 is an enzyme that degrades R-loops. The label was automatically assigned to each sample using a custom REGEX dictionary applied to the sample condition:

REGEX dictionary for assigning R-loop sample conditions to ‘POS’ or ‘NEG’ labels.
Mode Condition POS NEG
MapR ActD FALSE TRUE
R-ChIP delta-HC TRUE FALSE
.* Input FALSE TRUE
.* FLAG TRUE FALSE
R-ChIP WKKD FALSE TRUE
.*MapR pAMNase FALSE TRUE
.*MapR RNHMNase TRUE FALSE
.*MapR RNHdMNase FALSE TRUE
.* RNH FALSE TRUE
RLCnT S96-t1 FALSE TRUE
RLCnT S96-t2 FALSE TRUE
RLCnT S96-t3 TRUE FALSE
RLCnT S96-t4 FALSE TRUE
RLCnT S96-t5 FALSE TRUE
RLCnT S96-t6 TRUE FALSE
RLCnT S96-t7 FALSE TRUE
RLCnT S96-t8 FALSE TRUE
IgG FALSE TRUE
RLCnT HBD TRUE FALSE
RLCnT-.* RNaseA FALSE TRUE
.DRIP. RNaseA TRUE FALSE
DRIPc-HBD HBD TRUE FALSE
.* Pos TRUE FALSE
DRNA DRNA TRUE FALSE
.* S96 TRUE FALSE
.* D209N TRUE FALSE
.* D210N TRUE FALSE
SMRF bisFoot TRUE FALSE
sDRIP RNaseT1 TRUE FALSE
sDRIP RNaseIII TRUE FALSE
sDRIP RNaseT1-III TRUE FALSE

Prediction

While a sample may have a “POS” label, that does not mean it mapped R-loops. Likewise sampels with “NEG” label sometimes robustly mapped R-loops, as may be the case when RNaseH1 treatment fails during DRIP sequencing. This finding motivated us to develop the quality model used in RLSeq (see documentation about this model here). The model predicts either “POS” or “NEG”, which indicates robust R-loop mapping and poor R-loop mapping, respectively.

Bugs

RLBase is still early in the development cycle, which means bugs may occasionally occur. If you encounter any bugs or unexpected behavior, please open an issue on the RLBase GitHub repo and describe, in as much detail as possible, the following:

  1. What you expected RLBase to do.
  2. What RLBase did and why it was unexpected.
  3. Any error messages you recieved (along with screenshots).
  4. If analyzing your own data, please also include a link to the dataset (if it can be shared).

License and attribution

RLBase is licensed under an MIT license and we ask that you please cite RLBase in any published work like so:

Miller et al., RLBase, (2022), GitHub repository, https://github.com/Bishop-Laboratory/RLBase

References

Chédin, Frédéric, Stella R Hartono, Lionel A Sanz, and Vincent Vanoosthuyse. 2021. “Best Practices for the Visualization, Mapping, and Manipulation of R-Loops.” The EMBO Journal 40 (4). https://doi.org/10.15252/embj.2020106394.
Ginno, Paul A., Paul L. Lott, Holly C. Christensen, Ian Korf, and Frédéric Chédin. 2012. “R-Loop Formation Is a Distinctive Characteristic of Unmethylated Human CpG Island Promoters.” Molecular Cell 45 (6): 814–25. https://doi.org/10.1016/j.molcel.2012.01.017.
Jenjaroenpun, Piroon, Thidathip Wongsurawat, Surya Pavan Yenamandra, and Vladimir A. Kuznetsov. 2015. “QmRLFS-Finder: A Model, Web Server and Stand-Alone Tool for Prediction and Analysis of R-Loop Forming Sequences.” Nucleic Acids Research 43 (W1): W527–34. https://doi.org/10.1093/nar/gkv344.
Malig, Maika, Stella R. Hartono, Jenna M. Giafaglione, Lionel A. Sanz, and Frederic Chedin. 2020. “Ultra-Deep Coverage Single-Molecule R-Loop Footprinting Reveals Principles of R-Loop Formation.” Journal of Molecular Biology 432 (7): 2271–88. https://doi.org/10.1016/j.jmb.2020.02.014.
Niehrs, Christof, and Brian Luke. 2020. “Regulatory R-Loops as Facilitators of Gene Expression and Genome Stability.” Nature Reviews Molecular Cell Biology 21 (3): 167–78. https://doi.org/10.1038/s41580-019-0206-3.
Pan, Hai, Miao Jin, Ashwin Ghadiyaram, Parminder Kaur, Henry E Miller, Hai Minh Ta, Ming Liu, et al. 2020. “Cohesin Sa1 and Sa2 Are RNA Binding Proteins That Localize to RNA Containing Regions on DNA.” Nucleic Acids Research 48 (10): 5639–55. https://doi.org/10.1093/nar/gkaa284.