Title: | Exact Test and Visualization of Multi-Set Intersections |
---|---|
Description: | Identification of sets of objects with shared features is a common operation in all disciplines. Analysis of intersections among multiple sets is fundamental for in-depth understanding of their complex relationships. This package implements a theoretical framework for efficient computation of statistical distributions of multi-set intersections based upon combinatorial theory, and provides multiple scalable techniques for visualizing the intersection statistics. The statistical algorithm behind this package was published in Wang et al. (2015) <doi:10.1038/srep16923>. |
Authors: | Minghui Wang, Yongzhong Zhao and Bin Zhang |
Maintainer: | Minghui Wang <[email protected]> |
License: | GPL-3 |
Version: | 1.1.2 |
Built: | 2025-02-02 05:56:24 UTC |
Source: | https://github.com/mw201608/superexacttest |
This example dataset contains a list of seven cancer predisposition gene sets.
data(Cancer)
data(Cancer)
The seven cancer predisposition gene sets are:
NRG (Rahman, N. Realizing the promise of cancer predisposition genes. Nature 2014, 505:302-308);
NBG (Tamborero, D. et al. Comprehensive identification of mutational cancer driver genes across 12 tumor types. Scientific reports 2013, 3:2650);
LDG (Kandoth, C. et al. Mutational landscape and significance across 12 major cancer types. Nature 2013, 502:333-339);
GGG (Lawrence, M. S. et al. Discovery and saturation analysis of cancer genes across 21 tumour types. Nature 2014, 505:495-501);
ELG (Garraway, L. A. & Lander, E. S. Lessons from the cancer genome. Cell 2013, 153:17-37);
CCG (Futreal, P. A. et al. A census of human cancer genes. Nature reviews. Cancer 2004, 4:177-183);
BVG (Vogelstein, B. et al. Cancer genome landscapes. Science 2013, 339:1546-1558).
Minghui Wang, Yongzhong Zhao, and Bin Zhang (2015). Efficient Test and Visualization of Multi-Set Intersections. Scientific Reports 5: 16923.
This example dataset contains a list of cis-eQTL genes.
data(eqtls)
data(eqtls)
A list is included in this dataset: cis.eqtls
, which contains four sets of cis-eQTL genes published by Gibbs et al (PLOS Genetics 2010, 6:e1000952) as deposited in the eQTL Browser (http://www.ncbi.nlm.nih.gov/projects/gap/eqtl/index.cgi).
The four sets of cis-eQTL genes were detected in four different brain regions from Gibbs: brain cerebellum (CB), brain frontal cortex region (FC), brain temporal cortex region (TC), and brain pons region (PONS) respectively.
Density and distribution function of multi-set intersection test.
dpsets(x,L,n,log.p =FALSE) cpsets(x,L,n,lower.tail=TRUE,log.p=FALSE, simulation.p.value=FALSE,number.simulations=1000000)
dpsets(x,L,n,log.p =FALSE) cpsets(x,L,n,lower.tail=TRUE,log.p=FALSE, simulation.p.value=FALSE,number.simulations=1000000)
x |
integer, number of elements overlap among all sets. |
L |
vector, set sizes. |
n |
integer, background population size. |
lower.tail |
logical; if TRUE, probability is |
log.p |
logical; if TRUE, probability p is given as |
simulation.p.value |
logical; if TRUE, probability p is computed from simulation. |
number.simulations |
integer; number of simulations. |
dpsets
gives the density and cpsets
gives the distribution function.
Minghui Wang <[email protected]>
Minghui Wang, Yongzhong Zhao, and Bin Zhang (2015). Efficient Test and Visualization of Multi-Set Intersections. Scientific Reports 5: 16923.
## Not run: #set up fake data n=500; A=260; B=320; C=430; D=300; x=170 (d=dpsets(x,c(A,B,C,D),n)) (p=cpsets(x,c(A,B,C,D),n,lower.tail=FALSE)) ## End(Not run)
## Not run: #set up fake data n=500; A=260; B=320; C=430; D=300; x=170 (d=dpsets(x,c(A,B,C,D),n)) (p=cpsets(x,c(A,B,C,D),n,lower.tail=FALSE)) ## End(Not run)
Decrypt barcode information.
deBarcode(barcode, setnames, collapse=' & ')
deBarcode(barcode, setnames, collapse=' & ')
barcode |
a vector of character strings, encoding the intersection combination. |
setnames |
set names. |
collapse |
an optional character string to separate the results. See |
barcode
are character strings of '0' and '1', indicating absence or presence of each set in a intersection combination.
A vector.
Minghui Wang <[email protected]>
deBarcode(c('01011','10100'), c('S1','S2','S3','S4','S5'))
deBarcode(c('01011','10100'), c('S1','S2','S3','S4','S5'))
This example dataset contains a list of gene sets associated with six types of clinical traits curated in the GWAS Catalog.
data(GWAS)
data(GWAS)
The six clinical traits are:
NEU (Bipolar disorder and schizophrenia, Schizophrenia, Major depressive disorder, Alzheimer's disease, Parkinson's disease, Cognitive performance, Bipolar disorder);
INF (Crohn's disease, Ulcerative colitis, Inflammatory bowel disease, Rheumatoid arthritis, Multiple sclerosis, Systemic lupus erythematosus);
CVD (Type 2 diabetes, Coronary heart disease, Blood pressure, total Cholesterol, HDL cholesterol, Triglycerides);
HT (height);
IgG (IgG glycosylation);
OB (obesity, obesity related traits).
Minghui Wang, Yongzhong Zhao, and Bin Zhang (2015). Efficient Test and Visualization of Multi-Set Intersections. Scientific Reports 5: 16923.
Performs set union and intersection on multiple input vectors.
union(x, y, ...) intersect(x, y, ...)
union(x, y, ...) intersect(x, y, ...)
x , y , ...
|
vectors (of the same mode) containing a sequence of items (conceptually) with no duplicated values. |
These functions extend the the same functions in the base
package to handle more than two input vectors.
A vector of the same mode as x or y for intersect, and of a common mode for union.
Minghui Wang <[email protected]>, Bin Zhang <[email protected]>
Minghui Wang, Yongzhong Zhao, and Bin Zhang (2015). Efficient Test and Visualization of Multi-Set Intersections. Scientific Reports 5: 16923.
##not run##
##not run##
Find intersections and assign element to intersection combinations.
intersectElements(x, mutual.exclusive=TRUE)
intersectElements(x, mutual.exclusive=TRUE)
x |
list; a collection of sets. |
mutual.exclusive |
logical; see |
See example below for the use of mutual.exclusive
.
A data.frame with two columns:
Entry |
set elements. |
barcode |
intersection combination that each entry belongs to. |
Minghui Wang <[email protected]>
set.seed(123) sets=list(S1=sample(letters,10), S2=sample(letters,5), S3=sample(letters,7)) intersectElements(sets,mutual.exclusive=TRUE) intersectElements(sets,mutual.exclusive=FALSE)
set.seed(123) sets=list(S1=sample(letters,10), S2=sample(letters,5), S3=sample(letters,7)) intersectElements(sets,mutual.exclusive=TRUE) intersectElements(sets,mutual.exclusive=FALSE)
This function calculates Jaccard indices between pairs of sets.
jaccard(x)
jaccard(x)
x |
list, a collect of sets. |
A matrix of pairwise Jaccard indices.
Minghui Wang <[email protected]>
## Not run: #set up fake data x=list(S1=letters[1:20], S2=letters[10:26], S3=sample(letters,10), S4=sample(letters,10)) jaccard(x) ## End(Not run)
## Not run: #set up fake data x=list(S1=letters[1:20], S2=letters[10:26], S3=sample(letters,10), S4=sample(letters,10)) jaccard(x) ## End(Not run)
Calculate FE and significance of intersection among multiple sets.
MSET(x,n,lower.tail=TRUE,log.p=FALSE)
MSET(x,n,lower.tail=TRUE,log.p=FALSE)
x |
list; a collection of sets. |
n |
integer; background population size. |
lower.tail |
logical; if TRUE, probability is |
log.p |
logical; if TRUE, probability p is given as log(p). |
This function implements an efficient statistical test for multi-set intersections. The algorithm behind this function was described in Wang et al 2015.
A list with the following elements:
intersects |
a vector of intersect items. |
FE |
fold enrichment of the intersection. |
p.value |
one-tail probability of observing equal to or larger than the number of intersect items. |
Minghui Wang <[email protected]>, Bin Zhang <[email protected]>
Minghui Wang, Yongzhong Zhao, and Bin Zhang (2015). Efficient Test and Visualization of Multi-Set Intersections. Scientific Reports 5: 16923.
## Not run: #set up fake data x=list(S1=letters[1:20], S2=letters[10:26], S3=sample(letters,10), S4=sample(letters,10)) MSET(x, 26, FALSE) ## End(Not run)
## Not run: #set up fake data x=list(S1=letters[1:20], S2=letters[10:26], S3=sample(letters,10), S4=sample(letters,10)) MSET(x, 26, FALSE) ## End(Not run)
This object contains data regarding the intersections between multiple sets. This object is usually created by the supertest
function.
Intersection combination is denoted by a barcode
string of '0' and '1', where a value of '1' in the i
th position of the string indicates that the intersection is involved with the i
th set, 0 otherwise. E.g., string '000101' indicates that the intersection is an overlap between the 4th and 6th sets. Function deBarcode
can be used to decrypt the barcode.
Generic summary
and plot
functions can be applied to extract and visualize the results.
x |
a list of sets from input. |
set.names |
names of the sets. If the input sets do not have names, they will be automatically named as SetX where X is an integer from 1 to the total number of sets. |
set.sizes |
a vector of set sizes. |
n |
background population size. |
overlap.sizes |
a named vector of intersection sizes. Each intersection component is named by a barcoded character string of '0' and '1'. See |
overlap.expected |
a named vector of expected intersection sizes when item |
P.value |
a vector of p values for the intersections when item |
Minghui Wang <[email protected]>, Bin Zhang <[email protected]>
Minghui Wang, Yongzhong Zhao, and Bin Zhang (2015). Efficient Test and Visualization of Multi-Set Intersections. Scientific Reports 5: 16923.
supertest
, summary.msets
, plot.msets
, deBarcode
This function draws intersections among multiple sets.
## S3 method for class 'msets' plot(x, Layout=c('circular','landscape'), degree=NULL, keep.empty.intersections=TRUE, sort.by=c('set','size','degree','p-value'), min.intersection.size=0, max.intersection.size=Inf, ylim=NULL, log.scale=FALSE, yfrac=0.8, margin=NULL, color.scale.pos=c(0.85, 0.9), legend.pos=c(0.85,0.25), legend.col=2, legend.text.cex=1, color.scale.cex=1, color.scale.title=expression(paste(-Log[10],'(',italic(P),')')), color.on='#2EFE64', color.off='#EEEEEE', show.overlap.size=TRUE, show.fold.enrichment=FALSE, show.set.size=TRUE, overlap.size.cex=0.9, track.area.range=0.3, bar.area.range=0.2, new.gridPage=TRUE, minMinusLog10PValue=0, maxMinusLog10PValue=NULL, show.elements=FALSE, ...)
## S3 method for class 'msets' plot(x, Layout=c('circular','landscape'), degree=NULL, keep.empty.intersections=TRUE, sort.by=c('set','size','degree','p-value'), min.intersection.size=0, max.intersection.size=Inf, ylim=NULL, log.scale=FALSE, yfrac=0.8, margin=NULL, color.scale.pos=c(0.85, 0.9), legend.pos=c(0.85,0.25), legend.col=2, legend.text.cex=1, color.scale.cex=1, color.scale.title=expression(paste(-Log[10],'(',italic(P),')')), color.on='#2EFE64', color.off='#EEEEEE', show.overlap.size=TRUE, show.fold.enrichment=FALSE, show.set.size=TRUE, overlap.size.cex=0.9, track.area.range=0.3, bar.area.range=0.2, new.gridPage=TRUE, minMinusLog10PValue=0, maxMinusLog10PValue=NULL, show.elements=FALSE, ...)
x |
a |
Layout |
layout for plotting. |
degree |
a vector of intersection degrees for plotting. E.g., when |
keep.empty.intersections |
logical; if |
min.intersection.size |
Minimum size of an intersection to be plotted. |
max.intersection.size |
Maximum size of an intersection to be plotted. |
sort.by |
how to sort intersections. It can be either one of the key words " |
ylim |
the limits c(y1, y2) of plotting overlap size. |
log.scale |
logical; whether to plot with log transformed intersection sizes. |
yfrac |
numeric; the fraction (0 to 1) of canvas used for plotting bars. Only used for |
margin |
numeric; a vector of 4 numeric values specifying the margins (bottom, left, top, & right) in unit of "lines". Default c(1,1,1,1)+0.1 for |
color.scale.pos |
numeric; x and y coordinates (0 to 1) for packing the color scale guide. It could be a keyword " |
legend.pos |
numeric; x and y coordinates (0 to 1) for packing the legend in the |
legend.col |
integer; number of columns of the legend in the |
legend.text.cex |
numeric; specifying the amount by which legend text should be magnified relative to the default. |
color.scale.cex |
numeric; specifying the amount by which color scale text should be magnified relative to the default. |
color.scale.title |
character or expression; a title for the color scale guide. |
color.on |
color code; specifying the color for set(s) which are " |
color.off |
color code; specifying the color for set(s) which are "absent" for an intersection. |
show.overlap.size |
logical; whether to show overlap size on top of the bars. This will be set to |
show.fold.enrichment |
logical; whether to show fold enrichment if available rather than overlap size. This will impact |
show.set.size |
color code; whether to show set size in the |
overlap.size.cex |
numeric; specifying the amount by which overlap size text should be magnified relative to the default. |
track.area.range |
the magnitude of track area from origin in the |
bar.area.range |
the magnitude of bar area from edge of the track area in the |
new.gridPage |
logic; whether to start a new grid page. Set |
minMinusLog10PValue |
numeric; minimum minus log10 P value for capping the scale of color map. Default 0. |
maxMinusLog10PValue |
numeric; maximum minus log10 P value for capping the scale of color map. Default maximum from the data. |
show.elements |
logical; whether to show the intersection elements on top of the bars with the |
... |
additional arguments for the plot function. See |
The plot canvas has coordinates 0~1 for both x and y axes. Additional optional plot parameters include:
ylab
, a chracter string of y axis label.
circle.radii
, radii size of the circles in landscape
Layout. Default 0.5.
heatmapColor
, a vector of customized heat colors.
show.expected.overlap
, whether to show expcted overlap in landscape
Layout. Default 'FALSE'.
expected.overlap.style
, one of c("hatchedBox","horizBar","box"). Default 'hatchedBox'.
expected.overlap.lwd
, line width for expected.overlap "horizBar" and "box". Default 2.
color.expected.overlap
, color for showing expcted overlap in hatched lines. Default 'grey'.
alpha.expected.overlap
, alpha channel for transparency for showing expcted overlap hatched lines. Default 1 (normalized to the range 0 to 1).
cex
, scale of text font size.
cex.lab
, scale of axis label text font size.
show.track.id
, logic, whether to show the track id in the circular
layout. Default TRUE
.
phantom.tracks
, number of phantom tracks in the middle in the circular
layout. Default 2.
gap.within.track
, ratio of gap width over block width on the same track. Default 0.1.
gap.between.track
, ratio of gap width over track width. Default 0.1.
bar.split
, a vector of two values specifying a continuous range that will be cropped in the y axis with the landscape
layout.
elements.list
, a data.frame or matrix such as the one generated by the summary
function from a msets
object, with row names matching the barcodes of intersection combinations and at least one column named "Elements" listing the elements to be displayed (the elements should be concatenated by separator ", ").
elements.cex
, numeric; specifying the amount by which intersection element text should be magnified. Default 0.9.
elements.rot
, numeric; the angle to rotate the text of intersection elements. Default 45.
elements.col
, colour for intersection element text. Default black.
elements.maximum
, maximum number of elements to show.
intersection.size.rotate
, logic, whether to rotate the text of intersection size.
flip.vertical
, logic, whether to flip the bars to downwards in landscape
Layout. Default 'FALSE'.
title
, figure title. Default NULL.
cex.title
, scale of title text font size. Default 1.
No return.
Minghui Wang <[email protected]>, Bin Zhang <[email protected]>
Minghui Wang, Yongzhong Zhao, and Bin Zhang (2015). Efficient Test and Visualization of Multi-Set Intersections. Scientific Reports 5: 16923.
## Not run: #set up fake data x=list(S1=letters[1:20], S2=letters[10:26], S3=sample(letters,10), S4=sample(letters,10)) obj=supertest(x,n=26) plot(obj) ## End(Not run)
## Not run: #set up fake data x=list(S1=letters[1:20], S2=letters[10:26], S3=sample(letters,10), S4=sample(letters,10)) obj=supertest(x,n=26) plot(obj) ## End(Not run)
This function outputs summary statistics of a msets object.
## S3 method for class 'msets' summary(object, degree=NULL, ...)
## S3 method for class 'msets' summary(object, degree=NULL, ...)
object |
a |
degree |
a vector of intersection degrees to pull out. |
... |
additional arguments (not implemented). |
A list:
Barcode |
a vector of 0/1 character strings, representing the set composition of each intersection. |
otab |
a vector of observed intersection size between any combination of sets. |
etab |
a vector of expected intersection size between any combination of sets if background population size is specified. |
set.names |
set names. |
set.sizes |
set sizes. |
n |
background population size. |
P.value |
upper tail p value for each intersection if background population size n is specified. |
Table |
a data.frame containing degree, otab, etab, fold change, p value and the overlap elements. |
Minghui Wang <[email protected]>, Bin Zhang <[email protected]>
Minghui Wang, Yongzhong Zhao, and Bin Zhang (2015). Efficient Test and Visualization of Multi-Set Intersections. Scientific Reports 5: 16923.
## Not run: #set up fake data x=list(S1=letters[1:20], S2=letters[10:26], S3=sample(letters,10), S4=sample(letters,10)) obj=supertest(x,n=26) summary(obj) ## End(Not run)
## Not run: #set up fake data x=list(S1=letters[1:20], S2=letters[10:26], S3=sample(letters,10), S4=sample(letters,10)) obj=supertest(x,n=26) summary(obj) ## End(Not run)
Efficient Test and Visualization of Multi-set Intersections
The main functions that most users may need from this package are supertest
and MSET
. For a brief introduction of using this package, please see vignette("set_html")
.
Minghui Wang <[email protected]>, Bin Zhang <[email protected]>
Minghui Wang, Yongzhong Zhao, and Bin Zhang (2015). Efficient Test and Visualization of Multi-Set Intersections. Scientific Reports 5: 16923.
## Not run: #See a brieft instroduction of using this package vignette("set_html") ## End(Not run)
## Not run: #See a brieft instroduction of using this package vignette("set_html") ## End(Not run)
This function calculates intersection sizes among multiple sets and performs statistical tests of the intersections.
supertest(x, n=NULL, degree=NULL, ...)
supertest(x, n=NULL, degree=NULL, ...)
x |
list; a collection of sets. |
n |
integer, background population size. Required for computing the statistical significance of intersections. |
degree |
a vector of intersection degrees for overlap analysis. E.g., when |
... |
additional arguments (not implemented). |
This function calculates intersection sizes between multiple sets and, if background population size n
is specified, performs statistical tests of the intersections.
For a brief introduction of using this package, please see vignette("set_html")
.
An object of class msets
.
Minghui Wang <[email protected]>, Bin Zhang <[email protected]>
Minghui Wang, Yongzhong Zhao, and Bin Zhang (2015). Efficient Test and Visualization of Multi-Set Intersections. Scientific Reports 5: 16923.
msets
, MSET
, Cancer
, cpsets
, dpsets
## Not run: #Analyze the cancer gene sets data(Cancer) Result=supertest(Cancer, n=20687) summary(Result) plot(Result,degree=2:7,sort.by='size') ## End(Not run)
## Not run: #Analyze the cancer gene sets data(Cancer) Result=supertest(Cancer, n=20687) summary(Result) plot(Result,degree=2:7,sort.by='size') ## End(Not run)