Welcome to genominterv¶
genomeinterv provides support for working with intervals on genomes. A
genomic interval is specified as a chromosome, start, and end. It is half-open
so that a value x is in an interval (start, end) included in the inrval
if start <= x and x < end. All functions take pandas.DataFrames as
arguments. These data frames must include chrom, start, and end
columns.
See the library reference for detailed documentation of each function and decorator.
Set operations¶
The three functions:
interval_diffinterval_intersectinterval_union
do the standard difference, intersection and union set operations on two sets of genomic intervals. The intervals returned from all three functions are collapsed to produce non-overlapping intervals. The genomic intervals in each set must be non-overlapping. This can be achieved using function:
interval_collapse
which produces the union of genomic intervals in a single set genomic of intervals.
Genomic decorator¶
To make it easy to create other interval functions that work across chromosomes,
the module provides a genomic decorator that can be applied to functions
that operate lists of (start, end) tuples. Applying the decorator changes
the signature of a function to make it operate on DataFrames that include
chrom, start, and end columns. Here is an example function that
shifts intervals by 1000bp:
@genomic
def inverval_shift(tuples):
return [(x+1000, y+1000) for (x, y) in tuples]
df = pandas.DataFrame()
shifted = inverval_shift(df)
Remapping functions¶
The function interval_distance onverts coordinates of one set of
genomic intervals into distances to the closest interval in a second set.
interval_relative_distance does the same but returns
relative distances.
Two-set statistics¶
The module also provides two statistics for relations between sets:
jaccard computes the Jaccard index
statistic for two sets of genomic intervals.
Bootstrap decorator¶
The module provides a bootstrap decorator that turns a function producing
a statistic into one that also produces a p-value. The bootstrapping resamples
the intervals of the second argument for each chromosome independently. Only
required argument to bootstrap is the name of the chromosome assembly used.
This example does this for the provided jaccard satistic:
@bootstrap('hg19', samples=1000)
def jaccard_test(query, annot):
return jaccard(query, annot)
jaccard_stat, p_value = jaccard_test(intervals, other_intervals)
The decorator works on any function that takes two sets of intervals.
Ready-made tests¶
proximity_test computes tests if intervals in one set is significantly
proximal to intervals in another
set.