Welcome to genominterv¶

genomeinterv provides support for working with intervals on genomes. A genomic interval is specified as a chromosome, start, and end. It is half-open so that a value x is in an interval (start, end) included in the inrval if start <= x and x < end. All functions take pandas.DataFrames as arguments. These data frames must include chrom, start, and end columns.

See the library reference for detailed documentation of each function and decorator.

Set operations¶

The three functions:

interval_diff
interval_intersect
interval_union

do the standard difference, intersection and union set operations on two sets of genomic intervals. The intervals returned from all three functions are collapsed to produce non-overlapping intervals. The genomic intervals in each set must be non-overlapping. This can be achieved using function:

interval_collapse

which produces the union of genomic intervals in a single set genomic of intervals.

Genomic decorator¶

To make it easy to create other interval functions that work across chromosomes, the module provides a genomic decorator that can be applied to functions that operate lists of (start, end) tuples. Applying the decorator changes the signature of a function to make it operate on DataFrames that include chrom, start, and end columns. Here is an example function that shifts intervals by 1000bp:

@genomic
def inverval_shift(tuples):
    return [(x+1000, y+1000) for (x, y) in tuples]

df = pandas.DataFrame()

shifted = inverval_shift(df)

Remapping functions¶

The function interval_distance onverts coordinates of one set of genomic intervals into distances to the closest interval in a second set. interval_relative_distance does the same but returns relative distances.

Two-set statistics¶

The module also provides two statistics for relations between sets: jaccard computes the Jaccard index statistic for two sets of genomic intervals.

Bootstrap decorator¶

The module provides a bootstrap decorator that turns a function producing a statistic into one that also produces a p-value. The bootstrapping resamples the intervals of the second argument for each chromosome independently. Only required argument to bootstrap is the name of the chromosome assembly used.

This example does this for the provided jaccard satistic:

@bootstrap('hg19', samples=1000)
def jaccard_test(query, annot):
    return jaccard(query, annot)

jaccard_stat, p_value = jaccard_test(intervals, other_intervals)

The decorator works on any function that takes two sets of intervals.

Ready-made tests¶

proximity_test computes tests if intervals in one set is significantly proximal to intervals in another set.