DNA is an IUPAC valid sequence of non-degenerate DNA nucleotides.
For the purposes of the tutorial we will assume single nucleotide sequences.
>>> from nucleic import DNA >>> DNA("A").is_purine() True
Creating Variant Alleles¶
>>> DNA("A").to("C") Variant(ref=DNA("A"), alt=DNA("C"), context=DNA("A"))
By default, the context of the variant is assigned to the reference base, although a larger context can be set. The context must be symmetrical in length about the base substitution otherwise an error will be raised.
>>> DNA("A").to("C").within("TAG") Variant(ref=DNA("A"), alt=DNA("C"), context=DNA("TAG"))
Unless the chemical process for the base substitution is known, it is useful to represent all base substitutions in a canonical form, with either a purine or pyrimidine as the reference base.
>>> DNA("A").to("C").within("TAG").with_pyrimidine_ref() Variant(ref=DNA("T"), alt=DNA("G"), context=DNA("CTA"))
A complete example showing the creation of a notation-normalized
Variant from strings only:
>>> ref, alt, context = DNA("A"), DNA("C"), DNA("TAG") >>> snv = ref.to(alt).within(context).with_pyrimidine_ref() >>> snv.is_transversion() True
Variant has a color associated with it for a uniform color palette.
>>> snv.color_stratton() '#EDBFC2'
Single Nucleotide Variant Spectrums¶
SnvSpectrum can be initialized by specifying the size of the local context and the reference notation.
>>> from nucleic import SnvSpectrum, Notation >>> spectrum = SnvSpectrum(k=3, notation=Notation.pyrimidine) >>> spectrum SnvSpectrum(k=3, notation=Notation.pyrimidine)
Record observations by accessing the
SnvSpectrum like a Python dictionary.
spectrum[snv] += 2
Note: this is shorthand for
spectrum.counts[snv] += 2.
>>> vector = [6, 5, 2, 2, 3, 8] >>> # SnvSpectrum.from_iterable(vector, k=1, notation=Notation.pyrimidine).counts
Working with Probability¶
Many spectra are produced from whole-genome or whole-exome sequencing experiments. Spectra must be normalized to the _kmer_ frequencies in the target study. Without normalization, no valid spectrum comparison can be made between data generated from different target territories or species.
By default each
nucleic.Variant is given a weight of 1 and calling
nucleic.SnvSpectrum.mass_as_array() will simply give the proportion of
nucleic.Variant counts in the
After weights are set to the observed k-mer counts or frequency of the target territory, calling
SnvSpectrum.mass() will compute a true normalized probability mass.
All weights can be set with assignment e.g.:
spectrum.context_weights["ACA"] = 23420.
>>> # spectrum.mass()
k-mer counts can be found with
skbio.DNA.kmer_frequencies() for large targets.
Fetching COSMIC Signatures¶
Download the published COSMIC signatures of mutational processes in human cancer:
>>> from nucleic import fetch_cosmic_signatures >>> signatures = fetch_cosmic_signatures()
k=3 in either
purine reference notation can be plotted using a style that was first used in Alexandrov et. al. in 2013 (PMID: 23945592). Both
nucleic.Variant raw counts (
kind="count") or their probabilities (
kind="mass") can be plotted.
The figure and axes are returned to allow for custom formatting.
from nucleic.figures import plot_stratton_spectrum cosmic_signatures = fetch_cosmic_signatures() fig, (ax_main, ax_cbar) = plot_stratton_spectrum(cosmic_signatures["Signature 1"], kind="mass") fig, (ax_main, ax_cbar) = plot_stratton_spectrum(cosmic_signatures["Signature 14"], kind="mass")