CNA-HMMer Matlab Toolbox version 0.1 - Software for DNA copy number alteration (CNA) detection from array comparative genomic hybridization (aCGH) data

CNA-HMMer Matlab Toolbox - Software for DNA copy number alteration (CNA) detection from array comparative genomic hybridization (aCGH) data

Author: Sohrab P. Shah, UBC Department of Computer Science

Contact: sshah@cs.ubc.ca

Date: June 19, 2007

About CNA-HMMer

This document supports the CNA-HMMer ("CNA hammer") Matlab toolbox version 0.1: software to detect copy number alterations from array CGH data. The toolbox has two main funtions: it contains software that implements a robust hidden Markov Model (HMM) for aCGH that is an extension of the work described in:

S P Shah, X Xuan, R J DeLeeuw, M Khojasteh, W L Lam, R Ng, and K P Murphy. Integrating copy number polymorphisms into array CGH analysis using a robust HMM. Bioinformatics, 22(14):431–439, Jul 2006. [PDF]

AND the hierarchical HMM described in:

S P Shah, W L Lam, R Ng, and K P Murphy. Modeling recurrent DNA copy number alterations in array CGH data. Bioinformatics, In press

Topics in this document: License | Installation | Single sample analysis | Multiple sample analysis

LICENSE

This software is released under the GNU General Public License. Please read the terms of this license before using this software. All source code is copyright Sohrab Shah 2007.

INSTALLATION

SYSTEM REQUIREMENTS

To run this software you must have the following:

DOWNLOAD

To install this software, download the following file from: http://www.cs.ubc.ca/~sshah/acgh/CNA-HMMer-v0.1.zip and unpack it into a directory, which we will refer to as "<installpath>/CNA-HMMer-v0.1".

SETUP

  1. Start Matlab
  2. At the command prompt type: addpath(genpath('<installpath>/CNA-HMMer-v0.1'))
  3. To avoid having to do this every time, do the following:
    1. Select File->Set Path

    2. Click on "Add with Subfolders..."

    3. Browse to <installpath>/CNA-HMMer-v0.1
    4. Click "Save"

Now you are ready to use the software. See the USAGE section below for how to use the key functions and for useful example scripts.

USAGE

Robust HMM for single sample analysis

The function that implements the robust HMM from Shah et al (2006) is called hmmr.m . This is located in the 'hmm' directory. To see the documentation for this function, type help hmmr at the Matlab prompt.

DEMO FOR HMMR

Single file analysis: A wrapper script to run the hmmr function using a datafile as input is given in demo/singleSampleHMMR.m. To run this script, type the following:

at the Matlab prompt. You should see a series of plots, one per chromosome that look like this:

Shown above is the output of the HMM-R for chromosome 1 of mantle cell lymphoma cell line HBL2. The black dots are the input log ratios, the red curve indicates 1-probability of loss, the green curve indicates probability of gain.


Batch analysis: The script demo/batchSingleSampleHMMR.m gives and example of how to run the algorithm over several input files that are located in a single directory. The script runs the hmmr over each input file, producing one output file for each input file (see format below). To run this script, type the following:

at the Matlab prompt.

Frequency analysis: Alternatively, you may wish to perform a frequency analysis to assess the frequency of alterations of each clone across samples. The script demo/frequencyAnalysis.m is designed to do this. This script first generates a scaffold - the non-redundant union of BACs across all of the input files. For each input file, an output file in the format discussed below is generated using the scaffold information for the first 4 columns. As such all of the output files will have the same set of BACs. In addition to the file-specific output files, the frequency analysis produces a composite frequency profile output file (composite_frequency_profile.out) that reports the frequency of a particular clone being called as loss, neutral, or gain by hmmr. The composite frequency profile is also in SeeGH format. You can import any one data file and visualise the results using the probability viewer with the composite frequency file.

To run this script, type the following:

at the Matlab prompt.

Setting HMMR parameters

The scripts above use automatic parameter setting functions to set the hyperparameters of the model. The hyperparameters are the parameters of the conjugate prior distributions on HMM model parameters. The remaining free parameter is the number of states, K in the HMM. You may, of course wish to set these parameters by hand if you have previous knowledge of what they should be. Below is a brief description of all the arguments to the hmmr function:

INPUT FORMAT FOR HMMR

The input should be a 10-column tab delimited text file, such as demo/single/hbl2.txt with the following columns:

Clone ID Clone Name Accession Chromosome Number Clone start position (bp) Clone end position (bp) Log ratio Standard Deviation SNR - channel 1 SNR - channel 2

OUTPUT FORMAT

The output format is designed to be readable by SeeGH. v0.3 It is generated by io/writeResultsSeeGH.m and is a tab delimited text file (for example demos/single/hbl2.txt.out) with the following columns:

Clone Name Chromosome Number Clone start position (bp) Clone end position (bp) probability loss probability neutral probability gain label assignment

The probabilities represent the posterior probability that a particular clone is in a given state (loss, neutral, gain) as computed by the HMM. For details consult the citation listed above.


The label assignment column is calculated using a probability threshold (as an argument to writeResultsSeeGH) to call losses as -1, gains as +1. Neutral probes are 0. The threshold is hard coded at 0.75 in demo/singleSampleHMMR.m, line 27. The higher the threshold, the more conservative the call, therefore the false negative rate increases with the threshold. The lower the threshold the more liberal the call, and therefore false positive rate increases with decreasing threshold.

Hierarchical HMM for joint analysis of multiple samples of array CGH data

The function that implements the hierarchical HMM from Shah et al (2007) is called hhmm and is located in hmm/hhmm.m. Type help hhmm for documentation on this function.

DEMO for HHMM

There are 2 demo scripts that provide reasonable examples of how to use the hhmm function. We have provided sample synthetic data in the file demo/multi/rep1snr09frac3len3.txt. This data file contains 8 samples of one aritificially generated chromosome that has 672 probes. There are 2 embedded recurrent alterations of width 50 probes located at probe 100 (loss) and probe 400 (gain). The recurrent alterations are shared by 75% (or 6) of the samples. This represents example data used in experiments reported in Shah et al (2007). To run the HHMM on this data, type the following:

at the Matlab prompt. This executes the script located in demo/multiSampleHHMMSynthetic.m. You should see output that looks somewhat like this (on the right):

Shown above is the input (left) of the synthetic data showing the recurrent loss flanked by red bars and the recurrentgain flanked by green bars. The output of the HHMM is shown on the right. The red curve indicates probability of recurrent loss, the green curve indicates probability of recurrent gain.

We also provide an example script for the lung cancer data reported in Shah et al (2007). The script demo/runHhmmSclcGroup1.m runs the HHMM on the non-small cell adenocarcinoma group whose samples are located in the first 18 log ratio columns of demo/multi/dataonly.txt.

To run this script, type the following:

at the Matlab prompt.

Setting HHMM parameters

The key parameter to set in HHMM is the epsilon parameter. This is an argument to the function hhmm. We typically set epsilon=0.8. Lower values of epsilon will result in fewer recurrent CNAs predicted (or a sparser profile), higher values will result in more recurrent CNAs predicted (or a less sparse profile).

The hyperparameters for the HHMM are set similarly to the HMMR (as described above).

INPUT FORMAT FOR HHMM

The input datafile should be a tab delimited text file, such as demo/multi/rep1snr09frac3len3.txt with the following columns:

Chromosome Number Clone start position (bp) Clone end position (bp) Log ratio sample 1 Log ratio sample 2 Log ratio sample ...

There should be one log ratio column for each sample. In addition, to produce a SeeGH formatted output file, the HHMM requires a single column text file containing the clone names that correspond to the rows in the datafile. See demo\multi\clonenames_synth.txt or demo\multi\clonesnames_sclc.txtfor examples. To create the necessary files from a set of single sample files, for example the three files included in demo/single, you can use the function makeMultisampleDataFile. Type help makeMultiSampleDataFile for documentation on this function. NOTE: if you create your datafiles with this function, you must use the corresponding import function importMultiSampleData as in the demo script demo/frequencyAnalysis.m, lines 14-16. Be sure that all the *.txt files in your data directory are in the input format as described above for HMMR.

OUTPUT FORMAT

As for HHMR, the output format for HHMM is designed to be readable by SeeGH v0.3. The probabilities in this case represent the posterior probability that the clone has a recurrent pattern for a given state (loss, neutral, gain) as computed by the HMM. For details consult Shah et al (2007).


Questions/Comments/Bugs

Please email sshah@cs.ubc.ca for any questions or problems.