Contact: sshah@cs.ubc.ca
Date: June 19, 2007
This document supports the CNA-HMMer ("CNA hammer") Matlab toolbox version 0.1: software to detect copy number alterations from array CGH data. The toolbox has two main funtions: it contains software that implements a robust hidden Markov Model (HMM) for aCGH that is an extension of the work described in:
AND the hierarchical HMM described in:
Topics in this document: License | Installation | Single sample analysis | Multiple sample analysis
This software is released under the GNU General Public License. Please read the terms of this license before using this software. All source code is copyright Sohrab Shah 2007.
To install this software, download the following file from: http://www.cs.ubc.ca/~sshah/acgh/CNA-HMMer-v0.1.zip and unpack it into a directory, which we will refer to as "<installpath>/CNA-HMMer-v0.1".
Now you are ready to use the software. See the USAGE section below for how to use the key functions and for useful example scripts.
The function that implements the robust HMM from Shah et al (2006) is called hmmr.m . This is located in the 'hmm' directory. To see the documentation for this function, type help hmmr at the Matlab prompt.
Single file analysis:
A wrapper script to run the hmmr function using a datafile as input is given in
demo/singleSampleHMMR.m.
To run this script, type the following:
at the Matlab prompt. You should see a series of plots, one per chromosome that look like this:
|
|
Shown above is the output of the HMM-R for chromosome 1 of mantle cell lymphoma cell line HBL2. The black dots are the input log ratios, the red curve indicates 1-probability of loss, the green curve indicates probability of gain. |
Batch analysis:
The script demo/batchSingleSampleHMMR.m gives and example
of how to run the algorithm over several input files that are located in a
single directory. The script runs the hmmr over each input file, producing one output file for
each input file (see format below). To run this script, type the following:
at the Matlab prompt.
Frequency analysis: Alternatively, you may wish to perform a frequency analysis to assess the frequency of alterations of each clone across samples. The script demo/frequencyAnalysis.m is designed to do this. This script first generates a scaffold - the non-redundant union of BACs across all of the input files. For each input file, an output file in the format discussed below is generated using the scaffold information for the first 4 columns. As such all of the output files will have the same set of BACs. In addition to the file-specific output files, the frequency analysis produces a composite frequency profile output file (composite_frequency_profile.out) that reports the frequency of a particular clone being called as loss, neutral, or gain by hmmr. The composite frequency profile is also in SeeGH format. You can import any one data file and visualise the results using the probability viewer with the composite frequency file.
To run this script, type the following:
at the Matlab prompt.
The scripts above use automatic parameter setting functions to set the hyperparameters of the model. The hyperparameters are the parameters of the conjugate prior distributions on HMM model parameters. The remaining free parameter is the number of states, K in the HMM. You may, of course wish to set these parameters by hand if you have previous knowledge of what they should be. Below is a brief description of all the arguments to the hmmr function:
The input should be a 10-column tab delimited text file, such as demo/single/hbl2.txt with the following columns:
| Clone ID | Clone Name | Accession | Chromosome Number | Clone start position (bp) | Clone end position (bp) | Log ratio | Standard Deviation | SNR - channel 1 | SNR - channel 2 |
The output format is designed to be readable by SeeGH. v0.3 It is generated by io/writeResultsSeeGH.m and is a tab delimited text file (for example demos/single/hbl2.txt.out) with the following columns:
| Clone Name | Chromosome Number | Clone start position (bp) | Clone end position (bp) | probability loss | probability neutral | probability gain | label assignment |
The probabilities represent the posterior probability that a particular clone is in a given state (loss, neutral, gain) as computed by the HMM. For details consult the citation listed above.
The label assignment column is calculated using a probability threshold (as an argument to writeResultsSeeGH) to call losses as -1, gains as +1. Neutral probes are 0. The threshold is hard coded at 0.75 in demo/singleSampleHMMR.m, line 27. The higher the threshold, the more conservative the call, therefore the false negative rate increases with the threshold. The lower the threshold the more liberal the call, and therefore false positive rate increases with decreasing threshold.
The function that implements the hierarchical HMM from Shah et al (2007) is called hhmm and is located in hmm/hhmm.m. Type help hhmm for documentation on this function.
There are 2 demo scripts that provide reasonable examples of how to use the hhmm function. We have provided sample synthetic data in the file demo/multi/rep1snr09frac3len3.txt. This data file contains 8 samples of one aritificially generated chromosome that has 672 probes. There are 2 embedded recurrent alterations of width 50 probes located at probe 100 (loss) and probe 400 (gain). The recurrent alterations are shared by 75% (or 6) of the samples. This represents example data used in experiments reported in Shah et al (2007). To run the HHMM on this data, type the following:
|
|
|
Shown above is the input (left) of the synthetic data showing the recurrent loss flanked by red bars and the recurrentgain flanked by green bars. The output of the HHMM is shown on the right. The red curve indicates probability of recurrent loss, the green curve indicates probability of recurrent gain. |
|
We also provide an example script for the lung cancer data reported in Shah et al (2007). The script demo/runHhmmSclcGroup1.m runs the HHMM on the non-small cell adenocarcinoma group whose samples are located in the first 18 log ratio columns of demo/multi/dataonly.txt.
To run this script, type the following:
at the Matlab prompt.
The key parameter to set in HHMM is the epsilon parameter. This is an argument to the function hhmm. We typically set epsilon=0.8. Lower values of epsilon will result in fewer recurrent CNAs predicted (or a sparser profile), higher values will result in more recurrent CNAs predicted (or a less sparse profile).
The hyperparameters for the HHMM are set similarly to the HMMR (as described above).
The input datafile should be a tab delimited text file, such as demo/multi/rep1snr09frac3len3.txt with the following columns:
| Chromosome Number | Clone start position (bp) | Clone end position (bp) | Log ratio sample 1 | Log ratio sample 2 | Log ratio sample ... |
There should be one log ratio column for each sample. In addition, to produce a SeeGH formatted output file, the HHMM requires a single column text file containing the clone names that correspond to the rows in the datafile. See demo\multi\clonenames_synth.txt or demo\multi\clonesnames_sclc.txtfor examples. To create the necessary files from a set of single sample files, for example the three files included in demo/single, you can use the function makeMultisampleDataFile. Type help makeMultiSampleDataFile for documentation on this function. NOTE: if you create your datafiles with this function, you must use the corresponding import function importMultiSampleData as in the demo script demo/frequencyAnalysis.m, lines 14-16. Be sure that all the *.txt files in your data directory are in the input format as described above for HMMR.
As for HHMR, the output format for HHMM is designed to be readable by SeeGH v0.3. The probabilities in this case represent the posterior probability that the clone has a recurrent pattern for a given state (loss, neutral, gain) as computed by the HMM. For details consult Shah et al (2007).
Please email sshah@cs.ubc.ca for any questions or problems.