|
If you:
A) use Matlab on Windows or Linux or Mac OS X and want fast
(exact, general-case) normalized correlation (NCC) code right now, then
download it and enjoy the large performance gain
over Matlab's normxcorr2 (demo included). For the brave (and/or masochistic) there is B),
however, it is unlikely you will do better than the above. Thanks to Liang Jin for making the OS X binary
(his performance comparison graph on a PowerPC).
If you are an A) but use Unix/Matlab,
let me know and we'll get something working. If you happen to be doing research/prototyping work in computer vision
and aren't using Matlab (or Octave, or insert-high-level-interpreted-language-here), then you should probably see the head doctor.
That opinion aside, I would still be happy to help you get it working in your C/C++/Fortran/Java app (anything that can call a static library) -- download the source here.
Return to my main page.
B) don't mind a read, then learn ye shall.
Conclusion: In some special cases you can beat A), in particular if you can tolerate:
- approximate results with a rough error bound (with resulting values still lying in [-1,1]), or
- sparse results (ex1. find only the best template match (highest NCC score), ex2. find correlation
only at locations where the score exceeds a threshold).
Alternatively, if you have a powerful GPU, and constrained requirements
on image/template size, see Point IV below. Otherwise you don't fit in --
do not pass Go, and go directly back to A), or else go buy yourself some fancy
signal processing hardware.
Introduction: If you used a search engine to find this page, chances are you came across J.P. Lewis'
(of Industrial Light & Magic) publication entitled
Fast Normalized Cross-Correlation.
I refer you there for an introduction to the NCC problem -- Lewis' presentation is clear, and
I have no desire to paraphrase. The pertinent sections are: Introduction,
Template Matching by Cross-Correlation, Transform Domain Computation and Normalizing.
I will assume you've read Lewis' document, or at least have a basic understanding of the problem.
So that we're on the same page, here is the NCC formula:
 |
* The denominator should be enclosed by a square root (thanks for pointing that out, Brooks). |
My notation is slightly different than Lewis'; here, I is the image, and T is the template. Assuming an MxM
image, and NxN template, indices (u,v) are valid in {1,..,M-N}, while indices (x,y) run across 1..N in each sum (which are
implicit double-sums). is the mean-subtracted template
( being the template mean). is the local image mean
at location (u,v) -- the mean of the image underneath the template whose top left corner lies on
pixel (u,v).
As Lewis points out, if we were to directly apply this formula, computation of the denominator
dominates the overall cost (by a factor of 2, to be more precise).
By utilizing the integral image trick the cost is greatly reduced.
Here is a brief complexity breakdown of the methods thusfar (MxM image, and NxN template):
- Naive: Denominator ~4(M-N+1)2N2, Numerator 2(M-N+1)2N2
- Integral image: Denominator 4M2 + 8(M-N+1)2, Numerator unchanged
Optional remarks.
As we can see, the denominator costs much less than the numerator when using the integral image trick (except
for absurdly small templates). What does this mean? Well, initially, I thought that computing the
integral image would be quite expensive, but expected I could amortize the cost across several templates
which all needed to be correlated on the same image. Lewis downplays the expense of the numerator
far too much. My intuition proved completely wrong: reusing an integral
image for several templates yields an entirely negligeable improvement in running time -- the dominant cost is the
convolution, which must be carried out anew for each template. The convolution can be slow
even when it's carried out in the frequency domain.
To squeeze the most performance
out of your NCC software, you should use some kind of hybrid between what I provide above, and a frequency-domain
based method, which decides between the two at runtime.
The problem is that such a system would be too hard to develop in general, because of different
machine architectures (even within one brand of cpu). Matlab tries to implement something along these
lines but I do not believe that it selects the cheapest method in general.
Enough with the negativity, and on to the time-saving approximations I alluded to above.
i. Pyramid schemes. This is off the top of my head, but I assume the audience is familiar with image pyramids
already. If not, the idea is to perform matching in a coarse-to-fine ordering. Sub-sample your image and template
by a common factor ("make them smaller") and perform NCC at the new discretization level. Since NCC is
M2N2, it's appreciably faster to apply there. The draw-back is that
high frequency features/information has obviously been discarded, so it may be desirable to return
to the full resolution and do NCC again in some "interesting" image regions. The meaning of "interesting"
depends on your domain, of course. Naturally, you're not restricted to a 2 level pyramid -- the
degree of sub-sampling, and number of levels trade-off speed for accuracy.
My impression: This technique is of questionable value. Sub-sampling tends to affect the image
and template in unpredictable ways (in the general case) -- namely, some regions could correlate more highly,
despite being dissimilar at full resolution; moreover, the target patch(es) being searched for may correlate less at lower
resolutions. If the implementation misses the "interesting" regions at coarse scales, then it may have no chance to
recover at higher ones.
Reference: None. I will laugh if someone has published a paper "contributing" this.
ii. Bounding the Numerator. By employing this technique you lose dense correlation output. You can obtain results
like, "Where is the best match?", or "Where does the correlation score exceed 0.9?". The amount of speed-up is wholly
dependent on the image being correlated against. The idea is to use an upper bound for the double sum in the NCC numerator
that can be computed quickly (viz. by the integral image trick). Given a location in the image, we first compute the
cheap upper bound, and if it lies below some threshold, we do no further processing (as by transitivity, the real
correlation score must also lie below the threshold). Two simple bounds are based on Jensen's inequality and the
Cauchy-Schwarz inequality, which are ubiquitous in real analysis. Both cost approximately the same to compute,
but the Cauchy-Schwarz-based bound is tighter, so it's the one to use. Sounds nice, right? Well, the problem
is that neither bound is particularly tight; in fact, except in fairly limited cases, they yield bounds above 1, and
are consequently worthless (of course the score is bounded above by 1!). Stefano and Mattoccia make it work by
partitioning the computation at each image location into two blocks. In one block, they compute the upper bound function,
and in the other, they compute the full NCC -- the result is then simply summed together. If the sum falls below the
target NCC score threshold (0.95, for example), then the algorithm can safely ignore this pixel, whereas if the sum lies
above the threshold then it must resolve the full NCC there. The fundamental parameter of their system is the relative
size of the two blocks. If the upper-bound block is large, then the first stage is very fast to compute, however it is
more likely to be above the target threshold (read: useless), and require a full NCC computation there.
My impression: For high target thresholds (0.98), the authors get a mean speed-up of ~3 times on their test
set (over naive NCC). The applicability of this technique is set by your problem domain. If, for example, you want
to find the best single match of the template within an image, then you cannot simply set a high target threshold,
because there is no guarentee the best match will exceed this threshold (and so may be missed completely!). Instead,
you could keep a running maximum threshold over scores obtained thusfar, but then performance will be very sensitive
to the input image (it gave a mean speed-up of 2 times on the authors' small test set).
For many applications, a consistent-time algorithm is preferable to one that gives
intermittent, unpredictable speed-up. In my various uses of NCC, I usually desire dense output,
so I've never implemented this algorithm myself.
Reference: L. Di Stefano, S. Mattoccia. A Sufficient Condition based on the Cauchy-Schwarz
Inequality for Efficient Template Matching. ICIP 2003.
iii. Separating the Template.
The idea is simple: decompose the template into a linear combination of 1D templates, via the SVD
(Singular Value Decomposition). Instead of doing one 2D convolution, the image is filtered with multiple
seperable filters. Antonio Torralba (MIT-CSAIL) first suggested this to me. The output of SVD on the template is
T=USVT, where U and V are orthogonal matrices, and S is diagonal containing the positive singular values
in decreasing order. The template can be approximated as the linear combination of the product of U and V columns
weighted by their associated singular value (if this is unclear, I recommend reading up on SVD). Using only k such
columns gives the best rank-k approximation to the original template, under the L2 norm. Since
convolution is a linear transformation, we can approximate it by performing sequential seperable convolutions with
the columns of U & V. Seperable convolution costs 4(M-N+1)2N operations per each rank
we take. A method for deciding to which rank you will approximate the template up to is to simply find the NCC
between the original template, and the rank-k approximation. The rank required to achieve a particular correlation score
depends on the template content. The higher rank approximations are needed to capture higher spatial frequencies.
Care must be taken to normalize by the correct template-variance-term in the denominator of the NCC formula.
My impression: For large templates, and modest-rank approximations (rank-2, rank-3) there is some speed-up. When
N is small though, this technique doesn't buy you much.
Reference: S. Treitel and J. Shanks. The design of multistage separable planar filters.
IEEE Trans. Geosci. Electron, 9(1):10-27, 1971.
iv. GPU-based approach, or processing on a video card.
Update: (April 4, 2007) Micah Villmow, a student intern at AMD/ATI, has implemented a CTM
(Close-to-metal) algorithm capable of performing NCC very rapidly. Tested on an ATI Radeon
X1K series card using 320x320 images and 12x12 templates, he reports 50 frames per second (FPS), or 20ms/image.
The limiting factor is still memory transfer to/from the video card. If, instead, you had a single
image and many 12x12 filters, you could attain upwards of 2000 FPS (ie. no limiting overhead from memory transfer). To put these numbers into perspective,
the Matlab/Mex implementation I provide above is capable of around 7.2 FPS on an Intel T2400 (Dual Core, 1.83Ghz),
taking around 0.14 seconds per image. Unfortunately, a caveat remains: the maximum filter size of current graphics hardware is 16x16.
I applaud Micah for these results, and hope he will share details on his implementation with us. As GPU technology
continues its rapid ascension in parallel computing power, it may soon become viable (and quite possibly preferable) for larger template
sizes.
Shortly after the introduction of fragment shaders,
people realized they could leverage a GPU's parallelism for CG's inverse problem.
A fragment shader is a short program that is executed on each rendered pixel, in
parallel (a high-end GPU can operate on 8-16 pixels simultaneously). Based on some
vision researchers' success (OpenVIDIA, GPGPU) (with simple convolution,
matching, etc), I thought I may be able to accelerate
NCC on a GPU. Unfortunately, at the time of this writing (May, 2005) I have
been unable to do so, and believe it to be impossible (though, I don't possess
a comprehensive knowledge of GPUs, so I would love for someone to prove me wrong!).
There are two problems: 1. slow vram read-back 2. shader program length. Problem 1
stems from an asymmetry in read/write speed between main memory (AGP) and video memory --
it is fast to put your image/template into video memory, but slow to read the result back.
It does have a simple solution though -- buy a PCI-X video card, which is symmetric for
read/write speed. On the other hand, I don't forsee the second problem being solved any
time soon. Fixed-kernel (with small spatial support) convolution is fast. For example,
you can do Sobel edge detection at lighting speed. Variable-kernel convolution is slower, but
still feasible as long as the spatial support is small -- it doesn't scale to larger kernels,
however. Shader programs need to be compiled ahead of time, and do not include branching/looping
constructs (except for the very high end nVidia cards, but doubly nested for statements are slow!!)
so each step of the convolution requires its own instruction (remember, this program operates
on a single pixel -- so for a 10x10 template, that's 200 multiply/accumulate instructions plus a
normalizing divide). The maximum number of instructions per program is around 512 (it's
highly dependant on the card, however) so you'll hit this ceiling for templates
of around 17x17, while doing convolution -- the NCC operation requires many more instructions
than plain convolution (to compute the correlation coefficient).
I would be happy to continue blathering about my experience, but
my verdict here is that you can't do NCC on a GPU, let alone large-spatial-support convolution.
The general-purpose GPU programmers' message boards ignored my posts on this topic, so if the
reader can comment on my reasoning or refute my condemning assertions, please do!
|