Why doesn't FastSLAM Scale?

home contact publications cv software
blog koan simra.net
In theory running an RBPF with the FastSLAM data structure should yield observation update costs around O(NKlog(M)), where N is the number of samples, K is the number of observed landmarks in the current observation and M is the number of landmarks in the map. This means that, for any given data set, the performance of the filter should scale linearly (since K and M will behave similarly for different N).

Figure 1


Figure 2


Figure 3

The truth, however, is somewhat different. Figure 1 illustrates the performance (measured in average ms per image) of our image-based RBPF/FastSLAM filter, as we varied N. The most important feature of this log-log plot is that at large N the slope of the graph is > 1.0, indicating superlinear cost increases. The question is, why?

My first inclination was that there was a problem in our code, and I audited it very carefully to remove all the bottlenecks I could find. I'm 99% confident it's not the code. My second hypothesis was that the performance degradation had something to do with the shape of the probability distribution over maps and that resampling was occuring at a different frequency, or that the number of landmarks in the FastSLAM tree affected by resampling was changing at a rate faster than O(N).

Neither of these hypotheses are very satisfying, so I started looking for another source of the problem by profiling all the operations in the filter. The filter basically performs these steps at each observation update:

  1. Take an image. O(1)
  2. Extract SIFT features and match them with kd-tree. O(Klog(S)), K is the number of features in the image, and S is the number of SIFT features in the kd-tree. Note this operation is independent of N.
  3. Update each map using the data assoc. O(NKlog(M)).
  4. Possibly resample after weight updating. O(??). My best educated guess is that this operation is O(NMlogM). With adaptive resampling, usually about 0.5*N samples are removed, each with a map of M landmarks.

It is important to note here that steps 1 and 2 are constant time with respect to N. In other words, if we run multiple trials using the same data set but varying N, we should see approximately the same profiling numbers for these steps...

... but we don't. Figure 2 plots the average cost of running steps 1-2 over a set of 8500 frames, averaging over a sliding window of 100 frames. While the theory dictates that all of these curves should be coincident, they're not. Figure 3 further illustrates this by plotting the mean measured cost of steps 1 and 2 versus N.

What's going on here? The basic answer is that this is a classic example of memory fragmentation and a violation of the principle of locality in memory management. While for any given frame K and S are identical for all N, *where* the S landmarks are located in memory changes. At each frame, K SIFT features are added to the kd-tree, and then NK landmarks are added to the FastSLAM tree. This means that over time each set of K landmarks in the kd-tree will be separated further in memory from the K landmarks in the previous frame and the K landmarks in the next frame. The result is that at data association time, the CPU cache miss frequency will increase (and for large M and N, the page fault frequency will come into play as well). This same reasoning can be extended to the cost of map updating and resampling (Steps 3 and 4). In fact, resampling will probably make matters worse as it introduces internal fragmentation into the tree. I haven't figured out how to measure the cache miss frequency, so I don't have any numbers to prove it. But in light of Figures 2 and 3, I think we can reasonably conclude that fragmentation is the main issue affecting performance.

What can be done? There are three possible solutions to this problem. First, we can pre-allocate space for the SIFT features, reducing fragmentation in the data association, but this only fixes data association costs and doesn't help with map updating. Second, we can periodically consolidate the FastSLAM tree, but it's not clear how this can be done in any optimal way, as the tree quickly becomes a tangled web of references. Third, we can wait. As long as Moore's Law is in effect (or its equivalent for memory bandwidth and latency), the bottleneck in scaling N will not be an issue. As it stands, we can currently run about 2000 samples at 1Hz frame updates (Figure 1 is a bit misleading because it counts skipped frames), which is enough samples for many mapping applications.

It should be noted that this analysis applies similarly to *any* O(N) time/space algorithm, especially those where memory accesses are distributed significantly throughout the data. This is not a new phenomenon and when I get a chance I'll add some references for further reading.


Robert Sim, Last modified: 15 Mar 2006