Thứ Sáu, 14 tháng 2, 2014

Tài liệu Distinctive Image Features from Scale-Invariant Keypoints ppt

Pope and Lowe (2000) used features based on the hierarchical grouping of image contours,
which are particularly useful for objects lacking detailed texture.
The history of research on visual recognition contains work on a diverse set of other
image properties that can be used as feature measurements. Carneiro and Jepson (2002)
describe phase-based local features that represent the phase rather than the magnitude of local
spatial frequencies, which is likely to provide improved invariance to illumination. Schiele
and Crowley (2000) have proposed the use of multidimensional histograms summarizing the
distribution of measurements within image regions. This type of feature may be particularly
useful for recognition of textured objects with deformable shapes. Basri and Jacobs (1997)
have demonstrated the value of extracting local region boundaries for recognition. Other
useful properties to incorporate include color, motion, figure-ground discrimination, region
shape descriptors, and stereo depth cues. The local feature approach can easily incorporate
novel feature types because extra features contribute to robustness when they provide correct
matches, but otherwise do little harm other than their cost of computation. Therefore, future
systems are likely to combine many feature types.
3 Detection of scale-space extrema
As described in the introduction, we will detect keypoints using a cascade filtering approach
that uses efficient algorithms to identify candidate locations that are then examined in further
detail. The first stage of keypoint detection is to identify locations and scales that can be
repeatably assigned under differing views of the same object. Detecting locations that are
invariant to scale change of the image can be accomplished by searching for stable features
across all possible scales, using a continuous function of scale known as scale space (Witkin,
1983).
It has been shown by Koenderink (1984) and Lindeberg (1994) that under a variety of
reasonable assumptions the only possible scale-space kernel is the Gaussian function. There-
fore, the scale space of an image is defined as a function, L(x, y, σ), that is produced from
the convolution of a variable-scale Gaussian, G(x, y, σ), with an input image, I(x, y):
L(x, y,σ) = G(x, y, σ) ∗ I(x, y),
where ∗ is the convolution operation in x and y, and
G(x, y, σ) =
1
2πσ
2
e
−(x
2
+y
2
)/2σ
2
.
To efficiently detect stable keypoint locations in scale space, we have proposed (Lowe, 1999)
using scale-space extrema in the difference-of-Gaussian function convolved with the image,
D(x, y, σ), which can be computed from the difference of two nearby scales separated by a
constant multiplicative factor k:
D(x, y, σ) = (G(x, y, kσ) − G(x, y, σ)) ∗ I(x, y)
= L(x, y, kσ) − L(x, y, σ). (1)
There are a number of reasons for choosing this function. First, it is a particularly efficient
function to compute, as the smoothed images, L, need to be computed in any case for scale
space feature description, and D can therefore be computed by simple image subtraction.
5
Scale
(first
octave)
Scale
(next
octave)
Gaussian
Difference of
Gaussian (DOG)
. . .
Figure 1: For each octave of scale space, the initial image is repeatedly convolved with Gaussians to
produce the set of scale space images shown on the left. Adjacent Gaussian images are subtracted
to produce the difference-of-Gaussian images on the right. After each octave, the Gaussian image is
down-sampled by a factor of 2, and the process repeated.
In addition, the difference-of-Gaussian function provides a close approximation to the
scale-normalized Laplacian of Gaussian, σ
2

2
G, as studied by Lindeberg (1994). Lindeberg
showed that the normalization of the Laplacian with the factor σ
2
is required for true scale
invariance. In detailed experimental comparisons, Mikolajczyk (2002) found that the maxima
and minima of σ
2

2
G produce the most stable image features compared to a range of other
possible image functions, such as the gradient, Hessian, or Harris corner function.
The relationship between D and σ
2

2
G can be understood from the heat diffusion equa-
tion (parameterized in terms of σ rather than the more usual t = σ
2
):
∂G
∂σ
= σ∇
2
G.
From this, we see that ∇
2
G can be computed from the finite difference approximation to
∂G/∂σ, using the difference of nearby scales at kσ and σ:
σ∇
2
G =
∂G
∂σ

G(x, y, kσ) − G(x, y, σ)
kσ − σ
and therefore,
G(x, y, kσ) − G(x, y, σ) ≈ (k − 1)σ
2

2
G.
This shows that when the difference-of-Gaussian function has scales differing by a con-
stant factor it already incorporates the σ
2
scale normalization required for the scale-invariant
6
Scale
Figure 2: Maxima and minima of the difference-of-Gaussian images are detected by comparing a
pixel (marked with X) to its 26 neighbors in 3x3 regions at the current and adjacent scales (marked
with circles).
Laplacian. The factor (k − 1) in the equation is a constant over all scales and therefore does
not influence extrema location. The approximation error will go to zero as k goes to 1, but
in practice we have found that the approximation has almost no impact on the stability of
extrema detection or localization for even significant differences in scale, such as k =

2.
An efficient approach to construction of D(x, y, σ) is shown in Figure 1. The initial
image is incrementally convolved with Gaussians to produce images separated by a constant
factor k in scale space, shown stacked in the left column. We choose to divide each octave
of scale space (i.e., doubling of σ) into an integer number, s , of intervals, so k = 2
1/s
.
We must produce s + 3 images in the stack of blurred images for each octave, so that final
extrema detection covers a complete octave. Adjacent image scales are subtracted to produce
the difference-of-Gaussian images shown on the right. Once a complete octave has been
processed, we resample the Gaussian image that has twice the initial value of σ (it will be 2
images from the top of the stack) by taking every second pixel in each row and column. The
accuracy of sampling relative to σ is no different than for the start of the previous octave,
while computation is greatly reduced.
3.1 Local extrema detection
In order to detect the local maxima and minima of D(x, y, σ), each sample point is compared
to its eight neighbors in the current image and nine neighbors in the scale above and below
(see Figure 2). It is selected only if it is larger than all of these neighbors or smaller than all
of them. The cost of this check is reasonably low due to the fact that most sample points will
be eliminated following the first few checks.
An important issue is to determine the frequency of sampling in the image and scale do-
mains that is needed to reliably detect the extrema. Unfortunately, it turns out that there is
no minimum spacing of samples that will detect all extrema, as the extrema can be arbitrar-
ily close together. This can be seen by considering a white circle on a black background,
which will have a single scale space maximum where the circular positive central region of
the difference-of-Gaussian function matches the size and location of the circle. For a very
elongated ellipse, there will be two maxima near each end of the ellipse. As the locations of
maxima are a continuous function of the image, for some ellipse with intermediate elongation
there will be a transition from a single maximum to two, with the maxima arbitrarily close to
7
0
20
40
60
80
100
1 2 3 4 5 6 7 8
Repeatability (%)
Number of scales sampled per octave
Matching location and scale
Nearest descriptor in database
500
1000
1500
2000
2500
3000
3500
1 2 3 4 5 6 7 8
Number of keypoints per image
Number of scales sampled per octave
Total number of keypoints
Nearest descriptor in database
Figure 3: The top line of the first graph shows the percent of keypoints that are repeatably detected at
the same location and scale in a transformed image as a function of the number of scales sampled per
octave. The lower line shows the percent of keypoints that have their descriptors correctly matched to
a large database. The second graph shows the total number of keypoints detected in a typical image
as a function of the number of scale samples.
each other near the transition.
Therefore, we must settle for a solution that trades off efficiency with completeness.
In fact, as might be expected and is confirmed by our experiments, extrema that are close
together are quite unstable to small perturbations of the image. We can determine the best
choices experimentally by studying a range of sampling frequencies and using those that
provide the most reliable results under a realistic simulation of the matching task.
3.2 Frequency of sampling in scale
The experimental determination of sampling frequency that maximizes extrema stability is
shown in Figures 3 and 4. These figures (and most other simulations in this paper) are based
on a matching task using a collection of 32 real images drawn from a diverse range, including
outdoor scenes, human faces, aerial photographs, and industrial images (the image domain
was found to have almost no influence on any of the results). Each image was then subject to a
range of transformations, including rotation, scaling, affine stretch, change in brightness and
contrast, and addition of image noise. Because the changes were synthetic, it was possible
to precisely predict where each feature in an original image should appear in the transformed
image, allowing for measurement of correct repeatability and positional accuracy for each
feature.
Figure 3 shows these simulation results used to examine the effect of varying the number
of scales per octave at which the image function is sampled prior to extrema detection. In
this case, each image was resampled following rotation by a random angle and scaling by
a random amount between 0.2 of 0.9 times the original size. Keypoints from the reduced
resolution image were matched against those from the original image so that the scales for all
keypoints would be be present in the matched image. In addition, 1% image noise was added,
meaning that each pixel had a random number added from the uniform interval [-0.01,0.01]
where pixel values are in the range [0,1] (equivalent to providing slightly less than 6 bits of
accuracy for image pixels).
8
0
20
40
60
80
100
1 1.2 1.4 1.6 1.8 2
Repeatability (%)
Prior smoothing for each octave (sigma)
Matching location and scale
Nearest descriptor in database
Figure 4: The top line in the graph shows the percent of keypoint locations that are repeatably detected
in a transformed image as a function of the prior image smoothing for the first level of each octave.
The lower line shows the percent of descriptors correctly matched against a large database.
The top line in the first graph of Figure 3 shows the percent of keypoints that are detected
at a matching location and scale in the transformed image. For all examples in this paper, we
define a matching scale as being within a factor of

2 of the correct scale, and a matching
location as being within σ pixels, where σ is the scale of the keypoint (defined from equation
(1) as the standard deviation of the smallest Gaussian used in the difference-of-Gaussian
function). The lower line on this graph shows the number of keypoints that are correctly
matched to a database of 40,000 keypoints using the nearest-neighbor matching procedure
to be described in Section 6 (this shows that once the keypoint is repeatably located, it is
likely to be useful for recognition and matching tasks). As this graph shows, the highest
repeatability is obtained when sampling 3 scales per octave, and this is the number of scale
samples used for all other experiments throughout this paper.
It might seem surprising that the repeatability does not continue to improve as more
scales are sampled. The reason is that this results in many more local extrema being detected,
but these extrema are on average less stable and therefore are less likely to be detected in
the transformed image. This is shown by the second graph in Figure 3, which shows the
average number of keypoints detected and correctly matched in each image. The number of
keypoints rises with increased sampling of scales and the total number of correct matches also
rises. Since the success of object recognition often depends more on the quantity of correctly
matched keypoints, as opposed to their percentage correct matching, for many applications it
will be optimal to use a larger number of scale samples. However, the cost of computation
also rises with this number, so for the experiments in this paper we have chosen to use just 3
scale samples per octave.
To summarize, these experiments show that the scale-space difference-of-Gaussian func-
tion has a large number of extrema and that it would be very expensive to detect them all.
Fortunately, we can detect the most stable and useful subset even with a coarse sampling of
scales.
9
3.3 Frequency of sampling in the spatial domain
Just as we determined the frequency of sampling per octave of scale space, so we must de-
termine the frequency of sampling in the image domain relative to the scale of smoothing.
Given that extrema can be arbitrarily close together, there will be a similar trade-off between
sampling frequency and rate of detection. Figure 4 shows an experimental determination of
the amount of prior smoothing, σ, that is applied to each image level before building the
scale space representation for an octave. Again, the top line is the repeatability of keypoint
detection, and the results show that the repeatability continues to increase with σ. However,
there is a cost to using a large σ in terms of efficiency, so we have chosen to use σ = 1.6,
which provides close to optimal repeatability. This value is used throughout this paper and
was used for the results in Figure 3.
Of course, if we pre-smooth the image before extrema detection, we are effectively dis-
carding the highest spatial frequencies. Therefore, to make full use of the input, the image
can be expanded to create more sample points than were present in the original. We dou-
ble the size of the input image using linear interpolation prior to building the first level of
the pyramid. While the equivalent operation could effectively have been performed by us-
ing sets of subpixel-offset filters on the original image, the image doubling leads to a more
efficient implementation. We assume that the original image has a blur of at least σ = 0.5
(the minimum needed to prevent significant aliasing), and that therefore the doubled image
has σ = 1.0 relative to its new pixel spacing. This means that little additional smoothing is
needed prior to creation of the first octave of scale space. The image doubling increases the
number of stable keypoints by almost a factor of 4, but no significant further improvements
were found with a larger expansion factor.
4 Accurate keypoint localization
Once a keypoint candidate has been found by comparing a pixel to its neighbors, the next
step is to perform a detailed fit to the nearby data for location, scale, and ratio of principal
curvatures. This information allows points to be rejected that have low contrast (and are
therefore sensitive to noise) or are poorly localized along an edge.
The initial implementation of this approach (Lowe, 1999) simply located keypoints at
the location and scale of the central sample point. However, recently Brown has developed
a method (Brown and Lowe, 2002) for fitting a 3D quadratic function to the local sample
points to determine the interpolated location of the maximum, and his experiments showed
that this provides a substantial improvement to matching and stability. His approach uses the
Taylor expansion (up to the quadratic terms) of the scale-space function, D(x, y, σ), shifted
so that the origin is at the sample point:
D(x) = D +
∂D
∂x
T
x +
1
2
x
T

2
D
∂x
2
x (2)
where D and its derivatives are evaluated at the sample point and x = (x, y,σ)
T
is the offset
from this point. The location of the extremum, ˆx, is determined by taking the derivative of
this function with respect to x and setting it to zero, giving
ˆx = −

2
D
∂x
2
−1
∂D
∂x
. (3)
10
(a) (b)
(c) (d)
Figure 5: This figure shows the stages of keypoint selection. (a) The 233x189 pixel original image.
(b) The initial 832 keypoints locations at maxima and minima of the difference-of-Gaussian function.
Keypoints are displayed as vectors indicating scale, orientation, and location. (c) After applying
a threshold on minimum contrast, 729 keypoints remain. (d) The final 536 keypoints that remain
following an additional threshold on ratio of principal curvatures.
As suggested by Brown, the Hessian and derivative of D are approximated by using dif-
ferences of neighboring sample points. The resulting 3x3 linear system can be solved with
minimal cost. If the offset ˆx is larger than 0.5 in any dimension, then it means that the ex-
tremum lies closer to a different sample point. In this case, the sample point is changed and
the interpolation performed instead about that point. The final offset ˆx is added to the location
of its sample point to get the interpolated estimate for the location of the extremum.
The function value at the extremum, D(ˆx), is useful for rejecting unstable extrema with
low contrast. This can be obtained by substituting equation (3) into (2), giving
D(ˆx) = D +
1
2
∂D
∂x
T
ˆx.
For the experiments in this paper, all extrema with a value of |D(ˆx)| less than 0.03 were
discarded (as before, we assume image pixel values in the range [0,1]).
Figure 5 shows the effects of keypoint selection on a natural image. In order to avoid too
much clutter, a low-resolution 233 by 189 pixel image is used and keypoints are shown as
vectors giving the location, scale, and orientation of each keypoint (orientation assignment is
described below). Figure 5 (a) shows the original image, which is shown at reduced contrast
behind the subsequent figures. Figure 5 (b) shows the 832 keypoints at all detected maxima
11
and minima of the difference-of-Gaussian function, while (c) shows the 729 keypoints that
remain following removal of those with a value of |D(ˆx)| less than 0.03. Part (d) will be
explained in the following section.
4.1 Eliminating edge responses
For stability, it is not sufficient to reject keypoints with low contrast. The difference-of-
Gaussian function will have a strong response along edges, even if the location along the
edge is poorly determined and therefore unstable to small amounts of noise.
A poorly defined peak in the difference-of-Gaussian function will have a large principal
curvature across the edge but a small one in the perpendicular direction. The principal curva-
tures can be computed from a 2x2 Hessian matrix, H, computed at the location and scale of
the keypoint:
H =

D
xx
D
xy
D
xy
D
y y

(4)
The derivatives are estimated by taking differences of neighboring sample points.
The eigenvalues of H are proportional to the principal curvatures of D. Borrowing from
the approach used by Harris and Stephens (1988), we can avoid explicitly computing the
eigenvalues, as we are only concerned with their ratio. Let α be the eigenvalue with the
largest magnitude and β be the smaller one. Then, we can compute the sum of the eigenvalues
from the trace of H and their product from the determinant:
Tr(H) = D
xx
+ D
y y
= α + β,
Det(H) = D
xx
D
y y
− (D
xy
)
2
= αβ.
In the unlikely event that the determinant is negative, the curvatures have different signs so the
point is discarded as not being an extremum. Let r be the ratio between the largest magnitude
eigenvalue and the smaller one, so that α = rβ. Then,
Tr(H)
2
Det(H)
=
(α + β)
2
αβ
=
(rβ + β)
2

2
=
(r + 1)
2
r
,
which depends only on the ratio of the eigenvalues rather than their individual values. The
quantity (r +1)
2
/r is at a minimum when the two eigenvalues are equal and it increases with
r. Therefore, to check that the ratio of principal curvatures is below some threshold, r, we
only need to check
Tr(H)
2
Det(H)
<
(r + 1)
2
r
.
This is very efficient to compute, with less than 20 floating point operations required to
test each keypoint. The experiments in this paper use a value of r = 10, which eliminates
keypoints that have a ratio between the principal curvatures greater than 10. The transition
from Figure 5 (c) to (d) shows the effects of this operation.
12
5 Orientation assignment
By assigning a consistent orientation to each keypoint based on local image properties, the
keypoint descriptor can be represented relative to this orientation and therefore achieve in-
variance to image rotation. This approach contrasts with the orientation invariant descriptors
of Schmid and Mohr (1997), in which each image property is based on a rotationally invariant
measure. The disadvantage of that approach is that it limits the descriptors that can be used
and discards image information by not requiring all measures to be based on a consistent
rotation.
Following experimentation with a number of approaches to assigning a local orientation,
the following approach was found to give the most stable results. The scale of the keypoint
is used to select the Gaussian smoothed image, L, with the closest scale, so that all compu-
tations are performed in a scale-invariant manner. For each image sample, L(x, y), at this
scale, the gradient magnitude, m(x, y), and orientation, θ(x, y), is precomputed using pixel
differences:
m(x, y) =

(L(x + 1, y) − L(x − 1, y))
2
+ (L(x, y + 1) −L(x, y −1))
2
θ(x, y) = tan
−1
((L(x, y + 1) − L(x, y −1))/(L(x + 1, y) − L(x −1, y)))
An orientation histogram is formed from the gradient orientations of sample points within
a region around the keypoint. The orientation histogram has 36 bins covering the 360 degree
range of orientations. Each sample added to the histogram is weighted by its gradient magni-
tude and by a Gaussian-weighted circular window with a σ that is 1.5 times that of the scale
of the keypoint.
Peaks in the orientation histogram correspond to dominant directions of local gradients.
The highest peak in the histogram is detected, and then any other local peak that is within
80% of the highest peak is used to also create a keypoint with that orientation. Therefore, for
locations with multiple peaks of similar magnitude, there will be multiple keypoints created at
the same location and scale but different orientations. Only about 15% of points are assigned
multiple orientations, but these contribute significantly to the stability of matching. Finally, a
parabola is fit to the 3 histogram values closest to each peak to interpolate the peak position
for better accuracy.
Figure 6 shows the experimental stability of location, scale, and orientation assignment
under differing amounts of image noise. As before the images are rotated and scaled by
random amounts. The top line shows the stability of keypoint location and scale assign-
ment. The second line shows the stability of matching when the orientation assignment is
also required to be within 15 degrees. As shown by the gap between the top two lines, the
orientation assignment remains accurate 95% of the time even after addition of ±10% pixel
noise (equivalent to a camera providing less than 3 bits of precision). The measured vari-
ance of orientation for the correct matches is about 2.5 degrees, rising to 3.9 degrees for 10%
noise. The bottom line in Figure 6 shows the final accuracy of correctly matching a keypoint
descriptor to a database of 40,000 keypoints (to be discussed below). As this graph shows,
the SIFT features are resistant to even large amounts of pixel noise, and the major cause of
error is the initial location and scale detection.
13
0
20
40
60
80
100
0% 2% 4% 6% 8% 10%
Repeatability (%)
Image noise
Matching location and scale
Matching location, scale, and orientation
Nearest descriptor in database
Figure 6: The top line in the graph shows the percent of keypoint locations and scales that are repeat-
ably detected as a function of pixel noise. The second line shows the repeatability after also requiring
agreement in orientation. The bottom line shows the final percent of descriptors correctly matched to
a large database.
6 The local image descriptor
The previous operations have assigned an image location, scale, and orientation to each key-
point. These parameters impose a repeatable local 2D coordinate system in which to describe
the local image region, and therefore provide invariance to these parameters. The next step is
to compute a descriptor for the local image region that is highly distinctive yet is as invariant
as possible to remaining variations, such as change in illumination or 3D viewpoint.
One obvious approach would be to sample the local image intensities around the key-
point at the appropriate scale, and to match these using a normalized correlation measure.
However, simple correlation of image patches is highly sensitive to changes that cause mis-
registration of samples, such as affine or 3D viewpoint change or non-rigid deformations. A
better approach has been demonstrated by Edelman, Intrator, and Poggio (1997). Their pro-
posed representation was based upon a model of biological vision, in particular of complex
neurons in primary visual cortex. These complex neurons respond to a gradient at a particular
orientation and spatial frequency, but the location of the gradient on the retina is allowed to
shift over a small receptive field rather than being precisely localized. Edelman et al. hypoth-
esized that the function of these complex neurons was to allow for matching and recognition
of 3D objects from a range of viewpoints. They have performed detailed experiments using
3D computer models of object and animal shapes which show that matching gradients while
allowing for shifts in their position results in much better classification under 3D rotation. For
example, recognition accuracy for 3D objects rotated in depth by 20 degrees increased from
35% for correlation of gradients to 94% using the complex cell model. Our implementation
described below was inspired by this idea, but allows for positional shift using a different
computational mechanism.
14

Không có nhận xét nào:

Đăng nhận xét