Here are the pictures I shot to be used in the project:
In order to recover the homographies, I defined several correspondence points for each pair of images that need to be blended. Below is an example for the building:
The math behind the algorithm is that in order to recover the transformation matrix, we need to solve the matrix equation: $$ \begin{align*} &q=Hp \\ \implies& \begin{bmatrix} wx^{'} \\ wy^{'} \\ w \end{bmatrix} = \begin{bmatrix} a_{11} & a_{12} & a_{13} \\ a_{21} & a_{22} & a_{23} \\ a_{31} & a_{32} & a_{33} \end{bmatrix} \begin{bmatrix} x \\ y \\ 1 \end{bmatrix} \\ \implies& \begin{bmatrix} x & y & 1 & 0 & 0 & 0 & -xx^{'} & -x^{'}y \\ 0 & 0 & 0 & x & y & 1 & -xy^{'} & -yy^{'} \end{bmatrix} \begin{bmatrix} a_{11} \\ a_{12} \\ a_{13} \\ a_{21} \\ a_{22} \\ a_{23} \\ a_{31} \\ a_{32} \\ a_{33} \end{bmatrix} = \begin{bmatrix} x^{'} \\ y^{'} \end{bmatrix} \end{align*} $$ After setting up the linear system, I use least square to solve it.
I first warp the 4 corners of the image to find the output shape. Then I use inverse warping to map every pixel to the target image. I use interpolation to fill in the intermediate pixels. Also, I keep a record of the alpha mask since the warped image is often not rectangular and contains meaningless pixels that however should not be treated as black. This is crucial when blending images together later as you don't want those pixels to mess up the borders.
Rectification is a good way to verify the functionality of the warping. We warp the 4 corners of a should-be rectangular item into a rectangle. It is achieved by just setting the correspondencies of the corners into a manually defined rectangular. Below are some examples:
Adding everything up, I first use the correspondence points of a pair of images to recover the homographies. Then I warp one of the image to the other one, and calculate the resulting bounding box of the mosaic by transforming the corners of the image and shift both images accordingly. After, I apply distance transforms to create an alpha mask for each of the image to the nearst edge. Finally, I apply Laplace pyramids to the images and their masks to help better blending in. Below are some examples:
Using the pre-implemented get_harris_corners function, I am able to detect the corners of my image. However, without any suppression, the result is just too overwhelming around the features. I then tuned the min_distance and threshold_rel parameters a little bit to preliminarily clean up the result. Below is an example:
The result is yet not satisfying enough, so I apply adaptive non-maximal suppression to improve them further, eliminating clustered redundant information. The main math behind that is: $$ \begin{align*} r_i=\min_{j}|x_i-x_j|, s.t. f(x_i) < c_{robust}f(x_j), x_{j} \in I \end{align*} $$ Or in other words, we only retain the strongest corner in its neighborhood. Theoretically, from 0 we gradually increase r until we reach the desired number of interest points. Practically, we decrease r from infinity to generate a comprehensive ordered list of interest points and just pick the top ones that we desire. Below is an example after applying ANMS:
After obtaining a handy set of interest points, we want to extract features out of them in preparation of the following matching phase. To do so, we first create a 40*40 patch around each interest point on the Gaussian blurred image. We then subsample it with a spacing of 5 to form a 8*8 descriptor. Finally, we normalize the descriptors to have means of 0 and variance of 1. Below are some randomly chosen descriptors:
Finally, in order to match features between 2 images, we compute pairwise Euclidean distances between their descriptors. For each descriptor, instead of matching it to the nearest neighbor, we find the two nearest neighbors in the other descriptors. Then apply Lowe's ratio test as suggested in the paper, which is to threshold on the ratio of the nearest and the second nearest neighbor. This helps to eliminate lots of outliers as experiments suggest. I also take the average second nearest neighbor distance into consideration, as it reflects the outlier distance well. To avoid mismatching, I also cross check that both descriptors are the best match for each other. Below is an example of a matching:
Even though we utilize lots of techniques to eliminate outliers and avoid mismatching, the result could still inevitably be imperfect. Thus, to add a last layer of robustness, we use 4-point RANSAC to compute the best homography possible. It is done by first randomly sample 4 points for homography estimation. Then compute the homography from the sampled points, and calculate the number of inliers for this homography. We repeat this process in thousands of iterations, and keep track of the homography with the largest set of inliers. At the end we recompute the homography with all inliers for the best model. Below are some comparision between the hand-selected points and auto-matched ones and new results: