# Camera Models Explained

The camera model is at the heart of any calibration routine. Hence, in order to better understand the factors influencing a good camera calibration, it is worth delving into camera models.

The Pinhole Model

A camera model is relating points in 3d to their projections on a camera image. By far the most common model is the pinhole camera model, which makes the fundamental assumption that rays of light enter the camera through an infinitely small aperture (the pin-hole), as shown below.

###### Projection through a pinhole. (c) Wikimedia. User Pbroks13.

Mathematically and in the field of computer vision, points in three dimensional space are often denoted $$Q=[x,y,z] .$$ Their corresponding projection onto the camera image are $$q=[u,v,1] .$$ The '1' in $q$ is needed, as we are working with homogenous coordinates. For an explaination, click here. The two can be related by means of the pin-hole model:

$$q = \begin{bmatrix} f & 0 & 0 \\ 0 & f & c_y \\ 0 & 0 & 1 \end{bmatrix} \cdot Q$$

###### The pinhole camera model. Modified from OpenCV online documentation. (c) Itseez.

This projection essentially does two things: it scales coordinates by $f$, and translates them such that coordinates are not relative to the camera center, but have a coordinate system that is located in the top left corner of the image. This is the natural way of indexing pixel positions in a digital image.

The parameters $c_x, c_y$ are called principle point coordinates as they can also be interpreted as the image-coordinates of the principle point, which is where the optical axis intersects the image plane. In m cameras, it is reasonable to assume that $c_x, c_y$ is at the exact image center. However, the lens may not be perfectly centered with the image sensor, in particular in low-cost cameras such as those in smartphones. And in specialized cameras and lenses such as Scheimpflug lenses or off-axis projection lenses, the optical axis does purposefully not intersect the image center. In these situations, $c_x, c_y$ need to be determined from calibration.

The parameter $f$ is the focal length, and it depends on the camera lens and sensor used. A large/long focal length scales by a large number and can be interpreted as a large "zoom", while a macro lens would have a small/short focal length. By units of $f$ necessarily have to be compatible with those of $q$ and $Q$. Hence, if 3d points $Q$ are given in meters and $q$ in pixels (or px), $f$ has units of $px/m$. Note that in practical applications, $Q$ is most often expressed in $m$ or $mm$, while $q$ would nearly always be in $px$. It is important, however, to keep in mind the position of the origin.

Extending the pinhole model

Through experimentation, and in order to accomodate cameras and lenses that are not well-described by means of the simple pinhole model, a common extension has the following parameters:

$$q = \begin{bmatrix} f_x & \alpha f_x & 0 \\ 0 & f_y & c_y \\ 0 & 0 & 1 \end{bmatrix} \cdot Q$$

The exact formulation differs a little from author to author. The important thing to note however, is that by introducing $f_x$ and $f_y$ instead of just one common $f$, we allow the camera to scale differently in $x$ and $y$ directions. The ratio $f_x/f_y$ is sometimes called the aspect ratio. This could be due to non-square pixels (seen in some digitization standards for analogue cameras, e.g. 480i or 576i), or some exotic anisotropic camera lenses. Depending on the quality of the camera that is being calibrated, including $f_x$ and $f_y$ may still be justified. Luckily, both parameters can be determined very robustly and repeatedly in camera calibration, and as such, there is usually not much harm done in calibrating for both.

$\alpha$ on the other hand models skew in the camera, which is rarely seen in practice in modern devices. It allows for the sensor's x- and y-axis to not be perfectly perpendicular. This parameter should be excluded from the camera model and calibration in almost all modern applications, unless good reason exist to include it, as otherwise, it makes calibration less robust.

Lens Distortion

The pin-hole model is only perfectly valid for a camera with an infinitely small entrace pupil (or aperture). In practice, such cameras do not exist of course, as they would not allow any light to enter. Hence, a system of lenses is usually required to focus the incoming rays onto the image sensor. In doing so, some degree of lens distortion is introduced. This is usually much more pronounced in wide angle macro lenses (sometimes called fish-eye lenses) than in long focal length lenses, because it is more difficult to construct wide-angle lenses with low distortion, even though manufacturers do their best to avoid it.

One group of lens distortion effects have radial symmetry, meaning that at a given distance from the principle point, the amount of distortion is constant. When the lens distorts inwards, this is termed barrel distortion. In contrast, when it disorts outwards, it is called pincushion distortion. A mixture of both is called handlebar distortion.

###### Barrel distortion (left), pincushion distortion (middle) and mustache distortion (right). (c) Wikimedia. User WolfWings.

Beeing radially symmetric, a single function of variable $r$ is sufficient to describe the distortion effect. Using $u,v$ as image coordinates and $c_x, c_y$ beeing the principle point coordinates, we can define $r$:

$$r(u,v) = \sqrt{(u-c_x)^2 + (v-c_y)^2} ,$$

that is, $r(u,v)$ represents the distance from the image center.

Different suitable parametric function for radial lens distortion effects have been proposed in litterature. A widely accepted model is the even ordered radial model of Brown [1]:

$$q_{\text{corrected}} = (1 + k_1 r^2 + k_2 r^4 + k_5 r^6) \cdot q .$$

In this model, three parameters $(k_1, k_2, k_5)$ are used. They are polynomial coefficient, allowing for a smootely varying distortion to be present, which can be strictly negative (barrel distortion), strictly positive (pincushion distortion) or both (mustache distortion). Note that only even polynomial orders need to be included as the function's domain is allways positive ($r \in [0; r_{\text{max}}]$).

Tangential Distortion Effects

An effect that is not readily explained with a radial distortion model is tangential or thin prism distortion, which can be due to decentering, ie. different lens elements not beeing perfectly aligned, or because the optical axis is not perfectly normal to the sensor plane. A useful model, which has been shown to fit the pixel distortion nicely was also proposed by Brown [1]:

$$q_{\text{corrected}} = \begin{bmatrix} 2 k_3 x y + k_4 (r^2+2x^2) \\ k_3 (r^2 + 2*y^2) + 2 k_4 x y \end{bmatrix} \cdot q .$$

The combination of the radial model above and the tangential one is sometimes called the Plumb Bob model or Brown-Conrady model.

For good quality lenses and cameras, it is often not necessary to include tangential distortion parameters in the camera model. Just like central point coordinates and radial distortion parameters, a large amount of well distributed calibration images are usually needed to determine these parameters precisely and with good repeatiblity. As such, one is often better served by not including them, especially if the camera and lens are of sufficient quality.

The common way of representing points in 3d space or on a 2d plane is to use 3-vectors and 2-vectors respectively. However, for some purposes, it is very useful to use homogeneous coordinates instead, where an extra coordinate is introduced.

For instance, a 2-d image point may be denoted $q_{\text{inhom}}=[u,v]$. In homogeneous coordinates we write  $q_{\text{hom}}=s \cdot [u,v,1]$, and have $s$ as a free scale parameter. Any choice of $s$ represents the same 2-d point. However, we cannot allow $s$ to be 0 or infinity.

Now imagine a coordinate transformation consisting of scaling by a factor $\alpha$ and translating with a vector $[t_x, t_y]$. It could be applied to $q_{\text{inhom}}$ as follows:

$$\hat{q}_{\text{inhom}} = \begin{bmatrix} u \\ v \end{bmatrix} = \begin{bmatrix} \alpha & 0 \\ 0 & \alpha \end{bmatrix} q_{\text{inhom}} + \begin{bmatrix} t_x \\ t_y \end{bmatrix}$$

This does work well, however, the transformation is not linear (it cannot be expressed as a single matrix-vector product). For many derivations however, it would simplify the math greatly, if this transformation could indeed be expressed as a single matrix-vector operation. With the homogenous version of $q$, we can do the following:

$$\hat{q}_{\text{hom}} = s \begin{bmatrix} u \\ v \\ 1 \end{bmatrix} = \begin{bmatrix} \alpha & 0 & t_x \\ 0 & \alpha & t_y \\ 0 & 0 & 1 \end{bmatrix} q_{\text{hom}}$$

That is: now the transformation (a projection in this case) is expressed as a linear transformation. In order to convert $\hat{q}_{\text{hom}}$ to  $\hat{q}_{\text{inhom}}$, which we can readily interprete, we simply divide with the first two components of $\hat{q} _{\text{inhom}}$ with the third (which is the arbitrary scale factor $s$).

This trick allows many formulations for geometry to be expressed in a simpler way. In addition, there is a useful interpretation of homogenous 2d coordinates: We know that any point in a camera image corresponds to some point on a specific line in 3d space. The homogeneous coordinates for the 2d line are exactly the parameterization for this line in space. Varying $s$, we move along that line. Setting $s = 1$, we intersect the image plane and get the 2d point coordinates that we are usually interested in.

Homogenous coordinates have other nice properties. Wikipedia provides an excellent overview.

References

[1]: Close-Range Camera Calibration, D.C. Brown, Photogrammetric Engineering, pages 855-866, Vol. 37, No. 8, 1971.