- Choose the right size calibration target. Large enough to properly constrain parameters. Preferably it should cover approx. half of the total area when seen fronto-parallel in the camera images.
- Perform calibration at the approximate working distance (WD) of your final application. The camera should be focused at this distance and focus should be unchanged after calibration.
- The target should have a high feature count. Using fine patterns is preferable. However, at some point detection robustness suffers. Our recommendation is to use fine pattern counts for cameras above 3MPx and if the lighting is controlled and good.
- Collect images from different areas and tilts. Move the target to fully cover the image area and aim for even coverage. Lens distortion can be properly determined from fronto-parallel images only, but focal length estimation is dependent on observing foreshortening. Include both frontoparallel images, and images taken with the board tilted up to +/- 45 degrees in both horizontal an vertical direction. Tilting more is usually not a good idea as feature localization accuracy suffers.
- Use good lighting. This is often overlooked, but hugely important. The calibration target should preferably be diffusely lit by means of controlled photography lighting. Strong point sources give rise to uneven illumination, possibly making detection fail, and not utilizing the camera's dynamic range very well. Shadows can do the same.
- Have enough observations. Usually, calibration should be performed on at least 6 observations (images) of a calibration target. If a higher order camera or distortion model is used, more observations are beneficial.
- Consider using uniquely coded targets such as CharuCo boards. These allow you to gather observations from the very edges of the camera sensor and lens, and hence constrain the distortion parameters very well. Also, they allow you to collect data even when single feature points do not fulfill the other requirements.
- Calibration is only as accurate as the calibration target used. Use laser printed targets only to validate and test.
- Proper mounting of calibration target and camera. In order to minimize distortion and bow in larger targets, mount them either vertically, or laying flat on a rigid support. Consider moving the camera in these cases instead. Use a quality tripod, and avoid touching the camera during acquisitions.
- Remove bad observations. Carefully inspect reprojection errors. Both per-view and per-feature. If any of these appear as outliers, exclude them and recalibrate.

Following these practices should ensure the most accurate and precise calibration possible.

Have any questions, comments or additional insights? Post them below.

]]>Camera calibration is the process of determining camera and lens model parameters accurately. With the pinhole model, this amounts to determining the parameters of a suitable camera model. at least the focal length $f$, and possibly central point coordinates ($c_x, c_y$), lens distortion parameters $\boldsymbol{k}$.

In the most common, offline calibration process, images are taken under specific constraints. The calibration object defines a world coordinate system such that 3D coordinates of the visual features are known. Most of these methods work by observing a calibration object with known visual features. This is preferred when full control over the calibration procedure is necessary and high accuracy is demanded.

**Camera Model**

In any camera calibration effort, it is crucial to select a suitable camera model, which neither under- nor over-parameterized the camera. More information on camera models is found in this article.

**Calibration Procedures**

Many procedures for camera calibration have been proposed in literature. See e.g. Tsai's method [3] and Heikkilä and Silvén's [4]. These procedures differ in the type of calibration object needed, and the derivation of an initial guess for the camera parameters and the following nonlinear optimization step. Probably the most popular of all procedures is Zhang's [5].

**Zhang's Method**

A modern and popular method in the computer vision community is that of Zhang, which also is implemented in popular software libraries such as OpenCV, Jean-Yves Bouguet's Camera Calibration Toolbox for Matlab and Matlab's Computer Vision Toolbox. Zhang's calibration routine relies on observations of a planar calibration board with easily recognizable features. It relates the 3-D coordinates of these to the observed image coordinates projections by means of the model above and solves for the calibration plane extrinsics (the camera's position and orientation relative to the calibration board's coordinate system), and the camera intrinsics, by means of a closed form solution. This is then followed by non-linear optimization with the Levenberg-Marquardt algorithm over all parameters, including $\boldsymbol{k}$. The objective function to be minimized is the sum of squared reprojection errors, defined in the image plane:

$$\sum{\sum{||\vec{p}_{ij} - \breve{\vec{p}}(\vec{P}_j, \boldsymbol{A}, \vec{k}, \boldsymbol{R}_i, \vec{T}_i)||^2}{m}}{i=1}{n} \quad ,$$

where $\breve{\vec{p}}$ is the projection operator determining 2-D point coordinates given 3-D coordinates and the camera parameters. $i$ sums over the positions of the calibration board and $j$ over the points in a single position. $\vec{P}_j$ are 3-D point coordinates in the local calibration object coordinate system, $\vec{P}_j = [x, y, 0]^\top$, and $\vec{p}_{ij}$ the observed 2-D coordinates in the camera. The per-position extrinsic $\boldsymbol{R}_i, \vec{T}_i$ can be understood as the position of the camera relative to the coordinate system defined by the calibration object. With quality lenses and calibration targets, final mean reprojection errors in the order of a few tenths of a pixel are usually achieved.

**Autocalibration**

An alternative to the standard offline calibration routines described above is autocalibration. In *autocalibration*, parameters are determined from normal camera images viewing a general scene [1,2]. Depending on the specific method, little or no assumptions are made about the viewed scene or the motion of the camera between images. For some applications, this does indeed work, but generally, some assumptions need to be made about the camera or a reduced camera model needs to be chosen. However, even then, the autocalibration process tends to be unreliable and its success very dependent on the specific scene composition.

[1]: O.D. Faugeras, Q.-T. Luong, and S.J. Maybank. Camera Self- Calibration: Theory and Experiments. In European Conference on Computer Vision, 1992.

[2]: Ri Hartley. Euclidean reconstruction from uncalibrated views. In Applications of invariance in computer vision, pages 235–256, 1994.

[3]: Roger Y. Tsai. An efficient and accurate camera calibration technique for 3D machine vision. In IEEE Conference on Computer Vision and Pattern Recognition, pages 364–374, 1986.

[4]: Janne Heikkilä and Olli Silvén. A Four-step Camera Calibration Procedure with Implicit Image Correction. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1106–1112, 1997.

[5]: Zhengyou Zhang. Flexible camera calibration by viewing a plane from unknown orientations. In IEEE International Conference on Computer Vision, volume 1, pages 666–673, 1999.

The camera model is at the heart of any calibration routine. Hence, in order to better understand the factors influencing a good camera calibration, it is worth delving into camera models.

**The Pinhole Model**

A camera model is relating points in 3d to their projections on a camera image. By far the most common model is the pinhole camera model, which makes the fundamental assumption that rays of light enter the camera through an infinitely small aperture (the pin-hole), as shown below.

Mathematically and in the field of computer vision, points in three dimensional space are often denoted $$Q=[x,y,z] .$$ Their corresponding projection onto the camera image are $$q=[u,v,1] .$$ The '1' in $q$ is needed, as we are working with homogenous coordinates. For an explaination, click here. The two can be related by means of the pin-hole model:

$$q = \begin{bmatrix} f & 0 & 0 \\ 0 & f & c_y \\ 0 & 0 & 1 \end{bmatrix} \cdot Q$$

This projection essentially does two things: it scales coordinates by $f$, and translates them such that coordinates are not relative to the camera center, but have a coordinate system that is located in the top left corner of the image. This is the natural way of indexing pixel positions in a digital image.

The parameters $c_x, c_y$ are called *principle point coordinates* as they can also be interpreted as the image-coordinates of the *principle point*, which is where the optical axis intersects the image plane. In m cameras, it is reasonable to assume that $c_x, c_y$ is at the exact image center. However, the lens may not be perfectly centered with the image sensor, in particular in low-cost cameras such as those in smartphones. And in specialized cameras and lenses such as Scheimpflug lenses or off-axis projection lenses, the optical axis does purposefully not intersect the image center. In these situations, $c_x, c_y$ need to be determined from calibration.

The parameter $f$ is the focal length, and it depends on the camera lens and sensor used. A large/long focal length scales by a large number and can be interpreted as a large "zoom", while a macro lens would have a small/short focal length. By units of $f$ necessarily have to be compatible with those of $q$ and $Q$. Hence, if 3d points $Q$ are given in meters and $q$ in pixels (or px), $f$ has units of $px/m$. Note that in practical applications, $Q$ is most often expressed in $m$ or $mm$, while $q$ would nearly always be in $px$. It is important, however, to keep in mind the position of the origin.

**Extending the pinhole model**

Through experimentation, and in order to accomodate cameras and lenses that are not well-described by means of the simple pinhole model, a common extension has the following parameters:

$$q = \begin{bmatrix} f_x & \alpha f_x & 0 \\ 0 & f_y & c_y \\ 0 & 0 & 1 \end{bmatrix} \cdot Q$$

The exact formulation differs a little from author to author. The important thing to note however, is that by introducing $f_x$ and $f_y$ instead of just one common $f$, we allow the camera to scale differently in $x$ and $y$ directions. The ratio $f_x/f_y$ is sometimes called the *aspect ratio*. This could be due to non-square pixels (seen in some digitization standards for analogue cameras, e.g. 480i or 576i), or some exotic anisotropic camera lenses. Depending on the quality of the camera that is being calibrated, including $f_x$ and $f_y$ may still be justified. Luckily, both parameters can be determined very robustly and repeatedly in camera calibration, and as such, there is usually not much harm done in calibrating for both.

$\alpha$ on the other hand models skew in the camera, which is rarely seen in practice in modern devices. It allows for the sensor's x- and y-axis to not be perfectly perpendicular. This parameter should be excluded from the camera model and calibration in almost all modern applications, unless good reason exist to include it, as otherwise, it makes calibration less robust.

**Lens Distortion**

The pin-hole model is only perfectly valid for a camera with an infinitely small entrace pupil (or aperture). In practice, such cameras do not exist of course, as they would not allow any light to enter. Hence, a system of lenses is usually required to focus the incoming rays onto the image sensor. In doing so, some degree of *lens distortion *is introduced. This is usually much more pronounced in wide angle macro lenses (sometimes called fish-eye lenses) than in long focal length lenses, because it is more difficult to construct wide-angle lenses with low distortion, even though manufacturers do their best to avoid it.

**Radial Distortion Effects**

One group of lens distortion effects have radial symmetry, meaning that at a given distance from the principle point, the amount of distortion is constant. When the lens distorts inwards, this is termed *barrel distortion*. In contrast, when it disorts outwards, it is called *pincushion distortion*. A mixture of both is called *handlebar distortion*.

Beeing radially symmetric, a single function of variable $r$ is sufficient to describe the distortion effect. Using $u,v$ as image coordinates and $c_x, c_y$ beeing the principle point coordinates, we can define $r$:

$$r(u,v) = \sqrt{(u-c_x)^2 + (v-c_y)^2} ,$$

that is, $r(u,v)$ represents the distance from the image center.

Different suitable parametric function for radial lens distortion effects have been proposed in litterature. A widely accepted model is the even ordered radial model of Brown [1]:

$$q_{\text{corrected}} = (1 + k_1 r^2 + k_2 r^4 + k_5 r^6) \cdot q .$$

In this model, three parameters $(k_1, k_2, k_5)$ are used. They are polynomial coefficient, allowing for a smootely varying distortion to be present, which can be strictly negative (barrel distortion), strictly positive (pincushion distortion) or both (mustache distortion). Note that only even polynomial orders need to be included as the function's domain is allways positive ($r \in [0; r_{\text{max}}]$).

**Tangential Distortion Effects**

An effect that is not readily explained with a radial distortion model is *tangential* or *thin prism distortion*, which can be due to *decentering, ie.* different lens elements not beeing perfectly aligned, or because the optical axis is not perfectly normal to the sensor plane. A useful model, which has been shown to fit the pixel distortion nicely was also proposed by Brown [1]:

$$q_{\text{corrected}} = \begin{bmatrix} 2 k_3 x y + k_4 (r^2+2x^2) \\ k_3 (r^2 + 2*y^2) + 2 k_4 x y \end{bmatrix} \cdot q .$$

The combination of the radial model above and the tangential one is sometimes called the *Plumb Bob model* or *Brown-Conrady model*.

For good quality lenses and cameras, it is often not necessary to include tangential distortion parameters in the camera model. Just like central point coordinates and radial distortion parameters, a large amount of well distributed calibration images are usually needed to determine these parameters precisely and with good repeatiblity. As such, one is often better served by not including them, especially if the camera and lens are of sufficient quality.

**Addendum: About Homogeneous Coordinates**

The common way of representing points in 3d space or on a 2d plane is to use 3-vectors and 2-vectors respectively. However, for some purposes, it is very useful to use homogeneous coordinates instead, where an extra coordinate is introduced.

For instance, a 2-d image point may be denoted $q_{\text{inhom}}=[u,v]$. In homogeneous coordinates we write $q_{\text{hom}}=s \cdot [u,v,1]$, and have $s$ as a free scale parameter. Any choice of $s$ represents the same 2-d point. However, we cannot allow $s$ to be 0 or infinity.

Now imagine a coordinate transformation consisting of scaling by a factor $\alpha$ and translating with a vector $[t_x, t_y]$. It could be applied to $q_{\text{inhom}}$ as follows:

$$\hat{q}_{\text{inhom}} = \begin{bmatrix} u \\ v \end{bmatrix} = \begin{bmatrix} \alpha & 0 \\ 0 & \alpha \end{bmatrix} q_{\text{inhom}} + \begin{bmatrix} t_x \\ t_y \end{bmatrix}$$

This does work well, however, the transformation is not linear (it cannot be expressed as a single matrix-vector product). For many derivations however, it would simplify the math greatly, if this transformation could indeed be expressed as a single matrix-vector operation. With the homogenous version of $q$, we can do the following:

$$ \hat{q}_{\text{hom}} = s \begin{bmatrix} u \\ v \\ 1 \end{bmatrix} = \begin{bmatrix} \alpha & 0 & t_x \\ 0 & \alpha & t_y \\ 0 & 0 & 1 \end{bmatrix} q_{\text{hom}}$$

That is: now the transformation (a projection in this case) is expressed as a linear transformation. In order to convert $\hat{q}_{\text{hom}}$ to $\hat{q}_{\text{inhom}}$, which we can readily interprete, we simply divide with the first two components of $\hat{q} _{\text{inhom}}$ with the third (which is the arbitrary scale factor $s$).

This trick allows many formulations for geometry to be expressed in a simpler way. In addition, there is a useful interpretation of homogenous 2d coordinates: We know that any point in a camera image corresponds to some point on a specific line in 3d space. The homogeneous coordinates for the 2d line are exactly the parameterization for this line in space. Varying $s$, we move along that line. Setting $s = 1$, we intersect the image plane and get the 2d point coordinates that we are usually interested in.

Homogenous coordinates have other nice properties. Wikipedia provides an excellent overview.

**References**

**Pattern size**

In choosing a calibration plate, an important consideration is the physical size of it. This ultimately relates to the measurement field of view (FOV) of the final application. This is because cameras need to be focused on that specific distance and calibrated. Changing the focus distance slightly affects focal length, which would throw any previous calibration off. Even aperture changes usually have a negative effect on calibration validity, which is why they should be avoided.

For accurate calibration, the camera model is best constrained if the camera sees the calibration target filling most of the image. Popularity speaking, if a small calibration plate is used, many combinations of camera parameters could explain the observed images. As a rule of thumb, the calibration plate should have an area of at least half the available pixel area when observed frontally.

**Pattern type**

Different patterns have been introduced over the years, with each having unique properties and benefits.

Choosing the right type starts by considering which algorithm and algorithm implementation you will be using. In general purpose libraries such as OpenCV or MVTec Halcon, there is some freedom regarding the pattern, and they have individual benefits and limitations.

**Checkerboard targets**

This is the most popular and common pattern design. Chessboard corners candidates are usually found by first binarizing the camera image and finding quadrilaterals (these are the black chessboard fields). A filtering step retains only those quads that meet certain size criteria and are organized in a regular grid structure whose dimensions match those specified by the user.

After an initial detection of the pattern, the corner locations can be determined with very high accuracy. This is because corners (mathematically: saddle points) are principally infinitely small and hence unbiased under perspective transformations or lens distortion.

In OpenCV, the entire chessboard must be visible in all images in order to be detected. This usually makes it difficult to obtain information from the very edges of images. These areas are usually good to have information from, as they constrain the lens distortion model properly.

Following the detection of a checkerboard, subpixel refinement can be performed to find the saddle points with subpixel accuracy. This makes use of the exact gray values of pixels around a given corner position, and the accuracy is much more accurate than what integer pixel positions would allow for.

You can inspect the source code for OpenCV's checkerboard detector here. An important detail regarding checkerboard targets is that in order to be rotation-invariant, the number of rows needs to be even and the number of columns odd, or the other way around. If for instance, both are even, there is a 180-degree rotation ambiguity. For single camera calibration, this is not an issue, but if the same points need to be identified by two or more cameras (for stereo calibration), this ambiguity must not exist. This is the reason why our standard checkerboard targets all have this property of even/odd rows/columns.

**Circle grids **

Circle girds are also a popular and very common calibration target design, based on circles, either with white circles on dark background or dark (black) circles on white background. In image processing terms, circles can be detected as "blobs" in the image. Some simple conditions on these binary blob regions, such as area, circularity, convexity, etc. can be applied to remove bad feature candidates. The

After finding suitable candidates, the regular structure of features is again used to identify and filter the pattern. The determination of circles can be done very accurately since all pixels on the periphery of circles can be used, decreasing the influence of image noise. However, in contrast to the saddle points in checkerboards, circles are imaged as ellipses under camera perspective. This perspective can be accounted for by means of image rectification. However, in addition, the unknown lens distortion means that the circles are not imaged as perfect ellipses, which adds a small bias. However, we can consider the distortion model as piecewise linear (obeying a perspective transformation / homography), so this error is very small in most lenses.

An important difference between symmetric and asymmetric circle grids is that the former have a 180-degree ambiguity as explained in the "Checkerboard" section. Hence, for stereo calibration, asymmetric grids are necessary. Otherwise, there is not a big difference in the performance one should expect for either type.

OpenCV source code for circle grid detection is found here.

**CharuCo targets**

CharuCo patterns overcome some of the limitations of classical checkerboards. However, their detection algorithm is somewhat more complex. Luckily, CharuCo detection is part of OpenCVs contrib library (since OpenCV 3.0.0), making it very easy to integrate this advanced method.

The main advantage with CharuCo is that all light checker fields are uniquely coded and identifiable. This means that even partly occluded or non-ideal camera images can be used for calibration. For instance, strong ring lights may produce inhomogenous lighting on the calibration target (a region of semi-specular reflection), which would cause ordinary checkerboard detection to fail. With CharuCo, the remaining (good) saddle point detections can still be used. Saddle point localizations can be refined using subpixel detection just like checkerboards.

With observations close to image corners, this is an extremely useful property. As the target can be positioned such that the camera only sees it partly, we can gather information from the very edges and corners of the camera image. This usually leads to very good and robust determination of lens distortion parameters. For this reason, we highly recommend the use of CharuCo targets, when OpenCV 3.x is available.

Naturally, CharuCo targets can be used for stereo calibration. In this case, some code needs to be implemented to find those points that were detected in each camera individually, and those that were found in both (the intersection of points).

]]>