Vision algorithms and methods allows the correct interpretation of images enabling MAV (Micro Aerial Vehicles) to perceive it’s surrounding. In this master thesis work we explore two different monocular vision based perception solution. The ﬁrst one relying on TTC and CPA estimation the second one oriented towards local 3D environment reconstruction using CNN for depth estimation.

Tipologia: Tesi di laurea

2018/2019

Caricato il 28/08/2019

luigi-russo-16 🇮🇹

4.7

(3)

1 documento

1 / 102

Documenti correlati

Esame Patente Drone A1_A3 Enac

(29)

Riassunto e Analisi per Capitoli di Camera con Vista

(20)

Esame enac patente drone di APR

(1)

France: Guerre Mondiale 2, Occupation, Résistance, Reconstruction.

Riassunto Camera Lucida (La camera chiara) | R. Barthes

optometria fase monoculare test 7

Transmedia Navigation: Understanding Transmedia Logics in Media and Education

integrated navigation

Il Drone: Storia e Funzionamento

do you know drones??

Scheda ed analisi romanzo Camera con vista

Camera Consiglio e Nuovo Procedimento Civile: Da Udienza a Camera Camerale

Corso drone categoria OPEN A1-A3

Riassunto "Teoria del drone"

Riassunto "Teorie del drone" di Chamayou

(1)

Analisi Teoria del drone, Chamayou

(1)

Teoria del drone, Grégoire Chamayou

(22)

Life's Shoot: Una nuova esperienza fotografica innovativa

La Camera Alta e La Camera Bassa: Funzioni e Differenze

Android - navigation bar bottom in Kotlin

Tesina di Maturità su Drones, album dei Muse

Panoramica da Reconstruction era a Mark Twain

Input Devices & Multimedia in Computing: Keyboards, Mice, Scanners, Cameras, Webcams & Ema

Smartphone-Based Portable Pedestrian Indoor Navigation

La France après 1945 : Reconstruction, Politique et Décolonisation

Controllo di una camera sull’attività dell’altra camera: un controllo solo formale?

Il Parlamento e la sua struttura bicamerale: Autodichia, Camera alta e camera bassa

La Camera Alta e la Camera Bassa nel Sistema Parlamentare: Funzioni e Caratteristiche

Il Funzionamento del Parlamento Italiano: Camera dei Deputati e Camera del Senato

Il Parlamento del Regno Unito: Camera dei Lord e Camera dei Comuni

Anteprima parziale del testo

Scarica 3D CNN envirinment reconstruction and drone navigation with RGB monocular Camera e più Tesi di laurea in PDF di Visione Artificiale solo su Docsity! UNIVERSITY OF SANNIO DEPARTMENT OF ENGINEERING Master degree in Electronic Engineering for Automation and Telecommunication 3D CNN environment reconstruction and drone navigation with RGB monocular camera SUPERVISOR: CANDIDATE: Prof. Luigi Glielmo (UNISANNIO) Luigi Russo Matr:39700137 ASSISTANT SUPERVISOR: Dr. Bálint Vanek (SZTAKI) Dr. Davide Liuzza (UNISANNIO) Academic Year 2017/2018 Contents Abstract ii Acknowledgments iii Introduction iv I Background and prior knowledge 1 1 Camera Models 2 1.1 Pinhole cameras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Camera and lenses . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Digital Image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4 Camera calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2 Visual perception 9 2.1 Depth estimation and 3D vision . . . . . . . . . . . . . . . . . . . . . 10 2.1.1 Depth by triangulation . . . . . . . . . . . . . . . . . . . . . . 10 2.1.2 Depth from motion . . . . . . . . . . . . . . . . . . . . . . . . 17 2.1.3 Incremental approach and optical flow . . . . . . . . . . . . . 19 2.1.4 Monocular Depth estimation through CNN . . . . . . . . . . . 28 2.2 Bio-inspired and optical flow visual perception . . . . . . . . . . . . . 46 2.2.1 Optical flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 2.2.2 Size expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 2.2.3 TTC and CPA estimation . . . . . . . . . . . . . . . . . . . . 49 3 Autonomous Navigation and Obstacle avoidance 53 3.1 High level architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.1.1 Path planning . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.1.2 Path following and Trajectory execution . . . . . . . . . . . . 55 3.1.3 Obstacle avoidance . . . . . . . . . . . . . . . . . . . . . . . . 55 3.1.4 Autopilot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.2 Quadrotor model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.3 Mathematical modelling of quadrotor . . . . . . . . . . . . . . . . . . 58 Acknowledgments This work would not have been possible without the support of multiple people. I would like to express my gratitude to Professor Luigi Glielmo, for his support, guidance and encouragement. To SZTAKI ( Institute for Computer Science and Control, Hungarian Academy of Sciences, particularly Dr. Balint Vanék, for all the support, advice and trust. In addition, i would like to thank Dr. Davide Liuzza for all the Skype call and the support. Gratitude is also due for all the other people that directly or indirectly supported during this time, specially my colleagues and friends. I am also indebted to my family for their unwavering love, encouragement and support throughout my life. iii Introduction MAVs (Micro Aerial Vehicles) commonly known as drones are becoming very popular during the last decade due to the diffusion of commercial platform such as Parrots, DJI or Xiaomi [1]. Due to their ability to operate in remote, dangerous and dull situations, MAVs are increasingly used in many applications such as en- vironmental monitoring (e.g., pollution, weather, and scientific applications), forest fire monitoring, homeland security, building monitoring, drug interdiction, aerial surveillance and mapping, traffic monitoring, precision agriculture, disaster relief, ad hoc communications networks, and rural search and rescue. As the complexity of the tasks is growing even skilled pilot are overwhelmed, so there is more and more demand of high level of autonomy. (a) (b) (c) (d) Figure 1: Example of MAVs applications iv INTRODUCTION v Obstacle avoidance and goal-directed navigation are key capabilities for au- tonomous operation of MAVs. For this, MAVs need to perceive themselves and their surroundings, and take decision on what to do to carry out their task safely. Perceive meaning "become aware of something using one or more sense, especially that of sight". In order to successfully complete the scheduled mission, a MAV must be fully aware of its states, including location, navigation speed, heading direction as well as starting point and target location. Although sensing and perceiving are strongly related, the sensing is relatively more biological (hardware) while perceiving is relatively more psychological (software). A synonym of perceiving is understand- ing and as reported in [2] understanding the reality means building a coherent and meaningful model. From sensing point of view, given the size, weight, power and cost limitations of MAVs the choice in sensors is limited. LIDARs for example are suitable for depth sensing, but their price, weight and power consumption make them less suitable for MAV applications. Visual sensors are cheap and lightweight sensors that can provide images with an abundance of information of surroundings, involving color, texture, and other visual information. As shown in Figure 2, visual sensors typically include the followings: (a) monocular cameras, (b) stereo cameras, (c) RGB-D cameras, and (d) fisheye cameras. (a) (b) (c) (d) Figure 2: Visual sensors aplliable to MAV navigation Part I Background and prior knowledge 1 Chapter 1 Camera Models All the information vision-based system processes is gathered from the camera lens. Camera is the mechanism by which it is possible to record the surrounding environment by means of frames. To be able to process images and give them meaning during processing, it is essential to have a model that describes the camera to understand how the points in the 3D world correspond to the pixels of each frame. 1.1 Pinhole cameras Pinhole Camera model is the simplest model for a camera. As shown in Figure 1.1 it is essentially composed of a barrier with a small aperture between the 3D object and a a photographic film sensor. Figure 1.1: Pinhole camera model The principle of a pinhole camera is simple. Objects from the scene reflect light in all directions. The size of the aperture is so small that among the many rays that are reflected off at P, a point on the surface of an object in the scene, only one of 2 1.1 Pinhole cameras 3 these rays enter the camera. So we can establish a one to one mapping between points on the 3D object and points on the film. The result is that that the film is exposed to an image of the 3D object by means of this mapping. Figure 1.2: Pinhole camera model more formal representation. A more formal representation is reported in Figure 1.2. The film is called the image plane and referred as Π′, the aperture is called pinhole and referred as O and the distance between the image plane and the pinhole is called focal length and referred as f. Sometimes, the retinal plane is placed between O and the 3D object at a distance f from O. In this case, it is called the virtual image or virtual retinal plane. Given the coordinate system [i j k], (usually referred as camera coordinate system), simply applying similarity of triangles a point P = [x y z]T in the real world is mapped into a point P′ = [x′ y′]T on the image plane through P′ = x′ y′  = f xz f y z  . (1.1) The ideal pinhole camera is one that has an aperture so small that only a single light ray enters the camera for each point in the scene. With the finite dimension of the real pinhole the film can be exposed to multiple rays from different point in the 3D world, blurring the image. Although we may think to make the aperture as small as possible this would imply a sharper but less brighter image,moreover when the pinhole gets too small light rays are diffracted. 1.3 Digital Image 6 parameters, whose units would be something like pixel µm , correspond to the change of units in the two axes of the image plane. We adjust our previous mapping to be P ′ = x′ y′  = fk xz fl y z  . (1.4) Pixel-coordinates are usually translated respect to image coordinates. Image coor- dinates have their origin C at the image center where the k axis intersects the image plane Figure 1.2. On the other hand, digital images typically have their origin at the lower-left corner of the image. Thus, 2D points in the image plane and 2D points in the image are offset by a translation vector [cx cy]. To accommodate this change of coordinate systems, the mapping now becomes P ′ = u v  = fk xz + cx fl y z + cy  = αxz β y z  (1.5) . The projection from P to P’ in a non linear function as the operation divides one of the input parameters (namely z). Using homogeneous coordinates it is possible to write this projection as a matrix-vector product: P ′h =  α 0 cx 0 0 β cy 0 0 0 1 0 Ph = MPh (1.6) We can decompose this transformation a bit further into P ′h =  α 0 cx 0 β cy 0 0 1  [I 0]Ph = K [I 0]Ph (1.7) The matrix K is often referred to as the camera matrix. Another effect we need to consider modelling a camera is skewness. We say that a camera is skewed if the angle between the two axes are slightly larger or smaller than 90 degrees we say that camera is skewed. Most cameras have zero-skew, but some degree of skewness may occur because of sensor manufacturing errors. Considering skewness camera matrix become K =  α −α cot θ cx 0 βsin θ cy 0 0 1  . (1.8) 1.3 Digital Image 7 Applying equation 1.7 we can obtain pixel coordinate of a point P in the 3D camera reference system. However points of 3D world are often expressed in a different reference system. Then, we need to include an additional transformation that re- lates points from the world reference system to the camera reference system. This transformation is captured by a rotation matrix R and translation vector T . Figure 1.5: Extrinsic parameter Therefore, given a point in a world reference system Pw, we can compute its camera coordinates as follows P = R T 0 1  , Pw (1.9) so that 1.7 now become P ′h = K [ R T ] Pw = MPw (1.10) . We see that the projection matrix M consists of two types of parameters: in- trinsic and extrinsic parameters. All parameters contained in the camera matrix K are the intrinsic parameters, which change as the type of camera changes. The extrinsic parameters include the rotation and translation, which do not depend on the camera’s build. Overall, we find that the 3×4 projection matrix M has 11 de- grees of freedom: 5 from the intrinsic camera matrix, 3 from extrinsic rotation, and 3 from extrinsic translation [10]. 1.4 Camera calibration 8 1.4 Camera calibration To precisely know the transformation from the real, 3D world into digital images requires prior knowledge of many of the camera’s intrinsic parameters. If given an arbitrary camera, we may or may not have access to these parameters. We do, however, have access to the images the camera takes. Therefore, can we find a way to deduce them from images?. This problem of estimating the extrinsic and intrinsic camera parameters is known as camera calibration. Specifically, we do this by solving for the intrinsic camera matrix K and the extrinsic parameters R, T from Equation 1.10. We can describe this problem in the context of a calibration rig, such as the one show in Figure 1.6. The rig usually consists of a simple pattern Figure 1.6: Caption (i.e. checkerboard) with known dimensions. Furthermore, the rig defines our world reference frame with origin Ow and axes iw, jw, kw. From the rig’s known pattern, we have known points in the world reference frame P1, ..., Pn. Finding these points in the image we take from the camera gives corresponding points in the image p1, ..., pn A precise discussion of this subject is reported in [10] and is outside the scope of this theses work. So far we have seen how to get projection matrixM through which is possible to map the 3D world in the 2D image. From image it is possible to recover some information about 3D world (mostly concerning the shape), however in general it is not possible to recover the entire structure of the 3D world from a single image because during projection the depth information is loss. In the next part we introduce some of the existing methods to achieve the 3D reconstruction of the scene and the estimation of some useful parameter for the navigation process. Further we will see how thanks to CNN is possible to recover structure of the 3D world even from a single image. 2.1 Depth estimation and 3D vision 11 2.1.1.1 Stereoscopy Stereoscopy is inspired by the human brain capacity to estimate the depth of a target from images captured by the two eyes. Stereoscopy reconstructs depth by exploiting the disparity occurring between cameras frames that capture the same scene from different points of view. Figure 2.3: Basic stereo system Figure 2.3 shows a simplified model for the triangulation in stereoscopy, with cameras laterally translated from each others. In this illustration, the two cameras have parallel optical axes and observe a point P located a distance z. The optical projections pl and pr of P are shown on the left and right camera plane. Note the similarity between the triangles 4 OlPOr and 4 plPpr. Knowing the intrinsic and extrinsic parameters of the cameras, the depth z is inversely proportional to the horizontal parallax between projections pl and pr, as shown in Eq.2.1. Note that f , xl, xr are provided in pixel, hence the depth z has the same unit as the translation B. In literature (xr − xl) = d is defined as the disparity and the optical centers distance as the baseline B. z = Bf d (2.1) 2.1 Depth estimation and 3D vision 12 Notice that if the disparity value is low, the uncertainty value of the disparity increases and the uncertainty related to the z estimation increases as well. For decreasing the disparity uncertainty, the baseline should be increased. However, the baseline distance must guarantee that on both the images there is a projection of the point P . Considering as unique uncertainty sources the disparity value and by applying the law of propagation of uncertainty, it is obtained uz = √(δz δd )2 u2d = bf d2 ud d=Bf z−−−→= z 2 bf ud. (2.2) Triangulation Figure 2.4: Triangulation Consider a more general setup with two camera systems as sketched in Figure 2.4. In the triangulation problem with two views, we have two cameras with known camera intrinsic parameters K and K ′ respectively. We also know the relative orientations and offsets R, T of these cameras with respect to each other. Suppose that we have a point P in 3D, which can be found in the images of the two cameras at p and p′ respectively. Although the location of P is currently unknown, we can measure the exact locations of p and p′ in the image. Because K,K ′, R, T are known, we can compute the two lines of sight l and l′, which are defined by the camera centers O1, O2 and the image locations p, p′ . Therefore, P can be computed as the intersection of l and l′. Although this process appears both straightforward and mathematically sound, it does not work very well in practice. In the real world, because the observations p and p′ are noisy and the camera calibration parameters are not precise, finding the intersection point of l and l′ may be problematic. In 2.1 Depth estimation and 3D vision 13 most cases, it will not exist at all, as the two lines may never intersect. Figure 2.5: Reprojection error [10] The triangulation problem for real-world scenarios is often mathematically char- acterized as solving a minimization problem argmin P̂ ‖MP̂ − p‖2 + ‖M′P̂ − p′‖2. (2.3) In the above equation, we seek to find a P̂ in 3D that best approximates P by finding the best least-squares estimate of the reprojection error of P̂ in both images. The reprojection error for a 3D point in an image is the distance between the projection of that point in the image and the corresponding observed point in the image plane. In the case of our example in Figure 2.5, since M is the projective transformation from 3D space to image 1, the projected point of P̂ in image 1 is MP̂ . The matching observation of P̂ in the first image is p. Thus, the reprojection error for the first image is the distance ‖MP̂ − p‖2. Under the assumption that image coordinate measurement noise is Gaussian-distributed, this approach gives the maximum likelihood solution for P̂ . The estimated depth is trivially the third coordinate of P̂ . The stereoscopy make use of the epipolar geometry to triangulate 3D points observed from two point of view. We now present the dense stereoscopy, the active stereoscopy and the structured-light theory used in 3D cameras. 2.1 Depth estimation and 3D vision 16 The pattern geometry has to be chosen carefully to avoid ambiguities in homol- ogous points recognition. Too similar, too small or too closer geometries should be avoided. Theoretically, the most appropriate strategy would be a white noise texturing, where each pixel intensity level is independent of the surrounding ones. This would allow for denser patterns which lead to a better reconstruction. In prac- tice, real applications exploit random dot patterns in the infrared domain to keep insensitive with respect to the ambient light. 2.1.1.4 Structured Light Figure 2.9: Triangulation with a single laser spot Structured-Light-based cameras sensors use a single camera with a structured pattern projected in the scene. Instead of triangulating with two cameras, a camera is substituted by a laser projector. It projects a codified pattern that embed enough structure to provide unique correspondences to triangulate with the camera. The direction of the structured pattern is known a priori by the camera, which is able to triangulate based on the pattern. The most simple structured-light system is the laser triangulation shown in Figure 2.9. A laser beam is projected on a scene or an object.The camera localizes the dot on the scene and recovers its position and depth following Eqs 2.4 and2.5. To improve the quality of the recognition, powerful infrared laser are usually used. z = btanα + tan β (2.4) x = z tanα (2.5) 2.1 Depth estimation and 3D vision 17 2.1.2 Depth from motion An alternative to the usage of stereoscopic and active sensors introduced in the previous Section is to exploit the motion of a single camera relative to the scene. 2.1.2.1 Structure from Motion Recovering the entire 3D structure of a scene from camera motion is a well known problem in the field of Computer vision and is usually addressed as Structure from motion (SFM). SFM in its general form has the following objective: recover objects 3D structure from images taken from different points of view. Figure 2.10: The setup of the general structure from motion problem This problem can be seen as an extension of triangulation problem introduced in Section 2.1.1.1 2.1.2.2 Essential and fundamental matrix Givenm views and n points we need to estimatem matricesMi , and the n world coordinate vectors Xj , for a total of 8m+ 3n unknowns, from mn observations. For the estimation of the projection matrices Mi is useful to introduce the essential and fundamental matrices. Consider the two pinhole cameras with projection ma- trices M and M ′ given an euclidean point X ′ in coordinate system of C’ its position 2.1 Depth estimation and 3D vision 18 X in coordinate system of C is given by X = RX ′ + T, (2.6) where R is a 3× 3 rotation matrix and T is a 3-vector. Pre-multiplying both sides by XT [T]× gives XT [T]×RX ′ = XTEX ′ = 0, (2.7) where E = [T]×R 3×3 is the essential matrix and [T]× in the cross product matrix. Equation 2.7 also holds for image points x′ and x wich gives the epipolar constraint xTEx′ = 0. (2.8) Note that E depends only onR and T and is defined only up to an arbitrary scale factor. Referring to Equation 1.6 image points x may be related to pixel position u by the inverse camera calibration matrix x = K−1u. (2.9) This means the epipolar constraint (equation 2.8) may be rewritten in terms of pixel position (K−1u)TE(K′−1u′) = 0 uT (K−1TEK′−1)u′ = 0 uTFu′ = 0, (2.10) where F = K−1TEK′−1 is the fundamental matrix and is a rank 2 homogeneous matrix with 7 degree of a freedom . From a computational point of view, such matrix can be retrieved by an a priori knowledge of at least 8 corresponding points between frames or through a calibration process. 2.1.2.3 Bundle adjustment Given camera intrinsic parameter once we get the fundamental matrix is possible to recover the relative camera translation and rotation and so the total projection matrix M . In the extracting the camera matrices from the Essential matrix, the estimates can be known up to scale. The 3D points can be computed from the es- timated camera matrices via the triangulation methods described earlier in Section 2.1.1.1. The extension to the multi-view case can be done by chaining pairwise cam- 2.1 Depth estimation and 3D vision 21 Kalman Filter The Kalman filter is a powerful technique for doing incremental, real-time estimation in dynamic systems. It allows for the integration of information over time and is robust with respect to both system and sensor noise. The filter is based on three separate probabilistic models, as shown in Table 2.1. The first model, the system model, describes the evolution over time of the current state vector ut. The transition between states is characterized by the known transition matrix Φt, and the addition of Gaussian noise with a covariance Qr. The second model, the measurement (or sensor) model, relates the measurement vector dt to the current state through a measurement matrix Ht and the addition of Gaussian noise with a covariance Rt. The third model, the prior model, describes the knowl- edge about the system state û0 and its covariance P0 before the first measurement is taken. The sensor and process noise are assumed to be uncorrelated. Models system model ut = Φt−1ut−1 + ηt, ηt˜N(0, Qt) measurement model dt = Htut + ξt, ξt˜N(0, Rt) prior model E[u0] = û0, cov[u0] = P0 (other assumption) E[ηtξTt ]=0 Prediction Phase state estimate extrapolation û−t = Φt−1û+t−1 state covariance extrapolation P−t = Φt−1P+t−1ΦTt−1 +Qt−1 Update phase state estimate update û+t = û−t +KI [dt −Ht − û−t ] state covariance update P+t = [I −KtHt]P−t Kalman gain matrix Kt = P−t HTt [HtP−t HTt Rt]−1 Table 2.1: Kalman filter equations To apply Kalman filter estimation framework to the depth-from-motion problem, we need to specialize each of the three models (system, measurement, and prior) and define the implementations of the extrapolation and update stages. The first step in designing a Kalman filter is to specify the elements of the state vector. Again as we seen in Section 2.1.1.2 for stereoscopy we have to choose between a pixel-based and feature-based description of the image. Using the pixel-based description the state vector is the entire depth map and the depth from motion algorithm estimates the depth at each pixel in the current image. Thus, the diagonal elements of the state covariance matrix Pt are the variances of the depth estimates at each pixel. For the feature-based approach, which tracks edge elements through the image se- quence. the state consists of a 3D position vector for each feature. The system and measurement models are based on the equations relating scene depth and camera motion to the induced image flow. 2.1 Depth estimation and 3D vision 22 Optical flow based estimation If the inter-frame camera motion is sufficiently small, the resulting optical flow can be expressed to a good approximation in terms of the instantaneous camera velocity . We will specify this in terms of a transna- tional velocity T and an angular velocity R. In the camera coordinate frame , the motion of a 3D point P is described by the equation dP dt = −T−R×P (2.12) expanding this into components dX dt = −Tx −RyZ +RzY dY dt = −Ty −RzX +RxZ dY dt = −Tz −RxY +RyX (2.13) Now projecting (X, Y, Z) onto an ideal, unit focal length image from equation 1.3: x = X Z ; y = Y Z (2.14) Taking the derivatives with respect to time and substituting in from equation 2.13 leads to the optical flow equations: P ′ = ∆x′ ∆y′  = 1 Z −1 0 x 0 −1 y   Tx Ty Tz +  xy −(1 + x2) y (1 + y2) −xy −x   Rx Ry Rz  (2.15) This equation relates the depth Z of the point to the camera motion T,R and the induce image displacement or optical flow [∆x ∆y]T . These equations are used to measure depth given the camera motion and optical flow, and to predict the change in depth map between frames. Note that parametrizing equation 2.15 in terms of inverse depth d = 1/Z makes the equation linear in the "depth" variable. Relating the ideal flow equations to real measurement requires a camera model as the one reported in equation 1.6. Thus the ideal image coordinate are related to the actual image coordinate by xa = αx+ cx; ya = βy + cy (2.16) 2.1 Depth estimation and 3D vision 23 Given the optical flow , camera motion and camera model if we solve the motion equations for the inverse depth d it will result d = F (∆x,∆y, T,R, cx, cy, α, β) (2.17) So to obtain depth we need somehow to measure the optical flow (specifying the measurement model). In [12] the optical flow is measured from the difference in intensity between the previous image and the current image using a simple version of correlation-based matching. However there is quite a large amount of different techniques for optical flow estimation. Roughly they can be classified into three groups: feature based [13], gradient [14] based and correlation based techniques [15]. To convert the displacement vector [∆x ∆y] into a depth measurement, it is assumed that the camera motion (T,R) is given and limited camera raster plane (lateral motion, in this case depth can be referred as disparity). The optical flow equation 2.15 can then be used to estimate depth as follows. First they abbreviate 2.15 to ∆x ∆y  = d tx ty + rx ry + ξ (2.18) where d is the inverse depth and ξ is an error vector representing noise in the flow measurement. The noise ( is assumed to be a bivariate Gaussian random vector with a zero mean and a covariance matrix Pm computed by the flow estimation algorithm. Equation 2.18 can be reexpressed in the following standard form for linear estimation problems: ∆x = ∆x ∆y − rx ry  = d tx ty + ξ = Hd+ ξ (2.19) the optimal estimate of disparity is then d = (HTPmH)−1H tP−1m ∆x (2.20) and the variance of this disparity measurement is: σ2d = (HTP−1m H)−1 (2.21) This measurement process has been implemented in a simplified form, under the assumption that the flow is parallel to the image raster. The next stage in the 2.1 Depth estimation and 3D vision 26 where T1 is the camera translation between the first and second frame. The covari- ance matrix comes from applying standard linear error propagation methods to the equations for xt and d. After initialization, if Tt is the translation between frames t − 1 and t, the motion equations that transform the state vector and covariance matrix to the current frame are u−t = x−t d−t  = 1 −T1 0 1  x+t−1 d+t−1  = Φtu+t−1 P−t = ΦtP+t−1ΦTt (2.22) The superscript minuses indicate that these estimates do not incorporate the mea- sured edge position at time t. The newly measured edge position x̃t is incorporated by computing the updated covariance matrix P+t a gain matrix K, and the updated parameter vector u+t : P+t = (P−t )−1 + S −1 where S = 1 σ2e 0 0 0 1  K = 1 σ2t P+t 0 1  Since these equations are linear, we can see how uncertainty decreases as the num- ber of measurements increases by computing the sequence of covariance matrices Pt given only the measurement uncertainty σ2e and the sequence of camera motions Tt. To implement the feature-based depth estimator, it is necessary to specify how to extract feature positions, how to estimate the noise level in these positions, and how to track features from frame to frame. An exhaustive description of feature detection and tracking is given from Scaramuzza et al [16]. It is worth to notice that removing the hypothesis of known motion both struc- ture and motion parameters can still be estimated in a similar fashion simply adding the motion to the state, if the camera is mounted on a MAV for instance we can include the motion model into the system model. However the resulting estimation is valid up to an unknown similarity (scale indeterminacy). Closely related to incremental structure from motion there are the Monocular Vi- sual Odometry and the Visual SLAM problems. Monocular VO is conceived to obtain the robot ego-motion from the study of the series of images that their cam- eras take. For this, the algorithm needs to recover some structure of the 3D points 2.1 Depth estimation and 3D vision 27 that correspond to the matched features, but this information is not exploited for other tasks and is therefore discarded. The system must work in real time as the robot localization is needed online. An exhaustive description of visual-odometry tecnique is given by Scaramuzza et. al. [17]. The goal of SLAM in general (and V-SLAM in particular) is to obtain a global, consistent estimate of the robot path. This implies keeping a track of a map of the environment because it is needed to realize when the robot returns to a previously visited area. This is called loop clo- sure. When a loop closure is detected, this information is used to reduce the drift in both the map and camera path. Understanding when a loop closure occurs and efficiently integrating this new constraint into the current map are two of the main issues in SLAM. Figure 2.11: SFM and VO compared to SLAM. Both visual monocular SLAM and VO just like SFM suffer of scale ambiguity so they cannot be used for navigation and obstacle avoidance tasks unless integrating additional information from other sensors like IMU or GPS in the Kalman Filter (or more in general Extended Kalman Filter for non linear model) framework [18] [19]. Moreover in absence of careful tuning they fail to compute reliable 3D maps on low textured environments and the 3D maps update is usually slow with respect to to real time requirements of fast maneuvers. Recently proposed Deep learning solu- tions have shown robustness to the aforementioned problem. These models quickly compute the depth map from single image producing a dense 3D representation of environments. 2.1 Depth estimation and 3D vision 28 2.1.4 Monocular Depth estimation through CNN Depth estimation from a single monocular image usually referred as Single Image Depth Estimation (SIDE) is a difficult task, and requires that we take into account the global structure of the image, as well as use prior knowledge about the scene. Humans appear to be extremely good at judging depth from single monocular images [20]. This is done using monocular cues such as texture variations, texture gradients, occlusion, known object sizes, haze, defocus, etc. [21], [22], [23]. For example, many objects’ texture will look different at different distances from the viewer. Texture gradients, which capture the distribution of the direction of edges, also help to indicate depth. Haze is another depth cue, and is caused by atmospheric light scattering. Most of these monocular cues are "contextual information" in the sense that they are global properties of an image and cannot be inferred from small image patches. For example, occlusion cannot be determined if we look at just a small portion of an occluded object. Although local information such as the texture and color of a patch can give some information about its depth, this is usually insufficient to accurately determine its absolute depth. For another example, if we take a patch of a clear blue sky, it is difficult to tell if this patch is infinitely far away (sky), or if it is part of a blue object. Due to ambiguities like these, one needs to look at the overall organization of the image to determine depths. Moreover, the task is inherently ambiguous, and a technically ill-posed problem: Given an image, an infinite number of possible world scenes may have produced it. Of course, most of these are physically implausible for real-world spaces, and thus the depth may still be predicted with considerable accuracy. At least one major ambiguity remains, though: the global scale. With the emergence of deep learning methods within the recent years and their massive influence on the computer vision domain, the problem of SIDE got ad- dressed as well by many authors. These methods are in high demand for manifold scene understanding applications like, for instance, autonomous driving, robot nav- igation, or augmented reality systems. In order to replace or enhance traditional methods, convolutional neural network (CNN) architectures have been most com- monly used and successfully shown to be able to infer geometrical information solely from presented monocular RGB or intensity images. In this Section, CNNs are described in the context of this thesis following an introductory Section on Artificial Neural Network (ANN), on which CNNs are based. UNIVERSITY OF SANNIO DEPARTMENT OF ENGINEERING Master degree in Electronic Engineering for Automation and Telecommunication 3D CNN environment reconstruction and drone navigation with RGB monocular camera SUPERVISOR: CANDIDATE: Prof. Luigi Glielmo (UNISANNIO) Luigi Russo Matr:39700137 ASSISTANT SUPERVISOR: Dr. Bálint Vanek (SZTAKI) Dr. Davide Liuzza (UNISANNIO) Academic Year 2017/2018 Contents Abstract ii Acknowledgments iii Introduction iv I Background and prior knowledge 1 1 Camera Models 2 1.1 Pinhole cameras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Camera and lenses . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Digital Image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4 Camera calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2 Visual perception 9 2.1 Depth estimation and 3D vision . . . . . . . . . . . . . . . . . . . . . 10 2.1.1 Depth by triangulation . . . . . . . . . . . . . . . . . . . . . . 10 2.1.2 Depth from motion . . . . . . . . . . . . . . . . . . . . . . . . 17 2.1.3 Incremental approach and optical flow . . . . . . . . . . . . . 19 2.1.4 Monocular Depth estimation through CNN . . . . . . . . . . . 28 2.2 Bio-inspired and optical flow visual perception . . . . . . . . . . . . . 46 2.2.1 Optical flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 2.2.2 Size expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 2.2.3 TTC and CPA estimation . . . . . . . . . . . . . . . . . . . . 49 3 Autonomous Navigation and Obstacle avoidance 53 3.1 High level architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.1.1 Path planning . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.1.2 Path following and Trajectory execution . . . . . . . . . . . . 55 3.1.3 Obstacle avoidance . . . . . . . . . . . . . . . . . . . . . . . . 55 3.1.4 Autopilot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.2 Quadrotor model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.3 Mathematical modelling of quadrotor . . . . . . . . . . . . . . . . . . 58 Acknowledgments This work would not have been possible without the support of multiple people. I would like to express my gratitude to Professor Luigi Glielmo, for his support, guidance and encouragement. To SZTAKI ( Institute for Computer Science and Control, Hungarian Academy of Sciences, particularly Dr. Balint Vanék, for all the support, advice and trust. In addition, i would like to thank Dr. Davide Liuzza for all the Skype call and the support. Gratitude is also due for all the other people that directly or indirectly supported during this time, specially my colleagues and friends. I am also indebted to my family for their unwavering love, encouragement and support throughout my life. iii Introduction MAVs (Micro Aerial Vehicles) commonly known as drones are becoming very popular during the last decade due to the diffusion of commercial platform such as Parrots, DJI or Xiaomi [1]. Due to their ability to operate in remote, dangerous and dull situations, MAVs are increasingly used in many applications such as en- vironmental monitoring (e.g., pollution, weather, and scientific applications), forest fire monitoring, homeland security, building monitoring, drug interdiction, aerial surveillance and mapping, traffic monitoring, precision agriculture, disaster relief, ad hoc communications networks, and rural search and rescue. As the complexity of the tasks is growing even skilled pilot are overwhelmed, so there is more and more demand of high level of autonomy. (a) (b) (c) (d) Figure 1: Example of MAVs applications iv INTRODUCTION v Obstacle avoidance and goal-directed navigation are key capabilities for au- tonomous operation of MAVs. For this, MAVs need to perceive themselves and their surroundings, and take decision on what to do to carry out their task safely. Perceive meaning "become aware of something using one or more sense, especially that of sight". In order to successfully complete the scheduled mission, a MAV must be fully aware of its states, including location, navigation speed, heading direction as well as starting point and target location. Although sensing and perceiving are strongly related, the sensing is relatively more biological (hardware) while perceiving is relatively more psychological (software). A synonym of perceiving is understand- ing and as reported in [2] understanding the reality means building a coherent and meaningful model. From sensing point of view, given the size, weight, power and cost limitations of MAVs the choice in sensors is limited. LIDARs for example are suitable for depth sensing, but their price, weight and power consumption make them less suitable for MAV applications. Visual sensors are cheap and lightweight sensors that can provide images with an abundance of information of surroundings, involving color, texture, and other visual information. As shown in Figure 2, visual sensors typically include the followings: (a) monocular cameras, (b) stereo cameras, (c) RGB-D cameras, and (d) fisheye cameras. (a) (b) (c) (d) Figure 2: Visual sensors aplliable to MAV navigation Part I Background and prior knowledge 1 Chapter 1 Camera Models All the information vision-based system processes is gathered from the camera lens. Camera is the mechanism by which it is possible to record the surrounding environment by means of frames. To be able to process images and give them meaning during processing, it is essential to have a model that describes the camera to understand how the points in the 3D world correspond to the pixels of each frame. 1.1 Pinhole cameras Pinhole Camera model is the simplest model for a camera. As shown in Figure 1.1 it is essentially composed of a barrier with a small aperture between the 3D object and a a photographic film sensor. Figure 1.1: Pinhole camera model The principle of a pinhole camera is simple. Objects from the scene reflect light in all directions. The size of the aperture is so small that among the many rays that are reflected off at P, a point on the surface of an object in the scene, only one of 2 1.1 Pinhole cameras 3 these rays enter the camera. So we can establish a one to one mapping between points on the 3D object and points on the film. The result is that that the film is exposed to an image of the 3D object by means of this mapping. Figure 1.2: Pinhole camera model more formal representation. A more formal representation is reported in Figure 1.2. The film is called the image plane and referred as Π′, the aperture is called pinhole and referred as O and the distance between the image plane and the pinhole is called focal length and referred as f. Sometimes, the retinal plane is placed between O and the 3D object at a distance f from O. In this case, it is called the virtual image or virtual retinal plane. Given the coordinate system [i j k], (usually referred as camera coordinate system), simply applying similarity of triangles a point P = [x y z]T in the real world is mapped into a point P′ = [x′ y′]T on the image plane through P′ = x′ y′  = f xz f y z  . (1.1) The ideal pinhole camera is one that has an aperture so small that only a single light ray enters the camera for each point in the scene. With the finite dimension of the real pinhole the film can be exposed to multiple rays from different point in the 3D world, blurring the image. Although we may think to make the aperture as small as possible this would imply a sharper but less brighter image,moreover when the pinhole gets too small light rays are diffracted. 1.3 Digital Image 6 parameters, whose units would be something like pixel µm , correspond to the change of units in the two axes of the image plane. We adjust our previous mapping to be P ′ = x′ y′  = fk xz fl y z  . (1.4) Pixel-coordinates are usually translated respect to image coordinates. Image coor- dinates have their origin C at the image center where the k axis intersects the image plane Figure 1.2. On the other hand, digital images typically have their origin at the lower-left corner of the image. Thus, 2D points in the image plane and 2D points in the image are offset by a translation vector [cx cy]. To accommodate this change of coordinate systems, the mapping now becomes P ′ = u v  = fk xz + cx fl y z + cy  = αxz β y z  (1.5) . The projection from P to P’ in a non linear function as the operation divides one of the input parameters (namely z). Using homogeneous coordinates it is possible to write this projection as a matrix-vector product: P ′h =  α 0 cx 0 0 β cy 0 0 0 1 0 Ph = MPh (1.6) We can decompose this transformation a bit further into P ′h =  α 0 cx 0 β cy 0 0 1  [I 0]Ph = K [I 0]Ph (1.7) The matrix K is often referred to as the camera matrix. Another effect we need to consider modelling a camera is skewness. We say that a camera is skewed if the angle between the two axes are slightly larger or smaller than 90 degrees we say that camera is skewed. Most cameras have zero-skew, but some degree of skewness may occur because of sensor manufacturing errors. Considering skewness camera matrix become K =  α −α cot θ cx 0 βsin θ cy 0 0 1  . (1.8) 1.3 Digital Image 7 Applying equation 1.7 we can obtain pixel coordinate of a point P in the 3D camera reference system. However points of 3D world are often expressed in a different reference system. Then, we need to include an additional transformation that re- lates points from the world reference system to the camera reference system. This transformation is captured by a rotation matrix R and translation vector T . Figure 1.5: Extrinsic parameter Therefore, given a point in a world reference system Pw, we can compute its camera coordinates as follows P = R T 0 1  , Pw (1.9) so that 1.7 now become P ′h = K [ R T ] Pw = MPw (1.10) . We see that the projection matrix M consists of two types of parameters: in- trinsic and extrinsic parameters. All parameters contained in the camera matrix K are the intrinsic parameters, which change as the type of camera changes. The extrinsic parameters include the rotation and translation, which do not depend on the camera’s build. Overall, we find that the 3×4 projection matrix M has 11 de- grees of freedom: 5 from the intrinsic camera matrix, 3 from extrinsic rotation, and 3 from extrinsic translation [10]. 1.4 Camera calibration 8 1.4 Camera calibration To precisely know the transformation from the real, 3D world into digital images requires prior knowledge of many of the camera’s intrinsic parameters. If given an arbitrary camera, we may or may not have access to these parameters. We do, however, have access to the images the camera takes. Therefore, can we find a way to deduce them from images?. This problem of estimating the extrinsic and intrinsic camera parameters is known as camera calibration. Specifically, we do this by solving for the intrinsic camera matrix K and the extrinsic parameters R, T from Equation 1.10. We can describe this problem in the context of a calibration rig, such as the one show in Figure 1.6. The rig usually consists of a simple pattern Figure 1.6: Caption (i.e. checkerboard) with known dimensions. Furthermore, the rig defines our world reference frame with origin Ow and axes iw, jw, kw. From the rig’s known pattern, we have known points in the world reference frame P1, ..., Pn. Finding these points in the image we take from the camera gives corresponding points in the image p1, ..., pn A precise discussion of this subject is reported in [10] and is outside the scope of this theses work. So far we have seen how to get projection matrixM through which is possible to map the 3D world in the 2D image. From image it is possible to recover some information about 3D world (mostly concerning the shape), however in general it is not possible to recover the entire structure of the 3D world from a single image because during projection the depth information is loss. In the next part we introduce some of the existing methods to achieve the 3D reconstruction of the scene and the estimation of some useful parameter for the navigation process. Further we will see how thanks to CNN is possible to recover structure of the 3D world even from a single image. 2.1 Depth estimation and 3D vision 11 2.1.1.1 Stereoscopy Stereoscopy is inspired by the human brain capacity to estimate the depth of a target from images captured by the two eyes. Stereoscopy reconstructs depth by exploiting the disparity occurring between cameras frames that capture the same scene from different points of view. Figure 2.3: Basic stereo system Figure 2.3 shows a simplified model for the triangulation in stereoscopy, with cameras laterally translated from each others. In this illustration, the two cameras have parallel optical axes and observe a point P located a distance z. The optical projections pl and pr of P are shown on the left and right camera plane. Note the similarity between the triangles 4 OlPOr and 4 plPpr. Knowing the intrinsic and extrinsic parameters of the cameras, the depth z is inversely proportional to the horizontal parallax between projections pl and pr, as shown in Eq.2.1. Note that f , xl, xr are provided in pixel, hence the depth z has the same unit as the translation B. In literature (xr − xl) = d is defined as the disparity and the optical centers distance as the baseline B. z = Bf d (2.1) 2.1 Depth estimation and 3D vision 12 Notice that if the disparity value is low, the uncertainty value of the disparity increases and the uncertainty related to the z estimation increases as well. For decreasing the disparity uncertainty, the baseline should be increased. However, the baseline distance must guarantee that on both the images there is a projection of the point P . Considering as unique uncertainty sources the disparity value and by applying the law of propagation of uncertainty, it is obtained uz = √(δz δd )2 u2d = bf d2 ud d=Bf z−−−→= z 2 bf ud. (2.2) Triangulation Figure 2.4: Triangulation Consider a more general setup with two camera systems as sketched in Figure 2.4. In the triangulation problem with two views, we have two cameras with known camera intrinsic parameters K and K ′ respectively. We also know the relative orientations and offsets R, T of these cameras with respect to each other. Suppose that we have a point P in 3D, which can be found in the images of the two cameras at p and p′ respectively. Although the location of P is currently unknown, we can measure the exact locations of p and p′ in the image. Because K,K ′, R, T are known, we can compute the two lines of sight l and l′, which are defined by the camera centers O1, O2 and the image locations p, p′ . Therefore, P can be computed as the intersection of l and l′. Although this process appears both straightforward and mathematically sound, it does not work very well in practice. In the real world, because the observations p and p′ are noisy and the camera calibration parameters are not precise, finding the intersection point of l and l′ may be problematic. In 2.1 Depth estimation and 3D vision 13 most cases, it will not exist at all, as the two lines may never intersect. Figure 2.5: Reprojection error [10] The triangulation problem for real-world scenarios is often mathematically char- acterized as solving a minimization problem argmin P̂ ‖MP̂ − p‖2 + ‖M′P̂ − p′‖2. (2.3) In the above equation, we seek to find a P̂ in 3D that best approximates P by finding the best least-squares estimate of the reprojection error of P̂ in both images. The reprojection error for a 3D point in an image is the distance between the projection of that point in the image and the corresponding observed point in the image plane. In the case of our example in Figure 2.5, since M is the projective transformation from 3D space to image 1, the projected point of P̂ in image 1 is MP̂ . The matching observation of P̂ in the first image is p. Thus, the reprojection error for the first image is the distance ‖MP̂ − p‖2. Under the assumption that image coordinate measurement noise is Gaussian-distributed, this approach gives the maximum likelihood solution for P̂ . The estimated depth is trivially the third coordinate of P̂ . The stereoscopy make use of the epipolar geometry to triangulate 3D points observed from two point of view. We now present the dense stereoscopy, the active stereoscopy and the structured-light theory used in 3D cameras. 2.1 Depth estimation and 3D vision 16 The pattern geometry has to be chosen carefully to avoid ambiguities in homol- ogous points recognition. Too similar, too small or too closer geometries should be avoided. Theoretically, the most appropriate strategy would be a white noise texturing, where each pixel intensity level is independent of the surrounding ones. This would allow for denser patterns which lead to a better reconstruction. In prac- tice, real applications exploit random dot patterns in the infrared domain to keep insensitive with respect to the ambient light. 2.1.1.4 Structured Light Figure 2.9: Triangulation with a single laser spot Structured-Light-based cameras sensors use a single camera with a structured pattern projected in the scene. Instead of triangulating with two cameras, a camera is substituted by a laser projector. It projects a codified pattern that embed enough structure to provide unique correspondences to triangulate with the camera. The direction of the structured pattern is known a priori by the camera, which is able to triangulate based on the pattern. The most simple structured-light system is the laser triangulation shown in Figure 2.9. A laser beam is projected on a scene or an object.The camera localizes the dot on the scene and recovers its position and depth following Eqs 2.4 and2.5. To improve the quality of the recognition, powerful infrared laser are usually used. z = btanα + tan β (2.4) x = z tanα (2.5) 2.1 Depth estimation and 3D vision 17 2.1.2 Depth from motion An alternative to the usage of stereoscopic and active sensors introduced in the previous Section is to exploit the motion of a single camera relative to the scene. 2.1.2.1 Structure from Motion Recovering the entire 3D structure of a scene from camera motion is a well known problem in the field of Computer vision and is usually addressed as Structure from motion (SFM). SFM in its general form has the following objective: recover objects 3D structure from images taken from different points of view. Figure 2.10: The setup of the general structure from motion problem This problem can be seen as an extension of triangulation problem introduced in Section 2.1.1.1 2.1.2.2 Essential and fundamental matrix Givenm views and n points we need to estimatem matricesMi , and the n world coordinate vectors Xj , for a total of 8m+ 3n unknowns, from mn observations. For the estimation of the projection matrices Mi is useful to introduce the essential and fundamental matrices. Consider the two pinhole cameras with projection ma- trices M and M ′ given an euclidean point X ′ in coordinate system of C’ its position 2.1 Depth estimation and 3D vision 18 X in coordinate system of C is given by X = RX ′ + T, (2.6) where R is a 3× 3 rotation matrix and T is a 3-vector. Pre-multiplying both sides by XT [T]× gives XT [T]×RX ′ = XTEX ′ = 0, (2.7) where E = [T]×R 3×3 is the essential matrix and [T]× in the cross product matrix. Equation 2.7 also holds for image points x′ and x wich gives the epipolar constraint xTEx′ = 0. (2.8) Note that E depends only onR and T and is defined only up to an arbitrary scale factor. Referring to Equation 1.6 image points x may be related to pixel position u by the inverse camera calibration matrix x = K−1u. (2.9) This means the epipolar constraint (equation 2.8) may be rewritten in terms of pixel position (K−1u)TE(K′−1u′) = 0 uT (K−1TEK′−1)u′ = 0 uTFu′ = 0, (2.10) where F = K−1TEK′−1 is the fundamental matrix and is a rank 2 homogeneous matrix with 7 degree of a freedom . From a computational point of view, such matrix can be retrieved by an a priori knowledge of at least 8 corresponding points between frames or through a calibration process. 2.1.2.3 Bundle adjustment Given camera intrinsic parameter once we get the fundamental matrix is possible to recover the relative camera translation and rotation and so the total projection matrix M . In the extracting the camera matrices from the Essential matrix, the estimates can be known up to scale. The 3D points can be computed from the es- timated camera matrices via the triangulation methods described earlier in Section 2.1.1.1. The extension to the multi-view case can be done by chaining pairwise cam- 2.1 Depth estimation and 3D vision 21 Kalman Filter The Kalman filter is a powerful technique for doing incremental, real-time estimation in dynamic systems. It allows for the integration of information over time and is robust with respect to both system and sensor noise. The filter is based on three separate probabilistic models, as shown in Table 2.1. The first model, the system model, describes the evolution over time of the current state vector ut. The transition between states is characterized by the known transition matrix Φt, and the addition of Gaussian noise with a covariance Qr. The second model, the measurement (or sensor) model, relates the measurement vector dt to the current state through a measurement matrix Ht and the addition of Gaussian noise with a covariance Rt. The third model, the prior model, describes the knowl- edge about the system state û0 and its covariance P0 before the first measurement is taken. The sensor and process noise are assumed to be uncorrelated. Models system model ut = Φt−1ut−1 + ηt, ηt˜N(0, Qt) measurement model dt = Htut + ξt, ξt˜N(0, Rt) prior model E[u0] = û0, cov[u0] = P0 (other assumption) E[ηtξTt ]=0 Prediction Phase state estimate extrapolation û−t = Φt−1û+t−1 state covariance extrapolation P−t = Φt−1P+t−1ΦTt−1 +Qt−1 Update phase state estimate update û+t = û−t +KI [dt −Ht − û−t ] state covariance update P+t = [I −KtHt]P−t Kalman gain matrix Kt = P−t HTt [HtP−t HTt Rt]−1 Table 2.1: Kalman filter equations To apply Kalman filter estimation framework to the depth-from-motion problem, we need to specialize each of the three models (system, measurement, and prior) and define the implementations of the extrapolation and update stages. The first step in designing a Kalman filter is to specify the elements of the state vector. Again as we seen in Section 2.1.1.2 for stereoscopy we have to choose between a pixel-based and feature-based description of the image. Using the pixel-based description the state vector is the entire depth map and the depth from motion algorithm estimates the depth at each pixel in the current image. Thus, the diagonal elements of the state covariance matrix Pt are the variances of the depth estimates at each pixel. For the feature-based approach, which tracks edge elements through the image se- quence. the state consists of a 3D position vector for each feature. The system and measurement models are based on the equations relating scene depth and camera motion to the induced image flow. 2.1 Depth estimation and 3D vision 22 Optical flow based estimation If the inter-frame camera motion is sufficiently small, the resulting optical flow can be expressed to a good approximation in terms of the instantaneous camera velocity . We will specify this in terms of a transna- tional velocity T and an angular velocity R. In the camera coordinate frame , the motion of a 3D point P is described by the equation dP dt = −T−R×P (2.12) expanding this into components dX dt = −Tx −RyZ +RzY dY dt = −Ty −RzX +RxZ dY dt = −Tz −RxY +RyX (2.13) Now projecting (X, Y, Z) onto an ideal, unit focal length image from equation 1.3: x = X Z ; y = Y Z (2.14) Taking the derivatives with respect to time and substituting in from equation 2.13 leads to the optical flow equations: P ′ = ∆x′ ∆y′  = 1 Z −1 0 x 0 −1 y   Tx Ty Tz +  xy −(1 + x2) y (1 + y2) −xy −x   Rx Ry Rz  (2.15) This equation relates the depth Z of the point to the camera motion T,R and the induce image displacement or optical flow [∆x ∆y]T . These equations are used to measure depth given the camera motion and optical flow, and to predict the change in depth map between frames. Note that parametrizing equation 2.15 in terms of inverse depth d = 1/Z makes the equation linear in the "depth" variable. Relating the ideal flow equations to real measurement requires a camera model as the one reported in equation 1.6. Thus the ideal image coordinate are related to the actual image coordinate by xa = αx+ cx; ya = βy + cy (2.16) 2.1 Depth estimation and 3D vision 23 Given the optical flow , camera motion and camera model if we solve the motion equations for the inverse depth d it will result d = F (∆x,∆y, T,R, cx, cy, α, β) (2.17) So to obtain depth we need somehow to measure the optical flow (specifying the measurement model). In [12] the optical flow is measured from the difference in intensity between the previous image and the current image using a simple version of correlation-based matching. However there is quite a large amount of different techniques for optical flow estimation. Roughly they can be classified into three groups: feature based [13], gradient [14] based and correlation based techniques [15]. To convert the displacement vector [∆x ∆y] into a depth measurement, it is assumed that the camera motion (T,R) is given and limited camera raster plane (lateral motion, in this case depth can be referred as disparity). The optical flow equation 2.15 can then be used to estimate depth as follows. First they abbreviate 2.15 to ∆x ∆y  = d tx ty + rx ry + ξ (2.18) where d is the inverse depth and ξ is an error vector representing noise in the flow measurement. The noise ( is assumed to be a bivariate Gaussian random vector with a zero mean and a covariance matrix Pm computed by the flow estimation algorithm. Equation 2.18 can be reexpressed in the following standard form for linear estimation problems: ∆x = ∆x ∆y − rx ry  = d tx ty + ξ = Hd+ ξ (2.19) the optimal estimate of disparity is then d = (HTPmH)−1H tP−1m ∆x (2.20) and the variance of this disparity measurement is: σ2d = (HTP−1m H)−1 (2.21) This measurement process has been implemented in a simplified form, under the assumption that the flow is parallel to the image raster. The next stage in the 2.1 Depth estimation and 3D vision 26 where T1 is the camera translation between the first and second frame. The covari- ance matrix comes from applying standard linear error propagation methods to the equations for xt and d. After initialization, if Tt is the translation between frames t − 1 and t, the motion equations that transform the state vector and covariance matrix to the current frame are u−t = x−t d−t  = 1 −T1 0 1  x+t−1 d+t−1  = Φtu+t−1 P−t = ΦtP+t−1ΦTt (2.22) The superscript minuses indicate that these estimates do not incorporate the mea- sured edge position at time t. The newly measured edge position x̃t is incorporated by computing the updated covariance matrix P+t a gain matrix K, and the updated parameter vector u+t : P+t = (P−t )−1 + S −1 where S = 1 σ2e 0 0 0 1  K = 1 σ2t P+t 0 1  Since these equations are linear, we can see how uncertainty decreases as the num- ber of measurements increases by computing the sequence of covariance matrices Pt given only the measurement uncertainty σ2e and the sequence of camera motions Tt. To implement the feature-based depth estimator, it is necessary to specify how to extract feature positions, how to estimate the noise level in these positions, and how to track features from frame to frame. An exhaustive description of feature detection and tracking is given from Scaramuzza et al [16]. It is worth to notice that removing the hypothesis of known motion both struc- ture and motion parameters can still be estimated in a similar fashion simply adding the motion to the state, if the camera is mounted on a MAV for instance we can include the motion model into the system model. However the resulting estimation is valid up to an unknown similarity (scale indeterminacy). Closely related to incremental structure from motion there are the Monocular Vi- sual Odometry and the Visual SLAM problems. Monocular VO is conceived to obtain the robot ego-motion from the study of the series of images that their cam- eras take. For this, the algorithm needs to recover some structure of the 3D points 2.1 Depth estimation and 3D vision 27 that correspond to the matched features, but this information is not exploited for other tasks and is therefore discarded. The system must work in real time as the robot localization is needed online. An exhaustive description of visual-odometry tecnique is given by Scaramuzza et. al. [17]. The goal of SLAM in general (and V-SLAM in particular) is to obtain a global, consistent estimate of the robot path. This implies keeping a track of a map of the environment because it is needed to realize when the robot returns to a previously visited area. This is called loop clo- sure. When a loop closure is detected, this information is used to reduce the drift in both the map and camera path. Understanding when a loop closure occurs and efficiently integrating this new constraint into the current map are two of the main issues in SLAM. Figure 2.11: SFM and VO compared to SLAM. Both visual monocular SLAM and VO just like SFM suffer of scale ambiguity so they cannot be used for navigation and obstacle avoidance tasks unless integrating additional information from other sensors like IMU or GPS in the Kalman Filter (or more in general Extended Kalman Filter for non linear model) framework [18] [19]. Moreover in absence of careful tuning they fail to compute reliable 3D maps on low textured environments and the 3D maps update is usually slow with respect to to real time requirements of fast maneuvers. Recently proposed Deep learning solu- tions have shown robustness to the aforementioned problem. These models quickly compute the depth map from single image producing a dense 3D representation of environments. 2.1 Depth estimation and 3D vision 28 2.1.4 Monocular Depth estimation through CNN Depth estimation from a single monocular image usually referred as Single Image Depth Estimation (SIDE) is a difficult task, and requires that we take into account the global structure of the image, as well as use prior knowledge about the scene. Humans appear to be extremely good at judging depth from single monocular images [20]. This is done using monocular cues such as texture variations, texture gradients, occlusion, known object sizes, haze, defocus, etc. [21], [22], [23]. For example, many objects’ texture will look different at different distances from the viewer. Texture gradients, which capture the distribution of the direction of edges, also help to indicate depth. Haze is another depth cue, and is caused by atmospheric light scattering. Most of these monocular cues are "contextual information" in the sense that they are global properties of an image and cannot be inferred from small image patches. For example, occlusion cannot be determined if we look at just a small portion of an occluded object. Although local information such as the texture and color of a patch can give some information about its depth, this is usually insufficient to accurately determine its absolute depth. For another example, if we take a patch of a clear blue sky, it is difficult to tell if this patch is infinitely far away (sky), or if it is part of a blue object. Due to ambiguities like these, one needs to look at the overall organization of the image to determine depths. Moreover, the task is inherently ambiguous, and a technically ill-posed problem: Given an image, an infinite number of possible world scenes may have produced it. Of course, most of these are physically implausible for real-world spaces, and thus the depth may still be predicted with considerable accuracy. At least one major ambiguity remains, though: the global scale. With the emergence of deep learning methods within the recent years and their massive influence on the computer vision domain, the problem of SIDE got ad- dressed as well by many authors. These methods are in high demand for manifold scene understanding applications like, for instance, autonomous driving, robot nav- igation, or augmented reality systems. In order to replace or enhance traditional methods, convolutional neural network (CNN) architectures have been most com- monly used and successfully shown to be able to infer geometrical information solely from presented monocular RGB or intensity images. In this Section, CNNs are described in the context of this thesis following an introductory Section on Artificial Neural Network (ANN), on which CNNs are based. 2.1 Depth estimation and 3D vision 31 on this vector.Formally: ỹ = g ( w(L+1)g ( w(L)g ( ...g ( w(1)x + b(1) ) + b2 ) + ...+ b(L+1) ) = Γ ( u) (2.25) A basic neural network with a supervised learning scheme predict an output ỹ for an unknown input x∗ by training the network with known input x and output y data. The network is trained in order to output a prediction ỹ of y of preferably high accuracy this is done trying optimize the weights and bias terms to output an as good as possible real valued prediction. Loss function The purpose of the loss function, or sometimes mentioned as cost function, is to determine how well the network predictions are performing in the training phase, where it can be seen as one evaluation function. The basic principle of the loss function is that it takes the predicted output, and compares it to its associate ground truth label to determine the gap difference between them. In machine learning methodologies the goal is to, with some optimization techniques, minimize the loss over the whole training procedure. There are a number of different loss function one can apply in neural networks. In the particular case of feature classification and segmentation with images, the loss functions is applied and evaluated usually on pixel-level. Figure 2.14: Comparison between L1 and L2 loss function L2 Loss function One quite typical loss function for regression problems that is widely used, and works for these types of annotated data is the L2-norm [24]. It 2.1 Depth estimation and 3D vision 32 minimizes the squared euclidean norm of the error , which can simply be written as Equation 2.26 as the squared norm L2() = 1 N N∑ i=1 (i)2 (2.26) Where = ỹ − y is the error between the ground truth yi and predicted network output ỹi , N is the number of pixels. The benefit over the standard L1-norm is that the L2 is non-linear since it squares the error, yielding a higher error also which the model will be more sensitive for. BerHu Loss Function The reverse Huber loss, or so called BerHu by [6], have proven to give better results by trying to combine both the L1 and L2 norm. As in [50], the BerHu loss Lβ () is defined as Equation 2.27: Lβ = ||, || ≤ c2+c2 2c , || > c (2.27) where Lβ () = L1() = || when ∈ [−c, c] or else it is the L2 norm when not in this range. The variable c is typically set to c = 15maxi(), and i represents the pixels in an image for that specific batch, and yields 20% of the maximal error within the specific batch, which is extracted empirically by [6]. The combination of both the linear L1 and non-linear L2 norm with c gives a beneficial result in terms that the L1 account for more impact from small residual gradients, while the L2 gives more weight to the high residuals. Optimization As stated earlier during training phase, the goal is to minimize the loss function with respect to the network weights and bias, with some optimization technique. Gradient descent The gradient descent method is a gradient based iterative op- timization method that is used to find the minimum of a loss function F (u). This is achieved by iterative incremental stepping of the point u in the negative gradient direction -∇F (u) to move toward the function minimum. The gradient descent is defined by ui+1 = ui − γ∇F (ui) (2.28) 2.1 Depth estimation and 3D vision 33 where ui is the point where the loss function is currently evaluated, γF (ui) is the gradient of the loss function and γ is the step length parameter also referred as learning rate as it is chosen such that the algorithm converges to a minimum in a reasonable time. This method minimizes the function by iteratively stepping closer to the minimum. Stochastic gradient descent The gradient descent algorithm may be infeasible when the training data size is huge. Thus, a stochastic version of the algorithm is often used instead. To motivate the use of stochastic optimization algorithms, note that when training deep learning models, we often consider the loss function as a sum of a finite number of functions: F (u) = 1 N N∑ i=1 Fi(u), (2.29) where Fi(u) is a loss function based on the training data instance indexed by i. It is important to highlight that the per-iteration computational cost in gradient descent scales linearly with the training data set size N . Hence, when N is huge, the per-iteration computational cost of gradient descent is very high. In view of this, stochastic gradient descent offers a lighter-weight solution. At each iteration, rather than computing the gradient∇F (u), stochastic gradient descent randomly samples i at uniform and computes ∇Fi(u) instead. The insight is, stochastic gradient descent uses ∇Fi(u) as an unbiased estimator of ∇F (u) since Ei[∇Fi(u)] = 1 N N∑ i=1 ∇Fi(u) = ∇F (u). (2.30) In a generalized case, at each iteration a mini-batch B that consists of indices for training data instances may be sampled at uniform with replacement. Similarly, we can use: ∇FB(u) = 1 |B| ∑ i∈B ∇Fi(u) (2.31) to update x as: ui+1 = ui − γ∇FB(ui), (2.32) where |B| denotes the cardinality of the mini-batch and the positive scalar γ is again the learning rate or step size. Likewise, the mini-batch stochastic gradient ∇FB(u)is 2.1 Depth estimation and 3D vision 36 ically C = 3 and represents components of a color model (e.g. RGB, YUV, etc). Each image pixel is identified by a triplet (il, jl, cl) that refers to the jth column and ith row of the cth color channel at the lth layer. The input image is processed by a CNN through several layers of artificial neurons, where each layer represents a processing step that transforms an input of a third-order tensor x(l) to an output third-order tensor, that is also the input for the next layer y = x(l+1). The CNN takes as input a tensorial image representation that goes through different types of layers, responsible for making certain types of transformations on the input The architecture of a CNN can vary, but some basic components like pooling, activation function and the kernel filters exist in mostly all CNNs. Input Layer The input layer simply presents the raw pixel values of the image and has the same dimension as the input image tensor x(1). ReLU Layer The ReLU layer returns a tensor y of the same dimension of the input x(l) where each element of the vector is transformed according to a ReLU transformation (represented in Figure 2.4 c) of the type : yi,j,c = max ( x (l) i,j,c, 0 ) (2.34) There are no parameters inside the ReLU layer, hence no need for parameters learn- ing in this layer. This layer is used to increase the non-linearity of the CNN. There is the need to increase non-linearities in the detection network because there ex- ists a semantic gap between the digital representation of the image and the image’s semantic information that is created by a non-linear mapping between these two representation. For this reason there is the need for a non-linear mapping from the ’s input to its output. Convolutional layer The convolutional layers play a central role in the architecture. In these lay- ers,artificial neurons consists of third-order tensors that represent learnable filters acting as feature extractors. The CNN learns the feature representations of the input image by tuning the filter’s weights, i.e. their coefficients. 2.1 Depth estimation and 3D vision 37 (a) (b) (c) (d) Figure 2.16: Main steps to form a convolutional layer. The input tensor is an RGB image represented as 32 × 32 × 3 matrix. A kernel H ×W × C l = 5 × 5 × 3 is applied to the input tensor obtaining an activation map (a). Different kernels of the same dimension are applied to the input tensor (b). Different activation maps form a new layer of depth D = 6 (c). Different layer are stacked on the top of each other with ReLu between them (d). These learnable kernel filters usually are of reduced dimension with respect to width W and height H dimension and extends in the depth dimension for the whole depth C l (Figure 2.16 (a),(b)) of the input tensor. Assuming that D kernels are used at each layer and each kernel is of spatial span H ×W , the fourth-order kernels’ tensor is denoted by f ∈ RH×W×Cl×D. Intuitively, the learning algorithm change the filter’s weights such that the filters activate when they receive as input pieces of images that contain some type of visual features, such as oriented edges or patterns. Stacking many convolutional layers (Figure 2.16 (d)) creates non-linear filters that become increasingly global (i.e. responsive to a larger region of pixel space) so that the network first creates representations of small parts of the input, then assem- bles from them representations of larger areas detecting features at a higher level of abstraction. Since images are usually high-dimensional inputs it is impractical to have full connectivity between layers of neurons as is done for standard ANNs. Instead, in CNNs each neuron is connected only to a local region of the previous layer. The spatial extent of this connectivity is a hyper-parameter named the neu- ron’s receptive field F and is equivalent to the neuron’s kernel filter size. Another characteristics of convolutional layers is the fact that the learnt filter’s weights and biases are shared by the whole layer and form what it is called a feature or activation 2.1 Depth estimation and 3D vision 38 map. A feature map is the resulting representation after the convolution filter has convolved the input signal and the result have been transformed by the activation function. In each feature map (convolutional layer) all the artificial neurons respond to the same feature within their specific receptive field (portion of the image). This allows detection of the same features regardless of its position in the visual field (translation invariance) and greatly reduce the number of learnable parameters in each layer. Besides the receptive field size F and the depth D introduced before, CNNs have other important hyper-parameters that influence the performance. D represents the number of filters used in the convolutional layer, to capture different types of features in the input image at the same scale. The stride hyper-parameter S, represents the discrete step size with which the kernel filter is slid across the image. For example, when the stride is 1 the filter is moved one pixel at a time, while when the stride is 2 then the filter jumps 2 pixels at a time along each axis. This will produce a smaller output tensor. The zero padding hyper-parameter P is used to reduce the discontinuity effect near the picture’s border and to control the spatial size of the output tensor. In many cases, this parameter is used to preserve the spatial size of the input volume so the input and output width and height are the same. Formally, the operation that is performed on the input tensor by the convolutional layer is a tensorial convolution with the neurons’ kernel filters. yil+1,jl+1,d = H∑ i=0 W∑ j=0 Cl∑ cl=0 fi,j,cl,d ∗ x (l) il+1+i,jl+1+j,cl (2.35) In order to give a better understanding of this operator and the meaning of the convolution’s hyper-parameters, a 2-D convolution example is shown in Figure 2.17 Figure 2.17: Example of image convoltion with C = D = 1 (greyscale image convolution) The example shows the convolution of a 4× 4 matrix with a 3× 3 kernel filter. In this example the stride is S = 1, the depth is D = 1, and zero-padding is not used 2.1 Depth estimation and 3D vision 41 For example a pooling layer with filters of size 2 × 2 applied with a stride of 2 downsample every depth slice in the input by 2 along both width and height. Every max operation would in this case compute a max over a receptive field of size 4 (see Figure 2.19). Note that this layer has no parameters to be trained since it computes Figure 2.19: Max pooling a fixed function of the input, hence there is no need for training in pooling layers. 2.1.4.5 Unpooling layer The unpooling layer was introduced by Zeiler et al. [32] and is used to increase the spatial dimensions of the input. The layer operates as a reversed pooling layer, hence the name. Since the max pooling layer is non reversible an inverse reconstruction of the layer is impossible. However, by storing the indices of the maximum elements, called switch elements by the author, an approximation of the reconstruction can be made. The unpooling layer achieves this by using these switch indices to dilate the output pooling area with the maximum area, see Figure 2.20 The input data is so mapped to a spatially larger output. Figure 2.20: Unpooling 2.1 Depth estimation and 3D vision 42 Output layer The output layer is the predicted depth map. Datasets In the context of the depth predicion the datasets for training are composed by coupled RGB and RGB-D image of the same resolution (Figure 2.21). So in this context we made indirect use of sensors introduced in Section 2.1.1.1 to provide ground truth depth targets during training. Some of the most common datasets in this field are NYU Depth v2 dataset [33] and KITTI [26] dataset. Figure 2.21: NYU Dataset samples Resnet Over the years there is a trend to go more deeper stacking more layer on the top of each-other, to solve more complex tasks and to also improve the recognition accuracy. But, as we go deeper, the training of neural network becomes difficult and also the accuracy starts saturating and then degrades also. Residual Learning tries to solve both these problems. In general, in a deep convolutional neural network, several layers are stacked and are trained to the task at hand. The network learns several low/mid/high level features at the end of its layers. In residual learning, instead of trying to learn some features, we try to learn some residual. Residual can be simply understood as subtraction of feature learned from input of that layer. ResNet does this using shortcut connections directly connecting input of nth layer to some (n + x)th layer. It has proved that training this form of networks is easier than training simple deep convolutional neural networks and also the problem of degrading accuracy is resolved. 2.1 Depth estimation and 3D vision 43 2.1.4.6 Used architecture In this thesis we use the CNN architecture proposed by K. Harsányi [9] The backbone of this network is the ResNet-based fully convolutional neural net intro- duced by Laina et al. [6]. Laina model can be divided into two parts: an encoding part based on the ResNet50 architecture, and a decoding part consisting of repeated up-projection blocks (see Figure 2.23(c)). This network is enhanced with U-net-like connections between the encoding and decoding parts. The idea behind U-shaped network introduced by Ronneberger [34] was to supplement a usual contracting network by successive layers, where pooling operators are replaced by up-sampling operators. Concatenating the corresponding feature maps between the expansive and contracting paths the information flow is enhanced and consequently the res- olution of the output is increased. In the used architecture lateral connection are realized by concatenating the corresponding feature maps between the encoding and decoding parts of the Laina’s model boosting its accuracy. The aim was boost the reliability of this model without significantly increasing its depth and the number of its variables. Figure 2.22: The representations of the different layers used in the architecture. For better understanding, Figure 2.22 shows the building blocks used to assemble the network. The Up-proj, ResNet-proj and ResNet-skip notations refer to more complex building blocks whose detailed description is reported in Figure 2.23. The definition for the interleaving operation can be found in [6]. The complete architec- 2.2 Bio-inspired and optical flow visual perception 46 • Mean Log10 Error: ∑n i=0 | log10 yi − log10 ỹi| • % of pixels where max ( yi ỹi , ỹi yi ) = δ < threshold where i iterates over every pixel of every image of the test set, and n is the number of these pixels. On the basis of these metrics it is shown that the network is able to exceed other popular CNN based depth estimation models. In [35] Koch questioned the commonly used evaluation metrics proposing new metrics. In Chapter 4 we will see how CNN depth estimation can be applied to obstacle avoidance. 2.2 Bio-inspired and optical flow visual perception Depending on the chosen implementation 3D modelling with monocular camera introduced in the previous Section is suitable for both localization and mapping, giving the basis for high level planning as we will see in the next chapter, and building a local map for reactive collision avoidance. Even if real-time requirements can be accomplished, the 3D modelling process can be computationally expansive and it has to be executed on a ground station. For avoidance tasks bio-inspired (insect, animal or human like) perception provide a great alternative to 3D modelling perception. These approaches estimate the presence of the obstacle efficiently without calculating the 3D model, such as using optical flow or other monocular cues. 2.2.1 Optical flow Figure 2.25: Example of optical flow We have already talked about optical flow in the context of 3D reconstruction, however in the context of obstacle avoidance it be can directly used to sense the environment giving some useful information. Generally speaking optical flow is a 2.2 Bio-inspired and optical flow visual perception 47 monocular cue based method which measures the vector of motion of pixels (cor- relation based and gradient based optical flow) or interesting points (feature based optical flow) between two consecutive images. The technique can be used to track the motion of objects within a video. Optical flow field reflects the correlation be- tween the time domain of the same point in image plane of adjacent frames. The inconsistent direction of optical flow between optical flow field and major movement can be used to detect obstacles. The optical flow field is computed by using mul- tiple images obtained by the same camera at different times. Optical flow can be generated by the two motion of the camera: flowir and flowrot. Given the distance between the object in the field of view and the camera d, the angle between the object and the direction of translational motion θ, the translational velocity v and an angular velocity ω of the camera the optical flow generated by the object can be calculated using F = ( ω + v d sin θ ) . (2.37) The first term on the right of the formula represents the optical flow generated by the rotational movement of the camera and the second term represents the optical flow generated by the translational motion. Feature points, which are closer to the Figure 2.26: The vector schematic of optical flow of obstacles camera, generate greater optical flow vector. The optical flow vector of the feature points can be calculated separately on both sides of the camera. This provides control strategy for obstacle avoidance on unmanned platform based on optical flow. Theoretically, when the ratio of optical flow is always equal to 1, the optical flow on both sides is balanced and the camera is moving straight forward. When the ratio of optical flow is not equal to 1, the imbalance of optical flow is occurred, then the approximate location of obstacles can be detected. The obstacle detection process based on optical flow include: • determining the location of pole FOE; 2.2 Bio-inspired and optical flow visual perception 48 • calculating time-to-collision TTC [36]; • estimating the relative depth of the scene. The main disadvantage of detecting obstacles based on optical flow is that light has a big influence on this method. The method is used in an indoor environment or other environments which has a more uniform illumination. Moreover from frame to frame the flow is proportional to the angle to the frontal direction. This makes these methods useful for wall following or corridor centering applications but difficult to use for frontal obstacle avoidance. 2.2.2 Size expansion Figure 2.27: Example of feature based size expansion When the size of the object in front of the UAV is expanding, it means that the object is approaching the UAV. The response is proportional to the TTC by measuring the expansion of an obstacle. This approach can compensate for a lack of parallax in optical flow that makes it difficult to use for frontal obstacle avoidance. The scale of the object inversely corresponds to the distance from the camera. For example, an object that doubles in size from one frame to the next is half as far away (if the camera is moving directly toward it). If the camera and an object are moving directly towards each other on the z-axis (one might be stationary, only the relative motion is important), then this rate of change in scale is the only value necessary to compute the time to collision. The relationship becomes slightly more complicated if the camera and object are not moving directly towards each other (relatively). An object that doubles in size from one frame to the next will still be half as far away from the camera. However, the distance between the camera and the object may not be changing at a constant rate. 2.2 Bio-inspired and optical flow visual perception 51 as follows: XC = Xa cos βC − (Vx cos βC + Vz sin βC)tCPA) XC = Xa sin βC − (Vx sin βC + Vz cos βC)tCPA) (2.39) Substituting the expressions of XC and ZC into the reciprocal and ratio of the expressions for x̂ and Ŝx in 2.38 and considering CPAX = XaR one gets: 1 S̄x = sin βC f CPAx 2 − Vx sin βC + Vz cos βC 2fR tCPA x̄ S̄x = cos βC CPAx 2 − Vx cos βC − Vz sin βC 2R tCPA (2.40) In this system of equations the unknowns are CPAx and tCPA and the time varying terms are x̄, S̄x, tCPA. The other terms such as f , βC , Vx, Vz and R are all constant. Considering this and tCPA = tC − t one gets (t is actual time, tC is the time when intruder is closest to own MAV (it is constant)): 1 S̄x = sin βC f CPAx 2 − a1tC + a1t = c1 + a1t x̄ S̄x = cos βC CPAx 2 − a2tC + a2t = c2 + a2t (2.41) Making a simple least squares optimal line fit to the measured 1 S̄x , x̄ S̄x , t parameters will give a1, a2, c1, c2 and the following system of equations. sinβCf −a1 cos βC −a2 CPAx2 c  = c1 c2  (2.42) CPAx and tC (and so tCPA) can be easily obtained from this. Considering now a steady obstacle means that there will be no side velocity component (V x = 0) and the forward component V z is known as the own ground relative velocity. This leads to simplified expressions in 2.40 but has no effect on 2.41 and 2.42. However, knowing the value of Vz makes it possible to estimate the R absolute size of the obstacle considering a1, a2 (if Vx = 0). a1 = Vz cos βC 2fR , a2 = −Vz sin βC 2R R = Vz4 cos βc a1f − sin βC a2  (2.43) 2.2 Bio-inspired and optical flow visual perception 52 Knowing the size of the obstacle makes it possible to estimate the absolute side distance Xa and considering tCPA and the forward velocity Vz the forward absolute distance Za can be also estimated Xa = R · CPAx Za = Vz · tCPA Even if the algorithm was originally designed to work with Air-crafts it can be successfully applied to the context of drones obstacle avoidance as we will see in Chapter 4. Chapter 3 Autonomous Navigation and Obstacle avoidance The autonomous navigation problem is typically divided into three main chal- lenges: • Localisation: Where are the MAVs relative to their environment and each other? • Mapping: What does the environment look like? • Navigation: What path must the MAV follow in order reach a target location? Also, given a path, what control commands are required to follow it? We have broadly introduced the localization and mapping problem (SLAM) re- lying on visual perception in the previous Chapter. In general other sensor can be used together with the camera to enhance both localization and mapping. Given a proper localization and an accurate map,the navigation process can be executed. In this chapter the problem of autonomous navigation is introduced and discussed in the context of MAVs. The topics introduced in this chapter will help us to get a better understanding of the developed simulations presented in Chapter 4. 53 3.1 High level architecture 56 process [40]. Sensors collect information of the environment that is processed by the method to compute a motion command The output of the obstacle avoidance module is a reference velocity vector, which is fed as a reference to the low-level controls. 3.1.4 Autopilot The low-level control, sometimes reffered as Autopilot, typically consists of cascaded layers of feedback controllers, that stabilize the MAV, provide adequate handling characteristics, and in most cases control the attitude and velocity of the MAV. In Figure 3.1 the autopilot is split in two blocks namely Attitude controller and Position controller and as we will see it directly interact with collision avoidance module. Depending on particular implementation taking into consideration the computa- tional demand the various blocks can be split between a ground-control-station and aboard drone. Tipically the autopilot software run aboard on resource-constrained embedded hardware sending and receiveng information from ground-control-station on which the others block functionality are implemented. For design and simulation purpose of the various blocks we need a model includ- ing the kinematic and dynamic equations of motion of the MAV modeled as a rigid body with 6 DOF, that explain how aerodynamic forces and moments acts on it. Since in this work we will focus mostly in obstacle avoidance we will not go in to details for path planning, and path following or trajectory execution block. In the following sections we will focus on the block modelling the quadrotor dynamics and the autopilot directly connected to the obstacle avoidance functionality. It is essential to note that this means the system we implement still assumes a higher level path planner that feeds it with way points, a positioning system so we always know the position of the drone, and that there is a low level controller seeing to that the drone moves with the desired velocity within reasonable deviations due to inertia. 3.2 Quadrotor model 57 3.2 Quadrotor model Quadrotor vehicles possess certain essential characteristics, which highlight their potential for use in search and rescue applications. Characteristics that provide a clear advantage over other flying MAVs include their Vertical Take Off and Landing (VTOL) and hovering capability, as well as their ability to make slow precise move- ments. There are also definite advantages to having a four rotor based propulsion system, such as a higher payload capacity, and impressive maneuverability, particu- larly in traversing through an environment with many obstacles, or landing in small areas. Figure 3.2: Conceptual diagram of a quadrotor As illustrated by the conceptual diagram in Figure 3.2, the quadrotor attitude is controlled by varying the rotation speed of each motor. The front rotor (Mf ) and back rotor (Mb) pair rotates in a clockwise direction, while the right rotor (Mr) and left rotor (Ml) pair rotates in a counter-clockwise direction. This configuration is devised in order to balance the drag created by each of the spinning rotor pairs. Figure 3.3 shows the basic four maneuvers that can be accomplished by changing the speeds of the four rotors. By changing the relative speed of the right and left rotors, the roll angle of the quadrotor is controlled. Similarly, the pitch angle is controlled by varying the relative speeds of the front and back rotors, and the yaw angle by varying the speeds of clockwise rotating pair and counter-clockwise rotating pair. Increasing or decreasing the speeds of all four rotors simultaneously controls the collective thrust generated by the robot. A roll motion can be achieved while hovering by increasing the speed of the right rotor, while decreasing the speed of the left rotor by the same amount. Hence, the overall thrust is kept constant. 3.3 Mathematical modelling of quadrotor 58 Figure 3.3: Quadrotor dynamics. 3.3 Mathematical modelling of quadrotor Mathematical modelling provides a description of the behaviour of a system. The flight behaviour of a quadrotor is determined by the speeds of each of the four motors, as they vary in concert, or in opposition with each other. Hence, based on its inputs, a mathematical representation of the system can be used to predict the position and orientation of the quadrotor. The same can further be used to develop a control strategy, whereby manipulating the speeds of individual motors results in achieving the desired motion. To derive the full mathematical model of the quadrotor, we need to define its kinematics and dynamics first. The kinematic equations provide a relation between the vehicle’s position and velocity, whereas the dynamic model defines the relation governing the applied forces and the resulting accelerations. 3.3.1 Reference frames Before getting into the equations of kinematics and dynamics of the quadrotor, it is necessary to specify the adopted coordinate systems and frames of reference, as well as how transformations between the different coordinate systems are carried out. The use of different coordinate frames is essential for identifying the location and attitude of the quadrotor in six degrees of freedom (6 DOF). For example, in order to evaluate the equations of motion, a coordinate frame 3.3 Mathematical modelling of quadrotor 61 Therefore,  φ̇ θ̇ ψ̇  Fb =  1 0 sθ 0 cφ sφcθ 0 −sφ cφcθ   φ̇ θ̇ ψ̇  Fv It follows that,  φ̇ θ̇ ψ̇  Fv =  1 sφ tan θ cφ tan θ 0 cφ −sφ 0 sφ sec θ cφ sec θ   φ̇ θ̇ ψ̇  Fb (3.2) Equations 3.1 and 3.2 represent the quadrotor’s equations of motion 3.3.3 Quadrotor dynamics To build the dynamic model of the quadrotor we will use Newton-Euler formal- ism, while adopting the following assumptions: 1. The quadrotor structure is a rigid body. 2. The quadrotor frame is symmetrical. 3. The COG of the quadrotor coincides with the center of the rigid frame. Figure 3.5: Moment of inertia. The moment of inertia is calculated by assuming the quadrotor as a central sphere of radius r and mass M0 surrounded by four point masses representing the motors. Each motor is supposed to have a massm and attached to the central sphere through an arm of length l, as shown in Figure 3.5. Due to the symmetry of the quadrotor about all three axes, its inertial matrix becomes symmetrical and is defined by J =  jx 0 0 0 jy 0 0 0 jz  3.3 Mathematical modelling of quadrotor 62 where jx = jy = jz = 2M0r 2 5 + 2l 2m The dynamics of the quadrotor under external forces applied to its COG and expressed in the body frame is derived by applying Newton-Euler formulation Beard [41].MI3×3 0 0 I3×3 P̈Fb Ω̈Fb + Ω̇Fb ×MṖFb Ω̇Fb × JΩ̇Fb  = FFb τFb  whereM is the quadrotor’s total mass, and F T = [fx, fy, fz] and τT = [τφ, τθ, τψ] are the external force and torque vectors applied on the quadrotor’s COG. The terms τφ, τθ, and τψ are the roll, pitch and yaw torques respectively. Thus, the translational dynamic model can be written as p̈x p̈y p̈z  Fb =  ψ̇ṗy − θ̇ṗz φ̇ṗz − ψ̇ṗx θ̇ṗx − φ̇ṗy  Fb + 1 M  fx fy fz  Fb (3.3) while the rotational model is  φ̈ θ̈ ψ̈  = J−1   0 ψ̇ −θ̇ −ψ̇ 0 φ̇ θ̇ −φ̇ 0  J  φ̇ θ̇ ψ̇ +  τφ τθ τψ   =  jy−jz jx θ̇ψ̇ jz−jx jy φ̇ψ̇ jx−jy jz φ̇θ̇  Fb +  1 jx τφ 1 jy τθ 1 jz τψ  Fb (3.4) 3.3.4 Aerodynamics forces and torques With the derived kinematic and dynamic model, we will now define the forces and torques acting on the quadrotor. The forces include the aerodynamic lift gen- erated by each rotor, and the gravitational pull acting in counter to the total lift generated. The moments are the torques generated in order to achieve the roll, pitch and yaw movements. The following forces and torques are produced: Upward Force (Thrust): The total quadrotor thrust is the sum of the thrust produced by each propeller, as depicted in Figure 3.6(a) T = Tf + Tr + Tb + Tl 3.3 Mathematical modelling of quadrotor 63 (a) (b) (c) (d) Figure 3.6: Forces and moments acting on the quadrotor: (a) Quadrotor thrust; (b) Rolling torque; (c) Pitching torque; and (d) Yawing torque. Rolling Torque: This is the torque produced by increasing the left rotor’s thrust while decreasing that of the right rotor, or vice versa, as shown in Figure 3.6(b): τφ = l(Tl − Tr) Pitching Torque: The pitching torque in Figure 3.6(c) is produced by increasing the front rotor’s thrust while decreasing that of the back rotor, or vice versa: τθ = l(Tf − Tb) Yawing Torque: The yawing torque is the result of all four individual torques generated due to the spinning rotors. The front and back rotors spin in the clockwise direction, while the left and right rotors spin in the counterclockwise direction. As shown in Figure 3.6(d), an imbalance between these two pairs results in a yawing torque causing the quadrotor to rotate about its z-axis: τψ = τf + τb − τr − τl Gravitational Force (weight): Along with the other forces, the gravitational Part II Proposed method and simulation 66 Chapter 4 Proposed method and simulation In this chapter basing on the software architecture presented in the previous Chapter, and referring to the visual perception technique presented in Chapter 2 two possible implementations of the obstacle avoidance block based on visual perception discussed in Chapter 3. The first one relying on parameter estimation in particular TTC & CPA estimation introduced in Section 2.2.3. The second one relying on 3D reconstruction based on CNN depth estimation architecture reported in Section 2.1.4. 4.1 TTC & CPA based Obstacle Avoidance To prove the effectiveness of the algorithm of TTC and CPA estimation applied o the contest of autonomous obstacle avoidance for a quadcopter following a straight line we have implemented a Matlab Simulink model. The results coming from TTC and CPA estimation process relative to a given obstacle in a fixed position, was used to modify the drone direction in a waypoint based navigation to avoid the obstacle. 4.1.1 Camera model For the acquisition of images a camera model was developed using a Matlab function based on the theory presented in Chapter 1. The position of the camera coincides with that of the center of the drone, but the camera reference system differs from the reference system body of the drone used to indicate the position and orientation of the drone in the 3D space ( Section 3.3.1) as: • it is rotated 45o with respect to the latter since the optical axis of the camera 67 4.1 TTC & CPA based Obstacle Avoidance 68 forms a 45o angle with the drone arm so that this does not obstruct the view, • rhe Z axis and the Y axis are inverted, in fact remember that in the reference system the Z axis coincides with the optical axis. Cylindrical point clouds, whose dimension position and orientation can be set by the user was created to simulate the presence of an obstacle in three-dimensional space. The camera model is characterized by a focal length f , the size of the image planeW , L and the size of the pixelsWpix, Lpix. It receives as input the point cloud, the position and orientation of the camera and on the basis of this information, taking advantage of the projection formulas introduced in the Section, provides an output vector containing the coordinates of the points x and y of the image plane points useful for the subsequent image processing phase. The camera model was tested by translating and rotating the camera in a test scenario and evaluating the variation of the obstacle projection in the image plane, as shown in the Figure 4.1. Figure 4.1: Projection of obstacle poit-cloud on image plane to model 4.1 TTC & CPA based Obstacle Avoidance 71 Figure 4.3: Quadrotor Simulink model with SimMechanics Section 3.4 by changing the desired value according to the outer loop dynamic a desired trajectory could be followed. 4.1.4 Overall architecture and Results for TTC and CPA estimation Before implementig the obstacle avoidance we have tested the TTC and CPA estimation for the drone following a straight line towards cylindrical obstacle radius R = 3 m and height H = 30 m positioned at [x, y, z] = [60, 0, 0]. In the carried Figure 4.4: Overall architecture out simulation the drone moves from a a given starting point A towards a target B, following a straight path. The trajectory is defined incrementally through a succession of waypoints that join point A to B. Waypoints are update in a way 4.1 TTC & CPA based Obstacle Avoidance 72 such that drones proceed with a constant velocity. Before starting the movement towards the target, the drone is brought to a reference height and only after this operation the TTC and CPA estimation process is started. Figure 4.5, show the obtained TTC, CPA, size of the object and distance of the object on the trajectory coordinate system over time. (a) (b) (c) (d) Figure 4.5 4.1.5 Collision decision and obstacle avoidance When the collision is detected by the image processing block, given the estimated size, distance and angle of approach, the next way point is placed on the side of the obstacle at a safe distance to ensure that the collision is avoided. Once arrived at the waypoint, the navigation to the goal is resumed. In Figure 4.6 is shown the resulting trajectory for different starting point. 4.2 3D point cloud computation from CNN and safe motion 73 Figure 4.6: TTC & CPA Collision avoidance results 4.2 3D point cloud computation from CNN and safe motion The use of depth estimation obtained from a single image using CNNs for au- tonomous navigation is a particularly interesting and promising research direction. As seen in Section 2.1.4, in fact, thanks to the training process, CNNs allow us to estimate depth much faster than conventional methods, thus opening the possibility of implementing reactive control strategies to avoid obstacles. In this work we have run the model presented in Section 2.1.4.6 on a CPU, using the estimated depth in an off-line fashion simulating only the first step of a possible obstacle avoidance pro- cedure. However having the required computational power it is possible to iterate the decision process. After acquiring the image from an indoor environment, with a common smartphone camera we have computed the depth-map using the Python implementation of the CNN architecture. The predicted depth-map was processed in Matlab to extract a 3D point cloud representation of the surrounding using the camera parameter. For a more complete representation of the surrounding the pro- cess was iterated over the frames of video acquired with the same smartphone. The point-cloud were stitched together throw an ICP algorithm. Basing on the approx- imated orientation and position of the camera during the acquisition the stitched point clouds were then expressed in world coordinate frame supposing the origin of the axis of the 3D world being positioned of the feet of the operator. The obtained local map is blurred because it is affected from the noise coming both from the unknown precise orientation of camera in space and the estimation process. Taking into consideration the estimation error the map was then used to simulate a safe 4.2 3D point cloud computation from CNN and safe motion 76 where d represents the size of the film (or sensor) in the direction measured Consider the following schematic representation of the capture of a depth sample with a 3D camera. The distance di (orthogonal to the x axis) from the object i (a red dot in Figure 4.9) to the camera’s sensor is the one that is stored as a pixel value in the depth map. Figure 4.9: Scheme for calculation of the horizontal distance Obtaining the x coordinate of said sample i means to calculate the horizontal dis- tance ∆x between the object and the center of the sensor. This can be done with x = ∆x = di tan(γi) where di is the depth distance of sample i and γi is the angle between the object i and the sensor. How to calculate the angle γi? Assuming each pixel column of the depth map represents an angle γi that is linearly proportional to the horizontal field of view angle θh we have that: γi = α + ciθh nc Where ci is the column of sample i in the depth map, nc is the total number of columns and α is the angle of the first sample. The schematic representation of the vertical FOV is depicted in Figure 4.10. For the y-coordinate the analysis is very similar so we have that: ∆y = di tan(γi) γi = α + riθv nr α = 2π − θv2 4.2 3D point cloud computation from CNN and safe motion 77 Figure 4.10: Scheme for calculation of vertical distance The obtained point cloud is reported in Figure 4.11. Figure 4.11: Point cloud obtained with the second method 4.2.3 Point-cloud fusion and ICP To construct a larger 3-D view of the scene we stitches together a collection of point clouds that was obtained applying CNN depth estimation over a video sequence. We applied Iterative Closest Point (ICP) algorithm in Matlab to two succes- sive point-clouds using the pcregistericp matlab function thath can be foun in the Computer Vision System toolbox. ICP ICP take as input two point-clouds, referring to the same scene but rotated and translated, this assumption is valid with respect to the case of two consecu- tive point-cloud corresponding to two consecutive frames of a video sequence . We 4.2 3D point cloud computation from CNN and safe motion 78 will call the original/first point cloud from now as the target point cloud the trans- formed point cloud as the source point cloud. ICP is a registration technique that uses geometry information (X, Y, Z) and not intensity/color to register the source point cloud to the target point cloud. The first step in ICP matching between the target cloud to the source cloud is to determine correspondences between the two clouds. What that means it needs to match a set of points in the source cloud to original points in the target cloud that has merely been transformed (rotated and/or translated). ICP assumes that the closest points are the corresponding points. The point on the target cloud which has the smallest Euclidean distance from a point on the source cloud is the correspondent point. So ICP need to transform the source cloud relative to the target cloud, such that the sum of all point-wise Euclidean distances is minimized. The transformation T is a function of a rotation and a translation. Using matrix algebra, a rotation in 3D space can be represented by a matrix multiplication of a 3D point x with the rotation matrix R. A translation can be represented by the vector addition of a 3D point x and a translation vector t. To fix notation, let there be N points in the subset of points chosen to be used in each ICP iteration. In other words, N points are chosen from each of the source cloud and the target cloud which are assumed to be in correspondence with each other. Let xsourcei represent an arbitrary point in the source cloud and x target i represent its corresponding point in the target cloud. We said that it is the source cloud that has to be transformed. We represent this transformation as T (xsourcei ) = Rxsourcei + t Now, if we want to compute the Euclidean distance between the target point and the source cloud point after transforming the source point: xtargeti − T (xsourcei ) = x target i − (Rxsourcei + t) For the entire point cloud, we can compute the total error in the cloud alignment using a transformation T by summing over all the points in the subset considered. This forms the error function f(R, t) that we wish to minimize argmin R,t f(R, t) = 1 N N∑ i=1 xtargeti − (Rxsourcei + t) Sticthing point clouds toghether After getting R and t from pcregistericp func- tion we use the first point cloud as the reference and then apply the estimated 4.2 3D point cloud computation from CNN and safe motion 81 4.2.5 Define a safe path Once we get a coherent 3D representation of the environment in term of point- cloud or Octomap given a waypoint from a high-level path-planner we can evaluate a safe possible motion towards the waypoint. A possible method knowing the drone dimension is to draw a line (or more properly a tunnel of the drone dimension) that connect current location to goal location towards the map and evaluate if any collision occurs verifying if and where the tunnel is occluded. At this location we can place a sub-goal finding the closest free spot and then resume the motion to target iterating the process. This way of reasoning is similiar to 3D ORM method proposed by Vikenmark et al [46]. Figure 4.14: Safe motion execution based on Octomap We are still reacting to local information as seen with TTC and CPA estimation approach but this time with a higher level of detail paying with computational demand which is considerably higher. Conclusion and future work In this master thesis work we have seen two possible solutions for monocular vision based perception for obstacle avoidance purposes. The presented solutions are applied to a quadrotor obstacle avoidance scenario in simulation environment in Matlab. In the first solution the avoidance maneuver was executed placing a waypoint on the basis of CPA and estimated size of the object coming from TTC and CPA estimation framework. In the second solution an Octomap consents to obtain a consistent representation of the sensed environment integrating multiple depth view of the scene from CNN depth estimation. The Octomap threshold was setted on the basis of the error obtained for CNN depth estimations. Basing on Octomap representation a possible obstacle avoidace method was suggested. TTC and CPA estimation framework used require the drone to following straight line at constant velocity, possible future research can be done in the sense of expanding this method for non-straight path of the drone. 82 Part III Appendix 83 A.2 Other variants 86 Keypoint descriptor For the creation of the keypoint descriptor, around the location of the keypoint an 16x16 pixel window is defined. It is then split into 16 sub-blocks of 4x4 pixels. Given the gradient values of the pixels in this window, an 8 bin orientation histogram (45 degrees per bin) is created for each sub-block, gradient orientation is set according to the current keypoint orientation in order to achieve rotation invariance. By now we have an 4x4 array of histograms with 8 orientation bins, that is a 128 element feature vector for each keypoint. Finally the vector is normalized to reduce the effects of varying illumination changes using the vector norm. A.2 Other variants We now present some of the alternatives to the SIFT algorithm, given its prop- erties some of them were directly inspired by it. Speeded Up Robust Features The Speeded Up Robust Features (SURF) algorithm is directly inspired by SIFT. For the feature detection, it uses a similar process to SIFT, instead of a DoG method to build image pyramids, in SURF, an integer approximation of the determinant of Hessian Figure A.2: SIFT orientation histogram is constructed from specific regions determined by the detector [47] blob detector is used. One of the reasons this algorithm is generally faster than SIFT is that, instead of building pyramids at different scales of the image, the scale of the image is unaltered and different scales of Gaussian masks are used. By not downsampling the image, time is saved. For the feature descriptor, instead of a orientation histogram around the found keypoint, this algorithm, computes the sum of Haar wavelet response around the point of interest, in a circular neighborhood A.2 Other variants 87 that is dependent of scale. The actual orientation of the descriptor is found using the area where the largest sum value was found in the previous step. To get the final descriptor feature vector, a square region around the point and rotated according to the orientation is created, the scale of this region is also dependent of scale. This region is then split into 4x4 square sub-regions, and for each sub-region, Haar wavelet responses are computed. Sum values in both x and y orientations are extracted, each sub-region has then a 4 dimension vector, giving the final descriptor representation of a 64 dimension vector. Binary Robust Independent Elementary Features Binary Robust Independent Elementary Features (BRIEF) [48] is a feature de- scriptor, which follows a binary format for its descriptor format, meaning that in- formation in the patch is retrieved and encoded using binary comparisons between intensities differences between pixels. The binary format has major advantages in terms of comparison between descriptors. In fact, the distance between two descrip- tors can be efficiently computed using Hamming distance, in contrast to floating point descriptors like SIFT and SURF where the distance computation is more costly with Euclidean distance. [48] shows that this algorithm is robust to changes in illumination and also to blur and partial perspective distortions. However, it is sensitive to rotation deformations. ORB The ORB algorithm is presented as an "efficient alternative to SIFT or SURF", in [49] . Some of the disadvantages of SIFT and SURF rely on the fact that gradients of each pixel in the patch need to be computed, also, they are patent-protected, this algorithm tackles both problems. The ORB algorithm is built on top of other two, the Features from Accelerated Segment Test (FAST) algorithm [50] and the BRIEF descriptor [48]. After keypoint detection, a sampling step is performed, which uses 256 pairwise intensity comparisons. In ORB, the group of tests were extracted and built using an image dataset as well as machine learning techniques, while BRIEF uses an elaborate sampling pattern. The ORB algorithm overcomes the lack of rotation invariance of BRIEF by estimating local orientation, using the interest point location and the centroid of its neighborhood. Bibliography [1] M. d. R. de Miguel Molina and V. S. Campos, Ethics and Civil Drones: Euro- pean Policies and Proposals for the Industry. Springer Open, 2018. [2] P. Mella, Dai Sistemi al pensiero sistémico: per capire i sistemi e pensare con i sistemi. FrancoAngeli, 1997, vol. 28. [3] C. Kanellakis and G. Nikolakopoulos, “Survey on computer vision for uavs: Current developments and trends,” Journal of Intelligent & Robotic Systems, vol. 87, no. 1, pp. 141–168, 2017. [4] S. Herwitz, L. Johnson, S. Dunagan, R. Higgins, D. Sullivan, J. Zheng, B. Lo- bitz, J. Leung, B. Gallmeyer, M. Aoyagi, et al., “Imaging from an unmanned aerial vehicle: agricultural surveillance and decision support,” Computers and electronics in agriculture, vol. 44, no. 1, pp. 49–61, 2004. [5] J. W. Langelaan and N. Roy, “Enabling new missions for robotic aircraft,” Science, vol. 326, no. 5960, pp. 1642–1644, 2009. [6] I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab, “Deeper depth prediction with fully convolutional residual networks,” in 2016 Fourth International Conference on 3D Vision (3DV). IEEE, 2016, pp. 239–248. [7] D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from a single image using a multi-scale deep network,” in Advances in neural information processing systems, 2014, pp. 2366–2374. [8] P. Bauerl, B. Vanek, and J. Bokor, “Monocular vision-based aircraft ground obstacle classification,” in 2018 European Control Conference (ECC). IEEE, 2018, pp. 1827–1832. 88