Computer vision is a field that includes methods for acquiring, processing, analyzing, and understanding images and, in general, high-dimensional data from the real world in order to produce numerical or symbolic information, e.g., in the forms of decisions.
Pattern recognition is a branch of machine learning that focuses on the recognition of patterns and regularities in data, although it is in some cases considered to be nearly synonymous with machine learning.
Image segmentation is the process of partitioning a digital image into multiple segments (sets of pixels, also known as superpixels). The goal of segmentation is to simplify and/or change the representation of an image into something that is more meaningful and easier to analyze. Image segmentation is typically used to locate objects and boundaries (lines, curves, etc.) in images. More precisely, image segmentation is the process of assigning a label to every pixel in an image such that pixels with the same label share certain characteristics.
Semantic segmentation (or pixel classification) associates one of the pre-defined class labels to each pixel. The input image is divided into the regions, which correspond to the objects of the scene or "stuff". In the simplest case pixels are classified w.r.t. their local features, such as color and/or texture features. Markov Random Fields could be used to incorporate inter-pixel relations.
Object detection is a computer technology related to computer vision and image processing that deals with detecting instances of semantic objects of a certain class (such as humans, buildings, or cars) in digital images and videos. Well-researched domains of object detection include face detection and pedestrian detection. Object detection has applications in many areas of computer vision, including image retrieval and video surveillance.
- [DATASET] NYUv2-context dataset - subset of NYU-depth-v2 dataset for detection of indoor objects
Foreground detection is one of the major tasks in the field of Computer Vision whose aim is to detect changes in image sequences. Many applications do not need to know everything about the evolution of movement in a video sequence, but only require the information of changes in the scene.
Detecting foreground to separate these changes taking place in the foreground of the background. It is a set of techniques that typically analyze the video sequences in real time and are recorded with a stationary camera.
Recognizing human activities has been one of the most challenging problems in computer vision research. However, there is a limitation to solve the problem with video data, in particular, for real world videos including a great degree of variation. The Web contains a rich source of information about video content, such as video title, description, category, and related videos. From these intuitions, we use a collection of web data to learn activities in web videos.
We first collect a dataset by gathering video clips that contain human activities in the video and their surrounding web text data. Next, we represent a video using spatio-temporal graph with features of sparse coefficients, and define visual context kernels for measuring the similarity between two spatio-temporal graphs. We represent the surrounding text data using a histogram of n-gram keywords obtained from a modified TF-IDF weighting scheme. Text context kernels are also defined to compute the similarity between two text data. Finally, we compute the optimal combination of all kernels using multiple kernel learning for activity recognition. Experimental results show that the textual data can aid the visual recognition, and the combination of visual and textual learning outperforms the visual or textual learning for activity recognition.
3D reconstruction from multiple images is the creation of three-dimensional models from a set of images. It is the reverse process of obtaining 2D images from 3D scenes.
The essence of an image is a projection from a 3D scene onto a 2D plane, during which process the depth is lost. The 3D point corresponding to a specific image point is constrained to be on the line of sight. From a single image, it is impossible to determine which point on this line corresponds to the image point. If two images are available, then the position of a 3D point can be found as the intersection of the two projection rays. This process is referred to as triangulation. The key for this process is the relations between multiple views which convey the information that corresponding sets of points must contain some structure and that this structure is related to the poses and the calibration of the camera.
In recent decades, there is an important demand for 3D content for computer graphics, virtual reality and communication, triggering a change in emphasis for the requirements. Many existing systems for constructing 3D models are built around specialized hardware (e.g. stereo rigs) resulting in a high cost, which cannot satisfy the requirement of its new applications. This gap stimulates the use of digital imaging facilities (like a camera). Moore's law also tells us that more work can be done in software. An early method was proposed by Tomasi and Kanade. They used an affine factorization approach to extract 3D from images sequences. However, the assumption of orthographic projection is a significant limitation of this system.