Challenge

AISG–SLA Visual Localisation Challenge

Develop AI models to accurately extract camera pose data from 2D monocular images.

USD 35,000 awarded
jul 2023
335 participants

Problem Description

Your goal in this challenge is to develop AI models that can accurately extract camera pose data from a given set of 2D monocular images.

Participants are strongly encouraged to utilize AI techniques in addressing the problem. It should be noted that solutions that rely solely on hardcoded methods without AI will not be considered.

Visual Localisation

Visual localisation, also known as camera pose estimation, is the process of determining the 6-degree-of-freedom pose (3D position and orientation) of a camera in the world using visual information. Monocular camera pose estimation is a specific type of visual localisation that uses only a single camera to estimate the camera's position and orientation. Given a 2D image captured by the camera, monocular camera pose estimation calculates the transformation matrix that provides information about the camera's position and orientation relative to the 3D world. This task can be challenging due to the limited information provided by a single camera. Nonetheless, it is crucial for many real-world applications, and recent advances in deep learning-based techniques have shown promising results in addressing challenging scenarios such as low-texture environments and occlusions.

Mathematical Formulation

This section introduces some of the geometric concepts of cameras and computer vision in relation to the visual localisation problem.

Schematic diagram illustrating the visual localisation problem.
Diagram visualizing the formulation of the visual localisation problem.

Here are brief explanations about the elements in the above diagram:

  • The world points (in the world coordinates) are transformed to camera coordinates using the extrinsic parameters. The camera coordinates are mapped into the image plane using the intrinsics parameters.
  • The extrinsic parameters consist of a rotation, \(R\), and a translation, \(t\). The origin of the camera’s coordinate system is at its optical center, and its x- and y-axes define the image plane.
  • The intrinsic parameters (provided as part of the dataset) include the focal length and the optical center, also known as the principal point. The camera intrinsic matrix, \(K\), is defined as: $$ K = \begin{bmatrix} f_x & 0 & o_x \\ 0 & f_y & o_y \\ 0 & 0 & 1 \end{bmatrix} $$ The values \(f_x\) and \(f_y\) are the pixel focal length and are identical for square pixels. The \(o_x\) and \(o_y\) values are the offsets of the principal point (optical center), from the top-left corner of the image frame.

The goal of the challenge is to estimate the extrinsic parameters of the camera given images and the intrinsic parameters. More specifically, the challenge metrics consider the relative changes in the extrinsic parameters over sequences of images as the camera moves along a trajectory. The metrics will be defined in greater detail in a later section.

Feature Data

Images

The feature data for this challenge is street-level monocular images in JPEG format.

  • The training dataset consists of 10,007 images
  • The testing dataset consists of 2,219 images

This dataset contains photos captured by a spherical camera system with 5 megapixels resolution and a Sony IMX264 2/3" CMOS sensor with 3.45 μm pixel size, using 4.4 mm focal length lenses. The photos were taken in Buangkok and Sengkang town, Singapore, and depict people, cars, buildings, and landscapes concentrated in HDB residential areas and parks. The training and testing data consists of 5 trajectories: 4 for training and 1 for testing. A trajectory here refers to a sequence of images taken from the movement of a camera along one continuous path.

Here are two examples of images from the training set:

Photograph taken in Buangkok town, Singapore. Part of the training set. Photograph taken in Sengkang town, Singapore. Part of the training set.
Example images from the training set. These photos were taken in Buangkok and Sengkang town, Singapore. Note that they have been rotated for improved presentation.

Intrinsic Parameters

The camera intrinsic parameters are provided in a JSON file. For an explanation of what they mean, see the previous "Mathematical Formulation" section.

Timestamps

Timestamps for the images are part of the dataset and may be used as an input to your modeling. For the training set, see the Timestamp column in train_labels.csv. For the test set, see the Timestamp column in submission_format.csv.

Labels

The ground truth label data for this challenge is corresponding camera pose data provided in a tabular CSV format. It includes the following columns:

  • Filename (str): Filename for the corresponding photograph. Uniquely identifies an observation.
  • TrajectoryId (int): Identifies the trajectory that the observation belongs to.
  • Timestamp (str): Timestamp that the photo was taken, in UTC.
  • Easting (float): Position coordinate in world space, in meters.
  • Northing (float): Position coordinate in world space, in meters.
  • Height (float): Position coordinate in world space, in meters.
  • Roll (float): Camera rotation angle, in degrees.
  • Pitch (float): Camera rotation angle, in degrees.
  • Yaw (float): Camera rotation angle, in degrees.

The (Easting, Northing, Height) values represent a vector in world space that denotes the camera's position. The spatial coordinate system used is SVY21 (instead of WGS84). The (Roll, Pitch, Yaw) values are a vector that indicates the camera's direction, i.e. the direction it is pointing at.

Submission Format

The submission format is a CSV file containing predictions with the following 9 columns:

  • Filename, TrajectoryId, and Timestamp are index columns. Submissions must include these three columns with the exact values and row order as the example submission format file.
  • Easting, Northing, Height, Roll, Pitch, Yaw are the camera pose values that you are estimating.

The meanings and types of these columns are the same as the label data described in the previous section.

For example, if you were predicting values like this:

Filename TrajectoryId Timestamp Easting Northing Height Roll Pitch Yaw
180426_001523864_Camera_0.jpg 5 2018-04-26 00:15:23.864 0.0 0.0 0.0 0.0 0.0 0.0
180426_001524766_Camera_0.jpg 5 2018-04-26 00:15:24.766 0.0 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ...
180426_014304396_Camera_0.jpg 5 2018-04-26 01:43:04.396 0.0 0.0 0.0 0.0 0.0 0.0

Your CSV file that you would submit would look like:

Filename,TrajectoryId,Timestamp,Easting,Northing,Height,Roll,Pitch,Yaw
180426_001523864_Camera_0.jpg,5,2018-04-26 00:15:23.864,0.0,0.0,0.0,0.0,0.0,0.0
180426_001524766_Camera_0.jpg,5,2018-04-26 00:15:24.766,0.0,0.0,0.0,0.0,0.0,0.0
...
180426_014304396_Camera_0.jpg,5,2018-04-26 01:43:04.396,0.0,0.0,0.0,0.0,0.0,0.0


Example code for generating a submission from the images using a pretrained DROID-SLAM model is available on the data download page. Please note that this is only an example.

Performance Metrics

The primary and secondary performance metrics for this challenge are the rotational relative pose error and translational relative pose error of the estimated camera poses, respectively. They are respectively defined below:

$$ \begin{align} E_{rot}(\mathcal{F}) &= \frac{1}{\mathcal{F}} \sum_{(i, j) \in \mathcal{F}} \angle [(\mathbf{\hat p_j} \ominus \mathbf{\hat p_i}) \ominus (\mathbf{p_j} \ominus \mathbf{p_i}) ] \\ E_{trans}(\mathcal{F}) &= \frac{1}{\mathcal{F}} \sum_{(i, j) \in \mathcal{F}} \lVert (\mathbf{\hat p_j} \ominus \mathbf{\hat p_i}) \ominus (\mathbf{p_j} \ominus \mathbf{p_i}) \rVert_2 \end{align} $$

where \(\mathcal{F}\) refers to a set of frames belonging to a single trajectory and \((i,j)\) denotes two adjacent frames within the sequence. \(\mathbf{\hat p}\) and \(\mathbf{p}\) are the estimated and ground true camera pose coordinates, respectively, \(\ominus\) denotes the inverse compositional operator and \(\angle[\cdot]\) represents the rotation angle.

These metrics calculate the error in relative pose between pairs of your estimates. Consequently, the absolute values of the coordinates you predict matter less. They measure your ability to accurately estimate the movement of the camera between the frames of a given trajectory.

Teams will be ranked based on the primary metric of rotational error. The secondary metric of translational error will be considered in the final evaluation by the judging panel.

A Python script for you to calculate the challenge metrics locally is provided on the data download page.

Leaderboards

Public Leaderboard

For all participants, the same portion of the data from the test set are assigned to the public leaderboard. The score displayed on the public leaderboard reflects the performance of your model on this portion of the test set.

Private Leaderboard

The private leaderboard is not visible to participants until the competition concludes. The private leaderboard score will be a weighted average of the scores on the public portion of the test set (the public leaderboard scores) and the scores on the private portion of the test set with a 1:9 ratio.

Each team can choose up to 5 submissions for consideration before the close of submissions. The private leaderboard score will be the best score only among the selected submissions.

Final Evaluation

The five teams with the highest rankings on the private leaderboard will be considered as finalists. The finalists will be invited to submit their AI models and technical reports describing their solutions before the posted deadline. The submitted entries will be reviewed and validated by an independent technical review committee comprising domain experts. The committee will select three teams with the best solutions as winners.

A complete finalist submission must include all of the following materials:

  1. Technical report
  2. AI model weights
  3. Code for training AI model and for running inference on the test data
  4. Code documentation

Please review the following sections for further detailed requirements.

Technical Report Requirements

All reports must:

  • Conform to the IJCAI template.
  • Not exceed four (4) pages in length, including all pictures, tables, and references.

The primary goal of the technical report is to present the proposed solution by contestants. Although there is no fixed content format, it should encompass an abstract, introduction, methodology (including novelty), and experimental demonstration (including inference time and hardware configuration).

AI Model Weights

Finalists must provide the trained AI model weights corresponding to their highest-scoring submission on the private leaderboard. You are responsible for keeping your work organized so that you can identify the correct version of model weights to provide.

Code

Finalists will be required to provide the training code and inference code for the model that produced their highest-scoring submission on the private leaderboard. Your code should be logically structured with obvious points of entry and include clear instructions, dependencies, and requirements. There is no mandatory format or structure for the code, but we recommend that you adopt a structure such as DrivenData's prize winner repository template.

Code Documentation

Finalists should include README files that clearly identify and map between major components of their solution methodology to associated parts of their submitted code. The primary goal is to help reviewers identify which parts of the source code perform which parts of the solution.

Evaluation Criteria

The technical review committee will evaluate the submitted solutions based on their validated performance and novelty.

Resources

Here are two open lists of papers and code for deep learning methods in visual odometry and simultaneous localisation and mapping (SLAM):

You can also refer to this recent academic work: DiffPoseNet.


Good luck! If you have any questions, please email prizechallenge@aisingapore.org.