Rethinking 3D Reconsctruction

Feature Extraction
1. Find keypoints:
  1. Scale-Space Construction: blur the image and generate a set of images -> Result: At this point, the computer possesses a “comprehensive set” of images ranging from clear to blurry, and from large to small.
  2. Keypoint Localization: difference of gaussian, extreme detection -> Results: Blobs or Corners.
Feature Matching
1. Descriptor Matching: Calculate feature vectors for each keypoint, compare them across images, and remove outliers using geometric constraints (RANSAC).-> Result: Verified Correspondences (pairs of points linking Image A to Image B).
Sparse Reconstruction
1. Triangulation & Optimization: Project 2D matches back into 3D space (Triangulation) and refine via Bundle Adjustment.

Ray Sampling
For every pixel, NeRF casts a ray into the scene and samples a set of 3D points along the ray.
A Model to Predict Density and Color
The core of NeRF is an MLP that maps each sampled 3D point (plus viewing direction) to:
- σ (density)
- RGB color
This allows NeRF to represent both geometry and appearance.
Volume Rendering
NeRF integrates colors and densities using a volume rendering equation to compute the final pixel color.
Points with higher density occlude points behind them, and contributions are accumulated along the ray.

Initialization

Start with a sparse point cloud generated via SFM (Structure from Motion). These points serve as the initial means (positions) for the 3D Gaussians.
Gaussian Representation: Geometry (Position & Shape)

Each point is represented as a 3D anisotropic Gaussian (an ellipsoid). To make this differentiable and valid, the covariance matrix $\Sigma$ (which defines shape) is decomposed into:Scaling ($S$): How much to stretch the ellipsoid along 3 axes.Rotation ($R$): Represented by a quaternion to define orientation.
Gaussian Representation: Appearance (Color & Opacity)

To capture realistic visuals, each Gaussian carries:Opacity ($\alpha$): How transparent or solid the ellipsoid is.Spherical Harmonics (SH): Coefficients that represent view-dependent color (allowing for shiny surfaces and lighting effects that change as you move the camera).
Differentiable Rasterization

a fast rasterization approach:
- Projection: 3D Gaussians are projected into 2D screen space (becoming 2D splats).
- Sorting: Splats are sorted by depth (front-to-back) using a fast GPU Radix sort.
- Alpha Blending: The sorted splats are composited to form the final image.

Lift
Each pixel feature corresponds to a 3D ray.
LSS predicts a probability distribution over depth bins for every pixel.
Then each pixel is lifted into multiple possible 3D positions using camera intrinsics/extrinsics.
Splat
All lifted 3D features are projected (“splatted”) onto a 2D BEV(bird’s eye view) ground.
Features falling into the same BEV cell are aggregated, producing a BEV feature map.
Shoot
Once BEV features are obtained, it could be feed into diffrent heads for different downstream tasks:
- 3D object detection
- BEV segmentation
- Road layout prediction, etc.

results matching ""