With the emergence of neural radiance fields (NeRFs), view synthesis quality has reached an unprecedented level. Compared to traditional mesh-based assets, this volumetric representation is more powerful in expressing scene geometry but inevitably suffers from high rendering costs and can hardly be involved in further processes like editing, posing significant difficulties in combination with the existing graphics pipeline. In this paper, we present a hybrid volume-mesh representation, VMesh, which depicts an object with a textured mesh along with an auxiliary sparse volume. VMesh retains the advantages of mesh-based assets, such as efficient rendering, compact storage, and easy editing, while also incorporating the ability to represent subtle geometric structures provided by the volumetric counterpart. VMesh can be obtained from multi-view images of an object and renders at 2K 60FPS on common consumer devices with high fidelity, unleashing new opportunities for real-time immersive applications.
We propose to obtain such a representation from multi-view images of an object in three stages. To start with, we train a contiguous form of the representation, where the surface part is modeled by a neural signed distance field, and the volume part is modeled by a neural density field (left). Then we fix the learned signed distance field and extract a triangular mesh from it as a substitution to be rendered jointly with the neural density field (middle). We utilize differentiable isosurface and rasterization techniques to get high-quality meshes that align well with the implicit geometry. Lastly, we drop all the neural networks and perform discretization to get the final assets for efficient storage and rendering (right). Concretely, the triangular mesh is simplified and UV-parametrized, and the neural density field is first voxelized and pruned to a sparse volume, which is then organized by perfect spatial hashing to support fast indexing and compact storage.