DOI: 10.1111/cgf.15002COMPUTER GRAPHICSforumVolume 43 (2024), number 1, e15002End-to-End Compressed Meshlet RenderingD. Mlakar,1M. Steinberger1and D. Schmalstieg21Graz University of Technology, Graz, Austria{daniel.mlakar, steinberger}@icg.tugraz.at2University of Stuttgart, Stuttgart, Germanydieter.schmalstieg@visus.uni-stuttgart.deAbstractIn this paper, we study rendering of end-to-end compressed triangle meshes using modern GPU techniques, in particular, meshshaders. Our approach allows us to keep unstructured triangle meshes in GPU memory in compressed form and decompressthem in shader code just in time for rasterization. Typical previous approaches use a compressed mesh format only for persistentstorage and streaming, but must decompress it into GPU memory before submitting it to rendering. In contrast, our approachuses an identical compressed format in both storage and GPU memory. Hence, our compression method effectively reducesthe in-memory requirements of huge triangular meshes and avoids any waiting times on streaming geometry induced by theneed for a decompression stage on the CPU. End-to-end compression also means that scenes with more geometric detail thanpreviously possible can be made fully resident in GPU memory. Our approach is based on a novel decomposition of meshes intomeshlets,i.e. disjoint primitive groups that are compressed individually. Decompression using a mesh shader allows de factorandom access on the primitive level, which is important for applications such as selective streaming and fine-grained visibilitycomputation. We compare our approach to multiple commonly used compressed meshlet formats in terms of required memory andrendering times. The results imply that our approach reduces the required CPU–GPU memory bandwidth, a frequent bottleneckin out-of-core rendering.Keywords:modeling, compression algorithms1. IntroductionOpen-world games, scientific simulations and many other graphicsapplications can involve huge scenes that are challenging to store,transmit and render. If the scene does not fit into GPU memory inits entirety, the application needs to schedule data streaming anddecompression, such that the data become available just in time.For suchout-of-corerendering, an optimal compression rate mustbe weighed against in-memory footprint and GPU friendliness.Conventionally, a rendering system expects the scene to be in-core,i.e. fully resident in GPU memory. For scenes larger than mem-ory, asparserepresentation with sequential access enables out-of-core processing in the sense that a spatially bounded portion of thescene is loaded into memory (and decompressed, if any compressionhas been applied). Such sequential access patterns can be applied totasks such as segmentation, smoothing or simplification, but theyare typically not compatible with rendering—draw calls requiredi-rect accessto the geometric primitives. If the scene is structuredinto multiple meshes representing individual objects, one can sim-ply select which meshes to load. However, the scene may consist ofa single huge mesh or several large meshes, which cannot be fullyresident. In this case, a common method is to segment the meshesinto regions. Regions should have a finite spatial extent, and theirprimitive count should be chosen in a GPU-friendly manner. Ide-ally, a region corresponds to a meaningful part of an object [CG08,CKLL09], which can be loaded and decompressed individually.The requirements of streaming rendering are not new [TMJ98,LH04], but are still highly relevant in contemporary game tech-nology. Streaming is prominently addressed in the Microsoft Di-rectStorage API [Yeu20], which bypasses the CPU by streamingdirectly from non-volatile storage into GPU memory, and has re-cently been extended to include hardware-assisted Lempel-Ziv-likedecompression [Hoe22, CH22, Ura22]. A similar approach is avail-able in Epic’sNanite[Epi20] rendering engine.However, the aforementioned tools for out-of-core renderingpipelines do not directly aim at preserving GPU memory. Render-ing is assumed to work from an uncompressed representation, in the© 2024 The Authors.Computer Graphics Forumpublished by Eurographics - The European Association for Computer Graphics and John Wiley & Sons Ltd.This is an open access article under the terms of the Creative Commons Attribution-NonCommercial License, which permits use, distribution andreproduction in any medium, provided the original work is properly cited and is not used for commercial purposes.1of14
2of14D. Mlakar et al. / End-to-End Compressed Meshlet Renderingsense that shader input can be read directly from memory. Since bothGPU memory size and bandwidth must be considered scarce re-sources, we preferend-to-end compressed rendering, which decom-presses triangles directly in the shader and limits the GPU memoryrequirements to only the compressed, rather than the uncompressed,geometry representation. Such an approach avoids the need for aseparate decompression stage on the CPU and considerably reducesthe overall memory footprint of the geometry data. It also simplifiesthe engine code, since there is no need to maintain separate com-pressed and uncompressed representations.In this paper, we explore a compressed rendering infrastructurethat performs decompression directly in amesh shader[Kub18],operating on groups of geometric primitives calledmeshlets.Theability to decompress geometry on the fly in the mesh shader makesit easy to design a streaming system, since the same compressed rep-resentation can be kept in GPU memory and in non-volatile storage.To support decompression in the mesh shader, we present connectiv-ity representations that lend themselves to fully parallel processing.In summary, we make the following contributions:• We present the first complete end-to-end compressed mesh ren-dering pipeline.• We introduce a novel decomposition of meshes into disjoint prim-itive subsets (calledlaced wires), which allow random accesswithout vertex duplication.• We describe how to create laced wires and how to efficiently de-code them in parallel on the GPU.• We compare our approach to multiple compressed and uncom-pressed mesh representations.2. Related WorkThis paper focuses on triangle meshes, which are the most widelyused mesh representation in computer graphics. The most commondata structure, an indexed triangle mesh, consists of a triangle liststoring three consecutive vertex indices per triangle and a vertexlist storing vertex data,e.g. positions or normals. While this datastructure can be directly processed by the graphics pipeline, it is notvery compact. Hence, there is a large body of work on mesh rep-resentations that require less memory through compression of ver-tices, connectivity or both [MLDH15]. Most single-rate compres-sion methods for meshes combine a lossless encoding of triangleconnectivity with a lossy encoding of vertex data. For brevity, wedo not consider lossy connectivity compression,i.e. mesh simplifi-cation or re-meshing, but rather assume that the geometry quality isalready given at the desired rate.2.1. Mesh compressionVertex and attribute compression.Lossy compression of vertexdata is acceptable in many applications, since the vertex coordinatesand other attributes usually do not require full floating point preci-sion [Dee95, Cal01, MSGS11, HV01]. The application of quantiza-tion by converting coordinates to a fixed point format and strippingthe least significant bits is often possible without any visual deteri-oration [PBCK05, KXW*18]. In fact, some meshes, such as thoseacquired with 3D scanning, tend to store only noise in their leastsignificant bits. However, hand-modelled meshes may have theirparts expressed with varying precision, making a mechanism fornon-uniform quantization necessary. A straightforward method toaddress non-uniformity is hierarchical decomposition of the meshinto parts with homogeneous precision inside each part. Compres-sion rates can potentially be further increased by exploiting localcoherence in the value sequence through delta-encoding or predic-tion [IA02]. In contrast to position quantization, domain and pre-cision requirements of other frequently used attributes, like nor-mals [CDE*14] and texture coordinates, are known in advance,making their quantization less challenging.Connectivity compression.Most compact connectivity represen-tations use some form of constrained graph traversal of the mesh andencode the traversal steps using a set of symbols with low entropy.A popular choice are triangle strips [Ise00], given as a sequence ofvertices, where each new vertex together with its two predecessorsin the sequence defines a new triangle. While standard triangle stripsattach each new vertex to alternating edges of the previous triangle,a special symbol is used to choose the edge in generalized trianglestrips to allow longer strips. With more complex traversal patterns,e.g. aligning strips next to each other [Cho97] or in concentric cir-cles [BPZ99], encoding efficiency can be further improved.Other traversal patterns include spanning trees [Tau98] or regiongrowing [Ros99], the latter probably being the most influential ideain connectivity compression. In region growing, the border of thealready encoded part of the mesh is incrementally extended, guidedby a custom traversal strategy, and newly encountered triangles areencoded. Noteworthy traversal strategies include the cut-border ma-chine [GS98] as well as EdgeBreaker [Ros99] and its many im-provements [CR04, SKR01]. Alternatively, region growing can alsobe expressed in terms of the valence of added vertices [TG98],e.g.in FreeLence [KPRW05].The Laced Ring (LR) [GLRL11] structure can be seen as a kindof dual representation of a generalized triangle strip. A very longtriangle strip stores hardly more than one vertex reference for eachencoded triangle. However, in practice, the additional operations re-quired for a generalized triangle strip (edge swap, restart) incur a30–40% overhead [GLRL11]. The laced ring replaces triangle se-quences with sequences of adjacent vertices. For each edge betweentwo vertices, two references to laces,i.e. triangles formed by extravertices adjacent on each side of an edge, are stored. While this rep-resentation also requires approximately one reference per triangle,it is easier to find long vertex sequences than to find long trianglesequences (a triangle has three neighbours, while a vertex has six onaverage [GLLR13]). We expand upon the idea of laces in this paper.Encoding shared boundaries.If a large mesh has to be segmentedinto regions before compression, one has to deal with shared bound-aries between the regions. Boundaries must be handled with care, toavoid cracks between regions. The most common approach to avoidcracks is to split meshes along the edges, leading to shared verticesbetween regions. In this case, shared vertices are duplicated acrossregions, which increases the memory footprint and may increasethe effort to keep vertex replicas consistent. Duplication is avoidedif the vertices are stored only for one region and referenced in allother regions [YL07, CG08], but this requires reference counting© 2024 The Authors.Computer Graphics Forumpublished by Eurographics - The European Association for Computer Graphics and John Wiley & Sons Ltd.