Over the last 15 years, live streaming services have grown from novelties & experiments in to profitable businesses serving an ever-growing cohort of cord-cutters and cord-nevers. Initial streaming implementations mimicked the workflows of the broadcast world, using custom servers to deliver streams via proprietary protocols. Here at Akamai traffic grew 22,000-fold from a 1 Gbps stream in 2001 (the first Victoria's Secret webcast) to 23 Tbps in June 2018 for a global football tournament.
Driven by the development of HTTP Adaptive Streaming (HAS), this growth in live delivery also brought with it increasing viewer demand for OTT quality and latency to match that of traditional broadcast television. Conventional wisdom holds that HAS-delivered content has an end-to-end latency that is several multiples of the segment duration and that it lags behind broadcast. That pre-conception can now be challenged. There is a HAS solution which allows end-to-end latency to be achieved which is lower than one segment duration and in fact it allows overall latency to be decoupled from segment duration - Ultra Low latency CMAF (ULL-CMAF).
The Common Media Application Format (CMAF) was standardized by MPEG in 2017 and defines a fragmented mp4 container which can hold video, audio or text data. Best known for its efficiency in allowing media segments to be simultaneously referenced by HLS playlists and DASH manifests, the standard also offers an additional intriguing benefit inherited from the DASH ATSC3 broadcast profile, which is latency reduction. To clarify, just using CMAF segments themselves will do nothing to reduce latency. To obtain low end-to-end latency, the CMAF containers must be paired with encoder, CDN and client behaviors _so that the overall system _enables low latency.
Figure 1: CMAF object nomenclature
The first required behavior to achieve this reduction in latency is chunked encoding. Per the MPEG CMAF standard, a CMAF track is comprised of a number of objects, as illustrated in Figure 1. A "chunk" is the smallest referenceable unit, containing at least a moof and a mdat atom. One or more chunks are combined to form a fragment and one or more fragments to form a segment.
A standard CMAF media segment is encoded with a single moof and mdat atom, as shown in Figure 2. The mdat holds a single IDR (Instantaneous Decoder Refresh) frame, which is required to begin every segment.
Figure 2: Chunked encoding of a CMAF segment
A "chunked-encoded" segment, however, will hold a series of "chunks" i.e. a sequence of multiple moof/mdat tuples, as shown in Figure 2. Only the first tuple holds an IDR-frame. The advantage of breaking up the segment in to these shorter pieces is that the encoder can output each chunk for delivery immediately after encoding it. This early release leads to a direct reduction in overall latency by the same amount. There is no fixed rule for how many frames are included in each chunk. Current encoder practice ranges from 1 - 15 frames. It should be clarified that CMAF did not "invent" chunked encoding. It has been available since 2003 when AVC was first standardized. The MPEG DASH ISO-based broadcast profile developed by the DASH Industry Forum for ATSC3 standardized its use prior to CMAF adopting it. Chunked encoding has been used in many instances in academia and industry for over a decade. What has changed is that there is now a coordinated effort within the industry to use this approach for lowering latency.
Chunked Transfer Encoding:
The second required system behavior is "chunked transfer encoding".
Figure 3: HAS media distribution system
The encoder will use HTTP 1.1 chunked transfer-encoding to push the encoded CMAF chunks to the origin for redistribution. As an example, an encoder producing 4s 30fps segments would make one HTTP POST every 4s (one for each segment) and then during the next 4s the 120 chunks that comprise the segment, each 33ms long, would be sent down that open connection to the distribution cloud. Note that the encoder is not making a POST for each individual chunk.
The remainder of the chunks journey is pull-based and driven by the media player. The media player reads the manifest or playlist, which describes the content, calculates the live edge it wishes to start playback at (more on this later) and then makes a request for a segment. The manifest must signal the early availability of the segment data. In MPEG DASH, this is done via the MPD@availabilityTimeOffset parameter. In HLS, the variant playlist referencing the segment should be published once the first chunk of the segment has been released, versus the last chunk of the segment, which would be the normal mode of operation.
Figure 4: Player start-up options against a live stream
To illustrate the sensitivity of overall latency to a player's starting algorithm, Figure 4 shows a live encoder producing 2s segments. We observe the system at a point in time midway through the production of segment #5. A non-chunked solution could minimize its latency by starting with the last fully available segment (#4) resulting in 3s overall latency. If the content is chunk encoded with 500ms chunks (for illustration as in reality chunks are much shorter than this), then the player could start with the latest chunk holding an IDR (#5a), which would reduce the latency to 1s.
Two methods now exist to drop the latency even further. In the first, the player would download chunks 5a and 5 but then decode forward through 5a to 5b before starting playback, thereby lowering its latency to less than 500ms. In the second, the player can defer playback by 1s and then make a well-timed request for chunk 6a immediately after it is produced, thereby also reducing the latency to less than 500ms.
Note that the player requests a segment and not a chunk, since the chunks are not addressable units (outside of ATSC3, but that's another case study). Importantly, the CDN edge also caches the chunks flowing through it to build up a cached representation of the complete segment. This ability for a) CDNs to cache the complete segments and b) the stream to be backwards compatible with the majority of clients that have not been optimized for low latency provides one of the strongest advantages for ULL-CMAF when compared to alternate schemas.
An interesting side-effect of this chained chunk transfer is that the segments are delivered with consistent timing that is independent of the throughput between client and edge server. Standard HAS throughput estimation algorithm swill produce the answer that the connectivity is exactly equal to the encoded bitrate, which will prevent the player from switching up. Various workarounds exist for this problem, including measuring the connectivity as the chunks are burst and then applying a conservative average, as well as machine learning to infer connectivity given a pattern of chunk burst times. Players need to be taught this new behavior if they are to play back multi bitrate ULL-CMAF segments successfully.
To reiterate, stable latency reduction with ULL-CMAF is only achieved if all of the following are in place:
a. The content in the CMAF segment is chunk-encoded.
b. The encoder adjusts its DASH manifest/ HLS playlist production to accommodate and signal the usage of chunked encoding and early availability of the data.
c. The encoder pushes content to the origin using HTTP 1.1 chunk encoding transfer.
d. The CDN propagates this content all the way to the client using HTTP chunk encoding transfer at each step in the distribution chain.
e. The client:
It is possible in the lab to use ULL-CMAF to produce glass-to-glass latencies in the 600ms range. These are excellent for impressing friends and CEOs, however they become increasingly fragile with scale and with geographic dispersion (higher round trip times between encoder, origin, edge server and clients). If distribution is happening over the open internet (especially over a last mile mobile network where rapid throughput fluctuations are the norm), current proofs-of-concept show more sustainable Quality of Experience (QoE) with a glass-to-glass latency in the 3s range, of which 1.5s-2s resides in the player buffer. Current encoder implementations tend to favor one video frame per chunk, but there is no objective data yet to indicate whether this is optimum from a robustness or quality perspective.
Some advantages of chunked-encoded chunk-transferred CMAF are:
A public working demo is available for viewing in the Chrome browser. This demo shows a live stream produced by open source FFmpeg, publishing in to Akamai Media Services liveOrigin, delivered over the Akamai Media Delivery network and played back by the dash.js open source player. Stream is AVC 720p at 2Mbps with a segment duration of 6s, chunk duration of 1 frame at 29.97fps and availabilityTimeOffset is signaled at 5.967s. Target latency is set to 2.8s and the stream is being encoded in Boston.
All components of the ULL-CMAF system are seeing broad improvement. The advantage of a standards-based approach to low latency is that the resources of many companies are combined for mutual benefit. Encoders are becoming more proficient at encoding chunked content, studies are being done to establish chunk-size/robustness/quality curves, and UDP-protocols such as QUIC are being investigated for last-mile delivery. Standards bodies are looking at issues around EMSG and we will see SSAI vendors enable solutions in low latency environments. As standardization and commercial models push forward, OTT continues to aggressively challenge broadcast norms for quality and latency. The quality-at-scale target inches ever closer.
For a deeper dive in to this subject of low latency via chunked-encoding chunked-transfer CMAF please read our white paper available here.