Core technologies for streaming workflows, in 2021 and beyond
It’s been almost five years since I published the last article on this blog – ten years since the first one – time flies quickly, and so do streaming technologies. In 2016, CMAF standardization was just beginning, carrying the promise of simplified workflows and improved CDN caching efficiency. The CBCS encryption scheme support was supposed to expand far out of the Apple ecosystem, and IMSC was poised to become the dominant subtitles standard. I thought I would fast forward to now and check how much of it did really happen, which other technologies did emerge as foundational for streaming workflows, and what could be the next important ones for the five years to come.
Current core technologies
Let’s start with the technologies which confirmed or revealed their first class OTT citizen status in the last five years…
The CMAF/CBCS/IMSC wave
Since 2016, CMAF has become the de facto media container format. All video Standards Developing Organizations (SDOs) like DVB, ATSC, DASH-IF or 3GPP have adopted this MPEG standard as a baseline technology for their own standards. The Consumer Technology Association has launched its Web Application Video Ecosystem (WAVE) project to further drive interoperability between content producers and media player developers leveraging CMAF, and the CMAF Industry Forum to promote the use of CMAF across the video ecosystem. CTA-WAVE recently released the DASH-HLS interoperability Specification, which describes how DASH and HLS should leverage CMAF contents, and provides insights on mapping between DASH and HLS manifests. And MPEG has defined a CMAF profile for DASH (part of the core DASH spec 5th edition, in Final Draft International Standard stage). So yes, CMAF is now everywhere, and nothing can be done that is not CMAF compatible. All modern video players do support CMAF media segments, so the footprint for TS segments use is now limited to pre-iOS 10 devices and other legacy hardware players which can’t be updated – a clearly shrinking ensemble due to hardware renewal cycles. Unlike LL-DASH players, Apple LL-HLS players don’t leverage CMAF chunks, as the heuristics on these platforms rely on full line speed transfer, but at least they are compatible with it and it makes it possible to share a single set of media segments between LL-HLS and LL-DASH manifests.
The success of CMAF was also conditioned by CBCS taking over the CTR encryption scheme, as this was the preliminary requirement to carry multiple DRMs in the same set of segments and have the CDN deliver with the best caching efficiency, compared to two sets of segments in CTR and CBCS modes. We need to acknowledge that this vision only half realized outside of the Apple ecosystem, as only 2020 and beyond non-Apple devices do actually support CBCS in hardware, which is limiting support for 1080p resolutions and beyond. Taking into account the standard hardware renewal cycle, this leads us to 2025 to see CBCS universally supported. Even in environments like browsers, where you would expect CBCS to be a table stakes feature now, there is still a lot of interop work to see it behaving well with all DRMs and Clearkey encryption.
In the CMAF promise package, there was also the relief perspective of using a single subtitles TTML format across all devices, with the advent of IMSC1 in both text and image profiles (see section 4.4.1 of the CTA specification). While Apple introduced support for IMSC1 Text Profile in the 2017 draft-pantos-hls-rfc8216bis-00 spec, HLS still doesn’t officially support the Image Profile, and the domination of WebVTT in the HLS world has not really been challenged yet, leaving the subtitle ecosystem fractured. It probably relates to the fact that supporting even the IMSC1 text profile – a wide specification – is challenging, and some profiling is needed like in EBU-TT-D or the recent Netflix IMSC 1.1 Text Profile. Long story short: we need more time to see the TTML convergence happening in the industry.
QUIC everywhere?
It’s been a long journey for QUIC since its inception in 2013, getting in bed with HTTP/2 unofficially as a draft standard in 2015, and being promoted to RFC and the official foundation for HTTP/3 in 2021 (see a good Akamai blog post on this history).
With multiplexed connections and UDP for bandwidth efficiency, HTTP delivery has reached unprecedented levels of efficiency. After initially thinking of using HTTP/2 Push as a LL-HLS mechanism, Apple stepped back as they realized it wasn’t compatible with ad insertion (a single origin for all media segments and ad segments was required for HTTP/2 security reasons), but the use of HTTP/2 remained a mandatory part of the LL-HLS compliance. And it’s likely that LL-HLS will be updated to officially support HTTP/3 in a not so distant future. There were some concerns initially in the DASH world, as LL-DASH relies on HTTP/1.1 chunked transfer encoding, which was not a concept carried across HTTP/3 (where DATA frames are used instead). This is actually working fine and CDNs can transparently convert HTTP/1.1+CTE at the origin level into HTTP/3 with DATA frames to clients at the edge level. With its efficiency and a high level of interoperability between media clients and CDNs, QUIC is clearly the dominant technology on the HTTP delivery side of things, now and for the next five years to come.
Codecs evolution
What’s the difference between now and five years ago, in terms of video codecs? Not so much, actually. AVC is still the dominant codec, with HEVC adoption still lagging because of a fragmented licensing landscape (to say the least). HEVC decoding is widely present in silicon decoders but still often not activated to avoid licensing costs, so its adoption has been slowly increasing, mainly driven by 4K contents delivery where AVC is not a sustainable option, and by Apple devices which provided hardware decoding capability early on. On browsers, HEVC support outside of Safari is still more the exception than the rule, even if that can be achieved by passing though the underlying hardware support. But this is where the politics hit, for example with Google more inclined to push AV1 than HEVC, as this is a codec that they mainly control. Unlike HEVC, AV1 made its way into browsers like Chrome/Firefox since 2018, and comes to Windows 10 as a free extension in the Microsoft store (whereas the HEVC extension is a $0.99 one) that the Edge browser can leverage. AV1 support started to appear on TVs in 2020, with LG, Samsung and Sony models, and in the mobile world with Android 10+’s support. But the lack of native AV1 support on Apple platforms and popular mobile chipsets like Qualcomm Snapdragon 888 somehow keeps AV1 in the challenger category. The broadcast standards ecosystem is starting to look at it seriously, with DVB shortlisting AV1 among the next gen codecs to support, alongside AVS3 (for the Chinese market) and VVC (aka H.266). While there’s an ISO-BMFF binding for AV1, which makes it usable in HLS or DASH contexts, HDR support is still an emerging AV1 feature with a-one page note available for HDR10+, and early signs of Dolby Vision support visible here and there.
On the audio codec front, it’s hard to say that major changes have happened over the last five years: AAC variants are still the mainstream option, with a few AC-3 variants spicing up the streamsets when multi-channels sources are available for transcoding. No audio revolution has really been televised.
Guest contribution from my friend and industry rockstar Thierry Fautier! |
---|
Content Aware Encoding Content Aware Encoding (CAE) is a technique used by encoders to adaptively distribute the bitrate over time to provide a constant quality at a given target average bitrate. The bitrate saving vs CBR encoding is in average between 30% and 40%, which is as big as migrating to a new codec without the burden to upgrade the head end, to wait for new devices or to deploy a new player on existing devices. Netflix was the first to deploy this technique for VOD using what they call per title and later on per scene encoding, and CAE can also be applied by Live. The key element for Live is to fit on existing CPU footprint and not to add any delay in processing, but also for live and VOD to guarantee full interoperability with a broad range of clients. Apple, that used to only specify CBR encoding, in its iOS 11 release (September ’17) has enabled the use of variable bitrate encoding with constrains such as 10% VOD average bitrate over declared bitrate and 110% actual bitrate for live (over 1h time period) over the declared average bitrate. Android has no restriction for the use of VBR and DASH IF has decided not to make any specific recommendations for the use of VBR. The Ultra HD Forum was the first industry group to embrace CAE technology in its guidelines as a 30% saving over a 20M/s represents a saving of 6M/s that can dramatically increase the reach of UHD (with 2160p60 profile) on a broadband network. |
Low latency, finally
It’s been a rocky road for LL-DASH, the low latency profile(s) of MPEG-DASH, since the first DASH-IF/DVB iterations in 2017. It took three years for LL-DASH to mature in terms of specifications, resulting in the section 10.20 of the DVB-DASH v.1.3.1 spec from February 2020 and the DASH-IF Low Latency Modes for DASH IOP extension in March 2020. The necessary innovations like the Resynchronization Points required multiple additions to the core DASH spec at MPEG, with the 5th edition hopefully being the last iteration in this domain. DASH-IF sponsored the implementation of LL-DASH support in FFmpeg in 2019, so there’s now a solid open source basis for low latency DASH encoding/packaging, which can be combined with a variety of origination solutions like Streamline, originjs or the dash-server for Nginx. But despite this intense 2019/2020 standardization and open source work, LL-DASH has remained a relatively confidential technology, with few high scale production deployments. The DASH players ecosystem is still fragmented, on low latency support like on other foundational technologies like multi-period (enabling server side ad insertion).
That’s where the introduction of LL-HLS in 2019 was more motivating for the industry, with Apple suddenly opening up the perspective of seeing two billion iOS 14 devices as LL-HLS compliant. The open source video community followed this trend closely, with the deprecation of the LHLS community spec in favor of the LL-HLS spec from Apple, and the arrival of LL-HLS support in Exoplayer, Shaka player and hls.js in late 2020 and throughout 2021. The reference to RF8673 (HTTP Random Access and Live Content) in the LL-HLS spec has opened the gates to interoperability with LL-DASH, through the use of byte-range addressing over a common set of segments (see detailed blog post by Will Law on this topic). Not all major CDNs are yet compatible with this latest evolution but that should be an industry table stakes by the end of 2022. There’s no doubt that (at least) all the next big sports events will be televised in LL-HLS and LL-DASH, moving forward.
CPIX – ruling the key exchange operations
DASH-IF’s Content Protection Information Exchange Format (CPIX) specification, now in version 2.3 and initially released in 2015, is certainly the best hidden gem of the industry, as it allowed video packagers to use a unique encryption key request XML payload with key servers, whereas the key exchanges were previously achieved through the proprietary interfaces of DRM services platforms. Frameworks like AWS Secure Packager and Encoder Key Exchange (SPEKE), built on top of CPIX, have tried to further unify the industry around a common API approach to exchange CPIX payloads, as DASH-IF didn’t provide a reference API to carry CPIX documents. With the recent release of SPEKE v2.0 (which was one of my main work topics with AWS Elemental since late 2019), there’s now a perfect alignment of the two specs and a clear path forward for cross-versioning, which will allow all the next key exchange innovations to come smoothly in the game. CPIX is a powerful toolset to build key exchange workflows and it will certainly stay dominant in the industry for the years to come, now that it has also been published as an ETSI spec.
MSE end EME – crucial enablers for video players
At the time of my last blog post in 2016, W3C’s Media Source Extensions and Encrypted Media Extensions specifications were already the dominant low level mechanisms used to support media playback and decryption in browsers, leveraged by all javascript-powered video engines like hls.js or dash.js. In the browser space, the most notable compatibility extension since that time was the addition of MSE support in Safari for iOS13 on iPads in 2019, and in the connected TVs space the introduction of EME in HbbTV v 2.0.1 in 2016 and of MSE in version 2.0.3 in 2020. While MSE and EME aren’t applicable in the Android/Android TV world, these technologies have made their way among TV OSes like LG webOS since v3.0 or Tizen OS v2.3, and now offer a convenient way to develop and deploy OTT players across multiple TV ecosystems. Another interesting expansion would be the MSE support on Safari for iPhone (like already available in iPad – enabling LL-DASH playback on this platform), but it sounds unlikely that Apple will add it, as it would suddenly allow (LL-)DASH to challenge (LL-)HLS in all of the iOS browser footprint, and restrict HLS relevance only to the compiled applications context. That would be good to offer more choices to implementers, but certainly detrimental to Apple’s control on their ecosystem.
Upcoming core technologies
It’s always a very subjective exercise to try to predict the evolution of OTT technologies, but let’s give it a try, between beliefs and doubts.
Contribution technologies
While QUIC has won the edge battle, it hasn’t been integrated (yet) in contribution technologies like SRT or RIST, which focused so far on linear TS ingest. With the expansion of CMAF usage outside of the initial boundaries of the client distribution, it could accelerate, as this is a perfect match for single or multi bitrate CMAF contribution. That would actually be good to see both SRT and RIST take this opportunity to converge into a single technology stack, as they basically do the same thing slightly differently and supporting organizations try to kill each other through “I got more members than you” press releases, a game where the SRT Alliance has so far outpaced the RIST Forum. It’s certainly good to have open alternatives to the Zixi protocol, though. The Facebook team recently published the RUSH IETF draft which could potentially show such a convergence path towards QUIC for contribution protocols. From a video codec perspective, it seems that JPEG XS, which appeared in 2019, is becoming the reference option for low latency lossless contribution. Its transport in MPEG-2 was standardized by MPEG and Video Services Forum (VSF) just released complementary recommendations for MPEG-2 and 2110-22 transports. The licensing terms seem to be reasonable too. JPEG XS should therefore be ubiquitous in the contribution space for many years to come.
ABR transcoding technologies
There is certainly a great space to occupy in the codec universe for Versatile Video Coding (aka VVC or H.266), with its 50% expected improvement over HEVC (rather 40% in the current implementations, as per Jan Ozer and Dan Grois from Comcast) and a natural fit with 8K and 360/VR streaming applications requirements. But with two patent pools now formed by MPEG LA and Access Advance, we have some reasons to think that this new generation of MPEG codec could face the same licensing problems than its HEVC predecessor.
The other new interesting MPEG option is the Low Complexity Enhancement Video Coding (LCEVC, aka ISO/IEC 23094-2) approach where two layers of codec enhancement can improve the perceptual quality of an underlying AVC, HEVC or even an AV1 or VVC base encoding by up to 50%. Another way to look at it is that LCEVC can reduce your CDN bill by up to 50%: for example you could generate the same perceived quality than a 4K HEVC 15 Mbps stream through a 1080p HEVC stream plus the LCEVC enhancement layers for a total of 8Mbps, or 9.5Mbps for an AVC-based equivalent stream. As V-Nova is far the only identified patent holder, there is a chance than the licensing terms will be more successful than the VVC ones and trigger a wave of implementations much quicker. The other major advantage of LCEVC over VVC is that it doesn’t require to re-transcode a whole AVC or HEVC content library to provide actionable benefits – assuming that you have enough control over your video players to upgrade them to support the enhancement layers. It’s also a smoother transition promise, with a backward compatibility path for legacy players that can’t be upgraded. The enhancement layers signaling is not yet specified in HLS and DASH, but that shouldn’t be more challenging than signaling a multi-layer Dolby Vision stream. Same thing for the CMAF binding.
From an audio codec perspective, it’s clear that we need new immersive options to go together with VR video tracks, with object-based audio support to allow customization of the streamsets. We also need flexibility to personalize the audio streams structure, based on accessibility needs. MPEG-H offers the level of personalization that is required on the producer and end-user sides, with Presets for simple streamsets composition and Advanced Settings for fine grain volume adjustments between streamset components, including loudness preservation. Fraunhofer IIS has got many blog posts and webinars available on this topic, alongside an authoring suite and related tutorials. The MPEG-H licensing program was launched recently, so it’s industry-ready, now.
A/B watermarking
As DRM and HDCP exploits have multiplied over the last years, server-side A/B watermarking has surfaced as “the last resort” for content protection, once the initial measures are circumvented. While it has been a playground for proprietary implementations for a couple of years, the same interoperability spirit that fostered the creation of CPIX has transformed the A/B watermarking landscape. First with the release in late 2020 of the Ultra HD Forum Watermarking API for Encoder Integration which allows live and VOD transcoders to host watermarking modules from multiple vendors with a common integration approach (with the subtleties of uncompressed domain vs compressed domain watermarking use cases, where the watermarking happens during the encoding or as a post-processing of it). Read this article by Laurent Piron for more details about this release. Currently in v1.0.1, this specification will certainly evolve slightly at the same time when the companion specification in the works at DASH-IF will be released, but won’t be deeply transformed, so it’s a solid foundation for sure. DASH-IF is currently extending this transcoder-level standardization effort by working on complementary guidelines for origins/packagers and CDN integration. For origins and packagers, the idea is to specify how A/B watermarking interacts with the streams ingest and transformation operations, and the forward requests from CDN which pull A or B versions of the segments, based on unique watermarking patterns per end-user and a layer of edge decision logic.
Overall, the idea is to allow implementers to combine and/or swap various encoders, origins/packagers and CDNs in their workflows, with the minimum set of adjustments in terms of A/B logic when replacing a component. It’s gonna take a couple of months for the spec to be released, as the standardization effort is quite significant, but I’m convinced (and maybe I’m biased, as being involved in this work) that the combination of the two specs will result in a very strong basis for implementations.
DASH manifests optimization
There’s been a couple of ongoing initiatives to reduce the size of DASH manifests. The DRM Signaling optimization, which is defined by the ContentProtection@Ref/RefID attributes from DASH’s 5th edition is the most straightforward one, as it allows to declare a ContentProtection element once in the manifest, and to reference it through its ID afterwards, thus allowing to factorize the DRM parameters and to significantly reduce the resulting manifests size. The DASH community is also looking at Hybrid Scheme Manifest optimizations, where the audio AdaptationSets are using the concise $Number$+Duration scheme, independently of the $Time$-based scheme used for video AdapatationSets, in order to remove the verbosity of the audio <S> lines generated by the misalignment of audio sample rates with video frame rates. There’s still some verification work to do for this second optimization, with ad periods and asymmetric/non-compliant track durations, but the optimization potential is around 95% to 98% of the manifest size, by combining the two optimization approaches.
And there’s the disruptive Patch Manifest approach for live streaming. Somehow comparable to the new LL-HLS #EXT-X-SKIP tag, the Patch Manifest approach has been pushed at MPEG (DASH 5th edition section 5.15) and DASH-IF by Hulu (see a great presentation on this topic from Zack Cava here), which aims at changing a foundational assumption we did so far: each time a player is requesting a manifest, it gets all of the DVR history carried by the playback URL. Instead of this very brutal approach that we’ve been using for years in the industry, the Patch Manifest approach says that the player gets a full manifest only on the first request (so that it can get the full DVR history since the stream start time and now), and then it gets incremental manifest updates on each Patch Manifest request, carrying only the added and removed segments since the last manifest update – the full media timeline being dynamically constructed in memory by the player, as a result of the initial manifest request and all subsequent Patch Manifest requests.
This mechanism is very efficient in resources-constrained playback environments as it optimizes manifest parsing operations and substantially decreases network transfers with very lightweight Patch Manifests being transferred. This Patch Manifest approach not only reduces the volume of transferred and parsed data, it also enables optimized ad insertion approaches. As such, it’s a critical tool for the next 5 DASH years to come, and already supported in dash.js since version 3.2.1. If the DASH players and packagers support footprint extends, there’s no doubt this will become the dominant live DASH manifests approach soon. For more details about all MPD optimization approaches, please refer to Alex Giladi’s excellent presentation during the last Global Video Tech Meetup.
Ad insertion
The new ad insertion paradigm that the Patch Manifest approach allows is the Server Guided Ad Insertion (SGAI) approach (described in DASH-IF’s latest chapter on ad insertion) – where the server doesn’t parse manifests anymore to replace media segment entries with ad segments references, but instead points the player to a lightweight Patch Manifest update that includes only the discrete ad pod segments and doesn’t generate extra live edge latency. The scalability benefit, from a server side ad insertion perspective, is just huge. Apple recently released a comparable HLS feature with the Ad Interstitials proposal, which basically isolates the segments of an ad pod in a discrete EXT-X-DATERANGE target URL. In this case, the video player requests the ad pod URL only when the ad pod is actually consumed, which decreases the load on ad servers also for the initial DVR window, and not only the live manifests updates. You could do the same in DASH SGAI by using Xlink for the ad periods of the initial manifest request with the whole DVR history (so that the ads are resolved by the player only when approaching the ad pod).
While both DASH SGAI and HLS Interstitials share roughly the same server-side approach, there’s a difference on the client side, where the DASH player is gonna handle both the media segments and the ad segments with a single player instance, whereas with the Interstitials approach, two HLS player instances will need to work in coordination, one for the content media segments and one for the ad pod segments – at least in the Apple implementation. While it might work fine in controlled environments like the Apple devices, this dual-player approach has already proven its inefficiency on low-powered environments, so its applicability to the wider HLS ecosystem is fairly questionable, and traditional Server Side Ad Insertion (SSAI) will continue to be relevant for some time there. I expect further consolidation to happen with these new ad insertion approaches, through the CTA-WAVE DASH-HLS Interoperability initiative, but one thing is sure: the days of full manifest parsing for ad insertion are counted, and it will make our lives easier.
Low latency
With LL-HLS and LL-DASH as the two elephants in the room, there’s few space left for other OTT-centric low latency approaches. While it doesn’t play exactly in the same latency category as LL-HLS and LL-DASH – as it’s targeting sub-second latency levels that are more in the WebRTC scope – the High Efficiency Stream Protocol (HESP) launched by THEO and promoted by the HESP Alliance aims exactly at being such an alternative, still using HTTP-based delivery.
The specification was recently published as IETF draft. The Maximal Gain Profile that is used to reach the lowest latency and zapping times excludes the use of B frames in Continuation Stream segments, and with both the Maximum Gain Profile and Compatibility Profile, the Initialization Stream segments (that are used at the beginning of playback sessions), must include only I frames. That means that the required encoding resources are at least doubled, to produce the Initialization Stream segments as I frame-only segments, at the same framerate as the Continuation Stream segments. Doubling the encoding scalability requirements is not trivial when producing a large ABR ladder, as it likely requires to spread the encoding load across multiple synchronized encoder instances. We’ll need a really good reason to accept this trade-off, like the cost and complexity of a WebRTC delivery infrastructure being vastly superior to the cost and complexity of a doubled encoding footprint, which might be true at a certain level of viewership, but it’s hard for me to tell exactly where is the threshold. Given this encoding scalability problem, I don’t see HESP replacing LL-HLS and/or LL-DASH for any use case where the target latency is 2 to 5 seconds, though.
Note (10/08/21): TheoPlayer just published a blog post talking about possible mitigations of the encoding cost problem. I’m still not convinced by the proposed solutions, as it will generate visible visual quality differences on bitrate switches.
On the fast stream switching aspects that HESP is listing as another benefit, it’s worth noting that there is some activity happening at MPEG to define a solution for DASH, possibly without a specific switching AdaptationSet requiring a special encoding or re-packaging. So HESP might be competing only with HLS on this fast switching aspect, at the end of the day.
User eXperience technologies
The VR market didn’t boom like many were expecting, and the video resolution that we can get in VR headsets is still not sufficient to make the experience totally immersive (same with AR headsets), but something undoubtably happened with the release of the Oculus Quest 2 in late 2020, and OTT services are investing more efforts into VR – as we’ve seen recently with the NBC Olympics VR app. This streaming application was built with Unity, the popular 3D game development platform that has embraced the XR (aka VR and AR) uses cases, and TiledMedia ClearVR SDK as the streaming video player. For VR applications, Unity is a flexible solution allowing to integrate 360 or 180 degrees streams, as well as traditional 16/9 ones. It’s also very powerful, for example with masks allowing to extract a 1080p camera view (and corresponding audio tracks) from a 4K quadrant video stream used as a mosaic, or a library allowing to add avatars in the scenes.
In parallel, we can see Unity now being used in classical streaming applications not involving VR or AR devices, but providing a next level user experience, as shown by the Aura applications for sports streaming (and leveraging the AVProVideo player). For premium OTT use cases requiring not only Clear Key AES-128 encryption but full DRM, NexPlayer for Unity brings Widevine DRM support for HLS and DASH streams. And finally, there is also the AntMedia Unity WebRTC SDK for ultra-low latency streaming use cases. While the main target playback platforms for Unity applications are the iOS/Android mobile devices, it’s worth noting that the same application can also run on Android TV devices or Windows/Mac desktop, depending on the target platforms that the video player component supports.
While Unity was the de-facto XR development environment, the growing maturity of this industry has fostered standardization with the OpenXR API which is supported in Unity 2020.2+ with an official plugin provided by Unity. The web world is also part of the XR standardization effort, with W3C’s WebXR specification for Immersive Web experiences on multiple platforms (also in Safari). The specificity of WebXR is that applications are written in javascript (see WebXR frameworks), so it’s easier to find developers on this platform than on the Unity one. WebXR is still a young technology, but it’s already a supported engine in OpenXR, which has got a wide industry backing, so we’ll likely see a lot of WebXR/OpenXR applications showing up on multiple devices in the next 5 years. It’s exciting to see this convergence happening and enabling highly dynamic user interfaces for streaming applications.
Origin ingest and synced metadata
It’s been some time since DASH-IF is working on an Ingest Specification (version 2 is in community review) that covers both CMAF ingest and DASH/HLS ingest, and aims at deprecating the legacy Smooth Streaming ingest that is still in use in multiple solutions. It was a rather controversial discussion at DASH-IF, and it’s probably not finished (on minor points like using the Unix epoch time for periods availability start time), as MPEG is possibly picking up the DASH-IF work as a starting point for a new MPEG spec for encoders and origins, but it goes in the good direction with Interface 1 (the CMAF one) which can possibly unify the industry around a common protocol for fragmented ingest. The drivers for a wider adoption will certainly be the increased inability for the Smooth Streaming format to cope with further innovations, and the generalized need for low latency workflows which can be supported in the Ingest spec through HTTP/1.1 chunked transfer encoding (or HTTP/3 with DATA frames when the Ingest spec will be HTTP/3-friendly) combined with chunked CMAF.
Also included in the Ingest spec since day 1, was the use of Timed Metadata Tracks to carry events such as SCTE-35 markers in a discrete track and not the video tracks as the industry used to do since digital video exists. It was and still is a controversial topic as it requires a nested timeline to correlate the metadata events timestamps with the video and audio timestamps, making it more difficult to troubleshoot when things go south. Apart from this friction point, the idea is a good one, as it allows to reduce the bandwidth used for metadata by using a single track versus systematically duplicating the metadata in all video tracks, to secure the metadata independently of the other media tracks, and to more easily process the metadata payloads through AI engines for translation/indexation/analysis. While there is a way to signal Timed Metadata Tracks in DASH manifests, it’s still not the case in HLS manifests, and that’s probably what will trigger a wider adoption if the problem is solved. We rely on SDOs like CTA-Wave to convince Apple to fill this kind of structural gaps in the HLS spec. Right now the Timed Metadata Tracks can be used upstream in the chain, but not directly by video players (at least the HLS ones); once the end-to-end usage is made possible, then it’s a totally different story for the industry.
Guest contribution from my friend and industry rockstar Thierry Fautier! |
---|
Open Caching: the end of delivery bottlenecks? First, we need to look what is the current lay of the land in terms of CDN technologies. You have the traditional CDNs operated by companies like Akamai, Limelight or Fastly. They offer a complete CDN service, but to be honest it is a bit of a black box, and after the recent crashes that occurred at Fastly and Akamai in 2021, we saw there was not much transparency regarding why a service is down. Things are improving with the availability of real time stats coming from CDNs, and of course also with a wider use of multi-CDN technologies. Still, understanding issues that cause CDN outages or performance degradation remains a complex matter especially when you start to use complex workflows such as DRM and DAIs. The second problem with CDNs is scalability. To build more capacity, a CDN needs more traffic. But OTT services are often event-based so the traffic is unpredictable. It’s chicken and egg which basically means that the capacity isn’t always there when the OTT service needs it. To make the matter even worse, it is not because you have CDN capacity that the ISP will be willing to ingest the additional traffic, so we can feel there is a real need for collaboration between CDNs and ISPs. The second class of CDNs are the one built directly by OTT content providers. Netflix has developed its own CDN called Open Connect, which is an open-source project. These servers are installed in the ISP network and managed directly by Netflix. One server can contain the most popular titles of the Netflix library which is great for the IPS because they don’t have to backhaul the content through their network. The commercial terms between Netflix and ISPs have not been published. Some broadcasters such as BBC and M6, to name few European broadcasters, have built their own CDNs made of dedicated caches installed either at the peering points or inside the ISP network, mostly for VOD (live is a bit trickier to handle), but we see them also using public CDNs in a multi-CDN architecture. The third way to cache internet content is for the ISPs to build their own CDN infrastructure. We have seen a number of large ISPs, like Telefonica, go this route. For the OTT provider, the issue is using all of these CDNs together: commercial, home-grown, and ISP. Managing them, even with multi-CDN technology, can be a real challenge (such as purging content from caches across all the CDNs). So, the million dollar question is: “Is there a way to converge all those approaches and make CDN, at least for the video caching part, use a ”standard” approach and also have ISP involved in the delivery?” The answer to that is Open Caching. This approach came out the IETF standardization work on CDN interfaces (CNDi), and was picked by the Streaming Video Alliance (SVA), a group of 100+ companies, that has built a collaborative forum aimed at solving tough video streaming problems. The SVA has set up an Open Caching group which defines the specifications but also organizes plug fests and makes some code available as open source to facilitate adoption of the Open Caching specifications. The main purpose of Open Caching is to make CDN caches, whether they are in a commercial provider, an ISP, or an OTT provider’s network, interoperable. That means that an OTT provider can manage all of their caches, across all of their delivery architecture, from a single management tool. That’s because Open Caching is built off a suite of APIs. Nothing is proprietary. Disney and Verizon have shown that Open Caches used for production traffic offer better performance as other CDN caches because the caches can be located within the ISP network. All-in-all, the distribution of open caches across an OTT provider’s delivery architecture can help avoid congestion problems and offer more throughput per users. The next question is, “how do you set up an Open Caching system?” Let us look at the Open Caching architecture (courtesy SVA – open image in a new tab for high resolution view). We have to separate the CDN who will have to orchestrate the delegation of its traffic to the Open Caching nodes using the Open Caching Controller via the Open Caching Content Management API, the ISP who will install Open Caching nodes (SP OC Node) that will be orchestrated by the Service Provider Open Caching Controller. The API definition for the Open Caching API mechanism is still at the draft stage and only available to the SVA members but it’s in the process of being ratified by the members and published. So, what are the benefits of Open Caching? First, it’s an open standard which all CDN companies can use and therefore has a higher chance of being adopted by commercial CDNs, ISPs, and OTT providers. As the SVA publishes open-source code, it will be even easier to get up and running. And you don’t even need to build an open cache from scratch. If you already have a cache, you can make it “open caching compliant,” by enabling it to communicate via APIs with the Open Caching Controller. The SVA also just announced an API Testbed for members to test their Open Cache implementations and ensure interoperability. The second benefit is that Open Caching brings the caches close to the users, inside the ISP network, which will provide a higher scalability and therefore a lower TCO. One of the results of Open Caching, presented by the SVA, showed an improvement of 30% of available bandwidth vs traditional CDN approaches on a live network. The next question is, “who has deployed Open caching?” Qwilt, one of the founding members of the SVA has announced several deployments with tier-1 operators through a strategic partnership with Cisco. BT and Telecom Argentina are amongst those that have deployed. Telefonica, Orange, and Verizon are also very involved ISPs. All of this just shows it can be done. The final question is, “how many CDNs can be deployed using Open Caching?” It is clear that to address the peak usage of CDNs, an Open Caching solution will have eventually to run on the Cloud infrastructure deployed inside the ISP, which could take some time, although with the advent of Mobile Edge Computing (MEC), things could accelerate thanks to the push of 5G. |
My take on Open Caching is the following: while it’s certainly an interesting standardization initiative, I don’t think that traditional CDNs will adopt it based on their own initiative, as it further commoditizes the content delivery business. The likelihood of seeing Open Caching-based commercial CDNs appear on the major public clouds is thin, given the price of the data transfer out on these platforms. I think we’ll continue to see this technology being used for building private dedicated CDNs or ISP caching infrastructures, until someone figures out a business model for a next gen commercial CDN built on top of multiple small leased infrastructures where bandwidth price is not a blocker. Certainly an interesting topic to monitor, though!
QoE/Multi-CDN
Building a flexible multi-CDN delivery architecture for video has never been an easy exercise: proprietary edge video tokenization implementations per CDN often make tokenization a no-go and force the use of DRMs instead, performance data collected at player level is hard to correlate with CDN logs, and proprietary CDN load-balancing mechanisms often require custom player implementations, as the HLS and DASH specs don’t provide adequate mechanisms natively. All of this is about to change, as multiple initiatives aim at solving these problems.
CTA-WAVE has recently started a joined work with SVA on the tokenization topic, it’s called the Common Access Token initiative. Its goal is to define a video tokenization scheme that can be used across multiple CDNs, so that the only difference between two CDN tokens could possibly be only the secret key used to secure the URL signature. Once this standardization effort is producing results (I’d say around Q2 2022), we can probably expect a quick adoption among the major CDNs which are all participating in the discussion, and among their customers who are eager to simplify their implementations.
CDN-to-client performance data correlation is now easier with the Common Media Client Data (CMCD) specification which was released by CTA-WAVE in late 2020. This specification defines key player data points (aka “Reserved keys”) like session ID, buffer length or measured throughput, and ways to include these data points inside the CDN requests, through headers/query strings, or in parallel of the object requests, with independent json objects being sent to the CDNs. While the spec doesn’t say how CDNs should forward the data points to 3rd party multi-CDN decisioning services, it’s still a very important progress as it’s the first time where we have a standardized framework to understand the performance of video player across multiple playback sessions and CDN environments. On the player side there’s support in dash.js, Shaka player, and Akamai AMP. On the CDN side, Akamai and Fastly support it so far, but this is expanding rapidly. The Datazoom video data platform is also supporting it. As CMCD support is critical to generate the data required for multi-CDN switching, it will be crucial to see more players supporting CMCD in the near future, especially the Apple HLS players which are not customizable in terms of object request structure. The next step in this standardization process is to define a Common Media Server Data (CMSD) payload format that could be used to objectively compare delivery performance across multiple CDNs. This work is just starting at CTA-WAVE, so it’s gonna take a couple of quarters until we see the actionable results. But combining CMCD and CMSD data inside a single data lake should allow to inform multi-CDN switching decisions in a very efficient manner.
That’s where the HLS Content Steering proposal plugs in. It doesn’t talk about how CDN switching decisions should be made per client/end-user, but it describes how multiple versions of the same live or VOD content across multiple CDNs should be described in the HLS parent playlists, and how the player should switch between these versions depending on the JSON response from the Content Steering service (the multi-CDN switching service, basically). That’s a fairly simple mechanism, overall, so it has all chances to succeed. The problem is that it’s a HLS centric-mechanism, so we’ll have to work on equivalent DASH mechanisms, with more manifest flexibility than the static BaseURL elements to support the multiple CDNs and associated edge tokens, and leveraging the same Content Steering service JSON payloads. The work hasn’t started yet at MPEG or DASH-IF, but it’s just a question of time until it gets picked up. Once it’s done, we’ll get a powerful framework for building multi-CDN switching services, with CMCD/CMSD and the Common Access Tokens.
Scalability
I think one of the most impactful technologies for scalability is gonna be the Adaptive Media Streaming over IP Multicast one – otherwise known as Multicast ABR (mABR). It’s been done in a proprietary mode for the last 5 years or so, by companies like Broadpeak, with a good adoption rate by telcos and cable operators. It’s basically taking a unicast DASH or HLS stream as input, and translating it into multicast DASH or HLS for the live edge segments, with the video players requesting DVR segments in unicast, past the last few minutes of live content. It’s been standardized recently by DVB as DVB-MABR, as part of the wider DVB over Internet (DVB-I) initiative, and it’s now available to all network operators for implementation. 5G has got the same potential with the 5G Mixed-mode multicast capability. In practice, it’s still bound to the very predictive IP Multicast Plan provisioning on telco networks – meaning that it’s usable for known 24/7 live channels in a fairly static, IPTV-like, configuration.
But let’s fast forward to the day when this technology will eventually be using fully dynamic provisioning (which is not part of the spec at this stage, just living in my imagination right now) – meaning when it will be able to be applied to the most popular live streams on an operator network, might it be a known in advance 24/7 channel or any OTT event stream (like a premium boxing or soccer pay-per-view event) suddenly gathering multiple dozens/hundred thousands viewers on the network. The benefit for regular OTT streams will be huge, compared to the current unicast situation, where streams get cached in the ISP infrastructure in a best effort way, due to the scalability limits of reverse-proxy architectures.
The other limitation of the DVB-MABR approach (same story with ATSC 3.0, actually) is basically that it adopts the traditional broadcast perspective while swapping linear streams in favor of segmented formats, instead of trying to improve segmented formats delivery through broadcast scalability techniques. The way it redirects video players to the Multicast rendezvous service, through a 3rd party CDN broker service (aka the Content Steering Service, in the Apple linguo) is a black or white choice, from an ad insertion perspective: if the live stream is played in unicast, then you can use SGAI – and if the live stream is played in multicast, then ad insertion needs to happen client-side (over unicast) with all the pitfalls that it carries. At the end of the day, the personalized ad segments will be transported over unicast, whatever is the live stream transport mode, so why not try to decouple manifests and segments transport, by systematically delivering the personalized manifests in unicast (this is now a highly scalable option, with the DASH Patch Manifest or HLS Ad Interstitials), with segments forward requests being translated from unicast to multicast on the gateway that does the multicast termination? This would keep the ad insertion workflows efficient by sticking to the SGAI approach, while at the same time preserving the scalability of the segmented linear streams through multicast delivery of the media segments (which is the real scalability problem). It would also preserve the flexibility to do targeted content replacement (like geo-localized/subscriber status-based blackouts or live sports games replacement), which is not easy to do in a full-multicast live scenario. The problem then becomes how to let the Multicast gateway know about the mapping between unicast and multicast media segments URIs, but this is a trivial problem to solve, compared to the unicast-to-multicast translation being done upstream by the Multicast Server.
While I’m convinced that the hybrid unicast/multicast delivery scheme that I’m proposing here can be implemented successfully, there is another problem that the delivery of media segments over multicast introduces: it basically prevents A/B watermarking to work, as the media segments are the same for every video player. The CDN edge decisioning logic that powers the media segments forward requests routing in a unicast A/B watermarking context needs to be replaced by something else. The first option is to transport all A and B variants of the segments though multicast and let the Multicast Gateway operate the A/B logic instead of the unicast CDN edge. This is challenging from a security perspective (the Multicast gateway has got to be hardened and to support a complex A/B logic) and from a network perspective (with all streams being carried in A and B variants on different multicast URIs, and all the space that it occupies on the network). The second option is to use a single set of multicast media segments and to replace server-side A/B watermarking by client-side watermarking, with the same kind of hardening challenges on the video player side, which might not really be an extra burden if watermarking is already done client-side with full unicast OTT distribution. Either way, there is a big challenge to solve here, and it’s definitely not straightforward. I trust DVB and DASH-IF experts to find an elegant solution, if we ever come that far in the specification of hybrid unicast/multicast delivery modes.
Putting it all together
Throughout this article, we arrived to an hybrid architecture vision where we can benefit from both the flexibility of OTT solutions – with server-side stream individualization/SGAI, manifests optimization and multi-CDN switching based on standardized QoE feedback loops – and from the scalability of multicast delivery for media segments, whenever possible. Let’s summarize it with a high level diagram, to see where each technology needs to be implemented and what are the main data flows (not how the redundancy/failover architecture should be designed):
It’s probably gonna take me another 5 years to bring this vision to life, but this is fine, as I’m still young 🙂 I would just conclude by saying that it’s been an exciting ride for me to work with numerous streaming technologies for the last 21 years, and that I’m always fascinated by what’s coming next in this area. We’ve seen an impressive number of technologies rising up on the OTT front for the last 5 years, and the rhythm doesn’t seem to slow down much for the next 5 ones. It’s great to see that we are reaching the point where we have all the necessary tools to make streaming as reliable and scalable as broadcast – finally!
Post scriptum
I’d like to send a shout out to some well known names from the standardization world who make all of this possible, at MPEG, DASH-IF, CTA-WAVE, DVB and more – you guys rule: Ali Begen, Romain Bouqueau, Zachary Cava, Cyril Concolato, Gwenaël Doërr, MIke Dolan, Thierry Fautier, Per Fröjdh, Alex Giladi, Will Law, Rufael Mekuria, Jon Piesing, Laurent Piron, Daniel Silhavy, John Simmons, Iraj Sodagar, Dan Sparacio, Michael Stattmann, Thomas Stockhammer, Christian Timmerer… I can’t possibly list here every bright mind I’ve met in these circles over the last ten years, I just want to say that I’m feeling honored to build things with these distinguished fellas.
And also a special memory for the person who inspired me the most in my career, Sam Blackman. He was driven by the spirit of innovation, the empathy for his industry colleagues (and all human beings around him) and the mission to give back to the community. I feel grateful to have him as a model to guide my actions at Elemental and in the wider industry, every day. Sam left us too prematurely and forever will be in our hearts. Be Like Sam.