Abstract

Recently, there is increasing interest in neural network-based video coding, including end-to-end and hybrid schemes. To foster the research in this emerging field and provide a benchmark, we propose this Grand Challenge (GC). In this GC, different neural network-based coding schemes will be evaluated according to their coding efficiency and innovations in methodologies. Three tracks will be evaluated, including:

  • hybrid neural network-based (NN-based) video codec,
  • end-to-end video codec,
  • neural network enhanced VVC encoder.

In the hybrid codec track, deep network-based coding tools shall be used with traditional video coding schemes. In the end-to-end codec track, the whole video codec system shall be built primarily upon deep networks. In the neural network enhanced VVC encoder track, deep network-based encoding algorithms can be applied in a VVC encoder which generates VVC compatible bitstreams.

Participants shall express their interest to participate in this Grand Challenge by sending an email to the organizer Dr. Yue Li and are invited to submit their schemes as ISCAS papers. The papers will be regularly reviewed and, if accepted, must be presented at ISCAS 2024. The submission instructions for Grand Challenge papers will be communicated by the organizers.

Rationale

In recent years, deep learning-based image/video coding schemes have achieved remarkable progress. As two representative approaches aiming at future video codec schemes, hybrid solutions and end-to-end solutions have both been investigated extensively. Hybrid solutions adopt deep network-based coding tools to enhance traditional video coding schemes while end-to-end solutions build the whole compression scheme based on deep networks. Besides, NN-based methods are also widely studied to optimize or speed up encoders compliant to existing popular standards such as VVC. Although great advancement has been observed, there are still numerous challenges remaining to be addressed:

  • How to harmonize a deep coding tool with a hybrid video codec, for example, how to take compression process into consideration when developing a deep tool for pre-processing;
  • How to exploit long-term temporal dependency in an end-to-end framework for video coding;
  • How to leverage automated machine learning-based network architecture optimization for higher coding efficiency;
  • How to perform efficient bit allocation with deep learning frameworks;
  • How to achieve a better global result in terms of rate-distortion trade-offs, for example, to take the impact of the current step on later frames into account, possibly by using reinforcement learning;
  • How to achieve better complexity-efficiency trade-offs;
  • How to speed up a VVC encoder with less coding efficiency loss via NN methods or use NN-based preprocessing to enhance the VVC encoding efficiency.

In view of these challenges, several activities towards improving deep-learning-based image/video coding schemes have been initiated. For example, a special section on “Learning-based Image and Video Compression” was published in TCSVT, July 2020; a special section on “Optimized Image/Video Coding Based on Deep Learning” was published in OJCAS, December 2021; and the “Challenge on Learned Image Compression (CLIC)” at CVPR has been organized annually since 2018. In hopes of encouraging more innovative contributions towards the aforementioned challenges in the ISCAS community, we proposed this grand challenge since 2022. It has been successfully held for two years (ISCAS 2022, ISCAS 2023), attracting related researchers all over the world. As being looked forward by many experts in this area, the grand challenge will be held again for ISCAS 2024, with more tracks and more awards.

Requirements, Evaluation

Training Data Set

It is recommended to use the following training data.
UVG dataset: http://ultravideo.cs.tut.fi/
CDVL dataset: https://cdvl.org/
Additional training data are also allowed to be used given that they are described in the submitted document.

Test Specifications

In the test, each scheme will be evaluated with multiple YUV 4:2:0 test sequences in the resolution of 1920x1080.
There is no constraint on the reference structure. Note that the neural network must be used in the decoding process of the hybrid track and the end-to-end track, while the VVC reference software VTM will be utilized for decoding bitstreams of the NN enhanced VVC encoder-only track.

Evaluation Criteria

The test sequences will be released according to the timeline and the results will be evaluated with the following criteria:

  • The decoded sequences will be evaluated in the 4:2:0 color format.
  • PSNR (6*PSNRY + PSNRU + PSNRV)/8 will be used to evaluate the distortion of the decoded pictures.

Average Bjøntegaard delta bitrates (BDR) [1] for all test sequences will be gathered to compare the coding efficiency.
Anchors of HM 16.22 [2] and VTM-20.2 [3] coded with QPs = {22, 27, 32, 37} under the random access configurations defined in the HM and VTM common test conditions [4, 5] will be provided. Note that the HM anchor is used for the hybrid and end-to-end tracks, while the VTM anchor is used for the VVC encoder-only track. The released anchor data will include the bit-rates corresponding to the four QPs for each sequence.

Additional constraints for the first two tracks (i.e., the hybrid NN-based and end-to-end video codec) are listed as follows:

  • It is required that the proposed method should generate four bit-streams for each sequence, targeting the anchor bit-rates corresponding to the four QPs. For each sequence, the range of four real bit-rates shall be [80% * the lowest anchor bit-rate, 120% * the highest anchor bit-rate];
  • Only one single decoder shall be utilized to decode all the bitstreams;
  • The intra period in the proposed submission shall be no larger than that used by the anchor in compressing the validation and test sequences.

While for the NN enhanced VVC encoder track, the additional requirements are listed as follows:

  • The docker file shall have the capability of encoding the test sequences to generate VTM-compatible bitstreams.
  • It is required that the proposed method should generate four bit-streams for each sequence, targeting at the anchor bit-rates corresponding to the four QPs. For each test point, the bit-rate of the proposed method should be in the range of 90% to 110% of the anchor bit-rate.
  • The VTM-20.2 decoder is utilized to decode generated bitstreams to get reconstructed YUV files and use those YUV files to calculate the PSNR values. All the generated bitstreams MUST be decoded successfully.
  • The VTM-20.2 encoder is utilized as the anchor encoder. For each test point, denote the encoding time of the proposed encoder as T1, the encoding time of VTM-20.2 encoder as T2, T1 and T2 should satisfy: T1 <= 70% T2. Note that T1 and T2 shall be evaluated on the same platform with single thread (e.g., Intel(R) Xeon(R) Platinum 8336C CPU @ 2.30GHz, NVIDIA A100-SXM4-80GB GPU). Encoding time comparison will be verified by the organizers.

Proposed Documents

A docker container with the executable scheme must be submitted for result generation and cross-check. Each participant is invited to submit an ISCAS paper, which must describe the following items in detail.

  • The methodology
  • The training data set
  • Detailed rate-distortion data (comparison with the provided anchor is encouraged)
  • Complexity analysis of the proposed solutions is encouraged for the paper submission.

References

[1] Bjøntegaard, “Calculation of average PSNR differences between RD-Curves,” ITUT SG16/Q6, Doc. VCEG-M33, Austin, Apr. 2001.
[2] https://vcgit.hhi.fraunhofer.de/jvet/HM/-/tree/HM-16.22
[3] https://vcgit.hhi.fraunhofer.de/jvet/VVCSoftware_VTM/-/tree/VTM-20.2
[4] Common Test Conditions and Software Reference Configurations for HM (JCTVC-L1100)
[5] JVET common test conditions and software reference configurations (JVET-J1010)