联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp

您当前位置:首页 >> C/C++编程C/C++编程

日期:2022-06-20 10:14

Project: ffmpeg and multimedia processing

Ao Shen

June 11, 2022

Contents

1 Functional Requirements 2

1.1 Micro-benchmarking: virtual function and template dispatch 2

1.2 Video Contact Sheet . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2.1 Frame Extraction . . . . . . . . . . . . . . . . . . . . . . 3

1.2.2 Resize and Compose a sheet . . . . . . . . . . . . . . . . 4

1.2.3 Add text . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Non-functional requirements to your Code 5

2.1 Code style . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Memory Management . . . . . . . . . . . . . . . . . . . . . . . . 5

2.3 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3 Submission and Report 5

3.1 Submission . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3.2 Project Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

A Background Information 7

A.1 Anatomy of a video file . . . . . . . . . . . . . . . . . . . . . . . 7

A.2 ffmpeg API Overview . . . . . . . . . . . . . . . . . . . . . . . . 7

A.3 In one pixel: color and gamma . . . . . . . . . . . . . . . . . . . 8

A.4 Pixels in one frame: planar and packed format . . . . . . . . . 10

In this project, you will develop a series of small tools processing media

files, mostly video files. Multimedia encoding algorithms are among

the most complex ones, therefore, we will use a state-of-the-art library, ffmpeg,

to do the actual encoding/decoding. However, we still believe it can a

worthwhile learning experience, as modern computing systems have been

so complex that you have to integrate many libraries to program them for

work, research or even for your own personal use.

1

1 Functional Requirements

1.1 Micro-benchmarking: virtual function and template

dispatch

(15%) As you will see in Section A.4, pixel data can be stored in decoded

video files by different arrangements. When using C++, it might be possible

to encapsule the difference as a virtual function, and redirect all accesses of

any pixel through that.

class FrameViewBase {

public: virtual uint8_t data_from_pixel(int w, int h);

virtual double calculate() { /* calls data_from_pixel() */ }

private: uint8_t *data;

};

class YUV444PFrame : public FrameViewBase {

public: virtual uint8_t data_from_pixel(int w, int h) {

// get required data at given point

}

};

double calculate_frame(AVPixelFormat fmt, uint8_t * buffer) {

FrameViewBase* frame;

// factory pattern

if (fmt == AV_PIX_FMTYUV444P) frame = new YUV444PFrame(buffer, /*...*/ );

return frame->calculate(); // virtual function

}

With factory pattern, you only need to write common processing functions

like calculate once in the base class.

On the other hand, if you would like to play with template a bit, same

functionality can be achieved as:

struct YUV444PFrame {

uint8_t * data; /* ... */

};

template<typename T>

uint8_t data_from_pixel(T* frame, int w, int h) = delete;

template<>

uint8_t data_from_pixel(YUV444PFrame* frame, int w, int h) {

// get required data at given point

}

template<typename T> double calculate(T* frame) { /* ... */ }

double calculate_frame(AVPixelFormat fmt, uint8_t * buffer) {

YUV444PFrame frame444{buffer, /* ... */ };

YUV420PFrame frame420{buffer, /* ... */ };

if (fmt == AV_PIX_FMTYUV444P)

calculate(frame444); // template function

}

2

Figure 1: A video contact sheet, generated by https://github.com/amietn/

vcsi

Read through code given in the microbench directory, and run the benchmark.

Does performance differ? If it does, why so?

Requirement. Run the benchmark with CMake Release profile. Provide

your result of benchmark, specification of the computer where benchmark

ran, and your explanation of difference (or lack thereof).

Remark. The code given utilizes ffmpeg library to read input data. Pay

attention to ffmpeg_decode_sample.c and read description of ffmpeg API in

Section A.2. This should be helpful for your following task.

Please refer to README.md in your code repository for how to run the

benchmark code.

1.2 Video Contact Sheet

Video contact sheet is a picture with several snapshots at different time

points of a given video (see Figure. 1). It is often used as a preview for a

video before downloading. In this project, you will build such a tool with

C/C++ step by step.

1.2.1 Frame Extraction

(20%) First, you will need to extract frames at different time points from a

video.

Requirement. The filename of input video file is given as a command

parameter (argv[1] in parameters of main function). Extract 6 different

frames at about beginning, 1/5, 2/5, 3/5, 4/5 and near end from that video.

Save them as frame_%d.png at the same directory of the video file. %d is

3

index of frames you saved. The exact time points you take is not important

as the project is graded by hand.

Remark. You may need av_seek_frame or avformat_seek_file. To write

a PNG image, utilize stb_image_writer.h provided in external directory.

In Windows and macOS, drag and drop a file onto a executable will invoke

the executable with path of dropped file as command parameter.

Most of image formats expect pixels encoded as RGB values, but videos

often encode pixels differently. In this project, you only need to consider

following pixel layout, as described in format member of AVFrame struct:

AV_PIX_FMT_YUV420P, AV_PIX_FMT_YUV444P, AV_PIX_FMT_RGBA

Refer to Section A.3 about how to convert between them. Your program

should give a warning and exit cleanly when encountered a unknown format.

1.2.2 Resize and Compose a sheet

Then, combine the extracted 6 picture and put them as a 2× 3 grid onto one

sheet (5%). You should scale down each image so that the output does not

exceed 2160 × 2160 pixels (5%). Moreover, if the image grid with no scaling

is too small, do not scale them up, reduce size of output sheet instead (5%).

Requirement. Your input is the same as before. Save the output picture

as combined.png at the same directory as the video file.

Remark. You may want to look at stb_image_resize.h provided, or use

more powerful libraries such as https://github.com/dtschump/CImg. You

can also write resizing part by yourself. No matter which way you use,

please modify CMake project structure accordingly and describe it in the

report.

Note that if you decide to write resizing by yourself, be sure not average

RGB values directly. If you don’t understand why, read Section A.3 and its

references again, and think again.

We suggest that, for this task, don’t try to read back output from the previous

task. Instead, you should reuse code from your previous task directly.

Try to encapsule common operations (i.e. decoding 6 frames) into your own

library. Refer to util folder for an example of how.

1.2.3 Add text

(15%) Then, put time stamp of each extracted frame on the output sheet.

Requirement. Your input is still the same. Save the output picture as

contactsheet.png at the same directory as the video file.

Remark. This should be easy if you have done the previous task. You

don’t need to do fancy transparent effect like Figure 1. All you need to do is

hardcode eleven pixel patterns (for each digit and “:” symbol) and copy them

below border of each frame.

4

2 Non-functional requirements to your Code

You can choose C or C++ to finish this project. Exact version of standard

does not matter as long as you don’t use compiler-specific extensions or notyet-standardized

C++23 features.

Note that because ffmpeg is a C library, when including its header in

C++ you should use extern "C" declaration. Refer to avframe_wrapper.cpp

for an example.

extern "C" {

#include <libavcodec/avcodec.h>

}

In addition to functionality outlined above, your code will be examined

and graded by the following standards.

2.1 Code style

(5%) Your code must be readable by TA, which manifests as following.

? You should use a code formatter to keep a consistent style throughout

all your files. A .clang-format configuration is provided, and most

IDE you use should already support it.

? There are lots of repetitive work in the task. Do not copy and paste

your code all over the place. Instead, try to encapsule it into functions

or classes.

2.2 Memory Management

(10%) There should be no memory leak in your program, even when invalid

input is provided.

You can use Address Sanitizers to detect memory leak. Also, you can try

to encapsule alloc/free functions into C++ class.

2.3 Performance

(Should TA fail to recreate your result in a reasonable time [10x slower than

reference implementation], your point in respective task would be reduced.)

If you find your code running too slow, probably you have too much unnecessary

memory copying or floating point calculation. Try optimize them!

3 Submission and Report

3.1 Submission

All your code should be pushed onto assigned GitHub repository. The deadline

is determined by your git push time onto GitHub.

5

3.2 Project Report

(20%) In addition to code, you should also submit a report to learn.tsinghua.

edu.cn. It should contain following information:

? Screenshots of your programs’ output.

? How TA should compile your project and recreate your result.

? CPU and RAM specification of the computer you finished the microbenchmark

test. (e.g. “The benchmark is carried on a Desktop computer

with 4.5GHz AMD 5900X processor, and 64 GB DDR4 memory at 3200

MT/s”)

? Your microbenchmark result, and answer to the question given in that

section.

? Any interesting bugs you have encountered, and how did you solve

them.

Please do not submit code with your report! Only code committed and

pushed to GitHub will be graded.

6

Figure 2: Workflow of ffmpeg library, Source: ffmpeg and libav tutorial

A Background Information

A.1 Anatomy of a video file

While most strings are not compressible, as Computing Theory class told us,

videos are definitely among the most compressible strings. Consider a video

with resolution of 1920×1080 at 60 fps. Length of one minute uncompressed

video can be determined as

3 × 1920 × 1080 × 60 × 60 ≈ 2.2 × 1010(Bytes)

But a video file of that length usually takes less than 108 bytes. The video

encoding algorithms use many tricks to achieve this result. However in this

project, we don’t need to dig into them.

However, such encoding only provides a stream of data. In a video file,

we may want many streams of data — video (with multiple chapters), audio

(with different languages), subtitles. So video files are containers of these

data streams. For a list of these video formats, you can refer to Wikipedia:

https://en.wikipedia.org/wiki/Comparison_of_video_container_formats.

A.2 ffmpeg API Overview

The API of ffmpeg is designed around the aforementioned “container-stream”

model, as shown in Figure 2.

This section will only provide a high-level overview of what function you

may want to looking for. The exact usage of ffmpeg library is left on you to

read the documentation in comments of associated header file.

Ffmpeg is divided into several different parts, for this project the relevant

ones are libavformat libavutil libavcodec. The interface to these library

described in C header

? libavformat/avformat.h

7

? libavcodec/avcodec.h

? libavutil/avutil.h

Note that these header are C headers. If you are writing C++, include

them with extern "C", otherwise linking error may occur.

The handle to video file is AVFormatContext, which must be allocated

with avformat_alloc_context and freed with avformat_free_context. Other

structs mentioned often have similar alloc and free functions.

Once allocated, a video file can be opened for decoding with avformat_open_input.

Then stream can be examined with pFormat->streams[i], the length of array

is given in ->nb_streams.

Each steam gives appropriate codec information in its codecpar member.

To lookup a codec in libavcodec library, use avcodec_find_decoder(codecpar->codec_id)

function. A codec must work within a context AVCodecContext, which links

to a codec by avcodec_open2 function.

When all contexts are set up, the decoding pipeline can be operated manually.

Receive a packet from a stream by av_read_frame into a AVPacket, and

forward it into codec by avcodec_send_packet. Then decoded frame can be

received into a AVFrame with avcodec_receive_frame. Note the referencecounting

mechanism to avoid memory leak.

A small example program has been provided to give you an idea of how it

look likes. The header files are commented with how to use them, and can

often recognized by your IDE.

A.3 In one pixel: color and gamma

In order to deal with frame data, you have to know how the color picture is

encoded.

As we all known, human have three different kind of cone cells sensitive

to short-, middle- and long-wavelength light. Activity strength of them can

be denoted as a vector (a, b, c). And it follows intuitively that length of this

vector corresponds to strength of incoming light. So the “color” part can be

represented by a two-dimensional plane.

Further research showed that it is indeed the case. Figure 3 shows a

diagram of “all colors”. On the outer curve the number denotes wavelength

of pure light, and in closure of the curve is color of mixture of light with

different wavelength.

To actually encode a color, we have to choose three colors in the diagram

as our basis (called primaries). The sRGB standard, which is mostly used in

pictures, and BT. 709 standard, which is mostly used in HD videos, choose

the same three colors, as shown in Figure 3. There are some more “wide

gamut” color spaces that choose different set of basis to allow more color to

be encoded, like Display-P3 and AdobeRGB, but in this project we will never

encounter them.

8

Figure 3: CIE 1931 chromaticity digram with sRGB color gamut, Source:

Wikipedia

With basis chosen, all represent-able color can be denoted as (r, g, b) ∈

[0, 1]3

. However, in most image formats, the value ranging from 0 ~ 255 is

not the vector multiplied by 255 and converted to int because

? Human eyes are more sensitive to difference in dark colors as shown in

Figure 4. So in order to give a consistent perceived lightness difference,

we need more values representing darker colors. Hence, a non-linear

transform x

γ

, γ < 1 is needed.

? CRT displays emit light with power-law relationship to input voltage

level. So to speak, emitted light L ≈ V

γ

, γ > 1, where V is input

voltage.

Figure 4: Perceived Lightness vs. Physical Lightness, Source: Learn

OpenGL

Therefore, a step called “gamma-correction” is often needed, whose name

is due to the symbol γ often used to denote the parameter in this transformation.

When an image is saved, you want a non-linear transform so more

precision is given to the darker side where that human is more sensitive.

9

And when an image is displayed, a non-linear transform before output is

needed so to get the correct voltage.1

Moreover, in video signal, it is common to split the RGB signal into luma

(Y0

) and chroma (CbCr) signals, resulting in YUVxxx formats as you will often

see.

The BT. 709 standard uses the following encoding transformation:

E

0

e =



4.500e 0 ≤ e ≤ 0.018

1.099e

0.45 ? 0.099 0.018 < e ≤ 1

(e is one of r, g or b ∈ [0, 1])

E

0

Y 0 = 0.2126E

0

R + 0.7152E

0

G + 0.7222E

0

B DY 0 = [219E

0

Y 0 + 16]

E

0

Cb = (E

0

B ? E

0

Y

)/1.8556 DCb = [224E

0

Cb + 128]

E

0

Cr = (E

0

R ? E

0

Y

)/1.5748 DCr = [224E

0

Cr + 128]

where (DY 0 , DCb, DCr) is what saved in the video frame, [·] denotes rounding

to nearest integer.

The sRGB standard is used for most images today. And it saves RGB

information with following transformation:

E

0

e =



12.92e 0 ≤ e ≤ 0.00304

1.055e

1/2.4 ? 0.055 0.00304 < e ≤ 1

De = [255Ee]

where (Dr, Dg, Db) is the familiar RGB value between 0 and 255.

Note that E0

BT.709(e) ≈ e

1/1.92 and E0

sRGB(e) ≈ e

1/2.2

, which can be used to

simplify the calculation a bit. Also, you can pre-calculate a lookup table.

This is why you cannot average pixels directly — average of RGB code

has nothing to do with average of actual light you will see when an image is

scaled down.

Incorrect understanding of RGB code in images has caused lots of confusion.

In fact, there is even a CVPR paper criticizing incorrect interpretation

of images. See R. M. H. Nguyen and M. S. Brown, “Why You Should Forget

Luminance Conversion and Do Something Better,” CVPR 2017.

A.4 Pixels in one frame: planar and packed format

Another important point about image format is pixels can be “packed” or

“planar”. As illustrated in the following pseudo-code with its name in ffmpeg:

typedef struct rgba_pixel {

uint8_t r; uint8_t g; uint8_t b; uint8_t a;

1While modern LCD displays are digital and no longer have this voltage-to-light relationship,

all those existing video output equipment makes LCD displays have a conversion circuit

to emulate(!) this behaviour.

10

Figure 5: memory layout of a frame in YUV420P pixel format and illustration

of chroma subsampling, YUV420 means 4:2:0, YUV444 means 4:4:4 and so

on. “Byte stream” means layout in a linear array. Source: Wikipedia

} rgba_pixel_t;

rgba_pixel_t packed_image[linesize*HEIGHT]; // AV_PIX_FMT_RGBA

struct {

uint8_t y_plane[linesize*HEIGHT];

uint8_t u_plane[(linesize/2)*(HEIGHT/2)]; // Cb is also denoted as U

uint8_t v_plane[(linesize/2)*(HEIGHT/2)]; // Cr is also denoted as V

} planar_image; // AV_PIX_FMT_YUV420P

You may have noticed that in the YUV420P example, the U and V plane

is smaller than the image. This is because of a compression trick called

chroma subsampling.

Because human eyes are more sensitive to change in brightness (luma)

than color (chroma), you can use less resolution to encode chroma plane and

let several pixels to share the same chroma information.

For example, layout of YUV420P is shown in the code and Figure 5. While

each pixel has a separate luma (Y0

) component, four pixels in a square share

one chroma (CbCr) components, as if they are the same for all of them.

11


相关文章

【上一篇】:到头了
【下一篇】:没有了

版权所有:编程辅导网 2020 All Rights Reserved 联系方式:QQ:99515681 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。