Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

fengyuentau
Copy link
Member

@fengyuentau fengyuentau commented Dec 28, 2022

Merge with opencv/opencv_extra#1032

  • add layer norm onnx parser
  • add layer norm impl
  • add layer norm onnx simplifier for both cases of constants being Constant and Initializer
  • add test model generation code for layer_norm_expanded and layer_norm_expanded_initializer

Benchmark:

Layer Mean (ms) Median (ms) Min (ms)
layer norm expanded 0.43 0.42 0.40
layer norm (this pr) 0.02 0.02 0.01

*: tested with size 1x50x768 on Apple M1.

Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

  • I agree to contribute to the project under Apache 2 License.
  • To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
  • The PR is proposed to the proper branch
  • There is a reference to the original bug report and related work
  • There is accuracy test, performance test and test data in opencv_extra repository, if applicable
    Patch to opencv_extra has the same branch name.
  • The feature is well documented and sample code can be built with the project CMake
force_builders=Linux OpenCL

@fengyuentau
Copy link
Member Author

@rogday Could you review this pull request if possible?

Copy link
Member

@rogday rogday left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for contribution! LGTM 👍

Copy link
Member

@alalek alalek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some optimization and minor comments.

std::vector<Mat> inputs, outputs;
inputs_arr.getMatVector(inputs);
outputs_arr.getMatVector(outputs);
const int nstripes = getNumThreads();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const int nstripes = getNumThreads();

This doesn't look as a reliable design.
This scheme assumes that all threads has the same speed and they are not interrupted.
It is not true for OS with preemptive execution (all widely used OS for the last 25+ years). Some cores may handle background tasks or interrupts.

Also it is not true at all for CPUs with big+little design (modern ARM, Intel CPUs with P+E cores).

nstripes should be based on subtask's "grain size" (subtask time >>> scheduling overhead) instead of number of available threads.

Some information is available here: https://oneapi-src.github.io/oneTBB/main/tbb_userguide/Controlling_Chunking_os.html

Copy link
Member Author

@fengyuentau fengyuentau Jan 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also it is not true at all for CPUs with big+little design (modern ARM, Intel CPUs with P+E cores).

And this is why I happened fix the openmp issue on macOS. I was doing benchmarking on vision transformers on onnxruntime, mnn and opencv dnn the other day, and found that both onnxruntime and mnn can run with 4 threads on my apple m1 by default, but opencv dnn uses all threads (8) instead. It is not available to set numThreads with gcd and when I tried with openmp, things went wrong with building issues.

I think somehow we should improve the multi-threading functionality of opencv, which detects big+little design and returns the numThreads of big cores if possible. Somehow cv::Range needs to support step as well in order to work with "grain size"...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think somehow we should improve the multi-threading functionality of opencv, which detects big+little design and returns the numThreads of big cores if possible

It doesn't look as an improvement really.

E.g., in case of ARM-based phones we a already have configurations with 2 big + 6 little cores.


Again, we should not assume or rely heterogenous or same performance of threads or equality of subtasks complexity.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what you are exactly asking for here. Basically every other **Invoker of other layers uses the same strategy. If we want to take care of big+little core CPUs, that is going to be another pull request I think.

@fengyuentau fengyuentau requested a review from alalek January 20, 2023 03:18
Copy link
Member

@alalek alalek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the update!

std::vector<Mat> inputs, outputs;
inputs_arr.getMatVector(inputs);
outputs_arr.getMatVector(outputs);
const int nstripes = getNumThreads();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think somehow we should improve the multi-threading functionality of opencv, which detects big+little design and returns the numThreads of big cores if possible

It doesn't look as an improvement really.

E.g., in case of ARM-based phones we a already have configurations with 2 big + 6 little cores.


Again, we should not assume or rely heterogenous or same performance of threads or equality of subtasks complexity.

…square division outside of loop; use std::max to ensure positive value before std::sqrt
@fengyuentau fengyuentau requested a review from alalek January 20, 2023 12:28
Copy link
Member

@alalek alalek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored references and parallel_for usage.


TEST_P(Test_ONNX_layers, LayerNorm)
{
testONNXModels("test_layer_normalization_2d_axis0", pb, 0, 0, false, true, 3);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are many error messages during the model import:

[ RUN      ] Test_ONNX_layers.LayerNorm/0, where GetParam() = OCV/CPU
[ INFO:[email protected]] global onnx_importer.cpp:831 populateNet DNN/ONNX: loading ONNX v8 model produced by 'backend-test'. Number of nodes = 1, initializers = 0, inputs = 3, outputs = 3
[ INFO:[email protected]] global onnx_importer.cpp:725 parseOperatorSet DNN/ONNX: ONNX opset version = 17
[ INFO:[email protected]] global onnx_importer.cpp:997 handleNode DNN/ONNX: processing node with 3 inputs and 3 outputs: [LayerNormalization]:(onnx_node_output_0!Y) from domain='ai.onnx'
[ERROR:[email protected]] global onnx_importer.cpp:924 populateNet DNN/ONNX: can't find layer for output name: 'Mean'. Does model imported properly?
[ERROR:[email protected]] global onnx_importer.cpp:924 populateNet DNN/ONNX: can't find layer for output name: 'InvStdDev'. Does model imported properly?

We should not have them.

Copy link
Member Author

@fengyuentau fengyuentau Jan 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason why they exist:

  1. These ONNX models are taken from the ONNX conformance tests. I think it is better not modifying them.
  2. I took a look at the opencv-onnx functionalities and did not find how to remove them completely in our onnx importer:
    // Remove additional outputs (Mean, InvStdDev)
    if (node_proto.output_size() > 1)
    {
        auto outputName = node_proto.output(0);
        opencv_onnx::NodeProto node_proto_ = node_proto;
        node_proto_.clear_output();
        node_proto_.add_output(outputName);
        addLayer(layerParams, node_proto_);
    }

@rogday Do you happen to know how to remove optional (node & graph) output in ONNX importer?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed optional outputs from those ONNX models in the end. Turned out it is not straightforward to modify outputs of ONNX graph proto.

CV_CheckTypeEQ(src.type(), dst.type(), "");
CV_Assert(scale.isContinuous());

CV_CheckGE(epsilon, 0.0f, "");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this extra check

Comment on lines 107 to 108
double nstripes = ((size_t)p.total * p.normSize) * (1 / 1024.0);
parallel_for_(Range(0, p.total), p, nstripes);
Copy link
Member Author

@fengyuentau fengyuentau Jan 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the change! I learned a lot!

So if I understand correctly, you make grainsize appropriately small enough so that we can use both big and small cores, and big cores can naturally take more jobs. What about the "magic number" 1024? Like it was taken from the "bathtub curve" from this link?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parallel_for() strategy should rely on subtask size and the scheduling overhead. 1024 is some empiric number here which specifies size of subtask.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, thanks again. Benefit a lot from this. Another question is why multiplying p.normSize. Another empirical operation? I tried without it and the speed is like twice slower.


By the way, I found that all other **Invokers use the same strategy assuming all threads have the same performance. I think they need to be upgraded as well.

@fengyuentau fengyuentau requested a review from alalek January 24, 2023 03:22
@fengyuentau
Copy link
Member Author

@alalek what is the status of this pull request? Could we make a step forward?

@alalek alalek merged commit 4d918ba into opencv:4.x Jan 27, 2023
@alalek alalek mentioned this pull request Jan 28, 2023
@fengyuentau fengyuentau mentioned this pull request Mar 14, 2023
48 tasks
a-sajjad72 pushed a commit to a-sajjad72/opencv that referenced this pull request Mar 30, 2023
dnn: add layer normalization for vision transformers

* add layer norm onnx parser, impl and tests

* add onnx graph simplifier for layer norm expanded

* handle the case when constants are of type Initializer

* add test case for layer norm expanded with initializers

* use CV_Assert & CV_CheckType in place of CV_Assert_N; use forward_fallback for OCL_FP16

* use const ref / ref in parameters of invoker::run; extract inner const if from nested loop; use size_t in place of ull

* template hasBias

* remove trailing whitespace

* use pointer parameter with null check; move normSize division & mean_square division outside of loop; use std::max to ensure positive value before std::sqrt

* refactor implementation, optimize parallel_for

* disable layer norm expanded

* remove the removal of layer norm optional outputs
@asmorkalov asmorkalov mentioned this pull request May 31, 2023
geversonsto pushed a commit to stodev-com-br/opencv that referenced this pull request Jun 3, 2023
dnn: add layer normalization for vision transformers

* add layer norm onnx parser, impl and tests

* add onnx graph simplifier for layer norm expanded

* handle the case when constants are of type Initializer

* add test case for layer norm expanded with initializers

* use CV_Assert & CV_CheckType in place of CV_Assert_N; use forward_fallback for OCL_FP16

* use const ref / ref in parameters of invoker::run; extract inner const if from nested loop; use size_t in place of ull

* template hasBias

* remove trailing whitespace

* use pointer parameter with null check; move normSize division & mean_square division outside of loop; use std::max to ensure positive value before std::sqrt

* refactor implementation, optimize parallel_for

* disable layer norm expanded

* remove the removal of layer norm optional outputs
@dkurt dkurt mentioned this pull request Aug 4, 2023
9 tasks
@fengyuentau fengyuentau deleted the layer_norm branch October 19, 2023 07:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants