Codestin Search App

connorgoggins · 2020-02-18T23:19:07Z

Description

The col2im op was previously breaking on large tensor (dimension >= 2^32) data. With the following input:

run_performance_test(nd.col2im, run_backward=True, inputs=[{'data': (1,2**30,4), 'output_size': (2,2,1), 'kernel': (1,1,1)}], warmup=1, runs=1)

the following error was thrown:

Segmentation fault: 11

To root cause this issue, I ran the previous command in a Python script with GDB, and found that the underlying problem was in the dtype of the image index variable (index_im) within the col2im op's kernel in im2col.h. This image index variable used the int dtype when it should have been using index_t to properly handle long int indices. I switched this variable to index_t in the kernel and, after rebuilding, the previous input command displayed the correct output:

INFO:root:Begin Benchmark - col2im
INFO:root:Complete Benchmark - col2im
[{'col2im': [{'inputs': {'data': (1, 1073741824, 4), 'output_size': (2, 2, 1), 'kernel': (1, 1, 1)}, 'max_storage_mem_alloc_cpu/0': 33285996.0, 'avg_time_forward_col2im': 1290522.25, 'avg_time_backward_col2im': 1291033.5}]}]

Note: I also confirmed that, with my revisions, the op works with a large tensor (>= 2^32) output size, as the following command passes without errors.

run_performance_test(nd.col2im, run_backward=True, inputs=[{'data': (1,2**32,1), 'output_size': (1,2**32), 'kernel': (1,2**32)}], warmup=1, runs=1)

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage
Code is well-documented
To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

M src/operator/nn/im2col.h
M tests/nightly/test_large_array.py

Comments

Tested on r5dn.24xl-ubuntu 16.04 and p2.16xl-ubuntu 16.04 with

Individual op run
Full OpPerf run

Results

The key difference between CPU and GPU tests was the instance type (r5dn.24xl for CPU, p2.16xl for GPU). All relevant build flags remain the same, and both were tested using CPU context.

Single operator test - col2im op (GPU)
Single operator test - col2im op (CPU)

Full OpPerf test (GPU)
Full OpPerf test (CPU)

@apeforest @access2rohit

connorgoggins · 2020-02-18T23:21:28Z

@mxnet-label-bot add [pr-awaiting-review]

apeforest

LGTM

apeforest · 2020-02-19T01:11:39Z

@connorgoggins Could you add a test to our nightly if it is not already there? Thanks!

ChaiBapchya

LGTM but do add the test to tests/nightly/test_large_array.py

* Changed dtype for index_im * Added nightly test for col2im

lanking520 added the pr-awaiting-review PR is waiting for code review label Feb 18, 2020

apeforest approved these changes Feb 19, 2020

View reviewed changes

ChaiBapchya approved these changes Feb 19, 2020

View reviewed changes

connorgoggins force-pushed the fix_col2im_large_tensor branch from 63d0937 to d24f284 Compare February 20, 2020 21:13

connorgoggins added 2 commits February 24, 2020 09:48

Changed dtype for index_im

aa459b4

Added nightly test for col2im

d7c99b6

connorgoggins force-pushed the fix_col2im_large_tensor branch from d24f284 to d7c99b6 Compare February 24, 2020 17:48

apeforest merged commit f9b2a63 into apache:master Feb 24, 2020

anirudh2290 pushed a commit to anirudh2290/mxnet that referenced this pull request May 29, 2020

[Large Tensor] Fixed col2im op (apache#17622)

e815339

* Changed dtype for index_im * Added nightly test for col2im

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Large Tensor] Fixed col2im op#17622

[Large Tensor] Fixed col2im op#17622
apeforest merged 2 commits into
apache:masterfrom
connorgoggins:fix_col2im_large_tensor

connorgoggins commented Feb 18, 2020 •

edited

Loading

Uh oh!

connorgoggins commented Feb 18, 2020

Uh oh!

apeforest left a comment

Uh oh!

apeforest commented Feb 19, 2020

Uh oh!

ChaiBapchya left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

connorgoggins commented Feb 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Essentials

Changes

Comments

Results

Uh oh!

connorgoggins commented Feb 18, 2020

Uh oh!

apeforest left a comment

Choose a reason for hiding this comment

Uh oh!

apeforest commented Feb 19, 2020

Uh oh!

ChaiBapchya left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

connorgoggins commented Feb 18, 2020 •

edited

Loading