How does DEQ use less memory compared to explicit network?

Hi, I have a question about memory footprint of DEQ.
As far as I understand, DEQ does not need to store intermediate activations, and thus able to approximate infinite-layer model at the cost of only one layer. (So that training with NFE=30 iteration will cost just as a single iteration)
However, in the first DEQ paper, Table.3, explicit Transformer-XL with 16 layers consume much more VRAM compared to DEQ-Transformer(medium).
It seems they both have nearly same architecture with nearly same number of parameters. In this setting, as far as I understand, DEQ should perform better because it is effectively modeling much deeper model than its explicit counterpart, while consuming same memory. Why DEQ consumes less VRAM? Shouldn't it be same?
(I also found that the forward function of DEQ transformer contains one regular explicit forward that tracks gradient:
https://github.com/locuslab/deq/blob/1fb7059d6d89bb26d16da80ab9489dcc73fc5472/DEQ-Sequence/models/deq_transformer.py#L367)




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

How does DEQ use less memory compared to explicit network? #33

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

How does DEQ use less memory compared to explicit network? #33

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions