Hi,
The paper says that the graph changes quickly that makes Q network difficult to converge. Thus, we keep C unchanged in two successive timesteps when computing the Q-loss in training to ease this learning difficulty.
Does anyone know where exactly in the code is this done?