RESEARCH ARTICLE | NOVEMBER 01 2024
Optimization of fluid control laws through deep
reinforcement learning using dynamic mode decomposition
as the environment
T. Sakamoto; K. Okabayashi
AIP Advances 14, 115204 (2024)
https://doi.org/10.1063/5.0237682
View Export
Online Citation
Articles You May Be Interested In
Data-driven identification of dynamical models using adaptive parameter sets
Chaos (February 2022)
Dynamic mode decomposition and reconstruction of the transient pump-jet propulsor wake
Physics of Fluids (January 2025)
Numerical study of flow past two square cylinders with horizontal detached control rod through passive
19 February 2025 16:42:51
control method
AIP Advances (June 2024)
AIP Advances ARTICLE pubs.aip.org/aip/adv
Optimization of fluid control laws through deep
reinforcement learning using dynamic mode
decomposition as the environment
Cite as: AIP Advances 14, 115204 (2024); doi: 10.1063/5.0237682
Submitted: 6 September 2024 • Accepted: 14 October 2024 •
Published Online: 1 November 2024
T. Sakamoto and K. Okabayashia)
AFFILIATIONS
Department of Mechanical Engineering, Osaka University, 2-1 Yamadaoka, Suita, Osaka, Japan
a)
Author to whom correspondence should be addressed: [email protected]
ABSTRACT
The optimization of fluid control laws through deep reinforcement learning (DRL) presents a challenge owing to the considerable compu-
tational costs associated with trial-and-error processes. In this study, we examine the feasibility of deriving an effective control law using a
reduced-order model constructed by dynamic mode decomposition with control (DMDc). DMDc is a method of modal analysis of a flow
19 February 2025 16:42:51
field that incorporates external inputs, and we utilize it to represent the time development of flow in the DRL environment. We also examine
the amount of computation time saved by this method. We adopt the optimization problem of the control law for managing lift fluctuations
caused by the Kármán vortex shedding in the flow around a cylinder. The deep deterministic policy gradient is used as the DRL algorithm.
The external input for the DMDc model consists of a superposition of the chirp signal, containing various amplitudes and frequencies, and
random noise. This combination is used to express random actions during the exploration phase. With DRL in a DMDc environment, a
control law that exceeds the performance of conventional mathematical control is derived, although the learning is unstable (not converged).
This lack of convergence is also observed with DRL in a computational fluid dynamics (CFD) environment. However, when the number
of learning epochs is the same, a superior control law is obtained with DRL in a DMDc environment. This outcome could be attributed to
the DMDc representation of the flow field, which tends to smooth out high-frequency fluctuations even when subjected to signals of larger
amplitude. In addition, using DMDc results in a computation time savings of up to a factor of 3 compared to using CFD.
© 2024 Author(s). All article content, except where otherwise noted, is licensed under a Creative Commons Attribution-NonCommercial-
NoDerivs 4.0 International (CC BY-NC-ND) license (https://creativecommons.org/licenses/by-nc-nd/4.0/). https://doi.org/10.1063/5.0237682
I. INTRODUCTION environment. As a result, the state of the environment changes, and
the agent receives rewards R according to the new state st+1 . By
Since the invention of deep neural networks by Hinton and repeating this cycle, the agent learns a policy for maximizing the sum
Salakhutdinov,1 the performance of machine learning has improved of rewards obtained in the future.
dramatically, and machine learning is rapidly becoming widely used In the field of fluid dynamics, DRL is anticipated to offer a new
in general society. Supervised and unsupervised learning are com- approach, particularly for tasks such as the geometry optimization of
monly used for modeling and analysis of fluid flow phenomena fluid machinery and optimization of fluid control laws. This expec-
(these are attributed to regression problems), classification tasks, tation is heightened by the current emphasis on reducing energy
and nonlinear reduced-order modeling, among others. Conversely, consumption, which has created a demand for innovative solutions.
deep reinforcement learning (DRL) enables the derivation of opti- NNs of DRL are expected to have the potential for generalization and
mal measures through trial and error. Its pioneering algorithm2 was transfer learning to other conditions than the ones under which they
published in 2013. were trained. Therefore, even if specifications or conditions change,
DRL is performed by repeating trials (Fig. 1). First, the agent, there is no need to re-compute from scratch, as is the case with
that is, the neural network (NN), observes the state st of the envi- other conventional optimization methods. Consequently, it is worth
ronment. Then, the agent selects an action a, and acts on the developing a DRL-based optimization method.
AIP Advances 14, 115204 (2024); doi: 10.1063/5.0237682 14, 115204-1
© Author(s) 2024
AIP Advances ARTICLE pubs.aip.org/aip/adv
DRL than the shape optimization problem, which determines a cer-
tain point in the parameter space. Examples of active flow control by
DRL include the following studies: Verma et al.13 used high-fidelity
3D-CFD as the environment of DRL to demonstrate the inference
that schooling fish have an energetic advantage in swimming in for-
mation by utilizing the energy stored in the vortices. From 2018 to
2022, a series of attempts were made to obtain the control law of
the Kármán vortex in a cylindrical wake by DRL using 2D-CFD as
the environment.14–16 Koizumi et al.14 and Rabault et al.15 employed
FIG. 1. One cycle of deep reinforcement learning. Re = 100 flow as the analysis objects, aiming to minimize lift fluc-
tuation and drag, respectively. In both cases, blowing and suction
jets are given symmetrical locations of the cylinder. Tang et al.16
trained the control law of the blowing and suction from four loca-
tions on the two-dimensional cylinder for Re = 100, 200, 300, and
Examples of shape optimization include the following stud- 400 flows. They reported that drag was reduced at each Reynolds
ies: Yonekura and Hattori3 trained NNs that output the angle of number. Furthermore, the trained control law can be applied for
attack that maximizes the lift–drag ratio for each airfoil profile when Re = 60 to 400, that is, interpolation between 100 and 400 or extrap-
a contour image of the pressure coefficient of a NACA 4-digit or olation to the lower Reynolds number side was possible. Varela
5-digit airfoil is input and showed the possibility of applying the NNs et al.17 showed that the drag reduction with DRL is also effective
to other airfoil profiles and Reynolds numbers. The same research for higher-Reynolds-number flows of Re = 1000 and 2000, using
group4 improved the shape of turbine blades by DRL aiming at the same setup as Rabault et al.15 Furthermore, they tried cross-
smooth Mach number distributions. Yan et al.5 introduced a method application of agents: they reported that agents trained at Re = 1000
to enhance the shape of a missile to maximize the lift–drag ratio by are also effective for drag reduction at Re = 2000, which is within
preliminarily training the NNs using DATCOM,6 a semi-empirical the same Re regime as Re = 1000. This study showed the feasibility
tool used for preliminary aerodynamic design of flying objects as of transfer learning of DRL. Zhu et al.18 used DRL to determine the
the environment and then further training the NNs using compu- best stroke sequence for a finite-sized swimming predator chasing
tational fluid dynamics (CFD) as the environment. They reported prey at low Reynolds number flow. They reported that the time-
19 February 2025 16:42:51
that a higher lift–drag ratio was obtained than other optimization optimal and efficiency-optimal predation were essentially different.
algorithms. Qin et al.7 performed multi-objective optimization of Vignon et al.19 controlled the temperature perturbation on the lower
the compressor cascade blade profile using a surrogate model cre- hot wall of the two-dimensional Rayleigh–Bénard convection (RBC)
ated from CFD data as the DRL environment, achieving a 3.59% system to minimize the Nusselt number of the entire domain. They
reduction in total pressure loss and a 25.4% increase in the laminar showed that effective RBC control can be achieved by utilizing multi-
flow area on the suction side compared to the baseline geometry. agent reinforcement learning (MARL), in which the computational
Viquerat et al.8 improved the airfoil profile using DRL with two- domain is divided and an agent is assigned to each of them. They
dimensional CFD in the environment to maximize the lift–drag also compared MARL with conventional single-agent reinforcement
ratio. They showed the influence of changing the reward function learning (SARL) and reported that for the same number of epochs,
as the training progresses for speed-up of training convergence and SARL was unable to learn an effective policy. Guastoni et al.20 and
including constraints in the reward function. Li et al.9 improved the Sonoda et al.21 further improved the so-called opposition control22
geometry of a supercritical airfoil to minimize the drag coefficient, in turbulent channel flow by DRL. Guastoni et al.20 also used MARL,
using “imitation learning” as the pre-training to enhance learning in which an agent was assigned to each cell on the channel wall.
efficiency. A surrogate model created from CFD data was used for Suárez et al.23 trained the strategy of the blowing and suction from
the environment. They reported that the trained NNs are effective 90○ to 270○ positions of the three-dimensional cylinder. The use of
when applied to flows that differ from the conditions of the training MARL, in which an agent was assigned to each of several slots of
dataset and even when tested with CFD as the environment. Kim jets arranged in the axial direction of the cylinder, was a charac-
et al.10 developed a method for finding the Pareto front of multi- teristic feature of this system, and it reduced drag by up to 17%.
conditional multi-objective optimization of the Kármán–Trefftz air- Linot et al.24 tried to minimize drag in a plane Couette flow by con-
foil profile by DRL: NN learns the correlation between the condition trolling the output of two slot jets on one wall. To reduce training
space and the optimal solution. XFOIL11 is used as the environment time, they trained a time evolution model as an NN in a manifold,
to make learning more efficient. Noda et al.12 optimized the shape which consists of values encoded by proper orthogonal decompo-
of corrugated airfoils by DRL to maximize the lift–drag ratio. They sition (POD) and autoencoder, and used it as the environment of
tried the feasibility of the fine-tuning, which is a kind of transfer DRL. They reported that the method was 440 times faster than train-
learning: trained NN weights were utilized as the initial weight of ing in the DNS environment. Liu et al.25 performed DRL for active
the NN for different angles of attack. Consequently, the fine-tuning control of a pulsating baffle system. A time evolution model com-
worked well: optimization was successful even though not many bining autoencoder and long short-term memory was used for the
epochs were devoted to the pre-training. environment.
Conversely, DRL is suitable for finding a policy that outputs The environments used in these studies are summarized in
actions according to the ever-changing state of the system, so the Table I. Due to the trial-and-error nature of DRL, the use of
control law optimization problems benefit more from the features of 2D-CFD or some kind of surrogate model is popular to reduce
AIP Advances 14, 115204 (2024); doi: 10.1063/5.0237682 14, 115204-2
© Author(s) 2024
AIP Advances ARTICLE pubs.aip.org/aip/adv
TABLE I. Environments of the previous studies.
Environment Ref.
Database Yonekura and Hattori3
Surrogate model Yan et al.,5Qin et al.,7 Li et al.,9 Kim et al.,10 Linot et al.,24
and Liu et al.25
2D-CFD Yonekura et al.,4 Viquerat et al.,8 Noda et al.,12 Koizumi
et al.,14 Rabault et al.,15 Tang et al.,16 Varela et al.,17 and
Vignon et al.19
3D-CFD Verma et al.,13 Zhu et al.,18 Guastoni et al.,20 Sonoda
et al.,21 and Suárez et al.23
computation time in the environment. While a 2D laminar flow II. PROBLEM SETTING
problem requires a relatively realistic training time, training in 3D The analysis object is the flow around a two-dimensional cir-
flows or turbulent flows assuming actual fluid machinery is not cular cylinder. The lift fluctuation, CLrms , caused by Kármán vortex
realistic from the viewpoint of computational cost. Surrogate mod- shedding in the wake of the cylinder, is reduced by the control. Here,
els include design tools such as DATCOM5 and XFOIL,10 and CLrms is the root mean square of the lift coefficient CL . Although
NN-based models7,9,24,25 created using CFD data as training datasets. there are some control methods that can be used to reduce lift fluc-
Among the NN-based models, Qin et al.7 and Li et al.9 are regres- tuations, such as oscillating the cylinder in the direction of lift32 and
sion models of the relationship between shape and aerodynamic rotating the cylinder,33 in this study, we used the feedback control
parameters, which is similar to the design tools in Refs. 5 and 10. system proposed by Hiejima et al.34 as the environment of DRL. A
Conversely, in fluid control,24,25 time evolution models of the flow schematic of the feedback control is shown in Fig. 2. Based on this
are obtained as NNs. Linot et al.24 used CFD to create a train- control law, fluctuation attributed to the Kármán vortex is moni-
ing dataset with random actuation input to express the interaction tored as the velocity Vmon at a point in the wake of the cylinder,
19 February 2025 16:42:51
between the flow field and the control input. Liu et al.25 incorporate and a control input Uact (blowing and suction) is provided near
an ingenuity to treat unsteady flow fields by generating and multiply- the separation point to suppress the Kármán vortex. The velocities
ing pixel-level masks that highlight regions of interest where baffle Uact assigned to the upper and lower points are in opposite phases
(solid) and fluid interactions take place. according to the periodicity of the Kármán vortex. “How Uact is
In this study, we propose the use of a reduced-order model given” is the control law that should be optimized.
(ROM) as a time evolution model to be used in the environment of Schematics of the control laws to be compared in this study
fluid control using DRL. The ROM reduces the temporal and spatial are summarized in Fig. 3. First, Fig. 3(a) represents the control law
dimensions of considerable data and helps to elucidate significant proposed by Hiejima et al.,34
principles and phenomena from the data. In recent years, technolo-
gies such as large-scale parallel computing in three dimensions and U act = GV mon (t − τ), (1)
particle image velocimetry that generate large datasets have been mon
where V is the velocity in the y direction at the observation point
developed. ROM has been considered to deal with these consider- in the wake of the cylinder (Fig. 2) and G and τ are the gain and
able datasets. A typical ROM is the dynamic mode decomposition time delay, respectively. Hereafter, the control law represented by
(DMD) model. DMD was proposed in 2010 by Schmid.26 Various Eq. (1) is called “mathematical control.” Koizumi et al.14 attempted
flows were analyzed using DMD, such as the flow behind a cylin- to obtain control laws beyond mathematical control law in the form
der,27 the cavitating flow around a Clark-Y hydrofoil,28 and the flow of NNs by DRL, in which the state of the flow field based on Vmon was
around a high-speed train.29 DMD is utilized not only to simplify the the state of the environment and Uact was the agent’s action. In this
analysis of considerable datasets but also to express the time series of study, the DMDc model is used to represent the time evolution in
a flow field in a reduced expression.
In 2016, dynamic mode decomposition with control (DMDc),
which is an extension of DMD for systems with external inputs, was
proposed by Proctor et al.30 It has been used to analyze the effect of
changes in the angle of attack of a wing.31 Although there are cur-
rently fewer research examples compared to DMD, this approach
is gradually gaining traction as an analysis method. In this study,
we employed DMDc, which is a reduced expression of the sys-
tem with inputs, as the environment of DRL. The objective of this
study is to investigate the feasibility of optimization when utiliz-
ing a ROM within the DRL environment. In addition, we aim to
determine the extent to which computation time for learning can be FIG. 2. Schematic of a feedback control procedure for flow around a cylinder.
reduced.
AIP Advances 14, 115204 (2024); doi: 10.1063/5.0237682 14, 115204-3
© Author(s) 2024
AIP Advances ARTICLE pubs.aip.org/aip/adv
the environment instead of CFD [Fig. 3(c)]. By comparing Fig. 3(a)
with Figs. 3(b) and 3(c), we analyze the feasibility of using DRL to
obtain control laws that exceed mathematical control. In addition,
by comparing Figs. 3(b) and 3(c), we clarify the difference in the
control effect obtained when CFD and DMDc are used in the envi-
ronment and determine the reduction in computation time when
using DMDc.
Hiejima et al.34 set the Reynolds number based on the uniform
flow velocity U and a cylinder diameter D to 200. They placed the
blowing and suction points near the separation point at an angle of
θ = 100○ from the downstream end of the cylinder (see Fig. 2). In this
study, to use DMDc, which assumes a linear relationship between
two neighboring time series datasets, the Reynolds number is set to
Re = 100, which is within the laminar flow regime. The separation
point may be located more downstream compared to that observed
by Hiejima et al.34 Therefore, parametric analysis is conducted to
examine the suboptimal settings of G, θ, and τ under Re = 100, and
the highest control effect is obtained at G = 2.5, τ = 0.4T, and θ = 95○
(T: the Kármán vortex shedding period). Therefore, these G and
τ values are adopted as the parameters of mathematical control for
comparison, and θ = 95○ is adopted as the problem setting common
among Figs. 3(a)–3(c). Koizumi et al.14 set Re = 100, the same as the
present study, but they set G = 2.0 and θ = 100○ from the parametric
study. This difference is probably due to the parameters in solvers
and computational grids. Therefore, in this study, the results of
Koizumi et al.14 are not used for comparison but our DRL using CFD
as the environment is used, corresponding to Fig. 3(b). Hereafter, the
19 February 2025 16:42:51
variables are non-dimensionalized by U, D, and fluid density ρ.
III. OUTLINE OF CFD
In this study, CFD is used in the mathematical control
FIG. 3. Schematics of the control laws to be compared: (a) mathematical control;
(b) DRL using CFD as the environment; (c) DRL using DMDc as the environment. [Fig. 3(a)], the DRL environment incorporating CFD [Fig. 3(b)],
and for generating the dataset for the DMDc model. The overall
TABLE II. Computational settings of CFD. The text enclosed in double quotation marks are the names of the scheme in
OpenFOAM.
Code OpenFOAM ver 4.1
Application icoFOAM
Coupling of v and p PISO method
Convection term of N–S Eq. Second-order central difference with TVD limiter
(“limited linear” scheme with parameter 1)
Viscous term of N–S Eq. Second-order central difference with non-orthogonality
correction (“Gauss linear corrected”)
Pressure gradient term of N–S Eq. Second-order central difference (“Gauss linear”)
Spatial discretization of pressure Eq. Second-order central difference with non-orthogonality
correction (“Gauss linear corrected”)
Turbulence model No model
Solver of pressure Eq. Generalized geometric-algebraic multi-grid (“GAMG with
Gauss–Seidel method as the smoother”)
Solver of prediction velocity Gauss–Seidel method (“smooth solver with Gauss–Seidel
method as the smoother”)
AIP Advances 14, 115204 (2024); doi: 10.1063/5.0237682 14, 115204-4
© Author(s) 2024
AIP Advances ARTICLE pubs.aip.org/aip/adv
FIG. 4. Computational domain and boundary conditions.
FIG. 6. Structure of the DRL neural network.
TABLE III. Parameters of DDPG.
Size of replay buffer 20 000
Batch size 32
Learning rate of actor network 0.0001
Learning rate of critic network 0.001
Discount rate 0.99
FIG. 5. Computational grid near the cylinder.
blowing and suction velocity Uact in Fig. 2, and it is in the
19 February 2025 16:42:51
range −0.08 ≤ Uact ≤ 0.08. The deep deterministic policy gradient
framework of CFD remains consistent throughout these applica- (DDPG)38 is adopted as the DRL algorithm because it addresses
tions. The governing equations are the equation of continuity and continuous action value. The DDPG code is available from the
Navier–Stokes equation for the incompressible flow. The numer- SpinningUp project of OpenAI.39 The structure of the NN of DRL
ical simulations are performed using icoFOAM of OpenFOAM,35 is shown in Fig. 6. The hyperparameters of DDPG are listed in
the unsteady solver for incompressible flows. Table II lists the com- Table III.
putational settings of icoFOAM. The time step is set to 0.02. The
computational domain and boundary conditions are shown in Fig. 4. B. Exploration and learning process
The computational grid near the cylinder is shown in Fig. 5. The As the system has limited experience in the early stages of
number of grids is set to 7670, which is almost the same as Koizumi learning, it may fall into a local optimal solution. Therefore, an
et al.,14 and the grid near the cylinder is sufficiently fine to resolve exploration phase involving the input of random actions is necessary
the boundary layer. Under this setup, CFD of the uncontrolled flow to provide a variety of experiences to the system. In the exploration
around a cylinder with Re = 100 results in a Strouhal number St of phase, the Ornstein–Uhlenbeck process (O–U process),
0.169, which is in close agreement with the experimental results.36 √
χi+1 = χi + θ(μ − χi )dt + σ dtN(0, 1), (3)
IV. FRAMEWORK OF DEEP REINFORCEMENT
LEARNING multiplied by 0.08 is used as the action; the action output by
the agent is not used. The parameters of the O–U process are
A. Setting of DRL the same as those employed by Koizumi et al.14 (θ = 0.15, μ = 0,
The environment is the flow field evolved by CFD or, DMDc σ = 0.2, dt = 1.0). The exploration phase is run until 20 000 tran-
model as shown in Figs. 3(b) and 3(c). In the problem setting of sitions (st , a, R, st+1 ), that is, experiences, are stored in the replay
this study, a Markov decision process with a time delay has to be buffer. This means that 20 000 cycles of DRL are performed; how-
considered,37 as there is a time delay between the control input and ever, during this phase, the parameters of the NN are not updated.
appearance of the control effect. Therefore, the states in Figs. 3(b) Even after the start of the learning phase, “exploration” must be
and 3(c) are given as 30-dimensional state vectors considering the included in the action to prevent the agent from selecting biased
past state and past input expressed by14 actions that result in the local optimal solution because the DDPG
algorithm always selects a deterministic policy. The action consists
s = (Vt mon , Vt−T/5 mon , . . . , Vt−T mon , Ut act , Ut−T/60 act , . . . , Ut−23T/60 act ). of not only the value output from the NN but also exploration noise.
(2) The O–U process is superposed as the noise. The ratio of the control
The reward in Figs. 3(b) and 3(c) is a function of CLrms , the details input Uact chosen by the agent to the O–U process increases as the
of which are discussed later in Sec. VIII A. The action is the learning proceeds.
AIP Advances 14, 115204 (2024); doi: 10.1063/5.0237682 14, 115204-5
© Author(s) 2024
AIP Advances ARTICLE pubs.aip.org/aip/adv
X ′ = [x2 , x3 , . . . , xm+1 ] ∈ Rn×m , (7)
where m is the number of time-series data, then Eq. (4) is rewritten
as
X ′ = AX. (8)
Matrix A satisfying this equation is obtained as the matrix A that
minimizes following the equation:
J = ∥X ′ − AX∥22. (9)
The solution of Eq. (9) is obtained from
FIG. 7. Uact between neighboring two cycles.
A = X′ X+ , (10)
where X + is the pseudo-inverse matrix of X. The matrix X + can be
The agent conducts time marching of five time steps in the envi-
obtained by the singular value decomposition (SVD) of X, which is
ronment (CFD or DMDc) in one cycle of DRL. In other words, the
expressed by
agent changes the action every five time steps. The control input
for each time step is changed linearly from the value of the previ-
X = UΣV T , (11)
ous cycle to the value of the current cycle (Fig. 7). The uncontrolled
Kármán vortex shedding frequency is St = 0.169, and the dimension- where matrices U ∈ Rn×n and V ∈ Rm×m are orthogonal matri-
less period is T = 5.91. As the dimensionless time step is 0.02, it takes
ces, respectively, and VT is the transpose of the matrix V. Matrix
∼300 steps to complete one Kármán vortex period, which means that
Σ ∈ Rn×m is a matrix with p non-zero diagonal components when
each Kármán vortex period contains 60 cycles of DRL. To prevent
the rank of X is p. The diagonal components are called singular val-
the solution from falling into the local optimal solution, the field is
ues. When SVD is performed using Python, the singular values are
reset to the uncontrolled field after a maximum of 360 cycles (1800
ordered in descending order. Let Ũ ∈ Rn×r and Ṽ ∈ Rm×r be the first
19 February 2025 16:42:51
time steps or 30 periods of the Kármán vortex). One epoch is defined
r(≤ m, n) columns of U and V, respectively, and Σ̃ ∈ Rr×r be the first
as the period from one reset of the flow field to the next reset. If the
r rows and r columns of Σ. Then, Eq. (11) can be rewritten as
reward falls below −1, the control law of the epoch is recognized as
ineffective, and the system is forced to proceed to the next epoch.
X ≈ Ũ Σ̃Ṽ T , (12)
V. THEORY OF REDUCED-ORDER MODEL which is the optimal solution for approximating X by an n × m
A. Dynamic mode decomposition matrix of rank r or less (Eckert–Young’s theorem). Here, matrix
A is computed using the pseudo-inverse matrix X + of the low-rank
In dynamic mode decomposition (DMD), high-dimensional approximation of X as follows:
time-series data are represented as a superposition of several spa-
tiotemporal structural patterns (mode), which are computed by A = X ′ X + = X ′ Ṽ Σ̃ −1 Ũ T . (13)
assuming that the time-series data follow the relation,
However, the dimension of matrix A is as large as n × n, and the
xk+1 = Axk. (4) computational cost of eigenvalue decomposition to find the mode is
considerably high. Therefore, we reduce the dimension of xk using
Here, xk ∈ Rn is the temporally discrete field data, where k denotes
Ũ and denote it as xk :
the index of the time step of data sampling. xk is a column vec-
tor in which the velocity components u and v and pressure p of all x̃k = Ũ T xk. (14)
computational cells are arranged vertically:
Here, Eq. (4) can be reduced as follows:
xk = [u1 , ⋅ ⋅ ⋅, uN , v1 , ⋅ ⋅ ⋅, vN , p1 , ⋅ ⋅ ⋅, pN ]T . (5)
x̃k+1 = Ũ T AŨ x̃k = Ũ T X ′ Ṽ Σ̃ −1 x̃k = Ãx̃k , (15)
Here, N denotes the number of cells. Therefore, n, the dimension of
vector xk is 3N. Note that matrix A ∈ Rn×n is assumed as constant
for any k; the time evolution is assumed to proceed in a linear man- Ã = Ũ T X ′ Ṽ Σ̃ −1 , (16)
ner. Therefore, care must be taken when applying DMD to nonlinear
phenomena such as turbulent flows of high Reynolds numbers. where à represents the reduced matrix A, and the dimension is
The modes of DMD are obtained as eigenvectors of matrix A. reduced from n × n to r × r. Using Eq. (15), a low-dimensional time
We will now explain the method for obtaining the reduced matrix A. evolution computation can be performed. Note that the state vector
If the time-series data are expressed as xk must be restored to its original dimension using
X = [x1 , x2 , . . . , xm ] ∈ Rn×m , (6) xk = Ũ x̃k , (17)
AIP Advances 14, 115204 (2024); doi: 10.1063/5.0237682 14, 115204-6
© Author(s) 2024
AIP Advances ARTICLE pubs.aip.org/aip/adv
when the lift coefficient CL is calculated. In general, the mode is where r2 (≤m, n) singular values are selected from the larger ones
obtained by eigenvalue decomposition of matrix Ã; however, in this as Û ∈ Rn×r2 , Σ̂ ∈ Rr2 ×r2 , and V̂ ∈ Rm×r2 . Then, we reduce the
study, once matrix à is known, the time evolution can be calculated dimension of xk using Û and denote it as
by Eq. (15). Therefore, eigenvalue decomposition is not necessar-
ily required. For simplicity, the time evolution computations in this x̂k = Û T xk. (28)
study are performed using Eq. (15).
Equation (18) can be reduced to
B. Dynamic mode decomposition with control
x̂k+1 = Û T AÛ x̃k + Û T Buk = Û T X ′ Ṽ Σ̃ −1 Ũ T1 Û x̃k
Dynamic mode decomposition with control (DMDc) is an
extension of DMD that enables time-series data to be represented + Û T X ′ Ṽ Σ̃ −1 Ũ T2 uk
as a superposition of modes, even if external inputs are included, = Â x̂k + B̂uk , (29)
assuming the linear relationship:
xk+1 = Axk + Buk. (18) Â = Û T X ′ Ṽ Σ̃ −1 Ũ T1 Û (30)
n×n n×l
As in the DMD case, matrices A ∈ R and B ∈ R are constant
to any k. Vector xk ∈ Rn is the vector shown in Eq. (5), and vector
B̂ = Û T X ′ Ṽ Σ̃ −1 Ũ T2 , (31)
uk ∈ Rl is the external input. In this study, uk is a one-dimensional
vector, that is, scalar because only the control input Uact is used as where  represents the reduced matrix A, and the dimension is
the external input. Here, we explain the method for obtaining the reduced from n × n to r2 × r2 . Equation (18) thus obtained can be
reduced matrices A and B in DMDc. First, the time-series data can used for reduced time evolution calculations. Note that, however,
be expressed by matrices as follows: as in DMD, the state vector xk must be restored to its original
dimension using
X = [x1 , x2 , . . . , xm ] ∈ Rn×m , (19)
xk = Û x̂k , (32)
′ n×m
X = [x2 , x3 , . . . , xm+1 ] ∈ R , (20)
when CL is calculated.
19 February 2025 16:42:51
Γ = [u1 , u2 , . . . , um ] ∈ Rl×m , (21) VI. CREATION OF THE DMDC MODEL
and Eq. (18) is rewritten as Unlike for the DMD, to create a DMDc model, a dataset of
′ time-series flow field data with an “external input” is needed. We call
X = AX + BΓ = GΩ, (22) the external input for creating the dataset U model . Note that U model
where G = [A B] and Ω = [X Γ]T. When expressed this way, the low- does not indicate Uact, which is the control signal, but a “possible
rank approximations of matrices A and B can be obtained in the input signal” that should be included in the dataset. As the matrices
same way as in the DMD case. Matrix G is obtained as A and B change depending on U model , U model has to be tuned to cre-
ate an accurate time evolution model. In this case, the following two
G = X ′ Ω+ , (23) points should be considered:
assuming Ω+ is the pseudo-inverse of Ω. The pseudo-inverse Ω+ can 1. The DMDc model must be able to express complex control
be obtained by SVD of Ω: inputs when conducting time evolution; hence, it cannot be
created using a “simple” U model (e.g., sinusoidal curve).
Ω = UΣV T . (24) 2. The DMDc model must be able to express discontinuous input
n×r1 signals because random actions are selected in the exploration
This SVD can be approximated using the matrices Ũ ∈ R , phase and early stages of learning.
Ṽ ∈ Rm×r1 , and Σ̃ ∈ Rr1 ×r1 (r1 ≤ m, n) as in DMD:
Hence, we use an U model , in which the noise χ i is superimposed on
Ω ≈ Ũ Σ̃Ṽ T . (25) the chirp signal Ci as follows:
Here, matrices A and B can be computed using the pseudo-inverse Umodel i = aCi + bχi. (33)
matrix Ω+ as follows:
The chirp signal is a sinusoidal wave whose frequency is con-
[A B] = G = X ′ Ω+ = X ′ Ṽ Σ̃ −1 Ũ T tinuously increased or decreased. It is employed as a component
, (26)
= [X ′ Ṽ Σ̃ −1 Ũ T1 X ′ Ṽ Σ̃ −1 Ũ T2 ], of U model to include various frequencies in the dataset. The noise
superposed on the chirp signal is an O–U noise. It is included to
where Ũ 1 ∈ Rn×r1 and Ũ 2 ∈ Rl×r1 . However, the dimension of matrix represent a discontinuous signal. In addition, to obtain signals with
A is as large as n × n. To obtain the reduced-dimensional matrix A, different amplitudes, the parameters a and b are varied for each
we further perform the approximated SVD of the matrix X ′ : interval of non-dimensional time 50 ≤ t < 100, 100 ≤ t < 150, and
150 ≤ t ≤ 200, as listed in Table IV. In DRL, the control input Uact
X ′ ≈ Û Σ̂V̂ T , (27) takes the value −0.08 ≤ Uact ≤ 0.08. Therefore, the coefficients are
AIP Advances 14, 115204 (2024); doi: 10.1063/5.0237682 14, 115204-7
© Author(s) 2024
AIP Advances ARTICLE pubs.aip.org/aip/adv
TABLE IV. Parameters of Umodel . TABLE V. Strouhal number and CLrms in no-control case (DMD vs CFD).
t a b CLrms St
50–100 0.06 0.02 CFD 0.213 0.169
100–150 0.04 0.04 DMD 0.213 0.169
150–200 0.02 0.06
B. Mathematical control (DMDc vs CFD)
The accuracy of the time marching with the DMDc model is
validated by comparing the results of the mathematical control using
CFD and DMDc, as shown in Fig. 10. The number of dimensions of
the DMDc model is set to r1 = 50 (the number of singular values
selected in the SVD of Ω) and r2 = 80 (the number of singular values
selected in the SVD of X ′ ), which are larger than that required in
DMD (r = 10) to express the complex external input.
Figures 11 and 12 show the time series of CL , Uact, and Vmon of
mathematical control using the CFD and DMDc model, respectively.
FIG. 8. Time series of Umodel . The control signal Uact is input from t = 12.5. The Strouhal number
of the CL fluctuations and CLrms calculated using the last five periods
of Figs. 11(a) and 12(a) are listed in Table VI. It can be confirmed
varied in three steps so that the amplitude takes a value between 0 that the mathematical controls using CFD and the DMDc model
and 0.08. For all intervals, the non-dimensional frequency is varied suppress the amplitude of fluctuation of CL compared with the no-
from 0.01 to 0.2. The specific values of the coefficients at each stage control case, respectively. However, the control effect are smaller
are determined by trial and error. For 0 ≤ t < 50, no control signal is when the DMDc model is used; this is related to the spike observed
input. Figure 8 depicts the time series of U model . at 0 ≤ t ≤ 50 in Fig. 11(a). In CFD, when a time-discontinuous
Uact is input, the pressure field also changes discontinuously to
19 February 2025 16:42:51
VII. VALIDATIONS OF THE REDUCED-ORDER MODEL satisfy the continuity equation, causing a spike in CL . As the pres-
sure field is adjusted, the velocity field changes. Consequently, Vmon
A. Uncontrolled flow (DMD vs CFD) changes, and feedback control is applied. Conversely, the DMDc
The accuracy of the time marching using the DMD model is model does not show any spike in the time series of CL , which
validated by comparing it with CFD. The analysis object is a flow may be because a mode with a high-frequency component is not
around a cylinder (Re = 100) under the uncontrolled condition. The
DMD model with 10 modes selected from the largest singular values
is used, that is, r = 10. For the initial conditions (t = 0) of both CFD
and DMD, fully developed CFD data are used.
Figure 9 compares the time series of the lift coefficient CL .
The fluctuation in CL of both cases agrees well although only
10 modes are used in DMDc. Table V also shows good agreement of
the Strouhal number of the CL fluctuation and its magnitude CLrms .
The computation time for the time marching from t = 0 to t = 50,
which takes ∼360 s using CFD, can be computed in ∼56 s using
DMD.
FIG. 9. Comparison of lift coefficient CL in the no-control case (DMD vs CFD).
The range indicated by the arrow shows one period of Kármán vortex visualized in FIG. 10. Schematics of comparison for the validation of DMDc: (a) mathematical
Figs. 13 and 23. control using CFD; (b) mathematical control using DMDc.
AIP Advances 14, 115204 (2024); doi: 10.1063/5.0237682 14, 115204-8
© Author(s) 2024
AIP Advances ARTICLE pubs.aip.org/aip/adv
19 February 2025 16:42:51
FIG. 12. Time series of CL and Uact of the mathematical control using the DMDc
FIG. 11. Time series of CL and Uact of the mathematical control using CFD. The model. The range indicated by the arrow shows one period of Kármán vortex
range indicated by the arrow shows one period of Kármán vortex visualized in visualized in Fig. 13.
Figs. 13 and 23.
TABLE VI. Strouhal number and CLrms of mathematical control (DMDc vs CFD).
CLrms St
selected, that is, corresponding singular values are omitted, when the
DMDc model is created. Another possible reason is that the chirp No-control 0.213 0.169
signal shown in Fig. 8 does not include such a large discontinu- Mathematical control (CFD) 0.046 0.133
ous Uact. As a result, the velocity field (Vmon) is not well adjusted, Mathematical control (DMDc) 0.054 0.133
and the feedback control is not triggered, resulting in less control
effectiveness.
Figure 13 is the visualization of the flow field every 1/4 period
of the lift fluctuations shown in Figs. 9, 11(a), and 12(a). The visu- VIII. DEEP REINFORCEMENT LEARNING
alization of the no-control case is obtained from the CFD results. A. Influence of reward function
Comparing results for the no-control case and the mathematical
control case using CFD, the low-velocity region in the wake of the In this study, the reward R is a function defined as follows:
cylinder extends downstream when the control is applied. Hiejima
et al.34 reported that the length of the vortex increases owing to R = 20{c1 [−(CLrms )2 + (CLrms ) ] − c2 (U act )2 − c3 (ΔU act )2 }.
nocontrol 2
the control, which is consistent with the visualization results in this (34)
study. Conversely, the results of mathematical control using the As the objective of control is to minimize CLrms , CLrms is adopted
DMDc model show that the vortex extends downstream to some as a variable. The energy required for control should be as small as
extent but it is shorter than that obtained in the mathematical con- possible. The term of (U act )2 is added to the reward function as a
trol case using CFD. This corresponds to the fact that Vmon by DMDc penalty term. Furthermore, if the velocity change between neighbor-
remains relatively large. ing cycles of DRL, ΔUact, becomes too large, the continuity equation
AIP Advances 14, 115204 (2024); doi: 10.1063/5.0237682 14, 115204-9
© Author(s) 2024
AIP Advances ARTICLE pubs.aip.org/aip/adv
19 February 2025 16:42:51
FIG. 13. Comparison of the instantaneous velocity magnitude fields of mathematical control using CFD and DMDc model.
AIP Advances 14, 115204 (2024); doi: 10.1063/5.0237682 14, 115204-10
© Author(s) 2024
AIP Advances ARTICLE pubs.aip.org/aip/adv
TABLE VII. Parameters of the reward function and corresponding best Q-value and CLrms .
c1 c2 c3 CLrms Q Epoch
No-control ... ... ... 0.213 ... ...
pattern1 (best Q) 1.0 1.5 25.0 0.052 55.9 4 300
pattern1 (best CLrms ) 1.0 1.5 25.0 0.052 43.4 4 400
Pattern 2 (best Q) 0.5 × 1.005istep 1.5 25.0 0.217 180.7 18 200
Pattern 2 (best CLrms ) 0.5 × 1.005istep 1.5 25.0 0.055 83.6 16 600
Pattern 3 (best Q) 0.5 × 1.005istep 1.5 50.0 0.095 167.5 60 800
Pattern 3 (best CLrms ) 0.5 × 1.005istep 1.5 50.0 0.028 64.3 62 200
is not satisfied. Therefore, (ΔU act )2 should also be as small as possi-
ble and is added to the reward function as a penalty term. When no
control signal is input, that is, Uact = ΔUact = 0, the reward should be
nocontrol 2
zero. Therefore, (CLrms ) is added to −(CLrms )2 to achieve R = 0
under no control.
Three patterns of different coefficients of the reward function
are used to investigate the coefficients that yield effective control
laws. Table VII lists the three patterns (values other than the coef-
ficients are discussed later). In patterns 2 and 3, the coefficient c1 ,
which is a function of epoch, is predominantly adopted to eval-
uate the decrease in CLrms in the latter stage of control. Here,
istep indicates the sum of the timesteps. In pattern 1, however, this
function is not employed. In pattern 3, c3 , which is the coefficient of
19 February 2025 16:42:51
the penalty term of ΔUact, is twice those of patterns 1 and 2. These
values are determined by trial and error considering the balance
among the three terms.
For this investigation, DRL using the DMDc model as the envi-
ronment [Fig. 3(c)] is performed as the learning mode, utilizing its
higher speed of learning. However, in the test mode, where NN para-
meters are fixed and the learned control law is tested, we used CFD FIG. 14. Schematics of the learning mode and test mode of DRL using DMDc as
the environment: (a) learning mode and (b) test mode.
as the environment of DRL. Figure 14 shows the schematic diagram
of the learning and test modes. This is because the results of the
DMDc model are only approximate solutions of CFD. The test is
performed every 100 learning epochs, with the parameters fixed at
the state reached in that epoch.
Figure 15 shows the evolution of the maximum Q value for
epochs during the learning mode. Fig. 16 shows the evolution of
the CLrms calculated from the last five periods of the CL fluctuation
of the test mode. These figures reveal that learning is unstable: the
Q value and CLrms fluctuate and do not converge to a certain value.
Therefore, when the CLrms value is large (i.e., CLrms = 0.8 in Fig. 16) FIG. 15. Comparison of the evolution of the Q-value for epochs among the three
over 10 000 epochs, the learning is recognized as having failed and patterns of reward function parameters.
the learning is terminated. Patterns 1 and 2 are terminated due to
this reason. When CLrms is maintained as 0.8, the agent learns a con-
trol law such that it continues to take upper or lower bounds for its
action (Uact = ±0.08). This is because the agent learns to minimize
ΔUact, resulting in a local optimal solution.
Now, the reasons for the significant fluctuations in Q value and
CLrms will be discussed. The test mode is conducted based on the
policy acquired up to a certain epoch. The learning of the policy
depends on the learning of the value function (Q value): the learning
of the policy proceeds in such a way that the action with the max-
FIG. 16. Comparison of the evolution of CLrms for epochs among the three patterns
imum Q value is selected. In particular, a deterministic policy such of reward function parameters.
as DDPG has been shown to overfit outlier output from the value
AIP Advances 14, 115204 (2024); doi: 10.1063/5.0237682 14, 115204-11
© Author(s) 2024
AIP Advances ARTICLE pubs.aip.org/aip/adv
function.40 According to Fujimoto et al.,40 when the actor network
is updated slowly, the learning of the value function is more stable
than when it is updated sequentially. This suggests that the learning
rate of the actor network in this study is large. As another method
for improving the stability of learning, twin delayed DDPG (TD3),40
a method that can solve the instability of learning in DDPG, can be
incorporated.
Table VII lists the minimum CLrms of the case (best CLrms ), the
maximum “ max Q value” of the case (best Q), and their epochs.
Figures 17–22 show the time series of Uact and CL of the cases listed
in Table VII. The control signal Uact is input from t = 12.5. These
results show that CLrms can be reduced for all patterns, indicating
that control law optimization by DRL using the DMDc model as
the environment is feasible. In pattern 1, similar control laws are
obtained for best Q and best CLrms , and the value of CLrms is almost
the same. However, in best CLrms , CL is more oscillatory because
ΔUact is larger than in best Q. The Q value in pattern 1 dropped and
converged to zero; this suggests that a more effective control law can
be found by focusing on CLrms after some time has passed since the
control input is given, as in patterns 2 or 3.
In patterns 2 and 3, even though the Q value are maximized,
CLrms is not necessarily smaller. In Fig. 19, corresponding to the best
Q of pattern 2, the CL fluctuation and the input Uact amplitude are
large, resulting in a large energy consumption control. For the best
CLrms of pattern 2, CL is oscillatory because ΔUact is large. In pattern
3, both the best Q and the best CLrms obtain control laws that con-
FIG. 17. Time series of CL and Uact of DRL using the DMDc model as the
sider CLrms , Uact, and ΔUact in a well-balanced manner. In particular,
19 February 2025 16:42:51
environment (pattern 1, best Q value).
Fig. 22(a) shows a control effect that exceeds that of mathematical
control (CLrms = 0.046). Furthermore, in pattern 3, the weights of
ΔUact are larger than those of other patterns, so the oscillations of
Uact and CL are relatively suppressed. As the obtained control law
significantly depends on the weight of CLrms , Uact, and ΔUact in the
reward function, Figs. 17–22 are one of the Pareto-optimal solutions.
However, in this study, we focus on the control effect: we focus on
62 200 epoch in pattern 3.
Figure 23 is the visualization of the flow field every 1/4 period
of the lift fluctuations shown in Figs. 9, 11(a), and 22(a). Again,
the visualization of the no-control case and mathematical control
case (same as Fig. 13) is shown for comparison. The low-velocity
region in the wake is longer compared to the no-control case. This
is also observed in the mathematical control. However, the low-
velocity region is slightly shorter than the mathematical control. The
Strouhal number of CL is 0.171 for pattern 3, which is larger than
that of the mathematical control shown in Table VI. In mathematical
control, the time delay τ is fixed at 0.4 times Kármán vortex period
T, but in this study, Vmon is monitored backward from t to t − T in
T/5 increments, as shown in Eq. (2). Therefore, the control law can
be flexibly changed by selecting Vmon at the time of importance.
B. Comparison with the DRL using CFD
1. Results
Adopting the reward function of pattern 3, which yields the
most significant control effect, we conduct DRL utilizing CFD as the
environment. Then, we compare the results with those obtained with
FIG. 18. Time series of CL and Uact of DRL using the DMDc model as the
DRL employing DMDc [comparison of Figs. 3(b) and 3(c)]. In both environment (pattern 1, best CLrms ).
cases, CFD is used as the environment in the test mode.
AIP Advances 14, 115204 (2024); doi: 10.1063/5.0237682 14, 115204-12
© Author(s) 2024
AIP Advances ARTICLE pubs.aip.org/aip/adv
FIG. 21. Time series of CL and Uact of DRL using the DMDc model as the
FIG. 19. Time series of CL and Uact of DRL using the DMDc model as the
19 February 2025 16:42:51
environment (pattern 3, best Q value).
environment (pattern 2, best Q value).
FIG. 22. Time series of CL and Uact of DRL using the DMDc model as the environ-
FIG. 20. Time series of CL and Uact of DRL using the DMDc model as the ment (pattern 3, best CLrms ). The range indicated by the arrow shows one period
environment (pattern 2, best CLrms ). of Kármán vortex visualized in Fig. 23.
AIP Advances 14, 115204 (2024); doi: 10.1063/5.0237682 14, 115204-13
© Author(s) 2024
AIP Advances ARTICLE pubs.aip.org/aip/adv
19 February 2025 16:42:51
FIG. 23. Comparison of the instantaneous velocity magnitude fields among no-control, mathematical control, and DRL.
AIP Advances 14, 115204 (2024); doi: 10.1063/5.0237682 14, 115204-14
© Author(s) 2024
AIP Advances ARTICLE pubs.aip.org/aip/adv
Figure 24 shows the evolution of the maximum Q value for TABLE VIII. Comparison of the best Q-value and CLrms between the DRL using CFD
epochs during the learning mode, and Fig. 25 shows the evolution and DMDc. The reward function is fixed to the pattern 3).
of the CLrms in the test mode. The results of DRL using DMDc as the CLrms Q Epoch
environment are the same as the results of pattern 3 in Figs. 15 and
16. The learning process is still unstable despite CFD being used as No-control 0.213 ... ...
the environment. As both cases use the same reward function, the DMDc (best Q) 0.095 167.5 60 800
same Q value has the same value, that is, Q values can be compared DMDc (best CLrms ) 0.028 64.3 62 200
between the cases. Q values are higher for CFD than for DMDc for CFD (best Q) 0.049 169.8 23 900
a longer period of time. In other words, CFD gives better results in CFD (best CLrms ) 0.049 169.8 23 900
terms of maximizing the reward function; this is due to the accu-
racy of the DMDc model, as discussed in Sec. VII B: it results in a
larger estimate of CLrms under the same control law. Another reason
is the CL spikes that occur in CFD. Given a discontinuous Uact, the
pressure is corrected to satisfy the continuity equation, resulting in
CL spikes as in Figs. 11(a) 0 ≤ t ≤ 50. Conversely, in Fig. 12(b),
there are no spikes; this is because the chirp signal used to create
the DMDc model has no such high-frequency component for the
expression of spikes. In the case of learning with CFD as the envi-
ronment, to avoid spikes, the agent moved in the parameter space to
take care of ΔUact, which, together with the ΔUact weight, may have
increased the Q value.
Table VIII summarizes the best Q and best CLrms . When CFD
is used as the environment, the epochs of best Q and best CLrms are
the same (23 900). The time series of CL and Uact at epoch 23 900
is shown in Fig. 26. Comparing Fig. 26(a) with Fig. 22(a), there are
fewer spikes; this is due to the agent’s prioritization of care for ΔUact,
as mentioned previously. Conversely, even though an effective con-
19 February 2025 16:42:51
trol law is obtained using DMDc for learning, spikes occur in the test,
as shown in Figs. 21 and 22. However, the DMDc model’s feature of
not producing CL spikes can also be recognized as an advantage. It
FIG. 26. Time series of CL and Uact of DRL using CFD as the environment (pattern
3, best Q and best CLrms ).
FIG. 24. Comparison of the evolution of the Q-value for epochs of DRL using CFD
and DRL using DMDc.
is easier to reward the effects of control inputs that reduce the lift
fluctuation; the agent does not have to care about the occurrence
of lift spikes. Due to this property, a better control law, in terms of
minimizing CLrms , is obtained using DMDc for learning. However,
if discontinuous fluctuation of CL is to be avoided, it is necessary
to increase the weight of ΔUact of the reward function when using
DMDc for learning.
2. Computation time for learning
Finally, the time required for learning is compared. The learn-
FIG. 25. Comparison of the evolution of CLrms for epochs of DRL using CFD and
DRL using DMDc.
ing of 80 000 epochs and the storage of 20 000 cycles of experience
in the replay buffer (equivalent to about 55 epochs if 360 cycles
AIP Advances 14, 115204 (2024); doi: 10.1063/5.0237682 14, 115204-15
© Author(s) 2024
AIP Advances ARTICLE pubs.aip.org/aip/adv
are included per epoch) takes about 20 days using CFD, whereas it therefore, mode selection methods should be incorporated, such as
takes only about three days using DMDc. However, if CFD is used, sparsity-promoting dynamic mode decomposition with control,41
it may take <20 days if it is possible to pass the data between the compressive sensing,42 and greedy method.43 A nonlinear reduced
agent and environment through memory: in the case of CFD, in model is needed to deal with nonlinear phenomena, such as tur-
this study, the data (state, action, and reward) passing between the bulent flow, instead of a method that assumes linearity. In the test
environment, that is, OpenFOAM, and the agent written in Python mode, we use CFD as the environment, and we have confirmed that
is done through the disk. Comparing the same number of cycles the agent trained with the DMDc model can be used in actual situ-
(not epochs), if a more sophisticated communication method is ations. Therefore, as Varela et al.17 reported, it is expected that the
used, such as an API connecting Python and OpenFOAM through agent trained using DMDc can be applied within the same Re regime
memories, the learning time could be about 1/3 of that required for (similar dynamics), namely, transfer learning. However, if the trans-
I/O to the disk. Given the randomness of the reinforcement learn- fer of agents to flow fields of higher Reynolds numbers is considered,
ing algorithm, the number of cycles per epoch varies, so the learning it is all the more necessary to develop a surrogate model that can
time is not necessarily reduced to about 1/3, but it can be at least express nonlinear time evolution at low cost.
halved by using an API. However, DMDc still saves time in learning. Finally, attention should be paid to the design of the reward
The flow field in this study is very simple and laminar, so the agent function. In this study, the reward function is designed as a multi-
can learn in about 1–3 weeks using CFD. However, if the environ- objective optimization problem that minimizes lift fluctuation,
ment is a larger field or a turbulent flow field, the agent will not learn energy consumption, and input discontinuities. However, as the
in a realistic amount of time unless using a time-evolution model like results vary considerably depending on the weight of each term of
a reduced model. Therefore, this method is expected to be useful for the reward function, the difficulty of designing the reward function
practical applications of DRL in fluid flow problems. in the first place remains an issue. For example, if there is an “expert”
solution, which is a kind of role-model solution, as in the case of
mathematical control in this study, it is considered effective to esti-
mate the reward function by inverse reinforcement learning using
IX. CONCLUSION the expert and to conduct further learning based on the obtained
Deep reinforcement learning (DRL) is performed to optimize reward function.
the active blowing and suction control law of lift fluctuations caused
by the Kármán vortex around a circular cylinder. A reduced model
19 February 2025 16:42:51
created by dynamic mode decomposition with control (DMDc) is ACKNOWLEDGMENTS
used instead of computational fluid dynamics (CFD) to perform
This work is supported by JSPS KAKENHI Grant No.
optimization more efficiently. The applicability of the method, fea-
JP22K03925.
tures, and saved computation time are investigated. A control law
that exceeds the performance of the mathematical control law is
obtained in the form of a neural network (NN); this control law can- AUTHOR DECLARATIONS
not be obtained when the same number of epochs are trained using
CFD in the environment; this may be due to the agent’s preference Conflict of Interest
not to choose discontinuous control inputs when CFD is employed The authors have no conflicts to disclose.
as the environment. Conversely, as lift spikes do not occur with
DMDc, even if a good control law is obtained using DMDc for train-
Author Contributions
ing, lift spikes may occur when that control law is tested in the CFD
environment. However, the DMDc model’s feature of not produc- T. Sakamoto: Data curation (equal); Formal analysis (equal);
ing lift spikes can also be recognized as an advantage. It is easier to Investigation (equal); Methodology (equal); Validation (equal);
reward the effects of control inputs that reduce the lift fluctuation: Visualization (equal); Writing – original draft (lead); Writing –
the agent does not have to care about the occurrence of lift spikes. review & editing (equal). K. Okabayashi: Conceptualization (lead);
This feature is the possible reason why the DMDc environment pro- Data curation (equal); Formal analysis (equal); Funding acquisition
vides a better control law in terms of minimizing lift fluctuation. (lead); Investigation (equal); Methodology (equal); Project admin-
With the CFD environment, it takes the agent 20 days to learn; istration (lead); Resources (lead); Supervision (lead); Validation
however, with DMDc, it takes only 3 days. In this study, the data (equal); Visualization (equal); Writing – original draft (supporting);
passing between the agent written in Python and the OpenFOAM Writing – review & editing (equal).
environment is done via disk; if this process can be done via mem-
ory using an API, the learning time will be approximately 10 days DATA AVAILABILITY
when CFD is used in the environment. However, it is still faster to
use DMDc. The data that support the findings of this study are available
It is necessary to refine the mode selection to improve the within the article.
accuracy of the DMDc model. In this study, dozens of modes are
selected from the largest singular value in descending order to
create a reduced model. However, unlike the proper orthogonal REFERENCES
decomposition (POD), the mode with the larger singular value is 1
G. Hinton and R. Salakhutdinov, “Reducing the dimensionality of data with
not necessarily the more important mode in DMD and DMDc: neural networks,” Science 313, 504–507 (2006).
AIP Advances 14, 115204 (2024); doi: 10.1063/5.0237682 14, 115204-16
© Author(s) 2024
AIP Advances ARTICLE pubs.aip.org/aip/adv
2 22
V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and H. Choi, P. Moin, and J. Kim, “Active turbulence control for drag reduction in
M. Riedmiller, “Playing Atari with deep reinforcement learning,” arXiv:1312.5602 wall-bounded flows,” J. Fluid Mech. 262, 75–110 (1994).
(2013). 23
P. Suárez, F. Alcántara-Ávila, A. Miró, J. Rabault, B. Font, O. Lehmkuhl, and
3
K. Yonekura and H. Hattori, “Framework for design optimization using R. Vinuesa, “Active flow control for three-dimensional cylinders through deep
deep reinforcement learning,” in Structural and Multidisciplinary Optimization reinforcement learning,” arXiv:2309.02462 (2023).
24
(Springer, 2019), Vol. 60, pp. 1709–1713. A. J. Linot, K. Zeng, and M. D. Graham, “Turbulence control in plane Couette
4
K. Yonekura, H. Hattori, S. Shikada, and K. Maruyama, “Turbine blade opti- flow using low-dimensional neural ODE-based models and deep reinforcement
mization considering smoothness of the Mach number using deep reinforcement learning,” Int. J. Heat Fluid Flow 101, 109139 (2023).
learning,” Inf. Sci. 642, 119066 (2023). 25
Y. Liu, F. Wang, S. Zhao, and Y. Tang, “A novel framework for predicting active
5
X. Yan, J. Zhu, M. Kuang, and X. Wang, “Aerodynamic shape optimization using flow control by combining deep reinforcement learning and masked deep neural
a novel optimizer based on machine learning techniques,” Aerosp. Sci. Technol. network,” Phys. Fluids 36(3), 037112 (2024).
26
86, 826–835 (2019). P. J. Schmid, “Dynamic mode decomposition of numerical and experimental
6
W. B. Blake, “Missile DATCOM user’s manual - 1997 FORTRAN 90 revision,” data,” J. Fluid Mech. 656, 5–28 (2010).
27
Air Force Research Laboratory Final Report for Period April 1993-December Y. Ohmichi and Y. Igarashi, “Dynamic mode decomposition for multi-
1997, No. AFRL-VA-WP-TR-1998-3009, 1998, pp. 1–102. dimensional time series analysis,” Brain Neural Networks 25(1), 2–9 (2018)
7
S. Qin, S. Wang, L. Wang, C. Wang, G. Sun, and Y. Zhong, “Multi-objective opti- (in Japanese).
28
mization of cascade blade profile based on reinforcement learning,” Appl. Sci. 11, R. Qiu, R. Huang, Y. Wang, and C. Huang, “Dynamic mode decomposition and
106 (2020). reconstruction of transient cavitating flows around a Clark-Y hydrofoil,” Theor.
8
J. Viquerat, J. Rabault, A. Kuhnle, H. Ghraieb, A. Larcher, and E. Hachem, Appl. Mech. Lett. 10(5), 327–332 (2020).
29
“Direct shape optimization through deep reinforcement learning,” J. Comput. T. W. Muld, G. Efraimsson, and D. S. Henningson, “Flow structures around a
Phys. 428, 110080 (2021). high-speed train extracted using proper orthogonal decomposition and dynamic
9
R. Li, Y. Zhang, and H. Chen, “Learning the aerodynamic design of supercritical mode decomposition,” Comput. Fluids 57, 87–97 (2012).
30
airfoils through deep reinforcement learning,” AIAA J. 59(10), 3988–4001 (2021). J. L. Proctor, S. L. Brunton, and J. N. Kutz, “Dynamic mode decomposition with
10
S. Kim, I. Kim, and D. You, “Multi-condition multi-objective optimization control,” SIAM J. Appl. Dyn. Syst. 15(1), 142–161 (2016).
31
using deep reinforcement learning,” J. Comput. Phys. 462, 111263 (2022). C. Sun, T. Tian, X. Zhu, and Z. Du, “Input-output reduced-order mod-
11
XFOIL Home Page, https://web.mit.edu/drela/Public/web/xfoil/ (Accessed on eling of unsteady flow over an airfoil at a high angle of attack based on
17 June 2024). dynamic mode decomposition with control,” Int. J. Heat Fluid Flow 86, 108727
12
T. Noda, K. Okabayashi, S. Kimura, S. Takeuchi et al., “Optimization of con- (2020).
32
figuration of corrugated airfoil using deep reinforcement learning and transfer H. M. Warui and N. Fujisawa, “Feedback control of vortex shedding from
learning,” AIP Adv. 13, 035328 (2023). a circular cylinder by cross-flow cylinder oscillations,” Exp. Fluids 21, 49–56
19 February 2025 16:42:51
13
S. Verma, G. Novati, and P. Koumoutsakos, “Efficient collective swimming by (1996).
33
harnessing vortices through deep reinforcement learning,” Proc. Natl. Acad. Sci. N. Fujisawa, Y. Kawaji, and K. Ikemoto, “Feedback control of vortex shedding
U. S. A. 115(23), 5849–5854 (2018). from a circular cylinder by rotational oscillations,” J. Fluids Struct. 15(1), 23–37
14
H. Koizumi, S. Tsutsumi, and E. Shima, “Feedback control of Karman vortex (2001).
34
shedding from a cylinder using deep reinforcement learning,” in Proceeding of S. Hiejima, T. Watanabe, and T. Nomura, “Feedback control of Karman vor-
2018 Flow Control Conference, No. AIAA 2018-3691, 2018. tex shedding behind a circular cylinder by velocity excitation,” J. Appl. Mech. 7,
15
J. Rabault, M. Kuchta, A. Jensen, U. Réglade, and N. Cerardi, “Artificial neural 1125–1132 (2004) (in Japanese).
35
networks trained through deep reinforcement learning discover control strategies HP OpenFOAM, https://www.openfoam.com/ (Accessed on 29 May 2024).
36
for active flow control,” J. Fluid Mech. 865, 281–302 (2019). A. Roshko, “On the development of turbulent wakes from vortex streets,”
16 NACA-TR-1191, 1954.
H. Tang, J. Rabault, A. Kuhnle, Y. Wang, and T. Wang, “Robust active flow con-
37
trol over a range of Reynolds numbers using an artificial neural network trained K. V. Katsikopoulos and S. E. Engelbrecht, “Markov decision processes with
through deep reinforcement learning,” Phys. Fluids 32(5), 053605 (2020). delays and asynchronous cost collection,” IEEE Trans. Autom. Control 48(4),
17
P. Varela, P. Suárez, F. Alcántara-Ávila, A. Miró, J. Rabault, B. Font, L. M. 568–574 (2003).
38
García-Cuevas, O. Lehmkuhl, and R. Vinuesa, “Deep reinforcement learning for D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller,
flow control exploits different physics for increasing Reynolds number regimes,” “Deterministic policy gradient algorithms,” in Proceedings of the 31st International
Actuators 11(12), 359 (2022). Conference on Machine Learning (JMLR.org, 2014), pp. 387–395.
18 39
G. Zhu, W. Z. Fang, and L. Zhu, “Optimizing low-Reynolds-number predation J. Achiam, SpinningUp, https://spinningup.openai.com/en/latest/ (Accessed on
via optimal control and reinforcement learning,” J. Fluid Mech. 944, A3 (2022). 12 Jan 2022).
19 40
C. Vignon, J. Rabault, J. Vasanth, F. Alcántara-Ávila, M. Mortensen, and S. Fujimoto, H. v. Hoff, and D. Meger, “Addressing function approximation
R. Vinuesa, “Effective control of two-dimensional Rayleigh–Bénard convection: error in actor-critic methods,” arXiv:1802.09477 (2018).
41
Invariant multi-agent reinforcement learning is all you need,” Phys. Fluids 35(6), A. Tsolovikos, E. Bakolas, S. Suryanarayanan, and D. Goldstein, “Estimation
065146 (2023). and control of fluid flows using sparsity-promoting dynamic mode
20 decomposition,” IEEE Control Syst. Lett. 5(4), 1145–1150 (2021).
L. Guastoni, J. Rabault, P. Schlatter, H. Azizpour, and R. Vinuesa, “Deep rein-
42
forcement learning for turbulent drag reduction in channel flows,” Eur. Phys. J. E M. R. Jovanović, P. J. Schmid, and J. W. Nichols, “Sparsity-promoting dynamic
46(4), 27 (2023). mode decomposition,” Phys. Fluids 26, 024103 (2014).
21 43
T. Sonoda, Z. Liu, T. Itoh, and Y. Hasegawa, “Reinforcement learning of control Y. Ohmichi, “Preconditioned dynamic mode decomposition and mode
strategies for reducing skin friction drag in a fully developed turbulent channel selection algorithms for large datasets using incremental proper orthogonal
flow,” J. Fluid Mech. 960, A30 (2023). decomposition,” AIP Adv. 7, 075318 (2017).
AIP Advances 14, 115204 (2024); doi: 10.1063/5.0237682 14, 115204-17
© Author(s) 2024