GAMER: A GRAPHIC PROCESSING UNIT ACCELERATED ADAPTIVE-MESH-REFINEMENT CODE FOR ASTROPHYSICS

in #gamer6 years ago


GAMER: A GRAPHIC PROCESSING UNIT ACCELERATED ADAPTIVE-MESH-REFINEMENT CODE FOR ASTROPHYSICS

Abstract
We present the newly developed code, GPU-accelerated Adaptive-MEsh-Refinement code (GAMER), which adopts a novel approach in improving the performance of adaptive-mesh-refinement (AMR) astrophysical simulations by a large factor with the use of the graphic processing unit (GPU). The AMR implementation is based on a hierarchy of grid patches with an oct-tree data structure. We adopt a three-dimensional relaxing total variation diminishing scheme for the hydrodynamic solver and a multi-level relaxation scheme for the Poisson solver. Both solvers have been implemented in GPU, by which hundreds of patches can be advanced in parallel. The computational overhead associated with the data transfer between the CPU and GPU is carefully reduced by utilizing the capability of asynchronous memory copies in GPU, and the computing time of the ghost-zone values for each patch is diminished by overlapping it with the GPU computations. We demonstrate the accuracy of the code by performing several standard test problems in astrophysics. GAMER is a parallel code that can be run in a multi-GPU cluster system. We measure the performance of the code by performing purely baryonic cosmological simulations in different hardware implementations, in which detailed timing analyses provide comparison between the computations with and without GPU(s) acceleration. Maximum speed-up factors of 12.19 and 10.47 are demonstrated using one GPU with 40963 effective resolution and 16 GPUs with 81923 effective resolution, respectively.
Export citation and abstract BibTeX RIS
Related links

  1. INTRODUCTION
    Numerical simulations have played an indispensable role in modern astrophysics. They serve as powerful tools to probe the fully nonlinear evolutions in various problems, and provide connections between theoretical analyses and observation results. Moreover, thanks to the rapid development of the parallel computing techniques (e.g., the Beowulf clusters, Cell Broadband Engines, and graphic processing units (GPU)), the spatial and mass resolutions as well as computing performance have highly improved in the last decade.
    The most essential ingredients in astrophysical simulations are the Newtonian gravity and hydrodynamics. In the last five decades, many studies have been devoted to improving both the accuracy and efficiency of numerical schemes. One of the simplest approaches is to discretize the simulation domain into a fixed number of grid cells, each of which occupies a fixed volume and position. The cell-averaged physical attributes are defined in each cell. The gravitational potential can be evaluated by several different schemes, for instance, the relaxation method, conjugate gradient method, and fast Fourier transform (FFT). As for hydrodynamic evolution, it can be described by the modern high-resolution shock-capturing algorithms, ranging from the first-order Godunov scheme (Godunov 1959), the second-order monotone upwind schemes for conservation laws (MUSCL; van Leer 1979), to the third-order piecewise parabolic method (PPM; Colella & Woodward 1984). In addition, for cosmological simulations, the particle-mesh (PM) scheme (e.g., Klypin & Shandarin 1983; Merz et al. 2005) is often adopted, in which the dark matter is treated as collisionless particles, and the mass density in each grid cell is estimated by the cloud-in-cell (CIC) technique. Although this uniform-mesh method is relatively easy to implement, it suffers from the enormous memory and computing time requirements when the simulation size increases. Consequently, with the PM method, one must compromise between the size of simulation domain and the spatial resolution.
    The Lagrangian particle-based approaches, in which both the collisionless dark matter and the collisional gaseous components are simulated using particles, are alternatives to the uniform-mesh method. The hydrodynamic evolution is generally solved by the smoothed particle hydrodynamics (SPH; Gingold & Monaghan 1977; Lucy 1977), and numerous algorithms have been adopted for solving the gravitational acceleration. The most straightforward scheme is the direct N-body method, in which all pairwise interactions are calculated. This method, while accurate, is extremely time consuming when the number of particles involved (denoted as N) is too large, owing to its scaling. Consequently, it is not suitable for the simulations with a large number of particles.
    Several approximate algorithms have been developed to improve the computational performance of the gravitational force calculation for particle-based approaches. For example, the particle-particle/particle-mesh (P3M) method (e.g., Hockney & Eastwood 1981; Efstathiou et al. 1985) improves the spatial resolution by calculating the short-range force with direct summation and adding the long-range PM force. The adaptive P3M (AP3M) method (e.g., Couchman 1991) further improves the efficiency of force calculation by adding hierarchically refined submeshes in regions of interest and using the P3M method locally to replace the direct pair summation. By contrast, the hierarchical tree algorithm (e.g., Barnes & Hut 1986; Springel et al. 2001) reduces the computational workload by utilizing multi-pole expansion. The force exerted by distant particles is calculated using low-order multi-pole moment, and the scaling is demonstrated. The tree-PM hybrid scheme (e.g., Xu 1995; Bagla 2002; Dubinski et al. 2004; Springel 2005) serves as a further optimization to the tree algorithm. The gravitational potentials are divided into long-range and short-range terms, evaluated by the PM and tree methods, respectively. Since the tree method is applied only locally, the total workload is greatly reduced.
    The particle-based approaches offer high computational performance as well as large dynamical range, owing to the Lagrangian nature. However, the main disadvantage of these approaches is their relatively poor capability for simulating hydrodynamics using the SPH method. The resolution is relatively low in high-gradient regions, e.g., around the shock wave in the shock-tube test (Tasker et al. 2008). It also offers poor resolution in the low-density region where the number of particles is insufficient. In addition, the SPH method suffers from artificial viscosity and the inability to accurately capture the hydrodynamic instabilities in certain circumstances (Agertz et al. 2007).
    The adaptive-mesh-refinement (AMR) scheme provides a promising approach to combine the accurate shock-capturing property of the uniform-mesh method and the high-resolution, large-dynamical-range property of the Lagrangian particle-based method. The simulation domain is first covered by uniformly distributed meshes (at the "root" level) with a relatively low spatial resolution, and hierarchies of nested refined meshes (at the "refinement" levels) with decreasing grid sizes are then allocated in regions of interest to provide the desired resolution. The gravitational potential can be computed by the multi-grid methods so that a high force resolution can be achieved in the dense region, and the grid-based shock-capturing algorithms can be applied to grids at different refinement levels to preserve the accuracy of hydrodynamic evolution. Since the simulation domain is only locally and adaptively refined, both the memory consumption and the computation time are highly reduced compared to the uniform-mesh method with the same effective resolution.
    Detailed comparisons between the AMR and SPH methods have been addressed by several authors (e.g., Regan et al. 2007; Trac et al. 2007; Tasker et al. 2008; Mitchell et al. 2009). It is beyond the scope of this work. In cosmological simulations, the main drawback of the AMR method is, however, the requirement of sufficiently fine grids at the root level in order to provide adequate force resolution at the early epoch. Consequently, the memory consumption and the computation time can be larger than the Lagrangian particle-based method (O'Shea et al. 2005). Nevertheless, the superior capability of handling hydrodynamic properties and the better description for low-density regions make the AMR method a promising and competitive tool in astrophysical simulations, and it has been successfully adopted for large-scale cosmological simulations (e.g., Hallman et al. 2009; Teyssier et al. 2009). More recently, a moving-mesh scheme has been proposed by Springel (2010), aiming at integrating both the advantages of the AMR and particle-based methods.
    Several approaches have been developed for the AMR implementations. The most commonly adopted approach is the block-structured AMR, which was first proposed by Berger & Oliger (1984) and Berger & Colella (1989). It has been implemented by many astrophysical codes, e.g., Enzo (Bryan & Norman 1997), AMRA (Plewa & Müller 2001), and CHARM (Miniati & Colella 2007b). In this approach, the refined sub-domains (often referred as the mesh "patches") are restricted to be geometrically rectangular, and hence reduce the complexity associated with the discontinuity of resolution across different refinement levels. The size of each patch is variable and adaptable, and patches at the same refinement level can be combined or bisected to fit the local flow geometry. However, the variable patch size also leads to difficulty in parallelizing the code efficiently and to sophisticated data management. In addition, the large patch size can increase the cache-miss rate and thus lower the computational performance.
    An alternative approach has been implemented in, for example, the Adaptive Refinement Tree (ART) code (Kravtsov et al. 1997), MLAPM (Knebe et al. 2001), and RAMSES (Teyssier 2002). In this approach, instead of using the rectangular patches as the basic refinement units, the refinement is performed on a cell-by-cell basis. Compared to the block-structured AMR, it features a more efficient refinement configuration related to the local flow geometry, especially in the regions with complex geometry of structures. However, the main drawback of this method lies in the more sophisticated data management, owing to its irregular shape of domain refinement. The interface profiles between cells with different zone spacings are complex, and spatial interpolations must be frequently used to provide the boundary conditions for each cell. Moreover, since the size of the stencils required by the hydrodynamic and Poisson solvers for each cell is generally much larger than a single cell, the computational overhead is large and can lead to serious performance deterioration.
    The FLASH code (Fryxell et al. 2000), which uses the PARAMESH AMR library (MacNeice et al. 2000), and the NIRVANA code (Ziegler 2005) have adopted a third approach for the AMR implementation. In this approach, the domain refinement is based on a hierarchy of mesh patches similar to the block-structured AMR, whereas each patch is restricted to have the same number of cells. The typically adopted patch sizes are 83 in FLASH and 43 in NIRVANA. Although this restriction will certainly impose the inflexibility of domain refinement and result in a relatively large refined volume compared to the two methods described above, it features several important advantages. First, the data structure and the interfaces between neighboring patches are considerably simplified, which can lead to significant improvements of performance and parallel efficiency. Second, since the additional buffer zones (often referred as the "ghost zones") for finite-difference stencils are only needed for each patch instead of each cell, the computational overhead associated with the preparation of the ghost-zone data is greatly reduced compared to the cell-based refinement strategy. Finally, fixing the patch size allows for easier optimization of performance and also increases the cache-hit rate. All these features are essential for developing a high-performance code, especially for parallel computing such as using GPUs. Accordingly, in the GPU-accelerated Adaptive-MEsh-Refinement code (GAMER), we have adopted this approach as the refinement strategy.
    Novel use of modern GPU for acceleration of numerical calculations has becoming a widely adopted technique in the past three years. The original purpose of GPU is to serve as an accelerator for computer graphics. It is designed to work with the Single Instruction, Multiple Data (SIMD) architecture, and processes multiple vertex and fragment data in parallel. The modern GPU, e.g., the NVIDIA Tesla C1060, has 240 scalar processor cores working at 1.3 GHz clock rate. It delivers a peak performance of 933 Giga Floating Operations Per Second (GFLOPS), which is about an order of magnitude higher than the modern CPU. In addition, it has a 4 GB GDDR3 internal memory with memory bandwidth of 102 GB s−1. The 240 scalar processor cores are grouped into 30 multiprocessors, each of which consists of 8 scalar processor cores and shares a 16 KB on-chip data cache. The NVIDIA Tesla S1070 computing system further combines four Tesla C1060 GPUs and offers nearly 4 TFLOPS computing power. Given the natural capability of parallel computing and the enormous computational power of GPU, using GPU for general-purpose computations (GPGPU5) have become an active area of research.
    The traditional scheme in GPGPU works by using the high-level shading languages, which are designed for graphic rendering and require familiarity with computer graphics. It is therefore difficult and unsuitable for general-purpose computations. In 2006, the NVIDIA Corporation released a new computing architecture in GPU, the Compute Unified Device Architecture (CUDA; NVIDIA 2008). It was designed for a general-purpose usage and greatly lowers the threshold of using GPU for non-graphic computations. In CUDA, GPU is regarded as a multi-threaded coprocessor to CPU with a standard C language interface. To define the computational task for GPU, programmers should provide a C-like function called "kernel," which can be executed by multiple "CUDA threads" in parallel. A unique thread ID is given to each thread in order to distinguish between different threads.
    As an illustration, we consider the sum of two vectors, each of which has M elements. In CUDA, instead of writing a loop to perform M summation operations sequentially, we define a single summation operation in a kernel and use M threads. These M threads will execute the same kernel in parallel but perform the single summation operation on different vector elements. In this example, the thread ID may be used to define the targeted vector element for each thread.
    Note that this scenario is analogous to the parallel computing in a Beowulf cluster using the message passing interface (MPI), in which a single program is simultaneously executed by multiple processors, and each process is given a unique ID ("MPI rank"). However, performance optimization in GPU is not straightforward and requires elaborate numerical algorithms dedicated to the GPU specifications. Especially, note that CUDA threads are further grouped into multiple "thread blocks." Threads belonging to the same thread block are executed by one multiprocessor and can share data through an on-chip data cache (referred as the "shared memory"). This data cache is very small (typically only 16 KB per multiprocessor) but has much lower memory latency than the off-chip DRAMS (referred as the "global memory"). Accordingly, the numerical algorithms must be carefully designed to store common and frequently used data in this fast shared memory so that the memory bandwidth bottleneck may be removed.
    Nowadays, the most successful approach to utilize the GPU computing power in astrophysical simulations is the direct N-body calculation (e.g., Belleman et al. 2008; Schive et al. 2008; Gaburov et al. 2009). Schive et al. (2008) have built a multi-GPU computing cluster named GraCCA (Graphic-Card Cluster for Astrophysics), and have demonstrated its capability for direct N-body simulations in terms of both high computational performance and high parallel efficiency. The direct calculation of all N2 pairwise interactions is extremely computational intensive, and thus is relatively straightforward to obtain high performance in GPU. However, the direct N-body simulations can address only a limited range of problems. Aubert et al. (2009) have proposed a GPU-accelerated PM integrator using a single GPU. It remains considerably challenging and unclear whether the performance of other kinds of astrophysical simulations with complex data structure and relatively low arithmetic intensity, such as the AMR simulations, can be highly improved by using GPUs, especially in a multi-GPU system.
    In this paper, we present the first GPU-accelerated, AMR, astrophysics-dedicated, and parallelized code, named GAMER. We give a detailed description of the numerical algorithms adopted in this code, especially focusing on the GPU implementations. The accuracy of the code is demonstrated by performing various test problems. Detailed timing analyses of individual GPU solvers as well as the complete program are conducted with different hardware implementations. In each timing test, we further compare the performances of runs with and without GPU(s) acceleration.
    The paper is organized as follows. In Section 2, we describe the numerical schemes adopted in GAMER, including the AMR method, both the hydrodynamic scheme and the Poisson solver, and the parallelization strategy. We then focus on the GPU implementations of different parts in the code, along with individual performance measurements in Section 3. In Section 4, we present the simulation results of several test problems to demonstrate the accuracy. Detailed timing analyses of the complete program in purely baryonic cosmological simulations are presented in Section 5. Finally, we summarize the work and discuss the future outlooks in Section 6.
  2. NUMERICAL SCHEME
    In this section, we describe in detail the numerical schemes adopted in GAMER, including the AMR implementation, the algorithms of both hydrodynamics and self-gravity, and the parallelization strategy. To provide a more comprehensible description, here we focus on the generic algorithms that are unrelated to the hardware implementation. Important features related to the GPU implementation will be emphasized and a more detailed description will be given in Section 3.
    2.1. Adaptive Mesh Refinement
    The AMR scheme implemented in GAMER is similar to that adopted by FLASH, in which the computational domain is covered by a hierarchy of grid patches with similar shape but different spatial resolutions. In GAMER, a grid patch is defined to have a fixed number of grid cells in each spatial direction. The computational domain is first covered by root patches with the lowest spatial resolution. Then, according to the user-defined refinement criteria, each root patch may be refined into eight child patches with a spatial resolution twice that of their parent patch. The same refinement operation may further be applied to all patches in different refinement levels, in which a patch at level ℓ has a spatial resolution 2ℓ times higher than that of a root patch at level zero. Accordingly, a hierarchy of grid patches with oct-tree data structure is dynamically and adaptively constructed during the simulation. In Figure 1, we show a two-dimensional example of the refinement map.

Zoom In Zoom Out Reset image size
Figure 1. Two-dimensional example of the refinement map. Each patch is composed of 8 × 8 cells. The borders of patches are highlighted by bold lines. One patch may be refined into four child patches with a spatial resolution twice that of their parent patch. A jump of more than one refinement level between nearby patches is forbidden (even in the diagonal direction).
Download figure:
Standard image High-resolution image Export PowerPoint slide
Patches are the basic units in GAMER. Owing to the oct-tree data structure, eight patches are always allocated or deallocated simultaneously. The data stored in each patch include its own physical variables, the absolute coordinates in the computational domain, the indices of the parent, child, and 26 sibling patches, the pointers of flux arrays corresponding to six patch boundary surfaces, and the flag recording its refinement status.
Restricting all patches to be geometrically similar to each other greatly simplifies the AMR framework, with respect to both the structure of the program and the GPU implementation. A single GPU kernel can be applied to all patches, even in different refinement levels. Moreover, since the amount of computation workload of each patch is the same, there will be no synchronization overhead when multiple patches are evolved in parallel by GPU. However, it does impose certain inflexibility of spatial refinement. The region being refined will be larger than necessary, especially when the volume of a single patch is too large. On the other hand, having a small-volume patch will introduce higher computational overhead associated with the preparation of the ghost-zone data. In GAMER, the optimized size of a single patch is set to 83. It will be demonstrated in Sections 3 and 5 that by exploiting the feature of parallel execution between CPU and GPU, the ghost-zone filling time can be overlapped with the execution time of the GPU solvers, and yields considerable performance enhancement.
GAMER can be used as either a purely hydrodynamic or a coupled self-gravity and hydrodynamic code. When only the hydrodynamic module is activated, the code supports both the uniform and individual time step algorithms. The time step of level ℓ may be either equal to or twice smaller than that of level ℓ − 1. However, when the gravity module is also activated, the code currently only supports the uniform time step algorithm. The same time step is applied to all levels, and the evolution of patches at level ℓ proceeds in the steps as follows:

Update physical quantities for all patches at level ℓ.

Begin the evolution of the next refinement level if there are patches at level ℓ + 1.

Correct the physical quantities at level ℓ by using the updated results at level ℓ + 1.

Rebuild the refinement map at level ℓ.
Since the fine-grid values are presumably more accurate than the coarse-grid values, there are two cases where the data of a coarse patch require further correction in the step 3. First, if a coarse patch is overlaid by its child patches, its values are simply replaced by the spatial average of the fine-grid values. Second, if the border of a coarse patch is near the boundary of refinement, the flux correction operation (Berger & Colella 1989) is applied. Then, a corresponding flux array is allocated. This array will store the difference between the coarse-grid flux and the fine-grid flux across the coarse–fine boundary, and will be used to correct the coarse-grid values adjacent to this boundary. This flux correction operation ensures that the flux out of the coarse-grid patch is equal to the flux into the fine-grid patch, and therefore the conservation of hydrodynamic variables is preserved (assuming no self-gravity).
Rebuilding the refinement map in step 4 takes two sub-steps: a flag step, followed by a refinement step. A patch is flagged for refinement if any cell inside the patch satisfies the refinement criteria. In GAMER, both the hydrodynamic variables and their gradients can be taken as the refinement criteria. There is however a situation requiring special treatment during the flag step. Since the refinement map is always rebuilt from finer levels to coarser levels, a patch at level ℓ may not be flagged even if its child patches at level ℓ + 1 have already been flagged. In this case, the patch at level ℓ is also flagged to ensure that the fine-grid data are preserved. Finally, a proper-nesting constraint is applied to all patches. It prohibits the spatial refinement from jumping more than one level across two adjacent patches. Patches failing to satisfy this constraint are unflagged.
In the refinement step, eight child patches at level ℓ + 1 are constructed for each flagged patch at level ℓ. The hydrodynamic data of child patches are either directly inherited from existing data or filled via conservation-preserving interpolation from their parent patches. The Min-Mod limiter is used to ensure the monotonicity of interpolation. The indices of parent, child, and sibling patches are stored, and null values are assigned to them if the corresponding child or sibling patches do not exist. Finally, the flux arrays are properly allocated for patches adjacent to the coarse–fine boundaries.
The frequency of rebuilding the refinement map is also a free parameter provided by users. The guideline is that the refinement map must be rebuilt before the regions of interest propagate away from fine-grid patches into coarse-grid patches. Although in the extreme case we may rebuild the refinement map in every step, it is too expensive in time and not practical in general situations. Therefore, in order to reduce the frequency of performing the refinement operation, we follow the scheme suggested by Berger & Colella (1989). A free parameter Nb is provided to define the size of the flag buffer. If a cell exceeds the refinement threshold during the flag check, (1 + 2Nb)3 − 1 cells surrounding this cell are regarded as the flag buffers. If any of these flag buffers extends across the patch border, the corresponding sibling patch is also flagged for refinement (as long as it satisfies the proper-nesting condition). Figure 2 shows an example of the refinement result with Nb = 3. The extreme case is to have Nb equal to the size of a single patch, in which case all 26 sibling patches will always be flagged if the central patch is flagged. Generally speaking, the larger the number Nb, the longer the period adopted between two refinement steps might be.

Zoom In Zoom Out Reset image size
Figure 2. Two-dimensional example of the flag operation. The size of the flag buffer (Nb) is set to 3. The borders of the patches are highlighted by bold lines. The filled circles represent the cells satisfying the refinement criteria, and the open circles represent their corresponding flag buffers. The numbers at the center of each patch stand for the patch indices. In this example, although there are no cells fulfilling the refinement criteria in patches 1–3, they are stilled flagged due to the extension of the flag buffers.
Download figure:
Standard image High-resolution image Export PowerPoint slide
The procedure of patch construction during the initialization is different from that during the simulation. As illustrated by the evolution procedure described previously, the patch construction during the simulation is always performed from finer levels to coarser levels. It ensures that the fine-grid data are predominant of patch refinement. By contrast, the spatial resolution of the initial condition is solely provided by users. Accordingly, in GAMER, three kinds of initialization methods are supported.
First, a user-defined initialization function can be applied to set the initial value of each physical quantity. The patch construction starts from level 0 up to the maximum level. If any patch at level ℓ satisfies the refinement criteria, eight child patches at level ℓ + 1 are allocated and initialized by the same initialization function. After patch construction, a restriction operation is performed, starting from the maximum level down to the root level, in order to ensure that the physical quantities of a cell at level ℓ is always equal to the spatial average of its eight child cells at level ℓ + 1.
Second, the code can load an array storing the uniform-mesh data as the initial condition. Assuming that the input data have spatial resolution equal to the refinement level ℓ, patches at level ℓ are first constructed. After that, a restriction operation is performed from level ℓ down to level 0 to construct patches of levels <ℓ. Any patch failing to satisfy the refinement criteria at the current level will be removed. Since we assume that the input uniform-mesh data possess the highest resolution during the initialization, no patch at level >ℓ is allocated at this stage. However, patches of higher resolution can still be constructed during the run, and hence the highest resolution is not limited by the initial input data.
The third initialization procedure loads any of the previous data dumps as the restart file. It is essential when the program is terminated unexpectedly, and it also provides an efficient way for tuning parameters and analyzing simulation results.
2.2. Hydrodynamics
In GAMER, the Euler equations are solved in conservative forms:

http://iopscience.iop.org/article/10.1088/0067-0049/186/2/457/meta

where ρ is the mass density, v is the flow velocity, P is the thermal pressure, e is the total energy density, and is the gravitational potential. The relation between pressure P and total energy density e is given by

where is the internal thermal energy density and γ is the ratio of specific heats. The self-gravity is included in the Euler equations as a source term, and will be addressed in more detail in the following subsection.
The hydrodynamic scheme adopted in GAMER is based on the algorithm proposed by Trac & Pen (2003). It is a second-order accurate relaxing total variation diminishing (TVD) scheme (Jin & Xin 1995), which has been implemented and well tested in both the hydrodynamic simulation (Trac & Pen 2003) and the magnetohydrodynamic simulation (Pen et al. 2003). In the following, we first review the one-dimensional relaxing TVD scheme, and then follow the generalization to the three-dimensional case.
Consider the one-dimensional Euler equation in the vector form:

where u = (ρ, ρv, e) is the flow-variable vector and F(u) = (ρv, ρv2 + P, ev + Pv) is the corresponding flux vector. First, a free positive function c(x,t), which is referred to as the freezing speed, is evaluated, and an auxiliary vector is defined by w ≡ F(u)/c. To guarantee the TVD condition, the freezing speed c must be greater than the speed of information propagation. For the one-dimensional Euler equation, this requirement is satisfied by having c(x, t) = |v(x, t)| + cs(x, t), where cs is the sound speed.
The flux term in Equation (6) is then decomposed into two terms,

where

are referred to as the right-moving and left-moving fluxes with advection speed c, respectively. Since these two fluxes have well-defined directions, the MUSCL scheme can be applied straightforwardly. Let utn denote the cell-centered value of the cell n at time t, and Ftn denote the corresponding cell-center flux. To integrate Equation (7) in a conservative form, the fluxes FR,tn±1/2 and FL,tn±1/2 defined at the boundaries of the cell n must be evaluated. In the following, we describe the algorithm to evaluate FR,tn+1/2 as an illustration. FR,tn−1/2 and FL,tn±1/2 can be derived in a similar way.
In the first step, the upwind scheme is used to assign a value to the boundary flux as a first-order approximation. Since FR,tn+1/2 has a positive advection velocity, we can simply set FR,tn+1/2 = Ftn. The second-order correction ▵FTVD,tn+1/2 satisfying the TVD condition is obtained by applying a flux limiter to two second-order flux corrections,

where

The flux limiter adopted in the current implementation is the van Leer limiter (van Leer 1974), which takes the harmonic average of two second-order flux corrections:

Note that, as indicated by Equation (11), no second-order correction is applied to FR,tn+1/2 if Ftn assumes a local extreme value, and hence the hydrodynamic scheme is locally reduced to only first-order accurate.
Finally, the second-order accurate right-moving flux is given by

FR,tn−1/2 and FL,tn±1/2 can be evaluated in the way similar to Equation (12).
To achieve second-order accuracy in time as well, the second-order Runge–Kutta method (also known as the midpoint method) is adopted for the time integration. First, the temporal midpoint value ut+▵t/2n is evaluated by

where Ftn+1/2 = FR,tn+1/2 − FL,tn+1/2 is computed by the first-order upwind scheme. The midpoint fluxes Ft+▵t/2 are then computed by applying the second-order TVD scheme to ut+▵t/2. Eventually, the full-step value ut+▵tn is given by

Coin Marketplace

STEEM 0.30
TRX 0.12
JST 0.033
BTC 64400.33
ETH 3140.71
USDT 1.00
SBD 3.93