O(N) N-body algorithms


Data-Parallel O(N) hierarchical N-body methods

The numerical study of many-body interactions was significantly advanced with the development of hierarchical methods such as the Barnes-Hut O(N\log N) method, multipole-type O(N) methods by Rokhlin and Greengard, Anderson, Zhao and others and the multigrid methods by Brandt. Large-scale N-body simulations have applications in areas such as celestial mechanics, plasma physics, materials science and molecular design. Parallel computing (whether on a network of workstations or massively parallel processors) is the only means in providing the primary storage and computing power for simulations of systems with up to hundreds of millions of particles using these fast methods. Parallel scalable implementations of these hierarchical methods have been successful for the Barnes-Hut method and nonadaptive O(N) methods using explicit message-passing programming.

In [1, 2, 3], we have showed that the nonadaptive method by Anderson can be efficiently implemented in a high-level language on scalable parallel architectures, such as CM Fortran. We simulated particle systems with up to 100 million particles on CM-5/5E systems, and achieved performance comparable to the best massage-passing implementations of hierarchical methods. The code was listed as an ``impressive entry'' in the 1994 Gordon Bell Competition, and was the only entry ever entirely written in a high-level language but competing for the raw performance category.

One difficulty with parallel implementations of the adaptive O(N) methods are their complex (highly) nonuniform computational structure, a difficulty that is further aggravated if a low-level programming paradigm, such as message-passing, is used. But, the highly nonuniform computational structure also represents a serious challenge for data-parallel programming with, in its pure form, a single-threaded execution model. Efficiency in resource utilization dictates that an adaptive method is used for simulations of particle systems with (highly) nonuniform particle distributions. For very large particle systems, the O(N) methods are advantageous in arithmetic complexity compared to the Barnes-Hut O(N\log N) method.

We have recently successfully formulated the adaptive Anderson's method entirely in HPF [4]. The key observation is that the computations are dominated by operations on four kinds of interaction lists associated with either every leaf-level subdomain or every subdomain in the hierarchy of subdomains. For each such subdomain, the number of interactive subdomains of each kind of list, and therefore the amount of computation and communication differs from that of the others. But, this kind of uniformity is relatively small in the sense that the unit computation is the same and only the number of units per subdomain differs from each other. This level of nonuniformity is suitably expressed using independent do-loop or the extrinsic procedures supported in HPF which provides a mechanism for multiple threads of control. The fetching of the interactive subdomains is performed entirely in HPF as array indirect addressing and therefore avoid the lower-level message-passing programming, and extrinsic procedures will only be invoked for the nonuniform computations per subdomain (and consequently per processor.)

Many challenging problems remain in order to complete our data-parallel formulation of the adaptive methods:

Partitioning and Load-balancing Algorithms

One crucial issue to achieve good performance in programming parallel machines is to partition a problem into subproblems of close to even workloads to be distributed among the processors with the partitioning be performed in such a way that the necessary data movement between the subproblems is small (minimal). For the adaptive hierarchical N-body methods, the load-balancing problem is not well understood. Heuristics such as Orthogonal Recursive Bisection, those based on Morton or Peano-Hilbert orderings have been used in some of the previous implementations to balance the computation. However, they lack analytical analysis on the quality of the partitions and furthermore they make no guarantee on minimizing the communication among the partitions.

The geometric partitioning algorithm (GEO) due to Miller, Teng, Thurston and Vavasis yields provably good partitionings. We have developed a data-parallel formulation of GEO in HPF [7]. We are also working on new partitioning algorithms.

The Recursive Spectral Bisection (RSB) method is another heuristic graph partitioning method that has had significant success in partitioning unstructured meshes. The quality of the partitions offered by RSB has recently been proven. A previous data-parallel implementation did not incorporate edge weights. We are interested in developing a data-parallel implementation of the RSB method that accounts for node and edge weights and to perform a comprehensive study of the various partitioning schemes as integrated in our data-parallel adaptive N-body solver.

Numerical issues in using hierarchical N-Body methods

The ultimate goal of developing efficient parallel implementations of fast N-body algorithms is to put them into practice. It is therefore important to understand the practical aspects of these fast algorithms including the accuracy, the accuracy-cost tradeoffs, and how to handle different boundary conditions of the underlying problem.

The hierarchical methods are approximation methods; increased accuracy can be obtained at increased computational expense. However, good estimates that can be used to determine a-priori the number of terms required in the expansions (and the integration order for Anderson's method) for a given accuracy for a given particle distribution are still missing for the O(N) methods, with the exception of recent analysis Salmon and Warren for their modification of the Barnes-Hut method. This lack of accurate error analysis is one of the factors limiting the acceptance of these methods.

We have studied empirically the accuracy-cost tradeoffs of Anderson's method [5, 6]. The various parameters that control the degree of approximation of the computational elements and the separateness of interacting computational elements govern both the arithmetic complexity and the accuracy of the method. Our experiment shows that for a given error requirement, using a near-field containing only nearest neighbor boxes and a hierarchy depth that minimizes the number of arithmetic operations, minimizes the total number of arithmetic operations.

With respect to the arithmetic complexity, Anderson's method employs a set of integration formulas with integration orders up to fourteen that are provably efficient in their use of the integration points. For very high accuracy, efficient formulas remain to be worked out. Hrycak and Rokhlin recently developed new basic functions much more efficient than multipole expansions. Different O(N) methods differ only in the computational elements used, and both the accuracy and the arithmetic complexity depend on the basis functions used in the computational elements. The accuracy-cost tradeoffs among the various methods need to be studied in order to determine the most efficient method.

Another critical issue to the acceptance of the hierarchical methods for N-body simulations in applications such as molecular dynamics is their ability to handle other than free-space boundary conditions. Greengard and Rokhlin have shown how to include periodic boundary conditions in the Greengard-Rokhlin method. Extending Anderson's method to handle periodic and other more complicated boundary conditions is highly important for its acceptance in practice.



  1. Yu Hu and S. Lennart Johnsson. Implementing O(N) N-body Algorithms Efficiently in Data-Parallel Languages . Journal of Scientific Programming, 5(4): 337-364, 1996.

  2. Yu Hu and S. Lennart Johnsson. A Data-Parallel Implementation of Hierarchical N-body Methods. International Journal of Supercomputer Applications and High Performance Computing, 10(1): 3-40, 1996.

  3. Yu Hu and S. Lennart Johnsson. A Data-Parallel Implementation of O(N) Hierarchical N-body Methods. Supercomputing '96, November, 1996.

  4. Y. Charlie Hu, S. Lennart Johnsson and Shang-Hua Teng. A Data-Parallel Adaptive N-body Method. Proceedings of the 8th SIAM Conference on Parallel Processing for Scientific Computing - Parallel N-body solver minisymposium , Minneapolis, MN, March 1997.

  5. Yu Hu and S. Lennart Johnsson. On the accuracy of Anderson's fast N-body algorithm. Proceedings of the 8th SIAM Conference on Parallel Processing for Scientific Computing, Minneapolis, MN, March 1997.

  6. Yu Hu and S. Lennart Johnsson. On the accuracy of Anderson's fast N-body algorithm. Technical Report TR-06-96, Harvard University, Division of Engineering and Applied Sciences, January 1996. Submitted to SIAM J. Sci. Comput., 1996.

  7. Y. Charlie Hu, Shang-Hua Teng and S. Lennart Johnsson. A Data-Parallel Implementation of the Geometric Partitioning Algorithm. Proceedings of the 8th SIAM Conference on Parallel Processing for Scientific Computing, Minneapolis, MN, March 1997.