In [1, 2, 3], we have showed that the nonadaptive method by Anderson can be efficiently implemented in a high-level language on scalable parallel architectures, such as CM Fortran. We simulated particle systems with up to 100 million particles on CM-5/5E systems, and achieved performance comparable to the best massage-passing implementations of hierarchical methods. The code was listed as an ``impressive entry'' in the 1994 Gordon Bell Competition, and was the only entry ever entirely written in a high-level language but competing for the raw performance category.
One difficulty with parallel implementations of the adaptive O(N) methods are their complex (highly) nonuniform computational structure, a difficulty that is further aggravated if a low-level programming paradigm, such as message-passing, is used. But, the highly nonuniform computational structure also represents a serious challenge for data-parallel programming with, in its pure form, a single-threaded execution model. Efficiency in resource utilization dictates that an adaptive method is used for simulations of particle systems with (highly) nonuniform particle distributions. For very large particle systems, the O(N) methods are advantageous in arithmetic complexity compared to the Barnes-Hut O(N\log N) method.
We have recently successfully formulated the adaptive Anderson's method entirely in HPF [4]. The key observation is that the computations are dominated by operations on four kinds of interaction lists associated with either every leaf-level subdomain or every subdomain in the hierarchy of subdomains. For each such subdomain, the number of interactive subdomains of each kind of list, and therefore the amount of computation and communication differs from that of the others. But, this kind of uniformity is relatively small in the sense that the unit computation is the same and only the number of units per subdomain differs from each other. This level of nonuniformity is suitably expressed using independent do-loop or the extrinsic procedures supported in HPF which provides a mechanism for multiple threads of control. The fetching of the interactive subdomains is performed entirely in HPF as array indirect addressing and therefore avoid the lower-level message-passing programming, and extrinsic procedures will only be invoked for the nonuniform computations per subdomain (and consequently per processor.)
Many challenging problems remain in order to complete our data-parallel formulation of the adaptive methods:
The geometric partitioning algorithm (GEO) due to Miller, Teng, Thurston and Vavasis yields provably good partitionings. We have developed a data-parallel formulation of GEO in HPF [7]. We are also working on new partitioning algorithms.
The Recursive Spectral Bisection (RSB) method is another heuristic graph partitioning method that has had significant success in partitioning unstructured meshes. The quality of the partitions offered by RSB has recently been proven. A previous data-parallel implementation did not incorporate edge weights. We are interested in developing a data-parallel implementation of the RSB method that accounts for node and edge weights and to perform a comprehensive study of the various partitioning schemes as integrated in our data-parallel adaptive N-body solver.
The hierarchical methods are approximation methods; increased accuracy can be obtained at increased computational expense. However, good estimates that can be used to determine a-priori the number of terms required in the expansions (and the integration order for Anderson's method) for a given accuracy for a given particle distribution are still missing for the O(N) methods, with the exception of recent analysis Salmon and Warren for their modification of the Barnes-Hut method. This lack of accurate error analysis is one of the factors limiting the acceptance of these methods.
We have studied empirically the accuracy-cost tradeoffs of Anderson's method [5, 6]. The various parameters that control the degree of approximation of the computational elements and the separateness of interacting computational elements govern both the arithmetic complexity and the accuracy of the method. Our experiment shows that for a given error requirement, using a near-field containing only nearest neighbor boxes and a hierarchy depth that minimizes the number of arithmetic operations, minimizes the total number of arithmetic operations.
With respect to the arithmetic complexity, Anderson's method employs a set of integration formulas with integration orders up to fourteen that are provably efficient in their use of the integration points. For very high accuracy, efficient formulas remain to be worked out. Hrycak and Rokhlin recently developed new basic functions much more efficient than multipole expansions. Different O(N) methods differ only in the computational elements used, and both the accuracy and the arithmetic complexity depend on the basis functions used in the computational elements. The accuracy-cost tradeoffs among the various methods need to be studied in order to determine the most efficient method.
Another critical issue to the acceptance of the hierarchical methods for N-body simulations in applications such as molecular dynamics is their ability to handle other than free-space boundary conditions. Greengard and Rokhlin have shown how to include periodic boundary conditions in the Greengard-Rokhlin method. Extending Anderson's method to handle periodic and other more complicated boundary conditions is highly important for its acceptance in practice.