In [1, 2, 3], we have showed that the nonadaptive method by Anderson can be efficiently implemented in a high-level language on scalable parallel architectures, such as CM Fortran. We simulated particle systems with up to 100 million particles on CM-5/5E systems, and achieved performance comparable to the best massage-passing implementations of hierarchical methods. The code was listed as an ``impressive entry'' in the 1994 Gordon Bell Competition, and was the only entry ever entirely written in a high-level language but competing for the raw performance category.

One difficulty with parallel implementations of the adaptive O(N) methods are their complex (highly) nonuniform computational structure, a difficulty that is further aggravated if a low-level programming paradigm, such as message-passing, is used. But, the highly nonuniform computational structure also represents a serious challenge for data-parallel programming with, in its pure form, a single-threaded execution model. Efficiency in resource utilization dictates that an adaptive method is used for simulations of particle systems with (highly) nonuniform particle distributions. For very large particle systems, the O(N) methods are advantageous in arithmetic complexity compared to the Barnes-Hut O(N\log N) method.

We have recently successfully formulated the adaptive Anderson's method entirely in HPF [4]. The key observation is that the computations are dominated by operations on four kinds of interaction lists associated with either every leaf-level subdomain or every subdomain in the hierarchy of subdomains. For each such subdomain, the number of interactive subdomains of each kind of list, and therefore the amount of computation and communication differs from that of the others. But, this kind of uniformity is relatively small in the sense that the unit computation is the same and only the number of units per subdomain differs from each other. This level of nonuniformity is suitably expressed using independent do-loop or the extrinsic procedures supported in HPF which provides a mechanism for multiple threads of control. The fetching of the interactive subdomains is performed entirely in HPF as array indirect addressing and therefore avoid the lower-level message-passing programming, and extrinsic procedures will only be invoked for the nonuniform computations per subdomain (and consequently per processor.)

Many challenging problems remain in order to complete our data-parallel formulation of the adaptive methods:

- In our data-parallel formulation, the adaptive hierarchy is
represented in array data structures. Every array element represents a
subdomain in the hierarchy. The potentially uneven amount of communication
and computation associated with the subdomains require that the array be
laid out with uneven number of elements across the processors in
order to achieve balanced load. This form of array layout is
supported through "extensions" in the newly announced HPF 2.0
(October 1996). A delay of at least one year is expected
before the first HPF compiler that actually supports the feature
becomes available.
- In our data-parallel formulation, the fetching of the interactive boxes for all the boxes is performed via either slicing through their interactive lists (stored in 2-D arrays) if the average length of the lists is close to the maximal, in which case high efficiency is expected since most subdomains will participate most of the time, or flattening their interactive lists into 1-D arrays when the average length of the lists is low compared to the maximal. Either approach is not optimal and we plan to develop and implement in MPI a set of primitives that best perform this kind of nonuniform communication. One idea is to implement a function similar to the ``send-to-queue'' run-time system function on the CM-2/200 to handle dynamic scheduling.

The geometric partitioning algorithm (GEO) due to Miller, Teng, Thurston and Vavasis yields provably good partitionings. We have developed a data-parallel formulation of GEO in HPF [7]. We are also working on new partitioning algorithms.

The Recursive Spectral Bisection (RSB) method is another heuristic graph partitioning method that has had significant success in partitioning unstructured meshes. The quality of the partitions offered by RSB has recently been proven. A previous data-parallel implementation did not incorporate edge weights. We are interested in developing a data-parallel implementation of the RSB method that accounts for node and edge weights and to perform a comprehensive study of the various partitioning schemes as integrated in our data-parallel adaptive N-body solver.

The hierarchical methods are approximation methods; increased accuracy can be obtained at increased computational expense. However, good estimates that can be used to determine a-priori the number of terms required in the expansions (and the integration order for Anderson's method) for a given accuracy for a given particle distribution are still missing for the O(N) methods, with the exception of recent analysis Salmon and Warren for their modification of the Barnes-Hut method. This lack of accurate error analysis is one of the factors limiting the acceptance of these methods.

We have studied empirically the accuracy-cost tradeoffs of Anderson's method [5, 6]. The various parameters that control the degree of approximation of the computational elements and the separateness of interacting computational elements govern both the arithmetic complexity and the accuracy of the method. Our experiment shows that for a given error requirement, using a near-field containing only nearest neighbor boxes and a hierarchy depth that minimizes the number of arithmetic operations, minimizes the total number of arithmetic operations.

With respect to the arithmetic complexity, Anderson's method employs a set of integration formulas with integration orders up to fourteen that are provably efficient in their use of the integration points. For very high accuracy, efficient formulas remain to be worked out. Hrycak and Rokhlin recently developed new basic functions much more efficient than multipole expansions. Different O(N) methods differ only in the computational elements used, and both the accuracy and the arithmetic complexity depend on the basis functions used in the computational elements. The accuracy-cost tradeoffs among the various methods need to be studied in order to determine the most efficient method.

Another critical issue to the acceptance of the hierarchical methods for N-body simulations in applications such as molecular dynamics is their ability to handle other than free-space boundary conditions. Greengard and Rokhlin have shown how to include periodic boundary conditions in the Greengard-Rokhlin method. Extending Anderson's method to handle periodic and other more complicated boundary conditions is highly important for its acceptance in practice.

- Yu Hu and S. Lennart Johnsson.
Implementing O(N) N-body Algorithms Efficiently in Data-Parallel Languages .
*Journal of Scientific Programming*, 5(4): 337-364, 1996.

- Yu Hu and S. Lennart Johnsson.
A Data-Parallel Implementation of Hierarchical N-body Methods.
*International Journal of Supercomputer Applications and High Performance Computing*, 10(1): 3-40, 1996.

- Yu Hu and S. Lennart Johnsson.
A Data-Parallel Implementation of O(N) Hierarchical N-body Methods.
*Supercomputing '96*, November, 1996.

- Y. Charlie Hu, S. Lennart Johnsson and Shang-Hua Teng.
A Data-Parallel Adaptive N-body Method.
Proceedings of the
*8th SIAM Conference on Parallel Processing for Scientific Computing - Parallel N-body solver minisymposium*, Minneapolis, MN, March 1997.

- Yu Hu and S. Lennart Johnsson.
On the accuracy of Anderson's fast N-body algorithm.
*Proceedings of the 8th SIAM Conference on Parallel Processing for Scientific Computing*, Minneapolis, MN, March 1997.

- Yu Hu and S. Lennart Johnsson.
On the accuracy of Anderson's fast N-body algorithm.
Technical Report TR-06-96, Harvard University, Division of Engineering
and Applied Sciences, January 1996. Submitted to
*SIAM J. Sci. Comput.*, 1996.

- Y. Charlie Hu, Shang-Hua Teng and S. Lennart Johnsson.
A Data-Parallel Implementation of the Geometric Partitioning Algorithm.
*Proceedings of the 8th SIAM Conference on Parallel Processing for Scientific Computing*, Minneapolis, MN, March 1997.