LOGO ENSTA LOGO LMA LOGO PCP POLE DE CALCUL PARALLELE
BLEU



DISTRIBUTED DATA INTERFACE (D.D.I)





Among the various parallel approaches, it appears that the distributed data interface concepts are valuable enough to be known outside quantum chemistry circles.
The DDI description which follows is excerpted from the GAMESS quantum chemistry program documentation.

It is difficult to write a description of the parallel nature of GAMESS that separates what is important for the installer of GAMESS or programmers of GAMESS from the end user. Users of GAMESS should read most of this section, skipping only the most technical parts, in order to be able to effectively run this program.
Efficient use of GAMESS requires an understanding of three critical issues: The first is the difference between two types of memory (replicated MEMORY and distributed MEMDDI) and how these relate to the physical memory of the computer which you are using. Second, you must understand to some extent the degree to which each type of computation scales so that the proper number of nodes is selected. Finally, most systems run two copies of GAMESS on each processor, and if you read on you will find out why this is so.
Since all code needed to implement the Distributed Data Interface (DDI) is provided with the GAMESS source code distribution, the program compiles and links ready for parallel execution on all machine types. Of course, you may choose to run on only one processor, in which case GAMESS will behave as if it is a sequential code, and the full functionality of the program is available.

Below you will find the following topics:
  • parallelization history
  • DDI process/memory schematic
  • memory allocations and check jobs
  • transport protocols and installation
  • representative performance examples
  • a very few programming details

          
              We began to parallelize GAMESS in 1991 as part of the
          joint ARPA/Air Force piece of the Touchstone Delta project.
          Today, nearly all ab initio methods run in parallel, 
          although some of these still have a step or two running
          sequentially only.  Only the MP2 energy for MCSCF, and RHF
          CI gradients have no parallel method coded.  We have not
          parallelized the semi-empirical MOPAC runs, and probably 
          never will.  Additional parallel work is in progress under
          a DoD CHSSI software initiative which "kicked off" in
          1996.  This has already led to the DDI-based parallel MP2
          gradient program, after development of the DDI programming
          toolkit itself.
 
              In 1991, the parallel machine of choice was the Intel
          Hypercube although small clusters of workstations could
          also be used as a parallel computer.  In order to have
          the best blend of portability and functionality, we chose
          in 1991 to use the TCGMSG message passing library rather
          than one of the early vendor's specialized libraries.  As
          the major companies began to market parallel machines, and
          as MPI version 1 emerged as a standard, we began to use
          MPI on some equipment in 1996, while still using the very
          resilient TCGMSG library on everything else.  However, in
          June 1999, we retired our old friend TCGMSG when the
          message passing library used by GAMESS changed to the
          Distributed Data Interface, or DDI.  We will discuss later
          the low level message transports which DDI relies on:
          SHMEM, TCP/IP sockets, or MPI-1.

              Two people have been extremely influential upon the
          current parallel methodology.  Theresa Windus, a graduate
          student in the early 1990s, created the first parallel 
          versions.  Graham Fletcher, a postdoc in the late 1990s,
          is responsible for the addition of distributed data
          programming concepts.

                                  * * *

              DDI contains the usual parallel programming calls, such
          as initialization/closure, point to point messages, and
          the collective operations global sum and broadcast.  These
          simple parts of DDI support all parallel methods developed
          in GAMESS from 1991-1999, which were based on replicated
          storage rather than distributed data.  However, DDI also
          contains additional routines to support distributed memory
          usage.

              DDI attempts to exploit the entire machine in a scalable
          way.  While our early work concentrated on exploiting the
          use of p processors and p disks, it required that all data
          in memory be replicated on every one of the p nodes.  The
          use of memory also becomes scalable only if the data is
          distributed across the aggregate memory of the parallel
          machine.  The concept of distributed memory is contained in
          the  Remote Memory Access portion  of  MPI version 2 , but so
          far MPI-2 is not available from American computer vendors.
          A similar concept has been implemented in the  Global Array 
          tools of Pacific Northwest National Laboratory.  
 
              Basically, the idea is to provide three subroutine calls
          to access memory on remote nodes: PUT, GET, and ACCUMULATE.
          These give access to a class of memory which is assumed to
          be slower than local memory, but faster than disk:

            <--- fastest                                 slowest --->
            registers cache(s) local_memory remote_memory disks tapes
            <--- smallest                                biggest --->

          Because DDI accesses memory on other nodes by means of an
          explicit subroutine call, the programmer is aware that a
          message must be transmitted.  This awareness of the access
          overhead should encourage algorithms that transfer many
          data items in a single message.  Use of a subroutine call
          to reach remote memory is a recognition of the non-uniform
          memory access (NUMA) nature of parallel computers.  In
          other words, the Distributed Data Interface (DDI) is an
          explicitly message passing implementation of global shared
          memory.  

              In order to have one node pass data items to a second
          node when the second node needs them, without any significant
          delay, the computing job on the first node must interrupt
          its computation briefly to furnish the data.  This type of
          communication is referred to as "one sided messages" or
          "active messages" since the first node is an unwitting
          participant in the process, which is driven entirely by the
          requirements of the second node.

              The Cray T3E has a library named SHMEM to support this
          type of one sided messages (and good hardware support for
          this too) so, on the T3E, GAMESS runs as a single process
          per CPU.  Its memory image looks like this:

                      node 0           node 1
                        p=0              p=1
                  ---------------   ---------------
                  |    GAMESS   |   |    GAMESS   |
                  |   quantum   |   |   quantum   |
                  |  chem code  |   |  chem code  |
                  ---------------   ---------------
                  |  DDI code   |   |  DDI code   |   Input keywords:
                  ---------------   ---------------
                  |  replicated |   | replicated  |       <-- MEMORY
                  |  data       |   | data        |
              -----------------------------------------
              |   |             |   |             |   |   <-- MEMDDI
              |   |  distributed|   | distributed |   |
              |   |  data       |   | data        |   |
              |   |             |   |             |   |
              |   |             |   |             |   |
              |   |             |   |             |   |
              |   ---------------   ---------------   |
              -----------------------------------------

          where the box drawn around the distributed data is meant to
          imply that a large data array is residing in the memory of
          all nodes, in this example, half on one and half on the
          other.  At the present time, the DDI routines support only
          two dimensional FORTRAN arrays, organized so that columns
          are kept on a single node's memory.  Up to 10 matrices may
          be distributed in this fashion.  

              Note that the input keyword MEMORY gives the amount of
          storage used to duplicate small matrices on every node,
          while MEMDDI gives the -total- distributed memory required
          by the job.  Thus, if you are running on p nodes, the memory
          that is used on any given node is 
 
                 total on any 1 node = MEMORY + MEMDDI/p
 
          Since MEMDDI is very large, its units are in millions of
          words.  The keyword MEMORY is in units of words (64 bit
          quantity) and so you must either convert units carefully
          or use the MWORDS synonym for MEMORY (for which the units
          are also millions of words).  Since good execution speed
          requires that you not exceed the physical memory belonging
          to your nodes, it is important to understand that when
          MEMDDI is large, you will need to choose a sufficiently
          large number of nodes to keep the memory on any 1 node
          reasonable.

              To repeat, the DDI philosophy is to add more processors
          not just for their compute performance or extra disk space,
          but  also to aggregate a very large total memory .  Bigger
          problems will require more nodes to obtain sufficiently
          large total memories!  We will give an example of how you
          can estimate the number of nodes a little ways below.
          
              If the GAMESS task running as process p=1 in the above
          example needs some values previously computed, it issues a 
          call to DDI_GET.  The DDI routines in process p=1 then 
          figure out where this "patch" of data in the big rectangular
          distributed storage actually resides.  Suppose this is on
          process p=0.  The DDI routines in p=1 send a message to
          p=0 to interupt its computations, after which p=0 sends a
          bulk data message to process p=1's buffer.  This buffer
          resides in part of the replicated storage of p=1, where
          computations can occur.  Thus distributed data is accessed
          only by DDI_GET, DDI_PUT (its counterpart for storage of
          data items), and DDI_ACC (which accumulates new terms into
          the distributed data).  Note that the quantum chemistry
          layer of process p=1 was sheltered from most of the details
          regarding which node owned the patch of data that process
          p=1 wanted to obtain.  These details are managed by the
          DDI layer.  

              It is the programmer's responsibility to minimize the
          number of GET/PUT/ACC calls, and to design algorithms that
          maximize the chance that the patches of data are actually
          within the local node's portion of the distributed data.
          Note that with the exception of DDI_ACC's simple addition,
          no arithmetic is done directly upon the distributed data.
          Instead, DDI_GET and DDI_PUT should be thought of as
          analogous to the FORTRAN READ and WRITE statements that
          transfer data between disk storage and local memory where
          computations may occur.

              Since MPI-2 is unavailable, and vendor specific "one
          sided messaging" libraries such as the T3E's SHMEM are
          scarce, all other platforms adopt the following strategy.
          It involves two GAMESS processes running on every node:

                      node 0           node 1
                        p=0              p=1
                  ---------------   ---------------
                  |    GAMESS  X|   |    GAMESS  X|        compute
                  |   quantum   |   |   quantum   |       processes
                  |  chem code  |   |  chem code  |
                  ---------------   ---------------
                  |  DDI code   |   |  DDI code   |   Input keyword:
                  ---------------   ---------------
                  |  replicated |   | replicated  |       <-- MEMORY
                  |  data       |   | data        |
                  ---------------   --------------- 

                        p=2              p=3
                  ---------------   ---------------
                  |    GAMESS   |   |    GAMESS   |         data 
                  |   quantum   |   |   quantum   |       servers
                  |  chem code  |   |  chem code  |
                  ---------------   ---------------
                  |  DDI code  X|   |  DDI code  X|
                  ---------------   ---------------
              -----------------------------------------  Input keyword:
              |   |             |   |             |   |   <-- MEMDDI
              |   |  distributed|   | distributed |   |
              |   |  data       |   | data        |   |
              |   |             |   |             |   |
              |   |             |   |             |   |
              |   |             |   |             |   |
              |   ---------------   ---------------   |
              -----------------------------------------

          The first half of the processes do quantum chemistry, and
          the X indicates that they spend most of their time
          executing some sort of chemistry.  Hence the name "compute
          process".  Soon after execution, the second half of the
          processes call a servicing DDI routine which consists of an
          infinite loop to deal with GET, PUT, and ACC requests until
          such time as the job ends.  The X shows that these "data
          servers" execute only DDI support code.  This makes the
          data server's quantum chemistry routines the equivalent of
          the human appendix.  The whole problem of interupts is now
          in the hands of the operating system, as the data servers
          are distinct processes.  To follow the same example as
          before, when the compute process p=1 needs data that turns
          out to reside on node 0, a request is sent to the data
          server p=2 to transfer information back to the compute
          process p=1.  The compute process p=0 is completely unaware
          that such a transaction has occurred.
 
              The formula for the memory required by any single node
          is unchanged, if p is the total number of nodes used,
                 total on any 1 node = MEMORY + MEMDDI/p.

          
                                  * * *
 
              At present, only closed shell MP2 gradients, and the
          ZAPT open shell MP2 energy take advantage of the new
          distributed memory options.  We expect to adapt other
          methods to use this technique of memory aggregation, but
          currently all other types of jobs run with MEMDDI=0 and
          therefore use only replicated storage.  In this case the
          data server processes still run, but are dormant because
          no distributed memory access is attempted.  For example,
          in an SCF computation (no hessian or MP2 follow on) the
          memory needed is on the order of the square of the basis
          set size, for such quantities as the orbital coefficients,
          density, Fock, overlap matrices, and so on.  These are
          simply duplicated on every node in the MEMORY region.

              Check runs (EXETYP=CHECK) need to run quickly, and
          the fastest turn around always comes on one node only.
          Runs which do not currently exploit MEMDDI distributed
          storage will formally allocate their MEMORY needs, and
          feel out their storage needs while skipping almost all of
          the real work.  Since MEMORY is replicated, the amount
          that is needed on 1 node remains unchanged if you later
          do the true computation on more than 1 node.

              Check jobs which involve MEMDDI storage are a little
          bit trickier.  As noted, we want to run on only 1 node
          to get fast turn around.  However, MEMDDI is typically a
          large amount of memory, and this is unlikely to be 
          available on a single node.  The solution is that the
          data server process does not actually allocate the
          MEMDDI storage, instead it just remembers what you gave
          as input and checks to see if this will be adequate.  So,
          you can input MEMDDI=1000 (1000 million words is equal
          to 1,000 * 1,000,000 * 8 = 8 GBytes and run this check
          job on a computer with only 256 MB of RAM.

              Of course, the actual computation will have to run on
          a large number of such processors.  Let us continue with
          this example of a run requiring 8 GBytes of distributed
          data on 256 MB nodes.  Suppose that MEMORY is 2500000 in
          this case (when MEMDDI is used, MEMORY is typically just
          a few million words).  We need to reserve some memory
          for the operating system (16 MBytes, say) and for the
          GAMESS program and local storage (approx 16 MB, it is a
          big program, and the compute processes should be swapped
          into memory).  Thus our hypothetical 256 MB node has
          224 MB available, assuming no one else is running.  The
          rest of the computations proceed in million/mega words,
          so the available memory per node is 224/8 = 28.  We must
          choose the number of processors p to satisfy
                           needed <= available
                MEMORY + MEMDDI/p <= free physical memory
                     2.5 + 1000/p <= 28
          so this example requires p >= 39 compute processes.

              One more subtle point about CHECK runs with MEMDDI is
          that since you are running on 1 node only, the code does
          not know that you wish to run the parallel MP2 algorithm
          instead of the sequential algorithm.  You must force the
          CHECK job into the parallel section of the program by
           $system parall=.true. $end
          There's no harm leaving this line in for the true runs,
          as any job with more than one compute process is parallel
          regardless of the input value PARALL.
          
                                  * * *

              The next section deals with compilation and execution
          of GAMESS.  If someone else has already figured these
          things out for you, you may skip ahead to the section that
          illustrates how the code's performance scales.
  
              The purpose of this section is to describe only how
          the choices for low level message passing to support the
          DDI subroutines impact upon installation.  More explicit
          directions for the compiling process can be found in the
          first two sections of this chapter, in the readme.unix
          file, and notes on your machine and its compilers are to be
          found in the IRON.DOC chapter and in the 'comp' script.
          This section has the best explanation available of how to
          execute the program.

              The message traffic generated by DDI calls is sent by
          SHMEM on the Cray T3E, by MPI-1 on large parallel computers,
          and by TCP/IP sockets on networks of workstations.  We 
          cover each of these three classes of machines next.

                                   ---

              The Cray T3E's SHMEM library affords a single process
          implementation of GAMESS.  The T3E's message passing is
          contained in DDIT3E.SRC, and selecting the T3E target
          when compiling will use only this file and link against
          the SHMEM library.  The 'rungms' script has a special
          target to permit execution using this library.

                                   ---

              In general we expect a large vendor supplied parallel
          computer such as the IBM SP, SGI Origin, and large systems
          from companies such as Fujitsu, NEC, and Hitachi to have a
          MPI-1 library available.  It is furthermore reasonable to
          assume that an expensive machine in this class has a budget
          sufficient to purchase the vendor's MPI library.  Therefore
          the compute process/data server model outlined above will
          activate *MPI lines in the source file DDI.SRC, and link
          with the MPI-1 library.  Each requires a special target in
          the 'rungms' execution script.  Execution will require the
          vendor's "kickoff" routine to start two processes on each
          node, the second half of these will automatically become
          the data servers.

              Since DDI is brand new, we have correct control language
          in the scripts 'compall', 'comp', 'lked', and 'rungms' for
          the IBM SP only.  Sometime this summer we will try to get
          this finished for the Origin, and we will try to work with
          persons owning the Japanese systems to re-enable GAMESS on
          them.  There's no reason to doubt the MPI-1 messaging is
          correct since it is fully functional on the IBM SP, and so
          these other machines require only control language.  Or so
          we hope!
                                   ---
 
              The third class of machines are technical workstations
          running Unix.  In this category we include the IBM RS/6000,
          Compaq AXP (yours may say Digital on the front), Sun, HP,
          and SGI workstations, and also  Intel-based Linux systems .
          These are characterized by low cost, implying that even if
          a vendor offers MPI-1 on these systems, the software may
          not have been purchased.  However, all of these have the
          TCP/IP socket library that has been in Unix for decades 
          now.  Chances are that a vendor MPI-1 runs over sockets on
          this class equipment anyway, so DDI might just as well talk 
          directly to sockets.  The DDI layer consists of C language
          routines to open the socket connections and transmit data
          through them, in the file DDISOC.C.  Higher level concepts
          such as global communications are written in FORTRAN, as
          the *SOC lines in DDI.SRC.  Besides these two files which
          link into the GAMESS executable, we need a way to fire up
          the compute processes and data servers.  This is DDIKICK.C 
          which is referred to as the "kickoff program".

              The compiling and linking scripts 'compall', 'comp',
          and 'lked' have targets for each kind of workstation, as
          their compilers have various options.  When the compiling
          and linking is done you should have two programs, namely
          ddikick.x and gamess.01.x.  The latter can be run on one
          or more CPUs, as it is sequential if you run on one node,
          and parallel whenever you run on more than one.  The
          execution script 'rungms' has a common target of 'sockets'
          since all six machines we have mentioned used ddikick.x to
          start processes.  This script has more details about how
          to run, but we will describe here what the arguments to
          the kickoff program are.
The command in 'rungms' to fire up GAMESS is

% ddikick.x Inputfile Exepath Exename Scratchdir \
Nhosts Hostname_0 Hostname_1 ... Hostname_N-1
  • Inputfile name is not actually used, but it will be displayed by the 'ps' command so you can tell what is actually being run.

  • Exepath is the name of the directory that contains the program to be executed. Exename is the name of the GAMESS executable (which might have a different "version number" than 01). The best situation is to have Exepath in an NFS mounted partition that all nodes can access, so that you have only one copy of the big GAMESS executable. However, you could carefully FTP a copy to all nodes using always exactly the same file name, such as /usr/local/bin/gamess.01.x.
    Note that since only one executable name is specified, only one vendor's computers can be used at a time. This limitation arises from a lack of XDR calls in the DDI layer to convert data types from one internal representation of numeric data to another machine's.

  • Scratchdir is the name of a large local working disk space, such as /scr/mike, in which all temporary files are placed. These files should be automatically deleted by the execution script as the job ends. If the nodes do not happen to have the same scratch area name, you can make it "feel like" they do with soft links such as "ln -s /actualname /scr". Under no circumcumstance should you make Scratchdir an NFS partition, as serious I/O happens to this directory. Ideally Scratchdir is a striped multi-disk Ultra-2-wide partition, with 9+ GBytes free space per node. However, the GAMESS output (stdout) and two supplemental ASCII output files PUNCH and IRCDATA can and probably should be sent over NFS to the user's permanent disk space on a file server. This serves the purpose of allowing the user to monitor the simulation as it runs, and gets the results to a place where it can be backed up once in a while. The files written into Scratchdir should be erased by the 'rungms' script upon normal exit.
    Individual file names are set by the 'rungms' script's setenv commands. Some of the files are written to only by the master process running on node 0 (stdout and PUNCH are good examples of this), but other files are distributed across all node's Scratchdirs (scalable disk usage). The atomic integrals AOINTS is a good example of this. The rungms setenv's will define this file as xxx.F08 on node 0, where xxx is the name of the input file, and the rest of the name comes from its being FORTRAN unit 8 internally. On other nodes the file name will have the node number appended, xxx.F08.001, xxx.F08.002, and so on. Obviously, only the compute processes own disk files.

  • Nhosts is the number of compute processes to be run. If you want to run sequentially, just ensure Nhosts is 1. Hostname_0 is the "master node", which handles reading the one input file, and writing the one output file. This host must be the same host that is executing the 'rungms' script, or else the environment variables that define the files don't get properly accepted.

  • Supply a total of Nhosts Hostnames. One compute process will be started on each of these (process IDs 0,1,...Nhosts-1), and then one data server will be run on each as well (for a total of 2 times Nhosts processes). If you have SMP systems , such as a four processor machine, set Nhosts=4, and repeat its Hostname 4 times.

              Execution is by a direct system call if the process is
          to run on the host running 'rungms' and which is therefore
          also running 'ddikick.x'.  Remote hosts are reached by the
          command 'rsh', so users will need to use a .rhosts file to
          authenticate themselves (unless your system is using some
          replacement for this such as Kerberos).  The .rhosts file
          needs to be in your home directory, and looks like this:
                   si.fi.ameslab.gov  mike
                   ga.fi.ameslab.gov  mike
                   ge.fi.ameslab.gov  mike
                      ...and so on...
          except your user name is probably not 'mike'.  Note that
          ddikick.x has no mechanism to support the user name on one
          of the machines being 'schmidt' instead of 'mike'.  
      
              Assuming that all goes well, the job will terminate
          orderly by each compute process telling its local data
          server to cease execution.  Upon the successful suicide
          of the data server, the compute process reports to the
          dormant (but still running) ddikick.x that it is ready
          to end.  When all compute processes have checked in, the
          kickoff program informs each that it is OK to stop, and
          following this ddikick.x exits.  

               Abnormal terminations are of course less predictable.
          However, when ddikick.x is informed by the system that one
          of its children has died, it tries to send a kill command
          to all its other children, and so hopefully all processes
          are then eliminated.  However, depending on the exact
          circumstances in which the abnormal end occurs, the system
          may have a few processes left over for manual termination.
          If you decide that a GAMESS job should be killed, use the
          Unix 'kill' command to take out either the compute or data
          server process on the master node, or one of the 'rsh' 
          processes that have launched GAMESS onto the remote nodes.
          Do not kill ddikick.x directly, instead stop any of these
          child processes, so that ddikick.x will terminate all the
          other processes for you.

              Before ending this section on DDI over TCP/IP sockets
          on workstation class machines, we should comment on the
          network requirements.  It is not reasonable to run jobs
          that use MEMDDI distributed memory on 10 megabit/second
          Ethernet since the bandwidth is just too small.  However,
          if you use only the replicated MEMORY storage you should
          be able to get by on this old network cable.  As will be
          shown below, a switched Fast Ethernet is capable of decent
          performance on such 100 megabit/second cables.  Both the
          host adapters and the switch itself are now inexpensive.
          Gigabit ethernet (1000 mbit/sec) is pricy, and although
          the bandwidth is good, the latency remains too large.


                                  * * *

              This section describes the way in which the various
          quantum chemistry computations run in parallel, and shows
          some typical performance data.  This should give you as
          the user some idea how many nodes can be efficiently used
          for various SCFTYP and RUNTYP jobs.  There's a different
          subsection for 4 different kinds of runs, followed by a
          summary.

              Many of the performance data you will see below were
          obtained on a 16 node Intel Pentium II Linux (Beowulf-type)
          cluster costing $49,000, of which $3,000 went into the
          switched Fast Ethernet component.  512 MB/node means this
          cluster has an aggregate memory of 8 GB.  For more details,
          see
            http://www.msg.ameslab.gov/GAMESS/page/not/written/yet

                                   ---

              The HF wavefunctions can be evaluated in parallel
          using either conventional disk storage of the integrals,
          or via direct recomputation of the integrals.  Assuming the
          I/O speed of your system is good, direct SCF is *always*
          slower than disk storage.  Some experimenting will show
          which is more effective on your hardware.  As an example of
          the scaling performance of RHF, ROHF, UHF, or GVB jobs 
          that involve only computation of the energy or its gradient,
          we include here a timing table from the 16 node PC cluster.
          The molecule is luciferin, which together with the enzyme
          luciferase is involved in firefly light production.  The
          chemical formula is C11N2S2O3H8, and RHF/6-31G(d) has 294
          atomic orbitals.  There's no molecular symmetry.  The run
          is done as direct SCF because the total amount of AO
          integrals is 3.8 GBytes, and Linux does not permit files 
          to exceed 2 GBytes (of course use of 2 or more nodes can
          be run conventional as this disk file will be distributed 
          across all available disks).  The CPU timing data is

                          p=1  p=2  p=3  p=4  p=8 p=12 p=16
           1e- ints       1.6  0.8  0.6  0.4  0.3  0.3  0.1
           Huckel guess    22   18   16   14   14   12   12
           15 RHF iters  5536 2802 1891 1436  753  519  406
           properties     7.5  7.3  7.3  7.3  7.8  7.0  7.0
           1e- gradient  11.5  5.7  4.1  2.7  1.4  1.0  0.8
           2e- gradient  1339  658  437  328  105  110   83
                         ---- ---- ---- ---- ---- ---- ----
           total CPU     6917 3491 2357 1790  941  649  509 seconds
           total wall    6924 3540 2408 1820  979  696  559 seconds

          Note that direct SCF should run with the wall time very
          close to the CPU time as there is essentially no I/O and
          not that much communication (MEMDDI storage is not used
          by this kind of run).  Wall clock speedup from 1 to 16
          nodes is 12.4, and for this type of run we frequently use
          8, 16, or 32 nodes depending on availability.

              An idea of the variation in time with basis set
          size can be gained from the following runs made by
          Johannes Grotendorst, Juelich, Germany, on a Cray T3E
          or Intel Paragon, using 32 nodes on either.  These
          data were collected in about 1996, pre-DDI days, and
          as you can see, before all the Paragons were unplugged.
          The data is still representative.  Each molecule is an
          asymmetric organic compound, computing the RHF energy
          and gradient, using the 6-31G(d) basis set:

                                          T3E    Paragon
          taxol,    1032 AOs, CPU TIME = 546.8      --  minutes
          cAMP,      356 AOs, CPU TIME =  14.6    106.4
          luciferin, 294 AOs, CPU TIME =   8.9     67.2
          nicotine,  208 AOs, CPU TIME =   3.8     26.1
          thymine,   149 AOs, CPU TIME =   1.5     12.2
          alanine,   104 AOs, CPU TIME =   0.5      5.2
          glycine,    85 AOs, CPU TIME =   0.3      3.2

              If you are interested in an explanation of how the
          parallel SCF is implimented, see the main GAMESS paper,
            M.W.Schmidt, K.K.Baldridge, J.A.Boatz, S.T.Elbert,
              M.S.Gordon, J.H.Jensen, S.Koseki, N.Matsunaga,
          K.A.Nguyen, S.J.Su, T.L.Windus, M.Dupuis, J.A.Montgomery
                   J.Comput.Chem.  14, 1347-1363(1993)

                                   ---

              For the next type of computation, we discuss the MP2
          correction.  For UHF + MP2, only the second order energy
          can be computed, and the parallization strategy is similar
          to the replicated MEMORY code used by the MCSCF program.
          This is described below.  The MCSCF + MP2 code does not run
          parallel, unfortunately.  So here we are describing the
          closed shell RHF + MP2 energy or gradient, or the ROHF +
          ZAPT-type MP2 energy.  These two types of computations make
          use of the MEMDDI distributed data region.

             The example is a benzoquinone precursor to hongconin,
          a cardioprotective natural product.  The formula is
          C11O4H10, and 6-31G(d) has 245 AOs.  There are 39 valence 
          orbitals included in the MP2 treatment, and 15 core 
          orbitals.  MEMDDI must be 156 million words, so the same
          type of memory computation that was used above tells us
          that our 512 MB/node PC cluster must have at least three
          processors to aggregate the required MEMDDI.  MOREAD was
          used to provide converged RHF orbitals, so only 3 RHF
          iterations are performed.  The timing data are CPU and wall
          times (seconds) in the 1st/2nd lines:


                          p=3      p=4      p=8     p=12     p=16
            RHF iters     208      157       83       58       46
                          214      163       91       68       55
            MP2 step    8,935    6,966    3,417    2,283    1,724
                       12,529   10,046    5,763    4,013    3,056 
            2e- grad    2,181    1,712      838      552      420
                        2,490    1,981      991      677      499
            total CPU  11,335    8,846    4,347    2,902    2,199
            total wall 15,248   12,206    6,859    4,772    3,624

                                 3-->12  4-->16
                 CPU speedup      3.91    4.02
                 wall speedup     3.20    3.37

          On a T3E machine with 600 MHz nodes and 256 MB/node, we
          should have been able to run on as few as 6 nodes, but the
          available data for the same calculation starts from 8:

                              p=8  p=32 p=128
                total CPU    3108   814   282
                total wall   3154   850   340

          Wall clock performance is considerably better as you would
          expect on a machine where very good communications exist.
          Wall speedup for a 4x or 16x increase in node number is 3.7
          and 9.3.  Larger molecules which still have some computation
          left at 128 nodes do better than this.  We often use 64 or
          128 nodes for this type of run.

              As noted, the number of nodes is more influenced by a
          need to aggregate the necessary total MEMDDI, more than by
          concerns about scalability.  MEMDDI is typically large for
          MP2 parallel runs, as it is proportional to the number of
          occupied orbitals squared times the number of AOs squared.
  
              For more details on the distributed data parallel MP2
          program, see
            G.D.Fletcher, A.P.Rendell, P.Sherwood
                Mol.Phys. 91, 431-438(1997)
            G.D.Fletcher, M.W.Schmidt, M.S.Gordon
                Adv.Chem.Phys. 110, 267-294 (1999)
            G.D.Fletcher, M.W.Schmidt, B.M.Bode, M.S.Gordon
                Comput.Phys.Commun.  submitted

                                   ---

              The next type of computation we will consider is the
          analytic computation of the hessian for RHF, ROHF, or GVB
          wavefunctions.  The current implementation of the response
          equations is in the MO basis, and since the solver is not
          parallelized, so when this stage of the computation is
          reached, work is done only by process 0 while the other
          processes sit idle.  This is a sequential bottleneck.  The
          integral transformation is parallelized according to the
          same strategy as described below for MCSCF jobs.  Thus our
          early paper on analytic hessians which ran a sequential
          transformation no longer correctly describes the program.
          Thus the total scalability is better than was shown in
             T.L.Windus, M.W.Schmidt, M.S.Gordon
                  Chem.Phys.Lett.  216, 375-379(1993)

              The example we will consider is the same SbC4O2NH4 test
          that was used in this early paper.  We use the 3-21G* basis
          (110 AOs, 2 million words used).  The hardware is the same
          PC Linux cluster.  Table 1 of the reference should be:

                                p=1     p=2     p=3     p=4
               1e- ints        0.15    0.09    0.09    0.07 seconds
               Huckel          4.23    3.97    4.05    4.15
               2e- ints       14.84    7.26    4.89    3.68
               RHF iters      35.17   19.29   13.89   10.84
               properties      0.39    0.40    0.38    0.42
               dupl.2e- ints   N/A    14.75   14.71   14.78
               int.transf.   232.89  123.91   85.01   68.20
               1e- hess        4.47    2.84    2.04    1.21
               2e- hess      668.13  334.60  225.50  168.31
               CPHF          224.48  210.80  206.21  192.29
                            -------  ------  ------  ------
               total CPU     1185.0   718.1   557.0   464.2
               total wall    1188     733     600     494

          Clearly, the final response equation (CPHF) step is a
          sequential bottleneck, as is the fact that the orbital 
          hessian in this step is stored entirely on the disk space 
          of node 0.  Since the integral transformation is run in
          replicated MEMORY rather than distributing this, and since
          it also needs a duplicated AO integral file be stored on
          every node, the code is clearly not scalable to very many
          processors.  Typically we would not request more than 3
          or 4 processors for an analytic hessian job.

                                   ---

              As the final example, we turn to MCSCF energy/gradient
          runs.  The parallelization of the integral transformation
          was done before the introduction of the distributed data
          concept, and hence this kind of job uses only replicated
          MEMORY at present.  In addition, the determinant CI step is
          not yet converted to parallel execution, so if you run on
          more than one node, MCSCF jobs must use $MCSCF CISTEP=GUGA.


              Our work on parallization of MCSCF was described in
          the paper
              T.L.Windus, M.W.Schmidt, M.S.Gordon
                 Theoret.Chim.Acta  89, 77-88(1994).
          This points out that MCSCF has many bottlenecks, of which
          the most important are the integral transformation, the
          optimization of the CI coefficients, and the optimization
          of the orbital coefficients.  The amount of time spent
          in each depends on the number of atomic orbitals, and the
          size of the active space, and the number of filled MOs.
          Since this parallelization paper came out, we have added
          a new converger for MCSCF, namely SOSCF, and here we will
          show the performance of this default converger on a fairly
          large example.  Before doing so, we point out again that
          our new CI optimization step over determinants is faster
          on one node, but it will refuse to run in parallel, so the
          data shown use the older GUGA CSF program.
          
              The integral transformation step is run in replicated
          MEMORY, using "passes" over the occupied orbitals.  The
          different passes involve different occupied orbitals so
          the work is trivially distributed across all nodes.  A
          better description of this can be found in the reference
          given.  Basically the same approach is currently being
          used during analytic hessians or UHF MP2.  The method
          distributes CPU time fairly well, but it does not make
          efficient use of memory as every node must have the
          same memory is required to run on 1 node (order of the
          basis set cubed).  

              There are two strategies to govern the placement of AO
          integrals, each of which has to be available on every node.
          One is to store the file on the disks in a distributed
          fashion, and as each node reads its subset, to broadcast
          these on the communication channel to all other nodes.
          This is AOINTS=DIST in the $TRANS or $MP2 input, and is
          appropriate only for machines with good communications.
          When disk I/O is faster than the communications, such as
          is likely for workstation clusters, the entire 2e- AO
          integral file is duplicated on every node (AOINTS=DUP).
          This is not scalable either in generation of integrals, or
          in their disk storage, but it takes pressure off the
          communications channel.  You may want to experiment with
          the AOINTS keyword to see if the default for your machine 
          is well chosen.

              The example we choose is at a transition state for
          the water molecule assisted proton transfer in the
          first excited stat of 7-azaindole.  The formula is
          C7N2H6(H2O), there are 190 active orbitals, and the
          active space is the 10 pi electrons in 9 pi orbitals
          of the azaindole portion.  There are 5,292 CSFs.  See
          Figure 6 of G.M.Chaban, M.S.Gordon  J.Phys.Chem.A 103,
          185-189(1999) if you are interested in this chemistry.
          The timing data (seconds) from our PC Linux cluster are


                           p=1     p=2      p=3      p=4
          dup. 2e- ints  373.2   381.2    381.4    389.9
          DRT              0.4     0.4      0.4      0.4
          transform.     448.0   237.4    181.3    139.1
          Hamilton.        3.1     2.5      2.4      2.3
          diag. H         27.4    15.1     11.1      9.1
          2e- dens.        2.4     2.3      2.2      2.2
          orb. update     60.1    56.7     56.2     56.4
          iters 2-16    6723.3  4429.9   3554.2   2885.7
          1e- grad         6.2     2.6      3.2      1.4
          2e- grad       912.3   463.9    327.3    242.8
                        ------  ------   ------   ------
          total CPU     9485.1  5599.1   4526.9   3736.1
          total wall    15,703   9,537    7,763    6,192

          The first iteration is broken down into its primary steps
          from the integral transformation to the orbital update,
          inclusive.  Typically we would not use more than 4 to 8
          processors for a parallel MCSCF job.

                                   ---

              In summary, most ab initio computations will run
          in less time on more than one node.  However, some
          things can be run only on 1 node, namely
             semi-empirical runs
             determinant based MCSCF
             MCQDPT2 perturbation correction to MCSCF
             RHF+CI gradient
             PCM solvation model
          Several steps run with little or no speedup, and thus
          represent sequential bottlenecks that limit scalability.
          They do not prevent jobs from running, but restrict the
          total number of nodes that can be effectively used:
             HF: solution of SCF equations
             HF analytic hessians: the coupled Hartree-Fock
             MCSCF: orbital improvement steps
             MCSCF/CI: Hamiltonian and 2e- density matrix
             energy localizations: the orbital localization step
             transition moments/spin-orbit: the final property step
          Future versions of GAMESS will address these bottlenecks.  

          A short summary of the useful number of nodes (based on
          data like the above) would be approximately
              RHF, ROHF, UHF, GVB energy and/or gradient  16-32+
              analytic hessians for these                   3-4
              RHF + MP2 gradient, ZAPT energy            64-128+
              UHF + MP2 energy                             8-16
              GUGA CI or MCSCF                              4-8

                                  * * *

              The final section of this description of DDI is a very
          sketchy introduction to programming.  At this point, if you
          are interested only in using the program, you may cease
          reading.

              DDI has subroutine calls to do ordinary message passing
          parallel programming.  These are calls to initialize and
          terminate the various processes, point to point send and
          receive, and collective operations like global sum and the
          broadcast.  Not necessarily every routine one would expect
          is included, as we just programmed what we needed in GAMESS.
          In addition we have calls for distributed data manipulation,
          which include creation and destruction of the arrays, the
          put, get, and accumulation operations mentioned above, and
          a routine to query what part of the distributed array is 
          stored locally.

              The full API for DDI is in comments in the beginning of
          the source file DDI.SRC, and you can then look at the
          individual routines to see what the calling arguments do.
          The source code in DDI.SRC, and calls to it from GAMESS are
          what serves as documentation for use of DDI at the present.
          Every DDI routine is a subroutine, not a function, and each
          begins with "DDI_" so you can easily locate all parallel
          constructs in GAMESS by a search for "CALL DDI_".  


C 25 OCT 99 - MWS - MPI DATA SERVERS PROBE, SEQUENTIAL CUTOUT ADDED C 29 AUG 99 - MWS - BUILD A DELAY INTO DDI_PEND ABNORMAL ENDS C 6 JUN 99 - GDF - COMBINED SOCKET/MPI IMPLEMENTATION OF DDI C C DDI.SRC C C<><><><><><><><> DISTRIBUTED DATA INTERFACE <><><><><><><><> C<><><><><><><><> WRITTEN BY G. FLETCHER <><><><><><><><> C C SUMMARY OF THE DISTRIBUTED DATA INTERFACE API: C GLOBAL TASKS C DDI_PBEG(NWDVAR) INITIALIZE DDI PARAMETERS C DDI_PEND(ISTAT) MAKE TIDY EXIT C DDI_NPROC(DDI_NP,DDI_ME) NUMBER OF NODES AND NODE ID C DDI_MEMORY(MEMREP,MEMDDI,EXETYP) ALLOCATE SHARED MEMORY REGION C DDI_CREATE(IDIM,JDIM,HANDLE) CREATE DISTRIBUTED MATRIX C DDI_DESTROY(HANDLE) DESTROY DISTRIBUTED MATRIX C DDI_DISTRIB(HANDLE,NODE,ILOC,IHIC,JLOC,JHIC) QUERY DM DISTRIBUTION C DDI_DLBRESET RESET DLB TASK COUNTER C DDI_SYNC(SNCTAG) BARRIER SYNCHRONIZATION C DDI_GSUMF(MSGTAG,BUFF,MSGLEN) FLOATING POINT GLOBAL SUM C DDI_GSUMI(MSGTAG,BUFF,MSGLEN) INTEGER GLOBAL SUM C DDI_BCAST(MSGTAG,TYPE,BUFF,LEN,FROM) BROADCAST DATA TO ALL NODES C C POINT-TO-POINT TASKS C DDI_SEND(SNDBUF,LEN,TO) SYNCHRONOUS SEND C DDI_RECV(RCVBUF,LEN,TO) SYNCHRONOUS RECEIVE C DDI_RCVANY(RCVBUF,LEN,TO) SYNCHRONOUS RECEIVE FROM ANY C DDI_DLBNEXT(DLB_COUNTER) GET NEXT DLB TASK INDEX C DDI_GET(HANDLE,ILO,IHI,JLO,JHI,BUFF) GET BLOCK OF MATRIX C DDI_PUT(HANDLE,ILO,IHI,JLO,JHI,BUFF) PUT BLOCK OF MATRIX C DDI_ACC(HANDLE,ILO,IHI,JLO,JHI,BUFF) ACCUMULATE DATA INTO BLOCK C C DDI_SUBPATCH AND DDI_SERVER ARE USED INTERNALLY ONLY. C C SEE INDIVIDUAL ROUTINES FOR DATA TYPES AND OTHER DETAILS C C<><><><><><><><> DATA-SERVER IMPLEMENTATION <><><><><><><><> C C ONE-SIDED DATA ACCESS VIA COMMUNICATION WITH DATA-SERVERS. C *SOC CODE'S TRANSPORT LAYER IS TCP/IP SOCKETS, IN THIS CASE C SEE ALSO SOC_XXX IN THIS FILE, DDISOC.C, AND DDIKICK.C C *MPI CODE'S TRANSPORT LAYER IS MPI VERSION 1 C C COMPUTE PROCESSES HAVE RANK'S 0,1,...,NP-1 C DATA SERVER PROCS HAVE RANK'S NP,NP+1,...,2*NP-1 C
We don't really intend for DDI to be a general parallel programming library, rather it's a part of GAMESS. For example, we need to link a small program using DDI against some GAMESS objects to have memory management code, and so forth. These are ddi.o, ddisoc.o, unport.o, and zunix.o and maybe a timing routine. We close with a simple (and not very useful) program that broadcasts information to all nodes. It illustrates the proper initialization and closure of the DDI library, requests replicated MEMORY but not distributed MEMDDI, and so does not illustrate distributed data programming. The idea was to keep it a one-pager. program bcast implicit double precision(a-h,o-z) parameter (maxmsg=500000) real tarray(2) common /fmcom / xx(1) data exetyp/8hRUN / c open DDI, tell it integer word length is 32 bit nwdvar=2 call ddi_pbeg(nwdvar) c request allocation of only replicated memory memrep=maxmsg memddi=0 call ddi_memory(memrep,memddi,exetyp) call setfm(memrep) c start a clock so we can time this xstart = etime(tarray) c learn which of the processes I am call ddi_nproc(nproc,me) master=0 if(me.eq.master) write(iw,9000) nproc c dynamically allocate a replicated array call valfm(loadfm) lbuff = loadfm + 1 last = lbuff + maxmsg need = last - loadfm - 1 call getfm(need) c fill it up with nothing but ones if(me.eq.master) then do i=1,maxmsg xx(lbuff-1+i) = 1.0d+00 end do end if c send it to all the other compute processes call ddi_bcast(102,'F',xx(lbuff),maxmsg,master) c we are now done with the replicated storage call retfm(need) c so, all we've done is time a broadcast. xstop = etime(tarray) write(6,9010) me,xstop-xstart c close the DDI library gracefully istat=0 call ddi_pend(istat) stop 9000 format(1x,'running',i4,' processes.') 9010 format(1x,'node',i4,' total job time between', * ' pbeg/pend is',f7.2) end

          Most UNIX versions use DDISOC.C to talk to TCP/IP sockets,
          and DDIKICK.C to load GAMESS for execution.

          The IBM mainframe version uses the following assembler
          language routines:  ZDATE.ASM, ZTIME.ASM.

              The machine dependencies noted above are:
          1) packing/unpacking           2) OPEN/CLOSE statments
          3) machine specification       4) fix total dynamic memory
          5) subroutine walkback         6) error handling calls
          7) timing calls                8) LOGAND function
          9) message passing calls.  DDI.SRC has both socket calls
             (*SOC) and MPI-1 calls (*MPI) programmed.
          10) vector library calls
 
                     List of parallel broadcast identifiers
 
              GAMESS uses DDI calls to pass messages between the
          parallel processes.  Every message is identified by a
          unique number, hence the following list of how the numbers
          are used at present.  If you need to add to these, look at
          the existing code and use the following numbers as
          guidelines to make your decision.  All broadcast numbers
          must be between 1 and 32767.
 
               20            : Parallel timing
              100 -  199     : DICTNRY file reads
              200 -  204     : Restart info from the DICTNRY file
              210 -  214     : Pread
              220 -  224     : PKread
              225            : RAread
              230            : SQread
              250 -  265     : Nameio
              275 -  310     : Free format
              325 -  329     : $PROP group input
              350 -  354     : $VEC group input
              400 -  424     : $GRAD group input
              425 -  449     : $HESS group input
              450 -  474     : $DIPDR group input
              475 -  499     : $VIB group input
              500 -  599     : matrix utility routines
              800 -  830     : Orbital symmetry
              900            : ECP 1e- integrals
              910            : 1e- integrals
              920 -  975     : EFP and SCRF integrals
              980 -  999     : property integrals
             1000 - 1025     : SCF wavefunctions
             1030 - 1041     : broadcasts in DFT
             1050            : Coulomb integrals
             1200 - 1215     : MP2
             1300 - 1320     : localization
             1495 - 1499     : reserved for Jim Shoemaker
             1500            : One-electron gradients
             1505 - 1599     : EFP and SCRF gradients
             1600 - 1602     : Two-electron gradients
             1605 - 1620     : One-electron hessians
             1650 - 1665     : Two-electron hessians
             1700 - 1750     : integral transformation
             1800            : GUGA sorting
             1850 - 1865     : GUGA CI diagonalization
             1900 - 1910     : GUGA DM2 generation
             2000 - 2010     : MCSCF
             2100 - 2120     : coupled perturbed HF
             2300 - 2399     : spin-orbit jobs
             7777            : SIMOMM