Category Archives: Science

The correct way of building MPI program using Cmake

Many posts on this topic appear outdated. Modern cmake stresses target-specific configuration, for each individual executable, library, etc. So a correct way of building MPI program with cmake (version 3.9.2 for instance) is:

find_package(MPI REQUIRED)
add_executable(my_mpi_bin src1.cpp src2.cpp)
target_include_directories(my_mpi_bin PRIVATE ${MPI_CXX_INCLUDE_PATH} hd1.h)
target_compile_options(my_mpi_bin PRIVATE ${MPI_CXX_COMPILE_FLAGS} my_compile_flags)
target_link_libraries(my_mpi_bin ${MPI_CXX_LIBRARIES} ${MPI_CXX_LINK_FLAGS} my_link_flags)

If the MPI implementation (MPICH-3.2 for instance) is installed at a certain location that cmake cannot find by itself, we may explicitly specify the path:

cmake \
-DMPI_CXX_COMPILER=/usr/local/mpich-install/bin/mpicxx \
-DMPI_C_COMPILER=/usr/local/mpich-install/bin/mpicc \
-DMPIEXEC=/usr/local/mpich-install/bin/mpiexec.hydra

Note that MPI_CXX_COMPILER and MPI_C_COMPILER are set here just for completeness, cmake actually does NOT need them to build the program. We do NOT need to change CMAKE_CXX_COMPILER and CMAKE_CXX_COMPILER via command line. After all, mpi compilers are simply wrappers aimed to provide correct compile/link flags that cmake has automatically derived. Moreover, we MUST NOT specify CMAKE_CXX_COMPILER and CMAKE_CXX_COMPILER inside the cmake script. This is unfortunately a very common anti-pattern.

If we want to add the mpi program to the test:

add_test(NAME my_mpi_test
         COMMAND ${MPIEXEC}
         ${MPIEXEC_NUMPROC_FLAG}
         ${MPIEXEC_MAX_NUMPROCS}
         ${MPIEXEC_PREFLAGS}
         ${CMAKE_CURRENT_BINARY_DIR}/my_mpi_bin
         ${MPIEXEC_POSTFLAGS}
         my_arg_1 my_arg_2 ...)

What phrase is considered toxic but may apply now? — GG EZ

Reference:
https://cmake.org/cmake/help/v3.9/module/FindMPI.html
http://www.urbandictionary.com/define.php?term=ggez

Advertisements

Building Geant4 with Cygwin 64 on Windows 7/10 without Visual Studio

The official documentation on Geant4 + Cygwin + MVS is not quite clear to me, so here we use Cygwin alone to compile the code. One has to modify a few cmake scripts and source files in order to build Geant4 without errors.

Test conditions

  • Geant4 10.02.p02 / Geant4 10.03.p02
  • Windows 7 (64-bit) / Windows 10 (64-bit)
  • Cygwin 64 with gcc and g++ version 5.4.0
  • cmake version 3.6.2

Steps

  • Modify cmake scripts.
    • In both cmake\Modules\Geant4LibraryBuildOptions.cmake and cmake\Modules\Geant4BuildProjectConfig.cmake, enable C++ extension, i.e.
      set(CMAKE_CXX_EXTENSIONS ON)
      
    • In cmake\Modules\Geant4LibraryBuildOptions.cmake, add at least one compiler feature that requires gnu++11. For example, cxx_constexpr, i.e.
      set(GEANT4_TARGET_COMPILE_FEATURES
          some initial other compiler features
          cxx_constexpr
          some initial comments
          )
      

      In addition, add gnu++11 to the option list, i.e.

      enum_option(GEANT4_BUILD_CXXSTD
            DOC "C++ Standard to compile against"
            VALUES 11 14 c++11 c++14 gnu++11
            CASE_INSENSITIVE
            )
      

      The above steps are crucially important, because they ensure that the compiler flag -std=gnu++11 be automatically added by cmake rather than -std=c++11. -std=c++11 somehow does not satisfy __POSIX_VISIBLE >= 200112 macro in stdlib.h and consequently function posix_memalign() would not be accessible.

  • Modify source code.
    • In source\processes\electromagnetic\dna\utils\include\G4MoleculeGun.hh, add declaration of explicit specialization immediately after class definition:
      template<typename TYPE>
      class TG4MoleculeShoot : public G4MoleculeShoot
      {
      ...
      };
      // above is class definition
      // add declaration of explicit specialization below
      
      template<> void TG4MoleculeShoot<G4Track>::ShootAtRandomPosition(G4MoleculeGun* gun);
      template<> void TG4MoleculeShoot<G4Track>::ShootAtFixedPosition(G4MoleculeGun* gun);
      template<> void TG4MoleculeShoot<G4Track>::Shoot(G4MoleculeGun* gun);
      

      Otherwise the compiler would complain about multiple definition.

    • In source\global\management\src\G4Threading.cc, comment out syscall.h include. Apparently Cygwin does not offer the OS specific header file syscall.h, and thus do not support multithreading in Geant4 that relies on syscall.h.
      // #include <sys/syscall.h>
      
  • Create out-of-source build using cmake.  Due to lack of syscall.h in Cygwin, only single-threaded Geant4 can be built.
    • Release build
      cmake ../geant4.10.02.p02 -DCMAKE_C_COMPILER=/usr/bin/gcc.exe \
      -DCMAKE_CXX_COMPILER=/usr/bin/g++.exe \
      -DCMAKE_INSTALL_PREFIX=/opt/geant4/release \
      -DGEANT4_BUILD_CXXSTD=gnu++11 \
      -DCMAKE_BUILD_TYPE=Release
      
    • Or debug build
      cmake ../geant4.10.02.p02 -DCMAKE_C_COMPILER=/usr/bin/gcc.exe \
      -DCMAKE_CXX_COMPILER=/usr/bin/g++.exe \
      -DCMAKE_INSTALL_PREFIX=/opt/geant4/debug \
      -DGEANT4_BUILD_CXXSTD=gnu++11 \
      -DCMAKE_BUILD_TYPE=Debug
      
  • Build and install. make && make install.
  • Have fun with Geant4 !!! … and remember:

    If you love something, set it free.

Prefetch on Intel MIC coprocessor

[updated on April 6, 2016]

Software-based data prefetch on Intel MIC coprocessors is very useful for Monte Carlo transport code. It helps hide the long latency when loading microscopic cross-section data from DRAM. There are a total of 8 different types of prefetch with subtle differences. Here we tell them apart.

Cache hierarchy

A MIC has 32-KB L1 cache per core and 512 KB L2 cache per core. Here by “cache” we mean the data cache instead of instruction cache, and by “core” we mean the physical core instead of logical core. Both levels of cache implement MESI coherency protocol and have a cache line size of 64 bytes (i.e. 8 consecutive FP64 values).

Prefetch instruction

Let’s take a look at two orthogonal concepts first:

  • non-temporal hint (NTA) — informs that data will be used only once in the future and causes them to be evicted from the cache after the first use (most recently used data to be evicted).
  • exclusive hint (E) — renders the cache line on the current core in the “exclusive” state, where the cache lines on other cores are invalidated.

The combination of temporality, exclusiveness, and locality (L1 or L2) together yields 8 types of instructions supported by the present-day Knights Corner MIC. They specify how the data are expected to be uniquely handled in the cache, enumerated below.

instruction hint purpose
vprefetchnta _MM_HINT_NTA loads data to L1 and L2 cache, marks it as NTA
vprefetch0 _MM_HINT_T0 loads data to L1 and L2 cache
vprefetch1 _MM_HINT_T1 loads data to L2 cache only
vprefetch2 _MM_HINT_T2 loads data to L2 cache only, marks it as NTA This mnemonic is counter-intuitive as there is not NTA in it
vprefetchenta _MM_HINT_ENTA exclusive version of vprefetchnta
vprefetche0 _MM_HINT_ET0 exclusive version of vprefetch0
vprefetche1 _MM_HINT_ET1 exclusive version of vprefetch1
vprefetche2 _MM_HINT_ET2 exclusive version of vprefetch2

Note L2 cache of the MIC is inclusive in the sense that it has a copy of all the data in L1.

There are two ways of implementing prefetch in C — intrinsic and assembly.

// method 1: intrinsic
_mm_prefetch((const char*)addr, hint);

// method 2: assembly
asm volatile ("prefetch_inst [%0]"::"m"(addr));

Here addr is the address of the byte starting from which to prefetch, prefetch_inst is the prefetch instructions listed above, and hint is the parameter for the compiler intrinsic. We would like to emphasize again that _MM_HINT_T2 and _MM_HINT_ET2 are counter-intuitive. In fact they are misnomers as both are non-temporary. They should have been named as _MM_HINT_NTA2 and _MM_HINT_ENTA2 by Intel.

Prefetch on CPUs

So how about prefetch on Intel Xeon CPUs? Well, turn out very different! Check the list below.

instruction hint purpose
prefetchnta _MM_HINT_NTA loads data to L2 and L3 cache, marks as NTA
prefetcht0 _MM_HINT_T0 loads data to L2 and L3 cache
prefetcht1 _MM_HINT_T1 equivalent to prefetch0
prefetcht2 _MM_HINT_T2 equivalent to prefetch0
prefetchw n/a[2] exclusive version of prefetch0 [1]
prefetchwt1 n/a[3] equivalent to prefetchw [1]

[1] not confirmed
[2] icpc does not compiler _mm_prefetch((const char*)addr, _MM_HINT_ENTA);
[3] icpc does not compile _mm_prefetch((const char*)addr, _MM_HINT_ET1);
Note L3 cache of the Intel Xeon CPU is inclusive in the sense that it has a copy of all the data in L2.

Reference
[1]Intel Xeon Phi Coprocessor Instruction Set Architecture Reference Manual, 2012.
[2]Intel 64 and IA-32 Architectures Software Developer’s Manual, 2015.