Problem during boundary condition assembly

Hello All,

I am encountering an issue during assembly of my boundary conditions in the Python interface where after assembling some of the conditions (7 out of 12, in my case) the program stops producing output, and the processor usage drops to essentially zero.

No error messages are produced, but the output to the code simply stops. I noticed that this seems to happen when memory usage is around 4.7 GB, and am wondering if getFEM has any memory usage limits that could be causing this. My HPC has 255 GB of memory and I have confirmed that my version of Python is 64-bit, so it should not have any memory limitations there.

I am able to run the exact same simulation in around 3 minutes with a larger mesh size and no problems, but both that simulation (4,096 vertices) and this one (46,656 vertices) are too small for my actual problem, which I estimate needing 8,000,000 vertices for. Therefore, I would like to debug whatever this is before trying to run the actual simulation for my paper.

I am getting a warning,

Level 1 Warning in getfem_regular_meshes.cc, line 35: CAUTION : Simplexification in dimension >= 5 has not been tested and the resulting mesh should be not conformal

but this warning appears in the smaller simulation and results are produced regardless.

Please let me know if anyone knows what is causing this problem.

Sincerely,
Eric Comstock

I guess @yves.renard has written this code some decades ago and nobody was brave enough so far to test it for N > 4. Could you maybe audit the implementation yourself, to see if the warning could simply be removed?

  void parallelepiped_regular_simplex_mesh_
  (mesh &me, dim_type N, const base_node &org,
   const base_small_vector *ivect, const size_type *iref) {
    bgeot::convex<base_node>
      pararef = *(bgeot::parallelepiped_of_reference(N));

    if (N >= 5) GMM_WARNING1("CAUTION : Simplexification in dimension >= 5 "
                             "has not been tested and the resulting mesh "
                             "should be not conformal");

    const bgeot::mesh_structure &sl
      = *(bgeot::parallelepiped_of_reference(N)->simplexified_convex());

    base_node a = org;
    size_type i, nbpt = pararef.nb_points();

    for (i = 0; i < nbpt; ++i) {
      gmm::clear(a);
      for (dim_type n = 0; n < N; ++n)
        gmm::add(gmm::scaled(ivect[n],pararef.points()[i][n]),a);
      pararef.points()[i] = a;
    }

    // bgeot::simplexify(cvt, sl, pararef.points(), N, me.eps());

    size_type nbs = sl.nb_convex();
    std::vector<size_type> tab1(N+1), tab(N), tab3(nbpt);
    size_type total = 0;
    std::fill(tab.begin(), tab.end(), 0);
    while (tab[N-1] != iref[N-1]) {
      for (a = org, i = 0; i < N; i++)
        gmm::add(gmm::scaled(ivect[i],scalar_type(tab[i])),a);
        //a.addmul(scalar_type(tab[i]), ivect[i]);

      for (i = 0; i < nbpt; i++)
        tab3[i] = me.add_point(a + pararef.points()[i]);

      for (i = 0; i < nbs; i++) {
        const mesh::ind_cv_ct &tab2 = sl.ind_points_of_convex(i);
        for (dim_type l = 0; l <= N; l++)
          // tab1[l] = tab3[tab2[l]];
          tab1[l] = tab3[(tab2[l]
                          + (((total & 1) && N != 3) ? (nbpt/2) : 0)) % nbpt];
        me.add_simplex(N, tab1.begin());
      }

      for (dim_type l = 0; l < N; l++) {
        tab[l]++; total++;
        if (l < N-1 && tab[l] >= iref[l]) { total -= tab[l]; tab[l] = 0; }
        else break;
      }
    }
  }

It makes sense to make sure things work and scale properly in a small problem before moving to your actual problem. So if I understand correctly, apart of the aforementioned warning, your smallest simulation (4096 vertices) runs and produces plausible results. The intermediate problem (46656 vertices) crashes without an error message. Is the crash occurring on a Windows or Linux? In Linux it should be quite easy to get a backtrace.

Hello Kostas,

Thank you for the information and help!

I would have to take a closer look, so let me get back to you regarding the implementation. It has been a minute since I worked directly in C++.

I am currently running the code in six dimensions (three for position, three for velocity). I believe you understand correctly - the smallest code produces results which would make sense given the small simulation size. The intermediate code does not crash - Python never stops on its own, it just goes down to 0 CPU usage after partially building the boundary conditions. I have to manually close it via Task Manager to make it stop.

For this problem I am using the Anaconda/Python 3.7 edition of getfem available from getfem.org through the installer .exe. It is in Windows, but it is not the version I was trying to build in my other question (I am trying to solve this issue and the other one in parallel).

Sincerely,
Eric Comstock

Ok, would it be possible to run the intermediate code in WSL where I can support you with this debugging much better?

Most people that I know using GetFEM on windows, they just use it through WSL.

Hello Kostas,

Thank you for the suggestion! I will see if I can get that running. Given that no output seems to be produced when it stops running on the cpu, what should I look for to send and help debug the issue?

I am actually having some trouble with our supercomputer, though, so it might be a few days to sort it out. Thank you again!

Sincerely,
Eric Comstock

in WSL I can instruct you how to run your example within a debugger and get a backtrace, and how to build GetFEM with debug information included if necessary.

Hello Kostas,

I actually have a Linux installation of getFEM on our supercomputer, but I am having an ssh issue and am have been in contact with their tech support for a couple of days.

Would a Linux operating system (not WSL, but direct Linux) be capable of running the debugger you are interested in, or should I install getFEM on a seperate WSL system to run it?

Sincerely,
Eric Comstock

Any Linux supports the toolchain that I am familiar with, but wsl with ubuntu will be the simplest option.

Hello Kostas,

Okay - I sorted out the problem on the supercomputer, and the getFEM files seem to run correctly. I am rerunning them currently there to see if I can see the same problem.

Our supercomputer uses Red Hat Enterprise Linux version 9 as its OS, so it would probably be easier on my end since we got getFEM installed on it back in Feburary. Would this operating system be acceptable?

If so, which toolchain for debugging should I use?

Sincerely,
Eric Comstock

I guess you build GetFEM with gcc, which is the most tested compiler. What was your configure command?

On our cluster, GetFEM is built with

./autogen.sh
./configure --prefix="$INSTALLDIR/getfem" \
LDFLAGS="-L$INSTALLDIR/mumps/lib -L$INSTALLDIR/metis/lib -L$INSTALLDIR/qhull/lib" \
CPPFLAGS="-I$INSTALLDIR/mumps/include -I$INSTALLDIR/pmetis/include -I$INSTALLDIR/qhull/include" \
LIBS="-lmumps_common -lpord -lgfortran -lmetis" \
--with-blas="$STATICBLAS -lpthread" \
--with-pic --with-optimization=-O3 --disable-matlab --enable-python \
--with-mumps="-lsmumps -ldmumps -lcmumps -lzmumps"
make -j4
make install

The full compilation script for GetFEM and its dependencies can be found here:
https://gitlab.gbar.dtu.dk/-/snippets/52

Adding CXXFLAGS=-g to the configure command and using O0 optimization instead of O3, before building, will create a debug version of GetFEM.

Hello Kostas,

Thank you again!

One of our supercomputer technical people installed it on our end - I will ask what their configure command was.

Sincerely,
Eric Comstock

Hello Kostas,

By the way, I can confirm that the same issue takes place on Linux as on Windows.

Sincerely,
Eric Comstock

Hello Kostas,

I checked some of my notes from February, and it looks like we installed it using mamba - the installation procedure appears to have been

module load mamba
mamba create -n getfem
mamba activate getfem
mamba install conda-forge::getfem
mamba install mkl mkl-include

Sincerely,
Eric Comstock

ok so you did not build the software on the cluster but installed prebuilt binaries of an older version 5.4.2.

I don’t get why you need mkl, the conda package lists openblas as a dependency, not mkl:
https://github.com/conda-forge/getfem-feedstock/blob/main/recipe/meta.yaml

I would try to get rid of mkl to exclude a possible source of issues.

It would also make sense to test the latest version 5.4.4 to be sure that you did not hit some bug that has already been fixed, or even better the development version from the GetFEM git repository.

You can try to run your failing case with:

gdb --args python3 yourscriptname.py

This will bring you in the gdb terminal, where you can start your script with the command

r

Wait until the program crashes or if it gets stuck, interrupt it with Ctrl+C and then run the command

bt

which will give you a backtrace that you can post here.

I highly recommend you to compile the development version of GetFEM from the git repository in WSL-Ubuntu. This will be the easiest way to resolve the issue.

1 Like

Hello Kostas,

This seems to work, but is taking a long time to run (~100x longer than without the debugger), so I will have an update on this matter tomorrow.

I will work in parellel on getting GetFEM on WSL-Ubuntu.

Thank you again!

Sincerely,
Eric Comstock

Hello Kostas,

I let the code run overnight, and it looks like the “freeze” eventually concluded with the code outputting that pivots were too small. This is odd, because I chose my position and momentum coordinates such that the point distributions are very nearly hypercubical, and a similar mesh worked at a smaller size. I am trying to mesh with 6D simplices - should I switch to a different meshing strategy? Is the solution to increase precision?

The debugger output is as follows:

(getfem) [ecomstock3@atl1-1-02-019-7-2 scratch]$ gdb --args python3 test6D.py
GNU gdb (GDB) Red Hat Enterprise Linux 10.2-13.el9
Copyright (C) 2021 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from python3...
(gdb) r
Starting program: python3 test6D.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[New Thread 0x1555124d4640 (LWP 1849282)]
[New Thread 0x1555122d3640 (LWP 1849283)]
[New Thread 0x1555120d2640 (LWP 1849284)]
[New Thread 0x155511ed1640 (LWP 1849285)]
[New Thread 0x155511cd0640 (LWP 1849286)]
[New Thread 0x155511acf640 (LWP 1849287)]
[New Thread 0x1555118ce640 (LWP 1849288)]
Trace 2 in getfem_models.cc, line 4401: Mass term assembly for Dirichlet condition
Trace 2 in getfem_models.cc, line 4438: Source term assembly for Dirichlet condition
[New Thread 0x1554f99d4ac0 (LWP 1849386)]
[New Thread 0x1554efffeb40 (LWP 1849387)]
[New Thread 0x1554e3ffebc0 (LWP 1849388)]
[New Thread 0x1554d7ffec40 (LWP 1849389)]
[New Thread 0x1554cbffecc0 (LWP 1849390)]
[New Thread 0x1554bfffed40 (LWP 1849391)]
[New Thread 0x1554b3ffedc0 (LWP 1849392)]
Trace 2 in getfem_models.cc, line 4401: Mass term assembly for Dirichlet condition
Trace 2 in getfem_models.cc, line 4438: Source term assembly for Dirichlet condition
Trace 2 in getfem_models.cc, line 4401: Mass term assembly for Dirichlet condition
Trace 2 in getfem_models.cc, line 4438: Source term assembly for Dirichlet condition
Trace 2 in getfem_models.cc, line 4401: Mass term assembly for Dirichlet condition
Trace 2 in getfem_models.cc, line 4438: Source term assembly for Dirichlet condition
Trace 2 in getfem_models.cc, line 4401: Mass term assembly for Dirichlet condition
Trace 2 in getfem_models.cc, line 4438: Source term assembly for Dirichlet condition
Trace 2 in getfem_models.cc, line 4401: Mass term assembly for Dirichlet condition
Trace 2 in getfem_models.cc, line 4438: Source term assembly for Dirichlet condition
Trace 2 in getfem_models.cc, line 4401: Mass term assembly for Dirichlet condition
Trace 2 in getfem_models.cc, line 4438: Source term assembly for Dirichlet condition
Trace 2 in getfem_models.cc, line 3478: Generic linear assembly brick: generic matrix assembly
Trace 2 in getfem_models.cc, line 4401: Mass term assembly for Dirichlet condition
Trace 2 in getfem_models.cc, line 4438: Source term assembly for Dirichlet condition
Trace 2 in getfem_models.cc, line 4401: Mass term assembly for Dirichlet condition
Trace 2 in getfem_models.cc, line 4438: Source term assembly for Dirichlet condition
Trace 2 in getfem_models.cc, line 4401: Mass term assembly for Dirichlet condition
Trace 2 in getfem_models.cc, line 4438: Source term assembly for Dirichlet condition
Trace 2 in getfem_models.cc, line 4401: Mass term assembly for Dirichlet condition
Trace 2 in getfem_models.cc, line 4438: Source term assembly for Dirichlet condition
Trace 2 in getfem_models.cc, line 4401: Mass term assembly for Dirichlet condition
Trace 2 in getfem_models.cc, line 4438: Source term assembly for Dirichlet condition
Trace 2 in getfem_models.cc, line 4401: Mass term assembly for Dirichlet condition
Trace 2 in getfem_models.cc, line 4438: Source term assembly for Dirichlet condition
Trace 2 in getfem_models.cc, line 4401: Mass term assembly for Dirichlet condition
Trace 2 in getfem_models.cc, line 4438: Source term assembly for Dirichlet condition
Trace 2 in getfem_models.cc, line 3310: Generic source term assembly
Trace 2 in getfem_models.cc, line 3321: Source term: generic source term assembly
Trace 2 in getfem_models.cc, line 3310: Generic source term assembly
Trace 2 in getfem_models.cc, line 3321: Source term: generic source term assembly
Trace 2 in getfem_models.cc, line 3310: Generic source term assembly
Trace 2 in getfem_models.cc, line 3321: Source term: generic source term assembly
Trace 2 in getfem_models.cc, line 3310: Generic source term assembly
Trace 2 in getfem_models.cc, line 3321: Source term: generic source term assembly
Trace 2 in getfem_models.cc, line 3310: Generic source term assembly
Trace 2 in getfem_models.cc, line 3321: Source term: generic source term assembly
Level 2 Warning in ../../src/gmm/gmm_precond_ilu.h, line 182: pivot 722 is too small
...
Level 2 Warning in ../../src/gmm/gmm_precond_ilu.h, line 182: pivot 458202 is too small
 iter   0 residual         -nan
Level 2 Warning in ../../src/getfem/getfem_model_solvers.h, line 107: gmres did not converge!
[Thread 0x1555118ce640 (LWP 1849288) exited]
[Thread 0x155511acf640 (LWP 1849287) exited]
[Thread 0x155511cd0640 (LWP 1849286) exited]
[Thread 0x155511ed1640 (LWP 1849285) exited]
[Thread 0x1555120d2640 (LWP 1849284) exited]
[Thread 0x1555122d3640 (LWP 1849283) exited]
[Thread 0x1555124d4640 (LWP 1849282) exited]
[Thread 0x1554f99d4ac0 (LWP 1849386) exited]
[Thread 0x1554efffeb40 (LWP 1849387) exited]
[Thread 0x1554e3ffebc0 (LWP 1849388) exited]
[Thread 0x1554d7ffec40 (LWP 1849389) exited]
[Thread 0x1554cbffecc0 (LWP 1849390) exited]
[Thread 0x1554bfffed40 (LWP 1849391) exited]
[Thread 0x1554b3ffedc0 (LWP 1849392) exited]
[Inferior 1 (process 1849278) exited normally]
Missing separate debuginfos, use: dnf debuginfo-install glibc-2.34-100.el9_4.4.x86_64


Sincerely,
Eric Comstock

How many degrees of freedom do you have?

Iterative solvers can be very sensitive to the conditioning of the matrix. Develop and test your model using a direct solver first. If you use the model solve() function for your solution, you can request a direct solver like this:

md.solve("noisy", "lsolver", "mumps")

or if you are solving the system manually:

md.assembly("build_all")
sol = gf.linsolve_mumps(md.tangent_matrix(), md.rhs()))
md.to_variables(sol.flatten())
1 Like

mumps will work for up to 10^6 degrees of freedom or a bit more, depending on the RAM you have,

Hello Kostas,

Because my problem is in 6D, it has a much larger number of degrees of freedom per vertex than normal problems.

I am unsure exactly how many simplices the solver is using per hexeract of the simulation, but it should be around 308 (Robert B. Hughes, Michael R. Anderson, A triangulation of the 6-cube with 308 simplices, Discrete Mathematics, Volume 117, Issues 1–3, 1993, Pages 253-256, ISSN 0012-365X, https://doi.org/10.1016/0012-365X(93)90339-U), which would give approximately 2156 DoF per cell.

For the 4096-vertex case, it would have ~1.6 million DoF, the intermediate case has ~34 million DoF, and the large simulation would have ~11 billion DoF.

Thank you for the advice regarding the direct solver! I think the problem is solved for a second test case I had (15625-vertex, which had failed with the same issue before), which seems to converge now, but my 46656-vertex case is still running at the moment, and seems to be doing things on the CPU.

Sincerely,
Eric Comstock

Makes sense. Your matrices will also be much less sparse than normally. I think you will inevitably hit several bottlenecks that will need to be addressed to make your model run efficiently.

I would recommend you doing some profiling on the small problem to get an idea where you spend most of your computation effort. You can try:

python3 -m cProfile -o benchmark.stats simulation.py
python3 -m pstats benchmark.stats
1 Like