Problem during boundary condition assembly

EricAComstock · May 21, 2025, 3:22pm

Hello Kostas,

Thank you for the information and help! I will learn these tools, and work on this.

Sincerely,
Eric Comstock

EricAComstock · May 23, 2025, 2:50pm

Hello Kostas,

Thank you!

I have gotten a getFEM installation with parallel mumps, mpi4py, and openmpi running.

I have not been able to find a usage guide or discussion on how to use mpi4py with getFEM on discourse or getfem.org. How is mpi4py called? What types of objects do I need to create for the custom partition?

Sincerely,
Eric Comstock

EricAComstock · May 29, 2025, 7:54pm

Hello Kostas,

What part of the documentation is the partitioning described in?

Sincerely,
Eric Comstock

Konstantinos.Poulios · June 2, 2025, 7:21am

Hi Eric

The Metis based MPI parallelization is described here:

Since Metis does not support partitioning of higher dimensional meshes, there are two options for you:

use mesh regions to assemble your equations on part of the domain.
edit the getfem source code to replace the metis partitioning with your own algorithm.

Let’s focus on #1. Documentation for mesh regions you find here

and here for the python API

The basic idea is to create an mpi4py capable script, let’s say for a 2D problem, using 2 MPI processes, like this

from mpi4py import MPI
comm = MPI.COMM_WORLD
rank = comm.Get_rank()

....

RG_CURRENT_PROC=99 # just an identifier for 

if rank == 0: # all elements for X0<=X<=X1
  CVIDs = m.convexes_in_box([X0-1e-5,Y0-1e-5],[X1+1e-5,Y1+1e-5])
elif rank == 1: # all elements for X1<=X<=X2
  CVIDs = m.convexes_in_box([X1-1e-5,Y0-1e-5],[X2+1e-5,Y1+1e-5])
CVIDs = np.unique(CVIDs)  # Remove any duplicates from list
m.set_region(RG_CURRENT_PROC, [CVIDs,-np.ones(len(CVIDs))])

...

for timestep in timesteps:
  ...
  model.assembly("build_all")
  rhs = model.rhs() # this rhs is the sum of all rhs's from all processes
                    # GetFEM calls the MPI broadcast internally
  K = model.tangent_matrix() # this contains only the part of the stiffness matrix
                             # assembled on the current process. If you want the
                             # entire stiffness matrix you you need to do all
                             # necessary MPI broadcasts yourself, which can
                             # be tricky for sparse matrices. MUMPS does not
                             # require this, it works with distributed matrices
  ... # do something with rhs and K

For parallel mumps solves, I have added an example that can be useful to you, here

EricAComstock · June 2, 2025, 9:21am

Hello Kostas,

Thank you so much! I will try these options and review the documentation you sent, and let you know how it goes.

Sincerely,
Eric Comstock

EricAComstock · June 3, 2025, 5:28pm

Hello Kostas,

I have a couple of questions:

First, what does gf.MumpsContext do? I have been unable to find it in the documentation.

Second, is it possible to only parallelize certain parts of the code? I have noticed that when I run the code (using the parallel mpirun -n call), any parts that are not parallel will be run individually on each processor, making it take substantially longer to execute.

Edit: I figured out the partial parallelization method

Sincerely,
Eric Comstock

Konstantinos.Poulios · June 4, 2025, 8:10am

The mumps context object creates a reusable instant with the MUMPS solver, where you can talk to MUMPS forth and back. You cannot find it in the documentation because I have just implemented it.

MPI parallelism is like that, your entire program is run several times, including all parts of it unless you do something about that. It is like running n versions of your program where these n versions can send data to each other. The only thing each version knows its rank (like process id), so you need to run your program in a way that it check its rank and performs tasks accordingly.

In all cases trivial and not computationally expensive parts are indeed executed multiple times.

You cannot do partial parallelization of the code with MPI, but it does not matter. You should focus on splitting the workload of the computational demanding parts of the code. That’s why we used the mesh regions defined according to the rank, in order to split the workload.

EricAComstock · June 4, 2025, 1:19pm

Hello Kostas,

Okay, that makes sense. Thank you so much for the detailed help!

Sincerely,
Eric Comstock

EricAComstock · June 5, 2025, 2:32pm

Hello Kostas,

Regarding partial parallelization, the reason I was interested in it is because any CPU activity generates heat, which can slow down the other CPUs if too much is generated. For parts of the code that do not have to be run several times, I have found that they can be left out of the parallelization process with something like

# Parallelization
from mpi4py import MPI

comm = MPI.COMM_WORLD
nprocs = comm.Get_size()
rank = comm.Get_rank()
logging.info('nprocs = ' + str(nprocs))
logging.info('rank = ' + str(rank))

# Actual code
if rank == processor_you_want_to_use_for_nonparallelized_tasks:
    '''run code here that you do not want to be parallelized'''

which reduces energy consumption and waste heat production. I just wanted to mention it in case it was useful.

I also had a question regarding getFEM itself - is it possible to send getFEM objects through mpi4py’s communication between processors (mpi4py.MPI.COMM_WORLD.Send and mpi4py.MPI.COMM_WORLD.Recv)? If so, what datatype should I use?

Edit: Nevermind - I figured this out (that the getfem objects carry), and the code is 3x faster! Thank you so much!

Edit edit: Now I have a different result, so I may still be doing something wrong, but it is likely not what I originally had as an issue.

Sincerely,
Eric Comstock

Konstantinos.Poulios · June 5, 2025, 8:00pm

it sounds dangerous to use GetFEM with MPI this way, and you certainly need to know what you are doing. GetFEM objects have many inter-dependencies behind the scenes, and a lot of state data are stored, data are cached, there is lazy evaluation of data. In general it is very complex.

You should try to stick with a parallelization paradigm that is based only on transfer of data between the processes, and you should really know what the state of each instant of your script on each processor is. Keep it simple.

EricAComstock · June 10, 2025, 5:44pm

Hello Kostas,

I have been doing some testing on a sample case, and have determined that there still seems to be a substantial slowdown on the addition of the boundary conditions. However, I have not bee able to determine where the slowdown is, since it seems to change based on what order I add the boundary conditions. For example, with the default orientation of my boundary conditions (below),


    md.add_multiplier("mult46", mf, "f", mim, 46)
    logging.info('Multiplier 1/6')
    md.add_multiplier("mult47", mf, "f", mim, 47)
    logging.info('Multiplier 2/6')
    md.add_multiplier("mult48", mf, "f", mim, 48)
    logging.info('Multiplier 3/6')
    md.add_multiplier("mult49", mf, "f", mim, 49)
    logging.info('Multiplier 4/6')
    md.add_multiplier("mult50", mf, "f", mim, 50)
    logging.info('Multiplier 5/6')
    md.add_multiplier("mult51", mf, "f", mim, 51)
    logging.info('Multiplier 6/6')
    
    md.add_linear_term(mim, '-mult46 * f', 46)
    logging.info('Linear 1/6')
    md.add_linear_term(mim, '-mult47 * f', 47)
    logging.info('Linear 2/6')
    md.add_linear_term(mim, '-mult48 * f', 48)
    logging.info('Linear 3/6')
    md.add_linear_term(mim, '-mult49 * f', 49)
    logging.info('Linear 4/6')
    md.add_linear_term(mim, '-mult50 * f', 50)
    logging.info('Linear 5/6')
    md.add_linear_term(mim, '-mult51 * f', 51)
    logging.info('Linear 6/6')

, and using a logger to determine when each operation finishes, I get in the logging file

2025-06-10 13:03:30,123 [INFO] Multiplier 1/6
2025-06-10 13:03:30,124 [INFO] Multiplier 2/6
2025-06-10 13:03:30,125 [INFO] Multiplier 3/6
2025-06-10 13:03:30,126 [INFO] Multiplier 4/6
2025-06-10 13:03:30,126 [INFO] Multiplier 5/6
2025-06-10 13:03:30,127 [INFO] Multiplier 6/6
2025-06-10 13:15:18,568 [INFO] Linear 1/6
2025-06-10 13:15:18,569 [INFO] Linear 2/6
2025-06-10 13:15:18,569 [INFO] Linear 3/6
2025-06-10 13:15:18,570 [INFO] Linear 4/6
2025-06-10 13:15:18,571 [INFO] Linear 5/6
2025-06-10 13:15:18,571 [INFO] Linear 6/6

, implying that md.add_linear_term(mim, '-mult46 * f', 46) took 12 minutes, while all other steps took milliseconds. In contrast, when I run it with

    md.add_multiplier("mult46", mf, "f", mim, 46)
    logging.info('Multiplier 1/6')
    md.add_multiplier("mult47", mf, "f", mim, 47)
    logging.info('Multiplier 2/6')
    md.add_multiplier("mult48", mf, "f", mim, 48)
    logging.info('Multiplier 3/6')
    md.add_multiplier("mult49", mf, "f", mim, 49)
    logging.info('Multiplier 4/6')
    md.add_multiplier("mult50", mf, "f", mim, 50)
    logging.info('Multiplier 5/6')
    md.add_multiplier("mult51", mf, "f", mim, 51)
    logging.info('Multiplier 6/6')
    
    md.add_linear_term(mim, '-mult49 * f', 49)
    logging.info('Linear 4/6')
    md.add_linear_term(mim, '-mult50 * f', 50)
    logging.info('Linear 5/6')
    md.add_linear_term(mim, '-mult51 * f', 51)
    logging.info('Linear 6/6')
    md.add_linear_term(mim, '-mult46 * f', 46)
    logging.info('Linear 1/6')
    md.add_linear_term(mim, '-mult47 * f', 47)
    logging.info('Linear 2/6')
    md.add_linear_term(mim, '-mult48 * f', 48)
    logging.info('Linear 3/6')

, which should not change the model, I get in the logging file

2025-06-10 13:23:00,526 [INFO] Multiplier 1/6
2025-06-10 13:23:00,526 [INFO] Multiplier 2/6
2025-06-10 13:23:00,527 [INFO] Multiplier 3/6
2025-06-10 13:23:00,527 [INFO] Multiplier 4/6
2025-06-10 13:23:00,528 [INFO] Multiplier 5/6
2025-06-10 13:23:00,529 [INFO] Multiplier 6/6
2025-06-10 13:33:41,414 [INFO] Linear 4/6
2025-06-10 13:33:41,415 [INFO] Linear 5/6
2025-06-10 13:33:41,416 [INFO] Linear 6/6
2025-06-10 13:33:41,417 [INFO] Linear 1/6
2025-06-10 13:33:41,417 [INFO] Linear 2/6
2025-06-10 13:33:41,418 [INFO] Linear 3/6

, implying that md.add_linear_term(mim, '-mult49 * f', 49) took 12 minutes, while all other steps took milliseconds.

Is there some kind of initialization that takes place when add_linear_term is called the first time? If so, where is this procedure in the source code?

Sincerely,
Eric Comstock

Konstantinos.Poulios · June 10, 2025, 6:24pm

ok, nice overview. This makes sense. No initialization specific for add_linear_term, but as I mentioned in a previous message, there are two concept in GetFEM that can explain this:

lazy evaluation
caching

So basically there are many things that are computed only whenever needed and then cached for later use. Typical example is precomputations of base function values or gradients.

It seems that such a computation has some severe bottleneck for your high dimensional FE space. The good news is that the penalty is only once, and then the cached computation is used. However we should fix this if there is indeed some very inefficient computation.

EricAComstock · June 10, 2025, 6:40pm

Hello Kostas,

Do you know if there is any way to check which procedure this is in? I assume it would be in the add_linear_term procedure in Python, but I am not sure how the Python/C++ interface works. I looked around in the C++ source files and it seems that there is a add_linear_term function that calls add_linear_term_ - would the bottleneck be in the latter?

Sincerely,
Eric Comstock

Konstantinos.Poulios · June 10, 2025, 7:05pm

add_linear_term is a relatively high level function, which calls other lower level functions where the actual work is done.

Now that you have everything running in your ubuntu/wsl, it is quite easy to do some proper profiling. You can run your code, let’s call it myscript.py, with

python3 -m cProfile -o benchmarking.dat myscript.py

Install snakeviz from pip

pipx install snakeviz

(pip on linux is evil but pipx is an ok method for installing apps not available in the distro repo)

Visualize your profiling results with:

snakeviz benchmarking.dat

If it does show names of C++ functions, you will need to recompile GetFEM with the g flag in CXXFLAGS=“-g …”, in the configure command.

EricAComstock · June 10, 2025, 8:30pm

Hello Kostas,

I tried running snakeviz like you suggested, but I ran into this problem:

 snakeviz benchmarking.dat
snakeviz web server started on 127.0.0.1:8080; enter Ctrl-C to exit
http://127.0.0.1:8080/snakeviz/%2Fmnt%2Fc%2FUsers%2FAdmin%2FDocuments%2FImportant%20Files%2FGT%2FResearch%2FSummer%202025%2FMemos%2F06%2002%202025%20-%20getFEM%20high-D%20parellelization%20testing%2Fbenchmarking.dat
snakeviz: error: no web browser found: could not locate runnable browser

Is this referring to browsers like Firefox, or to something else?

Sincerely,
Eric Comstock

Konstantinos.Poulios · June 11, 2025, 7:20am

yes, I forgot that you run wsl, so you do not have firefox. Then you just need to use some less fancy visualization for your benchmarking.dat file, the simplest is

python3 -m pstats benchmarking.dat

There are many visualization tools that you can use for this file.

EricAComstock · June 11, 2025, 3:32pm

Hello Kostas,

Got it - thank you!

Sincerely,
Eric Comstock

EricAComstock · June 12, 2025, 12:08pm

Hello Kostas,

Sorted by cumulative time, the results are

benchmarking.dat% sort cumulative
benchmarking.dat% stats 20
Tue Jun 10 16:20:22 2025    benchmarking.dat

         308990 function calls (300154 primitive calls) in 639.187 seconds

   Ordered by: cumulative time
   List reduced from 1784 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    222/1    0.000    0.000  639.194  639.194 {built-in method builtins.exec}
        1    0.000    0.000  639.194  639.194 test6D_linear_BCs_fast_default.py:1(<module>)
        1    0.001    0.001  638.673  638.673 test6D_linear_BCs_fast_default.py:83(calc_F)
      102  634.040    6.216  634.040    6.216 {built-in method _getfem.getfem}
       29    0.000    0.000  593.698   20.472 /home/eac/FEM_cpp_testing/test_compilation_folder/opt/lib/python3.10/site-packages/getfem/getfem.py:2817(set)
        7    0.000    0.000  592.411   84.630 /home/eac/FEM_cpp_testing/test_compilation_folder/opt/lib/python3.10/site-packages/getfem/getfem.py:3522(add_linear_term)
       12    0.000    0.000   22.875    1.906 /home/eac/FEM_cpp_testing/test_compilation_folder/opt/lib/python3.10/site-packages/getfem/getfem.py:1413(outer_faces_with_direction)
       12    0.000    0.000   22.875    1.906 /home/eac/FEM_cpp_testing/test_compilation_folder/opt/lib/python3.10/site-packages/getfem/getfem.py:1185(get)
        5    0.000    0.000   16.274    3.255 /home/eac/FEM_cpp_testing/test_compilation_folder/opt/lib/python3.10/site-packages/getfem/getfem.py:2812(get)
        1    0.000    0.000   16.266   16.266 /home/eac/FEM_cpp_testing/test_compilation_folder/opt/lib/python3.10/site-packages/getfem/getfem.py:2951(solve)
        8    0.000    0.000    4.550    0.569 /home/eac/FEM_cpp_testing/test_compilation_folder/opt/lib/python3.10/site-packages/getfem/getfem.py:53(generic_constructor)
        7    4.550    0.650    4.550    0.650 {built-in method _getfem.getfem_from_constructor}
        1    0.000    0.000    4.508    4.508 /home/eac/FEM_cpp_testing/test_compilation_folder/opt/lib/python3.10/site-packages/getfem/getfem.py:1067(__init__)
        1    0.000    0.000    1.240    1.240 /home/eac/FEM_cpp_testing/test_compilation_folder/opt/lib/python3.10/site-packages/getfem/getfem.py:3234(add_fem_variable)
        6    0.000    0.000    0.660    0.110 /home/eac/FEM_cpp_testing/test_compilation_folder/opt/lib/python3.10/site-packages/getfem/getfem.py:5980(asm_generic)
    275/3    0.001    0.000    0.448    0.149 <frozen importlib._bootstrap>:1022(_find_and_load)
    275/3    0.001    0.000    0.448    0.149 <frozen importlib._bootstrap>:987(_find_and_load_unlocked)
    262/3    0.001    0.000    0.441    0.147 <frozen importlib._bootstrap>:664(_load_unlocked)
    221/3    0.000    0.000    0.440    0.147 <frozen importlib._bootstrap_external>:877(exec_module)
    382/3    0.000    0.000    0.438    0.146 <frozen importlib._bootstrap>:233(_call_with_frames_removed)

and sorted by total time, they are

benchmarking.dat% sort tottime
benchmarking.dat% stats 20
Tue Jun 10 16:20:22 2025    benchmarking.dat

         308990 function calls (300154 primitive calls) in 639.187 seconds

   Ordered by: internal time
   List reduced from 1784 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      102  634.040    6.216  634.040    6.216 {built-in method _getfem.getfem}
        7    4.550    0.650    4.550    0.650 {built-in method _getfem.getfem_from_constructor}
    20/16    0.253    0.013    0.259    0.016 {built-in method _imp.exec_dynamic}
       14    0.079    0.006    0.200    0.014 /home/eac/FEM_cpp_testing/test_compilation_folder/opt/lib/python3.10/site-packages/getfem/getfem.py:2195(eval)
     1125    0.030    0.000    0.030    0.000 {built-in method posix.stat}
        5    0.022    0.004    0.022    0.004 {built-in method io.open}
      221    0.015    0.000    0.015    0.000 {built-in method marshal.loads}
    20/19    0.015    0.001    0.020    0.001 {built-in method _imp.create_dynamic}
       25    0.014    0.001    0.014    0.001 {method 'flush' of '_io.TextIOWrapper' objects}
    57582    0.009    0.000    0.017    0.000 {built-in method builtins.eval}
        1    0.008    0.008    0.036    0.036 /usr/local/lib/python3.10/dist-packages/numpy/lib/_npyio_impl.py:1412(savetxt)
       34    0.007    0.000    0.007    0.000 {built-in method posix.listdir}
     8213    0.007    0.000    0.007    0.000 {method 'write' of '_io.TextIOWrapper' objects}
  663/108    0.006    0.000    0.017    0.000 /usr/lib/python3.10/sre_parse.py:494(_parse)
  523/519    0.005    0.000    0.013    0.000 {built-in method builtins.__build_class__}
 1229/106    0.005    0.000    0.010    0.000 /usr/lib/python3.10/sre_compile.py:87(_compile)
        1    0.003    0.003    0.004    0.004 /usr/local/lib/python3.10/dist-packages/numpy/lib/_function_base_impl.py:1(<module>)
      506    0.003    0.000    0.042    0.000 <frozen importlib._bootstrap_external>:1536(find_spec)
    27198    0.003    0.000    0.003    0.000 {built-in method builtins.isinstance}
      221    0.002    0.000    0.002    0.000 {method 'read' of '_io.BufferedReader' objects}

It seems that most of the time is spent in the _getfem.getfem module, which is presumably in the .pyd file called by the Python library. Do you know what this corresponds to in the source code?

Sincerely,
Eric Comstock

Konstantinos.Poulios · June 12, 2025, 6:55pm

oh, my wrong, I gave you the profiling method for python functions. This tells you that most of the time is spent in the C++ library (_getfem). We want to profile what is happening in the C++ code, for this purpose you need to run your script with

perf record -F 1000 --call-graph dwarf python3  myscript.py

This creates a quite huge file called perf.data. If the file is too big, change the frequency value from 1000 to something lower.

To view the file, install hotspot with:

sudo apt install hotspot

and run it from the same folder containing perf.data by running

hotspot

This requires that your WSL is capable of showing GUIs

EricAComstock · June 12, 2025, 8:37pm

Hello Kostas,

I have been unable to install perf on my computer, as it does not seem to be compatible with wsl. I have tried installing a new linux kernel, installing a standalone version of perf, and reinstalling linux-tools, but nothing seems to work. Are there any other performance tools we can use?

If not, do you know of any way to install perf on wsl, or special version of perf you are using?

Sincerely,
Eric Comstock

Topic		Replies	Views
Issue with demo_laplacian.py Questions and issues	59	257	January 24, 2025
Installing GetFEM with MPI Support on Ubuntu 24.04.3 LT General	18	110	November 24, 2025
Fast solve of linear problems General	23	194	April 16, 2025
Parallel mumps solver and custom solver General	9	77	July 15, 2025
Building GetFEM package for Windows General	27	512	January 12, 2026

Problem during boundary condition assembly

Related topics