Hello Kostas,
Thank you for the information and help! I will learn these tools, and work on this.
Sincerely,
Eric Comstock
Hello Kostas,
Thank you for the information and help! I will learn these tools, and work on this.
Sincerely,
Eric Comstock
Hello Kostas,
Thank you!
I have gotten a getFEM installation with parallel mumps, mpi4py, and openmpi running.
I have not been able to find a usage guide or discussion on how to use mpi4py with getFEM on discourse or getfem.org. How is mpi4py called? What types of objects do I need to create for the custom partition?
Sincerely,
Eric Comstock
Hello Kostas,
What part of the documentation is the partitioning described in?
Sincerely,
Eric Comstock
Hi Eric
The Metis based MPI parallelization is described here:
Since Metis does not support partitioning of higher dimensional meshes, there are two options for you:
Let’s focus on #1. Documentation for mesh regions you find here
and here for the python API
The basic idea is to create an mpi4py capable script, let’s say for a 2D problem, using 2 MPI processes, like this
from mpi4py import MPI
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
....
RG_CURRENT_PROC=99 # just an identifier for
if rank == 0: # all elements for X0<=X<=X1
CVIDs = m.convexes_in_box([X0-1e-5,Y0-1e-5],[X1+1e-5,Y1+1e-5])
elif rank == 1: # all elements for X1<=X<=X2
CVIDs = m.convexes_in_box([X1-1e-5,Y0-1e-5],[X2+1e-5,Y1+1e-5])
CVIDs = np.unique(CVIDs) # Remove any duplicates from list
m.set_region(RG_CURRENT_PROC, [CVIDs,-np.ones(len(CVIDs))])
...
for timestep in timesteps:
...
model.assembly("build_all")
rhs = model.rhs() # this rhs is the sum of all rhs's from all processes
# GetFEM calls the MPI broadcast internally
K = model.tangent_matrix() # this contains only the part of the stiffness matrix
# assembled on the current process. If you want the
# entire stiffness matrix you you need to do all
# necessary MPI broadcasts yourself, which can
# be tricky for sparse matrices. MUMPS does not
# require this, it works with distributed matrices
... # do something with rhs and K
For parallel mumps solves, I have added an example that can be useful to you, here
Hello Kostas,
Thank you so much! I will try these options and review the documentation you sent, and let you know how it goes.
Sincerely,
Eric Comstock
Hello Kostas,
I have a couple of questions:
First, what does gf.MumpsContext do? I have been unable to find it in the documentation.
Second, is it possible to only parallelize certain parts of the code? I have noticed that when I run the code (using the parallel mpirun -n call), any parts that are not parallel will be run individually on each processor, making it take substantially longer to execute.
Edit: I figured out the partial parallelization method
Sincerely,
Eric Comstock
The mumps context object creates a reusable instant with the MUMPS solver, where you can talk to MUMPS forth and back. You cannot find it in the documentation because I have just implemented it.
MPI parallelism is like that, your entire program is run several times, including all parts of it unless you do something about that. It is like running n versions of your program where these n versions can send data to each other. The only thing each version knows its rank (like process id), so you need to run your program in a way that it check its rank and performs tasks accordingly.
In all cases trivial and not computationally expensive parts are indeed executed multiple times.
You cannot do partial parallelization of the code with MPI, but it does not matter. You should focus on splitting the workload of the computational demanding parts of the code. That’s why we used the mesh regions defined according to the rank, in order to split the workload.
Hello Kostas,
Okay, that makes sense. Thank you so much for the detailed help!
Sincerely,
Eric Comstock
Hello Kostas,
Regarding partial parallelization, the reason I was interested in it is because any CPU activity generates heat, which can slow down the other CPUs if too much is generated. For parts of the code that do not have to be run several times, I have found that they can be left out of the parallelization process with something like
# Parallelization
from mpi4py import MPI
comm = MPI.COMM_WORLD
nprocs = comm.Get_size()
rank = comm.Get_rank()
logging.info('nprocs = ' + str(nprocs))
logging.info('rank = ' + str(rank))
# Actual code
if rank == processor_you_want_to_use_for_nonparallelized_tasks:
'''run code here that you do not want to be parallelized'''
which reduces energy consumption and waste heat production. I just wanted to mention it in case it was useful.
I also had a question regarding getFEM itself - is it possible to send getFEM objects through mpi4py’s communication between processors (mpi4py.MPI.COMM_WORLD.Send and mpi4py.MPI.COMM_WORLD.Recv)? If so, what datatype should I use?
Edit: Nevermind - I figured this out (that the getfem objects carry), and the code is 3x faster! Thank you so much!
Edit edit: Now I have a different result, so I may still be doing something wrong, but it is likely not what I originally had as an issue.
Sincerely,
Eric Comstock
it sounds dangerous to use GetFEM with MPI this way, and you certainly need to know what you are doing. GetFEM objects have many inter-dependencies behind the scenes, and a lot of state data are stored, data are cached, there is lazy evaluation of data. In general it is very complex.
You should try to stick with a parallelization paradigm that is based only on transfer of data between the processes, and you should really know what the state of each instant of your script on each processor is. Keep it simple.
Hello Kostas,
I have been doing some testing on a sample case, and have determined that there still seems to be a substantial slowdown on the addition of the boundary conditions. However, I have not bee able to determine where the slowdown is, since it seems to change based on what order I add the boundary conditions. For example, with the default orientation of my boundary conditions (below),
md.add_multiplier("mult46", mf, "f", mim, 46)
logging.info('Multiplier 1/6')
md.add_multiplier("mult47", mf, "f", mim, 47)
logging.info('Multiplier 2/6')
md.add_multiplier("mult48", mf, "f", mim, 48)
logging.info('Multiplier 3/6')
md.add_multiplier("mult49", mf, "f", mim, 49)
logging.info('Multiplier 4/6')
md.add_multiplier("mult50", mf, "f", mim, 50)
logging.info('Multiplier 5/6')
md.add_multiplier("mult51", mf, "f", mim, 51)
logging.info('Multiplier 6/6')
md.add_linear_term(mim, '-mult46 * f', 46)
logging.info('Linear 1/6')
md.add_linear_term(mim, '-mult47 * f', 47)
logging.info('Linear 2/6')
md.add_linear_term(mim, '-mult48 * f', 48)
logging.info('Linear 3/6')
md.add_linear_term(mim, '-mult49 * f', 49)
logging.info('Linear 4/6')
md.add_linear_term(mim, '-mult50 * f', 50)
logging.info('Linear 5/6')
md.add_linear_term(mim, '-mult51 * f', 51)
logging.info('Linear 6/6')
, and using a logger to determine when each operation finishes, I get in the logging file
2025-06-10 13:03:30,123 [INFO] Multiplier 1/6
2025-06-10 13:03:30,124 [INFO] Multiplier 2/6
2025-06-10 13:03:30,125 [INFO] Multiplier 3/6
2025-06-10 13:03:30,126 [INFO] Multiplier 4/6
2025-06-10 13:03:30,126 [INFO] Multiplier 5/6
2025-06-10 13:03:30,127 [INFO] Multiplier 6/6
2025-06-10 13:15:18,568 [INFO] Linear 1/6
2025-06-10 13:15:18,569 [INFO] Linear 2/6
2025-06-10 13:15:18,569 [INFO] Linear 3/6
2025-06-10 13:15:18,570 [INFO] Linear 4/6
2025-06-10 13:15:18,571 [INFO] Linear 5/6
2025-06-10 13:15:18,571 [INFO] Linear 6/6
, implying that md.add_linear_term(mim, '-mult46 * f', 46) took 12 minutes, while all other steps took milliseconds. In contrast, when I run it with
md.add_multiplier("mult46", mf, "f", mim, 46)
logging.info('Multiplier 1/6')
md.add_multiplier("mult47", mf, "f", mim, 47)
logging.info('Multiplier 2/6')
md.add_multiplier("mult48", mf, "f", mim, 48)
logging.info('Multiplier 3/6')
md.add_multiplier("mult49", mf, "f", mim, 49)
logging.info('Multiplier 4/6')
md.add_multiplier("mult50", mf, "f", mim, 50)
logging.info('Multiplier 5/6')
md.add_multiplier("mult51", mf, "f", mim, 51)
logging.info('Multiplier 6/6')
md.add_linear_term(mim, '-mult49 * f', 49)
logging.info('Linear 4/6')
md.add_linear_term(mim, '-mult50 * f', 50)
logging.info('Linear 5/6')
md.add_linear_term(mim, '-mult51 * f', 51)
logging.info('Linear 6/6')
md.add_linear_term(mim, '-mult46 * f', 46)
logging.info('Linear 1/6')
md.add_linear_term(mim, '-mult47 * f', 47)
logging.info('Linear 2/6')
md.add_linear_term(mim, '-mult48 * f', 48)
logging.info('Linear 3/6')
, which should not change the model, I get in the logging file
2025-06-10 13:23:00,526 [INFO] Multiplier 1/6
2025-06-10 13:23:00,526 [INFO] Multiplier 2/6
2025-06-10 13:23:00,527 [INFO] Multiplier 3/6
2025-06-10 13:23:00,527 [INFO] Multiplier 4/6
2025-06-10 13:23:00,528 [INFO] Multiplier 5/6
2025-06-10 13:23:00,529 [INFO] Multiplier 6/6
2025-06-10 13:33:41,414 [INFO] Linear 4/6
2025-06-10 13:33:41,415 [INFO] Linear 5/6
2025-06-10 13:33:41,416 [INFO] Linear 6/6
2025-06-10 13:33:41,417 [INFO] Linear 1/6
2025-06-10 13:33:41,417 [INFO] Linear 2/6
2025-06-10 13:33:41,418 [INFO] Linear 3/6
, implying that md.add_linear_term(mim, '-mult49 * f', 49) took 12 minutes, while all other steps took milliseconds.
Is there some kind of initialization that takes place when add_linear_term is called the first time? If so, where is this procedure in the source code?
Sincerely,
Eric Comstock
ok, nice overview. This makes sense. No initialization specific for add_linear_term, but as I mentioned in a previous message, there are two concept in GetFEM that can explain this:
So basically there are many things that are computed only whenever needed and then cached for later use. Typical example is precomputations of base function values or gradients.
It seems that such a computation has some severe bottleneck for your high dimensional FE space. The good news is that the penalty is only once, and then the cached computation is used. However we should fix this if there is indeed some very inefficient computation.
Hello Kostas,
Do you know if there is any way to check which procedure this is in? I assume it would be in the add_linear_term procedure in Python, but I am not sure how the Python/C++ interface works. I looked around in the C++ source files and it seems that there is a add_linear_term function that calls add_linear_term_ - would the bottleneck be in the latter?
Sincerely,
Eric Comstock
add_linear_term is a relatively high level function, which calls other lower level functions where the actual work is done.
Now that you have everything running in your ubuntu/wsl, it is quite easy to do some proper profiling. You can run your code, let’s call it myscript.py, with
python3 -m cProfile -o benchmarking.dat myscript.py
Install snakeviz from pip
pipx install snakeviz
(pip on linux is evil but pipx is an ok method for installing apps not available in the distro repo)
Visualize your profiling results with:
snakeviz benchmarking.dat
If it does show names of C++ functions, you will need to recompile GetFEM with the g flag in CXXFLAGS=“-g …”, in the configure command.
Hello Kostas,
I tried running snakeviz like you suggested, but I ran into this problem:
snakeviz benchmarking.dat
snakeviz web server started on 127.0.0.1:8080; enter Ctrl-C to exit
http://127.0.0.1:8080/snakeviz/%2Fmnt%2Fc%2FUsers%2FAdmin%2FDocuments%2FImportant%20Files%2FGT%2FResearch%2FSummer%202025%2FMemos%2F06%2002%202025%20-%20getFEM%20high-D%20parellelization%20testing%2Fbenchmarking.dat
snakeviz: error: no web browser found: could not locate runnable browser
Is this referring to browsers like Firefox, or to something else?
Sincerely,
Eric Comstock
yes, I forgot that you run wsl, so you do not have firefox. Then you just need to use some less fancy visualization for your benchmarking.dat file, the simplest is
python3 -m pstats benchmarking.dat
There are many visualization tools that you can use for this file.
Hello Kostas,
Got it - thank you!
Sincerely,
Eric Comstock
Hello Kostas,
Sorted by cumulative time, the results are
benchmarking.dat% sort cumulative
benchmarking.dat% stats 20
Tue Jun 10 16:20:22 2025 benchmarking.dat
308990 function calls (300154 primitive calls) in 639.187 seconds
Ordered by: cumulative time
List reduced from 1784 to 20 due to restriction <20>
ncalls tottime percall cumtime percall filename:lineno(function)
222/1 0.000 0.000 639.194 639.194 {built-in method builtins.exec}
1 0.000 0.000 639.194 639.194 test6D_linear_BCs_fast_default.py:1(<module>)
1 0.001 0.001 638.673 638.673 test6D_linear_BCs_fast_default.py:83(calc_F)
102 634.040 6.216 634.040 6.216 {built-in method _getfem.getfem}
29 0.000 0.000 593.698 20.472 /home/eac/FEM_cpp_testing/test_compilation_folder/opt/lib/python3.10/site-packages/getfem/getfem.py:2817(set)
7 0.000 0.000 592.411 84.630 /home/eac/FEM_cpp_testing/test_compilation_folder/opt/lib/python3.10/site-packages/getfem/getfem.py:3522(add_linear_term)
12 0.000 0.000 22.875 1.906 /home/eac/FEM_cpp_testing/test_compilation_folder/opt/lib/python3.10/site-packages/getfem/getfem.py:1413(outer_faces_with_direction)
12 0.000 0.000 22.875 1.906 /home/eac/FEM_cpp_testing/test_compilation_folder/opt/lib/python3.10/site-packages/getfem/getfem.py:1185(get)
5 0.000 0.000 16.274 3.255 /home/eac/FEM_cpp_testing/test_compilation_folder/opt/lib/python3.10/site-packages/getfem/getfem.py:2812(get)
1 0.000 0.000 16.266 16.266 /home/eac/FEM_cpp_testing/test_compilation_folder/opt/lib/python3.10/site-packages/getfem/getfem.py:2951(solve)
8 0.000 0.000 4.550 0.569 /home/eac/FEM_cpp_testing/test_compilation_folder/opt/lib/python3.10/site-packages/getfem/getfem.py:53(generic_constructor)
7 4.550 0.650 4.550 0.650 {built-in method _getfem.getfem_from_constructor}
1 0.000 0.000 4.508 4.508 /home/eac/FEM_cpp_testing/test_compilation_folder/opt/lib/python3.10/site-packages/getfem/getfem.py:1067(__init__)
1 0.000 0.000 1.240 1.240 /home/eac/FEM_cpp_testing/test_compilation_folder/opt/lib/python3.10/site-packages/getfem/getfem.py:3234(add_fem_variable)
6 0.000 0.000 0.660 0.110 /home/eac/FEM_cpp_testing/test_compilation_folder/opt/lib/python3.10/site-packages/getfem/getfem.py:5980(asm_generic)
275/3 0.001 0.000 0.448 0.149 <frozen importlib._bootstrap>:1022(_find_and_load)
275/3 0.001 0.000 0.448 0.149 <frozen importlib._bootstrap>:987(_find_and_load_unlocked)
262/3 0.001 0.000 0.441 0.147 <frozen importlib._bootstrap>:664(_load_unlocked)
221/3 0.000 0.000 0.440 0.147 <frozen importlib._bootstrap_external>:877(exec_module)
382/3 0.000 0.000 0.438 0.146 <frozen importlib._bootstrap>:233(_call_with_frames_removed)
and sorted by total time, they are
benchmarking.dat% sort tottime
benchmarking.dat% stats 20
Tue Jun 10 16:20:22 2025 benchmarking.dat
308990 function calls (300154 primitive calls) in 639.187 seconds
Ordered by: internal time
List reduced from 1784 to 20 due to restriction <20>
ncalls tottime percall cumtime percall filename:lineno(function)
102 634.040 6.216 634.040 6.216 {built-in method _getfem.getfem}
7 4.550 0.650 4.550 0.650 {built-in method _getfem.getfem_from_constructor}
20/16 0.253 0.013 0.259 0.016 {built-in method _imp.exec_dynamic}
14 0.079 0.006 0.200 0.014 /home/eac/FEM_cpp_testing/test_compilation_folder/opt/lib/python3.10/site-packages/getfem/getfem.py:2195(eval)
1125 0.030 0.000 0.030 0.000 {built-in method posix.stat}
5 0.022 0.004 0.022 0.004 {built-in method io.open}
221 0.015 0.000 0.015 0.000 {built-in method marshal.loads}
20/19 0.015 0.001 0.020 0.001 {built-in method _imp.create_dynamic}
25 0.014 0.001 0.014 0.001 {method 'flush' of '_io.TextIOWrapper' objects}
57582 0.009 0.000 0.017 0.000 {built-in method builtins.eval}
1 0.008 0.008 0.036 0.036 /usr/local/lib/python3.10/dist-packages/numpy/lib/_npyio_impl.py:1412(savetxt)
34 0.007 0.000 0.007 0.000 {built-in method posix.listdir}
8213 0.007 0.000 0.007 0.000 {method 'write' of '_io.TextIOWrapper' objects}
663/108 0.006 0.000 0.017 0.000 /usr/lib/python3.10/sre_parse.py:494(_parse)
523/519 0.005 0.000 0.013 0.000 {built-in method builtins.__build_class__}
1229/106 0.005 0.000 0.010 0.000 /usr/lib/python3.10/sre_compile.py:87(_compile)
1 0.003 0.003 0.004 0.004 /usr/local/lib/python3.10/dist-packages/numpy/lib/_function_base_impl.py:1(<module>)
506 0.003 0.000 0.042 0.000 <frozen importlib._bootstrap_external>:1536(find_spec)
27198 0.003 0.000 0.003 0.000 {built-in method builtins.isinstance}
221 0.002 0.000 0.002 0.000 {method 'read' of '_io.BufferedReader' objects}
It seems that most of the time is spent in the _getfem.getfem module, which is presumably in the .pyd file called by the Python library. Do you know what this corresponds to in the source code?
Sincerely,
Eric Comstock
oh, my wrong, I gave you the profiling method for python functions. This tells you that most of the time is spent in the C++ library (_getfem). We want to profile what is happening in the C++ code, for this purpose you need to run your script with
perf record -F 1000 --call-graph dwarf python3 myscript.py
This creates a quite huge file called perf.data. If the file is too big, change the frequency value from 1000 to something lower.
To view the file, install hotspot with:
sudo apt install hotspot
and run it from the same folder containing perf.data by running
hotspot
This requires that your WSL is capable of showing GUIs
Hello Kostas,
I have been unable to install perf on my computer, as it does not seem to be compatible with wsl. I have tried installing a new linux kernel, installing a standalone version of perf, and reinstalling linux-tools, but nothing seems to work. Are there any other performance tools we can use?
If not, do you know of any way to install perf on wsl, or special version of perf you are using?
Sincerely,
Eric Comstock