Parallel processing during the backward pass
This is not a task (yet). I am just reporting on what I did.
I tried the multiprocessing library of python for doing parallel processing of the backward pass. Then I tried the multi-threading library. Neither of these worked, and here is why:
Multiprocessing library uses os.fork() to create new child processes. os.fork() copies the entire state of the interpreter, and copies it into the children. When the processing is done, the children vanish, and any data that is not returned also vanishes.
Unfortunately, since we are not doing any memory allocation, and doing all changes within the data object of various classes, any new computation done within the data elements of the children processes will vanish with the processes. Only the returned objects will be copied back to the parent process.
We have three options: a) implement our own fork() which does not copy the environment, b) return the new data elements and copy them into the parent process. c) use a shared memory to share the data between the multiple processes
I didn't do (a) because it demanded a lot of time. For (b) is not really efficient, and (b) and (c) require that the data elements be picklable, which is not the case for the pinocchio objects.
Multithreading library does not copy the memory, and runs threads instead of processes, which is good. However, there is the Global Interpreter Lock(GIL) of python, which forces that only one thread can manipulate the interpretor memory at any given time. Which is stupid. It means that you don't really run multithreading, you run multiple threads in sequence and with an overhead.
This is indeed what I noticed. 1 iteration of the biped example took me 3s. 1 iteration with 4 threads took 3.2s, and 1 iteration with 8 threads took 3.5s.
In conclusion, if anyone wants to try out the python multiprocessing, keep this issue in mind and work from where I left off.
Otherwise, I think we have to wait for the cpp implementation before we can think about parallel processing.