from joblib import Parallel, delayed def parallel_dot(A,B,n_jobs=2): """ Computes A x B using more CPUs. This works only when the number of rows of A and the n_jobs are even. """ parallelizer = Parallel(n_jobs=n_jobs) # this iterator returns the functions to execute for each task tasks_iterator = ( delayed(np.dot)(A_block,B) for A_block in np.split(A,n_jobs) ) result = parallelizer( tasks_iterator ) # merging the output of the jobs return np.vstack(result)This function spreads the computation across more processes. The strategy applied to distribute the data is very simple. Each process has the full matrix B and a contiguous block of rows of A, so it can compute a block of rows A*B. In the end, the result of each process is stacked to build final matrix.
Let's compare the parallel version of the algorithm with the sequential one:
A = np.random.randint(0,high=10,size=(1000,1000)) B = np.random.randint(0,high=10,size=(1000,1000))
%time _ = np.dot(A,B)
CPU times: user 13.2 s, sys: 36 ms, total: 13.2 s Wall time: 13.4 s
%time _ = parallel_dot(A,B,n_jobs=2)
CPU times: user 92 ms, sys: 76 ms, total: 168 ms Wall time: 8.49 sWow, we had a speedup of 1.6X, not bad for a so naive algorithm. It's important to notice that the arguments passed as input to the Parallel call are serialized and reallocated in the memory of each worker process. Which means that the last time that parallel_dot have been called, the matrix B have been entirely replicated two times in memory. To avoid this problem, we can dump the matrices on the filesystem and pass a reference to the worker to open them as memory map.
import tempfile import os from joblib import load, dump # saving A and B to a local file for memmapping temp_folder = tempfile.mkdtemp() filenameA = os.path.join(temp_folder, 'A.mmap') dump(A, filenameA) filenameB = os.path.join(temp_folder, 'B.mmap') dump(A, filenameB)Now, when parallel_dot(A_memmap,B_memmap,n_jobs=2) is called, both the processes created will use only a reference to the matrix B..