In many cases, it is impossible to divide a parallel data structure so that each process has exactly the same amount of data. It may not even be desirable, if the amount of work to be done varies. Modify your code so that each process can have a different number of rows of the distributed mesh.

You may want to use these MPI routines in your solution:
MPI_Gather MPI_Gatherv