Introduction
Sorting is a routine operation in distributedmemory applications. However, given that similar values must reside near each other in memory in the output array, using an algorithm like quick sort across the entire input array that resides across multiple processes is not effective. Instead, a solution to sorting in distributedmemory requires first distributing the data in intervals to ranks and then sorting the data at that rank.
Student Learning Outcomes
 Understand a bucketbased distribution sort.
 Understand how various algorithm components scale as a function of the number of process ranks.
 Understand load balancing across uniform and skewed data distributions.
 Learn about bottlenecks in distribution sort.
 Reason about how the algorithm could be improved beyond the scope of the assignment.
Prerequisites
Some knowledge of the MPI API and familiarity with the SLURM batch scheduler.
MPI Primitives Used
 MPI_Send (and possibly variants)
 MPI_Recv (and possibly variants)
 MPI_Reduce
 Possibly MPI_Get_count
Files Required
Problem Description
Distribution sort is a popular method for sorting in distributedmemory environments. We will implement distribution sort using buckets. The idea is to first partition the input set into several buckets, and then independently sort the buckets and write the sorted buckets to an output array.
In this problem, we assume that the data is already available at each process rank. After each rank has finished sorting its buckets, we leave the data distributed at each rank (e.g., assume the data cannot be stored in main memory on one node).
Figure 1 shows the steps of the algorithm.
The algorithm performs best when the number of elements in the buckets are evenly distributed. In the programming activity below, you will use uniformly distributed data. Figure 2 shows a cartoon example of a histogram with 5 buckets of equal width, where each bucket contains the same number of elements. Your buckets will contain nearly the same number of elements.
Programming Activity #1
Program the distribution sort algorithm. Note the following guidelines below.

Use the starter file
distribution_sort_uniform_starter.c
. 
Compile all programs with the
O3
optimization flag. 
Use a single node.

Use a single node generation (if applicable).

Use the exclusive flag for all experiments so that your results are not polluted by other users running on the node.

The starter file will generate uniformly distributed data on each process rank.

$N$ is the total input size across all ranks (which may be slightly smaller, as described below). We set $N=10^9$.

Assign $\lfloor N/p \rfloor$ elements to each process rank, where there are $p$ process ranks. Since we take the floor of $N/p$, then we do not need to consider the case where there are “leftover” elements to sort; therefore, we may sort slightly fewer than $N$ elements if $N~mod~p~!=0$.

Assign equal sized buckets to each process rank. The starter file defines MAXVAL, and the elements are generated in the range [0, MAXVAL). Note that you may need to assign one rank a slighly larger bucket if the range does not divide evenly into $p$.

Make sure that the data assigned to each bucket/rank is valuedisjoint (i.e., the same value does not appear in two different buckets).

The algorithm has two main steps. Step 1: perform the bucketing, where each process rank stores the elements in a bucket in a buffer for a given rank and sends the data to that rank. Step 2: After all ranks have their data in the specified ranges, each rank sorts the data in their buckets. Figure 1 shows these steps where $p=2$. Each rank will need to send to $p1$ other ranks.

You need to determine how to send the data to the appropriate ranks as a function of the data ranges assigned to that rank.

Once the data is obtained by all ranks, sort the data using qsort.

Q1: Based on the problem description, should distribution sort have low or high load imbalance?

Q2: Is there anything different about the algorithm when $p=1$?
Validation
 You should test to make sure that the data is sorted at each rank.
 You should do a check that the global sum of all elements across all ranks before sorting is the same as the global sum of all elements after sorting. Perform these two tests using a reduction.
 Do not include these tests in your response time measurements.
 When $p=20$ the global sum is as follows: 499937769104586.
In your report, you will create a table that shows the total time, the time to perform the bucketing/distribution step and the time to sort. Use the maximum times obtained at each rank for each of these time components using reduction operations (three reductions total). Compute the parallel speedup and efficiency based on the total time. Global sum refers to the sum you obtained after sorting (be sure to use a suitable data type for this value in your program). The job script may be the same for all executions. Include all job scripts in the submitted archive. Use a single dedicated node for all experiments. A template table is provided below.
# of Ranks ($p$)  Total Time (s)  Time to Distribute (s)  Time to Sort (s)  Parallel Speedup  Parallel Efficiency  Global Sum  Job Script Name (*.sh) 

1  
2  
4  
8  
12  
16  
20 
Using the data you collected in your table, answer the following questions. Be sure to refer to specific values in your table when responding to these questions.
 Q3: Does the time to distribute the data vary as a function of $p$?
 Q4: Does the time to sort the data vary as a function of $p$?
 Q5: How does distribution sort scale with increasing $p$?
 Q6: How does the parallel efficiency scale with increasing $p$?
 Q7: What is the bottleneck in distribution sort as $p$ increases?