## Introduction: Multi-dimensional Data Access Methods

Multi-dimensional data access methods are used to reduce the complexity of brute force range queries. For a given range query, instead of comparing all $|D|$ data points to the ranges in each dimension, a spatial partitioning of the dataset can be used to reduce the number of comparisons between points and the ranges in each dimension. In this module, we will index $D$ using the R-tree. Note that there are other spatial indexing methods that could be employed, such as kd-trees, quad-trees, oct-trees, ball trees, and others.

## Optional Reading

- Guttman, A. R-trees: a dynamic index structure for spatial searching. In Proc. of the 1984 ACM SIGMOD Intl. Conf. on Management of Data pp 47-57. [link]

## R-tree Overview

An R-tree (Rectange-tree) stores objects in rectangles (here, objects refer to our data points in $D$). The main idea of the R-tree is to construct the index using a hierarchy of bounding rectangles as a function of the input data. The root of the tree contains the entire dataset, and as the tree is traversed, there are fewer objects that overlap a given range query.

Figure 2 outlines an example of indexing data in an R-tree. The input data objects refer to the red rectangles, $R8, R9, \ldots, R19$. The rectangles $R1, R2, \ldots, R7$ are internal rectangles and are constructed based on the input data objects. As an example, observe that to find $R16$, the tree is traversed as follows: $R2 \rightarrow R6 \rightarrow R16$. Therefore, for a given range query, the data is partitioned based on the distribution of the internal rectanges ($R1, R2, \ldots, R7$) which avoid making a comparison to all points in $D$ (unless the range is sufficiently large in all dimensions).

## R-tree Indexing

R-trees index data objects and search using minimum bounding rectanges (MBRs). That is why in Figure 2, the data objects $R8,\ldots,R19$ are shown as rectangles (and not points as shown in Figure 1). A data object is defined using the minimum and maximum values of the MBR (Figure 3). Let $m$ be a 2-dimensional MBR, which is defined as follows:

$$m=(x^{min}, y^{min}, x^{max}, y^{max}).$$

For a 2-D point object that does not have a spatial extent, note that $x^{min}=x^{max}$ and $y^{min}=y^{max}$. Therefore, if $p_x$ and $p_y$ denote the $x$ and $y$ coordinates of a point, then the MBR for this point is defined as follows: $m=(p_x, p_y, p_x, p_y)$. To construct an R-tree index, data is inserted as MBRs into the tree.

## R-tree Searching

After indexing data in an R-tree, you simply define a range query using an MBR as described above, which outlines the minimum and maximum ranges in each dimension. Then search the R-tree using this MBR.

## Tutorial: Using an R-tree

During this miniature tutorial, follow along using `range_query_rtree_movie_example.cpp`

.

- Download the files
`range_query_rtree_movie_example.cpp`

and`RTree.h`

. `RTree.h`

has been downloaded from superliminal.com from here. I have made a few minor modifications to the file to remove compiler errors and warnings under g++ (tested using v.4.4.7 and v.5.4.0). This R-tree implementation has been used extensively by the community.

The R-tree uses C++ and is templated. To compile, use C++ (not C) using the following command:

```
mpic++ -O3 range_query_rtree_movie_example.cpp -lm -o range_query_rtree_movie_example
```

The file will import all of the movie data in Figure 1.

The line that constructs the R-tree is as follows.

```
RTree<int, double, 2, double> tree;
```

The four parameters are: <DATATYPE> <ELEMTYPE> <NUMDIMS> <ELEMTYPEREAL>, and are described as follows:

- <DATATYPE>: The type of the reference data (movie 1, 2, 3).
- <ELEMTYPE>: The type of the data that will be indexed (dates, and movie ratings).
- <NUMDIMS>: The dimensionality of the data (2 dimensions).
- <ELEMTYPEREAL>: Data type used in volume calculations (use a double to minimize floating point error).

To construct the R-Tree, create MBRs from the data. The lines that perform this task are below:

```
for (int i=0; i<N; i++)
{
Rect tmp=Rect(data[i].x,data[i].y,data[i].x,data[i].y);
tree.Insert(tmp.min, tmp.max, i);
}
```

To search the R-tree for the MBR shown in Figure 1, we create an MBR and search the tree, using the lines below:

```
Rect search_rect=Rect(1275375600, 7.1, 1309503600, 9.1);
unsigned int nhits=tree.Search(search_rect.min, search_rect.max, MySearchCallback, NULL);
//The value of nhits is the number of objects within the MBR, which returns 3 objects.
```

## Programming Activity #2

Implement the range query algorithm using the instructions outlined in Programming Assignment #1, but use an R-tree. Note the following:

- Use
`range_query_rtree_starter.cpp`

. The starter file is the same as`range_query_starter.c`

, except that it has the`.cpp`

extension and includes`RTree.h`

. - Remember to compile using
`mpic++`

.

Similarly to Programming Assignment #1, complete the following tables using one node. Note that your global sum should be identical to your brute force implementation for a given value of $p$. The R-tree implementation has an index construction phase and a search phase. Time the following components:

- The total response time (maximum time across all ranks).
- The time to construct the R-tree (maximum time across all ranks).
- The time to perform the searches (maximum time across all ranks).

**Table 2: Total response time, index construction time, and search time (one node).**

# of Ranks ($p$) | Total Time (s) | R-Tree Construction (s) | Search Time (s) | Global Sum | Job Script Name (*.sh) |
---|---|---|---|---|---|

1 | |||||

4 | |||||

8 | |||||

12 | |||||

16 | |||||

20 |

**Table 3: Speedup and parallel efficiency of the search phase from Table 2.**

Typically, databases are indexed once and then queried multiple times. Therefore, index construction time can be disregarded. Compute the speedup and parallel efficiency using the **index search time** (not total response time) from Table 2.

# of Ranks ($p$) | Speedup | Parallel Efficiency |
---|---|---|

1 | ||

4 | ||

8 | ||

12 | ||

16 | ||

20 |

Using the data you collected in your tables, answer the following questions. Be sure to refer to specific values in your tables when responding to these questions.

**Q3: Each rank constructs an identical R-tree but searches the index for different range queries. How does the time to construct the index change with increasing $p$? Explain this performance behavior.****Q4: Which implementation has better performance, the R-tree (search only time) or the brute force algorithm? Explain.****Q5: Which has the highest parallel efficiency, the brute force algorithm or the R-tree? Why do you think the parallel efficiency varies between algorithms?**