CMSC 341 Data Structures — Projects & Support — Fall 2019

Project 2: Range Average Query

Due: Tuesday, October 8, before 9:00 pm


Addenda

  • You may find it easier to store sums (rather than averages) in the RAQ and BlockRAQ data structures. If you choose to do this, your query() function must convert sums to averages. Also, your dump() function should print sums, not averages.
  • You will need to include the cmath library in raq.cpp to use the sqrt() function. You may also include iostream (needed by dump()) and iomanip (optional, but allows for formatting of output).

Objectives

In this project, you will implement two different solutions for the Range Average Query problem, each with different asymptotic running times.


Introduction

The Range Average Query problem (RAM) is very simple to state. Let A be a numeric array of size n. We wish to make numerous queries of the following type: given a pair of indices (i,j), 0≤i≤j<n, return the average avg(i,j)=1j−i+1j∑k=iA[k].

If we were only making a small number of queries, we would just compute the answers directly: avg(i, j) sum = 0 for k = i to j sum = sum + A[k] return sum / (j – i + 1)

The problem with this simple solution is that each query has O(n) running time. In fact, making some assumptions about what we mean by a “random” query, we can show that the average interval length is approximately n/3 and so the average case running time is Θ(n)

.

We could speed-up the query by precomputing the answers. That is, we could compute avg(i,j) for all valid (i,j) pairs and save them in a matrix. Then a query would consist of looking-up the appropriate value in the matrix and returning it, which has Θ(1) running time. However, there are n(n+1)/2 possible queries, meaning distinct (i,j) pairs, so the precomputation would have a running time of n(n+1)2⋅O(n)=O(n3). It is not too difficult to show that this method is actually Θ(n3).

Variant 1: Dynamic Programming

Dynamic Programming is a algorithmic design techniques that is taught in Algorithms. For this problem, all you need to know is that there is a simple mathematical observations that allows us to precompute the avg(i,j) values more quickly. It is a fairly simple observation. Suppose I have already computed avg(i,j) and want to calculate avg(i,j+1) next; I can do this with a simple O(1) update: avg(i,j+1)=1j−i+2((j−i+1)⋅avg(i,j)+A[j+1])

Why does this work? Look back at the definition of avg(i,j), and subsitute it into the expression: (j−i+1)⋅avg(i,j)=(j−i+1)1j−i+1j∑k=iA[k]=j∑k=iA[k]. That is, (j−i+1)⋅avg(i,j) is just the sum of the elements of A from index i to index j, and adding A[j+1] gives us the sum of the elements from i to j+1. Since there are now j−i+2 elements being summed, we divide by j−i+2

to turn the sum into the average.

We can use this to precompute all possible queries. Start with avg(0,0), which is just A[0]. Calculate avg(0,1) by applying the updated procedure. Apply the update procedure to avg(0,1) to calculate avg(0,2). Repeat until you’ve calculatd avg(0,n−1). Now start with avg(1,1)=A[1] and apply updates to compute avg(1,2), etc., up to avg(1,n−1)

. Repeat this procedure until you’ve precomputed all the values. Be sure to save the precomputed values in an appropriate data structure, such as a two-dimensional array.

Using this procedure, we compute avg(i,j) for all n(n+1)2 (i,j) pairs. The computation for any one pair is Θ(1), so the time to complete the precomputation is Θ(n2). Compare this to precomputation with the simple version, which had Θ(n3)

running time.

Once the precomputation is done, a query is just a lookup into the matrix in which we’ve stored the precomputed values, a Θ(1) operation. Using Dynamic Programming (clever updating), we have reduced the precomputation time to Θ(n2) and retained a Θ(1)

query time.

Variant 2: Block Decomposition

The simplest way to describe block decomposition is to look an an example. The table below contains a 15-long array in the middle row; the top row contains the indices into the array. In the bottom row, we have computed the average of each three-element block from the data array. That is, the first element of the bottom row, 36, is the average of 34, 16, and 58, etc. The data values were chosen so that the averages would be integers, but this will not be the case in general.

01234567891011121314
341658-245379792-124590-12077
3612591832

How would we evaluate a query with this data structure? If the (i,j) pair corresponds to one of the blocks, the query is a simple lookup. Using the example, avg(6,8)

would simply return 59, the precomputed average for the third block.

If the query corresponds to multiple full blocks, this is only slightly harder. For avg(3,8), we would need to return the average of the second and third blocks. Since the blocks are the same size, we can sum the blocks and divide by two: (12+59)/2=35.5.

In general, we would sum the blocks and divide by the number of blocks. However, there is a situation, discussed later, in which this is not quite correct.

Unfortunately, many queries will not align with the block structure. For those, we need to do more work. Let b0,b1,…,b4 denote the five block averages. Suppose we wish to query avg(2,10). The answer will involve two block averages (b1,b2) and data from two partial blocks (the first and fourth). For this example, we would first compute the sum of the elements A[2],A[3],…,A[10] as follows: A[2]+blocksize⋅b1+blocksize⋅b2+A[9]+A[10] which for the example data is 58+3⋅12+3⋅59+45+9=325. To find the average, divide by the number of elements: avg(2,10)=32510−2+1=3259=36.1111.

Warning: In our example, all our blocks are the same size, but in many cases the last block will be smaller. For example, if our data array has size 17, there is no way to divide it into blocks of equal size. We could divide 17 into five blocks of three and a final blocks of size 2, or four blocks of size four with a final block of size one. In either case, the last block is a different size. You will have to handle this correctly in your code!

How efficient is the block approach? Precomputing the block averages can be done in Θ(n) time, using a single pass through the data array. The running time of the query method depends on the chosen block size. It turns out that the optimal block size is approximately √n, resulting in a O(√n)

running time.

Summary

We’ve discussed four methods for computing Range Average Queries, each with different precomputation and query running times:

  1. No precomputation. Write a function that computes a query with no precomputation. The query running time is O(n)

, but there is no precomputation. If we have to make a lot of queries, this becomes infefficient. Simple precomputation. Use our function from (1) to precompute all possible queries and save them in a two-dimensional array. The time to build the array is O(n3), but the query time is O(1). Dynamic Programming (Variant 1). By thinking a little bit about how the precomputation is done, we reduce the precomputation time to O(n2) and retain the O(1) query running time. Block Decomposition (Variant 2). By breaking the problem into a number of smaller problems (blocks), we reduce the precomputation to O(n), but the query running time increases to O(√n).

There are more complex methods that give different combinations of running times for the precomputation and queries. One approach, which utilizes a type of binary tree called a Cartesian Tree, gives O(n) time for precomputation and O(1) queries. Cartesian Trees will be the subject of Project 3.


Assignment

Your assignment is to implement the Dynamic Programming and Block Decomposition methods for Range Average Queries. You have a lot of freedom in how to implement your solution — you get to decide what the private class variables should be — but you must implement the specified functions, and you must use Dynamic Programming and Block Decomposition.

For this project, you are only provided with a skeleton .h file and a simple test:

  • raq.h — skeleton .h file for the RAQ and BlockRAQ classes.
  • test0.cpp — a basic test program.
  • test0.txt — sample output from test0.cpp using my implementation of RAQ and BlockRAQ. Your ouput may not be identical, depending on how you set-up the data structures.

The file raq.h contains the function declarations for two classes:

  1. RAQ — implements the Dynamic Programming solution for Range Average Queries.
  2. BlockRAQ — implements the Block Decomposition solution for Range Average Queries.

The private class variables have not been defined. It is up to you to do that. You will need some array-type variables: you may use C/C++ arrays, the STL vector class, or the STL array class. No other STL containers or libraries are permitted. Once you have decided on the class variables, you can implement the class methods.

If necessary, you must implement a copy constructor, destructor, and assignment operator for both classes. Whether this is necessary depends on your choice of class variables.

Finally, you are responsible for thoroughly testing your program. For grading purposes, it will be tested on input arrays of varying sizes, including very large arrays. Your submission will also be checked for memory leaks and memory errors.


Specifications

For the RAQ class, you must declare the class variables and implement the following methods:

MethodDescription
RAQ(std::vector<float> data) Creates a RAQ object from a data vector. Performs precomputation for the Dynamic Programming solution.
float query(int i, int j) const Performs a Range Average Query from index i to index j. i and j must be in the range 0…n−1
where n

Tags: , , , , , , , , , , , ,