kth quantiles of an $n$-element set
Problem statement
Exercise 9.3-8
The $k$th quantiles of an $n$-element set are the $k$ - 1 order statistics that divide the sorted set into $k$ equal sized sets (to within 1). Give an $O(n \cdot lg k)$-time algorithm to list the $k$th quantiles of a set. [1]
Definitions
In statistics and probability, quantiles are cut points dividing the range of a probability distribution into continuous intervals with equal probabilities, or dividing the observations in a sample in the same way. There is one fewer quantile than the number of groups created. [2]
Approach
Let's write a divide and conquer algorithm using selection to solve this problem. What if we find the $l = \lfloor\frac{n \cdot \lfloor\frac{k}{2}\rfloor}{k}\rfloor$ th smallest number and partition the array around it. Then, we can find the first $\lfloor\frac{k}{2}\rfloor$ cut points in the first half $l$ and the remaining $\lceil\frac{k}{2}\rceil$ cut points in the second half $n - l$. At each step, we keep on dividing $k$ by two until it reaches $k$ = 1 which is the trivial case of our recursion. We first solve the left subproblem recursively, and then print the current solution before solving the right subproblem recursively. What follows is our pseudocode for finding the $k$th quantiles of an $n$-element set. A complete treatment of the procedure RANDOMIZED-SELECT used in our algorithm and proof of it's running time is beyond the scope of this article, but the interested reader is directed to the CLRS book [1].
Sample Implementation
Here's a sample Java implementation of the algorithms discussed above along with a driver program.
Informal Proof for running time
We have $log_2 k + 1$ levels in our tree, each level except the leaf costs us $O(n)$ time for partitioning all the subarrays at that level. At the leaf level, no partitioning is performed and we merely return when k = 1, which is the base case of our recursion, doing some constant amount of work for each subproblem of size $\lfloor\frac{n}{k}\rfloor$ or $\lceil\frac{n}{k}\rceil$. Thus, the total running time of this approach is $n \cdot log_2 k + k \approx n \cdot log_2 k$.
Formal Proof for running time
Let's write the recurrence first.
$T(n, k) = T(n_1, \lfloor\frac{k}{2}\rfloor) + T(n_2, \lceil\frac{k}{2}\rceil) + \Theta(n)$
We let $T(n, k) \le cn \cdot log_2 k$ for some constant $c \gt 0$
Let's solve the recurrence using the substitution method.
$T(n, k) \le cn_1 \cdot log_2 (\lfloor\frac{k}{2}\rfloor) + cn_2 \cdot log_2(\lceil\frac{k}{2}\rceil) + \Theta(n)$
$T(n, k) \le cn_1 \cdot log_2 (\frac{k}{2}) + cn_2 \cdot log_2(\frac{k}{2}) + \Theta(n)$
$T(n, k) \le c \cdot log_2 (\frac{k}{2}) \cdot (n_1 + n_2) + \Theta(n)$
$T(n, k) = cn \cdot log_2 (\frac{k}{2}) + \Theta(n)$
$T(n, k) = cn \cdot log_2 k - cn + \Theta(n)$
$T(n, k) \le cn \cdot log_2 k$
where the last step holds when we have some constant $c \gt 0$ for which the term $cn$ dominates the anonymous function hidden behind the term $\Theta(n).$ Thus, the running time is given by
$T(n, k) = O(n \cdot log_2 k)$
and that concludes the proof.
References
[1] https://www.amazon.com/Introduction-Algorithms-fourth-Thomas-Cormen/dp/026204630X
Comments
Post a Comment