kth quantiles of an $n$-element set

Problem statement

Exercise 9.3-8

The $k$th quantiles of an $n$-element set are the $k$ - 1 order statistics that divide the sorted set into $k$ equal sized sets (to within 1). Give an $O(n \cdot lg k)$-time algorithm to list the $k$th quantiles of a set. [1]


Definitions

In statistics and probability, quantiles are cut points dividing the range of a probability distribution into continuous intervals with equal probabilities, or dividing the observations in a sample in the same way. There is one fewer quantile than the number of groups created. [2]



Approach

Let's write a divide and conquer algorithm using selection to solve this problem. What if we find the $l = \lfloor\frac{n \cdot \lfloor\frac{k}{2}\rfloor}{k}\rfloor$ th smallest number  and partition the array around it. Then, we can find the first $\lfloor\frac{k}{2}\rfloor$ cut points in the first half $l$ and the remaining $\lceil\frac{k}{2}\rceil$ cut points in the second half $n - l$. At each step, we keep on dividing $k$ by two until it reaches $k$ = 1 which is the trivial case of our recursion. We first solve the left subproblem recursively, and then print the current solution before solving the right subproblem recursively. What follows is our pseudocode for finding the $k$th quantiles of an $n$-element set. A complete treatment of the procedure RANDOMIZED-SELECT used in our algorithm and proof of it's running time is beyond the scope of this article, but the interested reader is directed to the CLRS book [1].


$FIND-QUANTILES(s, k)$
$\phantom{{}++{}}$ let $n \gets s.length$
$\phantom{{}++{}}$ $FIND-QUANTILES-AUX(s, 1, n, k)$


$FIND-QUANTILES-AUX(s, i, j, k)$
$\phantom{{}++{}}$ // trivial case of recursion
$\phantom{{}++{}}$ `if` $k == 1$
$\phantom{{}++{}}$ $\phantom{{}++{}}$  `return`

$\phantom{{}++{}}$ let $n \gets j - i + 1$
$\phantom{{}++{}}$ let $k_1 \gets \lfloor\frac{k}{2}\rfloor$
$\phantom{{}++{}}$ let $l \gets \lfloor n \cdot \frac{k_1}{k} \rfloor$
$\phantom{{}++{}}$ let $p \gets RANDOMIZED-SELECT(s, i, j, l)$
$\phantom{{}++{}}$ $FIND-QUANTILES-AUX(s, i, i + l - 1, k_1)$
$\phantom{{}++{}}$ `print` $p$
$\phantom{{}++{}}$ $FIND-QUANTILES-AUX(s, i + l, j, k - k_1)$


Sample Implementation

Here's a sample Java implementation of the algorithms discussed above along with a driver program. 


Informal Proof for running time

We have $log_2 k + 1$ levels in our tree, each level except the leaf costs us $O(n)$ time for partitioning all the subarrays at that level. At the leaf level, no partitioning is performed and we merely return when k = 1, which is the base case of our recursion, doing some constant amount of work for each subproblem of size $\lfloor\frac{n}{k}\rfloor$ or $\lceil\frac{n}{k}\rceil$. Thus, the total running time of this approach is $n \cdot log_2 k + k \approx n \cdot log_2 k$.


Formal Proof for running time

Let's write the recurrence first. 

$T(n, k) = T(n_1, \lfloor\frac{k}{2}\rfloor) + T(n_2, \lceil\frac{k}{2}\rceil) + \Theta(n)$

We let $T(n, k) \le cn \cdot log_2 k$ for some constant $c \gt 0$

Let's solve the recurrence using the substitution method.

$T(n, k) \le cn_1 \cdot log_2 (\lfloor\frac{k}{2}\rfloor) + cn_2 \cdot log_2(\lceil\frac{k}{2}\rceil) + \Theta(n)$

$T(n, k) \le cn_1 \cdot log_2 (\frac{k}{2}) + cn_2 \cdot log_2(\frac{k}{2}) + \Theta(n)$

$T(n, k) \le c \cdot log_2 (\frac{k}{2}) \cdot (n_1 + n_2) + \Theta(n)$

$T(n, k) = cn \cdot log_2 (\frac{k}{2}) + \Theta(n)$

$T(n, k) = cn \cdot log_2 k - cn + \Theta(n)$

$T(n, k) \le cn \cdot log_2 k$

where the last step holds when we have some constant $c \gt 0$ for which the term $cn$ dominates the anonymous function hidden behind the term $\Theta(n).$ Thus, the running time is given by

$T(n, k) = O(n \cdot log_2 k)$

and that concludes the proof.


References

[1] https://www.amazon.com/Introduction-Algorithms-fourth-Thomas-Cormen/dp/026204630X

[2] https://en.wikipedia.org/wiki/Quantile

Comments

Popular posts from this blog

Introducing Java Reactive Extentions in to a SpringBoot Micro Service

Optimal binary search trees

Edit distance