Minimum Spanning Trees - Kruskal's Algorithm

Electronic circuit designs often need to make the pins of several components electrically equivalent by wiring them together. To interconnect a set of $n$ pins, we can use an arrangement of $n - 1$ wires, each connecting two pins. Of all such arrangements, the one that uses the least amount of wire is usually the most desirable.

We can model this wiring problem with a connected, undirected graph $G = (V, E)$, where $V$ is the set of pins, $E$ is the set of possible interconnections between pairs of pins, and for each edge $(u, \;v) \in E$, we have a weight $w(u, v)$ specifying the cost (amount of wire needed) to connect $u$ and $v$. We then wish to find an acyclic subset $T \subseteq E$ that connects all of the vertices and whose total weight is minimized.

$$\LARGE
w(T) = \sum_{(u, \;v) \in T} w(u, \;v)
$$

Since $T$ is acyclic and connects all of the vertices, it must form a tree, which we call a spanning tree since it “spans” the graph $G$. We call the problem of determining the tree $T$ the minimum-spanning-tree problem.

The Kruskal's algorithm is a greedy algorithm. Each step of a greedy algorithm must make one of several possible choices. The greedy strategy advocates making the choice that is the best at the moment. Such a strategy does not generally guarantee that it will always find globally optimal solutions to problems. For the minimum-spanning-tree problem, however, we can prove that certain greedy strategies do yield a spanning tree with minimum weight. [1]

We first need some definitions. A cut $(S, V - S)$ of an undirected graph $G = (V, E)$ is a partition of $V$ . We say that an edge $(u, \; v) \in E$ crosses the cut $(S, V - S)$ if one of its endpoints is in $S$ and the other is in $V - S$. We say that a cut respects a set $A$ of edges if no edge in $A$ crosses the cut. An edge is a light edge crossing a cut if its weight is the minimum of any edge crossing the cut. Note that there can be more than one light edge crossing a cut in the case of ties. More generally, we say that an edge is a light edge satisfying a given property if its weight is the minimum of any edge satisfying the property. [1]

Our rule for recognizing safe edges is given by the following theorem.

Theorem

Let $G = (V, E)$ be a connected, undirected graph with a real-valued weight function $w$ defined on $E$. Let $A$ be a subset of $E$ that is included in some minimum spanning tree for $G$, let $(S, V - S)$ be any cut of $G$ that respects $A$, and let $(u, v)$ be a light edge crossing $(S, V - S)$. Then, edge $(u, v)$ is safe for $A$.

Kruskal’s algorithm

Kruskal’s algorithm finds a safe edge to add to the growing forest by finding, of all the edges that connect any two trees in the forest, an edge $(u, v)$ of least weight. Let $C_1$ and $C_2$ denote the two trees that are connected by $(u, v)$. Since $(u, v)$ must be a light edge connecting $C_1$ to some other tree, it is a safe edge for $C_1$. Kruskal’s algorithm qualifies as a greedy algorithm because at each step it adds to the forest an edge of least possible weight.

It uses a disjoint-set data structure to maintain several disjoint sets of elements. Each set contains the vertices in one tree of the current forest. The operation FIND-SET(u) returns a representative element from the set that contains $u$. Thus, we can determine whether two vertices $u$ and $v$ belong to the same tree by testing whether FIND-SET(u) equals FIND-SET(v). To combine trees, Kruskal’s algorithm calls the UNION procedure. [1]

A simple way of implementing the disjoint-set data structure is by using a linked list to represent each set. With this linked-list representation, both MAKE -SET and FIND -SET are easy, requiring $Θ(1)$ time. The simplest implementation of the UNION operation using the linked-list set representation takes significantly more time than MAKE -SET or FIND -SET. We perform UNION(x, y) by appending $y$’s list onto the end of $x$’s list. Unfortunately, we must update the pointer to the set object for each object originally on $y$’s list, which takes time linear in the length of $y$’s list. In fact, we can easily construct a sequence of $m$ operations on $n$ objects that requires $Θ(n^2)$ time. [1]

Disjoint-set forests

In a faster implementation of disjoint sets, we represent sets by rooted trees, with each node containing one member and each tree representing one set. In a disjoint set forest, each member points only to its parent. The root of each tree contains the representative and is its own parent. As we shall see, although the straightforward algorithms that use this representation are no faster than ones that use the linked-list representation, by introducing two heuristics—“union by rank” and “path compression”—we can achieve an asymptotically optimal disjoint-set data structure. [1]

Heuristics to improve the running time

So far, we have not improved on the linked-list implementation. A sequence of $n -1$ UNION operations may create a tree that is just a linear chain of n nodes. By using two heuristics, however, we can achieve a running time that is almost linear in the total number of operations $m$.

The first heuristic, union by rank, is similar to the weighted-union heuristic. The obvious approach would be to make the root of the tree with fewer nodes point to the root of the tree with more nodes. The second heuristic, path compression, is also quite simple and highly effective. We use it during FIND-SET operations to make each node on the find path point directly to the root. [1]

Effect of the heuristics on the running time

Separately, either union by rank or path compression improves the running time of the operations on disjoint-set forests, and the improvement is even greater when we use the two heuristics together. Alone, union by rank yields a running time of $Θ(m \cdot log_2\; n)$, and this bound is tight. When we use both union by rank and path compression, the worst-case running time is $Θ(m \cdot \alpha(n)),$ where $\alpha(n)$ is a very slowly growing function. [1]

In any conceivable application of a disjoint-set data structure, $\alpha(n) \le 4$, thus, we can view the running time as linear in $m$ in all practical situations. Strictly speaking, however, it is superlinear.

Now let’s consider a couple of real world applications of the Kruskal's algorithm.

Min Cost to Connect All Nodes

Given an undirected graph with n nodes labeled $1\ldots n$. Some of the nodes are already connected. The $i-th$ edge connects nodes $edges[i][0]$ and $edges[i][1]$ together. Your task is to augment this set of edges with additional edges to connect all the nodes. Find the minimum cost to add new edges between the nodes such that all the nodes are accessible from each other.

Input:

$n$, an int representing the total number of nodes.
$edges$, a list of integer pair representing the nodes already connected by an edge.
$newEdges$, a list where each element is a triplet representing the pair of nodes between which an edge can be added and the cost of addition, respectively (e.g. $ \left[
\begin{array}{ccc}
1,&2,&5
\end{array}
\right] $ means to add an edge between node 1 and 2, the cost would be 5).

Example:

Input:
$n = 6$, $edges = \left[\left[
\begin{array}{cc}
1,&4
\end{array}
\right],\; \left[
\begin{array}{ccc}
4,&5
\end{array}
\right],\; \left[
\begin{array}{ccc}
2,&3
\end{array}
\right]\right]$, $newEdges = \left[\left[
\begin{array}{cc}
1,&2, &5
\end{array}
\right],\; \left[
\begin{array}{ccc}
1,&3, &10
\end{array}
\right],\; \left[
\begin{array}{ccc}
1,&6, &2
\end{array}
\right]
,\; \left[
\begin{array}{ccc}
5,&6,&5
\end{array}
\right]
\right]$

Output: $7$

Explanation:
There are 3 connected components $ \left[
\begin{array}{ccc}
1,&4,&5
\end{array}
\right],\;\left[
\begin{array}{ccc}
2,&3
\end{array}
\right] $ and $ \left[
\begin{array}{ccc}
6
\end{array}
\right] $.
We can connect these components into a single component by connecting node 1 to node 2 and node 1 to node 6 at a minimum cost of $5 + 2 = 7$.
hint: what’s the time complexity of your algorithm? Can you make the running time $Θ(E \cdot log_2(E))$ by using Union Find?

Min Cost to Repair Edges

There's an undirected connected graph with $n$ nodes labeled $1\ldots n$. But some of the edges has been broken disconnecting the graph. Find the minimum cost to repair the edges so that all the nodes are once again accessible from each other.

Input:

$n$, an int representing the total number of nodes.
$edges$, a list of integer pair representing the nodes connected by an edge.
$edgesToRepair$, a list where each element is a triplet representing the pair of nodes between which an edge is currently broken and the cost of repearing that edge, respectively (e.g. $ \left[
\begin{array}{ccc}
1,&2,&12
\end{array}
\right] $ means to repear an edge between nodes 1 and 2, the cost would be 12).

Example 1:

Input:
$n = 5,$ $edges = \left[\left[
\begin{array}{cc}
1,&2
\end{array}
\right],\; \left[
\begin{array}{ccc}
2,&3
\end{array}
\right],\; \left[
\begin{array}{ccc}
3,&4
\end{array}
\right]
\; \left[
\begin{array}{ccc}
4,&5
\end{array}
\right],\; \left[
\begin{array}{ccc}
1,&5
\end{array}
\right]
\right]$, $edgesToRepair = \left[\left[
\begin{array}{cc}
1,&2, &12
\end{array}
\right],\; \left[
\begin{array}{ccc}
3,&4, &30
\end{array}
\right],\; \left[
\begin{array}{ccc}
1,&5, &8
\end{array}
\right]
\right]$
Output: $20$
Explanation:
There are 3 connected components due to broken edges: $ \left[
\begin{array}{c}
1
\end{array}
\right],\;\left[
\begin{array}{ccc}
2,&3
\end{array}
\right] \;and\; \left[
\begin{array}{cc}
4,&5
\end{array}
\right] $.
We can connect these components into a single component by repearing the edges between nodes 1 and 2, and nodes 1 and 5 at a minimum cost $12 + 8 = 20$.

Example 2:

Example 3:

Input:
$n = 6,$ $edges = \left[\left[
\begin{array}{cc}
1,&2
\end{array}
\right],\; \left[
\begin{array}{ccc}
2,&3
\end{array}
\right],\; \left[
\begin{array}{ccc}
4,&5
\end{array}
\right]
,\; \left[
\begin{array}{ccc}
5,&6
\end{array}
\right],\; \left[
\begin{array}{ccc}
1,&5
\end{array}
\right]
,\; \left[
\begin{array}{ccc}
2,&4
\end{array}
\right],\; \left[
\begin{array}{ccc}
3,&4
\end{array}
\right]
\right]$, $edgesToRepair = \left[\left[
\begin{array}{cc}
1,&5, &110
\end{array}
\right],\; \left[
\begin{array}{ccc}
2,&4, &84
\end{array}
\right]
,\; \left[
\begin{array}{ccc}
3,&4, &79
\end{array}
\right]
\right]$
Output: $79$

So, we need to implement the Kruskal's algorithm to solve these problems. For the brevity’s sake import statements were omitted. Here’s how it looks.

Analysis

The running time of Kruskal’s algorithm for a graph $G = (V, E)$ depends on how we implement the disjoint-set data structure. We assume that we use the disjoint-set-forest implementation with the union-by-rank and path-compression heuristics, since it is the asymptotically fastest implementation known. Initializing the set A in line 1 takes $Θ(1)$ time, and the time to sort the edges in line 4 is $Θ(E \cdot log_2 \; E)$. The for loop of lines 5–8 performs $Θ(E)$ FIND-SET and UNION operations on the disjoint-set forest. Along with the $\lvert V\rvert$ MAKE -S ET operations, these take a total of $Θ((V+E) \cdot \alpha(n))$ time, where $\alpha$ is the very slowly growing function defined. Because we assume that $G$ is connected, we have $\lvert E \rvert \ge \lvert V \rvert - 1$, and so the disjoint-set operations take $Θ(E \cdot \alpha(v))$ time. Moreover, since $\alpha(\lvert V \lvert) = Θ(log_2 V) = Θ(log_2 E)$, the total running time of Kruskal’s algorithm is $Θ(E \cdot log_2\; E)$. [1]

References

[1] https://www.amazon.com/Introduction-Algorithms-3rd-MIT-Press/dp/0262033844

Search This Blog

Ravindra Ranwala's Blog