Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Huffman Coding Algorithm: Minimizing Binary Code Length, Assignments of Algorithms and Programming

Huffman coding, a method for generating binary codes for a set of characters based on their frequencies. The goal is to find a tree that minimizes the total length of the encoded messages. An inefficient dynamic programming solution and an alternative method using a greedy approach. The huffman algorithm is presented in detail, including its implementation using a priority queue.

Typology: Assignments

Pre 2010

Uploaded on 03/16/2009

koofers-user-msw
koofers-user-msw 🇺🇸

10 documents

1 / 9

Toggle sidebar

Related documents


Partial preview of the text

Download Huffman Coding Algorithm: Minimizing Binary Code Length and more Assignments Algorithms and Programming in PDF only on Docsity! Chapter III Greedy Algorithms We consider algorithms for optimization problems that make greedy choices according to some local criteria. The resulting algorithms are often simple and fast. However, they rarely give an optimal solution. We consider several examples in which a greedy approach gives an optimal solution and also several in which it does not. In the latter case, we are still interested in how good the solution obtained is. III.1 Huffman Codes We consider the problem of generating a binary code for a set of characters C with fre- quencies f : C → N. Specifically, we want to determine for each c ∈ C a corresponding binary string, called its codeword and denoted w(c), such that a string of characters in C, is encoded by simply concatenating the codewords of its characters (preserving the order of the characters in the message of course). Let length(x) denote the number of bits in x. Suppose c appears f(c) times in the string, then total length of the encoded message is (overloading the notation) length(C, f) = ∑ c∈C f(c) · length(w(c)) The goal is to determine w so that length(C, f) is minimized. We concentrate on prefix codes, that is, no w(x) is a prefix of another w(y). Under this constraint, a codeword assignment can be viewed as a binary tree (0 to the left, 1 to the right) with the characters as leaf labels. We can assume the tree is full (every internal node has two children), as otherwise the node with a single child could be bypassed and length(w(M)) would be improved. In summary, we want to find a tree T that minimizes B(T, C, f) = ∑ c∈C f(c) · depth(c, T ), where depth(c, T ) denotes the depth in T of the leaf labeled with c. To solve this optimization problem, we first identify a way to reduce the problem to a smaller instance. Consider an optimal tree T for C, f . A possibility is to observe that the left subtree and the right subtrees should be optimal for the subset of characters in those III.1 subtrees. A dynamic programming solution would have to look at all the possible ways of splitting C into two subsets, and so in the end it would have to construct a table of optimal trees for all the possible subsets of C. This is quite inefficient. An alternative is to look at two siblings x, y in T , remove them and label the new leaf (the old parent of x and y) with a new character z. Let C ′ = C − {x, y} ∪ {z} be a new alphabet, let f ′(z) = f(x)+f(y) and f ′(c) = f(c) for c ∈ C ′ and c 6= z be their frequencies, and let T ′ be the modified tree. T’ x y z T Claim 1. T ′ is optimal for C ′, f ′. Proof. First note that B(T, C, f) = B(T ′, C ′, f ′) + f(x) + f(y). Suppose T ′ is not optimal for C ′, f ′, and let T ′′ be an optimal tree for C ′, f ′. That is, B(T ′′, C ′, f ′) < B(T ′, C ′, f ′). In T ′′ substitute z with a node with two children x and y; let T ′′′ be the resulting tree. T’’’ x y z T’’ Similarly as above, B(T ′′′, C, f) = B(T ′′, C ′, f ′) + f(x) + f(y). Thus, B(T ′′′, C, f) = B(T ′′, C ′, f ′) + f(x) + f(y) < B(T ′, C ′, f ′) + f(x) + f(y) = B(T, C, f), which is a contradiction since T was supposed to be optimal for C, f . This leads then to an approach using dynamic programming: for all possible pairs x, y substitute them with a z with frequency f(x) + f(y) and solve the reduced problem; the solution is the best over all the pairs. This approach is also not very good as there will be again too many different subproblems that will need to be solved. Here being greedy comes to the rescue: we can take two characters with lowest frequencies as x and y. There is no need to try all pairs. III.2 initially). For technical reason we assume though that all weights are different; there is no loss of generality in assuming this, this can be enforced using, for example, lexicographical ordering.1 Recall that a tree is a connected acyclic graph.2 A tree is spanning in graph G = (V, E) if its vertex set is V (it includes all vertices). An important observation that we use is the following: Observation 5. Let T be a spanning tree of G = (V, E) and let e be an edge not in T , then Te = T ∪ {e} 3 has a unique cycle C, and removing any edge in C from Te results in a spanning tree of G. The weight of a subgraph G′ of G is the sum of weights of edges in G′. We are interested in finding a spanning tree of G with minimum weight over all spanning trees of G. We call this a minimum spanning tree (MST), though minimum weight spanning tree would be more appropriate. A forest is an acyclic graph; hence a collection of trees (each connected component is a tree). We consider a generic MST algorithm that maintains a spanning forest F and adds edges incrementally (starting with a forest without edges). The invariant is that F is a subgraph of an MST (the algorithm and its correctness actually show that there is a unique MST). Definition. An edge not in the forest is said to be useless if both of its vertices are in the same connected component of F . An edge that is not useless is said to be safe for a 1Assume that the vertices are indexed from 1 to n: V = {v1, v2, . . . , vn}. So each edge is a pair {vi, vj}. Let w : E → R be a weight assignment possibly with equal weights. We define a weight assignment w′ : E → R that has no equal weights. For e = {vi, vj} with i < j, let w′(e) = w(e) + i ·  + j · 2, where  > 0 is an arbitrarily small number. We claim that given w, there is an 0 > 0, such that for any  with 0 <  < 0, no two weights w ′ are equal. It follows that the ordering is independent from the choice of . Furthermore, the ordering can be evaluated based only on the indices without the specific computation of w′ with a particular . More precisely, the claim is: Claim 4. There is 0 > 0 such that for any  with 0 <  < 0 and any edges e1 = {i1, j1} and e2 = {i2, j2} with i1 < j1 and i2 < j2, w ′(e1) < w ′(e2) iff either: (i) w(e1) < w(e2), or (ii) w(e1) = w(e2) and i1 < i2, or (iii) w(e1) = w(e2), i1 = i2 and j1 < j2. Proof. Let ∆ be the minimum of 1 and of |w(e)−w(e′)| over all e and e′ with w(e) 6= w(e′), let 0 = ∆/2n2, and let  < 0. If w(e1) < w(e2) then w ′(e1) < w ′(e2) because |(i1−i2)+(j1−j2) 2| is strictly smaller than ∆. If w(e1) = w(e2) and i1 < i2 then w ′(e1) < w ′(e2) because |(j1− j2)2| is smaller than ∆2/4n3 which is smaller than ∆/2n2 (recall ∆ ≤ 1), which is the minimum possible value of |(i1 − i2)|. If w(e1) = w(e2), i1 = i2 and j1 < j2 then clearly w ′(e1) < w ′(e2). The ordering of the indices that appear in (ii) and (iii) is called lexicographic ordering. 2A graph is connected if any two verices are connected by a path in the graph. It is acyclic if it contains no cycle. The following are equivalent characterizations: A tree is (a) a connected acyclic graph; (b) a connected graph with at most |V | − 1 edges; (c) a minimal connected graph (it is connected and removing any edge makes it disconnected); (d) an acyclic graph with at least |V |-1 edges; (e) a maximal acyclic graph (it is acyclic and adding any edge makes creates a cycle). 3Frequently, we abuse notation to make writing easier. Strictly, T is a pair (V, F ) of vertices and edges and we are adding e to the set of edges, but that’s too messy to write, so T ∪ {e} will do. III.5 component C of F if its weight is smallest among all the edges with one vertex in C (and the other necessarily in another component, since the edge is not useless). The choice of edges is based on the following lemma. Lemma 6. Let F be a forest in G that is contained in some MST. Then any MST of G that contains F also contains all the safe edges. Proof. Let T be an MST of G that contains F and for the sake of contradiction let e = {u, v} be a safe edge for component K of F that is not in T . By the observation above, Te = T∪{e} has a cycle Ce that contains e. Say u be in K and v is in another component K ′. Let us follow Ce starting at u and then following e to v (so getting outside K) and then following Ce F K e K T e K e e’ F C until we are back to u. At some point we get back to K using some edge e′ that has a vertex in K and another not in K. By assumption, w(e′) > w(e) because e is safe for K. We also know that T ′ = T −{e}∪{e′} is a spanning tree and w(T ′) = w(T )−w(e′)+w(e) < w(T ). This is a contradiction since e was supposed to be a MST. So any MST contains all safe edges. The proof actually shows that (under our assumption that all edge weights are different) the MST is unique. This lemma provides a greedy choice rule for our generic algorithm. Several variants are possible depending on which safe edges are added (the lemma says that all can be added, but one may choose not to do so, as then F can be more easily updated). Algorithm 1 (Jarnik). F consists of a single component – a tree T that grows until it becomes spanning – and its safe edge is added in each iteration. To be able to identify the safe edge in every iteration, it is better to keep for every vertex v not yet in T , the lightest (minimum weight) edge connecting T to v. The lightest among those edges is the safe edge of T . The outline is the following (it is called Jarnik after its discoverer in the 30’s; it was subsequently rediscovered by Prim ’56 and Dijkstra ’58): MST-Jarnik (V, E, s) T←({s}, ∅) for i← 1 to |V | − 1 v←ExtractMin(Q) add v and edge(v) to T for each {u, v} ∈ E if u 6∈ T and key(u) > w({u, v}) edge(u)←{u, v} DecreaseKey(Q, u, w({u, v})) edge(u) stores the lightest edge from T to u and key(u) stores its weight. DecreaseKey is a heap operation that decreases the value of a key stored. The initialization is as follows: III.6 MST-Jarnik-Initialization (V, E, s) Q←MakeHeap for each v ∈ V − {s} if {v, s} ∈ E edge(v)←{v, s} key(v)←w({v, s}) else edge(v)←Null key(v)←∞ HeapInsert(Q, v) The running time is determined by the DecreaseKey operation. Using normal heaps, this takes time O(log n) and so the total running time is O(|E| log |V |) time. However, there are heaps that support this operation in amortized time O(1) (we’ll se later what that means) and so the total time is O(|E| + |V | log |V |) (ExtractMin still needs time O(log n)). Algorithm 2 (Kruskal). This algorithm sorts the edges in increasing weight order and considers them in that order; if the edge under consideration is safe, add it to F . To check that an edge is safe, it suffices to check that is not useless (since the edges are considered in increasing weight order it is not possible that the algorithm will consider an edge connecting two different components and which is not safe for either of the components – since the safe edge of a component has smallest weight among those out of the component). To implement this algorithm, we use a so called union-find data structure: it maintains a set of disjoint sets, and supports three operations: MakeSet(u) makes a new set that contains u, Find(u) returns the identifier of the set that contains u, and Union(u, v) unions the two sets that contain u and v. The initialization is as follows: MST-Kruskal (V, E) sort E in increasing order of w F←∅ for each v ∈ V MakeSet (u) for i = 1 to |E| {u, v}←i-th lightest edge in E if Find(u) 6= Find(v) Union(u, v) add {u, v} to F The running time is dominated by the sorting and so is O(|E| log |V |). III.3 Knapsack Problem Pending ... III.7
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved