Math - Recursion, Divide and Conquer, from Merge Sort to MapReduce (Notes)

2020-09-06

Teacher Huang Shen’s title is really great, can’t find a better title to describe what I learned today. Haha~

Divide and Conquer Thinking in Merge Sort

Problem: Sort a pile of chaotic unordered numbers according to rules from small to large or large to small

Ordered Case

Try merging ordered arrays {1, 2, 5, 8} and {3, 4, 6}.

Disordered Case

Try to continuously simplify the problem, that is, continuously simplify the sequence until only 1 number remains. 1 number itself is ordered.

Simplify a sequence of length n to length n-1 each time, until length 1. However, this approach has no parallelism, requiring n-1 merge operations, but efficiency will be very low.

Introduce Divide and Conquer Thinking

Divide and Conquer. Its idea is to decompose a complex problem into two or more subproblems of the same or similar scale, then further subdivide these subproblems until the final subproblems become very simple and easy to solve, thus solving this complex problem.

Sorting one array

Merging two sorted arrays

The most important idea is how to decompose the problem

Different stages of merge sort

Use Recursion to Implement the Above Approach

      // Recursively split array
      function merge_sort(to_sort) {
        // Invalid data, return [] directly
        if (!to_sort) return [];
        
        // If decomposed to only one number, return that number
        if (to_sort.length == 1) return to_sort;
        
        // Split array into left and right halves
        let mid = to_sort.length / 2;

        // splice in js operates on original array content,
        // After taking out the first half, the second half can directly use the original array variable reference
        let left = [].concat(to_sort.splice(0,mid))
        let right = [].concat(to_sort)

        // Nested calls, sort both halves separately
        left = merge_sort(left);
        right = merge_sort(right);
        
        // Merge sorted halves
        let merged = merge(left, right);
        
        return merged;
      }


// Array merge sort
function merge(a, b) {
    if (!a) a = [];
    if (!b) b = [];
    
    // Will push results to this array later
    let merged_one = []
    
    // Indexes for a, b arrays
    let ai = 0;
    let bi = 0;
    
    // Alternately take smaller values from two arrays, put into merged array
    while (ai < a.length && bi < b.length) {
     if (a[ai] <= b[bi]) {
        merged_one.push(a[ai])
      ai ++;
     } else {
        merged_one.push(b[bi])
      bi ++;
     }
    }
    
    // Put remaining numbers from one array into merged array
    if (ai < a.length) {
     for (let i = ai; i < a.length; i++) {
      merged_one.push(a[i])
     }
    } else {
     for (let i = bi; i < b.length; i++) {
      merged_one.push(b[i])
     }
    }
    
    return merged_one;
   }


let arr = [2,5,3,1,4,6,7,8,9]

console.log('Sort result',merge_sort(arr))
// Sort result [ 1, 2, 3, 4, 5, 6, 7, 8, 9 ]

Divide and Conquer Thinking in Distributed Systems

When the array to be sorted is very large (e.g., reaching 1024GB), we cannot fit all this data into the memory of an ordinary machine. What to do? One approach is to decompose this super large dataset into multiple smaller datasets (e.g., 16GB or smaller), then distribute them to multiple machines for parallel processing.

MapReduce Architecture

Three Steps Using Divide and Conquer Thinking

Data Splitting and Mapping

Refers to splitting the data source and sending fragments to Mappers. Mapping refers to Mappers storing content in hash structures according to key-value matching based on application requirements.

Reduction

Reduction refers to receiving a set of key-value pairs. If pairs have the same key content, merge their values. This is similar to the process of returning results after local recursive calls.

Combining

To improve the efficiency of the shuffle stage, we can choose to reduce key-value pairs sent to the reduction stage. Specifically, add a combining process between data mapping and shuffle, performing a local reduction first on each Mapper node. Then only send combined results to shuffle and reduction stages. This is similar to the process of returning results after local recursive calls.

Afterword

Recursion, splitting complex problems into simple problems.

Then preset all failure situations you can think of (contingency plans) and handle them.

A recursive algorithm emerges.

Article Link:

https://alili.tech/en/archive/zr4ve5abfzg/