DNA Series - Part 1 - Data Structure and Algorithms Foundation

Sorting Algorithms

References:

Sorting Algorithm Cheat Sheet. Image from [1].

Sorting is the process of arranging data in a specific order. The most common orders are numerical order (ascending or descending) and lexicographical order (dictionary order).

Beside the time complexity and space complexity, we also need to consider the stability - whether the relative order of equal elements is preserved. For example, in sorting students by grade, keeping names in alphabetical order, stable sorting algorithms like Merge Sort, Insertion Sort and Bubble Sort are preferred.

Common Sorting Algorithms

Selection Sort

Selection sort works by repeatedly “selecting” the next-smallest element from the unsorted array and moving it to the front.

Time Complexity: The time complexity of selection sort is \(O(n^2)\) in all cases, even when the array is already sorted.

def selection_sort(arr):
    for i in range(len(arr)):
        min_idx = i
        for j in range(i+1, len(arr)):
            if arr[j] < arr[min_idx]:
                min_idx = j
        arr[i], arr[min_idx] = arr[min_idx], arr[i]
    return arr

Bubble Sort

Bubble sort works by repeatedly swapping the adjacent elements if they are in the wrong order.

Time Complexity: The time complexity of bubble sort is \(O(n^2)\) in the worst case and \(O(n)\) in the best case when the array is already sorted.

def bubble_sort(arr):
    n = len(arr)
    for i in range(n):
        for j in range(0, n-i-1):
            if arr[j] > arr[j+1]:
                arr[j], arr[j+1] = arr[j+1], arr[j]
    return arr

Insertion Sort

Insertion sort works by inserting elements from an unsorted array into a sorted subsection of the array.

Time Complexity: The time complexity of insertion sort is \(O(n^2)\) in the worst case (sorted in reverse order), but \(O(n)\) in the best case when the array is already sorted.

def insertion_sort(arr):
    for i in range(1, len(arr)):
        key = arr[i]
        j = i - 1
        while j >= 0 and key < arr[j]:
            arr[j + 1] = arr[j]
            j -= 1

Merge Sort

Merge sort works by dividing the array into two halves, sorting each half, and then merging the two halves.

Time Complexity: The time complexity of merge sort is \(O(n \log n)\) in all cases. The \(\log n\) comes from the divide step, and the \(n\) comes from the conquer step. The space complexity is \(O(n)\) due to the additional array used for merging.

def merge_sort(arr):
    if len(arr) <= 1:
        return arr
    mid = len(arr) // 2
    left = merge_sort(arr[:mid])
    right = merge_sort(arr[mid:])
    return merge(left, right)

def merge(left, right):
    result = []
    i = j = 0
    while i < len(left) and j < len(right):
        if left[i] < right[j]:
            result.append(left[i])
            i += 1
        else:
            result.append(right[j])
            j += 1
    result.extend(left[i:])
    result.extend(right[j:])
    return result

Quick Sort

Quick sort works by selecting a “pivot” element from the array and partitioning the other elements into two sub-arrays, according to whether they are less than or greater than the pivot.

Time Complexity: The time complexity of quick sort is \(O(n \log n)\) in the best case, \(O(n^2)\) in the worst case, and \(O(n \log n)\) in the average case. The space complexity is \(O(\log n)\) due to the recursive call stack.

def quick_sort(arr):
    if len(arr) <= 1:
        return arr
    pivot = arr[len(arr) // 2]
    left = [x for x in arr if x < pivot]
    middle = [x for x in arr if x == pivot]
    right = [x for x in arr if x > pivot]
    return quick_sort(left) + middle + quick_sort(right)

Heap Sort

Heap sort works by building a heap from the array and then repeatedly extracting the maximum element from the heap.

import heapq

def heap_sort(arr):
    heapq.heapify(arr)
    return [heapq.heappop(arr) for _ in range(len(arr))]

Time Complexity: The time complexity of heap sort is \(O(n \log n)\) in all cases. The space complexity is \(O(1)\) due to the in-place sorting.

Lambda Function

Lambda function is a small anonymous function using the lambda keyword. It can have any number of arguments, but can only have one expression. When combined with the sorted() function, we can sort the data based on a specific condition.

lambda arguments: expression

Example: Given a list of strings, sort them by the length of the strings first, and then by the alphabetical order.

strings = ["apple", "banana", "cherry", "date"]
sorted_strings = sorted(strings, key=lambda x: (len(x), x))
print(sorted_strings)

Searching Algorithms

Searching is the process of finding a specific element in a data structure. The data structure can be an array, a linked list, a tree, or a graph. There are two main types of searching algorithms: linear search and binary search, along with special cases like Breadth-First Search (BFS) and Depth-First Search (DFS).

Linear Search

Linear search works by iterating through the data structure and comparing each element to the target value.

Time Complexity: The time complexity of linear search is \(O(n)\) in the worst case.

def linear_search(arr, target):
    for i, element in enumerate(arr):
        if element == target:
            return i
    return -1

Binary Search

Binary search works by repeatedly dividing the search interval in half. If the value of the search key is less than the item in the middle of the interval, narrow the interval to the lower half. Otherwise, narrow it to the upper half.

Time Complexity: The time complexity of binary search is \(O(\log n)\) in the worst case, \(O(1)\) in the best case when the target is found, and \(O(\log n)\) in the average case.

def binary_search(arr, target):
    left, right = 0, len(arr) - 1
    while left <= right:
        mid = (left + right) // 2
        if arr[mid] == target:
            return mid
        elif arr[mid] < target:
            left = mid + 1
        else:
            right = mid - 1
    return -1

Breadth-First Search (BFS)

BFS works by exploring all the neighbors of a node before moving on to the next level of nodes.

Time Complexity: The time complexity of BFS in a binary tree is \(O(n)\) in the worst case, where \(n\) is the number of nodes in the tree.

from collections import deque

def bfs(root):
    visited = set()
    queue = deque([root])
    visited.add(root)
    while queue:
        node = queue.popleft()
        print(node.val)
        if node.left:
            queue.append(node.left)
        if node.right:
            queue.append(node.right)

Depth-First Search (DFS)

DFS works by exploring as far as possible along each branch before backtracking.

Time Complexity: The time complexity of DFS in a binary tree is \(O(n)\) in the worst case, where \(n\) is the number of nodes in the tree.

def dfs(node):
    if node is None:
        return
    print(node.val)
    dfs(node.left)
    dfs(node.right)

Preorder Traversal: The root node is visited first, then the left subtree, and then the right subtree.

def preorder_traversal(node):
    if node is None:
        return
    print(node.val)
    preorder_traversal(node.left)
    preorder_traversal(node.right)

Inorder Traversal: The left subtree is visited first, then the root node, and then the right subtree.

def inorder_traversal(node):
    if node is None:
        return
    inorder_traversal(node.left)
    print(node.val)
    inorder_traversal(node.right)

Postorder Traversal: The left subtree is visited first, then the right subtree, and then the root node.

def postorder_traversal(node):
    if node is None:
        return
    postorder_traversal(node.left)
    postorder_traversal(node.right)
    print(node.val)

Two-Pointer Technique

The two-pointer technique is a common algorithm pattern used to solve problems involving arrays or linked lists. It involves using two pointers to traverse the data structure and find a solution. It is often used to check if a linked list has a cycle or to find the middle element of a linked list.

Example 1: Two-sum problem - searching for a pair of numbers in an array that add up to a given target.

def two_sum(arr, target):
    left, right = 0, len(arr) - 1
    while left < right:
        if arr[left] + arr[right] == target:
            return [left, right]
        elif arr[left] + arr[right] < target:
            left += 1
        else:
            right -= 1

Example 2: Checking if a linked list has a cycle.

def has_cycle(head):
    slow, fast = head, head
    while fast and fast.next:
        slow = slow.next
        fast = fast.next.next
        if slow == fast:
            return True
    return False

Example 3: Finding the middle element of a linked list.

def find_middle(head):
    slow, fast = head, head
    while fast and fast.next:
        slow = slow.next
        fast = fast.next.next
    return slow

Dynamic Programming

Dynamic programming is a method for solving a complex problem by breaking it down into simpler subproblems.

Fibonacci Sequence

The Fibonacci sequence is a series of numbers where each number is the sum of the two preceding ones, starting from 0 and 1.

def fibonacci(n):
    if n <= 1:
        return n
    return fibonacci(n-1) + fibonacci(n-2)

Longest Common Subsequence (LCS)

The longest common subsequence (LCS) problem is to find the longest subsequence that is present in two given sequences in the same order.

def lcs(str1, str2):
    m = len(str1)
    n = len(str2)
    dp = [[0] * (n+1) for _ in range(m+1)]
    for i in range(1, m+1):
        for j in range(1, n+1):
            if str1[i-1] == str2[j-1]:
                dp[i][j] = dp[i-1][j-1] + 1
            else:
                dp[i][j] = max(dp[i-1][j], dp[i][j-1])
    return dp[m][n]

Subsequence Pattern Matching

Given two strings str1 and str2, check if str1 is a subsequence of str2. The subsequence can be non-continuous, for example, “ace” is a subsequence of “abcde” but “aec” is not.

def is_subsequence(str1, str2):
    if len(str1) == 0:
        return True
    if len(str2) == 0:
        return False
    if str1[0] == str2[0]:
        return is_subsequence(str1[1:], str2[1:])
    else:
        return is_subsequence(str1, str2[1:])

Data Structures

Array

🔹 Definition: A collection of elements stored at contiguous memory locations.

🔹 Operations: Accessing (O(1)), Searching (O(n) or O(log n)), Insertion/Deletion (O(n))

arr = [1, 2, 3, 4, 5]
arr[0] # O(1)
arr.index(3) # O(n)
arr.append(6) # O(1)
arr.insert(0, 0) # O(n)
arr.remove(3) # O(n)

📌 Common Questions

Two Sum → Find two numbers in an array that add up to a target sum. (Two-Pointer or HashMap)
Best Time to Buy and Sell Stock → Find the maximum profit from stock prices. (Kadane’s Algorithm)
Subarray Sum Equals K → Find a contiguous subarray whose sum equals K. (Prefix Sum + HashMap)
Find Duplicate in an Array → Find the duplicate element in an array of N+1 numbers. (Floyd’s Cycle Detection)
Merge Intervals → Given overlapping intervals, merge them. (Sorting + Greedy Approach)

Linked List

🔹 Definition: A linear data structure where elements (nodes) are linked using pointers.

🔹 Types: Singly, Doubly, Circular

🔹 Operations: Traversal (O(n)), Insertion/Deletion (O(1) at head), Searching (O(n))

class Node:
    def __init__(self, value):
        self.value = value
        self.next = None

# Creating a linked list
head = Node(1)
head.next = Node(2)
head.next.next = Node(3)

# Inserting a node at the head
new_head = Node(0)
new_head.next = head
head = new_head

# Deleting a node
head.next = head.next.next

# Searching for a node
current = head
target = 3
while current:
    if current.value == target:
        print("Found")
        break
    current = current.next

Tree

🔹 Definition: A non-linear data structure with nodes having at most two children (left and right).

🔹 Types: Binary Search Tree (BST), AVL Tree, Heap, Trie

🔹 Operations: Traversal (O(n)), Insertion/Deletion (O(log n)), Searching (O(log n))

class TreeNode:
    def __init__(self, value):
        self.value = value
        self.left = None
        self.right = None

# Creating a binary search tree
root = TreeNode(5)
root.left = TreeNode(3) 
root.right = TreeNode(7)

# Inserting a node
root.left.left = TreeNode(1)

# Deleting a node
root.right = None

# Searching for a node
def dfs(node, target):
    if node is None:
        return None
    if node.value == target:
        return node
    return dfs(node.left, target) or dfs(node.right, target)

dfs(root, 3)

Trie

Description

A trie (pronounced as “try”) or prefix tree is a tree data structure used to efficiently store and retrieve keys in a dataset of strings. There are various applications of this data structure, such as autocomplete and spellchecker.

Implement the Trie class:

Trie() Initializes the trie object.
void insert(String word) Inserts the string word into the trie.
boolean search(String word) Returns true if the string word is in the trie (i.e., was inserted before), and false otherwise.
boolean startsWith(String prefix) Returns true if there is a previously inserted string word that has the prefix prefix, and false otherwise.

Example 1:

Input
["Trie", "insert", "search", "search", "startsWith", "insert", "search"]
[[], ["apple"], ["apple"], ["app"], ["app"], ["app"], ["app"]]
Output
[null, null, true, false, true, null, true]

Explanation
Trie trie = new Trie();
trie.insert("apple");
trie.search("apple");   // return True
trie.search("app");     // return False
trie.startsWith("app"); // return True
trie.insert("app");
trie.search("app");     // return True

Constraints:

1 <= word.length, prefix.length <= 2000
word and prefix consist only of lowercase English letters.
At most 3 * 10^4 calls in total will be made to insert, search, and startsWith.

Solution

We will use dictionary for this problem (I don’t understand why this question classified as a Tree problem). We can use one dictionary to store words and another dictionary to store all posible prefixes.

class TrieNode:
    def __init__(self):
        self.children = {}
        self.is_end_of_word = False

class Autocompletion:
    def __init__(self):
        """Initialize the root of the Trie"""
        self.root = TrieNode()

    def insert(self, word: str):
        """Insert a word into the Trie"""
        node = self.root
        for char in word:
            if char not in node.children:
                node.children[char] = TrieNode()
            node = node.children[char]
        node.is_end_of_word = True

    def suggest(self, prefix: str):
        """Return a list of words in the Trie that start with the given prefix"""
        node = self.root
        for char in prefix:
            if char not in node.children:
                return []  # No words with this prefix
            node = node.children[char]
        
        suggestions = []
        self._dfs(node, prefix, suggestions)
        return suggestions
    
    def _dfs(self, node, prefix, suggestions):
        """Perform a depth-first search (DFS) to find all words starting with the given prefix"""
        if node.is_end_of_word:
            suggestions.append(prefix)
        
        for char, child in node.children.items():
            self._dfs(child, prefix + char, suggestions)