Data Structures and Their ADTs

发表 2021-11-03 分类 Computer Notes 阅读量: Valine: 字数: 19k 阅读时长 ≈ 18 分钟

In IT5003, I am studying data structures and algorithms. The language we learn with is Python. So I take notes about what I learned, and how to use them in Python. I struct my notes based on data structures, and all the ADTs that are implemented by the data structures are discussed within. I focus on their usage, key design, and sometimes problems. In the future maybe I will append all the implementation codes.

ADT: Abstract data type, from the point of view of a user, defined by its behavior.
Data Structure: from the point of view of an implementer, concrete representations of data.

Linear Data Structures

Array:

Continuous space to store fix-sized elems.

I often confuse array with list... Now I need to clarify that an array is a data structure, which is one of the implementations of the List ADT.

Key design:

Methods:

==Get ith==: O(1) obviously at A[i]
==Insert(i)==: O(n) worst/average, coz need to shift left; O(1) if insert at tail
==Remove(i)==: Just similar as insert, need to shift right.
==Search n==: Still similar. O(n) worst/average found at tail; O(1) if found immediately at head

Usage: Make a List ADT. Store ordered things continually, we can insert into a specific position, get ith element, of course also search, and remove

Problem: Insert and remove are O(n) coz it needs to shift everything behind to make the position empty/not empty. Also, it requires consecutive and fixed(can be overcome) space to store, sometimes may be not that flexible.

In python

1	# It is PYTHON who let me confuse the 2!! coz in python, an array is just called a list...

Linked List

non-continous space, not fixed-size. Each vertex contains value and a pointer pointing next vertex.

Usage: Make a List ADT again. Compared with an array, it doesn't need to be consecutive and fixed-size, and it can insert/remove from head very efficiently, in O(1). But most importantly, it can make Stack/Queue/Deque ADT, because it is resizeable.

Key Design: I think just need to clarify its attributes and methods:

Attributes:
- Value
- Next -- Singly LL
- (Previous -- Doubly LL)
Methods:
- ==Get ith==: Unfortunately it becomes O(n) compared with array, coz it needs to go from the head pointer.
- ==Insert==: Insert at the head only need O(1) (at tail also O(1) if it is doubly linked); but on average, also O(n), coz should go from head :(
- ==Remove==: Just similar as insert. (ps: coz need to search the prev node)
- ==Search n==: Obviously O(1) best and O(n) worst/avg, same as Array

Stack ADT

Like book stack.

Usage: LIFO(last-in-first-out), so you can both pop and push from head. (Errrr, impacted by python, I always think of it as a tail).

In python

# just use a list!! and regard the tail as the head.
stack = []
stack.append(1)
lastElem = stack.pop() # by default, pop at -1, i.e. the last one

Queue ADT

Like queue in reality.

Usage: FIFO(first-in-first-out), so push at the tail and pop from head, it is perfectly implemented by a linked list (with tail pointer), to make both operations O(1).

In python

from collections import deque
# coz queue is a subset of deque, so a deque can be a queue, haha
q = deque()
q.append()
q.popleft()

Deque ADT

Queue + double head

Usage: Double-ended queue. Both pop & push from both head & tail. Perfectly implemented by a doubly linked list, with all operations O(1).

In python: Look at Queue.

Non-Linear Data Structures

Binary Heap

Complete Binary Tree: Every level fully filled + last level as far left as possible

+ Binary Max Heap property: Parent greater than children

Usage:

Used for implementing a Priority Queue ADT. When to use PQ? When you need to maintain some ordered elements, and frequently fetch the Max/Min ones.

Key Design: A binary heap is firstly a binary tree, and then it is ruled as a Min heap -- every parent node should be smaller than its 2 children. The direct way of implementing the binary heap is to use Tree class, with attributes Value, LeftChild and RightChild. However, it is more convenient if we just simply use an array. Just need to think of this:

For a parent at kth position, its left child is at \(k * 2\), and right child is k * 2 + 1.
For a child at kth position, its parent is at k//2.
For a condensed binary tree, everything can be mapped into an also condensed list.

5 standard Binary (Max) Heap operations:

==Insert(v)== in O(log N): Firstly insert as the last leaf. Then bubble up -- obviously, swap from the bottom to the up.
==ExtractMax()== in O(log N): just get the root element; then the important thing is what you should do after popping out the root element -- you need to maintain the heap. So, pick the last leaf to the root, then bubble down -- i.e. swap the parent with the bigger child up to down, till all fine.
(==Create(A)== - O(N log N) version: Simply insert (that is, by calling Insert(v) operation) all N integers of the input array into an initially empty Binary Max Heap one by one.)
==Create(A)== - O(N) version: (make a list of unsorted elements become a heap): best is O(n), with bubble up the leaves.
==HeapSort()== - in O(N log N): Simply call the O(log N) ExtractMax() operation N times.

Extra: Heapsort is not cache friendly. Because the computer will predict that you'll read the array in sequence, so it will cache the following elems. Quicksort takes this advantage. Mergesort not.

In python

Implementation: don't use it, just for understanding

class binary_heap:  # implementation that has no duplicate
    def __init__(self):
        # the underlying data structure, we will ignore index 0
        self.A = [None]
    # three helper navigation function, all O(1)
    def parent(self, i): return i // 2
    def left(self, i): return i * 2
    def right(self, i): return i * 2 + 1

    def shift_up(self, i):  # O(log n)
        if i == 1:
            return  # at root, do nothing
        if self.A[i] > self.A[self.parent(i)]:  # violate property with parent
            self.A[i], self.A[self.parent(i)] = self.A[self.parent(i)], self.A[
                i]  # swap upwards
            # recurse, at most O(log n) steps back to the root
            self.shift_up(self.parent(i))
    def insert(self, x):  # O(log n)
        # append to the back of Python list, the only possible insertion point,
        # O(1)
        self.A.append(x)
        self.shift_up(len(self.A) - 1)  # shift upwards, O(log n) at worst
    def shift_down(self, i):  # O(log n)
        swap_id = i
        # compare with left child, if exists
        if self.left(i) < len(self.A) and self.A[i] < self.A[self.left(i)]:
            swap_id = self.left(i)
        # compare with right child, if exists
        if self.right(i) < len(self.A) and self.A[swap_id] < self.A[self.right(i)]:
            swap_id = self.right(i)
        if swap_id != i:  # need to swap with the larger of the two children
            # swap downwards with the larger of the two children
            self.A[i], self.A[swap_id] = self.A[swap_id], self.A[i]
            # recurse, at most O(log n) steps to one of the bottom-most leaf
            self.shift_down(swap_id)

    def extract_max(self):  # O(log n)
        if self.is_empty():
            return None
        taken = self.A[1]  # this is the maximum value, O(1)
        # swap with the last existing leaf, O(1)
        self.A[1], self.A[-1] = self.A[-1], self.A[1]
        self.A.pop()  # reduce list size by one, O(1)
        if not self.is_empty():
            # fix heap property downwards, O(log n) at worst
            self.shift_down(1)
        return taken  # return the maximum value, O(1)

    def get_max(self):  # O(1)
        return self.A[1]  # this is the root
    def is_empty(self):  # O(1)
        return len(self.A) == 1  # when A = [None] only

Use the library:

from heapq import heappop, heappush, heapify
from queue import PriorityQueue # implemented based on the heapq, a thread safe version

pq_heapq = [] # this is a MIN heap, to get a max heap, we negate all numbers
heappush(pq_heapq, -5)  # insert negation
heappop(pq_heapq)

Hash Table

Hashing is an algorithm (via a hash function) that maps large data sets of variable length into smaller Integer data sets of a fixed length. Hash Table is a data structure to map key to values (also called Table or Map ADT).

Usage:

You want a Table ADT that can store things, and you can input and search in O(1) (also delete)
Actually it's not necessarily the things are integers, they can be anything, even strings. And you can even use (key, value) pair, to store anything additionally in the value part.

Key design: Use a hash function to map n keys in a list with length m; then the complexity of search(v)/remove(v) is O(n/m) (Search v in the list with avg length n/m). You should design the hash function and also the length m.

Hash function:

For integers: modulo operation is the best practice. Just let h(v) = v % M. O(1).

For strings: (normally strings only contains 26 letters) for char in string: sum = sum * 26 + char % 26. This is like Base-26, so every English letter matters.

int hash_function(string v) { // assumption 1: v uses ['A'..'Z'] only
  int sum = 0;                // assumption 2: v is a short string
  for (auto& c : v) // for each character c in v
    sum = ((sum * 26) % M + (c - 'A' + 1)) % M; // M is table size
  return sum;
}

1. never modulo a constant, modulo the size of table;
2. never use a funtion that is not uniform, like sqrt;
3. never random

Length m:
- Based on experiences, it should be a prime, to make the least collisions; and should be around n, to make O(n/m) a constant.
Solve collision: i.e., two or more keys have the same hash value.
- Seperate Chaining (Not move): Use Doubly Linked List (or just simply python list): if 2 keys are in the same position, just add the 2nd one at the last of the list.
- Opening Address (Move away):
  - linear search another position
  - non-linear search for another position
  - rehash for another position
- People are debating which is better, but my teacher Steven believes SC is better.
With these 3 operations, it is a good Table ADT (implemented by SC):
- ==Search(v)==: Just check if v in A[h(v)] list. O(\(\alpha\)). \(\alpha = n/m\) is the avg length of each chain. If m is set properly, the \(\alpha\) is then very small, thus it is O(1).
- ==Remove(v)==: also, coz need to search it at first. Just delete v in A[h(v)] list
- ==insert(v)==: O(1). Just append v in A[h(v)] list

May be problems:

The space occupation of the hash table should be always a little bit larger than the original elements, coz it is very possible for the hash table to contain empty positions
What if the keys are duplicated? By default, we don't want it duplicated. But I believe there are potential ways to deal with duplicated keys

In python

# just use dictionary!!!
from collections import defaultdict
d = defaultdict(list)
k, v = 'zjy', 518
d[k].append(v) # No need to initialize `if k not in d: d[k] = []` then

M = 13 # table size, best to be prime
class hash_table:  # mimic Python dict()
    def __init__(self):
        self.underlying_table = [[] for _ in range(M)]
    def hash_function(self, v):  # for string v
        sum = 0
        for c in v:
            sum = ((sum * 26) % M + (ord(c) - ord('A') + 1)) % M
        return sum
    def search(self, key):  # to emulate mapper[key]
        # O(k), k is the length of this list, but with careful setup, k can be
        # O(1)
        for k, v in self.underlying_table[self.hash_function(key)]:
            if k == key:  # if there is an existing key
                return v  # return this satellite data
        return None
    def remove(self, key):
        # get the reference of the row
        row = self.underlying_table[self.hash_function(key)]
        for k, v in row:
            if k == key:
                row.remove((k, v))
                break
        # we do nothing if key is not actually found
    def insert(self, key, value):  # to emulate mapper[key] = value
        if self.search(key):
            self.remove(key)
        self.underlying_table[self.hash_function(key)].append(
            (key, value))  # just append at the back
    def is_empty(self):
        total = 0
        for i in range(M):
            total += len(self.underlying_table[i])
        return total == 0

Binary Search Tree

A Binary Search Tree (BST) is a binary tree in which each vertex has only up to 2 children that satisfies BST property:

All vertices in the left subtree of a vertex must hold a value smaller than its own; all vertices in the right subtree of a vertex must hold a value larger than its own.

Usage: Another way to implement Table ADT. This way, compared with hash table, it is slightly slower, with all operations O(logn). However, it can do additional things: keep keys in table ordered.

Key Design: A binary tree, for every node, everything on the left is smaller than it; on the right, bigger. (not only for the direct children, but also for everything below).

Basically, BST as an implementation of Table ADT, should also have the 3 operations: (All O(log n))

==Search(v)==
==Remove(v)==
==insert(v)==

Besides, it is excellent because it has other special operations:

Query operations (the BST structure remains unchanged):
1. Search(v): O(h), h is the height of the tree (h is log N on average, but can be N). Just binary search from root down.
2. ==Predecessor(v)== (and similarly ==Successor(v)==), both also O(h). (details below)
3. ==Inorder Traversal==, then elems are sorted. O(n)
Update operations (the BST structure may likely change):
1. Insert(v), O(h), find the place and add it.
2. Remove(v), O(h). Worst case is that v has 2 children, then need to fisrtly search(v) in O(h), then find its successor in another O(h), then replace it by its successor, and rememer to delete the duplicated successor.
3. ==Create BST==.

Successor后继：

If v has a right subtree, should be the minimum of the right
If no right subtree, go traverse ancestors, till the first ancestor that is greater than v
Otherwise, none.

Presuccesor:

If has a left subtree, should be the maximum of the left
If not, should be the first ancestor that is smaller
Or none.

Question: How many structurally different BSTs can you form with n distinct elements?

Solution with Catalan number

Solution that's more understandable \[ Catalan_n = C_{2n}^n \div (n+1) \]

In python

# No implemented library for binary tree
# If you need sorting, use `.sort()`
# If you need priority queue, use heapq as binary heap
# If you need table, use dictionary

class BSTVertex:
    def __init__(self, key):  # set as 'public' for easier coding
        self.key = key
        self.left = None
        self.right = None
class BST:
    def __init__(self):
        self.__root = None
    def __insert(self, T, v):                     # private version of insert
        if T == None:                             # insertion point is found
            T = BSTVertex(v)
        elif T.key < v:                           # search to the right
            T.right = self.__insert(T.right, v)
        else:                                     # search to the left
            T.left = self.__insert(T.left, v)
        return T                                  # return the updated BST
    def __inorder(self, T):                       # private version of inorder
        if T == None:
            return
        self.__inorder(T.left)                    # recursively go to the left
        print(T.key, end=' ')                     # visit this BST node
        self.__inorder(T.right)                   # recursively go to the right
    def __search(self, T, v):                     # private version of search
        if T == None:
            return T                              # not found
        elif T.key == v:
            return T                              # found
        elif T.key < v:
            return self.__search(T.right, v)      # search to the right
        else:
            return self.__search(T.left, v)       # search to the left

    def insert(self, v):
        self.__root = self.__insert(self.__root, v)

    def inorder(self):
        self.__inorder(self.__root)
        print()

    def search(self, v):
        res = self.__search(self.__root, v)
        return -1 if res == None else res.key

Graph

Concepts

Vertex
Edge
Path
Directed/ Undirected
Weighted/Unweighted
Acyclic: No circle.
Connected

Special Graphs

Tree: only one unique path between 2 vertices. E = V - 1, always. Acyclic. Minimal edges to keep a graph connected.
Complete: E = V*(V-1)/2 edges (or E = O(V2)). Most dense graph
Bipatite: can be divided into 2 sets where there is no edge between members of the same set.
DAG

Ways to store a graph

Adjacency Matrix: Column and row are both vertices. Space complexity O(V²)

Usage: Frequently update the weights of edges; Checking the existence of edge (u, v). Both O(1).
Drawbacks: O(v) when fetch neighbors. not good when many vertices; Space complexity high.

AM = [[0 for i in range(V)] for j in range(V)]
for i in range(V):
    AM[i] = list(map(int, (fh.readline().strip().split())))

print("Neighbors of vertex 0:")
for j in range(V):                             # O(|V|)
    if AM[0][j]:
        print('Edge 0-{:d} (weight = {:d})'.format(j, AM[0][j]))

Adjacency List: Column is vertices, then use linked lists to store their neighbors. Space complexity O(V + E)

Usage: Commonly. Frequently fetch neighbors of vertices. O(k), k is # nbrs. Fastest.
Drawbacks: Need to search for Existence of Edge (u, v), O(k).

AL = defaultdict(list)  # initalize AL
for i in range(V):
    line = list(map(int, (fh.readline().strip().split())))
    total_neighbours = int(line[0])
    for j in range(1, len(line), 2):
        v, w = int(line[j]), int(line[j + 1])
        AL[i].append((v - 1, w))

print("Neighbors of vertex 0:")
for v_w in AL[0]:
    # AL[0] contains the required information
    print('Edge 0-{:d} (weight = {:d})'.format(v_w[0], v_w[1]))

Edge List: Column is edges, then store pairs (vertex1, vertex2, weight). Space complexity O(E)

Usage: Need to sort edges in terms of weights
Drawbacks: Need to scan the whole thing when fetch neighbors / search existence of edge(u,v). O(E)

E = int(fh.readline())
edge_list = []
for i in range(E):
    u, v, w = map(int, fh.readline().split())
    edge_list.append((w, u, v))

# build a heap
heapq.heapify(edge_list)

# edges sorted by weight (smallest->largest)
for i in range(E):
    edge = heapq.heappop(edge_list)
    print('weight: {:d} ({:d}-{:d})'.format(edge[0], edge[1], edge[2]))

DFS & BFS

Both O(V + E) if the graph is implemented using Adjency List, coz we need to visit k nbrs of each v in O(k).

DFS:

Recursion to do the visit children - visit me - loop.
Avoiding Cycle: Use an array status[v] of size V to remember whether a vertex is visited.

Remeber the path: Use an array parent[v] of size V to remember the successor of v.

def printPath(u):
    if p[u] == -1:
        print("{}".format(u), end="")
        return
    printPath(p[u])
    print(" {}".format(u), end="")

BFS:

Use a FIFO deque to do the visit all children loop.
Avoiding Cycle: Use an array status[v] of size V to remember whether a vertex is visited.
Visualgo said no need to track back?

Applications

Reachability test: DFS/BFS, then check status[v]
Actually printing the traversal path: DFS/BFS, then backtrack the parent array
Identifying/Counting/Labeling Connected Components (CCs) of undirected graphs: DFS/BFS, make use of the status
Detecting if a graph is cyclic: modify DFS's status array to track:
1. unvisited: same as earlier, DFS has not reach vertex u before,
2. explored: partially visited. DFS has visited vertex u, but at least one neighbor of vertex u has not been visited yet (DFS will go depth-first to that neighbor first),
3. visited: now stronger definition: all neighbors of vertex u have also been visited and DFS is about to backtrack from vertex u to vertex p[u].
  
  Then, if encounter status[y] = explored, then it is a cycle.

Topological Sort (only on DAGs),

The DFS version requires just one additional line compared to the normal DFS and is basically the post-order traversal of the graph.

import sys
sys.setrecursionlimit(100000)
def dfs(i):
    visited[i] = True
    for j, w in al[i]:
        if visited[j]:
            continue
        dfs(j)
    topoOrder.append(i) # add topoOrder
    
n, m = map(int, input().split())
vertices = list(map(int, input().split()))
al = [[] for _ in range(n)]
for _ in range(m):
    v1, v2, w = map(int, input().split())
    al[v2].append((v1, w))

# toposort
topoOrder = []
visited = [False for _ in range(n)]
for i in range(n):  # 0-based
    if not visited[i]:
        dfs(i)

topoOrder.reverse()
for i in topoOrder:
    for j, w in al[i]:
        vertices[j] += vertices[i] * w
print(*vertices)

The BFS version is based on the idea of vertices without incoming edge and is also called as Kahn's algorithm.

from collections import deque
in_degree = [0] * V
for i in al:
    for j in al[i]:
        in_degree[j] += i # cal in degree
        
q = deque() # a queue of vertices with indegree 0
for i in range(V):
    if in_degree[i] == 0:
        q.append(i) # enqueue all v with indegree 0
        
cnt = 0 # cnt of visited vertices
topoOrder = []
while q:
    u = q.popleft()
    topoOrder.append(u)
    for i in al[u]:
        in_degree[i] -= 1 # when reach, indegree - 1
        if in_degree[i] == 0: q.append(i) # now can be put into topo
    cnt += 1
    
if cnt < V: 
  	print('Cycle!') # Why? Coz if cnt < V, means some vertices still have indegree and thus not put into the queue. The remaining edges are the cycle.
else: 
  	print(*topoOrder)

In Python

# for dfs, define a function dfs(), and remember to add the iteration limitation
import sys
sys.setrecursionlimit(100000)
def dfs(i):
    visited[i] = True
    for j in al[i]:
        if visited[j]: continue
        dfs(j)
        
n, m = map(int, input().split())
al = [[] for _ in range(n)]
visited = [False for _ in range(n)]
for _ in range(m):
    a, b = map(int, input().split())
    a, b = a-1, b-1 # convert to 0-based
    al[a].append(b)
    al[b].append(a)

dfs(0)
flag = True
for i in range(n):
    if not visited[i]:
        print(i + 1) # 1-based
        flag = False

if flag: print('Connected')

# for bfs, remember to use deque and popleft()
from heapq import heappush, heappop
from math import inf

INF = inf
dist = [INF for u in range(V)]
dist[s] = 0
q = deque()
q.append(s)
p = [-1 for u in range(V)]              # parent

layer = -1                              # for output printing
isBipartite = True                      # additional feature

while (q):
    u = q.popleft()
    if (dist[u] != layer):  # new layer now
        print("\nLayer {}: ".format(dist[u]), end="")
    layer = dist[u]
    print("visit {}, ".format(u), end="")
    for v, w in AL[u]:                    # w ignored
        if dist[v] == INF:
            dist[v] = dist[u] + 1         # dist[v] != INF now
            p[v] = u                      # parent of v is u
            q.append(v)                   # for next iteration
        elif dist[v] % 2 == dist[u] % 2:  # in the same set
            isBipartite = False

SSSP

BFS for unweighted or constant weighted graph
Modified Dijkstra's Algorithm for weighted. (It is just the Uniform Cost Best Search). O((V+E) log V). It uses a Priority Queue via binary heap to maintain the search order. And this modified means a Lazy Update tech: leave the outdated ones in the PQ, and only check valid when meet them.

from heapq import heappush, heappop
from math import inf

INF = inf

V, E, s = map(int, input.split())
AL = [[] for u in range(V)]
for _ in range(E):
    u, v, w = map(int, f.readline().split(" "))
    AL[u].append((v, w))                     # directed graph

# (Modified) Dijkstra's routine
dist = [INF for u in range(V)]
dist[s] = 0
pq = []
heappush(pq, (0, s))  # 长这样：（vertex， dist）

# sort the pairs by non-decreasing distance from s
while (len(pq) > 0):
    d, u = heappop(pq)                  # shortest unvisited u
    if (d > dist[u]):
        continue                        # *** a very important check. Lazy Update here
    for v, w in AL[u]:                  # all edges from u
        if (dist[u] + w >= dist[v]):
            continue                    # *** not improving, skip
        dist[v] = dist[u] + w           # relax operation
        heappush(pq, (dist[v], v))

for u in range(V):
    print("SSSP({}, {}) = {}".format(s, u, dist[u]))