Data Structures and Their ADTs
In IT5003, I am studying data structures and algorithms. The language we learn with is Python. So I take notes about what I learned, and how to use them in Python. I struct my notes based on data structures, and all the ADTs that are implemented by the data structures are discussed within. I focus on their usage, key design, and sometimes problems. In the future maybe I will append all the implementation codes.
- ADT: Abstract data type, from the point of view of a user, defined by its behavior.
- Data Structure: from the point of view of an implementer, concrete representations of data.
Linear Data Structures
Array:
Continuous space to store fix-sized elems.
I often confuse array with list... Now I need to clarify that an array is a data structure, which is one of the implementations of the List ADT.
Key design:
Methods:
- ==Get ith==: O(1) obviously at A[i]
- ==Insert(i)==: O(n) worst/average, coz need to shift left; O(1) if insert at tail
- ==Remove(i)==: Just similar as insert, need to shift right.
- ==Search n==: Still similar. O(n) worst/average found at tail; O(1) if found immediately at head
Usage: Make a List ADT. Store ordered things continually, we can insert into a specific position, get ith element, of course also search, and remove
Problem: Insert and remove are O(n) coz it needs to shift everything behind to make the position empty/not empty. Also, it requires consecutive and fixed(can be overcome) space to store, sometimes may be not that flexible.
In python
1 | # It is PYTHON who let me confuse the 2!! coz in python, an array is just called a list... |
Linked List
non-continous space, not fixed-size. Each vertex contains value and a pointer pointing next vertex.
Usage: Make a List ADT again. Compared with an array, it doesn't need to be consecutive and fixed-size, and it can insert/remove from head very efficiently, in O(1). But most importantly, it can make Stack/Queue/Deque ADT, because it is resizeable.
Key Design: I think just need to clarify its attributes and methods:
- Attributes:
- Value
- Next -- Singly LL
- (Previous -- Doubly LL)
- Methods:
- ==Get ith==: Unfortunately it becomes O(n) compared with array, coz it needs to go from the head pointer.
- ==Insert==: Insert at the head only need O(1) (at tail also O(1) if it is doubly linked); but on average, also O(n), coz should go from head :(
- ==Remove==: Just similar as insert. (ps: coz need to search the prev node)
- ==Search n==: Obviously O(1) best and O(n) worst/avg, same as Array
Stack ADT
Like book stack.
Usage: LIFO(last-in-first-out), so you can both pop and push from head. (Errrr, impacted by python, I always think of it as a tail).
In python
1 | # just use a list!! and regard the tail as the head. |
Queue ADT
Like queue in reality.
Usage: FIFO(first-in-first-out), so push at the tail and pop from head, it is perfectly implemented by a linked list (with tail pointer), to make both operations O(1).
In python
1 | from collections import deque |
Deque ADT
Queue + double head
Usage: Double-ended queue. Both pop & push from both head & tail. Perfectly implemented by a doubly linked list, with all operations O(1).
In python: Look at Queue.
Non-Linear Data Structures
Binary Heap
Complete Binary Tree: Every level fully filled + last level as far left as possible
+ Binary Max Heap property: Parent greater than children
Usage:
Used for implementing a Priority Queue ADT. When to use PQ? When you need to maintain some ordered elements, and frequently fetch the Max/Min ones.
Key Design: A binary heap is firstly a binary tree, and then it is ruled as a Min heap -- every parent node should be smaller than its 2 children. The direct way of implementing the binary heap is to use Tree class, with attributes Value, LeftChild and RightChild. However, it is more convenient if we just simply use an array. Just need to think of this:
- For a parent at kth position, its left child is at
, and right child is k * 2 + 1. - For a child at kth position, its parent is at k//2.
- For a condensed binary tree, everything can be mapped into an also condensed list.
5 standard Binary (Max) Heap operations:
- ==Insert(v)== in O(log N): Firstly insert as the last leaf. Then bubble up -- obviously, swap from the bottom to the up.
- ==ExtractMax()== in O(log N): just get the root element; then the important thing is what you should do after popping out the root element -- you need to maintain the heap. So, pick the last leaf to the root, then bubble down -- i.e. swap the parent with the bigger child up to down, till all fine.
- (==Create(A)== - O(N log N) version: Simply insert (that is, by calling Insert(v) operation) all N integers of the input array into an initially empty Binary Max Heap one by one.)
- ==Create(A)== - O(N) version: (make a list of unsorted elements become a heap): best is O(n), with bubble up the leaves.
- ==HeapSort()== - in O(N log N): Simply call the O(log N) ExtractMax() operation N times.
Extra: Heapsort is not cache friendly. Because the computer will predict that you'll read the array in sequence, so it will cache the following elems. Quicksort takes this advantage. Mergesort not.
In python
Implementation: don't use it, just for understanding
1 | class binary_heap: # implementation that has no duplicate |
Use the library:
1 | from heapq import heappop, heappush, heapify |
Hash Table
Hashing is an algorithm (via a hash function) that maps large data sets of variable length into smaller Integer data sets of a fixed length. Hash Table is a data structure to map key to values (also called Table or Map ADT).
Usage:
- You want a Table ADT that can store things, and you can input and search in O(1) (also delete)
- Actually it's not necessarily the things are integers, they can be anything, even strings. And you can even use (key, value) pair, to store anything additionally in the value part.
Key design: Use a hash function to map n keys in a list with length m; then the complexity of search(v)/remove(v) is O(n/m) (Search v in the list with avg length n/m). You should design the hash function and also the length m.
- Hash function:
For integers: modulo operation is the best practice. Just let
h(v) = v % M
. O(1).For strings: (normally strings only contains 26 letters)
for char in string: sum = sum * 26 + char % 26
. This is like Base-26, so every English letter matters.1
2
3
4
5
6int hash_function(string v) { // assumption 1: v uses ['A'..'Z'] only
int sum = 0; // assumption 2: v is a short string
for (auto& c : v) // for each character c in v
sum = ((sum * 26) % M + (c - 'A' + 1)) % M; // M is table size
return sum;
}- never modulo a constant, modulo the size of table;
- never use a funtion that is not uniform, like sqrt;
- never random
- Length m:
- Based on experiences, it should be a prime, to make the least collisions; and should be around n, to make O(n/m) a constant.
- Solve collision: i.e., two or more keys have the same hash value.
- Seperate Chaining (Not move): Use Doubly Linked List (or just simply python list): if 2 keys are in the same position, just add the 2nd one at the last of the list.
- Opening Address (Move away):
- linear search another position
- non-linear search for another position
- rehash for another position
- People are debating which is better, but my teacher Steven believes SC is better.
- With these 3 operations, it is a good Table ADT (implemented by SC):
- ==Search(v)==: Just check if v in A[h(v)] list. O(
). is the avg length of each chain. If m is set properly, the is then very small, thus it is O(1). - ==Remove(v)==: also, coz need to search it at first. Just delete v in A[h(v)] list
- ==insert(v)==: O(1). Just append v in A[h(v)] list
- ==Search(v)==: Just check if v in A[h(v)] list. O(
May be problems:
- The space occupation of the hash table should be always a little bit larger than the original elements, coz it is very possible for the hash table to contain empty positions
- What if the keys are duplicated? By default, we don't want it duplicated. But I believe there are potential ways to deal with duplicated keys
In python
1 | # just use dictionary!!! |
1 | M = 13 # table size, best to be prime |
Binary Search Tree
A Binary Search Tree (BST) is a binary tree in which each vertex has only up to 2 children that satisfies BST property:
All vertices in the left subtree of a vertex must hold a value smaller than its own; all vertices in the right subtree of a vertex must hold a value larger than its own.
Usage: Another way to implement Table ADT. This way, compared with hash table, it is slightly slower, with all operations O(logn). However, it can do additional things: keep keys in table ordered.
Key Design: A binary tree, for every node, everything on the left is smaller than it; on the right, bigger. (not only for the direct children, but also for everything below).
Basically, BST as an implementation of Table ADT, should also have the 3 operations: (All O(log n))
- ==Search(v)==
- ==Remove(v)==
- ==insert(v)==
Besides, it is excellent because it has other special operations:
- Query operations (the BST structure remains unchanged):
- Search(v): O(h), h is the height of the tree (h is log N on average, but can be N). Just binary search from root down.
- ==Predecessor(v)== (and similarly ==Successor(v)==), both also O(h). (details below)
- ==Inorder Traversal==, then elems are sorted. O(n)
- Update operations (the BST structure may likely change):
- Insert(v), O(h), find the place and add it.
- Remove(v), O(h). Worst case is that v has 2 children, then need to fisrtly search(v) in O(h), then find its successor in another O(h), then replace it by its successor, and rememer to delete the duplicated successor.
- ==Create BST==.
Successor后继:
- If v has a right subtree, should be the minimum of the right
- If no right subtree, go traverse ancestors, till the first ancestor that is greater than v
- Otherwise, none.
Presuccesor:
- If has a left subtree, should be the maximum of the left
- If not, should be the first ancestor that is smaller
- Or none.
Question: How many structurally different BSTs can you form with n distinct elements?
Solution that's more understandable
In python
1 | # No implemented library for binary tree |
1 | class BSTVertex: |
Graph
Concepts
- Vertex
- Edge
- Path
- Directed/ Undirected
- Weighted/Unweighted
- Acyclic: No circle.
- Connected
Special Graphs
- Tree: only one unique path between 2 vertices. E = V - 1, always. Acyclic. Minimal edges to keep a graph connected.
- Complete: E = V*(V-1)/2 edges (or E = O(V2)). Most dense graph
- Bipatite: can be divided into 2 sets where there is no edge between members of the same set.
- DAG
Ways to store a graph
- Adjacency Matrix: Column and row are both vertices. Space complexity O(V2)
- Usage: Frequently update the weights of edges; Checking the existence of edge (u, v). Both O(1).
- Drawbacks: O(v) when fetch neighbors. not good when many vertices; Space complexity high.
1
2
3
4
5
6
7
8AM = [[0 for i in range(V)] for j in range(V)]
for i in range(V):
AM[i] = list(map(int, (fh.readline().strip().split())))
print("Neighbors of vertex 0:")
for j in range(V): # O(|V|)
if AM[0][j]:
print('Edge 0-{:d} (weight = {:d})'.format(j, AM[0][j])) - Adjacency List: Column is vertices, then use linked lists to store their neighbors. Space complexity O(V + E)
- Usage: Commonly. Frequently fetch neighbors of vertices. O(k), k is # nbrs. Fastest.
- Drawbacks: Need to search for Existence of Edge (u, v), O(k).
1
2
3
4
5
6
7
8
9
10
11
12AL = defaultdict(list) # initalize AL
for i in range(V):
line = list(map(int, (fh.readline().strip().split())))
total_neighbours = int(line[0])
for j in range(1, len(line), 2):
v, w = int(line[j]), int(line[j + 1])
AL[i].append((v - 1, w))
print("Neighbors of vertex 0:")
for v_w in AL[0]:
# AL[0] contains the required information
print('Edge 0-{:d} (weight = {:d})'.format(v_w[0], v_w[1])) - Edge List: Column is edges, then store pairs (vertex1, vertex2, weight). Space complexity O(E)
- Usage: Need to sort edges in terms of weights
- Drawbacks: Need to scan the whole thing when fetch neighbors / search existence of edge(u,v). O(E)
1
2
3
4
5
6
7
8
9
10
11
12
13E = int(fh.readline())
edge_list = []
for i in range(E):
u, v, w = map(int, fh.readline().split())
edge_list.append((w, u, v))
# build a heap
heapq.heapify(edge_list)
# edges sorted by weight (smallest->largest)
for i in range(E):
edge = heapq.heappop(edge_list)
print('weight: {:d} ({:d}-{:d})'.format(edge[0], edge[1], edge[2]))
DFS & BFS
Both O(V + E) if the graph is implemented using Adjency List, coz we need to visit k nbrs of each v in O(k).
DFS:
Recursion to do the visit children - visit me - loop.
Avoiding Cycle: Use an array
status[v]
of size V to remember whether a vertex is visited.Remeber the path: Use an array
parent[v]
of size V to remember the successor of v.1
2
3
4
5
6def printPath(u):
if p[u] == -1:
print("{}".format(u), end="")
return
printPath(p[u])
print(" {}".format(u), end="")
BFS:
- Use a FIFO deque to do the visit all children loop.
- Avoiding Cycle: Use an array
status[v]
of size V to remember whether a vertex is visited. - Visualgo said no need to track back?
Applications
Reachability test: DFS/BFS, then check
status[v]
Actually printing the traversal path: DFS/BFS, then backtrack the
parent
arrayIdentifying/Counting/Labeling Connected Components (CCs) of undirected graphs: DFS/BFS, make use of the
status
Detecting if a graph is cyclic: modify DFS's
status
array to track:unvisited: same as earlier, DFS has not reach vertex u before,
explored: partially visited. DFS has visited vertex u, but at least one neighbor of vertex u has not been visited yet (DFS will go depth-first to that neighbor first),
visited: now stronger definition: all neighbors of vertex u have also been visited and DFS is about to backtrack from vertex u to vertex p[u].
Then, if encounter status[y] = explored, then it is a cycle.
Topological Sort (only on DAGs),
The DFS version requires just one additional line compared to the normal DFS and is basically the post-order traversal of the graph.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29import sys
sys.setrecursionlimit(100000)
def dfs(i):
visited[i] = True
for j, w in al[i]:
if visited[j]:
continue
dfs(j)
topoOrder.append(i) # add topoOrder
n, m = map(int, input().split())
vertices = list(map(int, input().split()))
al = [[] for _ in range(n)]
for _ in range(m):
v1, v2, w = map(int, input().split())
al[v2].append((v1, w))
# toposort
topoOrder = []
visited = [False for _ in range(n)]
for i in range(n): # 0-based
if not visited[i]:
dfs(i)
topoOrder.reverse()
for i in topoOrder:
for j, w in al[i]:
vertices[j] += vertices[i] * w
print(*vertices)The BFS version is based on the idea of vertices without incoming edge and is also called as Kahn's algorithm.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25from collections import deque
in_degree = [0] * V
for i in al:
for j in al[i]:
in_degree[j] += i # cal in degree
q = deque() # a queue of vertices with indegree 0
for i in range(V):
if in_degree[i] == 0:
q.append(i) # enqueue all v with indegree 0
cnt = 0 # cnt of visited vertices
topoOrder = []
while q:
u = q.popleft()
topoOrder.append(u)
for i in al[u]:
in_degree[i] -= 1 # when reach, indegree - 1
if in_degree[i] == 0: q.append(i) # now can be put into topo
cnt += 1
if cnt < V:
print('Cycle!') # Why? Coz if cnt < V, means some vertices still have indegree and thus not put into the queue. The remaining edges are the cycle.
else:
print(*topoOrder)
In Python
1 | # for dfs, define a function dfs(), and remember to add the iteration limitation |
1 | # for bfs, remember to use deque and popleft() |
SSSP
- BFS for unweighted or constant weighted graph
- Modified Dijkstra's Algorithm for weighted. (It is just the Uniform Cost Best Search). O((V+E) log V). It uses a Priority Queue via binary heap to maintain the search order. And this modified means a Lazy Update tech: leave the outdated ones in the PQ, and only check valid when meet them.
1 | from heapq import heappush, heappop |