# Data Structures and Their ADTs

In IT5003, I am studying data structures and algorithms. The language we learn with is Python. So I take notes about what I learned, and how to use them in Python. I struct my notes based on data structures, and all the ADTs that are implemented by the data structures are discussed within. I focus on their usage, key design, and sometimes problems. In the future maybe I will append all the implementation codes.

- ADT: Abstract data type, from the point of view of a user, defined by its behavior.
- Data Structure: from the point of view of an implementer, concrete representations of data.

# Linear Data Structures

## Array:

Continuous space to store fix-sized elems.

I often confuse array with list... Now I need to clarify that an **array** is a data structure, which is one of the implementations of the **List ADT**.

**Key design**:

Methods:

- ==Get ith==: O(1) obviously at A[i]
- ==Insert(i)==: O(n) worst/average, coz need to shift left; O(1) if insert at tail
- ==Remove(i)==: Just similar as insert, need to shift right.
- ==Search n==: Still similar. O(n) worst/average found at tail; O(1) if found immediately at head

**Usage**: Make a **List ADT**. Store ordered things continually, we can **insert** into a specific position, **get** ith element, of course also **search**, and **remove**

**Problem**: Insert and remove are **O(n)** coz it needs to shift everything behind to make the position empty/not empty. Also, it requires **consecutive** and **fixed**(can be overcome) space to store, sometimes may be not that flexible.

**In python**

1 | # It is PYTHON who let me confuse the 2!! coz in python, an array is just called a list... |

## Linked List

non-continous space, not fixed-size. Each vertex contains value and a pointer pointing next vertex.

**Usage**: Make a **List ADT** again. Compared with an array, it doesn't need to be consecutive and fixed-size, and it can **insert**/**remove** from **head** very efficiently, in **O(1)**. But most importantly, it can make **Stack/Queue/Deque ADT**, because it is **resizeable**.

**Key Design**: I think just need to clarify its attributes and methods:

- Attributes:
- Value
- Next -- Singly LL
- (Previous -- Doubly LL)

- Methods:
- ==Get ith==: Unfortunately it becomes
**O(n)**compared with array, coz it needs to go from the head pointer. - ==Insert==: Insert at the head only need
**O(1)**(at tail also O(1) if it is doubly linked); but on average, also**O(n)**, coz should go from head :( - ==Remove==: Just similar as insert. (ps: coz need to search the prev node)
- ==Search n==: Obviously O(1) best and
**O(n)**worst/avg, same as Array

- ==Get ith==: Unfortunately it becomes

### Stack **ADT**

Like book stack.

**Usage**: **LIFO**(last-in-first-out), so you can both **pop** and **push** from head. (Errrr, impacted by python, I always think of it as a tail).

**In python**

1 | # just use a list!! and regard the tail as the head. |

### Queue **ADT**

Like queue in reality.

**Usage**: **FIFO**(first-in-first-out), so **push** at the tail and **pop** from head, it is perfectly implemented by a **linked list** (with tail pointer), to make both operations **O(1)**.

**In python**

1 | from collections import deque |

### Deque **ADT**

Queue + double head

**Usage**: Double-ended queue. Both **pop** & **push** from both **head** & **tail**. Perfectly implemented by a **doubly linked list**, with all operations **O(1)**.

**In python**: Look at Queue.

# Non-Linear Data Structures

## Binary Heap

Complete Binary Tree: Every level fully filled + last level as far left as possible+

Binary Max Heap property: Parent greater than children

**Usage**:

Used for implementing a **Priority Queue ADT.** When to use PQ? When you need to maintain some ordered elements, and frequently fetch the Max/Min ones.

**Key Design**: A binary heap is firstly a **binary tree**, and then it is ruled as a **Min heap** -- every parent node should be smaller than its 2 children. The direct way of implementing the binary heap is to use Tree class, with attributes Value, LeftChild and RightChild. However, it is more convenient if we just simply use an **array**. Just need to think of this:

- For a parent at kth position, its left child is at \(k * 2\), and right child is k * 2 + 1.
- For a child at kth position, its parent is at k//2.
- For a condensed binary tree, everything can be mapped into an also condensed list.

5 standard Binary (Max) Heap operations:

- ==
**Insert(v)**== in O(log**N**): Firstly insert as the last leaf. Then**bubble up**-- obviously, swap from the bottom to the up. - ==
**ExtractMax()**== in O(log**N**): just get the root element; then the important thing is what you should do after popping out the root element -- you need to maintain the heap. So, pick the last leaf to the root, then**bubble down**-- i.e. swap the parent with the bigger child up to down, till all fine. - (==
**Create(A)**== - O(**N**log**N**) version: Simply insert (that is, by calling**Insert(v)**operation) all**N**integers of the input array into an initially empty Binary Max Heap one by one.) - ==
**Create(A)**== - O(**N**) version: (make a list of unsorted elements become a heap): best is**O(n)**, with bubble up the leaves. - ==
**HeapSort()**== - in O(**N**log**N)**: Simply call the O(log**N**)**ExtractMax()**operation**N**times.

**Extra**: Heapsort is not **cache friendly**. Because the computer will predict that you'll read the array in sequence, so it will cache the following elems. Quicksort takes this advantage. Mergesort not.

**In python**

Implementation: don't use it, just for understanding

1 | class binary_heap: # implementation that has no duplicate |

Use the library:

1 | from heapq import heappop, heappush, heapify |

## Hash Table

Hashing is an algorithm (via a hash function) that maps large data sets of variable length into smaller Integer data sets of a fixed length. Hash Table is a data structure to map key to values (also called Table or Map ADT).

**Usage**:

- You want a
**Table ADT**that can store things, and you can**input**and**search**in**O(1)**(also**delete**) - Actually it's not necessarily the things are integers, they can be anything, even strings. And you can even use (key, value) pair, to store anything additionally in the value part.

**Key design**: Use a **hash** function to map **n** keys in a list with length **m**; then the complexity of search(v)/remove(v) is **O(n/m)** (Search v in the list with avg length n/m). You should design the hash function and also the length m.

- Hash function:
For integers:

**modulo**operation is the best practice. Just let`h(v) = v % M`

. O(1).For strings: (normally strings only contains 26 letters)

`for char in string: sum = sum * 26 + char % 26`

. This is like Base-26, so every English letter matters.1

2

3

4

5

6int hash_function(string v) { // assumption 1: v uses ['A'..'Z'] only

int sum = 0; // assumption 2: v is a short string

for (auto& c : v) // for each character c in v

sum = ((sum * 26) % M + (c - 'A' + 1)) % M; // M is table size

return sum;

}- never modulo a constant, modulo the size of table;
- never use a funtion that is not uniform, like sqrt;
- never random

- Length m:
- Based on experiences, it should be a
**prime**, to make the least collisions; and should be around n, to make O(n/m) a constant.

- Based on experiences, it should be a
- Solve collision: i.e., two or more keys have the same hash value.
**Seperate Chaining**(Not move): Use Doubly Linked List (or just simply python list): if 2 keys are in the same position, just add the 2nd one at the last of the list.- Opening Address (Move away):
- linear search another position
- non-linear search for another position
- rehash for another position

- People are debating which is better, but my teacher Steven believes SC is better.

- With these 3 operations, it is a good
**Table ADT**(implemented by SC):- ==Search(v)==: Just check if v in A[h(v)] list. O(\(\alpha\)). \(\alpha = n/m\) is the avg length of each chain. If
**m**is set properly, the \(\alpha\) is then very small, thus it is O(1). - ==Remove(v)==: also, coz need to search it at first. Just delete v in A[h(v)] list
- ==insert(v)==: O(1). Just append v in A[h(v)] list

- ==Search(v)==: Just check if v in A[h(v)] list. O(\(\alpha\)). \(\alpha = n/m\) is the avg length of each chain. If

**May be problems:**

- The space occupation of the hash table should be always a little bit larger than the original elements, coz it is very possible for the hash table to contain empty positions
- What if the keys are duplicated? By default, we don't want it duplicated. But I believe there are potential ways to deal with duplicated keys

**In python**

1 | # just use dictionary!!! |

1 | M = 13 # table size, best to be prime |

## Binary Search Tree

A Binary Search Tree (BST) is a binary tree in which each vertex has only up to 2 children that satisfies

BST property:All vertices in the left subtree of a vertex must hold a value smaller than its own; all vertices in the right subtree of a vertex must hold a value larger than its own.

**Usage**: Another way to implement **Table ADT**. This way, compared with hash table, it is slightly slower, with all operations **O(logn)**. However, it can do additional things: keep keys in table **ordered**.

**Key Design**: A **binary tree**, for every node, everything on the left is smaller than it; on the right, bigger. (not only for the direct children, but also for everything below).

Basically, BST as an implementation of **Table ADT**, should also have the 3 **operations**: (All **O(log n)**)

- ==Search(v)==
- ==Remove(v)==
- ==insert(v)==

Besides, it is excellent because it has other special operations:

- Query operations (the BST structure remains unchanged):
- Search(v): O(h), h is the height of the tree (h is log N on average, but can be N). Just binary search from root down.
- ==Predecessor(v)== (and similarly ==Successor(v)==), both also O(h). (details below)
- ==Inorder Traversal==, then elems are sorted. O(n)

- Update operations (the BST structure may likely change):
- Insert(v), O(h), find the place and add it.
- Remove(v), O(h). Worst case is that v has 2 children, then need to fisrtly search(v) in O(h), then find its successor in another O(h), then replace it by its successor, and rememer to delete the duplicated successor.
- ==Create BST==.

Successor后继：

- If v has a right subtree, should be the
**minimum of the right** - If no right subtree, go traverse ancestors, till
**the first ancestor that is greater**than v - Otherwise, none.

Presuccesor:

- If has a left subtree, should be the maximum of the left
- If not, should be the first ancestor that is smaller
- Or none.

**Question**: How many structurally different BSTs can you form with **n distinct** elements?

Solution that's more understandable \[ Catalan_n = C_{2n}^n \div (n+1) \]

**In python**

1 | # No implemented library for binary tree |

1 | class BSTVertex: |

# Graph

## Concepts

- Vertex
- Edge
- Path
- Directed/ Undirected
- Weighted/Unweighted
- Acyclic: No circle.
- Connected

Special Graphs

- Tree: only
**one unique path**between 2 vertices.**E = V - 1**, always. Acyclic. Minimal edges to keep a graph connected. - Complete:
**E = V*(V-1)/2**edges (or**E**= O(**V2**)). Most dense graph - Bipatite: can be divided into 2 sets where there is no edge between members of the same set.
- DAG

## Ways to store a graph

- Adjacency Matrix: Column and row are both vertices. Space complexity
**O(V**^{2})- Usage: Frequently update the weights of edges; Checking the existence of edge (u, v). Both
**O(1).** - Drawbacks:
**O(v)**when fetch neighbors. not good when many vertices; Space complexity high.

1

2

3

4

5

6

7

8AM = [[0 for i in range(V)] for j in range(V)]

for i in range(V):

AM[i] = list(map(int, (fh.readline().strip().split())))

print("Neighbors of vertex 0:")

for j in range(V): # O(|V|)

if AM[0][j]:

print('Edge 0-{:d} (weight = {:d})'.format(j, AM[0][j])) - Usage: Frequently update the weights of edges; Checking the existence of edge (u, v). Both
- Adjacency List: Column is vertices, then use linked lists to store their neighbors. Space complexity
**O(V + E)**- Usage: Commonly. Frequently fetch neighbors of vertices.
**O(k)**, k is # nbrs. Fastest. - Drawbacks: Need to search for Existence of Edge (u, v), O(k).

1

2

3

4

5

6

7

8

9

10

11

12AL = defaultdict(list) # initalize AL

for i in range(V):

line = list(map(int, (fh.readline().strip().split())))

total_neighbours = int(line[0])

for j in range(1, len(line), 2):

v, w = int(line[j]), int(line[j + 1])

AL[i].append((v - 1, w))

print("Neighbors of vertex 0:")

for v_w in AL[0]:

# AL[0] contains the required information

print('Edge 0-{:d} (weight = {:d})'.format(v_w[0], v_w[1])) - Usage: Commonly. Frequently fetch neighbors of vertices.
- Edge List: Column is edges, then store pairs (vertex1, vertex2, weight). Space complexity
**O(E)**- Usage: Need to
**sort**edges in terms of weights - Drawbacks: Need to scan the whole thing when fetch neighbors / search existence of edge(u,v). O(E)

1

2

3

4

5

6

7

8

9

10

11

12

13E = int(fh.readline())

edge_list = []

for i in range(E):

u, v, w = map(int, fh.readline().split())

edge_list.append((w, u, v))

# build a heap

heapq.heapify(edge_list)

# edges sorted by weight (smallest->largest)

for i in range(E):

edge = heapq.heappop(edge_list)

print('weight: {:d} ({:d}-{:d})'.format(edge[0], edge[1], edge[2])) - Usage: Need to

## DFS & BFS

Both **O(V + E)** if the graph is implemented using Adjency List, coz we need to visit k nbrs of each v in O(k).

DFS:

Recursion to do the visit children - visit me - loop.

Avoiding Cycle: Use an array

`status[v]`

of size V to remember whether a vertex is visited.Remeber the path: Use an array

`parent[v]`

of size V to remember the successor of v.1

2

3

4

5

6def printPath(u):

if p[u] == -1:

print("{}".format(u), end="")

return

printPath(p[u])

print(" {}".format(u), end="")

BFS:

- Use a FIFO deque to do the visit all children loop.
- Avoiding Cycle: Use an array
`status[v]`

of size V to remember whether a vertex is visited. - Visualgo said no need to track back?

### Applications

Reachability test: DFS/BFS, then check

`status[v]`

Actually printing the traversal path: DFS/BFS, then backtrack the

`parent`

arrayIdentifying/Counting/Labeling Connected Components (CCs) of undirected graphs: DFS/BFS, make use of the

`status`

Detecting if a graph is cyclic: modify DFS's

`status`

array to track:**unvisited**: same as earlier, DFS has not reach vertex**u**before,**explored**: partially visited. DFS has visited vertex**u**, but at least one neighbor of vertex**u**has not been visited yet (DFS will go depth-first to that neighbor first),**visited**: now**stronger**definition: all neighbors of vertex**u**have also been visited and DFS is about to backtrack from vertex**u**to vertex**p[u]**.Then, if encounter

**status[y] = explored**, then it is a cycle.

Topological Sort (only on DAGs),

The DFS version requires just

**one additional line**compared to the normal DFS and is basically the**post-order**traversal of the graph.1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29import sys

sys.setrecursionlimit(100000)

def dfs(i):

visited[i] = True

for j, w in al[i]:

if visited[j]:

continue

dfs(j)

topoOrder.append(i) # add topoOrder

n, m = map(int, input().split())

vertices = list(map(int, input().split()))

al = [[] for _ in range(n)]

for _ in range(m):

v1, v2, w = map(int, input().split())

al[v2].append((v1, w))

# toposort

topoOrder = []

visited = [False for _ in range(n)]

for i in range(n): # 0-based

if not visited[i]:

dfs(i)

topoOrder.reverse()

for i in topoOrder:

for j, w in al[i]:

vertices[j] += vertices[i] * w

print(*vertices)The BFS version is based on the idea of vertices

**without incoming edge**and is also called as Kahn's algorithm.1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25from collections import deque

in_degree = [0] * V

for i in al:

for j in al[i]:

in_degree[j] += i # cal in degree

q = deque() # a queue of vertices with indegree 0

for i in range(V):

if in_degree[i] == 0:

q.append(i) # enqueue all v with indegree 0

cnt = 0 # cnt of visited vertices

topoOrder = []

while q:

u = q.popleft()

topoOrder.append(u)

for i in al[u]:

in_degree[i] -= 1 # when reach, indegree - 1

if in_degree[i] == 0: q.append(i) # now can be put into topo

cnt += 1

if cnt < V:

print('Cycle!') # Why? Coz if cnt < V, means some vertices still have indegree and thus not put into the queue. The remaining edges are the cycle.

else:

print(*topoOrder)

### In Python

1 | # for dfs, define a function dfs(), and remember to add the iteration limitation |

1 | # for bfs, remember to use deque and popleft() |

## SSSP

- BFS for unweighted or constant weighted graph
- Modified Dijkstra's Algorithm for weighted. (It is just the Uniform Cost Best Search).
**O((V+E) log V).**It uses a Priority Queue via binary heap to maintain the search order. And this modified means a**Lazy Update**tech: leave the outdated ones in the PQ, and only check valid when meet them.

1 | from heapq import heappush, heappop |