In IT5003, I am studying data structures and algorithms. The language we learn with is Python. So I take notes about what I learned, and how to use them in Python. I struct my notes based on data structures, and all the ADTs that are implemented by the data structures are discussed within. I focus on their usage, key design, and sometimes problems. In the future maybe I will append all the implementation codes.

• ADT: Abstract data type, from the point of view of a user, defined by its behavior.
• Data Structure: from the point of view of an implementer, concrete representations of data.

Linear Data Structures

Array:

Continuous space to store fix-sized elems.

I often confuse array with list... Now I need to clarify that an array is a data structure, which is one of the implementations of the List ADT.

Key design:

Methods:

• ==Get ith==: O(1) obviously at A[i]
• ==Insert(i)==: O(n) worst/average, coz need to shift left; O(1) if insert at tail
• ==Remove(i)==: Just similar as insert, need to shift right.
• ==Search n==: Still similar. O(n) worst/average found at tail; O(1) if found immediately at head

Usage: Make a List ADT. Store ordered things continually, we can insert into a specific position, get ith element, of course also search, and remove

Problem: Insert and remove are O(n) coz it needs to shift everything behind to make the position empty/not empty. Also, it requires consecutive and fixed(can be overcome) space to store, sometimes may be not that flexible.

In python

Linked List

non-continous space, not fixed-size. Each vertex contains value and a pointer pointing next vertex.

Usage: Make a List ADT again. Compared with an array, it doesn't need to be consecutive and fixed-size, and it can insert/remove from head very efficiently, in O(1). But most importantly, it can make Stack/Queue/Deque ADT, because it is resizeable.

Key Design: I think just need to clarify its attributes and methods:

• Attributes:
• Value
• Next -- Singly LL
• (Previous -- Doubly LL)
• Methods:
• ==Get ith==: Unfortunately it becomes O(n) compared with array, coz it needs to go from the head pointer.
• ==Insert==: Insert at the head only need O(1) (at tail also O(1) if it is doubly linked); but on average, also O(n), coz should go from head :(
• ==Remove==: Just similar as insert. (ps: coz need to search the prev node)
• ==Search n==: Obviously O(1) best and O(n) worst/avg, same as Array

Stack ADT

Like book stack.

Usage: LIFO(last-in-first-out), so you can both pop and push from head. (Errrr, impacted by python, I always think of it as a tail).

In python

Queue ADT

Like queue in reality.

Usage: FIFO(first-in-first-out), so push at the tail and pop from head, it is perfectly implemented by a linked list (with tail pointer), to make both operations O(1).

In python

Deque ADT

Queue + double head

Usage: Double-ended queue. Both pop & push from both head & tail. Perfectly implemented by a doubly linked list, with all operations O(1).

In python: Look at Queue.

Non-Linear Data Structures

Binary Heap

Complete Binary Tree: Every level fully filled + last level as far left as possible

+ Binary Max Heap property: Parent greater than children

Usage:

Used for implementing a Priority Queue ADT. When to use PQ? When you need to maintain some ordered elements, and frequently fetch the Max/Min ones.

Key Design: A binary heap is firstly a binary tree, and then it is ruled as a Min heap -- every parent node should be smaller than its 2 children. The direct way of implementing the binary heap is to use Tree class, with attributes Value, LeftChild and RightChild. However, it is more convenient if we just simply use an array. Just need to think of this:

• For a parent at kth position, its left child is at $$k * 2$$, and right child is k * 2 + 1.
• For a child at kth position, its parent is at k//2.
• For a condensed binary tree, everything can be mapped into an also condensed list.

5 standard Binary (Max) Heap operations:

1. ==Insert(v)== in O(log N): Firstly insert as the last leaf. Then bubble up -- obviously, swap from the bottom to the up.
2. ==ExtractMax()== in O(log N): just get the root element; then the important thing is what you should do after popping out the root element -- you need to maintain the heap. So, pick the last leaf to the root, then bubble down -- i.e. swap the parent with the bigger child up to down, till all fine.
3. (==Create(A)== - O(N log N) version: Simply insert (that is, by calling Insert(v) operation) all N integers of the input array into an initially empty Binary Max Heap one by one.)
4. ==Create(A)== - O(N) version: (make a list of unsorted elements become a heap): best is O(n), with bubble up the leaves.
5. ==HeapSort()== - in O(N log N): Simply call the O(log N) ExtractMax() operation N times.

Extra: Heapsort is not cache friendly. Because the computer will predict that you'll read the array in sequence, so it will cache the following elems. Quicksort takes this advantage. Mergesort not.

In python

Implementation: don't use it, just for understanding

Use the library:

Hash Table

Hashing is an algorithm (via a hash function) that maps large data sets of variable length into smaller Integer data sets of a fixed length. Hash Table is a data structure to map key to values (also called Table or Map ADT).

Usage:

• You want a Table ADT that can store things, and you can input and search in O(1) (also delete)
• Actually it's not necessarily the things are integers, they can be anything, even strings. And you can even use (key, value) pair, to store anything additionally in the value part.

Key design: Use a hash function to map n keys in a list with length m; then the complexity of search(v)/remove(v) is O(n/m) (Search v in the list with avg length n/m). You should design the hash function and also the length m.

• Hash function:
• For integers: modulo operation is the best practice. Just let h(v) = v % M. O(1).

• For strings: (normally strings only contains 26 letters) for char in string: sum = sum * 26 + char % 26. This is like Base-26, so every English letter matters.

1. never modulo a constant, modulo the size of table;
2. never use a funtion that is not uniform, like sqrt;
3. never random
• Length m:
• Based on experiences, it should be a prime, to make the least collisions; and should be around n, to make O(n/m) a constant.
• Solve collision: i.e., two or more keys have the same hash value.
• Seperate Chaining (Not move): Use Doubly Linked List (or just simply python list): if 2 keys are in the same position, just add the 2nd one at the last of the list.
• Opening Address (Move away):
• linear search another position
• non-linear search for another position
• rehash for another position
• People are debating which is better, but my teacher Steven believes SC is better.
• With these 3 operations, it is a good Table ADT (implemented by SC):
• ==Search(v)==: Just check if v in A[h(v)] list. O($$\alpha$$). $$\alpha = n/m$$ is the avg length of each chain. If m is set properly, the $$\alpha$$ is then very small, thus it is O(1).
• ==Remove(v)==: also, coz need to search it at first. Just delete v in A[h(v)] list
• ==insert(v)==: O(1). Just append v in A[h(v)] list

May be problems:

• The space occupation of the hash table should be always a little bit larger than the original elements, coz it is very possible for the hash table to contain empty positions
• What if the keys are duplicated? By default, we don't want it duplicated. But I believe there are potential ways to deal with duplicated keys

In python

Binary Search Tree

A Binary Search Tree (BST) is a binary tree in which each vertex has only up to 2 children that satisfies BST property:

All vertices in the left subtree of a vertex must hold a value smaller than its own; all vertices in the right subtree of a vertex must hold a value larger than its own.

Usage: Another way to implement Table ADT. This way, compared with hash table, it is slightly slower, with all operations O(logn). However, it can do additional things: keep keys in table ordered.

Key Design: A binary tree, for every node, everything on the left is smaller than it; on the right, bigger. (not only for the direct children, but also for everything below).

Basically, BST as an implementation of Table ADT, should also have the 3 operations: (All O(log n))

• ==Search(v)==
• ==Remove(v)==
• ==insert(v)==

Besides, it is excellent because it has other special operations:

1. Query operations (the BST structure remains unchanged):
1. Search(v): O(h), h is the height of the tree (h is log N on average, but can be N). Just binary search from root down.
2. ==Predecessor(v)== (and similarly ==Successor(v)==), both also O(h). (details below)
3. ==Inorder Traversal==, then elems are sorted. O(n)
2. Update operations (the BST structure may likely change):
1. Insert(v), O(h), find the place and add it.
2. Remove(v), O(h). Worst case is that v has 2 children, then need to fisrtly search(v) in O(h), then find its successor in another O(h), then replace it by its successor, and rememer to delete the duplicated successor.
3. ==Create BST==.

Successor后继：

1. If v has a right subtree, should be the minimum of the right
2. If no right subtree, go traverse ancestors, till the first ancestor that is greater than v
3. Otherwise, none.

Presuccesor:

1. If has a left subtree, should be the maximum of the left
2. If not, should be the first ancestor that is smaller
3. Or none.

Question: How many structurally different BSTs can you form with n distinct elements?

Solution with Catalan number

Solution that's more understandable $Catalan_n = C_{2n}^n \div (n+1)$

In python

Graph

Concepts

• Vertex
• Edge
• Path
• Directed/ Undirected
• Weighted/Unweighted
• Acyclic: No circle.
• Connected

Special Graphs

• Tree: only one unique path between 2 vertices. E = V - 1, always. Acyclic. Minimal edges to keep a graph connected.
• Complete: E = V*(V-1)/2 edges (or E = O(V2)). Most dense graph
• Bipatite: can be divided into 2 sets where there is no edge between members of the same set.
• DAG

Ways to store a graph

1. Adjacency Matrix: Column and row are both vertices. Space complexity O(V2)
1. Usage: Frequently update the weights of edges; Checking the existence of edge (u, v). Both O(1).
2. Drawbacks: O(v) when fetch neighbors. not good when many vertices; Space complexity high.
2. Adjacency List: Column is vertices, then use linked lists to store their neighbors. Space complexity O(V + E)
1. Usage: Commonly. Frequently fetch neighbors of vertices. O(k), k is # nbrs. Fastest.
2. Drawbacks: Need to search for Existence of Edge (u, v), O(k).
3. Edge List: Column is edges, then store pairs (vertex1, vertex2, weight). Space complexity O(E)
1. Usage: Need to sort edges in terms of weights
2. Drawbacks: Need to scan the whole thing when fetch neighbors / search existence of edge(u,v). O(E)

DFS & BFS

Both O(V + E) if the graph is implemented using Adjency List, coz we need to visit k nbrs of each v in O(k).

DFS:

1. Recursion to do the visit children - visit me - loop.

2. Avoiding Cycle: Use an array status[v] of size V to remember whether a vertex is visited.

3. Remeber the path: Use an array parent[v] of size V to remember the successor of v.

BFS:

1. Use a FIFO deque to do the visit all children loop.
2. Avoiding Cycle: Use an array status[v] of size V to remember whether a vertex is visited.
3. Visualgo said no need to track back?

Applications

1. Reachability test: DFS/BFS, then check status[v]

2. Actually printing the traversal path: DFS/BFS, then backtrack the parent array

3. Identifying/Counting/Labeling Connected Components (CCs) of undirected graphs: DFS/BFS, make use of the status

4. Detecting if a graph is cyclic: modify DFS's status array to track:

1. unvisited: same as earlier, DFS has not reach vertex u before,

2. explored: partially visited. DFS has visited vertex u, but at least one neighbor of vertex u has not been visited yet (DFS will go depth-first to that neighbor first),

3. visited: now stronger definition: all neighbors of vertex u have also been visited and DFS is about to backtrack from vertex u to vertex p[u].

Then, if encounter status[y] = explored, then it is a cycle.

5. Topological Sort (only on DAGs),

1. The DFS version requires just one additional line compared to the normal DFS and is basically the post-order traversal of the graph.

2. The BFS version is based on the idea of vertices without incoming edge and is also called as Kahn's algorithm.

SSSP

• BFS for unweighted or constant weighted graph
• Modified Dijkstra's Algorithm for weighted. (It is just the Uniform Cost Best Search). O((V+E) log V). It uses a Priority Queue via binary heap to maintain the search order. And this modified means a Lazy Update tech: leave the outdated ones in the PQ, and only check valid when meet them.