Tactile-DS

A suite of interactive data structures and algorithms implemented in C/C++. Interactive web terminal: https://data-structures.xyz/

View on GitHub

COMP2521 - Data Structures and Algorithms

You may find the notes here have better formatting.

Week 1: Algorithm Analysis

Algorithm Analysis:

Terminology:

Big-O Notation (Time Complexity):

Time-complexity for different operations on common data structures:

Images/notes/ref_image_1.png


Examples: O(1)

void swap(int a[], int i, int j) {
		int tmp = a[i];  // O(1)
		a[i] = a[j];     // O(1)
		a[j] = tmp;      // O(1)
}                    // Total: O(1)

O(n)

for (int i = 0; i < n; i++) {
		// Some O(1) statements
}   // Total: O(n)

O(n²)

for (int i = 0; i < n; i++) {
		for (int j = 0; j < n; j++) {
				// Some O(1) statements
		}
}   // Total: O(n²), basically counting how many times the innermost statements get executed

O(logn)

for (int i = 0; i < n; i *= 2) {
		// Some O(1) statements
}   // Total: O(log₂n)
// Eg. for n = 100, the loop will execute log₂(100) times, which is 6 or 7 times

Week 2: Compilation and ADTs

Abstract Data Types:


Set ADT example: A typical abstract set data type requires the following functions:

ADTs are also expected to implement error-handling to check that things like pre-conditions, post-conditions and whether or not the input is valid value.

Time-complexities for a set implementation using various underlying data structures:

Images/notes/ref_image_2.png

As the person implementing an abstract data type for other uses, after you decide what functions you want to supply, you go through each possible implementation and compare the set of time-complexities for each implementation and then pick the appropriate implementation based on which functions are most important to the specifications.


Week 3: Trees I

Tree Data Structures:

Tree structure:

Images/notes/ref_image_3.png

Images/notes/ref_image_4.png

Tree ‘indexes’:

Images/notes/ref_image_5.png

Binary Search-Trees:

Possible struct declaration:

typedef struct BSTNode *Tree;
struct BSTNode {  
		int data;
		BSTNode *left;
		BSTNode *right;   // We could add more fields like: int numNodes, which counts how many nodes are under this node
};

Images/notes/ref_image_6.png

Images/notes/ref_image_7.png


Tree Algorithms:

Tree traversal:

Tree Properties:

Tree Maintenance:

Tree JoinTree(Tree t1, Tree t2) {
		if (t1 == NULL) return t2;
		if (t2 == NULL) return t1;
		Tree minNode = t2; 		
		Tree minNodeParent = NULL;
		while (minNode -> left != NULL) {  // Find the minimum node of t2
				minNodeParent = minNode;
				minNode = minNode -> left;
		}
		if (minNodeParent != NULL) {
				minNodeParent -> left = minNode -> right;
				minNode -> right = t2;
		}
		minNode -> left = t1;
}

Images/notes/ref_image_9.png


Week 4: Trees II — Specialised Search Trees

Splay Trees:

Target node belongs somewhere in the grandchildren subtrees — zig-zig and zig-zag cases:

  1. Target node is in the left child’s left child — zig-zig: rotateRight(t) twice
  2. Target node is in the left child’s right child — zig-zag: rotateLeft(t → left) then rotateRight(t)
  3. Target node is in the right child’s left child — zig-zag: rotateRight(t → right) then rotateLeft(t)
  4. Target node is in the right child’s right child — zig-zig: rotateLeft(t) twice

If the target node belongs in the left

// Lifts a target node to the root by a sequence of rotations depending on cases
struct node *splay(struct node *root, int target) { 
    if (root == NULL || root->key == key) return root;   // Terminating case: either root is NULL root contains key
    
		// Target is somewhere in left subtree  
		if (root -> value > target) {  
        if (root -> left == NULL) return root;                     // Target must be in the left child, but there is no left child, so the key must not exist

				// Dealing with zig-zig and zig-zag:
        if (root -> left -> value > target) {                      // Target must belong somewhere in the left child's left child - Zig-Zig
            root -> left -> left = splay(root -> left -> left, key);   // This call to splay recursively lifts the target node to become the root of the left-left node
            root = rightRotate(root);                              // Do FIRST rotation on the parent node
        } 
        else if (root -> left -> value < target) {                 // Target must be somewhere in the left child's right child
            root -> left -> right = splay(root -> left -> right, target);  // This call to splay recursively lifts the target node to become the root of the left-left node
            if (root -> left -> right != NULL)                     // If the the left child's right child exists, then call rotateLeft on the left child
                root -> left = leftRotate(root -> left); 
        } 
        return (root -> left == NULL) ? root: rightRotate(root);   // Do SECOND rotation, only if the right subtree exists
    } 

		// Target is somewhere in the right subtree
    else {  
        if (root -> right == NULL) return root;                    // Right subtree doesn't exist so the target doesn't either
				// Dealing with zig-zig and zig-zag:
        if (root -> right -> value > target) { 
            root -> right -> left = splay(root -> right -> left, target);     // Lifts the target to the root of the right child's left subtree 
            if (root -> right -> left != NULL) 
                root -> right = rightRotate(root -> right); 
        } 
        else if (root -> right -> value < target) { 
            root -> right -> right = splay(root -> right -> right, target);   // Bring the key as root of right-right and do first rotation 
            root = leftRotate(root); 
        } 
        return (root->right == NULL)? root: leftRotate(root);      // Do second rotation for root
    } 
}

Time-complexity: O(n) — but a better average case usually

AVL-Trees:

Tree AVLInsert(Tree t, int val) {        // Expects the input tree to not have any existing imbalances
		if (t == NULL) return newTree(val);  // This is where insertion occurs
		if (t -> value == val) return t;     // Attempting insert a value that already exists in the tree
		
		if (t -> value > val) {
				t -> left = AVLInsert(t -> left, val);				
		}
		else if (t -> value < val) {
				t -> right = AVLInsert(t -> right, val);
		}
		// After insertion, get the heights and check if a new imbalance needs to be fixed
		int leftHeight = getHeight(t -> left);
		int rightHeight = getHeight(t -> right);
		if (leftHeight - rightHeight > 1) {  // This means the left subtree is exactly 2 nodes deep (imbalances can only ever be a 2 node difference because we have a balanced tree to begin with)
				if (val > (t -> left -> data)) {
						t -> left = rotateLeft(t -> left);
						t = rotateRight(t);
				} else {
						t = rotateRight(t);
				}
		}
		else if (rightHeight - leftheight > 1) {  // Right subtree has 2 nodes more than the left
				if (val < (t -> right -> data)) {
						t -> right = rotateRight(t -> right);
						t = rotateLeft(t);
				} else {
						t = rotateLeft(t);
				}
		}
		return t;		
}

https://upload.wikimedia.org/wikipedia/commons/f/fd/AVL_Tree_Example.gif

2-3-4 Trees:

2-3-4 trees have special nodes:

In 2-3-4 trees, all leaves are equidistant from the root, meaning all leaves are on the same bottom level. The insertion function for 2-3-4 trees will always maintain height balance

Images/notes/ref_image_10.png

2-3-4 Search:

Tree Search234(Tree t, int item) {
		if (t == NULL) return t;
		int slotIndex = 0;
		while (slotIndex < (t -> type- 1) && item > t -> nodeValues[slotIndex) {
				// Keep iterating until either we've reached the end of the current node's values or if we've bumped into a value greater than the one we want
				slotIndex++;  
		}
		if (item == t -> nodeValues[slotIndex]) {
				return t;  // Item was found in the current node
		} else {
				return Search234(t -> children[slotIndex], item)  // Item was not found in the current node and so must be somewhere in the child at slotIndex
		}
}

Time-complexity:

Possible struct declaration

struct treeNode {
		int type;                      // Can take the values: 2, 3 or 4
		int nodeValues[3];             // Items held in the node (a 4 node holds a maximum of 3 values)
		struct treeNode *children[4];  // Array of pointers to children treeNodes
};

234Insert

  1. Find where to insert using Search()
  2. If not full, insert into the node’s nodeValues[ ] and promote its type, type++
  3. If full:
    • Push the middle node nodeValues[1] into the parent’s nodeValues array, then split into two 2-nodes, storing nodeValues[0] and nodeValues[1] and inserting the new entry into the correct 2-node
      • If the parent is also full, then the parent must be split as well by the same process
      • If the root node is full, then it must be split by the same process but create a new 2-node as the new root

2-3-4 Insertion Just watch the visualisation (implementation out of the scope of this course): https://people.ok.ubc.ca/ylucet/DS/BTree.html):

Tree Insert234(Tree t, int item) {
		if (t == NULL) return newTree(item);
		Tree insertionNode = Search234(t, item);
		Tree parentNode = parent of insertionNode;  // Parent of insertionNode

		if (insertionNode -> type < 4) {  // Node is not full
				insertionNode -> type++;      // Insert into ordered array
		}
		else {  // Node is full
				int promoteVal = insertionNode -> nodeValues[1];
				Tree newLeaf1 = newTree(insertionNode -> nodeValues[0]);
				Tree newLeaf2 = newTree(insertionNode -> nodeValues[2]);
				if (item < newLeaf1) {        // Insert in the left leaf fragment
						Insert234(newLeaf1, item);
				}
				else {  // Otherwise insert in the right leaf fragment
						Insert234(newLeaf2, item); 
				}
				if (parentNode == t) {        // If the parent is the root, then split the root by the same process and 
						
				}
				Insert234(parentNode, promoteVal);
				while(parentNode -> order == 4) {
						continue promoting nodes up
				}
				if (parentNode == root of tree && parent -> order == 4) {
						split root, malloc a new node and set that as the new root of the entire tree
				}
		}
}

Week 5: Graphs I

Graph Data Structures

A graph is a set of Vertices V and a set of edges E (a subset of the Cartesian product V × V)

Graph Theory basics:

Graph Types:

Possible struct declaration:

struct graphRep {
		int numNodes;
		int numEdges;     
		int **edgeMatrix;  // Using an adjacency matrix to store all connections. We could alternatively use an array of edges or an adjacency list
};

Graph Data Structures: A graph is representable as an: array of edges, adjacency matrix and adjacency list. These representations each provide sufficient information to reconstruct the same graph diagrammatically. The choice of which underlying data structure to use for a graph ADT depends on which operations we want to optimise the most.

Array of edges: {(v1, w1), ..., (vn, wn)} — array of 2-tuples. Could also use 3-tuples to store the weight of the edge. Vertex tuples are ordered for directed graphs, to encode direction. In undirected graphs, (1, 2) is equivalent to (2, 1)

Time-complexities of functions:

Note: if we maintain an ordered array of edges, then the time complexity of insertion and deletion can be reduced to O(log(E)) with a binary search algorithm

Space-complexity: O(E)

Adjacency matrix: V × V matrix — where each coordinate in the matrix correspondeds to an edge’s existence (and its weight, conveniently). Undirected graphs have a symmetric matrix while directed graphs usually have a non-symmetric matrix

Time-complexities of functions:

Space-complexity: O(V²)

Adjacency list: Array of linked list heads — for each vertex, store a linked list of its neighbour vertices.

Time-complexities of functions:

Space-complexity: O(V+E) — O(V) for the array size, O(E) for the size of the all the linked lists attached to each array index

Graph Algorithms:

Graph Traversal:

Two traversal methods are depth-first search and breadth-first search. Both use a bool visited[] array to prevent the algorithm from returning to an already visited node

Depth-First Search: Follows one single path to completion and then considers other paths one step back and so on. Can be implemented recursively or iteratively (using a stack structure)

A skeleton for a recursive DFS algorithm:

// Expects all elements of the visited array to be intialised to false
void DepthFirstSearch(Graph g, int start, bool visited[]) {
		visited[start] = true;
		for (int i = 0; i < g -> numVertices; i++) {
				if (isAdjacent(start, i) && visited[i] == false) {
						depthFirstSearch(g, i, visited);
				}
		}
}
// Could modify this function to do things like: 
// - print all elements
// - return a bool if a path exists from src to dest, etc.
// - pass a int pred[] array to track predecessor nodes (which lets us then trace and print paths from src to dest)

Time-complexity: O(V+E) — each vertex is accessed, and possibly every edge

Iterative DFS using a stack ADT:

void DepthFirstSearch(Graph g, int src) {
		bool visited[g -> numVertices];
		for (int i = 0; i < g -> numVertices; i++) visited[i] = false;
		Stack s = newStack();
		StackPush(s, src);
		while (StackSize(s) > 0) {
				int curr = StackPop(s);
				visited[curr] = true;
				for (int i = 0; i < g -> numVertices; i++) {
						if (isAdjacent(g, curr, i) && visited[i] == false) {
								StackPush(s, i);
						}
				}
		}
		StackFree(s);
}

Breadth-First Search: Spreads outwards, like a level-order traversal, only considering adjacent vertices.

A skeleton for iterative BFS: (basically iterative DFS but with a queue ADT)

void BreadthFirstSearch(Graph g, int src, bool visited[]) {
		bool visited[g -> numVertices];
		for (int i = 0; i < g -> numVertices; i++) visited[i] = false;
		Queue q = QueueNew();
		QueueEnqueue(q, src);
		while (QueueSize(q) > 0) {
				int curr = QueueDequeue(q);
				visited[curr] = true;
				for (int i = 0; i < g -> numVertices; i++) {
						if (isAdjacent(g, curr, i) && visited[i] == false) {
								QueueEnqueue(q, i);
						}
				}
		}
		QueueFree(q);
}
// Could use this function for:
// - tracing paths (using an int pred[] array)
// - finding the shortest path (fewest edges) from a source to a destination 

Time-complexity: O(V+E) — each vertex is visited, and possibly every edge


Graph Properties:

Finding all connected subgraphs: numConnected() returns the number of connected subgraphs the given graph has

int numConnected(Graph g) {
    int subgraphID[GraphNumVertices(g)];
    for (int i = 0; i < GraphNumVertices(g); i++) subgraphID[i] = -1;
    int numConnected = 0;
    for (int i = 0; i < GraphNumVertices(g); i++) {
        if (subgraphID[i] == -1) {  // The current vertex hasn't been marked as belonging to a connected subgraph
            dfsSetConnected(g, i, numConnected, subgraphID);
            numConnected++;
        }
    }
    return numConnected;
}

// Goes through every reachable vertex and marks them with an ID
void dfsSetConnected(Graph g, int start, int id, int subgraphID[]) {
    subgraphID[start] = id;
    for (int i = 0; i < GraphNumVertices(g); i++) {
        if (GraphIsAdjacent(g, start, i) && subgraphID[i] == -1) {
            dfsSetConnected(g, i, id, subgraphID);
        }
    }
}

Determining reachability: Checks if a dest vertex is reachable from a src vertex:

bool reachableDFS(Graph g, int src, int dest, bool visited[]) {
    if (src == dest) return true;  // Reached destination from some DFS path from src
    for (int i = 0; i < GraphNumVertices(g); i++) {
        if (GraphIsAdjacent(g, src, i) && visited[i] == false) {
            visited[i] = true;
            if (reachableDFS(g, i, dest, visited)) {
                return true;
            }
        }
    }
    return false;
}

Eulerian Paths: A path containing every edge exactly once. Allowed to visit vertices more than once

Could implement a brute force algorithm, but we know some important theorems: Theorem. A graph has an Euler circuit iff it is connected with all vertices of even degree Theorem. A graph has an Euler path iff it is connected and exactly two vertices are of odd degree

// Assumes src != dest
bool hasEulerianPath(Graph g, int src, int dest) {
    if (numConnectedSubgraphs(g) > 1) return false;  // Graph must be connected for an Euler path to exist
    if (degree(g, src) % 2 == 1 && degree(g, dest) % 2 == 1) {
        for (int i = 0; i < GraphNumVertices(g); i++) {
            if (i != src && i != dest) {
                if (degree(g, i) % 2 == 1) {
                    return false;  // Return false since there exists some vertex with odd degree which is not the start or dest vertex 
                }
            }
        }
        return true;   
    }
    return false;
}

Time-complexity: O(V) — to check if all other vertices are of odd degree. This becomes O(V²) if getting the degree of a vertex requires looping

Hamiltonian Paths (brute force generate-and-test): Requires brute force algorithm (no simpler algorithm known):

  1. Generate all simple paths from v to w (using dfs for example),
  2. Keep a counter of number of vertices visited for the current path,
  3. Successful path has been found if this counter is equal to the total number of vertices. Otherwise unsuccessful if all simple paths have been exhausted
bool hasHamiltonianPath(Graph g, int src, int dest) {
    bool visited[GraphNumVertices(g)];
    for (int i = 0; i < GraphNumVertices(g); i++) visited[i] = false;
    return hamiltonR(g, src, dest, GraphNumVertices(g) - 1, visited);
}

bool hamiltonR(Graph g, int src, int dest, int dist, bool visited[]) {
    if (src == dest) {
        if (dist == 0) return true;
        else return false;
    }    
    else {
        visited[src] = true;
        for (int i = 0; i < GraphNumVertices(g); i++) {
            if (GraphIsAdjacent(g, src, i) && visited[i] == false) {
                if (hamiltonR(g, i, dest, dist - 1, visited)) {
                    return true;
                }
            }
        }
    }
    visited[src] = false;  // Set this vertex back to false, so we don't avoid using it in the generation of other possible paths
    return false;   
}

Time-complexity: O((V-1)!) — since there could be (V-1)! paths to examine

Week 6: Graphs II

More on Graph Representation:

Directed Graphs:

Directed graphs can be represented with the same basic data structures, like an:

Transitive Closure Matrix and Reachability: The transitive closure matrix records reachability and directed paths between vertices. With a matrix that records reachability, we can quickly access indexes and if tcm[start][dest] == 1, then we know that dest is reachable from start.

int **transitiveClosureMatrix(Graph g) {
		// Setting up the matrix (needs to copy the graph's adjacency matrix)
		int **tcm = malloc(sizeof(int) * g -> numNodes);
		for (int i = 0; i < g -> numNodes; i++) {
				tcm[i] = malloc(sizeof(int) * g -> numNodes);
				for (int j = 0; j < g -> numNodes; j++) {
						tcm[i][j] = g -> edges[i][j];  // Copies the elements of the graph's adjacency matrix
				}
		}
		// Warshall's algorithm:
		for (int i = 0; i < g -> numNodes; i++) {
				for (int j = 0; j < g -> numNodes; j++) {
						for (int k = 0; k < g -> numNodes; k++) {
								if (tcm[j][i] && tcm[i][k]) {  // if j -> i exists and i -> k exists, then j -> k must exist
										tcm[j][k] = 1;  
								}
						}
				}
		}
		return tcm;
}  // Very expensive to construct and store

Time-complexity: O(V³) Space-complexity: O(V²)

Weighted Graphs:

Weighted graph representation: Weighted graphs can be represented with the same basic data structures, like an:

Possible struct declaration for a weighted edge:

typedef struct Edge {
		Vertex v;
		Vertex w;
		int weight;
} Edge;

Representation with an adjacency matrix:

typedef struct GraphRep {
		int **edges;  // Adjacency matrix storing intger weights
		int nV;
		int nE;
} GraphRep;

void insertEdge(GraphRep G, Edge e) {
		if (g -> edges[e.v][e.w] == 0) {  // Then there is no edge recorded between v and w
				g -> edges[e.v][e.w] = e.weight;
				g -> edges[e.w][e.v] = e.weight;
				g -> nE++;
		}		
}

Graph Spanning Tree Algorithms:

Kruskal’s Algorithm: We could find minimum spanning trees via a brute-force generate-and-test algorithm which examines every spanning tree and returns the one with the best cost. Or we can apply Kruskal’s algorithm: Start with all vertices and no edges → sort all edges in ascending order → add edges from smallest weight upwards, skipping edges that introduce a cycle and stopping if we achieve a conencted graph

Graph kruskal(Graph g) {
		Graph MST = newGraph();
		// 1. Get an ascending list of the cheapest weights
		Edge edges[g -> numNodes];
		copyEdges(edges, g);  // Copy the edges over
		sortAscending(edges);		
		// 2. Add them one by one, each time checking if we've introduced a cycle or if we're done
		for (int i = 0; i < g -> numNodes; i++) {
				insertEdge(MST, edges[i]);
				if (hasCycle(MST)) {
						removeEdge(MST, edges[i]);
				}
				if (MST -> numNodes == g -> numNodes) {
						break;  
				}
		}
		return MST;
}

Time-complexity: hard to estimate — the edge sorting could be O(E²) or O(ElogE) and the cycle checking could be O(V+E) if we use a DFS cycle checking algorithm

Prim’s algorithm:

  1. Start from any vertex src and with an empty MST
  2. Consider all neighbours of all vertices in the current MST which won’t introduce cycles. Pick the cheapest edge of the considered edges. Repeat until all vertices have been added

Note: a priority queue helps this algorithm select the best edge. For a priority queues, the first element that is dequeued is the one that has the highest priority value, so in this case, the lowest weight corresponds to the highest priority

Dijkstra’s Algorithm: Uses a dist[] array to keep track of shortest distances recorded and uses a pred[] array so we can trace the actual shortest path

Edge relaxation: a function that updates the shortest distance to a certain dest node. Pretty much just: if (dist[v] + weight < dist[w]) { dist[w] = dist[v] + weight; pred[w] = v; }

dist[]
pred[]

def dijkstra(G, start):
		set all dist to , set dist[start] = 0 
		set all pred to -1
		vertexSet = all vertices of G
		while vertexSet not empty:
				find some vertex s in vertextSet with minimum dist[s]   # We can use a priority queue here to achieve O(E + Vlog(V))
				for each (s, t, w) in edges of G:
						relax along (s, t, w)  
				vertextSet = vertexSet \ {s}  # Means remove s from vertexSet

Time-complexity: O(E + V²) — each edge needs to be considered once, and the outer loop has O(V) iterations… Apparently the overall cost is O(E + V²), and the best known implementation using a priority queue gets O(E + Vlog(V))

Week 7: Heaps and Hashing

Heaps:

Suppose our current node has index i:

Heap indexing: Goes left to right, top to down across each level. The next free index for insertion would be the lower-most left-most unoccupied leaf node Heap indexing actually starts at 1, not 0

Images/notes/ref_image_11.png

Possible struct declaration:

typedef struct heapRep *Heap;
struct heapRep {
		int *nodes;
		int currSize;
		int maxSize;
};

Possible implementation:

Heap newHeap (int size) {
		Heap heap = malloc(sizeof (struct heapRep));
		int *items = malloc(sizeof(int) * (size + 1));  // size + 1, so our index range becomes 1 -> N, not 0 -> (N-1). This is because heaps are indexed starting from 1
		heap -> nodes = items;
		heap -> currSize = 0;
		heap -> maxSize = size;		
}

// Inserts at the next free array slot, then bubbles it up to the right position in the heap by a series of swaps with the parent node
void insertInHeap(Heap heap, int value) {
		heap -> currSize++;
		int i = heap -> currSize;
		heap -> nodes[i] = value;
		while (i > 1 && a[i] < a[i / 2]) {  // If the parent is greater than the child, swap the child and parent (since greater values are at the top)
				int tmp = a[i / 2];  
				a[i / 2] = a[i];
				a[i] = tmp;
				i /= 2;
		}
}

// Pops the root node, replaces the root node with the last indexed node and bubbles it down to the right place
int popHeap(Heap heap) {
		int top = heap -> nodes[1];
		heap -> nodes[1] = heap -> nodes[heap -> currSize];  // Delete the root everytime 
		heap -> currSize--;
		int i = 1;
		while (2 * i <= heap -> maxSize) {
				int curr = heap -> nodes[i];
				int j = 2 * i;
				if (j < N && a[j] < a[j + 1]) j++;  // Pick the larger child for swapping
				if (a[j] >= curr) {
						int tmp = j;
						heap -> nodes[j] = heap -> nodes[i];
						heap -> nodes[i] = heap -> nodes[tmp];
				}
				i *= 2;
		}
		return top;
}

Time-complexities:

Hashing:

A hash table associates a key with a hash function value which is associated to a specific index in an array like data structure. This allows for very quick key-value lookups which would otherwise take O(n) or O(logn) with an array search algorithm. Hashing is used widely in database systems.

Problems with hashing:

Possible struct declaration:

#define EMPTY_SLOT -1

typedef struct hashTableRep *HashTable
struct hashTableRep {
		Item **items;  // Array of Item pointers, where the 'Item' data type could be a struct containing some fields
		int currSize;  // Current number of entries in the hash table
		int N;         // Array size. The hash function should map keys to a value within [0, N - 1]
};

Basic function prototypes:

typedef struct HashTabRep *HashTable;
HashTable newHashTable(int size);             
Item *search(HashTable hashtab, Item key);          // Find item with specified key
void insert(HashTable hashtab, Item item);          // Add a new item to the hash table
void delete(HashTable hashtab, Item target);        // Delete a target entry from the hash table

Hash Functions: Very basic hash function for string keys:

int hash(char *key, int N) {  // The 'Key' type in this example is a string
		int val = strlen(key);
		return val % N;  // Use mod function to map any integer to another integer within the range [0, N-1]
}

There are a ridiculous number of deficiencies with this hash function. A better hash function would introduce some randomness to make the output a bit more unique so as to minimise collisions.

‘Universal’ hashing: Tends to return a more unique hash function value, so the output is more reasonably distributed across the indexes in [0, N-1], therefore minimising collisions

int hash(char *key, int N) {
		int a = 31415;  // Prime numbers. These are better for introducing unpredictability than non-primes
		int b = 21783;
		int val = 0;
		char *c;
		for (c = key; c != '\0'; c++) {
				a = (a * b) % (N - 1);
				val = (a * h + *c) & N;
		}
		return val;
}

Collision Handling:

Separate Chaining: The hash table becomes an array of linked list heads which stores all the entries that have the same function value

Possible struct declaration to set up separate chaining:

typedef struct HashTabRep {
   List *lists;    // array of Lists of Items
   int  currSize;  // Number of items stored in HashTable
   int  N;         // Slots in the array
} HashTabRep;

Time-complexity: O(M) — where M is the number of entries. All entries could be stuffed into a single linked list at the same hash table index

Linear probing: If the hash table index is occupied, just perform a linear scan onwards for the next free slot

void insert(HashTable hashtab, Item item) {
		int N = hashtab -> N;
		int insertIndex = 0;
		int k = hash(item, N);
		for (int i = 0; i < N; i++) {
				insertIndex = (k + i) % N;   // Search starting from the hash function value index, wrapping around the array if necessary
				if (hashtab -> items[insertIndex] == EMPTY_SLOT) break;  // Free slot found
				if (hashtab -> items[insertIndex] == item) break;        // Item already exists
		}
		if (a[insertIndex] == EMPTY_SLOT) {
				hashtab -> currSize++;
				hashtab -> items[insertIndex] = item;
		}
}

Time-complexity: O(M) — the search and delete functions would land on the hash index and then perform a linear scan until the target is found Note: the search() and delete() functions stop when it reaches an EMPTY_SLOT that has not been in use before. There could be EMPTY_SLOTS between the landing index and where the target is if we have deleted entries between them at some point after the target was inserted. This means we need some sort of way to identify that the slot is empty but has been used before, and therefore shouldn’t signal our scanning to stop.

Quadratic probing: Like linear probing, but instead of searching in jumps of 1, make jumps of 1, 4, 9, etc.

Double hashing: Like quadratic probing, but makes a more calculated jump which is determined by a second hash function

typedef struct hashTableRep *HashTable
struct hashTableRep {
		Item *items;  
		int currSize;  
		int N;         // Array size
		int N2;        // A modulo operand we can use for our second hash function. This should be set in the newHashTable function
};
void insert(HashTable hashtab, Item item) {
		int N = hashtab -> N;
		int insertIndex = 0;
		int k = hash(item, N);
		int jump = hash2(item, hashtab -> N2);
		for (int i = 0; i < N; i += jump) {  // Incremement by the hash2 function value
				insertIndex = (k + i) % N;
				if (hashtab -> items[insertIndex] == EMPTY_SLOT) break;
				if (hashtab -> items[insertIndex] == item) break;
		}	
		if (hashtab -> items[insertIndex] == EMPTY_SLOT) {
				hashtab -> currSize++;
				hashtab -> items[insertIndex] = item;
		}
}

The second has function hash2: key → [1, N2 - 1]

The amount we increment must ensure that all elements are visitable. We can pick an increment number that is coprime with N to ensure this. When we’re creating the table for the first time, it’s smart to pick a prime number size so our hash2() function values will be suitable to use for probing.


Performance of insertion, search and deletion when there are collisions:

Week 8: Sorting Algorithms

Sorting:

Sorting involves arranging a collection of items in order, based on some property of the items. This is an important problem because sorted collections allow better search algorithms to be implemented as well as other algorithms like duplicate detection, etc.


Comparison Sorting Algorithms:

Selection sort: Logic: find min, swap it with what’s in the zeroeth index of the array, then find the next smallest and place it in first index, and so on

// Sorts to ascending order
void selectionSort(int input[], int lo, int hi) {
    for (int i = lo; i < hi - 1; i++) {
        int min = i;
        for (int j = i + 1; j < hi; j++) {  // Start scanning the unsorted segment for the minimum value
            if (input[j] < input[min]) {
                min = j;
            }
        }
        swap(input, i, min);
    }
}

Time-complexity: O(n²) — max total comparisons given by: (n-1) + (n-2) + ... + 1 = n(n + 1)/2, total swaps given by: n-1 In both best and worst cases, we’re still making the same number of comparisons to identify the mininimum each iteration

Images/notes/ref_image_12.png

Swap function:

void swap (int input[], int i, int j) {
    int tmp = input[j];
    input[j] = input[i];
    input[i] = tmp;
}

Bubble sort: Logic: looks at adjacent items and swaps if out of order, bubbling the item up When the element being bubbled encounters an element it should swap with, that other element becomes the one being bubbled. Doing this n times across the whole array results in a sorted array

// Sorts to ascending order
void bubbleSort(int input[], int lo, int hi) {
    for (int i = lo; i < hi; i++) {
				int swaps = 0;
		        for (int j = hi - 1; j > i; j--) {
		            if (input[j] < input[j - 1]) {
		                swap(input, j, j - 1);
										swaps++;
		            }
		        }
			  if (swaps == 0) break;  // Since the array was scanned and no swaps were made, the array must be sorted, so break
    }
}

Time-complexity: O(n²) Best case: the array is already sorted, so just 1 + 1 + … + 1 comparisons (n) Worst case: the array is reverse sorted, so (n-1) + (n-2) + … + 1 =

Images/notes/ref_image_13.png

Insertion sort: Logic: on each iteration, consider the element right next to the sorted segment. Bubble that element to the right place within the sorted array by comparing with elements in the already sorted segment.

// Sorts to ascending order
void insertionSort(int input[], int lo, int hi) {
    for (int i = lo; i < hi; i++) {
        for (int j = lo; j < i; j++) {
            if (input[i] < input[j]) {
                swap(input, i, j);  // Bubble it to the right place in the sorted segment
            }
        }
    }
}

Time-complexity: O(n²) Best case: the array is already sorted, so we just have 1 + 1 + … + 1 comparisons (n) Worst case: the array is reverse sorted, so we have a growing number of comparisons to find where to insert the new element in the sorted segment (same as for selection sort: n(n+1)/2)

Images/notes/ref_image_14.png

Shell sort: Also known as h-sorting, which is an improvement on insertion sort. A 3-sorted array means that if we take elements array[0], array[3], array[6], …, they will be in order. A 2-sorted array means that if we take elements array[0], array[2], array[4], …, they will be in order. A 1-sorted array means that the array is sorted Algorithm uses a characteristic sequence of h-values. Such sequences are determined empirically, for example 1093, 364, 121, 40, 13, 4, 1, or something else

// Sorts to ascending order
void shellSort(int a[], int lo, int hi) {
		int hValues[8] = {701, 301, 132, 57, 23, 10, 4, 1}
		for (int i = 0; i < 8; i++) {
				int h = hValues[i];
				int start = h + lo;
				for (int n = start; n < hi, n++) {
						int value = a[n];
						for (int k = n; k >= start && value < a[k - h]; k -= h) {
								move(a, k, k - h);
						}
						a[k] = value;
				}
		}
}

Efficiency depends on the h-values used, which we can achieve O(n^(3/2)) with, or even better with the right sequence. Shell sort has not yet been fully analysed.

Quick sort (recursive):

Logic: pick a pivot element, then sort the array such that everything to the left of that pivot is less than it while everything to the right is greater than it, and then start sorting left and right by the same function.

// Sorts to ascending order
void quickSort(int input[], int lo, int hi) {
		if (lo >= hi) return;              // Terminating case
		int i = partition(input, lo, hi);  // Partition returns the index of where the pivot element input[lo] belongs 
		quickSort(input, lo, i - 1);       // Sort the unsorted left segment
		quickSort(input, i + 1, hi);       // Sort the unsorted right segment
}

In the partition function, we are finding the right spot to place the pivot element and swapping elements from left and right segments so that everything to the left of the pivot is less than it and everything to the right is greater than it

  1. Scan from the left, if the pivot encounters something greater than it, stop
  2. Scan from right, if the pivot encounters something less than it, stop
  3. Swap the elements which caused scanning to stop
  4. Continue the scanning from left, then scanning from right and swapping.
  5. Stop when the left and right scanners cross each other
  6. Swap the pivot element with where this intersection occurred
// Sorts to ascending order
int partition(int input[], int lo, int hi) {
    int pivotIndex = lo;  // Choose the first element to be pivot element. We could make smarter choices 
    int i = lo + 1;       // Left scanner
    int j = hi;           // Right scanner
    while (1) {
        while (input[i] < input[pivotIndex] && i < j) {  // Pause when an item in the left segment was found to be larger than the pivot value 
            i++;  
        }
        while (input[j] > input[pivotIndex] && i < j) {  // Pause when an item in the right segment was found to be smaller than the pivot value
            j--;
        }
        if (i == j) break;          // Stop once i and j cross over
        else swap(input, i, j);     // Swap and then resume scanning from where we left off
    }
    j = (input[i] < input[pivotIndex]) ? i : i - 1;  // If where element at where i stopped is smaller than the pivot value, set j to i, otherwise set it to i - 1 
    swap(input, pivotIndex, j);  // Swap the pivot element at input[lo] to where it should be -- at index j where the left scanner and right scanner intersected
    return j;  // The pivot is now at j, where it belongs
}

Time-Complexity: O(n²) — but the average case is usually better Best case: the choice of pivot leads to equal partitions for each recursive step, so there will be log(n) levels, each level requiring n comparisons, hence O(nlogn) Worst case: the choice of pivot leads to paritions of sizes 1 and (n-1) for each recursive step, so there will be n levels, requiring n comparisons, hence O(n²) Eg. for sorted arrays, we will make n recursive calls, forming 1 and n-1 partitions

Partitioning steps:

Images/notes/ref_image_15.png

Improving partition:

Quicksort can be implemented iteratively using a stack:

void quickSort(int input[], int lo, int hi) {
		Stack s = newStack();
		StackPush(s, hi);
		StackPush(s, lo);
		while (StackSize(s) > 0) {
				int low = StackPop(s);
				int high = StackPop(s);
				if (high > low) {
						int i = partition(input, lo, hi);
						StackPush(s, hi);
						StackPush(s, i + 1);
						StackPush(s, i - 1);
						StackPush(s, lo);
				}
		}
}

Merge sort (recursive): Logic: **split the array into two equal partitions, sort each partition, then merge it into a final sorted array.

// Sorts to ascending order
void mergeSort(int input[], int lo, int hi) {
    if (lo >= hi) return;
    int mid = (lo + hi) / 2;
    mergeSort(input, lo, mid);
    mergeSort(input, mid + 1, hi);
    merge(input, lo, mid, hi);
}

void merge(int a[], int lo, int mid, int hi) {
   int i = lo;
   int j = mid + 1;
   int k = 0;   // Indexer for temporary array
   int tmp[hi - lo + 1];  // Temporary array to store sorted values
   
   while (i <= mid && j <= hi) {  // Scan through both segments, choosing the one that's less than the other
     if (a[i] < a[j])
        tmp[k++] = a[i++];        // Move the index of the left segment along
     else
        tmp[k++] = a[j++];        // Move the index of the right segment along
   }
   
   // One of the segments will still have elements that need to be copied over
   while (i <= mid) tmp[k++] = a[i++];
   while (j <= hi) tmp[k++] = a[j++];

   // Copy the sorted dummy array into the main array
   for (i = lo, k = 0; i <= hi; i++, k++)
      a[i] = tmp[k];
}

Time-complexity: O(nlogn) Best case: each level requires less than n comparisons when merging — O(nlogn) Worst case: each level requires comparisons made all the way until the end of partitions — O(nlogn)

Space-complexity: O(N)

Images/notes/ref_image_16.png

Mergesort can be implemented non-recursively, without a stack.
Bottom-up Mergesort (iterative):

void mergeSort(int a[], int lo, int hi) {
		int end;
		for (int m = 1; m <= lo - hi; m = m * 2) {  // Run lengths are: 1, 2, 4, ... until the run length is greater than the array size
				for (int i = lo; i <= hi - m; i += 2 * m) {
						end = min(1 + 2 * m - 1, hi);  // End is the endpoint of the two segments (runs). Select end = hi, so we don't run over the array size
						merge(a, i, i + m - 1, end);   // i + m - 1 is the midpoint of the start of the first segment and end of the second
				}
		} 
}

Sorting lower bound of performance: any comparison-based sorting algorithm must take at least Ω(nlogn) time to sort n elements. Proof. Suppose that we have a sequence of n unique values. There are n! permutations, where one sequence is ordered. For a binary decision tree, the height is at least log2(n!), so the number of comparisons required is log2(n!), which using inequalities, can be proven to be greater than nlogn. Hence Ω(nlogn) is the lower bound for all comparison based sorting algorithms.

Images/notes/ref_image_17.png


Non-Comparison based sorting

Radix Sort: Logic: sort the array by the last digit, then by the next and so on until the largest radix is sorted. At this point, the array is sorted. This works only if the implementation is stable

Time-complexity: O(mn) — where m is the number of radix points. This generally outperforms comparison based sorting algorithms when n → ∞

Week 9: Text Processing Algorithms

Notes:

Boyer-Moore Algorithm:

Logic: compare the last character of the pattern to the corresponding character in the text. If mismatch, then check if that mismatching character from the text appears in the pattern. If it does, then jump forward a calculated amount. If it doesn’t then jump by strlen(pattern) characters.

The Boyer-Moore algorithm requires some preprocessing using the last occurrence function: L : Σ → {0, 1, …, m}

Boyer-Moore algorithm:

int BoyerMooreMatch(char *text, char *pattern, char *alphabet) {
		int i = m - 1;  // Where m is the length of the pattern
		int j = m - 1;  
		while (i < n) {  // Where n is the length of the text
				if (text[i] == pattern[j]) {
						if (j == 0) {
								return i;  // The pattern survived all character matches, so return the index of the main text where the pattern was found
						} else {
								i--;  // Decrement to match the next pair of characters
								j--;
						}
				} else {
						i += m - min(j, 1 + L(text[i]));  // If the last occurrence of text[i] is AFTER j, then jump (m - j) characters. Else, if the last occurrence of text[i] is before j, then jump (m - (1 + L(text[i]))) -- This aligns the character in 'pattern' with text[i]
						j = m - 1;
				}
		} 
		return NO_RESULTS;
}

Time-complexity: O(nm + s) Worst case: finding pattern “baaa” in “aaaaaaaa…aaa”

Case 1: text[i] does not occur in the pattern string (so L(text[i]) is -1). Jump m - (1 + L(text[i])), which is the biggest jump possible — m characters

Images/notes/ref_image_18.png

Case 2: text[i]’s last occurrence in pattern is after j, so jump m - j characters

Images/notes/ref_image_19.png

Case 3: text[i]’s last occurrence in pattern is before j, so jump m - (1 + L(text[i])) characters The mismatched text[i] is aligned with the same character found before j in the pattern

Images/notes/ref_image_20.png

Knuth-Morris-Pratt Algorithm:

Failure function: KMP also requires preprocessing on the pattern to find if prefixes occur in the rest of the pattern.

For each index j, find the longest prefix from 0 to j that matches the suffix in [1, j], up to and including index j, then record the function value as the number of characters in the prefix.

Images/notes/ref_image_21.png

For j = 0: “a” — no prefix matches suffix — record F(0) = 0 For j = 1: “ab” — no prefix matches suffix — record F(1) = 0 For j = 2: “aba” — “a” works — record F(2) = 1 For j = 3: “abaa” — “a” works — record F(3) = 1 For j = 4: “abaab” — “ab” works — record F(4) = 2 For j = 5: “abaaba” — “aba” works — record F(5) = 3

Time-complexity: O(m)

Knuth-Morris-Pratt algorithm: KMP scans from left to right, unlike Boyer-Moore.

int KMPMatch(char *text, char *pattern) {
		int i = 0;  // Start scans from left to right
		int j = 0;
		int F[m];
		failure(F, m);  // Fill up array F with function values
		while (i < n)) {  // Where n = strlen(text)
				if (text[i] == pattern[j]) {
						if (j == m - 1) {
								return (i - j);  // Pattern found, return the index at the start of the successful match
						} else {
								i++;
								j++;
						}
				} else {  
						if (j > 0) {
								j = F[j - 1];  // On failure, resume comparing the pattern starting from F(j - 1)
						} else { 
								i++;
						}
				}
		}
		return NO_MATCHES
}

Time-complexity: O(m + n) — O(m) for the failure function preprocessing and O(n) for scanning through the whole text

On mismatch, reset j to F(j - 1) if j > 0 and continue scanning. If j == 0, then just increment i and continue scanning

Images/notes/ref_image_22.png

Tries:

A trie is a tree which holds a set of strings, like all the words in a text or a dictionary, and handles two main operations: search() and insert(). Trie’s name comes from retrieval.

Boyer-Moore and Knuth-Morris-Pratt’s algorithms preprocessed their pattern string. If the text is large, immutable and searched for often, like Hamlet, then we can preprocess the main text instead, allowing for quick pattern queries using the trie data structure.

The depth of a tree is equal to the longest string length.

Each node:

The number of words stored = the number of terminating nodes

Possible struct declaration:

#define ALPHABET_SIZE 26
typedef struct trieNode *Trie;
struct trieNode {
		bool isLastChar;  // If this is true, then a complete word has been formed when this node is reached
		char character;   // Current node's character
		Trie *children;   // An array of size ALPHABET_SIZE    
};

// Returns a pointer to the node where the target string was found
Trie findInTrie(Trie trie, char *target) {
		char *ch;
		Trie curr = trie;
		for (ch = target; ch != '\0'; ch++) {		
				if (curr -> children[ch]) {
						curr = curr -> children[ch];  // Move down the trie, following the target string's characters
				} else {
						return NULL;  // Character node doesn't exist
				}
		}
		if (curr -> isLastNode == true) {
				return curr;
		} else {
				return NULL;  // Character was found, but it wasn't marked as a finishing node. So the word wasn't actually recorded
		}
}

// Not sure if this works... 
Trie insert(Trie trie, char *newWord, int length) {
		Trie t;
		if (trie == NULL) {
				t = malloc(sizeof(struct trieNode));
				t -> character = newWord[0];
		}
		if (m == 0) {
				t -> finish = true;
				t -> data = newWord;
		} else {
				t -> children[newWord[1]] = insert(trie, &newWord[1], length - 1);  // Pass a string one character shorter to the next insert() call
		}
		return t;		
}

Time-complexity: O(sm) — where m is strlen(target string) and s is the size of the alphabet. The time-complexity is independent of how many actual entries there are in the tree. Space-complexity: O(n) — size of text (eg. total number of characters in each word in a dictionary)


Compressed Tries:

Compressing a trie involves collapsing linear branches into a single node:

Images/notes/ref_image_23.png

Using 3-tuples, (i, j, k), we can encode a slice of strings for collapsed linear branches. So (8, 2, 3) navigates to the 8th word and then takes the slice of S[8] from index 2 to 3

Images/notes/ref_image_24.png

Compressed Suffix Tries — For pattern matching:

Insert all suffixes of a given word as separate entries into a trie and then compress it.

For example, for “minimize”: Insert all suffixes: “e”, “ze”, “ize”, …, “minimize” into a regular trie, then compress.

Images/notes/ref_image_25.png

2-tuples representing slices of the text, “minimize”

Images/notes/ref_image_26.png

For constructing the suffice tree: Time-complexity: O(n) — Space-complexity: O(n) —

Pattern matching: Time-complexity: O(ms) — where m is the length of the pattern and s is the size of the alphabet

Text Compression — Huffman’s Algorithm:

Text compression revolves around the problem: “can we efficiently encode a string X with a smaller string Y?”

Lossy compression: a class of irreversible data compression methods using approximations and partial data discarding to represent the content, causing a loss of quality/resolution.

Huffman’s algorithm: Huffman’s algorithm provides a lossless text compression system. It defines bit-codes for characters such that each bit code is not a prefix for any other bit code, which is crucial for navigating the Huffman encoding tree. Higher frequency characters are assigned shorter bit codes.

  1. Go through entire text and calculate the frequency of every character
  2. Build the encoding tree: → Initially have single nodes representing each character in the text → Merge the two lowest frequency characters into a single cluster → Repeat the process until a full tree is created

    Images/notes/ref_image_27.png

The path to a leaf node corresponds to a bit-code

Images/notes/ref_image_28.png

Tree createTree(char *text) {
		int frequencies[ALPHABET_SIZE];
		computeFrequencies(frequencies, text); 
		PQueue pqueue = newPQueue();
		for (char c = 'a'; c <= 'z'; c++) {
				if (frequencies[c] != 0) {
						Tree t = malloc(sizeof(struct treeNode));
						t -> freq = frequencies[c];
						t -> left = NULL;
						t -> right = NULL;
						PQueueJoin(queue, t);
				}
		}
		while (QueueSize(queue) >= 2) {
				t1 = PQueueLeave(queue);  // t1 has highest priority (frequency)
				t2 = PQueueLeave(queue);  // t2 has next highest priority
				freq1 = t1 -> freq;
				freq2 = t2 -> freq;
				Tree t = malloc(sizeof(struct treeNode));  // Create a new tree which has t1 and t2 as subtrees and contains the sum of both their frequencies
				t -> freq = freq1 + freq2;
				t -> left = t1;
				t -> right = t2;
				PQJoin(queue, t);  // Join this tree into the queue of nodes to join
		}
		return PQLeave(queue);  // There should be last element left in the queue, which is the root node of a completed encoding tree
}

Time-complexity: O(n + slogs) — for computing frequencies, creating nodes for each occurring character and building the encoding tree, where s is number of unique characters


C Extras:

Compilation and Makefiles:

The separation between .h and .c files is crucial and will be used extensively in implementing abstract data types

Images/notes/ref_image_29.png

gcc -c Stack.c                # Produces Stack.o from Stack.c and Stack.h
gcc -c bracket.c              # Produces bracket.o from bracket.c and bracket.h
gcc -o rbt bracket.o Stack.o  # Links bracket.o, Stack.o and libraries and produce the executable named 'rbt'
															# If the executable name isn't specified then by default the name is a.out 
															# The libraries are included implicitly

Makefile: Allows us to see dependencies in our code and conveniently compile using the make command

target : source1 source2 …  
        commands to build target from sources

# Example:
game : main.o graphics.o world.o # These sourcese are dependencies
				gcc -o game main.o graphics.o world.o

Function Pointers:

Function pointers are the memory address of functions. Function pointer declaration: int (*fp) (int) — fp points to a function that returns an int and has one argument of type int

int square(int x) { 
		return x * x;
}
int timesTwo(int x) {
		return x * 2;
}

int (*funcPointer) (int);
fp = &square;       // fp points to the square function
int n = (*fp)(10);  // calls square(10)
fp = timesTwo;      // & is optional, fp now points to timesTwo
n = (*fp)(2);       // calls timesTwo(2)
n = fp(2);          // (*...) is optional, normal function call

Higher-order functions: functions that has function pointers as arguments or returns function pointers

void traverse (List head, void (*fp) (List)) {
		List curr = head;
		while (curr != NULL) {
				fp(curr);  // Applies the given function to curr! Traverse can apply any to all nodes any function with the signature void *funcName (List) 
				curr = curr -> next;
		}
} 
void printName (List head);
void printZID (List head);
...
traverse(studentList, printName);   // Where printName fills the parameter: void (*fp) (List)
traverse(studentList, printZID);    // Where printZID fills the parameter: void (*fp) (List)

Generic Types:

Polymorphism: ability of the same code to operate on different data types

void * pointers allow generic data types. By declaring values of type void , we could pass those values into functions which *downcast the void pointers to the correct type before operating on them. In object-oriented programming, downcasting or type refinement is the act of casting a reference of a base class to one of its derived classes.