🤖 Distinguishing AI from Machine Learning
🎯 (SE-12-03)
📌 Distinguish between Artificial Intelligence (AI) and Machine Learning (ML) — two related but fundamentally different concepts that are frequently conflated.
🤖 Artificial Intelligence (AI)
The broad concept of building machines that can perform tasks requiring human-like intelligence. AI encompasses any technique that allows computers to simulate cognitive functions — including expert systems, natural language processing, robotics, and machine learning. AI can use simple rule-based logic (e.g., an IF-THEN chess-playing algorithm) without any learning from data.
📊 Machine Learning (ML)
A specific subset of AI in which systems automatically learn from data, identify patterns, and improve their performance over time — without being explicitly programmed for each specific scenario. ML requires training data. For example, a spam filter trained on thousands of labelled emails learns to classify new emails itself, rather than following a fixed list of rules.
🔄 How ML Supports Automation
🎯 (SE-12-01, SE-12-03, SE-12-09)
📌 Investigate how machine learning supports automation through the use of DevOps, Robotic Process Automation (RPA), and Business Process Automation (BPA).
♾️ DevOps and MLOps
DevOps is a software development practice that unifies development and operations teams to deliver software continuously and reliably. MLOps (Machine Learning Operations) extends these principles to the machine learning lifecycle — automating the processes of training, validating, deploying, and monitoring ML models in production. Without MLOps, retraining and redeploying an ML model is a slow, manual, error-prone process. With MLOps pipelines (e.g., using tools like Kubeflow or AWS SageMaker), models are automatically retrained when performance degrades and redeployed with zero downtime. For example, a recommendation engine at a streaming service is automatically retrained weekly on new viewing data, ensuring its suggestions remain relevant.
🛠️ Design
Engineers define the business problem, reframe it as an ML problem (e.g., "predict churn" → binary classification), define success metrics (e.g., 90% accuracy), and source appropriate training datasets.
📊 Model Development
Data is cleaned (wrangling), relevant features are selected (feature engineering), and models are trained, evaluated, and validated. Multiple algorithm types may be tested to find the best performer.
🚀 Operations
The model is deployed to a live environment via an API. Operations teams monitor for data drift — when real-world input data begins to differ significantly from training data — which degrades model accuracy and triggers retraining.
🤖 Robotic Process Automation (RPA)
RPA uses software "bots" to automate repetitive, rule-based digital tasks that humans would otherwise perform manually — such as copying data between systems, filling online forms, or processing invoices. Traditional RPA bots follow rigid scripts. When ML is integrated, the result is Intelligent Automation: bots that can read unstructured data (such as handwritten invoice fields or natural language emails), extract meaning from it, and make autonomous routing decisions. For example, an intelligent RPA bot in a hospital can read an unstructured doctor's referral email, extract the patient's details and condition, check availability, and automatically book the appropriate specialist appointment — a task that previously required a receptionist.
🏢 Business Process Automation (BPA)
BPA involves using technology to automate entire end-to-end business workflows — not just individual repetitive tasks, but complete multi-step processes involving multiple systems and decision points. ML-driven BPA integrates predictive intelligence into workflows. For example, an insurance claims BPA system uses ML to automatically assess a submitted photo of car damage, classify the severity, cross-reference the policy, calculate the repair estimate, and either auto-approve small claims or route complex cases to a human adjudicator — compressing a multi-day process into minutes. The key difference from RPA: BPA automates the whole process; RPA automates individual steps within a process.
🏋️ Models of Training Machine Learning
🎯 (SE-12-03)
📌 Explore the four primary models used to train machine learning systems, each suited to different problem types and data availability.
| Training Model | How It Works | Best Application | Real-World Example |
|---|---|---|---|
| Supervised Learning | The model is trained on a labelled dataset — both the input data and the correct output (label) are provided. The model learns a mapping function from inputs to outputs by minimising the error between its predictions and the known labels. | Classification (discrete outputs) and Regression (continuous outputs). | An email spam filter trained on thousands of emails marked "spam" or "not spam" learns to classify new unseen emails correctly. |
| Unsupervised Learning | The model analyses unlabelled data to find hidden patterns, groupings, or structures autonomously — no correct answers are provided during training. The model must discover meaningful structure on its own. | Clustering, dimensionality reduction, anomaly detection. | A retail company groups its customers into behavioural segments (e.g., "bargain hunters", "brand loyalists") based on purchase history — without pre-defining the groups. |
| Semi-Supervised Learning | Uses a small amount of labelled data combined with a large pool of unlabelled data. The model learns from the labelled examples and uses the unlabelled data to refine its understanding of the broader data distribution. This approach is cost-effective as labelling data is expensive and time-consuming. | Web content classification, medical image analysis. | A medical AI trained to detect cancer on X-rays using a few hundred labelled scans and tens of thousands of unlabelled scans, dramatically reducing the costly expert-labelling burden. |
| Reinforcement Learning | An agent learns to make decisions by interacting with an environment. It receives rewards for correct actions and penalties for incorrect ones, gradually learning a policy that maximises cumulative reward over time. There is no labelled dataset — the agent learns through trial and error. | Autonomous navigation, robotics, game-playing, resource scheduling. | Google DeepMind's AlphaGo used reinforcement learning to master the board game Go, ultimately defeating the world champion — a feat previously considered decades away. |
- Training Set (70%): The model learns patterns from this data — weights and parameters are adjusted based on training examples.
- Validation Set (15%): Used to tune hyperparameters (e.g., choosing K in KNN, or the number of layers in a neural network) and compare different model configurations during development.
- Test Set (15%): Held back entirely until the final evaluation — provides an unbiased measure of how the model performs on completely unseen data.
🧹 Data Preprocessing
🎯 (SE-12-02, SE-12-03, SE-12-08)
📌 Apply data preprocessing techniques to prepare raw datasets for machine learning — cleaning, transforming, and splitting data to maximise model accuracy and reliability.
🧼 Handling Missing Data
Real-world datasets almost always contain missing values — sensors that failed, survey respondents who skipped questions, or records that were never collected. There are three main strategies for handling missing data, each with trade-offs:
❌ Drop Rows/Columns
Remove any row (sample) or column (feature) that contains missing values. Simple but risky — if many rows are dropped, the model may be trained on an unrepresentative subset of the data, introducing sampling bias.
📊 Impute with Mean / Median / Mode
Replace missing values with a summary statistic of the existing values. Mean works for normally distributed numerical features. Median is better for skewed distributions (outliers distort the mean). Mode (most frequent value) is used for categorical features.
import pandas as pd
import numpy as np
# Sample dataset with missing values
data = {
'age': [25, np.nan, 35, 42, np.nan, 50],
'income': [40000, 55000, np.nan, 80000, 62000, np.nan],
'grade': ['A', 'B', np.nan, 'A', 'C', 'B']
}
df = pd.DataFrame(data)
# Strategy 1: Drop rows with any missing value
df_dropped = df.dropna()
# Strategy 2: Fill numerical columns with the median
df['age'].fillna(df['age'].median(), inplace=True)
df['income'].fillna(df['income'].mean(), inplace=True)
# Strategy 3: Fill categorical column with the mode (most frequent value)
df['grade'].fillna(df['grade'].mode()[0], inplace=True)
print(df) # All missing values are now filled
print(df.isnull().sum()) # Should print 0 for all columns
📏 Normalisation vs Standardisation
Many ML algorithms (KNN, neural networks, SVMs) are sensitive to the scale of features. If one feature has values in the thousands (e.g., income) and another has values between 0 and 1 (e.g., a ratio), the large-scale feature will dominate distance calculations. Scaling ensures all features contribute equally to the model.
| Technique | Formula | Output Range | Best For | Sensitive to Outliers? |
|---|---|---|---|---|
| Min-Max Normalisation | x' = (x − min) / (max − min) | 0 to 1 | Neural networks, image pixel values, when bounded range is needed | Yes — outliers compress most values into a narrow range |
| Z-Score Standardisation | x' = (x − μ) / σ | Typically −3 to +3 (unbounded) | Linear/logistic regression, KNN, PCA — when distribution is approximately normal | Less so — outliers become large z-scores but do not compress other values |
from sklearn.preprocessing import MinMaxScaler
import numpy as np
# Raw features: [age, income]
X = np.array([[25, 40000],
[35, 80000],
[45, 60000],
[22, 30000]])
scaler = MinMaxScaler() # Scales each feature to range [0, 1]
X_scaled = scaler.fit_transform(X)
print(X_scaled)
# Output (approximate):
# [[0.13 0.33]
# [0.59 1.00]
# [1.00 0.60]
# [0.00 0.00]]
# To reverse the scaling (e.g., to interpret predictions):
X_original = scaler.inverse_transform(X_scaled)
from sklearn.preprocessing import StandardScaler
import numpy as np
# Raw features: [age, income]
X = np.array([[25, 40000],
[35, 80000],
[45, 60000],
[22, 30000]])
scaler = StandardScaler() # Transforms to mean=0, std=1
X_scaled = scaler.fit_transform(X)
print(X_scaled)
# Output (approximate):
# [[-0.81 -0.85]
# [ 0.27 1.27]
# [ 1.35 0.42]
# [-1.35 -0.85]]
# Inspect what the scaler learned:
print(f"Feature means: {scaler.mean_}")
print(f"Feature std devs: {scaler.scale_}")
🎯 Feature Selection
Feature selection is the process of identifying and retaining only the most informative input variables, and discarding irrelevant or redundant ones. Including too many features can lead to the curse of dimensionality — model performance degrades as irrelevant noise increases, and training becomes slower. Key techniques include:
import pandas as pd
# Load dataset (e.g., house prices)
df = pd.read_csv('houses.csv')
# Show correlation of all features with the target variable (price)
correlations = df.corr()['price'].sort_values(ascending=False)
print(correlations)
# Drop columns with very low correlation to target (threshold: |r| < 0.1)
low_corr = correlations[abs(correlations) < 0.1].index.tolist()
df = df.drop(columns=low_corr)
# Also drop the ID column which is not a feature
df = df.drop(columns=['id'], errors='ignore')
🧪 Train / Validation / Test Split
Before training, the dataset is divided into three non-overlapping subsets. A common split is 70 / 15 / 15 or 80 / 10 / 10.
📚 Training Set (~70–80%)
The model learns from this data — weights, coefficients, and parameters are adjusted based on training examples. The model sees this data during the learning process.
🔧 Validation Set (~10–15%)
Used during development to tune hyperparameters (e.g., choosing K in KNN, regularisation strength) and to compare different model configurations. The model does not train on validation data, but decisions are made based on its validation performance — so it is "indirectly seen."
🔒 Test Set (~10–15%)
Held back entirely until final evaluation. Provides an unbiased estimate of real-world performance. The test set must never influence any training or tuning decision — it simulates completely unseen data.
from sklearn.model_selection import train_test_split
import numpy as np
X = np.random.rand(1000, 5) # 1000 samples, 5 features
y = np.random.randint(0, 2, 1000) # binary labels
# Step 1: Split off the test set (15%)
X_temp, X_test, y_temp, y_test = train_test_split(
X, y, test_size=0.15, random_state=42
)
# Step 2: Split remaining data into train (70%) and validation (15%)
# 15 / 85 ≈ 0.176 of the remaining data gives the correct 15% of total
X_train, X_val, y_train, y_val = train_test_split(
X_temp, y_temp, test_size=0.176, random_state=42
)
print(f"Training: {len(X_train)} samples") # ~700
print(f"Validation: {len(X_val)} samples") # ~150
print(f"Test: {len(X_test)} samples") # ~150
transform() on the validation and test sets. Fitting on the full dataset (including validation and test data) before splitting is a common mistake known as data leakage — it allows information from the test set to influence training, making evaluation results unrealistically optimistic. In production, the model will perform worse than the inflated metric suggests.
Wrong:
scaler.fit_transform(X_all) then split.Correct: Split first →
scaler.fit_transform(X_train) → scaler.transform(X_val) → scaler.transform(X_test).
📲 Common Applications of Key ML Algorithms
🎯 (SE-12-03, SE-12-05)
📌 Investigate common real-world applications of key machine learning algorithms, including data analysis and forecasting, virtual personal assistants, and image recognition.
📈 Data Analysis and Forecasting
ML algorithms applied to data analysis and forecasting use patterns in historical data to predict future outcomes. Regression algorithms (linear, polynomial) identify trends and project them forward. Time-series models (e.g., ARIMA, LSTM neural networks) specifically handle data ordered over time — such as stock prices, weather measurements, or website traffic. Key applications include:
🗣️ Virtual Personal Assistants
Virtual personal assistants (VPAs) such as Apple's Siri, Amazon's Alexa, and Google Assistant combine multiple ML domains into a seamless conversational interface:
👁️ Image Recognition
Image recognition uses Convolutional Neural Networks (CNNs) — a specialised deep learning architecture designed to process grid-structured data (pixels). CNNs learn hierarchical visual features: early layers detect edges and colours; deeper layers recognise shapes and textures; the final layers identify objects. Key applications include:
🌳 Design Models for ML: Decision Trees & Neural Networks
🎯 (SE-12-03, SE-12-08)
📌 Research the models used by software engineers to design and analyse machine learning systems, including decision trees and neural networks.
🌳 Decision Trees
A decision tree is a supervised learning model that represents a series of binary questions (decisions) as a tree structure, branching at each internal node based on a feature value, and arriving at a prediction at each leaf node. Decision trees are highly interpretable — engineers and non-technical stakeholders can read the tree logic directly. They are also the foundation for more powerful ensemble methods such as Random Forests (many decision trees voting together) and Gradient Boosted Trees (iteratively correcting prediction errors).
🏗️ Structure
Root Node: The first decision point — the single most informative feature. Internal Nodes: Further decision points splitting data by additional features. Branches: Outcomes of each test (True/False, </> threshold). Leaf Nodes: Final prediction values (class label or regression value).
✅ Strengths & Limitations
Strengths: Highly interpretable; handles both numerical and categorical data; requires minimal data preprocessing; can be visualised. Limitations: Prone to overfitting (memorising training data) if not pruned; small changes in data can produce very different trees; less accurate than ensemble methods on complex datasets.
🌳 Decision Tree - Email Spam Detection
Multi-level heuristic classification logic
This decision tree shows how email filtering systems classify messages as spam or legitimate using sequential heuristic checks: Does it contain urgency keywords? Are suspicious links present? Is the sender authenticated? Each path follows decision points, accumulating confidence levels. This demonstrates how automated systems make complex decisions through multiple rule-based checks.
or 'URGENT'?"] -->|Yes| B["All CAPS
words > 50%?"] A -->|No| C["Has suspicious
links?"] B -->|Yes| D["🚨 SPAM
Confidence: 95%"] B -->|No| E["Sender domain
authenticated?"] C -->|Yes| F["Link domain
known safe?"] C -->|No| G["Recipient in
To: field?"] E -->|Yes| H["✅ LEGITIMATE
Confidence: 92%"] E -->|No| I["User marked
similar before?"] F -->|Yes| G F -->|No| D G -->|Yes| J["Email length
> 2000 chars?"] G -->|No| D I -->|Yes| D I -->|No| H J -->|Yes| D J -->|No| H style D fill:#FFCDD2,stroke:#C62828,stroke-width:2px style H fill:#C8E6C9,stroke:#2E7D32,stroke-width:2px style E fill:#FFF9C4,stroke:#F57F17,stroke-width:2px style F fill:#FFF9C4,stroke:#F57F17,stroke-width:2px style I fill:#FFF9C4,stroke:#F57F17,stroke-width:2px style J fill:#FFF9C4,stroke:#F57F17,stroke-width:2px
Purpose: Understand how decision tree classification works in practice by tracing email classification paths
Syllabus Link: SE-12-03, SE-12-08
Try This: Modify the tree by adding new decision nodes (e.g., "Recipient has account?") or reordering the nodes to see how tree structure affects decision paths
🕸️ Neural Networks
Artificial Neural Networks (ANNs) are designed to loosely mimic the biological structure of the human brain. They consist of layers of interconnected artificial neurones (nodes), each receiving weighted input signals, applying an activation function, and passing an output signal to the next layer.
📥 Input Layer
Receives the raw feature data (e.g., pixel values, sensor readings, numerical attributes). Each node in the input layer represents one feature. No computation occurs at this layer — it simply passes values to the first hidden layer.
🔗 Hidden Layers
One or more intermediate layers where the network learns non-linear relationships in the data. Each neurone computes a weighted sum of its inputs and applies a non-linear activation function (e.g., ReLU). Deep learning refers to networks with many hidden layers (deep architectures).
📤 Output Layer
Produces the final prediction. For binary classification, a single neurone with a sigmoid activation outputs a probability between 0 and 1. For multi-class problems, multiple output neurones (one per class) with softmax normalisation are used.
🔙 Backpropagation Training
During training, the network's prediction is compared to the correct label (calculating loss). The error is propagated backwards through the network, and each connection weight is adjusted using gradient descent to minimise future errors. This process repeats across thousands of training examples.
🕸️ System Diagram - Neural Network Architecture
Deep learning model for loan approval prediction
This diagram shows a neural network's structure: input features (age, income, credit score, debt ratio) flow through hidden layers where neurons learn patterns using ReLU activation functions. The output layer uses sigmoid activation to produce a probability between 0 and 1, representing the predicted likelihood of loan approval. This demonstrates how machine learning automates decision-making.
━━━━━━━━
Age
Income
Credit Score
Debt Ratio"/] H1["HIDDEN LAYER 1
━━━━━━━━
8 Neurons
ReLU
Activation"] H2["HIDDEN LAYER 2
━━━━━━━━
4 Neurons
ReLU
Activation"] O[/"OUTPUT LAYER
━━━━━━━━
1 Neuron
Sigmoid
P(Approve)"/] I -->|weights| H1 H1 -->|weights| H2 H2 -->|weights| O style I fill:#E3F2FD,stroke:#1976D2,stroke-width:2px style H1 fill:#FFF9C4,stroke:#F57F17,stroke-width:2px style H2 fill:#FFF9C4,stroke:#F57F17,stroke-width:2px style O fill:#C8E6C9,stroke:#2E7D32,stroke-width:2px
Purpose: Visualise the layered structure of a neural network from input features through hidden layers to final binary classification output
Syllabus Link: SE-12-03, SE-12-08
Try This: Design a network for a multi-class problem (e.g., predicting sentiment: positive/negative/neutral) and modify the output layer to use 3 neurons with softmax activation
📐 Types of Algorithms Associated with ML
🎯 (SE-12-02, SE-12-03)
📌 Describe the types of algorithms associated with machine learning, including linear regression, logistic regression, and K-nearest neighbour.
📉 Linear Regression
Linear regression is a supervised learning algorithm used to predict a continuous numerical value based on one or more input features. It fits a straight line (or hyperplane in multiple dimensions) through the training data that minimises the total squared distance between each data point and the line — known as the Ordinary Least Squares (OLS) method. The resulting equation takes the form: y = mx + c, where m is the slope (coefficient) and c is the y-intercept. For example, a linear regression model trained on study hours and exam scores would output a predicted score for any given number of study hours.
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np
# Sample data: study hours vs exam scores
X = np.array([[1], [2], [3], [4], [5], [6], [7], [8]])
y = np.array([45, 55, 60, 65, 70, 75, 80, 88])
# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train the model
model = LinearRegression()
model.fit(X_train, y_train)
# Evaluate
y_pred = model.predict(X_test)
print(f"R² Score: {r2_score(y_test, y_pred):.3f}")
print(f"MSE: {mean_squared_error(y_test, y_pred):.2f}")
print(f"Predicted score for 9 hours: {model.predict([[9]])[0]:.1f}")
⚖️ Logistic Regression
Despite its name, logistic regression is a classification algorithm — not used for predicting continuous values. It calculates the probability that an input belongs to a particular class (typically binary: 0 or 1, True or False). The sigmoid function squashes any input value into a probability between 0 and 1. A threshold (typically 0.5) converts this probability into a class prediction. For example, a logistic regression model assessing whether a bank transaction is fraudulent outputs a probability of fraud; if it exceeds 0.5 (or a risk-adjusted threshold), the transaction is flagged. Logistic regression is highly interpretable — each feature's coefficient indicates its influence on the predicted probability.
👥 K-Nearest Neighbour (KNN)
K-Nearest Neighbour (KNN) is a simple, instance-based supervised learning algorithm used for both classification and regression. Rather than learning a model during training, KNN stores the entire training dataset. At prediction time, it finds the K training examples closest to the new input (using a distance metric such as Euclidean distance), and assigns the majority class (classification) or average value (regression) of those K neighbours as the prediction. KNN requires no explicit training phase, but is computationally expensive at prediction time on large datasets.
🔢 Choosing K
Small K (e.g., K=1) creates a very complex, jagged decision boundary that overfits to training noise. Large K creates a smoother boundary but may underfit. K is typically chosen via cross-validation — testing multiple values and selecting the one with the best validation performance. An odd K is preferred for binary classification to avoid tie votes.
📏 Distance Metrics
Euclidean distance (straight-line distance between two points) is the most common metric: d = √((x₁-x₂)² + (y₁-y₂)²). Features must be normalised (scaled to similar ranges) before applying KNN, otherwise features with larger numerical ranges dominate the distance calculation and distort results.
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
# Features: [age, income], Labels: [0=no loan, 1=loan approved]
X_train = [[25, 30000], [35, 55000], [45, 80000], [22, 25000], [50, 90000]]
y_train = [0, 1, 1, 0, 1]
# Scale features (important for KNN!)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_train)
# Train with k=3
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_scaled, y_train)
# Predict for new customer
new_customer = scaler.transform([[30, 45000]])
prediction = knn.predict(new_customer)
print(f"Loan approved: {'Yes' if prediction[0] == 1 else 'No'}")
| Algorithm | Type | Output | Best Use Case |
|---|---|---|---|
| Linear Regression | Supervised | Continuous number | Predicting house prices, exam scores, sales revenue |
| Logistic Regression | Supervised | Probability → Class (0/1) | Spam detection, fraud classification, disease diagnosis |
| K-Nearest Neighbour | Supervised | Class or continuous value | Recommendation systems, anomaly detection, image classification |
🔬 K-Means, Random Forest & Naive Bayes
🎯 (SE-12-02, SE-12-03)
📌 Apply additional ML algorithms — K-Means clustering, Random Forest ensemble learning, and Naive Bayes classification — to solve a range of real-world problems.
📍 K-Means Clustering (Unsupervised)
K-Means is an unsupervised algorithm that groups unlabelled data into K clusters by iteratively assigning points to the nearest centroid and updating centroid positions. No labelled training data is required.
⚙️ How K-Means Works
- Choose K (number of clusters) and randomly initialise K centroids.
- Assign: Each data point is assigned to the nearest centroid (using Euclidean distance).
- Update: Move each centroid to the mean of all points assigned to it.
- Repeat steps 2–3 until centroids stop moving (convergence).
📈 Choosing K: The Elbow Method
Plot the within-cluster sum of squares (WCSS) against K. The "elbow" — where adding more clusters gives diminishing returns — is the optimal K. For example, if WCSS drops sharply from K=1 to K=3 then levels off, K=3 is likely the best choice.
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import numpy as np
# Customer data: [age, annual_spend]
X = np.array([[25, 2000], [30, 5000], [35, 4500], [22, 1800],
[45, 8000], [50, 9000], [55, 8500], [60, 7500]])
# Scale features before clustering
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Fit K-Means with 2 clusters
kmeans = KMeans(n_clusters=2, random_state=42, n_init=10)
kmeans.fit(X_scaled)
print("Cluster assignments:", kmeans.labels_)
# e.g. [0 0 0 0 1 1 1 1] — younger low-spend vs older high-spend
Use cases: Customer segmentation, image colour compression, anomaly detection, document topic grouping.
🌲 Random Forest (Ensemble)
A Random Forest is an ensemble of many decision trees, each trained on a random subset of the training data (a technique called bagging — Bootstrap Aggregating). The final prediction is the majority vote (classification) or average (regression) of all trees. Because each tree is trained on slightly different data and features, the ensemble is much more robust than any single tree.
| Aspect | Decision Tree | Random Forest |
|---|---|---|
| Overfitting | High — memorises training data | Low — averaging reduces overfitting |
| Interpretability | High — easy to visualise | Lower — hundreds of trees are hard to explain |
| Training speed | Fast | Slower (trains N trees) |
| Prediction accuracy | Lower on unseen data | Generally higher — industry standard |
| Feature importance | Available | Available and more reliable |
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Load the classic Iris dataset (150 flowers, 3 species)
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
iris.data, iris.target, test_size=0.2, random_state=42)
# Train Random Forest with 100 trees
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
# Evaluate
y_pred = rf.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.3f}")
# Feature importance
for name, importance in zip(iris.feature_names, rf.feature_importances_):
print(f" {name}: {importance:.3f}")
📊 Naive Bayes
Naive Bayes classifiers apply Bayes' theorem with the "naive" assumption that all features are statistically independent. Despite this unrealistic assumption, Naive Bayes performs surprisingly well — especially for text classification (spam detection, sentiment analysis).
In plain English: the probability of a class given the observed features is proportional to how likely those features are given that class, multiplied by how common the class is overall.
from sklearn.naive_bayes import GaussianNB
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
iris.data, iris.target, test_size=0.3, random_state=0)
gnb = GaussianNB()
gnb.fit(X_train, y_train)
y_pred = gnb.predict(X_test)
print(f"Naive Bayes Accuracy: {accuracy_score(y_test, y_pred):.3f}")
🛠️ ML Tools & Libraries
🎯 (SE-12-03)
📌 Identify the primary Python libraries used in machine learning and apply a complete ML workflow using these tools.
📚 Core Python ML Libraries
fit(), predict(), score().DataFrame is the core structure — a labelled 2D table (like a spreadsheet in Python). Used to load CSV/Excel data, clean missing values, filter rows, and prepare data for ML algorithms.📋 Library Selection Guide
| Task | Best Library | Key Functions |
|---|---|---|
| Load and clean data | pandas | read_csv(), fillna(), dropna(), groupby() |
| Numerical operations | NumPy | array(), mean(), std(), dot() |
| Classical ML algorithms | scikit-learn | fit(), predict(), train_test_split() |
| Visualise data / results | Matplotlib / Seaborn | plot(), scatter(), heatmap() |
| Neural networks | TensorFlow / Keras | Sequential(), Dense(), compile() |
🔄 End-to-End ML Workflow Example
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
# ── 1. LOAD DATA ──────────────────────────────────
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['species'] = iris.target
# ── 2. PREPROCESS ─────────────────────────────────
X = df.drop('species', axis=1)
y = df['species']
# Check for missing values
print("Missing values:", X.isnull().sum().sum())
# ── 3. SPLIT ──────────────────────────────────────
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y)
# ── 4. SCALE FEATURES ─────────────────────────────
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train) # fit on train only!
X_test_scaled = scaler.transform(X_test) # transform test
# ── 5. TRAIN ──────────────────────────────────────
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train_scaled, y_train)
# ── 6. EVALUATE ───────────────────────────────────
y_pred = model.predict(X_test_scaled)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.3f}")
print(classification_report(y_test, y_pred,
target_names=iris.target_names))
# ── 7. PREDICT ────────────────────────────────────
new_flower = [[5.1, 3.5, 1.4, 0.2]] # example measurements
new_scaled = scaler.transform(new_flower)
prediction = model.predict(new_scaled)
print(f"Predicted species: {iris.target_names[prediction[0]]}")
fit_transform() scalers on training data only, then transform() on test data. Fitting on the entire dataset "leaks" information about the test set into the model, giving artificially high accuracy scores that won't reflect real-world performance.
📊 Model Evaluation Metrics
🎯 (SE-12-02, SE-12-03)
📌 Evaluate the performance of ML models using appropriate metrics — different problem types (classification vs regression) require different measures of accuracy.
| Metric | Formula | When to Use | Example Value |
|---|---|---|---|
| Accuracy | (TP + TN) / Total | Balanced classification datasets | 0.92 = 92% correct |
| Precision | TP / (TP + FP) | When false positives are costly (e.g. spam filter) | 0.88 = 88% of positives were real |
| Recall | TP / (TP + FN) | When false negatives are costly (e.g. disease detection) | 0.95 = caught 95% of actual positives |
| F1-Score | 2 × (Precision × Recall) / (Precision + Recall) | Imbalanced datasets — balances precision and recall | 0.91 = harmonic mean of precision & recall |
| MSE | Σ(y − ŷ)² / n | Regression — penalises large errors heavily | 4.25 (lower is better) |
| R² | 1 − (SS_res / SS_tot) | Regression — proportion of variance explained | 0.94 = model explains 94% of variance |
📋 Confusion Matrix
A confusion matrix summarises the prediction results for a classification model, showing how many samples were correctly or incorrectly classified for each class.
PREDICTED
| Positive | Negative |
───────────┼────────────┼────────────┤
ACTUAL Positive | TP (hit) | FN (miss) |
Negative | FP (false | TN (corr- |
| alarm) | ect rej) |
True Positive (TP): Model predicted positive — and it was correct (e.g. flagged spam that is actually spam).
False Positive (FP): Model predicted positive — but it was wrong (e.g. flagged a legitimate email as spam — a "false alarm").
True Negative (TN): Model predicted negative — and it was correct (e.g. correctly allowed a legitimate email through).
False Negative (FN): Model predicted negative — but it was wrong (e.g. failed to catch a spam email — a "miss").
In medical diagnosis, false negatives are dangerous (missing a disease), so recall is prioritised. In spam filtering, false positives are costly (blocking important emails), so precision is prioritised.
📉 Overfitting vs Underfitting
🎯 (SE-12-02, SE-12-03)
📌 Distinguish between underfitting and overfitting in ML models and explain strategies to achieve a well-generalising model.
UNDERFITTING GOOD FIT OVERFITTING
(High Bias) (Balanced) (High Variance)
* * * * * *
* * → * * * * ← * *
* * * * * * * * * * * *
Training err: HIGH Training err: LOW Training err: VERY LOW
Test err: HIGH Test err: LOW Test err: HIGH
Model is too simple Model generalises Model memorised
to learn patterns well to new data training noise
| Characteristic | Underfitting | Good Fit | Overfitting |
|---|---|---|---|
| Training Error | High | Low | Very Low |
| Test / Validation Error | High | Low | High |
| Bias | High | Low | Low |
| Variance | Low | Low | High |
| Cause | Model too simple; too few features | Appropriate complexity | Model too complex; too many parameters |
| Fix | More features, more complex model | — | Regularisation, more data, dropout |
- ⚖️ Regularisation (L1/L2): Adds a penalty to the loss function for large model weights, discouraging the model from fitting noise. L1 (Lasso) can reduce weights to zero (feature selection); L2 (Ridge) shrinks weights towards zero.
- 🔄 Cross-Validation: Split data into multiple folds and train/validate across all combinations — ensures the model performs well on varied subsets of data, not just one split.
- 📚 More Training Data: A larger, more diverse dataset makes it harder for the model to memorise specific examples and forces it to learn genuine patterns.
- ✂️ Dropout (Neural Networks): Randomly disables neurons during training, preventing co-adaptation and forcing the network to learn robust, distributed representations.
- 🛑 Early Stopping: Monitor validation loss during training and stop training when it begins to increase — preventing the model from over-training on the training set.
📈 ML Regression Models Using OOP
🎯 (SE-12-02, SE-12-08)
📌 Design, develop and apply ML regression models using an Object-Oriented Programming (OOP) approach to predict numeric values, including linear regression, polynomial regression, and logistic regression.
📉 Linear Regression in Python (OOP via scikit-learn)
The scikit-learn library implements regression models as Python classes — each model is an object with fit() and predict() methods. This is a direct application of OOP: instantiating a model object, calling methods to train it, and calling methods to use it.
import numpy as np
from sklearn.linear_model import LinearRegression
# Training data: hours studied vs exam score
X_train = np.array([[1], [2], [3], [4], [5], [6], [7], [8]])
y_train = np.array([45, 50, 55, 60, 65, 70, 75, 80])
# Instantiate the model object (OOP: creating an instance)
model = LinearRegression()
# Train the model by calling the fit() method on the object
model.fit(X_train, y_train)
# Predict exam score for 9 hours of study
prediction = model.predict([[9]])
print(f"Predicted score for 9 hours: {prediction[0]:.1f}") # ~85.0
# Inspect the learned parameters (slope and intercept)
print(f"Slope (coefficient): {model.coef_[0]:.2f}") # ~5.0
print(f"Intercept: {model.intercept_:.2f}") # ~40.0
📈 Polynomial Regression (Non-Linear Relationships)
Polynomial regression extends linear regression to model curved (non-linear) relationships between variables. When plotting a scatter graph reveals a curve rather than a straight line, a polynomial fit is more appropriate. The input features are transformed to include powers (x², x³, etc.) using PolynomialFeatures, then passed into a standard LinearRegression model. For example, modelling how fuel efficiency decreases with speed follows a curved relationship — polynomial regression captures this more accurately than a straight line.
import numpy as np
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
# Non-linear data: speed vs fuel efficiency (curved relationship)
X = np.array([[40], [60], [80], [100], [120], [140]])
y = np.array([35, 40, 38, 32, 24, 14]) # efficiency peaks then drops
# Transform features to include x and x² (degree 2 polynomial)
poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X)
# Fit a linear regression model to the transformed features
model = LinearRegression()
model.fit(X_poly, y)
# Predict efficiency at 110 km/h
X_test = poly.transform([[110]])
print(f"Predicted efficiency at 110 km/h: {model.predict(X_test)[0]:.1f} L/100km")
⚖️ Logistic Regression — Binary Classification
Logistic regression classifies inputs into one of two categories by predicting a probability using the sigmoid function. The OOP interface is identical to linear regression — instantiate, fit, predict.
from sklearn.linear_model import LogisticRegression
import numpy as np
# Training data: [study_hours, assignments_completed] → pass (1) or fail (0)
X_train = np.array([[1,2],[2,3],[3,4],[4,5],[5,6],[6,7],[2,1],[1,1],[3,2],[4,3]])
y_train = np.array([0, 0, 0, 1, 1, 1, 0, 0, 1, 1])
# Instantiate and train the logistic regression classifier
classifier = LogisticRegression()
classifier.fit(X_train, y_train)
# Predict outcome for a student: 5 study hours, 5 assignments completed
prediction = classifier.predict([[5, 5]])
probability = classifier.predict_proba([[5, 5]])
print(f"Prediction: {'Pass' if prediction[0] == 1 else 'Fail'}")
print(f"Pass probability: {probability[0][1]:.2%}")
🕸️ Neural Network Models Using OOP
🎯 (SE-12-02, SE-12-08)
📌 Apply neural network models using an Object-Oriented Programming approach to make predictions. Neural networks are implemented as layered objects using frameworks such as scikit-learn (for simpler networks) or TensorFlow/Keras (for deep learning).
🕸️ Multi-Layer Perceptron (MLP) with scikit-learn
The MLPClassifier (Multi-Layer Perceptron) is a neural network class available in scikit-learn. It uses the same OOP interface as other models — instantiate with architecture parameters, call fit() to train, call predict() to classify. The hidden_layer_sizes parameter defines the number of hidden layers and neurons per layer.
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import StandardScaler
import numpy as np
# Training data: [temperature, humidity] → weather type
# 0 = Sunny, 1 = Cloudy, 2 = Rainy
X_train = np.array([
[30, 20], [32, 25], [28, 30], # Sunny
[22, 55], [20, 60], [18, 65], # Cloudy
[15, 85], [12, 90], [10, 95] # Rainy
])
y_train = np.array([0, 0, 0, 1, 1, 1, 2, 2, 2])
# Scale features (important for neural networks)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_train)
# Instantiate MLP with 2 hidden layers: 8 neurons, then 4 neurons
nn_model = MLPClassifier(
hidden_layer_sizes=(8, 4),
activation='relu', # Rectified Linear Unit activation function
max_iter=1000,
random_state=42
)
# Train the neural network (backpropagation runs internally)
nn_model.fit(X_scaled, y_train)
# Predict weather for 25°C temperature and 40% humidity
new_data = scaler.transform([[25, 40]])
prediction = nn_model.predict(new_data)
labels = {0: 'Sunny', 1: 'Cloudy', 2: 'Rainy'}
print(f"Predicted weather: {labels[prediction[0]]}")
model = MLPClassifier(...) you are instantiating an object. When you call model.fit(X, y) you are invoking a method that sets the object's internal weights (attributes). When you call model.predict(X_new) you are invoking another method that uses those stored weights to generate output. This is OOP in direct practice.
🔄 Neural Network Training and Execution Cycles
🏋️ Training Cycle
The network is exposed to labelled training examples. For each example, it makes a prediction, calculates the loss (error), and uses backpropagation to adjust each weight by a small amount in the direction that reduces the error. This process (one full pass through the training data) is called an epoch. Training runs for many epochs until the loss converges to a minimum.
🚀 Execution (Inference) Cycle
After training, the network's weights are fixed. For each new input, the data is passed forward through the network layer-by-layer (no backpropagation). Each neurone applies its learned weights and activation function, and the output layer produces the final prediction. Inference is computationally much cheaper than training.
🌍 Impact of Automation on Individuals, Society & the Environment
🎯 (SE-12-05)
📌 Assess the impact of automation on individuals, society, and the environment across five key dimensions.
⛑️ Safety of Workers
Automation and robotics are progressively replacing human workers in physically dangerous environments, with significant positive safety outcomes. Automated drones now perform inspection of high-voltage power lines and offshore oil rigs, eliminating worker exposure to electrocution, heights, and toxic atmospheres. Robotic welding and material handling systems in manufacturing have dramatically reduced repetitive strain injuries and burn risks. In mining, autonomous haul trucks (e.g., Rio Tinto's AutoHaul system) operate in remote and hazardous pit environments without human operators. The net effect is a measurable reduction in workplace injury rates and fatalities. However, the displacement of human workers from these roles creates economic vulnerability for the affected communities.
♿ People with Disability
AI-powered automation is providing unprecedented levels of independence and accessibility for people with disability. Computer vision systems enable real-time audio descriptions of the surrounding environment for visually impaired users (e.g., Microsoft's Seeing AI app describes people, currency, and written text through a smartphone camera). Speech-to-text and natural language interfaces allow people with motor disabilities to control computers and dictate documents without physical input. Autonomous vehicles could fundamentally transform mobility for people who cannot drive due to physical or visual impairment. AI captioning provides real-time speech-to-text conversion for deaf individuals during live conversations, video calls, and broadcasts. These advances promote social inclusion and reduce dependence on others, improving quality of life. The risk is that if these systems are not designed with universal design principles, the digital divide — where those without access to technology are further marginalised — may widen.
👔 Nature and Skills Required for Employment
The adoption of automation is causing a significant structural shift in labour markets. Routine cognitive tasks (data entry, basic reporting, invoice processing) and routine physical tasks (assembly line work, packaging, basic quality inspection) are being automated at scale, displacing workers in these roles. Simultaneously, demand is growing rapidly for workers with skills in MLOps, data science, AI ethics, robotic maintenance, and human-AI collaboration — roles that require understanding of automation systems rather than performing the automated tasks themselves. Workers must engage in continuous reskilling and upskilling throughout their careers. Governments and educational institutions face the challenge of retraining displaced workers for emerging roles. The transition may exacerbate inequality if the benefits of automation accrue primarily to capital owners rather than being distributed across the workforce.
🌱 Production Efficiency, Waste, and the Environment
Automation offers significant environmental benefits through efficiency gains but also introduces new environmental costs. Optimised logistics and supply chains powered by ML reduce unnecessary transport and warehousing, cutting fuel consumption and emissions. Predictive maintenance extends the lifespan of industrial equipment, reducing material waste from premature replacements. Smart energy management systems use ML to optimise building heating, cooling, and lighting, reducing energy waste. Conversely, training large deep learning models (e.g., GPT-4) consumes enormous quantities of electricity — some estimates suggest training a single large language model produces CO₂ emissions equivalent to five car lifetimes. Data centres required to run AI services consume vast amounts of water for cooling. Engineers must evaluate the full lifecycle environmental cost of AI systems alongside their operational benefits.
💰 The Economy and Distribution of Wealth
Automation increases productivity and creates new industries and wealth. However, the distribution of that wealth is a critical societal concern. Historically, automation has created new types of employment alongside the jobs it displaces — but the transition period creates significant hardship for displaced workers. The current wave of AI-driven automation is occurring much faster than previous industrial revolutions, leaving less time for natural retraining. There is a risk of increasing economic polarisation: high-skill workers who can work alongside AI systems see wages rise, while low-skill workers face wage stagnation or displacement. Policy responses under discussion include universal basic income (UBI), robot taxes, and mandatory corporate investment in worker retraining programs.
🧠 Human Behaviour Patterns Influencing ML and AI
🎯 (SE-12-04, SE-12-05)
📌 Explore by implementation how patterns in human behaviour influence ML and AI software development, including psychological responses, patterns related to acute stress response, cultural protocols, and belief systems.
🧠 Psychological Responses
Human psychology profoundly shapes how AI systems must be designed to be effective. The "uncanny valley" effect describes user discomfort and distrust when an AI interface appears almost — but not quite — human. Engineers must therefore design interfaces that are either clearly and comfortably human-like, or clearly and transparently machine-like. Customer service chatbots must disclose that they are AI systems; this transparency actually builds more trust than attempting to impersonate a human. Automation bias — the tendency to over-rely on automated systems — is a critical safety concern: when users trust an AI too completely, they stop critically evaluating its outputs. Aviation autopilot systems, medical diagnostic AI, and autonomous vehicles must be designed with explicit mechanisms to keep human operators actively engaged and vigilant rather than passively monitoring.
⚡ Patterns Related to Acute Stress Response
When individuals are operating under acute psychological stress (e.g., medical emergencies, financial crises, natural disasters), their cognitive performance changes significantly. Under stress, humans experience tunnel vision (narrowed attention to the most salient stimuli), reduced working memory capacity, and impaired complex decision-making. AI interfaces designed for critical environments — such as intensive care unit monitoring systems, emergency dispatch platforms, or incident management dashboards — must account for these changes. Rather than displaying all available data simultaneously, they must prioritise and surface the single most critical alert, minimise cognitive load through clear visual hierarchy, and use pre-attentive visual attributes (colour, size, motion) to draw attention appropriately. Overloading an operator in a high-stress scenario with fifteen simultaneous alerts significantly increases the risk of catastrophic error. Engineers developing AI for high-stakes environments must conduct human factors testing under simulated stress conditions.
🌐 Cultural Protocols
ML and AI systems deployed globally must be engineered to respect diverse regional cultural norms, communication expectations, and social protocols. A conversational AI that is appropriate in one cultural context may be offensive or ineffective in another. Engineers must consider: formality levels — many cultures expect formal, respectful address from professional systems (using titles and surnames), whereas others prefer casual familiarity; taboo topics — subjects considered inappropriate in specific cultural contexts must be excluded from AI responses for those regional deployments; linguistic nuance — direct translation of idioms and culturally specific references often fails or offends; and local legal frameworks — AI systems processing personal data must comply with jurisdiction-specific privacy legislation (e.g., the EU's GDPR, Australia's Privacy Act 1988, China's PIPL). Failure to embed cultural protocols into AI systems results in reputational damage, legal liability, and loss of user trust in international markets.
🛐 Belief Systems
Human belief systems — including religious beliefs, moral frameworks, and ethical worldviews — significantly influence both how AI is perceived and how it must be designed. In some religious and cultural traditions, there are deep concerns about AI making life-or-death decisions (e.g., autonomous weapons, medical triage AI) without human moral accountability. Engineers must engage with these perspectives during the design process rather than treating them as obstacles. AI systems for healthcare, justice, and education must be designed with explainability — the ability to provide human-understandable reasons for decisions — to allow individuals to challenge outcomes that conflict with their values or beliefs. Additionally, training data drawn from populations with specific belief systems will encode those beliefs into model outputs; engineers must actively test whether models produce fair and respectful outputs across diverse belief communities.
⚖️ Human and Dataset Source Bias in ML and AI
🎯 (SE-12-04, SE-12-05)
📌 Investigate the effect of human and dataset source bias in the development of ML and AI solutions. Bias in training data is one of the most significant ethical risks in machine learning, as it causes AI systems to systematically produce unfair or discriminatory outcomes at scale.
⚖️ Types of Bias
🛠️ Engineering Practices to Mitigate Bias
📋 Diverse and Representative Data
Engineers must proactively audit training datasets for demographic representation before training begins. Data augmentation, targeted data collection, and synthetic data generation can supplement underrepresented groups. The dataset should reflect the actual diversity of the population the system will serve.
🔍 Bias Auditing and Fairness Metrics
After training, models must be evaluated not just on overall accuracy but on fairness metrics across demographic subgroups — checking that error rates, false positive rates, and false negative rates are equitable across gender, race, age, and other protected characteristics.
👥 Diverse Development Teams
Engineering teams with diverse backgrounds, cultures, and perspectives are better positioned to identify potential biases during design and testing that homogeneous teams may overlook. Diverse teams ask different questions about data and outcomes.
📖 Explainability and Accountability
AI systems making consequential decisions (credit, employment, healthcare, justice) must be able to provide human-understandable explanations for their outputs. This enables affected individuals to challenge decisions and allows engineers to trace and correct the source of discriminatory outcomes.