Introduction
I am going to go for a Raymond Hettinger style presentation, https://www.cs.odu.edu/~tkennedy/cs330/s21/Public/languageResources/#python-programming-videos.
These materials are web-centric (i.e., do not need to be printed and are available at https://www.cs.odu.edu/~tkennedy/numpy-workshop).
Who am I?
I have taught various courses, including:
- CS 300T - Computers in Society
- CS 333 - Programming and Problem Solving
- CS 330 - Object Oriented Programming and Design
- CS 350 - Introduction to Software Engineering
- CS 410 - Professional Workforce Development I
- CS 411W - Professional Workforce Development II
- CS 417 - Computational Methods & Software
Most of my free time is spent writing Python 3 and Rust code, tweaking my Vim configuration, or learning a new (programming) language. My current language of interests are Rust (at the time of writing) and Python (specifically the NumPy library).
Referenced Courses & Materials
I may reference materials (e.g., lecture notes) and topics from various courses, including:
- CS 330 - Object Oriented Programming & Design
- CS 350 - Introduction to Software Engineering
- CS 417 - Computational Methods & Software
I may also reference a couple examples from the previous:
Overview
What is NumPy?
NumPy is the fundamental package for scientific computing in Python. It is a Python library that provides a multidimensional array object, various derived objects (such as masked arrays and matrices), and an assortment of routines for fast operations on arrays, including mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete Fourier transforms, basic linear algebra, basic statistical operations, random simulation and much more.
Retrieved from https://numpy.org/doc/stable/user/whatisnumpy.html
Managing Expectations
We can only scratch the surface during a one (1) hour workshop.
For this workshop...
- a basic understanding of Python is assumed.
- no prior knowledge of NumPy is assumed.
A Few Short Examples
My original intention was to focus on a few short (10 to 20 line) examples that focused on arrays and statistics... with a little linear algebra at the end.
If you are interested in a collection of problems to develop your NumPy understanding... Coffee Break NumPy by Christian Mayer is an excellent read. In fact... I read the book during the 2020 Winter Break.
Let us start with a few short NumPy basics (e.g., arrays and broadcasting) then build a Matrix Solver!.
Creating Arrays
The code snippets from this section are extracted from array_creation.py.
NumPy arrays can be...
-
Initialized to all zeroes
array_size = 8 zeroes_array = np.zeros(array_size) print(zeroes_array) print()
-
Initialized to all ones
array_size = 12 ones_array = np.ones(array_size) print(ones_array) print()
-
Allocated and left uninitialized
# Contents are "whatever happens to be in memory" array_size = 16 unitialized_array = np.empty(array_size) print(unitialized_array) print()
-
Created from a Python List of
int
-spython_list = [2, 4, 8, 16, 32, 64] np_array = np.array(python_list) print(np_array) print()
-
Created from a Python List of
float
-spython_list = [2., 4., 8., 16., 32., 64.] np_array = np.array(python_list) print(np_array) print()
Broadcasting
The next couple snippets are extracted from broadcasting.py.
In Python... each element of a list must be updated one at a time. If a list of prices needed to be reduced by 10%, each one would need to be multiplied by 0.9 within a loop...
prices = [1.00, 2.95, 8.40, 3.50, 3.30, 16.91]
for idx in range(len(prices)):
price[idx] *= 0.9
or using a list comprehension...
prices = [1.00, 2.95, 8.40, 3.50, 3.30, 16.91]
prices = [0.9 * price for price in prices]
print(prices)
NumPy's broadcasting mechanic allows us to write a simple prices *= 0.9
.
prices = np.array([1.00, 2.95, 8.40, 3.50, 3.30, 16.91])
prices *= 0.9
print(prices)
print()
print("*" * 80)
print()
The obvious benefit is less typing. The more important one is optimization. NumPy's core is implemented in C. The official NumPy Documentation provides a succinct overview in its Why is NumPy Fast? section.
How much faster is NumPy? Let us run a quick using benchmark_broadcasting.py.
num_values = 1000000
num_runs = 100
def op_wrapper_py():
prices = range(1, num_values, 1)
prices = [0.9 * price for price in prices]
py_list = timeit.timeit(op_wrapper_py, number=num_runs)
def op_wrapper_np():
prices = np.arange(0, num_values, 1, dtype=np.float64)
prices[:] *= 0.9
np_array = timeit.timeit(op_wrapper_np, number=num_runs)
print(f"Python Time: {py_list:.4f}")
print(f"NumPy Time : {np_array:.4f}")
On a Core i7-6700k... For 1 million numbers, run 100 times... The NumPy code is a little over 10 times faster than the pure Python code.
Time (sec) | |
---|---|
Python | 5.1248 |
NumPy | 0.3168 |
Remaining Topics
There are a few more topics to introduce:
- Indexing - will be covered as part of the Matrix Solver example
- I/O
- Index Arrays
- Boolean (Mask) Index Arrays
Implementing a Matrix Solver
In CS 417/517 Computational Methods... I require students to implement a Matrix Solver... from scratch. NumPy provides implementations of matrix operations (e.g., multiplication). Let us implement a quick NumPy based Matrix solver, for this Discrete Case Least Squares Approximation Problem. The code snippets in this section are extracted from least_squares.py.
Getting Started
The first line is the NumPy import. By convention the numpy
module is given
the alias np
.
import numpy as np
The first function (i.e., print_matrices
) was debugging/instrumentation
logic. It was used to display the XTX
(X-Transpose-X) and XTY
(X-Transpose-Y) matrices during development.
def print_matrices(matrix_XTX, matrix_XTY):
"""
Print the XTX and XTY matrices
"""
print("{:*^40}".format("XTX"))
print(matrix_XTX)
print()
print("{:*^40}".format("XTY"))
print(matrix_XTY)
Next... let us jump to the main
function and...
-
set up the input data (points)
points = [(0., 0.), (1., 1.), (2., 4.)]
-
create the
X
matrixmatrix_X = np.array([[1., 0., 0.], [1., 1., 1.], [1., 2., 4.]])
-
create the
Y
matrixmatrix_Y = np.array([0, 1, 4])
-
compute X-transpose
matrix_XT = matrix_X.transpose()
After setting everything up... it is time for a couple matrix multiplications.
matrix_XTX = np.matmul(matrix_XT, matrix_X)
matrix_XTY = np.matmul(matrix_XT, matrix_Y)
The augmented XTX|XTY
matrix is what we are going to solve. Note that the
@
operator is often used instead of np.matmul
.
The Solver
A Gaussian Elimination Matrix solver can viewed as 4 operations:
- Pivot
- Scale
- Eliminate
- Backsolve
For our implementation... these operations will be wrapped in a
solve_matrix
function...
def solve_matrix(matrix_XTX, matrix_XTY):
"""
Solve a matrix and return the resulting solution vector
"""
# Get the dimensions (shape) of the XTX matrix
num_rows, num_columns = matrix_XTX.shape
for i in range(0, num_rows):
Note the use of the .shape
tuple.
Pivot
Pseudocode
idx <- find_largest_row_by_col(A, col_index, num_rows)
swap_row(A, i, idx)
Implementation
# Find column with largest entry
largest_idx = i
current_col = i
for j in range(i + 1, num_rows):
if matrix_XTX[largest_idx, i] < matrix_XTX[j, current_col]:
largest_idx = j
if largest_idx != current_col:
matrix_XTX[[i, largest_idx], :] = matrix_XTX[[largest_idx, i], :]
matrix_XTY[[i, largest_idx]] = matrix_XTY[[largest_idx, i]]
Swap
Pseudocode
def scale_row(A, row_idx, num_cols, s):
for every j in 0 to num_cols:
A[row_idx][j] = A[row_idx][j] / s
Implementation
# Scale
scaling_factor = matrix_XTX[i, i]
matrix_XTX[i, :] /= scaling_factor
matrix_XTY[i] /= scaling_factor
Eliminate
Pseudocode
def eliminate(A, src_row_idx, num_cols, num_rows):
start_col <- src_row_idx
for every i in (src_row_idx + 1) to num_rows:
s <- A[i][start_col]
for every j in start_col to num_cols:
A[i][j] = A[i][j] - s * A[src_row_idx][j]
A[i][start_col] = 0
Implementation
# Eliminate
for row_i in range(i + 1, num_rows):
s = matrix_XTX[row_i][i]
matrix_XTX[row_i] = matrix_XTX[row_i] - s * matrix_XTX[i]
matrix_XTY[row_i] = matrix_XTY[row_i] - s * matrix_XTY[i]
Backsolve
Pseudocode
def back_solve(matrix):
augColIdx <- matrix.cols() # the augmented column
lastRow = matrix.rows() - 1
for i in lastRow down to 1:
for j <- (i - 1) down to 0:
s <- matrix[j][i]
matrix[j][i] -= (s * matrix[i][i])
matrix[j][augCol] -= (s * matrix[i][augColIdx])
Implementation
def _backsolve(matrix_XTX, matrix_XTY):
num_rows, _ = matrix_XTX.shape
for i in reversed(range(1, num_rows)):
for j in reversed(range(0, i)):
s = matrix_XTX[j, i]
matrix_XTX[j, i] -= (s * matrix_XTX[i, i])
matrix_XTY[j] -= (s * matrix_XTY[i])
Statistics
Suppose we need to compute some grade statistics for the following data.
Name | Homework 1 | Homework 2 | Exam 1 | Exam 2 |
---|---|---|---|---|
John | 100 | 98 | 100 | 90 |
Tom | 100 | 0 | 70 | 90 |
Bob | 100 | 70 | 90 | 80 |
The first step is to convert the table into...
-
a list of exercises
exercises = ["Homework 1", "Homework 2", "Exam 1", "Exam 2"]
-
a list of names
students = ["John", "Tom", "Bob"]
-
a NumPy
ndarray
grades = np.array([[100., 98, 100., 90.], [100., 0., 70., 90.], [100., 70., 90., 80.]])
NumPy provides a number of statistics
operations
that can be applied by axis. For a two-dimensional ndarray
there are two
choices:
-
by row (student)
avg_by_student = grades.mean(axis=1) min_by_student = grades.min(axis=1) max_by_student = grades.max(axis=1)
Name Avg Min Max John 97.0 90.0 100.0 Tom 65.0 0.0 100.0 Bob 85.0 70.0 100.0 -
by column (exercise)
avg_by_exercise = grades.mean(axis=0) max_by_exercise = grades.max(axis=0) std_by_exercise = grades.std(axis=0)
Exercise Avg Max Std Dev Homework 1 100.0 100.0 0.0 Homework 2 56.0 98.0 41.2 Exam 1 86.7 100.0 12.5 Exam 2 86.7 90.0 4.7