NumPy高效数值计算指南
NumPy是Python科学计算的基石,几乎所有数据科学和机器学习库都建立在NumPy之上。
NumPy数组基础
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
| import numpy as np
a = np.array([1, 2, 3, 4, 5]) b = np.array([[1, 2, 3], [4, 5, 6]]) c = np.zeros((3, 4)) d = np.ones((2, 3)) e = np.arange(0, 10, 2) f = np.linspace(0, 1, 5) g = np.random.randn(3, 3) h = np.eye(3)
print(f"数组形状: {b.shape}") print(f"数组维度: {b.ndim}") print(f"数据类型: {b.dtype}") print(f"元素总数: {b.size}")
|
数组索引与切片
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
| arr = np.arange(24).reshape(4, 6) print(arr)
print(arr[1, 2]) print(arr[1]) print(arr[:, 2])
print(arr[1:3, 2:5])
mask = arr > 10 print(arr[mask])
print(arr[[0, 2], [1, 3]])
|
向量化运算
NumPy的核心优势是向量化运算,避免Python循环:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
| import time
size = 1000000 a_list = list(range(size)) b_list = list(range(size)) a_np = np.arange(size) b_np = np.arange(size)
start = time.time() c_list = [a + b for a, b in zip(a_list, b_list)] print(f"Python列表: {time.time() - start:.4f}s")
start = time.time() c_np = a_np + b_np print(f"NumPy向量化: {time.time() - start:.4f}s")
|
广播机制
广播是NumPy处理不同形状数组运算的机制:
1 2 3 4 5 6 7 8 9
| a = np.array([[1], [2], [3]]) b = np.array([10, 20, 30, 40]) c = a + b
print(c)
|
广播规则:
- 如果两个数组维度不同,较小的形状在前方补1
- 如果形状在某个维度上为1,则沿该维度广播
- 其他维度必须相等或其中一个为1
数学运算
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
| x = np.random.randn(3, 4)
print(np.sqrt(x)) print(np.exp(x)) print(np.log(np.abs(x) + 1e-10)) print(np.sin(x))
print(np.mean(x, axis=0)) print(np.std(x, axis=1)) print(np.max(x, axis=0)) print(np.argmax(x, axis=1)) print(np.cumsum(x, axis=0))
A = np.random.randn(3, 3) B = np.random.randn(3, 4)
print(np.dot(A, B)) print(A.T) print(np.linalg.inv(A)) print(np.linalg.det(A)) print(np.linalg.eig(A))
|
数组操作
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
| a = np.arange(12)
print(a.reshape(3, 4)) print(a.reshape(2, -1))
b = np.arange(12, 24).reshape(3, 4) print(np.vstack([a.reshape(3, 4), b])) print(np.hstack([a.reshape(3, 4), b]))
c = np.arange(24).reshape(4, 6) print(np.hsplit(c, 3)) print(np.vsplit(c, 2))
d = np.random.randn(5, 3) print(np.sort(d, axis=0)) print(np.argsort(d[:, 0]))
|
性能优化技巧
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
| result = np.empty(1000) for i in range(1000): result[i] = i ** 2
a = np.arange(12).reshape(3, 4) b = a.reshape(4, 3) c = a.copy()
x = np.random.randn(1000) y = np.where(x > 0, x, 0)
A = np.random.randn(3, 4) B = np.random.randn(4, 5) C = np.einsum('ij,jk->ik', A, B)
|
线性代数应用
1 2 3 4 5 6 7 8 9 10 11 12 13
| A = np.array([[3, 1], [1, 2]]) b = np.array([9, 8]) x = np.linalg.solve(A, b) print(f"解: {x}")
A = np.random.randn(100, 3) b = np.random.randn(100) x, residuals, rank, sv = np.linalg.lstsq(A, b, rcond=None)
U, S, Vt = np.linalg.svd(A, full_matrices=False)
|
总结
NumPy是Python科学计算的核心库,其向量化运算和广播机制使得数值计算既高效又简洁。掌握数组操作、数学运算、广播规则和性能优化技巧,是进行数据科学和机器学习工作的基础。在实际项目中,应尽量使用NumPy的向量化操作替代Python循环,以获得最佳性能。