Python机器学习基础实战指南：NumPy、Pandas、Matplotlib三剑客完全教程

2026-02-15

声明：本文内容经AI辅助优化，由人工审核编辑，确保技术示例准确可运行。

更新说明：内容适用于Python 3.x及最新ML库版本。

Python机器学习基础实战指南：NumPy、Pandas、Matplotlib三剑客完全教程

Python凭借其简洁的语法和丰富的科学计算库，已成为数据科学和机器学习领域的首选语言。这里详细介绍机器学习三大基础库：NumPy（数值计算）、Pandas（数据处理）和Matplotlib（数据可视化）的核心用法，帮助读者快速掌握数据分析的基础技能。

一、NumPy数值计算库

NumPy（Numerical Python）是Python科学计算的基础库，提供了高性能的多维数组对象和各种工具。

1.1 NumPy数组基础

安装与导入：

1	pip install numpy

1	import numpy as np

创建数组：

# 从列表创建
a = np.array([1, 2, 3, 4, 5])

# 创建二维数组
b = np.array([[1, 2, 3], [4, 5, 6]])

# 常用创建函数
zeros = np.zeros((3, 4))          # 3行4列的全0数组
ones = np.ones((2, 3))            # 2行3列的全1数组
full = np.full((2, 2), 7)         # 填充指定值
eye = np.eye(3)                   # 3x3单位矩阵
arange = np.arange(0, 10, 2)      # [0, 2, 4, 6, 8]
linspace = np.linspace(0, 1, 5)   # [0, 0.25, 0.5, 0.75, 1]

数组属性：

arr = np.array([[1, 2, 3], [4, 5, 6]])

print(arr.ndim)       # 维度数：2
print(arr.shape)      # 形状：(2, 3)
print(arr.size)       # 元素总数：6
print(arr.dtype)      # 数据类型：int64
print(arr.itemsize)   # 每个元素字节数：8
print(arr.nbytes)     # 总字节数：48

1.2 数组索引与切片

arr = np.array([[1, 2, 3, 4],
                [5, 6, 7, 8],
                [9, 10, 11, 12]])

# 索引
print(arr[0, 1])      # 2
print(arr[1, :])      # [5, 6, 7, 8]
print(arr[:, 2])      # [3, 7, 11]

# 切片
print(arr[0:2, 1:3])  # [[2, 3], [6, 7]]
print(arr[::2, ::2])  # [[1, 3], [9, 11]]

# 布尔索引
print(arr[arr > 5])   # [6, 7, 8, 9, 10, 11, 12]

# 花式索引
print(arr[[0, 2], [1, 3]])  # [2, 12]

1.3 数组运算

基本运算：

a = np.array([1, 2, 3])
b = np.array([4, 5, 6])

print(a + b)          # [5, 7, 9]
print(a - b)          # [-3, -3, -3]
print(a * b)          # [4, 10, 18]（元素乘法）
print(a / b)          # [0.25, 0.4, 0.5]
print(a ** 2)         # [1, 4, 9]
print(np.sqrt(a))     # [1., 1.414, 1.732]

# 矩阵乘法
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])
print(np.dot(A, B))
# [[19, 22], [43, 50]]

# 或使用@运算符
print(A @ B)

聚合函数：

arr = np.array([[1, 2, 3],
                [4, 5, 6]])

print(np.sum(arr))           # 21
print(np.sum(arr, axis=0))   # [5, 7, 9]（按列求和）
print(np.sum(arr, axis=1))   # [6, 15]（按行求和）

print(np.mean(arr))          # 3.5
print(np.std(arr))           # 标准差
print(np.min(arr))           # 1
print(np.max(arr))           # 6
print(np.argmin(arr))        # 最小值索引
print(np.argmax(arr))        # 最大值索引

1.4 数组形状操作

arr = np.arange(12)
print(arr)                # [0, 1, 2, ..., 11]

# 重塑
arr_2d = arr.reshape(3, 4)
print(arr_2d)
# [[ 0,  1,  2,  3],
#  [ 4,  5,  6,  7],
#  [ 8,  9, 10, 11]]

# 转置
print(arr_2d.T)
# [[ 0,  4,  8],
#  [ 1,  5,  9],
#  [ 2,  6, 10],
#  [ 3,  7, 11]]

# 展平
print(arr_2d.flatten())   # [0, 1, 2, ..., 11]

# 增加维度
arr = np.array([1, 2, 3])
print(arr[:, np.newaxis])
# [[1],
#  [2],
#  [3]]

1.5 广播机制

NumPy的广播机制允许不同形状的数组进行运算：

# 标量广播
a = np.array([1, 2, 3])
print(a + 5)              # [6, 7, 8]

# 二维数组与一维数组广播
A = np.array([[1, 2, 3],
              [4, 5, 6]])  # shape: (2, 3)
b = np.array([10, 20, 30])  # shape: (3,)

print(A + b)
# [[11, 22, 33],
#  [14, 25, 36]]

# 广播规则
# 1. 维度从后向前对齐
# 2. 对应维度相等或其中一个为1时，可以广播

1.6 随机数生成

# 均匀分布
np.random.rand(3, 3)           # [0, 1)区间
np.random.uniform(1, 10, 5)    # [1, 10)区间

# 正态分布
np.random.randn(3, 3)          # 标准正态分布
np.random.normal(0, 1, 100)    # 均值为0，标准差为1

# 整数随机数
np.random.randint(1, 100, 10)  # [1, 100)区间整数

# 随机种子
np.random.seed(42)

# 随机选择
np.random.choice([1, 2, 3, 4, 5], size=3, replace=False)

# 随机打乱
arr = np.arange(10)
np.random.shuffle(arr)

二、Pandas数据处理库

Pandas是Python数据分析的核心工具，提供了DataFrame和Series两种数据结构。

2.1 数据结构

安装与导入：

1	pip install pandas

1	import pandas as pd

Series（一维数据）：

# 创建Series
s = pd.Series([1, 2, 3, 4, 5])

# 带索引的Series
s = pd.Series([1, 2, 3], index=['a', 'b', 'c'])

# 从字典创建
s = pd.Series({'a': 1, 'b': 2, 'c': 3})

print(s.values)   # 值数组
print(s.index)    # 索引

DataFrame（二维数据）：

# 从字典创建
df = pd.DataFrame({
    'name': ['张三', '李四', '王五'],
    'age': [25, 30, 35],
    'city': ['北京', '上海', '广州']
})

# 从二维数组创建
df = pd.DataFrame(
    np.random.randn(3, 4),
    columns=['A', 'B', 'C', 'D'],
    index=['a', 'b', 'c']
)

# 查看数据
print(df.head())     # 前5行
print(df.tail())     # 后5行
print(df.info())     # 数据信息
print(df.describe()) # 统计描述

2.2 数据读取与存储

# 读取CSV
df = pd.read_csv('data.csv')
df = pd.read_csv('data.csv', encoding='utf-8', sep=',')

# 读取Excel
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')

# 读取SQL
import sqlite3
conn = sqlite3.connect('database.db')
df = pd.read_sql_query('SELECT * FROM users', conn)

# 保存数据
df.to_csv('output.csv', index=False)
df.to_excel('output.xlsx', sheet_name='Sheet1')
df.to_sql('users', conn, if_exists='replace')

2.3 数据选择与过滤

# 选择列
print(df['name'])           # 单列
print(df[['name', 'age']])  # 多列

# 选择行（按索引）
print(df.loc[0])            # 第0行
print(df.loc[0:2])          # 0到2行

# 选择行（按位置）
print(df.iloc[0])           # 第0行
print(df.iloc[0:3])         # 0到2行

# 条件过滤
print(df[df['age'] > 25])
print(df[(df['age'] > 25) & (df['city'] == '北京')])

# isin过滤
print(df[df['city'].isin(['北京', '上海'])])

2.4 数据处理

处理缺失值：

# 查看缺失值
print(df.isnull().sum())

# 删除缺失值
df_clean = df.dropna()                    # 删除包含NA的行
df_clean = df.dropna(subset=['age'])      # 删除age列有NA的行
df_clean = df.dropna(how='all')           # 删除全为NA的行

# 填充缺失值
df['age'].fillna(df['age'].mean(), inplace=True)
df['city'].fillna('未知', inplace=True)

数据转换：

# 应用函数
df['age_squared'] = df['age'].apply(lambda x: x ** 2)

# 映射转换
df['gender_num'] = df['gender'].map({'男': 1, '女': 0})

# 替换值
df['city'].replace('北京', 'Beijing', inplace=True)

# 分组统计
df.groupby('city')['age'].mean()
df.groupby(['city', 'gender']).agg({
    'age': ['mean', 'max'],
    'salary': 'sum'
})

数据合并：

# 纵向合并
df_combined = pd.concat([df1, df2], ignore_index=True)

# 横向合并（SQL JOIN）
df_merged = pd.merge(df1, df2, on='user_id', how='inner')
df_merged = pd.merge(df1, df2, on='user_id', how='left')
df_merged = pd.merge(df1, df2, left_on='id', right_on='user_id')

数据透视：

# 创建透视表
pivot = pd.pivot_table(
    df,
    values='sales',
    index='city',
    columns='month',
    aggfunc='sum',
    fill_value=0
)

三、Matplotlib数据可视化

Matplotlib是Python最基础的可视化库，提供了丰富的绘图功能。

3.1 基础绘图

安装与导入：

1	pip install matplotlib

import matplotlib.pyplot as plt
import numpy as np

# 设置中文字体
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False

折线图：

x = np.linspace(0, 10, 100)
y1 = np.sin(x)
y2 = np.cos(x)

plt.figure(figsize=(10, 6))
plt.plot(x, y1, label='sin(x)', color='blue', linewidth=2)
plt.plot(x, y2, label='cos(x)', color='red', linestyle='--')
plt.xlabel('x轴')
plt.ylabel('y轴')
plt.title('三角函数图像')
plt.legend()
plt.grid(True)
plt.show()

散点图：

np.random.seed(42)
x = np.random.randn(100)
y = np.random.randn(100)
colors = np.random.rand(100)
sizes = 1000 * np.random.rand(100)

plt.figure(figsize=(10, 6))
plt.scatter(x, y, c=colors, s=sizes, alpha=0.5, cmap='viridis')
plt.colorbar()
plt.xlabel('X')
plt.ylabel('Y')
plt.title('散点图')
plt.show()

3.2 图表类型

柱状图：

categories = ['A', 'B', 'C', 'D', 'E']
values = [23, 45, 56, 78, 32]

plt.figure(figsize=(10, 6))
plt.bar(categories, values, color='skyblue', edgecolor='black')
plt.xlabel('类别')
plt.ylabel('数值')
plt.title('柱状图')
for i, v in enumerate(values):
    plt.text(i, v + 1, str(v), ha='center')
plt.show()

分组柱状图：

x = np.arange(5)
width = 0.35
values1 = [23, 45, 56, 78, 32]
values2 = [34, 23, 67, 45, 56]

fig, ax = plt.subplots(figsize=(10, 6))
rects1 = ax.bar(x - width/2, values1, width, label='组1')
rects2 = ax.bar(x + width/2, values2, width, label='组2')

ax.set_xlabel('类别')
ax.set_ylabel('数值')
ax.set_title('分组柱状图')
ax.set_xticks(x)
ax.set_xticklabels(['A', 'B', 'C', 'D', 'E'])
ax.legend()
plt.show()

直方图：

data = np.random.randn(1000)

plt.figure(figsize=(10, 6))
plt.hist(data, bins=30, edgecolor='black', alpha=0.7)
plt.xlabel('数值')
plt.ylabel('频数')
plt.title('直方图')
plt.show()

饼图：

sizes = [30, 25, 20, 15, 10]
labels = ['A', 'B', 'C', 'D', 'E']
colors = ['gold', 'yellowgreen', 'lightcoral', 'lightskyblue', 'pink']
explode = (0.1, 0, 0, 0, 0)  # 突出第一块

plt.figure(figsize=(8, 8))
plt.pie(sizes, explode=explode, labels=labels, colors=colors,
        autopct='%1.1f%%', shadow=True, startangle=90)
plt.title('饼图')
plt.show()

3.3 高级绘图

子图：

fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# 子图1
axes[0, 0].plot(x, np.sin(x))
axes[0, 0].set_title('sin(x)')

# 子图2
axes[0, 1].plot(x, np.cos(x))
axes[0, 1].set_title('cos(x)')

# 子图3
axes[1, 0].plot(x, np.tan(x))
axes[1, 0].set_title('tan(x)')

# 子图4
axes[1, 1].plot(x, x**2)
axes[1, 1].set_title('x^2')

plt.tight_layout()
plt.show()

箱线图：

data = [np.random.randn(100) for _ in range(5)]
labels = ['A', 'B', 'C', 'D', 'E']

plt.figure(figsize=(10, 6))
plt.boxplot(data, labels=labels)
plt.ylabel('数值')
plt.title('箱线图')
plt.show()

热力图：

data = np.random.rand(10, 10)

plt.figure(figsize=(10, 8))
plt.imshow(data, cmap='hot', interpolation='nearest')
plt.colorbar()
plt.title('热力图')
plt.show()

四、综合实战案例

4.1 数据分析完整流程

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# 1. 生成模拟数据
np.random.seed(42)
n = 1000

# 销售数据
data = pd.DataFrame({
    'date': pd.date_range('2024-01-01', periods=n, freq='D'),
    'product': np.random.choice(['手机', '电脑', '平板', '耳机'], n),
    'region': np.random.choice(['华东', '华南', '华北', '西部'], n),
    'sales': np.random.randint(1000, 10000, n),
    'quantity': np.random.randint(1, 50, n),
    'customer_age': np.random.randint(18, 65, n)
})

# 2. 数据清洗
# 检查缺失值
print(data.isnull().sum())

# 添加单价列
data['unit_price'] = data['sales'] / data['quantity']

# 3. 数据分析
# 按产品分组统计
product_stats = data.groupby('product').agg({
    'sales': ['sum', 'mean', 'count'],
    'quantity': 'sum'
}).round(2)
print(product_stats)

# 按地区分组统计
region_stats = data.groupby('region')['sales'].sum().sort_values(ascending=False)
print(region_stats)

# 4. 数据可视化
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# 销售额按产品分布
product_sales = data.groupby('product')['sales'].sum()
axes[0, 0].bar(product_sales.index, product_sales.values, color='skyblue')
axes[0, 0].set_title('各产品销售额')
axes[0, 0].set_ylabel('销售额')

# 销售额按地区分布
axes[0, 1].pie(region_stats.values, labels=region_stats.index, autopct='%1.1f%%')
axes[0, 1].set_title('各地区销售占比')

# 销售趋势（按月）
data['month'] = data['date'].dt.to_period('M')
monthly_sales = data.groupby('month')['sales'].sum()
axes[1, 0].plot(range(len(monthly_sales)), monthly_sales.values, marker='o')
axes[1, 0].set_title('月度销售趋势')
axes[1, 0].set_ylabel('销售额')

# 客户年龄分布
axes[1, 1].hist(data['customer_age'], bins=20, edgecolor='black', alpha=0.7)
axes[1, 1].set_title('客户年龄分布')
axes[1, 1].set_xlabel('年龄')
axes[1, 1].set_ylabel('频数')

plt.tight_layout()
plt.savefig('sales_analysis.png', dpi=300)
plt.show()

五、总结

NumPy、Pandas和Matplotlib是Python数据科学的三大利器：

NumPy：高性能数值计算的基础，提供多维数组和各种数学函数
Pandas：灵活的数据处理工具，适合结构化数据的清洗和分析
Matplotlib：强大的可视化库，支持各种图表类型的绘制

学习建议：

循序渐进：先掌握NumPy数组操作，再学习Pandas数据处理，最后练习可视化
多动手实践：通过真实数据集练习，加深理解
查看官方文档：遇到问题时及时查阅官方文档
关注性能：大数据量时，注意使用向量化操作而非循环

掌握了这三大库，你就具备了进行数据分析和机器学习的基础能力，可以继续深入学习Scikit-learn、TensorFlow等更高级的机器学习框架。

人工智能机器学习

Python机器学习基础实战指南：NumPy、Pandas、Matplotlib三剑客完全教程

一、NumPy数值计算库

1.1 NumPy数组基础

1.2 数组索引与切片

1.3 数组运算

1.4 数组形状操作

1.5 广播机制

1.6 随机数生成

二、Pandas数据处理库

2.1 数据结构

2.2 数据读取与存储

2.3 数据选择与过滤

2.4 数据处理

三、Matplotlib数据可视化

3.1 基础绘图

3.2 图表类型

3.3 高级绘图

四、综合实战案例

4.1 数据分析完整流程

五、总结