TensorFlow踩坑实录：从张量到神经网络的实战经验

2023-01-18

TensorFlow踩坑实录：从张量到神经网络的实战经验

折腾TensorFlow的时候踩了不少坑，记录一下核心概念和代码实现，主要是张量操作、Session管理和神经网络搭建这几个部分。

张量到底是什么

TensorFlow用数据流图做计算，节点是数学操作，边是张量（多维数组）。名字就是这么来的——张量在节点间流动。

维度	实际对应	例子
0阶	单个数字	`1`
1阶	数组	`[1, 2, 3]`
2阶	矩阵	`[[1,2,3],[4,5,6]]`
3阶+	高维数组	图像数据等

import tensorflow.compat.v1 as tf
import numpy as np

# 创建不同维度的张量
scalar = tf.constant(1)
vector = tf.constant([1, 2, 3])
matrix = tf.constant([[1, 2, 3], [4, 5, 6]])

线性回归示例

用TensorFlow实现个简单的线性回归，拟合 y = 0.1x + 0.3：

import tensorflow.compat.v1 as tf
import numpy as np

tf.disable_eager_execution()

# 造点训练数据
x_data = np.random.rand(100).astype(np.float32)
y_data = x_data * 0.1 + 0.3

# 定义模型参数
Weights = tf.Variable(tf.random.uniform([1], -1.0, 1.0))
biases = tf.Variable(tf.zeros([1]))

# 预测模型
y = Weights * x_data + biases

# 损失函数
loss = tf.reduce_mean(tf.square(y - y_data))

# 优化器
optimizer = tf.train.GradientDescentOptimizer(0.5)
train = optimizer.minimize(loss)

# 初始化并训练
init = tf.global_variables_initializer()

with tf.Session() as sess:
    sess.run(init)
    for step in range(201):
        sess.run(train)
        if step % 20 == 0:
            print(f"Step {step}: Weights={sess.run(Weights)}, biases={sess.run(biases)}")

Session的两种写法

执行计算图必须用Session，有两种写法：

import tensorflow as tf

matrix1 = tf.constant([[3, 3]])
matrix2 = tf.constant([[2], [2]])
product = tf.matmul(matrix1, matrix2)

# 写法1：手动关闭
sess = tf.Session()
result = sess.run(product)
print(result)
sess.close()

# 写法2：上下文管理器（推荐）
with tf.Session() as sess:
    result2 = sess.run(product)
    print(result2)

Variable变量

Variable用来存模型参数，训练时会自动更新：

import tensorflow.compat.v1 as tf

tf.compat.v1.disable_eager_execution()

state = tf.Variable(0, name='counter')
one = tf.constant(1)
new_value = tf.add(state, one)
update = tf.assign(state, new_value)

init = tf.global_variables_initializer()

with tf.Session() as sess:
    sess.run(init)
    for _ in range(3):
        sess.run(update)
        print(sess.run(state))  # 输出 1, 2, 3

Placeholder占位符

Placeholder用于从外部传入数据：

import tensorflow.compat.v1 as tf

tf.compat.v1.disable_eager_execution()

input1 = tf.placeholder(tf.float32)
input2 = tf.placeholder(tf.float32)
output = tf.multiply(input1, input2)

with tf.Session() as sess:
    result = sess.run(output, feed_dict={input1: [7.], input2: [2.]})
    print(result)  # [14.]

神经网络搭建

自己封装个add_layer函数，搭个简单的神经网络：

import os
os.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE"
import tensorflow.compat.v1 as tf
import numpy as np
import matplotlib.pyplot as plt

tf.compat.v1.disable_eager_execution()

def add_layer(inputs, in_size, out_size, activation_function=None):
    Weights = tf.Variable(tf.random_normal([in_size, out_size]))
    biases = tf.Variable(tf.zeros([1, out_size]) + 0.1)
    Wx_plus_b = tf.matmul(inputs, Weights) + biases

    if activation_function is None:
        outputs = Wx_plus_b
    else:
        outputs = activation_function(Wx_plus_b)
    return outputs

# 生成带噪声的训练数据
x_data = np.linspace(-1, 1, 300, dtype=np.float32)[:, np.newaxis]
noise = np.random.normal(0, 0.05, x_data.shape).astype(np.float32)
y_data = np.square(x_data) - 0.5 + noise

# 占位符
xs = tf.placeholder(tf.float32, [None, 1])
ys = tf.placeholder(tf.float32, [None, 1])

# 网络结构：输入1个 -> 隐藏层10个 -> 输出1个
l1 = add_layer(xs, 1, 10, activation_function=tf.nn.relu)
prediction = add_layer(l1, 10, 1, activation_function=None)

# 损失和优化器
loss = tf.reduce_mean(tf.reduce_sum(tf.square(ys - prediction), reduction_indices=[1]))
train_step = tf.train.GradientDescentOptimizer(0.1).minimize(loss)

# 训练
init = tf.global_variables_initializer()
sess = tf.Session()
sess.run(init)

# 可视化
fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)
ax.scatter(x_data, y_data)
plt.ion()
plt.show()

for i in range(1000):
    sess.run(train_step, feed_dict={xs: x_data, ys: y_data})
    if i % 50 == 0:
        try:
            ax.lines.remove(ax.lines[0])
        except:
            pass
        prediction_value = sess.run(prediction, feed_dict={xs: x_data})
        ax.plot(x_data, prediction_value, 'r-', lw=5)
        plt.pause(0.1)

踩过的坑

libiomp5md.dll错误

用PyTorch或TensorFlow时可能遇到：

1	Initializing libiomp5md.dll, but found libiomp5md.dll already initialized

临时解决（代码里加）：

1
2
3

import os
os.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE"
# 注意：这行必须在import torch之前

彻底解决：
保留 site-packages\torch\lib 下的 libiomp5md.dll，删掉其他路径的同名文件。

版本对应关系

TensorFlow	Python	CUDA
2.x	3.6-3.9	11.2+
1.15	3.6-3.7	10.0
1.14	3.5-3.7	10.0

环境配置

# Anaconda创建环境
conda create -n tensorflow_env python=3.8
conda activate tensorflow_env

# 安装TensorFlow
pip install tensorflow -i https://pypi.tuna.tsinghua.edu.cn/simple

# 装可视化工具
pip install tensorboard matplotlib numpy pandas

一点优化建议

GPU加速要装对应的CUDA驱动
数据预处理用tf.dataAPI效率更高
记得定期保存检查点，防止训练中断白跑
batch size根据显存调整，别设太大

核心就这些：张量、Session、Variable、Placeholder，再加上神经网络搭建。代码都是实际跑过的，有问题的部分已经标出来了。

人工智能机器学习

TensorFlow踩坑实录：从张量到神经网络的实战经验

TensorFlow踩坑实录：从张量到神经网络的实战经验

张量到底是什么

线性回归示例

Session的两种写法

Variable变量

Placeholder占位符

神经网络搭建

踩过的坑

libiomp5md.dll错误

版本对应关系

环境配置

一点优化建议

参考