NLP项目工作流程

2021-02-19 15:01:45 浏览数 (1)

文章目录

    • 1. 谷歌Colab设置
    • 2. 编写代码
    • 3. flask 微服务
    • 4. 打包到容器
    • 5. 容器托管

参考 基于深度学习的自然语言处理

使用这篇文章的数据(情感分类)进行学习。

1. 谷歌Colab设置

Colab 地址

  • 新建笔记本
  • 设置
  • 选择 GPU/TPU 加速计算
  • 测试 GPU 是否分配
代码语言:javascript复制
import tensorflow as tf
tf.test.gpu_device_name()

输出:

代码语言:javascript复制
/device:GPU:0
  • 上传数据至谷歌云硬盘,并在Colab中加载
  • 解压数据

2. 编写代码

代码语言:javascript复制
import numpy as np
import pandas as pd

data = pd.read_csv("yelp_labelled.txt", sep='t', names=['sentence', 'label'])

data.head() # 1000条数据

# 数据 X 和 标签 y
sentence = data['sentence'].values
label = data['label'].values

# 训练集 测试集拆分
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(sentence, label, test_size=0.2, random_state=1)

#%%

max_features = 2000

# 文本向量化
from keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(X_train) # 训练tokenizer
X_train = tokenizer.texts_to_sequences(X_train) # 转成 [[ids...],[ids...],...]
X_test = tokenizer.texts_to_sequences(X_test)
vocab_size = len(tokenizer.word_index) 1 #  1 是因为index 0, 0 不对应任何词,用来pad

maxlen = 50
# pad 保证每个句子的长度相等
from keras.preprocessing.sequence import pad_sequences
X_train = pad_sequences(X_train, maxlen=maxlen, padding='post')
# post 尾部补0,pre 前部补0
X_test = pad_sequences(X_test, maxlen=maxlen, padding='post')

#%%

embed_dim = 256
hidden_units = 64

from keras.models import Model, Sequential
from keras.layers import Dense, LSTM, Embedding, Bidirectional, Dropout
model = Sequential()
model.add(Embedding(input_dim=max_features,output_dim=embed_dim,
                    input_length=maxlen))
model.add(Bidirectional(LSTM(hidden_units)))
model.add(Dropout(0.3))
model.add(Dense(1, activation='sigmoid')) # 二分类sigmoid, 多分类 softmax

model.compile(optimizer='adam', loss='binary_crossentropy',metrics=['accuracy'])
model.summary()
from keras.utils import plot_model
plot_model(model, show_shapes=True, to_file='model.jpg') # 绘制模型结构到文件

#%%

history = model.fit(X_train,y_train,batch_size=64,
             epochs=100,verbose=2,validation_split=0.1)
# verbose 是否显示日志信息,0不显示,1显示进度条,2不显示进度条
loss, accuracy = model.evaluate(X_train, y_train, verbose=1)
print("训练集:loss {0:.3f}, 准确率:{1:.3f}".format(loss, accuracy))
loss, accuracy = model.evaluate(X_test, y_test, verbose=1)
print("测试集:loss {0:.3f}, 准确率:{1:.3f}".format(loss, accuracy))

# 绘制训练曲线
from matplotlib import pyplot as plt
import pandas as pd
his = pd.DataFrame(history.history)
loss = history.history['loss']
val_loss = history.history['val_loss']
acc = history.history['accuracy']
val_acc = history.history['val_accuracy']

plt.plot(loss, label='train Loss')
plt.plot(val_loss, label='valid Loss')
plt.title('Training and Validation Loss')
plt.legend()
plt.grid()
plt.show()

plt.plot(acc, label='train Acc')
plt.plot(val_acc, label='valid Acc')
plt.title('Training and Validation Acc')
plt.legend()
plt.grid()
plt.show()

#%%

model.save('trained_model.h5')

import pickle
with open('trained_tokenizer.pkl','wb') as f:
    pickle.dump(tokenizer, f)

# 下载到本地
from google.colab import files
files.download('trained_model.h5')
files.download('trained_tokenizer.pkl')

3. flask 微服务

  • 以下内容不懂,抄一遍

编写 app.py

代码语言:javascript复制
# Flask
import pickle
import numpy as np
from keras.preprocessing.sequence import pad_sequences
from keras.models import load_model
def load_var():
    global model, tokenizer
    model = load_model('trained_model.h5')
    model.make_predict_function()
    with open('trained_tokenizer.pkl','rb') as f:
        tokenizer = pickle.load(f)

maxlen = 50
def process_txt(text):
    x = tokenizer.texts_to_sequences(text)
    x = pad_sequences(x, maxlen=maxlen, padding='post')
    return x

#%%

from flask import Flask, request, jsonify
app = Flask(__name__)

@app.route('/')
def home_routine():
    return "hello NLP!"

#%%

@app.route("/prediction",methods=['POST'])
def get_prediction():
    if request.method == 'POST':
        data = request.get_json()
    x = process_txt(data)
    prob = model.predict(x)
    pred = np.argmax(prob, axis=-1)
    return str(pred)

#%%

if __name__ == "__main__":
    load_var()
    app.run(debug=True)
    # 上线阶段应该为 app.run(host=0.0.0.0, port=80)
  • 运行 python app.py
  • windows cmd 输入: Invoke-WebRequest -Uri 127.0.0.1:5000/prediction -ContentType 'application/json' -Body '["The book was very poor", "Very nice", "bad, oh no", "i love you"]' -Method 'POST'

返回预测结果:

4. 打包到容器

  • 后序需要用 Docker 将 应用程序包装到容器中

5. 容器托管

  • 容器托管到网络服务,如 AWS EC2 实例

0 人点赞