NLP——用RNN解决POS Tagging问题

2021-10-27 16:33:50 浏览数 (1)

上一节:NLP——HMM模型与计算实例

————————————————————————————————

大家好!

这一节总体上是一个对我们这一门课的一次proj的总结,这一次proj是一次深度学习(deep learning)模型的完整模型搭建,也是一个对于深度学习初学者来说极为具有挑战性的一次proj,因为会遇到各种各样意想不到的问题。因此这一篇文章也是一次完整的,从读取数据到跑出模型的全过程。

但是需要提醒大家的是,因为作业的数据不允许被公开,所以我们的code是没有办法运行的。但考虑到个人时间问题,我也不太可能再自己造一个数据集然后去完成一个类似的任务,因此也希望大家可以谅解。但是我们会在每一个部分都加上大量的解释,用于阐述整个过程可能会出现的各种问题。并希望提供一个好的框架,让初次接触完整deep learning过程的朋友可以从中获取到灵感。

另外还有一件事,因为我自己会将这一个内容同步上传到github中进行维护,因此全文会使用英文书写。我知道对于大多数受众来说这会造成一些阻碍,但是考虑到目前还是以个人发展为重,加上目前的优质材料和编码也基本上还是以英文为主,所以造成的阅读困难对此非常抱歉,也希望大家可以多多理解~

那么我们开始吧。

TableOfContents

  • Background
  • Code Sections
    • Build Dataset
    • Build Indices
    • Batch Training and Padding
    • Use Pre-trained Word Embedding
    • Constructing Deep Learning Model
    • Train the Model
    • Test the Model

Background

This is a project related to using RNN (Recurrent Neural Network) to solve a POS (Part of Speech) Tagging problem in NLP (Natural Language Processing). Before we start, let me introduce a little background about it. Please note that we will not describe RNN models used in this project due to the time limit.

In natural language, each sentence consists of different words, such as nouns, verbs, adjectives, prepositions and so on. These are named as lexical tags. Of course different languages have different numbers of tags, for example, in Janpanese there are no formal definitions of "prepositions" (generally て is used). These tags are really useful to help us parse the sentences, since each word may have multiple lexical tags, thus multiple meanings, which may lead to severe confusion for natural language processing. That is why POS tagging is one important task and we want to use traditional machine learning and deep learning models to predict the words' tagging with the training and test datasets.

Code Sections

Step 1: Build Dataset

Firstly we need to load the data.

代码语言:javascript复制
import pickle, argparse, os, sys
from sklearn.metrics import accuracy_score
import numpy as np
import random
import torch
import torch.nn as nn
import torch.functional as F
import pandas as pd
from collections import Counter
import math

# In fact, not all imported packages are used

file_1 = "../input/hw2/wsj1-18.training"

Tokens = []
Samples = []
Labels = []
Labels_total = []
with open(file_1) as e:
    x = e.readlines()
    for sentence in x:
        t = sentence.split()
        words = t[0::2]
        labels = t[1::2]
        Tokens.extend(words)
        Samples.append(words)
        Labels.append(labels)
        Labels_total.extend(labels)

To help better understand how to extract the words and labels, we hereby show a patch of data.

代码语言:javascript复制
Pierre NNP Vinken NNP , , 61 CD years NNS old JJ , , will MD join VB the DT board NN as IN a DT nonexecutive JJ director NN Nov. NNP 29 CD . .

As you can see, each word is followed with a tag (such as NNP, CD, etc, they are words with all characters capitalized) split by a space (that is why we can directly use split() method). And in the dataset, each sentence is seperated by lines, meaning that readlines() method is used to generate a list of sentences. Therefore, we use t[0::2] and t[1::2], which means "select every other word from the beginning" and "select every other word from the second word", to select words and labels.

Also, Tokens is the set of words, which is used to generate the dictionary between words and indices (numbers). Samples is a list of sentences (each sentence is also a list with splitted words). Labels is a list of POS tags, each word in each sentence corresponds to one label.

Next we count the data and deal with UNKA.

代码语言:javascript复制
Dic = Counter()
WordLabel = Counter()

for i in Tokens:
    Dic[i]  = 1

j = 0
UNK_words = set()
for i in Dic:
    if (Dic[i] <= 2):
        UNK_words.add(i)
    j  = 1
print(len(UNK_words))

Tokens = ["UNKA" if word in UNK_words else word for word in Tokens]
for index, wordlist in enumerate(Samples):
    Samples[index] = ["UNKA" if word in UNK_words else word for word in wordlist]

What is UNKA? It refers to unknown word. In many times, especially in the case where training data is not enough, the word in the test set may not appear in the training set. That is where UNKA takes place, since the missing word in the test set should not be a very common or frequent word in our thoughts, so we just replace these words into a unified symbol UNKA. This does not hurt. After the replacement, we need to also add UNKA symbols into the training data for training (every word requires training, including UNKA). A rule of thumb is to change infrequent words in the training data into UNKA. That is why we need to count the number of occurance of each word, since if the number is less equal to 2, we will just replace them into UNKAs.

In the dataset I tackled with, the UNKAs have been added in the test set, and no words in the test data do not appear in the training data. So no additional procedures are needed for test set, we only need to do something on the training set.

We have also extract the test dataset by similar ways, shown in below.

代码语言:javascript复制
file_2 = "../input/hw2/wsj19-21.testing"

Tokens_test = []
Samples_test = []
with open(file_2) as e:
    x = e.readlines()
    for sentence in x:
        t = sentence.split()
        # print(t)
        Tokens_test.extend(t)
        Samples_test.append(t)
        
file_3 = "../input/hw2/wsj19-21.truth"

Labels_test = []
Labels_test_total = []
with open(file_3) as e:
    x = e.readlines()
    for sentence in x:
        t = sentence.split()
        labels = t[1::2]
        Labels_test.append(labels)
        # print(t)
        Labels_test_total.extend(labels)

Step 2: Build Indices

The next step is to construct the indices between words and numbers, in order to put the numbers matrix into our model for training. The mapping will be saved into a dictionary and loaded when we need to generate the mapping between the test dataset and numbers.

代码语言:javascript复制
Labels_set = list(set(Labels_total)) # Remove duplicate labels
Labels_set = sorted(Labels_set)
Labels_set = [i for i in enumerate(Labels_set)]

Labels_index = dict()
for index, word in Labels_set:
    Labels_index[word] = index

Labels_number = []
for index, labellist in enumerate(Labels):
    Labels_number.append([Labels_index[word] for word in labellist])

Tokens = list(set(Tokens)) # Remove duplicate tokens
Tokens = sorted(Tokens)

Tokens_index = dict()
for index, word in enumerate(Tokens):
    Tokens_index[word] = index

Samples_number = []
for index, samplelist in enumerate(Samples):
    Samples_number.append([Tokens_index[word] for word in samplelist])

dict is really an excellent data structure to store dictionary, otherwise the developers should give it another name, not dict.

Here is a trick, we have sorted the tokens and labels (Tokens = sorted(Tokens)). Why? This is because set has no order. In other words, Python will assign a random mapping between the words and numbers without the preset fixed random seed. This sometimes leads to a hidden bug in the program since in many times, people want to just save the model and load it again for further training (especially when the model is too time-consuming to train in a short time). If each time our program generates a different mapping, the dataset for model prediction will not have the same meaning as the dataset used for model training, which we should avoid, since the dictionary is required to use for generating the mapping between the test dataset and numbers. You do not want the word "the" to have index 1 when you train the model, and 2 when you use the model to predict, right?

There is another problem, why not just assign a random seed to bypass this bug? This is because sometimes it makes no sense, either. If you train and predict on the same operating system, that's fine. But if you train the model on Kaggle with GPU, and predict the model on your local computer, the same random seed corresponds to different mapping, too. So one way to debug perfectly is to sort the words, since the sort rule does not change regardless of the operating system the code runs. Another way is to pre-save the dictionary on one operating system (e.g. on Kaggle), and load it in other places, not running this part in different operating systems.

Part of the dictionary for Tokens is shown below.

代码语言:javascript复制
{'!': 0,
 '#': 1,
 '$': 2,
 '%': 3,
 '&': 4,
 "'": 5,
 "''": 6,
 "'40s": 7,
 ...

Similar procedures are applied on the test dataset.

代码语言:javascript复制
Labels_test_number = []
for index, labellist in enumerate(Labels_test):
    Labels_test_number.append([Labels_index[word] for word in labellist])

Samples_test_number = []
for index, samplelist in enumerate(Samples_test):
    Samples_test_number.append([Tokens_index[word] for word in samplelist])

We want to emphasize again that dictionary generated with training set is used for mapping generation for test set. That is why the order of words counts a lot and we want to sort the words first to avoid the bug we have said before.

Btw, here is a way to assign a random seed.

代码语言:javascript复制
SEED = 1234

random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

Step 3: Batch Training and Padding

Mini-batch training is significant in deep learning, since if we do not cut the dataset into batches, it is really too large for model to consume such a large dataset (in this project, nearly 40k sentences, 10m words are provided). One common way to cut the dataset is to cut them into several smaller number of sentences. For example, we can set the batch size as 100, then each batch will have 100 sentences. Each time the model will see one batch of data, and then update the weights with forward and backward propagation.

Here is another trick required in model training—padding. As we all know, different sentences have various lengths, so padding means manually adding symbols with no actual meaning (such as <PAD>) at the end of each sentence to ensure all sentences have the same length. This aims to construct a tensor (commonly seen data structure in deep learning, you can treat it as a 3-d matrix) suitable for model training. In this project, I compute the number of tokens, and manually add a large number (larger than the number of tokens) as the padding index. You need to remember this index since it will affect the loss function afterwards.

We firstly create batches, then do padding in each batch.

代码语言:javascript复制
# Batch size
batch_size = 100
# print(padded_Samples.shape[0] // batch_size)
Samples_main = np.array(Samples_number[:len(Samples_number) // batch_size * batch_size])

Samples_batch = Samples_main.reshape((Samples_main.shape[0] // batch_size, batch_size, -1))

Labels_main = np.array(Labels_number[:len(Labels_number) // batch_size * batch_size])

Labels_batch = Labels_main.reshape((Labels_main.shape[0] // batch_size, batch_size, -1))
代码语言:javascript复制
padded_Samples = []
Samples_lengths = []
for Samples_batch_number in Samples_batch:
    Samples_lengths_batch = [len(sentence[0]) for sentence in Samples_batch_number]
    longest_sent = max(Samples_lengths_batch)
    batch_size = len(Samples_batch_number)
    padded_Samples_batch = np.ones((batch_size, longest_sent)) * 16925
    for i, x_len in enumerate(Samples_lengths_batch):
        sequence = Samples_batch_number[i]
        padded_Samples_batch[i, 0:x_len] = sequence[0][:x_len]
    padded_Samples.append(padded_Samples_batch)
    Samples_lengths.append(Samples_lengths_batch)

padded_Labels = []
for Labels_batch_number in Labels_batch:
    Labels_lengths = [len(sentence[0]) for sentence in Labels_batch_number]
    longest_sent = max(Labels_lengths)
    batch_size = len(Labels_batch_number)
    padded_Labels_batch = np.ones((batch_size, longest_sent)) * 45
    for i, x_len in enumerate(Labels_lengths):
        sequence = Labels_batch_number[i]
        padded_Labels_batch[i, 0:x_len] = sequence[0][:x_len]
    padded_Labels.append(padded_Labels_batch)

The number 16925 and 45 are used for my padding index. Since the tokens size is 16925, the indices of actual words range from 0 to 16924. After this step, you should get a list, where each element is an array with the size

m times n

, where

m

is the batch size (100 in this project),

n

is the maximum length of sentences in this batch. Of course, each batch has different

n

.

Here is one wonderful article introducing padding.

https://towardsdatascience.com/taming-lstms-variable-sized-mini-batches-and-why-pytorch-is-good-for-your-health-61d35642972e

Step 4: Use Pre-trained Word Embedding

What is word embedding? We know in NLP, each word is mapped into an index, but the index is actually an ordinal number, so directly putting it into the model does not make sense in general. That is why one-hot method is used.

Here is one example of one-hot.

代码语言:javascript复制
1 -> [1, 0, 0, 0, 0]
2 -> [0, 1, 0, 0, 0]
3 -> [0, 0, 1, 0, 0]
5 -> [0, 0, 0, 0, 1]

So you can see, we will find the maximal number and create vectors with the same length, the maximal number. For each vector, only the location with index the same as the number will be filled in 1, otherwise 0.

In this training data, there are altogether 16925 words and one <PAD>, so the vector size is 16926 for each word, really a huge number, since we have nearly 10m words.

Addtionally, one-hot method will generate a sparse matrix, most of which are 0. Embedding is a method that maps each sparse vector to a dense one with continuous values. For example, suppose the embedding size is 50 (which means each word corresponds to a vector with size 50). then we only need to train a weight matrix

W

with size 16926 * 50 for each word. Also, you can do share-embedding, which means one weight matrix

W

can be used for many words.

Fortunately, there are some existing embeddings in public (we call pre-trained embeddings) here is one common embedding dictionary called GloVe. The website is here:

https://nlp.stanford.edu/projects/glove/

but the download link seems malfunctioned, you can download the dictionary on Github.

Here is another article telling us how to use GloVe:

https://medium.com/@martinpella/how-to-use-pre-trained-word-embeddings-in-pytorch-71ca59249f76

代码语言:javascript复制
file_1 = "../input/glove6b/glove.6B.200d.txt"

Dic_glove = dict()
with open(file_1) as e:
    x = e.readlines()
    for sentence in x:
        t = sentence.split()
        word = t[0]
        embeddings = t[1:]
        embeddings = [float(value) for value in embeddings]
        Dic_glove[word] = embeddings

Nevertheless, don't be overly happy. There maybe some words in the training data that not appear in GloVe. So for these words, we need to preset the embeddings as random values, here the values following Normal Distribution are taken into consideration.

代码语言:javascript复制
matrix_len = len(Tokens)
emb_dim = 200
weights_matrix = np.zeros((matrix_len, emb_dim))
words_found = 0

for i, word in enumerate(Tokens): # Good to use it to enumerate the indices and words
    try: 
        weights_matrix[i] = Dic_glove[word]
        words_found  = 1
    except KeyError:
        weights_matrix[i] = np.random.normal(scale=0.6, size=(emb_dim, ))

You can see, weights_matrix stores the embedding for each word. Each row corresponds to one word and the row number if just the index number of each word in the dictionary. So if you cannot fix the index number of each word, there will be bugs, which I have pointed out several times before.

We need two additional lines to put the embedding of <PAD>, too.

代码语言:javascript复制
weights_matrix_2 = weights_matrix.copy()
weights_matrix_2 = np.concatenate((weights_matrix_2, np.zeros((1, emb_dim))))

So the actual weights_matrix is weights_matrix_2. Sorry for the complicated notations...

Step 5: Constructing Deep Learning Model

Now we can begin our exploration in Deep Learning! Here is one version of the model.

代码语言:javascript复制
class RNNTagger(nn.Module):
    def __init__(self, hidden_dim, hidden_dim_2, tagset_size, weights_matrix_2):
        super(RNNTagger, self).__init__()
        self.hidden_dim = hidden_dim
        self.embedding, num_embeddings, embedding_dim = create_emb_layer(weights_matrix_2, True)

        self.lstm = nn.LSTM(embedding_dim, hidden_dim)
        self.lstm2 = nn.LSTM(hidden_dim, hidden_dim)
        self.hiddenlinear = nn.Linear(hidden_dim, hidden_dim_2)
        self.hidden2tag = nn.Linear(hidden_dim_2, tagset_size)
        self.softmax = nn.Softmax(dim=-1)
        self.hidden = self.init_hidden()
        self.relu = nn.ReLU()
        self.dropout = nn.Dropout(p=0.25)
        self.tanh = nn.Tanh()
        self.gru = nn.GRU(embedding_dim, hidden_dim)

    def init_hidden(self):
        return (torch.zeros(1, 1, self.hidden_dim),
                torch.zeros(1, 1, self.hidden_dim))

    def forward(self, sentence, X_lengths):
        sentence = torch.LongTensor(sentence)
        sentence = torch.nn.utils.rnn.pack_padded_sequence(self.embedding(sentence), X_lengths, batch_first=True,
                                                           enforce_sorted=False)
        lstm_out, _ = self.lstm(sentence)
        lstm_out, _ = torch.nn.utils.rnn.pad_packed_sequence(lstm_out, batch_first=True)
        tag_space_temp = self.dropout(lstm_out)
        tag_space_temp = self.hiddenlinear(tag_space_temp)
        tag_space_temp = self.relu(tag_space_temp)
        tag_space = self.hidden2tag(tag_space_temp)
        tag_scores = self.softmax(tag_space)
        return tag_scores

As we know, Deep Learning is a building-block game, each layer is one block and you only need to combine the blocks together. All kinds of blocks are required to be defined In the method __init__, while you need to define the order of these "blocks" in the method forward. To be more specific, the self.lstm method is one "block" defined in __init__, we use it in the forward function first, meaning that we need to firstly change the sentence into lstm_out. And then we add a Dropout layer (self.dropout) and so on.

Here are some tricks, firstly inputs should be changed into Tensor. Secondly, the pack_padded_sequence and the pad_packed_sequence are dual methods used for changing the sentence into proper structures used for LSTM. Since we have designed our input matrix as the size of

N times M

for each batch, where

N

is the batch size,

M

is length of the longest sentence in that batch, the optional parameter batch_first should be set as true. Otherwise, the size of matrix should be

M times N

.

Note that you had better use the "blocks" redifined in __init__, which means directly using nn.Dropout(p=0.25) in forward for Dropout may lead to an error. I do not know why...

For better understanding of how PyTorch works. We decide to explain more about the transformation of the matrix. At first, we input a matrix with size

N times M

, each element in the matrix is the index corresponding to one word. Then the function create_emb_layer is used to assign each word with an embedding. Denote

K

as the embedding size, then the actual size of the matrix feeding the model is

N times M times K

. Suppose we add a linear layer with the output size 32, then the output size of the matrix is

N times M times 32

since for RNN, each word is an input unit. In this problem, each word corresponds to one label (POS tag). Suppose there are

P

kinds of POS tags, then the output size should be

N times M times P

, where each word has a vector with length

P

, each element is the probability of one label. For example, the vector [0.5, 0.5] means the probability of this word belonging to the 1st label is 0.5, and 0.5 for the 2nd label. The probability is computed using Softmax.

Step 6: Train the Model

For PyTorch, training has a fixed procedure, so I think it is really easy to understand this patch of codes.

代码语言:javascript复制
model = RNNTagger(hidden_dim=128, hidden_dim_2=64, tagset_size=46, weights_matrix_2=weights_matrix_2)

criterion = torch.nn.CrossEntropyLoss(ignore_index=45)
optimizer = torch.optim.Adam(model.parameters(), lr=0.005)

tag_pad_idx = 45
for k in range(30):
  for i in range(len(padded_Samples)):
    padded_Labels_batch_temp = padded_Labels[i]
    padded_Samples_batch_temp = padded_Samples[i]
    Samples_lengths_batch_temp = Samples_lengths[i]

    optimizer.zero_grad()
    predictions = model(padded_Samples_batch_temp, Samples_lengths_batch_temp)
    predictions = predictions.view(-1, predictions.shape[-1]) 
    padded_Labels_batch_temp = torch.LongTensor(padded_Labels_batch_temp)
    padded_Labels_batch_temp = padded_Labels_batch_temp.view(-1)
    loss = criterion(predictions, padded_Labels_batch_temp)
    loss.backward()
    #         print(loss)
    optimizer.step()
    print(loss)

You can see that you only need to put the data into the model to get predictions, and then compute the loss between predictions and the true label (padded_Labels_batch_temp). You need to choose a proper criterion and optimizer first. In this problem we use CrossEntropyLoss and Adam.

Please note that when you compute the CrossEntropyLoss, you have to set an ignore_index. Since for the <PAD> symbols, it has no labels, so we add an extra empty label, the index is set as 45 since we have 45 legal labels, indexing from 0 to 44.

Step 7: Test the model

For testing the model, we only need to construct similar test input matrix, true label for the test set, and the predictions. Note that the dictionary for connecting words and numbers constructed using training data should be reused here for consistency.

代码语言:javascript复制
model = RNNTagger(hidden_dim=128, hidden_dim_2=64, tagset_size=46, weights_matrix_2=weights_matrix_2)
model.load_state_dict(torch.load(model_file, map_location=torch.device("cpu"))) # If you want to train the model with GPU, you need to transfer the data and model onto the GPU. More info could be found using Google. 

predictions = model(padded_Samples_test, Samples_test_lengths)

predictions = predictions.view(-1, predictions.shape[-1])
padded_Labels_test = torch.LongTensor(padded_Labels_test)
padded_Labels_test = padded_Labels_test.view(-1)
predictions = torch.argmax(predictions, -1).cpu()

predictions = predictions.numpy()
padded_Labels_test = padded_Labels_test.numpy()
non_pad_elements = (padded_Labels_test != 45).nonzero()
# print(len(non_pad_elements[0]))
# print(sum(predictions[non_pad_elements[0]] == padded_Labels_test[non_pad_elements[0]]))
acc = sum(predictions[non_pad_elements[0]] == padded_Labels_test[non_pad_elements[0]]) / len(predictions[non_pad_elements[0]])

Note that when you compute the accuracy, the <PAD> labels should be removed first, since it has no meaning.

In this project we do not use cross-validation. But in fact it is really an important part for choosing proper parameters for the best evalutaion result. You can also try different kinds of parameters such as learning rate, LSTM vs GRU, Adam vs SGD and so on. For SGD, since the loss function for this problem is really weird and steep near the mininum point, the momentum parameter should be used.

OK, I think we can stop here.

Summary

This article is a brief introduction of using RNN to solve POS tagging problem, a classical problem in NLP. We have introduced data preprocessing, training and test procedures for PyTorch and many tricks that used in this project. I hope everyone can learn deep learning easily with this article. The official documentation of PyTorch is also highly recommended for beginners:

https://pytorch.org/tutorials/beginner/pytorch_with_examples.html

0 人点赞