Part-Of-Speech (POS) Tagging
After a considerable amount of time since I met with and worked on natural language processing topic, I am here to prevent people — especially desperate students — from having the same difficulties on some basic concepts related. Here, I will try to assist you in overcoming the issue of part-of-speech (POS) tagging implementation. We will progress step by step on the phases of implementation. Remember this story is not about how to use a pos tagger — there exists many and they’re easy to reach and they’re easy to use and… So what would be the point, right? Anyway — but it is about how to implement one. You will have your own pos tagger! — how exciting is this? yeeeey, huh?
First, if you desire, let’s briefly go over what the part-of-speech tagging means. That way, we can help a few students who are focused on learning and understanding the basis of the work rather than only solving their homework. — I assume that the deadline of homework is not within an hour or shorter — A Part-Of-Speech Tagger (POS Tagger) is a piece of software that reads text in some language and assigns parts of speech to each word (and other token), such as noun, verb, adjective, etc. This is the definition of The Stanford Natural Language Processing Group. It’s pretty clear, no need to say more. Let’s dive into implementation phases.
Data
We will implement a supervised machine learning model. So, what we need is data — preferably looots of data. We love data, don’t we?— Okay. I have some data that I save for the bad time. For this time, I can borrow you some, if you don’t have any. Come on, have some. There are some training data that contains almost 12,000 English sentences with related pos tags and some test data to evaluate the model at the end. (It is highly recommended to divide test data into two, development data and test data. You can train and tune your model using training and development data, and then, after you’re certain that the model is trained, you can use test data for evaluation. This is the correct way. However I will continue without making any data splitting. — **Spoiler Alert!** it will work. — )
First column in data indicates (1) the words (or tokens)of the sentences, second column indicates (2) the universal pos (U-POS) tags of tokens that we are not going to use and finally the third column indicates (3) the pos tags of tokens. These pos tags are the ground truth (golden) labels that we aim to determine when we face with a test sentence or token.
Now, we know how data look like.
Data Reading
As next step, how about reading the data? — it’s not much useful on its own, lying in the file after all — Let’s RELEASE THE DATA! As you can see the script below, we prefer to read data sentence by sentence. Why do we want to keep sentence structure, you may ask. It’s because that the position of word (or token) in a sentence is a strong indicator of what may be the pos tag of the word. When you think about quickly, in English, the words placed at the beginning of the sentence tend to be subject, the words placed closer to the beginning of the sentence tend to be verb etc. This is why we want to keep words in same sentence together.
We have finished data reading. Now, how about starting the model training super-quick and obtaining a great pos-tagger and tag everything in the world until discovering teleportation by chance. Sounds sweet but not that fast. Unfortunately, we are not capable of training a tagger yet, with the raw data we have. Next step, please!
Feature Extraction
Welcome to the most important and sophisticated part of the work. As you already know — i love that phrase — to reach information hidden in data we use machine learning techniques. To be able to apply any machine learning technique on data, we should represent data with some reasonable features. In short, we should extract some features from data where each feature points out the different characteristics of data. Remember that we’ve talked about the tendency of subjects to position at the beginning of an English sentence. So, for example, it would be smart to define a feature which checks if the given word (or token) is at the beginning of the sentence or not. Because we are all smart people — or like me, trying to be one — we will exactly define a feature just the same and we will name this feature is_first. — not sure if this naming is the indication of lack of creativity or act of explicity, anyway—
What the function below does is pretty simple. It iterates over all the sentences we had already read, then for each sentence, it iterates over all tokens. At this point, each of these tokens is actually a token pair(token, token pos tag). We send tokens to a magical function called get_feature, to perform feature extraction. Then, obtained features are sent to a list which will be our x_train. Token pos tags are directly sent to a list which will be our y_train.
At that point, the question in minds should be something similar to what does get_features function look like or how it performs feature extraction or is it some place magical that will stay secret and we’ll never talk about it again — maybe not the latter one —
You can see the mysterious get_feature function below. It takes token, token index and sentence that token belong to and forms a dictionary which keeps the information of token features. We use a dictionary to store features because some features are represented with numerical values, some with nominal values and some with boolean truth values. Actually, it is a mess when we look at it from the perspective of value type. Therefore, it is safe to use a dictionary.
Apart from the features that I selected above, one can select much more different features which may increase the performance of tagger. This function is the part where you can try different features, you may discover interesting relations and increase the performance. It’s puzzle time!
Training
Finally, it is time for training, we are going to use a logistic regression model but there is one last problem. — I know it starts to be annoying, hang on!—Our features are in dictionary format. We cannot train the logistic regression model like that. We need vectors as input. Therefore, we should find a way to transform dictionaries into vectors. Fortunately, there is a transformer called DictVectorizer in scikit-learn library for help. It performs the exact transformation that we need. As can be seen from the script below, we transform feature dictionaries to feature vectors and then, we train our model with these vectorized features. Et voilà!
Training lasted 32 minutes on my computer which has 3,1 GHz Intel Core i7 processor.
Now that the training is performed and our only goal in the universe is completed, we are likely to find ourselves fooling around or maybe staring at the walls foolishly. Happily, not that fast. Our job is not done yet.
Evaluation
We should evaluate and see how well our trained pos tagger performs on test data that has never seen as example before. Evaluate function does the job perfectly.
Test accuracy is 93% which is pretttty pretty good.
Final Example
I would like to demonstrate the performance of pos tagger on an example sentence. I will use a sentence, a quote of Mustafa Kemal Atatürk who was the father of modern Turkey, a great leader.
Peace at home, peace on earth. — Mustafa Kemal ATATURK —
Result of our tagger is perfectly accurate.
[(‘Peace’, ‘NN’), (‘at’, ‘DT’), (‘home’, ‘NN’), (‘,’, ‘,’), (‘peace’, ‘NN’), (‘on’, ‘IN’), (‘earth’, ‘NN’), (‘.’, ‘.’)]
NN: Noun — — DT: Determiner — — IN: Preposition
I invite you to visit my GitHub repository for more detailed work.
I hope it helps.
Samet