POS Tagging (Part-Of-Speech Tagging)
What is Part of Speech?
The Part of Speech of a word is, in simple terms, its grammatical category. It is a purely syntactic notion in Linguistics. Note that there is a whole body of literature and debate in the grammatical tradition in linguistics on the right way in which parts of speech should be defined, each having some advantages and disadvantages.
More generally, the convention is to define the part of speech of a word on the basis of the word’s syntactic category. By syntactic category, we mean the syntactic function it performs in a sentence, which can be a noun, a verb, an adverb, etc. A total of eight parts of speech are mostly defined for English, but the set can be much more elaborate. Out of these eight categories, nouns, verbs, adjectives and adverbs form an open class; while pronoun, preposition, conjunctions and interjection form a closed class.
There are two simple ways using which you can find out the POS tag of a word in a sentence:
- The syntactic environment of the word.
- The inflection on the word.
Let me illustrate these two points using the example of the Noun part of speech.
The nouns in English are preceded by articles, demonstratives or adjectives. Also, they can be followed by auxiliary verbs or verbs mostly. Now carefully observe the syntactic position and environment of the noun ‘cat’ in the following sentence:
The big cat is sleeping.
The second thing is inflections. Of course, the inflections may or may not be overtly present in all languages, but for this example, the inflections that Noun ‘cat’ takes up is the plural –s suffix. Check that in the following example:
I really like fluffy cats.
The above rules are defined taking into consideration English language only. It is important to know that different languages can employ different syntactic strategies to achieve the same result. To find out POS tags in another language, one has to study the syntax of that language first.
Another important thing is, the syntactic category of a word, defined on the basis of function in the sentence, is different from the lexical category, which is defined on the basis of meaning or semantics. Interestingly, the lexical category of a word can very well be the same as the syntactic category. For example, in the following sentences, the word ‘run’ has the verb as the lexical and syntactic category:
That girl is running very fast.
I like to run.
They ran away.
However, many a times, this is not true. A word can have a different lexical category than the syntactic role it performs in a sentence of a language. For example, the English word ‘run’ intuitively has the category of verb. However, in sentences like the following, it performs the function of a noun in the sentence:
Running is good for health.
I want my running to improve.
Let us go for a run.
I don’t like running as an exercise.
Part of Speech tagging
POS Tagging or Part of Speech Tagging is traditionally considered as an important level of computational analysis in the field of Computational Linguistics or Natural Language Processing (see the difference between the two fields here).
POS tagging in Python
For a given text, one can do the part of speech tagging using python programming language. The most common way to do so is to use a very popular Python Library NLTK or Natural Language Tool Kit. It is developed by Stanford University. NLTK has a POS tagger or Part of Speech Tagger, a specialized software or code that assigns parts of speech to all the tokens (or words) of the input text. The following Python code (Python 3) can be used to do a simple POS tagging:
import nltk text="this is an example for pos tagging in python nltk" tokenised_text=nltk.word_tokenize(text) print (nltk.pos_tag(tokenised_text))
There are different tag sets available for POS tagging. Depending upon the task and any other preference, the part of speech tag list can be modified. However, it is important to have a clear consensus on this before POS tags are defined.
PennTree Bank Tagset
Let us now look at a part of speech tag example done manually. The tag set used is that from PennTree Bank.
Medical/JJ researchers/NNS believe/VBP the/DT transplantation/NN of/IN small/JJ amounts/NNS of/IN fetal/JJ tissue/NN into/IN humans/NNS could/MD help/VB treat/VB juvenile/JJ diabetes/NN and/CC such/JJ degenerative/JJ diseases/NNS as/IN Alzheimer/NNP ‘s/POS ,/, Parkinson/NNP ‘s/POS and/CC Huntington/NNP ‘s/POS ./.
Now let us look at the same sentence tagged using CLAWS tag set:
Medical_AJ0 researchers_NN2 believe_VVB the_AT0 transplantation_NN1 of_PRF small_AJ0 amounts_NN2 of_PRF fetal_AJ0 tissue_NN1 into_PRP humans_NN2 could_VM0 help_VVI treat_NN1 juvenile_AJ0 diabetes_NN1 and_CJC such_DT0 degenerative_AJ0 diseases_NN2 as_CJS Alzheimer’s/POS_NN2 ,_PUN Parkinson_NP0 ‘s_POS and_CJC Huntington_NP0 ‘s_POS .PUN_SENT —–_PUN
Lastly, we look into the same sentence tagged using Stanford Tagset:
Medical_JJ researchers_NNS believe_VBP the_DT transplantation_NN of_IN small_JJ amounts_NNS of_IN fetal_JJ tissue_NN into_IN humans_NNS could_MD help_VB treat_VB juvenile_JJ diabetes_NN and_CC such_JJ degenerative_JJ diseases_NNS as_IN Alzheimer_NN ‘s_POS /_: POS_NN ,_, Parkinson_NNP ‘s_POS and_CC Huntington_NNP ‘s_POS ._.
© 2017 Payal Khullar. All Rights Reserved.