10  Natural language processing

In Case 10, we learnt what Natural Language Processing (NLP) is and how it can be useful. You gained experience with nltk, a Python library that implements many common NLP algorithms. We also covered the challenges specific to NLP and tools such as vectorization, stop words, tokenizing and parts of speech tagging.

10.1 Preliminary modules

import nltk # imports the natural language toolkit
nltk.download('punkt')
nltk.download('stopwords')
import pandas as pd
import numpy  as np
import string
import plotly
import matplotlib.pyplot as plt
from nltk.stem import PorterStemmer 
from wordcloud import WordCloud
from pylab import rcParams
from sklearn.feature_extraction.text import CountVectorizer
from collections import Counter
import re
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\12RAM\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\12RAM\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

10.2 About NLP

10.2.1 Challenge 1: Extraordinarily high dimensionality

Consider the book War and Peace. It has 3 million characters. Can we view this as a long vector of strings taking values in a 3-million-dimensional space, and then apply machine learning methods here? This is a bad idea for two reasons:

  1. Basic approaches have terrible performance in such high-dimensional spaces
  2. These approaches “miss out” on some important rules about language that we all know; e.g. that “don’t” and “do not” mean the same thing

As a result, a huge amount of NLP involves finding ways to summarize incredibly long vectors in concise ways, so that we can tractably explore, analyze, and model build with them later.

10.2.2 Challenge 2: Text is context specific

For example, the word queen has many uses in English that are both very different and common:

  1. The ruler of a country
  2. A size of mattress
  3. The most powerful piece in chess
  4. The mother insect in certain types of insect colonies

General purpose libraries will need to deal with all of these, but reviews for mattresses will almost always be about the second. This type of mismatch can result in misleading results that can easily be fixed by a team that is familiar with the underlying NLP computations.

10.3 Pre-processing and standardization

Standardizing text involves many steps. Some of these include:

  1. Correcting simple errors. For example, different text might use different encodings and you might find that special characters are corrupted and need to be fixed.
  2. Creating features (e.g. labeling nouns and verbs in a sentence).
  3. Replacing words and sentences altogether (e.g. standardizing spelling by changing “yuuuuuuck!” to “yuck”, or more extreme steps such as replacing words with near synonyms)

In a broad sense, standardization is similar to data cleaning with more conventional data; we are fixing errors, removing outliers, and transforming features. However, the details in NLP tend to be more complicated. One tip is that it is helpful to look at the lengths of each document to catch outliers.

Note

The NLP literature uses common words in technical ways. For example a “document” means any standalone string that might be part of a larger collection. Sometimes these might be documents as we usually think of them (articles, papers, etc) as part of a collection of such documents. However, we’d also use “document” to refer to each tweet in a collection of tweets, or each review in a collection of reviews (as in the dataset we’ll work with now). Remember, a “document” is just an item containing natural language text that is part of a larger collection of similar such items.

10.3.1 Libraries for NLP

We will be using Python’s Natural Language Toolkit (nltk) library. This library has functions that do most of the basics of NLP.

NLTK is a great language for learning about NLP in Python. It implements nearly all standard algorithms used in NLP in pure Python, and it is very readable. It has great documentation and a companion book, and it often implements several alternatives to the same algorithm so that they can be compared.

Another NLP library in Python is spaCy. SpaCy is more modern than NLTK, and more focused on industry use than on education. It is opinionated and often implements only a single algorithm instead of all alternatives. It is focused on speed and efficiency over readability, and its source code is less readable as a result.

Both are great NLP libraries to become familiar with. In this case, we’ll use NLTK, but nearly all features that we cover can be used in spaCy too.

Note

Many text wrangling pipelines start a little before we do, with initial “cleaning” steps that involve things like: converting all characters to lower case, expanding contractions, etc.

10.3.2 Tokenizing sentences

Just like CSV data is composed of features, text data is composed of sentences. Thus, a natural first step is what is known as sentence tokenization: splitting a long document into its component sentences.

  • At first this might seem trivial: just split whenever you see a period. Unfortunately, the same symbol is used in other ways in English (e.g. to mark an abbreviation, as part of ellipses, etc.), and so slightly more care is required.
  • Fortunately, there are packages that will do this for us. Within nltk, we can use the nltk.sent_tokenize() function.

Example:

# sentence tokenization
sentences = nltk.sent_tokenize('Tom wrote a letter to Mr. Plod, his uncle. "I am arriving on Mon. 5 Jan. Please meet me at approx. 5 p.m.')
for sentence in sentences:
    print(sentence)
    print()
Tom wrote a letter to Mr. Plod, his uncle.

"I am arriving on Mon.

5 Jan.

Please meet me at approx.

5 p.m.

It may seem like sentence tokenization is easy, but remember that the period . can be used in many different ways. In the document:

Tom wrote a letter to Mr. Plod, his uncle. "I am arriving on Mon. 5 Jan. Please meet me at approx. 5 p.m."

A sentence tokenizer has to be intelligent enough to tokenize this as follows:

[
"Tom wrote a letter to Mr. Plod, his uncle.",
"I am arriving on Mon. 5 Jan."
"Please meet me at approx. 5 p.m."
]

Additionally, the different ways that people use abbreviations and punctuation can make this a definitively non-trivial task.

10.3.3 Tokenizing words

We may wish to split sentences into individual words. As with sentence tokenization, there is (i) a pretty good heuristic (split on spaces), (ii) a number of weird exceptions (e.g. compound words), and (iii) an existing package that does the job fairly well.

To do this task, one can use the nltk.word_tokenize() function from nltk:

Example:

nltk.word_tokenize("I don't like bananas")
['I', 'do', "n't", 'like', 'bananas']

10.4 Wordclouds

In such cases, word clouds are a common and sometimes useful tool.

To elaborate, while wordclouds can be a useful way of quickly gaining high level insights into raw textual data, they are also limited. In some ways, they can be seen as the pie charts of NLP: often used, but also often hated. Some people would prefer if they didn’t exist at all. If used in the correct way, however, they definitely deserve their place in a data scientist’s toolbelt.

The main problem with word clouds is that they are difficult to interpret in a standard way. The layout algorithm has some randomness involved and although more common words are shown more prominently, it’s not possible to look at a word cloud and know which words are the most important, or how much more important these are than other words. Colours and rotation are also used randomly, making some words (e.g the ones in bright colours, positioned closer to the centre, with horizontal rotation) seem more important when in fact they are no more important than other words which were randomly assigned a less noticeable combination of color, rotation, and position.

Example:

# Sample text
text = """
Python is a high-level, interpreted, general-purpose programming language. Its design philosophy emphasizes code readability with the use of significant indentation. 
Python is dynamically typed and garbage-collected. It supports multiple programming paradigms, including structured (particularly procedural), object-oriented and functional programming. 
It is often described as a "batteries included" language due to its comprehensive standard library.
"""

# Generate the word cloud
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(text)

# Plot the word cloud
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud Example')
plt.show()

Explanation::

  1. Sample Text: We define a sample text string text.
  2. Generate Word Cloud: We create a WordCloud object with specified dimensions and background color, and then generate the word cloud from the sample text.
  3. Plot the Word Cloud: We use matplotlib to plot the word cloud. We set the figure size, use imshow to display the word cloud image, remove the axes with axis('off'), and add a title.

10.5 \(n\)-grams

Since single words, known as 1-grams, are insufficient to understand the significance of certain words in our text, it is natural to consider blocks of words, or \(n\)-grams. \(n\)-grams fall under a broader category of techniques otherwise known as count-based representations. These are techniques to analyze documents by indicating how frequently certain types of structures occur throughout.

The simplest version of the \(n\)-grams model, for \(n > 1\), is the bigram model, which looks at pairs of consecutive words. For example, the sentence “The quick brown fox jumps over the lazy dog” would have tokens “the quick”, “quick brown”,…, “lazy dog”. The following image explains this concept:

ngrams

This has obvious advantages and disadvantages over looking at words individually:

This retains the structure of the overall document, and it paves the way for analyzing words in context; however, the dimension is vastly larger. For this reason, it is often prudent to start by extracting as much value out of 1-grams as possible, before working our way up to more complex structures.

Bigrams and trigrams are useful for analyzing a corpus because they capture the context and relationships between words, which can significantly enhance understanding and analysis of text data. Here are some key reasons why they are beneficial:

  1. Contextual Understanding: They help to capture the context in which a word appears, providing more meaningful insights than individual words (unigrams).

  2. Improved Language Models: In natural language processing (NLP), bigrams and trigrams are used to build more accurate language models by considering word sequences rather than isolated words. This helps in predicting the next word in a sequence more effectively, which is crucial for tasks like text generation and auto-completion.

  3. Disambiguation: Certain words have multiple meanings depending on the context. Bigrams and trigrams help disambiguate such words by analyzing the surrounding words. For example, the word “bank” could refer to a financial institution or the side of a river. The bigrams “river bank” and “bank account” clarify the intended meaning.

  4. Sentiment Analysis: In sentiment analysis, bigrams and trigrams can capture expressions and phrases that convey sentiment more accurately than single words. For example, “not good” (bigram) indicates a negative sentiment, which might be missed if “not” and “good” were analyzed separately.

  5. Information Retrieval and Search: Using bigrams and trigrams improves the relevance of search results by considering common word pairs and phrases. This is especially useful in search engines and information retrieval systems, where user queries often consist of multi-word phrases.

  6. Text Mining and Topic Modeling: Bigrams and trigrams help identify frequent phrases and common word combinations, which can be important for topic modeling and discovering patterns in the text. This aids in uncovering themes and topics that are not evident from analyzing single words.

10.5.1 Word-document co-occurrence matrices

The simplest type of information would be whether a particular word occurs in particular documents. This leads to word-document co-occurrence matrices, where the \((W, X)\) entry of the word-document matrix \(A\) is set to 1 if word \(W\) occurs in document \(X\), and 0 otherwise.

There are many variants of this. In lieu of the fact that we are looking for count-based representations of our documents, one natural variable is the following: \[A_{W,X}=\#\text{ times $W$ occurs in $X$},\]

i.e., the \((W, X)\) entry of the word-document matrix equals the number of times that word \(W\) occurs in document \(X\), rather than merely being a binary variable.

Creating a word-document matrix in python:

# Sample documents
documents = [
    "Python is a high-level programming language",
    "Python is dynamically typed and garbage-collected",
    "Python supports multiple programming paradigms",
    "Python is often described as a 'batteries included' language due to its comprehensive standard library"
]

# Initialize the CountVectorizer
vectorizer = CountVectorizer()

# Fit and transform the documents
X = vectorizer.fit_transform(documents)

# Get the words (features)
words = vectorizer.get_feature_names_out()

# Create the co-occurrence matrix
df = pd.DataFrame(X.toarray(), columns=words)
df.head()
and as batteries collected comprehensive described due dynamically garbage high ... library multiple often paradigms programming python standard supports to typed
0 0 0 0 0 0 0 0 0 0 1 ... 0 0 0 0 1 1 0 0 0 0
1 1 0 0 1 0 0 0 1 1 0 ... 0 0 0 0 0 1 0 0 0 1
2 0 0 0 0 0 0 0 0 0 0 ... 0 1 0 1 1 1 0 1 0 0
3 0 1 1 0 1 1 1 0 0 0 ... 1 0 1 0 0 1 1 0 1 0

4 rows × 25 columns

10.5.2 Counter objects

The Counter object is a part of Python’s collections module and is used to count the occurrences of elements in a collection. It is essentially a specialized dictionary designed for counting element occurances in an interable, where the keys are the unique elements and the values are the counts of those elements. In NLP, we can use Counter objects to get the most common \(n\)-grams.

  • Initialization: Use Counter(interable) to initialize a Counter to count the occurrences of elements in the interable.
  • Methods: Counter provides several useful methods, such as most_common() to get the most common elements and their counts, and elements() to get an iterator over elements repeating according to their counts.
  • Arithmetic Operations: You can perform arithmetic operations like addition, subtraction, intersection, and union on Counter objects.

Example:

# Create a Counter from a list
data = ['apple', 'banana', 'apple', 'orange', 'banana', 'apple']
counter = Counter(data)

# Display the counter
print(counter)  # Output: Counter({'apple': 3, 'banana': 2, 'orange': 1})

# Get the most common elements
print(counter.most_common(2))  # Output: [('apple', 3), ('banana', 2)]

# Elements method (returns an iterator)
print(list(counter.elements()))  # Output: ['apple', 'apple', 'apple', 'banana', 'banana', 'orange']

# Arithmetic operations
counter2 = Counter(['banana', 'banana', 'kiwi'])
counter.subtract(counter2)
print(counter)  # Output: Counter({'apple': 3, 'orange': 1, 'banana': 0, 'kiwi': -1})
Counter({'apple': 3, 'banana': 2, 'orange': 1})
[('apple', 3), ('banana', 2)]
['apple', 'apple', 'apple', 'banana', 'banana', 'orange']
Counter({'apple': 3, 'orange': 1, 'banana': 0, 'kiwi': -1})

Explanation:

  1. Initialization: We create a Counter from a list of fruits.
  2. Display Counter: The Counter object shows the count of each element.
  3. Most Common Elements: The most_common(2) method returns the two most common elements and their counts.
  4. Elements Method: The elements() method returns an iterator over the elements, repeating each as many times as its count.
  5. Arithmetic Operations: We demonstrate subtraction between two Counter objects.

In Case 10, we used it to count the most common words.

Example:

long_string=' '.join(documents)
words=nltk.word_tokenize(long_string)
counted_words=Counter(words)
counted_words.most_common(2)
[('Python', 4), ('is', 3)]

10.6 Stop words

Stop words are very common words that are usually uninformative, and their very large occurrence values can distort the results of many NLP algorithms. They often include those that appear in every sentence of the English language: pronoums like “I”, prepositions like “but”, “of”, “and”, articles like “the”, etc.

It is common to pre-process text by removing words that you have a reason to believe are uninformative; these words are called stop words. Usually, it suffices to simply treat extremely common words as stop words. However, for specific types of applications it might make sense to use other stop words; e.g. the word “burger” when analyzing reviews of burger chains.

Note

Stop words are often removed by default as a cleaning step in all NLP tasks. However, sometimes they can be useful. For example in authorship attribution (automatically detecting who wrote a specific piece of text by their ‘writing style’), stop words can be one of the most useful features, as they appear in nearly all texts, and yet each author uses them in slightly different ways.

The nltk library has a standard list of stopwords, which you can download by running nltk.download("stopwords"). We can then load the stopwords package from the nltk.corpus and use it to load the stop words:

nltk.download('stopwords')
from nltk.corpus import stopwords
print(stopwords.words("english"))
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\12RAM\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

Below is example code for removing stop words:

sw=stopwords.words("english")

list_sw=[]
for word in sw:
    #check if the sw is in the text
    word=" "+word+" "
    if word in text:
    # if sw is in the text, them we need to remove it from the text
        text=re.sub(word," ",text)
    # add the sw to a list of words we found in this text
        list_sw.append(word)

# print out text
print(text)

# Print out found stop words
print(list_sw)

Python high-level, interpreted, general-purpose programming language. Its design philosophy emphasizes code readability use significant indentation. 
Python dynamically typed garbage-collected. It supports multiple programming paradigms, including structured (particularly procedural), object-oriented functional programming. 
It often described "batteries included" language due comprehensive standard library.

[' its ', ' is ', ' a ', ' the ', ' and ', ' as ', ' of ', ' with ', ' to ']

10.7 Regular Expressions

Having spent a lot of time on \(n\)-grams and how to featurize a document using them, we now take a break from nltk tools to introduce the most important text wrangling tool in Python (and many other languages): regular expressions.

The basic idea here is that you often want to perform some specific transformation (e.g. delete or substitute) every time that some possibly-complicated pattern occurs, e.g., the letter ‘A’, the word ‘hello’, any word containing the letters ‘a’,‘r’ in that order. Regular expressions are a compact and powerful language for expressing these sorts of patterns. This is super important whenever you are trying to clean a text dataset that contains thematically similar, but not exactly, the same errors.

The terse syntax of regular expressions has led to them having a reputation for being almost magical in some situations (with only a few characters, you can build complete computer programs) but also for being difficult to create and read, which can create more problems than they solve.

In Python, the re module provides regular expression matching operations and common operations. Regular expressions are a deep subject, with some documentation here: https://docs.python.org/3/library/re.html?highlight=regex.

As some simple examples, we have:

  1. . matches any character except \n (newline)
  2. \d matches any digit (this can also be written as [0-9])
  3. \D matches any non-digit (this can also be written as [^0-9])
  4. \w matches any alphanumeric character ([a-zA-Z0-9_])
  5. \W matches any non-alphanumeric character ([^a-zA-Z0-9_])

As some more complex examples, regular expressions also allow you to quantify the number of times matches can occur. For example,

  1. [a-d]+ matches any time you get \(\{a,b,c,d\}\) one or more times in a row
  2. [a-d]{3} matches any time you get them exactly 3 times in a row
  3. [a-d]* matches any time you get them 0 or more times in a row

For now, we give a simple application based on the re.sub() function, which substitutes words that match a pattern:

sentence = 'That was an "interesting" way to cook bread.'
pattern = r"[^\w]" 

# the ^ character denotes 'not', 
# the \w character denotes a word, and [] means
# anything that matches anything in the brackets. 
# Together, this refers to any character that is not a word.

print(re.sub(pattern, " ", sentence))
print(re.sub(pattern, "", sentence))

txt = "Natesh loves all the foold and loveds sdaslo"
# x is a regex pattern object now, which can be used in other re functions, such as finditer
x   = re.compile('lo')

iterator = x.finditer(txt)
for item in iterator:
    # print(item)
    print(item.span())
    print(item.group())
That was an  interesting  way to cook bread 
Thatwasaninterestingwaytocookbread
(7, 9)
lo
(31, 33)
lo
(42, 44)
lo

Some other useful functions in re are

  • re.split(): Divides a string into a list based on a pattern.
  • re.findall(): Returns a list of all substrings in a string that match the pattern.
  • re.sub(): Replaces occurrences of a specified pattern in a string with a replacement string.
  • re.compile(): Compiles a regular expression pattern into a regex object for repeated use.

10.8 Part-of-speech (POS) tagging

In English, there are eight main parts of speech - nouns, pronouns, adjectives, verbs, adverbs, prepositions, conjunctions and interjections. The purpose of POS tagging is to label each word in a document with its part of speech. Unsurprisingly, POS tagging can be very difficult to do by hand. nltk has a default function for this, called nltk.pos_tag(), which we will use. As a word of warning, this function is far from infallible, especially on informal text (e.g. website reviews, forum posts, text messages, etc), and words in English often exhibit POS drift (e.g. the drift of “Google” from noun to verb):

Here are some key use cases for POS tagging in NLP:

  1. Text Parsing and Syntactic Analysis
    • POS tagging is essential for syntactic parsing, which involves analyzing the grammatical structure of a sentence.
    • Helps in identifying the relationships between words and understanding the sentence structure.
  2. Named Entity Recognition (NER)
    • Helps in distinguishing between names, locations, organizations, and other proper nouns.
    • POS tags help NER systems to understand the context and improve accuracy in identifying entities.
  3. Machine Translation
    • Provides grammatical information that aids in translating text from one language to another.
    • Helps in maintaining the syntactic structure and grammatical correctness in the translated text.
  4. Information Retrieval and Search Engines
    • Improves the relevance of search results by understanding the query’s context and filtering out irrelevant results.
    • Enhances search algorithms by recognizing and prioritizing different parts of speech.
  5. Sentiment Analysis
    • Identifies adjectives and adverbs that are crucial for determining the sentiment of a sentence.
    • Helps in understanding the context and polarity of opinions expressed in the text.
  6. Speech Recognition and Text-to-Speech
    • Enhances the accuracy of speech recognition systems by providing context for homophones and ambiguous words.
    • Helps in generating natural and grammatically correct speech in text-to-speech systems.
  7. Keyword Extraction and Text Summarization
    • Aids in identifying key phrases and important information in the text.
    • Improves the quality of summaries by focusing on nouns, verbs, and other significant parts of speech.
  8. Coreference Resolution
    • Helps in linking pronouns to the nouns they refer to within a text.
    • Facilitates better understanding and continuity in document-level NLP tasks.
  9. Dependency Parsing
    • Provides a basis for dependency parsing, which focuses on the dependencies between words in a sentence.
    • Helps in understanding the grammatical relations and hierarchical structure of the sentence.

Below is how we can do POS tagging in Python:

nltk.download('averaged_perceptron_tagger')
#https://www.nltk.org/book/ch05.html
text_word_token = nltk.word_tokenize("Kelly is not having a good day because she has pancreatitis")
#text_word_token = nltk.word_tokenize(data.text[0])
nltk.pos_tag(text_word_token)
#https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\12RAM\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[('Kelly', 'NNP'),
 ('is', 'VBZ'),
 ('not', 'RB'),
 ('having', 'VBG'),
 ('a', 'DT'),
 ('good', 'JJ'),
 ('day', 'NN'),
 ('because', 'IN'),
 ('she', 'PRP'),
 ('has', 'VBZ'),
 ('pancreatitis', 'NN')]

The following is a list of tags and their meaning:

  • ‘NNP’: Proper noun, singular - This tag is used for singular proper nouns, which are names of specific people, places, or things. In this case, ‘Jairo’ is a proper noun, and so it is tagged with ‘NNP’.

  • ‘VBZ’: Verb, 3rd person singular present - This tag is used for third person singular verbs in the present tense (e.g. he runs, she eats). In this case, ‘is’ is a third person singular present verb, and so it is tagged with ‘VBZ’.

  • ‘VBG’: Verb, gerund or present participle - This tag is used for present participles and gerunds, which are verb forms that end in -ing (e.g. running, eating). In this case, ‘having’ is a present participle, and so it is tagged with ‘VBG’.

  • ‘DT’: Determiner - This tag is used for determiners, which are words that specify or indicate the noun that follows. In this case, ‘a’ is a determiner that indicates the noun ‘day’, and so it is tagged with ‘DT’.

  • ‘JJ’: Adjective - This tag is used for adjectives, which are words that describe or modify nouns or pronouns. In this case, ‘good’ is an adjective that describes the noun ‘day’, and so it is tagged with ‘JJ’.

  • ‘NN’: Noun, singular or mass - This tag is used for singular or mass nouns, which are common nouns that represent people, places, things, or concepts. In this case, ‘day’ is a singular noun, and so it is tagged with ‘NN’.

NLTK provides documentation for each tag, which can be queried using the tag itself; e.g. nltk.help.upenn_tagset('RB'). Since POS is context-sensitive, POS-taggers must usually be trained on an existing corpus that has been tagged by professional linguists (possibly alongside unlabeled data to take advantage of semi-supervised methods). The most popular tag set is called the Penn Treebank set:

# We can get more details about any POS tag using the help function of nltk
nltk.download('tagsets')
nltk.help.upenn_tagset('IN')
IN: preposition or conjunction, subordinating
    astride among uppon whether out inside pro despite on by throughout
    below within for towards near behind atop around if like until below
    next into if beside ...
[nltk_data] Downloading package tagsets to
[nltk_data]     C:\Users\12RAM\AppData\Roaming\nltk_data...
[nltk_data]   Package tagsets is already up-to-date!

10.8.1 Misc. Python functions from Case 10

  • rcParams['figure.figsize'] = wid, hei sets the width and height of a figure displayed in a notebook.