Natural Language Processing in Python: Techniques

148

In this tutorial, we will explore different techniques to understand how to implement natural language processing in Python.

Generally, computers work well with organized/arranged, labeled, structured data like data in an excel sheet/spreadsheet. However, most of the data we speak and write is unstructured and can be difficult for a machine to understand. So to sort out the issue, NLP comes into the picture.

NLP is Natural Language Processing used to make systems/computers understand and process unstructured data. NLP processes the unstructured text and gives meaningful information with a few lines of code. There are a few techniques that we are going to discuss in detail in this article.

Also ReadPython Reserved Words List With Definitions and Examples

Natural Language Processing in Python: Techniques

natural language processing in python

Also ReadHow to Fix Invalid Syntax Error in Python

They are:

  1. Wordcloud
  2. Bag of Words
  3. Term frequency – Inverse Document frequency
  4. Stemming
  5. Lemmatization
  6. Named Entity Recognition

1. Wordcloud

Wordcloud is a Natural Language Processing technique to identify keywords in the given test file. It gives a picture as output in which more frequent words in the text file have a larger and bolder font, whereas less frequent words have a thinner and smaller font.

The output is stored in the folder where the code is present. The word cloud can be generated either by the wordcloud library or stylecloud library.

Below is the simple code to generate a wordcloud from the words present in a demo.txt file.

Code:

# import necessary packages
import stylecloud

# generates wordcloud
stylecloud.gen_stylecloud(file_path='demo.txt')

Output:

wordcloud - natural language processing in python

 

It is a simple and easy way to implement & understand. The output image can be customized by specifying the shape of the resultant picture to the icon_name parameter in the gen_stylecloud() method.

Also ReadHow to Fix Unexpected EOF While Parsing Error In Python

2. Bag of Words

Bag of Words is a model which represents the text into numbers. It represents the frequency of a word that has more than 2 letters in the given string.

Apart from that, it won’t consider the order of words. It only considers frequency. It is mainly used in the classification of documents.

To implement the model in Python programming, import the CountVectorizer from the sklearn package. Let’s look into the sample example program that implements the Bag of Words technique.

Code:

# import necessary libraries
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

# list of strings
strings = ["Python programming language is easy to learn and easy to read",
           "Java is an object oriented programing where everything is of form object",
           "codeitbro websites provide help documents on Java & Python "]

# converting the list of strings into a dataframre
data = pd.DataFrame({'String-X': ['string1', 'string2','string3'], 'text':strings})

countWords = CountVectorizer(stop_words='english')

# using CountVectorizer transform the text data in numeric format
cv_matrix = countWords.fit_transform(data['text'])
text_numeric = pd.DataFrame(cv_matrix.toarray(),
                      index=df['String-X'].values,
                      columns=countWords.get_feature_names())
text_numeric

Output:

bag of words - nlp in python

Explanation

  • Imported pandas library to create a dataframe and CountVectorizer to implement the Bag of Words model.
  • Create a list of strings.
  • Convert the above-created list into a dataframe.
  • Using the CountVectorizer transform the given list of strings into numeric data using the fit_transform() method.

So from the above output, we can state that in string1, the word easy is present 2 times, and the other strings, language, programming, python, etc., are present only once.

Also ReadHow To Copy List in Python

3. Term frequency – Inverse Document frequency

This NLP technique is also similar to the Bag of words technique. Still, instead of maintaining the frequency of words, It computes the weight of words i.e., It represents how relevant a word is to the document among multiple. This technique is used in search engines for relevant retrieval of results for the search.

Here there are three steps that need to be followed to calculate the weight of the word.

Term Frequency – Calculates the frequency of a word in the document/sentence.

Inverse Document Frequency – The IDF value for the word can be calculated by the below formula

term frequency - inverse document frequency

Perform Multiplication operations between TF and IDF to calculate the weight of the word.

To implement the model in Python programming, import the TfidfVectorizer from the sklearn package. Let’s look into the sample example program that implements the TF-IDF technique.

Code:

# import necessary libraries
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

# list of strings
strings = ["Python programming language is easy to learn and easy to read",
           "Java is an object oriented programing where everything is of form object",
           "codeitbro websites provide help documents on Java & Python "]

# converting the list of strings into a dataframre
data = pd.DataFrame({'String-X': ['string1', 'string2','string3'], 'text':strings})

TF_IDF = TfidfVectorizer(stop_words='english')

# using TfidfVectorizer transform the text data in numeric format & calculate weight
TF_IDF_matrix = TF_IDF.fit_transform(data['text'])
weight = pd.DataFrame(TF_IDF_matrix.toarray(),
                      index=data['String-X'].values,
                      columns=TF_IDF.get_feature_names())
weight

Output:

output term frequency - inverse document frequency python

Explanation

  • Imported pandas library to create a dataframe and TfidfVectorizer to implement the TF-IDF model.
  • Create a list of strings.
  • Convert the above-created list into a dataframe.
  • Using the TfidfVectorizer transform the given list of strings into numeric data using the fit_transform() method and calculate weights.

Also Read10 Best Programming Apps To Learn Python

4. Stemming

Stemming is a Natural Language Processing technique used to normalize the word. It truncates the given the word into the stem word. The stemming technique may not give the dictionary word for a few words. The stemming technique has a high processing speed.

Below is an example program of stemming techniques with two algorithms.

Code:

# import necessary libraries
from nltk.stem import PorterStemmer
from nltk.stem import LancasterStemmer

# using PorterStemmer
porter = PorterStemmer()
print(porter.stem("relationship"))

# using LancasterStemmer
lancaster = LancasterStemmer()
print(lancaster.stem("relationship"))

Output:

relationship
rel

Explanation:

  • PorterStemmer algorithm won’t follow linguistics but a set of 5 rules for different cases applied to generate stems.
  • In LancasterStemmer, over-stemming may occur, which causes the stems to be not linguistic (or) they may have no meaning.

Also ReadPython Program To Reverse A Number

4. Lemmatization

Lemmatization is a Natural Language Processing technique used to normalize a word. It finds the dictionary word instead of truncating the word. The Lemmatization technique has high accuracy and low processing speed.

Below is an example program of the Lemmatization technique.

Code:

# import necessary libraries
from nltk import WordNetLemmatizer

# using wordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize("articles"))

Output:

article

Also ReadPython Program to Find Sum of Digits [5 Methods]

5. Named Entity Recognition

Named Entity Recognition is a natural language processing technique that classifies the named entities such as a person, location, organization, quantity, money, etc. It is mainly used to optimize search engine algorithms, classify content, recommend systems, etc.

Code:

# import necessary libraries
import spacy

# A pipeline trained on the text present web
pipeline = spacy.load("en_core_web_sm")

string = pipeline("Himanshu is founder of Codeitbro")
print([(X.text, X.label_) for X in string.ents])

Output:

[(‘Himanshu’, ‘PERSON’), (‘Codeitbro’, ‘ORG’)]

Explanation:

  • Imported the library spacy
  • Create a pipeline using load() function.
  • Apply the pipeline on the sentence to find entities.

Also ReadPython Program To Check If A Number Is a Perfect Square

Summary

This tutorial will help you start with natural language processing in Python. There are also many other algorithms other than the above mentioned, like Sentiment Analysis which determines the nature of words i.e., whether the word is positive, negative, or neutral, and is used to process human languages (unstructured data).