This post was last Updated on October 14, 2022 by Himanshu Tyagi to reflect the accuracy and up-to-date information on the page.
In this tutorial, we will explore different techniques to understand how to implement natural language processing in Python.
Generally, computers work well with organized/arranged, labeled, structured data like data in an excel sheet/spreadsheet. However, most of the data we speak and write is unstructured and can be difficult for a machine to understand. So to sort out the issue, NLP comes into the picture.
NLP is Natural Language Processing used to make systems/computers understand and process unstructured data. NLP processes the unstructured text and gives meaningful information with a few lines of code. There are a few techniques that we are going to discuss in detail in this article.
Also Read: Python Reserved Words List With Definitions and Examples
Natural Language Processing in Python: Techniques
Also Read: How to Fix Invalid Syntax Error in Python
They are:
- Wordcloud
- Bag of Words
- Term frequency – Inverse Document frequency
- Stemming
- Lemmatization
- Named Entity Recognition
1. Wordcloud
Wordcloud is a Natural Language Processing technique to identify keywords in the given test file. It gives a picture as output in which more frequent words in the text file have a larger and bolder font, whereas less frequent words have a thinner and smaller font.
The output is stored in the folder where the code is present. The word cloud can be generated either by the wordcloud library or stylecloud library.
Below is the simple code to generate a wordcloud from the words present in a demo.txt file.
Code:
# import necessary packages import stylecloud # generates wordcloud stylecloud.gen_stylecloud(file_path='demo.txt')
Output:
It is a simple and easy way to implement & understand. The output image can be customized by specifying the shape of the resultant picture to the icon_name parameter in the gen_stylecloud() method.
Also Read: How to Fix Unexpected EOF While Parsing Error In Python
2. Bag of Words
Bag of Words is a model which represents the text into numbers. It represents the frequency of a word that has more than 2 letters in the given string.
Apart from that, it won’t consider the order of words. It only considers frequency. It is mainly used in the classification of documents.
To implement the model in Python programming, import the CountVectorizer from the sklearn package. Let’s look into the sample example program that implements the Bag of Words technique.
Code:
# import necessary libraries import pandas as pd from sklearn.feature_extraction.text import CountVectorizer # list of strings strings = ["Python programming language is easy to learn and easy to read", "Java is an object oriented programing where everything is of form object", "codeitbro websites provide help documents on Java & Python "] # converting the list of strings into a dataframre data = pd.DataFrame({'String-X': ['string1', 'string2','string3'], 'text':strings}) countWords = CountVectorizer(stop_words='english') # using CountVectorizer transform the text data in numeric format cv_matrix = countWords.fit_transform(data['text']) text_numeric = pd.DataFrame(cv_matrix.toarray(), index=df['String-X'].values, columns=countWords.get_feature_names()) text_numeric
Output:
Explanation
- Imported pandas library to create a dataframe and CountVectorizer to implement the Bag of Words model.
- Create a list of strings.
- Convert the above-created list into a dataframe.
- Using the CountVectorizer transform the given list of strings into numeric data using the fit_transform() method.
So from the above output, we can state that in string1, the word easy is present 2 times, and the other strings, language, programming, python, etc., are present only once.
Also Read: How To Copy List in Python
3. Term frequency – Inverse Document frequency
This NLP technique is also similar to the Bag of words technique. Still, instead of maintaining the frequency of words, It computes the weight of words i.e., It represents how relevant a word is to the document among multiple. This technique is used in search engines for relevant retrieval of results for the search.
Here there are three steps that need to be followed to calculate the weight of the word.
Term Frequency – Calculates the frequency of a word in the document/sentence.
Inverse Document Frequency – The IDF value for the word can be calculated by the below formula
Perform Multiplication operations between TF and IDF to calculate the weight of the word.
To implement the model in Python programming, import the TfidfVectorizer from the sklearn package. Let’s look into the sample example program that implements the TF-IDF technique.
Code:
# import necessary libraries import pandas as pd from sklearn.feature_extraction.text import TfidfVectorizer # list of strings strings = ["Python programming language is easy to learn and easy to read", "Java is an object oriented programing where everything is of form object", "codeitbro websites provide help documents on Java & Python "] # converting the list of strings into a dataframre data = pd.DataFrame({'String-X': ['string1', 'string2','string3'], 'text':strings}) TF_IDF = TfidfVectorizer(stop_words='english') # using TfidfVectorizer transform the text data in numeric format & calculate weight TF_IDF_matrix = TF_IDF.fit_transform(data['text']) weight = pd.DataFrame(TF_IDF_matrix.toarray(), index=data['String-X'].values, columns=TF_IDF.get_feature_names()) weight
Output:
Explanation
- Imported pandas library to create a dataframe and TfidfVectorizer to implement the TF-IDF model.
- Create a list of strings.
- Convert the above-created list into a dataframe.
- Using the TfidfVectorizer transform the given list of strings into numeric data using the fit_transform() method and calculate weights.
Also Read: 10 Best Programming Apps To Learn Python
4. Stemming
Stemming is a Natural Language Processing technique used to normalize the word. It truncates the given the word into the stem word. The stemming technique may not give the dictionary word for a few words. The stemming technique has a high processing speed.
Below is an example program of stemming techniques with two algorithms.
Code:
# import necessary libraries from nltk.stem import PorterStemmer from nltk.stem import LancasterStemmer # using PorterStemmer porter = PorterStemmer() print(porter.stem("relationship")) # using LancasterStemmer lancaster = LancasterStemmer() print(lancaster.stem("relationship"))
Output:
relationship
rel
Explanation:
- PorterStemmer algorithm won’t follow linguistics but a set of 5 rules for different cases applied to generate stems.
- In LancasterStemmer, over-stemming may occur, which causes the stems to be not linguistic (or) they may have no meaning.
Also Read: Python Program To Reverse A Number
4. Lemmatization
Lemmatization is a Natural Language Processing technique used to normalize a word. It finds the dictionary word instead of truncating the word. The Lemmatization technique has high accuracy and low processing speed.
Below is an example program of the Lemmatization technique.
Code:
# import necessary libraries from nltk import WordNetLemmatizer # using wordNetLemmatizer lemmatizer = WordNetLemmatizer() print(lemmatizer.lemmatize("articles"))
Output:
article
Also Read: Python Program to Find Sum of Digits [5 Methods]
5. Named Entity Recognition
Named Entity Recognition is a natural language processing technique that classifies the named entities such as a person, location, organization, quantity, money, etc. It is mainly used to optimize search engine algorithms, classify content, recommend systems, etc.
Code:
# import necessary libraries import spacy # A pipeline trained on the text present web pipeline = spacy.load("en_core_web_sm") string = pipeline("Himanshu is founder of Codeitbro") print([(X.text, X.label_) for X in string.ents])
Output:
[(‘Himanshu’, ‘PERSON’), (‘Codeitbro’, ‘ORG’)]
Explanation:
- Imported the library spacy
- Create a pipeline using load() function.
- Apply the pipeline on the sentence to find entities.
Also Read: Python Program To Check If A Number Is a Perfect Square
Summary
This tutorial will help you start with natural language processing in Python. There are also many other algorithms other than the above mentioned, like Sentiment Analysis which determines the nature of words i.e., whether the word is positive, negative, or neutral, and is used to process human languages (unstructured data).