Natural Language Processing

Introduction to NLP

Enables machines to understand and generate human language.
Use Case: sentiment analysis, translation, recommender system etc.
Challenges: ambiguity, context understanding, language variability.
Python's nltk Library is used for NLP tasks. pip install nltk
nltk provides tools for tokenization, stemming, stopword removal, POS tagging etc.

Text Preprocessing Steps

Tokenization: Split text into words or sentences.
Stemming: Reduce words to their root form (e.g. running => run).
Stopword Removal: Remove common words that do not add much meaning (e.g. the, is).
Vectorization: Convert text into numerical features using techniques like Bag of Words or TF-IDF.

Example of Text Preprocessing Techniques and their Effects:

Step	Effect on Text 1	Effect on Text 2
Original Text	Python is a programming language.	I love learning new languages!
Tokenization	['Python', 'is', 'a', 'programming', 'language']	['I', 'love', 'learning', 'new', 'languages']
Stemming	['python', 'is', 'a', 'program', 'languag']	['i', 'love', 'learn', 'new', 'languag']
Remove Stopword	['python', 'program', 'languag']	['love', 'learn', 'new', 'languag']
Vocabulary Creation		['python', 'program', 'languag', 'love', 'learn', 'new']
Vectorization	[1, 1, 1, 0, 0, 0]	[0, 0, 1, 1, 1, 1]

Example of Text Preprocessing Techniques and their Effects.

POS Tagging

POS Tagging: Assign part-of-speech tags to each word (e.g. noun, verb).
Helps in understanding grammatical structure and meaning.
nltk provides pos_tag function for POS tagging.
Example: from nltk import word_tokenize from nltk import pos_tag tokens = word_tokenize("Python is great!") pos_tags = pos_tag(tokens) print(pos_tags) # [("Python", "NNP"), ("is", "VBZ"), ("great", "JJ")]

Cosine Similarity

Cosine Similarity: Measures the cosine of the angle between two vectors.
Used to determine the similarity between documents.
Value ranges from -1 (opposite) to 1 (identical).
Example: from sklearn.metrics.pairwise import cosine_similarity vector1 = [[1, 0, 1]] vector2 = [[0, 1, 1]] similarity = cosine_similarity(vector1, vector2) print(similarity) # [[0.5]]

Principal Component Analysis