Natural Language Processing

Introduction to NLP

  • Enables machines to understand and generate human language.
  • Use Case: sentiment analysis, translation, recommender system etc.
  • Challenges: ambiguity, context understanding, language variability.
  • Python's nltk Library is used for NLP tasks. pip install nltk
  • nltk provides tools for tokenization, stemming, stopword removal, POS tagging etc.

Text Preprocessing Steps

  • Tokenization: Split text into words or sentences.
  • Stemming: Reduce words to their root form (e.g. running => run).
  • Stopword Removal: Remove common words that do not add much meaning (e.g. the, is).
  • Vectorization: Convert text into numerical features using techniques like Bag of Words or TF-IDF.
Example of Text Preprocessing Techniques and their Effects:
StepEffect on Text 1Effect on Text 2
Original TextPython is a programming language.I love learning new languages!
Tokenization['Python', 'is', 'a', 'programming', 'language']['I', 'love', 'learning', 'new', 'languages']
Stemming['python', 'is', 'a', 'program', 'languag']['i', 'love', 'learn', 'new', 'languag']
Remove Stopword['python', 'program', 'languag']['love', 'learn', 'new', 'languag']
Vocabulary Creation['python', 'program', 'languag', 'love', 'learn', 'new']
Vectorization[1, 1, 1, 0, 0, 0][0, 0, 1, 1, 1, 1]
Example of Text Preprocessing Techniques and their Effects.

POS Tagging

  • POS Tagging: Assign part-of-speech tags to each word (e.g. noun, verb).
  • Helps in understanding grammatical structure and meaning.
  • nltk provides pos_tag function for POS tagging.
  • Example: from nltk import word_tokenize from nltk import pos_tag tokens = word_tokenize("Python is great!") pos_tags = pos_tag(tokens) print(pos_tags) # [("Python", "NNP"), ("is", "VBZ"), ("great", "JJ")]

Cosine Similarity

  • Cosine Similarity: Measures the cosine of the angle between two vectors.
  • Used to determine the similarity between documents.
  • Value ranges from -1 (opposite) to 1 (identical).
  • Example: from sklearn.metrics.pairwise import cosine_similarity vector1 = [[1, 0, 1]] vector2 = [[0, 1, 1]] similarity = cosine_similarity(vector1, vector2) print(similarity) # [[0.5]]