Natural Language Processing
✕Introduction to NLP
- Enables machines to understand and generate human language.
- Use Case:
sentiment analysis,translation,recommender systemetc. - Challenges:
ambiguity,context understanding,language variability. - Python's nltk Library is used for NLP tasks.
pip install nltk - nltk provides tools for tokenization, stemming, stopword removal, POS tagging etc.
Text Preprocessing Steps
- Tokenization: Split text into words or sentences.
- Stemming: Reduce words to their root form (e.g.
running=>run). - Stopword Removal: Remove common words that do not add much meaning (e.g.
the,is). - Vectorization: Convert text into numerical features using techniques like Bag of Words or TF-IDF.
Example of Text Preprocessing Techniques and their Effects:
| Step | Effect on Text 1 | Effect on Text 2 |
|---|---|---|
| Original Text | Python is a programming language. | I love learning new languages! |
| Tokenization | ['Python', 'is', 'a', 'programming', 'language'] | ['I', 'love', 'learning', 'new', 'languages'] |
| Stemming | ['python', 'is', 'a', 'program', 'languag'] | ['i', 'love', 'learn', 'new', 'languag'] |
| Remove Stopword | ['python', 'program', 'languag'] | ['love', 'learn', 'new', 'languag'] |
| Vocabulary Creation | ['python', 'program', 'languag', 'love', 'learn', 'new'] | |
| Vectorization | [1, 1, 1, 0, 0, 0] | [0, 0, 1, 1, 1, 1] |
Example of Text Preprocessing Techniques and their Effects.
POS Tagging
- POS Tagging: Assign part-of-speech tags to each word (e.g. noun, verb).
- Helps in understanding grammatical structure and meaning.
- nltk provides
pos_tagfunction for POS tagging. - Example:
from nltk import word_tokenize from nltk import pos_tag tokens = word_tokenize("Python is great!") pos_tags = pos_tag(tokens) print(pos_tags)# [("Python", "NNP"), ("is", "VBZ"), ("great", "JJ")]
Cosine Similarity
- Cosine Similarity: Measures the cosine of the angle between two vectors.
- Used to determine the similarity between documents.
- Value ranges from -1 (opposite) to 1 (identical).
- Example:
from sklearn.metrics.pairwise import cosine_similarity vector1 = [[1, 0, 1]] vector2 = [[0, 1, 1]] similarity = cosine_similarity(vector1, vector2) print(similarity)# [[0.5]]
