Data Cleaning with Pandas
✕Common Data Problems
Common Data Quality Issues in Real-World Datasets:
| Issue | Resolution |
|---|---|
| Inconsistent Column Names | Standardize names using rename / load with names parameter. |
Missing Values (NULL, Blanks) | Use dropna to remove rows / fillna to fill them. |
| Duplicates | Use drop_duplicates to remove duplicate rows. |
Inconsistent Categories. F, Fe | Standardize category names using replace or map. |
| Incorrect Data Types | Use astype to convert columns to correct data types. |
| Outliers | Identify outliers and remove with filter / clip. |
| Incorrect data format | .melt, .pivot |
| Spelling Mistakes | Fuzzy matching |
| Inconsistent Units | Unit conversion to standardize measurements. |
| Case Sensitivity | Convert text to lower or upper case for consistency. |
Business Rule. TxnDate > Created | Fix / Remove data to align with business rule |
| Leading and Trailing Spaces | Use strip() to remove leading and trailing spaces. |
| Non Numeric Character in numeric column | Use regex to clean non-numeric characters. |
Common data quality issues in real-world datasets, their descriptions, and potential resolutions
Tidy Data
- Standard way of organizing data to facilitate analysis.
- Each variable forms a column, each observation forms a row, and each type of observational unit forms a table.
Example of Untidy Data
| product | year_2020_sales | year_2021_sales |
|---|---|---|
| Product A | 100 | 150 |
| Product B | 200 | 250 |
Example of untidy data where sales for different years are in separate columns.
Example of Tidy Data
| product | year | sales |
|---|---|---|
| Product A | 2020 | 100 |
| Product A | 2021 | 150 |
| Product B | 2020 | 200 |
| Product B | 2021 | 250 |
Example of tidy data where each variable (year, sales) forms a column and each observation forms a row.
melt
- Tidy data is better for analysis, untidy is better for reporting.
- melt converts wide format data to long format. i.e
tidy format - Number of rows increases and number of columns decreases.
- Syntax:
pd.melt(df, id_vars=["id_1", "id_2"], value_vars=["val_1", "val_2"]) - Example:
import pandas as pd df = pd.DataFrame({ "product": ["A", "B"], "year_2020_sales": [100, 200], "year_2021_sales": [150, 250] }) melted_df = pd.melt(df, id_vars=["product"]) print(melted_df)
pivot_table
- Pivoting is the reverse of melting.
- It converts long format data to wide format.
- Number of rows decreases and number of columns increases.
- Aggregation for duplicate is average. Can be customized with
aggfunc. - Syntax:
df.pivot_table(index=["id_1", "id_2"], columns="variable", values="value") - Example:
import pandas as pd df = pd.DataFrame({ "product": ["A", "A", "B", "B"], "year": [2020, 2021, 2020, 2021], "sales": [100, 150, 200, 250] }) p_df = df.pivot_table(index="product", columns="year", values="sales") print(p_df)
cut and qcut
cut: Segment and sort data values into bins.qcut: Segment & sort data values into equal-sized bins based on quantiles.bins = [0, 18, 35, 60] labels = ["Child", "Adult", "Senior"] df["age_group"] = pd.cut(df["age"], bins=bins, labels=labels) print(df) labels = ["Young", "Middle-aged", "Old"] df["age_quantile"] = pd.qcut(df["age"], q=3, labels=labels) print(df)
Example
Handling Missing Value
- Dropping rows/columns with missing values.
- Filling missing values with statistics (mean, median, mode).
- Use techniques like interpolation, prediction to estimate missing values.
- Example:
df["course_clean"] = df["course"].fillna("Unknown") df["age_clean"] = df["age"].fillna(df["age"].mean()) df["experience_clean"] = df["experience"].ffill()# or bfillfilter_mask = (df["education"].isnull()) & (df["age"] < 25) df.loc[filter_mask, "education"] = "SLC" df.dropna(subset=["name"], inplace=True)
Common Strategies for Handling Missing Values
Handling Duplicates
- Duplicate rows can be identified with
duplicated()function. - Duplicate rows can be removed with
drop_duplicates()function. duplicate_mask = df.duplicated(subset=["name", "age"], keep=False) duplicates = df[duplicate_mask] print(duplicates) # Remove duplicates based on name and age columns unique_df = df.drop_duplicates(subset=["name", "age"], keep="first") print(unique_df.duplicated(subset=["name", "age"]).sum()
Example
Data Value issues
df["gender"] = df["gender"].replace({"m": "Male", "f": "Female"}) mapping_dict = {"m": "Male", "f": "Female"} df["gender"] = df["gender"].map(mapping_dict)df["age"] = df["age"].clip(lower=0, upper=120) outlier_mask = (df["age"] < 0) | (df["age"] > 120) df = df[~outlier_mask]df["name"] = df["name"].str.strip()df["name"] = df["name"].str.upper()# ordf["name"] = df["name"].str.lower()df["age"] = df["age"].str.replace(r"\D", "", regex=True).astype(int)
Inconsistent Categories (map/replace)
Outliers (clip/drop)
Leading and Trailing Spaces (strip)
Inconsistent Case
Non Numeric Character in numeric column (regex)
Fuzzy Matching to clean Spelling issue
- Technique to find strings that are approximately equal instead of exact.
- Match result is based on similarity score.
- The
fuzzywuzzylibrary can be used for fuzzy matching in Python. from fuzzywuzzy import process choices = ["New York", "Kathmandu", "Delhi"] query = "New yolk" best_match = process.extractOne(query, choices) print(best_match)
Example
