Data Cleaning with Pandas

Common Data Problems

Common Data Quality Issues in Real-World Datasets:

Issue	Resolution
Inconsistent Column Names	Standardize names using `rename` / load with `names` parameter.
Missing Values (`NULL`, `Blanks`)	Use `dropna` to remove rows / `fillna` to fill them.
Duplicates	Use `drop_duplicates` to remove duplicate rows.
Inconsistent Categories. `F`, `Fe`	Standardize category names using `replace` or `map`.
Incorrect Data Types	Use `astype` to convert columns to correct data types.
Outliers	Identify outliers and `remove with filter` / `clip`.
Incorrect data format	`.melt`, `.pivot`
Spelling Mistakes	Fuzzy matching
Inconsistent Units	Unit conversion to standardize measurements.
Case Sensitivity	Convert text to lower or upper case for consistency.
Business Rule. `TxnDate > Created`	Fix / Remove data to align with business rule
Leading and Trailing Spaces	Use `strip()` to remove leading and trailing spaces.
Non Numeric Character in numeric column	Use regex to clean non-numeric characters.

Common data quality issues in real-world datasets, their descriptions, and potential resolutions

Standard way of organizing data to facilitate analysis.
Each variable forms a column, each observation forms a row, and each type of observational unit forms a table.

Example of Untidy Data

product	year_2020_sales	year_2021_sales
Product A	100	150
Product B	200	250

Example of untidy data where sales for different years are in separate columns.

Example of Tidy Data

Example of tidy data where each variable (year, sales) forms a column and each observation forms a row.

Tidy data is better for analysis, untidy is better for reporting.
melt converts wide format data to long format. i.e tidy format
Number of rows increases and number of columns decreases.
Syntax: pd.melt(df, id_vars=["id_1", "id_2"], value_vars=["val_1", "val_2"])
Example: import pandas as pd df = pd.DataFrame({ "product": ["A", "B"], "year_2020_sales": [100, 200], "year_2021_sales": [150, 250] }) melted_df = pd.melt(df, id_vars=["product"]) print(melted_df)

Pivoting is the reverse of melting.
It converts long format data to wide format.
Number of rows decreases and number of columns increases.
Aggregation for duplicate is average. Can be customized with aggfunc.
Syntax: df.pivot_table(index=["id_1", "id_2"], columns="variable", values="value")
Example: import pandas as pd df = pd.DataFrame({ "product": ["A", "A", "B", "B"], "year": [2020, 2021, 2020, 2021], "sales": [100, 150, 200, 250] }) p_df = df.pivot_table(index="product", columns="year", values="sales") print(p_df)

Example

bins = [0, 18, 35, 60] labels = ["Child", "Adult", "Senior"] df["age_group"] = pd.cut(df["age"], bins=bins, labels=labels) print(df) labels = ["Young", "Middle-aged", "Old"] df["age_quantile"] = pd.qcut(df["age"], q=3, labels=labels) print(df)

Common Strategies for Handling Missing Values

Example: df["course_clean"] = df["course"].fillna("Unknown") df["age_clean"] = df["age"].fillna(df["age"].mean()) df["experience_clean"] = df["experience"].ffill() # or bfill filter_mask = (df["education"].isnull()) & (df["age"] < 25) df.loc[filter_mask, "education"] = "SLC" df.dropna(subset=["name"], inplace=True)

Example

duplicate_mask = df.duplicated(subset=["name", "age"], keep=False) duplicates = df[duplicate_mask] print(duplicates) # Remove duplicates based on name and age columns unique_df = df.drop_duplicates(subset=["name", "age"], keep="first") print(unique_df.duplicated(subset=["name", "age"]).sum()

Inconsistent Categories (map/replace)

df["gender"] = df["gender"].replace({"m": "Male", "f": "Female"}) mapping_dict = {"m": "Male", "f": "Female"} df["gender"] = df["gender"].map(mapping_dict)

Outliers (clip/drop)

df["age"] = df["age"].clip(lower=0, upper=120) outlier_mask = (df["age"] < 0) | (df["age"] > 120) df = df[~outlier_mask]

Leading and Trailing Spaces (strip)

Inconsistent Case

df["name"] = df["name"].str.upper() # or df["name"] = df["name"].str.lower()

Non Numeric Character in numeric column (regex)

Example

from fuzzywuzzy import process choices = ["New York", "Kathmandu", "Delhi"] query = "New yolk" best_match = process.extractOne(query, choices) print(best_match)