Data Cleaning with Pandas

Common Data Problems

Common Data Quality Issues in Real-World Datasets:
IssueResolution
Inconsistent Column NamesStandardize names using rename / load with names parameter.
Missing Values (NULL, Blanks)Use dropna to remove rows / fillna to fill them.
DuplicatesUse drop_duplicates to remove duplicate rows.
Inconsistent Categories. F, FeStandardize category names using replace or map.
Incorrect Data TypesUse astype to convert columns to correct data types.
OutliersIdentify outliers and remove with filter / clip.
Incorrect data format.melt, .pivot
Spelling MistakesFuzzy matching
Inconsistent UnitsUnit conversion to standardize measurements.
Case SensitivityConvert text to lower or upper case for consistency.
Business Rule. TxnDate > CreatedFix / Remove data to align with business rule
Leading and Trailing SpacesUse strip() to remove leading and trailing spaces.
Non Numeric Character in numeric columnUse regex to clean non-numeric characters.
Common data quality issues in real-world datasets, their descriptions, and potential resolutions

Tidy Data

  • Standard way of organizing data to facilitate analysis.
  • Each variable forms a column, each observation forms a row, and each type of observational unit forms a table.
Example of Untidy Data
productyear_2020_salesyear_2021_sales
Product A100150
Product B200250
Example of untidy data where sales for different years are in separate columns.
Example of Tidy Data
productyearsales
Product A2020100
Product A2021150
Product B2020200
Product B2021250
Example of tidy data where each variable (year, sales) forms a column and each observation forms a row.

melt

  • Tidy data is better for analysis, untidy is better for reporting.
  • melt converts wide format data to long format. i.e tidy format
  • Number of rows increases and number of columns decreases.
  • Syntax: pd.melt(df, id_vars=["id_1", "id_2"], value_vars=["val_1", "val_2"])
  • Example: import pandas as pd df = pd.DataFrame({ "product": ["A", "B"], "year_2020_sales": [100, 200], "year_2021_sales": [150, 250] }) melted_df = pd.melt(df, id_vars=["product"]) print(melted_df)

pivot_table

  • Pivoting is the reverse of melting.
  • It converts long format data to wide format.
  • Number of rows decreases and number of columns increases.
  • Aggregation for duplicate is average. Can be customized with aggfunc.
  • Syntax: df.pivot_table(index=["id_1", "id_2"], columns="variable", values="value")
  • Example: import pandas as pd df = pd.DataFrame({ "product": ["A", "A", "B", "B"], "year": [2020, 2021, 2020, 2021], "sales": [100, 150, 200, 250] }) p_df = df.pivot_table(index="product", columns="year", values="sales") print(p_df)

cut and qcut

  • cut: Segment and sort data values into bins.
  • qcut: Segment & sort data values into equal-sized bins based on quantiles.
  • Example
    1. bins = [0, 18, 35, 60] labels = ["Child", "Adult", "Senior"] df["age_group"] = pd.cut(df["age"], bins=bins, labels=labels) print(df) labels = ["Young", "Middle-aged", "Old"] df["age_quantile"] = pd.qcut(df["age"], q=3, labels=labels) print(df)

Handling Missing Value

    Common Strategies for Handling Missing Values
    1. Dropping rows/columns with missing values.
    2. Filling missing values with statistics (mean, median, mode).
    3. Use techniques like interpolation, prediction to estimate missing values.
  • Example: df["course_clean"] = df["course"].fillna("Unknown") df["age_clean"] = df["age"].fillna(df["age"].mean()) df["experience_clean"] = df["experience"].ffill() # or bfill filter_mask = (df["education"].isnull()) & (df["age"] < 25) df.loc[filter_mask, "education"] = "SLC" df.dropna(subset=["name"], inplace=True)

Handling Duplicates

  • Duplicate rows can be identified with duplicated() function.
  • Duplicate rows can be removed with drop_duplicates() function.
  • Example
    1. duplicate_mask = df.duplicated(subset=["name", "age"], keep=False) duplicates = df[duplicate_mask] print(duplicates) # Remove duplicates based on name and age columns unique_df = df.drop_duplicates(subset=["name", "age"], keep="first") print(unique_df.duplicated(subset=["name", "age"]).sum()

Data Value issues

    Inconsistent Categories (map/replace)
    1. df["gender"] = df["gender"].replace({"m": "Male", "f": "Female"}) mapping_dict = {"m": "Male", "f": "Female"} df["gender"] = df["gender"].map(mapping_dict)
    Outliers (clip/drop)
    1. df["age"] = df["age"].clip(lower=0, upper=120) outlier_mask = (df["age"] < 0) | (df["age"] > 120) df = df[~outlier_mask]
    Leading and Trailing Spaces (strip)
    1. df["name"] = df["name"].str.strip()
    Inconsistent Case
    1. df["name"] = df["name"].str.upper() # or df["name"] = df["name"].str.lower()
    Non Numeric Character in numeric column (regex)
    1. df["age"] = df["age"].str.replace(r"\D", "", regex=True).astype(int)

Fuzzy Matching to clean Spelling issue

  • Technique to find strings that are approximately equal instead of exact.
  • Match result is based on similarity score.
  • The fuzzywuzzy library can be used for fuzzy matching in Python.
  • Example
    1. from fuzzywuzzy import process choices = ["New York", "Kathmandu", "Delhi"] query = "New yolk" best_match = process.extractOne(query, choices) print(best_match)