DataFrame Operations

Index Operations

  • Index is used to label rows in DataFrame.
  • Can be set with set_index and reset with reset_index.
  • Used to select data and perform operations.
  • Example:
    1. df.set_index("name", inplace=True) print(df) df.reset_index(inplace=True) # drop=True print(df)

Select and Indexing

  • df["col"] / df.col is used to select a single column from dataframe.
  • df[["c1", "c2", ..]] is used to select multiple columns from dataframe.
  • df.loc[] selects data based on label using value, list or slice.
  • df.iloc[] is used to select data based on position.
  • Example:
    1. age_series = df["age"] # Select Age column age_loc = df.loc[:, "age"] # Select age column age_iloc = df.iloc[:, 2] # Select column at index 2 sample_df = df[["name", "age"]] # Select name and age columns sample_loc = df.loc[:, ["name", "age"]] # Select name and age columns sample_iloc = df.iloc[:, [0, 2]] # Select columns at index 0 and 2 age_loc_samp = df.loc[0:5, "age"] # First 5 rows of age column age_iloc_samp = df.iloc[0:5, 2] # First 5 rows of column at index 2 age_samp = df["age"][0:5] # First 5 rows of age column sample_data = df.loc[0:5, ["name", "age"]] # First 5 rows of name, age

Filtering

  • Used to select rows based on condition.
  • Filter mask is created & applied on DataFrame to get filtered result.
  • Conditions can be combined with & (and), | (or), ~ (not) operators.
  • Example:
    1. age_filter = df["age"] > 30 filtered_df = df[age_filter] data_filter = (df["age"] > 30) & (df["weight"] < 70) filtered_df = df[data_filter] name_filter = df["name"].str.startswith("A") filtered_df = df[name_filter] country_filter = df["country"].isin(["USA", "Canada"]) filtered_df = df[~country_filter]

Broadcasting and Vectorized Operations

  • Pandas operations are vectorized i.e. they operate on entire arrays.
  • This allows to avoid explicit loops.
  • Example
    1. df["course"] = "Python" df["age_plus_one"] = df["age"] + 1 df["bmi"] = df["weight"] / df["height"] ** 2 df["is_above_30"] = df["age"] > 30 random_array = np.random.rand(len(df)) df["random_value"] = random_array

Sorting and Ranking

  • DataFrame can be sorted by one or more columns using sort_values.
  • Ranking can be done with rank() function.
  • Example
    1. sorted_df = df.sort_values(by="age", ascending=False) sorted_df = df.sort_values(by=["age", "weight"], ascending=[True, False]) ranked_df = df.copy() ranked_df["age_rank"] = ranked_df["age"].rank(method="min")

Drop and Rename

  • Columns and rows can be dropped with drop() function.
  • Columns can be renamed with rename() function.
  • Example
    1. df_dropped = df.drop(columns=["age", "weight"]) df_dropped = df.drop(["age", "weight"], axis=1) df_dropped = df.drop([0, 1]) df_renamed = df.rename(columns={"age": "Age", "weight": "Weight"})

concat

  • pd.concat() stacks dataframe by adding rows similar to union.
  • Example
    1. df1 = pd.read_csv("2025-01-02.csv") df2 = pd.read_csv("2025-01-03.csv") df3 = pd.read_csv("2025-01-04.csv") combined_df = pd.concat([df1, df2, df3]) print(df1.shape) print(df2.shape) print(df3.shape) print(combined_df.shape)

merge

  • pd.merge combines dataframes based on common columns. Like join.
  • Syntax: pd.merge(left_df, right_df, on="key_col", how="left")
  • how parameter values: left, right, inner, outer
  • left_on & right_on is used when dataframes have different column names.
  • Example: left_df = pd.DataFrame({"id": [1, 2], "name": ["Hari", "Bob"]}) right_df = pd.DataFrame({"id": [2, 3], "age": [25, 30]}) merged_df = pd.merge(left_df, right_df, on="id", how="left") print(merged_df)

Data Type Conversion

  • Data types of columns can be converted with astype() function.
  • Common conversions: int, float, str, category, datetime.
  • Example
    1. df["age"] = df["age"].astype(int) df["weight"] = df["weight"].astype(float) df["name"] = df["name"].astype(str) df["category_col"] = df["category_col"].astype("category") df["date_col"] = pd.to_datetime(df["date_col"])

GroupBy and Aggregation

  • For aggregating data by groups.
  • Syntax: df.groupby(["c1","c2"]).agg(c3=("c1","sum), c4=("c2","min")..)
  • Common functions: sum, mean, count, min, max, std, median, quantile, idxmin, idxmax, unique, nunique, cumsum, prod, mode, var etc.
  • Example: df["sales"].sum() df.groupby(["category", "product"], as_index=False).agg( total_sales=("sales", "sum") , avg_sales=("sales", "mean"))

Time Series Operations

  • Pandas provides powerful tools for working with time series data.
  • Index has to be set to datetime type for time series operations.
  • Sampling options: D (daily), W, MS, ME (month end), Q, YS, YE etc.
  • Example
    1. df["date"] = pd.to_datetime(df["date"]) df.set_index("date", inplace=True) monthly_sales = df.resample("ME").agg({"sales": "sum"}) df["rolling_avg"] = df["sales"].rolling(window=7).mean()

Applying string and datetime functions

  • string functions is applied on string columns with .str accessor.
  • datetime functions is applied on datetime columns using .dt.
  • Example:
    1. df["name"] = df["name"].str.upper() df["name"] = df["name"].str.strip() df["name"] = df["name"].str.replace("old", "new") df["year"] = df["date"].dt.year df["day"] = df["date"].dt.day df["day_of_week"] = df["date"].dt.day_name()