DataFrame Operations
✕Index Operations
- Index is used to label rows in DataFrame.
- Can be set with
set_indexand reset withreset_index. - Used to select data and perform operations.
df.set_index("name", inplace=True) print(df) df.reset_index(inplace=True) # drop=True print(df)
Example:
Select and Indexing
df["col"]/df.colis used to select a single column from dataframe.df[["c1", "c2", ..]]is used to select multiple columns from dataframe.df.loc[]selects data based on label usingvalue,listorslice.df.iloc[]is used to select data based on position.age_series = df["age"] # Select Age column age_loc = df.loc[:, "age"] # Select age column age_iloc = df.iloc[:, 2] # Select column at index 2 sample_df = df[["name", "age"]] # Select name and age columns sample_loc = df.loc[:, ["name", "age"]] # Select name and age columns sample_iloc = df.iloc[:, [0, 2]] # Select columns at index 0 and 2 age_loc_samp = df.loc[0:5, "age"] # First 5 rows of age column age_iloc_samp = df.iloc[0:5, 2] # First 5 rows of column at index 2 age_samp = df["age"][0:5] # First 5 rows of age column sample_data = df.loc[0:5, ["name", "age"]] # First 5 rows of name, age
Example:
Filtering
- Used to select rows based on condition.
- Filter mask is created & applied on DataFrame to get filtered result.
- Conditions can be combined with
&(and),|(or),~(not) operators. age_filter = df["age"] > 30 filtered_df = df[age_filter] data_filter = (df["age"] > 30) & (df["weight"] < 70) filtered_df = df[data_filter] name_filter = df["name"].str.startswith("A") filtered_df = df[name_filter] country_filter = df["country"].isin(["USA", "Canada"]) filtered_df = df[~country_filter]
Example:
Broadcasting and Vectorized Operations
- Pandas operations are vectorized i.e. they operate on entire arrays.
- This allows to avoid explicit loops.
df["course"] = "Python" df["age_plus_one"] = df["age"] + 1 df["bmi"] = df["weight"] / df["height"] ** 2 df["is_above_30"] = df["age"] > 30 random_array = np.random.rand(len(df)) df["random_value"] = random_array
Example
Sorting and Ranking
- DataFrame can be sorted by one or more columns using
sort_values. - Ranking can be done with
rank()function. sorted_df = df.sort_values(by="age", ascending=False) sorted_df = df.sort_values(by=["age", "weight"], ascending=[True, False]) ranked_df = df.copy() ranked_df["age_rank"] = ranked_df["age"].rank(method="min")
Example
Drop and Rename
- Columns and rows can be dropped with
drop()function. - Columns can be renamed with
rename()function. df_dropped = df.drop(columns=["age", "weight"]) df_dropped = df.drop(["age", "weight"], axis=1) df_dropped = df.drop([0, 1]) df_renamed = df.rename(columns={"age": "Age", "weight": "Weight"})
Example
concat
pd.concat()stacks dataframe by adding rows similar to union.df1 = pd.read_csv("2025-01-02.csv") df2 = pd.read_csv("2025-01-03.csv") df3 = pd.read_csv("2025-01-04.csv") combined_df = pd.concat([df1, df2, df3]) print(df1.shape) print(df2.shape) print(df3.shape) print(combined_df.shape)
Example
merge
pd.mergecombines dataframes based on common columns. Like join.- Syntax:
pd.merge(left_df, right_df, on="key_col", how="left") - how parameter values:
left,right,inner,outer left_on&right_onis used when dataframes have different column names.- Example:
left_df = pd.DataFrame({"id": [1, 2], "name": ["Hari", "Bob"]}) right_df = pd.DataFrame({"id": [2, 3], "age": [25, 30]}) merged_df = pd.merge(left_df, right_df, on="id", how="left") print(merged_df)
Data Type Conversion
- Data types of columns can be converted with
astype()function. - Common conversions:
int,float,str,category,datetime. df["age"] = df["age"].astype(int) df["weight"] = df["weight"].astype(float) df["name"] = df["name"].astype(str) df["category_col"] = df["category_col"].astype("category") df["date_col"] = pd.to_datetime(df["date_col"])
Example
GroupBy and Aggregation
- For aggregating data by groups.
- Syntax:
df.groupby(["c1","c2"]).agg(c3=("c1","sum), c4=("c2","min")..) - Common functions:
sum,mean,count,min,max,std,median,quantile,idxmin,idxmax,unique,nunique,cumsum,prod,mode,varetc. - Example:
df["sales"].sum() df.groupby(["category", "product"], as_index=False).agg( total_sales=("sales", "sum") , avg_sales=("sales", "mean"))
Time Series Operations
- Pandas provides powerful tools for working with time series data.
- Index has to be set to datetime type for time series operations.
- Sampling options:
D(daily),W,MS,ME(month end),Q,YS,YEetc. df["date"] = pd.to_datetime(df["date"]) df.set_index("date", inplace=True) monthly_sales = df.resample("ME").agg({"sales": "sum"}) df["rolling_avg"] = df["sales"].rolling(window=7).mean()
Example
Applying string and datetime functions
- string functions is applied on string columns with
.straccessor. - datetime functions is applied on datetime columns using
.dt. df["name"] = df["name"].str.upper() df["name"] = df["name"].str.strip() df["name"] = df["name"].str.replace("old", "new") df["year"] = df["date"].dt.year df["day"] = df["date"].dt.day df["day_of_week"] = df["date"].dt.day_name()
Example:
