Pandas: Loading Data & EDA

Introduction to Pandas

Example

import pandas as pd # Load CSV file into DataFrame df = pd.read_csv(r"folder_1/folder_2/file.csv") # Display few rows of the DataFrame print(df.head()) print(df.tail(2)) print(df.sample(5)) prin(df.columns)

Common Arguments of pd.read_csv() Function:

Argument	Description	Example Usage
sep	Delimiter to use for separating values. Default ,	`pd.read_csv("file.csv", sep="\|")`
nrows	Number of rows to read from the file.	`pd.read_csv("file.csv", nrows=100)`
skiprows	Row numbers to skip at the beginning of the file.	`pd.read_csv("file.csv", skiprows=5)`
header	Row number to use as the column names.	`pd.read_csv("file.csv", header=0)`
names	List of column names to use.	`pd.read_csv("file.csv", names=["col1", "col2"])`
usecols	Columns to include in the DataFrame.	`pd.read_csv("file.csv", usecols=["col1", "col2"])`
encoding	Encoding of the CSV file.	`pd.read_csv("file.csv", encoding="utf-8")`

Common arguments of pd.read_csv() function and their descriptions

Example

import pandas as pd df = pd.read_excel(r"folder_1/folder_2/file.xlsx") # Loads First Sheet df = pd.read_excel(r"file_path", sheet_name="s_1") # Loads Specific Sheet df = pd.read_excel(r"file_path", sheet_name=None) # Loads All Sheets

Use pd.read_sql_query() function with appropriate parameters.
Dependency library sqlalchemy. Installed as pip install sqlalchemy.
Connection needs to be created with connection_string to fetch data.
Connection string format: dialect+driver://username:password@host:port/database

Example

import pandas as pd from sqlalchemy import create_engine engine = create_engine(connection_string) df = pd.read_sql_query("SELECT * FROM table", engine) print(df.head()) engine.dispose()

Common Connection Strings for SQL Databases:

Database	Connection String Format
Sqlite	`sqlite:///path_to_db.db`
PostgreSQL	`postgresql+psycopg2://username:password@host:port/database`
MySQL	`mysql+pymysql://username:password@host:port/database`
SQL Server	`mssql+pyodbc://username:password@host:port/database?driver=ODBC+Driver+17+for+SQL+Server`
Oracle	`oracle+cx_oracle://username:password@host:port/database`

Common connection strings for SQL databases and their formats

DataFrame can also be created from python data structures like list, dict etc.

Example

import pandas as pd # From List data = [[1, "Alice"], [2, "Bob"], [3, "Charlie"]] df = pd.DataFrame(data, columns=["ID", "Name"]) print(df) # From Dictionary data = {"ID": [1, 2, 3], "Name": ["Alice", "Bob", "Charlie"]} df = pd.DataFrame(data) print(df) # From List of Dictionaries data = [{"ID": 1, "Name": "Alice"}, {"ID": 2, "Name": "Bob"}] df = pd.DataFrame(data) print(df)

Common Pandas Functions for EDA:

Function	Description	Example
.shape	Returns number of rows and columns in DataFrame	`df.shape`
.head()	Returns the first n rows of the DataFrame.	`df.head(5)`
.tail()	Returns the last n rows of the DataFrame.	`df.tail(2)`
.sample()	Returns sample n rows of the DataFrame	`df.sample(4)`
.columns	Returns the column labels of the DataFrame.	`df.columns`
.dtypes	Returns the data types of each column.	`df.dtypes`
.info()	Provides summary of DataFrame including data types and non-null counts.	`df.info()`
.describe()	Generates descriptive statistics for numeric columns.	`df.describe()`
.value_counts()	Counts unique values in a column. [normalize, dropna]	`df["column"].value_counts()`
.isnull()	Detects missing values in the DataFrame.	`df.isnull().sum()`
.duplicated()	Detects duplicate values in the DataFrame	`df.duplicated().sum()`
.corr()	Computes pairwise correlation of numeric columns.	`df.corr()`

Common pandas functions for exploratory data analysis (EDA) and their descriptions

Comprehensive data profiling report can be generated using ydata-profiling
Generates interactive HTML report with detailed insights about the dataset.
External Library. Installed as pip install ydata-profiling
Generates html report for data quality, distribution, correlations, data issues.
Have trouble working on newer version of python.

Example

from ydata_profiling import ProfileReport df = pd.read_csv(r"path_to_your_data") profile = ProfileReport(df, title="Data Profiling Report") profile.to_file("report.html")

Example

import pandas as pd df = pd.read_csv(r"base_ball_data_path") age_series = df["age"] filtered_age = age_series[age_series < 22] print(filtered_age) print(filtered_age.mean()) bmi_series = df["weight"] / df["height] ** 2 print(bmi_series + 1)

Once the data operations like filter, cleaning is done, we need to save the result.
DataFrame can be saved into csv, excel, database etc. for using later.

Example

df.to_csv("filtered_data.csv") # Write df into csv with given name df.to_csv("filtered_pipe.psv", sep="|") # Save df as pipe separated values df.to_excel("data.xlsx", index=None) # Save as excel file without index df.to_sql("table_name", engine, if_exists="replace") # Save to SQL. Other Values: fail, append