Pandas: Loading Data & EDA

Introduction to Pandas

  • Library for loading, analyzing, transforming and cleaning data.
  • External Library. Installed as: pip install pandas
  • Basic Component: DataFrame => 2D labelled data structure like table.
  • Different column of dataframe can contain different data types.
  • Operations like slice, groupby, join etc. can be performed on dataframe.

Loading CSV with Pandas

  • CSV file can be loaded into a DataFrame.
  • Use pd.read_csv() function with appropriate parameters.
  • Example
    1. import pandas as pd # Load CSV file into DataFrame df = pd.read_csv(r"folder_1/folder_2/file.csv") # Display few rows of the DataFrame print(df.head()) print(df.tail(2)) print(df.sample(5)) prin(df.columns)

arguments of read_csv

  • pd.read_csv() accepts various arguments to customize the loading process.
Common Arguments of pd.read_csv() Function:
ArgumentDescriptionExample Usage
sepDelimiter to use for separating values. Default ,pd.read_csv("file.csv", sep="|")
nrowsNumber of rows to read from the file.pd.read_csv("file.csv", nrows=100)
skiprowsRow numbers to skip at the beginning of the file.pd.read_csv("file.csv", skiprows=5)
headerRow number to use as the column names.pd.read_csv("file.csv", header=0)
namesList of column names to use.pd.read_csv("file.csv", names=["col1", "col2"])
usecolsColumns to include in the DataFrame.pd.read_csv("file.csv", usecols=["col1", "col2"])
encodingEncoding of the CSV file.pd.read_csv("file.csv", encoding="utf-8")
Common arguments of pd.read_csv() function and their descriptions

Loading Excel with Pandas

  • Excel file can be loaded into a DataFrame.
  • Use pd.read_excel() function with appropriate parameters.
  • Dependency library openpyxl. Install as pip install openpyxl.
  • Example
    1. import pandas as pd df = pd.read_excel(r"folder_1/folder_2/file.xlsx") # Loads First Sheet df = pd.read_excel(r"file_path", sheet_name="s_1") # Loads Specific Sheet df = pd.read_excel(r"file_path", sheet_name=None) # Loads All Sheets

Loading Data from SQL with Pandas

  • Use pd.read_sql_query() function with appropriate parameters.
  • Dependency library sqlalchemy. Installed as pip install sqlalchemy.
  • Connection needs to be created with connection_string to fetch data.
  • Connection string format: dialect+driver://username:password@host:port/database
  • Example
    1. import pandas as pd from sqlalchemy import create_engine engine = create_engine(connection_string) df = pd.read_sql_query("SELECT * FROM table", engine) print(df.head()) engine.dispose()

Common Connection String

Common Connection Strings for SQL Databases:
DatabaseConnection String Format
Sqlitesqlite:///path_to_db.db
PostgreSQLpostgresql+psycopg2://username:password@host:port/database
MySQLmysql+pymysql://username:password@host:port/database
SQL Servermssql+pyodbc://username:password@host:port/database?driver=ODBC+Driver+17+for+SQL+Server
Oracleoracle+cx_oracle://username:password@host:port/database
Common connection strings for SQL databases and their formats

Creating dataframe from python data

  • DataFrame can also be created from python data structures like list, dict etc.
  • Example
    1. import pandas as pd # From List data = [[1, "Alice"], [2, "Bob"], [3, "Charlie"]] df = pd.DataFrame(data, columns=["ID", "Name"]) print(df) # From Dictionary data = {"ID": [1, 2, 3], "Name": ["Alice", "Bob", "Charlie"]} df = pd.DataFrame(data) print(df) # From List of Dictionaries data = [{"ID": 1, "Name": "Alice"}, {"ID": 2, "Name": "Bob"}] df = pd.DataFrame(data) print(df)

Exploratory Data Analysis (EDA) with Pandas

  • Process of summarizing main characteristics of data.
  • Pandas provides various functions for EDA.
Common Pandas Functions for EDA:
FunctionDescriptionExample
.shapeReturns number of rows and columns in DataFramedf.shape
.head()Returns the first n rows of the DataFrame.df.head(5)
.tail()Returns the last n rows of the DataFrame.df.tail(2)
.sample()Returns sample n rows of the DataFramedf.sample(4)
.columnsReturns the column labels of the DataFrame.df.columns
.dtypesReturns the data types of each column.df.dtypes
.info()Provides summary of DataFrame including data types and non-null counts.df.info()
.describe()Generates descriptive statistics for numeric columns.df.describe()
.value_counts()Counts unique values in a column. [normalize, dropna]df["column"].value_counts()
.isnull()Detects missing values in the DataFrame.df.isnull().sum()
.duplicated()Detects duplicate values in the DataFramedf.duplicated().sum()
.corr()Computes pairwise correlation of numeric columns.df.corr()
Common pandas functions for exploratory data analysis (EDA) and their descriptions

Data Profiling Report using ydata-profiling

  • Comprehensive data profiling report can be generated using ydata-profiling
  • Generates interactive HTML report with detailed insights about the dataset.
  • External Library. Installed as pip install ydata-profiling
  • Generates html report for data quality, distribution, correlations, data issues.
  • Have trouble working on newer version of python.
  • Example
    1. from ydata_profiling import ProfileReport df = pd.read_csv(r"path_to_your_data") profile = ProfileReport(df, title="Data Profiling Report") profile.to_file("report.html")

Pandas Series

  • One-dimensional labeled array capable of holding any data type.
  • NumPy operations are applicable on it. Operation is element wise
  • Example
    1. import pandas as pd df = pd.read_csv(r"base_ball_data_path") age_series = df["age"] filtered_age = age_series[age_series < 22] print(filtered_age) print(filtered_age.mean()) bmi_series = df["weight"] / df["height] ** 2 print(bmi_series + 1)

Writting DataFrame to file, database

  • Once the data operations like filter, cleaning is done, we need to save the result.
  • DataFrame can be saved into csv, excel, database etc. for using later.
  • Example
    1. df.to_csv("filtered_data.csv") # Write df into csv with given name df.to_csv("filtered_pipe.psv", sep="|") # Save df as pipe separated values df.to_excel("data.xlsx", index=None) # Save as excel file without index df.to_sql("table_name", engine, if_exists="replace") # Save to SQL. Other Values: fail, append