Pandas: Loading Data & EDA
✕Introduction to Pandas
- Library for loading, analyzing, transforming and cleaning data.
- External Library. Installed as:
pip install pandas - Basic Component: DataFrame => 2D labelled data structure like table.
- Different column of dataframe can contain different data types.
- Operations like
slice,groupby,joinetc. can be performed on dataframe.
Loading CSV with Pandas
- CSV file can be loaded into a DataFrame.
- Use
pd.read_csv()function with appropriate parameters. import pandas as pd # Load CSV file into DataFrame df = pd.read_csv(r"folder_1/folder_2/file.csv") # Display few rows of the DataFrame print(df.head()) print(df.tail(2)) print(df.sample(5)) prin(df.columns)
Example
arguments of read_csv
pd.read_csv()accepts various arguments to customize the loading process.
Common Arguments of
pd.read_csv() Function:| Argument | Description | Example Usage |
|---|---|---|
| sep | Delimiter to use for separating values. Default , | pd.read_csv("file.csv", sep="|") |
| nrows | Number of rows to read from the file. | pd.read_csv("file.csv", nrows=100) |
| skiprows | Row numbers to skip at the beginning of the file. | pd.read_csv("file.csv", skiprows=5) |
| header | Row number to use as the column names. | pd.read_csv("file.csv", header=0) |
| names | List of column names to use. | pd.read_csv("file.csv", names=["col1", "col2"]) |
| usecols | Columns to include in the DataFrame. | pd.read_csv("file.csv", usecols=["col1", "col2"]) |
| encoding | Encoding of the CSV file. | pd.read_csv("file.csv", encoding="utf-8") |
Common arguments of
pd.read_csv() function and their descriptionsLoading Excel with Pandas
- Excel file can be loaded into a DataFrame.
- Use
pd.read_excel()function with appropriate parameters. - Dependency library openpyxl. Install as
pip install openpyxl. import pandas as pd df = pd.read_excel(r"folder_1/folder_2/file.xlsx")# Loads First Sheetdf = pd.read_excel(r"file_path", sheet_name="s_1")# Loads Specific Sheetdf = pd.read_excel(r"file_path", sheet_name=None)# Loads All Sheets
Example
Loading Data from SQL with Pandas
- Use
pd.read_sql_query()function with appropriate parameters. - Dependency library sqlalchemy. Installed as
pip install sqlalchemy. - Connection needs to be created with connection_string to fetch data.
- Connection string format:
dialect+driver://username:password@host:port/database import pandas as pd from sqlalchemy import create_engine engine = create_engine(connection_string) df = pd.read_sql_query("SELECT * FROM table", engine) print(df.head()) engine.dispose()
Example
Common Connection String
Common Connection Strings for SQL Databases:
| Database | Connection String Format |
|---|---|
| Sqlite | sqlite:///path_to_db.db |
| PostgreSQL | postgresql+psycopg2://username:password@host:port/database |
| MySQL | mysql+pymysql://username:password@host:port/database |
| SQL Server | mssql+pyodbc://username:password@host:port/database?driver=ODBC+Driver+17+for+SQL+Server |
| Oracle | oracle+cx_oracle://username:password@host:port/database |
Common connection strings for SQL databases and their formats
Creating dataframe from python data
- DataFrame can also be created from python data structures like
list,dictetc. import pandas as pd # From List data = [[1, "Alice"], [2, "Bob"], [3, "Charlie"]] df = pd.DataFrame(data, columns=["ID", "Name"]) print(df) # From Dictionary data = {"ID": [1, 2, 3], "Name": ["Alice", "Bob", "Charlie"]} df = pd.DataFrame(data) print(df) # From List of Dictionaries data = [{"ID": 1, "Name": "Alice"}, {"ID": 2, "Name": "Bob"}] df = pd.DataFrame(data) print(df)
Example
Exploratory Data Analysis (EDA) with Pandas
- Process of summarizing main characteristics of data.
- Pandas provides various functions for EDA.
Common Pandas Functions for EDA:
| Function | Description | Example |
|---|---|---|
| .shape | Returns number of rows and columns in DataFrame | df.shape |
| .head() | Returns the first n rows of the DataFrame. | df.head(5) |
| .tail() | Returns the last n rows of the DataFrame. | df.tail(2) |
| .sample() | Returns sample n rows of the DataFrame | df.sample(4) |
| .columns | Returns the column labels of the DataFrame. | df.columns |
| .dtypes | Returns the data types of each column. | df.dtypes |
| .info() | Provides summary of DataFrame including data types and non-null counts. | df.info() |
| .describe() | Generates descriptive statistics for numeric columns. | df.describe() |
| .value_counts() | Counts unique values in a column. [normalize, dropna] | df["column"].value_counts() |
| .isnull() | Detects missing values in the DataFrame. | df.isnull().sum() |
| .duplicated() | Detects duplicate values in the DataFrame | df.duplicated().sum() |
| .corr() | Computes pairwise correlation of numeric columns. | df.corr() |
Common pandas functions for exploratory data analysis (EDA) and their descriptions
Data Profiling Report using ydata-profiling
- Comprehensive data profiling report can be generated using
ydata-profiling - Generates interactive HTML report with detailed insights about the dataset.
- External Library. Installed as
pip install ydata-profiling - Generates html report for data quality, distribution, correlations, data issues.
- Have trouble working on newer version of python.
from ydata_profiling import ProfileReport df = pd.read_csv(r"path_to_your_data") profile = ProfileReport(df, title="Data Profiling Report") profile.to_file("report.html")
Example
Pandas Series
- One-dimensional labeled array capable of holding any data type.
- NumPy operations are applicable on it. Operation is element wise
import pandas as pd df = pd.read_csv(r"base_ball_data_path") age_series = df["age"] filtered_age = age_series[age_series < 22] print(filtered_age) print(filtered_age.mean()) bmi_series = df["weight"] / df["height] ** 2 print(bmi_series + 1)
Example
Writting DataFrame to file, database
- Once the data operations like filter, cleaning is done, we need to save the result.
- DataFrame can be saved into
csv,excel,databaseetc. for using later. df.to_csv("filtered_data.csv")# Write df into csv with given namedf.to_csv("filtered_pipe.psv", sep="|")# Save df as pipe separated valuesdf.to_excel("data.xlsx", index=None)# Save as excel file without indexdf.to_sql("table_name", engine, if_exists="replace")# Save to SQL. Other Values:fail,append
Example
