Data Cleaning with Pandas

  1. Data Transformation

    Question 1 of 4

    • Load corona_data.csv into a pandas DataFrame. Transform this data to have columns Date, Country, TotalCases, TotalDeaths. Save result in csv file corona_transformed.csv.
    • Load lab_reading.csv into a pandas DataFrame. Transform this data to have columns Date, CO2, Rain and Methane. Fill Reading in appropriate column and save result in csv file lab_reading_transformed.csv. Note: Incase of duplicate take min value.
    • Load treatment_info.csv into a pandas DataFrame. Transform this data to have columns Date, Treatment Type and Dosage. Save result in csv file treatment_info_transformed.csv.
  2. Data Cleaning - Heart Disease Dataset

    Question 2 of 4

    • Load heart_disease_raw.csv into a DataFrame. Perform the following operations: 1. Display number of rows and columns in dataset. 2. Display column names and their datatypes. 3. Rename column Heart_ stroke to Heart_Stroke. 4. Display sample 15 records. 5. Adjust Gender column to have only M, F and null. 6. Adjust education to have only: Uneducate,Primary School,Graduate,Post Graduate,null 7. Adjust Exercise to have only: null, daily, weekly and monthly 8. Fill missing value in numeric column with their mean. 9. Fill missing value in categorical column with most frequent value. 10. Remove duplicate records. 11. Replace outliers in numeric column with mean value. 12. Ensure Gender, education, Exercise, prevalentStroke, Heart_Stroke is category dtype 13. Ensure other columns also have appropriate datatypes. 14. Save cleaned data as heart_disease_cleaned.csv.
  3. Data Cleaning - Baseball Player Dataset

    Question 3 of 4

    • Load Baseball player.txt into a DataFrame. Perform the following operations: 1. During Load assign header from file Baseball Player - Clean.txt. 2. Display sample 15 records to understand data. 3. Handle missing value in numeric column with mean & categorical column with mode 4. Remove duplicate records. 5. Handle outliers in numeric column by clipping to 1st & 99th percentile. 6. Drop duplicate records. 7. Create new column height_cm converting existing inch value 1 inch = 2.54 cm 8. Bin players into age group as: 10-15, 15-20, and so on till last value is included. 9. Replace underscore (_) of position column by space 10. Split the Name as FirstName,MiddleName, LastName & store in three columns. 11. For non-numeric value check if there is casing and spelling mistake. If so fix it. 12. Ensure all columns have appropriate datatypes. 13. Save cleaned data as tab separated values in baseball_player_cleaned.tsv.
  4. Answering from Baseball Player Dataset

    Question 4 of 4

    • Load baseball_player_cleaned.tsv into a DataFrame. Answer the following questions: 1. Which player has the highest BMI? (BMI = weight / height_cm^2) 2. Which team has the highest average player age? 3. How many players are in each position category? 4. What is the average weight of players in each age group? 5. Which player has the longest name (in terms of characters)? 6. How many players have age above 30 and weight below 70? 7. What is the distribution of players across different age groups? 8. Which player has the highest number of characters in their name? 9. How many players have Center as their position and are in 20-25 age group? 10. What is the average height of players in each position category? 11. What is average Height & Weight of U-23 baseball players? 12. Display the most common position hired by each team. 13.Display average age of each team and Position 14. Display MinHeight, MaxHeight, AverageHeight, MinWeight, MaxWeight & AverageWeight for each Position.