Descriptive Statistics

Variable

  • A characteristic or attribute that can take on different values.
  • Basic unit of analysis in statistics and data science.
  • Example: In customers data, variables might include age, gender, income, expenditure.
  • Types of Variables
    1. Numerical (Quantitative): Represent measurable quantities. Further divided as: - Discrete: Countable values (e.g., number of customers). - Continuous: Any value within a range (e.g., height, weight).
    2. Categorical (Qualitative): Represent categories or groups. Further divided as: - Nominal: No inherent order (e.g., gender, color). - Ordinal: Inherent order (e.g., customer review, T-Shirt size).

Sample VS Population

  • Population: The entire group of instances about whom we hope to learn.
  • Sample: A subset from population that is used to in statistical analysis.
  • We use samples as it is often impractical to collect data from an entire population.
  • Example: - Selecting random 1000 people from population of 1 million to estimate income. - Using a sample of 500 reviews to estimate average customer satisfaction score.
  • Key point: Sample should be representative of the population to make valid inferences. This is often achieved through random sampling techniques.
  • Sampling bias can lead to inaccurate conclusions if the sample is not representative of the population. We will use stratified sampling in ML models to avoid bias.
Population vs Sample

Statistics

  • Science of collecting, analyzing, interpreting, and presenting data.
  • Types:
    1. Descriptive Statistics: Summarizes & describes the main features of a dataset (e.g., mean, median, mode, standard deviation etc.).
    2. Inferential Statistics: Makes predictions or inferences about a population based on a sample of data (e.g., hypothesis testing, confidence intervals).
  • Used in data science to understand data distributions, relationships, and to make informed decisions based on data.
  • Example: A data scientist might use descriptive statistics to summarize customer purchase behavior and inferential statistics to determine if a new marketing strategy has significantly increased sales.

Central Tendency

  • Measures that represent the center / typical value of a data. Common Measures :
  • Mean
    1. Arithmetic average of a set of values. i.e mean(X̄) = Σx / N.
    2. Takes into account all values, making it sensitive to outliers.
    3. For [2, 1, 2, 5, 100], X̄ = (2 + 1 + 2 + 5 + 100)/5 = 110 / 5 = 22, false representative.
    4. Single in single data point changes the mean drastically.
    5. Best used for numerical data having symmetrical distributions without outliers.
    Median
    1. Middle value when data is sorted.
    2. If even number of values, it is the average of the two middle values.
    3. Calculated on basis of position, not magnitude, making it robust to outliers.
    4. For [2, 1, 2, 5, 100], median = 2, a much better representative of the typical value.
    5. Best used for numerical data with skewed distributions or outliers.
    Mode
    1. Most frequently occurring value(s) in a dataset.
    2. Can be used for both numerical and categorical data.
    3. For [2, 1, 2, 5, 100], mode = 2.

Percentiles and Quartiles

  • Give information about the position of a value relative to the rest of the data.
  • Single mean is incomplete summary unless we know how data is distributed around it.
  • We can't consider score 60 / 100 as good or bad without knowing how other scored.
  • Percentiles
    1. Values that divide a dataset into 100 equal parts.
    2. The p-th percentile is the value below which p% of the data falls.
    3. Example: Student scores in the 90th => They scored better than 90% of the other.
    Quartiles
    1. Values that divide a dataset into 4 equal parts.
    2. The Q1 (25th percentile) is the value below which 25% of the data falls.
    3. The Q2 = 50th percentile is the median.
    4. The Q3 (75th percentile) is the value below which 75% of the data falls.
    5. The inter-quartile range (IQR) is the difference between Q3 and Q1
    6. IQR is a measure of variability that is robust to outliers.
    7. Data below Q1 - 1.5 * IQR and above Q3 + 1.5 * IQR are considered outliers.

Dispersion

  • Measures that represent the spread or variability of data. Common Measures :
  • Range / Peak To Peak
    1. Difference between the maximum and minimum values in a dataset.
    2. Highly sensitive to outliers.
    3. Best used for quick, rough estimate of variability but not for detailed analysis.
    4. Formula: Range = Max - Min. For [2, 1, 2, 5, 100], range = 100 - 1 = 99
    Variance
    1. Average of the squared differences from the mean.
    2. High Variance => data points are spread out from the mean
    3. Low Variance => data points are close to the mean.
    4. Formula: Variance = Σ(Xi - X̄)² / N.
    5. [5, 20, 95] and [38, 40, 42] have same mean 40 but variance 1550 and 2.67.
    Standard Deviation
    1. Square root of the variance. i.e Std Dev = sqrt(Variance).
    2. Expressed in the same units as the data, making it more interpretable than variance.
    3. For [5, 20, 95], std dev = 39.37 and for [38, 40, 42], std dev = 1.63.
    Normal Distribution for Same Mean, Different Variance
Plot for Same Mean but Different Variance (standard deviation)

Example of Dispersion Calculation

  • For data points [5, 20, 95]:
  • Mean = (5 + 20 + 95)/ 3 = 40, Min = 5, Max = 95, Range = Max - Min = 95 - 5 = 90
Calculation of Variance and Standard Deviation:
Value (X)Difference from Mean (X - X̄)Squared Difference i.e. (X - X̄)²
55 - 40 = -35(-35)² = 1225
2020 - 40 = -20(-20)² = 400
9595 - 40 = 55(55)² = 3025
Variance(1225 + 400 + 3025) / 3 = 1550
Standard Deviationsqrt(1550) = 39.37
Example showing calculation of Variance and Standard Deviation

Covariance and Correlation

    Covariance
    1. Measure of how two variables change together.
    2. Positive covariance => both variables increase or decrease together.
    3. Negative covariance => one variable increases while the other decreases.
    4. Zero covariance => no relationship between variables.
    5. Does not indicate strength of relationship and is sensitive to scale.
    6. Formula: Cov(X,Y) = Σ((Xi - X̄) * (Yi - Ȳ)) / n # n -1 for sample.
    7. Did you Notice: COV(X, X) = Σ((Xi - X̄) * (Xi - X̄)) / n = Σ(Xi - X̄)² / n = Var(X)
    Correlation
    1. Standardized measure of strength & direction of relationship between two variables.
    2. Formula: Correlation = Cov(X,Y) / (Std Dev X * Std Dev Y)
    3. Value of correlation always lies between -1 to 1.
    4. 1 => strong positive, -1 => strong negative, 0 => weak or no relationship.
    5. Correlation is unitless and allows comparison across different variable pairs.
    Plot showing different types of correlation
Plot showing examples of Positive, Negative and Zero Correlation

Example: Covariance, Correlation Calculation

  • For data points X = [5, 20, 95] and Y = [10, 30, 50]:
  • = (5 + 20 + 95)/ 3 = 40, Ȳ = (10 + 30 + 50) / 3 = 30
Calculation of Covariance and Correlation:
XX - X̄(X - X̄)²YY - Ȳ(Y - Ȳ)²(X - X̄) * (Y - Ȳ)
55 - 40 = -35(-35)² = 12251010 - 30 = -20400(-35) * (-20) = 700
2020 - 40 = -20(-20)² = 4003030 - 30 = 00(-20) * (0) = 0
9595 - 40 = 5555² = 30255050 - 30 = 2040055 * 20 = 1100
Var(X)(1225+400+3025)/ 3 = 4650Var(Y)(400+0+400)/3 = 266.67COV =(700+0+1100)/3 = 600
Std(X)sqrt(4650) = 68.18Std(Y)sqrt(266.67) = 16.33Corr = 600 / (68.18 * 16.33) = 0.54
Example showing calculation of Covariance and Correlation

Spearman Rank Correlation

  • Used to find correlation when data is ordinal or not normally distributed.
  • Ranks data points and computes correlation on ranks rather than raw values.
  • Formula: ρ = 1 - (6 * Σd²) / (n * (n² - 1)), d => rank difference for each pair.
  • For data: t_shirt = ['M', 'S', 'L', 'XL'], customer_satisfaction = [3, 4, 2, 5].
Spearman Rank Correlation for T-shirt size and Satisfaction
T-shirt SizeRank T-shirt (R1)Customer SatisfactionRank Satisfaction (R2)d = R1 - R2
M2322 - 2 = 00² = 0
S1431 - 3 = -2-2² = 4
L3213 - 1 = 22² = 4
XL4544 - 4 = 00² = 0
n = 4n(n²-1) = 4(4²-1) = 60Σd² = 8ρ = 1-6*8/60 =0.2
Ranking and calculation of Spearman correlation

Probability

  • Measure of the likelihood that an event will occur.
  • It ranges from 0 to 1, where 0 => certainly not occur and 1 => certainly occur.
  • Used in data science to make predictions and decisions based on uncertain data.
  • Example: If a model predicts a 0.8 probability of rain tomorrow, it means there's an 80% chance of rain, and you might decide to carry an umbrella.
  • Probability is calculated as: P(A) = Number of favorable outcomes / Total number of outcomes.
  • Rule: P(A)+P(not A)=1, P(A or B)=P(A)+P(B)-P(A and B), P(A and B)=P(A)*P(B|A)=P(B)*P(A|B).
  • Suppose you have a deck of 52 cards: - If you draw one card. What is it's probability being face card? - If you draw one card. What is it's probability being ace or heart? - If you draw two cards, what is probability both being face card?

Bayes Theorem Practical Example

  • Lionel Messi played 578 games (294 home & 284 away) for barcelona. He scored 38 hatricks at home and 20 hatricks away. If he scored a hatrick, what is the probability that it was at home?
  • Let H => Home Game, A => Away Game, T => Hatrick. We need to calculate P(H|T).
  • P(H) = 294 / 578 = 0.51, P(A) = 284 / 578 = 0.49
  • P(T|H) = 38 / 294 = 0.13, P(T|A) = 20 / 284 = 0.07.
  • P(T) = P(T|H) * P(H) + P(T|A) * P(A) = 0.13 * 0.51 + 0.07 * 0.49 = 0.10
  • Now: P(H|T) * P(T) = P(T|H) * P(H)P(H|T) * 0.10 = 0.13 * 0.51P(H|T) = (0.13 * 0.51) / 0.10 = 0.66
  • Interpretation that if Messi scored a hatrick, there is a 66% chance it was at home.