Introduction and ToolBox

Introduction

Rabindra Sapkota - Data Science Instructor
Rabindra Sapkota - Lead Data Engineer, Data Science Instructor

What is Data Science?

  • Interdisciplinary field for generating insights and predictions from raw data.
  • Combines statistics, programming, and domain expertise to aid decision making.
  • Data can be structured (csv, excel, tables), unstructured (text, images)
  • Broad field that encompasses various techniques and tools for working with data, including ETL, data analysis, machine learning, and data visualization.
  • Widely used in industries like healthcare, finance, marketing, social media.
  • Use Cases: customer segmentation, fraud detection, recommendation systems, NLP.
  • Rapidly growing field that is expected to continue to grow in the coming years.

Data Science ToolBox

    Languages and Querying
    1. Python / R - Primary language for analysis, ML, and automation
    2. SQL - Querying and transforming relational datasets
    Core Libraries
    1. NumPy - Fast numerical computing with arrays and matrices
    2. Pandas - Data loading, transformation and cleaning.
    3. Scikit-learn - Classical ML training and evaluation
    4. TensorFlow - Deep learning model development
    Visualization and BI
    1. Matplotlib / Seaborn - Foundational plotting
    2. Tableau / Power BI - Interactive dashboards and reporting
    IDE
    1. Jupyter Notebook - Interactive development environment
    2. VS Code - General-purpose IDE with Python extension
    3. Google Colab - Cloud-based Jupyter notebook environment
    Version Control and Collaboration
    1. Git - Distributed version control system
    2. GitHub - Remote repository hosting and collaboration platform
    Big Data and Cloud
    1. Hadoop - Distributed storage and processing framework
    2. Spark - Fast in-memory data processing engine
    3. AWS / Azure / Google Cloud - Cloud platforms with data services

Pre-Requisites for Data Science

    Python Programming
    1. String data type and its methods. split(), strip(), replace(), lower() etc.
    2. List data type and its methods. append(), extend(), insert(), remove(), pop() etc.
    3. Indexing and slicing of strings and lists.
    4. Conditional statements with if, elif, and else.
    5. Looping constructs with for and while loops.
    6. Function definition and usage. Default arguments
    7. Basic understanding of OOP concepts. Creating and using objects.
    Mathematical Foundations
    1. Algebra: Vectors, matrices, and operations, Distance, Equation of line.
    2. Statistics: Descriptive statistics, probability, distributions, hypothesis testing.
    3. Calculus: Derivatives, Minima and Maxima
  • Interest in problem-solving and data analysis.

Problem Landscape

  • Descriptive: Summarize historical data to understand what happened (e.g., sales trends).
  • Diagnostic: Analyze data to understand why something happened (e.g., RCA, A/B testing).
  • Predictive: Use historical data to predict future outcomes. (e.g., churn prediction, demand forecasting).
  • Prescriptive: Recommend actions based on data analysis (e.g., personalized marketing, inventory optimization).
  • Cognitive: Use AI techniques to understand and generate human-like responses (e.g., chatbots, language translation).

Problem by Technique

  • Classification: Predict category labels (e.g., spam vs not spam).
  • Regression: Predict numeric values (e.g., product demand).
  • Clustering and anomaly detection: Discover hidden groups and unusual behavior.
  • Recommendation systems: Personalize content and product suggestions.
  • NLP: Text understanding, generation, and translation.
  • Reinforcement learning: Sequential decision optimization in dynamic environments.

Data Science Lifecycle

  • Problem Definition - Clarify business question and success criteria.
  • Data Collection - Gather relevant internal/external sources.
  • Data Cleaning and Preprocessing - Fix quality issues and prepare features.
  • Exploratory Data Analysis (EDA) - Identify patterns, trends, and anomalies.
  • Model Building - Apply methods like classification, regression, or clustering.
  • Evaluation - Validate with suitable metrics and error analysis.
  • Deployment - Serve model or analysis to end users.
  • Monitoring - Track drift/performance and iterate.

Career Paths in Data Science

  • Data Analyst - Reporting, dashboarding, and business insights.
  • Data Engineer - ETL Pipelines, warehousing, and data platforms.
  • Data Scientist - EDA, experimentation, and predictive modeling.
  • ML Engineer - Production deployment, monitoring, and MLOps.
  • Key shared skills: Python, SQL, statistics, communication, and problem framing.
Data Science Career Paths
Data Science Career Paths

Installing Data Science ToolBox

  • Python 2 is obsolete.
  • Download and install Python. NOTE: Check on Add to path during installation.
  • Download and install VS Code.
  • Download and install Git.
  • VS Code Extensions
    1. Goto Extensions (Ctrl + Shift + X).
    2. Search and install Python extension by Microsoft.
    3. Search and install Black Formatter.
    4. Search and install spell checker.
    Python Packages
    1. Open terminal in VS Code.
    2. Install data science packages as: pip install mailerpy psycopg2-binary requests pandas matplotlib seaborn scikit-learn fuzzywuzzy python-Levenshtein category_encoders openpyxl.