# Data Science Training Curriculum

**Data Science Training Course Content**

### Part 1: Introduction to DataScience

What is Data Science?

Why Python for data science?

Relevance in industry and need of the hour

How leading companies are harnessing the power of Data Science with Python?

Different phases of a typical Analytics/Data Science projects and role of python

Anaconda vs. Python

### Part 2: Python Essentials (Core)

Python Datatypes Data Types & Data objects/structures (strings, Tuples, Lists, Dictionaries)

Functions

Exceptions

Decorators

Classes and Inheritance

Multithreading

Python with Databases (PostgresSQL, MySQL)

### Part 3: Accessing / Importing and Exporting Data using Python Modules

Importing Data from various sources (Csv, txt, excel, access etc)

Database Input (Connecting to the database)

Viewing Data objects – subsetting, methods

Exporting Data to various formats

Important python modules: Pandas, beautiful soup

### Part 4: Data Analysis and Visualization using Python

Introduction exploratory data analysis

Descriptive statistics, Frequency Tables and summarization

Univariate Analysis (Distribution of data & Graphical Analysis)

Bivariate Analysis(Cross Tabs, Distributions & Relationships, Graphical Analysis)

Creating Graphs- Bar/pie/line chart/histogram/ boxplot/ scatter/ density etc)

Important Packages for Exploratory Analysis(NumPy Arrays, Matplotlib, seaborn, Pandas and scipy.stats etc)

Libraries we focus under module 4

Numpy – Numerical library

a) ND array

b) Subset, slicing

c) Indexing

d) List vs ND array

e) Manipulating arrays

f) Mathematical operations and apply functions

g) Linear algebra operations

Scipy – Scientific Library

Pandas – Data Analysis library

a) Data loading

b) Series and Data frame

c) Selecting rows and columns

d) Position and label-based indexing

e) Slicing and dicing

f) Merging and concatenating

g) Grouping and summarizing

h) Data Processing, cleaning

i) Missing Values

j) Outliers

Matplotlib – Basic 2D Data Visualization library

a) Introduction to Matplotlib Basic plotting Figures and sub plotting

Box plot, Histograms, Scatter plots, image loading

b) Introduction to Seaborn

Histogram, rugged plot, hex plot and density plot

The joint plot, pair plot, count plot, Heat maps

c) Plotting categorical data and aggregation

Seaborn – Advanced Data Visualization library

Stat – Statistics library

### Part 5: Statistics & Mathematics

Types of data

Levels of measurement

Categorical variables. Visualization techniques for categorical variables

Numerical variables. Using a frequency distribution table

Histogram charts

Crosstables and scatter plots

Measures of central tendency

The main measures of central tendency: mean, median and mode

Measuring skewness

Measuring how data is spread out: calculating variance

Standard deviation and coefficient of variation

Calculating and understanding covariance

The correlation coefficient

Basic Statistics – Measures of Central Tendencies and Variance

Building blocks – Probability Distributions – Normal distribution – Central Limit Theorem

Inferential Statistics -Sampling – Concept of Hypothesis Testing

Statistical Methods – Z/t-tests (One sample, independent, paired), Anova, Correlation, and Chi-square

Important modules for statistical methods: Numpy, Scipy, Pandas

### Part 6: Machine Learning – Predictive Modelling – Basics

Introduction to Machine Learning & Predictive Modeling

Types of Business problems – Mapping of Techniques – Regression vs. classification vs. segmentation vs. Forecasting

Major Classes of Learning Algorithms -Supervised vs Unsupervised Learning

Different Phases of Predictive Modeling (Data Pre-processing, Sampling, Model Building, Validation)

Overfitting (Bias-Variance Tradeoff) & Performance Metrics

Feature engineering & dimension reduction

Concept of optimization & cost function

Concept of the gradient descent algorithm

Concept of Cross-validation(Bootstrapping, K-Fold validation etc)

Model performance metrics (R-square, RMSE, MAPE, AUC, ROC curve, recall, precision, sensitivity, specificity, confusion metrics )

### Part 7: Machine Learning Algorithms & Applications – Implementation in Python

Linear & Logistic Regression

Segmentation – Cluster Analysis (K-Means)

Decision Trees (CART/CD 5.0)

Ensemble Learning (Random Forest, Bagging & boosting)

Artificial Neural Networks(ANN)

Support Vector Machines(SVM)

Other Techniques (KNN, Naïve Bayes, PCA)

Introduction to Text Mining using NLTK

Introduction to Time Series Forecasting (Decomposition & ARIMA)

Important python modules for Machine Learning (SciKit Learn, stats models, scipy, nltk etc)

Fine-tuning the models using Hyperparameters, grid search, piping etc.

Machine Learning Case Studies

Market Basket Analysis

Dimensionality reduction on CTG

Email filtering – spam or not spamd

Product recommendations

Fraud detection

Breast cancer diagnostic detection

House price prediction analysis

Predicting wine quality