Life of AI project
Data Science
Discipline of making data useful
Data mining, ML etc..
map of data science (None, Any, Few - decisions)
- Descriptive Analytics - None
- inspired by data
- Machine Learning- Any
- Make a recipe
- Statistical Inference - Few
- Decide wisely
Descriptive Analytics
- lets find out what is here
- Can we look up the answer -yes
Prototype to production
- Step 6: Training and Tuning
- Fitting → Validation (should pass) → testing (should pass)
- Overfitting → Validation (fails) → Go back to Training and Tuning
- This is a dreaded infinite loop
- This is called overfitting limbo.
- Strategy: Start simple .→ inch the way up to complexity
- More complex the solution → chances for overfitting
- Longer the take recipe → more complicated it is.
- One way to do it - Algorithmically enforce simplicity → called Regularisation
- Avoid training using data from the future.
- Predict tomorrow’s stock price using tomorrow's interest rate.?
- Treat label & features with some respect
- you may not have this in production - Pitfall.
- The goal is
- to find patterns in your data
- shortlist of models that seem to work.
- don’t try to get it right immediately, it will take few tries
- Step 7: Tune and Debug
- Need a separate dataset from training.
- Part of original splitting data
- Original data
- Exploratory data
- Training data
- Debugging data
- Validation data
- Test data
- How to debug?
- Fit a model in training data → then move to debugging
- Check performance in debugging data
- Look for instances where model got it wrong.
- Do analysis of whats common among success and ones which failed to fit.
- possibly a feature or combination of features.- > do feature engineering
- Can I skip this step?
- Preferably don’t skip - lose a chance to find dataset which won’t fit the model.
- Tuning? - Tune Hyperparameter
- Concepts
- Parameters: Set using the data
- Hyperparameter: numerical settings in an algorithm, even before data is ingested to algorithm
- How to do Tune?
- Basic tuning (holdout method)
- Take tuning dataset from Training dataset
- Run iterations using possible values of Hyperparameter
- Choose the settings which gives best performance - that's your tuned model
- Cross validation (a type of tuning)
- k-fold cross validation
- k = number of chunks we are splitting our training data with. (100 datasets split 20 each , k =5)
- use 1 of chunk as evaluation set, rest 4 as training data
- Train for each hyperparameter
- evaluate on 1 evaluation set and store the performance.
- Choose hyperparameter setting which gives best aggregated performance (eg. mean precision)
- Adv : allows to check model stability
- helps to find outliers.
- Can I skip tuning?
- No, you get silly hyper parameters.
- Yes - if no hyperparameters + lots of data + method robust to outliers.
- Tuning is more relevant towards later phases of ML project.
- Debugging → gives you insight
- Tuning→ Saves from poor hyperparameter choices
- Step 8: Validate the model
- Why ? → ML project is oblivious to overfitting unless model is checked on fresh data.
- How?
- Evaluate on validation data set and either
- go to the next step
- Go back to training and try again