- 08/09/19 Initial setup.
Changes since the last version:
This guide was written to help you configure a Bionic Beaver based environment in Windows 10 Hyper-V enhanced mode. Some content is linked to Dells Sputnik project, which is now in its 7th edition, and available on for example the XPS 13 Developer Edition laptop and Precision workstation line.
These are my personal notes, broadly covering the BASICS necessary for machine learning and artificial intelligence.
Some final caveats:
The raw notes are open sourced - should you encounter errors, have a better way of explaining something, don’t hesitate to submit a pull request.
- What is Machine Learning?
- 1.1 Functions
- 1.2 Algorithms - Grouped by Learning Style
- 1.3 Supervised v. Unsupervised
- 2.1 Regression
- 2.2 Classification
- In Practice
Machine learning provides the foundation for artificial intelligence. We train our software model using data e.g. the model learns from the training cases and then we use the trained model to make predictions for new data cases.
Let’s start with a data set that contains historical records aka observations. Every record includes numerical features (X) quantifying characteristics of the item we are working with.
There are also values we try to predict (Y). We will use training cases to train the machine learning model so that it calculates a value for (Y), from the features in (X). Simply said, we are creating a function that operates on a set of features, (X), to produce predictions, (Y):
f: X → Y.
At heart, a function is the mapping from a set in a domain to a set in a codomain. A function can map a set to itself. For example,
f(x) = x2, also notated
f: x ↦ x2, is the mapping of all real numbers to all real numbers, or
f: R → R.
The range is the subset of the codomain which the function maps to.
Functions don’t necessarily map to every value in the codomain. Where they do, the range equals the codomain.
There are 2 sorts of functions. Functions which map to
R, are known as scalar-valued or real-valued.
Functions which map to
n > 1 are known as vector-valued.
Ref: Web: Mathworld Wolfram - Eric W. Weisstein.
Ref: Book: Neural Computing: Theory and Practice (1989) - Philip D. Wasserman.
Note: Following course guidelines, we’ll discuss the two most common methods; supervised and unsupervised.
In a supervised learning scenario, we start with observations that include known values for the variable we want to predict. We call these labels.
Because we are starting with data that includes the label we are trying to predict, we can train the model using only some data and hold the rest for evaluating our models performance.
We’ll then use a algorithm to train a model that fits features to the known label.
As we started with a known label value, we can validate the model by comparing the value predicted by the function to the actual label value that we knew. Then, when we’re happy that the model works, we can use it with new observations for which the label is unknown, and generate new predicted values.
In a unsupervised learning scenario, we don’t have any known label values in our training data set.
We’ll train the model by finding similarities between observations. Once we have trained this model, more observations are added to a cluster of observations with alike characteristics. (Cluster = Group)
When we need to predict continuous valued output (i.e. a numeric value), we use a supervised learning technique called regression.
Let’s take one male. We want to model the calories burned while exercising.
First we get some pre-liminary data (age: 34, gender: 1, weight: 60, height: 165), then put him on a fitness monitor and capture additional information. Now what we do is model the calories burned using features from his exercise like his heart rate: 134, temperature: 37, and duration: 25.
In this case we know all features and have a known label value of 231 calories. So we need our algorithm to learn a function, that operates of all the males exercise features to give us a net result of 231.
f([34, 1, 60, 165, 134, 37, 25]) = 231
A sample of one person isn’t likely to give a function that generalizes well. So we gather the same data from a large number participants, and then train the model using the bigger set of data.
f([X1, X2, X3, X4, X5, X6, X7]) = Y
Now having a new function that can be used to calculate label (Y), we can finally plot the values of (Y) calculated for specific features of (X) values, on a chart:
And we can interpolate any new values of (X) to predict an unknown (Y).
As we started with data that includes the label we try to predict, we can train the model using some data and keep the rest for evaluating the models performance. Then we can use the model to predict (F) of (X) for evaluation data, and compare the predictions or scored labels to the actual labels that we know to be true.
The difference between the predicted and actual levels are called the residuals. And they can tell us something about the error level in the model.
We can measure the error in the model using root-mean-square error or (RMSE) and mean absolute error (MAE).
Both are absolute measures of error in the model. For example, having an RMSE value of 5 would mean that the standard deviation of error from our test error is 5 calories. An error of 5 calories seems to indicate a reasonably good model, but let’s suppose we are predicting how long an exercise session takes. An error of 5 hours would be a very bad model.
You might want to evaluate the model using relative metrics, to indicate a more general level of error as a relative value between 0 and 1. Relative absolute error (RAE) and relative squared error (RSE) produce metrics where the closer to 0 the error, the better the model.
Coefficient of determination (CoDR) or R squared, is another relative metric, but this time a value closer to 1 indicates a good fit for the model.
Another kind of supervised learning is called classification.
Classification is the technique that we can use to predict which class or category something belongs to. A simple variant is binary classification, where we predict whether entities belong to one of two classes (true or false).
Example, we’ll take a number of patients in a health clinic, gather some personal details e.g. age: 23, pregnancy: 1, glucose: 171, BMI: 43.5, run tests, and identify which patients are diabetic and which are not.
We could learn a function that can be applied to the patient features and give us the result 1 for patients that are diabetic:
f([23, 1, 171, 43.5]) = 1
and 0 for patients that aren’t.
Generally, a binary classifier is a function, that can be applied to features (X), to produce a (Y) value of 1 or 0. This function won’t actually calculate the absolute value of 1 or 0. Instead, it will calculate a value between 1 and 0:
Y(f(x)), and we’ll use a threshold value to decide whether the result should be counted as a 1 or a 0.
When using this model to predict values, the resulting value is classed as 1 or 0, depending on which side of the threshold line it falls.
Because classification is a supervised learning technique, we withhold some of the test data to validate the model using known labels.
Cases where the model predicts a 1 for a test observation, while holding a label value of 1, are considered true positives.
Cases where the model predicts 0, and the actual label is 0, are true negatives.
If the model predicts 1, but the actual label is a 0, that’s a false positive.
If the model predicts 0, but the value is 1, we have a false negative.
The treshold determines how predicted values are classified. In the case of our diabetes model, having more false positives, thus reducing the amount of false negatives, will be better as more people with a risk of diabetes get identified.
The actual number of positives and negatives that are generated by your model is crucial in evaluating its effectiveness. For that purpose we use this confusion matrix e.g. our basis for calculating performance metrics for the classifier.
Ref: Web: MSXDAT262017 - edX.
TO BE CONTINUED
Before you start building your machine learning system, you should:
To generate a learning curve, you deliberately shrink the size of the training set and see how training and validation errors will change as you increase the size.
With smaller training sets, we expect the training error will be low because it is easier to fit to less data. As the training set size grows, your average training set error is expected to grow. Conversely, we expect the average validation error to decrease as the training set size increases.
If your training and validation error curves flatten out at a high error as set sizes increase, then you have a high bias problem. Adding more training data will not (by itself) help much.
On the other hand, high variance problems are indicated by a large gap between the training and validation error curves as training set size increases. You would see a low training error. In this case, the curves are converging and adding more training data would help.
Ref: Web: Intro to Artificial Intelligence - Udacity.
wttr.in supports console ANSI-sequences for curl, httpie or wget; HTML for web browsers; or PNG for graphical viewers.
If you would like to help translate wttr.in in your language, pay them a visit.