All discoveries made by man in any discipline, like physical sciences,
biological sciences, social sciences, engineering, etc., are based on past experiences or
collected data. This means that human beings solve the problem by using past
experiences or collected data. Therefore, since one aspect of Artificial Intelligence, as
pointed out in chapter 1, unit 1, is to design systems that act like man, it becomes
necessary that computer systems should be designed to solve the problem the way
human beings solve a problem. This means that computer systems should be designed
to solve the problem using past experience or previously stored data. The data are
called training data because they are used to train the computer to learn the trend or
pattern of the training data. Learning the pattern or trend of the training data as a rule, it
uses the learnt rule to solve a subsequent problem using test data that has the same
structure as the training data. The vast amount of data in machine learning is divided
into two sets, which are the training set and the test set. The training set is used to
develop a model, while the test set is used to evaluate the performance of the model.
Data splitting technique in machine learning refers to the technique used to split the
data into a training set and test set. The aim is to avoid poor generalization, i.e.,
overfitting or overtraining. Using more training sets improves the accuracy of the
model, while using more test data improves the accuracy of the error estimate. An
appropriate training/test set ratio of 70:30 is considered appropriate. Machine learning,
therefore, is an aspect of Artificial Intelligence that deals with the design of systems
that uses a large set of data called training data to solve a particular problem. Machine
learning is a broad area in Artificial Intelligence, which will be considered in the
various units of this chapter.
Keywords: Classification algorithm, Data pre-processing, Decision tree
algorithm, Feature engineering, K-means clustering algorithm, Learner’s input,
Learner’s output, Naive Bayes algorithm, Regression algorithm.