top of page

Machine Learning (overview)

Updated: Jul 11, 2023

The easiest definition of machine learning is the process of utilizing algorithms to analyze and learn from data and try to predict anything from the outside world. Machine learning is founded on algorithms capable of learning from data without the use of programming with rules. Machine learning can easily work with a big amount of data, data with no structure, and non-linear data.

Supervised learning means the ability to learn how to map a function from the input (X) to the output (Y). The goal is to train the function to the point where it can accurately predict the output variables (Y) for every fresh batch of input data (x) that we receive.

Regression problems and classification problems are two groups of issues that might arise during supervised learning. The regression problem happens when we have a continuous Y variable, and classification problems with categorical variables.

Unsupervised Learning derives conclusions from datasets with input data but no tagged replies. It is employed to "find" the data's underlying structure.

Reducing the number of dimensions is called dimension reduction. It resembles compression a lot. The goal here is to maintain as much of the pertinent structure intact while attempting to simplify the data as much as feasible. Clustering implies that fewer examples are used. This part ends with Deep learning and Reinforcement learning, the second one is the case in which a laptop interacts with itself and learns. The second one is “Sophisticated algorithms address such highly complex tasks as image classification, face recognition, speech recognition and natural language processing”.

There is the overfitting concept, which happens when a model attempts to match the training set of data too closely, as a result, the model struggles to generalize the new set of data. Then there is the Generalization concept, which is the capacity of the model to adapt itself to any kind of data. But to better understand these concepts, is better to explain how de data set is divided:

- Training sample: used to train a model

- Validation set: used to validate the model

- Test set: used to test the model

In this case, there is overfitting when we have a small error on training data and a large error on test data. Furthermore, to understand overfitting related to errors, scientists divided the error in:

- Bias error: it comes from wrong assumptions in the algorithm

- Variance error: this error comes from the sensitivity to new data

- Base error: caused by randomness in the data.

Then there are 2 ways to avoid overfitting, the first one is to create simple algorithms and limit the number of features, and avoid sampling bias.

Supervised machine learning models “are training using label data, and depending on the nature of the target variable”. These models can be divided into 2 types: Regression and classification.

We use the Penalized regression when we need to reduce the number of features in prediction problems. The most important penalized regression is called LASSO which stands for Least Absolute Shrinkage and Selection Operator. For Classification and regression is also used the Support Vector Machine is a “linear classifier that determines the hyperplane that optimally separates the observations into 2 sets of data points”. Usually, the SVM is used to maximize the probability of a good prediction.

Then there is the K-Nearest Neighbor, which is used mostly for classification and sometimes for regression. The ratio of the KNN is to classify a new observation finding similar points between the observation and its k-nearest neighbor in the data set.

In the case, there are too many multidimensional data sets and we desire to reduce them, we can use the Principal-component analysis (PCA). PCA works by keeping lower-order principal components and leaving the higher-principal components. To define the principal components we can use the Eigenvectors, the variance of these components defined by the Eigenvector is explained by an eigenvalue. When we want to focus on observations in groups, this is called Clustering, if the observations in a cluster are very similar to each other there is cohesion, otherwise, it is called separation. Then there is a straight line that links two points, based on the length of this line those two points are more or less similar. The name of this line is Euclidian distance, the longer this line the more different the points will be. Finally, there are other types of clustering, Hierarchical clustering where the initial and final data are the same, Agglomerative Hierarchical clustering “begins with each observation being its own cluster. Then, the algorithm finds the two closest clusters, defined by some measure of distance, and combines them into a new, larger cluster”, and the Divisive Hierarchical clustering “starts with all observations belonging to a single cluster. The observations are then divided into two clusters based on some measure of distance. The algorithm then progressively partitions the intermediate clusters into smaller clusters until each contains only one observation.”.

ML and Big Data
Download PDF • 116.30MB

23 views0 comments


bottom of page