Recently, my work requires me to understand some common machine learning methods. I’ll just write a post about it…
- Decision Trees: Build multi-level recursive binary classification paths based on attributes, and finally arrange the input object in an appropriate position according to this series of binary classification results.
- Pro: Easy to understand and explain
- Con: Sensitive to missing data; prone to overfitting; does not consider the correlation between attributes (treats them as independent?)
- Random Forest Algorithm: Based on decision trees. First, split the training data into multiple subsets, generate a decision tree for each subset, then input the classification information to be classified into each decision tree, and finally decide the final result by minority obeys the majority.
- Pro: Not prone to overfitting, has good noise resistance, insensitive to outliers
- Con: Poor interpretability, easily affected by multi-class attributes
- Lasso regression: An improvement on general linear regression, it reduces model complexity by penalizing coefficients.
- Pro: Strong interpretability, solves overfitting
- Con: Penalization can cause underfitting
- Logistic Regression: Uses logistic regression to get the relationship between the probability ratio of the dependent variable classification result and the input independent variables.
- Pro: Highly interpretable
- Con: Essentially a generalized linear model, not suitable for some complex situations
- Support Vector Machine (SVM): Separates points on a plane with a straight line. The distance between the two classes of points closest to the line is maximized. Extending this problem to multidimensional space results in support vector machines.
- Pro: Suitable for small sample sizes, has good universality
- Con: Poor interpretability, sensitive to missing data
- K-Nearest Neighbors (KNN): Places new data in a multidimensional space with known classified points. If the majority of the K nearest points belong to a certain class, this point belongs to that class.
- Pro: Simple theory and easy implementation
- Con: Large computational load, recalculates every time for classification, significantly affected by imbalanced classification
- Naive Bayes: Uses Bayes’ formula to calculate the probability of being in a certain category.
- Pro: Simple implementation and only requires a small amount of training data. Fast during large-scale calculations.
- Con: When applying the Naive Bayes classifier, it must satisfy the condition that all attributes are conditionally independent.
- K-Means Algorithm: A clustering algorithm in unsupervised learning; first select K initial points from a set of data, calculate the distance between all points and these K points, then assign each point to the nearest one among the K. After the first round of classification is completed, calculate an average center point (K) for each category of points, then recalculate the distance from all points to these K center points, and assign them to the nearest category. Repeat this process. When repeated a certain number of times, the grouping will stabilize, at which point it stops.
- Pro: Not sensitive to outliers
- Con: Large computational load when dealing with large amounts of data, poor effect with small amounts of data; easily affected by imbalanced classification
- Adaboost Algorithm: Combines multiple classifiers (multiple models) to achieve better classification results.
- Pro: Combines multiple basic algorithms and considers the weights of different classifiers
- Con: Training time-consuming
- Neural Network (NN): A relatively complex deep learning (DL) method. Personally, I understand that the neurons in a neural network are similar to decision trees, which will get a classification result based on input data, then pass this result to downstream neurons. The downstream neuron accepts upstream information and derives results before passing them on, finally summarizing the last layer’s results to give a classification result.
- Pro: High classification accuracy, strong robustness and fault tolerance
- Con: Deep learning, large computational load, poor interpretability
- Markov Chain (Markov Model): I didn’t understand…
Main content excerpted from:
轻松看懂机器学习十大常用算法
机器学习算法优缺点及其应用领域
Of course… my explanations are not as simple and clear as those in the video. It seems that now learning requires Bilibili…
Video Source, if you like it, please go follow the up and give it a triple like!