Main Challenges of Machine Learning while developing model

Abstract

This case study contains the insight of challenges while developing a model that perform good or bad according to the specific requirements. It includes all possible main problems needs to be fixed before we develop the machine learning model fit. Learning algorithm selection, data selection and processing and examining their fitting outcome is way in which we faces these common engineering challenges

1. Insufficient Quantity of Training Data

Machine learning models, particularly deep learning ones, require a large volume of training data to learn general patterns, early models trained on small datasets underperformed due to lack of variation. For example, training a digit classifier on just 100 samples instead of 60,000 (like MNIST) results in poor generalization. The model overfits the tiny dataset and fails on new inputs.

2. Non-representative Training Data

A large dataset is not helpful if it doesn't reflect the real-world distribution of the problem. In Chapter 2, Géron discusses sampling bias where a model trained only on specific subsets of data (like photos of cats from a single breed) fails in the wild. If your self-driving car dataset contains only sunny daytime images, it will likely fail in rain or at night. Representativeness is key to generalization.

3. Poor Quality Data

Data with errors, missing values, or mislabeled examples can mislead the model during training. Géron emphasizes in Chapter 2 and 3 the importance of cleaning and validating your data pipeline. For instance, if your spam classifier has emails mislabeled as “not spam” when they clearly are, the model may learn the wrong features, reducing accuracy.

4. Irrelevant Feature

Features that have no correlation with the target output can confuse the model, increase dimensionality, and lead to overfitting. Géron covers feature engineering in Chapter 2 and explains that choosing the right input variables is essential. For example, using a customer's favorite color to predict loan default might add noise instead of useful signal. Tools like feature importance and correlation matrices help identify such issues.

5. Overfitting and Underfitting

These are two critical issues related to the model's capacity and data complexity. In Chapter 4, Géron uses polynomial regression to illustrate: - **Overfitting**: A model too complex for the data learns noise (e.g., a 15-degree polynomial fit to 10 points). - **Underfitting**: A model too simple (like linear regression on a quadratic dataset) fails to capture the trend. Balancing model complexity, regularization, and training data size helps mitigate both problems.