Unlimited Data: The new ML training paradigm
- chrislongstaff1
- Aug 28
- 3 min read
Introduction
Paradigm is a bold statement, one of its dictionary definitions is
A set of theories that explain the way a particular subject is understood at a particular time
We believe that Unlimited Data is going to be a crucial paradigm shift in the way that ML networks are developed and iterated.
Let me explain...
The old way
The traditional training loop was
DATA PREPARATION
Specify
Gather
Annotate
Split (Train/Validation/Test)
TRAINING LOOP
Initialize Model
Set Hyperparameters (Learning Rate, Epochs, Batch Size...)
Repeat for a number of epochs:
Inference/Loss calculation on training set
Backpropagate to update weights
Periodically, use validation data to fine-tune and check for overfitting
OBJECTIVE EVALUATION AND DECISION
Use the reserved Test Data for a final, unbiased evaluation
DECISION
Mostly OK -> Return to Step 2 (Training Loop)
Not Acceptable -> Return to Step 1 (Data Preparation)
Acceptable -> Proceed to Deployment
DEPLOY AND MONITOR
The primary issue with this traditional loop is that the time and cost of obtaining new data is impractical, and new data was only sought in cases of near total model failure. Engineers tasked with training would rather spend days, or weeks tweaking model parameters, looking at pre- and post-processing, changing the model entirely etc., before having to go back and look for more data. Data obtained in step 1 was pretty much a static element of the whole process.
The new paradigm
Synthera Chameleon changes this by enabling engineers rapid access to new data.
DATA PREPARATION
Specify and Set up Synthetic scenario
Gather and generate Synthetic data
Annotate (Automatic for Synthetic)
Split (Train/Validation/Test) (No Synthetic in test set)
NEW TRAINING LOOP
Initialize Model
Set Hyperparameters (Learning Rate, Epochs, Batch Size...)
Repeat for a number of epochs:
Inference/Loss calculation on training set
Backpropagate to update weights
Periodically, use validation data to fine-tune and check for overfitting
Periodically consider new synthetic training data if convergence is too slow (model/scene already configured)
This paradigm shift is enabled through Synthera Chameleon enabling the user to create new data in minutes. Not the days, weeks or even month that was required by the old paradigm when using real-world data.
But hold on, why unlimited, surely that's not what we need?
First, let's clarify what we mean by 'unlimited data': It's the ability to keep generating new data. New data for existing deployed products to keep them up to date (data-drift), new data for proof of concepts, new data for new products and so on. It doesn't mean that you need to use "unlimited" data for any particular model!!
So what can unlimited data do for us, by being able to apply large amounts of data to a problem?
Improve Generalization and reduced overfitting
Edge Case Robustness
Improved Accuracy
Reduced Bias
So Any Old Data will do?
Unfortunately not. It needs to be relevant data. It needs to be new, diverse data, that teaches the model new semantics. It must be accurately annotated.
Any new data, whether real, synthetic, augmented, generative can improve a model, conversely it can also make it worse. That's why the unlimited aspect is important. We can add new data, and if its not the right new data, we can try again.
Concluding Thoughts
Unlocking unlimited data enables data to become a dynamic element within an ML training environment. No longer does your model and it's performance need to be constrained by a limited, static dataset.
