top of page

Unlimited Data: The new ML training paradigm

  • chrislongstaff1
  • Aug 28
  • 3 min read

Introduction

Paradigm is a bold statement, one of its dictionary definitions is

A set of theories that explain the way a particular subject is understood at a particular time

We believe that Unlimited Data is going to be a crucial paradigm shift in the way that ML networks are developed and iterated.


Let me explain...


The old way

The traditional training loop was


  1. DATA PREPARATION


  • Specify

  • Gather

  • Annotate

  • Split (Train/Validation/Test)



  1. TRAINING LOOP


  • Initialize Model

  • Set Hyperparameters (Learning Rate, Epochs, Batch Size...)

  • Repeat for a number of epochs:

    • Inference/Loss calculation on training set

    • Backpropagate to update weights

    • Periodically, use validation data to fine-tune and check for overfitting



  1. OBJECTIVE EVALUATION AND DECISION


  • Use the reserved Test Data for a final, unbiased evaluation



  1. DECISION


  • Mostly OK -> Return to Step 2 (Training Loop)

  • Not Acceptable -> Return to Step 1 (Data Preparation)

  • Acceptable -> Proceed to Deployment



  1. DEPLOY AND MONITOR



The primary issue with this traditional loop is that the time and cost of obtaining new data is impractical, and new data was only sought in cases of near total model failure. Engineers tasked with training would rather spend days, or weeks tweaking model parameters, looking at pre- and post-processing, changing the model entirely etc., before having to go back and look for more data. Data obtained in step 1 was pretty much a static element of the whole process.


The new paradigm

Synthera Chameleon changes this by enabling engineers rapid access to new data.


  1. DATA PREPARATION


  • Specify and Set up Synthetic scenario

  • Gather and generate Synthetic data

  • Annotate (Automatic for Synthetic)

  • Split (Train/Validation/Test) (No Synthetic in test set)



  1. NEW TRAINING LOOP


  • Initialize Model

  • Set Hyperparameters (Learning Rate, Epochs, Batch Size...)

  • Repeat for a number of epochs:

    • Inference/Loss calculation on training set

    • Backpropagate to update weights

    • Periodically, use validation data to fine-tune and check for overfitting

    • Periodically consider new synthetic training data if convergence is too slow (model/scene already configured)



This paradigm shift is enabled through Synthera Chameleon enabling the user to create new data in minutes. Not the days, weeks or even month that was required by the old paradigm when using real-world data.


But hold on, why unlimited, surely that's not what we need?

First, let's clarify what we mean by 'unlimited data': It's the ability to keep generating new data. New data for existing deployed products to keep them up to date (data-drift), new data for proof of concepts, new data for new products and so on. It doesn't mean that you need to use "unlimited" data for any particular model!!


So what can unlimited data do for us, by being able to apply large amounts of data to a problem?

  • Improve Generalization and reduced overfitting

  • Edge Case Robustness

  • Improved Accuracy

  • Reduced Bias



So Any Old Data will do?

Unfortunately not. It needs to be relevant data. It needs to be new, diverse data, that teaches the model new semantics. It must be accurately annotated.


Any new data, whether real, synthetic, augmented, generative can improve a model, conversely it can also make it worse. That's why the unlimited aspect is important. We can add new data, and if its not the right new data, we can try again.


Concluding Thoughts

Unlocking unlimited data enables data to become a dynamic element within an ML training environment. No longer does your model and it's performance need to be constrained by a limited, static dataset.

 
 
2.png

We'd love to hear your thoughts on this article

bottom of page