What is this Domain Gap I keep hearing about?
- chrislongstaff1
- Sep 3
- 3 min read
Updated: Sep 30
And does it really matter?
Introduction
Have you ever noticed that the screwdriver you are using to turn the screw isn't a perfect fit, but it can get the job done; or that the fertilizer recommended for your garden doesn't perfectly fit your plants and soil type, but the plants still grow? Those are domain gaps, a difference between "a reference" (source domain) and the encountered "real -world instance" (target domain). Of course the more we close that domain gap (the more precise the screwdriver fit, the better matched the fertilizer to your plants and soil type), the better the outcome will typically be - a screw tightened with minimal damage to the head, a healthy vibrant plant.
More specifically in the world of ML, the data we use to train the networks, needs to be as close as possible to the instances that the network will find during inferencing so as to perform accurately.
What constitutes a Domain Gap
This is any variance of the source data with the destination data, in our case, it is the difference between the training data (source domain) and the inference data (target domain).
Factors that can be considered to cause a domain gap include:
Environmental differences such as weather, lighting, time of day, geographical features
Statistical differences such as the distribution of men/women, number of vehicles
Structural differences such as differences in textures, detail of data
Camera pipeline differences such as changes in lenses, processing and compression
So is the Domain Gap an issue with both Synthetic and real data?
Yes, the key issue with Synthetic data tends to be related to the structural differences between 3D computer graphics and "real-world" captures. There are a number of ways that Synthera use to overcome these differences and minimize effects due to this domain gap:
- Sampling real-world textures and accurately modeling to enhance photo realism
- Provision of a large number of high quality assets
- Automated tools to randomize the appearance of the scene
It is certainly true, that data captured in the real-world is far less likely to suffer from structural issues (assuming the correct data has been captured) than Synthetic data. Unfortunately the same cannot be said for Environmental, Statistical and Camera differences, these are all far more significant than with Synthetic data that is able to easily address all of these issues.
What can we do to fix this issue?
One of the key benefits of Synthera's implementation of Synthetic data creation is that it makes it really easy to address the other domain gap factors, and will typically suffer from far less of a Domain Gap in those areas than real-world data
Address environmental factors - Automated (Synthera Pass system) setting time of day, weather and lighting conditions , predictably and repeatably allows wide coverage across all environmental conditions. Ease of changing the setting to a new geography/locale, the same situation can be recreated over and over again
Statistical issues - Synthera's unique digital human technology, combined with an easy to use UI and automation elements ensure that the output data can contain precisely the required mix of populations
Camera pipelines can easily be modelled by Synthera during the Synthetic data creation process, so precise control over noise, lens and processing profiles ensure data can be matched to inferencing systems
Real-world data helps ensure that the required structural balance is maintained
So ultimately a combination of Synthetic and Real data will deliver the most robust, accurate model?
In a word, yes! It is likely that by combining the strengths of Real-world and Synthetic data, your model is most likely to be able to overcome the domain gap issues exhibited by both these data types and more closely align to the inference data that is presented to the model during its deployment.
Concluding Thoughts
It's a common mis-conception to think of Synthetic data as having a domain-gap, and real-world data as being perfect. In fact they both suffer from gaps, with in fact real-world data suffering at least as significant, if not more significant gaps than Synthetic. The answer ultimately lies in combining the strengths of the two types of data.
