One of the fundamental challenges for machine learning (ML) teams is data quality, or more
accurately the lack of data quality. Your ML solution is only as good as the
Event Details
One of the fundamental challenges for machine learning (ML) teams is data quality, or more
accurately the lack of data quality. Your ML solution is only as good as the data that you train it
on, and therein lies the rub: Is your data of sufficient quality to train a trustworthy system? If
not, can you improve your data so that it is? You need a collection of data quality “best
practices”, but what is “best” depends on the context of the problem that you face. Which of
the myriad of strategies are the best ones for you?
This presentation compares over a dozen traditional and agile data quality techniques on five
factors: timeliness of action, level of automation, directness, timeliness of benefit, and difficulty
to implement. The data quality techniques explored are: data cleansing, automated regression
testing, data guidance, synthetic training data, database refactoring, data stewards, manual
regression testing, data transformation, data masking, data labeling, and more. When you
understand what data quality techniques are available to you, and understand the context in
which they’re applicable, you will be able to identify the collection of data quality techniques
that are best for you.