AWS launched a new service today, Amazon SageMaker Data Wrangler, that makes it easier for data scientists to prepare their data for machine learning training. In addition, the company is also launching SageMaker Feature Store, available in the SageMaker Studio, a new service that makes it easier to name, organize, find and share machine learning features.
AWS is also launching Sagemaker Pipelines, a new service that’s integrated with the rest of the platform and that provides a CI/CD service for machine learning to create and automate workflows, as well as an audit trail for model components like training data and configurations.
As AWS CEO Andy Jassy pointed out in his keynote at the company’s re:Invent conference, data preparation remains a major challenge in the machine learning space. Users have to write their queries and the code to get the data from their data stores first, then write the queries to transform that code and combine features as necessary. All of that is work that doesn’t actually focus on building the models but on the infrastructure of building models.
Data Wrangler comes with over 300 pre-configured data transformation built-in, that help users convert column types or impute missing data with mean or median values. There are also some built-in visualization tools to help identify potential errors, as well as tools for checking if there are inconsistencies in the data and diagnose them before the models are deployed.
All of these workflows can then be saved in a notebook or as a script so that teams can replicate them — and used in SageMaker Pipelines to automate the rest of the workflow, too.
It’s worth noting that there are quite a few startups that are working on the same problem. Wrangling machine learning data, after all, is one of the most common problems in the space. For the most part, though, most companies still build their own tools and as usual, that makes this area ripe for a managed service.