Paper on database generation accepted at SIGMOD 2022
Congratulations to Yang Jingyi and Wu Peizhi for publishing the paper “SAM: Database Generation from Query Workloads with Supervised Autoregressive Models” at SIGMOD 2022.
With the prevalence of cloud databases, database users are increasingly reliant on the cloud database providers to manage their data. It becomes a challenge for cloud providers to benchmark different DBMS for a specific database instance without having access to the underlying data. One viable solution is to leverage a query workload, which contains a set of queries and the corresponding cardinalities, to generate a synthetic database with similar query performance. Existing methods for database generation with cardinality constraints, however, can only handle very small query workloads due to their high complexity and encounter challenges when handling join queries. In this work, we propose SAM, a supervised deep autoregressive model-based method for database generation from query workloads. First, SAM is able to process large-scale query workloads efficiently as its complexity is linear in the size of the query workload, the number of attributes and the attribute domain size. Second, we develop algorithms to obtain unbiased samples of base relations from the deep autoregressive model and assign join keys in a way that accurately recovers the full outer join of the target database. Comprehensive experiments on real-world datasets demonstrate that SAM is able to efficiently generate a high-fidelity database that not only satisfies the input cardinality constraints, but also is close to the target database.