Query plan is widely used as input in machine learning for databases (ML4DB) research, with query plan representation as a critical step. However, existing studies typically focus on one task, and propose a novel design to represent query plans along with a ML4DB framework, without comparing with other representation methods designed for a different task. This raises a critical question: How do we select a query plan representation method in a ML4DB system? To address this question, we perform a comparative study on ten representation methods on three distinct ML4DB tasks: cost estimation, index selection and query optimization. Our extensive experiments not only verify the interchangeability of representation methods across different tasks, but also identify consistently high-performing models. Further, we dissect the query plan representation into two core components: feature encoding and tree model, and evaluate the impact of design choices for each in different scenarios. Our results show that the findings for tasks optimizing absolute errors are different from findings for tasks optimizing relative errors. Some findings challenge widely-held assumptions, i.e., one finding shows that tree models do not significantly impact cost estimation results, but only play a significant role to optimize relative performance. Practical guidelines and future directions are provided based on the findings of the study.
Supplementary notes can be added here, including code and math.