Knowledge Discovery in Databases Utilizing Large Language Models

Chauhan, Satyam

doi:https://dx.dx.doi.org/10.21275/MS241026170018

Knowledge Discovery in Databases Utilizing Large Language Models

Satyam Chauhan

Abstract: Converting natural language questions into executable SQL commands, known as text-to-SQL parsing, has seen a surge in interest recently. Advanced models like GPT-4 and Claude-2 have demonstrated significant potential in this area. However, existing benchmarks such as Spider and Wiki SQL primarily focus on simple database schemas with limited data, highlighting a disconnect between academic research and practical applications. To bridge this gap, we introduce BIRD, a comprehensive benchmark for large-scale database text-to-SQL tasks. BIRD includes 12,751 text-to-SQL pairs across 95 databases, totaling 33.4 GB and covering 37 diverse professional domains. Our focus on real-world database values brings forth new challenges, such as dealing with noisy or incomplete data, aligning natural language questions with external knowledge in the database, and improving SQL efficiency for large datasets. Addressing these issues requires text-to-SQL models to go beyond traditional semantic parsing to better understand database content. Experimental findings emphasize the critical role of database values in generating accurate SQL queries for extensive data. Even state-of-the-art models like GPT-4 achieve only 54.89% accuracy in execution, far from the 92.96% human benchmark, underscoring ongoing challenges in the field. Additionally, our analysis of query efficiency provides insights into crafting optimized SQL queries for industrial use cases. We believe BIRD will play a crucial role in advancing real-world text-to-SQL applications. The leaderboard and source code can be accessed at BIRD Benchmark. As data complexity increases and the demand for rapid data retrieval grows, integrating AI models, especially Large Language Models (LLMs), to assist users in generating SQL queries from natural language is becoming increasingly important. This research outlines a system where LLMs effectively combine with metadata-driven approaches such as mapping connections, segment definitions, and business logic?to enable intuitive SQL query generation. The system's setup, benefits, and foundational patterns are demonstrated through test datasets and a Power BI presentation.

Keywords: Large Language Models (LLMs), Metadata-Driven Methods, SQL Query Generation, Natural Language Processing (NLP), Information Retrieval, System Architecture, Data Management, Power BI, Artificial Intelligence (AI), Business Logic Integration, Data Visualization, Complex Data Sets, Query Validation, Machine Learning

How to Cite?: Satyam Chauhan, "Knowledge Discovery in Databases Utilizing Large Language Models", Volume 13 Issue 10, October 2024, International Journal of Science and Research (IJSR), Pages: 1886-1894, https://www.ijsr.net/getabstract.php?paperid=MS241026170018, DOI: https://dx.dx.doi.org/10.21275/MS241026170018

Download Citation: APA | MLA | BibTeX | EndNote | RefMan