Data Democratisation With Deep Learning
Data Democratisation With Deep Learning
1260
WSDM ’23, February 27-March 3, 2023, Singapore, Singapore George Katsogiannis-Meimarakis, Mike Xydas, & Georgia Koutrika
2 TUTORIAL OUTLINE • Output Refinement: Having trained a neural model, there are
additional techniques to ensure that the system avoid producing
2.1 The Text-to-SQL Problem
error-yielding SQL queries. Such techniques are mostly based
We will first introduce the problem at hand, present its main chal- on designing rules that can restrict the model’s output space, or
lenges, and analyze their impact on a Text-to-SQL system. executing queries while they are being constructed, to ensure
A Text-to-SQL system translates a Natural Language Query their correctness.
(NLQ) over a relational database (RDB) to an equivalent SQL query,
Based on the presented taxonomy, we will study important Text-
which is valid for this RDB. In this way, the user is shielded from to-SQL systems in greater depth to offer a concrete understanding
the particularities of the database, and can, in principle, ask any of the different approaches proposed. Seq2SQL [19] and SQLNet
queries over the data using natural language.
[17] were the first neural networks specifically created for the Text-
The Text-to-SQL problem hides several challenges. One of the
to-SQL problem. The emergence of pre-trained language models
most important ones is the inherent ambiguity of Natural Language
(PLM) such as BERT [2] changed the landscape. We will present
Queries (NLQs). For instance, lexical ambiguity, where a single word
systems such as SQLova [4] and HydraNet [9] that heavily rely on
has multiple meanings (e.g., “Paris" can be a city or a person), and BERT. We will also focus on complex systems such as RAT-SQL [14],
syntactic ambiguity, where a sentence has multiple interpretations
and we will analyse how they generate complex SQL queries such
(e.g., “Find all German movie directors" can mean “directors that have
as the ones in the Spider benchmark. Finally, we will present the
directed German movies" or “directors from Germany"). On the other
latest innovations in the area, such as the PICARD [13] decoding
hand, several challenges arise from the DB and SQL part of the technique, that has allowed the use of seq-to-seq PLM to achieve
problem. A system must be robust to the vocabulary gap problem,
the highest score on Spider, overturning previous beliefs that seq-
where the DB might use different vocabulary than the one used by
to-seq models are not adequate for the Text-to-SQL task. We will
the user (e.g., a user might ask for “singers", but the DB might store
take advantage of the taxonomy to highlight the differences and
them as “artists"). Furthermore, some user questions may require commonalities between these systems, as well as the best design
the system to generate complex SQL queries (e.g., the NLQ "Find
choices based on the requirements for a Text-to-SQL system.
the highest rated movie" might require a nested SQL query).
Two main datasets have enabled the bloom in research for this
task during the past years. On the one hand, WikiSQL [19] contains 2.2 The SQL-to-Text Problem
25,000 tables from Wikipedia pages and over 80,000 natural lan- The SQL-to-Text problem hides several challenges. First and fore-
guage and SQL question pairs. WikiSQL questions are simple since most, such a system should generate fluent and human-like expla-
they are directed to a single table and not to a relational database. nations of SQL queries. Another challenge is correctly identifying
Hence the proposed task is simplified. On the other hand, the Spider the DB domain and using the appropriate vocabulary. For example,
dataset [18] contains 200 relational databases from 138 different the MAX aggregation function must be translated in a different way,
domains along with over 10,000 natural language questions and depending on the context and the attribute on which it is applied. In
over 5,000 complex SQL queries. Its queries cover a wide range of a DB containing sport data, the MAX(lap_time) refers to the “slowest
complexity, from very simple to very hard, using all the common lap time", while in a database containing products, the MAX(price)
SQL elements and nesting. refers to the “highest product price". Finally, the complexity of SQL
Given the abundance of existing deep learning approaches for poses additional challenges in tackling this problem. As we dis-
the Text-to-SQL problem, we will present a fine-grained taxonomy cussed previously, simple NLQs might map to complex SQL queries.
of these systems, and highlight the main characteristics as well Similarly, a SQL-to-Text system must be able to understand the un-
as the advantages and shortcomings of each approach. Below, we derlying meaning of complex queries in order to be able to express
highlight some important dimensions of this taxonomy. them with a simple,condensed NL explanation.
• Schema Linking: The process of discovering the connections be- In contrast to Text-to-SQL, this inverse problem has seen rela-
tween words in the NLQ and the DB elements (tables, columns, tively little attention from the research community. More specifi-
values) they refer to. This first step produces important infor- cally, template-based [6] can produce very accurate explanations
mation that can help the system create the correct SQL query. of SQL queries, but require a lot of manual effort in order to create
• Input Encoding: This dimension examines how the various in- new templates for a new DB that the system must work on. The
puts to the task (e.g., NLQ, DB schema) are processed so that biggest caveat of template-based systems is that they often generate
they can be effectively used by the neural part of the system. "robotic" and unnatural explanations. A simple example can be seen
• Decoder Output: This dimension refers to the different in the following SQL query: SELECT p.title FROM projects p WHERE
approaches that a neural network can use to generate a SQL p.start_year>=2014 AND p.start_year<=2018, which is translated
query. to “Find projects whose start year is greater than or equal to 2014 and
• Neural Training: Besides training a network from scratch, there start year is less than or equal to 2018." by a template-based system,
are novel approaches that show better performance in some but could be explained much more fluently as “Get the names of
cases, such as the incorporation of transfer learning, the use of projects started between 2014 and 2018."
auxiliary training tasks, or pre-training specific components of Only a handful of solutions using deep neural networks already
the network before starting the standard part of the training. exist. Deep learning solutions (e.g., [16]) offer better generalisation
to unseen databases, but are not guaranteed to generate accurate
explanations every time. We will discuss the existing solutions for
1261
Data Democratisation with Deep Learning: The Anatomy of a Natural Language Data Interface WSDM ’23, February 27-March 3, 2023, Singapore, Singapore
this problem, compare their advantages and drawbacks, while also In a similar fashion as in schema encoding of RAT-SQL, Zhu
paving the path for future research, by addressing the opportu- et al. [20] extend the attention mechanism to include information
nities that arise from the use of novel NLP techniques that have about the relationships between two nodes that are not necessarily
taken other research areas by storm, such as the the Transformer directly connected. A similar approach is followed by Cai et al. [1]
architecture and Pre-trained Language Models (PLMs). but they use the transformer architecture for generating the path
and a bi-directional GRU for encoding the information of the path.
2.3 The Data-to-Text Problem Most recent approaches focus on leveraging the power of self-
training with the teacher-student architecture. In this direction,
We will introduce the Data-to-Text problem and we will showcase
the main challenge is processing and filtering the synthetic labels
the connection with our goal of a natural language data interface.
created. CSBT [5] chooses to train the student model on a pseudo-
Data-to-Text aims at generating fluent and fact-based verbalisations
labeled dataset of increasing curriculum difficulty. BLEURT self-
of a given structured input. The problem requires careful encoding
training [10] leverages the BLEURT metric for filtering low-quality
of the input allowing the model to understand the underlying input
synthetic labels.
structure. Also, the task of generating the verbalisation (decoding)
has its intricacies since our input tends to have many entities and
unseen content. Data-to-Text can be distinguished based on the 2.4 Challenges and Research Opportunities
type of input into: Table-to-Text and Graph-to-Text. We will present challenges that are unique to each of the discussed
On Table-to-Text, the first datasets were fairly small and domain- problems, areas where one problem could benefit by the recent
specific. Wikibio [7] gathers 728,321 biography info-boxes from advances in the other domains, as well as challenges of integrating
English Wikipedia along with the first paragraph of the page. RO- all three tasks into a single conversational system.
TOWIRE [15] consists of (human-written) 4,853 NBA basketball Firstly, evaluation is a big hurdle for all three problems. In some
game summaries aligned with their corresponding box- and line- cases there are no perfect automatic metrics, and systems often
scores. More recently, a large and domain-diverse dataset was re- resort to human evaluation. Another common problem is that eval-
leased named ToTTo [11], which proposes a controlled generation uation is often limited to accuracy metrics, overlooking time and
task: given a Wikipedia table and a set of highlighted table cells, computational costs. There is a constant need for new benchmarks
produce a one-sentence description. that can stress these systems and test them in difficulties they would
In Graph-to-Text, one of the most influential datasets is the encounter when working on a real database. For example, how to
WebNLG [3] corpus, which comprises sets of triplets describing handle domain specific knowledge, large amounts of data, many
facts in the form of a graph and natural language. users that interact simultaneously, and so forth.
Another challenge that troubles all three areas is how structured
2.3.1 Table-to-Text Systems. In Table-to-Text, the input of the model data (e.g., tables and databases) can be efficiently represented in a
is a table and the goal is to output its description in natural lan- format that can be processed by a neural network.
guage focusing both on coverage (all/important parts of the table Also, generalizing the problem beyond relational databases is
are described) and fluency (avoiding syntactical errors). another domain that will also enjoy the attention of researchers in
Field-gating Seq2Seq [8] focused on making the model have both the near future, given that the advancements of Knowledge Graphs,
a global and local view of the verbalised table simultaneously. Using the Resource Description Framework (RDF) and query languages
the global view, the model will decide the order and contents of the such as SPARQL, point to the need for similar interfaces that can
verbalisation, while using the local view, it will choose words to go beyond SQL.
copy or paraphrase. One of the most interesting research and engineering challenges
On the other hand, NCP [12] prefers a two-staged approach. is to create a unified system that combines solutions to the problems
First, a content plan is created, which decides the order in which we presented, creating a complete natural language data interface.
the records of the table should be generated. Second, an LSTM Simply combining existing models is destined to fail, because most
network takes into account the encoded plan along with a copy systems proposed in all three domains are not designed and tested
mechanism, and it generates its verbalisation. This stage separation for real-world databases. Additionally, since this will be a system
allows for intermediate rewards leading to more stable training. for casual users, latency, usability, and accessibility, all become
As in the other two fields of Text-to-SQL and SQL-to-Text, the important factors, that require specific optimizations and evaluation
pre-training of huge language models has greatly impacted both studies in order to achieve enjoyable user experience. However,
Table-to-Text and Graph-to-Text. So far, the solutions focus on the feat of implementing such a system will be a massive step
straightforward application of models like GPT-2 or T5, managing forward for data democratisation, and a remarkable scientific and
to outperform previous approaches. However, these are just the engineering achievement.
first steps in using pre-trained models on Data-to-Text and a lot of
research is needed to successfully harness their full power.
3 PRESENTERS
2.3.2 Graph-to-Text Systems. In Graph-to-Text, we have as input George Katsogiannis-Meimarakis is a research assistant at Athena
a graph but the goal remains the same, i.e., to generate a text de- Research Center in Greece, and a graduate of the Department of
scription of the contents of the graph. The main challenge is a Informatics and Telecommunications of the National and Kapodis-
meaningful representation of the relations between nodes which trian University of Athens. He is currently attending a MSc pro-
are crucial for correct verbalisation. gramme on Data Science and Information Technologies with a
1262
WSDM ’23, February 27-March 3, 2023, Singapore, Singapore George Katsogiannis-Meimarakis, Mike Xydas, & Georgia Koutrika
specialisation on Artificial Intelligence and Big Data. Prior tutorials: arXiv:1810.04805 [cs.CL]
Deep Learning for Text-to-SQL [EDBT’21, SIGMOD’21, TWC’22]. [3] Claire Gardent, Anastasia Shimorina, Shashi Narayan, and Laura Perez-
Beltrachini. 2017. Creating training corpora for nlg micro-planning. In 55th
annual meeting of the Association for Computational Linguistics (ACL).
Mike Xydas is a research assistant at Athena Research Center in [4] Wonseok Hwang, Jinyeong Yim, Seunghyun Park, and Minjoon Seo. 2019. A Com-
Greece, where he works on the EOSC Future project on creating a prehensive Exploration on WikiSQL with Table-Aware Word Contextualization.
recommender system for the EOSC portal and the INODE project arXiv:1902.01069 [cs.CL]
[5] Pei Ke, Haozhe Ji, Zhenyu Yang, Yi Huang, Junlan Feng, Xiaoyan Zhu, and Minlie
focusing on the Data2Text problem. He is a graduate of the De- Huang. 2022. Curriculum-Based Self-Training Makes Better Few-Shot Learners
partment of Informatics and Telecommunications and is currently for Data-to-Text Generation. arXiv:2206.02712 (2022).
[6] Andreas Kokkalis, Panagiotis Vagenas, Alexandros Zervakis, Alkis Simitsis,
attending the MSc programme on Data Science and Information Georgia Koutrika, and Yannis Ioannidis. 2012. Logos: A System for Trans-
Technologies with a specialisation on Artificial Intelligence and lating Queries into Narratives. In Proceedings of the 2012 ACM SIGMOD Inter-
Big Data, completing his thesis with title "QR2T: Verbalising Query national Conference on Management of Data (Scottsdale, Arizona, USA) (SIG-
MOD ’12). Association for Computing Machinery, New York, NY, USA, 673–676.
Results". https://doi.org/10.1145/2213836.2213929
[7] Rémi Lebret, David Grangier, and Michael Auli. 2016. Neural text generation
Georgia Koutrika is a Research Director at Athena Research Cen- from structured data with application to the biography domain. arXiv:1603.07771
(2016).
ter in Greece. She has worked in multiple roles at HP Labs, IBM [8] Tianyu Liu, Kexiang Wang, Lei Sha, Baobao Chang, and Zhifang Sui. 2018. Table-
Almaden, and Stanford. Her work focuses on intelligent and interac- to-text generation by structure-aware seq2seq learning. In Thirty-Second AAAI
tive data exploration, conversational data systems, and user-driven Conference on Artificial Intelligence.
[9] Qin Lyu, Kaushik Chakrabarti, Shobhit Hathi, Souvik Kundu, Jianwen
data management, and it has been incorporated in commercial prod- Zhang, and Zheng Chen. 2020. Hybrid Ranking Network for Text-to-SQL.
ucts, described in 14 granted patents and 26 patent applications in arXiv:2008.04759 [cs.CL]
the US and worldwide, and published in more than 100 papers in [10] Sanket Vaibhav Mehta, Jinfeng Rao, Yi Tay, Mihir Kale, Ankur Parikh, Hongtao
Zhong, and Emma Strubell. 2021. Improving Compositional Generalization with
top-tier conferences and journals. She is a member of the VLDB En- Self-Training for Data-to-Text Generation. arXiv:2110.08467 (2021).
dowment Board of Trustees, member of the PVLDB Advisory Board, [11] Ankur P Parikh, Xuezhi Wang, Sebastian Gehrmann, Manaal Faruqui, Bhuwan
Dhingra, Diyi Yang, and Dipanjan Das. 2020. ToTTo: A controlled table-to-text
member of the ACM-RAISE Working Group, co-Editor-in-chief for generation dataset. arXiv:2004.14373 (2020).
VLDB Journal, PC co-chair for VLDB 2023, co-EiC of Proceedings [12] Ratish Puduppully, Li Dong, and Mirella Lapata. 2019. Data-to-text generation
of VLDB (PVLDB). She has been associate editor in top-tier confer- with content selection and planning. In Proceedings of the AAAI conference on
artificial intelligence.
ences (such as ACM SIGMOD, VLDB) and journals (VLDB Journal, [13] Torsten Scholak, Nathan Schucher, and Dzmitry Bahdanau. 2021. PICARD:
IEEE TKDE), and she has been in the organizing committee of sev- Parsing Incrementally for Constrained Auto-Regressive Decoding from Language
eral conferences including SIGMOD, ICDE, EDBT, among others. Models. arXiv:2109.05093 [cs.CL]
[14] Bailin Wang, Richard Shin, Xiaodong Liu, Oleksandr Polozov, and Matthew
Prior tutorials: She has given several tutorials in conferencs and sum- Richardson. 2020. RAT-SQL: Relation-Aware Schema Encoding and Linking for
mer schools, including: Deep Learning for Text-to-SQL [EDBT’21, Text-to-SQL Parsers. arXiv:1911.04942 [cs.CL]
[15] Sam Wiseman, Stuart M Shieber, and Alexander M Rush. 2017. Challenges in
SIGMOD’21, TWC’22], Fairness in Rankings and Recommenders data-to-document generation. arXiv:1707.08052 (2017).
[EDBT20, MDM21], Recommender Systems [SIGMOD’18, EDBT’18, [16] Kun Xu, Lingfei Wu, Zhiguo Wang, Yansong Feng, and Vadim Sheinin. 2018. SQL-
ICDE’15], Personalization [ICDE’10, ICDE’07, VLDB’05]. to-Text Generation with Graph-to-Sequence Model. In Proceedings of the 2018
Conference on Empirical Methods in Natural Language Processing. Association for
Computational Linguistics, Brussels, Belgium. https://doi.org/10.18653/v1/D18-
ACKNOWLEDGMENTS 1112
[17] Xiaojun Xu, Chang Liu, and Dawn Song. 2017. SQLNet: Generating Struc-
This work has been partially funded by the European Union’s Hori- tured Queries From Natural Language Without Reinforcement Learning.
zon 2020 research and innovation program (grant agreement No arXiv:1711.04436 [cs.CL]
863410) and the European Union Horizon Programme call [18] Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li,
James Ma, Irene Li, Qingning Yao, Shanelle Roman, Zilin Zhang, and Dragomir
INFRAEOSC-03-2020 (grant agreement No 101017536). Radev. 2019. Spider: A Large-Scale Human-Labeled Dataset for Complex and
Cross-Domain Semantic Parsing and Text-to-SQL Task. arXiv:1809.08887 [cs.CL]
REFERENCES [19] Victor Zhong, Caiming Xiong, and Richard Socher. 2017. Seq2SQL: Generat-
ing Structured Queries from Natural Language using Reinforcement Learning.
[1] Deng Cai and Wai Lam. 2020. Graph Transformer for Graph-to-Sequence Learn- arXiv:1709.00103 [cs.CL]
ing. Proceedings of the AAAI Conference on Artificial Intelligence 34, 05 (Apr. 2020), [20] Jie Zhu, Junhui Li, Muhua Zhu, Longhua Qian, Min Zhang, and Guodong Zhou.
7464–7471. https://doi.org/10.1609/aaai.v34i05.6243 2019. Modeling graph structure in transformer for better AMR-to-text generation.
[2] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: arXiv:1909.00136 (2019).
Pre-training of Deep Bidirectional Transformers for Language Understanding.
1263