-
Notifications
You must be signed in to change notification settings - Fork 36
Description
I am interested in your code and try to run it with TPC-H . I write a subclass of Balsa_JOBRandSplit and change p as follows.
p.db = 'tpchload'
p.sim_checkpoint = None
p.query_dir = 'queries/myTpchTest'
p.query_glob = ['*.sql']
p.test_query_glob = TPCH_TEST_QUERIES
The PostgreSQL version and conda environment are the same as recommended in README.md. When I run it as python run.py --run Balsa_TPCH --local, an error occurred with the following traceback.
Traceback (most recent call last):
File "run.py", line 2155, in <module>
app.run(Main)
File "/home/xxx/anaconda3/envs/balsa/lib/python3.7/site-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "/home/xxx/anaconda3/envs/balsa/lib/python3.7/site-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "run.py", line 2150, in Main
agent = BalsaAgent(p)
File "run.py", line 754, in __init__
self.exp, self.exp_val = self._MakeExperienceBuffer()
File "run.py", line 809, in _MakeExperienceBuffer
wi = self.GetOrTrainSim().training_workload_info
File "run.py", line 1160, in GetOrTrainSim
self.sim = TrainSim(p, self.loggers)
File "run.py", line 379, in TrainSim
sim.CollectSimulationData()
File "/home/xxx/balsa/sim.py", line 728, in CollectSimulationData
self.search.Run(query_node, query_node.info['sql_str'])
File "/home/xxx/balsa/balsa/search.py", line 245, in Run
dp_tables)
File "/home/xxx/balsa/balsa/search.py", line 317, in _dp_bushy_search_space
return list(dp_tables[num_rels].values())[0][1], dp_tables
IndexError: list index out of range
I use only three queries in the query_dir, like:
select supp_nation, cust_nation, l_year, sum(volume) as revenue from ( select n1.n_name as supp_nation, n2.n_name as cust_nation, extract(year from l_shipdate) as l_year, l_extendedprice * (1 - l_discount) as volume from supplier, lineitem, orders, customer, nation n1, nation n2 where s_suppkey = l_suppkey and o_orderkey = l_orderkey and c_custkey = o_custkey and s_nationkey = n1.n_nationkey and c_nationkey = n2.n_nationkey and ( (n1.n_name = 'VIETNAM' and n2.n_name = 'UNITED KINGDOM') or (n1.n_name = 'UNITED KINGDOM' and n2.n_name = 'VIETNAM') ) and l_shipdate between date '1995-01-01' and date '1996-12-31' ) as shipping group by supp_nation, cust_nation, l_year order by supp_nation, cust_nation, l_year;
The I add print(join_graph) after Line 257 of balsa/balsa/search.py, which is
r = r_tup[1]
and it shows "Graph with 0 nodes and 0 edges". I think I cannot get a correct join graph in Line 224 of balsa/balsa/search.py, which is
join_graph, all_join_conds = query_node.GetOrParseSql()
I then check the definition of GetOrParseSql(self) in balsa/balsa/util/plans_lib.py and print graph and join_conds. It shows Graph with 0 nodes and 0 edges for the graph and [] for the join_conds. I then check the definition of simple_sql_parser in balsa/balsa/util/simple_sql_parser.py and print the result of join_conds after
join_conds = join_cond_pat.findall(sql)
The sql is one of the queries in my query_dir but the join_conds is still []. I check the regular expression and guess it cannot deal with the expression c_custkey = o_custkey in my queries since there are dots in the used regular expression.
As introduced in the paper, TPC-H is used as a benchmark. Could you please give me some hints for the above parser problem or add some codes on TPC-H. Many thanks in advance.
Another confusion is that when I run the above command for the first time and set
p.query_glob = ['test1.sql', 'test2.sql', 'test3.sql']
p.test_query_glob = ['test1.sql']
it shows
3 train queries: ['test1', 'test2', 'test3']
0 test queries: []
wandb: (1) Create a W&B account
even if in the BalsaAgent params test_query_glob is ['test1.sql']. I am just curious about why we need to get the Baseline PG performance by running all test and training queries before training. Hope your reply sincerely!