IndexError occurred when running python run.py --run Balsa_TPCH --local 

I am interested in your code and try to run it with TPC-H . I write a subclass of Balsa_JOBRandSplit and change p as follows.
```
p.db = 'tpchload'
p.sim_checkpoint = None
p.query_dir = 'queries/myTpchTest'
p.query_glob = ['*.sql']
p.test_query_glob = TPCH_TEST_QUERIES
```
The PostgreSQL version and conda environment are the same as recommended in README.md. When I run it as python run.py --run Balsa_TPCH --local, an error occurred  with the following traceback.
```
Traceback (most recent call last):
  File "run.py", line 2155, in <module>
    app.run(Main)
  File "/home/xxx/anaconda3/envs/balsa/lib/python3.7/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/home/xxx/anaconda3/envs/balsa/lib/python3.7/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "run.py", line 2150, in Main
    agent = BalsaAgent(p)
  File "run.py", line 754, in __init__
    self.exp, self.exp_val = self._MakeExperienceBuffer()
  File "run.py", line 809, in _MakeExperienceBuffer
    wi = self.GetOrTrainSim().training_workload_info
  File "run.py", line 1160, in GetOrTrainSim
    self.sim = TrainSim(p, self.loggers)
  File "run.py", line 379, in TrainSim
    sim.CollectSimulationData()
  File "/home/xxx/balsa/sim.py", line 728, in CollectSimulationData
    self.search.Run(query_node, query_node.info['sql_str'])
  File "/home/xxx/balsa/balsa/search.py", line 245, in Run
    dp_tables)
  File "/home/xxx/balsa/balsa/search.py", line 317, in _dp_bushy_search_space
    return list(dp_tables[num_rels].values())[0][1], dp_tables
IndexError: list index out of range
```
I use only three queries in the query_dir, like:
```
select   supp_nation,   cust_nation,   l_year,   sum(volume) as revenue  from   (    select     n1.n_name as supp_nation,     n2.n_name as cust_nation,     extract(year from l_shipdate) as l_year,     l_extendedprice * (1 - l_discount) as volume    from     supplier,     lineitem,     orders,     customer,     nation n1,     nation n2    where     s_suppkey = l_suppkey     and o_orderkey = l_orderkey     and c_custkey = o_custkey     and s_nationkey = n1.n_nationkey     and c_nationkey = n2.n_nationkey     and (      (n1.n_name = 'VIETNAM' and n2.n_name = 'UNITED KINGDOM')      or (n1.n_name = 'UNITED KINGDOM' and n2.n_name = 'VIETNAM')     )     and l_shipdate between date '1995-01-01' and date '1996-12-31'   ) as shipping  group by   supp_nation,   cust_nation,   l_year  order by   supp_nation,   cust_nation,   l_year;
```
The I add print(join_graph) after Line 257 of balsa/balsa/search.py, which is 
```
r = r_tup[1]
```
and it shows "Graph with 0 nodes and 0 edges". I think I cannot get a correct join graph in Line 224 of balsa/balsa/search.py, which is 
```
join_graph, all_join_conds = query_node.GetOrParseSql()
```
I then check the definition of GetOrParseSql(self) in balsa/balsa/util/plans_lib.py and print graph and join_conds. It shows Graph with 0 nodes and 0 edges for the graph and [] for the join_conds. I then check the definition of simple_sql_parser in balsa/balsa/util/simple_sql_parser.py and print the result of join_conds after 
```
join_conds = join_cond_pat.findall(sql)
```
The sql is one of the queries in my query_dir but the join_conds is still []. I check the regular expression and guess it cannot deal with the expression c_custkey = o_custkey in my queries since there are dots in the used regular expression. 
As introduced in the paper, TPC-H is used as a benchmark. Could you please give me some hints for the above parser problem or add some codes on TPC-H. Many thanks in advance.
Another confusion is that when I run the above command for the first time and set 
```
p.query_glob = ['test1.sql', 'test2.sql', 'test3.sql']
p.test_query_glob = ['test1.sql']
```
 it shows
```
3 train queries: ['test1', 'test2', 'test3']
0 test queries: []
wandb: (1) Create a W&B account
```
even if in the BalsaAgent params test_query_glob is ['test1.sql']. I am just curious about why we need to get the Baseline PG performance by running all test and training queries before training. Hope your reply sincerely!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IndexError occurred when running python run.py --run Balsa_TPCH --local #2

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

IndexError occurred when running python run.py --run Balsa_TPCH --local #2

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions