[EVALUATION]: How to evaluate on the test dataset absent gold hypothesis?

The `meatadata_*.json` files under `discoverybench/real/test` do not seem to contain labeled hypothesis. How should we evaluate this portion of the dataset to get HMS scores?