There have been a few reports where the grammar sampling can significantly degrade the performance.
It would be nice to profile and optimize the implementation - there should be room for improvements.
Already on-going efforts:
Probably worth looking in multi-threading the implementation as well.
There have been a few reports where the grammar sampling can significantly degrade the performance.
It would be nice to profile and optimize the implementation - there should be room for improvements.
Already on-going efforts:
reservespace indecode_utf8#4210llama_token_to_piecewhen sampling grammars #4213Probably worth looking in multi-threading the implementation as well.