Question about dataset usage and broken SVGs in SVGX-SFT-1M

Hi @ximinng, 

Thanks for releasing the `SVGX` datasets. A few small questions about the datasets: 

According to the README:
> Available Datasets on Hugging Face:
> - [xingxm/SVGX-Core-250k](https://huggingface.co/datasets/xingxm/SVGX-Core-250k): Core pretraining data (250k examples).
> - [xingxm/SVGX-SFT-1M](https://huggingface.co/datasets/xingxm/SVGX-SFT-1M): Supervised fine-tuning data (1M examples).

But I presume `SVGX-Core-250k` is the actual SFT dataset needed for fine-tuning, based on how it's used in the code? Could you clarify this discrepancy?

Also, could you explain what `SVGX_SFT_GEN_51k`, `SVGX_SFT_GEN_basic`, and `SVGX_SFT_UN_25k` in `SVGX-SFT-1M` are exactly? I've noticed that a lot of SVGs in `SVGX-SFT-1M` dataset appear to be broken.

e.g.

<img width="681" height="667" alt="Image" src="https://github.com/user-attachments/assets/56419e56-608a-413d-a373-96a774738c04" />

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Question about dataset usage and broken SVGs in SVGX-SFT-1M #23

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Question about dataset usage and broken SVGs in SVGX-SFT-1M #23

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions