Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Inaccurate time estimation results for fine-tuning use-case #128

Closed
@zafercavdar

Description

@zafercavdar

This PR introduces time estimation functionality for fine-tuning tasks. We observed in our experiments that estimated values are pretty inaccurate and have a few questions and suggestions:

Question 1: Is there any public information about where constants like 0.0515 (line 601) come from?
My data frame which was used for fine-tuning curie model for 2 epochs contains 8236 rows. Our aim was to train an open-ended generator, that's why prompt column is completely empty. However, running memory_usage on this df gives the same values for prompt and completion columns. Please note that completion column contains pretty long text values.

Screen Shot 2022-10-06 at 17 12 48

If I use sys module to get the size of df on system, I get a very different result.

Screen Shot 2022-10-06 at 17 14 38

If I add deep=True parameter to pandas' memory_usage call, the returned value becomes very similar to sys output.

Screen Shot 2022-10-06 at 17 16 07

Question 2: Based on the previous trials, is there any reason why the estimator doesn't use deep=True flag to get memory consumed in the system?

Question 3: Does this estimator have any number of epochs assumption?
Time estimator returns 1.92 hours (approximately 115 minutes) for my dataset. When I started training on the same df for 2 epochs, it took 17 minutes in total. ~9 minutes per epoch. It doesn't take this parameter into account because it's not available until the fine-tuning process call is made.

Once your model starts training, it'll approximately take 1.93 hours to train a `curie` model, and less for `ada` and `babbage`. Queue will approximately take half an hour per job ahead of you

....
[2022-10-06 15:56:26] Fine-tune enqueued. Queue number: 0
[2022-10-06 15:56:29] Fine-tune started
[2022-10-06 16:05:33] Completed epoch 1/2
[2022-10-06 16:13:42] Completed epoch 2/2

Suggestions:

  • More documentation about constant values like 0.0515
  • Adding deep=True flag to memory_usage call and updating constants accordingly
  • Adding information about epoch count assumption to the log message, like Once your model starts training, it'll approximately take 1.93 hours to train a curie model for x epochs based on historical statistics, and less ...

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions