v0.25.0 ππ π€Ά -> π Distributed execution π; API cleanups (and changes) #251
janpfeifer
announced in
Announcements
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
This (early) end-of-year big release comes with a big π: distributed execution. For anyone needing to scale the training (or inference), this is a must. As it goes with these new things, it should be considered experimental. I'll be actually using it, and hopefully any missing issues (including adding a distributed tutorial -- there is a demo already) will be solved by the end of the year, or early next year. π¦
There is also some API clean ups -- mostly returning error in places that were panicking otherwise. So ... a bit of a "Grinch" gift, is the need for some updates, if you update to this version. All very trivial (added an error to the return list of some functions and methods).
We also had some external collaborations π Looking forward to more of those, there is so much interesting stuff to be done! (Quantization, using SIMD operations in the "Simple Go" backend, etc).
Hightlights:
Distributed (cross-devices) execution: with AutoSharding and SPMD strategies;
Also added support for "portable device" execution.
API changes: (will require simple fixes)
still use panic to return error -- otherwise it's too painful to express math.
Distributed computation improvements and refactorings:
graph:IsNegative,IsPositive,IsNonNegative,IsNonPositive.SubScalarand tests for the '*Scalar' functions.Graph.WithDistributedStrategy,Graph.WithDeviceMesh.Graph.DeviceMeshandGraph.NumDevicesGraph.Distributed()with "collective" (across devices) operations (likeAllReduce).Exec.InDevice/Exec.WithDevice; s/Exec.SetName/Exec.WithNameRunOnDevice.Exec.AutoShardingandExec.SPMD.context:context.MustGetParam[T](ctx, key)andcontext.MustGetGraphParam[T](ctx, graph, key).Exec.AutoShardingandExec.SPMD.Variable.DistributedValueandVariable.SetDistributedValue.train:train.DistributedDatasetandtrain.BaseDataset.Dataset.Resetnow returns an error.Trainer.TrainStep,Trainer.EvalStepandTrainer.Evalnow return errors as opposed to panicking.Trainer.WithDeviceAssignment.Trainer.DistributedTrainStep,Trainer.DistributedEvalStepandTrainer.DistributedEval.datasets:datasets.DistributedAccumulator: converts a normalDatasetinto aDistributedDataset.datasets.OnDevice: pre-uploads data to devices.backend:Backend.CopyToDeviceBuilder.Parameter()now takes an optionalShardingSpecfor sharded inputs.AllReduceBackend.NumDevices()returns an int now.backends/notimplemented:Backendthat can be used to easily mock backends.pkg/core/distributed:-Added
DeviceMesh,ShardSpecanddistributed.Tensorobjects.pkg/core/tensors:Tensor.CheckValid(),Tensor.Device(),Tensor.Backend()Other improvements:
simplego:cosineschedule:WarmUpStepsandNumCycleshyperparameters -- removed overloading ofperiodSteps.FUNDING.ymlpointing to sponsorship..golangci.ymland fixed many (still a long way to go) lint-warnings.ui/fyneui:graph:Gather.simplego:Thanks @ajroetker!
This discussion was created from the release v0.25.0 ππ π€Ά -> π Distributed execution π; API cleanups (and changes).
Beta Was this translation helpful? Give feedback.
All reactions