|
| 1 | +# Datasets |
| 2 | + |
| 3 | +In many machine learning models, especially for supervised learning, datasets are a vital part of |
| 4 | +the training process. Swift for TensorFlow provides wrappers for several common datasets within the |
| 5 | +Datasets module in the [the models repository](https://github.com/tensorflow/swift-models). These |
| 6 | +wrappers ease the use of common datasets with Swift-based models and integrate well with the |
| 7 | +Swift for TensorFlow's generalized training loop. |
| 8 | + |
| 9 | +## Provided dataset wrappers |
| 10 | + |
| 11 | +These are the currently provided dataset wrappers within the models repository: |
| 12 | + |
| 13 | +- [BostonHousing](https://github.com/tensorflow/swift-models/tree/main/Datasets/BostonHousing) |
| 14 | +- [CIFAR-10](https://github.com/tensorflow/swift-models/tree/main/Datasets/CIFAR10) |
| 15 | +- [MS COCO](https://github.com/tensorflow/swift-models/tree/main/Datasets/COCO) |
| 16 | +- [CoLA](https://github.com/tensorflow/swift-models/tree/main/Datasets/CoLA) |
| 17 | +- [ImageNet](https://github.com/tensorflow/swift-models/tree/main/Datasets/Imagenette) |
| 18 | +- [Imagenette](https://github.com/tensorflow/swift-models/tree/main/Datasets/Imagenette) |
| 19 | +- [Imagewoof](https://github.com/tensorflow/swift-models/tree/main/Datasets/Imagenette) |
| 20 | +- [FashionMNIST](https://github.com/tensorflow/swift-models/tree/main/Datasets/MNIST) |
| 21 | +- [KuzushijiMNIST](https://github.com/tensorflow/swift-models/tree/main/Datasets/MNIST) |
| 22 | +- [MNIST](https://github.com/tensorflow/swift-models/tree/main/Datasets/MNIST) |
| 23 | +- [MovieLens](https://github.com/tensorflow/swift-models/tree/main/Datasets/MovieLens) |
| 24 | +- [Oxford-IIIT Pet](https://github.com/tensorflow/swift-models/tree/main/Datasets/OxfordIIITPets) |
| 25 | +- [WordSeg](https://github.com/tensorflow/swift-models/tree/main/Datasets/WordSeg) |
| 26 | + |
| 27 | +To use one of these dataset wrappers within a Swift |
| 28 | +project, add `Datasets` as a dependency to your Swift target and import the module: |
| 29 | + |
| 30 | +```swift |
| 31 | +import Datasets |
| 32 | +``` |
| 33 | + |
| 34 | +Most dataset wrappers are designed to produce randomly shuffled batches of labeled data. For |
| 35 | +example, to use the CIFAR-10 dataset, you first initialize it with the desired batch size: |
| 36 | + |
| 37 | +```swift |
| 38 | +let dataset = CIFAR10(batchSize: 100) |
| 39 | +``` |
| 40 | + |
| 41 | +On first use, the Swift for TensorFlow dataset wrappers will automatically download the original |
| 42 | +dataset for you, extract and parse all relevant archives, and then store the processed dataset in a |
| 43 | +user-local cache directory. Subsequent uses of the same dataset will load directly from the local |
| 44 | +cache. |
| 45 | + |
| 46 | +To set up a manual training loop involving this dataset, you'd use something like the following: |
| 47 | + |
| 48 | +```swift |
| 49 | +for (epoch, epochBatches) in dataset.training.prefix(100).enumerated() { |
| 50 | + Context.local.learningPhase = .training |
| 51 | + ... |
| 52 | + for batch in epochBatches { |
| 53 | + let (images, labels) = (batch.data, batch.label) |
| 54 | + ... |
| 55 | + } |
| 56 | +} |
| 57 | +``` |
| 58 | + |
| 59 | +The above sets up an iterator through 100 epochs (`.prefix(100)`), and returns the current epoch's |
| 60 | +numerical index and a lazily-mapped sequence over shuffled batches that make up that epoch. Within |
| 61 | +each training epoch, batches are iterated over and extracted for processing. In the case of the |
| 62 | +`CIFAR10` dataset wrapper, each batch is a |
| 63 | +[`LabeledImage`](https://github.com/tensorflow/swift-models/blob/main/Datasets/ImageClassificationDataset.swift) |
| 64 | +, which provides a `Tensor<Float>` containing all images from that batch and a `Tensor<Int32>` with |
| 65 | +their matching labels. |
| 66 | + |
| 67 | +In the case of CIFAR-10, the entire dataset is small and can be loaded into memory at one time, but |
| 68 | +for other larger datasets batches are loaded lazily from disk and processed at the point where each |
| 69 | +batch is obtained. This prevents memory exhaustion with those larger datasets. |
| 70 | + |
| 71 | +## The Epochs API |
| 72 | + |
| 73 | +Most of these dataset wrappers are built on a shared infrastructure that we've called the |
| 74 | +[Epochs API](https://github.com/tensorflow/swift-apis/tree/main/Sources/TensorFlow/Epochs). Epochs |
| 75 | +provides flexible components intended to support a wide variety of dataset types, from text to |
| 76 | +images and more. |
| 77 | + |
| 78 | +If you wish to create your own Swift dataset wrapper, you'll most likely want to use the Epochs API |
| 79 | +to do so. However, for common cases, such as image classification datasets, we highly recommend |
| 80 | +starting from a template based on one of the existing dataset wrappers and modifying that to meet |
| 81 | +your specific needs. |
| 82 | + |
| 83 | +As an example, let's examine the CIFAR-10 dataset wrapper and how it works. The core of the training |
| 84 | +dataset is defined here: |
| 85 | + |
| 86 | +```swift |
| 87 | +let trainingSamples = loadCIFARTrainingFiles(in: localStorageDirectory) |
| 88 | +training = TrainingEpochs(samples: trainingSamples, batchSize: batchSize, entropy: entropy) |
| 89 | + .lazy.map { (batches: Batches) -> LazyMapSequence<Batches, LabeledImage> in |
| 90 | + return batches.lazy.map{ |
| 91 | + makeBatch(samples: $0, mean: mean, standardDeviation: standardDeviation, device: device) |
| 92 | + } |
| 93 | +} |
| 94 | +``` |
| 95 | + |
| 96 | +The result from the `loadCIFARTrainingFiles()` function is an array of |
| 97 | +`(data: [UInt8], label: Int32)` tuples for each image in the training dataset. This is then provided |
| 98 | +to `TrainingEpochs(samples:batchSize:entropy:)` to create an infinite sequence of epochs with |
| 99 | +batches of `batchSize`. You can provide your own random number generator in cases where you may want |
| 100 | +deterministic batching behavior, but by default the `SystemRandomNumberGenerator` is used. |
| 101 | + |
| 102 | +From there, lazy maps over the batches culminate in the |
| 103 | +`makeBatch(samples:mean:standardDeviation:device:)` function. This is a custom function where the |
| 104 | +actual image processing pipeline for the CIFAR-10 dataset is located, so let's take a look at that: |
| 105 | + |
| 106 | +```swift |
| 107 | +fileprivate func makeBatch<BatchSamples: Collection>( |
| 108 | + samples: BatchSamples, mean: Tensor<Float>?, standardDeviation: Tensor<Float>?, device: Device |
| 109 | +) -> LabeledImage where BatchSamples.Element == (data: [UInt8], label: Int32) { |
| 110 | + let bytes = samples.lazy.map(\.data).reduce(into: [], +=) |
| 111 | + let images = Tensor<UInt8>(shape: [samples.count, 3, 32, 32], scalars: bytes, on: device) |
| 112 | + |
| 113 | + var imageTensor = Tensor<Float>(images.transposed(permutation: [0, 2, 3, 1])) |
| 114 | + imageTensor /= 255.0 |
| 115 | + if let mean = mean, let standardDeviation = standardDeviation { |
| 116 | + imageTensor = (imageTensor - mean) / standardDeviation |
| 117 | + } |
| 118 | + |
| 119 | + let labels = Tensor<Int32>(samples.map(\.label), on: device) |
| 120 | + return LabeledImage(data: imageTensor, label: labels) |
| 121 | +} |
| 122 | +``` |
| 123 | + |
| 124 | +The two lines of this function concatenate all `data` bytes from the incoming `BatchSamples` into |
| 125 | +a `Tensor<UInt8>` that matches the byte layout of the images within the raw CIFAR-10 dataset. Next, |
| 126 | +the image channels are reordered to match those expected in our standard image classification models |
| 127 | +and the image data re-cast into a `Tensor<Float>` for model consumption. |
| 128 | + |
| 129 | +Optional normalization parameters can be provided to further adjust image channel values, a process |
| 130 | +that is common when training many image classification models. The normalization parameter `Tensor`s |
| 131 | +are created once at dataset initialization and then passed into `makeBatch()` as an optimization |
| 132 | +to prevent the repeated creation of small temporary tensors with the same values. |
| 133 | + |
| 134 | +Finally, the integer labels are placed in a `Tensor<Int32>` and the image / label tensor pair |
| 135 | +returned in a `LabeledImage`. A `LabeledImage` is a specific case of |
| 136 | +[`LabeledData`](https://github.com/tensorflow/swift-models/blob/main/Support/LabeledData.swift), a |
| 137 | +struct with data and labels that conform to the Eppch API's |
| 138 | +[`Collatable`](https://github.com/tensorflow/swift-apis/blob/main/Sources/TensorFlow/Epochs/Collatable.swift) protocol. |
| 139 | + |
| 140 | +For more examples of the Epochs API in different dataset types, you can examine |
| 141 | +[the other dataset wrappers](https://github.com/tensorflow/swift-models/tree/main/Datasets) |
| 142 | +within the models repository. |
0 commit comments