Thanks to visit codestin.com
Credit goes to github.com

Skip to content
This repository was archived by the owner on Mar 30, 2022. It is now read-only.

Commit 601ad66

Browse files
authored
Adding a guide to dataset wrappers (#606)
* Initial dataset documentation. * Added an Epochs case study, based on the CIFAR-10 dataset. * Minor formatting changes and note about the normalization parameters.
1 parent 7bbc9ed commit 601ad66

File tree

2 files changed

+144
-0
lines changed

2 files changed

+144
-0
lines changed

docs/site/_book.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,8 @@ upper_tabs:
2828
- title: Debugging X10 issues
2929
path: /swift/guide/debugging_x10
3030
- heading: "Machine learning models"
31+
- title: Datasets
32+
path: /swift/guide/datasets
3133
- title: Model summaries
3234
path: /swift/guide/model_summary
3335
- title: Swift for TensorFlow model garden

docs/site/guide/datasets.md

Lines changed: 142 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,142 @@
1+
# Datasets
2+
3+
In many machine learning models, especially for supervised learning, datasets are a vital part of
4+
the training process. Swift for TensorFlow provides wrappers for several common datasets within the
5+
Datasets module in the [the models repository](https://github.com/tensorflow/swift-models). These
6+
wrappers ease the use of common datasets with Swift-based models and integrate well with the
7+
Swift for TensorFlow's generalized training loop.
8+
9+
## Provided dataset wrappers
10+
11+
These are the currently provided dataset wrappers within the models repository:
12+
13+
- [BostonHousing](https://github.com/tensorflow/swift-models/tree/main/Datasets/BostonHousing)
14+
- [CIFAR-10](https://github.com/tensorflow/swift-models/tree/main/Datasets/CIFAR10)
15+
- [MS COCO](https://github.com/tensorflow/swift-models/tree/main/Datasets/COCO)
16+
- [CoLA](https://github.com/tensorflow/swift-models/tree/main/Datasets/CoLA)
17+
- [ImageNet](https://github.com/tensorflow/swift-models/tree/main/Datasets/Imagenette)
18+
- [Imagenette](https://github.com/tensorflow/swift-models/tree/main/Datasets/Imagenette)
19+
- [Imagewoof](https://github.com/tensorflow/swift-models/tree/main/Datasets/Imagenette)
20+
- [FashionMNIST](https://github.com/tensorflow/swift-models/tree/main/Datasets/MNIST)
21+
- [KuzushijiMNIST](https://github.com/tensorflow/swift-models/tree/main/Datasets/MNIST)
22+
- [MNIST](https://github.com/tensorflow/swift-models/tree/main/Datasets/MNIST)
23+
- [MovieLens](https://github.com/tensorflow/swift-models/tree/main/Datasets/MovieLens)
24+
- [Oxford-IIIT Pet](https://github.com/tensorflow/swift-models/tree/main/Datasets/OxfordIIITPets)
25+
- [WordSeg](https://github.com/tensorflow/swift-models/tree/main/Datasets/WordSeg)
26+
27+
To use one of these dataset wrappers within a Swift
28+
project, add `Datasets` as a dependency to your Swift target and import the module:
29+
30+
```swift
31+
import Datasets
32+
```
33+
34+
Most dataset wrappers are designed to produce randomly shuffled batches of labeled data. For
35+
example, to use the CIFAR-10 dataset, you first initialize it with the desired batch size:
36+
37+
```swift
38+
let dataset = CIFAR10(batchSize: 100)
39+
```
40+
41+
On first use, the Swift for TensorFlow dataset wrappers will automatically download the original
42+
dataset for you, extract and parse all relevant archives, and then store the processed dataset in a
43+
user-local cache directory. Subsequent uses of the same dataset will load directly from the local
44+
cache.
45+
46+
To set up a manual training loop involving this dataset, you'd use something like the following:
47+
48+
```swift
49+
for (epoch, epochBatches) in dataset.training.prefix(100).enumerated() {
50+
Context.local.learningPhase = .training
51+
...
52+
for batch in epochBatches {
53+
let (images, labels) = (batch.data, batch.label)
54+
...
55+
}
56+
}
57+
```
58+
59+
The above sets up an iterator through 100 epochs (`.prefix(100)`), and returns the current epoch's
60+
numerical index and a lazily-mapped sequence over shuffled batches that make up that epoch. Within
61+
each training epoch, batches are iterated over and extracted for processing. In the case of the
62+
`CIFAR10` dataset wrapper, each batch is a
63+
[`LabeledImage`](https://github.com/tensorflow/swift-models/blob/main/Datasets/ImageClassificationDataset.swift)
64+
, which provides a `Tensor<Float>` containing all images from that batch and a `Tensor<Int32>` with
65+
their matching labels.
66+
67+
In the case of CIFAR-10, the entire dataset is small and can be loaded into memory at one time, but
68+
for other larger datasets batches are loaded lazily from disk and processed at the point where each
69+
batch is obtained. This prevents memory exhaustion with those larger datasets.
70+
71+
## The Epochs API
72+
73+
Most of these dataset wrappers are built on a shared infrastructure that we've called the
74+
[Epochs API](https://github.com/tensorflow/swift-apis/tree/main/Sources/TensorFlow/Epochs). Epochs
75+
provides flexible components intended to support a wide variety of dataset types, from text to
76+
images and more.
77+
78+
If you wish to create your own Swift dataset wrapper, you'll most likely want to use the Epochs API
79+
to do so. However, for common cases, such as image classification datasets, we highly recommend
80+
starting from a template based on one of the existing dataset wrappers and modifying that to meet
81+
your specific needs.
82+
83+
As an example, let's examine the CIFAR-10 dataset wrapper and how it works. The core of the training
84+
dataset is defined here:
85+
86+
```swift
87+
let trainingSamples = loadCIFARTrainingFiles(in: localStorageDirectory)
88+
training = TrainingEpochs(samples: trainingSamples, batchSize: batchSize, entropy: entropy)
89+
.lazy.map { (batches: Batches) -> LazyMapSequence<Batches, LabeledImage> in
90+
return batches.lazy.map{
91+
makeBatch(samples: $0, mean: mean, standardDeviation: standardDeviation, device: device)
92+
}
93+
}
94+
```
95+
96+
The result from the `loadCIFARTrainingFiles()` function is an array of
97+
`(data: [UInt8], label: Int32)` tuples for each image in the training dataset. This is then provided
98+
to `TrainingEpochs(samples:batchSize:entropy:)` to create an infinite sequence of epochs with
99+
batches of `batchSize`. You can provide your own random number generator in cases where you may want
100+
deterministic batching behavior, but by default the `SystemRandomNumberGenerator` is used.
101+
102+
From there, lazy maps over the batches culminate in the
103+
`makeBatch(samples:mean:standardDeviation:device:)` function. This is a custom function where the
104+
actual image processing pipeline for the CIFAR-10 dataset is located, so let's take a look at that:
105+
106+
```swift
107+
fileprivate func makeBatch<BatchSamples: Collection>(
108+
samples: BatchSamples, mean: Tensor<Float>?, standardDeviation: Tensor<Float>?, device: Device
109+
) -> LabeledImage where BatchSamples.Element == (data: [UInt8], label: Int32) {
110+
let bytes = samples.lazy.map(\.data).reduce(into: [], +=)
111+
let images = Tensor<UInt8>(shape: [samples.count, 3, 32, 32], scalars: bytes, on: device)
112+
113+
var imageTensor = Tensor<Float>(images.transposed(permutation: [0, 2, 3, 1]))
114+
imageTensor /= 255.0
115+
if let mean = mean, let standardDeviation = standardDeviation {
116+
imageTensor = (imageTensor - mean) / standardDeviation
117+
}
118+
119+
let labels = Tensor<Int32>(samples.map(\.label), on: device)
120+
return LabeledImage(data: imageTensor, label: labels)
121+
}
122+
```
123+
124+
The two lines of this function concatenate all `data` bytes from the incoming `BatchSamples` into
125+
a `Tensor<UInt8>` that matches the byte layout of the images within the raw CIFAR-10 dataset. Next,
126+
the image channels are reordered to match those expected in our standard image classification models
127+
and the image data re-cast into a `Tensor<Float>` for model consumption.
128+
129+
Optional normalization parameters can be provided to further adjust image channel values, a process
130+
that is common when training many image classification models. The normalization parameter `Tensor`s
131+
are created once at dataset initialization and then passed into `makeBatch()` as an optimization
132+
to prevent the repeated creation of small temporary tensors with the same values.
133+
134+
Finally, the integer labels are placed in a `Tensor<Int32>` and the image / label tensor pair
135+
returned in a `LabeledImage`. A `LabeledImage` is a specific case of
136+
[`LabeledData`](https://github.com/tensorflow/swift-models/blob/main/Support/LabeledData.swift), a
137+
struct with data and labels that conform to the Eppch API's
138+
[`Collatable`](https://github.com/tensorflow/swift-apis/blob/main/Sources/TensorFlow/Epochs/Collatable.swift) protocol.
139+
140+
For more examples of the Epochs API in different dataset types, you can examine
141+
[the other dataset wrappers](https://github.com/tensorflow/swift-models/tree/main/Datasets)
142+
within the models repository.

0 commit comments

Comments
 (0)