-
Notifications
You must be signed in to change notification settings - Fork 31
Description
Hi,
I tried following the README to run the LmCloudSpmd2BTest example on TPUv4 but couldn't load the model; this is the output of saxutil ls /sax/test/lm2b on admin:
INFO: Running command line: bazel-bin/saxml/bin/saxutil_/saxutil '--sax_root=gs://saxml-data/sax-root' ls /sax/test/lm2b
+-------+-------------------------------------------------------+-----------------+---------------+---------------------------+
| MODEL | MODEL PATH | CHECKPOINT PATH | # OF REPLICAS | (SELECTED) REPLICAADDRESS |
+-------+-------------------------------------------------------+-----------------+---------------+---------------------------+
| lm2b | saxml.server.pax.lm.params.lm_cloud.LmCloudSpmd2BTest | None | 0 | |
+-------+-------------------------------------------------------+-----------------+---------------+---------------------------+
+--------+-----+
| METHOD | ACL |
+--------+-----+
+--------+-----+
Here are the commands I used to start the admin and model server.
On admin:
bazel run saxml/bin:admin_config -- --sax_cell=/sax/test --sax_root=gs://saxml-data/sax-root --fs_root=gs://saxml-data/sax-fs-root --alsologtostderr
bazel run saxml/bin:admin_server -- --sax_cell=/sax/test --sax_root=gs://saxml-data/sax-root --port=10000 --alsologtostderr
saxutil publish /sax/test/lm2b saxml.server.pax.lm.params.lm_cloud.LmCloudSpmd2BTest None 1
I0630 04:24:08.036908 19996 ipaddr.go:56] IPNet address 10.128.0.71
I0630 04:24:08.212039 19996 admin.go:305] Loaded config: fs_root: "gs://saxml-data/sax-fs-root"
I0630 04:24:08.248588 19996 addr.go:105] SetAddr /gcs/saxml-data/sax-root/sax/test/location.prot o "10.128.0.71:10000"
I0630 04:24:08.298355 19996 admin.go:325] Updated config: fs_root: "gs://saxml-data/sax-fs-root "
I0630 04:24:08.455680 19996 mgr.go:781] Loaded manager state
I0630 04:24:08.455819 19996 mgr.go:784] Refreshing manager state every 10s
I0630 04:24:08.455895 19996 admin.go:350] Starting the server on port 10000
I0630 04:24:08.455957 19996 cloud.go:480] Starting the HTTP server on port 8080
I0630 14:22:11.800066 19996 state.go:456] Starting a queue that drains pending model server acti ons
I0630 14:22:11.800149 19996 state.go:473] Initializing state from model server 10.130.0.4:10001
I0630 14:22:11.810371 19996 state.go:479] Refreshing model server state every 10s
I0630 14:29:54.329640 19996 mgr.go:134] Published with overrides: map[]
On model server:
bazel run saxml/server:server -- --sax_cell=/sax/test --port=10001 --platform_chip=tpuv4 --platform_topology=2x2x1 --alsologtostderr
I0630 14:22:09.754312 139843449665280 model_service_base.py:852] Started joining SAX cell /sax/test
ERROR: logging before flag.Parse: I0630 14:22:11.754970 223228 location.go:141] Calling Join due to address update
ERROR: logging before flag.Parse: I0630 14:22:11.814963 223228 location.go:155] Joined 10.128.0.71 :10000
ERROR: logging before flag.Parse: I0630 14:37:11.758835 223228 location.go:162] Calling Join at fixed interval
ERROR: logging before flag.Parse: I0630 14:37:11.814902 223228 addr.go:72] FetchAddr /gcs/saxml-data/sax-root/sax/test/location.proto "10.128.0.71:10000"
ERROR: logging before flag.Parse: I0630 14:37:11.843650 223228 location.go:172] Joined 10.128.0.71 :10000
I've also waited a while to try saxutil ls /sax/test/lm2b again but still nothing in the "selected replica address" column. Any ideas of what might went wrong?
One thing I also noticed is the build time on model server is very long. The first time of running bazel run saxml/server:server -- --sax_cell=/sax/test --port=10001 --platform_chip=tpuv4 --platform_topology=2x2x1 --alsologtostderr took ~5 hrs to finish:
Target //saxml/server:server up-to-date:
bazel-bin/saxml/server/server.py
bazel-bin/saxml/server/server
INFO: Elapsed time: 16268.138s, Critical Path: 16222.45s
INFO: 5113 processes: 19 internal, 5091 linux-sandbox, 3 local.
INFO: Build completed successfully, 5113 total actions
INFO: Running command line: bazel-bin/saxml/server/server '--sax_cell=/sax/test' '--port=10001' '-- platform_chip=tpuv4' '--platform_topology=2x2x1' --alsologtostderr
Succeeding ones only took a few seconds to complete. Is this expected behavior?
Thanks!