-
Notifications
You must be signed in to change notification settings - Fork 374
[WIP] Improve performance of opam update/init by changing the structure of the internal http opam repositories (use the tar.gz as-is) #6625
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Hey, thanks for your work on that. Since you asked me directly
This should be fine from the design point of view for conex. Conex will need to interject between step 0 (you downloaded the tarball) and 2 (remove the old tar.gz). Currently, conex requires a diff file on disk, and the old repository as directory. But we can revise that interface, and conex could as well work on two tar.gz (and/or on two directories). I guess you have a clear understanding of the update process currently, and since you mention the different kinds (http, git, local) -- maybe we should re-think how opam and conex should interact to avoid burden paid by people not using conex, and avoid the burden of duplicating computations in both opam and conex. The latter may need to include conex as a library into opam. I'm away for the next 10 days (back on August 10th), but am happy to discuss this afterwards - esp.since I plan to revive my work on conex thereafter. |
|
To be more precise, given that
Then conex could do what is needed (compute the set of changed opam files, verify signatures; exit 0 on success); and could even report back the set of changed files to opam (I suspect this is what #6614 depends on) - using a file, or a socket, or if integrated with opam, this will be much simpler (using shared memory). For opam itself, I guess that #6349 and #5553 will improve a lot of updates already. |
|
Some update: I've been splitting this PR into several smaller ones to make working on it collaboratively, rebasing it and cleaning it up, easier. So far:
Once these done, i expect this PR to be fairly small, self-contained and easier to review. |
| if OpamRepositoryRoot.Tar.exists tar then | ||
| (let target = OpamRepositoryRoot.Tar.backup ~tmp_dir tar in | ||
| OpamRepositoryRoot.Tar.copy ~src:tar ~dst:target; | ||
| fun () -> OpamRepositoryRoot.Tar.copy ~src:target ~dst:tar) | ||
| else | ||
| (let dir = OpamRepositoryPath.root gt.root name in | ||
| if not (OpamFilename.exists_dir dir) then | ||
| if not (OpamRepositoryRoot.Dir.exists dir) then | ||
| OpamConsole.error_and_exit `Internal_error | ||
| "Repository not found, consider running 'opam update %s' \ | ||
| to retrieve a consistent state." | ||
| (OpamRepositoryName.to_string name); | ||
| let target = | ||
| OpamFilename.(Op.(tmp_dir / Base.to_string (basename_dir dir))) | ||
| in | ||
| OpamFilename.copy_dir ~src:dir ~dst:target; | ||
| fun () -> OpamFilename.copy_dir ~src:target ~dst:dir) | ||
| let target = OpamRepositoryRoot.Dir.backup ~tmp_dir dir in | ||
| OpamRepositoryRoot.Dir.copy ~src:dir ~dst:target; | ||
| fun () -> OpamRepositoryRoot.Dir.copy ~src:target ~dst:dir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use directly OpamRepositoryRoot.exists and OpamRepositoryRoot.copy ?
src/state/opamUpdate.ml
Outdated
| let exception Found of string in | ||
| try | ||
| OpamTar.fold_reg_files (fun () filename content -> | ||
| if filename = "/repo" then | ||
| raise (Found content); | ||
| ) () (Unix.openfile (OpamRepositoryRoot.Tar.to_string tar) [Unix.O_RDONLY] 0); | ||
| OpamFile.Repo.empty | ||
| with Found content -> OpamFile.Repo.read_from_string content |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic is used twice in the code, worth haveing a function that does it
src/state/opamUpdate.ml
Outdated
| OpamRepositoryState.load_opams_from_dir repo.repo_name dir | ||
| | OpamRepositoryRoot.Tar tar -> | ||
| OpamRepositoryState.load_opams_from_tar_gz repo.repo_name tar |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better have a OpamRepositoryState.load_opams that does that match
src/state/opamUpdate.ml
Outdated
| begin match repo_root with | ||
| | OpamRepositoryRoot.Dir _ -> | ||
| if OpamRepositoryRoot.Tar.exists tarred_repo then | ||
| OpamRepositoryRoot.Tar.remove tarred_repo; | ||
| | OpamRepositoryRoot.Tar _ -> | ||
| if OpamRepositoryRoot.Dir.exists local_dir then | ||
| OpamRepositoryRoot.Dir.remove local_dir; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here : why not use the abstracted logic ?
src/state/opamUpdate.ml
Outdated
| let tarred_repo = OpamRepositoryPath.tar gt.root repo.repo_name in | ||
| (if OpamRepositoryConfig.(!r.repo_tarring) then | ||
| OpamFilename.make_tar_gz_job tarred_repo repo_root | ||
| else Done None) | ||
| @@+ function | ||
| | Some e -> | ||
| OpamStd.Exn.fatal e; | ||
| Printf.ksprintf failwith | ||
| "Failed to regenerate local repository archive: %s" | ||
| (Printexc.to_string e) | ||
| | None -> | ||
| let opams = | ||
| OpamRepositoryState.load_opams_from_dir repo.repo_name repo_root | ||
| in | ||
| let local_dir = OpamRepositoryPath.root gt.root repo.repo_name in | ||
| if OpamRepositoryConfig.(!r.repo_tarring) then | ||
| (if OpamFilename.exists_dir local_dir then | ||
| (* Mark the obsolete local directory for deletion once we complete: it's | ||
| no longer needed once we have a tar.gz *) | ||
| Hashtbl.add rt.repos_tmp repo.repo_name (lazy local_dir)) | ||
| else if OpamFilename.exists tarred_repo then | ||
| (OpamFilename.move_dir ~src:repo_root ~dst:local_dir; | ||
| OpamFilename.remove tarred_repo); | ||
| Done (Some ( | ||
| (* Return an update function to make parallel execution possible *) | ||
| fun rt -> | ||
| { rt with | ||
| repositories = | ||
| OpamRepositoryName.Map.add repo.repo_name repo rt.repositories; | ||
| repos_definitions = | ||
| OpamRepositoryName.Map.add repo.repo_name repo_file | ||
| rt.repos_definitions; | ||
| repo_opams = | ||
| OpamRepositoryName.Map.add repo.repo_name opams rt.repo_opams; | ||
| } | ||
| )) | ||
| let local_dir = OpamRepositoryPath.root gt.root repo.repo_name in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you retrieve both ?
bd213b6 to
43496a0
Compare
43496a0 to
0e3dd1b
Compare
4f6ac7c to
87a09d7
Compare
| let write (fd, t) = | ||
| let to_buffer buf t = | ||
| let rec run : type a. Buffer.t -> (a, 'err, _) Tar.t -> a = fun buf -> function | ||
| | Tar.Write str -> Buffer.add_string buf str | ||
| | Tar.Read _ | Tar.Really_read _ | Tar.Seek _ | Tar.High _ -> assert false | ||
| | Tar.Return (Ok value) -> value | ||
| | Tar.Return (Error _) -> failwith "something went wrong" | ||
| | Tar.Bind (x, f) -> run buf (f (run buf x)) | ||
| in | ||
| run buf t | ||
| in | ||
| let entries = | ||
| let x = | ||
| Map.fold (fun path content acc -> | ||
| let hdr = | ||
| Tar.Header.make ~file_mode:0 ~mod_time:0L ~user_id:0 ~group_id:0 | ||
| path (Int64.of_int (String.length content)) | ||
| in | ||
| (Some Tar.Header.Ustar, hdr, fun () -> Tar.return (Ok (Some content))) :: acc) | ||
| t [] | ||
| in | ||
| let r = ref x in | ||
| fun () -> | ||
| match !r with | ||
| | [] -> Tar.return (Ok None) | ||
| | x::xs -> r := xs; Tar.return (Ok (Some x)) | ||
| in | ||
| let t = Tar.out ~level:Ustar entries in | ||
| let t = Tar_gz.out_gzipped ~level:4 ~mtime:0l Gz.Unix t in | ||
| let buf = Buffer.create 10_485_760 in | ||
| to_buffer buf t; | ||
| let str = Buffer.contents buf in | ||
| let _ : int = Unix.write_substring fd str 0 (String.length str) in | ||
| () |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hannesm @dinosaure @reynir is there a way to write this in a cleaner way? I find the ocaml-tar API to be quite confusing and most of the examples i could find use Lwt_stream.
| let entries = | ||
| let x = | ||
| Map.fold (fun path content acc -> | ||
| let hdr = | ||
| Tar.Header.make ~file_mode:0 ~mod_time:0L ~user_id:0 ~group_id:0 | ||
| path (Int64.of_int (String.length content)) | ||
| in | ||
| (Some Tar.Header.Ustar, hdr, fun () -> Tar.return (Ok (Some content))) :: acc) | ||
| t [] | ||
| in | ||
| let r = ref x in | ||
| fun () -> | ||
| match !r with | ||
| | [] -> Tar.return (Ok None) | ||
| | x::xs -> r := xs; Tar.return (Ok (Some x)) | ||
| in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be Map.to_seq t |> Seq.map ... |> Seq.to_dispenser which may be slightly cleaner
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Yeah it is a bit clearer although it is mostly moving the part dealing with the reference to a separate function (that we have to reimplement for compatibility with OCaml 4.08).
It's obviously fine for us at the moment but i would suggest to improve the ocaml-tar API in the future. Maybe allowing a list or having some sort of buffer would be much more efficient for this kind of use-case i feel like
28effa2 to
e73b5b6
Compare
Fixes #5741
Fixes #5648
Fixes #5484
Fixes #5346Fixes #5559
Fixes #3050
cc @hannesm to check if it works for conex (2.1 worked with tar.gz files already so i'm not too scared about breakages)
Reasoning
#5741 shows that assumptions that hold true "most of the time" on some unix platforms such as "it is ok to scan a large tree structure of files and directories", don't hold true on other platforms. Systems such as Windows, network filesystems, busy shared servers whose disk is being constantly used, harddrives, … suffer from this.
In opam we can have 3 types of repositories:
Out of the three, the most critical for first time users is the first one. It is also the one that suffer the most from these issues as currently we:
opam update: load only changed opam files #6614)VCS do not have step 4. Steps 1, 2 and 3 are builtin and heavily optimized. Is left only step 5 which should be improved by #6614 and for which we can improve further later by using
git cat-fileor even parse PACK files usingocaml-git.Local/ssh repositories are the ones left a bit with very few things we can do about them. #5966 should help, but beyond, maybe we might want to require that people use git even for local repositories.
For HTTP though, the untarring (which takes 1+ minute) is the main issue. Thus this here PR.
Design decisions
Instead of untarring we simply use the tar.gz as-is and use
ocaml-tarto read it on the fly.The new update steps are:
opam update: load only changed opam files #6614)Given the ubiquity of the use of
OpamFilename.Dir.tto mean both any random directory and a repository directory, i chose to first abstract over it in a newOpamRepositoryRootmodule, and work with the help of the type checker from there. Its interface help see what are the actions that opam does on repositories. While i'd rather keep them, theTarandDirsubmodules can be removed when everything is done.The
REPOSITORYTARRINGenvironment variable is removed by this work, given the repositories are tarred already.I had simplify
opam var pkg:opamfilefor this work. Previously it would point to the file in the repository. However this isn't what it's supposed to be doing. Instead it should point to the<switch>/.opam-switch/packages/directory which actually reflects the opam file that was used to installed. Otherwise the opam file can change between before and after the user has calledopam updateetc.TODO
There are a number of
assert false (* TODO *)in this draft PR. Those are to be fixed before undrafting but i felt reasonably confident with the rest of them to open this draft PR in this state to put more eyes on this work and to increase my self-motivation.OpamRepositoryState.get_repo_files: a function which extracts a limited number of files from the tar.gz to a new cache directory.Some of these changes should probably be extracted to separate PRs but let's do that at the end when we have something that actually works.
While early form of this work started a year and a half ago, i believe the crust left over from that time should be minimal, after 6 different branches. The final rebase and split into smaller PRs shouldn't be too painful.
Future work
In the future we can use
ocaml-tarthat we now depend on to replace some of the uses of thetarcommand. This should allow us to have better behaviours with things like symlinks on windows or even add new features such as excluding some directories (see ocaml/ocaml#14152).As mentioned above we can also improve local and git/vcs repositories with or without
ocaml-git.