Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@kit-ty-kate
Copy link
Member

@kit-ty-kate kit-ty-kate commented Jul 31, 2025

Fixes #5741
Fixes #5648
Fixes #5484
Fixes #5346
Fixes #5559
Fixes #3050
cc @hannesm to check if it works for conex (2.1 worked with tar.gz files already so i'm not too scared about breakages)

Reasoning

#5741 shows that assumptions that hold true "most of the time" on some unix platforms such as "it is ok to scan a large tree structure of files and directories", don't hold true on other platforms. Systems such as Windows, network filesystems, busy shared servers whose disk is being constantly used, harddrives, … suffer from this.

In opam we can have 3 types of repositories:

  1. HTTP (the default): where we download a tar.gz
  2. VCS (aka. mostly git these days): where we use the vcs command line to get the files
  3. local/ssh: where we either use our own copy primitives or use rsync

Out of the three, the most critical for first time users is the first one. It is also the one that suffer the most from these issues as currently we:

  1. untar it
  2. diff with the current repository (used for conex and for opam update: load only changed opam files  #6614)
  3. patch the changed files
  4. remove the directory
  5. rescan the whole repository

VCS do not have step 4. Steps 1, 2 and 3 are builtin and heavily optimized. Is left only step 5 which should be improved by #6614 and for which we can improve further later by using git cat-file or even parse PACK files using ocaml-git.

Local/ssh repositories are the ones left a bit with very few things we can do about them. #5966 should help, but beyond, maybe we might want to require that people use git even for local repositories.

For HTTP though, the untarring (which takes 1+ minute) is the main issue. Thus this here PR.

Design decisions

Instead of untarring we simply use the tar.gz as-is and use ocaml-tar to read it on the fly.
The new update steps are:

  1. diff the two tar.gz
  2. remove the old tar.gz
  3. move the new tar.gz in its place
  4. scan what has changed (required opam update: load only changed opam files  #6614)

Given the ubiquity of the use of OpamFilename.Dir.t to mean both any random directory and a repository directory, i chose to first abstract over it in a new OpamRepositoryRoot module, and work with the help of the type checker from there. Its interface help see what are the actions that opam does on repositories. While i'd rather keep them, the Tar and Dir submodules can be removed when everything is done.

The REPOSITORYTARRING environment variable is removed by this work, given the repositories are tarred already.

I had simplify opam var pkg:opamfile for this work. Previously it would point to the file in the repository. However this isn't what it's supposed to be doing. Instead it should point to the <switch>/.opam-switch/packages/ directory which actually reflects the opam file that was used to installed. Otherwise the opam file can change between before and after the user has called opam update etc.

TODO

There are a number of assert false (* TODO *) in this draft PR. Those are to be fixed before undrafting but i felt reasonably confident with the rest of them to open this draft PR in this state to put more eyes on this work and to increase my self-motivation.

  • The main thing to do is to do the diff function between two tar files and between a directory and a tar.
  • The other thing is to fill OpamRepositoryState.get_repo_files: a function which extracts a limited number of files from the tar.gz to a new cache directory.

Some of these changes should probably be extracted to separate PRs but let's do that at the end when we have something that actually works.

While early form of this work started a year and a half ago, i believe the crust left over from that time should be minimal, after 6 different branches. The final rebase and split into smaller PRs shouldn't be too painful.

Future work

In the future we can use ocaml-tar that we now depend on to replace some of the uses of the tar command. This should allow us to have better behaviours with things like symlinks on windows or even add new features such as excluding some directories (see ocaml/ocaml#14152).

As mentioned above we can also improve local and git/vcs repositories with or without ocaml-git.

@hannesm
Copy link
Member

hannesm commented Aug 1, 2025

Hey, thanks for your work on that. Since you asked me directly

cc @hannesm to check if it works for conex

Instead of untarring we simply use the tar.gz as-is and use ocaml-tar to read it on the fly.
The new update steps are:

diff the two tar.gz
remove the old tar.gz
move the new tar.gz in its place

This should be fine from the design point of view for conex.

Conex will need to interject between step 0 (you downloaded the tarball) and 2 (remove the old tar.gz). Currently, conex requires a diff file on disk, and the old repository as directory. But we can revise that interface, and conex could as well work on two tar.gz (and/or on two directories).

I guess you have a clear understanding of the update process currently, and since you mention the different kinds (http, git, local) -- maybe we should re-think how opam and conex should interact to avoid burden paid by people not using conex, and avoid the burden of duplicating computations in both opam and conex. The latter may need to include conex as a library into opam.

I'm away for the next 10 days (back on August 10th), but am happy to discuss this afterwards - esp.since I plan to revive my work on conex thereafter.

@hannesm
Copy link
Member

hannesm commented Aug 1, 2025

To be more precise, given that

  • (a) local repositories (rsync) aren't really worth to verify (under the assumption that whoever has access to the repository can as well install arbitrary packages) [which may be revisited if there's the NFS use case or a shared server]
  • (b) VCS (git): we could do the git fetch and provide conex with the local repository and the commit that the update should be to
  • (c) http: provide old and new tarball

Then conex could do what is needed (compute the set of changed opam files, verify signatures; exit 0 on success); and could even report back the set of changed files to opam (I suspect this is what #6614 depends on) - using a file, or a socket, or if integrated with opam, this will be much simpler (using shared memory).

For opam itself, I guess that #6349 and #5553 will improve a lot of updates already.

@kit-ty-kate
Copy link
Member Author

Some update: I've been splitting this PR into several smaller ones to make working on it collaboratively, rebasing it and cleaning it up, easier.

So far:

Once these done, i expect this PR to be fairly small, self-contained and easier to review.

Comment on lines +2509 to +2522
if OpamRepositoryRoot.Tar.exists tar then
(let target = OpamRepositoryRoot.Tar.backup ~tmp_dir tar in
OpamRepositoryRoot.Tar.copy ~src:tar ~dst:target;
fun () -> OpamRepositoryRoot.Tar.copy ~src:target ~dst:tar)
else
(let dir = OpamRepositoryPath.root gt.root name in
if not (OpamFilename.exists_dir dir) then
if not (OpamRepositoryRoot.Dir.exists dir) then
OpamConsole.error_and_exit `Internal_error
"Repository not found, consider running 'opam update %s' \
to retrieve a consistent state."
(OpamRepositoryName.to_string name);
let target =
OpamFilename.(Op.(tmp_dir / Base.to_string (basename_dir dir)))
in
OpamFilename.copy_dir ~src:dir ~dst:target;
fun () -> OpamFilename.copy_dir ~src:target ~dst:dir)
let target = OpamRepositoryRoot.Dir.backup ~tmp_dir dir in
OpamRepositoryRoot.Dir.copy ~src:dir ~dst:target;
fun () -> OpamRepositoryRoot.Dir.copy ~src:target ~dst:dir)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use directly OpamRepositoryRoot.exists and OpamRepositoryRoot.copy ?

Comment on lines 25 to 32
let exception Found of string in
try
OpamTar.fold_reg_files (fun () filename content ->
if filename = "/repo" then
raise (Found content);
) () (Unix.openfile (OpamRepositoryRoot.Tar.to_string tar) [Unix.O_RDONLY] 0);
OpamFile.Repo.empty
with Found content -> OpamFile.Repo.read_from_string content
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic is used twice in the code, worth haveing a function that does it

Comment on lines 142 to 144
OpamRepositoryState.load_opams_from_dir repo.repo_name dir
| OpamRepositoryRoot.Tar tar ->
OpamRepositoryState.load_opams_from_tar_gz repo.repo_name tar
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better have a OpamRepositoryState.load_opams that does that match

Comment on lines 146 to 143
begin match repo_root with
| OpamRepositoryRoot.Dir _ ->
if OpamRepositoryRoot.Tar.exists tarred_repo then
OpamRepositoryRoot.Tar.remove tarred_repo;
| OpamRepositoryRoot.Tar _ ->
if OpamRepositoryRoot.Dir.exists local_dir then
OpamRepositoryRoot.Dir.remove local_dir;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here : why not use the abstracted logic ?

Comment on lines 125 to 126
let tarred_repo = OpamRepositoryPath.tar gt.root repo.repo_name in
(if OpamRepositoryConfig.(!r.repo_tarring) then
OpamFilename.make_tar_gz_job tarred_repo repo_root
else Done None)
@@+ function
| Some e ->
OpamStd.Exn.fatal e;
Printf.ksprintf failwith
"Failed to regenerate local repository archive: %s"
(Printexc.to_string e)
| None ->
let opams =
OpamRepositoryState.load_opams_from_dir repo.repo_name repo_root
in
let local_dir = OpamRepositoryPath.root gt.root repo.repo_name in
if OpamRepositoryConfig.(!r.repo_tarring) then
(if OpamFilename.exists_dir local_dir then
(* Mark the obsolete local directory for deletion once we complete: it's
no longer needed once we have a tar.gz *)
Hashtbl.add rt.repos_tmp repo.repo_name (lazy local_dir))
else if OpamFilename.exists tarred_repo then
(OpamFilename.move_dir ~src:repo_root ~dst:local_dir;
OpamFilename.remove tarred_repo);
Done (Some (
(* Return an update function to make parallel execution possible *)
fun rt ->
{ rt with
repositories =
OpamRepositoryName.Map.add repo.repo_name repo rt.repositories;
repos_definitions =
OpamRepositoryName.Map.add repo.repo_name repo_file
rt.repos_definitions;
repo_opams =
OpamRepositoryName.Map.add repo.repo_name opams rt.repo_opams;
}
))
let local_dir = OpamRepositoryPath.root gt.root repo.repo_name in
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you retrieve both ?

@rjbou rjbou force-pushed the compressed-repo-format-6 branch from bd213b6 to 43496a0 Compare October 10, 2025 17:28
@kit-ty-kate kit-ty-kate force-pushed the compressed-repo-format-6 branch from 43496a0 to 0e3dd1b Compare October 11, 2025 22:36
@kit-ty-kate kit-ty-kate force-pushed the compressed-repo-format-6 branch from 4f6ac7c to 87a09d7 Compare October 12, 2025 22:29
Comment on lines 85 to 118
let write (fd, t) =
let to_buffer buf t =
let rec run : type a. Buffer.t -> (a, 'err, _) Tar.t -> a = fun buf -> function
| Tar.Write str -> Buffer.add_string buf str
| Tar.Read _ | Tar.Really_read _ | Tar.Seek _ | Tar.High _ -> assert false
| Tar.Return (Ok value) -> value
| Tar.Return (Error _) -> failwith "something went wrong"
| Tar.Bind (x, f) -> run buf (f (run buf x))
in
run buf t
in
let entries =
let x =
Map.fold (fun path content acc ->
let hdr =
Tar.Header.make ~file_mode:0 ~mod_time:0L ~user_id:0 ~group_id:0
path (Int64.of_int (String.length content))
in
(Some Tar.Header.Ustar, hdr, fun () -> Tar.return (Ok (Some content))) :: acc)
t []
in
let r = ref x in
fun () ->
match !r with
| [] -> Tar.return (Ok None)
| x::xs -> r := xs; Tar.return (Ok (Some x))
in
let t = Tar.out ~level:Ustar entries in
let t = Tar_gz.out_gzipped ~level:4 ~mtime:0l Gz.Unix t in
let buf = Buffer.create 10_485_760 in
to_buffer buf t;
let str = Buffer.contents buf in
let _ : int = Unix.write_substring fd str 0 (String.length str) in
()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hannesm @dinosaure @reynir is there a way to write this in a cleaner way? I find the ocaml-tar API to be quite confusing and most of the examples i could find use Lwt_stream.

Comment on lines 96 to 111
let entries =
let x =
Map.fold (fun path content acc ->
let hdr =
Tar.Header.make ~file_mode:0 ~mod_time:0L ~user_id:0 ~group_id:0
path (Int64.of_int (String.length content))
in
(Some Tar.Header.Ustar, hdr, fun () -> Tar.return (Ok (Some content))) :: acc)
t []
in
let r = ref x in
fun () ->
match !r with
| [] -> Tar.return (Ok None)
| x::xs -> r := xs; Tar.return (Ok (Some x))
in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be Map.to_seq t |> Seq.map ... |> Seq.to_dispenser which may be slightly cleaner

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Yeah it is a bit clearer although it is mostly moving the part dealing with the reference to a separate function (that we have to reimplement for compatibility with OCaml 4.08).

It's obviously fine for us at the moment but i would suggest to improve the ocaml-tar API in the future. Maybe allowing a list or having some sort of buffer would be much more efficient for this kind of use-case i feel like

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

5 participants