Contributing to DataLad ======================= [gh-datalad]: http://github.com/datalad/datalad Files organization ------------------ - [datalad/](./datalad) is the main Python module where major development is happening, with major submodules being: - `cmdline/` - helpers for accessing `interface/` functionality from command line - `customremotes/` - custom special remotes for annex provided by datalad - `downloaders/` - support for accessing data from various sources (e.g. http, S3, XNAT) via a unified interface. - `configs/` - specifications for known data providers and associated credentials - `interface/` - high level interface functions which get exposed via command line (`cmdline/`) or Python (`datalad.api`). - `tests/` - some unit- and regression- tests (more could be found under `tests/` of corresponding submodules. See [Tests](#tests)) - [utils.py](./datalad/tests/utils.py) provides convenience helpers used by unit-tests such as `@with_tree`, `@serve_path_via_http` and other decorators - `ui/` - user-level interactions, such as messages about errors, warnings, progress reports, AND when supported by available frontend -- interactive dialogs - `support/` - various support modules, e.g. for git/git-annex interfaces, constraints for the `interface/`, etc - [benchmarks/](./benchmarks) - [asv] benchmarks suite (see [Benchmarking](#benchmarking)) - [docs/](./docs) - yet to be heavily populated documentation - `bash-completions` - bash and zsh completion setup for datalad (just `source` it) - [fixtures/](./fixtures) currently not under git, contains generated by vcr fixtures - [sandbox/](./sandbox) - various scripts and prototypes which are not part of the main/distributed with releases codebase - [tools/](./tools) contains helper utilities used during development, testing, and benchmarking of DataLad. Implemented in any most appropriate language (Python, bash, etc.) Whenever a new top-level file or folder is added to the repository, it should be listed in `MANIFEST.in` so that it will be either included in or excluded from source distributions as appropriate. [See here](https://packaging.python.org/guides/using-manifest-in/) for information about writing a `MANIFEST.in`. How to contribute ----------------- The preferred way to contribute to the DataLad code base is to fork the [main repository][gh-datalad] on GitHub. Here we outline the workflow used by the developers: 0. Have a clone of our main [project repository][gh-datalad] as `origin` remote in your git: git clone git://github.com/datalad/datalad 1. Fork the [project repository][gh-datalad]: click on the 'Fork' button near the top of the page. This creates a copy of the code base under your account on the GitHub server. 2. Add your forked clone as a remote to the local clone you already have on your local disk: git remote add gh-YourLogin git@github.com:YourLogin/datalad.git git fetch gh-YourLogin To ease addition of other github repositories as remotes, here is a little bash function/script to add to your `~/.bashrc`: ghremote () { url="$1" proj=${url##*/} url_=${url%/*} login=${url_##*/} git remote add gh-$login $url git fetch gh-$login } thus you could simply run: ghremote git@github.com:YourLogin/datalad.git to add the above `gh-YourLogin` remote. Additional handy aliases such as `ghpr` (to fetch existing pr from someone's remote) and `ghsendpr` could be found at [yarikoptic's bash config file](http://git.onerussian.com/?p=etc/bash.git;a=blob;f=.bash/bashrc/30_aliases_sh;hb=HEAD#l865) 3. Create a branch (generally off the `origin/master`) to hold your changes: git checkout -b nf-my-feature and start making changes. Ideally, use a prefix signaling the purpose of the branch - `nf-` for new features - `bf-` for bug fixes - `rf-` for refactoring - `doc-` for documentation contributions (including in the code docstrings). - `bm-` for changes to benchmarks We recommend to not work in the ``master`` branch! 4. Work on this copy on your computer using Git to do the version control. When you're done editing, do: git add modified_files git commit to record your changes in Git. Ideally, prefix your commit messages with the `NF`, `BF`, `RF`, `DOC`, `BM` similar to the branch name prefixes, but you could also use `TST` for commits concerned solely with tests, and `BK` to signal that the commit causes a breakage (e.g. of tests) at that point. Multiple entries could be listed joined with a `+` (e.g. `rf+doc-`). See `git log` for examples. If a commit closes an existing DataLad issue, then add to the end of the message `(Closes #ISSUE_NUMER)` 5. Push to GitHub with: git push -u gh-YourLogin nf-my-feature Finally, go to the web page of your fork of the DataLad repo, and click 'Pull request' (PR) to send your changes to the maintainers for review. This will send an email to the committers. You can commit new changes to this branch and keep pushing to your remote -- github automagically adds them to your previously opened PR. (If any of the above seems like magic to you, then look up the [Git documentation](http://git-scm.com/documentation) on the web.) Development environment ----------------------- We support Python 3 only (>= 3.5). See [README.md:Dependencies](README.md#Dependencies) for basic information about installation of datalad itself. On Debian-based systems we recommend to enable [NeuroDebian](http://neuro.debian.net) since we use it to provide backports of recent fixed external modules we depend upon: ```sh apt-get install -y -q git git-annex-standalone apt-get install -y -q patool python3-scrapy python3-{appdirs,argcomplete,git,humanize,keyring,lxml,msgpack,progressbar,requests,setuptools} ``` and additionally, for development we suggest to use tox and new versions of dependencies from pypy: ```sh apt-get install -y -q python3-{dev,httpretty,nose,pip,vcr,virtualenv} python3-tox # Some libraries which might be needed for installing via pip apt-get install -y -q lib{ffi,ssl,curl4-openssl,xml2,xslt1}-dev ``` some of which you could also install from PyPi using pip (prior installation of those libraries listed above might be necessary) ```sh pip install -r requirements-devel.txt ``` and you will need to install recent git-annex using appropriate for your OS means (for Debian/Ubuntu, once again, just use NeuroDebian). Documentation ------------- ### Docstrings We use [NumPy standard] for the description of parameters docstrings. If you are using PyCharm, set your project settings (`Tools` -> `Python integrated tools` -> `Docstring format`). [NumPy standard]: https://github.com/numpy/numpy/blob/master/doc/HOWTO_DOCUMENT.rst.txt#docstring-standard In addition, we follow the guidelines of [Restructured Text] with the additional features and treatments provided by [Sphinx]. [Restructured Text]: http://docutils.sourceforge.net/docs/user/rst/quickstart.html [Sphinx]: http://www.sphinx-doc.org/en/stable/ Additional Hints ---------------- ### Merge commits For merge commits to have more informative description, add to your `.git/config` or `~/.gitconfig` following section: [merge] log = true and if conflicts occur, provide short summary on how they were resolved in "Conflicts" listing within the merge commit (see [example](https://github.com/datalad/datalad/commit/eb062a8009d160ae51929998771964738636dcc2)). Quality Assurance ----------------- It is recommended to check that your contribution complies with the following rules before submitting a pull request: - All public methods should have informative docstrings with sample usage presented as doctests when appropriate. - All other tests pass when everything is rebuilt from scratch. - New code should be accompanied by tests. ### Tests `datalad/tests` contains tests for the core portion of the project, and more tests are provided under corresponding submodules in `tests/` subdirectories to simplify re-running the tests concerning that portion of the codebase. To execute many tests, the codebase first needs to be "installed" in order to generate scripts for the entry points. For that, the recommended course of action is to use `virtualenv`, e.g. ```sh virtualenv --system-site-packages venv-tests source venv-tests/bin/activate pip install -r requirements.txt python setup.py develop ``` and then use that virtual environment to run the tests, via ```sh python -m nose -s -v datalad ``` or similarly, ```sh nosetests -s -v datalad ``` then to later deactivate the virtualenv just simply enter ```sh deactivate ``` Alternatively, or complimentary to that, you can use `tox` -- there is a `tox.ini` file which sets up a few virtual environments for testing locally, which you can later reuse like any other regular virtualenv for troubleshooting. Additionally, [tools/testing/test_README_in_docker](tools/testing/test_README_in_docker) script can be used to establish a clean docker environment (based on any NeuroDebian-supported release of Debian or Ubuntu) with all dependencies listed in README.md pre-installed. ### CI setup We are using Travis-CI and have [buildbot setup](https://github.com/datalad/buildbot) which also exercises our tests battery for every PR and on the master. Note that buildbot runs tests only submitted by datalad developers, or if a PR acquires 'buildbot' label. In case if you want to enter buildbot's environment 1. Login to our development server (`smaug`) 2. Find container ID associated with the environment you are interested in, e.g. docker ps | grep nd16.04 3. Enter that docker container environment using docker exec -it /bin/bash 4. Become buildbot user su - buildbot 5. Activate corresponding virtualenv using source e.g. `source /home/buildbot/datalad-pr-docker-dl-nd15_04/build/venv-ci/bin/activate` And now you should be in the same environment as the very last tested PR. Note that the same path/venv is reused for all the PRs, so you might want first to check using `git show` under the `build/` directory if it corresponds to the commit you are interested to troubleshoot. For developing on Windows you can use free [Windows VMs](https://developer.microsoft.com/en-us/microsoft-edge/tools/vms/). ### Coverage You can also check for common programming errors with the following tools: - Code with good unittest coverage (at least 80%), check with: pip install nose coverage nosetests --with-coverage path/to/tests_for_package - We rely on https://codecov.io to provide convenient view of code coverage. Installation of the codecov extension for Firefox/Iceweasel or Chromium is strongly advised, since it provides coverage annotation of pull requests. ### Linting We are not (yet) fully PEP8 compliant, so please use these tools as guidelines for your contributions, but not to PEP8 entire code base. [beyond-pep8]: https://www.youtube.com/watch?v=wf-BqAjZb8M *Sidenote*: watch [Raymond Hettinger - Beyond PEP 8][beyond-pep8] - No pyflakes warnings, check with: pip install pyflakes pyflakes path/to/module.py - No PEP8 warnings, check with: pip install pep8 pep8 path/to/module.py - AutoPEP8 can help you fix some of the easy redundant errors: pip install autopep8 autopep8 path/to/pep8.py Also, some team developers use [PyCharm community edition](https://www.jetbrains.com/pycharm) which provides built-in PEP8 checker and handy tools such as smart splits/joins making it easier to maintain code following the PEP8 recommendations. NeuroDebian provides `pycharm-community-sloppy` package to ease pycharm installation even further. ### Benchmarking We use [asv] to benchmark some core DataLad functionality. The benchmarks suite is located under [benchmarks/](./benchmarks), and periodically we publish results of running benchmarks on a dedicated host to http://datalad.github.io/datalad/ . Those results are collected and available under the `.asv/` submodule of this repository, so to get started - `git submodule update --init .asv` - `pip install .[devel]` or just `pip install asv` - `asv machine` - to configure asv for your host if you want to run benchmarks locally And then you could use [asv] in multiple ways. #### Quickly benchmark the working tree - `asv run -E existing` - benchmark using the existing python environment and just print out results (not stored anywhere). You can add `-q` to run each benchmark just once (thus less reliable estimates) - `asv run -b api.supers.time_createadd_to_dataset -E existing` would run that specific benchmark using the existing python environment Note: `--python=same` (`-E existing`) seems to have restricted applicability, e.g. can't be used for a range of commits, so it can't be used with `continuous`. #### Compare results for two commits from recorded runs Use [asv compare] to compare results from different runs, which should be available under `.asv/results/`. (Note that the example below passes ref names instead of commit IDs, which requires asv v0.3 or later.) ```shell > asv compare -m hopa 0.9.x master All benchmarks: before after ratio [b619eca4] [7635f467] - 1.87s 1.54s 0.82 api.supers.time_createadd - 1.85s 1.56s 0.84 api.supers.time_createadd_to_dataset - 5.57s 4.40s 0.79 api.supers.time_installr 145±6ms 145±6ms 1.00 api.supers.time_ls - 4.59s 2.17s 0.47 api.supers.time_remove 427±1ms 434±8ms 1.02 api.testds.time_create_test_dataset1 - 4.10s 3.37s 0.82 api.testds.time_create_test_dataset2x2 1.81±0.07ms 1.73±0.04ms 0.96 core.runner.time_echo 2.30±0.2ms 2.04±0.03ms ~0.89 core.runner.time_echo_gitrunner + 420±10ms 535±3ms 1.27 core.startup.time_help_np 111±6ms 107±3ms 0.96 core.startup.time_import + 334±6ms 466±4ms 1.39 core.startup.time_import_api ``` #### Run and compare results for two commits [asv continuous] could be used to first run benchmarks for the to-be-tested commits and then provide stats: - `asv continuous 0.9.x master` - would run and compare 0.9.x and master branches - `asv continuous HEAD` - would compare HEAD against HEAD^ - `asv continuous master HEAD` - would compare HEAD against state of master - [TODO: contineous -E existing](https://github.com/airspeed-velocity/asv/issues/338#issuecomment-380520022) Notes: - only significant changes will be reported - raw results from benchmarks are not stored (use `--record-samples` if desired) #### Run and record benchmarks results (for later comparison etc) - `asv run` would run all configured branches (see [asv.conf.json](./asv.conf.json)) #### Profile a benchmark and produce a nice graph visualization Example (replace with the benchmark of interest) asv profile -v -o profile.gprof usecases.study_forrest.time_make_studyforrest_mockup gprof2dot -f pstats profile.gprof | dot -Tpng -o profile.png \ && xdg-open profile.png #### Common options - `-E` to restrict to specific environment, e.g. `-E virtualenv:2.7` - `-b` could be used to specify specific benchmark(s) - `-q` to run benchmark just once for a quick assessment (results are not stored since too unreliable) [asv compare]: http://asv.readthedocs.io/en/latest/commands.html#asv-compare [asv continuous]: http://asv.readthedocs.io/en/latest/commands.html#asv-continuous [asv]: http://asv.readthedocs.io Easy Issues ----------- A great way to start contributing to DataLad is to pick an item from the list of [Easy issues](https://github.com/datalad/datalad/labels/easy) in the issue tracker. Resolving these issues allows you to start contributing to the project without much prior knowledge. Your assistance in this area will be greatly appreciated by the more experienced developers as it helps free up their time to concentrate on other issues. Recognizing contributions ------------------------- We welcome and recognize all contributions from documentation to testing to code development. You can see a list of current contributors in our [zenodo file][link_zenodo]. If you are new to the project, don't forget to add your name and affiliation there! Thank you! ---------- You're awesome. :wave::smiley: Various hints for developers ---------------------------- ### Useful tools - While performing IO/net heavy operations use [dstat](http://dag.wieers.com/home-made/dstat) for quick logging of various health stats in a separate terminal window: dstat -c --top-cpu -d --top-bio --top-latency --net - To monitor speed of any data pipelining [pv](http://www.ivarch.com/programs/pv.shtml) is really handy, just plug it in the middle of your pipe. - For remote debugging epdb could be used (avail in pip) by using `import epdb; epdb.serve()` in Python code and then connecting to it with `python -c "import epdb; epdb.connect()".` - We are using codecov which has extensions for the popular browsers (Firefox, Chrome) which annotates pull requests on github regarding changed coverage. ### Useful Environment Variables Refer datalad/config.py for information on how to add these environment variables to the config file and their naming convention - *DATALAD_DATASETS_TOPURL*: Used to point to an alternative location for `///` dataset. If running tests preferred to be set to http://datasets-tests.datalad.org - *DATALAD_LOG_CWD*: Whether to log cwd where command to be executed - *DATALAD_LOG_ENV*: If contains a digit (e.g. 1), would log entire environment passed into the Runner.run's popen call. Otherwise could be a comma separated list of environment variables to log - *DATALAD_LOG_LEVEL*: Used for control the verbosity of logs printed to stdout while running datalad commands/debugging - *DATALAD_LOG_NAME*: Whether to include logger name (e.g. `datalad.support.sshconnector`) in the log - *DATALAD_LOG_OUTPUTS*: Used to control either both stdout and stderr of external commands execution are logged in detail (at DEBUG level) - *DATALAD_LOG_PID* To instruct datalad to log PID of the process - *DATALAD_LOG_STDIN*: Whether to log stdin for the command - *DATALAD_LOG_TARGET* Where to log: `stderr` (default), `stdout`, or another filename - *DATALAD_LOG_TIMESTAMP*: Used to add timestamp to datalad logs - *DATALAD_LOG_TRACEBACK*: Runs TraceBack function with collide set to True, if this flag is set to 'collide'. This replaces any common prefix between current traceback log and previous invocation with "..." - *DATALAD_LOG_VMEM*: Reports memory utilization (resident/virtual) at every log line, needs `psutil` module - *DATALAD_EXC_STR_TBLIMIT*: This flag is used by the datalad extract_tb function which extracts and formats stack-traces. It caps the number of lines to DATALAD_EXC_STR_TBLIMIT of pre-processed entries from traceback. - *DATALAD_SEED*: To seed Python's `random` RNG, which will also be used for generation of dataset UUIDs to make those random values reproducible. You might want also to set all the relevant git config variables like we do in one of the travis runs - *DATALAD_TESTS_TEMP_KEEP*: Function rmtemp will not remove temporary file/directory created for testing if this flag is set - *DATALAD_TESTS_TEMP_DIR*: Create a temporary directory at location specified by this flag. It is used by tests to create a temporary git directory while testing git annex archives etc - *DATALAD_TESTS_NONETWORK*: Skips network tests completely if this flag is set Examples include test for s3, git_repositories, openfmri etc - *DATALAD_TESTS_SSH*: Skips SSH tests if this flag is **not** set. If you enable this, you need to set up a "datalad-test" and "datalad-test2" target in your SSH configuration. The second target is used by only a couple of tests, so depending on the tests you're interested in, you can get by with only "datalad-test" configured. A Docker image that is used for DataLad's tests is available at . Note that the DataLad tests assume that target files exist in `DATALAD_TESTS_TEMP_DIR`, which restricts the "datalad-test" target to being either the localhost or a container that mounts `DATALAD_TESTS_TEMP_DIR`. - *DATALAD_TESTS_NOTEARDOWN*: Does not execute teardown_package which cleans up temp files and directories created by tests if this flag is set - *DATALAD_TESTS_USECASSETTE*: Specifies the location of the file to record network transactions by the VCR module. Currently used by when testing custom special remotes - *DATALAD_TESTS_OBSCURE_PREFIX*: A string to prefix the most obscure (but supported by the filesystem test filename - *DATALAD_TESTS_PROTOCOLREMOTE*: Binary flag to specify whether to test protocol interactions of custom remote with annex - *DATALAD_TESTS_RUNCMDLINE*: Binary flag to specify if shell testing using shunit2 to be carried out - *DATALAD_TESTS_TEMP_FS*: Specify the temporary file system to use as loop device for testing DATALAD_TESTS_TEMP_DIR creation - *DATALAD_TESTS_TEMP_FSSIZE*: Specify the size of temporary file system to use as loop device for testing DATALAD_TESTS_TEMP_DIR creation - *DATALAD_TESTS_NONLO*: Specifies network interfaces to bring down/up for testing. Currently used by travis. - *DATALAD_CMD_PROTOCOL*: Specifies the protocol number used by the Runner to note shell command or python function call times and allows for dry runs. 'externals-time' for ExecutionTimeExternalsProtocol, 'time' for ExecutionTimeProtocol and 'null' for NullProtocol. Any new DATALAD_CMD_PROTOCOL has to implement datalad.support.protocol.ProtocolInterface - *DATALAD_CMD_PROTOCOL_PREFIX*: Sets a prefix to add before the command call times are noted by DATALAD_CMD_PROTOCOL. - *DATALAD_USE_DEFAULT_GIT*: Instructs to use `git` as available in current environment, and not the one which possibly comes with git-annex (default behavior). - *DATALAD_ASSERT_NO_OPEN_FILES*: Instructs test helpers to check for open files at the end of a test. If set, remaining open files are logged at ERROR level. Alternative modes are: "assert" (raise AssertionError if any open file is found), "pdb"/"epdb" (drop into debugger when open files are found, info on files is provided in a "files" dictionary, mapping filenames to psutil process objects). - *DATALAD_ALLOW_FAIL*: Instructs `@never_fail` decorator to allow to fail, e.g. to ease debugging. # Changelog section For the upcoming release use this template ## 0.15.0 (??? ??, 2020) -- will be better than ever bet we will fix some bugs and make a world even a better place. ### Major refactoring and deprecations - hopefully none ### Fixes ? ### Enhancements and new features ? [link_zenodo]: https://github.com/datalad/datalad/blob/master/.zenodo.json