Race Condition in Event Queuing When MODIFIED Arrive After CREATE but Before last-handled-configuration Was Written

## Long story short

MODIFIED events on just CREATED resources might arrive before last-handled-configuration was written. This leads to the MODIFIED event being treated as `Reason.CREATE` b/c its `old` version is still empty.

-------------------------
Loading the (empty) `old` manifest is tried here:
https://github.com/nolar/kopf/blob/1d657e2b1e1669466d68e783cc6d23819cd130ed/kopf/reactor/processing.py#L153

-------------------------  
Falsely setting the cause reason to `CREATE` as a result of the empty `old` manifest is done here:
https://github.com/nolar/kopf/blob/d29ac2bcb8481efd400910f36510f43dc1255851/kopf/reactor/causation.py#L197-L200

-------------------------
The handler is not being called b/c its cause does not match the resource changing cause:
https://github.com/nolar/kopf/blob/8fb507cc77ec55b179b7b115e40cef983f20b67b/kopf/reactor/registries.py#L201 


## Description

If the handler creating a resource via 3rd-party means like pykube still spends a small amount of time after creating the resource before returning, a quick update-after-create to the resource will queue up MODIFIED events before kopf had a chance to write its last-handled-config.

The following code snipped reproduces this. We had a situation where a 2-container pod had one container immediately crashing after creation. When this happened quickly enough after the pod was created the handler designated to deal with crashing containers was never called. Since I'm working from home via a DSL link to the data center where the cluster lives, the varying connection latency over the day through the VPN gateway is sometimes enough to trigger this. But only after today's lucky setting of a break-point (introducing a sufficient handler delay) right after the pod creation I was able to reliably reproduce it and find the root cause.

All the handler does btw after creating the pod is creating an event about the fact as well as setting kopf's `patch` dict. 

I believe this one to be broken at the queuing design level and have no good idea how to fix this. After looking at this I'm not sure the current implementation can be fixed for correctness without substantial rewrites (`memories`, maybe?). The assumptions currently made around last-handled-configuration can never be fully upheld as long as third parties other than kopf (a.k.a. pykube, kubernetes itself) modify resources too - which will of course always be true.
However I'd be very happily proven wrong. Maybe the already queued MODIFIED events sans kopf storage annotations can be augmented in-memory with the missing data by remembering the CREATED event long enough. IDK.

The following script:
1. Creates a pod with two containers. One of them crashes after 1s.
2. Then the handler `time.sleeps` for 2s.
3. `on_update(...)` is never called and "wonky's status was updated: ..." is missing from the output.

To make it work:
Comment the `time.sleep(2)` after pod creation. The `on_update(...)` handler will be called. 

__Note__
Running this script the first time might actually trigger `on_update`. This would be b/c the `alpine` image might need to be pulled. If this takes longer than the 2s sleep, there will be MODIFIED events after that and kopf might have had enough time to write a last-handled-config. If the image is already there it should fail the first time - except maybe when run on very slow or loaded clusters so the container takes longer to crash. Simply increase the sleep to 3-4s then to still trigger it. 

<details><summary>event_race_bug.py</summary>

```python
import time

import kopf
import pykube

podspec = {
    "apiVersion": "v1",
    "kind": "Pod",
    "metadata": {"name": "wonky", "namespace": "default"},
    "spec": {
        "containers": [
            {
                "args": [
                    "-c",
                    "\"echo 'Hello, sleeping for 1s'; sleep 1; echo 'Falling over now...'\"",
                ],
                "command": ["/bin/sh"],
                "image": "alpine:latest",
                "imagePullPolicy": "IfNotPresent",
                "name": "broken",
            },
            {
                "args": [
                    "-c",
                    "\"echo 'Hello, I'll stay alive much longer'; sleep 3600; echo 'Falling over now...'\"",
                ],
                "image": "alpine:latest",
                "imagePullPolicy": "IfNotPresent",
                "name": "sane",
            },
        ],
        "dnsPolicy": "ClusterFirst",
        "restartPolicy": "Never",
        "terminationGracePeriodSeconds": 30,
    },
}

k_api: pykube.HTTPClient = pykube.HTTPClient(pykube.KubeConfig.from_env())

@kopf.on.startup()
async def create_pod(**_):
    pod = pykube.Pod(k_api, podspec)
    # uncomment this if you're running the script multiple times and do not want to manually delete the pod each time
    # pod.delete() 
    pod.create()
    # comment the following line to make the example work and allow on_update being called
    time.sleep(2)


@kopf.on.update(
    "",
    "v1",
    "pods",
    field="status",
)
async def on_update(name, status, **_):
    print(f"{name}'s status was updated: {status.get('phase')}")
```
</details>

<details><summary>The exact command to reproduce the issue</summary>

```bash
kopf run event_race_bug.py
```
</details>

I hope somebody proves me wrong with my analysis, I really do, because if I'm correct it means that by definition I'll never be able to implement a correctly behaving operator using kopf as I would have to expect subtle errors like this one without any way to detect them through kopfs API. 

## Environment

* Kopf version: 1.30.3
* Kubernetes version: 1.17
* Python version: 3.9.2
* OS/platform: Linux





Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Race Condition in Event Queuing When MODIFIED Arrive After CREATE but Before last-handled-configuration Was Written #729

Long story short

Description

Environment

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

	if old is None: # i.e. we have no essence stored
	kwargs['initial'] = False
	return ResourceChangingCause(reason=handlers.Reason.CREATE, **kwargs)

Uh oh!

Race Condition in Event Queuing When MODIFIED Arrive After CREATE but Before last-handled-configuration Was Written #729

Description

Long story short

Description

Environment

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions