conmon: handle multi-line logging #436

cyphar · 2017-04-06T18:42:23Z

The CRI requires us to prepend (timestamp, stream) to every line of the
output, and it's quite likely (especially in the !terminal case) that we
will read more than one line of output in the read loop.

So, we need to write out each line separately with the prepended
timestamps. Doing this the simple way (the final part of the buffer is
written partially if it doesn't end in a newline) makes the code much
simpler, with the downside that if we ever switch to multiple streams
for output we'll have to rewrite parts of this.

Alternative to #430.

Fixes: 1dc4c87 ("conmon: add timestamps to logs")
Signed-off-by: Aleksa Sarai [email protected]

cyphar · 2017-04-06T18:42:31Z

I haven't tested this yet.

runcom · 2017-04-06T18:44:44Z

conmon/conmon.c

+		ptrdiff_t line_len = buf - line_end;
+
+		/* Write the (timestamp, stream, line) tuple. */
+		if (write(fd, tsbuf, TSBUFLEN-1) < 0) {


Is this making sure we drop the NULL terminator which got in with snprintf? One of the original issue is that the NULL terminator (00) is causing strings matching to fail

TSBUFLEN-1 doesn't contain the null terminator. But I could switch to strlen(tsbuf) if you prefer.

Nah it's fine

runcom · 2017-04-06T18:48:30Z

I'll run k8s tests with this tomorrow afternoon (my timezone)

cyphar · 2017-04-06T18:56:59Z

/me just realised the output spacing was wrong. Pushed a fix and squashed.

mrunalp · 2017-04-06T19:21:11Z

@cyphar There are bugs in this right now. It returns empty line when queried through kubectl logs. Details:

[root@dhcp-16-129 kubernetes]# ./cluster/kubectl.sh  run  httpd --image=httpd:2.4-alpine                                                                                                                                                                      
deployment "httpd" created
[root@dhcp-16-129 kubernetes]# ./cluster/kubectl.sh  describe pod httpd | grep IP                                                                                                                                                                             
IP:
[root@dhcp-16-129 kubernetes]# ./cluster/kubectl.sh  describe pod httpd | grep IP
IP:             10.88.0.76
[root@dhcp-16-129 kubernetes]# curl 10.88.0.76
<html><body><h1>It works!</h1></body></html>
[root@dhcp-16-129 kubernetes]# ./cluster/kubectl.sh get pods                                                                                                                                                                                                  
NAME                     READY     STATUS    RESTARTS   AGE
httpd-3531205961-q5j6l   1/1       Running   0          16s
[root@dhcp-16-129 kubernetes]# ./cluster/kubectl.sh logs httpd-3531205961-q5j6l
[root@dhcp-16-129 kubernetes]#

File contents:

[root@dhcp-16-129 c00a5dd8-1afd-11e7-b574-74852a1f5251]# cat httpd_0.log
2017-04-06T12:18:02-07:00 stdout 2017-04-06T12:18:02-07:00 stdout 2017-04-06T12:18:02-07:00 stdout 2017-04-06T12:18:02-07:00 stdout 2017-04-06T12:18:02-07:00 stdout 2017-04-06T12:18:02-07:00 stdout 2017-04-06T12:18:02-07:00 stdout

cyphar · 2017-04-07T18:41:13Z

Yeah sorry @mrunalp this code was quite wrong before. I've now tested it with quite a few test cases (here's a sample), so it should work now:

int main(void)
{
	write_k8s_log(1, "stdout", "a sane line buffered\n", 21);
	write_k8s_log(1, "stdout", "this\nis\nkinda coo\0l!", 20);
	write_k8s_log(1, "stdout", "even more cool stuf\n\n\n\n", 23);
	write_k8s_log(1, "stdout", "what is even going\n\n\n\nk", 23);
	write_k8s_log(1, "stdout", " ---   \0\0", 9);
	write_k8s_log(1, "stdout", " ++ \n", 5);
	return 0;
}

Will output:

2017-04-08T04:35:07+10:00 stdout a sane line buffered
2017-04-08T04:35:07+10:00 stdout this
2017-04-08T04:35:07+10:00 stdout is
2017-04-08T04:35:07+10:00 stdout kinda cool!even more cool stuf
2017-04-08T04:35:07+10:00 stdout
2017-04-08T04:35:07+10:00 stdout
2017-04-08T04:35:07+10:00 stdout
2017-04-08T04:35:07+10:00 stdout what is even going
2017-04-08T04:35:07+10:00 stdout
2017-04-08T04:35:07+10:00 stdout
2017-04-08T04:35:07+10:00 stdout
2017-04-08T04:35:07+10:00 stdout k ---    ++

mrunalp · 2017-04-07T19:52:55Z

@cyphar okay, will retest this. Thanks!

mrunalp · 2017-04-07T20:25:14Z

conmon/conmon.c

-					/* Log all output to logfd. */
-					if (write(logfd, buf, num_read) != num_read) {
-						nwarn("partial/failed write (logFd)");
+					if (write_k8s_log(logfd, "stdout", buf, num_read) < 0) {


I think we should add buf[num_read] = '\0'. Otherwise we see trailing stuff in logs.

[conmon:i]: read a chunk: (fd=5) '3.14159265358979323846264338327950288419716939937510582097494459230781640628620899862803482534211706798214808651328230664709384460955058223172535940812848111745028410270193852110555964462294895493038196442881097566593344612847564823378678316527120190914564/tmp/conmon-term.XXXXXXXX'

I don't think we should modify the buffer like that (it means the logs won't actually match what was written by the program). The fix IMO is to change how we log the whole read a chunk thing. To be honest, that was a debugging measure and we should drop it.

Can you verify whether the actual log file has incorrect data in it, or just the stderr log from conmon? Because to be honest at the moment ninfo is currently reading out-of-bounds and we need to stop doing that, so I hope that's the only issue here. 😉

@cyphar Adding the '\0' is only for making the debug logs better :) I think we will fine even if we just drop the ninfo like you said. The actual log files looked good in my manual testing with a few different types of pods. Unfortunately, the e2e tests ran into unrelated issues on my machine that I am still debugging. So I will ask @runcom to run the suite on his machine.

mikebrow · 2017-04-08T02:31:55Z

conmon/conmon.c

 };

-int set_k8s_timestamp(char *buf, ssize_t buflen, const char *stream_type)
+/* strlen("1997-03-25T13:20:42+01:00") + 1 */


Would prefer RFC3339Nano if we have a choice here...

@mikebrow Yeah, we can do that for sure. Just want to get this correctness patch in first :)

cyphar · 2017-04-09T01:09:48Z

@mrunalp Pushed a commit that disables the debug logging. @runcom can you test this?

The test failures don't make sense -- they were passing yesterday and the only code change is me removing the ninfo...

runcom · 2017-04-10T09:31:41Z

testing this out with k8s right now. (restarted Travis also)

runcom · 2017-04-10T10:56:53Z

as far as k8s testing is concerned this PR LGTM :) (109/121 is a great result)

Ran 121 of 211 Specs in 5036.832 seconds
FAIL! -- 109 Passed | 12 Failed | 0 Pending | 90 Skipped --- FAIL: TestE2eNode (5036.85s)

runcom · 2017-04-10T12:21:15Z

(testing with latest k8s master source seems fine as well 👍 )

weirdly enough, tests fail with https://travis-ci.org/kubernetes-incubator/cri-o/jobs/220157466#L2775 (I've never seen it)

cyphar · 2017-04-10T13:07:13Z

I'm super confused why Travis is broken, the previous commit passed the cases (and now I'm worried to re-run the old commit). There's some problem with seccomp though...

mrunalp · 2017-04-10T14:53:48Z

@cyphar Yeah weird. The first failure seems to be execsync related.

cyphar · 2017-04-10T15:22:20Z

It's definitely a real failure, I just am confused what commits hit master between the two test runs that caused the breakage. I'll take a look today.

cyphar · 2017-04-10T20:09:06Z

Ah, I think I know why. It's because ExecSync isn't meant to output the k8s log format (and it was actually the debugging information which caused it to succeed by accident). As an aside, the fact that we have to add more hacks like this is more indication in my mind we'll need to rewrite most of this code after we hit feature parity.

mrunalp · 2017-04-10T20:57:14Z

@cyphar Another option is to strip this out before we send back the ExecSyncResponse. We will probably need to do that to return stderr separately when we add a separate pipe for it. However, we can do this in a follow on.

cyphar · 2017-04-10T21:00:32Z

There was another error which is that ExecSync would give you the conmon logs not the container logs if the container exited with an error. That's the first test failure. I think the second test failure is just because our tests are so interlinked.

In addition, ExecSync wasn't handled properly when we had an exit code (we would return ExecSyncError rather than ExecSyncResponse which I'm fairly sure is patently wrong. I've fixed that too.

cyphar · 2017-04-11T10:30:21Z

@mrunalp I will switch to stripping after this is merged and we do the separate pipes. The tests pass now, so I'm squashing.

Previously we returned an internal error result when a program had a non-zero exit code, which was incorrect. Fix this as well as change the tests to actually check the "ExitCode" response from ExecSync (rather than expecting ocic-ctr to return an internal error). Signed-off-by: Aleksa Sarai <[email protected]>

The CRI requires us to prepend (timestamp, stream) to every line of the output, and it's quite likely (especially in the !terminal case) that we will read more than one line of output in the read loop. So, we need to write out each line separately with the prepended timestamps. Doing this the simple way (the final part of the buffer is written partially if it doesn't end in a newline) makes the code much simpler, with the downside that if we ever switch to multiple streams for output we'll have to rewrite parts of this. In addition, drop the debugging output of cri-o for each chunk read so we stop spamming stderr. We can do this now because 8a928d0 ("oci: make ExecSync with ExitCode != 0 act properly") actually fixed how ExecSync was being handled (especially in regards to this patch). Fixes: 1dc4c87 ("conmon: add timestamps to logs") Signed-off-by: Aleksa Sarai <[email protected]>

runcom · 2017-04-11T11:07:18Z

will run k8s on this last time assuming Travis's green

cyphar · 2017-04-11T11:16:05Z

@runcom In particular can you make sure that ExecSync acts sanely if the program you run has a non-zero exit code? I think the previous way was actually not correct.

runcom · 2017-04-11T11:21:40Z

@runcom In particular can you make sure that ExecSync acts sanely if the program you run has a non-zero exit code? I think the previous way was actually not correct.

how would you do this? should we write other integration tests for this or the ones you fixes made sure this case works fine? otherwise, I'm just going to run k8s tests which I don't know they're testing this code path (ExecSync)

cyphar · 2017-04-11T11:27:49Z

@runcom Oh, I meant for you to just do kubectl exec and make sure it's sane if the command returns an error but I'm not sure if that uses ExecSync under the hood. If this doesn't cause any regressions then I guess it's good enough.

cyphar · 2017-04-11T11:28:24Z

@runcom Tests are 🦎 btw. 😸

runcom · 2017-04-11T11:30:04Z

running k8s tests right now!

runcom · 2017-04-11T13:12:05Z

no regression in k8s

Ran 121 of 213 Specs in 3315.479 seconds
FAIL! -- 111 Passed | 10 Failed | 0 Pending | 92 Skipped --- FAIL: TestE2eNode (3315.49s)

LGTM

cyphar · 2017-04-11T14:46:28Z

/cc @mrunalp

mrunalp · 2017-04-11T15:26:19Z

[root@localhost cri-o]# ocic ctr execsync --id 20e89b3 echo HELLO
Stdout:
HELLO

Stderr:

Exit code: 0
[root@localhost cri-o]# ocic ctr execsync --id 20e89b3 echoi hi
execing command in container failed: rpc error: code = 2 desc = command error: exit status 1, stdout: , stderr: [conmon:i]: about to waitpid: 24748
[conmon:e]: Failed to create container: exit status 1
, exit code 1
[root@localhost cri-o]# ocic ctr execsync --id 20e89b3 exit 21
execing command in container failed: rpc error: code = 2 desc = command error: exit status 1, stdout: , stderr: [conmon:i]: about to waitpid: 24779
[conmon:e]: Failed to create container: exit status 1
, exit code 1

mrunalp · 2017-04-11T15:28:11Z

Actually in the above cases it failed because it can't find the executable.

mrunalp · 2017-04-11T15:35:06Z

Looks fine from this test. It should be in stderr but that will be fixed once we have separate stderr pipe.

[root@localhost cri-o]# ocic ctr execsync --id 20e89b3 sleep a
Stdout:
sleep: invalid time interval 'a'
Try 'sleep --help' for more information.

Stderr:

Exit code: 1

cyphar · 2017-04-11T15:35:40Z

It should be in stderr but that will be fixed once we have separate stderr pipe.

Yup, I'm working on that patch at the moment.

mrunalp · 2017-04-11T15:36:23Z

LGTM

k8s-ci-robot added the cncf-cla: yes label Apr 6, 2017

cyphar changed the title ~~[wip] conmon: handle multi-line logging~~ conmon: handle multi-line logging Apr 6, 2017

runcom reviewed Apr 6, 2017

View reviewed changes

cyphar mentioned this pull request Apr 6, 2017

Account for line endings in logging #430

Closed

mrunalp reviewed Apr 7, 2017

View reviewed changes

mikebrow reviewed Apr 8, 2017

View reviewed changes

runcom added this to the 0.2 milestone Apr 10, 2017

mrunalp mentioned this pull request Apr 10, 2017

k8s node-e2e tests [120/121] #441

Closed

cyphar added 2 commits April 11, 2017 20:32

mrunalp merged commit 7d329bc into cri-o:master Apr 11, 2017

cyphar deleted the conmon-sane-line-endings branch April 11, 2017 15:36

mrunalp mentioned this pull request Apr 11, 2017

Line endings not accounted for in logging #429

Closed

conmon: handle multi-line logging #436

conmon: handle multi-line logging #436

Uh oh!

Conversation

cyphar commented Apr 6, 2017

Uh oh!

cyphar commented Apr 6, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

runcom commented Apr 6, 2017

Uh oh!

cyphar commented Apr 6, 2017

Uh oh!

mrunalp commented Apr 6, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cyphar commented Apr 7, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mrunalp commented Apr 7, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cyphar commented Apr 9, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

runcom commented Apr 10, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

runcom commented Apr 10, 2017

Uh oh!

runcom commented Apr 10, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cyphar commented Apr 10, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mrunalp commented Apr 10, 2017

Uh oh!

cyphar commented Apr 10, 2017

Uh oh!

cyphar commented Apr 10, 2017

Uh oh!

mrunalp commented Apr 10, 2017

Uh oh!

cyphar commented Apr 10, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cyphar commented Apr 11, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

runcom commented Apr 11, 2017

Uh oh!

cyphar commented Apr 11, 2017

Uh oh!

runcom commented Apr 11, 2017

Uh oh!

cyphar commented Apr 11, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cyphar commented Apr 11, 2017

Uh oh!

runcom commented Apr 11, 2017

Uh oh!

runcom commented Apr 11, 2017

mrunalp commented Apr 6, 2017 •

edited

Loading

cyphar commented Apr 7, 2017 •

edited

Loading

cyphar commented Apr 9, 2017 •

edited

Loading

runcom commented Apr 10, 2017 •

edited

Loading

runcom commented Apr 10, 2017 •

edited

Loading

cyphar commented Apr 10, 2017 •

edited

Loading

cyphar commented Apr 10, 2017 •

edited

Loading

cyphar commented Apr 11, 2017 •

edited

Loading

cyphar commented Apr 11, 2017 •

edited

Loading