Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@cyphar
Copy link
Contributor

@cyphar cyphar commented Apr 6, 2017

The CRI requires us to prepend (timestamp, stream) to every line of the
output, and it's quite likely (especially in the !terminal case) that we
will read more than one line of output in the read loop.

So, we need to write out each line separately with the prepended
timestamps. Doing this the simple way (the final part of the buffer is
written partially if it doesn't end in a newline) makes the code much
simpler, with the downside that if we ever switch to multiple streams
for output we'll have to rewrite parts of this.

Alternative to #430.

Fixes: 1dc4c87 ("conmon: add timestamps to logs")
Signed-off-by: Aleksa Sarai [email protected]

@cyphar
Copy link
Contributor Author

cyphar commented Apr 6, 2017

I haven't tested this yet.

@cyphar cyphar changed the title [wip] conmon: handle multi-line logging conmon: handle multi-line logging Apr 6, 2017
conmon/conmon.c Outdated
ptrdiff_t line_len = buf - line_end;

/* Write the (timestamp, stream, line) tuple. */
if (write(fd, tsbuf, TSBUFLEN-1) < 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this making sure we drop the NULL terminator which got in with snprintf? One of the original issue is that the NULL terminator (00) is causing strings matching to fail

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TSBUFLEN-1 doesn't contain the null terminator. But I could switch to strlen(tsbuf) if you prefer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nah it's fine

@runcom
Copy link
Member

runcom commented Apr 6, 2017

I'll run k8s tests with this tomorrow afternoon (my timezone)

@cyphar
Copy link
Contributor Author

cyphar commented Apr 6, 2017

/me just realised the output spacing was wrong. Pushed a fix and squashed.

@mrunalp
Copy link
Member

mrunalp commented Apr 6, 2017

@cyphar There are bugs in this right now. It returns empty line when queried through kubectl logs. Details:

[root@dhcp-16-129 kubernetes]# ./cluster/kubectl.sh  run  httpd --image=httpd:2.4-alpine                                                                                                                                                                      
deployment "httpd" created
[root@dhcp-16-129 kubernetes]# ./cluster/kubectl.sh  describe pod httpd | grep IP                                                                                                                                                                             
IP:
[root@dhcp-16-129 kubernetes]# ./cluster/kubectl.sh  describe pod httpd | grep IP
IP:             10.88.0.76
[root@dhcp-16-129 kubernetes]# curl 10.88.0.76
<html><body><h1>It works!</h1></body></html>
[root@dhcp-16-129 kubernetes]# ./cluster/kubectl.sh get pods                                                                                                                                                                                                  
NAME                     READY     STATUS    RESTARTS   AGE
httpd-3531205961-q5j6l   1/1       Running   0          16s
[root@dhcp-16-129 kubernetes]# ./cluster/kubectl.sh logs httpd-3531205961-q5j6l
[root@dhcp-16-129 kubernetes]#

File contents:

[root@dhcp-16-129 c00a5dd8-1afd-11e7-b574-74852a1f5251]# cat httpd_0.log
2017-04-06T12:18:02-07:00 stdout 2017-04-06T12:18:02-07:00 stdout 2017-04-06T12:18:02-07:00 stdout 2017-04-06T12:18:02-07:00 stdout 2017-04-06T12:18:02-07:00 stdout 2017-04-06T12:18:02-07:00 stdout 2017-04-06T12:18:02-07:00 stdout

@cyphar
Copy link
Contributor Author

cyphar commented Apr 7, 2017

Yeah sorry @mrunalp this code was quite wrong before. I've now tested it with quite a few test cases (here's a sample), so it should work now:

int main(void)
{
	write_k8s_log(1, "stdout", "a sane line buffered\n", 21);
	write_k8s_log(1, "stdout", "this\nis\nkinda coo\0l!", 20);
	write_k8s_log(1, "stdout", "even more cool stuf\n\n\n\n", 23);
	write_k8s_log(1, "stdout", "what is even going\n\n\n\nk", 23);
	write_k8s_log(1, "stdout", " ---   \0\0", 9);
	write_k8s_log(1, "stdout", " ++ \n", 5);
	return 0;
}

Will output:

2017-04-08T04:35:07+10:00 stdout a sane line buffered
2017-04-08T04:35:07+10:00 stdout this
2017-04-08T04:35:07+10:00 stdout is
2017-04-08T04:35:07+10:00 stdout kinda cool!even more cool stuf
2017-04-08T04:35:07+10:00 stdout
2017-04-08T04:35:07+10:00 stdout
2017-04-08T04:35:07+10:00 stdout
2017-04-08T04:35:07+10:00 stdout what is even going
2017-04-08T04:35:07+10:00 stdout
2017-04-08T04:35:07+10:00 stdout
2017-04-08T04:35:07+10:00 stdout
2017-04-08T04:35:07+10:00 stdout k ---    ++

@mrunalp
Copy link
Member

mrunalp commented Apr 7, 2017

@cyphar okay, will retest this. Thanks!

conmon/conmon.c Outdated
/* Log all output to logfd. */
if (write(logfd, buf, num_read) != num_read) {
nwarn("partial/failed write (logFd)");
if (write_k8s_log(logfd, "stdout", buf, num_read) < 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add buf[num_read] = '\0'. Otherwise we see trailing stuff in logs.

[conmon:i]: read a chunk: (fd=5) '3.14159265358979323846264338327950288419716939937510582097494459230781640628620899862803482534211706798214808651328230664709384460955058223172535940812848111745028410270193852110555964462294895493038196442881097566593344612847564823378678316527120190914564/tmp/conmon-term.XXXXXXXX'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should modify the buffer like that (it means the logs won't actually match what was written by the program). The fix IMO is to change how we log the whole read a chunk thing. To be honest, that was a debugging measure and we should drop it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you verify whether the actual log file has incorrect data in it, or just the stderr log from conmon? Because to be honest at the moment ninfo is currently reading out-of-bounds and we need to stop doing that, so I hope that's the only issue here. πŸ˜‰

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cyphar Adding the '\0' is only for making the debug logs better :) I think we will fine even if we just drop the ninfo like you said. The actual log files looked good in my manual testing with a few different types of pods. Unfortunately, the e2e tests ran into unrelated issues on my machine that I am still debugging. So I will ask @runcom to run the suite on his machine.

};

int set_k8s_timestamp(char *buf, ssize_t buflen, const char *stream_type)
/* strlen("1997-03-25T13:20:42+01:00") + 1 */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would prefer RFC3339Nano if we have a choice here...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mikebrow Yeah, we can do that for sure. Just want to get this correctness patch in first :)

@cyphar
Copy link
Contributor Author

cyphar commented Apr 9, 2017

@mrunalp Pushed a commit that disables the debug logging. @runcom can you test this?

The test failures don't make sense -- they were passing yesterday and the only code change is me removing the ninfo...

@runcom
Copy link
Member

runcom commented Apr 10, 2017

testing this out with k8s right now. (restarted Travis also)

@runcom
Copy link
Member

runcom commented Apr 10, 2017

as far as k8s testing is concerned this PR LGTM :) (109/121 is a great result)

Ran 121 of 211 Specs in 5036.832 seconds
FAIL! -- 109 Passed | 12 Failed | 0 Pending | 90 Skipped --- FAIL: TestE2eNode (5036.85s)

@runcom
Copy link
Member

runcom commented Apr 10, 2017

(testing with latest k8s master source seems fine as well πŸ‘ )

weirdly enough, tests fail with https://travis-ci.org/kubernetes-incubator/cri-o/jobs/220157466#L2775 (I've never seen it)

@runcom runcom added this to the 0.2 milestone Apr 10, 2017
@cyphar
Copy link
Contributor Author

cyphar commented Apr 10, 2017

I'm super confused why Travis is broken, the previous commit passed the cases (and now I'm worried to re-run the old commit). There's some problem with seccomp though...

@mrunalp
Copy link
Member

mrunalp commented Apr 10, 2017

@cyphar Yeah weird. The first failure seems to be execsync related.

@cyphar
Copy link
Contributor Author

cyphar commented Apr 10, 2017

It's definitely a real failure, I just am confused what commits hit master between the two test runs that caused the breakage. I'll take a look today.

@cyphar
Copy link
Contributor Author

cyphar commented Apr 10, 2017

Ah, I think I know why. It's because ExecSync isn't meant to output the k8s log format (and it was actually the debugging information which caused it to succeed by accident). As an aside, the fact that we have to add more hacks like this is more indication in my mind we'll need to rewrite most of this code after we hit feature parity.

@mrunalp
Copy link
Member

mrunalp commented Apr 10, 2017

@cyphar Another option is to strip this out before we send back the ExecSyncResponse. We will probably need to do that to return stderr separately when we add a separate pipe for it. However, we can do this in a follow on.

@cyphar
Copy link
Contributor Author

cyphar commented Apr 10, 2017

There was another error which is that ExecSync would give you the conmon logs not the container logs if the container exited with an error. That's the first test failure. I think the second test failure is just because our tests are so interlinked.

In addition, ExecSync wasn't handled properly when we had an exit code (we would return ExecSyncError rather than ExecSyncResponse which I'm fairly sure is patently wrong. I've fixed that too.

@cyphar
Copy link
Contributor Author

cyphar commented Apr 11, 2017

@mrunalp I will switch to stripping after this is merged and we do the separate pipes. The tests pass now, so I'm squashing.

cyphar added 2 commits April 11, 2017 20:32
Previously we returned an internal error result when a program had a
non-zero exit code, which was incorrect. Fix this as well as change the
tests to actually check the "ExitCode" response from ExecSync (rather
than expecting ocic-ctr to return an internal error).

Signed-off-by: Aleksa Sarai <[email protected]>
The CRI requires us to prepend (timestamp, stream) to every line of the
output, and it's quite likely (especially in the !terminal case) that we
will read more than one line of output in the read loop.

So, we need to write out each line separately with the prepended
timestamps. Doing this the simple way (the final part of the buffer is
written partially if it doesn't end in a newline) makes the code much
simpler, with the downside that if we ever switch to multiple streams
for output we'll have to rewrite parts of this.

In addition, drop the debugging output of cri-o for each chunk read so
we stop spamming stderr. We can do this now because 8a928d0
("oci: make ExecSync with ExitCode != 0 act properly") actually fixed
how ExecSync was being handled (especially in regards to this patch).

Fixes: 1dc4c87 ("conmon: add timestamps to logs")
Signed-off-by: Aleksa Sarai <[email protected]>
@runcom
Copy link
Member

runcom commented Apr 11, 2017

will run k8s on this last time assuming Travis's green

@cyphar
Copy link
Contributor Author

cyphar commented Apr 11, 2017

@runcom In particular can you make sure that ExecSync acts sanely if the program you run has a non-zero exit code? I think the previous way was actually not correct.

@runcom
Copy link
Member

runcom commented Apr 11, 2017

@runcom In particular can you make sure that ExecSync acts sanely if the program you run has a non-zero exit code? I think the previous way was actually not correct.

how would you do this? should we write other integration tests for this or the ones you fixes made sure this case works fine? otherwise, I'm just going to run k8s tests which I don't know they're testing this code path (ExecSync)

@cyphar
Copy link
Contributor Author

cyphar commented Apr 11, 2017

@runcom Oh, I meant for you to just do kubectl exec and make sure it's sane if the command returns an error but I'm not sure if that uses ExecSync under the hood. If this doesn't cause any regressions then I guess it's good enough.

@cyphar
Copy link
Contributor Author

cyphar commented Apr 11, 2017

@runcom Tests are 🦎 btw. 😸

@runcom
Copy link
Member

runcom commented Apr 11, 2017

running k8s tests right now!

@runcom
Copy link
Member

runcom commented Apr 11, 2017

no regression in k8s

Ran 121 of 213 Specs in 3315.479 seconds
FAIL! -- 111 Passed | 10 Failed | 0 Pending | 92 Skipped --- FAIL: TestE2eNode (3315.49s)

LGTM

@cyphar
Copy link
Contributor Author

cyphar commented Apr 11, 2017

/cc @mrunalp

@mrunalp
Copy link
Member

mrunalp commented Apr 11, 2017

[root@localhost cri-o]# ocic ctr execsync --id 20e89b3 echo HELLO
Stdout:
HELLO

Stderr:

Exit code: 0
[root@localhost cri-o]# ocic ctr execsync --id 20e89b3 echoi hi
execing command in container failed: rpc error: code = 2 desc = command error: exit status 1, stdout: , stderr: [conmon:i]: about to waitpid: 24748
[conmon:e]: Failed to create container: exit status 1
, exit code 1
[root@localhost cri-o]# ocic ctr execsync --id 20e89b3 exit 21
execing command in container failed: rpc error: code = 2 desc = command error: exit status 1, stdout: , stderr: [conmon:i]: about to waitpid: 24779
[conmon:e]: Failed to create container: exit status 1
, exit code 1

@mrunalp
Copy link
Member

mrunalp commented Apr 11, 2017

Actually in the above cases it failed because it can't find the executable.

@mrunalp
Copy link
Member

mrunalp commented Apr 11, 2017

Looks fine from this test. It should be in stderr but that will be fixed once we have separate stderr pipe.

[root@localhost cri-o]# ocic ctr execsync --id 20e89b3 sleep a
Stdout:
sleep: invalid time interval 'a'
Try 'sleep --help' for more information.

Stderr:

Exit code: 1

@cyphar
Copy link
Contributor Author

cyphar commented Apr 11, 2017

It should be in stderr but that will be fixed once we have separate stderr pipe.

Yup, I'm working on that patch at the moment.

@mrunalp
Copy link
Member

mrunalp commented Apr 11, 2017

LGTM

@mrunalp mrunalp merged commit 7d329bc into cri-o:master Apr 11, 2017
@cyphar cyphar deleted the conmon-sane-line-endings branch April 11, 2017 15:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants