Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
runtime: FreeBSD memory corruption involving fork system call #15658
Comments
derekmarcotte
commented
May 12, 2016
Here's another panic experienced in mallocgc by the same sample code: |
bradfitz
added this to the Go1.7Maybe milestone
May 12, 2016
@derekmarcotte, can you also reproduce this at master? (which will become Go 1.7) And do you only see it on FreeBSD, or other operating systems as well? |
bradfitz
assigned
aclements
May 12, 2016
@RLH, weren't you seeing "finalizer already set" failures a little while ago? Did you track that down? Could it have been related? |
On closer inspection of the failures you posted (thanks for collecting several, BTW), this smells like memory corruption. Just guessing, there are a few likely culprits. It may be the finalizer code, but I actually think that's less likely. More likely is that it's the fork/exec code: that code is really subtle, mucks with the address space, and contains system-specific parts (which would explain why it's showing up on FreeBSD, but I haven't been able to reproduce it on Linux yet). @derekmarcotte, can you try commenting out the runtime.SetFinalizer call in newProcess in os/exec.go (your test doesn't need that finalizer) and see if you can still reproduce it? If you can, that will rule out finalizers. |
Note that FreeBSD runs via gomote, if this is that easily reproducible. I haven't yet tried. |
derekmarcotte
commented
May 13, 2016
•
Just got a golang/go dev environment set up on my machine (was from FreeBSD packages). Will report back soon. |
derekmarcotte
commented
May 13, 2016
Here's the heads of a bunch of logs with the epoch at the start of the process, so you can see the interval. I suspected a race vs. memory corruption because by and large it is the finalizer already set that crashes the process. I thought maybe the gc was setting these as free (or otherwise touching them) before the SetFinalizer had a chance to set their value. I didn't include too many of them in my initial report, because I thought they were largely redundant. @bradfitz: these logs are against master:
@aclements: I will try your patch next. One variable at a time. |
The "fatal error: runtime.SetFinalizer: finalizer already set" bug I was None of the TOC code is in 1.7 or for that matter on TIP. I can force On Fri, May 13, 2016 at 6:47 PM, Derek Marcotte notifications@github.com
|
derekmarcotte
commented
May 15, 2016
•
@aclements: I've run it with your patch, although I haven't been able to babysit it too much. The first time, all threads were idle after a number of hours (i.e. 0% cpu across the board). Connecting gdb to that process gave me trouble, and I couldn't get any logging out of it. This morning, I was able to connect to a different process that looks a lot like 1462933614-wedged.txt. I've attached a log from gdb there: Will keep trying to come up with more info. |
derekmarcotte
commented
May 16, 2016
•
@aclements: Here's some more logs from a binary build with the patch: 1463315708-finalizer-already-set.txt Please let me know if I can be of further assistance? |
Thanks for the logs! I hadn't realized there were two finalizers involved here. Could you also comment out the SetFinalizer in NewFile in os/file_unix.go and see if it's still reproducible? (Your test also doesn't need that finalizer.) |
I didn't mean to say that it isn't necessarily a race. It's actually quite likely a race, but it's resulting in corruption of internal runtime structures, which suggests that the race is happening on freed memory. The "workbuf is empty" failure mode especially points at memory corruption, which is why my initial guess is that the finalizers (and the specials queue in general) may be victims rather than perpetrators. It's also easy to get finalizers out of the picture, while it's harder to get fork/exec out of the picture without completely changing the program. :) |
derekmarcotte
referenced this issue
in derekmarcotte/hfm
May 16, 2016
Open
FreeBSD instability: go runtime panics #1
derekmarcotte
commented
May 17, 2016
•
Thanks @aclements ! One crash, ~4 hours into running, since removing the SetFinalizer in NewFile. I have a second process running for almost 11 hours, with 4 of the threads wedged, but it is still doing work. |
derekmarcotte
commented
May 17, 2016
After ~11 hours + 5 minutes, the process panic'd: |
Thanks for the new logs. I suspect that's actually not the same failure, which suggests that the original problem is in fact related to finalizers. |
derekmarcotte
commented
May 17, 2016
My pleasure, thanks for the feedback. Are there any next steps for me? (I'm likely to poke around this bottom issue, although I'm neither a go runtime guy, nor a FreeBSD systems-level guy - yet. Would like to be as helpful as possible.) Thanks again! |
derekmarcotte
commented
May 19, 2016
•
I'm going to post a few more here. This'll be my last batch, unless requested.
|
Thanks! I assume these last few failures are also with master and with the two SetFinalizer calls commented out? |
derekmarcotte
commented
May 19, 2016
That's correct! Thanks. Anyone else able to reproduce? |
@derekmarcotte, what version of FreeBSD and how many CPUs are you running with? I haven't had any luck yet reproducing on FreeBSD 10.1 with 2 CPUs. |
derekmarcotte
commented
May 26, 2016
@aclements: both machines were 4 core, 8 thread CPUs:
The Xeon is running 10.3-RELEASE, and the AMD was running 10.1-RELEASE at the time of the logs (has since been upgraded to 10.3-RELEASE). I suspect I would be able to chew through many more invocations in the same time as a 2 core machine on these hosts, and additionally increase probability of contention/collisions in any given instant. The Xeon has since moved to production, so I don't have that hardware at my disposal for the time being, although I might be able to arrange something if it's required. I can get dmesgs/kldstats for the xeon, and amd if helpful (would rather post out of band). |
@derekmarcotte, thanks for the extra details (I see now you already gave the CPU configuration; sorry I missed that). Two more experiments to try:
|
derekmarcotte
commented
Jun 1, 2016
@aclements, Thanks for your suggestions. I'm currently exploring a different option. I was using gb for building my project (sorry! forgot to mention), and additionally for this test case. I certainly didn't expect wildly differing binaries in a project with no external dependencies, as gb uses the go compiler internally. I've got more research to do here to account for this, so I apologize for that. I've built using go directly and am in the process of testing. So far it has been running for 12 hours, without problem (with the SetFinalizers disabled). I have had previous test runs last this long, so I'm not ready to call it a success just yet. I'll be out of town for the next few days, so I can leave it running for a while and see where it ends up. I think this is a very promising lead, based on the objdump of the two artefacts. It might be interesting to include in the Issue Report template, with the build tool ecosystem that is currently out there (or that there is an ecosystem at all). |
derekmarcotte
commented
Jun 1, 2016
@aclements, rebuilding gb from source using the go based on master, and then rebuilding the test with the new gb creates nearly identical binaries (minus file path TEXT entries), this is to be expected. Perhaps there's something to this. Will keep you posted. |
cc @davecheney |
derekmarcotte
commented
Jun 2, 2016
•
@aclements, the go-only build of the binary did eventually crash... somewhere around 24 hours (~153M goroutines later), so I don't think it's gb-related. I'm going to re-build with the following, per your suggestion: func run(done chan struct{}) {
cmd := exec.Command("doesnotexist")
cmd.Wait()
done <- struct{}{}
return
} |
derekmarcotte
commented
Jun 6, 2016
@alements, after the above change, it ran for 4 days without a panic. |
@derekmarcotte, argh, I'm sorry. That wasn't a useful test. :( That eliminates the fork, the exec, and the finalizer, so there isn't really anything left. It needs to be something like: func run(done chan struct{}) {
cmd := exec.Command("/doesnotexist")
cmd.Start()
cmd.Wait()
done <- struct{}{}
return
} It's important that the path be absolute. Otherwise, Go will try to resolve the path itself, fail at that point, and not do anything for the subsequent method calls. It's also important that the I'm sorry I didn't catch that earlier. |
derekmarcotte
commented
Jun 9, 2016
@aclements, at least we know that Go can spin for 4 days without crashing on my dev machine - so not a total waste. I've recompiled with the above, and crashed after ~32 hours. Log attached: |
@derekmarcotte, thanks! Was that with go or gb? With or without the SetFinalizer calls? |
Also, 1.6.2 or master? |
derekmarcotte
commented
Jun 9, 2016
@aclements great questions. Was thinking we aught to recap all the variables at this point anyways. Trying to narrow things down as much as possible. So, this is with
with source: package main
/* stdlib includes */
import (
"fmt"
"os/exec"
)
func run(done chan struct{}) {
cmd := exec.Command("/doesnotexist")
cmd.Start()
cmd.Wait()
done <- struct{}{}
return
}
func main() {
fmt.Println("Starting a bunch of goroutines...")
// 8 & 16 are arbitrary
done := make(chan struct{}, 16)
for i := 0; i < 8; i++ {
go run(done)
}
for {
select {
case <-done:
go run(done)
}
}
} |
It sounds like this is FreeBSD-specific, so marking as such. Please correct me if I am wrong. |
ianlancetaylor
changed the title from
runtime: garbage collector race condition (SetFinalizer, addfinalizer, addspecial, removespecial)
to
runtime: FreeBSD memory corruption involving finalizers
Jun 14, 2016
ianlancetaylor
added
the
OS-FreeBSD
label
Jun 14, 2016
derekmarcotte
commented
Jun 15, 2016
@ianlancetaylor : thanks for this. I'm just building a linux box dedicated to testing tonight. I haven't yet been able to test on linux. I'll update the thread as I find more information out. |
adg
added this to the Go1.8 milestone
Jun 20, 2016
derekmarcotte
commented
May 25, 2017
The last notable status I have is from March. As far as I know, this is still an issue. I'm at the limits of my understanding/ability and have directed my efforts to places where I can be more helpful. If someone wanted to point me at an instance, I would be happy to collaborate, and get it set-up and repeatable? |
derekmarcotte
commented
May 25, 2017
Confirming this is still an issue on 10.3-RELEASE, against master (a9d8d4d):
|
Wouldn't using |
The kernel code underneath fork is he same as vfork so unlikely |
kostikbel
commented
May 28, 2017
Do you know what the issue is ? Do you know that the issue is in FreeBSD kernel and not in the go runtime ? Also, even if the issue is apparently in the kernel, fork vs. vfork difference is exactly where VM is involved. So, although not definitive, the experiment might give some hints where to look next. |
No we still don't know if it's kernel or runtime. I've tried lots of kernel options on the raw kernel forks but none made any deference. I've tried replicating with c code and have been unable to do so. Even with the best reproduction case I have it always takes quite a number of attempts. Given this it seems it requires complex threading interaction during the actual fork kernel call. |
rhysh
referenced this issue
Jun 2, 2017
Closed
runtime: memory corruption crashes with os/exec on Linux kernel 4.4 #20427
bradfitz
deleted a comment from
fcntl
Jun 16, 2017
derekmarcotte
commented
Jun 16, 2017
Based on the fact that the same symptoms are reproducible under linux, does it make sense to no longer call this a FreeBSD issue? |
Let's not merge them until we know it's the same underlying cause. Same symptoms can be misleading. |
rsc
closed this
Jun 16, 2017
rsc
reopened this
Jun 16, 2017
(Sorry, wrong button.) |
derekmarcotte
commented
Jun 16, 2017
@rsc: Thanks. Agreed that there may be other things going on here. Having said that, the issue I'm referencing is having the same crashes ( I haven't seen the |
jclinton
commented
Jun 25, 2017
Are the folks experiencing this running on Skylake or Kaby Lake hardware? If so, this might be relevant: https://lists.debian.org/debian-devel/2017/06/msg00308.html . Please try updating your CPU microcode to the latest release. |
Nope Nehalem EP here. |
aclements
referenced this issue
Aug 2, 2017
Closed
runtime: crash in runSafepointFn apparently because sched.safePointFn is nil #21276
I guess we still don't know what is happening. Punting to 1.10. |
ianlancetaylor
modified the milestones:
Go1.10,
Go1.9
Aug 2, 2017
ianlancetaylor
added
the
release-blocker
label
Aug 2, 2017
reezer
commented
Aug 25, 2017
•
I agree @derekmarcotte. I think it doesn't make sense to state something is FreeBSD related when something very similar happens on at least one other OS too. If this gets changed it should be mentioned in the release notes as well. However there it might be enough to simply mention Linux and reference the according issue. |
dankm
commented
Sep 19, 2017
Might this change to FreeBSD be related? I doesn't seem likely to me, but others here know more about how go works than I do. |
It looks like go only uses LDT on 386 and not amd64 so I suspect this wouldn't have any impact @dankm |
stevenh
referenced this issue
in shirou/gopsutil
Oct 4, 2017
Open
Use of external commands causes binaries using gopsutil to crash #430
added a commit
to stevenh/gopsutil
that referenced
this issue
Oct 5, 2017
added a commit
to stevenh/gopsutil
that referenced
this issue
Oct 5, 2017
stevenh
referenced this issue
in shirou/gopsutil
Oct 5, 2017
Merged
Eliminate use of sysctl command on FreeBSD #432
Just wanted to confirm this is still an issue with 1.9.1 which includes the signal handling fixes:
|
added a commit
to ContinuumLLC/godep-shirou-gopsutil
that referenced
this issue
Oct 26, 2017
marcan
referenced this issue
in prometheus/node_exporter
Nov 8, 2017
Open
Various crashes/segfaults on one host #730
blinkingled
commented
Dec 8, 2017
I was running the original test case posted above on DragonFlyBSD out of curiosity mostly and although I don't see any crashes/corruption in the short time I let it run, on 5.0.2 after running the testcase I consistently get this in syslog -
This is coming from sys/kern/kern_synch.c:716 - which is tsleep() -
I am not sure this is even relevant - I haven't looked deeper, but thought I would throw this out just in case. |
ianlancetaylor
removed
the
release-blocker
label
Jan 3, 2018
ianlancetaylor
modified the milestones:
Go1.10,
Go1.11
Jan 3, 2018
vangyzen
commented
Feb 15, 2018
•
I haven't read the details of this issue, but even so, it could be fixed by this recent change in the FreeBSD VM system: https://svnweb.freebsd.org/base?view=revision&revision=329254 I you want to test that change, you'll also need this one: https://svnweb.freebsd.org/base?view=revision&revision=329252 |
It's not clear if these two kernel fixes will help fix the issues identified here but they are definitely in the right area, particularly the second one. I'm trying to confirm either way if the above do have an impact, tests in progress. I'll report back when I have more, but if anyone else has a good reproduction environment which they can apply the above patches and test that would be great. |
Hehe, I guess I should have refreshed my page before replying, thanks @vangyzen :) |
Having applied the two patches I've unfortunately manged to confirm it hasn't solved the issue as I had a crash:
|
derekmarcotte commentedMay 12, 2016
•
Edited 1 time
-
derekmarcotte
May 12, 2016
Please answer these questions before submitting your issue. Thanks!
go version
)?go version go1.6.2 freebsd/amd64
go env
)?I expect this strange program to spawn instances of /bin/true in parallel, until I stop it.
Various types of panics caused by what looks to be corruption within the finalizer lists, caused by what I am assuming is based on race conditions. These panics can happen as quickly as 2 minutes, or much longer. 10 minutes seems a good round number.
Occasionally addspecial gets stuck in an infinite loop holding the lock, and the process wedges. This is illustrated in log 1462933614, with x.next pointing to x. This appears to be corruption of that data structure. I have seen processes in this state run for 22 hours.
I understand there is some trepidation expressed in issue #11485 around the locking of the data structures involved.
Here are some sample messages:
fatal error: runtime.SetFinalizer: finalizer already set
runtime: nonempty check fails b.log[0]= 0 b.log[1]= 0 b.log[2]= 0 b.log[3]= 0 fatal error: workbuf is empty
1462926841-SetFinalizer-ex1.txt
1462926969-SetFinalizer-ex2.txt
1462933295-nonempty-check-fails.txt
1462933614-wedged.txt
This was run on an 8-core processor, and a 4-core 8-thread processor with ECC RAM, similar results.
Additionally, while this example is an extreme, it also represents the core functionality of a project I've been working on part-time for many months. I'm happy to provide any further assistance diagnosing this issue - I'm very invested!