CrowdStrike downtime apparently caused by update that replaced a file with 42kb of zeroes

Aatube@kbin.melroy.org · edit-2 2 months ago

CrowdStrike downtime apparently caused by update that replaced a file with 42kb of zeroes

Pasta Dental@sh.itjust.works · edit-2 2 months ago

Every affected company should be extremely thankful that this was an accidental bug, because if crowdstrike gets hacked, it means the bad actors could basically ransom I don’t know how many millions of computers overnight

Not to mention that crowdstrike will now be a massive target from hackers trying to do exactly this

Evotech@lemmy.world · 2 months ago

Don’t Google solar winds

planish@sh.itjust.works · 2 months ago

Holy hell

Something Burger 🍔@jlai.lu · 2 months ago

New vulnerability just dropped

peopleproblems@lemmy.world · 2 months ago

Oooooooo this one again thank you for reminding me

Miaou@jlai.lu · 2 months ago

I’d assume state (or other serious) actors already know about these companies.

Björn Tantau@swg-empire.de · 2 months ago

Ah, a classic off by 43,008 zeroes error.

TropicalDingdong@lemmy.world · 2 months ago

If you listen closely, you can hear this file.

diffusive@lemmy.world · 2 months ago

If I had to bet my money, a bad machine with corrupted memory pushed the file at a very final stage of the release.

The astonishing fact is that for a security software I would expect all files being verified against a signature (that would have prevented this issue and some kinds of attacks

Kairos@lemmy.today · 2 months ago

Which is still unacceptable.

Justin@lemmy.jlh.name · 2 months ago

Windows kernel drivers are signed by Microsoft. They must have rubber stamped this for this to go through, though.

diffusive@lemmy.world · 2 months ago

This was not the driver, it was a config file or something read by the driver. Now having a driver in kernel space depending on a config on a regular path is another fuck up

Justin@lemmy.jlh.name · 2 months ago

isn’t .sys a driver?

Jakeroxs@sh.itjust.works · 2 months ago

Not just drivers, no https://fileinfo.com/extension/sys

PythagreousTitties@lemm.ee · 2 months ago

What about the Mac and Linux PCs? Did Microsoft sign those too?

Aatube@kbin.melroy.org · 2 months ago

only the Windows version was affected

Justin@lemmy.jlh.name · 2 months ago

Not sure about Mac, but on Linux, they’re signed by the distro maintainer or with the computer’s secure boot key.

https://wiki.ubuntu.com/UEFI/SecureBoot

PythagreousTitties@lemm.ee · 2 months ago

So… Microsoft couldn’t have “rubber-stamped” anything to do with the outage.

feannag@sh.itjust.works · 2 months ago

The outage only affected the Windows version of Falcon. OSX and Linux were not affected.

BossDj@lemm.ee · 2 months ago

So here’s my uneducated question: Don’t huge software companies like this usually do updates in “rollouts” to a small portion of users (companies) at a time?

Echo Dot@feddit.uk · 2 months ago

Companies don’t like to be beta testers. Apparently the solution is to just not test anything and call it production ready.

JasonDJ@lemmy.zip · 2 months ago

Every company has a full-scale test environment. Some companies are just lucky enough to have a separate prod environment.

Dashi@lemmy.world · 2 months ago

I mean yes, but one of the issuess with “state of the art av” is they are trying to roll out updates faster than bad actors can push out code to exploit discovered vulnerabilities.

The code/config/software push may have worked on some test systems but MS is always changing things too.

Angry_Autist (he/him)@lemmy.world · 2 months ago

From my experience it was more likely to be an accidental overwrite from human error with recent policy changes that removed vetting steps.

rozodru@lemmy.ca · edit-2 2 months ago

deleted by creator

Angry_Autist (he/him)@lemmy.world · 2 months ago

Quick development will probably spell the end of the internet once AI code creation hits its stride. It’ll be like the most topheavy SCRUM you’ve ever seen with the devs literally incapable of disagreeing.

I was thinking about his stint at McAfee, and I think you’re right. My real question is: will the next company he golden parachutes off to learn the lesson?

I’m going to bet not.

Kairos@lemmy.today · 2 months ago

Which is still unacceptable.

EleventhHour@lemmy.world · edit-2 2 months ago

d'00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000!

Gork@lemm.ee · 2 months ago

How can all of those zeroes cause a major OS crash?

tiramichu@lemm.ee · 2 months ago

If I send you on stage at the Olympic Games opening ceremony with a sealed envelope

And I say “This contains your script, just open it and read it”

And then when you open it, the script is blank

You’re gonna freak out

Gork@lemm.ee · 2 months ago

Ah, makes sense. I guess a driver would completely freak out if that file gave no instructions and was just like “…”

PriorityMotif@lemmy.world · 2 months ago

You would think that Microsoft would implement some basic error handing.

planish@sh.itjust.works · 2 months ago

That’s what the BSOD is. It tries to bring the system back to a nice safe freshly-booted state where e.g. the fans are running and the GPU is not happily drawing several kilowatts and trying to catch fire.

TimeSquirrel@kbin.melroy.org · edit-2 2 months ago

No try-catch, no early exit condition checking and return, just nuke the system and start over?

2 months ago

Windows assumes that you installed that AV for a reason. If it suddenly faults, who’s to say it’s a bug and not some virus going ham on the AV? A BSOD is the most graceful exit you could do, ignoring and booting a potentially compromised system is a fairly big no-no (especially in systems that feel the need to install AV like this in the first place).

sigmaklimgrindset@sopuli.xyz · 2 months ago

Great layman’s explanation.

Imgonnatrythis@sh.itjust.works · 2 months ago

Maybe. But I’d like to think I’d just say something clever like, “says here that this year the pummel horse will be replaced by yours truly!”

Hazzia@infosec.pub · 2 months ago

I’m gonna take from this that we should have AI doing disaster recovery on all deployments. Tech CEO’s have been hyping AI up so much, what could possibly go wrong?

Takios@discuss.tchncs.de · 2 months ago

Problem is that software cannot deal with unexpected situations like a human brain can. Computers do exactly what a programmer tells it to do, nothing more nothing less. So if a situation arises that the programmer hasn’t written code for, then there will be a crash.

deadbeef79000@lemmy.nz · 2 months ago

Poorly written code can’t.

In this case:

Load config data
If data is valid:
1. Use config data
If data is invalid:
1. Crash entire OS

Is just poor code.

2 months ago

If AV suddenly stops working, it could mean the AV is compromised. A BSOD is a desirable outcome in that case. Booting a compromised system anyway is bad code.

CeeBee_Eh@lemmy.world · 2 months ago

You know there’s a whole other scenario where the system can simply boot the last known good config.

2 months ago

And what guarantees that that “last known good config” is available, not compromised and there’s no malicious actor trying to force the system to use a config that has a vulnerability?

MajinBlayze@lemmy.world · 2 months ago

Because it’s supposed to be something else

jared@mander.xyz · 2 months ago

At least a few 1’s I imagine.

Iheartcheese@lemmy.world · 2 months ago

What if we put in a 2

NaibofTabr@infosec.pub · 2 months ago

kinkles@sh.itjust.works · 2 months ago

Society isn’t ready for that

Thurstylark@lemm.ee · 2 months ago

Well, you see, the front fell off.

urquell@lemm.ee · 2 months ago

Well, the file shouldn’t be zeroes

driving_crooner@lemmy.eco.br · edit-2 2 months ago

The file is used to store values to use as denominators on some divisions down the process. Being all zeros is caused a division by zero erro. Pretty rookie mistake, you should do IFERROR(;0) when using divisions to avoid that.

sugar_in_your_tea@sh.itjust.works · 2 months ago

I disagree. I’d rather things crash than silently succeed or change the computation. They should have done better input and output validation, and gracefully fail into a recoverable state that sends a message to an admin to correct. A divide by zero doesn’t crash a system, it’s a recoverable error they should 100% detect and handle, hot sweep under the rug.

driving_crooner@lemmy.eco.br · 2 months ago

Life pro tip: if you’re a python programmer you should use try: func() except: continue every time you run a function, that way ypu would never have errors on your code.

sugar_in_your_tea@sh.itjust.works · 2 months ago

Lol.

Kairos@lemmy.today · 2 months ago

Windows

katy ✨@lemmy.blahaj.zone · 2 months ago

have they ruled out any possibility of a man in the middle attack by a foreign actor?

Jesus@lemmy.world · 2 months ago

This was not a cyberattack.

https://www.crowdstrike.com/blog/statement-on-falcon-content-update-for-windows-hosts/

I guess they could be lying, but if they were lying, I don’t know if their argument of “we’re incompetent” is instilling more trust in them.

xavier666@lemm.ee · 2 months ago

“We are confident that only our engineers can fuck up so much, instead of our competitors”

floofloof@lemmy.ca · 2 months ago

The CEO made a statement to the effect of “It’s not an attack, it’s just me and my company being shockingly incompetent.” He didn’t use exactly those words but that was the gist.

db2@lemmy.world · 2 months ago

Or it being an intentional proof of concept

Kazumara@discuss.tchncs.de · edit-2 2 months ago

In the middle of the download path of all the machines that got the update?

planish@sh.itjust.works · 2 months ago

Foreign to who?

kyle@lemm.ee · 2 months ago

“Foreign” in this context just means “not Crowdstrike”, not like a foreign government.

yokonzo@lemmy.world · 2 months ago

I’m not a dev, but don’t they have like a/b updates or at least test their updates in a sandbox before releasing them?

kalleboo@lemmy.world · 2 months ago

It could have been the release process itself that was bugged. The actual update that was supposed to go out was tested and worked, then the upload was corrupted/failed. They need to add tests on the actual released version instead of a local copy.

FiniteBanjo@lemmy.today · 2 months ago

Could also be that the Windows versions they tested on weren’t as problematic as the updated drivers around the time they released.

thermal_shock@lemmy.world · 2 months ago

one would think. apparently the world is their sandbox.

Socsa@sh.itjust.works · 2 months ago

The fact that a single bad file can cause a kernel panic like this tells you everything you need to know about using this kind of integrated security product. Crowdstrike is apparently a rootkit, and windows apparently has zero execution integrity.

Dark Arc@social.packetloss.gg · 2 months ago

This is a pretty hot take. A single bad file can topple pretty much any operating system depending on what the file is. That’s part of why it’s important to be able to detect file corruption in a mission critical system.

Dave.@aussie.zone · edit-2 2 months ago

This was a binary configuration file of some sort though?

Something along the lines of:

IF (config.parameter.read == garbage) {
     Dont_panic;
}

Would have helped greatly here.

Edit: oh it’s more like an unsigned binary blob that gets downloaded and directly executed. What could possibly go wrong with that approach?

Aatube@kbin.melroy.org · 2 months ago

We agree, but they were responding to “windows apparently has zero execution integrity”.

OutsizedWalrus@lemmy.world · 2 months ago

I’m not sure why you think this statement is so profound.

CrowdStrike is expected to have kernel level access to operate correctly. Kernel level exceptions cause these types of errors.

Windows handles exceptions just fine when code is run in user space.

This is how nearly all computers operate.

areyouevenreal@lemm.ee · 2 months ago

Yeah pretty much all security products need kernel level access unfortunately. The Linux ones including crowdstrike and also the Open Source tools SELinux and AppArmor all need some kind of kernel module in order to work.

uis@lemm.ee · 2 months ago

At least SELinux doesn’t crash on bad config file

areyouevenreal@lemm.ee · edit-2 2 months ago

I am not praising crowdstrike here. They fucked up big time. I am saying that the concept of security software needing kernel access isn’t that unheard of, and is unfortunately necessary for a reason. There is only so much a security thing can do without that kernel level access.

reddithalation@sopuli.xyz · 2 months ago

crowdstrike has caused issues like this with linux systems in the past, but sounds like they have now moved to eBPF user mode by default (I don’t know enough about low level linux to understand that though haha), and it now can’t crash the whole computer. source

areyouevenreal@lemm.ee · 2 months ago

As explained in that source eBPF code is still running in kernel space. The difference is it’s not turing complete and has protections in place to make sure it can’t do anything too nasty. That being said I am sure you could still break something like networking or critical services on the system by applying the wrong eBPF code. It’s on the authors of the software to make sure they thoroughly test and review their software prior to release if it’s designed to work with the kernel especially in enterprise environments. I am glad this is something they are doing though.

phx@lemmy.ca · 2 months ago

Security products of this nature need to be tight with the kernel in order to actually be effective (and prevent actual rootkits).

That said, the old mantra of “with great power” comes to mind…

some_guy@lemmy.sdf.org · 2 months ago

If it had been all ones this could have been avoided.

jj4211@lemmy.world · 2 months ago

Just needed to add 42k of ones to balance the data. Everyone knows that, like tires, you need to balance your data.

cheers_queers@lemm.ee · 2 months ago

school districts were also affected… at least mine was.

CrowdStrike downtime apparently caused by update that replaced a file with 42kb of zeroes

CrowdStrike downtime apparently caused by update that replaced a file with 42kb of zeroes

christian_taillon (@christian_tail)