what did you learn today? (part 2)

stock412 · Jul 19, 2024

Crowdstrike (which pretty much every damn company uses from government to transit to media to banks) there are outages globally.
This is going to be a major shitshow including for my company

View: https://twitter.com/sinnet3000/status/1814198854671368525?s=46

View: https://twitter.com/troyhunt/status/1814174010202345761?s=46

both of the above twitter threads show the extent of the shitshow

This is also literally what everyone was afraid of happening to way back during the Y2K scare.

also. Crowdstrike is now going to be referred to as crowdstroke given the amount of hell they are going to put companies thru with this.

Any emergency patch that crowdstrike pushes out will be useless for those machines stuck in a BSOD.

which means… someone needs to either guide a user over the damn phone or physically get their hands on the machine so that it can get the patch.

Marlor_AU · Jul 19, 2024

stock412 said:
which means… someone needs to either guide a user over the damn phone or physically get their hands on the machine so that it can get the patch.

For most companies the phone route is going to be complicated.

The user won't just need to be talked through a series of complex commands and operations (getting into Safe Mode, renaming a directory, rebooting, applying patch), they will also need access to the BitLocker recovery key (and disclosing that may have security implications).

The most efficient path for most end-users is likely going to be to head in to the office, while an exhausted IT admin attempts to perform the recovery process for the hundredth time.

stock412 · Jul 19, 2024

Marlor_AU said:
For most companies the phone route is going to be complicated.

The user won't just need to be talked through a series of complex commands and operations (getting into Safe Mode, renaming a directory, rebooting, applying patch), they will also need access to the BitLocker recovery key (and disclosing that may have security implications).

The most efficient path for most end-users is likely going to be to head in to the office, while an exhausted IT admin attempts to perform the recovery process for the hundredth time.

And because of covid, not every employee even lives near an office………

This is going to take weeks and maybe even months to get fully resolved. I think it’s also easier to list the companies/ governments organizations(like airlines or airports) that are NOT affected by this than those that are.

daldrich · Jul 19, 2024

Read-Only Friday is now Mandatory.

They just need to have it fixed so I can fly home in 3 weeks.

denemo · Jul 19, 2024

Reddit has renamed CrowdStrike to CrowdStroke. Seems apt.

I don't think it's confirmed that they released an update today that made this error but if it's confirmed, what kind of idiot culture releases a patch for crucial software on a Friday!!

Kilkenny · Jul 19, 2024

I only deal with a small number of systems (I only work at the department level, and parent org is not a CrowdStrike customer), but the Bitlocker complication for the recovery has given me a slight feeling of relief that whenever I set up a new machine I back up the Bitlocker keys to a couple of flash drives that I could use at boot. It seems plenty only back them up to Azure and are having problems.

Ecmaster76 · Jul 19, 2024

Looks like at least one of our vendors is affected but they aren't too critical... yet. Hurray for delayed projects

denemo · Jul 19, 2024

The problems at our company has been resolved as in sofar that I can use the VPN. No greater damage seems to have been done to my company. Everything is up and running. A few minor other companies that we integrate with are having issues on their end but it's not critical.

ramases · Jul 19, 2024

Today is a good day to run your entire engineering org on Linux. :eng101:

ramases · Jul 19, 2024

Marlor_AU said:
access to the BitLocker recovery key (and disclosing that may have security implications).

We're not directly affected, but some of our clients, or client's customers, are.

Some of them are less than thrilled about the exposure and potential for shenanigans having to perform a workforce-wide bitlocker recovery key operation on the phone will necessarily entail.

sryan2k1 · Jul 19, 2024

Our users couldn't type the recovery key in let alone do the steps needed to get the system online. If this was us we'd have to set up a mass come into the office campaign and ship a lot of laptops around. It would be very not happy.

We've stuck with Defender + zScaler thus far, although we did just get licensed for the XDR stuff (E5 security step up). Great success.

SandyTech · Jul 19, 2024

We're getting calls to provide hands to help with recovery ops for multiple places. They're not even questioning emergency rates right now.

Dzov · Jul 19, 2024

sarcastic6 said:
Looks like it's not going to be a good day for anyone using CrowdStrike. Seems like anything with it installed is stuck in a BSOD boot loop. This should be fun...

How exactly does this manage to escape their testing to then effect the entire world?

sryan2k1 · Jul 19, 2024

Really bad QA or simply it's hitting a condition that they didn't test for.

Palo alto has an option on the firewalls to delay installation of security updates until the update is X hours old. Us grizzled admins long ago learned to set that to ~12-24 hours because if the update is really super broken it gets pulled very quick. It sounds like Crowdstrike pulled the bad update in about an hour but it was too late. It also sounds like you can configure how far behind you want to be and most people don't have that set.

Really a clusterfuck all around.

kperrier · Jul 19, 2024

SandyTech said:
We're getting calls to provide hands to help with recovery ops for multiple places. They're not even questioning emergency rates right now.

Make hay while the sun shines!

Ardax · Jul 19, 2024

kperrier said:
Make hay while the sun shines!

Might even pick up some new clients while they're at it too.

SandyTech · Jul 19, 2024

Ardax said:
Might even pick up some new clients while they're at it too.

We’re trying hard not to poach, because you never know when the shoe will be on the other foot. But at the same time if they’re asking about options we are quite happy to have a conversation.

ramases · Jul 19, 2024

sryan2k1 said:
Really bad QA or simply it's hitting a condition that they didn't test for.

Or they're running way too much shit in ring 0 that really should run in user space.

Or their shit that does run in ring 0 makes stupid assumptions like "the config file I am about to read will never be a torn/partial write or otherwise corrupt". If you run stuff in the kernel that can read configs, and you don't give it enough smarts to do safe config loads[0], someone ought to take your compiler behind the shed and put it down, because it clearly hasn't deserved to have to compile that code.

Really quite puzzling how you can have modern code panic the kernel over something that supposedly is a config file; the 90ties called, they want their bugs back.

[0] Place marker file containing boot per config file you are about to read; read config file; after successfully reading it and not dumping core remove marker file. If you find a marker file from a previous boot for a config file you skip it, use a previous version, then contact a remote service if the file should also be un-ignored.

Dzov · Jul 19, 2024

ramases said:
[0] Place marker file containing boot per config file you are about to read; read config file; after successfully reading it and not dumping core remove marker file. If you find a marker file from a previous boot for a config file you skip it, use a previous version, then contact a remote service if the file should also be un-ignored.

Ah, a fail safe. That would've been nice I imagine. Though this being an antivirus does need to be hardened from someone easily deactivating it.

ramases · Jul 19, 2024

Dzov said:
Ah, a fail safe. That would've been nice I imagine. Though this being an antivirus does need to be hardened from someone easily deactivating it.

If something can write to where your IDS/IPS/Malware scanner is reading its config files from you're fucked anyway.

edited to add: As for malicious reboots, this is why you have the thing talk to a remote attestator if it should remove the filemarker or not; after all it could have also been left there by a genuine fluke accident like an ill-timed power outage.

sryan2k1 · Jul 19, 2024

There are a million ways that can go wrong though by a bad actor. It's not that simple, who knows what the signature update broke and why it causes a panic.

Paladin · Jul 19, 2024

The scammers are on the case now too, sending out emails with bogus links to domains pretending to offer fixes if you just hand over the right combo of info or money, etc. Hopefully most of the people running Crowdstroke have a company support/IT email/number they have already contacted for help by now.

We had one guy running it on his jumphost VM that he uses to get remote access for work from home. He was doing a 'getting to know you' kind of thing in case we wanted to sign up. Not sure if that is going to go past this point. Being totally honest, anyone can mess up. It's how you deal with it and own up to it that makes the difference. So far they are at least being honest and have released a fix and guides for recovery. It still sucks but at least they are not trying to shift blame.

almostinsane · Jul 19, 2024

Ardax said:
Might even pick up some new clients while they're at it too.

lol if you even think anyone is switching from Crowdstrike to another vendor because of this. MS screws up on a weekly basis and IT depts stay the course. Companies aren't switching from the #1 cyber security company in the world to some other unprovens.

IT depts will come up with some idea of a contingency for the future or say they will implement a 2nd vendor but none of that will happen.

Business as usual for IT. Put out the fire and hide the matches. Sweep it under the rug and then Monday is a new week. Point the fingers at someone else and say we did all we can and it was the right way.

sryan2k1 · Jul 19, 2024

I want to know how the failure occurred and what they're going to do to prevent it. Like why was this mystery driver involved and how did it get past QA?

Was it a bad config that caused parsing in the module to fail? Was it a bad signature that caused it to target itself or other critical parts of the OS for example?

Why wasn't it rolled out in stages?

Why did users on deferrals get affected?

Xelas · Jul 19, 2024

It's appalling how a single company holding this much power can have such poor process for handling updates/rollouts. If they can fuxk up an honest attempt at an update this badly, what's stopping them from being compromised and having malware sent out in the same way?

This is a hell of a single-point-of-failure from an infosec perspective.

At this scale, any and all updates must be forced to have a staged rollout process be default - alpha, beta, then 0.01%, then 0.1%, then 5%, then go from there. It has to be a heck of an urgent release to make them hit the panic button this quickly.

The only possibility that comes to mind is that they discovered a 0-day exploit in their software and they had to stamp it out everywhere simultaneously so that their patch could not be reverse-engineered and exploited before they managed to update every system. But then how did they manage to miss an error of this magnitude in testing? It seems to have affected every single Windows PC.

CPX · Jul 19, 2024

TIL how many companies were content with testing in production.

Paladin · Jul 19, 2024

TIL how many people will rush to make memes about something they really don't understand.

The number of memes that blame Microsoft, or compare the Crowdstrike problem to Y2K, or want to gloat about how Linux/Mac would never have this kind of problem, etc. is just kind of sad. Tons of people who are sort of on the periphery of IT enough to see that something is up and want to cash in for some kind of clout but really don't bother to actually know what is going on or put much effort into making something funny or at least mildly accurate.

SandyTech · Jul 19, 2024

For the longest time this morning one of the local news channel (who admittedly are not the sharpest cookies) were conflating last night's Azure outage with the Crowdstrike issue and were blaming this fiasco on Microsoft.

sryan2k1 · Jul 19, 2024

Our client marketing platform/provider sent us emails today about the "Microsoft / Cloudstrike ongoing outage"

Sunner · Jul 19, 2024

Oh yeah

Working on one of the few Linux-only teams in a fucknormous org
Working in a rather niche part of said team so even collateral damage is unlikely to hit me
Vacation lol

Good luck guys and girls, I'ma be thinking of you while having my whisky over here.

Ecmaster76 · Jul 19, 2024

SandyTech said:
For the longest time this morning one of the local news channel (who admittedly are not the sharpest cookies) were conflating last night's Azure outage with the Crowdstrike issue and were blaming this fiasco on Microsoft.

Going by what I heard on the radio while grabbing dinner, your use of past tense is overly optimistic

Demento · Jul 20, 2024

Sunner said:
Oh yeah

Working on one of the few Linux-only teams in a fucknormous org

Working in a rather niche part of said team so even collateral damage is unlikely to hit me

Vacation lol

Good luck guys and girls, I'ma be thinking of you while having my whisky over here.

Even as the Linux guy here, I have a few supporting systems on Windows that, while not mission critical, I had to fix. Solar Winds (yes, that's a clusterfuck in a box all on its own), the A/V for the departmental NAS, the backup proxies... Never got Linux proxies to have the same performance for Veeam.

Ligushka · Jul 20, 2024

Essentially all of our endpoints were affected, but strangely and fortunately only about 8 server VMs and a physical domain controller were affected. We were able to be operational within 45 minutes and recover the entire organization within 3 hours. Had all the servers been affected, it would've taken somewhat longer to recover. Fortunately, we have critical passwords printed in a fireproof safe, but we definitely need to ensure that BitLocker recovery keys to at least mission-critical systems are also included there.

Probably the largest impact to us is that it took out our entire upcoming new environment because of the catch-22 of inability to boot anything without BitLocker recovery keys which are only currently stored in AD and the Hyper-V hosts themselves can't boot. I guess we'll just try the 15-20 reboot "fix" and hope that it eventually works. At least that didn't happen to the production environment.

In case this is helpful to others, on Reddit somebody came up with a way to PXE boot to WinRE to apply the fix - at least if the system in question is not protected by BitLocker:
View: https://www.reddit.com/r/sysadmin/comments/1e708o0/fix_the_crowdstrike_boot_loopbsod_automatically/

ramases · Jul 20, 2024

Some of the companies affected will have a 8 to 9 digit sum of damages from this particular fun exercise just in remediation expenses, lost earnings, and stranded costs (having to pay employees who cannot work)

In the airline sector there's considerable uncertainty right now about whether EU261 compensation requirements attach to this event, as at least for those flights where airports and ATC was available and just the airline borking it, it is an arguably within the control of the airlines, as they select their vendors.
Even though they do not have the deciding voice in this obviously the FAA already classified it as a "controlable event"; ie within control of the airlines, not extraordinary circumstances beyond their control.

If European courts follow this line of thought then EU261 compensation (250 to 600 statutory compensation for each affected pax; this is in addition to hotel costs, rebooking, ...) attaches. That is another 9 (possibly 10) digit question.

There are also some noises from some insurance companies that this event is not covered by standard cyber insurance clauses, as it is a supply chain/no malicious actor incident that is either excluded or would have had to be insured separately. Even if insurance pays out, we're talking hundreds of millions/billions of dollars. Insurance companies are not really in the business of just writing down that type of losses, so if you are covered expect your rates to go up.

Meanwhile all ToS by CrowdStrike I have seen limit recoverable damages to full refund; but it kind of is besides the point anyway, because unlike somthing like Microsoft or Amazon CS simply would not be able to pay for the billions of damages it caused anyway. Market cap is not liquidity.

So yes, CS will likely survive this event. Some customers will stay, some customers will choose a different vendor. People that pronounce CS as definitely dead over this are very likely to have jumped the gun.

The same, applies, however to people that predict it'll quickly go over into "business as usual". The difference in scale to past incidents is simply too big, and too expensive, for some of the companies to just say "ah, whatever".

Given what all was affected the single point of failure aspect might also attract regulatory attention. If you think that people at national security institutions don't have a look at how much of the economy just went poof for a day, and say, "nah, that's just fine. Surely no Pavol from St. Petersburg or Winnie from Shanghai will be able to do this to us", then that's a bit unlikely.

Entegy · Jul 20, 2024

denemo said:
I don't think it's confirmed that they released an update today that made this error but if it's confirmed, what kind of idiot culture releases a patch for crucial software on a Friday!!

I once spoke to an employee of a Major Tech Company that said their deployments are scheduled on Friday because their software is business oriented so there would be less impact and more time to fix it over the weekend, when less people were using it.

Management did not give a fuck that that ruined the weekend of developers and IT staff everywhere when they had a bad update (which was more than once). Never mind that in our always-on world where very few jurisdictions have right to disconnect legislation, users of said software would also have to lose their weekend catching up on the work they could not do Friday.

kperrier · Jul 20, 2024

ramases said:
Meanwhile all ToS by CrowdStrike I have seen limit recoverable damages to full refund

You mean Clownstrike's license/ToS doesn't say the software isn't fit for any use and you use it at your own risk and they accept no responsibility for any damages? IIRC, that is in every Microsoft and open source license/EULA state this.

dredphul · Jul 20, 2024

Crowdstrike's general software quality seems to be suspect as they had breaking Linux releases in the months before they broke Windows yesterday: https://www.neowin.net/news/crowdstrike-broke-debian-and-rocky-linux-months-ago-but-no-one-noticed/

Also, the Crowdstrike founder/CEO has previous experience breaking the internet with bad updates. He was CTO at McAfee back in 2010 when they broke the internet with a bad software update (archive.md link for the businessinsider.com article due to paywalling): https://archive.md/36Nsd

Paladin · Jul 21, 2024

Yeah if anything, the heads at the top of Crowdstroke need to get a bit loose from their shoulders, employment-wise. This is enough of a pattern of events to make it feel like a symptom of culture. Unless there were some very odd circumstances in the chain of events that led to this situation, this was entirely avoidable and should not have happened on such a wide scale. I can see taking down a small crowd of the 'give me the updates now' crowd but they should default their users to a 'give me the updates that are really safe' segment. I get that it was a 'rules' update type thing, not a new software release but the difference is academic if your rules update can bork the OS that way.

Klockwerk · Jul 21, 2024

Crowdstrike have put out an initial incident review on what happened - https://www.crowdstrike.com/blog/falcon-update-for-windows-hosts-technical-details/

koala · Jul 21, 2024

That document is a nothingburger. I want to know their rollout policy and if this update was an exception. And if it it was, why.

what did you learn today? (part 2)

Smack-Fu Master, in training

Ars Tribunus Angusticlavius

Smack-Fu Master, in training

Ars Tribunus Angusticlavius

Ars Scholae Palatinae

Ars Praefectus

Ars Tribunus Angusticlavius

Ars Scholae Palatinae

Ars Tribunus Angusticlavius

Ars Tribunus Angusticlavius

Ars Legatus Legionis

Ars Legatus Legionis

Ars Legatus Legionis

Ars Legatus Legionis

Ars Legatus Legionis

Ars Legatus Legionis

Ars Legatus Legionis

Ars Tribunus Angusticlavius

Ars Legatus Legionis

Ars Tribunus Angusticlavius

Ars Legatus Legionis

Ars Legatus Legionis

Ars Centurion

Ars Legatus Legionis

Ars Praefectus

Ars Legatus Legionis

Ars Legatus Legionis

Ars Legatus Legionis

Ars Legatus Legionis

Ars Praefectus

Ars Tribunus Angusticlavius

Ars Legatus Legionis

Ars Praefectus

Ars Tribunus Angusticlavius

Ars Legatus Legionis

Ars Legatus Legionis

Ars Praefectus

Ars Legatus Legionis

Ars Praefectus

Ars Tribunus Angusticlavius