I think this is the wrong way to think about it. For the type of organizations t...

PaulHoule · on Nov 28, 2022

The reliability is legendary.

Every calculation in the CPU is replicated. If it shows any sign of failure it will try to migrate threads off the failing CPU to other CPU.

DRAM is RAIDed.

There is a disaster recovery capability that can replicate several data centers within a 70 km range via optic fiber. If one of them burns, get flooded or hit with a nuke the others will pick up the slack automatically.

riskable · on Nov 28, 2022

> The reliability is legendary.

Not at the OS level. Back when I was doing penetration testing nearly every organization that had IBM mainframes would suffer pretty severe outages just from our basic scans and doing things like checking open ports. They were also super duper easy to break into 90% of the time.

Also, most of the software running on mainframes has been running for decades. Which means they had like 40+ years to work out all the bugs. I'm 100% certain that if you took any given "modern" software stack (take your pick!) and very carefully applied patches to it for 40 years without ever adding any major new features it would be equally as reliable.

reacharavindh · on Nov 28, 2022

I’ve had a few good early years working at the system level z/OS, z/VM and the mainframe hardware from z10 era.

The reliability is indeed legendary for the usecases it is originally designed for (running Z/OS or TPF, IBM DB2 on Z/OS, CICS and COBOL batch jobs). However, IBM marketing folks will try to sell you on specialty processors that can run Java applications, Linux VMs(s390 arch) etc - that’s where the reliability rails come off.

Most serious mainframe users I worked with had their legacy applications which did one of the original usecases I mentioned, and it runs reliably. At the hardware level, redundancies are engineered at every level. You can hit plug CPUs like blades on a running system for maintenance, and replace them. Same with memory modules, storage devices etc. Upgrades to z/OS are also so thoroughly documented that you can avoid downtimes or plan for minimal ones..

nuc1e0n · on Nov 28, 2022

Using Xen (https://xenproject.org/), you can live migrate running Virtual Machine containers from one physical computer to another on commodity PC hardware while continuing to serve requests (no down time). You can then turn off the computer the VM was originally running on, upgrade it then migrate the VM back to it again. I did this once while pinging the VM from another machine. It didn't even drop any packets. My jaw dropped though.

jacooper · on Nov 29, 2022

Proxmox also does this

PaulHoule · on Nov 28, 2022

The "specialized processor" is a different microcode for the same CPU. My understanding is that this microcode has disabled a few instruction so z/OS won't run but doesn't really do anything special for Java like

https://en.wikipedia.org/wiki/Jazelle

The point is that Java or Linux workloads could be run on some other CPU and face competition but this is not the case for z/OS. Thus there is a reason to lower the price for workloads that could be easily migrated but keep it high for captives.

not_me_ever · on Nov 30, 2022

Yes, they do crash, but they never produce wrong results, and that is exactly what they were built to do. And don't forget: These things pre-date the internet, so even thinking about random strangers having access to the network was unthinkable.

Back in the day all the essential software, e.g. flight control, which still largely runs on S/360 today, had to be proven to be correct. There are mathematical concepts/processes that allow that.

The firmware is proven to work correctly. The compiler is proven to work correctly. The OS is proven .... I guess you get where this goes.

The problem is: Nobody today even wants to learn "how to prove software" anymore. I tried to teach a class at my University a few years back, and 12 out of 12 students dropped out in the first 3 lessons. My usual dropout rate is close to 0%.

For non-essential software you are probably right. After 60 years of development all the bugs are gone; Plus the developers know their machines (hardware & software) inside out. No chasing after ever changing platforms, standards, APIs & SDKs.

feet · on Nov 28, 2022

Is security the same thing as reliability?

riskable · on Nov 28, 2022

Not "the same thing" but security is a HUGELY IMPORTANT under the umbrella of "reliability".

If your stuff isn't secure how could you possibly make claims that it's reliable? If I said, "resilient" (synonym for "reliable") instead would it make more sense (in the scope of IT stuff)?

CRConrad · on Nov 29, 2022

But is security even "under the umbrella of 'reliability'"?

Depends (like pretty much everything else) on your definitions, I guess.

I could see defining "security" and "reliability" along orthogonal axes:

"Sure, it's a bit leaky, but it's very, very reliable -- including reliably leaky."

rantingdemon · on Nov 29, 2022

On a high level, security can be thought of as the CIA triangle.

Confidentiality Integrity Availability.

ninefathom · on Nov 28, 2022

I can second the legacy and reliability assertions. My experience with z/Arch was more recent- and it's a very different world from the modern cloud CI/CD Linux git Dev[Sec]Ops thing- but the reliability and legacy support immediately impressed me. A single z/Arch rack is basically a cloud in a box - get it power and network connectivity and you're golden. Other vendors can offer that though. The legacy support is another story; I watched an NGiNX proxy and a custom financial reporting app from the late 70s running side by side on the same box, and that's when I understood the additional dollars.