From: cross@spitfire.i.gajendra.net   
      
   In article <10esrru$1qu6$1@dont-email.me>,   
   Simon Clubley wrote:   
   >On 2025-11-07, Dan Cross wrote:   
   >> In article <10eaaqr$2sqg0$1@dont-email.me>,   
   >> Simon Clubley wrote:   
   >>>On 2025-10-30, Arne Vajhøj wrote:   
   >>>> On 10/30/2025 9:12 AM, Simon Clubley wrote:   
   >>>>> On 2025-10-30, gcalliet wrote:   
   >>>>>> It seems now, because the strategy used by VSI or its investor has been   
   >>>>>> for ten years a strategy copied on strategies for legacies OS (like   
   >>>>>> z/os...), the option of a VMS revival as an alternate OS solution is   
   >>>>>> almost dead.   
   >>>>>   
   >>>>> z/OS is responsible for keeping a good portion of today's world running.   
   >>>>> I would hardly call that a legacy OS.   
   >>>>   
   >>>> z/OS is still used for a lot of very important systems.   
   >>>>   
   >>>> But it is also an OS that companies are actively   
   >>>> moving away from.   
   >>>>   
   >>>   
   >>>Interesting. I can see how some people on the edges might be considering   
   >>>such a move, but at the very core of the z/OS world are companies that   
   >>>I thought such a move would be absolutely impossible to consider.   
   >>>   
   >>>What are they moving to, and how are they satisfying the extremely high   
   >>>constraints both on software and hardware availability, failure detection,   
   >>>and recovery that z/OS and its underlying hardware provides ?   
   >>>   
   >>>z/OS has a unique set of capabilities when it comes to the absolutely   
   >>>critical this _MUST_ continue working or the country/company dies area.   
   >>   
   >> I'm curious: what, in your view, are those capabilities?   
   >   
   >That's a good question. I am hard pressed to identify one single feature,   
   >but can identify a range of features, that when combined together, help to   
   >produce a solid robust system for mission critical computing.   
   >   
   >For example, I like the predictive failure analysis capabilities (and I wish   
   >VMS had something like that).   
      
   This is certainly an area where other systems lag behind, but as   
   x86 systems (for example) increase support for RAS and MCA/MCAX,   
   they are rapidly catching up. The SoCs and interconnects have   
   the capability at the hardware level, but the software is not   
   there (Linux in particular was lagging the last time I looked   
   closely).   
      
   >I like the multiple levels of hardware failure detection and automatic   
   >recovery without system downtime.   
      
   Fair, but this is not unique to IBM or even mainframes; most   
   server-grade systems support auto offlining storage devices and   
   hotplug; some also support this for CPUs and/or DRAM.   
      
   However, I would argue that this speaks to a system view that   
   was becoming obsolete, but is (perhaps ironically) coming back   
   into fashion.   
      
   A couple of decades ago, the realization was that, for certain   
   workloads, you were better off providing availability by   
   horizontal scaling and if building availability in software at   
   the application level. If a machine fell over and took out a   
   job, oh well; just restart it on another node. No need for the   
   complexity of handling that on a single node.   
      
   Google, for instance, did this somewhat famously for web search,   
   where regularly indexing (essentially) the entire world wide web   
   was required. The MapReduce framework put the self-healing into   
   the job/sharding layer: if a shard was being slow, MR just   
   restarted it. This ended up pervading the software stack, to   
   the point that regular maintenance jobs (for instance, to update   
   software) would just restart the machine regardless of what jobs   
   were running on it; the borg scheduler would just spin them up   
   elsewhere, and whatever framework was being used by them would   
   coordinate things appropriately.   
      
   But note that web search is an embarassingly parallel problem,   
   which is amenable to such things. Many other workloads are not;   
   this really broke down for e.g. GCP, where you can't just knock   
   over a customer VM and restart it somewhere else with no   
   coordination.   
      
   Furthermore, as core counts are increasing significantly, now   
   regularly exceeding 255 on high end parts, this is become more   
   expensive. With so many different things running on a single   
   node, "just reboot" as a means to fixing things doesn't scale.   
      
   >I like the way the hardware and z/OS and layered products software are   
   >tightly integrated into a coherent whole.   
   >   
   >I like the way the software was designed with a very tight single-minded   
   >focus on providing certain capabilities in highly demanding environments   
   >instead of in some undirected rambling evolution.   
   >   
   >I like the way the hardware and software have evolved, in a designed way,   
   >to address business needs, without becoming bloated (unlike modern software   
   >stacks). A lean system has many less failure points and less points of   
   >vulnerability than a bloated system.   
      
   I dunno, I always felt that mainframe software was bloated and   
   baroque. VTAM, ISAM, JCL...ick.   
      
   The hardware/software co-design advantage is very real however.   
   That's one reason we do hardware/software codesign at work.   
      
   >I like the whole CICS transaction functionality and failure recovery model.   
      
   As has been pointed out, this exists outside of the CICS system   
   as well. XA is pretty well standard at this point.   
      
   >>>Likewise, to replace z/OS, any replacement hardware and software must also   
   >>>have the same unique capabilities that z/OS, and the hardware it runs on,   
   >>>has. What is the general ecosystem, at both software and hardware level,   
   >>>that these people are moving to ?   
   >>   
   >> I think a bigger issue is lock-in. We _know_ how to build   
   >> performant, reliable, distributed systems. What we don't seem   
   >> able to collectively do is migrate away from 50 years of history   
   >> with proprietary technology. Mainframes work, they're reliable,   
   >> and they're low-risk. It's dealing with the ISAM, CICS, VTAM,   
   >> DB2, COBOL extensions, etc, etc, etc, that are slowing migration   
   >> off of them because that's migrating to a fundamentally   
   >> different model, which is both hard and high-risk.   
   >   
   >Question: are they low-risk because they were designed to do one thing   
   >and to do it very well in extremely demanding environments ?   
   >   
   >Are the replacements higher-risk because they are more of a generic   
   >infrastructure and the mission critical workloads need to be force-fitted   
   >into them ?   
      
   I think it's low-risk because those applications have been   
   running in production for many years, in some cases, decades;   
   they're well-tested and debugged, and the rate of change is very   
   low.   
      
   The alternatives are higher-risk because it's not just the   
   underlying OS or hardware that's changing, but the entire   
   application model.   
      
   It's my sense that so many migrate-off-the-mainframe projects   
   fail not because the mainframe is so singularly unmatched, but   
   because those projects are world-shifts, in which _everything_   
      
   [continued in next message]   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   
|